Information Theory: Comparing surprisal of words with varying count frequency

This is a very broad question, I’m not sure if cstheory is the better place.

How can I compare the conditional surprisal of words that vary in frequency?

$ S(w|context)=−log(p(w|context))=−log(\frac{count(w,context)}{count(context)})$

The $ count(w,context)$ depends on the frequency of the word w because this can be further broken down to $ p(w|context)count(w)$ . This means a word that is more frequent will have a lower surprisal.

Is there a way to compare the surprisal of words with varying count/frequency, i.e control for frequency? Do I just divide $ count(w,context)$ by $ count(w)$ to normalize by count?