The capacity of a discrete memoryless channel is given by the maximum of the mutual information over all possible input probability distributions. That is

\begin{align} C &= \max_{p_X} I(X:Y) \ &= \max_{p_X} H(Y) – H(Y|X) \end{align}

$ H(Y|X)$ is specified by the channel only and has nothing to do with $ p_X$ . Hence, it seems that maximizing the capacity is just equivalent to maximizing $ H(Y)$ i.e. we want to choose $ p_X$ that guarantees a uniform distribution over the output alphabet.

**Incorrect claim:** If we achieve the capacity of the channel, it must be the case that the distribution over the output alphabet is a uniform distribution.

In this lecture, a remark is made at 15:00. The lecturer remarks that this is not true for all channels. There exist channels where the capacity is achieved without even using the full output alphabet. Can anyone give an example of this and also some general intuition on when the claim becomes false?