In p. 7 of the book "Introduction to Information Retrieval" (by Manning et al), the authors explain how, given a collection of text documents, an inverted index is built by tokenizing, then sorting the (term, docID) pairs, merging multiple occurrences of the same pairs, and the pairs for each term are grouped into a posting list. It is then mentioned that "Because a term generally occurs in a number of documents, this data organization already reduces the storage requirements of the index." My question is: how are the storage requirements reduced? It seems to me that if a term occurs in $ k$ documents, then these $ k$ documents’ ID’s must all be mentioned in the linked list for this term.
If a term occurs multiple times in the same document, then because the inverted index mentions a docID at most once in a term’s list, there is a reduction in storage. But this condition is different from the condition that a term occurs in multiple (distinct) documents.