# Set data structure for data too large to fit into memory

I’m trying to solve the following exercise:

Given N data items and memory that can hold M/B blocks of size B. Describe a data structure that needs at most N/B blocks of external memory and allows us to answer questions of the form “$$s \in S$$?” using $$log_BN$$ I/O operations.

We may assume a static set, i.e. we do not care for efficient insertion or deletion of elements to or from the set and do not care for efficient creation of the data structure. Furthermore, M and B are known.

If the data is ordered, I could sort the data using e.g. MergeSort for external memory, then partition the sorted data into N/B sections and create a B-tree, where each node consists of a block of memory with elements containing a value of the set and an external memory location pointing to the block which contains all elements larger or equal to the value and smaller than the value of the next element in the parent block. In that case we would need $$N/B$$ external memory locations to store the block and $$log_B N$$ I/O operations to search for a value.

My question is about how one would handle non-ordered data? I don’t see how one could encode the data as a numeric value and then use the B-tree construction.