Suppose you have $ n$ nonequal sets $ S_1, \ldots, S_n$ and some constant $ 0 \le k < n$ . The goal of *set clustering* is to find a partition of the set $ \mathbf{S} = \{S_1, \ldots, S_n\}$ such that the sum of the *total distance* for each subset of $ \mathbf{S}$ is minimized and such that $ \mathrm{cardinality}(\mathbf{S}) = k$ (in reality, this latter constraint is not quite so tight, but the size of $ \mathbf{S}$ must be less than $ k$ , and hopefully near it). The *total distance* $ d$ of a set $ X \in \mathbf{S}$ is $ d(X) = \sum_{A, B \in X} \mathrm{cardinality}(A \ominus B)$ where $ \ominus$ is symmetric difference. Assume for the purposes of the problem that the sets consume $ O(1)$ space (so symmetric difference can be computed in constant time).

Is there a good greedy linear(ithmic) heuristic for this problem? Is there any literature on this problem, or similar ones?

All I’ve come up with so far is an $ O(n^2 \log n)$ heuristic that looks like:

- Set $ I = \{1, \ldots, n\}$
- Choose some $ i \in I$
- Emit a cluster $ C \subset I$ containing all $ m \in I$ such that $ \mathrm{cardinality}(S_i \ominus S_m) \le d_\mathrm{max}$ .
- Set $ I = I – C$
- Go to step 2.

where this process occurs in each step of a binary search that finds the best value of $ d_\mathrm{max}$ for a given $ k$ .

I was thinking that if there were some way to sort the list of sets such that nearby sets in the list have small symmetric difference, then a linearithmic solution might be easy to write.