# Clustering sets by set difference

Suppose you have $$n$$ nonequal sets $$S_1, \ldots, S_n$$ and some constant $$0 \le k < n$$. The goal of set clustering is to find a partition of the set $$\mathbf{S} = \{S_1, \ldots, S_n\}$$ such that the sum of the total distance for each subset of $$\mathbf{S}$$ is minimized and such that $$\mathrm{cardinality}(\mathbf{S}) = k$$ (in reality, this latter constraint is not quite so tight, but the size of $$\mathbf{S}$$ must be less than $$k$$, and hopefully near it). The total distance $$d$$ of a set $$X \in \mathbf{S}$$ is $$d(X) = \sum_{A, B \in X} \mathrm{cardinality}(A \ominus B)$$ where $$\ominus$$ is symmetric difference. Assume for the purposes of the problem that the sets consume $$O(1)$$ space (so symmetric difference can be computed in constant time).

Is there a good greedy linear(ithmic) heuristic for this problem? Is there any literature on this problem, or similar ones?

All I’ve come up with so far is an $$O(n^2 \log n)$$ heuristic that looks like:

1. Set $$I = \{1, \ldots, n\}$$
2. Choose some $$i \in I$$
3. Emit a cluster $$C \subset I$$ containing all $$m \in I$$ such that $$\mathrm{cardinality}(S_i \ominus S_m) \le d_\mathrm{max}$$.
4. Set $$I = I – C$$
5. Go to step 2.

where this process occurs in each step of a binary search that finds the best value of $$d_\mathrm{max}$$ for a given $$k$$.

I was thinking that if there were some way to sort the list of sets such that nearby sets in the list have small symmetric difference, then a linearithmic solution might be easy to write.