# Clustering data for compression with PCA

If I have datapoints in a high dimensional space and want to find a (linear) subspace onto which a data-set projects well, I can use PCA and then discard less important dimensions of the new basis to get compressed datapoints. However, often the data can be projected onto lower dimensional spaces with much smaller error if one first separates them into a couple of classes and then performs PCA for each class individually. What kind of algorithm can find such clusters? Just clustering based on distance in the high dimensional space won’t be very useful:

Example:

If I’d just cluster first based on distance in the high-dimensional space, I would arrive at the bad clustering. There are 5 clusters and the green and red clusters don’t project very well onto a 2D subspace.

As a human looking at the data, I see however that if I separate the data as indicated, red and blue will project very well onto a plane each and green will project very well onto a line, so I can run PCA for each group individually and store the red data points with 2 values each and the gree ones with 1 value each (plus a 2bit index on each datapoint to label which group it belongs to) and get a very low error upon uncompressing.

How can I automate this clustering based on how well it will project onto as low-imensional subspaces as possible?

Something like minimize E = SumOverClusters(SumOverPoints(SquaredDist(projected_point, original_point)) * (number_dims_projected / number_dims_original)) + C * number_of_clusters

What technique is well suited to do that?

(edit: while the example shows a 3d space, I’m more interested in doing that in about 64dimensional spaces)