Originally posted in stack-overflow but was told to post here.
Context: I am doing a PCA on a MxN (N >> M) matrix with some invalid values located in the matrix. I cannot infer these values, so I need to remove all of them, which means I need to delete the whole corresponding row or column. Of course I want to keep the maximum amount of data. The invalid entries represent ~30% of data, but most of it is completly fill in a few lines, few of it is scattered in the rest of the matrix.
Some possible approches:
Similar to this problem , where I format my matrix such that valid data entries are equal to 1 and invalid entries to a huge negative number. However, all proposed solutions are of exponential complexity and my problem is simpler.
Computing the ratio (invalid data / valid data) for each row or column, and deleting the highest ratio(s). Recompute the ratios for the sub-matrix and remove the highest(s) ratios. (not sure how many lines or columns we can remove safely in one step), and so on until there is no invalid data left. It seems like an okay solution, but I am unsure it always gives the optimal solution.
My guess is that it is a standard data analysis problem, but surprisingly I could not find a solution online.