# Optimize sorting matrix entries by row and column

I am writing a routine to store an $$M$$-by-$$N$$ sparse matrix in a balanced binary tree. The insertion routine calls a comparison function to determine where a new matrix entry $$(i,j)$$ should be inserted in the tree. I am defining the ordering of the matrix elements simliar to how C handles a row-major ordered dense matrix. In particular, a matrix element $$(i_a,j_a)$$ comes before $$(i_b,j_b)$$ if: $$N i_a + j_a < N i_b + j_b$$ I need to write a routine which returns $$-1$$ if $$(i_a,j_a)$$ is less than $$(i_b,j_b)$$, $$+1$$ if $$(i_a,j_a)$$ is greater than $$(i_b,j_b)$$ and $$0$$ if they are equal. I want this routine to be as fast as possible since I will be calling it hundreds of millions of times.

I’ve tried two options so far, which I implement as macros to avoid the overhead of a function call.

Option 1: Comparing rows first and then compare columns

``#define COMPARE(ia,ja,ib,jb) (ia < ib ? -1 : (ia > ib ? 1 : (ja < jb ? -1 : ja > jb))) ``

The worst case for this option is 4 comparisons between integers.

Option 2: Compare full linear index

``#define COMPARE(ia,ja,ib,jb) (ia*N + ja < ib*N + jb ? -1 : ia*N + ja > ib*N + jb) ``

The worst case for this option is 2 multiplies, 2 adds, and 2 comparisons (assuming the compiler optimizes so that the quantities $$N i_a + j_a$$ and $$N i_b + j_b$$ are computed only once).

So far, Option 1 is a clear winner, and runs faster by around 20-30% on tests I have done.

My question is: can anyone suggest a way to optimize this even more, by somehow improving on Option 1, or perhaps suggesting a completely different method of sorting matrix elements in the tree?

Profiling has shown that a significant amount of time is spent in this comparison function, so it is worth it to me to make it as fast as possible.

Another idea I have is to use 2 levels of binary trees. The high-level tree uses only row indices for sorting. Then each node in that tree contains a pointer to another tree which uses the column indices for sorting. This way the comparison function only needs to compare two integers instead of 4. But having 2 levels of tree would make things more complicated. I have not implemented or benchmarked this approach yet.