Optimize sorting matrix entries by row and column

I am writing a routine to store an $ M$ -by-$ N$ sparse matrix in a balanced binary tree. The insertion routine calls a comparison function to determine where a new matrix entry $ (i,j)$ should be inserted in the tree. I am defining the ordering of the matrix elements simliar to how C handles a row-major ordered dense matrix. In particular, a matrix element $ (i_a,j_a)$ comes before $ (i_b,j_b)$ if: $ $ N i_a + j_a < N i_b + j_b $ $ I need to write a routine which returns $ -1$ if $ (i_a,j_a)$ is less than $ (i_b,j_b)$ , $ +1$ if $ (i_a,j_a)$ is greater than $ (i_b,j_b)$ and $ 0$ if they are equal. I want this routine to be as fast as possible since I will be calling it hundreds of millions of times.

I’ve tried two options so far, which I implement as macros to avoid the overhead of a function call.

Option 1: Comparing rows first and then compare columns

#define COMPARE(ia,ja,ib,jb) (ia < ib ? -1 : (ia > ib ? 1 : (ja < jb ? -1 : ja > jb))) 

The worst case for this option is 4 comparisons between integers.

Option 2: Compare full linear index

#define COMPARE(ia,ja,ib,jb) (ia*N + ja < ib*N + jb ? -1 : ia*N + ja > ib*N + jb) 

The worst case for this option is 2 multiplies, 2 adds, and 2 comparisons (assuming the compiler optimizes so that the quantities $ N i_a + j_a$ and $ N i_b + j_b$ are computed only once).

So far, Option 1 is a clear winner, and runs faster by around 20-30% on tests I have done.

My question is: can anyone suggest a way to optimize this even more, by somehow improving on Option 1, or perhaps suggesting a completely different method of sorting matrix elements in the tree?

Profiling has shown that a significant amount of time is spent in this comparison function, so it is worth it to me to make it as fast as possible.

Another idea I have is to use 2 levels of binary trees. The high-level tree uses only row indices for sorting. Then each node in that tree contains a pointer to another tree which uses the column indices for sorting. This way the comparison function only needs to compare two integers instead of 4. But having 2 levels of tree would make things more complicated. I have not implemented or benchmarked this approach yet.