In the paper Comparison Based Sorting for Systems with Multiple GPUs, the authors describe the selection of a pivot element with respect to the partition on the first GPU (and its mirrored counterpart on the other GPU-partition). That pivot element is crucial for being able to merge the two partitions, given that we have already sorted them on each GPU locally.

However, the pseudo-code for that pivot-selection, as shown in the paper, doesn’t seem to reflect the whole truth since when implementing it 1:1, the selected pivot element is off by some elements in some cases, depending on the input – the amount of elements to sort and therefore the amount of elements per partition (the chunk of data that each GPU gets).

To get more specific, the problem is – to my understanding – that the while loop is exited too early due to the stride being reduced down to zero before the correct pivot element has been found. In general, the approach is binary search-like, where the range of where the pivot can fall, is halved each iteration.

Can anyone spot what needs to be done here?

Here is a C++ implementation of the pivot selection:

`size_t SelectPivot(const std::vector<int> &a, const std::vector<int> &b) { size_t pivot = a.size() / 2; size_t stride = pivot / 2; while (stride > 0) { if (a[a.size() - pivot - 1] < b[pivot]) { if (a[a.size() - pivot - 2] < b[pivot + 1] && a[a.size() - pivot] > b[pivot - 1]) { return pivot; } else { pivot = pivot - stride; } } else { pivot = pivot + stride; } stride = stride / 2; } return pivot; } `

P.S.: I tried ceiling the stride in order to not skip iterations when the stride is odd, but this introduced the issue of moving out of bounds of the array and even after handling those cases by clipping to the array bounds, the pivot was not always correct.