I’m writing a b*tree library in Rust. I’m thinking it might be better to make the purposeful decision to only implement half of the b* optimizations. (And not because it is easier, although it is.)
According to my reading B* is 2 optimizations (for insertion anyway):
- If a node is full but a sibling has free space, move something over to the sibling before adding, then no recursion is necessary. With circular buffers this is really fast.
- If a node is full and has a full sibling, split them into 3 nodes, distribute the existing elements as 2/3 2/3 2/3 among them.
Without the rule 2 optimization, when a node is full, it is just split into 2 nodes with a 1/2 1/2 distribution between them.
I see how theoretically rule 2 is interesting, it guarantees that every node in the tree is always 2/3 full (except for the root and for very small trees).
But for a downside, consider the use case where you usually add the nodes in order (for example, adding 1, 2, 3, 4, 5….mostly in order). When you keep rule 2 then the tree is almost always exactly 2/3 full and no better; because, after the sibling gets it allotted 2/3, it is never looked at again. But without rule 2, the tree (at least the leaves anyway) is nearly 100% full all the time. Because the last node which is full gets split into 2 nodes, they each get 1/2 of the elements, and the previous sibling remains full and untouched. On the subsequent insertions, due to rule 1, the last 2 nodes fill up to be 100% full again before splitting.
And about the case with random order insertions, with rule 1 and without rule 2, the worse case scenario (I think really rare and difficult to construct) is a 50% fill rate and the best case is still the 100% fill rate. The average I would guess is still the same as with rule 2, since the tree is still usually between 2/3 and 100% full.
For a tree with large nodes (approaching page size), a possible problem would be increasing the number of page loads because of the need to check siblings. But it seems to me that even though rule 2 reduces the variance in the density of the tree, it doesn’t do much to actually increase the absolute density of the tree, that is moreso accomplished by rule 1. So I would expect the average number of page accesses to be about the same or probably less.
Can someone give me a good reason to keep rule 2 when inserting into a B* Tree?