Let’s Invent B(+)-Trees

Let’s invent B(+)-trees.

Say we have some things, and we want to be able to store and look them up in
order. We could just put them in a big sorted array:

  10,20,30,40,50,60,70

This is great, but it’s hard to mutate — to insert or delete an element, we
might have to move all the others!

So, instead, let’s split our array into fixed-size blocks — which can be in any
order — and keep references to the blocks in sorted order:

  [10,20,30,40] [50,60,70,--]

Each block is allowed to have some empty space, which we can use for insertions.

If we inserted 55 above, we’d get:

  [10,20,30,40] [50,55,60,70]

So we only need to move up to a blocksworth of elements to insert into a block.
What if the block we want to insert into is full? We can split a block into two
halves. Say we want to insert 15 above:

  [10,20,30,40] [50,55,60,70]
  [10,20,--,--] [30,40,--,--] [50,55,60,70]
  [10,15,20,--] [30,40,--,--] [50,55,60,70]

When we split a full block, it becomes half-empty. In order not to have too many
blocks, we don’t permit blocks to be any emptier than that, which means at most
50% of block space is wasted (but see below).

(Deletion is mostly like insertion: If, after deleting an element, a block
is less than half-full, we either merge it with its neighbor if they’re both
half-full, or we move an element over from its neighbor. If it has no neighbors,
we just leave it as it is.)

The next question is: How do we store the sorted list of blocks, which there
might be a lot of? We can just use the same structure: Blocks indexing blocks
indexing values (until the top-level list of blocks fits in a single block).
This is just a tree structure. The layer of blocks containing actual data is
the same; the layers above it store pointers to the next layer.

To insert into full leaf node, you split it into two, and insert the new node
into its parent; if its parent is full, you split it into two, and so on up to the
root. If the root is full, you split it it into two, and create a new root
pointing to the old root and the new node (which is how the tree grows).

(A note on the root: When the root is a leaf, it can have as few as zero items,
since you might just not have data. If it’s an internal node, it can have as few
as two items; if it ever only has one child, you can replace it with its own
child, which is how the tree shrinks.)

The above is skipping over an important detail: We don’t just want blocks to be
ordered, we want to be able to look our things up by key.
So for each block we want to know the range of keys it can contain. If a block
contains keys, that’s easy — the range is [first,last]. If a block contains
blocks, we could store the full range of each subblock —

    (l1,r1)   (l2,r2)   (l3,r3)   (l4,r4)
  [   b1,       b2,       b3,       b4   ]

— but we really just care about finding the right block, so l1 and r4 are
irrelevant, and r1/l2,r2/l3,r3/l4 are redundant — any value in that range would
work, since it just tells us which side to descend into. So we just need to
store n-1 keys:

         l2       l3     l4
  [  b1,     b2,     b3,     b4   ]

This got linked more widely than I expected, so I’ll add a few notes:

  • The main distinction between a B-tree and a B+-tree is that in B+ trees, data
    is stored only in leaves (the last layer), and the keys in the other layers
    are redundant copies. In classic B-trees, the internal layers have their own
    keys and values. There are many small variations you can make in this family
    of data structures, but the main idea is the same.

  • B-trees are particularly suited for on-disk storage, which is what they were
    originally designed for, but they’re also good for in-memory storage. There’s
    rarely a reason to use binary trees, with nodes that don’t even fill a
    cache line, nowadays.

  • Deletion as described above is deceptively complicated — by far the trickiest
    to implement correctly. One way to simplify it is “relaxed deletion
    — just let nodes get underfull, and only delete them when they’re completely
    empty. The theoretical guarantees of this aren’t as good, but in practice it works
    well (sometimes even better than “proper” deletion), and the total tree size is
    still bounded by the total number of inserts.

  • One of the cleanest (and fastest!) implementations I’ve seen is
    this one by Per Vognsen in under 200 lines of C++ (using relaxed
    deletion as described above). There’s also a longer version with
    iterators.

  • An interesting observation is that this is one of the most efficient data
    structures for ordered key-value maps; but up to the block size of, say, 32
    items, a B-tree is just a sorted array, probably with linear search! For only a
    small number of items, an array is optimal in practice. For a medium number of
    items, a two-level tree (just an array of blocks) is simpler than a full B-tree
    and works well. A regular B-tree naturally turns into one of these at low sizes,
    of course.

  • I wrote a C implementation of uint64_t→uint64_t B+ trees that was faster than
    anything I compared it to at just about all the workloads I tried. It’s not
    in a state to be published right now, but I’m planning to clean it up at one
    point; let me know if you’re interested.

Read More