Exploring the design space of binary search trees

Rudi Theunissen
Last updated: 31 Aug 2023
Repository / Benchmarks / Animations / Discussion

The study of binary search trees has been a fundamental topic in computer science going back to the 1950s. Anyone who has studied computer science or prepared for a technical programming interview is familiar with them, and many leading courses on algorithms and data structures teach them, including Princeton,[1] Stanford,[2] and UC Berkeley.[3] These courses usually teach the same selection of algorithms in a similar way, though never exactly the same. The motivation for this project, in part, was to innovate the way we teach fundamental algorithms and data structures, specifically binary search trees. Speaking from experience, some students might not realize how interesting this space can be because they are too distracted by seemingly arbitrary rules and acronyms.

Over the years, some binary search tree algorithms have come along that are both more intuitive and more efficient than the ones we usually learn about, but good implementations are often difficult to find and therefore difficult to teach. This project attempts to explore as many of these algorithms as possible to produce a reference that is easy to teach and learn from. Being mostly light on theory, we focus instead on experimentation and measurement. A few of the algorithms appear to be implemented for the first time, and some original ideas are also considered.

Part 1 briefly covers foundational ideas like memory allocation, linear lists, position, recursion, traversal, balance, persistence, and concurrency. This introduction explores a pedagogical approach to binary search trees as a natural extension of linked lists. Since linked lists are usually covered before binary search trees, the idea to implement the same interface as before but now with a different structure is a nice segue, cohesive and consistent.

Part 2 covers definitions of balance and explores various algorithms to globally balance a binary search tree. Topics to cover include tree rotations, partitioning, complexity, and the binary logarithm. This part is an opportunity to become more familiar with the binary tree structure and its measurements.

Part 3 explores self-balancing trees, which is where most of the research and discussion usually happens. There are many strategies to consider, including rank-balanced trees, weight-balanced trees, randomized trees, finger trees, and self-adjusting trees. Some weak and relaxed variants are also implemented and evaluated. There are multiple benchmarks and measurements to determine which strategies are better than others in different situations, and whether there might be a case to prefer other algorithms to the ones typically found in textbooks and lectures.

Part 1Introduction

Abstract memory

A computer program uses memory to keep information in mind while working on a problem, similar to the human concept of working memory. When more memory is needed, a program can ask an allocator to reserve some, and in return receive a unique address to access that memory. All the memory that was allocated by a program is usually freed by the time the program exits, but during its lifetime there can be many requests to allocate, read, write, and free memory.

These memory interactions all have an energy cost because work must be done to track allocated memory, resolve addresses, and write to memory. A program is considered efficient when it meets its functional requirements using only a small amount of energy relative to the complexity of the problem it solves. Some problems can not avoid frequent interaction with memory, but an inefficient algorithm wastes energy by interacting with memory more often than it needs to.

This field of research relates to the organization of information in memory, focusing on the organization of the information rather than the information itself. The goal is to design structures in memory to organize information in useful ways, such as to be ordered, searchable, or countable. The information and the structures that organize it usually exist together in memory, and programs can use these structures to organize information as needed, rather than interacting with memory directly.

There are structures that organize information all around us, like filing cabinets and books and vinyl records. Consider that an empty notebook contains no information, but the structure is there for information to exist. A step-by-step cooking recipe might be written as a numbered list to suggest to the reader to follow the steps in order, one at a time. The paper on which the recipe is written is the memory, the recipe is the information, the numbered list is the structure.

Using computer memory as a design material might seem intangible, but every bit of information exists physically on hardware no differently. The computer, however, does not know anything about semantics and relationships — there is only one large pool of memory to interact with, and our task is to provide structures as a layer of abstraction between the program and the memory.

Linear lists

This project explores specifically the implementation of a linear list, which is a one-dimensional ordered sequence of a known length, indexed by position. Lists are fundamental and are often introduced as the first abstract data type when learning about data structures. So many things are lists... search results, queues, events or transactions occurring over time, even the text of this article is an ordered sequence of characters arranged in a specific order — the same characters in different positions would be gibberish, so the position of each character in the sequence is important.

For a data structure to qualify as a list, it must support the operations that describe the list data type. A program using a given list implementation might not know exactly how the structure works internally, only that it behaves like a list.

The following operations describe a reasonable list data type:

  • Select produces the value at a given position.
  • Update replaces the value at a given position.
  • Insert introduces a new value at a given position.
  • Delete removes the value at a given position.
  • Join combines two lists into a new list.
  • Split separates a list at a given position.

Arrays

Perhaps the most common implementation of a list in practice is the array, which is a block of adjacent fixed-size memory cells, allocated all at once, and indexed directly as an offset from the front of the block. The memory allocated by an array is the number of cells multiplied by the information capacity of each cell. The array is the structure and the values stored in each cell are the information.

01234567ABCDEFG

Figure 1: An array with a size of 7 and a capacity of 8.

To illustrate this, we can use a sheet of grid paper to represent an allocated array where every square is a cell in which a single character could be stored. The sheet of grid paper models the array and the grid squares that have characters written in them form the information as a sequence. The sequence of characters recorded on the sheet of grid paper has semantic meaning to the person or program asking us to record them, but the structure itself is not aware of that meaning.

Data structures usually offer theoretical bounds for every operation to indicate how long an operation might take or how much memory it might allocate, and is usually proportional to some measurable property of the structure, such as size. Some data structures are particularly efficient for certain operations, but less so for others.

Arrays are great for accessing a value by position because there are no empty spaces between values in the array, which allows the memory address to be calculated as an offset from the address of the array itself. This is called constant time, because the cost of the calculation is the same regardless of the size of the array. Inserting a value at the end of the array is very efficient also, simply write into the next empty cell.

Consider however the case where a value is inserted somewhere within the array, or in the worst-case at the beginning of the array. Since there can be no empty cells between values, space must be created by moving every value from that position one step towards the back of the array. The cost of this movement of memory is linear in the size of the array, because the upper-bound on the amount of memory to move is exactly the size of the array. Deletion is similarly costly because it leaves an empty cell, which requires space to be compacted in exactly the opposite way.

What makes an array dynamic is the ability to increase its capacity by reallocating the entire sequence into a larger block of memory, thereby copying every value from the original array to the new block. Doing so is linear in the size of the array but occurs infrequently enough to amortize the cost, allowing many low-cost operations to make up for infrequent expensive ones. Common ways to avoid this cost are to predict a good upper-bound for the initial capacity to avoid reallocation entirely, or to always double the previous capacity to effectively amortize the cost of reallocation.

Linked lists

Found in every textbook on algorithms and data structures is the linked list, which consists of linked nodes that each contain one value and a pointer to the next node in the sequence. The structure provides a pointer to the first node in the sequence, thereby allowing sequential access by following one link at a time from there. The last node of the list points to an empty node, illustrated as .

To insert a new value at a given position within the list, start at the first node and follow links one node at a time until the number of links followed equals the search position. At this point we can allocate a node for the new value, point the previous node to the new node, and point the new node to the current node.

ABCDEFG

Figure 2: A linked list of size 7, illustrating the empty node at end as .

Binary search trees

Instead of starting at the front of the list, what if we started the search at a node in the middle? Nodes to the right of this median point to the next node in the sequence, as they did before, but nodes to the left of the median now point to the previous node towards the start of the sequence. The node in the middle should always know its position in the sequence since all operations pass through it.

Applying the same strategy recursively to the left and right sides of the median node produces a structure known as a binary search tree. Every node now has two links, one to its left subtree and another to its right subtree. A node might link to an empty subtree, known as leaf node, but the pointers themselves are still there because they are part of the allocated node structure. There is an empty subtree at the start of the sequence, between every node, and at the end of the sequence.

ABCDEFGABCDEFGABCDEFG

Figure 3: An illustration of the transformation from a linked list to a binary search tree. The median node now has two pointers: one going left and another going right. The initial median node, in this case D, is at the top of the tree and is commonly referred to as the root node. The horizontal coordinates of the nodes stay the same as the nodes line up vertically.

Figure 4: A complete binary tree, not showing leaf nodes. This tree seems large but has only 127 nodes — what would a tree of 1,000,000 nodes and ∼20 levels look like? Notice the repeating patterns and that the number of nodes per horizontal level is twice the number of nodes of the level above.

The fundamental rule that defines a binary search tree is that the nodes of the left subtree occur sequentially before the root node, and the nodes of the right subtree occur sequentially after the root node. This creates a total linear ordering of all values, and the same sequence as the original linked list can be produced from the tree structure. We do this with an inorder tree traversal, which visits every node in sequential order.

inorder (p *Node, visit func(*Node)) {
    if p == nil {
        return
    }
    inorder(p.l, visit)
    visit(p)
    inorder(p.r, visit)
}

(tree *Tree) inorder (visit func(*Node)) {
    inorder(tree.root, visit)
}

Starting at the root, perform an inorder traversal of the left subtree, then visit the root, then perform an inorder traversal of the right subtree. The first node visited by this algorithm is therefore the left-most node of the tree, and the last node visited is the right-most node. As an exercise, try to apply this algorithm manually to the tree at the bottom of Figure 3 starting at the root node D.

Leaf insertion

The leaves of a binary search tree are where new nodes are attached to the structure. To insert a new value at a given position, start by searching for the leaf corresponding to that position, then replace it with a new node allocated for the value, thereby also allocating two new leaves. This is called leaf insertion and forms the basis of many other insertion algorithms.

Figure 5: The tree on the left shows the search path to insert a new node into the tree, starting from the root node down to a leaf node indicated by . The tree on the right shows the resulting tree with a new node in place of the previous leaf node. Keep in mind that the new node would have two leaf nodes.

Relative position

We now have a binary tree data structure with nodes pointing to other nodes, but it is not yet clear how to search for a node by position. Searching through a linked list is simple because we traverse and count exactly one node at a time. The information obtained by counting is the position of every node along the way, where the relative position of every node is exactly 1.

When we follow a branch in a binary search tree, we effectively skip all the nodes of the other branch. From the root of the tree, the first branch skips about half the nodes, then the second branch skips about half again, then half again until we find the node at the position we are searching for.

This is known as binary search, which is a well-known algorithm independent of linked nodes and tree structures. For example, ask someone to think of a number between 1 and some large number n. How many guesses would we need to guess it correctly, if after every guess were are hinted to go lower or higher?

The answer is no more than log2(n)\log_2(n) by using binary search. Start in the middle, then to the middle of the left half if the guess was too high, or to the middle of the right half if the guess was too low. Repeating this step until we guess correctly closes in on the number by halving the search space every time until there is only one choice left.

To implement binary search in a binary tree, we need to determine whether to branch to the left or to the right; do we skip the left half or the right half? Branching to the left seeks towards the start of the sequence by skipping all the nodes of the right subtree, and branching to the right seeks towards the end of the sequence by skipping all the nodes of the left subtree.

Since the pointers only reference in one direction away from the root, there is no way back up the search path once a branch is taken. To determine which branch to take requires that we know the position of the current node relative to its tree, which is exactly the number of nodes in the left subtree, or the number of nodes that would be skipped by branching to the right.

Branching to the exact median at every step is not necessary for binary search to be effective. Even a somewhat average approximation of each median might only increase the length of the search path by a few steps. The subtrees might also not divide neatly down the middle, which is only possible when the size of the tree is exactly 2n12^n-1. We therefore can not assume that the current node along the search path is the exact median of its subtree. The only way to know the number of nodes in the left subtree is to either count them, or store that information as part of the node structure.

Perhaps the most common approach to solve this problem is to store the size of every node as part of its structure, which is equal to the number of nodes reachable from the node, including itself. Adding a new node to the tree requires then that we increment the size field of every node along the path from the root to the new node. To determine the size of the left subtree, and therefore the relative position of a node, we can follow its left link to read the size field of that node, wherein lies a weakness: we always follow the left link even when the search ends up branching to the right. Ideally, to avoid unnecessary interaction with memory, the only nodes to read from memory are the nodes along the search path. This strategy benefits from being symmetric, and being able to derive the size of a tree from the size of its root.

Another idea is to store in each node its position relative to its parent. This approach results in a left node having a negative position equal to one less than the negative size of its right subtree, and a right node having a positive position one more than the size of its left subtree. Assuming that the root node always has a positive position, the absolute position of any node is then equal to the sum of all relative positions along its path from the root. This strategy is symmetrical, intuitive, and provides one bonus insight: a node is known to be a left or right descendant based on the sign of its position. However, there are two downsides to this approach: (i) we require one bit to store the sign of the position, thereby halving the utilization of the size field, and (ii) the resulting algorithms require in some cases additional checks and arguments to indicate and normalize node orientation. Insertion using this strategy would increment the size field when a search path branches left at a right node, and symmetrically decrement the size when branching right at a left node. For example, inserting a node at the start of the sequence would increment the size of the root because the size of its left subtree is increasing, then descend along the left-most path without incrementing the size of any other nodes since they all store the size of their respective right subtrees. Similarly, inserting a node at the end of the sequence would not increment the size of any node at all because all the nodes on the right-most path, including the root, store the size of their left subtrees unaffected by an insertion to the right. In 2004, Schmücker[4] proposed a list implementation using this approach to the Apache Commons Collections, which is still part of the source at the time of this writing.

Alternatively, we could store specifically the size of the left subtree, as suggested by Knuth[5] and Crane[6]. Keeping a separate size field in the outer tree structure as well as the reference to the root node then allows us to track the current subtree size at each step along the search path. The size of the right subtree is then one less than the size of the current subtree minus the size of its left subtree. This approach allows us to know both the left and right subtree sizes without the need to access either of them. Intuitively, this pulls the subtree size information one level up in the tree. Insertion then only increments the size field of a node when branching to the left, because inserting to the right would not affect the size of the left subtree. This approach therefore reduces memory interaction in exchange for a slightly more complex tree structure and some asymmetry.

This project uses the approach where every node stores the size of its left subtree.

Node {
    x Data
    s Size
    l *Node
    r *Node
}

Tree {
    root *Node
    size Size
}

These are the node and tree structures used by the algorithms in this project. A node consists of a data field, a size field, and two pointers to other or empty nodes. A tree consists of a pointer to the root node, if any, and the current size of the tree.

search (p *Node, i Position) *Node {
   for {
      if i == p.s {
         return p
      }
      if i < p.s {
         p = p.l
      } else {
         i = i - p.s - 1
         p = p.r
      }
   }
}

When the position we are searching for is equal to the number of nodes in the left subtree, we have found the node we were looking for. Skipping all the nodes of the left subtree would skip ahead to exactly this position in the sequence. Otherwise, when the position is less than the size of the left subtree, we need to seek towards the start of the sequence because our current position is still too far ahead. In this case, follow the link to the left and continue the search. Otherwise, if the position is greater than the size of the left subtree, we know that we can reject the entire left subtree because even if we skipped all those nodes we would still need to seek further ahead in the sequence, and therefore the node we are looking for must be somewhere in the right subtree. In this case, reduce the search position by the size of the left subtree, including the current node, then descend to the right to continue the search.

Path length

The path length of a node is the number of links to follow to reach it from the root of the tree, or one less than the number of nodes along its path from the root. The total path length is the sum of the path length from the root to every other node, and the average path length is the total path length divided by the number of nodes. Starting from halfway along some path and turning either left or right effectively halves the maximum path length of that path. During the transformation of a linked list to a binary search tree, as shown in Figure 3, whenever a new branch is created to the middle of a sublist it halves the maximum path length of that sublist. What is the resulting maximum path length of the tree? This question is effectively asking how many times the size of the tree can be divided by 2 before reaching 1.

This is known as the binary logarithm, written as log2\log_{2}. The binary log is frequently encountered in computer science because it relates so closely to binary numbers. For example, the number of bits required to write an integer n in binary is log2(n)+1\left\lfloor\log_{2}(n)\right\rfloor+1. The same reasoning applies to decimal numbers, where the number of digits required to write an integer n in decimal is log10(n)+1\left\lfloor \log_{10}(n) \right\rfloor + 1, or the number of times we can divide by 10 until we get to 1. A binary number therefore shortens by 1 bit when we divide by 2 and a decimal number shortens by 1 digit when we divide by 10.

Deletion

The general strategy to delete a node from a binary search tree is to search for the node to be deleted, then to replace it by the join of its subtrees. A valid join function combines the left and right subtrees so that all the nodes in the left subtree still occur before all the nodes in the right subtree, while sharing a common root node. The problem to solve is to determine from which subtree the root of the join should come from, and exactly how to produce it.

In 1961, Hibbard[7] proposed a simple algorithm to join two subtrees: when the right subtree is empty, the result of the join is the left subtree as-is, otherwise when the right subtree is not empty, delete its left-most node and use that node as the joining root. This algorithm therefore always produces the joining root from the right subtree if possible. Hibbard’s paper was remarkable in that it contained one of the first formal theorems about algorithms and still features frequently in lectures and textbooks.

ZZZZZZZZABCDEFGABCDEFGABCEFGABCEFG

Figure 6: The root node D is being deleted. This requires a join of the subtrees rooted at B and F to replace D. There are two choices for the joining root: the tree in the bottom left shows the result of using the right-most node of the left subtree, therefore C, and the tree in the bottom right shows the result of using the left-most node of the right subtree, therefore E. Notice that the subtree ZZ of the joining root is never accessed and replaces the joining root at the bottom of the path.

Hibbard's algorithm is not symmetric — it never considers whether the left subtree is empty, only the right. Knuth[5] noticed this asymmetry and suggested an adjustment to symmetrically use the right subtree as-is when the left subtree is empty. This algorithm improves the symmetry but is still biased to one side because the joining root is always taken from the same side when neither subtree is empty.

In 1983, Eppinger[8] suggested a symmetric variant of Hibbard's algorithm that randomly prefers the left or the right with equal probability. Their experiments involve random insertions and deletions and measure the average path length over time. They found that the average path length actually improves when using their symmetric approach, eventually stabilizing to be better than that of random trees.

In 1989, Culberson and Munro[9][10] published an extensive study into the long-term behavior of random insertions and deletions. They presented evidence that the use of the Hibbard or Knuth algorithms, when coupled with random insertions, causes the average path length to increase after many iterations, then stabilize. They also mention the symmetric approach by Eppinger and show that the same idea when applied to Knuth's algorithm yields even better results.

Popular textbooks are inconsistent in their approach to deletion in binary search trees. For example, in the well-known Introduction to Algorithms, Cormen, Leiserson, Rivest, and Stein[11] teach Knuth's algorithm, as well as Drozdek[12] who attributes the algorithm to both Hibbard and Knuth. Sedgewick and Wayne[13] teach Hibbard's algorithm, but their code appears to implement Knuth's algorithm. Goodrich, Tamassia and Goldwasser[14] teach Hibbard's algorithm but flips it to always prefer the left subtree instead. Skiena,[15] Weiss,[16] and Manber[17] all teach Hibbard's algorithm.

So far, none of the algorithms consider subtree size to determine from which side to produce the joining root. Given that subtree size must be known to search by position, we can make use of this information to determine whether to produce the root from the left or the right subtree.

Consider the node to be deleted and its position relative to its median — a node to the left of its median has a larger right subtree, and a node to the right of its median has a larger left subtree. The difference between these subtree sizes can be thought of as the distance to the ideal branch. This distance can be reduced slightly by always producing the root from the larger subtree.

The inverse of this strategy is to instead always prefer the smaller subtree. While this might seem clearly counter-productive, consider that the path length to the joining root is likely to be shorter in the smaller subtree than the larger subtree. Generally, a lower total traversal distance corresponds to fewer interactions with memory since fewer nodes need to be accessed. In this case, while the average path length of the resulting tree might be longer than other strategies, the join itself might resolve faster by accessing fewer nodes on average.

A related idea is a practical combination of preferring the larger subtree and the randomized symmetry of Eppinger's algorithm. To make the determination, generate a random number no greater than the sum of the two subtree sizes, then prefer the left subtree if that number is less than the size of the left subtree. When the left subtree is smaller, there is a lower probability that the generated number will be less than its size, and therefore a higher probability that the larger subtree will be chosen. This results in a lower average path length than the symmetric algorithm of Eppinger, and both the Hibbard and Knuth algorithms. This is to be expected, because the Eppinger algorithm has a 50% chance of choosing the larger subtree, and the randomized strategy still has some chance of choosing the smaller subtree.

To compare the resulting average path length of each strategy, a random tree of 1,000 nodes was constructed and measured across 1,000,000,000 random insertions and deletions. The results show that always preferring the larger subtree results in the lowest average path length. All the evaluated algorithms appear to stabilize with an average path length that is logarithmic in the size of the tree.

1 2 3 4 5 6 7 8 9 101.001.011.021.031.041.051.061.071.081.091.101.111.121.131.141.151.161.171.181.191.20Average Path Length / log2(Size = 1000)Random insertions and deletions / 108HibbardKnuthRandomizedSymmetricSmallerLarger

Persistence

A persistent data structure can create many independent versions of itself over time. Any version can be modified to produce a new version without affecting existing versions. To achieve this, a program can preserve the current version by copying it before making changes to produce a new version.

Reference counting

To avoid copying all the nodes when copying a tree, we can allow trees to share common subtrees in memory. Some tree far in the future might still reference a node that was allocated for a much earlier version. We need to keep track of this, so we store in every node a reference count to indicate the number of other trees that also reference that node.

When the reference count is zero, it suggests that no other trees share that node, so the tree has ownership and may modify it. Otherwise, when the reference count is greater than zero, it suggests that at least one other tree shares that node in memory. In this case, modifying the node would unintentionally also modify the other trees that reference it, so the node must first be detached from its shared state. This can be done by replacing the node with a copy, setting the reference count of the copy to zero, sharing the left and right subtrees by incrementing their reference counts, and decrementing the reference count of the initial node since it is no longer referenced by this tree.

Path copying

Consider now that a node is being modified in the depths of some tree, but that node is shared so must first be copied. How can the tree know about the copy of the node being modified? How could its root reach it? When a node replaces another node on the search path, there is a parent node that must be modified to now point to the copy instead, so it too must be copied. This cascades all the way up to the root.

Generally, all paths that lead to a modification must be copied. Every node has a unique path from the root, so for any operation we can mark the nodes that need to be modified and trace their paths back to the root; it is these paths that need to be copied so that they all belong to the same new version of the tree that includes the result of the modification.

Binary search trees are especially well-suited for persistence because the amount of information to copy is bounded by the maximum path length, which is usually logarithmic in the size of the tree. When this is the case, for example, a tree of 1,000,000 nodes would only need to copy ∼20 nodes to produce a new, completely independent version.

Figure 8: An illustration of path copying. The tree in the top left highlights a search path to insert a new node, the tree in the center left shows the deletion of a node, and the tree in the bottom left shows an insertion along the left-most path. Notice that the resulting tree in the bottom right is composed from subtrees of previous versions.

Other persistent tree data structures such as B-trees[18][19] and RRB-trees[20][21][22] pack several values into each node and have more outgoing links per node. The motivation is usually (i) to reduce memory allocation by not requiring a new node for every value, (ii) to reduce path length by having more branches, and (iii) to read multiple values at a time from memory during search. In a persistent context, these nodes must be copied no differently, but instead of copying just one value we need to copy all the values in the node. Search paths may be shorter but there is more information to copy.

The number of values to copy can be estimated by mlogb(n)m\log_{b}(n), where m is the number of values per node, b is the number of branches per node, and n is the number of nodes in the tree. This is not an exact science of course, because actual path lengths may vary and some nodes with larger capacities might not be full. This expression therefore only provides an idea of the expected growth of information to copy as size increases.

Binary trees use m=1m = 1 and b=2b = 2, one value, two branches. B-trees in the Rust standard library use m=11m = 11 and b=12b = 12, and RRB-trees use m=32m = 32 and b=5b = 5.

1 10 100 1000 10000 0 20 40 60 80 100 120 140 160 180 200log2(x)11 * log12(x)32 * log5(x)

Parent pointers

Many implementations of binary search trees use a node structure that includes a third link: the parent pointer, which points back up towards the root. Path copying requires that all paths that lead to a modification must be copied, but parent pointers create cycles such that all paths lead to all nodes. Copying every node is not feasible, so parent pointers are not compatible with path copying and therefore not part of our node structure. Avoiding parent pointers also reduces memory interaction because there are fewer pointers to maintain, and no need to account for the pointer in the node structure.

Figure 10: An illustration of a binary tree that uses nodes with parent pointers.

Concurrency

A data structure that supports concurrency can run multiple independent operations at the same time. This is not exactly parallelism, which is where the work of a single operation is divided up to be worked on together. To illustrate this in a binary search tree, as shown in Figure 11, consider one operation that searches for an existing node by position, and another that inserts a new node. These operations may start at the same time, but only one of them can win the race to the root node and lock others out. The other operation must wait for the node to be unlocked to gain access.

The search operation branches to the left, locks that node, and only then unlocks the root node. The insert operation is now clear to access the root and lock it. The second operation may be locked out again if it also branches to the left, but if it branches to the right then the two operations become independent.

OPERATION 1OPERATION 2

Figure 11: Two concurrent operations, one is a search and the other is an insertion. The first operation branches to the left at the root, and the second operation branches to the right at the root. Assuming that both operations are top-down, as soon as they branch in different directions they become independent, allowing them to operate concurrently.

Recursion

Binary tree algorithms often use recursion to implement operations in two phases: a search phase descending from the root, and a maintenance phase ascending back to the root. Another operation would need to wait for the algorithm to go all the way down then come all the way back up to make final adjustments before unlocking the node. For this reason, we explore whenever possible the potential to avoid recursion by combining the search and maintenance phases into a single top-down phase in constant space. Recursive implementations are still valuable because they are often simpler to implement and help to verify top-down variants.

Part 2Restoring Balance

Consider what happens to the tree structure when we repeatedly insert nodes at the start of the sequence. Eventually, the structure becomes unbalanced because too many branches are too far from their median. Balance is a measure of low average path length, and a linked list can be thought of as a tree with the worst possible balance.

To restore balance, we need a way to lower the average path length without changing the sequence of the nodes. Some strategies spend a lot of energy to restore perfect balance, while others spend less energy by being more relaxed. We can evaluate this as a trade-off between the cost of balancing and the benefits of balance.

GFEDCBAABCDEFG

Figure 12: An unbalanced tree on the left, and the same tree but balanced on the right.

Rotations

A tree rotation is a local structure adjustment that rearranged links between nodes without changing their order. The effect of a rotation can be observed by focusing on the root of the rotation and the change in the composition of its subtrees.

There are two symmetrical rotations: a right rotation which moves the root node down to the right and the left node up to the right to take its place, and a left rotation which moves the root node down to the left and the right node up to the left to take its place.

ZZZZZZZZROTATE RIGHTROTATE LEFTABCABC

Figure 13: From the left tree to the right tree is a right rotation: push C down to the right, pull A up into its place, move B across to the right. From the right tree to the left tree is a left rotation: push A down to the left, pull C up into its place, move B across to the left. Notice that C has both A and B in its left subtree on the left, then after the right rotation only has B. Both trees have the same sequence, A B C. Subtrees marked ZZ are not accessed or modified.

rotateR (p *Node) *Node
   l = p.l
   p.l = l.r
   l.r = p
   p.s = p.s - l.s - 1
   return l
}

rotateL (p *Node) *Node
   r = p.r
   p.r = r.l
   r.l = p
   r.s = r.s + p.s + 1
   return r
}

A well-known strategy to restore balance using rotations is the Day-Stout-Warren algorithm, or simply DSW, which was designed by Stout and Warren[23] in 1986 based on work by Day[24] in 1976. This algorithm first transforms a tree into a linked list using right rotations, then transforms that linked list into a perfectly balanced tree using left rotations — all done in linear time and constant space.

Partitioning

In 1980, Stephenson[25] presented an algorithm that always inserts a new node as the root of a tree, commonly referred to as root insertion. This is done by splitting the tree in two: nodes that occur before the leaf position and nodes that occur after the leaf position, then setting those subtrees as the left and right subtrees of the new node.

A related algorithm is partition, which rearranges the tree to make the node at a given position the root while preserving the order of all nodes. To partition a tree, start from the root and follow the standard search algorithm. Along the search path, when branching to the left, the current node wil end up to the right of the partitioned root, and all further nodes along the search path will end up to the left of the current node. This applies symmetrically when branching to the right.

Using this information, the algorithm builds two subtrees top-down, the left partition and the right partition. When the search branches to the left, attach the current node to the left of the left-most node of the right partition, preserving the right subtree of the current node. When the search branches to the right, attach the current node to the right of the right-most node of the left partition, preserving its left subtree.

When the search reaches the node that is to become the root, attach its current left subtree to the right of the right-most node of the left partition, and attach its current right subtree to the left of the left-most node of the right partition. Then, set its left subtree to be the root of the left partition, and set its right subtree to be the root of the right partition. Finally, set this node as the root of the tree.

partition (p *Node, i Position) *Node {
   n = Node{s: i}
   l = &n
   r = &n
   for i ≠ p.s {
      if i < p.s {
         r.l = p
         p.s = p.s - i - 1
           r = r.l
           p = p.l
      } else {
         l.r = p
           i = i - p.s - 1
           l = l.r
           p = p.r
      }
   }
   r.l = p.r
   l.r = p.l
   p.l = n.r
   p.r = n.l
   p.s = n.s
   return p
}
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZABGFCDECABGFDE

Figure 14: The tree at the bottom is the result of partitioning the tree at the top at C. The search path is indicated by arrows, and subtrees marked ZZ are not accessed or modified.

Median balance

Definition

A node is median-balanced if the size of its left and right subtrees differ by no more than 1.

To determine if a node is median-balanced, we can add 1 to the size of its smaller subtree to see if it becomes greater than or equal to the size of its larger subtree. Without loss of generality, assume that xyx \leq y.

Median(x,y)=(x+1)y\text{Median}(x,y)=(x+1)\geq y

A median-balanced tree is perfectly balanced because every branch divides the search space exactly in half. In 2017, Muusse[26] published an algorithm to balance a binary search tree that uses partition to replace every node by its underlying median to produce a median-balanced tree.

Partitioning a tree around its median distributes the nodes evenly between its left and right subtrees, but there might be nodes deeper within those subtrees that are not median-balanced. Repeating the partitioning recursively in the left and right subtrees results in a median-balanced tree.

balance (p *Node, s Size) *Node {
   if p == nil {
      return p
   }
   if !balanced(p) {
      p = partition(p, s / 2)
   }
   p.l = balance(p.l, size(p.l))
   p.r = balance(p.r, size(p.r))
   return p
}
            

This algorithm has several useful properties: (i) it is general for any definition of balance based on subtree size, (ii) it operates top-down in constant space, (iii) it is particularly efficient when the tree is already somewhat-balanced, (iv) subtree balancing is independent so could be done parallel, and (v) the balancing process could be stopped or paused partway through without invalidating the structure.

There are however some node arrangements that are not median-balanced but have the same average path length as if they were median-balanced. Consider the following trees that both form the same sequence of 5 nodes with an average path length of 6/5.

MEDIAN-BALANCEDMEDIAN-BALANCEDABCDEABCDE

Figure 15: Two trees with the same average path length and maximum path length, where one is median-balanced and the other is not. To calculate average path length, first count the total path length then divide by the size of the tree. The tree on the left has path lengths A:1 B:0 C:2 D:1 E:2 for a total of 6, has 5 nodes, and is not median-balanced. The tree on the right has path lengths A:2 B:1 C:0 D:1 E:2 for a total of 6, also has 5 nodes, and is median-balanced.

The first tree is not median-balanced because 5/2 is 2, so the median node is C but the root is currently B. The median-balancing strategy would partition at B to make C the root, but the average path length stays the same. This partitioning step is therefore redundant because it did not improve the balance of the tree.

Height balance

Definition

The height of a tree is the length of the longest path starting from its root.

Definition

A node is height-balanced if the height of its left and right subtrees differ by no more than 1.

We can continue to use the same partition-based balancing algorithm but change the definition of balance to only partition if the node is not already height-balanced — every node that is not height-balanced is partitioned to become median-balanced. This results in a height-balanced tree because every node that is already height-balanced is left as such, and every node that is not height-balanced becomes median-balanced.

The problem to solve is to determine whether a node is already height-balanced. Muusse[26] solves this by using a strict definition of height-balance where both the minimum and maximum path lengths of two subtrees differ by no more than 1. Then, by pretending that both subtrees are already strictly height-balanced, we can compare their minimum and maximum heights.

A binary tree is perfect if every node has either two subtrees or two empty subtrees. The height of a perfect binary tree is log2(s+1)1\log_{2}(s+1)-1, which is intuitive since every level of the tree has twice the number of nodes as the one above it and the number of branches to follow from the root to the bottom of the tree is the number of times that the size can be divided by 2 before reaching 1.

1371531

Figure 16: Perfect binary trees of size 2n12^n-1.

When a tree is strictly height-balanced, all the levels of the tree are full except for maybe the bottom level, where some nodes might have only 1 subtree. The bottom level of the smaller subtree is emptied with floor, and the bottom level of the larger subtree is completed with ceil, giving the minimum and maximum heights of the two subtrees. We can then determine if the node is height-balanced by comparing these heights.

Height(x,y)=log2(max(x,y)+1)log2(min(x,y)+1)1\text{Height}(x,y) = \lceil \log_2(\max(x,y)+1) \rceil - \lfloor \log_2(\min(x,y)+1) \rfloor \leq 1

This function, as published, can be simplified with an identity of the discrete binary logarithm where log2(s+1)log2(s)+1\lceil\log_{2}(s+1)\rceil\equiv\lfloor\log_{2}(s)\rfloor+1. This allows both sides of the inequality to be expressed in terms of the floor of the log. The result is the same whether the last level is completed with ceil or emptied with floor and incremented.

Height(x,y)=log2(x+1)log2(y)\text{Height}(x,y) = \lfloor \log_{2}(x+1) \rfloor \geq \lfloor \log_{2}(y) \rfloor

Weight balance

Definition

The weight of a node is the number of reachable leaves, which is one more than the number of nodes.

The general category of weight-balanced trees require some definition of a bound between the number of leaves in the left and right subtrees. A median-balanced node is perfectly weight-balanced because there is an ideal distribution in the number of nodes between its left and right subtrees. A simple definition of weight-balance is to choose 2 as a constant factor, so that a node is balanced if the weight of the smaller subtree is at least half the weight of the larger subtree.

Weight(x,y)=(x+1)(y+1)/2\text{Weight}(x, y) = (x + 1) \geq (y+1) / 2

A somewhat overlooked variant of weight-balance was introduced by Roura[27] in 2001, which directly compares the binary logarithm of the weights of the left and right subtrees, rather than the weights directly.

Definition

A node is logarithmically weight-balanced if the binary log of the weights of its subtrees differ by no more than 1.

Log(x,y)=log2(y+1)log2(x+1)1=log2(x+1)log2(y+1)1=log2(x+1)log2(y+1)log2(2)=log2(x+1)log2((y+1)/2)\begin{aligned}\text{Log}(x,y) = &\lfloor\log_{2}(y+1)\rfloor - \lfloor\log_{2}(x+1)\rfloor \leq 1\\ = &\lfloor\log_{2}(x+1)\rfloor \geq \lfloor\log_{2}(y+1)\rfloor - 1\\ = &\lfloor\log_{2}(x+1)\rfloor \geq \lfloor\log_{2}(y+1)\rfloor - \lfloor\log_{2}(2)\rfloor\\ = &\lfloor\log_{2}(x+1)\rfloor \geq \lfloor \log_{2}((y+1)/2) \rfloor\end{aligned}

The most-significant-bit of a binary number, or MSB, is the bit-position of the left-most bit set to 1. The MSB gives the number of bits required to represent the integer in binary since all the bits to the left of the MSB will be 0 and do not affect the value. Comparing the MSB is therefore equivalent to comparing the binary log. A node is therefore logarithmically weight-balanced if the MSB of the weight of one subtree is at most one step away from the MSB of the weight of the other subtree.

BALANCEDNOT BALANCED8482

Figure 17: A node with subtree weights 8 and 4 is logarithmically weight-balanced because log2(8)=3\lfloor\log_{2}(8)\rfloor = 3 and log2(4)=2\lfloor\log_{2}(4)\rfloor = 2, which do not differ by more than 1. On the other hand, a node with subtree weights 8 and 2 is not balanced because log2(2)=1\lfloor\log_{2}(2)\rfloor = 1, which is too far away from 3.

Comparing the binary log

Generally, we require a function that determines whether the MSB of one integer is less than the MSB of another. The first candidate is the function used by Roura,[27] the second is by Chan,[28] and the third is from Hacker's Delight by Warren.[29] These functions allow us to compare the binary logarithm without the need to compute it.

log2(x)log2(y)xy || y(x&y)1xy || x(xy)x(¬x&y)\begin{align*}\lfloor\log_{2}(x)\rfloor \geq \lfloor\log_{2}(y)\rfloor &\equiv x \geq y \text{ || } y \leq (x \mathbin{\&} y) \ll 1\\ &\equiv x \geq y \text{ || } x \geq (x \oplus y)\\ &\equiv x \geq (\lnot{x} \mathbin{\&} y)\end{align*}

So far we have four definitions of balance based on the binary logarithm. There appears to be a pattern where a stronger definition of balance is relaxed by the logarithm of the same terms — what a neat relationship.

Median(x,y)=(x+1)yHeight(x,y)=log2(x+1)log2(y)Weight(x,y)=(x+1)(y+1)/2Log(x,y)=log2(x+1)log2((y+1)/2)\begin{align*}\text{Median}(x, y) &&=&& &(x+1) &\geq& && y\\\text{Height}(x,y) &&=&& \lfloor\log_{2} &(x+1)\rfloor &\geq& &&\lfloor\log_{2}(y)\rfloor\\\text{Weight}(x, y) &&=&& &(x+1) &\geq& &&(y+1)/2\\\text{Log}(x, y) &&=&& \lfloor\log_{2}&(x+1)\rfloor &\geq& &&\lfloor\log_{2}((y+1)/2) \rfloor\end{align*}

Using the bitwise function of Warren to compare the binary log and left shifts for division yields the following expanded solutions:

Median(x,y)=(x+1)(y)Height(x,y)=(x+1)(y)&¬(x+1)Weight(x,y)=(x+1)((y+1)1)Log(x,y)=(x+1)((y+1)1)&¬(x+1)\begin{align*}\text{Median}(x, y) &= (x+1) \geq (y)\\\text{Height}(x,y) &= (x+1) \geq (y)\mathbin{\&}\lnot{(x+1)}\\\text{Weight}(x, y) &= (x+1) \geq ((y+1)\gg1)\\\text{Log}(x, y) &= (x+1) \geq ((y+1)\gg1)\mathbin{\&}\lnot{(x+1)}\end{align*}

Cost-optimized balance

There is one other design candidate that does not follow the same pattern. In 2000, Cho and Sahni[30] introduced the idea of cost-optimized search trees, which are trees where the average path length can not be reduced by a rotation anywhere in the tree.

Even though their definition considers rotations, we can apply this concept to partition-based balancing without the need to consider rotations at all. In this context, the only determination to make is whether a node is balanced or not. To determine whether a node is cost-optimized, we need to compare two levels of subtree sizes.

Definition

A node is cost-optimized if the size of each subtree is greater than or equal to the size of both subtrees of the other subtree.

Cost(p)= size(l)size(rl),size(l)size(rr), size(r)size(lr),size(r)size(ll)\begin{align*}\text{Cost(p)} = & \text{ size}(l) \geq \text{size}(rl), \text{size}(l) \geq \text{size}(rr), \\ & \text{ size}(r) \geq \text{size}(lr), \text{size}(r) \geq \text{size}(ll)\end{align*}

Balancer analysis

Measurements were made in size increments of 1,000 up to 10,000,000. Many random trees were grown consistently at each size increment and balanced by each strategy. The results were generated on an Intel i5 13600K with 32GB of DDR5 RAM.

Partition Count

The total partition count is the number of nodes that required partitioning at each size increment. Dividing the total partition count by the size of the tree at each increment indicates the fraction of the tree that required partitioning. A horizontal line indicates that the total partition count is linear in the size of the tree.

1 2 3 4 5 6 7 8 9 100.100.150.200.250.300.350.40Partition Count / SizeSize × 106Median