Exploring the design space of binary search trees

Rudi Theunissen
Last updated: 31 Aug 2023
Repository / Benchmarks / Animations / Discussion

The study of binary search trees has been a fundamental topic in computer science going back to the 1950s. Anyone who has studied computer science or prepared for a technical programming interview is familiar with them, and many leading courses on algorithms and data structures teach them, including Princeton,[1] Stanford,[2] and UC Berkeley.[3] These courses usually teach the same selection of algorithms in a similar way, though never exactly the same. The motivation for this project, in part, was to innovate the way we teach fundamental algorithms and data structures, specifically binary search trees. Speaking from experience, some students might not realize how interesting this space can be because they are too distracted by seemingly arbitrary rules and acronyms.

Over the years, some binary search tree algorithms have come along that are both more intuitive and more efficient than the ones we usually learn about, but good implementations are often difficult to find and therefore difficult to teach. This project attempts to explore as many of these algorithms as possible to produce a reference that is easy to teach and learn from. Being mostly light on theory, we focus instead on experimentation and measurement. A few of the algorithms appear to be implemented for the first time, and some original ideas are also considered.

Part 1 briefly covers foundational ideas like memory allocation, linear lists, position, recursion, traversal, balance, persistence, and concurrency. This introduction explores a pedagogical approach to binary search trees as a natural extension of linked lists. Since linked lists are usually covered before binary search trees, the idea to implement the same interface as before but now with a different structure is a nice segue, cohesive and consistent.

Part 2 covers definitions of balance and explores various algorithms to globally balance a binary search tree. Topics to cover include tree rotations, partitioning, complexity, and the binary logarithm. This part is an opportunity to become more familiar with the binary tree structure and its measurements.

Part 3 explores self-balancing trees, which is where most of the research and discussion usually happens. There are many strategies to consider, including rank-balanced trees, weight-balanced trees, randomized trees, finger trees, and self-adjusting trees. Some weak and relaxed variants are also implemented and evaluated. There are multiple benchmarks and measurements to determine which strategies are better than others in different situations, and whether there might be a case to prefer other algorithms to the ones typically found in textbooks and lectures.

Part 1Introduction

Abstract memory

A computer program uses memory to keep information in mind while working on a problem, similar to the human concept of working memory. When more memory is needed, a program can ask an allocator to reserve some, and in return receive a unique address to access that memory. All the memory that was allocated by a program is usually freed by the time the program exits, but during its lifetime there can be many requests to allocate, read, write, and free memory.

These memory interactions all have an energy cost because work must be done to track allocated memory, resolve addresses, and write to memory. A program is considered efficient when it meets its functional requirements using only a small amount of energy relative to the complexity of the problem it solves. Some problems can not avoid frequent interaction with memory, but an inefficient algorithm wastes energy by interacting with memory more often than it needs to.

This field of research relates to the organization of information in memory, focusing on the organization of the information rather than the information itself. The goal is to design structures in memory to organize information in useful ways, such as to be ordered, searchable, or countable. The information and the structures that organize it usually exist together in memory, and programs can use these structures to organize information as needed, rather than interacting with memory directly.

There are structures that organize information all around us, like filing cabinets and books and vinyl records. Consider that an empty notebook contains no information, but the structure is there for information to exist. A step-by-step cooking recipe might be written as a numbered list to suggest to the reader to follow the steps in order, one at a time. The paper on which the recipe is written is the memory, the recipe is the information, the numbered list is the structure.

Using computer memory as a design material might seem intangible, but every bit of information exists physically on hardware no differently. The computer, however, does not know anything about semantics and relationships — there is only one large pool of memory to interact with, and our task is to provide structures as a layer of abstraction between the program and the memory.

Linear lists

This project explores specifically the implementation of a linear list, which is a one-dimensional ordered sequence of a known length, indexed by position. Lists are fundamental and are often introduced as the first abstract data type when learning about data structures. So many things are lists... search results, queues, events or transactions occurring over time, even the text of this article is an ordered sequence of characters arranged in a specific order — the same characters in different positions would be gibberish, so the position of each character in the sequence is important.

For a data structure to qualify as a list, it must support the operations that describe the list data type. A program using a given list implementation might not know exactly how the structure works internally, only that it behaves like a list.

The following operations describe a reasonable list data type:

  • Select produces the value at a given position.
  • Update replaces the value at a given position.
  • Insert introduces a new value at a given position.
  • Delete removes the value at a given position.
  • Join combines two lists into a new list.
  • Split separates a list at a given position.

Arrays

Perhaps the most common implementation of a list in practice is the array, which is a block of adjacent fixed-size memory cells, allocated all at once, and indexed directly as an offset from the front of the block. The memory allocated by an array is the number of cells multiplied by the information capacity of each cell. The array is the structure and the values stored in each cell are the information.

01234567ABCDEFG

Figure 1: An array with a size of 7 and a capacity of 8.

To illustrate this, we can use a sheet of grid paper to represent an allocated array where every square is a cell in which a single character could be stored. The sheet of grid paper models the array and the grid squares that have characters written in them form the information as a sequence. The sequence of characters recorded on the sheet of grid paper has semantic meaning to the person or program asking us to record them, but the structure itself is not aware of that meaning.

Data structures usually offer theoretical bounds for every operation to indicate how long an operation might take or how much memory it might allocate, and is usually proportional to some measurable property of the structure, such as size. Some data structures are particularly efficient for certain operations, but less so for others.

Arrays are great for accessing a value by position because there are no empty spaces between values in the array, which allows the memory address to be calculated as an offset from the address of the array itself. This is called constant time, because the cost of the calculation is the same regardless of the size of the array. Inserting a value at the end of the array is very efficient also, simply write into the next empty cell.

Consider however the case where a value is inserted somewhere within the array, or in the worst-case at the beginning of the array. Since there can be no empty cells between values, space must be created by moving every value from that position one step towards the back of the array. The cost of this movement of memory is linear in the size of the array, because the upper-bound on the amount of memory to move is exactly the size of the array. Deletion is similarly costly because it leaves an empty cell, which requires space to be compacted in exactly the opposite way.

What makes an array dynamic is the ability to increase its capacity by reallocating the entire sequence into a larger block of memory, thereby copying every value from the original array to the new block. Doing so is linear in the size of the array but occurs infrequently enough to amortize the cost, allowing many low-cost operations to make up for infrequent expensive ones. Common ways to avoid this cost are to predict a good upper-bound for the initial capacity to avoid reallocation entirely, or to always double the previous capacity to effectively amortize the cost of reallocation.

Linked lists

Found in every textbook on algorithms and data structures is the linked list, which consists of linked nodes that each contain one value and a pointer to the next node in the sequence. The structure provides a pointer to the first node in the sequence, thereby allowing sequential access by following one link at a time from there. The last node of the list points to an empty node, illustrated as .

To insert a new value at a given position within the list, start at the first node and follow links one node at a time until the number of links followed equals the search position. At this point we can allocate a node for the new value, point the previous node to the new node, and point the new node to the current node.

ABCDEFG

Figure 2: A linked list of size 7, illustrating the empty node at end as .

Binary search trees

Instead of starting at the front of the list, what if we started the search at a node in the middle? Nodes to the right of this median point to the next node in the sequence, as they did before, but nodes to the left of the median now point to the previous node towards the start of the sequence. The node in the middle should always know its position in the sequence since all operations pass through it.

Applying the same strategy recursively to the left and right sides of the median node produces a structure known as a binary search tree. Every node now has two links, one to its left subtree and another to its right subtree. A node might link to an empty subtree, known as leaf node, but the pointers themselves are still there because they are part of the allocated node structure. There is an empty subtree at the start of the sequence, between every node, and at the end of the sequence.

ABCDEFGABCDEFGABCDEFG

Figure 3: An illustration of the transformation from a linked list to a binary search tree. The median node now has two pointers: one going left and another going right. The initial median node, in this case D, is at the top of the tree and is commonly referred to as the root node. The horizontal coordinates of the nodes stay the same as the nodes line up vertically.

Figure 4: A complete binary tree, not showing leaf nodes. This tree seems large but has only 127 nodes — what would a tree of 1,000,000 nodes and ∼20 levels look like? Notice the repeating patterns and that the number of nodes per horizontal level is twice the number of nodes of the level above.

The fundamental rule that defines a binary search tree is that the nodes of the left subtree occur sequentially before the root node, and the nodes of the right subtree occur sequentially after the root node. This creates a total linear ordering of all values, and the same sequence as the original linked list can be produced from the tree structure. We do this with an inorder tree traversal, which visits every node in sequential order.

inorder (p *Node, visit func(*Node)) {
    if p == nil {
        return
    }
    inorder(p.l, visit)
    visit(p)
    inorder(p.r, visit)
}

(tree *Tree) inorder (visit func(*Node)) {
    inorder(tree.root, visit)
}

Starting at the root, perform an inorder traversal of the left subtree, then visit the root, then perform an inorder traversal of the right subtree. The first node visited by this algorithm is therefore the left-most node of the tree, and the last node visited is the right-most node. As an exercise, try to apply this algorithm manually to the tree at the bottom of Figure 3 starting at the root node D.

Leaf insertion

The leaves of a binary search tree are where new nodes are attached to the structure. To insert a new value at a given position, start by searching for the leaf corresponding to that position, then replace it with a new node allocated for the value, thereby also allocating two new leaves. This is called leaf insertion and forms the basis of many other insertion algorithms.

Figure 5: The tree on the left shows the search path to insert a new node into the tree, starting from the root node down to a leaf node indicated by . The tree on the right shows the resulting tree with a new node in place of the previous leaf node. Keep in mind that the new node would have two leaf nodes.

Relative position

We now have a binary tree data structure with nodes pointing to other nodes, but it is not yet clear how to search for a node by position. Searching through a linked list is simple because we traverse and count exactly one node at a time. The information obtained by counting is the position of every node along the way, where the relative position of every node is exactly 1.

When we follow a branch in a binary search tree, we effectively skip all the nodes of the other branch. From the root of the tree, the first branch skips about half the nodes, then the second branch skips about half again, then half again until we find the node at the position we are searching for.

This is known as binary search, which is a well-known algorithm independent of linked nodes and tree structures. For example, ask someone to think of a number between 1 and some large number n. How many guesses would we need to guess it correctly, if after every guess were are hinted to go lower or higher?

The answer is no more than log2(n)\log_2(n) by using binary search. Start in the middle, then to the middle of the left half if the guess was too high, or to the middle of the right half if the guess was too low. Repeating this step until we guess correctly closes in on the number by halving the search space every time until there is only one choice left.

To implement binary search in a binary tree, we need to determine whether to branch to the left or to the right; do we skip the left half or the right half? Branching to the left seeks towards the start of the sequence by skipping all the nodes of the right subtree, and branching to the right seeks towards the end of the sequence by skipping all the nodes of the left subtree.

Since the pointers only reference in one direction away from the root, there is no way back up the search path once a branch is taken. To determine which branch to take requires that we know the position of the current node relative to its tree, which is exactly the number of nodes in the left subtree, or the number of nodes that would be skipped by branching to the right.

Branching to the exact median at every step is not necessary for binary search to be effective. Even a somewhat average approximation of each median might only increase the length of the search path by a few steps. The subtrees might also not divide neatly down the middle, which is only possible when the size of the tree is exactly 2n12^n-1. We therefore can not assume that the current node along the search path is the exact median of its subtree. The only way to know the number of nodes in the left subtree is to either count them, or store that information as part of the node structure.

Perhaps the most common approach to solve this problem is to store the size of every node as part of its structure, which is equal to the number of nodes reachable from the node, including itself. Adding a new node to the tree requires then that we increment the size field of every node along the path from the root to the new node. To determine the size of the left subtree, and therefore the relative position of a node, we can follow its left link to read the size field of that node, wherein lies a weakness: we always follow the left link even when the search ends up branching to the right. Ideally, to avoid unnecessary interaction with memory, the only nodes to read from memory are the nodes along the search path. This strategy benefits from being symmetric, and being able to derive the size of a tree from the size of its root.

Another idea is to store in each node its position relative to its parent. This approach results in a left node having a negative position equal to one less than the negative size of its right subtree, and a right node having a positive position one more than the size of its left subtree. Assuming that the root node always has a positive position, the absolute position of any node is then equal to the sum of all relative positions along its path from the root. This strategy is symmetrical, intuitive, and provides one bonus insight: a node is known to be a left or right descendant based on the sign of its position. However, there are two downsides to this approach: (i) we require one bit to store the sign of the position, thereby halving the utilization of the size field, and (ii) the resulting algorithms require in some cases additional checks and arguments to indicate and normalize node orientation. Insertion using this strategy would increment the size field when a search path branches left at a right node, and symmetrically decrement the size when branching right at a left node. For example, inserting a node at the start of the sequence would increment the size of the root because the size of its left subtree is increasing, then descend along the left-most path without incrementing the size of any other nodes since they all store the size of their respective right subtrees. Similarly, inserting a node at the end of the sequence would not increment the size of any node at all because all the nodes on the right-most path, including the root, store the size of their left subtrees unaffected by an insertion to the right. In 2004, Schmücker[4] proposed a list implementation using this approach to the Apache Commons Collections, which is still part of the source at the time of this writing.

Alternatively, we could store specifically the size of the left subtree, as suggested by Knuth[5] and Crane[6]. Keeping a separate size field in the outer tree structure as well as the reference to the root node then allows us to track the current subtree size at each step along the search path. The size of the right subtree is then one less than the size of the current subtree minus the size of its left subtree. This approach allows us to know both the left and right subtree sizes without the need to access either of them. Intuitively, this pulls the subtree size information one level up in the tree. Insertion then only increments the size field of a node when branching to the left, because inserting to the right would not affect the size of the left subtree. This approach therefore reduces memory interaction in exchange for a slightly more complex tree structure and some asymmetry.

This project uses the approach where every node stores the size of its left subtree.

Node {
    x Data
    s Size
    l *Node
    r *Node
}

Tree {
    root *Node
    size Size
}

These are the node and tree structures used by the algorithms in this project. A node consists of a data field, a size field, and two pointers to other or empty nodes. A tree consists of a pointer to the root node, if any, and the current size of the tree.

search (p *Node, i Position) *Node {
   for {
      if i == p.s {
         return p
      }
      if i < p.s {
         p = p.l
      } else {
         i = i - p.s - 1
         p = p.r
      }
   }
}

When the position we are searching for is equal to the number of nodes in the left subtree, we have found the node we were looking for. Skipping all the nodes of the left subtree would skip ahead to exactly this position in the sequence. Otherwise, when the position is less than the size of the left subtree, we need to seek towards the start of the sequence because our current position is still too far ahead. In this case, follow the link to the left and continue the search. Otherwise, if the position is greater than the size of the left subtree, we know that we can reject the entire left subtree because even if we skipped all those nodes we would still need to seek further ahead in the sequence, and therefore the node we are looking for must be somewhere in the right subtree. In this case, reduce the search position by the size of the left subtree, including the current node, then descend to the right to continue the search.

Path length

The path length of a node is the number of links to follow to reach it from the root of the tree, or one less than the number of nodes along its path from the root. The total path length is the sum of the path length from the root to every other node, and the average path length is the total path length divided by the number of nodes. Starting from halfway along some path and turning either left or right effectively halves the maximum path length of that path. During the transformation of a linked list to a binary search tree, as shown in Figure 3, whenever a new branch is created to the middle of a sublist it halves the maximum path length of that sublist. What is the resulting maximum path length of the tree? This question is effectively asking how many times the size of the tree can be divided by 2 before reaching 1.

This is known as the binary logarithm, written as log2\log_{2}. The binary log is frequently encountered in computer science because it relates so closely to binary numbers. For example, the number of bits required to write an integer n in binary is log2(n)+1\left\lfloor\log_{2}(n)\right\rfloor+1. The same reasoning applies to decimal numbers, where the number of digits required to write an integer n in decimal is log10(n)+1\left\lfloor \log_{10}(n) \right\rfloor + 1, or the number of times we can divide by 10 until we get to 1. A binary number therefore shortens by 1 bit when we divide by 2 and a decimal number shortens by 1 digit when we divide by 10.

Deletion

The general strategy to delete a node from a binary search tree is to search for the node to be deleted, then to replace it by the join of its subtrees. A valid join function combines the left and right subtrees so that all the nodes in the left subtree still occur before all the nodes in the right subtree, while sharing a common root node. The problem to solve is to determine from which subtree the root of the join should come from, and exactly how to produce it.

In 1961, Hibbard[7] proposed a simple algorithm to join two subtrees: when the right subtree is empty, the result of the join is the left subtree as-is, otherwise when the right subtree is not empty, delete its left-most node and use that node as the joining root. This algorithm therefore always produces the joining root from the right subtree if possible. Hibbard’s paper was remarkable in that it contained one of the first formal theorems about algorithms and still features frequently in lectures and textbooks.

ZZZZZZZZABCDEFGABCDEFGABCEFGABCEFG

Figure 6: The root node D is being deleted. This requires a join of the subtrees rooted at B and F to replace D. There are two choices for the joining root: the tree in the bottom left shows the result of using the right-most node of the left subtree, therefore C, and the tree in the bottom right shows the result of using the left-most node of the right subtree, therefore E. Notice that the subtree ZZ of the joining root is never accessed and replaces the joining root at the bottom of the path.

Hibbard's algorithm is not symmetric — it never considers whether the left subtree is empty, only the right. Knuth[5] noticed this asymmetry and suggested an adjustment to symmetrically use the right subtree as-is when the left subtree is empty. This algorithm improves the symmetry but is still biased to one side because the joining root is always taken from the same side when neither subtree is empty.

In 1983, Eppinger[8] suggested a symmetric variant of Hibbard's algorithm that randomly prefers the left or the right with equal probability. Their experiments involve random insertions and deletions and measure the average path length over time. They found that the average path length actually improves when using their symmetric approach, eventually stabilizing to be better than that of random trees.

In 1989, Culberson and Munro[9][10] published an extensive study into the long-term behavior of random insertions and deletions. They presented evidence that the use of the Hibbard or Knuth algorithms, when coupled with random insertions, causes the average path length to increase after many iterations, then stabilize. They also mention the symmetric approach by Eppinger and show that the same idea when applied to Knuth's algorithm yields even better results.

Popular textbooks are inconsistent in their approach to deletion in binary search trees. For example, in the well-known Introduction to Algorithms, Cormen, Leiserson, Rivest, and Stein[11] teach Knuth's algorithm, as well as Drozdek[12] who attributes the algorithm to both Hibbard and Knuth. Sedgewick and Wayne[13] teach Hibbard's algorithm, but their code appears to implement Knuth's algorithm. Goodrich, Tamassia and Goldwasser[14] teach Hibbard's algorithm but flips it to always prefer the left subtree instead. Skiena,[15] Weiss,[16] and Manber[17] all teach Hibbard's algorithm.

So far, none of the algorithms consider subtree size to determine from which side to produce the joining root. Given that subtree size must be known to search by position, we can make use of this information to determine whether to produce the root from the left or the right subtree.

Consider the node to be deleted and its position relative to its median — a node to the left of its median has a larger right subtree, and a node to the right of its median has a larger left subtree. The difference between these subtree sizes can be thought of as the distance to the ideal branch. This distance can be reduced slightly by always producing the root from the larger subtree.

The inverse of this strategy is to instead always prefer the smaller subtree. While this might seem clearly counter-productive, consider that the path length to the joining root is likely to be shorter in the smaller subtree than the larger subtree. Generally, a lower total traversal distance corresponds to fewer interactions with memory since fewer nodes need to be accessed. In this case, while the average path length of the resulting tree might be longer than other strategies, the join itself might resolve faster by accessing fewer nodes on average.

A related idea is a practical combination of preferring the larger subtree and the randomized symmetry of Eppinger's algorithm. To make the determination, generate a random number no greater than the sum of the two subtree sizes, then prefer the left subtree if that number is less than the size of the left subtree. When the left subtree is smaller, there is a lower probability that the generated number will be less than its size, and therefore a higher probability that the larger subtree will be chosen. This results in a lower average path length than the symmetric algorithm of Eppinger, and both the Hibbard and Knuth algorithms. This is to be expected, because the Eppinger algorithm has a 50% chance of choosing the larger subtree, and the randomized strategy still has some chance of choosing the smaller subtree.

To compare the resulting average path length of each strategy, a random tree of 1,000 nodes was constructed and measured across 1,000,000,000 random insertions and deletions. The results show that always preferring the larger subtree results in the lowest average path length. All the evaluated algorithms appear to stabilize with an average path length that is logarithmic in the size of the tree.

1 2 3 4 5 6 7 8 9 101.001.011.021.031.041.051.061.071.081.091.101.111.121.131.141.151.161.171.181.191.20Average Path Length / log2(Size = 1000)Random insertions and deletions / 108HibbardKnuthRandomizedSymmetricSmallerLarger

Persistence

A persistent data structure can create many independent versions of itself over time. Any version can be modified to produce a new version without affecting existing versions. To achieve this, a program can preserve the current version by copying it before making changes to produce a new version.

Reference counting

To avoid copying all the nodes when copying a tree, we can allow trees to share common subtrees in memory. Some tree far in the future might still reference a node that was allocated for a much earlier version. We need to keep track of this, so we store in every node a reference count to indicate the number of other trees that also reference that node.

When the reference count is zero, it suggests that no other trees share that node, so the tree has ownership and may modify it. Otherwise, when the reference count is greater than zero, it suggests that at least one other tree shares that node in memory. In this case, modifying the node would unintentionally also modify the other trees that reference it, so the node must first be detached from its shared state. This can be done by replacing the node with a copy, setting the reference count of the copy to zero, sharing the left and right subtrees by incrementing their reference counts, and decrementing the reference count of the initial node since it is no longer referenced by this tree.

Path copying

Consider now that a node is being modified in the depths of some tree, but that node is shared so must first be copied. How can the tree know about the copy of the node being modified? How could its root reach it? When a node replaces another node on the search path, there is a parent node that must be modified to now point to the copy instead, so it too must be copied. This cascades all the way up to the root.

Generally, all paths that lead to a modification must be copied. Every node has a unique path from the root, so for any operation we can mark the nodes that need to be modified and trace their paths back to the root; it is these paths that need to be copied so that they all belong to the same new version of the tree that includes the result of the modification.

Binary search trees are especially well-suited for persistence because the amount of information to copy is bounded by the maximum path length, which is usually logarithmic in the size of the tree. When this is the case, for example, a tree of 1,000,000 nodes would only need to copy ∼20 nodes to produce a new, completely independent version.

Figure 8: An illustration of path copying. The tree in the top left highlights a search path to insert a new node, the tree in the center left shows the deletion of a node, and the tree in the bottom left shows an insertion along the left-most path. Notice that the resulting tree in the bottom right is composed from subtrees of previous versions.

Other persistent tree data structures such as B-trees[18][19] and RRB-trees[20][21][22] pack several values into each node and have more outgoing links per node. The motivation is usually (i) to reduce memory allocation by not requiring a new node for every value, (ii) to reduce path length by having more branches, and (iii) to read multiple values at a time from memory during search. In a persistent context, these nodes must be copied no differently, but instead of copying just one value we need to copy all the values in the node. Search paths may be shorter but there is more information to copy.

The number of values to copy can be estimated by mlogb(n)m\log_{b}(n), where m is the number of values per node, b is the number of branches per node, and n is the number of nodes in the tree. This is not an exact science of course, because actual path lengths may vary and some nodes with larger capacities might not be full. This expression therefore only provides an idea of the expected growth of information to copy as size increases.

Binary trees use m=1m = 1 and b=2b = 2, one value, two branches. B-trees in the Rust standard library use m=11m = 11 and b=12b = 12, and RRB-trees use m=32m = 32 and b=5b = 5.

1 10 100 1000 10000 0 20 40 60 80 100 120 140 160 180 200log2(x)11 * log12(x)32 * log5(x)

Parent pointers

Many implementations of binary search trees use a node structure that includes a third link: the parent pointer, which points back up towards the root. Path copying requires that all paths that lead to a modification must be copied, but parent pointers create cycles such that all paths lead to all nodes. Copying every node is not feasible, so parent pointers are not compatible with path copying and therefore not part of our node structure. Avoiding parent pointers also reduces memory interaction because there are fewer pointers to maintain, and no need to account for the pointer in the node structure.

Figure 10: An illustration of a binary tree that uses nodes with parent pointers.

Concurrency

A data structure that supports concurrency can run multiple independent operations at the same time. This is not exactly parallelism, which is where the work of a single operation is divided up to be worked on together. To illustrate this in a binary search tree, as shown in Figure 11, consider one operation that searches for an existing node by position, and another that inserts a new node. These operations may start at the same time, but only one of them can win the race to the root node and lock others out. The other operation must wait for the node to be unlocked to gain access.

The search operation branches to the left, locks that node, and only then unlocks the root node. The insert operation is now clear to access the root and lock it. The second operation may be locked out again if it also branches to the left, but if it branches to the right then the two operations become independent.

OPERATION 1OPERATION 2

Figure 11: Two concurrent operations, one is a search and the other is an insertion. The first operation branches to the left at the root, and the second operation branches to the right at the root. Assuming that both operations are top-down, as soon as they branch in different directions they become independent, allowing them to operate concurrently.

Recursion

Binary tree algorithms often use recursion to implement operations in two phases: a search phase descending from the root, and a maintenance phase ascending back to the root. Another operation would need to wait for the algorithm to go all the way down then come all the way back up to make final adjustments before unlocking the node. For this reason, we explore whenever possible the potential to avoid recursion by combining the search and maintenance phases into a single top-down phase in constant space. Recursive implementations are still valuable because they are often simpler to implement and help to verify top-down variants.

Part 2Restoring Balance

Consider what happens to the tree structure when we repeatedly insert nodes at the start of the sequence. Eventually, the structure becomes unbalanced because too many branches are too far from their median. Balance is a measure of low average path length, and a linked list can be thought of as a tree with the worst possible balance.

To restore balance, we need a way to lower the average path length without changing the sequence of the nodes. Some strategies spend a lot of energy to restore perfect balance, while others spend less energy by being more relaxed. We can evaluate this as a trade-off between the cost of balancing and the benefits of balance.

GFEDCBAABCDEFG

Figure 12: An unbalanced tree on the left, and the same tree but balanced on the right.

Rotations

A tree rotation is a local structure adjustment that rearranged links between nodes without changing their order. The effect of a rotation can be observed by focusing on the root of the rotation and the change in the composition of its subtrees.

There are two symmetrical rotations: a right rotation which moves the root node down to the right and the left node up to the right to take its place, and a left rotation which moves the root node down to the left and the right node up to the left to take its place.

ZZZZZZZZROTATE RIGHTROTATE LEFTABCABC

Figure 13: From the left tree to the right tree is a right rotation: push C down to the right, pull A up into its place, move B across to the right. From the right tree to the left tree is a left rotation: push A down to the left, pull C up into its place, move B across to the left. Notice that C has both A and B in its left subtree on the left, then after the right rotation only has B. Both trees have the same sequence, A B C. Subtrees marked ZZ are not accessed or modified.

rotateR (p *Node) *Node
   l = p.l
   p.l = l.r
   l.r = p
   p.s = p.s - l.s - 1
   return l
}

rotateL (p *Node) *Node
   r = p.r
   p.r = r.l
   r.l = p
   r.s = r.s + p.s + 1
   return r
}

A well-known strategy to restore balance using rotations is the Day-Stout-Warren algorithm, or simply DSW, which was designed by Stout and Warren[23] in 1986 based on work by Day[24] in 1976. This algorithm first transforms a tree into a linked list using right rotations, then transforms that linked list into a perfectly balanced tree using left rotations — all done in linear time and constant space.

Partitioning

In 1980, Stephenson[25] presented an algorithm that always inserts a new node as the root of a tree, commonly referred to as root insertion. This is done by splitting the tree in two: nodes that occur before the leaf position and nodes that occur after the leaf position, then setting those subtrees as the left and right subtrees of the new node.

A related algorithm is partition, which rearranges the tree to make the node at a given position the root while preserving the order of all nodes. To partition a tree, start from the root and follow the standard search algorithm. Along the search path, when branching to the left, the current node wil end up to the right of the partitioned root, and all further nodes along the search path will end up to the left of the current node. This applies symmetrically when branching to the right.

Using this information, the algorithm builds two subtrees top-down, the left partition and the right partition. When the search branches to the left, attach the current node to the left of the left-most node of the right partition, preserving the right subtree of the current node. When the search branches to the right, attach the current node to the right of the right-most node of the left partition, preserving its left subtree.

When the search reaches the node that is to become the root, attach its current left subtree to the right of the right-most node of the left partition, and attach its current right subtree to the left of the left-most node of the right partition. Then, set its left subtree to be the root of the left partition, and set its right subtree to be the root of the right partition. Finally, set this node as the root of the tree.

partition (p *Node, i Position) *Node {
   n = Node{s: i}
   l = &n
   r = &n
   for i ≠ p.s {
      if i < p.s {
         r.l = p
         p.s = p.s - i - 1
           r = r.l
           p = p.l
      } else {
         l.r = p
           i = i - p.s - 1
           l = l.r
           p = p.r
      }
   }
   r.l = p.r
   l.r = p.l
   p.l = n.r
   p.r = n.l
   p.s = n.s
   return p
}
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZABGFCDECABGFDE

Figure 14: The tree at the bottom is the result of partitioning the tree at the top at C. The search path is indicated by arrows, and subtrees marked ZZ are not accessed or modified.

Median balance

Definition

A node is median-balanced if the size of its left and right subtrees differ by no more than 1.

To determine if a node is median-balanced, we can add 1 to the size of its smaller subtree to see if it becomes greater than or equal to the size of its larger subtree. Without loss of generality, assume that xyx \leq y.

Median(x,y)=(x+1)y\text{Median}(x,y)=(x+1)\geq y

A median-balanced tree is perfectly balanced because every branch divides the search space exactly in half. In 2017, Muusse[26] published an algorithm to balance a binary search tree that uses partition to replace every node by its underlying median to produce a median-balanced tree.

Partitioning a tree around its median distributes the nodes evenly between its left and right subtrees, but there might be nodes deeper within those subtrees that are not median-balanced. Repeating the partitioning recursively in the left and right subtrees results in a median-balanced tree.

balance (p *Node, s Size) *Node {
   if p == nil {
      return p
   }
   if !balanced(p) {
      p = partition(p, s / 2)
   }
   p.l = balance(p.l, size(p.l))
   p.r = balance(p.r, size(p.r))
   return p
}
            

This algorithm has several useful properties: (i) it is general for any definition of balance based on subtree size, (ii) it operates top-down in constant space, (iii) it is particularly efficient when the tree is already somewhat-balanced, (iv) subtree balancing is independent so could be done parallel, and (v) the balancing process could be stopped or paused partway through without invalidating the structure.

There are however some node arrangements that are not median-balanced but have the same average path length as if they were median-balanced. Consider the following trees that both form the same sequence of 5 nodes with an average path length of 6/5.

MEDIAN-BALANCEDMEDIAN-BALANCEDABCDEABCDE

Figure 15: Two trees with the same average path length and maximum path length, where one is median-balanced and the other is not. To calculate average path length, first count the total path length then divide by the size of the tree. The tree on the left has path lengths A:1 B:0 C:2 D:1 E:2 for a total of 6, has 5 nodes, and is not median-balanced. The tree on the right has path lengths A:2 B:1 C:0 D:1 E:2 for a total of 6, also has 5 nodes, and is median-balanced.

The first tree is not median-balanced because 5/2 is 2, so the median node is C but the root is currently B. The median-balancing strategy would partition at B to make C the root, but the average path length stays the same. This partitioning step is therefore redundant because it did not improve the balance of the tree.

Height balance

Definition

The height of a tree is the length of the longest path starting from its root.

Definition

A node is height-balanced if the height of its left and right subtrees differ by no more than 1.

We can continue to use the same partition-based balancing algorithm but change the definition of balance to only partition if the node is not already height-balanced — every node that is not height-balanced is partitioned to become median-balanced. This results in a height-balanced tree because every node that is already height-balanced is left as such, and every node that is not height-balanced becomes median-balanced.

The problem to solve is to determine whether a node is already height-balanced. Muusse[26] solves this by using a strict definition of height-balance where both the minimum and maximum path lengths of two subtrees differ by no more than 1. Then, by pretending that both subtrees are already strictly height-balanced, we can compare their minimum and maximum heights.

A binary tree is perfect if every node has either two subtrees or two empty subtrees. The height of a perfect binary tree is log2(s+1)1\log_{2}(s+1)-1, which is intuitive since every level of the tree has twice the number of nodes as the one above it and the number of branches to follow from the root to the bottom of the tree is the number of times that the size can be divided by 2 before reaching 1.

1371531

Figure 16: Perfect binary trees of size 2n12^n-1.

When a tree is strictly height-balanced, all the levels of the tree are full except for maybe the bottom level, where some nodes might have only 1 subtree. The bottom level of the smaller subtree is emptied with floor, and the bottom level of the larger subtree is completed with ceil, giving the minimum and maximum heights of the two subtrees. We can then determine if the node is height-balanced by comparing these heights.

Height(x,y)=log2(max(x,y)+1)log2(min(x,y)+1)1\text{Height}(x,y) = \lceil \log_2(\max(x,y)+1) \rceil - \lfloor \log_2(\min(x,y)+1) \rfloor \leq 1

This function, as published, can be simplified with an identity of the discrete binary logarithm where log2(s+1)log2(s)+1\lceil\log_{2}(s+1)\rceil\equiv\lfloor\log_{2}(s)\rfloor+1. This allows both sides of the inequality to be expressed in terms of the floor of the log. The result is the same whether the last level is completed with ceil or emptied with floor and incremented.

Height(x,y)=log2(x+1)log2(y)\text{Height}(x,y) = \lfloor \log_{2}(x+1) \rfloor \geq \lfloor \log_{2}(y) \rfloor

Weight balance

Definition

The weight of a node is the number of reachable leaves, which is one more than the number of nodes.

The general category of weight-balanced trees require some definition of a bound between the number of leaves in the left and right subtrees. A median-balanced node is perfectly weight-balanced because there is an ideal distribution in the number of nodes between its left and right subtrees. A simple definition of weight-balance is to choose 2 as a constant factor, so that a node is balanced if the weight of the smaller subtree is at least half the weight of the larger subtree.

Weight(x,y)=(x+1)(y+1)/2\text{Weight}(x, y) = (x + 1) \geq (y+1) / 2

A somewhat overlooked variant of weight-balance was introduced by Roura[27] in 2001, which directly compares the binary logarithm of the weights of the left and right subtrees, rather than the weights directly.

Definition

A node is logarithmically weight-balanced if the binary log of the weights of its subtrees differ by no more than 1.

Log(x,y)=log2(y+1)log2(x+1)1=log2(x+1)log2(y+1)1=log2(x+1)log2(y+1)log2(2)=log2(x+1)log2((y+1)/2)\begin{aligned}\text{Log}(x,y) = &\lfloor\log_{2}(y+1)\rfloor - \lfloor\log_{2}(x+1)\rfloor \leq 1\\ = &\lfloor\log_{2}(x+1)\rfloor \geq \lfloor\log_{2}(y+1)\rfloor - 1\\ = &\lfloor\log_{2}(x+1)\rfloor \geq \lfloor\log_{2}(y+1)\rfloor - \lfloor\log_{2}(2)\rfloor\\ = &\lfloor\log_{2}(x+1)\rfloor \geq \lfloor \log_{2}((y+1)/2) \rfloor\end{aligned}

The most-significant-bit of a binary number, or MSB, is the bit-position of the left-most bit set to 1. The MSB gives the number of bits required to represent the integer in binary since all the bits to the left of the MSB will be 0 and do not affect the value. Comparing the MSB is therefore equivalent to comparing the binary log. A node is therefore logarithmically weight-balanced if the MSB of the weight of one subtree is at most one step away from the MSB of the weight of the other subtree.

BALANCEDNOT BALANCED8482

Figure 17: A node with subtree weights 8 and 4 is logarithmically weight-balanced because log2(8)=3\lfloor\log_{2}(8)\rfloor = 3 and log2(4)=2\lfloor\log_{2}(4)\rfloor = 2, which do not differ by more than 1. On the other hand, a node with subtree weights 8 and 2 is not balanced because log2(2)=1\lfloor\log_{2}(2)\rfloor = 1, which is too far away from 3.

Comparing the binary log

Generally, we require a function that determines whether the MSB of one integer is less than the MSB of another. The first candidate is the function used by Roura,[27] the second is by Chan,[28] and the third is from Hacker's Delight by Warren.[29] These functions allow us to compare the binary logarithm without the need to compute it.

log2(x)log2(y)xy || y(x&y)1xy || x(xy)x(¬x&y)\begin{align*}\lfloor\log_{2}(x)\rfloor \geq \lfloor\log_{2}(y)\rfloor &\equiv x \geq y \text{ || } y \leq (x \mathbin{\&} y) \ll 1\\ &\equiv x \geq y \text{ || } x \geq (x \oplus y)\\ &\equiv x \geq (\lnot{x} \mathbin{\&} y)\end{align*}

So far we have four definitions of balance based on the binary logarithm. There appears to be a pattern where a stronger definition of balance is relaxed by the logarithm of the same terms — what a neat relationship.

Median(x,y)=(x+1)yHeight(x,y)=log2(x+1)log2(y)Weight(x,y)=(x+1)(y+1)/2Log(x,y)=log2(x+1)log2((y+1)/2)\begin{align*}\text{Median}(x, y) &&=&& &(x+1) &\geq& && y\\\text{Height}(x,y) &&=&& \lfloor\log_{2} &(x+1)\rfloor &\geq& &&\lfloor\log_{2}(y)\rfloor\\\text{Weight}(x, y) &&=&& &(x+1) &\geq& &&(y+1)/2\\\text{Log}(x, y) &&=&& \lfloor\log_{2}&(x+1)\rfloor &\geq& &&\lfloor\log_{2}((y+1)/2) \rfloor\end{align*}

Using the bitwise function of Warren to compare the binary log and left shifts for division yields the following expanded solutions:

Median(x,y)=(x+1)(y)Height(x,y)=(x+1)(y)&¬(x+1)Weight(x,y)=(x+1)((y+1)1)Log(x,y)=(x+1)((y+1)1)&¬(x+1)\begin{align*}\text{Median}(x, y) &= (x+1) \geq (y)\\\text{Height}(x,y) &= (x+1) \geq (y)\mathbin{\&}\lnot{(x+1)}\\\text{Weight}(x, y) &= (x+1) \geq ((y+1)\gg1)\\\text{Log}(x, y) &= (x+1) \geq ((y+1)\gg1)\mathbin{\&}\lnot{(x+1)}\end{align*}

Cost-optimized balance

There is one other design candidate that does not follow the same pattern. In 2000, Cho and Sahni[30] introduced the idea of cost-optimized search trees, which are trees where the average path length can not be reduced by a rotation anywhere in the tree.

Even though their definition considers rotations, we can apply this concept to partition-based balancing without the need to consider rotations at all. In this context, the only determination to make is whether a node is balanced or not. To determine whether a node is cost-optimized, we need to compare two levels of subtree sizes.

Definition

A node is cost-optimized if the size of each subtree is greater than or equal to the size of both subtrees of the other subtree.

Cost(p)= size(l)size(rl),size(l)size(rr), size(r)size(lr),size(r)size(ll)\begin{align*}\text{Cost(p)} = & \text{ size}(l) \geq \text{size}(rl), \text{size}(l) \geq \text{size}(rr), \\ & \text{ size}(r) \geq \text{size}(lr), \text{size}(r) \geq \text{size}(ll)\end{align*}

Balancer analysis

Measurements were made in size increments of 1,000 up to 10,000,000. Many random trees were grown consistently at each size increment and balanced by each strategy. The results were generated on an Intel i5 13600K with 32GB of DDR5 RAM.

Partition Count

The total partition count is the number of nodes that required partitioning at each size increment. Dividing the total partition count by the size of the tree at each increment indicates the fraction of the tree that required partitioning. A horizontal line indicates that the total partition count is linear in the size of the tree.

1 2 3 4 5 6 7 8 9 100.100.150.200.250.300.350.40Partition Count / SizeSize × 106MedianHeightWeightLogCost

The expectation was that height-balance partitions fewer nodes than median-balance because it avoids some redundant partitioning, but the results indicate that this is not the case — height-balance actually partitions more nodes than median-balance.

On the other hand, the weight-balance strategies partition the fewest nodes overall: around 15% for logarithmic weight-balance, 16% for linear weight-balance, and between 30% and 40% for median-balance and height-balance. Cost-optimized balance is somewhere in-between at around 28%.

Partition Depth

The total partition depth is the total number of links followed by partition at each size increment. This corresponds to the total number of additional nodes visited during the balancing operation. Dividing the total partition depth by the size of the tree at each increment indicates this total as a fraction of the size of the tree. A horizontal line therefore indicates that the number of nodes visited is linear in the size of the tree.

1 2 3 4 5 6 7 8 9 100.500.600.700.800.901.001.101.20Partition Depth / SizeSize × 106MedianHeightWeightLogCost

The results indicate that height-balance visits fewer nodes than median-balance on average, with both visiting in total ∼1 additional node for every node in the tree. Cost-optimized balance visits fewer additional nodes at ∼0.88, then linear weight-balance at ∼0.62, then logarithmic weight-balance with the lowest at ∼0.56.

1 2 3 4 5 6 7 8 9 102.802.903.003.103.203.303.403.503.603.703.80Partition Depth / Partition CountSize × 106MedianHeightWeightLogCost

Average Path Length

The average path length of each resulting tree measures how well-balanced they are, where lower is better. Dividing this average by the log of the size of the tree at each increment gives the results as a factor of the binary log.

1 2 3 4 5 6 7 8 9 100.890.900.910.920.930.940.950.960.97Average Path Length / log2(Size)Size × 106MedianHeightWeightLogCost

The average path length of all strategies appears to approach log2(n)\log_2(n), which for the final tree of size 10,000,000 translates to an average path length of ∼23. The results show that median-balance, height-balance and DSW all result in optimal balance, and cost-optimized balance only slightly suboptimal. Using linear weight-balance results in a slightly lower average path length than logarithmic weight-balance, but only by ∼2%.

Notice that the relative difference between all strategies is very small. The average path length at the final increment ranges between 17.9 and 18.6, making the percentage difference between the least and most-balanced trees only ∼4%.

Maximum Path Length

The maximum path length of each resulting tree measures the worst-case search cost at each size increment, i.e. height. Dividing this by the log of the size of the tree at each increment gives the results as a factor of the binary log.

1 2 3 4 5 6 7 8 9 100.900.951.001.051.101.151.201.251.301.351.401.45Maximum Path Length / log2(Size)Size × 106MedianHeightWeightLogCostDSW

The maximum path length is log2(n)\log_2(n) for median-balance, height-balance, and DSW because only the lowest level in the tree has missing nodes, which is optimal. The lowest level of the tree eventually gets completed as the size increases, after which a new level is started which increases the height by 1. The new level is twice the size of the previous level, so it takes twice as long for the height to increase again.

There is a ∼30% difference in height between the least and most-balanced strategies: cost-optimized balance has a height of ∼1.06, linear weight-balance ∼1.18, and logarithmic weight-balance ∼1.30.

Duration

Trees were also created in size increments of 100,000 up to 10,000,000 to measure the amount of time it takes each strategy to balance random trees on actual hardware. Keep in mind that time-based measurements are system-dependent so the results could be entirely different on other machines.

1 2 3 4 5 6 7 8 9 10681012141618Duration in milliseconds / SizeSize × 106MedianHeightWeightLogCostDSW
1 2 3 4 5 6 7 8 9 100.001.002.003.004.005.006.007.008.009.00Total Duration in secondsSize × 106MedianHeightWeightLogCostDSW

The results show that median-balance is marginally faster than height-balance, which is somewhat unexpected. Looking at duration cumulatively, it appears that the total duration is practically identical between median-balance, height-balance and cost-optimized balance.

The two weight-balance strategies are both ∼25% faster than median-balance and height-balance and ∼30% faster than DSW, with logarithmic weight-balance having a slight edge. This suggests that linear weight-balance is a good choice because it takes about the same amount of time as logarithmic weight-balance but achieves better balance. However, logarithmic weight-balance partitions fewer nodes and visits fewer nodes overall, which might offer some additional benefit on other machines.

Cost-optimized balance has a lower partition count, lower partition depth, and is slightly faster than height-balance but slightly slower than median-balance. The overhead is likely from the memory interaction of the multi-node size comparison to determine cost-optimized balance.

DSW appears to be consistently slower to balance large random trees, but the algorithm has some advantages over partition-based balancing that should not be discounted. Both algorithms use constant space, but DSW is and partition-based balancing is theoretically , though could perhaps be proven linear with high probability in the case of random trees. The two phases of DSW are also useful procedures individually: to convert any binary search tree into a linked list, and to convert a linked list into a balanced tree. DSW also does not rely on subtree size which may not be available in some cases.

Generally, balancing by partition is an excellent strategy when trees are already somewhat-balanced. The weight-balance strategies are the fastest and the resulting balance is only slightly suboptimal, suggesting a good trade in practice when balancing is frequent. Surprisingly, the benchmarks suggest that median-balance is likely the best choice to balance random trees when strict height-balance is a requirement or when balancing is infrequent. DSW is the best choice to balance trees that are likely to be linear lists, or when subtree size is not available.

Part 3Self-Balancing Trees

Restoring balance to an entire tree is powerful but often disproportionately expensive because the entire tree must be traversed. There is also the question of when to balance to avoid balancing too often or not balancing often enough. Instead of either balancing the entire tree or not balancing at all, we can restore balance incrementally over time and thereby spread the balancing costs across updates.

For example, an algorithm can make structural adjustments along the search path during an update to restore balance locally. A valid balancing algorithm does not change the ordering of the nodes and always returns the tree in a balanced state. A program can then know that a given tree is balanced because it is always balanced.

This section explores various strategies to maintain balance incrementally. Some strategies make many structural changes while others are more selective; some maintain strict balance while others do not technically balance at all. The goal is to determine which strategy is likely to be a good choice in a particular situation by considering the nature of balance and the work required to maintain it.

To compare strategies, we use an operation that starts by inserting 1,000,000 values, then alternates between insertions and deletions for a total of 10,000,000 updates. An access pattern defines where updates should occur in the sequence throughout the operation. For example, the maximum access pattern always updates at the end of the sequence, i.e the maximum position. The results presented are averaged across five access patterns: the maximum distribution, a uniformly random distribution, a normal distribution with μ=0.5,σ=0.15\langle \mu = 0.5,\sigma = 0.15 \rangle, a skewed beta distribution with α=100,β=50\langle \alpha = 100,\beta = 50 \rangle, and a Zipf distribution with s=1.25\langle s = 1.25 \rangle.

Rank-balanced trees

The original self-balancing binary search tree was published in 1962 as an algorithm for the organization of information by Adelson-Velsky and Landis.[31] AVL trees store the height of every node as part of the node structure, then uses rotations along the search path to maintain height-balance. The most common approach to implement an AVL is tree to use recursive algorithms, where all operations start from the root, traverse along a search path, then backtrack to the root along the search path.

Knuth[5] describes how balancing in an AVL tree is only necessary at the topmost node affected by balancing, commonly referred to as the safe node.[32] This node is either the root or the last node along the search path with subtree heights not equal to each other. After a new node is inserted at the bottom of the tree, the path from the safe node is traversed again to update node heights, then balanced at the safe node using a subset of the bottom-up deletion balancing algorithm. This algorithm is not purely top-down because a section of the search path is traversed twice, but space is constant and everything above the safe node is safe to modify concurrently.

Since the invention of AVL trees, many self-balancing strategies have been proposed, the most common being the red-black tree of Guibas and Sedgewick[33] in 1978. Instead of directly storing the height of every node as in AVL trees, red-black trees use a color field in the node structure to track whether a node is red or black. The color red was chosen because it was the best-looking color produced by the laser printer available to the authors at the time, and they used red and black pens to draw trees by hand. A set of rules based on these colors determine where to make rotations along the search path to maintain balance.

Figure 18: A valid red-black tree of 32 nodes. Notice that every path to a leaf from the root has an equal number of black nodes, and no red node links to another red node.

Red-black trees are not technically height-balanced, but the color-based rules result in a maximum height bound of 2log2(n)2\log_{2}(n) where n is the size of the tree. The same bound for AVL trees is 1.44log2(n)1.44\log_{2}(n), so AVL trees are generally more balanced but rotate more often to maintain that balance.

Inserting into a red-black tree requires at most two rotations, and deleting requires at most three. Inserting into an AVL also requires at most two rotations, but deletion may require a rotation at every node along the search path. The advantages of red-black trees is that the balancing overhead per operation is constant, and that only 1 bit of information is needed to store the color.

A common response to the recurring question of should I use an AVL tree or a red-black tree is to use AVL trees when mostly searching because the balance is better, but to use red-black trees when mostly inserting and deleting because they use fewer rotations. This is however a misconception, because inserting and deleting both require a search from the root to the bottom of the tree, which is affected by path length all the same.

Red-black trees are very common and generally considered good enough. Many implementations of the C++ STL use them for sorted sets, also Java's TreeMap,[34] and parts of the Linux kernel[35]. However, most implementations are recursive, many use parent pointers, and attempts to implement them often result in frustration.

In 2013, Haeupler, Sen & Tarjan[32] introduced rank-balanced trees, which unify many height-based balanced trees under a single framework. Rank-balanced trees store in each node an integer rank and use various rank-based rules to define AVL trees, red-black trees, and others in a consistent way.

A new strategy emerged from this framework: the weak AVL tree, which effectively combines the insertion of AVL trees with the deletion of red-black trees such that the height bound degrades gracefully from that of an AVL tree as the number of deletions increase, and is never worse than that of a red-black tree. Insertion and deletion in a weak AVL tree require at most two rotations, which is fewer than AVL trees and red-black trees. Concurrency is well-supported in weak AVL trees because both insertion and deletion can be implemented purely top-down.

In 2018, Sen, Tarjan & Kim[36] introduced relaxed rank-balanced trees which only balance after insertion, not deletion. The balance of relaxed rank-balanced trees generally improve as nodes are inserted, and degrade slowly as nodes are deleted. A subtle property of relaxed rank-balanced trees is that their height bound is proportional to the number of insertions, rather than the size of the tree.

This project compares 10 different rank-balanced trees.

1 2 3 4 5 6 7 8 9 100.9100.9150.9200.9250.9300.9350.9400.945Average Path Length / log2(Size)Operations / 106AVLBottomUpAVLTopDownRedBlackBottomUpRedBlackTopDown

The average path length of the two AVL variants is exactly the same, and the average path length of the top-down red-black tree is better than the bottom-up red-black tree. Experiments indicate that top-down red-black trees make up to 6 rotations during either insert or delete, but still amortized O(1).

1 2 3 4 5 6 7 8 9 102.002.503.003.504.004.505.00Duration in millisecondsOperations / 106AVLBottomUpAVLTopDownRedBlackBottomUpRedBlackTopDown

The bottom-up red-black tree is clearly the slowest of this group. The performance of the top-down AVL tree and the top-down red-black tree is practically equivalent, and only slightly faster than the bottom-up AVL tree. Currently, the both red-black tree variants implement deletion bottom-up, so there is a decent chance that the top-down red-black tree can see further improvement when deletion is implemented top-down.

1 2 3 4 5 6 7 8 9 100.9100.9150.9200.9250.9300.9350.9400.945Average Path Length / log2(Size)Operations / 106AVLBottomUpRedBlackBottomUpAVLWeakBottomUpAVLWeakTopDown

The balance of weak AVL trees deteriorates as the number of deletions increase, eventually matching the balance of red-black trees. Here, the average path length of the top-down variant appears to be worse than the bottom-up variant, which is the opposite of the same observation in red-black trees.

1 2 3 4 5 6 7 8 9 102.402.602.803.003.203.403.603.804.004.204.40Duration in millisecondsOperations / 106AVLBottomUpAVLTopDownAVLWeakBottomUpAVLWeakTopDown

The top-down AVL tree is faster overall, and the top-down weak AVL tree is slower than both the bottom-up variant and the bottom-up AVL tree. The differences are small but consistent throughout the operation.

1 2 3 4 5 6 7 8 9 100.9100.9200.9300.9400.9500.9600.970Average Path Length / log2(Size)Operations / 106AVLBottomUpAVLWeakTopDownAVLRelaxedBottomUpAVLRelaxedTopDown

This plot shows how the balance of relaxed AVL trees continue to deteriorate over time, but did not exceed log2(n)\log_2(n) even after many more deletions than there were nodes in the tree. Their balance should stabilize however, since insertions generally improve balance and deletions always produce the joining root from the larger subtree, as shown in Figure 7.

1 2 3 4 5 6 7 8 9 102.402.602.803.003.203.403.603.804.004.204.40Duration in millisecondsOperations / 106AVLBottomUpAVLWeakTopDownAVLRelaxedBottomUpAVLRelaxedTopDown

This plot shows that relaxed AVL trees are much faster than the other AVL variants. Notice that relative difference in performance remains constant even though balance deteriorates over time. The performance of the bottom-up and top-down variants of relaxed AVL trees appears to be equivalent.

1 2 3 4 5 6 7 8 9 100.9100.9200.9300.9400.9500.9600.9700.9800.990Average Path Length / log2(Size)Operations / 106RedBlackBottomUpRedBlackTopDownRedBlackRelaxedBottomUpRedBlackRelaxedTopDown

This plot shows how the balance of relaxed red-black trees continue to deteriorate over time, much like relaxed AVL trees. The balance of top-down relaxed red-black trees is better than the bottom-up variant, which correlates with standard red-black trees since they use the exact same insertion algorithm.

1 2 3 4 5 6 7 8 9 102.503.003.504.004.505.00Duration in millisecondsOperations / 106RedBlackBottomUpRedBlackTopDownRedBlackRelaxedBottomUpRedBlackRelaxedTopDown

The top-down relaxed red-black tree is clearly the fastest red-black tree variant.

Weight-balanced trees

Implementing a linear list using a binary search tree requires that we know the size of each subtree along the search path. It seems ideal then to use this information for balancing, rather than heights or ranks stored additionally in each node.

In 1972, Nievergelt and Reingold[37] introduced trees of bounded balance, or BB[α] trees, where weight-balance is maintained by keeping subtree weights within a factor of α. They proved that BB[α] trees have a maximum height of 1+log1/(1α)(n+1)1+\log_{1/(1-\alpha)}(n+1).

The original insertion algorithm was purely top-down, but Blum and Mehlhorn[38] proved in 1980 that the valid range provided for α was incorrect. Their supplied proof, however, only works for bottom-up balancing. In 1993, Lai and Wood[39] described top-down algorithms and proved their correctness for 211<α122\frac{2}{11} \lt \alpha \leq 1-\frac{\sqrt2}{2}.

The valid range for the α parameter is complex and its evaluation involves rational arithmetic or integer multiplication that could overflow. Ideally there exists a weight-balanced strategy that can keep the weights of subtrees within a reasonable bound without the need to perform complex calculations.

Roura[27] introduced logarithmic binary search trees or LBSTs in 2001, which use simple bitwise operations to evaluate weight-balance. The maximum height of a logarithmic weight-balanced tree is the same as red-black trees, 2log2(n)2\log_{2}(n). The original bottom-up algorithm uses subtree size as weight, but Roura mentions in the final remarks that it is also possible to use the number of leaves as weight.

Frias[40] published top-down logarithmic weight-balanced trees in 2005, under advice from Roura. Their algorithms are more complex than the top-down algorithms of Lai and Wood as a result of using subtree size as weight. Using the standard definition of weight rather than subtree size simplifies their algorithms to become general with the top-down algorithms of Lai and Wood.

Hirai and Yamamoto[41] published an investigation of weight-balanced trees in 2011, wherein they introduce two balancing parameters: Δ as the constant factor that determines if two subtrees are weight-balanced, and Γ to determine the type of rotation that is needed to restore balance.

Δ=1ααα=1Δ+1Γ=11αΓ=Δ+1Δ\begin{align*}\Delta = \frac{1-\alpha}{\alpha}\qquad\qquad\alpha = \frac{1}{\Delta+1}\\\\\Gamma = \frac{1}{1-\alpha}\qquad\qquad\Gamma = \frac{\Delta+1}{\Delta}\end{align*}

Balancing parameters are considered feasible if they always produce valid weight-balanced trees. Hirai and Yamamoto show that the feasible space for Δ,Γ\langle\Delta,\Gamma\rangle is actually a non-empty polytope, suggesting that the linear dependency between Δ and Γ implied by α is not necessary. The feasible parameter space is therefore expanded by separating the two independent variables coupled linearly in α.

Hirai and Yamamoto include a short section on logarithmic weight-balance, stating that "for mathematical reliability, logarithmic weight-balanced trees are simpler". They benchmarked the original bottom-up implementation against other bottom-up weight-balanced trees and found their performance to be about the same.

Barth and Wagner[42] evaluated top-down weight-balanced trees in 2019, but only mention logarithmic weight-balance in the introduction, without further evaluation. They contribute detailed benchmarks across a range of Δ,Γ\langle\Delta,\Gamma\rangle, but acknowledge that "it is unclear what the feasible polytope looks like". They conclude with a suggestion for further study, that "in the future, it would be interesting to determine the space of feasible balancing parameters for top-down weight-balanced trees similar to how Hirai and Yamamoto have done for bottom-up weight-balanced trees."

To produce this, we simulate random trees across a wide range of rational Δ and Γ parameters. Whenever a node is not weight-balanced after an operation, it proves that a counter-example exists for that Δ,Γ\langle\Delta,\Gamma\rangle pair, which makes it infeasible. The plotted area represents all feasible balancing parameter pairs for top-down weight-balanced trees.

14/3√23/25/3221+√237/249/2ΓΔΓ ≤ (Δ – 1)Γ ≥ (Δ + 1) / Δ

Figure 19: The polytope of feasible Δ,Γ\langle\Delta,\Gamma\rangle parameters for top-down weight-balanced trees.

The shape appears to be a more detailed illustration of the polytope that Hirai and Yamamoto found for bottom-up weight-balanced trees. The curve along the bottom is the corresponding Δ,Γ\langle\Delta,\Gamma\rangle linear bound proven for α by Lai and Wood. 3,2\langle3,2\rangle is the only integer pair in the feasible space, which Barth and Wagner describe as "overall fairly performant", notably faster than red-black trees.

This project compares 4 weight-balanced trees: bottom-up and top-down variants that use 3,2\langle3,2\rangle, the original logarithmic weight-balanced tree that uses size as weight, and a top-down logarithmic weight-balanced tree.

1 2 3 4 5 6 7 8 9 100.9100.9150.9200.9250.9300.9350.9400.945Average Path Length / log2(Size)Operations / 106LBSTBottomUpLBSTTopDownWBSTBottomUpWBSTTopDown

The original bottom-up logarithmic weight-balanced tree has the lowest average path length due to its use of size as weight. The top-down variant is only slightly worse, and better than both the top-down and bottom-up weight-balanced trees that use 3,2\langle3,2\rangle.

1 2 3 4 5 6 7 8 9 102.202.402.602.803.003.203.403.60Duration in millisecondsOperations / 106LBSTBottomUpLBSTTopDownWBSTBottomUpWBSTTopDown

The original bottom-up logarithmic weight-balanced tree is generally better than the other weight-balanced variants, likely due to their lower average path length. The top-down variants appear to start out faster but become slower over, even though balance does not deteriorate. Why is that? Perhaps further optimization of the top-down algorithms and repeated benchmarks could shed some light.

Scapegoat trees

Exploring further we find the idea of partial rebuilding by Overmars[43] in 1983. Further development produced the general balance trees of Andersson[44][45] in 1989 and 1999, followed by the scapegoat tree of Galperin and Rivest[46] in 1993.

A tree in which every node is weight-balanced has a provable maximum height bound. Therefore, when the height of a tree exceeds this bound there must be at least one node that is not weight-balanced. Galperin and Rivest refer to this node as a scapegoat.

The basic idea of partial rebuilding is to find a node that is not balanced and then rebuild that subtree into a balanced tree, thereby restoring the height bound. Intuitively, the height of a tree that was not balanced decreases when it becomes balanced. This strategy combines the ideas of weight-balance and height-balance by allowing some nodes to not be weight-balanced, as long as the height of the tree stays within the height bound.

Figure 20: A tree with a large perfectly-balanced and an empty right subtree. Balanced, or not?

The original scapegoat tree uses a balancing parameter α, where 12α<1\frac{1}{2} \leq \alpha \lt 1 and height is bounded by log(1/α)+1\lfloor\log_{(1/\alpha)}\rfloor+1. A node with subtree sizes sl and sr has a total size of s=sl+sr+1s = sl + sr + 1, and is considered α-weight-balanced if slαssl \leq \alpha * s and srαssr \leq \alpha * s.

The insertion algorithm described by Galperin and Rivest is a two-pass recursive algorithm: search by descending from the root to attach a new node at the bottom of the tree, then backtrack bottom-up to find a scapegoat that is not α-weight-balanced, if any, and rebuild it. There must be at least one such node along the search path when the resulting height of the insertion exceeds the maximum height bound. The suggested rebuilding algorithm recursively flattens the tree into a list, then rotates that list back into a perfectly-balanced tree, similar to the Day-Stout-Warren algorithm.

There are a few downsides to this strategy: (i) the recursion goes all the way back to the root even when the height bound was not exceeded, (ii) the α-parameter imposes an awkward fractional log base, and (iii) the rebuilding algorithm does not consider that the subtree is likely already somewhat-balanced, given that the height only just exceeded the bound.

Recursion can be avoided by looking for a scapegoat top-down along the search path before it is known what the height will be. The scapegoat is then already in-hand if and when the height bound is exceeded at the bottom of the tree. Evaluating weight-balance optimistically is often redundant however, because the result is not needed when the height remains valid. This is okay, given that other weight-balanced trees already evaluate balance at every node along the search path anyway.

There can be many nodes along the search path that qualify as a scapegoat, offering a choice to use the first one closest to the root, or to use the last one furthest away from the root. For a given operation, accepting the first scapegoat closest to the root allows the algorithm to skip further checks for weight-balance along the search path, whereas using the last scapegoat furthest from the root must keep checking weight-balance because it is not known whether another will be found along the way.

Choosing the scapegoat furthest from the root is a clear choice when using recursion because once you pass one by on the way up to the root there is no guarantee that another will come along. For the algorithm to alternatively use the scapegoat closest to the root, it would need to maintain a reference to the last scapegoat found, which is now potentially far down the tree again since we are back at the root. Updating the references from the root to this node is then cumbersome. The intuition is also that subtrees are smaller around the lower levels of the tree, so one should prefer to rebuild smaller subtrees which require less work.

There are however some hints to suggest that rebuilding larger subtrees closer to the root is likely to produce better results. Overmars originally suggested to use the node closest to the root, and Galperin and Rivest write that "this heuristic performed better than choosing the first weight-unbalanced ancestor [from the bottom] to be the scapegoat". Even so, all available references and implementations of scapegoat trees use the recursive method and choose the first scapegoat encountered bottom-up.

Experiments confirm that it is more efficient to prefer the first scapegoat encountered top-down, therefore closest to the root. When a partial rebuild is required somewhere, we may as well balance more nodes than is strictly necessary to further delay the next rebuild.

The α-parameter can be avoided by using logarithmic weight-balance. The maximum height of a tree where every node is logarithmically weight-balanced is 2log2(n)2\log_{2}(n), so there must be a node along the search path that is not logarithmically weight-balanced when the height exceeds this bound after an insertion. The equivalent α for this bound is 12\frac{1}{\sqrt{2}}.

Using logarithmic weight-balance, a tree of size n requires a partial rebuild somewhere along the search path when the height h is greater than 2log2(n)2\log_{2}(n).

Exceeds(h,n) h>2log2(n)n<2(h+1)/2n<1((h+1)1)\begin{aligned}\text{Exceeds}(h, n)\ &\equiv h \gt 2\log_{2}(n)\\ &\equiv n \lt 2^{(h + 1)/2}\\ &\equiv n \lt 1 \ll ((h + 1) \gg 1)\end{aligned}

The final sharp edge to resolve is the choice of algorithm to rebuild the scapegoat. We can assert that the subtree is already somewhat-balanced, and we know that restoring logarithmic weight-balance at every node in that subtree would restore the height bound. Restoring balance using a partition-based strategy is therefore a good choice since most nodes should not require partitioning.

Relaxed deletion in scapegoat trees

Alongside a reference to the root node and the current size of the tree, a scapegoat tree usually stores a maximum size field, which is the maximum size that the tree has reached since the root was rebuilt. The theory suggests that the entire tree should be rebuilt when the current size becomes less than α times the maximum size, then to reset the maximum size to the current size.

Occasionally rebuilding the tree when many deletions have occurred ensures that the height remains logarithmic in the size of the tree. Alternatively, we can use the concept of relaxed balance from rank-balanced trees to only balance after an insertion, not after a deletion. The intuition is that height does not increase when a node is deleted, and a following insertion will inevitably restore the height bound anyway, possibly even at the root all the same. The height is then at most logarithmic in the number of insertions rather than the size of the tree.

This project compares 2 relaxed weight-balanced trees: one which is equivalent to the 3,2\langle3,2\rangle weight-balance rule even though the Γ parameter is not used, and another that uses logarithmic weight-balance.

1 2 3 4 5 6 7 8 9 100.9100.9200.9300.9400.9500.9600.9700.9800.9901.000Average Path Length / log2(Size)Operations / 106LBSTBottomUpLBSTRelaxedWBSTBottomUpWBSTRelaxed

This plot shows in more detail how the balance of the strict weight-balanced trees remains mostly constant. The relaxed variants allow balance to deteriorate until the height bound is exceeded, which causes a partial rebuild somewhere in the tree, resulting in a drop in average path length. Notice that the average path length of the relaxed weight-balanced trees do not exceed log2(n)\log_2(n). Can this be proven?

1 2 3 4 5 6 7 8 9 102.402.602.803.003.203.403.603.80Duration in millisecondsOperations / 106LBSTBottomUpLBSTRelaxedWBSTBottomUpWBSTRelaxed

This plot indicates that the relaxed weight-balance variants are generally slower than their strict counterparts. The relaxed logarithmic weight-balanced tree is consistently faster than the linear weight-balance variant, likely due to its lower average path length, and therefore lower search cost.

Randomly-balanced trees

All the strategies explored so far always results in the same structure for the same access pattern. As a result, they are consistent and predictable. This is not always a benefit, however. For example, an adversary could exploit predictability to gain information about the system. Some predictable strategies are also history-dependent, where information about previous operations can be derived from the current structure. While these concerns are uncommon in practice, unpredictability and history-independence is an essential feature in some applications.

Probabilistic balancing is effective regardless, and some of the most common data structures in practice use randomness in some way. Focusing on binary search trees, we explore four strategies that use randomness to keep the tree balanced.

In 1989, Seidel and Aragon[47] introduced the treap data structure, which combines the ideas of a tree and a heap. When a new node is allocated, the algorithm generates a uniformly-distributed random number as its rank, and stores that rank as part of the node structure. The only invariant to maintain is that the rank of a node must be greater than or equal to the ranks of its subtrees. The root of the tree therefore has the highest rank. This results with high likelihood in a balanced tree as it becomes increasingly unlikely that a new rank introduced at the bottom of the tree is greater than the ranks above it, such that most nodes end up around the bottom of the tree.

The algorithms to maintain the rank invariant are intuitive and simple to implement. There are two fundamental algorithms: split, which partitions a node into two separate trees at a given position, and join which combines two subtrees into one. These algorithms are similar to partition, operating top-down in constant space.

To insert a node, start by generating a new random rank for it, then follow the search path from the root until a node is reached with a lower rank than the new rank. Split this node into the left and right subtrees of the new node, then replace it with the new node. When the search reaches the bottom of the tree, it suggests that the rank of the new node is the minimum of its path, and therefore safe to attach there. This algorithm is exactly the root insertion algorithm of Stephenson[25], on which partition is based. To delete a node, simply find the node to delete then replace it by the join of its subtrees.

ZZZZZZZZZZZZZZZZZZZZZZZZZZZZ9854ABCD763ZYXABZYCDX9876543

Figure 21: The two trees at the top are joined to produce the tree at the bottom, in descending rank order. The opposite is a split of the tree at the bottom between D and X to produce the tree at the top. Ranks are indicated above the nodes.

split (p *Node, i Position) (*Node, *Node) {
   n := Node{}
   l := &n
   r := &n
   for p ≠ empty {
      if i ≤ p.s {
         p.s = p.s - i
         r.l = p
           r = r.l
           p = p.l
      } else {
           i = i - p.s - 1
         l.r = p
           l = l.r
           p = p.r
      }
   }
   l.r = nil
   r.l = nil
   return n.r, n.l
}
            
join (l, r *Node, sl Size) (root *Node) {
   p := &root
   for {
      if l == nil { *p = r; return }
      if r == nil { *p = l; return }

      if rank(l) < rank(r) {
         r.s = r.s + sl
          *p = r
           p = &r.l
           r = *p
      } else {
          sl = sl - l.s - 1
          *p = l
           p = &l.r
           l = *p
      }
   }
}
            

In 1997, Martínez and Roura[48] introduced randomized binary search trees, which are based on subtree size and thereby avoid the need to generate and store a rank in every node. The algorithms are very similar to that of treaps, but the method by which random choices are made is different. Consider insertion, where a treap has some new rank in hand and is searching for a node with a lower rank along the search path to split and replace. Randomized trees similarly search for a node to split and replace, but require a different method to determine which node to split and replace.

Instead of generating a single random variable across the entire range, randomized trees generate a random number between zero and the size of the subtree at each step, then split and replace if the subtree size is equal to the generated random number. Subtrees tend to become smaller further down the tree, so it becomes increasingly more likely to generate a random number equal to the size of the current subtree as the search continues. Most insertions therefore occur around the bottom of the tree.

Both treaps and randomized trees produce exactly the same result — random trees, so we can expect the average and maximum path lengths to be similar. The one advantage of treaps is that they require only one random variable per operation, where randomized trees require bounded randomness at every step along the search path. On the other hand, treaps require a rank field as part of the node structure, which is not required by randomized trees as they make use of the available subtree size information.

One strategy that always makes an appearance when probabilistic data structures are discussed is the skip list of Pugh[49] from 1989. While not technically a tree, skip lists are often used as an alternative to binary search trees because they have similar performance bounds and are generally considered simpler to implement. Skip lists are popular in practice, for example: RocksDB, maintained by Meta and based on LevelDB by Google, uses a skip list for its memory buffer,[50] Redis uses skip lists for sorted sets[51], and Apache HBase uses skip lists as the default in-memory store.

In 2018, Tarjan, Levy and Timmel[52] introduced zip trees, which are binary search trees that are isomorphic to skip lists, and simplify the mapping between skip lists and binary search trees by Dean and Jones[53] in 2007. Zip trees are essentially treaps, except that ranks are drawn from a geometric distribution instead of a uniform distribution, and rank ties are only allowed on the right. This strategy requires fewer bits of randomness than treaps, and the equivalence to skip lists is a novel characteristic.

The geometric rank distribution used by zip trees is the same as in skip lists, which is equivalent to the number of times a coin-flip lands on heads before landing on tails. The probability of each successive coin-flip landing on heads is half the probability of the previous flip. This aligns with binary search trees where approximately half the nodes are in the lowest level, a quarter of the nodes are one level up, all the way to the root level with only one node. The root node therefore has the highest-equal rank, corresponding to the longest streak of successive coin-flips landing on heads so far.

In 1988, Tarjan and van Wyk[54] discussed a homogenous finger search tree, where the pointers along the left-most and right-most paths of the tree, referred to as the spines, are reversed such that the left pointers of nodes along the left spine and the right pointers of nodes along the right spine point up to their parent instead. Access to the tree is then through the head and the tail nodes, rather than the root. This provides access from the start and the end of the sequence searching towards the middle, rather than starting in the middle searching outward. This results in a longer average path length, but access close to the start or the end of the sequence is more efficient.

Searching for a node in a homogenous finger search tree starts at either the head or the tail of the tree and proceeds upward towards the root, until the search is about to ascend too far, at which point it descends inward. This inward branch is the same branch that would be taken by a conventional top-down search for the same node. Balancing is maintained as usual in the interior tree at the branching point, which has the usual pointer convention, then rotated up along the spine as far as needed until balance is known to be restored.

The ideal balancing strategy for finger search trees is then one that does not need to traverse all the way up to the root every time. Strategies like the standard AVL rule and weight-balanced trees and are not great here, because they may require rotations all along the search path and therefore must traverse all the way up to the root. Other rank-balanced strategies like red-black trees and weak AVL trees are a great choice because their balancing procedure can stop as soon as a terminating case is handled. The idea of reversing the pointers along the left and right spines of the tree was also explored by Blandford and Blelloch[55] in 2001, who used a treap as the underlying balancing strategy, on which this project bases its implementation.

Reversing the nodes along the spine creates a problem for persistence by path copying. The root node maintains the size of its original left subtree, which is used to determine whether to search from the head or the tail. When the size of the left subtree of the root changes as a result of an insertion or deletion from the head, all paths that lead to the root must be copied, which includes the entire right spine. To resolve this, we break the links pointing up to the root, and use the left and right pointers of the root to reference the head and tail nodes.

Another adjustment is to always have the nodes along the spine store the size of their interior subtrees. This is already in place for the right spine because nodes store the size of their left subtree already, but the sizes along the left spine must be flipped. This breaks the convention a bit and is certainly more complex.

ZZZZZZZZZZZZDABCEFGHEADTAILROOT

Figure 22: A variant of a homogenous finger search tree with access from either the head or the tail. None of the nodes reachable from the head or the tail are reachable from the other. The root stores the size of the left subtree, and the links of the root can be used as the pointers to the head and the tail.

1 2 3 4 5 6 7 8 9 101.2001.3001.4001.5001.6001.7001.8001.9002.000Average Path Length / log2(Size)Operations / 106TreapTopDownTreapFingerTreeRandomizedZip

The average path length of treaps and randomized trees are equivalent, and better than both zip trees and finger treaps. The balance of zip trees is slightly worse because there is an inherent bias due to their tie-breaking rule, which the inventors chose to provide an isomorphism between zip trees and skip lists. As expected, finger treaps have longer paths on average since most of the nodes are reached through the top of the inverted spine.

1 2 3 4 5 6 7 8 9 102.202.402.602.803.003.203.40Maximum Path Length / log2(Size)Operations / 106TreapTopDownTreapFingerTreeRandomizedZip

This plot shows that zip trees and finger treaps have an equivalent maximum path length, and as expected so do treaps and randomized trees. The worst-case height of red-black trees and logarithmic weight-balanced trees is 2log2(n)2\log_{2}(n), which is lower than all the probabilistic trees evaluated here.

1 2 3 4 5 6 7 8 9 103.504.004.505.005.506.006.507.00Duration in millisecondsOperations / 106TreapTopDownTreapFingerTreeRandomizedZip

The results suggest that top-down treaps are the fastest randomly-balanced tree. This is a good result because they are also the simplest to implement. Interestingly, the performance of the others is practically equivalent, even though their average and maximum path lengths vary significantly.

Self-adjusting trees

In 1985, Sleator and Tarjan introduced self-adjusting binary search trees, also known as splay trees. This strategy is not exactly self-balancing, because splay trees do not define balance in any way. Instead, they adjust to the access pattern by moving frequently accessed nodes closer to the root.

All operations on a splay tree use a fundamental algorithm called splay, which is similar to partition in that it rearranges the tree to make the node at a given position the root, but also reduces the path length of every node along the search path. The splay algorithm uses the same linking primitives as partition, but considers nodes in pairs and makes a rotation whenever two links are followed in the same direction.

In 1994, Sleator[56] published a top-down splay algorithm that also maintains subtree sizes. Their algorithm requires two additional loops to update size fields along the left-most path of the right subtree and the right-most path of the left subtree. However, these loops can be avoided by storing the size of the left subtree specifically.

The balance of splay trees is entirely arbitrary. Consider that inserting a new node into a splay tree results in that node becoming the root, so repeatedly inserting at the end of the sequence would result in one long leftward path equivalent to a linked list. Accessing the start of the sequence would then traverse the entire tree while making right rotations to bring all those nodes closer to the root, eventually making the left-most node the root of the tree.

The dynamic optimality conjecture from the original paper, if true, suggests that splay trees are a form of universally efficient search tree: in an amortized sense and to within a constant factor, no other form of search tree can beat them.

The biggest risk of splay trees is that they might be terribly unbalanced at any time. What is good is that the algorithm makes long search paths shorter, so if a path was long for one access, it will be shorter for the next. This is not necessarily true in a persistent context, because an unbalanced splay tree might be used as the base for other versions and thereby repeatedly traverse long paths.

Consider that a persistent update adjusts the structure of the tree, making copies of nodes as needed to not modify the current version, which means the base version stays unbalanced. Producing new versions from a potentially unbalanced state is therefore a risk to consider when using splay trees in a persistent context. To avoid this risk, we could balance the entire tree before branching new versions from it.

1 2 3 4 5 6 7 8 9 100.8001.0001.2001.4001.6001.8002.0002.2002.4002.600Average Path Length / log2(Size)Operations / 106SplayAVLBottomUpLBSTRelaxedTreapTopDown

This plot only shows the uniform distribution. The average path length of splay trees appears to be approximately equivalent to finger treaps when the access pattern is uniform, around 2log2(n)2\log_{2}(n).

1 2 3 4 5 6 7 8 9 103.003.504.004.505.005.506.006.507.007.50Duration in millisecondsOperations / 106SplayAVLBottomUpLBSTRelaxedTreapTopDown

This plot only shows the uniform distribution. Splay trees are generally slower, but only by a constant factor. However, a uniform distribution is not where splay trees are expected to shine, since they excel at skewed access patterns where a particular section of the structure is accessed frequently.

1 2 3 4 5 6 7 8 9 10024681012log2(Average Path Length / log2(Size))Operations / 106SplayAVLBottomUpLBSTRelaxedTreapTopDown

This plot is logarithmic in the y-axis because the average path length of splay trees is so large that it would otherwise not fit the plot. The repeating pattern is due to the maximum access pattern which alternates between inserting at the end and deleting at the start.

1 2 3 4 5 6 7 8 9 102.002.202.402.602.803.003.203.403.603.804.004.20Duration in millisecondsOperations / 106SplayAVLBottomUpLBSTRelaxedTreapTopDown

Splay trees are no longer the slowest when averaged across all five access patterns. Albeit not a strong one, this result is an empirical validation of the dynamic optimality conjecture.

Height-balanced trees

On a completely different track, Prokopec and Odersky introduced conc-tree lists in 2015, which are functional height-balanced binary search trees originally introduced by the Fortress language.[57] Nodes in a conc-tree either (i) store a value but have no links, or (ii) do not store a value but link to two non-empty subtrees. Additionally, like in AVL trees, every node stores its height and the height difference between two nodes is no greater than one. In effect, the values of the sequence are stored in the leaf nodes, and all non-leaf nodes are used only for searching purposes.

Conc-trees are simple to implement and do not use rotations or explicit copying. Instead, they make use of functional composition and smart constructors, where rather than copying and modifying an existing node, they allocate and composes a new node by combining properties of other nodes. Intuitively, they always construct a new path and make no attempt to modify an existing one.

1 2 3 4 5 6 7 8 9 108.509.009.5010.0010.5011.0011.5012.0012.5013.00Duration in millisecondsOperations / 106ConcSplayAVLBottomUpAVLRelaxedBottomUpRedBlackBottomUpLBSTBottomUpLBSTRelaxedTreapTopDown

This plot shows persistent insertion and deletion, averaged across all access patterns. The results suggest that bottom-up logarithmic weight-balanced trees and relaxed balance generally do well in persistent contexts. Splay trees and conc-tree lists appear to be the slowest, followed by bottom-up red-black trees.

Conclusion

Logarithmic weight-balance is excellent. There are three simple implementations that each offer useful properties in practice. The bottom-up variant generally has a low average path length, great performance, and is a safe choice a persistent implementation. The top-down variant does not use recursion, and is therefore a good candidate for a concurrent implementation. The relaxed variant also has good support for concurrency and persistence, but further study is required to explore the extent of its potential.

Restoring logarithmic weight-balance with partition is a useful tool to convert a tree with unknown or alternative balance to any of these variants. When subtree size is not available, DSW can be used to restore balance and populate subtree size information by setting a size of zero to all nodes during the first phase of the algorithm.

Pedagogically, the logarithmic weight-balance rule introduces the binary logarithm and bitwise operations with a clear practical application. The algorithms are intuitive, simple to program, and easy to teach. Approaching the subject in the following order builds on various concepts in a cohesive way: (i) using the logarithmic weight-balance rule to restore balance globally to teach the balancing rule and the partition algorithm, (ii) then using partial rebuilding to teach insertion, simple deletion, and amortized complexity, (iii) then to balance incrementally using rotations.

Using the balancing information as a positional index is a useful property which can also be applied to other use-cases such as sorted sets to provide positional access. Most current red-black tree or skip list implementations can be replaced by a logarithmic weight-balanced tree to achieve better balance, better performance, and get positional access as a bonus, all with a simpler algorithm.

Work in Progress

References

1. Princeton, COS226: Data Structures and Algorithms, 2023. 2. Stanford, CS166: Data Structures, 2023. 3. UC Berkeley, CS 61B: Data Structures, 2023. 4. Jörg Schmücker, Apache Commons Collections, "TreeList", 2004. 5. Donald E. Knuth, The Art of Computer Programming, Vol. 3: Sorting and Searching, 1972. 6. Clark A. Crane, Linear lists and priority queues as balanced binary trees, 1972. 7. Thomas N. Hibbard, Some combinatorial properties of certain trees with applications to searching and sorting, 1961. 8. Jeffrey L. Eppinger, An empirical study of insertion and deletion in binary search trees, 1983. 9. J. C. Culberson, J. I. Munro, Explaining the behavior of binary search trees under prolonged updates: a model and simulations, 1989. 10. J. C. Culberson, The effect of asymmetric deletions on binary search trees, 1986. 11. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein, Introduction to Algorithms, 2022. 12. Adam Drozdek, Data Structures and Algorithms in C++, 4th Edition, Section 6.6, Pages 243-249, 2013. 13. Robert Sedgewick, Kevin Wayne, Algorithms, 4th Edition, Chapter 3, Page 410, 2013. 14. Michael T. Goodrich, Roberto Tamassia, Michael H. Goldwasser, Data Structures and Algorithms in Python, 2013. 15. Steven S. Skiena, The Algorithm Design Manual, 2nd Edition, Chapter 3.4, Page 81, 2008. 16. Mark Allen Weiss, Data Structures and Algorithm Analysis in C++, 4th Edition, Chapter 4.3.4, Page 139, 2014. 17. Udi Manber, Introduction to Algorithms: A Creative Approach, Chapter 4.3, Page 73, 1989. 18. R. Bayer, E. McCreight, Organization and maintenance of large ordered indexes, 1972. 19. Ohad Rodeh, B-trees, Shadowing, and Clones, 2006. 20. Phil Bagwell, Tiark Rompf, RRB-trees: efficient immutable vectors, 2012. 21. Nicolas Stucki, Tiark Rompf, Vlad Ureche, Phil Bagwell, RRB vector: a practical general purpose immutable sequence, 2015. 22. Juan Pedro Bolívar Puente, Persistence for the masses: RRB-vectors in a systems language, 2017. 23. Quentin F. Stout, Bette L. Warren, Tree rebalancing in optimal time and space, 1986. 24. A. Colin Day, Balancing a binary tree, 1976. 25. C. J. Stephenson, A method for constructing binary search trees by making insertions at the root, 1980. 26. Ivo Muusse, An algorithm for balancing a binary search tree, 2017. 27. Salvador Roura, A new method for balancing binary search trees, 2001. 28. Timothy M. Chan, Closest-point problems simplified on the RAM, 2001. 29. Henry S. Warren, Jr., Hacker's Delight, Second Edition, Ch. 5, page 106, 2013. 30. Seonghun Cho, Sartaj Sahni, A new weight balanced binary search tree, 2000. 31. G. M. Adelson-Velsky, E. M. Landis, An algorithm for the organization of information, 1962. 32. Bernhard Haeupler, Siddhartha Sen, Robert E. Tarjan, Rank-balanced trees, 2013. 33. Leo J. Guibas, Robert Sedgewick, A dichromatic framework for balanced trees, 1978. 34. Josh Bloch, Doug Lea, java.util.TreeMap, 1997. 35. Andrea Arcangeli, Linus Torvalds, include/linux/rbtree.h, 1999. 36. Siddhartha Sen, Robert E. Tarjan, David Hong Kyun Kim, Deletion without rebalancing in binary search trees, 2018. 37. J. Nievergelt, E. M. Reingold, Binary search trees of bounded balance, 1972. 38. N. Blum, K. Mehlhorn, On the average number of rebalancing operations in weight-balanced trees, 1980. 39. Tony W. Lai, Derick Wood, A top-down updating algorithm for weight-balanced trees, 1993. 40. L. Frias, Extending STL maps using LBSTs, 2005. 41. Yoichi Hirai, Kazuhiko Yamamoto, Balancing weight-balanced trees, 2011. 42. Lukas Barth, Dorothea Wagner, Engineering top-down weight-balanced trees, 2019. 43. M. Overmars, The Design of Dynamic Data Structures, 1983. 44. A. Andersson, Improving partial rebuilding by using simple balance criteria, 1989. 45. A. Andersson, General balance trees, 1999. 46. Igal Galperin, Ronald L. Rivest, Scapegoat trees, 1993. 47. Raimund Seidel, Cecilia R. Aragon, Randomized search trees, 1996. 48. Conrado Martínez, Salvador Roura, Randomized Binary Search Trees, 1997. 49. William Pugh, Skip Lists: A Probabilistic Alternative to Balanced Trees, 1989. 50. Jeffrey Dean, Sanjay Ghemawat, RocksDB, 2011. 51. Salvatore Sanfilippo, Redis ZSET, 2009. 52. Robert E. Tarjan, Caleb C. Levy, Stephen Timmel, Zip trees, 2018. 53. Brian C. Dean, Zachary H. Jones, Exploring the duality between skip lists and binary search trees, 2007. 54. Robert E. Tarjan, Christopher J. van Wyk, An O(n log log n)-time algorithm for triangulating a simple polygon, 1988. 55. Dan Blandford, Guy Blelloch, Functional Set Operations with Treaps, 2001. 56. D. Sleator, An implementation of top-down splaying with sizes, 1994. 57. Aleksandar Prokopec, Martin Odersky, Conc-trees for functional and parallel programming, 2015.