Exploring the design space of binary search trees
Rudi TheunissenLast updated: 1 May 2024
The study of binary search trees has been a fundamental topic in computer science going back to the 1950s. Anyone who has studied computer science or prepared for a technical programming interview is familiar with them, and many leading courses on algorithms and data structures teach them, including Princeton,[1] Stanford,[2] and UC Berkeley.[3] These courses usually teach the same selection of algorithms in a similar way, though never exactly the same. The motivation for this project, in part, was to innovate the way we teach fundamental algorithms and data structures, specifically binary search trees. Speaking from experience, some students might not realize how interesting this space can be because they are too distracted by seemingly arbitrary rules and acronyms.
Over the years, some binary search tree algorithms have come along that are both more intuitive and more efficient than the ones we usually learn about, but good implementations are often difficult to find and therefore difficult to teach. This project attempts to explore as many of these algorithms as possible to produce a reference that is easy to teach and learn from. Being mostly light on theory, we focus instead on experimentation and measurement. A few of the algorithms appear to be implemented for the first time, and some original ideas are also considered.
Part 1 briefly covers foundational ideas like memory allocation, linear lists, position, recursion, traversal, balance, persistence, and concurrency. This introduction explores a pedagogical approach to binary search trees as a natural extension of linked lists. Since linked lists are usually covered before binary search trees, the idea to implement the same interface as before but now with a different structure is a nice segue, cohesive and consistent.
Part 2 covers definitions of balance and explores various algorithms to globally balance a binary search tree. Topics to cover include tree rotations, partitioning, complexity, and the binary logarithm. This part is an opportunity to become more familiar with the binary tree structure and its measurements.
Part 3 explores self-balancing trees, which is where most of the research and discussion usually happens. There are many strategies to consider, including rank-balanced trees, weight-balanced trees, randomized trees, finger trees, and self-adjusting trees. Some weak and relaxed variants are also implemented and evaluated. There are multiple benchmarks and measurements to determine which strategies are better than others in different situations, and whether there might be a case to prefer other algorithms to the ones typically found in textbooks and lectures.
Part 1Introduction
Abstract memory
A computer program uses memory to keep information in mind while working on a problem, similar to the human concept of working memory. When more memory is needed, a program can ask an allocator to reserve some, and in return receive a unique address to access that memory. All the memory that was allocated by a program is usually freed by the time the program exits, but during its lifetime there can be many requests to allocate, read, write, and free memory.
These memory interactions all have an energy cost because work must be done to track allocated memory, resolve addresses, and write to memory. A program is considered efficient when it meets its functional requirements using only a small amount of energy relative to the complexity of the problem it solves. Some problems can not avoid frequent interaction with memory, but an inefficient algorithm wastes energy by interacting with memory more often than it needs to.
This field of research relates to the organization of information in memory, focusing on the organization of the information rather than the information itself. The goal is to design structures in memory to organize information in useful ways, such as to be ordered, searchable, or countable. The information and the structures that organize it usually exist together in memory, and programs can use these structures to organize information as needed, rather than interacting with memory directly.
There are structures that organize information all around us, like filing cabinets and books and vinyl records. Consider that an empty notebook contains no information, but the structure is there for information to exist. A step-by-step cooking recipe might be written as a numbered list to suggest to the reader to follow the steps in order, one at a time. The paper on which the recipe is written is the memory, the recipe is the information, the numbered list is the structure.
Using computer memory as a design material might seem intangible, but every bit of information exists physically on hardware no differently. The computer, however, does not know anything about semantics and relationships — there is only one large pool of memory to interact with, and our task is to provide structures as a layer of abstraction between the program and the memory.
Linear lists
This project explores specifically the implementation of a linear list, which is a one-dimensional ordered sequence of a known length, indexed by position. Lists are fundamental and are often introduced as the first abstract data type when learning about data structures. So many things are lists... search results, queues, events or transactions occurring over time, even the text of this article is an ordered sequence of characters arranged in a specific order — the same characters in different positions would be gibberish, so the position of each character in the sequence is important.
For a data structure to qualify as a list, it must support the operations that describe the list data type. A program using a given list implementation might not know exactly how the structure works internally, only that it behaves like a list.
The following operations describe a reasonable list data type:
- Select produces the value at a given position.
- Update replaces the value at a given position.
- Insert introduces a new value at a given position.
- Delete removes the value at a given position.
- Join combines two lists into a new list.
- Split separates a list at a given position.
Arrays
Perhaps the most common implementation of a list in practice is the array, which is a block of adjacent fixed-size memory cells, allocated all at once, and indexed directly as an offset from the front of the block. The memory allocated by an array is the number of cells multiplied by the information capacity of each cell. The array is the structure and the values stored in each cell are the information.
To illustrate this, we can use a sheet of grid paper to represent an allocated array where every square is a cell in which a single character could be stored. The sheet of grid paper models the array and the grid squares that have characters written in them form the information as a sequence. The sequence of characters recorded on the sheet of grid paper has semantic meaning to the person or program asking us to record them, but the structure itself is not aware of that meaning.
Data structures usually offer theoretical bounds for every operation to indicate how long an operation might take or how much memory it might allocate, and is usually proportional to some measurable property of the structure, such as size. Some data structures are particularly efficient for certain operations, but less so for others.
Arrays are great for accessing a value by position because there are no empty spaces between values in the array, which allows the memory address to be calculated as an offset from the address of the array itself. This is called constant time, because the cost of the calculation is the same regardless of the size of the array. Inserting a value at the end of the array is very efficient also, simply write into the next empty cell.
Consider however the case where a value is inserted somewhere within the array, or in the worst-case at the beginning of the array. Since there can be no empty cells between values, space must be created by moving every value from that position one step towards the back of the array. The cost of this movement of memory is linear in the size of the array, because the upper-bound on the amount of memory to move is exactly the size of the array. Deletion is similarly costly because it leaves an empty cell, which requires space to be compacted in exactly the opposite way.
What makes an array dynamic is the ability to increase its capacity by reallocating the entire sequence into a larger block of memory, thereby copying every value from the original array to the new block. Doing so is linear in the size of the array but occurs infrequently enough to amortize the cost, allowing many low-cost operations to make up for infrequent expensive ones. Common ways to avoid this cost are to predict a good upper-bound for the initial capacity to avoid reallocation entirely, or to always double the previous capacity to effectively amortize the cost of reallocation.
Linked lists
Found in every textbook on algorithms and data structures is the linked list, which consists of linked nodes that each contain one value and a pointer to the next node in the sequence. The structure provides a pointer to the first node in the sequence, thereby allowing sequential access by following one link at a time from there. The last node of the list points to an empty node, illustrated as .
To insert a new value at a given position within the list, start at the first node and follow links one node at a time until the number of links followed equals the search position. At this point we can allocate a node for the new value, point the previous node to the new node, and point the new node to the current node.
Binary search trees
Instead of starting at the front of the list, what if we started the search at a node in the middle? Nodes to the right of this median point to the next node in the sequence, as they did before, but nodes to the left of the median now point to the previous node towards the start of the sequence. The node in the middle should always know its position in the sequence since all operations pass through it.
Applying the same strategy recursively to the left and right sides of the median node produces a structure known as a binary search tree. Every node now has two links, one to its left subtree and another to its right subtree. A node might link to an empty subtree, known as leaf node, but the pointers themselves are still there because they are part of the allocated node structure. There is an empty subtree at the start of the sequence, between every node, and at the end of the sequence.
The fundamental rule that defines a binary search tree is that the nodes of the left subtree occur sequentially before the root node, and the nodes of the right subtree occur sequentially after the root node. This creates a total linear ordering of all values, and the same sequence as the original linked list can be produced from the tree structure. We do this with an inorder tree traversal, which visits every node in sequential order.
inorder (p *Node, visit func(*Node)) { if p == nil { return } inorder(p.l, visit) visit(p) inorder(p.r, visit) }
Leaf insertion
The leaves of a binary search tree are where new nodes are attached to the structure. To insert a new value at a given position, start by searching for the leaf corresponding to that position, then replace it with a new node allocated for the value, thereby also allocating two new leaves. This is called leaf insertion and forms the basis of many other insertion algorithms.
Relative position
We now have a binary tree data structure with nodes pointing to other nodes, but it is not yet clear how to search for a node by position. Searching through a linked list is simple because we traverse and count exactly one node at a time. The information obtained by counting is the position of every node along the way, where the relative position of every node is exactly 1.
When we follow a branch in a binary search tree, we effectively skip all the nodes of the other branch. From the root of the tree, the first branch skips about half the nodes, then the second branch skips about half again, then half again until we find the node at the position we are searching for.
This is known as binary search, which is a well-known algorithm independent of linked nodes and tree structures. For example, ask someone to think of a number between 1 and some large number n. How many guesses would we need to guess it correctly, if after every guess were are hinted to go lower or higher?
The answer is no more than by using binary search. Start in the middle, then to the middle of the left half if the guess was too high, or to the middle of the right half if the guess was too low. Repeating this step until we guess correctly closes in on the number by halving the search space every time until there is only one choice left.
To implement binary search in a binary tree, we need to determine whether to branch to the left or to the right; do we skip the left half or the right half? Branching to the left seeks towards the start of the sequence by skipping all the nodes of the right subtree, and branching to the right seeks towards the end of the sequence by skipping all the nodes of the left subtree.
Since the pointers only reference in one direction away from the root, there is no way back up the search path once a branch is taken. To determine which branch to take requires that we know the position of the current node relative to its tree, which is exactly the number of nodes in the left subtree, or the number of nodes that would be skipped by branching to the right.
Branching to the exact median at every step is not necessary for binary search to be effective. Even a somewhat average approximation of each median might only increase the length of the search path by a few steps. The subtrees might also not divide neatly down the middle, which is only possible when the size of the tree is exactly . We therefore can not assume that the current node along the search path is the exact median of its subtree. The only way to know the number of nodes in the left subtree is to either count them, or store that information as part of the node structure.
Perhaps the most common approach to solve this problem is to store the size of every node as part of its structure, which is equal to the number of nodes reachable from the node, including itself. Adding a new node to the tree requires then that we increment the size field of every node along the path from the root to the new node. To determine the size of the left subtree, and therefore the relative position of a node, we can follow its left link to read the size field of that node, wherein lies a weakness: we always follow the left link even when the search ends up branching to the right. Ideally, to avoid unnecessary interaction with memory, the only nodes to read from memory are the nodes along the search path. This strategy benefits from being symmetric, and being able to derive the size of a tree from the size of its root.
Another idea is to store in each node its position relative to its parent. This approach results in a left node having a negative position equal to one less than the negative size of its right subtree, and a right node having a positive position one more than the size of its left subtree. Assuming that the root node always has a positive position, the absolute position of any node is then equal to the sum of all relative positions along its path from the root. This strategy is symmetrical, intuitive, and provides one bonus insight: a node is known to be a left or right descendant based on the sign of its position. However, there are two downsides to this approach: (i) we require one bit to store the sign of the position, thereby halving the utilization of the size field, and (ii) the resulting algorithms require in some cases additional checks and arguments to indicate and normalize node orientation. Insertion using this strategy would increment the size field when a search path branches left at a right node, and symmetrically decrement the size when branching right at a left node. For example, inserting a node at the start of the sequence would increment the size of the root because the size of its left subtree is increasing, then descend along the left-most path without incrementing the size of any other nodes since they all store the size of their respective right subtrees. Similarly, inserting a node at the end of the sequence would not increment the size of any node at all because all the nodes on the right-most path, including the root, store the size of their left subtrees unaffected by an insertion to the right. In 2004, Schmücker[4] proposed a list implementation using this approach to the Apache Commons Collections, which is still part of the source at the time of this writing.
Alternatively, we could store specifically the size of the left subtree, as suggested by Knuth[5] and Crane[6]. Keeping a separate size field in the outer tree structure as well as the reference to the root node then allows us to track the current subtree size at each step along the search path. The size of the right subtree is then one less than the size of the current subtree minus the size of its left subtree. This approach allows us to know both the left and right subtree sizes without the need to access either of them. Intuitively, this pulls the subtree size information one level up in the tree. Insertion then only increments the size field of a node when branching to the left, because inserting to the right would not affect the size of the left subtree. This approach therefore reduces memory interaction in exchange for a slightly more complex tree structure and some asymmetry.
This project uses the approach where every node stores the size of its left subtree.
Node { x Data s Size l *Node r *Node } Tree { root *Node size Size }
search (p *Node, i Position) *Node { loop { if i == p.s { return p } if i < p.s { p = p.l } else { i = i - p.s - 1 p = p.r } } }
Path length
The path length of a node is the number of links to follow to reach it from the root of the tree, or one less than the number of nodes along its path from the root. The total path length is the sum of the path length from the root to every other node, and the average path length is the total path length divided by the number of nodes. Starting from halfway along some path and turning either left or right effectively halves the maximum path length of that path. During the transformation of a linked list to a binary search tree, as shown in Figure 3, whenever a new branch is created to the middle of a sublist it halves the maximum path length of that sublist. What is the resulting maximum path length of the tree? This question is effectively asking how many times the size of the tree can be divided by 2 before reaching 1.
This is known as the binary logarithm, written as . The binary log is frequently encountered in computer science because it relates so closely to binary numbers. For example, the number of bits required to write an integer n in binary is . The same reasoning applies to decimal numbers, where the number of digits required to write an integer n in decimal is , or the number of times we can divide by 10 until we get to 1. A binary number therefore shortens by 1 bit when we divide by 2 and a decimal number shortens by 1 digit when we divide by 10.
Deletion
The general strategy to delete a node from a binary search tree is to search for the node to be deleted, then to replace it by the join of its subtrees. A valid join function combines the left and right subtrees so that all the nodes in the left subtree still occur before all the nodes in the right subtree, while sharing a common root node. The problem to solve is to determine from which subtree the root of the join should come from, and exactly how to produce it.
In 1961, Hibbard[7] proposed a simple algorithm to join two subtrees: when the right subtree is empty, the result of the join is the left subtree as-is, otherwise when the right subtree is not empty, delete its left-most node and use that node as the joining root. This algorithm therefore always produces the joining root from the right subtree if possible. Hibbard’s paper was remarkable in that it contained one of the first formal theorems about algorithms and still features frequently in lectures and textbooks.
Hibbard's algorithm is not symmetric — it never considers whether the left subtree is empty, only the right. Knuth[5] noticed this asymmetry and suggested an adjustment to symmetrically use the right subtree as-is when the left subtree is empty. This algorithm improves the symmetry but is still biased to one side because the joining root is always taken from the same side when neither subtree is empty.
In 1983, Eppinger[8] suggested a symmetric variant of Hibbard's algorithm that randomly prefers the left or the right with equal probability. Their experiments involve random insertions and deletions and measure the average path length over time. They found that the average path length actually improves when using their symmetric approach, eventually stabilizing to be better than that of random trees.
In 1989, Culberson and Munro[9][10] published an extensive study into the long-term behavior of random insertions and deletions. They presented evidence that the use of the Hibbard or Knuth algorithms, when coupled with random insertions, causes the average path length to increase after many iterations, then stabilize. They also mention the symmetric approach by Eppinger and show that the same idea when applied to Knuth's algorithm yields even better results.
Popular textbooks are inconsistent in their approach to deletion in binary search trees. For example, in the well-known Introduction to Algorithms, Cormen, Leiserson, Rivest, and Stein[11] teach Knuth's algorithm, as well as Drozdek[12] who attributes the algorithm to both Hibbard and Knuth. Sedgewick and Wayne[13] teach Hibbard's algorithm, but their code appears to implement Knuth's algorithm. Goodrich, Tamassia and Goldwasser[14] teach Hibbard's algorithm but flips it to always prefer the left subtree instead. Skiena,[15] Weiss,[16] and Manber[17] all teach Hibbard's algorithm.
So far, none of the algorithms consider subtree size to determine from which side to produce the joining root. Given that subtree size must be known to search by position, we can make use of this information to determine whether to produce the root from the left or the right subtree.
Consider the node to be deleted and its position relative to its median — a node to the left of its median has a larger right subtree, and a node to the right of its median has a larger left subtree. The difference between these subtree sizes can be thought of as the distance to the ideal branch. This distance can be reduced slightly by always producing the root from the larger subtree.
The inverse of this strategy is to instead always prefer the smaller subtree. While this might seem clearly counter-productive, consider that the path length to the joining root is likely to be shorter in the smaller subtree than the larger subtree. Generally, a lower total traversal distance corresponds to fewer interactions with memory since fewer nodes need to be accessed. In this case, while the average path length of the resulting tree might be longer than other strategies, the join itself might resolve faster by accessing fewer nodes on average.
A related idea is a practical combination of preferring the larger subtree and the randomized symmetry of Eppinger's algorithm. To make the determination, generate a random number no greater than the sum of the two subtree sizes, then prefer the left subtree if that number is less than the size of the left subtree. When the left subtree is smaller, there is a lower probability that the generated number will be less than its size, and therefore a higher probability that the larger subtree will be chosen. This results in a lower average path length than the symmetric algorithm of Eppinger, and both the Hibbard and Knuth algorithms. This is to be expected, because the Eppinger algorithm has a 50% chance of choosing the larger subtree, and the randomized strategy still has some chance of choosing the smaller subtree.
To compare the resulting average path length of each strategy, a random tree of 1,000 nodes was constructed and measured across 1,000,000,000 random insertions and deletions. The results show that always preferring the larger subtree results in the lowest average path length. All the evaluated algorithms appear to stabilize with an average path length that is logarithmic in the size of the tree.
Persistence
A persistent data structure can create many independent versions of itself over time. Any version can be modified to produce a new version without affecting existing versions. To achieve this, a program can preserve the current version by copying it before making changes to produce a new version.
Reference counting
To avoid copying all the nodes when copying a tree, we can allow trees to share common subtrees in memory. Some tree far in the future might still reference a node that was allocated for a much earlier version. We need to keep track of this, so we store in every node a reference count to indicate the number of other trees that also reference that node.
When the reference count is zero, it suggests that no other trees share that node, so the tree has ownership and may modify it. Otherwise, when the reference count is greater than zero, it suggests that at least one other tree shares that node in memory. In this case, modifying the node would unintentionally also modify the other trees that reference it, so the node must first be detached from its shared state. This can be done by replacing the node with a copy, setting the reference count of the copy to zero, sharing the left and right subtrees by incrementing their reference counts, and decrementing the reference count of the initial node since it is no longer referenced by this tree.
Path copying
Consider now that a node is being modified in the depths of some tree, but that node is shared so must first be copied. How can the tree know about the copy of the node being modified? How could its root reach it? When a node replaces another node on the search path, there is a parent node that must be modified to now point to the copy instead, so it too must be copied. This cascades all the way up to the root.
Generally, all paths that lead to a modification must be copied. Every node has a unique path from the root, so for any operation we can mark the nodes that need to be modified and trace their paths back to the root; it is these paths that need to be copied so that they all belong to the same new version of the tree that includes the result of the modification.
Binary search trees are especially well-suited for persistence because the amount of information to copy is bounded by the maximum path length, which is usually logarithmic in the size of the tree. When this is the case, for example, a tree of 1,000,000 nodes would only need to copy ∼20 nodes to produce a new, completely independent version.
Other persistent tree data structures such as B-trees[18][19] and RRB-trees[20][21][22] pack several values into each node and have more outgoing links per node. The motivation is usually (i) to reduce memory allocation by not requiring a new node for every value, (ii) to reduce path length by having more branches, and (iii) to read multiple values at a time from memory during search. In a persistent context, these nodes must be copied no differently, but instead of copying just one value we need to copy all the values in the node. Search paths may be shorter but there is more information to copy.
The number of values to copy can be estimated by , where m is the number of values per node, b is the number of branches per node, and n is the number of nodes in the tree. This is not an exact science of course, because actual path lengths may vary and some nodes with larger capacities might not be full. This expression therefore only provides an idea of the expected growth of information to copy as size increases.
Binary trees use and , one value, two branches. B-trees in the Rust standard library use and , and RRB-trees use and .
Parent pointers
Many implementations of binary search trees use a node structure that includes a third link: the parent pointer, which points back up towards the root. Path copying requires that all paths that lead to a modification must be copied, but parent pointers create cycles such that all paths lead to all nodes. Copying every node is not feasible, so parent pointers are not compatible with path copying and therefore not part of our node structure. Avoiding parent pointers also reduces memory interaction because there are fewer pointers to maintain, and no need to account for the pointer in the node structure.
Concurrency
A data structure that supports concurrency can run multiple independent operations at the same time. This is not exactly parallelism, which is where the work of a single operation is divided up to be worked on together. To illustrate this in a binary search tree, as shown in Figure 11, consider one operation that searches for an existing node by position, and another that inserts a new node. These operations may start at the same time, but only one of them can win the race to the root node and lock others out. The other operation must wait for the node to be unlocked to gain access.
The search operation branches to the left, locks that node, and only then unlocks the root node. The insert operation is now clear to access the root and lock it. The second operation may be locked out again if it also branches to the left, but if it branches to the right then the two operations become independent.
Recursion
Binary tree algorithms often use recursion to implement operations in two phases: a search phase descending from the root, and a maintenance phase ascending back to the root. Another operation would need to wait for the algorithm to go all the way down then come all the way back up to make final adjustments before unlocking the node. For this reason, we explore whenever possible the potential to avoid recursion by combining the search and maintenance phases into a single top-down phase in constant space. Recursive implementations are still valuable because they are often simpler to implement and help to verify top-down variants.
Part 2Restoring Balance
Consider what happens to the tree structure when we repeatedly insert nodes at the start of the sequence. Eventually, the structure becomes unbalanced because too many branches are too far from their median. Balance is a measure of low average path length, and a linked list can be thought of as a tree with the worst possible balance.
To restore balance, we need a way to lower the average path length without changing the sequence of the nodes. Some strategies spend a lot of energy to restore perfect balance, while others spend less energy by being more relaxed. We can evaluate this as a trade-off between the cost of balancing and the benefits of balance.
Rotations
A tree rotation is a local structure adjustment that rearranged links between nodes without changing their order. The effect of a rotation can be observed by focusing on the root of the rotation and the change in the composition of its subtrees.
There are two symmetrical rotations: a right rotation which moves the root node down to the right and the left node up to the right to take its place, and a left rotation which moves the root node down to the left and the right node up to the left to take its place.
rotateR (p *Node) *Node l = p.l p.l = l.r l.r = p p.s = p.s - l.s - 1 return l } rotateL (p *Node) *Node r = p.r p.r = r.l r.l = p r.s = r.s + p.s + 1 return r }
A well-known strategy to restore balance using rotations is the Day-Stout-Warren algorithm, or simply DSW, which was designed by Stout and Warren[23] in 1986 based on work by Day[24] in 1976. This algorithm first transforms a tree into a linked list using right rotations, then transforms that linked list into a perfectly balanced tree using left rotations — all done in linear time and constant space.
Partitioning
In 1980, Stephenson[25] presented an algorithm that always inserts a new node as the root of a tree, commonly referred to as root insertion. This is done by splitting the tree in two: nodes that occur before the leaf position and nodes that occur after the leaf position, then setting those subtrees as the left and right subtrees of the new node.
A related algorithm is partition, which rearranges the tree to make the node at a given position the root while preserving the order of all nodes. To partition a tree, start from the root and follow the standard search algorithm. Along the search path, when branching to the left, the current node wil end up to the right of the partitioned root, and all further nodes along the search path will end up to the left of the current node. This applies symmetrically when branching to the right.
Using this information, the algorithm builds two subtrees top-down, the left partition and the right partition. When the search branches to the left, attach the current node to the left of the left-most node of the right partition, preserving the right subtree of the current node. When the search branches to the right, attach the current node to the right of the right-most node of the left partition, preserving its left subtree.
When the search reaches the node that is to become the root, attach its current left subtree to the right of the right-most node of the left partition, and attach its current right subtree to the left of the left-most node of the right partition. Then, set its left subtree to be the root of the left partition, and set its right subtree to be the root of the right partition. Finally, set this node as the root of the tree.
partition (p *Node, i Position) *Node { n = Node{s: i} l = &n r = &n for i ≠ p.s { if i < p.s { r.l = p p.s = p.s - i - 1 r = r.l p = p.l } else { l.r = p i = i - p.s - 1 l = l.r p = p.r } } r.l = p.r l.r = p.l p.l = n.r p.r = n.l p.s = n.s return p }
Median balance
Definition
A node is median-balanced if the size of its left and right subtrees differ by no more than 1.
To determine if a node is median-balanced, we can add 1 to the size of its smaller subtree to see if it becomes greater than or equal to the size of its larger subtree. Without loss of generality, assume that .
A median-balanced tree is perfectly balanced because every branch divides the search space exactly in half. In 2017, Muusse[26] published an algorithm to balance a binary search tree that uses partition to replace every node by its underlying median to produce a median-balanced tree.
Partitioning a tree around its median distributes the nodes evenly between its left and right subtrees, but there might be nodes deeper within those subtrees that are not median-balanced. Repeating the partitioning recursively in the left and right subtrees results in a median-balanced tree.
balance (p *Node, s Size) *Node { if p == nil { return p } if !balanced(p) { p = partition(p, s / 2) } p.l = balance(p.l, size(p.l)) p.r = balance(p.r, size(p.r)) return p }
This algorithm has several useful properties: (i) it is general for any definition of balance based on subtree size, (ii) it operates top-down in constant space, (iii) it is particularly efficient when the tree is already somewhat-balanced, (iv) subtree balancing is independent so could be done parallel, and (v) the balancing process could be stopped or paused partway through without invalidating the structure.
There are however some node arrangements that are not median-balanced but have the same average path length as if they were median-balanced. Consider the following trees that both form the same sequence of 5 nodes with an average path length of 6/5.
The first tree is not median-balanced because 5/2 is 2, so the median node is but the root is currently . The median-balancing strategy would partition at to make the root, but the average path length stays the same. This partitioning step is therefore redundant because it did not improve the balance of the tree.
Height balance
Definition
The height of a tree is the length of the longest path starting from its root.
Definition
A node is height-balanced if the height of its left and right subtrees differ by no more than 1.
We can continue to use the same partition-based balancing algorithm but change the definition of balance to only partition if the node is not already height-balanced — every node that is not height-balanced is partitioned to become median-balanced. This results in a height-balanced tree because every node that is already height-balanced is left as such, and every node that is not height-balanced becomes median-balanced.
The problem to solve is to determine whether a node is already height-balanced. Muusse[26] solves this by using a strict definition of height-balance where both the minimum and maximum path lengths of two subtrees differ by no more than 1. Then, by pretending that both subtrees are already strictly height-balanced, we can compare their minimum and maximum heights.
A binary tree is perfect if every node has either two subtrees or two empty subtrees. The height of a perfect binary tree is , which is intuitive since every level of the tree has twice the number of nodes as the one above it and the number of branches to follow from the root to the bottom of the tree is the number of times that the size can be divided by 2 before reaching 1.
When a tree is strictly height-balanced, all the levels of the tree are full except for maybe the bottom level, where some nodes might have only 1 subtree. The bottom level of the smaller subtree is emptied with floor, and the bottom level of the larger subtree is completed with ceil, giving the minimum and maximum heights of the two subtrees. We can then determine if the node is height-balanced by comparing these heights.
This function, as published, can be simplified with an identity of the discrete binary logarithm where . This allows both sides of the inequality to be expressed in terms of the floor of the log. The result is the same whether the last level is completed with ceil or emptied with floor and incremented.
Weight balance
Definition
The weight of a node is the number of reachable leaves, which is one more than the number of nodes.
The general category of weight-balanced trees require some definition of a bound between the number of leaves in the left and right subtrees. A median-balanced node is perfectly weight-balanced because there is an ideal distribution in the number of nodes between its left and right subtrees. A simple definition of weight-balance is to choose 2 as a constant factor, so that a node is balanced if the weight of the smaller subtree is at least half the weight of the larger subtree.
A somewhat overlooked variant of weight-balance was introduced by Roura[27] in 2001, which directly compares the binary logarithm of the weights of the left and right subtrees, rather than the weights directly.
Definition
A node is logarithmically weight-balanced if the binary log of the weights of its subtrees differ by no more than 1.
The most-significant-bit of a binary number, or MSB, is the bit-position of the left-most bit set to 1. The MSB gives the number of bits required to represent the integer in binary since all the bits to the left of the MSB will be 0 and do not affect the value. Comparing the MSB is therefore equivalent to comparing the binary log. A node is therefore logarithmically weight-balanced if the MSB of the weight of one subtree is at most one step away from the MSB of the weight of the other subtree.
Comparing the binary log
Generally, we require a function that determines whether the MSB of one integer is less than the MSB of another. The first candidate is the function used by Roura,[27] the second is by Chan,[28] and the third is from Hacker's Delight by Warren.[29] These functions allow us to compare the binary logarithm without the need to compute it.
So far we have four definitions of balance based on the binary logarithm. There appears to be a pattern where a stronger definition of balance is relaxed by the logarithm of the same terms — what a neat relationship.
Using the bitwise function of Warren to compare the binary logarithm, and bit-shifts for division yields the following expanded solutions:
Cost-optimized balance
There is one other design candidate that does not follow the same pattern. In 2000, Cho and Sahni[30] introduced the idea of cost-optimized search trees, which are trees where the average path length can not be reduced by a rotation anywhere in the tree.
Even though their definition considers rotations, we can apply this concept to partition-based balancing without the need to consider rotations at all. In this context, the only determination to make is whether a node is balanced or not. To determine whether a node is cost-optimized, we need to compare two levels of subtree sizes.
Definition
A node is cost-optimized if the size of each subtree is greater than or equal to the size of both subtrees of the other subtree.
Balancer analysis
Measurements were made in size increments of 1,000 up to 10,000,000. Many random trees were grown consistently at each size increment and balanced by each strategy. The results were generated on an Intel i5 13600K with 32GB of DDR5 RAM.
Partition Count
The total partition count is the number of nodes that required partitioning at each size increment. Dividing the total partition count by the size of the tree at each increment indicates the fraction of the tree that required partitioning. A horizontal line indicates that the total partition count is linear in the size of the tree.
The expectation was that height-balance partitions fewer nodes than median-balance because it avoids some redundant partitioning, but the results indicate that this is not the case — height-balance actually partitions more nodes than median-balance.
On the other hand, the weight-balance strategies partition the fewest nodes overall: around 15% for logarithmic weight-balance, 16% for linear weight-balance, and between 30% and 40% for median-balance and height-balance. Cost-optimized balance is somewhere in-between at around 28%.
Partition Depth
The total partition depth is the total number of links followed by partition at each size increment. This corresponds to the total number of additional nodes visited during the balancing operation. Dividing the total partition depth by the size of the tree at each increment indicates this total as a fraction of the size of the tree. A horizontal line therefore indicates that the number of nodes visited is linear in the size of the tree.
The results indicate that height-balance visits fewer nodes than median-balance on average, with both visiting in total ∼1 additional node for every node in the tree. Cost-optimized balance visits fewer additional nodes at ∼0.88, then linear weight-balance at ∼0.62, then logarithmic weight-balance with the lowest at ∼0.56.
Average Path Length
The average path length of each resulting tree measures how well-balanced they are, where lower is better. Dividing this average by the log of the size of the tree at each increment gives the results as a factor of the binary log.
The average path length of all strategies appears to approach , which for the final tree of size 10,000,000 translates to an average path length of ∼23. The results show that median-balance, height-balance and DSW all result in optimal balance, and cost-optimized balance only slightly suboptimal. Using linear weight-balance results in a slightly lower average path length than logarithmic weight-balance, but only by ∼2%.
Notice that the relative difference between all strategies is very small. The average path length at the final increment ranges between 17.9 and 18.6, making the percentage difference between the least and most-balanced trees only ∼4%.
Maximum Path Length
The maximum path length of each resulting tree measures the worst-case search cost at each size increment, i.e. height. Dividing this by the log of the size of the tree at each increment gives the results as a factor of the binary log.
The maximum path length is for median-balance, height-balance, and DSW because only the lowest level in the tree has missing nodes, which is optimal. The lowest level of the tree eventually gets completed as the size increases, after which a new level is started which increases the height by 1. The new level is twice the size of the previous level, so it takes twice as long for the height to increase again.
There is a ∼30% difference in height between the least and most-balanced strategies: cost-optimized balance has a height of ∼1.06, linear weight-balance ∼1.18, and logarithmic weight-balance ∼1.30.
Duration
Trees were also created in size increments of 100,000 up to 10,000,000 to measure the amount of time it takes each strategy to balance random trees on actual hardware. Keep in mind that time-based measurements are system-dependent so the results could be entirely different on other machines.
The results show that median-balance is marginally faster than height-balance, which is somewhat unexpected. Looking at duration cumulatively, it appears that the total duration is practically identical between median-balance, height-balance and cost-optimized balance.
The two weight-balance strategies are both ∼25% faster than median-balance and height-balance and ∼30% faster than DSW, with logarithmic weight-balance having a slight edge. This suggests that linear weight-balance is a good choice because it takes about the same amount of time as logarithmic weight-balance but achieves better balance. However, logarithmic weight-balance partitions fewer nodes and visits fewer nodes overall, which might offer some additional benefit on other machines.
Cost-optimized balance has a lower partition count, lower partition depth, and is slightly faster than height-balance but slightly slower than median-balance. The overhead is likely from the memory interaction of the multi-node size comparison to determine cost-optimized balance.
DSW appears to be consistently slower to balance large random trees, but the algorithm has some advantages over partition-based balancing that should not be discounted. Both algorithms use constant space, but DSW is and partition-based balancing is theoretically , though could perhaps be proven linear with high probability in the case of random trees. The two phases of DSW are also useful procedures individually: to convert any binary search tree into a linked list, and to convert a linked list into a balanced tree. DSW also does not rely on subtree size which may not be available in some cases.
Generally, balancing by partition is an excellent strategy when trees are already somewhat-balanced. The weight-balance strategies are the fastest and the resulting balance is only slightly suboptimal, suggesting a good trade in practice when balancing is frequent. Surprisingly, the benchmarks suggest that median-balance is likely the best choice to balance random trees when strict height-balance is a requirement or when balancing is infrequent. DSW is the best choice to balance trees that are likely to be linear lists, or when subtree size is not available.
Part 3Self-Balancing Trees
Restoring balance to an entire tree is powerful but often disproportionately expensive because the entire tree must be traversed. There is also the question of when to balance to avoid balancing too often or not balancing often enough. Instead of either balancing the entire tree or not balancing at all, we can restore balance incrementally over time and thereby spread the balancing costs across updates.
For example, an algorithm can make structural adjustments along the search path during an update to restore balance locally. A valid balancing algorithm does not change the ordering of the nodes and always returns the tree in a balanced state. A program can then know that a given tree is balanced because it is always balanced.
This section explores various strategies to maintain balance incrementally. Some strategies make many structural changes while others are more selective; some maintain strict balance while others do not technically balance at all. The goal is to determine which strategy is likely to be a good choice in a particular situation by considering the nature of balance and the work required to maintain it.
To compare strategies, we use an operation that starts by inserting 1,000,000 values, then alternates between insertions and deletions for a total of 10,000,000 updates. An access pattern defines where updates should occur in the sequence throughout the operation. For example, the maximum access pattern always updates at the end of the sequence, i.e the maximum position. The results presented are averaged across five access patterns: the maximum distribution, a uniformly random distribution, a normal distribution with , a skewed beta distribution with , and a Zipf distribution with .
Rank-balanced trees
The original self-balancing binary search tree was published in 1962 as an algorithm for the organization of information by Adelson-Velsky and Landis.[31] AVL trees store the height of every node as part of the node structure, then uses rotations along the search path to maintain height-balance. The most common approach to implement an AVL is tree to use recursive algorithms, where all operations start from the root, traverse along a search path, then backtrack to the root along the search path.
Knuth[5] describes how balancing in an AVL tree is only necessary at the topmost node affected by balancing, commonly referred to as the safe node.[32] This node is either the root or the last node along the search path with subtree heights not equal to each other. After a new node is inserted at the bottom of the tree, the path from the safe node is traversed again to update node heights, then balanced at the safe node using a subset of the bottom-up deletion balancing algorithm. This algorithm is not purely top-down because a section of the search path is traversed twice, but space is constant and everything above the safe node is safe to modify concurrently.
Since the invention of AVL trees, many self-balancing strategies have been proposed, the most common being the red-black tree of Guibas and Sedgewick[33] in 1978. Instead of directly storing the height of every node as in AVL trees, red-black trees use a color field in the node structure to track whether a node is red or black. The color red was chosen because it was the best-looking color produced by the laser printer available to the authors at the time, and they used red and black pens to draw trees by hand. A set of rules based on these colors determine where to make rotations along the search path to maintain balance.
Red-black trees are not technically height-balanced, but the color-based rules result in a maximum height bound of where n is the size of the tree. The same bound for AVL trees is , so AVL trees are generally more balanced but rotate more often to maintain that balance.
Inserting into a red-black tree requires at most two rotations, and deleting requires at most three. Inserting into an AVL also requires at most two rotations, but deletion may require a rotation at every node along the search path. The advantages of red-black trees is that the balancing overhead per operation is constant, and that only 1 bit of information is needed to store the color.
A common response to the recurring question of should I use an AVL tree or a red-black tree is to use AVL trees when mostly searching because the balance is better, but to use red-black trees when mostly inserting and deleting because they use fewer rotations. This is however a misconception, because inserting and deleting both require a search from the root to the bottom of the tree, which is affected by path length all the same.
Red-black trees are very common and generally considered good enough. Many implementations of the C++ STL use them for sorted sets, also Java's TreeMap,[34] and parts of the Linux kernel[35]. However, most implementations are recursive, many use parent pointers, and attempts to implement them often result in frustration.
In 2013, Haeupler, Sen & Tarjan[32] introduced rank-balanced trees, which unify many height-based balanced trees under a single framework. Rank-balanced trees store in each node an integer rank and use various rank-based rules to define AVL trees, red-black trees, and others in a consistent way.
A new strategy emerged from this framework: the weak AVL tree, which effectively combines the insertion of AVL trees with the deletion of red-black trees such that the height bound degrades gracefully from that of an AVL tree as the number of deletions increase, and is never worse than that of a red-black tree. Insertion and deletion in a weak AVL tree require at most two rotations, which is fewer than AVL trees and red-black trees. Concurrency is well-supported in weak AVL trees because both insertion and deletion can be implemented purely top-down.
In 2018, Sen, Tarjan & Kim[36] introduced relaxed rank-balanced trees which only balance after insertion, not deletion. The balance of relaxed rank-balanced trees generally improve as nodes are inserted, and degrade slowly as nodes are deleted. A subtle property of relaxed rank-balanced trees is that their height bound is proportional to the number of insertions, rather than the size of the tree.
This project compares 10 different rank-balanced trees.
Weight-balanced trees
Implementing a linear list using a binary search tree requires that we know the size of each subtree along the search path. It seems ideal then to use this information for balancing, rather than heights or ranks stored additionally in each node.
In 1972, Nievergelt and Reingold[37] introduced trees of bounded balance, or BB[α] trees, where weight-balance is maintained by keeping subtree weights within a factor of α. They proved that BB[α] trees have a maximum height of .
The original insertion algorithm was purely top-down, but Blum and Mehlhorn[38] proved in 1980 that the valid range provided for α was incorrect. Their supplied proof, however, only works for bottom-up balancing. In 1993, Lai and Wood[39] described top-down algorithms and proved their correctness for .
The valid range for the α parameter is complex and its evaluation involves rational arithmetic or integer multiplication that could overflow. Ideally there exists a weight-balanced strategy that can keep the weights of subtrees within a reasonable bound without the need to perform complex calculations.
Roura[27] introduced logarithmic binary search trees or LBSTs in 2001, which use simple bitwise operations to evaluate weight-balance. The maximum height of a logarithmic weight-balanced tree is the same as red-black trees, . The original bottom-up algorithm uses subtree size as weight, but Roura mentions in the final remarks that it is also possible to use the number of leaves as weight.
Frias[40] published top-down logarithmic weight-balanced trees in 2005, under advice from Roura. Their algorithms are more complex than the top-down algorithms of Lai and Wood as a result of using subtree size as weight. Using the standard definition of weight rather than subtree size simplifies their algorithms to become general with the top-down algorithms of Lai and Wood.
Hirai and Yamamoto[41] published an investigation of weight-balanced trees in 2011, wherein they introduce two balancing parameters: Δ as the constant factor that determines if two subtrees are weight-balanced, and Γ to determine the type of rotation that is needed to restore balance.
Balancing parameters are considered feasible if they always produce valid weight-balanced trees. Hirai and Yamamoto show that the feasible space for is actually a non-empty polytope, suggesting that the linear dependency between Δ and Γ implied by α is not necessary. The feasible parameter space is therefore expanded by separating the two independent variables coupled linearly in α.
Hirai and Yamamoto include a short section on logarithmic weight-balance, stating that "for mathematical reliability, logarithmic weight-balanced trees are simpler". They benchmarked the original bottom-up implementation against other bottom-up weight-balanced trees and found their performance to be about the same.
Barth and Wagner[42] evaluated top-down weight-balanced trees in 2019, but only mention logarithmic weight-balance in the introduction, without further evaluation. They contribute detailed benchmarks across a range of , but acknowledge that "it is unclear what the feasible polytope looks like". They conclude with a suggestion for further study, that "in the future, it would be interesting to determine the space of feasible balancing parameters for top-down weight-balanced trees similar to how Hirai and Yamamoto have done for bottom-up weight-balanced trees."
To produce this, we simulate random trees across a wide range of rational Δ and Γ parameters. Whenever a node is not weight-balanced after an operation, it proves that a counter-example exists for that pair, which makes it infeasible. The plotted area represents all feasible balancing parameter pairs for top-down weight-balanced trees.
The shape appears to be a more detailed illustration of the polytope that Hirai and Yamamoto found for bottom-up weight-balanced trees. The curve along the bottom is the corresponding linear bound proven for α by Lai and Wood. is the only integer pair in the feasible space, which Barth and Wagner describe as "overall fairly performant", notably faster than red-black trees.
This project compares 4 weight-balanced trees: bottom-up and top-down variants that use , the original logarithmic weight-balanced tree that uses size as weight, and a top-down logarithmic weight-balanced tree.
Scapegoat trees
Exploring further we find the idea of partial rebuilding by Overmars[43] in 1983. Further development produced the general balance trees of Andersson[44][45] in 1989 and 1999, followed by the scapegoat tree of Galperin and Rivest[46] in 1993.
A tree in which every node is weight-balanced has a provable maximum height bound. Therefore, when the height of a tree exceeds this bound there must be at least one node that is not weight-balanced. Galperin and Rivest refer to this node as a scapegoat.
The basic idea of partial rebuilding is to find a node that is not balanced and then rebuild that subtree into a balanced tree, thereby restoring the height bound. Intuitively, the height of a tree that was not balanced decreases when it becomes balanced. This strategy combines the ideas of weight-balance and height-balance by allowing some nodes to not be weight-balanced, as long as the height of the tree stays within the height bound.
The original scapegoat tree uses a balancing parameter α, where and height is bounded by . A node with subtree sizes sl and sr has a total size of , and is considered α-weight-balanced if and .
The insertion algorithm described by Galperin and Rivest is a two-pass recursive algorithm: search by descending from the root to attach a new node at the bottom of the tree, then backtrack bottom-up to find a scapegoat that is not α-weight-balanced, if any, and rebuild it. There must be at least one such node along the search path when the resulting height of the insertion exceeds the maximum height bound. The suggested rebuilding algorithm recursively flattens the tree into a list, then rotates that list back into a perfectly-balanced tree, similar to the Day-Stout-Warren algorithm.
There are a few downsides to this strategy: (i) the recursion goes all the way back to the root even when the height bound was not exceeded, (ii) the α-parameter imposes an awkward fractional log base, and (iii) the rebuilding algorithm does not consider that the subtree is likely already somewhat-balanced, given that the height only just exceeded the bound.
Recursion can be avoided by looking for a scapegoat top-down along the search path before it is known what the height will be. The scapegoat is then already in-hand if and when the height bound is exceeded at the bottom of the tree. Evaluating weight-balance optimistically is often redundant however, because the result is not needed when the height remains valid. This is okay, given that other weight-balanced trees already evaluate balance at every node along the search path anyway.
There can be many nodes along the search path that qualify as a scapegoat, offering a choice to use the first one closest to the root, or to use the last one furthest away from the root. For a given operation, accepting the first scapegoat closest to the root allows the algorithm to skip further checks for weight-balance along the search path, whereas using the last scapegoat furthest from the root must keep checking weight-balance because it is not known whether another will be found along the way.
Choosing the scapegoat furthest from the root is a clear choice when using recursion because once you pass one by on the way up to the root there is no guarantee that another will come along. For the algorithm to alternatively use the scapegoat closest to the root, it would need to maintain a reference to the last scapegoat found, which is now potentially far down the tree again since we are back at the root. Updating the references from the root to this node is then cumbersome. The intuition is also that subtrees are smaller around the lower levels of the tree, so one should prefer to rebuild smaller subtrees which require less work.
There are however some hints to suggest that rebuilding larger subtrees closer to the root is likely to produce better results. Overmars originally suggested to use the node closest to the root, and Galperin and Rivest write that "this heuristic performed better than choosing the first weight-unbalanced ancestor [from the bottom] to be the scapegoat". Even so, all available references and implementations of scapegoat trees use the recursive method and choose the first scapegoat encountered bottom-up.
Experiments confirm that it is more efficient to prefer the first scapegoat encountered top-down, therefore closest to the root. When a partial rebuild is required somewhere, we may as well balance more nodes than is strictly necessary to further delay the next rebuild.
The α-parameter can be avoided by using logarithmic weight-balance. The maximum height of a tree where every node is logarithmically weight-balanced is , so there must be a node along the search path that is not logarithmically weight-balanced when the height exceeds this bound after an insertion. The equivalent α for this bound is .
Using logarithmic weight-balance, a tree of size n requires a partial rebuild somewhere along the search path when the height h is greater than .
The final sharp edge to resolve is the choice of algorithm to rebuild the scapegoat. We can assert that the subtree is already somewhat-balanced, and we know that restoring logarithmic weight-balance at every node in that subtree would restore the height bound. Restoring balance using a partition-based strategy is therefore a good choice since most nodes should not require partitioning.
Relaxed deletion in scapegoat trees
Alongside a reference to the root node and the current size of the tree, a scapegoat tree usually stores a maximum size field, which is the maximum size that the tree has reached since the root was rebuilt. The theory suggests that the entire tree should be rebuilt when the current size becomes less than α times the maximum size, then to reset the maximum size to the current size.
Occasionally rebuilding the tree when many deletions have occurred ensures that the height remains logarithmic in the size of the tree. Alternatively, we can use the concept of relaxed balance from rank-balanced trees to only balance after an insertion, not after a deletion. The intuition is that height does not increase when a node is deleted, and a following insertion will inevitably restore the height bound anyway, possibly even at the root all the same. The height is then at most logarithmic in the number of insertions rather than the size of the tree.
This project compares 2 relaxed weight-balanced trees: one which is equivalent to the weight-balance rule even though the Γ parameter is not used, and another that uses logarithmic weight-balance.
Randomly-balanced trees
All the strategies explored so far always results in the same structure for the same access pattern. As a result, they are consistent and predictable. This is not always a benefit, however. For example, an adversary could exploit predictability to gain information about the system. Some predictable strategies are also history-dependent, where information about previous operations can be derived from the current structure. While these concerns are uncommon in practice, unpredictability and history-independence is an essential feature in some applications.
Probabilistic balancing is effective regardless, and some of the most common data structures in practice use randomness in some way. Focusing on binary search trees, we explore four strategies that use randomness to keep the tree balanced.
In 1989, Seidel and Aragon[47] introduced the treap data structure, which combines the ideas of a tree and a heap. When a new node is allocated, the algorithm generates a uniformly-distributed random number as its rank, and stores that rank as part of the node structure. The only invariant to maintain is that the rank of a node must be greater than or equal to the ranks of its subtrees. The root of the tree therefore has the highest rank. This results with high likelihood in a balanced tree as it becomes increasingly unlikely that a new rank introduced at the bottom of the tree is greater than the ranks above it, such that most nodes end up around the bottom of the tree.
The algorithms to maintain the rank invariant are intuitive and simple to implement. There are two fundamental algorithms: split, which partitions a node into two separate trees at a given position, and join which combines two subtrees into one. These algorithms are similar to partition, operating top-down in constant space.
To insert a node, start by generating a new random rank for it, then follow the search path from the root until a node is reached with a lower rank than the new rank. Split this node into the left and right subtrees of the new node, then replace it with the new node. When the search reaches the bottom of the tree, it suggests that the rank of the new node is the minimum of its path, and therefore safe to attach there. This algorithm is exactly the root insertion algorithm of Stephenson[25], on which partition is based. To delete a node, simply find the node to delete then replace it by the join of its subtrees.
split (p *Node, i Position) (*Node, *Node) { n := Node{} l = &n r = &n while p ≠ null { if i ≤ p.s { p.s = p.s - i r.l = p r = r.l p = p.l } else { i = i - p.s - 1 l.r = p l = l.r p = p.r } } l.r = nil r.l = nil return n.r, n.l }
join (l, r *Node, sl Size) (root *Node) { p = &root loop { if l == nil { *p = r; return } if r == nil { *p = l; return } if rank(l) < rank(r) { r.s = r.s + sl *p = r p = &r.l r = *p } else { sl = sl - l.s - 1 *p = l p = &l.r l = *p } } }
In 1997, Martínez and Roura[48] introduced randomized binary search trees, which are based on subtree size and thereby avoid the need to generate and store a rank in every node. The algorithms are very similar to that of treaps, but the method by which random choices are made is different. Consider insertion, where a treap has some new rank in hand and is searching for a node with a lower rank along the search path to split and replace. Randomized trees similarly search for a node to split and replace, but require a different method to determine which node to split and replace.
Instead of generating a single random variable across the entire range, randomized trees generate a random number between zero and the size of the subtree at each step, then split and replace if the subtree size is equal to the generated random number. Subtrees tend to become smaller further down the tree, so it becomes increasingly more likely to generate a random number equal to the size of the current subtree as the search continues. Most insertions therefore occur around the bottom of the tree.
Both treaps and randomized trees produce exactly the same result — random trees, so we can expect the average and maximum path lengths to be similar. The one advantage of treaps is that they require only one random variable per operation, where randomized trees require bounded randomness at every step along the search path. On the other hand, treaps require a rank field as part of the node structure, which is not required by randomized trees as they make use of the available subtree size information.
One strategy that always makes an appearance when probabilistic data structures are discussed is the skip list of Pugh[49] from 1989. While not technically a tree, skip lists are often used as an alternative to binary search trees because they have similar performance bounds and are generally considered simpler to implement. Skip lists are popular in practice, for example: RocksDB, maintained by Meta and based on LevelDB by Google, uses a skip list for its memory buffer,[50] Redis uses skip lists for sorted sets[51], and Apache HBase uses skip lists as the default in-memory store.
In 2018, Tarjan, Levy and Timmel[52] introduced zip trees, which are binary search trees that are isomorphic to skip lists, and simplify the mapping between skip lists and binary search trees by Dean and Jones[53] in 2007. Zip trees are essentially treaps, except that ranks are drawn from a geometric distribution instead of a uniform distribution, and rank ties are only allowed on the right. This strategy requires fewer bits of randomness than treaps, and the equivalence to skip lists is a novel characteristic.
The geometric rank distribution used by zip trees is the same as in skip lists, which is equivalent to the number of times a coin-flip lands on heads before landing on tails. The probability of each successive coin-flip landing on heads is half the probability of the previous flip. This aligns with binary search trees where approximately half the nodes are in the lowest level, a quarter of the nodes are one level up, all the way to the root level with only one node. The root node therefore has the highest-equal rank, corresponding to the longest streak of successive coin-flips landing on heads so far.
In 1988, Tarjan and van Wyk[54] discussed a homogenous finger search tree, where the pointers along the left-most and right-most paths of the tree, referred to as the spines, are reversed such that the left pointers of nodes along the left spine and the right pointers of nodes along the right spine point up to their parent instead. Access to the tree is then through the head and the tail nodes, rather than the root. This provides access from the start and the end of the sequence searching towards the middle, rather than starting in the middle searching outward. This results in a longer average path length, but access close to the start or the end of the sequence is more efficient.
Searching for a node in a homogenous finger search tree starts at either the head or the tail of the tree and proceeds upward towards the root, until the search is about to ascend too far, at which point it descends inward. This inward branch is the same branch that would be taken by a conventional top-down search for the same node. Balancing is maintained as usual in the interior tree at the branching point, which has the usual pointer convention, then rotated up along the spine as far as needed until balance is known to be restored.
The ideal balancing strategy for finger search trees is then one that does not need to traverse all the way up to the root every time. Strategies like the standard AVL rule and weight-balanced trees and are not great here, because they may require rotations all along the search path and therefore must traverse all the way up to the root. Other rank-balanced strategies like red-black trees and weak AVL trees are a great choice because their balancing procedure can stop as soon as a terminating case is handled. The idea of reversing the pointers along the left and right spines of the tree was also explored by Blandford and Blelloch[55] in 2001, who used a treap as the underlying balancing strategy, on which this project bases its implementation.
Reversing the nodes along the spine creates a problem for persistence by path copying. The root node maintains the size of its original left subtree, which is used to determine whether to search from the head or the tail. When the size of the left subtree of the root changes as a result of an insertion or deletion from the head, all paths that lead to the root must be copied, which includes the entire right spine. To resolve this, we break the links pointing up to the root, and use the left and right pointers of the root to reference the head and tail nodes.
Another adjustment is to always have the nodes along the spine store the size of their interior subtrees. This is already in place for the right spine because nodes store the size of their left subtree already, but the sizes along the left spine must be flipped. This breaks the convention a bit and is certainly more complex.
Self-adjusting trees
In 1985, Sleator and Tarjan introduced self-adjusting binary search trees, also known as splay trees. This strategy is not exactly self-balancing, because splay trees do not define balance in any way. Instead, they adjust to the access pattern by moving frequently accessed nodes closer to the root.
All operations on a splay tree use a fundamental algorithm called splay, which is similar to partition in that it rearranges the tree to make the node at a given position the root, but also reduces the path length of every node along the search path. The splay algorithm uses the same linking primitives as partition, but considers nodes in pairs and makes a rotation whenever two links are followed in the same direction.
In 1994, Sleator[56] published a top-down splay algorithm that also maintains subtree sizes. Their algorithm requires two additional loops to update size fields along the left-most path of the right subtree and the right-most path of the left subtree. However, these loops can be avoided by storing the size of the left subtree specifically.
The balance of splay trees is entirely arbitrary. Consider that inserting a new node into a splay tree results in that node becoming the root, so repeatedly inserting at the end of the sequence would result in one long leftward path equivalent to a linked list. Accessing the start of the sequence would then traverse the entire tree while making right rotations to bring all those nodes closer to the root, eventually making the left-most node the root of the tree.
The dynamic optimality conjecture from the original paper, if true, suggests that splay trees are a form of universally efficient search tree: in an amortized sense and to within a constant factor, no other form of search tree can beat them.
The biggest risk of splay trees is that they might be terribly unbalanced at any time. What is good is that the algorithm makes long search paths shorter, so if a path was long for one access, it will be shorter for the next. This is not necessarily true in a persistent context, because an unbalanced splay tree might be used as the base for other versions and thereby repeatedly traverse long paths.
Consider that a persistent update adjusts the structure of the tree, making copies of nodes as needed to not modify the current version, which means the base version stays unbalanced. Producing new versions from a potentially unbalanced state is therefore a risk to consider when using splay trees in a persistent context. To avoid this risk, we could balance the entire tree before branching new versions from it.
Height-balanced trees
On a completely different track, Prokopec and Odersky introduced conc-tree lists in 2015, which are functional height-balanced binary search trees originally introduced by the Fortress language.[57] Nodes in a conc-tree either (i) store a value but have no links, or (ii) do not store a value but link to two non-empty subtrees. Additionally, like in AVL trees, every node stores its height and the height difference between two nodes is no greater than one. In effect, the values of the sequence are stored in the leaf nodes, and all non-leaf nodes are used only for searching purposes.
Conc-trees are simple to implement and do not use rotations or explicit copying. Instead, they make use of functional composition and smart constructors, where rather than copying and modifying an existing node, they allocate and composes a new node by combining properties of other nodes. Intuitively, they always construct a new path and make no attempt to modify an existing one.
Conclusion
Logarithmic weight-balance is excellent. There are three simple implementations that each offer useful properties in practice. The bottom-up variant generally has a low average path length, great performance, and is a safe choice a persistent implementation. The top-down variant does not use recursion, and is therefore a good candidate for a concurrent implementation. The relaxed variant also has good support for concurrency and persistence, but further study is required to explore the extent of its potential.
Restoring logarithmic weight-balance with partition is a useful tool to convert a tree with unknown or alternative balance to any of these variants. When subtree size is not available, DSW can be used to restore balance and populate subtree size information by setting a size of zero to all nodes during the first phase of the algorithm.
Pedagogically, the logarithmic weight-balance rule introduces the binary logarithm and bitwise operations with a clear practical application. The algorithms are intuitive, simple to program, and easy to teach. Approaching the subject in the following order builds on various concepts in a cohesive way: (i) using the logarithmic weight-balance rule to restore balance globally to teach the balancing rule and the partition algorithm, (ii) then using partial rebuilding to teach insertion, simple deletion, and amortized complexity, (iii) then to balance incrementally using rotations.
Using the balancing information as a positional index is a useful property which can also be applied to other use-cases such as sorted sets to provide positional access. Most current red-black tree or skip list implementations can be replaced by a logarithmic weight-balanced tree to achieve better balance, better performance, and get positional access as a bonus, all with a simpler algorithm.
Work in Progress
- Implement AA trees, left-leaning 2-3, and left-leaning red-black trees.
- Fix splay trees causing stack overflow when calculating average path length.
- A finger tree that uses the relaxed AVL rule.
- A section on practical considerations, memory hierarchy etc.
- Concurrent benchmarks.
- Catalogue of experiments in the sandbox.
- Clear instruction for contribution and project interaction.
- Top-down red-black tree deletion.
- Refactoring, commenting, and clean-up.
- Automated PDF rendering.
- Better page-breaks for printing.
- Accessibility improvements.
- Dark theme option.
- Sorted set implementations and benchmarks.
- Can deep reinforcement learning come up with a strategy?
- References need work.
- A lot of editing.
- More tests.