How Prolly Trees Enable Version-Controlled Databases

The Foundation: B-Trees

Modern databases and filesystems rely heavily on B-trees, a self-balancing tree data structure that maintains sorted data and allows efficient insertion, deletion, and search operations. B-trees are particularly well-suited for storage systems that interact with block devices (like hard drives or SSDs) because they minimize disk reads by keeping a high branching factor. This design ensures that even large datasets can be accessed with a small number of I/O operations.

How Prolly Trees Enable Version-Controlled Databases

What Makes B-Trees Efficient

The key to a B-tree's efficiency lies in its ability to hold many keys per node, reducing tree height. For example, a typical B-tree with a branching factor of 100 can store tens of millions of entries in just three or four levels. Each node corresponds to a disk block, and reading a node requires only one disk seek. This makes B-trees a cornerstone of relational databases like MySQL, PostgreSQL, and many file systems.

Introducing Prolly Trees

While B-trees excel at current-state queries, they struggle with version control—tracking changes over time while preserving historical states. Enter Prolly trees (short for probabilistic B-trees), a variant that introduces a layer of indirection to enable efficient branching and merging of data structures. The core innovation is that nodes are identified not by fixed block addresses but by content hashes, making them immutable. When a change occurs, only the affected path from leaf to root is rewritten, and unchanged nodes are reused via hash pointers.

How Prolly Trees Differ

In a traditional B-tree, a node update modifies the block in place, destroying the previous version. A Prolly tree, by contrast, creates new nodes for every modified piece of the tree; old nodes remain untouched and accessible through stored hash pointers. This structure allows efficient diffing, merging, and branching—exactly what version control systems need. Moreover, because node boundaries are determined probabilistically (using rolling hashes of key ranges), Prolly trees naturally produce balanced splits without explicit rebalancing algorithms.

Dolt: A Version-Controlled Database

Dolt is an open-source, Apache 2.0-licensed project that marries the concepts of Git with a full SQL database. It treats an entire database like a Git repository, allowing users to commit, push, pull, and merge data just as they would code. The secret sauce behind Dolt's version control capabilities is its use of Prolly trees.

How Dolt Leverages Prolly Trees

Dolt stores the entire dataset as a single Prolly tree (or a set of trees for different tables). Each row is a key-value pair, and the tree structure maps primary keys to their corresponding values. When a user executes a UPDATE, INSERT, or DELETE statement, Dolt constructs a new Prolly tree representing the new state, reusing unchanged nodes from the prior state. This process is both space-efficient (unchanged data is shared) and time-efficient (only changed paths are recalculated). The same mechanism enables branch creation: each branch points to the root hash of its latest tree, and merging two branches involves comparing hash pointers to find common ancestors and apply changes.

Because Prolly trees support snapshot isolation, Dolt can provide transactional consistency across versions. A query at a specific commit always sees a consistent state, even if concurrent writes are occurring. This makes Dolt ideal for applications like collaborative data editing, auditing, and reproducible data pipelines.

Potential Applications Beyond Dolt

The data structure used by Dolt—Prolly trees—could inspire other projects seeking efficient version control. For instance:

File systems that support versioning (e.g., git-like file storage)
Collaborative editing platforms where multiple users modify large datasets concurrently
Backup and archiving systems that need to store incremental snapshots with low overhead
Blockchain and distributed ledger technologies that require tamper-evident historical records

The probabilistic nature of node boundaries also simplifies distributed setups: since nodes are content-addressed, they can be cached or replicated without central coordination. Any project that benefits from immutable, merkle-like trees could adopt Prolly trees as a foundation.

In summary, Prolly trees extend the proven efficiency of B-trees with the version-control capabilities of content-addressed storage. By doing so, they unlock new possibilities for data management—and Dolt shows just how powerful this combination can be.

Tags: