Proposal for an efficient merkle tree implementation #4413

SkidanovAlex · 2021-06-24T17:39:24Z

SkidanovAlex
Jun 24, 2021
Maintainer

The proposal will iterate from a writes only, never deletes persistent tree, to writes and deletes, but no persistence, to potentially a persistent tree with both writes and deletes (though the last one is heavily underdeveloped).

The idea is inspired by this paper: https://diem-developers-components.netlify.app/papers/jellyfish-merkle-tree/2021-01-14.pdf

Writes only, persistent

I assume that in the entire existence of the blockchain no more than 2^48 merkle nodes will exist at the same time (including nodes that are scheduled for deletion, but are not yet GCed). If one doesn't believe in this assumption, all the calculations below need to be multiplied by corresponding constants.

With an assumption of no more than 2^48 merkle nodes existing at the same time, each merkle node can be assigned a u48 ID. In this proposal the ID also happens to be an index into a large flat file. Each merkle tree node has the following representation:

struct MerkleTreeNode {
    children: [u48, 16];
    extra: [u8; 32]
}

Where extra can be used to maintain whatever data could be useful for the patricia tree. Such a node is exactly 128 bytes long, and a 4KB page fits 32 of such nodes.

With 16-ary nodes, assuming the tree is more or less balanced, the depth is on the order of 12.

Whenever a leaf is changed, it results in 12 nodes being updated. Since the tree is persistent, the "update" means that 12 existing nodes are cloned, one of the children in them is changed, and a new node is written.

E.g. say the tree right now is (for simplicity I draw a binary tree, but in practice it would be 16-ary)

        27
       /  \
    20      24
   /  \    /  \
  17  18  23  31

And the first available ID is 32. Then if we want to change the node 18, we do the folloiwng:

Create a save the new node 32 (with whatever data we want to store in leaves).
Create a node 33, that is a clone of the node 20 (which currently has [17, 18] as its children), with the right child replaced with 32. I.e. nodes[33].children = [17, 32]
Create a node 34, that is a clone of the node 27 (which currently has [20, 24] as its children), with the left child replaced with 33. I.e. nodes[34].children = [33, 24]

So the tree now is:

        34
       /  \
    33      24
   /  \    /  \
  17  32  23  31

Such nodes are written into the file from the offsets 32 * 128, 33 * 128 and 34 * 128.
The nodes are never moved, and can always be found in the file at the offset node_id * 128.

The node 27, in particular, still exists in the file, and the entire tree as of the time when 27 was the root can be traversed from it.

Note that since the IDs that are being written are consecutive, and a single leaf changed results in ~12 consecutive writes, an update of a full merkle path results in writing 12*128 = 1536 consecutive bytes, or, in other words, less that a page of data.

To emphasize, updating a leaf results in reading from 12 random places on disk, and writing to a single place on disk: 12 random reads, one sequential write.

Writes and deletes, not persistent

Naturally, the tree described above can only grow. This is not acceptable, the tree needs to be garbage collected. In this section I present a simple way to implement a non-persistent tree that garbage collects nodes that were replaced, though not immediately.

E.g. in the tree example above once 32, 33 and 34 are written, a process to garbage collect 18, 20 and 27 will immediately kick off.

The way the garbage collection works is the following: at all times the list of all the IDs (which are u48s) that were deleted is maintained in an LSM tree. I.e.

There's a separate file that has pages full of u48s -- the IDs being deleted. The pages are organized into "sorted runs". A sorted run is a sequence of pages (not necessarily consecutive, think of them as a linked list of pages), such that the sequence of numbers in those pages concatenated is sorted.
There's a current page (not sorted yet) into which all new deletes are written.
Once the current page is full, it is sorted, and is stored as a sorted run of size one. There's an invariant that at no point should there be two sorted runs of the same size. Thus, if there's already a sorted run of size one, they get merged (using the merge procedure, the same one used in the merge sort) into a sorted run of size two. At that point if there's already a sorted run of size two, they get merged into a sorted run of size four, etc.

How to merge the sorted runs in a predictable fashion with predictable overhead

If the (3) above implemented naively, at some point it will result in merging two sorted runs of size 2^32 or something like this, which cannot happen on the foreground. This is how the merging is actually implemented:

For each power of two from 1 to 2^48 there's at most one sorted run. There's a place at the beginning of the file that has ID of the first page of each sorted run (a total of 48), or 0x0 if a particular sorted run doesn't exist. Each page of each sorted run contains the ID of the next page of the sorted runs -- i.e. it's just a linked list.
Each sorted run can either be "ready", meaning it's actually a sorted run of the corresponding size, or "in transition", meaning that two sorted runs of this size are being merged into one sorted run of larger size, and such a merge is not completed yet.
Below R(8) would indicate a sorted run of size 8 that is ready, and T(8, 4) would mean that two sorted runs of size 8 are being merged into a sorted run of size 16, and 4 pages are already merged (and thus 12 more pages are to go). On disk the head of R(8) is represented as one page ordinal -- the head of the corresponding linked list, and the head of T(8, 4) is represented as four page ordinals: the head of the future linkedlist (i.e. the first of the four pages already merged), the tail of the currently merged linked list (i.e. the fourth of the four pages already merged), and two heads of the remainders of the two lists that are being merged. One step of the merge takes the heads of the remainders (i.e. the first page of each of the two remainders), merges those two pages, resulting again in two pages, appends the smaller one at the end of the merged set, and the larger one stays at the beginning of one of the two sorted runs being merged. After such operation T(8, 4) becomes T(8, 5).
In regards to "there's always exactly one sorted run of each size", T(8, 4) is considered to be a sorted run of size 8, not 16 (i.e. it's possible to have both T(8, 4) and R(16), but not possible to have simultaneously T(8, 4) and R(8)).
We define "heartbeat" as a process that, for each sorted run in transition, performs two merge operations. I.e. if the state before was T(4, 6), T(16, 2), T(32, 4), it will become R(8), T(16, 4), T(32, 6) (T(4, 6) after two merges would become T(4, 8), which is just R(8)).
Whenever we write a new page of deletes (which forms a new sorted run of size one), we do a heartbeat before the write. Below I show the evolution of the system after the first several writes:

R(1)  ; the first page just forms R(1)
T(1, 0)  ; as we write the second page, we get two R(1), which violates "no two sorted runs of the same size", and we start merging them
R(1), R(2) ; after the heartbeat T(1, 0) becomes T(1, 2), which is just R(2)
T(1, 0), R(2)
R(1), T(2, 0) ; during the heartbeat T(1, 0) becomes R(2), and since now there are two R(2), a merge is started
T(1, 0), T(2, 2) ; one step of the merge is completed
R(1), R(2), R(4) ; T(2, 2) after the heartbeat becomes T(2, 4), which is just R(4)
T(1, 0), R(2), R(4)
R(1), T(2, 0), R(4)
T(1, 0), T(2, 2), R(4)
R(1), R(2), T(4, 0) ; T(2, 2) after the heartbeat becomes R(4), which starts the merge with the existing R(4)
...
R(1), R(2), R(4), R(8)
...
R(1), R(2), R(4), R(8), R(16)
...

Thus, for each page of deletes we need to perform up to 48 pairs of merges. Note that two merges can be implemented as 3 page reads and 3 page writes (not four), thus we need to perform 144 page writes per full page of deletes written.
The heartbeat doesn't need to happen all at once, we can spread it in time, starting from the largest sorted runs, and then proceeding with smaller, Thus, as we populate the page of deletes (which fits ~682 deletes), we perform one merge per approximately every 14 deletes, which results in 3 page writes.
14 deletes is close to the number of deletes per a full merkle path update (which for a balanced tree is 12), thus a full update of the tree results in up to three pages written to maintain the LSM tree.

Amortized cost is 1.5 pages / full merkle path update, because each sorted run is being merged half the time, and stays ready half the time. The constant can be reduced further if we observe that the sorted runs can be easily compressed: the larger sorted runs contain compact sorted lists of integers, and naturally the difference of such integers fits into less than 6 bytes.

That gives us the list of deletes. We don't actually delete them, though. Instead we maintain a list of pages with merkle nodes that are more than half-empty, with bitmasks of the spots deleted. Initially the free list is empty, and whenever a new page is needed to write merkle nodes, a new page is allocated. However, if during a merge of two sorted runs in the LSM tree we observe that there's a sequence of more than 16 IDs deleted in the same page, we remove such IDs from the sorted run, and put the page (and the bitmask that we can derive at this point) into the free list.
When a new page is needed for the merkle nodes, it is fetched from the free list, and up to 16 nodes are written into this page. Given that a single merkle updated requires 12 new merkle nodes, with this change we still average less than one page on average, at most two pages in the worst case written.

Thus, the total overhead, incuding both writes and maintainingg the LSM tree is:
on average: 1 page of merkle nodes + 1.5 pages of LSM tree updates = 2.5 pages written.
at most: 2 pages of merkle nodes + 3 pages of LSM tree updates = 5 pages written.

Writes and deletes, persistent

If we want writes and deletes, while maintaining the persistence, it becomes harder. The high level idea would be the same as we do today: to maintain the refcount for each node, and delete it whenever it has 0 refs.

One way to do that would be to maintain an LSM tree of refcount changes, not deletes. Note that the LSM tree will be the only source of refcounts, there's no actual hash table or any other storage that stores them. In such an LSM tree each entry is 6 bytes ID + some number of bytes (say also 6) for the refcount delta. When such a tree is merged, if two sorted runs being merged have the same key, the new sorted run contains it only once, and the value is the sum of the deltas.
If the longest sorted run observes the value of the refcount as 0, the node is deleted (it follows from the fact that the longest sorted run has the changes that precede in time any changes that are in the shorter sorted runs, and if a prefix of refcount changes make the refcount zero, the node is deleted). Correspondingly, if the longest sorted run observes a large consecutive set of nodes with refcount of zero, it can mark the page as half-empty, similar to the previous section.

This solution also has a problem of determening when to reduce the refcount. In the most naive solution it is decreased when a node gets overwritten (i.e. when the page containing it becomes half-empty, then is fetched to be the current page, and a new node is written on top of it). This, however, has a huge problem that if a particular node with refcount 0 stays in a page that is other wise untouched and never becomes half-empty, the entire subtree of that node is not GCed. So something smarter is needed here.

There's a 2x overhead from the fact that the elements are not 6 bytes, but rather 12 bytes long.

But also, naively, there's a 16x overhead from the fact that whenever we update a tree node, we need to take references on all its children (while we only needed to mark one for deletion in the previous section). There's likely a way to avoid this 16x overhead, but I do not have a complete solution yet.

Closing thoughts

There's a way to get by with the non-persistent tree if we replace prev_hash in the blocks with last_final_hash, but it has a large number of issues associated with it (starting with one extra block of delay for transactions to perform, continuing with the fact that once a block becomes final, we now need to apply multiple chunks, potentially lots of them, not just one, making the performance unpredictable, and ending with the fact that nodes not maintaining the persistent tree cannot respond to state sync requests).

So likely a way to make the refcounts work is needed.

SkidanovAlex · 2021-06-26T04:54:56Z

SkidanovAlex
Jun 26, 2021
Maintainer Author

I think the specifics of our garbage collection is such that we do not need refcount in the trie at all.

Specifically, when we apply a block, we already store in the storage the set of nodes in the merkle tree that were added and removed during the state transition. In the context of the datastructure described about it would be two vectors of u48s.

Now, with the persistent tree when a block is applied, and the state changes, the nodes that got replaced are not immediately scheduled for deletion (i.e. they are not immediately inserted into the LSM tree). Instead, they are inserted when the block is garbage collected:

If the block is garbage collected as part of the GCing a dead branch of the chain, we always start from the newest block and go backward, thus when the block is garbage collected, all the merkle tree nodes created when such a block was applied can be safely scheduled for deletion.
If the block is garbage collected as part of the GCing of the tail, all the merkle tree nodes that were removed when the block was applied can be scheduled for deletion.

So no refcounting is necessary.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for an efficient merkle tree implementation #4413

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Proposal for an efficient merkle tree implementation #4413

SkidanovAlex Jun 24, 2021 Maintainer

Writes only, persistent

Writes and deletes, not persistent

Writes and deletes, persistent

Closing thoughts

Replies: 1 comment

SkidanovAlex Jun 26, 2021 Maintainer Author

SkidanovAlex
Jun 24, 2021
Maintainer

SkidanovAlex
Jun 26, 2021
Maintainer Author