Proposal for an efficient merkle tree implementation #4413
Replies: 1 comment
-
I think the specifics of our garbage collection is such that we do not need refcount in the trie at all. Specifically, when we apply a block, we already store in the storage the set of nodes in the merkle tree that were added and removed during the state transition. In the context of the datastructure described about it would be two vectors of u48s. Now, with the persistent tree when a block is applied, and the state changes, the nodes that got replaced are not immediately scheduled for deletion (i.e. they are not immediately inserted into the LSM tree). Instead, they are inserted when the block is garbage collected:
So no refcounting is necessary. |
Beta Was this translation helpful? Give feedback.
-
The proposal will iterate from a writes only, never deletes persistent tree, to writes and deletes, but no persistence, to potentially a persistent tree with both writes and deletes (though the last one is heavily underdeveloped).
The idea is inspired by this paper: https://diem-developers-components.netlify.app/papers/jellyfish-merkle-tree/2021-01-14.pdf
Writes only, persistent
I assume that in the entire existence of the blockchain no more than 2^48 merkle nodes will exist at the same time (including nodes that are scheduled for deletion, but are not yet GCed). If one doesn't believe in this assumption, all the calculations below need to be multiplied by corresponding constants.
With an assumption of no more than 2^48 merkle nodes existing at the same time, each merkle node can be assigned a u48 ID. In this proposal the ID also happens to be an index into a large flat file. Each merkle tree node has the following representation:
Where
extra
can be used to maintain whatever data could be useful for the patricia tree. Such a node is exactly 128 bytes long, and a 4KB page fits 32 of such nodes.With 16-ary nodes, assuming the tree is more or less balanced, the depth is on the order of 12.
Whenever a leaf is changed, it results in 12 nodes being updated. Since the tree is persistent, the "update" means that 12 existing nodes are cloned, one of the
children
in them is changed, and a new node is written.E.g. say the tree right now is (for simplicity I draw a binary tree, but in practice it would be 16-ary)
And the first available ID is 32. Then if we want to change the node 18, we do the folloiwng:
nodes[33].children = [17, 32]
nodes[34].children = [33, 24]
So the tree now is:
Such nodes are written into the file from the offsets 32 * 128, 33 * 128 and 34 * 128.
The nodes are never moved, and can always be found in the file at the offset
node_id * 128
.The node 27, in particular, still exists in the file, and the entire tree as of the time when 27 was the root can be traversed from it.
Note that since the IDs that are being written are consecutive, and a single leaf changed results in ~12 consecutive writes, an update of a full merkle path results in writing 12*128 = 1536 consecutive bytes, or, in other words, less that a page of data.
To emphasize, updating a leaf results in reading from 12 random places on disk, and writing to a single place on disk: 12 random reads, one sequential write.
Writes and deletes, not persistent
Naturally, the tree described above can only grow. This is not acceptable, the tree needs to be garbage collected. In this section I present a simple way to implement a non-persistent tree that garbage collects nodes that were replaced, though not immediately.
E.g. in the tree example above once 32, 33 and 34 are written, a process to garbage collect 18, 20 and 27 will immediately kick off.
The way the garbage collection works is the following: at all times the list of all the IDs (which are u48s) that were deleted is maintained in an LSM tree. I.e.
How to merge the sorted runs in a predictable fashion with predictable overhead
If the (3) above implemented naively, at some point it will result in merging two sorted runs of size 2^32 or something like this, which cannot happen on the foreground. This is how the merging is actually implemented:
Below
R(8)
would indicate a sorted run of size 8 that is ready, andT(8, 4)
would mean that two sorted runs of size 8 are being merged into a sorted run of size 16, and 4 pages are already merged (and thus 12 more pages are to go). On disk the head ofR(8)
is represented as one page ordinal -- the head of the corresponding linked list, and the head ofT(8, 4)
is represented as four page ordinals: the head of the future linkedlist (i.e. the first of the four pages already merged), the tail of the currently merged linked list (i.e. the fourth of the four pages already merged), and two heads of the remainders of the two lists that are being merged. One step of the merge takes the heads of the remainders (i.e. the first page of each of the two remainders), merges those two pages, resulting again in two pages, appends the smaller one at the end of the merged set, and the larger one stays at the beginning of one of the two sorted runs being merged. After such operationT(8, 4)
becomesT(8, 5)
.In regards to "there's always exactly one sorted run of each size",
T(8, 4)
is considered to be a sorted run of size 8, not 16 (i.e. it's possible to have bothT(8, 4)
andR(16)
, but not possible to have simultaneouslyT(8, 4)
andR(8)
).T(4, 6), T(16, 2), T(32, 4)
, it will becomeR(8), T(16, 4), T(32, 6)
(T(4, 6)
after two merges would becomeT(4, 8)
, which is justR(8)
).Thus, for each page of deletes we need to perform up to 48 pairs of merges. Note that two merges can be implemented as 3 page reads and 3 page writes (not four), thus we need to perform 144 page writes per full page of deletes written.
The heartbeat doesn't need to happen all at once, we can spread it in time, starting from the largest sorted runs, and then proceeding with smaller, Thus, as we populate the page of deletes (which fits ~682 deletes), we perform one merge per approximately every 14 deletes, which results in 3 page writes.
14 deletes is close to the number of deletes per a full merkle path update (which for a balanced tree is 12), thus a full update of the tree results in up to three pages written to maintain the LSM tree.
Amortized cost is 1.5 pages / full merkle path update, because each sorted run is being merged half the time, and stays ready half the time. The constant can be reduced further if we observe that the sorted runs can be easily compressed: the larger sorted runs contain compact sorted lists of integers, and naturally the difference of such integers fits into less than 6 bytes.
That gives us the list of deletes. We don't actually delete them, though. Instead we maintain a list of pages with merkle nodes that are more than half-empty, with bitmasks of the spots deleted. Initially the free list is empty, and whenever a new page is needed to write merkle nodes, a new page is allocated. However, if during a merge of two sorted runs in the LSM tree we observe that there's a sequence of more than 16 IDs deleted in the same page, we remove such IDs from the sorted run, and put the page (and the bitmask that we can derive at this point) into the free list.
When a new page is needed for the merkle nodes, it is fetched from the free list, and up to 16 nodes are written into this page. Given that a single merkle updated requires 12 new merkle nodes, with this change we still average less than one page on average, at most two pages in the worst case written.
Thus, the total overhead, incuding both writes and maintainingg the LSM tree is:
on average: 1 page of merkle nodes + 1.5 pages of LSM tree updates = 2.5 pages written.
at most: 2 pages of merkle nodes + 3 pages of LSM tree updates = 5 pages written.
Writes and deletes, persistent
If we want writes and deletes, while maintaining the persistence, it becomes harder. The high level idea would be the same as we do today: to maintain the refcount for each node, and delete it whenever it has 0 refs.
One way to do that would be to maintain an LSM tree of refcount changes, not deletes. Note that the LSM tree will be the only source of refcounts, there's no actual hash table or any other storage that stores them. In such an LSM tree each entry is 6 bytes ID + some number of bytes (say also 6) for the refcount delta. When such a tree is merged, if two sorted runs being merged have the same key, the new sorted run contains it only once, and the value is the sum of the deltas.
If the longest sorted run observes the value of the refcount as 0, the node is deleted (it follows from the fact that the longest sorted run has the changes that precede in time any changes that are in the shorter sorted runs, and if a prefix of refcount changes make the refcount zero, the node is deleted). Correspondingly, if the longest sorted run observes a large consecutive set of nodes with refcount of zero, it can mark the page as half-empty, similar to the previous section.
This solution also has a problem of determening when to reduce the refcount. In the most naive solution it is decreased when a node gets overwritten (i.e. when the page containing it becomes half-empty, then is fetched to be the current page, and a new node is written on top of it). This, however, has a huge problem that if a particular node with refcount 0 stays in a page that is other wise untouched and never becomes half-empty, the entire subtree of that node is not GCed. So something smarter is needed here.
There's a 2x overhead from the fact that the elements are not 6 bytes, but rather 12 bytes long.
But also, naively, there's a 16x overhead from the fact that whenever we update a tree node, we need to take references on all its children (while we only needed to mark one for deletion in the previous section). There's likely a way to avoid this 16x overhead, but I do not have a complete solution yet.
Closing thoughts
There's a way to get by with the non-persistent tree if we replace
prev_hash
in the blocks withlast_final_hash
, but it has a large number of issues associated with it (starting with one extra block of delay for transactions to perform, continuing with the fact that once a block becomes final, we now need to apply multiple chunks, potentially lots of them, not just one, making the performance unpredictable, and ending with the fact that nodes not maintaining the persistent tree cannot respond to state sync requests).So likely a way to make the refcounts work is needed.
Beta Was this translation helpful? Give feedback.
All reactions