The doc in [dupsort.md](https://github.com/ledgerwatch/erigon/blob/devel/docs/programmers_guide/dupsort.md#erigon) points to a non-existent `AbstractKV.md` As best as I can tell, the `AbstractKV.md` was reworked and renamed in commit 0bc61c06edd1d2de3d9376736e0e4f4f4e7b9ed1 then, this was later moved into erigon-lib. This commit simply repoints the doc at this new location. Co-authored-by: Jason Yellick <jason@enya.ai>
5.5 KiB
DupSort feature explanation
If KV database has no concept of "Buckets/Tables/Collections" then all keys must have "Prefix". For example to store
Block bodies and headers need use b
and h
prefixes:
b1->encoded_block1
b2->encoded_block2
b3->encoded_block3
...
h1->encoded_header1
h2->encoded_header2
h3->encoded_header3
...
Of course this is 1 byte per key overhead is not very big. But if DB provide concept of named "
Buckets/Tables/Collections" then need create 2 tables b
and h
and store there key without prefixes. Physically table
names will stored only once (not 1 per key).
But if do 1 step forward - and introduce concept of named "Sub-Buckets/Sub-Tables/Sub-Collections". Then in will allow to store physically once longer prefixes.
Let's look at ChangeSets. If block N changed account A from value X to Y:
ChangeSet -> bigEndian(N) -> A -> X
ChangeSet
- name of TablebigEndian(N)
- name of Sub-TableA
- key inside Sub-TableX
- value inside Sub-Table
MDBX supports
MDBX supports "tables" (it uses name DBI) and supports "sub-tables" (DupSort DBI).
#MDBX_DUPSORT
Duplicate keys may be used in the database. (Or, from another perspective,
keys may have multiple data items, stored in sorted order.) By default
keys must be unique and may have only a single data item.
MDBX stores keys in Tree(B+Tree), and keys of sub-tables in sub-Tree (which is linked to Tree of table).
Find value of 1 key, still can be done by single method:
subTableName, keyInSubTable, value := db.Get(tableName, subTableName, keyInSubTable)
Common pattern to iterate over whole 'normal' table (without sub-table) in a pseudocode:
cursor := transaction.OpenCursor(tableName)
for k, v := cursor.Seek(key); k != nil; k, v = cursor.Next() {
// logic works with 'k' and 'v' variables
}
Iterate over table with sub-table:
cursor := transaction.OpenCursor(tableName)
for k, _ := cursor.SeekDup(subTableName, keyInSubTable); k != nil; k, _ = cursor.Next() {
// logic works with 'k1', 'k' and 'v' variables
}
Enough straight forward. No performance penalty (only profit from smaller database size).
MDBX in-depth
Max key size: 2022byte (same for key of sub-Table)
Let's look at ChangeSets. If block N changed account A from value X to Y:
ChangeSet -> bigEndian(N) -> A -> X
ChangeSet
- name of TablebigEndian(N)
- name of Sub-TableA
- key inside Sub-TableX
- value inside Sub-Table
------------------------------------------------------------------------------------------
table | sub-table-name | keyAndValueJoinedTogether (no 'value' column)
------------------------------------------------------------------------------------------
'ChangeSets' |
| {1} | {A}+{X}
| | {A2}+{X2}
| {2} | {A3}+{X3}
| | {A4}+{X4}
| ... | ...
It's a bit unexpected, but doesn't change much. All operations are still work:
subTableName, keyAndValueJoinedTogether := cursor.Get(subTableName, keyInSubTable)
{N}, {A}+{X} := cursor.Seek({N}, {A})
You need manually separate 'A' and 'X'. But, it unleash bunch of new features! Can iterate in sortet manner all changes in block N. Can read only 1 exact change - even if Block changed many megabytes of state.
And format of StorageChangeSetBucket: Loc - location hash (key of storage)
------------------------------------------------------------------------------------------
table | sub-table-name | keyAndValueJoinedTogether (no 'value' column)
------------------------------------------------------------------------------------------
'StorageChanges' |
| {1}+{A}+{inc1} | {Loc1}+{X}
| | {Loc2}+{X2}
| | {Loc3}+{X3}
| {2}+{A}+{inc1} | {Loc4}+{X4}
| | {Loc5}+{X5}
| | {Loc6}+{X6}
| | ...
Because column "keyAndValueJoinedTogether" is stored as key - it has same size limit: 551byte
MDBX, can you do better?
By default, for each key MDBX does store small metadata (size of data). Indices by nature - store much-much keys.
If all keys in sub-table (DupSort DBI) have same size - MDBX can store much less metadata.
(Remember! that "keys in sub-table" it's "keyAndValueJoinedTogether" - this thing must have same size). MDBX called this
feature DupFixed (can add this flag to table configuration).
#MDB_DUPFIXED
This flag may only be used in combination with #MDB_DUPSORT. This option
tells the library that the data items for this database are all the same
size, which allows further optimizations in storage and retrieval. When
all data items are the same size, the #MDB_GET_MULTIPLE, #MDB_NEXT_MULTIPLE
and #MDB_PREV_MULTIPLE cursor operations may be used to retrieve multiple
items at once.
It means in 1 db call you can Get/Put up to 4Kb of sub-table keys.
Erigon
This article target is to show tricky concepts on examples. Future reading here
Erigon supports multiple typed cursors, see the KV Readme.md