Change set docs (#2062)

This commit is contained in:
Alex Sharov 2021-05-31 15:29:46 +07:00 committed by GitHub
parent ee13ad17fa
commit a4ff299afb
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 196 additions and 226 deletions

View File

@ -18,8 +18,8 @@ var DBSchemaVersionMDBX = types.VersionReply{Major: 2, Minor: 0, Patch: 0}
// "Plain State" - state where keys arent' hashed. "CurrentState" - same, but keys are hashed. "PlainState" used for blocks execution. "CurrentState" used mostly for Merkle root calculation.
// "incarnation" - uint64 number - how much times given account was SelfDestruct'ed.
/*PlainStateBucket
Logical layout:
/*
PlainStateBucket logical layout:
Contains Accounts:
key - address (unhashed)
value - account encoded for storage
@ -45,21 +45,37 @@ Physical layout:
const PlainStateBucket = "PLAIN-CST2"
const PlainStateBucketOld1 = "PLAIN-CST"
//PlainContractCodeBucket -
//key - address+incarnation
//value - code hash
var PlainContractCodeBucket = "PLAIN-contractCode"
/*
AccountChangeSetBucket and StorageChangeSetBucket store PlainStateBucket changes in logical format:
key - blockNum_u64 + key_in_plain_state
value - value_in_plain_state_before_blockNum_changes
Example: If block N changed account A from value X to Y. Then:
AccountChangeSetBucket has record: bigEndian(N) + A -> X
PlainStateBucket has record: A -> Y
See also: docs/programmers_guide/db_walkthrough.MD#table-history-of-accounts
As you can see if block N changes much accounts - then all records have repetitive prefix `bigEndian(N)`.
MDBX can store such prefixes only once - by DupSort feature (see `docs/programmers_guide/dupsort.md`).
Both buckets are DupSort-ed and have physical format:
AccountChangeSetBucket:
key - blockNum_u64
value - address + account(encoded)
StorageChangeSetBucket:
key - blockNum_u64 + address + incarnation_u64
value - plain_storage_key + value
*/
var AccountChangeSetBucket = "PLAIN-ACS"
var StorageChangeSetBucket = "PLAIN-SCS"
const (
//PlainContractCodeBucket -
//key - address+incarnation
//value - code hash
PlainContractCodeBucket = "PLAIN-contractCode"
// AccountChangeSetBucket keeps changesets of accounts ("plain state")
// key - encoded timestamp(block number)
// value - encoded ChangeSet{k - address v - account(encoded).
AccountChangeSetBucket = "PLAIN-ACS"
// StorageChangeSetBucket keeps changesets of storage ("plain state")
// key - encoded timestamp(block number)
// value - encoded ChangeSet{k - plainCompositeKey(for storage) v - originalValue(common.Hash)}.
StorageChangeSetBucket = "PLAIN-SCS"
//HashedAccountsBucket
// key - address hash
@ -72,8 +88,8 @@ const (
CurrentStateBucketOld2 = "CST2"
)
/*AccountsHistoryBucket and StorageHistoryBucket
History index designed to serve next 2 type of requests:
/*
AccountsHistoryBucket and StorageHistoryBucket - indices designed to serve next 2 type of requests:
1. what is smallest block number >= X where account A changed
2. get last shard of A - to append there new block numbers
@ -95,6 +111,8 @@ If `db.Seek(A+bigEndian(X))` returns non-last shard -
If `db.Seek(A+bigEndian(X))` returns last shard -
then we go to PlainState: db.Get(PlainState, A)
see also: docs/programmers_guide/db_walkthrough.MD#table-change-sets
AccountsHistoryBucket:
key - address + shard_id_u64
value - roaring bitmap - list of block where it changed

View File

@ -0,0 +1,159 @@
DupSort feature explanation
===========================
If KV database has no concept of "Buckets/Tables/Collections" then all keys must have "Prefix". For example to store
Block bodies and headers need use `b` and `h` prefixes:
```
b1->encoded_block1
b2->encoded_block2
b3->encoded_block3
...
h1->encoded_header1
h2->encoded_header2
h3->encoded_header3
...
```
Of course this is 1 byte per key overhead is not very big. But if DB provide concept of named "
Buckets/Tables/Collections" then need create 2 tables `b` and `h` and store there key without prefixes. Physically table
names will stored only once (not 1 per key).
But if do 1 step forward - and introduce concept of named "Sub-Buckets/Sub-Tables/Sub-Collections". Then in will allow
to store physically once longer prefixes.
Let's look at ChangeSets. If block N changed account A from value X to Y:
`ChangeSet -> bigEndian(N) -> A -> X`
- `ChangeSet` - name of Table
- `bigEndian(N)` - name of Sub-Table
- `A` - key inside Sub-Table
- `X` - value inside Sub-Table
MDBX supports
-------------
MDBX supports "tables" (it uses name DBI) and supports "sub-tables" (DupSort DBI).
```
#MDBX_DUPSORT
Duplicate keys may be used in the database. (Or, from another perspective,
keys may have multiple data items, stored in sorted order.) By default
keys must be unique and may have only a single data item.
```
MDBX stores keys in Tree(B+Tree), and keys of sub-tables in sub-Tree (which is linked to Tree of table).
Find value of 1 key, still can be done by single method:
```
subTableName, keyInSubTable, value := db.Get(tableName, subTableName, keyInSubTable)
```
Common pattern to iterate over whole 'normal' table (without sub-table) in a pseudocode:
```
cursor := transaction.OpenCursor(tableName)
for k, v := cursor.Seek(key); k != nil; k, v = cursor.Next() {
// logic works with 'k' and 'v' variables
}
```
Iterate over table with sub-table:
```
cursor := transaction.OpenCursor(tableName)
for k, _ := cursor.SeekDup(subTableName, keyInSubTable); k != nil; k, _ = cursor.Next() {
// logic works with 'k1', 'k' and 'v' variables
}
```
Enough strait forward. No performance penalty (only profit from smaller database size).
MDBX in-depth
-------------
Max key size: 2022byte (same for key of sub-Table)
Let's look at ChangeSets. If block N changed account A from value X to Y:
`ChangeSet -> bigEndian(N) -> A -> X`
- `ChangeSet` - name of Table
- `bigEndian(N)` - name of Sub-Table
- `A` - key inside Sub-Table
- `X` - value inside Sub-Table
```
------------------------------------------------------------------------------------------
table | sub-table-name | keyAndValueJoinedTogether (no 'value' column)
------------------------------------------------------------------------------------------
'ChangeSets' |
| {1} | {A}+{X}
| | {A2}+{X2}
| {2} | {A3}+{X3}
| | {A4}+{X4}
| ... | ...
```
It's a bit unexpected, but doesn't change much. All operations are still work:
```
subTableName, keyAndValueJoinedTogether := cursor.Get(subTableName, keyInSubTable)
{N}, {A}+{X} := cursor.Seek({N}, {A})
```
You need manually separate 'A' and 'X'. But, it unleash bunch of new features!
Can iterate in sortet manner all changes in block N. Can read only 1 exact change - even if Block changed many megabytes
of state.
And format of StorageChangeSetBucket:
Loc - location hash (key of storage)
```
------------------------------------------------------------------------------------------
table | sub-table-name | keyAndValueJoinedTogether (no 'value' column)
------------------------------------------------------------------------------------------
'StorageChanges' |
| {1}+{A}+{inc1} | {Loc1}+{X}
| | {Loc2}+{X2}
| | {Loc3}+{X3}
| {2}+{A}+{inc1} | {Loc4}+{X4}
| | {Loc5}+{X5}
| | {Loc6}+{X6}
| | ...
```
Because column "keyAndValueJoinedTogether" is stored as key - it has same size limit: 551byte
MDBX, can you do better?
------------------------
By default, for each key MDBX does store small metadata (size of data). Indices by nature - store much-much keys.
If all keys in sub-table (DupSort DBI) have same size - MDBX can store much less metadata.
(Remember! that "keys in sub-table" it's "keyAndValueJoinedTogether" - this thing must have same size). MDBX called this
feature DupFixed (can add this flag to table configuration).
```
#MDB_DUPFIXED
This flag may only be used in combination with #MDB_DUPSORT. This option
tells the library that the data items for this database are all the same
size, which allows further optimizations in storage and retrieval. When
all data items are the same size, the #MDB_GET_MULTIPLE, #MDB_NEXT_MULTIPLE
and #MDB_PREV_MULTIPLE cursor operations may be used to retrieve multiple
items at once.
```
It means in 1 db call you can Get/Put up to 4Kb of sub-table keys.
[mdbx docs](https://github.com/erthink/libmdbx/blob/master/mdbx.h)
Erigon
---------
This article target is to show tricky concepts on examples. Future
reading [here](./db_walkthrough.MD#table-history-of-accounts)
Erigon supports multiple typed cursors, see [AbstractKV.md](./../../ethdb/AbstractKV.md)

View File

@ -1,207 +0,0 @@
Indices implementation in Erigon
====================================
Indices (inverted indices) - allow search data by multiple filters.
Here is an example: "In which blocks account X was updated? (account can be created/updated/deleted)"
2 types of data "accounts value" and "accounts history" need to store in 1 key-value database.
To avoid keys collision between data types - used `account` and `history` prefixes.
To encode `created/updated/deleted` operations - used `C`, `U`, `D` markers.
```
// Picture 1
----------------------------------------------------
key | value
----------------------------------------------------
'account'{account1_address} | {account1_value}
'account'{account2_address} | {account2_value}
... | ...
'account'{accountN_address} | {accountN_value}
'history'{account1_address}'C' | {block_number1}
'history'{account1_address}'U' | {block_number2}
'history'{account1_address}'U' | {block_number3}
'history'{account1_address}'D' | {block_number4}
'history'{account2_address}'C' | {block_number5}
'history'{account2_address}'U' | {block_number6}
... | ...
'history'{accountN_address}'U' | {block_numberM}
```
**Observation 1**: `account` and `history` prefixes repeated over and over again - wasting disk space.
Complete solutions is: database supports "named buckets" - independent sub-databases - between buckets collisions are impossible.
```
// Picture 2
--------------------------------------------------------------
bucket | key | value
--------------------------------------------------------------
'account' |
| {account1_address} | {account1_value}
| {account2_address} | {account2_value}
| ... | ...
| {accountN_address} | {accountN_value}
'history' |
| {account1_address}'C' | {block_number1}
| {account1_address}'U' | {block_number2}
| {account1_address}'U' | {block_number3}
| {account1_address}'D' | {block_number4}
| {account2_address}'C' | {block_number5}
| {account2_address}'U' | {block_number6}
| ... | ...
| {accountN_address}'U' | {block_numberM}
```
Most of key-value databases (LevelDB, BadgerDB) do not provide such feature, but some do (BoltDB, LMDB)
**Observation 2**: Bucket 'history' again has much repeated prefixes: `{account1_address}` prefix will repeat every time account1 changed
This is same problem as in "Observation 1" - can we use same solution for the same problem?
Database supports "named sub-buckets" - independent sub-sub-databases - between sub-buckets collisions are impossible.
```
// Picture 3
---------------------------------------------------------------------------
bucket | sub-bucket-name | key | value
---------------------------------------------------------------------------
'account' |
| {account1_address} | | {account1_value}
| {account2_address} | | {account2_value}
| ... | | ...
| {accountN_address} | | {accountN_value}
'history' |
| {account1_address} |
| | 'C' | {block_number1}
| | 'U' | {block_number2}
| | 'U' | {block_number3}
| | 'D' | {block_number4}
| {account2_address} |
| | 'C' | {block_number5}
| | 'U' | {block_number6}
| | ... | ...
| {accountn_address} |
| | 'U' | {block_numberM}
```
Keys don't have repetitive data anymore (markers 'C','U','D' can be part of sub-bucket name if need).
All this tricks must keep data accessible: search/iterate/insert operations must be easy.
LMDB supports
-------------
LMDB supports "buckets" (it uses name DBI) and supports "sub-buckets" (DupSort DBI).
```
#MDB_DUPSORT
Duplicate keys may be used in the database. (Or, from another perspective,
keys may have multiple data items, stored in sorted order.) By default
keys must be unique and may have only a single data item.
```
LMDB stores keys in Tree(B+Tree), and keys of sub-buckets in sub-Tree (which is linked to Tree of bucket).
Find value of 1 key, still can be done by single method:
```
subBucketName, keyInSubBucket, value := cursor.Get(subBucketName, keyInSubBucket)
```
Common pattern to iterate over whole 'normal' bucket (without sub-buckets) in a pseudocode:
```
cursor := transaction.OpenCursor(bucketName)
for k, v := cursor.Seek(key); k != nil; k, v = cursor.Next() {
// logic works with 'k' and 'v' variables
}
```
Iterate over bucket with sub-buckets:
```
cursor := transaction.OpenCursor(bucketName)
for k, _ := cursor.SeekDup(subBucketName, keyInSubBucket); k != nil; k, _ = cursor.Next() {
// logic works with 'k1', 'k' and 'v' variables
}
```
Enough strait forward. No performance penalty (only profit from smaller database size).
LMDB in-depth
-------------
Max key size: 551byte (same for key of sub-bucket)
Please take a look on 'Picture 3' again - it illustrates the high-level idea, but LMDB stores it different way.
'Picture 4' shows - sub-bucket (DupSort DBI) has no "value", it does join bytes of key and value and store it as 'key':
```
// Picture 4
--------------------------------------------------------------------------------------
bucket | sub-bucket-name | keyAndValueJoinedTogether (no 'value' column)
--------------------------------------------------------------------------------------
'account' |
| {account1_address} | {account1_value}
| {account2_address} | {account2_value}
| ... | ...
| {accountN_address} | {accountN_value}
'history' |
| {account1_address} |
| | 'C'{block_number1}
| | 'U'{block_number2}
| | 'U'{block_number3}
| | 'D'{block_number4}
| {account2_address} |
| | 'C'{block_number5}
| | 'U'{block_number6}
| | ...
| {accountn_address} |
| | 'U'{block_numberM}
```
It's a bit unexpected, but doesn't change much. All operations are still work:
```
subBucketName, keyAndValueJoinedTogether := cursor.Get(subBucketName, keyInSubBucket)
```
You may need manually separate 'key' and 'value'. But, it unleash bunch of new features!
Because column "keyAndValueJoinedTogether" is sorted and stored as key in same Tree (as normal keys).
"value" can be used as part your query. In 1 db command we can answer more complex question:
"Dear DB, Give me block number where account X was update and which is greater or equal than N".
```
{account1_address}, 'U'{block_number2} := cursor.Seek({account1_address}, 'U'{block_number1})
// notice that in parameter we used 'block_numger1'
// but DB had no 'U' records for this block and this account
// then db returned value which is greater than what we requested
// it returned 'block_number2'
```
Because column "keyAndValueJoinedTogether" is stored as key - it has same size limit: 551byte
LMDB, can you do better?
------------------------
By default, for each key LMDB does store small metadata (size of data).
Indices by nature - store much-much keys.
If all keys in sub-bucket (DupSort DBI) have same size - LMDB can store much less metadata.
(Remember! that "keys in sub-bucket" it's "keyAndValueJoinedTogether" - this thing must have same size).
LMDB called this feature DupFixed (can add this flag to bucket configuration).
```
#MDB_DUPFIXED
This flag may only be used in combination with #MDB_DUPSORT. This option
tells the library that the data items for this database are all the same
size, which allows further optimizations in storage and retrieval. When
all data items are the same size, the #MDB_GET_MULTIPLE, #MDB_NEXT_MULTIPLE
and #MDB_PREV_MULTIPLE cursor operations may be used to retrieve multiple
items at once.
```
It means in 1 db call you can Get/Put up to 4Kb of sub-bucket keys.
[lmdb docs](https://github.com/ledgerwatch/lmdb-go/blob/master/lmdb/lmdb.h)
Erigon
---------
This article target is to show tricky concepts on simple examples.
Real way how Erigon stores accounts value and accounts history is a bit different and described [here](./db_walkthrough.MD#bucket-history-of-accounts)
Erigon supports multiple typed cursors, see [AbstractKV.md](./../../ethdb/AbstractKV.md)

View File

@ -100,7 +100,7 @@ if err != nil {
- No internal copies/allocations. It means: 1. app must copy keys/values before put to database. 2. Data after read from db - valid only during current transaction - copy it if plan use data after transaction Commit/Rollback.
- Methods .Bucket() and .Cursor(), cant return nil, can't return error.
- Bucket and Cursor - are interfaces - means different classes can satisfy it: for example `LmdbCursor` and `LmdbDupSortCursor` classes satisfy it.
If your are not familiar with "DupSort" concept, please read [indices.md](./../docs/programmers_guide/indices.md) first.
If your are not familiar with "DupSort" concept, please read [dupsort.md](./../docs/programmers_guide/dupsort.md) first.
- If Cursor returns err!=nil then key SHOULD be != nil (can be []byte{} for example).