mirror of
https://gitlab.com/pulsechaincom/erigon-pulse.git
synced 2025-01-01 16:47:39 +00:00
139 lines
4.3 KiB
ReStructuredText
139 lines
4.3 KiB
ReStructuredText
|
===
|
||
|
ETL
|
||
|
===
|
||
|
|
||
|
ETL framework is most commonly used in [staged sync](../../eth/stagedsync).
|
||
|
|
||
|
It implements a pattern where we extract some data from a database, transform it,
|
||
|
then put it into temp files and insert back to the database in sorted order.
|
||
|
|
||
|
Inserting entries into our KV storage sorted by keys helps to minimize write
|
||
|
amplification, hence it is much faster, even considering additional I/O that
|
||
|
is generated by storing files.
|
||
|
|
||
|
It behaves similarly to enterprise [Extract, Tranform, Load] frameworks, hence the name.
|
||
|
We use temporary files because that helps keep RAM usage predictable and allows
|
||
|
using ETL on large amounts of data.
|
||
|
|
||
|
Example
|
||
|
=======
|
||
|
|
||
|
.. code-block:: go
|
||
|
|
||
|
func keyTransformExtractFunc(transformKey func([]byte) ([]byte, error)) etl.ExtractFunc {
|
||
|
return func(k, v []byte, next etl.ExtractNextFunc) error {
|
||
|
newK, err := transformKey(k)
|
||
|
if err != nil {
|
||
|
return err
|
||
|
}
|
||
|
return next(k, newK, v)
|
||
|
}
|
||
|
}
|
||
|
|
||
|
err := etl.Transform(
|
||
|
db, // database
|
||
|
dbutils.PlainStateBucket, // "from" bucket
|
||
|
dbutils.CurrentStateBucket, // "to" bucket
|
||
|
datadir, // where to store temp files
|
||
|
keyTransformExtractFunc(transformPlainStateKey), // transformFunc on extraction
|
||
|
etl.IdentityLoadFunc, // transform on load
|
||
|
etl.TransformArgs{ // additional arguments
|
||
|
Quit: quit,
|
||
|
},
|
||
|
)
|
||
|
if err != nil {
|
||
|
return err
|
||
|
}
|
||
|
|
||
|
Data Transformation
|
||
|
===================
|
||
|
|
||
|
Data could be transformed in two places along the pipeline:
|
||
|
|
||
|
* transform on extraction
|
||
|
|
||
|
* transform on loading
|
||
|
|
||
|
Transform On Extraction
|
||
|
=======================
|
||
|
|
||
|
.. code-block:: go
|
||
|
|
||
|
type ExtractFunc func(k []byte, v []byte, next ExtractNextFunc) error
|
||
|
|
||
|
Transform on extraction function receives the currenk key and value from the
|
||
|
source bucket.
|
||
|
|
||
|
Transform On Loading
|
||
|
====================
|
||
|
|
||
|
.. code-block:: go
|
||
|
|
||
|
type LoadFunc func(k []byte, value []byte, state State, next LoadNextFunc) error
|
||
|
|
||
|
As well as the current key and value, the transform on loading function
|
||
|
receives the `State` object that can receive data from the destination bucket.
|
||
|
|
||
|
That is used in index generation where we want to extend index entries with new
|
||
|
data instead of just adding new ones.
|
||
|
|
||
|
Buffer Types
|
||
|
============
|
||
|
|
||
|
Before the data is being flushed into temp files, it is getting collected into
|
||
|
a buffer until if overflows (`etl.ExtractArgs.BufferSize`).
|
||
|
|
||
|
There are different types of buffers available with different behaviour.
|
||
|
|
||
|
* `SortableSliceBuffer` -- just append (k, v1), (k, v2) onto a slice. Duplicate keys
|
||
|
will lead to duplicate entries: [(k, v1) (k, v2)].
|
||
|
|
||
|
* `SortableAppendBuffer` -- on duplicate keys: merge. (k, v1), (k, v2)
|
||
|
will lead to k: [v1 v2]
|
||
|
|
||
|
* `SortableOldestAppearedBuffer` -- on duplicate keys: keep the oldest. (k,
|
||
|
v1), (k v2) will lead to k: v1
|
||
|
|
||
|
Loading Into Database
|
||
|
=====================
|
||
|
|
||
|
We load data from the temp files into a database in batches, limited by
|
||
|
`IdealBatchSize()` of an `ethdb.Mutation`.
|
||
|
|
||
|
Saving & Restoring State
|
||
|
========================
|
||
|
|
||
|
Interrupting in the middle of loading can lead to inconsistent state in the
|
||
|
database.
|
||
|
|
||
|
To avoid that, the ETL framework allows storing progress by setting `OnLoadCommit` in `etl.TransformArgs`.
|
||
|
|
||
|
Then we can use this data to know the progress the ETL transformation made.
|
||
|
|
||
|
You can also specify `ExtractStartKey` and `ExtractEndKey` to limit the nubmer
|
||
|
of items transformed.
|
||
|
|
||
|
|
||
|
`etl.Transform` function
|
||
|
========================
|
||
|
|
||
|
The vast majority of use-cases is when we extract data from one bucket and in
|
||
|
the end, load it into another bucket. That is the use-case for `etl.Transform`
|
||
|
function.
|
||
|
|
||
|
`etl.Collector` struct
|
||
|
======================
|
||
|
|
||
|
If you want a more modular behaviour instead of just reading from the DB (like
|
||
|
generating intermediate hashes in https://github.com/ledgerwatch/turbo-geth/blob/master/core/chain_makers.go, you can use
|
||
|
`etl.Collector` struct directly.
|
||
|
|
||
|
It has a `.Collect()` method that you can provide your data to.
|
||
|
|
||
|
|
||
|
Optimizations
|
||
|
=============
|
||
|
|
||
|
* if all data fits into a single file, we don't write anything to disk and just
|
||
|
use in-memory storage.
|