This is a non functional change which consolidates the various packages
under metrics into the top level package now that the dead code is
removed.
It is a precursor to the removal of Victoria metrics after which all
erigon metrics code will be contained in this single package.
Improve p2p error handling to propagate errors
from the origin up the call chain the Server peer removal code
using a new PeerError type containing a DiscReason and a more detailed
description.
The origin can be tracked down using PeerErrorCode (code) and DiscReason
(reason)
which looks like this in the log:
> [TRACE] [08-28|16:33:40.205] Removing p2p peer peercount=0
url=enode://d399f4b...@1.2.3.4:30303 duration=6.901ms
err="PeerError(code=remote disconnect reason, reason=too many peers,
err=<nil>, message=Peer.run got a remote DiscReason)"
This is an update of:
https://github.com/ledgerwatch/erigon/pull/7846
which uses a local fork of victoria metrics to include the changes that
https://github.com/anshalshukla added to the original for we where
using.
It also includes code to address the duplicate metrics issue identified
here:
https://github.com/ledgerwatch/erigon/issues/8053
It has one more associated fix which is to correctly add a metadata
label to counters, these where previously labelled as gauges.
e.g.
```
# TYPE p2p_peers counter
p2p_peers 0
```
rather than
```
# TYPE p2p_peers gauge
p2p_peers 0
```
---------
Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com>
Co-authored-by: Anshal Shukla <shukla.anshal85@gmail.com>
I have added:
```go
{
ID: stages.BorHeimdall,
Description: "Download Bor-specific data from Heimdall",
Forward: func(firstCycle bool, badBlockUnwind bool, s *StageState, u Unwinder, tx kv.RwTx, logger log.Logger) error {
if badBlockUnwind {
return nil
}
return BorHeimdallForward(s, u, ctx, tx, borHeimdallCfg, true, logger)
},
Unwind: func(firstCycle bool, u *UnwindState, s *StageState, tx kv.RwTx, logger log.Logger) error {
return BorHeimdallUnwind(u, ctx, s, tx, borHeimdallCfg)
},
Prune: func(firstCycle bool, p *PruneState, tx kv.RwTx, logger log.Logger) error {
return BorHeimdallPrune(p, ctx, tx, borHeimdallCfg)
},
},
```
To MiningStages as well as Default as otherwise bor events are not added
when the block producer creates new blocks.
There are a couple of questions I have around this implementation:
* Is this the right place to add this
* As the state is also executed when the default stage is processed ther
is some duplicate processing for the block producing node.
* There is a duplicated call to heimdall which could be removed if the
stages share state - but its not clear if we want to do this.
* I don't think the mining stage needs to prune as this will be
replicated in the default iteration
This can be tested using the devnet with the following arguments:
```
--chain bor-devnet --bor.localheimdall --scenarios state-sync
```
This will generate sync events via an ethereum devnet which are
transmitted to bor chain and will be executed at the end of the snapshot
delay, which results in events generated from the bor chain. This tests
the whole sync, block generation, event lifecycle. As it needs to wait
for sprints to end after a sufficient delay it is quite slow to run.
Basically, pruning is specified by the user, by default, 1 million (in
the PR set to 100 for pruning purposes). the pruning for the database is
stored inside the db
The current logic is flawed, because it drops all peers that are less
synced.
It is valid to return empty responses by the eth spec.
A proper logic should penalize from the context of the sync process,
where enough "reputation" data is collected about a peer.
In order to be able to connect to erigon 2.48 peers that have
--sentry.drop-useless-peers enabled,
this adds a check to not reply with an empty headers list.
If we reply with an empty list, we're going to be considered useless and
kicked.
Once enough of erigon nodes are updated in the network past this commit,
this check should be removed,
because it is totally acceptable to return an empty list by the eth
spec.
The disconnect message could either be a plain integer, or a list with
one integer element. We were encoding it as a plain integer, but
decoding as a list. Change this to be able to decode any format.
Currently PropagateNewBlockHashes and BroadcastNewBlock
selects a subset of all sentries by taking a `Sqrt(len(sentries))`,
and then for each sentry SendMessageToRandomPeers
selects a subset of its peers by taking `Sqrt(len(peerInfos))`.
This behaviour limits the broadcast scope with a lot of peers, e.g. 100
becomes 10,
but is not great with very few peers, or if the message is very
important
to broadcast to everyone, which is the case of bor validator/proposer
nodes.
* send to all sentries in both BroadcastNewBlock and PropagateNewBlockHashes
* remove peerCountConstrained sqrt logic in SendMessageToRandomPeers
* add maxPeers provider func as a parameter to MultiClient
* default it to 10 for eth and 0 (unlimited) for bor validators
---------
Co-authored-by: Mark Holt <mark@distributed.vision>
problem: it was possible to call startSync
and start sending messages before our Status is sent
solution: wait for the sender goroutine to finish
before calling startSync
refactor handShake parameters to not require peerID and a startSync
callback
Co-authored-by: Mark Holt <mark@distributed.vision>
This request is extending the devnet functionality to more fully handle
contract processing by adding support for the following calls:
* trace_call,
* trace_transaction
* debug_accountAt,
* eth_getCode
* eth_estimateGas
* eth_gasPrice
It also contains an initial rationalization of the devnet subscription
code to use the erigon client code directly rather than using its own
intermediate subscription management.
This is used to implement a general purpose block waiter - which can be
used in any scenario step - rather than being specific to transaction
processing.
This pull also contains an end to end tested sync processor for bor and
associated support services:
* Heimdall (supports sync event transfer)
* Faucet - allows the creation and funding of arbitary test specific
accounts (cross chain)
Notes and Caveats:
* Code generation for contracts requires `--evm-version paris`. For
chains which don't support push0 for solc over 0.8.19
* The bor log processing post the application of sync events causes a
panic - this will be the subject of a seperate smaller push as it is not
devnet specific
* The bor code seems to make repeated calls for the same sync events and
also reverts requests - this needs further investigation. This is the
behaviour of the current implementation and may be required - although
it does seem to generate repeat processing - which could be avoided.
Otterscan API search methods allow the user to inform the page size.
This PR adds an internal max (default == 25 results) to cap the page
size, regardless of what the user asks.
It also adds a `--ots.search.max.pagesize` CLI args to override this max
(either in erigon and rpcdaemon binaries).
Adds `clear_bad_blocks` command to integration tool. This command allows
to re-process blocks that were erroneously marked as bad.
Command just clears `BadHeaderNumber` table. It can be safer in some
cases than
```
./integration state_stages —unwind=<some_number>
./integration stage_headers —unwind=<some_number>
```
and can be used in the cases like this one
https://github.com/ledgerwatch/erigon/issues/7892%20
Command syntax:
```
./integration clear_bad_blocks --datadir=<datadir>
```
Miracoulously, hive tests pass first try. YIPPIE.
Also for the future, I added `--experimental.modular` which enables a
secondary engine API for consensus separation.
Now block building is responsibility of the execution module.
This request implements an end to end Polygon state sync in the devnet.
It does this by deploying smart contracts ont the L2 & L2 chain which
follow the polygon fx portal model with security checks removed to
simplify the code. The sync events generated are routed through a local
mock heimdal - to avoid the consensus process for testing purposes.
The commit also includes support code to help the delivery of additional
contract based scenratios.
we need to extract this interface from the struct.
i need to also break down the interface more, to better show what parts
the caching is used, move some functions from the cache state to the
underlying.
don't merge
An update to the devnet to introduce a local heimdall to facilitate
multiple validators without the need for an external process, and hence
validator registration/staking etc.
In this initial release only span generation is supported.
It has the following changes:
* Introduction of a local grpc heimdall interface
* Allocation of accounts via a devnet account generator ()
* Introduction on 'Services' for the network config
"--chain bor-devnet --bor.localheimdall" will run a 2 validator network
with a local service
"--chain bor-devnet --bor.withoutheimdall" will sun a single validator
with no heimdall service as before
---------
Co-authored-by: Alex Sharp <alexsharp@Alexs-MacBook-Pro-2.local>
This request implements the insertion of Bor ephemeral transactions into
snapshot indexes.
I does this by taking the block hash from the header index and passing
it to the transaction indexer to add an additional index entry per block
into the transaction hash -> block index.
The passed entries are currently contained in an in memory array which
is (32 * number of blocks / sprint size) bytes.
In addition to the functional code there is also an update to the
`dump_test.go` so that it runs `DumpBlocks` to exercise the indexing
code. To facilitate this the `InsertChain` method in `mock_sentry` has
been modified so that it can process >128 blocks.
The code in this request also includes additional bor/consensus code
with the following functions:
`CalculateSprint`
`CalculateSprintCount`
The first function is a modification of the code in erigon-lib so that
the sprints are numerically rather than lexically ordered. This code
should be migrated to erigon-lib and should have its sprint set
calculated once from its underlying map rather than this process being
repeated every calculation.
---------
Co-authored-by: Alex Sharp <alexsharp@Alexs-MacBook-Pro-2.local>
Co-authored-by: ledgerwatch <akhounov@gmail.com>
Co-authored-by: Enrique Jose Avila Asapche <eavilaasapche@gmail.com>
Co-authored-by: Giulio <giulio.rebuffo@gmail.com>
Fixes https://github.com/ledgerwatch/erigon/issues/7814, it can be
really confusing when `erigon` and `integration` process `~` in path in
different ways.
I can't fix it for `integration` the same way it is done for `erigon`,
because in main app it is done with `urfave/cli` package that is not
used in `integration`, so I've used other simplest way to do it.
This PR separates ENGINE from Ethbackend. It makes it so:
1) EthBackend not a god class
2) We can abstract away engine API so that we can make it CL-like and
enable Consensus-Execution driven design
3) Objective is Json-RPC -> Engine Consensus Module -> Execution module.
The fixes here fix a couple of issues related to devnet start-up
1. macos threading and syscall error return where causing multi node
start to both not wait and fail
2. On windows creating DB's with the default 2 TB mapsize causes the os
to reserve about 4GB of committed memory per DB. This may not be used -
but is reserved by the OS - so a default bor node reserves around 10GB
of storage. Starting many nodes causes the OS page file to become
exhausted.
To fix this the consensus DB's now use the node's OpenDatabase function
rather than their own, which means that the consensus DB's take notice
of the config.MdbxDBSizeLimit.
This fix leaves one 4GB committed memory allocation in the TX pool which
needs its own MapSize setting.
---------
Co-authored-by: Alex Sharp <akhounov@gmail.com>
The check in catches errors in the node start-up code and makes sure
that the network is stopped if any node fails to start cleanly, and
that5 it returns an error - so that any calling code can take
appropriate action.
reasons:
- mainnet: even nodes with small FreeList - still have millions of pages
there `GC: 46446830 5.8%`. Probability of getting into state where space
re-use will be slower than free-list grow is > 0% (we now using db
version which limiting freelist-overhead, but increasing such
probability)
- polygon: size is > 8Tb
- hardware slowly moving towards bigger pageSizes (because for
OS/Hardware) maintenance of pages metadata is also not free (metadata,
lists, LRU, etc...). Macbook's default pagesize now is 16Kb. Network
disks in cloud are also likely working with 16Kb pages.
pros:
- less db fragmentation (better FS-level compression)
- less overflow pages in DB (which also reducing free-list overhead)
- smaller free-list
- bigger key-size-limit
- no 8Tb db size limit
- can setup FS - to also use bigger pagesize - it will reduce FS
overhead also
- reducing amount of page-faults during batch-reads (if FS pagesize
match)
- less write syscalls during commit (when WriteMap disabled)
cons:
- ~10% more IO: because of more RAM waste and just because need
read/write bigger pages (not all updates are co-located).
Added support tunnel to the devnet cmd. In order to get this to run I
made the following changes:
* Create a public function
* Added non root logging
I have also added commentary to the readme to explain the additional
command line arguments needed to integrate with diagnostics. In summary,
if you set the --diagnostics.url the devenet will wait for diagnostic
requests rather than exiting
---------
Co-authored-by: alex.sharov <AskAlexSharov@gmail.com>
Changes summary:
- Continue with the gasLimit check skip in ``verifyHeader`` of
``merge.go`` for unless pre-merge block and blockGasLimitContract
present
- Refactor ``aura.go`` a bit
- Have ``sysCall`` method customized to be able to call state (contract)
at a parent (or any other) header state
- breaks dependency from staged_sync to package with block_reader
implementation
- breaks dependency from snap_sync to package with block_reader
implementation
- breaks dependency from mining to txpool implementation
This is an update to the devnet code which introduces the concept of
configurable scenarios. This replaces the previous hard coded execution
function.
The intention is that now both the network and the operations to run on
the network can be described in a data structure which is configurable
and composable.
The operating model is to create a network and then ask it to run
scenarios:
```go
network.Run(
runCtx,
scenarios.Scenario{
Name: "all",
Steps: []*scenarios.Step{
&scenarios.Step{Text: "InitSubscriptions", Args: []any{[]requests.SubMethod{requests.Methods.ETHNewHeads}}},
&scenarios.Step{Text: "PingErigonRpc"},
&scenarios.Step{Text: "CheckTxPoolContent", Args: []any{0, 0, 0}},
&scenarios.Step{Text: "SendTxWithDynamicFee", Args: []any{recipientAddress, services.DevAddress, sendValue}},
&scenarios.Step{Text: "AwaitBlocks", Args: []any{2 * time.Second}},
},
})
```
The steps here refer to step handlers which can be defined as follows:
```go
func init() {
scenarios.MustRegisterStepHandlers(
scenarios.StepHandler(GetBalance),
)
}
func GetBalance(ctx context.Context, addr string, blockNum requests.BlockNumber, checkBal uint64) {
...
```
This commit is an initial implementation of the scenario running - which
is working, but will need to be enhanced to make it more usable &
developable.
The current version of the code is working and has been tested with the
dev network, and bor withoutheimdall. There is a multi miner bor
heimdall configuration but this is yet to be tested.
Note that by default the scenario runner picks nodes at random on the
network to send transactions to. this causes the dev network to run very
slowly as it seems to take a long time to include transactions where the
nonce is incremented across nodes. It seems to take a long time for the
nonce to catch up in the transaction pool processing. This is yet to be
investigated.
attempt to address next issue:
> when I'm having a lot of websocket connections the node is freezing
and then it needs like 10 mins to sync. Then if I keep pushing requests
it falls out of sync all the time
When testing with `Bor` consensus turned on I discovered that
`SendRawTransaction` returns a 0x000... hash when transactions are
submitted during block transitions. This turns out to be spurious in the
sense that the transaction insertion is successful.
The cause is that `ReadCurrentBlockNumber` returns a nil block number.
This in turn is caused by the following: In `accessors_chain.go` there
are two methods: `WriteHeader` and `WriteHeadHeaderHash` when the first
is called the block number is written for the header. The second writes
the header has, but there is no guarantee when it does that the head
header will have been written yet. In fact it seems to happen sometime
later.
The problem for `SendRawTransation` is that it begins a transaction
after inserting into the txpool. And depending on timing this
transaction may see only the `WriteHeadHeaderHash` insertion, and hence
can't read the block number.
I have mitigated this by opening the db transaction before calling the
tx pool insertion, meaning that it is more likely to have a clean view
of the DB.
I have also moved the chain id check earlier in the code - as I think
that if this is invalid the method should not try to insert transactions
in the first place.
The `ReadCurrentBlockNumber` is only used to produce a log message - so
I've changed this to not fail the whole function but to just log an
unknown sender. Which means that the hash is still returned to the
sender after a successful txpool insertion
This change adds 'any' as an alternate wildcard to '*'.
I have updated all doc references in the main erigon repo - let me know
if there is anywhere else that needs changing.
Added two new flags beacon.api.port and beacon.api.addr
Now we can listen for beacon api and get beacon genesis
---------
Co-authored-by: Giulio <giulio.rebuffo@gmail.com>
This PR does the following things:
- Updates the hardfork number of the upcoming Indore hardfork schedule
at block 36877056.
- Refactoring to `CommitStates` method of bor consensus
- Fixes a bug in triggering mining
I've added a non root logger to bor.ValidatorSet validator set. This
creates a signature change on a number of calling functions to propagate
the logger. This is mostly constrained to the bor package but impacts a
number of tests and utilities which call the validators set.
- allow store non-canonical blocks/senders
- optimize re-org: don't update/delete most of data
- allow mark chain as `Bad` - will be not visible by eth_getBlockByHash,
but can read if have hash+num
This branch is intended to allow the devnet to be used for testing
multiple consents types beyond the default clique. It is initially being
used to test Bor consensus for polygon.
It also has the following refactoring:
### 1. Network configuration
The two node arg building functions miningNodeArgs and nonMiningNodeArgs
have been replaced with a configuration struct which is used to
configure:
```go
network := &node.Network{
DataDir: dataDir,
Chain: networkname.DevChainName,
//Chain: networkname.BorDevnetChainName,
Logger: logger,
BasePrivateApiAddr: "localhost:9090",
BaseRPCAddr: "localhost:8545",
Nodes: []node.NetworkNode{
&node.Miner{},
&node.NonMiner{},
},
}
```
and start multiple nodes
```go
network.Start()
```
Network start will create a network of nodes ensuring that all nodes are
configured with non clashing network ports set via command line
arguments on start-up.
### 2. Request Routing
The `RequestRouter` has been updated to take a 'target' rather than
using a static dispatcher which routes to a single node on the network.
Each node in the network has its own request generator so command and
services have more flexibility in request routing and
`ExecuteAllMethods` currently takes the `node.Network` as an argument
and can pick which node (node 0 for the moment) to send requests to.
- stage_senders: don't re-calc existing senders
- stage_tx_lookup: prune less blocks per iteration - because
random-deletes are expensive. pruning must not slow-down sync.
- prune data even if --snap.stop is set
- "prune as-much-as-possible at startup" is not very good idea: at
initialCycle machine can be cold and prune will cause big downtime, no
reason to produce much freelist in 1 tx. People may also restart erigon
- because of some bug - and it will cause unexpected downtime (usually
Erigon startup very fast). So, I just remove all `initialSync`-related
logic in pruning.
- fix lost metrics about disk write byte/sec
it's step towards saving canonical and non-canonical bodies in same
table (and txs also in same own table). to reduce write amplification
(cheaper re-orgs)
PR change: reading BaseTxNum from existing snapshots instead of DB
DB will store in field body.BaseTxNum - non-canonical TxnID
Snapshots will store only canonical TxNum in field body.BaseTxNum
This implements batched state-test exectution, similar to
https://github.com/ethereum/go-ethereum/pull/27318 .
Some speedtests, executing a state-test twice on current master takes
~4-5 seconds, and scales linerarly.
```
Doing 2 execs old style
real 0m8.185s
user 0m8.081s
sys 0m0.110s
```
Doing `100` executions on this PR -- a few seconds of ramp-up time, but
very quick execution after that :
```
Doing 100 execs v2
real 0m5.009s
user 0m4.560s
sys 0m0.508s
```
I also tested a version where I moved the db instantiation into the top
callsite, with the `MustOpen` and `.Close` only performed once, instead
of `100` times -- however, I noticed no additional speed gains from
doing so (my branch `batched_evm_v2`).
Therefore, I suspect that the slowdowns comes not from the db, but the
kzg library initialization.
we update observability in the p2p layer for handlers, and also properly
encode error codes, close streams.
---------
Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com>
Co-authored-by: Giulio <giulio.rebuffo@gmail.com>
When mining is enabled, it waits for either a new block event, or a tx
notif or recommit interval before it starts mining the first block. This
PR achieves the following things:
- Start mining immediately (subsequent blocks will be mined via the new
head channel or miner.recommit timer).
- Modifies the conditions when it needs to look for new mining work
- Don't start mining on arrival of new transactions as it can be too
frequent (only for bor consensus as of now).
- Reset timer only if some work was done previously
- always RLock all snapshots - to guarantee consistency
- introduce class View (analog of RoTx and MakeContext)
- move read methods to View object
- View object will be managed by temporal_tx
---------
Co-authored-by: Alex Sharp <alexsharp@Alexs-MacBook-Pro-2.local>
deduplicate logic
create more producer goroutines (torrent lib does limiting internally
amount of consumers/disk-readers/hashers by 2, and it's enough because
we can verify multiple files in parallel)
move flag from "downloader torrent_hashes --verify" to "downloader
--verify"
instead of converting from ssz -> struct -> ssz, it may be better to
just stay as ssz, then use methods to read the data.
this pr explores this concept, while maintaining compatiblity with the
existing codebase.