This PR has fixes for a number of instances in the bor heimdall stage
where nil headers are either ignored or inadvertently processed.
It also has a demotion of milestone related logging messages to debug
for missing blocks because the process is not at the head of the chain +
a general reduction in periodic logging to 30 secs rather than 20 to
reduce the log output on long runs.
In addition there is a refactor of persistValidatorSets to perform
validator set initiation in a seperate function. This is intended to
clarify the operation of persistValidatorSets - which is till performing
2 actions, persisting the snapshot and then using it to check the header
against synthesized validator set in the snapshot.
This PR adds support to store the transaction dependency (generated by
the block producer) in the block header for bor. This transaction
dependency will then be used by the parallel processor
([Block-STM](https://github.com/ledgerwatch/erigon/pull/7812/)).
I have created another
[PR](https://github.com/ledgerwatch/erigon-lib/pull/1064) in the
erigon-lib repo which adds the `IsParallelUniverse()` function.
# Background
Erigon currently uses a combination of Victoria Metrics and Prometheus
client for providing metrics.
We want to rationalize this and use only the Prometheus client library,
but we want to maintain the simplified Victoria Metrics methods for
constructing metrics.
This task is currently partly complete and needs to be finished to a
stage where we can remove the Victoria Metrics module from the Erigon
code base.
# Summary of changes
- Adds missing `NewCounter`, `NewSummary`, `NewHistogram`,
`GetOrCreateHistogram` functions to `erigon-lib/metrics` similar to the
interface VictoriaMetrics lib provides
- Minor tidy up for consistency inside `erigon-lib/metrics/set.go`
around return types (panic vs err consistency for funcs inside the
file), error messages, comments
- Replace all remaining usages of `github.com/VictoriaMetrics/metrics`
with `github.com/ledgerwatch/erigon-lib/metrics` - seamless (only import
changes) since interfaces match
This fixes an issue where the mumbai testnet node struggle to find
peers. Before this fix in general test peer numbers are typically around
20 in total between eth66, eth67 and eth68. For new peers some can
struggle to find even a single peer after days of operation.
These are the numbers after 12 hours or running on a node which
previously could not find any peers: eth66=13, eth67=76, eth68=91.
The root cause of this issue is the following:
- A significant number of mumbai peers around the boot node return
network ids which are different from those currently available in the
DHT
- The available nodes are all consequently busy and return 'too many
peers' for long periods
These issues case a significant number of discovery timeouts, some of
the queries will never receive a response.
This causes the discovery read loop to enter a channel deadlock - which
means that no responses are processed, nor timeouts fired. This causes
the discovery process in the node to stop. From then on it just
re-requests handshakes from a relatively small number of peers.
This check in fixes this situation with the following changes:
- Remove the deadlock by running the timer in a separate go-routine so
it can run independently of the main request processing.
- Allow the discovery process matcher to match on port if no id match
can be established on initial ping. This allows subsequent node
validation to proceed and if the node proves to be valid via the
remainder of the look-up and handshake process it us used as a valid
peer.
- Completely unsolicited responses, i.e. those which come from a
completely unknown ip:port combination continue to be ignored.
-
Reason:
- produce and seed snapshots earlier on chain tip. reduce depnedency on
"good peers with history" at p2p-network.
Some networks have no much archive peers, also ConsensusLayer clients
are not-good(not-incentivised) at serving history.
- avoiding having too much files:
more files(shards) - means "more metadata", "more lookups for
non-indexed queries", "more dictionaries", "more bittorrent
connections", ...
less files - means small files will be removed after merge (no peers for
this files).
ToDo:
[x] Recent 500K - merge up to 100K
[x] Older than 500K - merge up to 500K
[x] Start seeding 100k files
[x] Stop seeding 100k files after merge (right before delete)
In next PR:
[] Old version of Erigon must be able download recent hashes. To achieve
it - at first start erigon will download preverified hashes .toml from
s3 - if it's newer that what we have (build-in) - use it.
If HeaderDownload.VerifyHeader always returns false, the memory usage
grows at a fast pace
due to Link objects (containing headers) not deallocated even after the
link queue pruning.
This introduces _experimental_ block execution run by embedded Silkworm
API library:
- new command-line option `silkworm.path` to enable the feature by
specifying the path to the Silkworm library
- the Silkworm API shared library is dynamically loaded on-demand
- currently requires to build Silkworm library on the target machine
- available only on Linux at the moment: macOS has issue with [stack
size](https://github.com/golang/go/issues/28024) and Windows would
require [TDM-GCC-64](https://jmeubank.github.io/tdm-gcc/), both need
dedicated effort for an assessment
Fixes for label discrepancies in collector for summaries etc which have
a template which includes a quantile.
Initial native Prometheus client implementation of metrics - which is
currently turned off except for local testing and interface exports.
Closes#8078
This change is primarily intended to support go 1.21, but as a
side-effect requires updating libp2p, which in turn triggers an update
of golang.org/x/exp which creates quite a bit of (simple) churn in the
slice sorting.
This change introduces a new `cmp.Compare` function which can be used to
return an integer satisfying the compare interface for slice sorting.
In order to continue to support mplex for libp2p, the change references
github.com/libp2p/go-libp2p-mplex instead. Please see the PR at
https://github.com/libp2p/go-libp2p/pull/2498 for the official usptream
comment that indicates official support for mplex being moved to this
location.
Co-authored-by: Jason Yellick <jason@enya.ai>
Add code to the headers state to break processing if a bor milestone
rewind is detected.
The rewind processing happens in the bor/heimdall stage - this change
just avoids unnecessary header loading
if a milestone fork is likely to be detected
---------
Co-authored-by: Anshal Shukla <shukla.anshal85@gmail.com>
This is the second part of the bor milestone release it contains the
following changes:
* Initialize services
* This is a change from the initial pull request I have moved all of the
initialization to the bor engine. To facilitate this I have just passed
in the heimdall client interface, rather than the whole engine
* Stage processing
* This is also a change from the original PR - the code is contained in
the bor heimdall stage rather than in headers - the effect should be the
same, but this needs testing
---------
Co-authored-by: Mark Holt <mark@disributed.vision>
Co-authored-by: Anshal Shukla <shukla.anshal85@gmail.com>
For example, erigon on devnet8 marked a block as bad due to
"mdbx_cursor_open: cannot allocate memory":
```
[INFO] [09-12|04:57:36.041] [NewPayload] Handling new payload height=171035 hash=0x321dea00c4853ee354bebaf8aef3e63fbe06c4508271c0db4c92b0f087aedc3b
171034
[WARN] [09-12|04:57:36.069] Could not validate block err="[3/7 BlockHashes] table: Header, mdbx_cursor_open: cannot allocate memory, stack: [kv_mdbx.go:1057 kv_mdbx.
go:1069 kv_mdbx.go:1077 memory_mutation.go:473 memory_mutation.go:502 etl.go:123 etl.go:96 block_writer.go:40 stage_blockhashes.go:49 default_stages.go:457 sync.go:425 sync.go:258 s
tageloop.go:414 backend.go:476 fork_validator.go:250 fork_validator.go:156 ethereum_execution.go:151 execution_client.go:51 chain_reader.go:252 engine_server.go:741 engine_server.go
:235 engine_server.go:600 value.go:586 value.go:370 service.go:224 handler.go:494 handler.go:444 handler.go:392 handler.go:223 handler.go:316 asm_amd64.s:1598]"
[WARN] [09-12|04:57:36.069] ethereumExecutionModule.ValidateChain: chain is invalid hash=0x321dea00c4853ee354bebaf8aef3e63fbe06c4508271c0db4c92b0f087aedc3b
```
With this PR blocks are marked as bad only on genuine protocol errors.
This is a non functional change which consolidates the various packages
under metrics into the top level package now that the dead code is
removed.
It is a precursor to the removal of Victoria metrics after which all
erigon metrics code will be contained in this single package.
ergon/metrics contains a lot of unused types - the pull is removing them
prior to rationalizing between Prometheus and Victoria Metrics types.
It also have upgrades of golang-lru + sets types to v2 when possible.
This is an update of:
https://github.com/ledgerwatch/erigon/pull/7846
which uses a local fork of victoria metrics to include the changes that
https://github.com/anshalshukla added to the original for we where
using.
It also includes code to address the duplicate metrics issue identified
here:
https://github.com/ledgerwatch/erigon/issues/8053
It has one more associated fix which is to correctly add a metadata
label to counters, these where previously labelled as gauges.
e.g.
```
# TYPE p2p_peers counter
p2p_peers 0
```
rather than
```
# TYPE p2p_peers gauge
p2p_peers 0
```
---------
Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com>
Co-authored-by: Anshal Shukla <shukla.anshal85@gmail.com>
I have added:
```go
{
ID: stages.BorHeimdall,
Description: "Download Bor-specific data from Heimdall",
Forward: func(firstCycle bool, badBlockUnwind bool, s *StageState, u Unwinder, tx kv.RwTx, logger log.Logger) error {
if badBlockUnwind {
return nil
}
return BorHeimdallForward(s, u, ctx, tx, borHeimdallCfg, true, logger)
},
Unwind: func(firstCycle bool, u *UnwindState, s *StageState, tx kv.RwTx, logger log.Logger) error {
return BorHeimdallUnwind(u, ctx, s, tx, borHeimdallCfg)
},
Prune: func(firstCycle bool, p *PruneState, tx kv.RwTx, logger log.Logger) error {
return BorHeimdallPrune(p, ctx, tx, borHeimdallCfg)
},
},
```
To MiningStages as well as Default as otherwise bor events are not added
when the block producer creates new blocks.
There are a couple of questions I have around this implementation:
* Is this the right place to add this
* As the state is also executed when the default stage is processed ther
is some duplicate processing for the block producing node.
* There is a duplicated call to heimdall which could be removed if the
stages share state - but its not clear if we want to do this.
* I don't think the mining stage needs to prune as this will be
replicated in the default iteration
This can be tested using the devnet with the following arguments:
```
--chain bor-devnet --bor.localheimdall --scenarios state-sync
```
This will generate sync events via an ethereum devnet which are
transmitted to bor chain and will be executed at the end of the snapshot
delay, which results in events generated from the bor chain. This tests
the whole sync, block generation, event lifecycle. As it needs to wait
for sprints to end after a sufficient delay it is quite slow to run.