When the sync loop first runs it suppresses block sync events both in
the initial loop and when the blocks being processed are greater than
1000.
This fix removed the first check, because otherwise the first block
received by the process ends up not getting sent to the tx pool. Which
means it won't produce new block for polygon.
As well as this fix - I have also moved the gas initialization to the
txpool start method rather than prompting it with a 'synthetic block
event'
As the txpool start has access to the core & tx DB's it can find the
current block and chain config internally so that it doesn't need to be
externally activated it can just do this itself on start up. This has
the advantage of making the txpool more self contained.
This fixes a couple of regressions for running the uploader for mumbai
* Now flags have moved to a higher context they need to be set in the
context not the flag values
* Span 0 of mumbai has a header/span mismatch for span zero sprint 0. So
the check here needs to be suppressed
When discarding spamming we need to remove the tx from the subpools as
well as the sender tx list. Otherwise the tx is ignored by other
operations and left in the subpool
As well as the fix here this PR also contains several changes to TX
TRACING and other logging to make it easier to see what is going on with
pool processing
This PR contains 3 fixes for interaction between the Bor mining loop and
the TX pool which where causing the regular creation of blocks with zero
transactions.
* Mining/Tx pool block synchronization
The synchronization of the tx pool between the sync loop and the mining
loop has been changed so that both are triggered by the same event and
synchronized via a sync.Cond rather than a polling loop with a hard
coded loop limit. This means that mining now waits for the pool to be
updated from the previous block before it starts the mining process.
* Txpool Startup consolidated into its MainLoop
Previously the tx pool start process was dynamically triggered at
various points in the code. This has all now been moved to the start of
the main loop. This is necessary to avoid a timing hole which can leave
the mining loop hanging waiting for a previously block broadcast which
it missed due to its delay start.
* Mining listens for block broadcast to avoid duplicate mining
operations
The mining loop for bor has a recommit timer in case blocks re not
produced on time. However in the case of sprint transitions where the
seal publication is delayed this can lead to duplicate block production.
This is suppressed by introducing a `waiting` state which is exited upon
the block being broadcast from the sealing operation.
Currently the mining loop is broken for the polygon chain. This PR fixes
this.
High level changes:
- Introduces new Bor<->Heimdall stage specifically for the needs of the
mining flow
- Extracts out common logic from Bor<->Heimdall sync and mining stages
into shared functions
- Removes `mine` flag for the Bor<->Heimdall sync stage
- Extends the current `StartMining` function to prefetch span zero if
needed before the mining loop is started
- Fixes Bor to read span zero (instead of span 1) from heimdall when the
span is not initially set in the local smart contract that the Spanner
uses
Test with devnet "state-sync" scenario:
![Screenshot 2024-01-05 at 17 41
23](https://github.com/ledgerwatch/erigon/assets/94537774/34ca903a-69b8-416a-900f-a32f2d4417fa)
While working on fixing the bor mining loop I stumbled across an error
in `ChainReader.BorSpan` - not implemented panic. Also hit a few other
panics due to missed logger in `ChainReaderImpl` struct initialisations.
This PR fixes both.
Heimdall prepares the next span a number of sprints before the current
span ends. Currently we always fetch the next span regardless of which
sprint we are in during the current span. This causes a liveness issue
due to how the Heimdall client works (it infinitely retries until it
fetches a span - this issue will be fixed in a separate PR). This PR
fixes this by matching what bor does - it fetches the next span only in
the last sprint of the current span.
Changes:
- Adds a unit test for the above
- Adds a new function BlockInLastSprintOfSpan
- Some code reorg and cleanup - moves the span num related functions
from the bor package to the span sub package for better logical grouping
Adds unit tests for:
- Bor Heimdall Stage - `checkHeaderExtraData`
- at end of each sprint verifies that the validators in the header extra
data matches the selected proposers from the heimdall span
- 1 test for selected proposers length mismatch
- 1 test for selected proposers bytes mismatch
- BorHeimdall Stage - `persistValidatorSets`
- verifies that each header is created by a validator in the validator
set
- in such situation we set the unwind point
This change introduces additional processes to manage snapshot uploading
for E2 snapshots:
## erigon snapshots upload
The `snapshots uploader` command starts a version of erigon customized
for uploading snapshot files to
a remote location.
It breaks the stage execution process after the senders stage and then
uses the snapshot stage to send
uploaded headers, bodies and (in the case of polygon) bor spans and
events to snapshot files. Because
this process avoids execution in run signifigantly faster than a
standard erigon configuration.
The uploader uses rclone to send seedable (100K or 500K blocks) to a
remote storage location specified
in the rclone config file.
The **uploader** is configured to minimize disk usage by doing the
following:
* It removes snapshots once they are loaded
* It aggressively prunes the database once entities are transferred to
snapshots
in addition to this it has the following performance related features:
* maximizes the workers allocated to snapshot processing to improve
throughput
* Can be started from scratch by downloading the latest snapshots from
the remote location to seed processing
## snapshots command
Is a stand alone command for managing remote snapshots it has the
following sub commands
* **cmp** - compare snapshots
* **copy** - copy snapshots
* **verify** - verify snapshots
* **manifest** - manage the manifest file in the root of remote snapshot
locations
* **torrent** - manage snapshot torrent files
1. Adds an eth/stagedsync/test package which provides a test Harness
object
2. Adds the first automated test to the bor-heimdall stage regarding
span persistence (more to come in subsequent PRs)
3. Fixes a bug in the bor-heimdall stage which was uncovered with the
test - we do not fetch span 0 when we sync straight from blockNum=0
without snapshots
4. Reorganises all mocks to be placed under ./mock sub-package within
their respective packages
"whitelisting" mechanism (list of files - stored in DB) - which
protecting us from downloading new files after upgrade/downgrade was
broken. And seems it became over-complicated with time.
I replacing it by 1 persistent flag inside downloader:
"prohibit_new_downloads.lock"
Erigon will turn downloader into this mode after
downloading/verification of first snapshots.
```
//Corner cases:
// - Erigon generated file X with hash H1. User upgraded Erigon. New version has preverified file X with hash H2. Must ignore H2 (don't send to Downloader)
// - Erigon "download once": means restart/upgrade/downgrade must not download files (and will be fast)
// - After "download once" - Erigon will produce and seed new files
```
------
`downloader --seedbox` is never "prohibit new downloads"
This PR has fixes for a number of instances in the bor heimdall stage
where nil headers are either ignored or inadvertently processed.
It also has a demotion of milestone related logging messages to debug
for missing blocks because the process is not at the head of the chain +
a general reduction in periodic logging to 30 secs rather than 20 to
reduce the log output on long runs.
In addition there is a refactor of persistValidatorSets to perform
validator set initiation in a seperate function. This is intended to
clarify the operation of persistValidatorSets - which is till performing
2 actions, persisting the snapshot and then using it to check the header
against synthesized validator set in the snapshot.
This PR adds support to store the transaction dependency (generated by
the block producer) in the block header for bor. This transaction
dependency will then be used by the parallel processor
([Block-STM](https://github.com/ledgerwatch/erigon/pull/7812/)).
I have created another
[PR](https://github.com/ledgerwatch/erigon-lib/pull/1064) in the
erigon-lib repo which adds the `IsParallelUniverse()` function.
# Background
Erigon currently uses a combination of Victoria Metrics and Prometheus
client for providing metrics.
We want to rationalize this and use only the Prometheus client library,
but we want to maintain the simplified Victoria Metrics methods for
constructing metrics.
This task is currently partly complete and needs to be finished to a
stage where we can remove the Victoria Metrics module from the Erigon
code base.
# Summary of changes
- Adds missing `NewCounter`, `NewSummary`, `NewHistogram`,
`GetOrCreateHistogram` functions to `erigon-lib/metrics` similar to the
interface VictoriaMetrics lib provides
- Minor tidy up for consistency inside `erigon-lib/metrics/set.go`
around return types (panic vs err consistency for funcs inside the
file), error messages, comments
- Replace all remaining usages of `github.com/VictoriaMetrics/metrics`
with `github.com/ledgerwatch/erigon-lib/metrics` - seamless (only import
changes) since interfaces match
This fixes an issue where the mumbai testnet node struggle to find
peers. Before this fix in general test peer numbers are typically around
20 in total between eth66, eth67 and eth68. For new peers some can
struggle to find even a single peer after days of operation.
These are the numbers after 12 hours or running on a node which
previously could not find any peers: eth66=13, eth67=76, eth68=91.
The root cause of this issue is the following:
- A significant number of mumbai peers around the boot node return
network ids which are different from those currently available in the
DHT
- The available nodes are all consequently busy and return 'too many
peers' for long periods
These issues case a significant number of discovery timeouts, some of
the queries will never receive a response.
This causes the discovery read loop to enter a channel deadlock - which
means that no responses are processed, nor timeouts fired. This causes
the discovery process in the node to stop. From then on it just
re-requests handshakes from a relatively small number of peers.
This check in fixes this situation with the following changes:
- Remove the deadlock by running the timer in a separate go-routine so
it can run independently of the main request processing.
- Allow the discovery process matcher to match on port if no id match
can be established on initial ping. This allows subsequent node
validation to proceed and if the node proves to be valid via the
remainder of the look-up and handshake process it us used as a valid
peer.
- Completely unsolicited responses, i.e. those which come from a
completely unknown ip:port combination continue to be ignored.
-
Reason:
- produce and seed snapshots earlier on chain tip. reduce depnedency on
"good peers with history" at p2p-network.
Some networks have no much archive peers, also ConsensusLayer clients
are not-good(not-incentivised) at serving history.
- avoiding having too much files:
more files(shards) - means "more metadata", "more lookups for
non-indexed queries", "more dictionaries", "more bittorrent
connections", ...
less files - means small files will be removed after merge (no peers for
this files).
ToDo:
[x] Recent 500K - merge up to 100K
[x] Older than 500K - merge up to 500K
[x] Start seeding 100k files
[x] Stop seeding 100k files after merge (right before delete)
In next PR:
[] Old version of Erigon must be able download recent hashes. To achieve
it - at first start erigon will download preverified hashes .toml from
s3 - if it's newer that what we have (build-in) - use it.