When the sync loop first runs it suppresses block sync events both in
the initial loop and when the blocks being processed are greater than
1000.
This fix removed the first check, because otherwise the first block
received by the process ends up not getting sent to the tx pool. Which
means it won't produce new block for polygon.
As well as this fix - I have also moved the gas initialization to the
txpool start method rather than prompting it with a 'synthetic block
event'
As the txpool start has access to the core & tx DB's it can find the
current block and chain config internally so that it doesn't need to be
externally activated it can just do this itself on start up. This has
the advantage of making the txpool more self contained.
This PR contains 3 fixes for interaction between the Bor mining loop and
the TX pool which where causing the regular creation of blocks with zero
transactions.
* Mining/Tx pool block synchronization
The synchronization of the tx pool between the sync loop and the mining
loop has been changed so that both are triggered by the same event and
synchronized via a sync.Cond rather than a polling loop with a hard
coded loop limit. This means that mining now waits for the pool to be
updated from the previous block before it starts the mining process.
* Txpool Startup consolidated into its MainLoop
Previously the tx pool start process was dynamically triggered at
various points in the code. This has all now been moved to the start of
the main loop. This is necessary to avoid a timing hole which can leave
the mining loop hanging waiting for a previously block broadcast which
it missed due to its delay start.
* Mining listens for block broadcast to avoid duplicate mining
operations
The mining loop for bor has a recommit timer in case blocks re not
produced on time. However in the case of sprint transitions where the
seal publication is delayed this can lead to duplicate block production.
This is suppressed by introducing a `waiting` state which is exited upon
the block being broadcast from the sealing operation.
Currently the mining loop is broken for the polygon chain. This PR fixes
this.
High level changes:
- Introduces new Bor<->Heimdall stage specifically for the needs of the
mining flow
- Extracts out common logic from Bor<->Heimdall sync and mining stages
into shared functions
- Removes `mine` flag for the Bor<->Heimdall sync stage
- Extends the current `StartMining` function to prefetch span zero if
needed before the mining loop is started
- Fixes Bor to read span zero (instead of span 1) from heimdall when the
span is not initially set in the local smart contract that the Spanner
uses
Test with devnet "state-sync" scenario:
![Screenshot 2024-01-05 at 17 41
23](https://github.com/ledgerwatch/erigon/assets/94537774/34ca903a-69b8-416a-900f-a32f2d4417fa)
While working on fixing the bor mining loop I stumbled across an error
in `ChainReader.BorSpan` - not implemented panic. Also hit a few other
panics due to missed logger in `ChainReaderImpl` struct initialisations.
This PR fixes both.
Mdbx now takes a logger - but this has not been pushed to all callers -
meaning it had an invalid logger
This fixes the log propagation.
It also fixed a start-up issue for http.enabled and txpool.disable
created by a previous merge
Users reported this error
```
[bor.heimdall] an error while trying fetching path=clerk/event-record/list attempt=5 error="unexpected end of JSON input"
```
Which may happen if:
1. Heimdall is behind and not sync-ed - for more info check
https://github.com/maticnetwork/heimdall/pull/993
2. Or the header time erigon is sending is far into the future
The logs in this PR will help us see which of the 2 is the culprit but
most likely it is 1. We will investigate further 2. if it ever happens.
Changes:
1. Improves logging upon heimdall client retries - prints out the full
url that failed.
2. Fixes a bug where the body was incorrectly checked if it is empty -
`len(body) == 0` vs `body == nil`
3. Unit test for the bug regression
4. Adds a log to indicate to users to check their heimdall process if
they run into this scenario since that may be the culprit
Example output with new logs
<img width="1465" alt="Screenshot 2023-12-29 at 20 16 57"
src="https://github.com/ledgerwatch/erigon/assets/94537774/1ebfde68-aa93-41d6-889a-27bef5414f25">
Heimdall prepares the next span a number of sprints before the current
span ends. Currently we always fetch the next span regardless of which
sprint we are in during the current span. This causes a liveness issue
due to how the Heimdall client works (it infinitely retries until it
fetches a span - this issue will be fixed in a separate PR). This PR
fixes this by matching what bor does - it fetches the next span only in
the last sprint of the current span.
Changes:
- Adds a unit test for the above
- Adds a new function BlockInLastSprintOfSpan
- Some code reorg and cleanup - moves the span num related functions
from the bor package to the span sub package for better logical grouping
Adds unit tests for:
- Bor Heimdall Stage - `checkHeaderExtraData`
- at end of each sprint verifies that the validators in the header extra
data matches the selected proposers from the heimdall span
- 1 test for selected proposers length mismatch
- 1 test for selected proposers bytes mismatch
- BorHeimdall Stage - `persistValidatorSets`
- verifies that each header is created by a validator in the validator
set
- in such situation we set the unwind point
Add paths to the hiemdall config URL when creating calls so that extra
paths needs by, for example proxy servers are not stripped from the flag
value passed into the process.
1. Adds an eth/stagedsync/test package which provides a test Harness
object
2. Adds the first automated test to the bor-heimdall stage regarding
span persistence (more to come in subsequent PRs)
3. Fixes a bug in the bor-heimdall stage which was uncovered with the
test - we do not fetch span 0 when we sync straight from blockNum=0
without snapshots
4. Reorganises all mocks to be placed under ./mock sub-package within
their respective packages
This PR has fixes for a number of instances in the bor heimdall stage
where nil headers are either ignored or inadvertently processed.
It also has a demotion of milestone related logging messages to debug
for missing blocks because the process is not at the head of the chain +
a general reduction in periodic logging to 30 secs rather than 20 to
reduce the log output on long runs.
In addition there is a refactor of persistValidatorSets to perform
validator set initiation in a seperate function. This is intended to
clarify the operation of persistValidatorSets - which is till performing
2 actions, persisting the snapshot and then using it to check the header
against synthesized validator set in the snapshot.
This PR adds support to store the transaction dependency (generated by
the block producer) in the block header for bor. This transaction
dependency will then be used by the parallel processor
([Block-STM](https://github.com/ledgerwatch/erigon/pull/7812/)).
I have created another
[PR](https://github.com/ledgerwatch/erigon-lib/pull/1064) in the
erigon-lib repo which adds the `IsParallelUniverse()` function.
I using `https://heimdall-api-testnet.polygon.technology/` and seems
5sec timeout is not enough sometime - even that remote service working
well (node syncing well)
most of timeouts comes from same endpoint:
```
[bor.heimdall] request canceled reason="context deadline exceeded" path=/milestone/lastNoAck attempt=2
```
* fix "genesis hash does not match" when dev nodes connect
The "dev" nodes need to have the same --miner.etherbase in order to
generate the same genesis ExtraData by DeveloperGenesisBlock(). Override
DevnetEtherbase global var that's used if --miner.etherbase is not
passed. (for NonBlockProducer case)
* fix missing private key for the hardcoded DevnetEtherbase
Fixes panic if SigKey is not found. Bor non-producers will use a default
`DevnetEtherbase` while Dev nodes modify it. Save hardcoded
DevnetEtherbase/DevnetSignPrivateKey into accounts so that SigKey can
recover it.
* refactor devnet.node to contain Node config
This avoids interface{} type casts and fixes an error with
Heimdall.validatorSet == nil
* add connection retries to rpcCall and Subscribe of requestGenerator
Fixes "connection refused" errors due to node not ready to handle early
RPC requests.
* fix deadlock in Heimdall.NodeStarted
* fix GetBlockByNumber
Fixes "cannot unmarshal string into Go struct field body.transactions of
type jsonrpc.RPCTransaction"
* demote "no of blocks on childchain is less than confirmations
required" to Info (#8626)
* demote "mismatched pending subpool size" to Debug (#8615)
* revert wiggle testing code
If HeaderDownload.VerifyHeader always returns false, the memory usage
grows at a fast pace
due to Link objects (containing headers) not deallocated even after the
link queue pruning.
Add a check against the inbound headers before publishing a newly mined
block after the wait delay.
If the node received a block while it was processing transactions, or
waiting for its publish slot, do a final check that another node hasn't
already published a block.
Fixes and issue with Polygon validators where locally mined blocks are
broadcast with invalid header hashes because the NewBlock message
constructor was removing the ReceiptHash which contributed to the header
hash.
The results in the bor header validation code not being able to
correctly identify the signer of the header - so header validation
fails.
This also likely fixes part of the bogon-block issue which was
identified by the polygon team.