This fixes an issue where the mumbai testnet node struggle to find
peers. Before this fix in general test peer numbers are typically around
20 in total between eth66, eth67 and eth68. For new peers some can
struggle to find even a single peer after days of operation.
These are the numbers after 12 hours or running on a node which
previously could not find any peers: eth66=13, eth67=76, eth68=91.
The root cause of this issue is the following:
- A significant number of mumbai peers around the boot node return
network ids which are different from those currently available in the
DHT
- The available nodes are all consequently busy and return 'too many
peers' for long periods
These issues case a significant number of discovery timeouts, some of
the queries will never receive a response.
This causes the discovery read loop to enter a channel deadlock - which
means that no responses are processed, nor timeouts fired. This causes
the discovery process in the node to stop. From then on it just
re-requests handshakes from a relatively small number of peers.
This check in fixes this situation with the following changes:
- Remove the deadlock by running the timer in a separate go-routine so
it can run independently of the main request processing.
- Allow the discovery process matcher to match on port if no id match
can be established on initial ping. This allows subsequent node
validation to proceed and if the node proves to be valid via the
remainder of the look-up and handshake process it us used as a valid
peer.
- Completely unsolicited responses, i.e. those which come from a
completely unknown ip:port combination continue to be ignored.
-
This is a non functional change which consolidates the various packages
under metrics into the top level package now that the dead code is
removed.
It is a precursor to the removal of Victoria metrics after which all
erigon metrics code will be contained in this single package.
Improve p2p error handling to propagate errors
from the origin up the call chain the Server peer removal code
using a new PeerError type containing a DiscReason and a more detailed
description.
The origin can be tracked down using PeerErrorCode (code) and DiscReason
(reason)
which looks like this in the log:
> [TRACE] [08-28|16:33:40.205] Removing p2p peer peercount=0
url=enode://d399f4b...@1.2.3.4:30303 duration=6.901ms
err="PeerError(code=remote disconnect reason, reason=too many peers,
err=<nil>, message=Peer.run got a remote DiscReason)"
This is an update of:
https://github.com/ledgerwatch/erigon/pull/7846
which uses a local fork of victoria metrics to include the changes that
https://github.com/anshalshukla added to the original for we where
using.
It also includes code to address the duplicate metrics issue identified
here:
https://github.com/ledgerwatch/erigon/issues/8053
It has one more associated fix which is to correctly add a metadata
label to counters, these where previously labelled as gauges.
e.g.
```
# TYPE p2p_peers counter
p2p_peers 0
```
rather than
```
# TYPE p2p_peers gauge
p2p_peers 0
```
---------
Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com>
Co-authored-by: Anshal Shukla <shukla.anshal85@gmail.com>
The disconnect message could either be a plain integer, or a list with
one integer element. We were encoding it as a plain integer, but
decoding as a list. Change this to be able to decode any format.
* Switch peerId from 256 to 512 bit (as in stable)
* go mod tidy
* Fix some tests
* Fixed
* Fixes
* Fix tests
* Update to erigon-lib main
Co-authored-by: Alex Sharp <alexsharp@Alexs-MacBook-Pro.local>
Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local>
* use "log" for struct fields
* use "logger" for function parameters and local vars
This is a compromise between:
1) using logger := log.New() to avoid aliasing (log := log.New())
2) and keeping it short when logging e.g.: srv.log.Info(...)
* implemented crash reporting for all goroutine panics that aren't handled explicitly
* implemented crash reporting for all goroutine panics that aren't handled explicitly
* changed node defaults back to originals after testing
* implemented panic handling for all goroutines that don't explicitly handle them, outputting the stack trace to a file in crashreports
* handling panics on all goroutines gracefully
* updated missing call
* error assignment
* implemented suggestions
* path.Join added
* implemented Evgeny's suggestions
* changed path.Join to filepath.Join for cross-platform
* added err check
* updated RecoverStackTrace to LogPanic
* updated closures
* removed call of common.Go to some goroutines
* updated scope capture
* removed testing files
* reverted back to original method, I feel like its less intrusive
* update filename for clarity
Prevents a situation where we (not running snap) connects with a peer running snap, and get stalled waiting for snap registration to succeed (which will never happen), which cause a waitgroup wait to halt shutdown
* peer: return localAddr instead of name to prevent spam
We currently use the name (which can be freely set by the peer) in several log messages.
This enables malicious actors to write spam into your geth log.
This commit returns the localAddr instead of the freely settable name.
* p2p: reduce usage of peer.Name in warn messages
* eth, p2p: use truncated names
* Update peer.go
Co-authored-by: Marius van der Wijden <m.vanderwijden@live.de>
Co-authored-by: Felix Lange <fjl@twurst.com>
The dialer limits itself to one attempt every 30s. Apply the same limit
in Server and reject peers which try to connect too eagerly. The check
against the limit happens right after accepting the connection.
Further changes in this commit ensure we pass the Server logger
down to Peer instances, discovery and dialState. Unit test logging now
works in all Server tests.
Package p2p/enode provides a generalized representation of p2p nodes
which can contain arbitrary information in key/value pairs. It is also
the new home for the node database. The "v4" identity scheme is also
moved here from p2p/enr to remove the dependency on Ethereum crypto from
that package.
Record signature handling is changed significantly. The identity scheme
registry is removed and acceptable schemes must be passed to any method
that needs identity. This means records must now be validated explicitly
after decoding.
The enode API is designed to make signature handling easy and safe: most
APIs around the codebase work with enode.Node, which is a wrapper around
a valid record. Going from enr.Record to enode.Node requires a valid
signature.
* p2p/discover: port to p2p/enode
This ports the discovery code to the new node representation in
p2p/enode. The wire protocol is unchanged, this can be considered a
refactoring change. The Kademlia table can now deal with nodes using an
arbitrary identity scheme. This requires a few incompatible API changes:
- Table.Lookup is not available anymore. It used to take a public key
as argument because v4 protocol requires one. Its replacement is
LookupRandom.
- Table.Resolve takes *enode.Node instead of NodeID. This is also for
v4 protocol compatibility because nodes cannot be looked up by ID
alone.
- Types Node and NodeID are gone. Further commits in the series will be
fixes all over the the codebase to deal with those removals.
* p2p: port to p2p/enode and discovery changes
This adapts package p2p to the changes in p2p/discover. All uses of
discover.Node and discover.NodeID are replaced by their equivalents from
p2p/enode.
New API is added to retrieve the enode.Node instance of a peer. The
behavior of Server.Self with discovery disabled is improved. It now
tries much harder to report a working IP address, falling back to
127.0.0.1 if no suitable address can be determined through other means.
These changes were needed for tests of other packages later in the
series.
* p2p/simulations, p2p/testing: port to p2p/enode
No surprises here, mostly replacements of discover.Node, discover.NodeID
with their new equivalents. The 'interesting' API changes are:
- testing.ProtocolSession tracks complete nodes, not just their IDs.
- adapters.NodeConfig has a new method to create a complete node.
These changes were needed to make swarm tests work.
Note that the NodeID change makes the code incompatible with old
simulation snapshots.
* whisper/whisperv5, whisper/whisperv6: port to p2p/enode
This port was easy because whisper uses []byte for node IDs and
URL strings in the API.
* eth: port to p2p/enode
Again, easy to port because eth uses strings for node IDs and doesn't
care about node information in any way.
* les: port to p2p/enode
Apart from replacing discover.NodeID with enode.ID, most changes are in
the server pool code. It now deals with complete nodes instead
of (Pubkey, IP, Port) triples. The database format is unchanged for now,
but we should probably change it to use the node database later.
* node: port to p2p/enode
This change simply replaces discover.Node and discover.NodeID with their
new equivalents.
* swarm/network: port to p2p/enode
Swarm has its own node address representation, BzzAddr, containing both
an overlay address (the hash of a secp256k1 public key) and an underlay
address (enode:// URL).
There are no changes to the BzzAddr format in this commit, but certain
operations such as creating a BzzAddr from a node ID are now impossible
because node IDs aren't public keys anymore.
Most swarm-related changes in the series remove uses of
NewAddrFromNodeID, replacing it with NewAddr which takes a complete node
as argument. ToOverlayAddr is removed because we can just use the node
ID directly.
* p2p: add DialRatio for configuration of inbound vs. dialed connections
* p2p: add connection flags to PeerInfo
* p2p/netutil: add SameNet, DistinctNetSet
* p2p/discover: improve revalidation and seeding
This changes node revalidation to be periodic instead of on-demand. This
should prevent issues where dead nodes get stuck in closer buckets
because no other node will ever come along to replace them.
Every 5 seconds (on average), the last node in a random bucket is
checked and moved to the front of the bucket if it is still responding.
If revalidation fails, the last node is replaced by an entry of the
'replacement list' containing recently-seen nodes.
Most close buckets are removed because it's very unlikely we'll ever
encounter a node that would fall into any of those buckets.
Table seeding is also improved: we now require a few minutes of table
membership before considering a node as a potential seed node. This
should make it less likely to store short-lived nodes as potential
seeds.
* p2p/discover: fix nits in UDP transport
We would skip sending neighbors replies if there were fewer than
maxNeighbors results and CheckRelayIP returned an error for the last
one. While here, also resolve a TODO about pong reply tokens.
p2p/simulations: introduce dialBan
- Refactor simulations/network connection getters to support
avoiding simultaneous dials between two peers If two peers dial
simultaneously, the connection will be dropped to help avoid
that, we essentially lock the connection object with a
timestamp which serves as a ban on dialing for a period of time
(dialBanTimeout).
- The connection getter InitConn can be wrapped and passed to the
nodes via adapters.NodeConfig#Reachable field and then used by
the respective services when they initiate connections. This
massively stablise the emerging connectivity when running with
hundreds of nodes bootstrapping a network.
p2p: add Inbound public method to p2p.Peer
p2p/simulations: Add server id to logs to support debugging
in-memory network simulations when multiple peers are logging.
p2p: SetupConn now returns error. The dialer checks the error and
only calls resolve if the actual TCP dial fails.
This commit introduces a network simulation framework which
can be used to run simulated networks of devp2p nodes. The
intention is to use this for testing protocols, performing
benchmarks and visualising emergent network behaviour.
Using a Timer over Ticker seems to be a lot better, though I cannot fully
account for why that it behaves so (since Ticker should be more bursty, but not
necessarily more active over time, but that may depend on how long window it
uses to decide on when to tick next)