Making the addReplyMatcher channel unbuffered makes the loop
going too slow sometimes for serving parallel requests.
This is an alternative fix for keeping the channel buffered.
Problem:
Some goroutines are blocked on shutdown:
1. table close <-tab.closed // because table loop pending
1. table loop <-refreshDone // because lookup shutdown blocks doRefresh
1. lookup shutdown <-it.replyCh // because it.queryfunc (findnode -
ensureBond) is blocked, and not returning errClosed (if it returns and
pushes to it.replyCh, then shutdown() will unblock)
1. findnode - ensureBond <-rm.errc // because the related replyMatcher
was added after loop() exited, so there's nothing to push errClosed and
unlock it
If addReplyMatcher channel is buffered, it is possible that
UDPv4.pending() adds a new reply matcher after closeCtx.Done().
Such reply matcher's errc result channel will never be updated, because
the UDPv4.loop() has exited at this point. Subsequent discovery
operations will deadlock.
Solution:
Revert to an unbuffered channel.
- changed communication tunnel to web socket in order to connect to
remote nodes
- changed diagnostics.url flag to diagnostics.addr as now user need to
enter only address and support command will connect to it through
websocket
- changed flag debug.urls to debug.addrs in order to have ability to
change connection type between erigon and support to websocket and don't
change user API
- added auto trying to connect to connect to ws if connection with was
failed
# Background
Erigon currently uses a combination of Victoria Metrics and Prometheus
client for providing metrics.
We want to rationalize this and use only the Prometheus client library,
but we want to maintain the simplified Victoria Metrics methods for
constructing metrics.
This task is currently partly complete and needs to be finished to a
stage where we can remove the Victoria Metrics module from the Erigon
code base.
## Tests
### Functional
* Make sure that the format change int->float implied by VM to
Prometheus does not impact clients (pay particular attention to block
numbers)
* Check that the prometheus/grafana dashboards defined in cmd/prometheus
are functional after the change
(see docker-compose.yml for details and
https://github.com/ledgerwatch/erigon/tree/devel/cmd/prometheus#readme)
* Confirm that the underlying go metrics are still generated
* Confirm the following flags setting work:
--metrics, --metrics.addr, --metrics.port with the new code
* Confirm that --metrics and --proff settings and handlers configuration
still allow metrics and pprof to share a port
#### Float counters - scientific notation test case
![Screenshot_2023-11-07_at_15 57
21](https://github.com/ledgerwatch/erigon/assets/94537774/32f0a6f6-968b-477c-8ec8-bb1812f3e848)
![Screenshot 2023-11-15 at 16 26
56](https://github.com/ledgerwatch/erigon/assets/94537774/3f402b2e-e343-4928-9fbb-18fa4d077485)
#### Float counters - NaN test case
![Screenshot_2023-11-07_at_16 04
25](https://github.com/ledgerwatch/erigon/assets/94537774/cbf90d5d-3749-4bd7-971d-e2124e54267c)
![Screenshot 2023-11-15 at 16 28
36](https://github.com/ledgerwatch/erigon/assets/94537774/5924915e-1977-4b7f-8082-23f73d0957d5)
### Performance
* Check the performance of counters created by RPC calls measurements
created by rpc/metrics.go are not impacted by the change.
#### RPC
Performed tests on rpcdaemon & erigon on localhost using
`etc_blockNumber`.
Did tests with 100, 1000, 10000 requests. Got a steady 15 ms response
time.
#### Memory
![Screenshot 2023-11-16 at 09 58
39](https://github.com/ledgerwatch/erigon/assets/94537774/5dd956d7-903f-4bea-a460-d3644da56201)
we plan step-by-step keep increasing this default
still see users for who it helped to handle more rpc
tradeoff: increasing of this flag - increasing "historical rpc"
throughput and decreasing "recent data rpc" throughput
Based on https://github.com/maticnetwork/bor/pull/871 in bor, this PR
handles import of same difficulty chains (tie breaker conditions) based
on their height and hash.
This PR also modifies an existing test to check different types of
side-chain import and how the canonical is decided.
This fixes an issue where the mumbai testnet node struggle to find
peers. Before this fix in general test peer numbers are typically around
20 in total between eth66, eth67 and eth68. For new peers some can
struggle to find even a single peer after days of operation.
These are the numbers after 12 hours or running on a node which
previously could not find any peers: eth66=13, eth67=76, eth68=91.
The root cause of this issue is the following:
- A significant number of mumbai peers around the boot node return
network ids which are different from those currently available in the
DHT
- The available nodes are all consequently busy and return 'too many
peers' for long periods
These issues case a significant number of discovery timeouts, some of
the queries will never receive a response.
This causes the discovery read loop to enter a channel deadlock - which
means that no responses are processed, nor timeouts fired. This causes
the discovery process in the node to stop. From then on it just
re-requests handshakes from a relatively small number of peers.
This check in fixes this situation with the following changes:
- Remove the deadlock by running the timer in a separate go-routine so
it can run independently of the main request processing.
- Allow the discovery process matcher to match on port if no id match
can be established on initial ping. This allows subsequent node
validation to proceed and if the node proves to be valid via the
remainder of the look-up and handshake process it us used as a valid
peer.
- Completely unsolicited responses, i.e. those which come from a
completely unknown ip:port combination continue to be ignored.
-
At `turbo/jsonrpc/bor_snapshot.go:239` creates read only transaction and
acquire semaphore but does not rollback or commit transaction and
unrelease semaphore lock. Over time, this will result in the locking all
of semaphore resources. Any other resources can't acquire semaphore.
I added defer function to rollback transaction to release semaphore.