lighthouse-pulse/validator_client/src
Michael Sproul 57dfcfd83a Optimise attestation selection proof signing (#4033)
## Issue Addressed

Closes #3963 (hopefully)

## Proposed Changes

Compute attestation selection proofs gradually each slot rather than in a single `join_all` at the start of each epoch. On a machine with 5k validators this replaces 5k tasks signing 5k proofs with 1 task that signs 5k/32 ~= 160 proofs each slot.

Based on testing with Goerli validators this seems to reduce the average time to produce a signature by preventing Tokio and the OS from falling over each other trying to run hundreds of threads. My testing so far has been with local keystores, which run on a dynamic pool of up to 512 OS threads because they use [`spawn_blocking`](https://docs.rs/tokio/1.11.0/tokio/task/fn.spawn_blocking.html) (and we haven't changed the default).

An earlier version of this PR hyper-optimised the time-per-signature metric to the detriment of the entire system's performance (see the reverted commits). The current PR is conservative in that it avoids touching the attestation service at all. I think there's more optimising to do here, but we can come back for that in a future PR rather than expanding the scope of this one.

The new algorithm for attestation selection proofs is:

- We sign a small batch of selection proofs each slot, for slots up to 8 slots in the future. On average we'll sign one slot's worth of proofs per slot, with an 8 slot lookahead.
- The batch is signed halfway through the slot when there is unlikely to be contention for signature production (blocks are <4s, attestations are ~4-6 seconds, aggregates are 8s+).

## Performance Data

_See first comment for updated graphs_.

Graph of median signing times before this PR:

![signing_times_median](https://user-images.githubusercontent.com/4452260/221495627-3ab3c105-319f-406e-b99d-b5913e0ded9c.png)

Graph of update attesters metric (includes selection proof signing) before this PR:

![update_attesters_store](https://user-images.githubusercontent.com/4452260/221497057-01ba40e4-8148-45f6-9e21-36a9567a631a.png)

Median signing time after this PR (prototype from 12:00, updated version from 13:30):

![signing_times_median_updated](https://user-images.githubusercontent.com/4452260/221771578-47a040cc-b832-482f-9a1a-d1bd9854e00e.png)

99th percentile on signing times (bounded attestation signing from 16:55, now removed):

![signing_times_99pc](https://user-images.githubusercontent.com/4452260/221772055-e64081a8-2220-45ba-ba6d-9d7e344a5bde.png)

Attester map update timing after this PR:

![update_attesters_store_updated](https://user-images.githubusercontent.com/4452260/221771757-c8558a48-7f4e-4bb5-9929-dee177a66c1e.png)

Selection proof signings per second change:

![signing_attempts](https://user-images.githubusercontent.com/4452260/221771855-64f5da22-1655-478d-926b-810be8a3650c.png)

## Link to late blocks

I believe this is related to the slow block signings because logs from Stakely in #3963 show these two logs almost 5 seconds apart:

> Feb 23 18:56:23.978 INFO Received unsigned block, slot: 5862880, service: block, module: validator_client::block_service:393
> Feb 23 18:56:28.552 INFO Publishing signed block, slot: 5862880, service: block, module: validator_client::block_service:416

The only thing that happens between those two logs is the signing of the block:

0fb58a680d/validator_client/src/block_service.rs (L410-L414)

Helpfully, Stakely noticed this issue without any Lighthouse BNs in the mix, which pointed to a clear issue in the VC.

## TODO

- [x] Further testing on testnet infrastructure.
- [x] Make the attestation signing parallelism configurable.
2023-03-05 23:43:31 +00:00
..
duties_service Validator registration request failures do not cause us to mark BNs offline (#3488) 2022-08-29 11:35:59 +00:00
http_api merge upstream/unstable 2022-12-28 14:43:25 -06:00
http_metrics Optimise attestation selection proof signing (#4033) 2023-03-05 23:43:31 +00:00
signing_method Clean capella (#4019) 2023-03-01 03:19:02 +00:00
attestation_service.rs Validator registration request failures do not cause us to mark BNs offline (#3488) 2022-08-29 11:35:59 +00:00
beacon_node_fallback.rs Add latency measurement service to VC (#4024) 2023-03-05 23:43:29 +00:00
block_service.rs Optimise attestation selection proof signing (#4033) 2023-03-05 23:43:31 +00:00
check_synced.rs Remove duplicate log in BN fallback (#2116) 2021-01-06 03:01:48 +00:00
cli.rs Add latency measurement service to VC (#4024) 2023-03-05 23:43:29 +00:00
config.rs Add latency measurement service to VC (#4024) 2023-03-05 23:43:29 +00:00
doppelganger_service.rs Clippy lints for rust 1.66 (#3810) 2022-12-16 04:04:00 +00:00
duties_service.rs Optimise attestation selection proof signing (#4033) 2023-03-05 23:43:31 +00:00
graffiti_file.rs Rust 1.54.0 lints (#2483) 2021-07-30 01:11:47 +00:00
initialized_validators.rs Web3 signer validator definitions reloading on any request (#3801) 2023-01-09 08:18:56 +00:00
key_cache.rs Clippy lints for rust 1.66 (#3810) 2022-12-16 04:04:00 +00:00
latency.rs Add latency measurement service to VC (#4024) 2023-03-05 23:43:29 +00:00
lib.rs Add latency measurement service to VC (#4024) 2023-03-05 23:43:29 +00:00
notifier.rs Add new VC metrics for beacon node availability (#3193) 2022-05-26 02:05:16 +00:00
preparation_service.rs Publish subscriptions to all beacon nodes (#3529) 2022-09-28 19:53:35 +00:00
signing_method.rs cleanup 2022-12-30 11:00:14 -05:00
sync_committee_service.rs Sync committee sign bn fallback (#3624) 2022-11-13 22:40:43 +00:00
validator_store.rs Fixed Compiler Warnings & Failing Tests (#3771) 2022-12-03 10:42:12 +11:00