lighthouse-pulse/beacon_node
Paul Hauner 44fa54004c Persist to DB after setting canonical head (#2547)
## Issue Addressed

NA

## Proposed Changes

Missed head votes on attestations is a well-known issue. The primary cause is a block getting set as the head *after* the attestation deadline.

This PR aims to shorten the overall time between "block received" and "block set as head" by:

1. Persisting the head and fork choice *after* setting the canonical head
    - Informal measurements show this takes ~200ms
 1. Pruning the op pool *after* setting the canonical head.
 1. No longer persisting the op pool to disk during `BeaconChain::fork_choice`
     - Informal measurements show this can take up to 1.2s.
     
I also add some metrics to help measure the effect of these changes.
     
Persistence changes like this run the risk of breaking assumptions downstream. However, I have considered these risks and I think we're fine here. I will describe my reasoning for each change.

## Reasoning

### Change 1:  Persisting the head and fork choice *after* setting the canonical head

For (1), although the function is called `persist_head_and_fork_choice`, it only persists:

- Fork choice
- Head tracker
- Genesis block root

Since `BeaconChain::fork_choice_internal` does not modify these values between the original time we were persisting it and the current time, I assert that the change I've made is non-substantial in terms of what ends up on-disk. There's the possibility that some *other* thread has modified fork choice in the extra time we've given it, but that's totally fine.

Since the only time we *read* those values from disk is during startup, I assert that this has no impact during runtime. 

### Change 2: Pruning the op pool after setting the canonical head

Similar to the argument above, we don't modify the op pool during `BeaconChain::fork_choice_internal` so it shouldn't matter when we prune. This change should be non-substantial.

### Change 3: No longer persisting the op pool to disk during `BeaconChain::fork_choice`

This change *is* substantial. With the proposed changes, we'll only be persisting the op pool to disk when we shut down cleanly (i.e., the `BeaconChain` gets dropped). This means we'll save disk IO and time during usual operation, but a `kill -9` or similar "crash" will probably result in an out-of-date op pool when we reboot. An out-of-date op pool can only have an impact when producing blocks or aggregate attestations/sync committees.

I think it's pretty reasonable that a crash might result in an out-of-date op pool, since:

- Crashes are fairly rare. Practically the only time I see LH suffer a full crash is when the OOM killer shows up, and that's a very serious event.
- It's generally quite rare to produce a block/aggregate immediately after a reboot. Just a few slots of runtime is probably enough to have a decent-enough op pool again.

## Additional Info

Credits to @macladson for the timings referenced here.
2021-08-31 04:48:21 +00:00
..
beacon_chain Persist to DB after setting canonical head (#2547) 2021-08-31 04:48:21 +00:00
client Upgrade dependencies (#2513) 2021-08-17 01:00:24 +00:00
eth1 Improve eth1 fallback logging (#2490) 2021-08-30 00:51:26 +00:00
eth2_libp2p Shutdown after sync (#2519) 2021-08-30 13:46:13 +00:00
genesis Upgrade dependencies (#2513) 2021-08-17 01:00:24 +00:00
http_api Fork schedule api (#2525) 2021-08-24 01:36:27 +00:00
http_metrics Upgrade dependencies (#2513) 2021-08-17 01:00:24 +00:00
network Shutdown after sync (#2519) 2021-08-30 13:46:13 +00:00
operation_pool Rust 1.54.0 lints (#2483) 2021-07-30 01:11:47 +00:00
src Shutdown after sync (#2519) 2021-08-30 13:46:13 +00:00
store Revert bad blocks on missed fork (#2529) 2021-08-30 06:41:31 +00:00
tests Altair consensus changes and refactors (#2279) 2021-07-09 06:15:32 +00:00
timer Upgrade dependencies (#2513) 2021-08-17 01:00:24 +00:00
websocket_server Server sent events (#1920) 2020-12-04 00:18:58 +00:00
Cargo.toml Add more unix signal handlers (#2486) 2021-08-30 05:19:34 +00:00