fix(net): retry whitelist peers instead of exiting#586
Conversation
|
Some ideas/questions out of curiosity: Would it make sense to randomize whitelist order on each refill? |
|
@randomlogin Thanks for reviewing my changes! Here's the answers:
|
|
Needs rebase |
|
@rustaceanrob will just rebase and mark ready for review! |
9955901 to
67db699
Compare
|
Update: @rustaceanrob the PR is rebased and ready for review. |
|
This adds a lot of convoluted logic to a feature that will be used by a small subset of users. So far I don't agree with this approach and hope we can do something similar in only a few lines of changes. There are two things I dislike a lot: 1. two vectors that should represent the same thing (a list of trusted peers) 2. adds fields to an already massive Can we just push the peers back onto the whitelist if we are in whitelist only? |
67db699 to
13966ac
Compare
|
@rustaceanrob
|
|
Why do we still need these retry intervals? This is still quite hard to follow for me. |
13966ac to
ae864ce
Compare
Whitelist-only nodes consumed their configured peers as they were dialed and never refilled, so with the default connection rotation a single-peer node eventually hit NoReachablePeers and exited, forcing a full resync. In whitelist-only mode the configured peers are now rotated rather than drained: each is popped, pushed back, and retried on the next pass, and hostnames are re-resolved every pass so reconnections follow DNS changes. An empty whitelist is still fatal, and gossip mode is unchanged.
ae864ce to
5492d87
Compare
|
After thinking about it more I'm still not convinced this is a good change conceptually. To me, if a user has configured
|
|
@rustaceanrob Yeah I see what you're saying. That does make sense.. The retry policy belongs in the consumer rather than the crate. Thats also how I kinda handle it as well. I run kyoto under a supervisor that catches the NoReachablePeers exit, rebuilds the node, and resumes from the last checkpoint via chain_state (basically your option 2). And since the rebuild re-runs add_peers, the hostnames get re-resolved on the way back up, which covers the DNS/pod-rotation case I was originally worried about. I'll close PR. Thanks again for reviewing and for the pointers! |
Whitelist-only nodes die permanently on transient peer loss: whitelist entries are consumed as they're dialed, hostnames are resolved once, and nothing ever refills the list. With the default 2h connection rotation, a single-peer node (the #566 k8s setup) hits
NoReachablePeersevery ~4 hours. We measured 54 deaths across 4 production nodes in 48 hours, each one forcing a full resync.Changes:
An allowlist says who to connect to, not how long to try. The new integration test forces 2-second rotations: on master the node dies on the second rotation, with this change it keeps syncing.