Skip to content

fix(net): retry whitelist peers instead of exiting#586

Closed
0xsiddharthks wants to merge 1 commit into
2140-dev:masterfrom
0xsiddharthks:siddharth/fix/whitelist-retry
Closed

fix(net): retry whitelist peers instead of exiting#586
0xsiddharthks wants to merge 1 commit into
2140-dev:masterfrom
0xsiddharthks:siddharth/fix/whitelist-retry

Conversation

@0xsiddharthks

@0xsiddharthks 0xsiddharthks commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Whitelist-only nodes die permanently on transient peer loss: whitelist entries are consumed as they're dialed, hostnames are resolved once, and nothing ever refills the list. With the default 2h connection rotation, a single-peer node (the #566 k8s setup) hits NoReachablePeers every ~4 hours. We measured 54 deaths across 4 production nodes in 48 hours, each one forcing a full resync.

Changes:

  • The configured whitelist is now a persistent set. When the dial queue runs dry in whitelist-only mode, it's re-seeded from the whitelist (shuffled, rate-limited to once per second).
  • Hostnames are re-resolved on every cycle, so reconnections follow DNS changes.
  • Gossip mode is unchanged, and an empty whitelist still errors immediately.

An allowlist says who to connect to, not how long to try. The new integration test forces 2-second rotations: on master the node dies on the second rotation, with this change it keeps syncing.

@randomlogin

Copy link
Copy Markdown
Contributor

Some ideas/questions out of curiosity:

Would it make sense to randomize whitelist order on each refill?
What happens if we set required peers to 2, have whitelist of 10 peers, 9 of which do not respond and only the last one is really working, will kyoto survive? Will it find that 1 whitelisted peer and 1 random one?

@0xsiddharthks

0xsiddharthks commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@randomlogin Thanks for reviewing my changes!

Here's the answers:

  • Randomizing the order
    Yes, this totally makes sense. Just added it to the PR.
    The dial queue is now shuffled on each refill, so reconnections spread across the allowlist instead of always favoring the same entry. Initial connection order is unchanged to keep the diff focused.

  • 9 dead + 1 live with required_peers = 2
    the node survives in both modes:

  • With whitelist_only

    • each refill cycle redials the dead peers (one per tick, bounded by the handshake timeout) and reconnects to the live one.
    • One caveat: PeerMap doesn't dedupe against already-connected addresses, so to satisfy required_peers = 2 it can open a second connection to the same live peer (the same thing you'd get today by listing one peer twice). Skipping already-connected addresses when popping the queue could be a nice follow-up.
  • Without whitelist_only

    • this PR changes nothing: the whitelist is consumed once as before, so you'd get 1 whitelisted connection and the second from gossip/DNS.

@rustaceanrob

Copy link
Copy Markdown
Member

Needs rebase

@0xsiddharthks

Copy link
Copy Markdown
Contributor Author

@rustaceanrob will just rebase and mark ready for review!
Thanks for taking a look.

@0xsiddharthks 0xsiddharthks force-pushed the siddharth/fix/whitelist-retry branch from 9955901 to 67db699 Compare June 23, 2026 18:27
@0xsiddharthks 0xsiddharthks marked this pull request as ready for review June 23, 2026 18:33
@0xsiddharthks

Copy link
Copy Markdown
Contributor Author

Update: @rustaceanrob the PR is rebased and ready for review.

@rustaceanrob

Copy link
Copy Markdown
Member

This adds a lot of convoluted logic to a feature that will be used by a small subset of users. So far I don't agree with this approach and hope we can do something similar in only a few lines of changes. There are two things I dislike a lot: 1. two vectors that should represent the same thing (a list of trusted peers) 2. adds fields to an already massive PeerMap struct.

Can we just push the peers back onto the whitelist if we are in whitelist only?

@0xsiddharthks 0xsiddharthks force-pushed the siddharth/fix/whitelist-retry branch from 67db699 to 13966ac Compare June 24, 2026 12:43
@0xsiddharthks

Copy link
Copy Markdown
Contributor Author

@rustaceanrob
Thanks for the review. I have addressed the feedback.

  • The second vector and the new PeerMap fields are both gone.
  • In whitelist_only mode the whitelist is now rotated in place rather than drained. each peer is popped, pushed back, and retried on the next pass (next_whitelist_peer).

@rustaceanrob

Copy link
Copy Markdown
Member

Why do we still need these retry intervals? This is still quite hard to follow for me.

@0xsiddharthks 0xsiddharthks force-pushed the siddharth/fix/whitelist-retry branch from 13966ac to ae864ce Compare June 24, 2026 13:17
Whitelist-only nodes consumed their configured peers as they were dialed
and never refilled, so with the default connection rotation a single-peer
node eventually hit NoReachablePeers and exited, forcing a full resync.

In whitelist-only mode the configured peers are now rotated rather than
drained: each is popped, pushed back, and retried on the next pass, and
hostnames are re-resolved every pass so reconnections follow DNS changes.
An empty whitelist is still fatal, and gossip mode is unchanged.
@0xsiddharthks 0xsiddharthks force-pushed the siddharth/fix/whitelist-retry branch from ae864ce to 5492d87 Compare June 24, 2026 13:26
@rustaceanrob

rustaceanrob commented Jun 24, 2026

Copy link
Copy Markdown
Member

After thinking about it more I'm still not convinced this is a good change conceptually. To me, if a user has configured whitelist_only, it feels like the proper thing to do is to exit once that list is empty. Continually retrying implies that we expect the servers to be up. For many users, this may not be the case. For instance, a user with a home node that enables whitelist_only should exit, as they should be made aware that the home node is down. This silently retries the node. This feels like business-specific logic that should exist outside of this crate. IMO there are also a few solutions:

  1. on your side, use add_peer on the Requester to continually add your whitelist back to the node. There you can define how often you re-add peers, shuffling, etc.
  2. save the most recent block headers for each node. if it gets kill from no more reachable peers, log, and then reset from those header by using ChainState:https://docs.rs/bip157/latest/bip157/builder/struct.Builder.html#method.chain_state

@0xsiddharthks

Copy link
Copy Markdown
Contributor Author

@rustaceanrob Yeah I see what you're saying. That does make sense.. The retry policy belongs in the consumer rather than the crate.

Thats also how I kinda handle it as well. I run kyoto under a supervisor that catches the NoReachablePeers exit, rebuilds the node, and resumes from the last checkpoint via chain_state (basically your option 2).

And since the rebuild re-runs add_peers, the hostnames get re-resolved on the way back up, which covers the DNS/pod-rotation case I was originally worried about.

I'll close PR. Thanks again for reviewing and for the pointers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants