fix(net): retry whitelist peers instead of exiting by 0xsiddharthks · Pull Request #586 · 2140-dev/kyoto

0xsiddharthks · 2026-06-10T21:44:55Z

Whitelist-only nodes die permanently on transient peer loss: whitelist entries are consumed as they're dialed, hostnames are resolved once, and nothing ever refills the list. With the default 2h connection rotation, a single-peer node (the #566 k8s setup) hits NoReachablePeers every ~4 hours. We measured 54 deaths across 4 production nodes in 48 hours, each one forcing a full resync.

Changes:

The configured whitelist is now a persistent set. When the dial queue runs dry in whitelist-only mode, it's re-seeded from the whitelist (shuffled, rate-limited to once per second).
Hostnames are re-resolved on every cycle, so reconnections follow DNS changes.
Gossip mode is unchanged, and an empty whitelist still errors immediately.

An allowlist says who to connect to, not how long to try. The new integration test forces 2-second rotations: on master the node dies on the second rotation, with this change it keeps syncing.

randomlogin · 2026-06-11T00:07:47Z

Some ideas/questions out of curiosity:

Would it make sense to randomize whitelist order on each refill?
What happens if we set required peers to 2, have whitelist of 10 peers, 9 of which do not respond and only the last one is really working, will kyoto survive? Will it find that 1 whitelisted peer and 1 random one?

0xsiddharthks · 2026-06-11T03:49:39Z

@randomlogin Thanks for reviewing my changes!

Here's the answers:

Randomizing the order
Yes, this totally makes sense. Just added it to the PR.
The dial queue is now shuffled on each refill, so reconnections spread across the allowlist instead of always favoring the same entry. Initial connection order is unchanged to keep the diff focused.
9 dead + 1 live with required_peers = 2
the node survives in both modes:
With whitelist_only
- each refill cycle redials the dead peers (one per tick, bounded by the handshake timeout) and reconnects to the live one.
- One caveat: PeerMap doesn't dedupe against already-connected addresses, so to satisfy required_peers = 2 it can open a second connection to the same live peer (the same thing you'd get today by listing one peer twice). Skipping already-connected addresses when popping the queue could be a nice follow-up.
Without whitelist_only
- this PR changes nothing: the whitelist is consumed once as before, so you'd get 1 whitelisted connection and the second from gossip/DNS.

rustaceanrob · 2026-06-23T11:49:02Z

Needs rebase

0xsiddharthks · 2026-06-23T11:51:42Z

@rustaceanrob will just rebase and mark ready for review!
Thanks for taking a look.

0xsiddharthks · 2026-06-23T18:33:23Z

Update: @rustaceanrob the PR is rebased and ready for review.

rustaceanrob · 2026-06-24T08:56:47Z

This adds a lot of convoluted logic to a feature that will be used by a small subset of users. So far I don't agree with this approach and hope we can do something similar in only a few lines of changes. There are two things I dislike a lot: 1. two vectors that should represent the same thing (a list of trusted peers) 2. adds fields to an already massive PeerMap struct.

Can we just push the peers back onto the whitelist if we are in whitelist only?

0xsiddharthks · 2026-06-24T12:52:41Z

@rustaceanrob
Thanks for the review. I have addressed the feedback.

The second vector and the new PeerMap fields are both gone.
In whitelist_only mode the whitelist is now rotated in place rather than drained. each peer is popped, pushed back, and retried on the next pass (next_whitelist_peer).

rustaceanrob · 2026-06-24T13:07:53Z

Why do we still need these retry intervals? This is still quite hard to follow for me.

Whitelist-only nodes consumed their configured peers as they were dialed and never refilled, so with the default connection rotation a single-peer node eventually hit NoReachablePeers and exited, forcing a full resync. In whitelist-only mode the configured peers are now rotated rather than drained: each is popped, pushed back, and retried on the next pass, and hostnames are re-resolved every pass so reconnections follow DNS changes. An empty whitelist is still fatal, and gossip mode is unchanged.

rustaceanrob · 2026-06-24T13:37:14Z

After thinking about it more I'm still not convinced this is a good change conceptually. To me, if a user has configured whitelist_only, it feels like the proper thing to do is to exit once that list is empty. Continually retrying implies that we expect the servers to be up. For many users, this may not be the case. For instance, a user with a home node that enables whitelist_only should exit, as they should be made aware that the home node is down. This silently retries the node. This feels like business-specific logic that should exist outside of this crate. IMO there are also a few solutions:

on your side, use add_peer on the Requester to continually add your whitelist back to the node. There you can define how often you re-add peers, shuffling, etc.
save the most recent block headers for each node. if it gets kill from no more reachable peers, log, and then reset from those header by using ChainState:https://docs.rs/bip157/latest/bip157/builder/struct.Builder.html#method.chain_state

0xsiddharthks · 2026-06-25T19:42:11Z

@rustaceanrob Yeah I see what you're saying. That does make sense.. The retry policy belongs in the consumer rather than the crate.

Thats also how I kinda handle it as well. I run kyoto under a supervisor that catches the NoReachablePeers exit, rebuilds the node, and resumes from the last checkpoint via chain_state (basically your option 2).

And since the rebuild re-runs add_peers, the hostnames get re-resolved on the way back up, which covers the DNS/pod-rotation case I was originally worried about.

I'll close PR. Thanks again for reviewing and for the pointers!

0xsiddharthks force-pushed the siddharth/fix/whitelist-retry branch from 9955901 to 67db699 Compare June 23, 2026 18:27

0xsiddharthks marked this pull request as ready for review June 23, 2026 18:33

0xsiddharthks force-pushed the siddharth/fix/whitelist-retry branch from 67db699 to 13966ac Compare June 24, 2026 12:43

0xsiddharthks force-pushed the siddharth/fix/whitelist-retry branch from 13966ac to ae864ce Compare June 24, 2026 13:17

0xsiddharthks force-pushed the siddharth/fix/whitelist-retry branch from ae864ce to 5492d87 Compare June 24, 2026 13:26

0xsiddharthks closed this Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(net): retry whitelist peers instead of exiting#586

fix(net): retry whitelist peers instead of exiting#586
0xsiddharthks wants to merge 1 commit into
2140-dev:masterfrom
0xsiddharthks:siddharth/fix/whitelist-retry

0xsiddharthks commented Jun 10, 2026 •

edited

Loading

Uh oh!

randomlogin commented Jun 11, 2026

Uh oh!

0xsiddharthks commented Jun 11, 2026 •

edited

Loading

Uh oh!

rustaceanrob commented Jun 23, 2026

Uh oh!

0xsiddharthks commented Jun 23, 2026

Uh oh!

0xsiddharthks commented Jun 23, 2026

Uh oh!

rustaceanrob commented Jun 24, 2026

Uh oh!

0xsiddharthks commented Jun 24, 2026

Uh oh!

rustaceanrob commented Jun 24, 2026

Uh oh!

rustaceanrob commented Jun 24, 2026 •

edited

Loading

Uh oh!

0xsiddharthks commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

0xsiddharthks commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

randomlogin commented Jun 11, 2026

Uh oh!

0xsiddharthks commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustaceanrob commented Jun 23, 2026

Uh oh!

0xsiddharthks commented Jun 23, 2026

Uh oh!

0xsiddharthks commented Jun 23, 2026

Uh oh!

rustaceanrob commented Jun 24, 2026

Uh oh!

0xsiddharthks commented Jun 24, 2026

Uh oh!

rustaceanrob commented Jun 24, 2026

Uh oh!

rustaceanrob commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xsiddharthks commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

0xsiddharthks commented Jun 10, 2026 •

edited

Loading

0xsiddharthks commented Jun 11, 2026 •

edited

Loading

rustaceanrob commented Jun 24, 2026 •

edited

Loading