feat(encryption): Stage 6D-3 — CapabilityFanout helper (Voters ∪ Learners)#793
feat(encryption): Stage 6D-3 — CapabilityFanout helper (Voters ∪ Learners)#793bootjp wants to merge 4 commits into
Conversation
…ners GetCapability fan-out)
Adds the §4.1 capability fan-out helper that the 6D-6 EnableStorageEnvelope cutover RPC will call to verify every Voter ∪ Learner of every Raft group is reachable and reports encryption_capable=true before the cutover entry is proposed.
The helper:
- Walks every (group, member) pair across voters AND learners
(§4.1 / §8 row "One learner unreachable" — learners count as
hard refusals the same way voters do)
- Dedupes by FullNodeID — a node serving multiple groups is
probed exactly once
- Dials all targets concurrently with one shared timeout
(default = one RaftElectionTimeout per §4.3)
- Surfaces dial / RPC failure as Reachable=false verdicts so the
operator-facing error can name the unreachable peer
- Sets Result.OK = true iff every verdict has
Reachable && EncryptionCapable (no partial-success mode per §4.3)
Per §9 the unit tests are table-driven and cover the four pinned cases (every-capable, one-not-capable, one-unreachable, duplicate-nodes-deduplicated) plus three input-validation cases (nil DialFunc, non-positive timeout, empty snapshot).
One deliberate deviation from the design doc: the result struct is named CapabilityFanoutResult rather than FanoutResult to avoid colliding with the unrelated admin.FanoutResult that already ships in keyviz_fanout.go (same package). The contract is otherwise unchanged.
Operator-inert: this helper has no callers in-tree yet. 6D-6 wires it.
Self-review (5 lenses):
1. Data loss — N/A; helper is read-only.
2. Concurrency / distributed failures — fan-out is goroutine-per-target with a single context.WithTimeout governing the deadline. Result aggregation guarded by sync.Mutex. The dedup phase runs before goroutines launch so the closure-loop-var case ("for _, m := range" capturing m in a goroutine) cannot race; m is passed by value as the goroutine argument. No partial-success mode — a single unreachable learner is a hard "no" per §8.
3. Performance — concurrent dials bounded by len(unique FullNodeID) per cluster; no quadratic fan-out. dedupKeys map sized at len; results slice preallocated with capacity = len(dedupKeys). The hot path (uint64ToDecimal) skips fmt to avoid one allocation per dedup-key build.
4. Data consistency — N/A; helper is a synchronous probe with no side effects. Per §4.2 it deliberately does NOT cache results — every cutover probes fresh.
5. Test coverage — table-driven §9 cases all pin. Dedup contract asserted by counting per-address dial calls (stubDial.calls). Input-validation cases pin the bad-input refusal posture.
Design-doc-first: renamed `*_proposed_*` → `*_partial_*` and updated the doc to record what shipped (6D-1, 6D-2, 6D-3) and what remains (6D-4, 6D-5, 6D-6).
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR introduces the ChangesCapabilityFanout helper and tests
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/design/2026_05_18_partial_6d_enable_storage_envelope.md`:
- Line 11: The heading "### Shipped milestones (lifecycle: partial)" uses H3 and
causes a heading level increment error (MD001); change that heading to H2 (i.e.,
"## Shipped milestones (lifecycle: partial)") so it follows the document's H1
and fixes the heading-increment lint issue.
In `@internal/admin/capability_fanout.go`:
- Around line 185-187: The code currently returns early when member.Address ==
"" which silently drops that member from Verdicts; instead, mark that member in
Verdicts as failed (set its OK=false and an explanatory error/reason) so the
overall fanout fails closed, then continue to the next member. Update the block
that contains the member.Address check in internal/admin/capability_fanout.go to
populate Verdicts[member] (or Verdicts[member.ID]) with a failing verdict
(OK=false plus a message like "empty address") rather than returning, and then
continue the loop.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b781744d-304a-4780-9f7d-fd0f38351a48
📒 Files selected for processing (3)
docs/design/2026_05_18_partial_6d_enable_storage_envelope.mdinternal/admin/capability_fanout.gointernal/admin/capability_fanout_test.go
There was a problem hiding this comment.
Code Review
This pull request introduces the CapabilityFanout helper in internal/admin to verify encryption capabilities across Raft group members, which is a critical component for the storage envelope cutover. The changes include the implementation of the fan-out logic with deduplication by FullNodeID, comprehensive unit tests, and an update to the design documentation to reflect current milestone progress. Review feedback suggests ensuring that members with empty addresses are not silently dropped to avoid potential state inconsistencies and recommends pre-allocating the deduplication map to improve performance.
| if member.Address == "" { | ||
| return | ||
| } | ||
| key := capabilityFanoutDedupKey(member) | ||
| if _, exists := m[key]; exists { | ||
| return | ||
| } | ||
| m[key] = member |
There was a problem hiding this comment.
Silently skipping members with an empty address could lead to a premature cutover if a node is misconfigured in the membership view. According to the §4.1 contract, every member (voter or learner) must be checked. If a node is missing an address, it should be included in the fan-out so that the dial fails and the overall result correctly reports OK=false due to an unreachable member. This ensures that entries are not silently dropped, preventing state inconsistencies.
key := capabilityFanoutDedupKey(member)
if _, exists := m[key]; exists {
return
}
m[key] = memberReferences
- Avoid silently dropping entries during serialization or normalization to ensure they are preserved and avoid state inconsistencies.
| // for unidentified rows). Pulled out of CapabilityFanout to keep | ||
| // the orchestration body under the cyclomatic-complexity budget. | ||
| func buildCapabilityFanoutDedupSet(routes RouteSnapshot) map[string]RouteMember { | ||
| dedupKeys := make(map[string]RouteMember) |
There was a problem hiding this comment.
The dedupKeys map can be pre-allocated with a size hint to reduce reallocations during the membership walk. While Raft groups are typically small, providing a hint based on the total number of members in the snapshot is a good practice for efficiency, provided the snapshot size is already bounded.
var totalMembers int
for _, group := range routes.Groups {
totalMembers += len(group.Voters) + len(group.Learners)
}
dedupKeys := make(map[string]RouteMember, totalMembers)References
- Pre-allocating a buffer is acceptable if a strictly enforced maximum size limit is checked before allocation and the entire data must be materialized in memory.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 862e2fd3b5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if member.Address == "" { | ||
| return |
There was a problem hiding this comment.
Fail closed when a route member lacks an address
Returning early for member.Address == "" silently removes that voter/learner from the fan-out set, so CapabilityFanout can report OK=true without probing every member in the snapshot. This breaks the helper’s safety contract in malformed or partially-populated route snapshots (e.g., a member entry exists but its address is missing) because the cutover preflight may pass even though one node was never checked; this should instead produce a hard failure (validation error or an explicit unreachable verdict).
Useful? React with 👍 / 👎.
…major (fail-closed on malformed members) + minor doc + medium preallocation Three reviewer findings from round-1 of PR #793, all addressed: ## codex P1 / gemini high / coderabbit major - fail-closed on empty address Pre-fix: addCapabilityFanoutDedupTarget early-returned when member.Address == "", silently dropping malformed catalog rows from Result.Verdicts. The cutover RPC would then see fewer verdicts than members, and OK=true could sneak through against an unprobed peer (the section 4.1 contract requires every voter union learner to be reached). Fix: admit every row into the dedup map; classify empty-Address members in probeCapability as Reachable=false with a new errCapabilityFanoutMalformedMember sentinel so the cutover RPC caller-facing error can name the misconfigured member. Result.OK collapses to false the same way an unreachable real peer would. Sibling case (fully-malformed row: FullNodeID=0 AND Address=""): use a synthetic ordinal-based dedup key so two malformed rows surface as two separate verdicts rather than collapsing into one (which would hide one of the two misconfigurations). Regression tests per CLAUDE.md test-then-fix convention: - TestCapabilityFanout_EmptyAddressFailsClosed - pins empty-Address row appears as Reachable=false in Result.Verdicts with Err set, OK=false. - TestCapabilityFanout_FullyMalformedRowsNotCollapsed - pins two malformed rows yield two separate verdicts. Both tests confirmed failing on the pre-fix code; both pass on the fix. ## caller audit (semantic change) addCapabilityFanoutDedupTarget and probeCapability are both file-private. Sole callers (buildCapabilityFanoutDedupSet, runCapabilityFanoutProbes) are also in this file. No external propagation. The public-facing semantic change is "Result.Verdicts now includes malformed-row entries as Reachable=false" - this is strictly safer than the pre-fix silent drop and the OK computation in capabilityFanoutAllOK already requires Reachable=true so no caller-side change is needed. ## coderabbit minor - MD001 heading-level skip Promoted "### Shipped milestones" and "### Open milestones" to H2 in the partial design doc. The pre-fix doc went directly from H1 to H3 with no H2 in between, tripping MD001. ## gemini medium - preallocate dedup map buildCapabilityFanoutDedupSet now pre-sizes the dedup map with sum(len(group.Voters) + len(group.Learners)) as the upper bound. The map shrinks at no runtime cost when dedup collapses entries. ## test/lint evidence - go test -run TestCapabilityFanout ./internal/admin/... - all 9 tests pass (7 pre-existing + 2 new regressions) - golangci-lint run internal/admin/... - 0 issues
|
@claude review Round-1 (commit codex P1 / gemini high / coderabbit major — fail-closed on empty addressPre-fix: Fix: admit every row into the dedup map; Sibling case (FullNodeID=0 AND Address=""): synthetic ordinal-based dedup key so two malformed rows surface as two separate verdicts rather than collapsing into one and hiding the second misconfiguration. Regression tests added per CLAUDE.md test-then-fix convention:
Both confirmed failing pre-fix, passing post-fix. Caller audit: coderabbit minor — MD001 heading-level skipPromoted gemini medium — preallocate dedup map
Verification
|
|
Claude finished @bootjp's task in 1m 58s —— View job Round-1 Review
The three Round-1 fixes all land correctly. Detailed findings below. Fix 1 — Fail-closed on empty address (Codex P1 / Gemini high / CodeRabbit major) ✅Approach is correct. The fix removes the early-return guard from Sibling case (FullNodeID=0 ∧ Address=""): The synthetic ordinal key ( Two goroutines launch, both hit the empty-address branch, both produce One observation (no action needed): Both doubly-malformed verdicts carry Regression tests ( Fix 2 — MD001 heading-level skip (CodeRabbit minor) ✅
Fix 3 — Dedup map preallocation (Gemini medium) ✅
Other observations (informational only)Design doc §4.1 type name. The doc's code block still shows
Loop variable capture. The goroutine spawn at SummaryAll three Round-1 findings are addressed correctly. The two regression tests follow the project's test-then-fix convention and are comprehensive. No new issues introduced. LGTM for the implementation sub-PR shape; ready for 6D-4. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b46d1f5e24
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if closer != nil { | ||
| defer closer() | ||
| } | ||
| report, err := client.GetCapability(ctx, &pb.Empty{}) |
There was a problem hiding this comment.
Handle nil dial client before issuing RPC
If dial returns (nil, closer, nil), this line panics on client.GetCapability(...) and can crash the admin RPC path instead of producing a Reachable=false verdict. Because DialFunc is injected and not validated beyond err, the helper should fail closed when client == nil (treating it like a dial failure) so a bad dialer implementation cannot take down cutover preflight.
Useful? React with 👍 / 👎.
…t buggy DialFunc) Codex r2 P1: if a DialFunc returns (nil, nil, nil) - no error but also no client - probeCapability would panic on client.GetCapability(...) and crash the admin RPC path. The fix is a defensive nil-client check that converts the contract violation into a Reachable=false verdict with a new errCapabilityFanoutBadDialer sentinel. Regression test per CLAUDE.md test-then-fix convention: - TestCapabilityFanout_NilClientFromDialFailsClosed - injects a buggy dialer that returns (nil, nil, nil); confirmed panicking on the pre-fix code (goroutine traceback through probeCapability:264); passes on the fix. Caller audit: probeCapability is file-private, sole caller is runCapabilityFanoutProbes also in this file. No external propagation. Semantic change is "nil client from a buggy dialer now produces a Reachable=false verdict instead of a panic" - strictly safer than the pre-fix crash. Lint clean. All 10 TestCapabilityFanout_* tests pass.
|
@claude review Round-2 (commit codex r2 P1 — nil-client guard against buggy DialFuncPre-fix: if a Regression test per CLAUDE.md test-then-fix:
Caller audit: Verification
|
|
Claude finished @bootjp's task in 1m 27s —— View job Round-2 Review
The Round-2 fix (nil-client guard) is correct. Detailed analysis below. Fix — nil-client guard against buggy
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f2bbdaa376
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if report.GetFullNodeId() != 0 { | ||
| verdict.FullNodeID = report.GetFullNodeId() |
There was a problem hiding this comment.
Reject capability replies from unexpected node IDs
The probe accepts any successful GetCapability response as evidence that the target member is reachable/capable, but it never validates that the responder’s full_node_id matches the expected route member. In a stale/misrouted address scenario (or if two members accidentally share an address), one healthy node can satisfy probes for another and OK can incorrectly become true even though the intended member was never verified. This check should fail closed on ID mismatch instead of overwriting the verdict ID.
Useful? React with 👍 / 👎.
| mu.Unlock() | ||
| }(m) | ||
| } | ||
| wg.Wait() |
There was a problem hiding this comment.
Honor timeout even when a probe ignores context
CapabilityFanout documents a hard timeout bound, but runCapabilityFanoutProbes waits for every goroutine unconditionally. If a DialFunc/client implementation blocks without honoring ctx cancellation, wg.Wait() never returns and the admin path can hang indefinitely, violating the helper’s contract that it returns within timeout and reports missing peers as unreachable.
Useful? React with 👍 / 👎.
… guard + timeout-bounded helper return) Two codex r3 findings, both addressed: ## codex r3 P1 - reject capability replies from unexpected node IDs Pre-fix: probeCapability accepted any successful GetCapability response as evidence that the expected route member was reachable, even when the responder reported a different full_node_id (stale routing, two members accidentally sharing an address, etc.). Result: one healthy node could satisfy probes for another, and Result.OK could become true even though the intended member was never verified. Fix: when both the snapshot's expected FullNodeID and the response's full_node_id are non-zero AND they disagree, the verdict is Reachable=false with a new errCapabilityFanoutMismatchedResponder sentinel. The cutover RPC refuses with OK=false rather than silently passing. Regression test: TestCapabilityFanout_MismatchedResponderIDFailsClosed - snapshot says n2:9000 has FullNodeID=2, responder reports FullNodeID=99, helper now produces Reachable=false instead of the pre-fix Reachable=true. ## codex r3 P2 - honor timeout even when a probe ignores ctx Pre-fix: runCapabilityFanoutProbes did wg.Wait() unconditionally. A DialFunc/client that ignored ctx cancellation would hang the helper indefinitely, violating the §4.3 contract "Returns within `timeout`". Fix: pre-seed every dedup-key slot with a Reachable=false verdict carrying errCapabilityFanoutProbeTimeout. Each goroutine overwrites its slot when its probe finishes. After launching goroutines, the orchestrator waits on min(done, ctx.Done()) and then returns whatever slot values are present. Probes that didn't finish by the deadline surface as Reachable=false with the pre-seeded timeout error. Note on goroutine cleanup: goroutines whose probes ignore ctx continue running until they unblock on their own. The helper documents that explicitly: contract is "the HELPER returns within timeout", not "every spawned goroutine returns". A dialer/client that ignores ctx is a bug on the caller side; the helper's job is to prevent that bug from hanging the admin RPC path. Regression test: TestCapabilityFanout_TimeoutBoundsHelperReturn - injects a hanging client that ignores ctx; the pre-fix code timed out the entire 30s test process (goroutine traceback through probeCapability); the fix returns within 500ms (5x the 100ms budget; generous for CI). ## caller audit (semantic change) probeCapability and runCapabilityFanoutProbes are both file-private. The new sentinels (errCapabilityFanoutMismatchedResponder, errCapabilityFanoutProbeTimeout) are also file-private. Sole caller of probeCapability is runCapabilityFanoutProbes; sole caller of runCapabilityFanoutProbes is CapabilityFanout. The public-facing semantic change is: 1. Stale-routing scenarios that previously slipped through as OK=true now produce OK=false. Strictly safer. 2. Hung probes that previously hung the helper now produce Reachable=false verdicts. Strictly safer. capabilityFanoutAllOK already requires Reachable=true so no caller-side change is needed. ## test/lint evidence - go test -race -run TestCapabilityFanout -timeout 60s ./internal/admin/... - all 12 tests pass (10 pre-existing + 2 new regressions) under -race - golangci-lint run internal/admin/... - 0 issues
|
@claude review Round-3 (commit codex r3 P1 — reject capability replies from unexpected node IDsPre-fix: Fix: when both expected and response Regression test: codex r3 P2 — honor timeout even when a probe ignores ctxPre-fix: Fix: pre-seed every dedup-key slot with a Goroutine cleanup: goroutines that ignore ctx continue running. The helper documents this — the contract is "the HELPER returns within timeout", not "every spawned goroutine returns within timeout". A dialer that ignores ctx is a caller-side bug; the helper's job is to prevent that bug from hanging the admin RPC path. Regression test: Caller audit
Verification
|
|
Claude finished @bootjp's task in 2m 9s —— View job Round-3 Review
Both Round-3 fixes land correctly. Detailed analysis below. Fix 1 — Mismatched-responder guard (codex r3 P1) ✅Location: if member.FullNodeID != 0 && report.GetFullNodeId() != 0 && report.GetFullNodeId() != member.FullNodeID {
verdict.Err = pkgerrors.Wrapf(errCapabilityFanoutMismatchedResponder,
"%s: expected full_node_id=%d got %d", member.Address, member.FullNodeID, report.GetFullNodeId())
return verdict
}The three-clause guard correctly handles all four cases:
The second row (non-zero expected, zero response) is a deliberate design choice to accommodate pre-capability nodes that don't populate
Nil-report safety: protobuf-go getters are nil-safe, so Regression test Fix 2 — Timeout-bounded helper return (codex r3 P2) ✅Architecture: pre-seeded slot pattern in Correctness trace:
Race safety after timeout:
Documented goroutine leak posture: the doc comment on Regression test
Overall — Rounds 0–3All rounds of review feedback have been addressed and pinned with regression tests:
12 tests, all passing under |
Summary
internal/admin/capability_fanout.go— the §4.1 helper that the 6D-6 cutover RPC will call to verify every (voter ∪ learner) of every Raft group is reachable and reportsencryption_capable=truebefore the cutover entry is proposed.FullNodeIDper §4.1 (a node serving multiple groups is probed exactly once), fans dials concurrently under onecontext.WithTimeout(default = one RaftElectionTimeout per §4.3), surfaces dial / RPC failure asReachable=falseverdicts so the operator-facing error can name the unreachable peer.Result.OK = trueiff every verdict hasReachable && EncryptionCapable— no partial-success mode per §4.3.DialFunc, non-positive timeout, empty snapshot). 7 tests total, all passing.CapabilityFanoutResultrather than the design doc'sFanoutResultto avoid colliding with the unrelatedadmin.FanoutResultalready inkeyviz_fanout.go(same package).2026_05_18_proposed_6d_enable_storage_envelope.md→2026_05_18_partial_6d_enable_storage_envelope.mdand updated the doc to record what shipped (6D-1, 6D-2, 6D-3) and what remains (6D-4, 6D-5, 6D-6).Self-review (5 lenses)
context.WithTimeoutgoverning the deadline. Result aggregation guarded bysync.Mutex. Dedup phase runs before goroutines launch so the loop-var-capture race is avoided (mpassed by value as the goroutine arg). No partial-success mode — a single unreachable learner is a hard "no" per §8.len(unique FullNodeID); no quadratic fan-out. Maps + result slice preallocated.uint64ToDecimalavoidsfmtfor the dedup-key hot path.stubDial.calls). Input-validation cases pin the bad-input refusal posture.Test plan
go test ./internal/admin/...— all 7TestCapabilityFanout_*tests passgolangci-lint run internal/admin/...— 0 issuesSummary by CodeRabbit
Documentation
Infrastructure Updates
Tests