Skip to content

feat(encryption): Stage 6D-3 — CapabilityFanout helper (Voters ∪ Learners)#793

Open
bootjp wants to merge 4 commits into
mainfrom
feat/encryption-stage6d3-capability-fanout
Open

feat(encryption): Stage 6D-3 — CapabilityFanout helper (Voters ∪ Learners)#793
bootjp wants to merge 4 commits into
mainfrom
feat/encryption-stage6d3-capability-fanout

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 20, 2026

Summary

  • Adds internal/admin/capability_fanout.go — the §4.1 helper that the 6D-6 cutover RPC will call to verify every (voter ∪ learner) of every Raft group is reachable and reports encryption_capable=true before the cutover entry is proposed.
  • Dedupes by FullNodeID per §4.1 (a node serving multiple groups is probed exactly once), fans dials concurrently under one context.WithTimeout (default = one RaftElectionTimeout per §4.3), surfaces dial / RPC failure as Reachable=false verdicts so the operator-facing error can name the unreachable peer.
  • Result.OK = true iff every verdict has Reachable && EncryptionCapable — no partial-success mode per §4.3.
  • Adds the §9 table-driven unit tests (every-capable, one-not-capable, one-unreachable, duplicate-nodes-deduplicated) + three input-validation cases (nil DialFunc, non-positive timeout, empty snapshot). 7 tests total, all passing.
  • One deliberate deviation: the result struct is CapabilityFanoutResult rather than the design doc's FanoutResult to avoid colliding with the unrelated admin.FanoutResult already in keyviz_fanout.go (same package).
  • Design-doc lifecycle: renamed 2026_05_18_proposed_6d_enable_storage_envelope.md2026_05_18_partial_6d_enable_storage_envelope.md and updated the doc to record what shipped (6D-1, 6D-2, 6D-3) and what remains (6D-4, 6D-5, 6D-6).
  • Operator-inert: no in-tree callers yet. 6D-6 wires it.

Self-review (5 lenses)

  1. Data loss — N/A; helper is read-only.
  2. Concurrency / distributed failures — fan-out is goroutine-per-target with a single context.WithTimeout governing the deadline. Result aggregation guarded by sync.Mutex. Dedup phase runs before goroutines launch so the loop-var-capture race is avoided (m passed by value as the goroutine arg). No partial-success mode — a single unreachable learner is a hard "no" per §8.
  3. Performance — concurrent dials bounded by len(unique FullNodeID); no quadratic fan-out. Maps + result slice preallocated. uint64ToDecimal avoids fmt for the dedup-key hot path.
  4. Data consistency — N/A; synchronous probe, no side effects. Per §4.2 deliberately does NOT cache — every cutover probes fresh.
  5. Test coverage — §9 table-driven cases all pin. Dedup contract asserted by counting per-address dial calls (stubDial.calls). Input-validation cases pin the bad-input refusal posture.

Test plan

  • go test ./internal/admin/... — all 7 TestCapabilityFanout_* tests pass
  • golangci-lint run internal/admin/... — 0 issues
  • Reviewer approval (standalone helper; no callers wired yet, so no Jepsen suite applies)

Summary by CodeRabbit

  • Documentation

    • Updated Stage 6D milestone status with completed sub-milestones (6D-1 through 6D-3) and outlined remaining work items (6D-4 through 6D-6).
  • Infrastructure Updates

    • Added concurrent capability verification helper to support cluster-wide node readiness assessment.
  • Tests

    • Added comprehensive unit tests covering capability verification scenarios including success, failures, deduplication, and edge cases.

Review Change Stack

…ners GetCapability fan-out)

Adds the §4.1 capability fan-out helper that the 6D-6 EnableStorageEnvelope cutover RPC will call to verify every Voter ∪ Learner of every Raft group is reachable and reports encryption_capable=true before the cutover entry is proposed.

The helper:
- Walks every (group, member) pair across voters AND learners
  (§4.1 / §8 row "One learner unreachable" — learners count as
  hard refusals the same way voters do)
- Dedupes by FullNodeID — a node serving multiple groups is
  probed exactly once
- Dials all targets concurrently with one shared timeout
  (default = one RaftElectionTimeout per §4.3)
- Surfaces dial / RPC failure as Reachable=false verdicts so the
  operator-facing error can name the unreachable peer
- Sets Result.OK = true iff every verdict has
  Reachable && EncryptionCapable (no partial-success mode per §4.3)

Per §9 the unit tests are table-driven and cover the four pinned cases (every-capable, one-not-capable, one-unreachable, duplicate-nodes-deduplicated) plus three input-validation cases (nil DialFunc, non-positive timeout, empty snapshot).

One deliberate deviation from the design doc: the result struct is named CapabilityFanoutResult rather than FanoutResult to avoid colliding with the unrelated admin.FanoutResult that already ships in keyviz_fanout.go (same package). The contract is otherwise unchanged.

Operator-inert: this helper has no callers in-tree yet. 6D-6 wires it.

Self-review (5 lenses):

1. Data loss — N/A; helper is read-only.
2. Concurrency / distributed failures — fan-out is goroutine-per-target with a single context.WithTimeout governing the deadline. Result aggregation guarded by sync.Mutex. The dedup phase runs before goroutines launch so the closure-loop-var case ("for _, m := range" capturing m in a goroutine) cannot race; m is passed by value as the goroutine argument. No partial-success mode — a single unreachable learner is a hard "no" per §8.
3. Performance — concurrent dials bounded by len(unique FullNodeID) per cluster; no quadratic fan-out. dedupKeys map sized at len; results slice preallocated with capacity = len(dedupKeys). The hot path (uint64ToDecimal) skips fmt to avoid one allocation per dedup-key build.
4. Data consistency — N/A; helper is a synchronous probe with no side effects. Per §4.2 it deliberately does NOT cache results — every cutover probes fresh.
5. Test coverage — table-driven §9 cases all pin. Dedup contract asserted by counting per-address dial calls (stubDial.calls). Input-validation cases pin the bad-input refusal posture.

Design-doc-first: renamed `*_proposed_*` → `*_partial_*` and updated the doc to record what shipped (6D-1, 6D-2, 6D-3) and what remains (6D-4, 6D-5, 6D-6).
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 30 minutes and 8 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6a6d5375-cd1b-4219-9d4f-478376596600

📥 Commits

Reviewing files that changed from the base of the PR and between 862e2fd and 438482f.

📒 Files selected for processing (3)
  • docs/design/2026_05_18_partial_6d_enable_storage_envelope.md
  • internal/admin/capability_fanout.go
  • internal/admin/capability_fanout_test.go
📝 Walkthrough

Walkthrough

This PR introduces the CapabilityFanout helper that concurrently probes every unique Raft member in a snapshot for encryption capabilities, returning aggregated reachability and capability verdicts. It also updates the Stage 6D milestone status to reflect the completion of this 6D-3 sub-milestone.

Changes

CapabilityFanout helper and tests

Layer / File(s) Summary
Milestone status update
docs/design/2026_05_18_partial_6d_enable_storage_envelope.md
Stage 6D status changes from proposed to partial; shipped milestones 6D-1 through 6D-3 are documented (including the CapabilityFanout helper), and open milestones 6D-4 through 6D-6 are enumerated.
Data models and entry point
internal/admin/capability_fanout.go
Defines input route structures (RouteMember, RouteGroup, RouteSnapshot), output verdict structures (CapabilityVerdict, CapabilityFanoutResult), DialFunc callback type, and the exported CapabilityFanout entry point with input validation, dedup key construction, timeout context, and concurrent probe orchestration.
Dedup, concurrency, and aggregation
internal/admin/capability_fanout.go
Implements dedup-map building that folds all route members into a keyed set by FullNodeID or address, concurrent probe runner that spawns goroutines and collects verdicts under mutex, per-node probe logic that dials, calls GetCapability, and records results, aggregation predicate for OK verdict, dedup-key selection with fallback logic, and uint64 stringification helper.
Unit tests and test stubs
internal/admin/capability_fanout_test.go
Implements stubEncryptionAdminClient and stubDial for testing, then covers all-capable nodes, single incapable node, single unreachable (dial failure), dedup by FullNodeID, nil dial rejection, non-positive timeout rejection, and empty snapshot edge cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • bootjp/elastickv#754: The new CapabilityFanout helper calls the EncryptionAdmin client's GetCapability RPC, directly building on the Stage 5 encryption admin surface introduced there.
  • bootjp/elastickv#786: This PR's CapabilityFanout helper and unit tests implement the capability fan-out behavior specified in the Stage 6D-1 enable-storage-envelope design doc's cutover RPC contract.
  • bootjp/elastickv#762: Planning doc that mentions the CapabilityFanout helper as part of the Stage 6/5E rollout; this PR delivers the code implementation.

Poem

🐰 A fanout hops through Raft with grace,
Each node polled in its rightful place,
No double-dial, dedup's so neat,
Verdicts gathered, concurrency sweet!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the main change: adding a CapabilityFanout helper for Stage 6D-3 that probes both voters and learners.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/encryption-stage6d3-capability-fanout

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/2026_05_18_partial_6d_enable_storage_envelope.md`:
- Line 11: The heading "### Shipped milestones (lifecycle: partial)" uses H3 and
causes a heading level increment error (MD001); change that heading to H2 (i.e.,
"## Shipped milestones (lifecycle: partial)") so it follows the document's H1
and fixes the heading-increment lint issue.

In `@internal/admin/capability_fanout.go`:
- Around line 185-187: The code currently returns early when member.Address ==
"" which silently drops that member from Verdicts; instead, mark that member in
Verdicts as failed (set its OK=false and an explanatory error/reason) so the
overall fanout fails closed, then continue to the next member. Update the block
that contains the member.Address check in internal/admin/capability_fanout.go to
populate Verdicts[member] (or Verdicts[member.ID]) with a failing verdict
(OK=false plus a message like "empty address") rather than returning, and then
continue the loop.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b781744d-304a-4780-9f7d-fd0f38351a48

📥 Commits

Reviewing files that changed from the base of the PR and between c9a7870 and 862e2fd.

📒 Files selected for processing (3)
  • docs/design/2026_05_18_partial_6d_enable_storage_envelope.md
  • internal/admin/capability_fanout.go
  • internal/admin/capability_fanout_test.go

Comment thread docs/design/2026_05_18_partial_6d_enable_storage_envelope.md Outdated
Comment thread internal/admin/capability_fanout.go Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the CapabilityFanout helper in internal/admin to verify encryption capabilities across Raft group members, which is a critical component for the storage envelope cutover. The changes include the implementation of the fan-out logic with deduplication by FullNodeID, comprehensive unit tests, and an update to the design documentation to reflect current milestone progress. Review feedback suggests ensuring that members with empty addresses are not silently dropped to avoid potential state inconsistencies and recommends pre-allocating the deduplication map to improve performance.

Comment thread internal/admin/capability_fanout.go Outdated
Comment on lines +185 to +192
if member.Address == "" {
return
}
key := capabilityFanoutDedupKey(member)
if _, exists := m[key]; exists {
return
}
m[key] = member
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Silently skipping members with an empty address could lead to a premature cutover if a node is misconfigured in the membership view. According to the §4.1 contract, every member (voter or learner) must be checked. If a node is missing an address, it should be included in the fan-out so that the dial fails and the overall result correctly reports OK=false due to an unreachable member. This ensures that entries are not silently dropped, preventing state inconsistencies.

key := capabilityFanoutDedupKey(member)
if _, exists := m[key]; exists {
	return
}
m[key] = member
References
  1. Avoid silently dropping entries during serialization or normalization to ensure they are preserved and avoid state inconsistencies.

Comment thread internal/admin/capability_fanout.go Outdated
// for unidentified rows). Pulled out of CapabilityFanout to keep
// the orchestration body under the cyclomatic-complexity budget.
func buildCapabilityFanoutDedupSet(routes RouteSnapshot) map[string]RouteMember {
dedupKeys := make(map[string]RouteMember)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The dedupKeys map can be pre-allocated with a size hint to reduce reallocations during the membership walk. While Raft groups are typically small, providing a hint based on the total number of members in the snapshot is a good practice for efficiency, provided the snapshot size is already bounded.

var totalMembers int
for _, group := range routes.Groups {
	totalMembers += len(group.Voters) + len(group.Learners)
}
dedupKeys := make(map[string]RouteMember, totalMembers)
References
  1. Pre-allocating a buffer is acceptable if a strictly enforced maximum size limit is checked before allocation and the entire data must be materialized in memory.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 862e2fd3b5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/admin/capability_fanout.go Outdated
Comment on lines +185 to +186
if member.Address == "" {
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fail closed when a route member lacks an address

Returning early for member.Address == "" silently removes that voter/learner from the fan-out set, so CapabilityFanout can report OK=true without probing every member in the snapshot. This breaks the helper’s safety contract in malformed or partially-populated route snapshots (e.g., a member entry exists but its address is missing) because the cutover preflight may pass even though one node was never checked; this should instead produce a hard failure (validation error or an explicit unreachable verdict).

Useful? React with 👍 / 👎.

…major (fail-closed on malformed members) + minor doc + medium preallocation

Three reviewer findings from round-1 of PR #793, all addressed:

## codex P1 / gemini high / coderabbit major - fail-closed on empty address

Pre-fix: addCapabilityFanoutDedupTarget early-returned when member.Address == "", silently dropping malformed catalog rows from Result.Verdicts. The cutover RPC would then see fewer verdicts than members, and OK=true could sneak through against an unprobed peer (the section 4.1 contract requires every voter union learner to be reached).

Fix: admit every row into the dedup map; classify empty-Address members in probeCapability as Reachable=false with a new errCapabilityFanoutMalformedMember sentinel so the cutover RPC caller-facing error can name the misconfigured member. Result.OK collapses to false the same way an unreachable real peer would.

Sibling case (fully-malformed row: FullNodeID=0 AND Address=""): use a synthetic ordinal-based dedup key so two malformed rows surface as two separate verdicts rather than collapsing into one (which would hide one of the two misconfigurations).

Regression tests per CLAUDE.md test-then-fix convention:

- TestCapabilityFanout_EmptyAddressFailsClosed - pins empty-Address row appears as Reachable=false in Result.Verdicts with Err set, OK=false.
- TestCapabilityFanout_FullyMalformedRowsNotCollapsed - pins two malformed rows yield two separate verdicts.

Both tests confirmed failing on the pre-fix code; both pass on the fix.

## caller audit (semantic change)

addCapabilityFanoutDedupTarget and probeCapability are both file-private. Sole callers (buildCapabilityFanoutDedupSet, runCapabilityFanoutProbes) are also in this file. No external propagation. The public-facing semantic change is "Result.Verdicts now includes malformed-row entries as Reachable=false" - this is strictly safer than the pre-fix silent drop and the OK computation in capabilityFanoutAllOK already requires Reachable=true so no caller-side change is needed.

## coderabbit minor - MD001 heading-level skip

Promoted "### Shipped milestones" and "### Open milestones" to H2 in the partial design doc. The pre-fix doc went directly from H1 to H3 with no H2 in between, tripping MD001.

## gemini medium - preallocate dedup map

buildCapabilityFanoutDedupSet now pre-sizes the dedup map with sum(len(group.Voters) + len(group.Learners)) as the upper bound. The map shrinks at no runtime cost when dedup collapses entries.

## test/lint evidence

- go test -run TestCapabilityFanout ./internal/admin/... - all 9 tests pass (7 pre-existing + 2 new regressions)
- golangci-lint run internal/admin/... - 0 issues
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 20, 2026

@claude review

Round-1 (commit b46d1f5e) addressed three reviewer findings from the initial round:

codex P1 / gemini high / coderabbit major — fail-closed on empty address

Pre-fix: addCapabilityFanoutDedupTarget returned early when member.Address == "", silently dropping malformed catalog rows from Result.Verdicts. The cutover RPC could then see OK=true against an unprobed peer.

Fix: admit every row into the dedup map; probeCapability classifies empty-Address members as Reachable=false with a new errCapabilityFanoutMalformedMember sentinel. The caller-facing error can now name the misconfigured member.

Sibling case (FullNodeID=0 AND Address=""): synthetic ordinal-based dedup key so two malformed rows surface as two separate verdicts rather than collapsing into one and hiding the second misconfiguration.

Regression tests added per CLAUDE.md test-then-fix convention:

  • TestCapabilityFanout_EmptyAddressFailsClosed
  • TestCapabilityFanout_FullyMalformedRowsNotCollapsed

Both confirmed failing pre-fix, passing post-fix.

Caller audit: addCapabilityFanoutDedupTarget and probeCapability are file-private; sole callers are also in this file; no external propagation. The public-facing semantic change ("Result.Verdicts now includes malformed-row entries as Reachable=false") is strictly safer than the pre-fix silent drop, and capabilityFanoutAllOK already requires Reachable=true.

coderabbit minor — MD001 heading-level skip

Promoted ### Shipped milestones and ### Open milestones to H2 in the partial design doc.

gemini medium — preallocate dedup map

buildCapabilityFanoutDedupSet now pre-sizes the dedup map with the upper-bound member count.

Verification

  • go test -run TestCapabilityFanout ./internal/admin/... — all 9 tests pass (7 pre-existing + 2 new regressions)
  • golangci-lint run internal/admin/... — 0 issues

@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

Claude finished @bootjp's task in 1m 58s —— View job


Round-1 Review

  • Gather context and read changed files
  • Review Round-1 fixes (fail-closed empty address, MD001, dedup map preallocation)
  • Verify test coverage for new regression tests
  • Post comprehensive review

The three Round-1 fixes all land correctly. Detailed findings below.


Fix 1 — Fail-closed on empty address (Codex P1 / Gemini high / CodeRabbit major) ✅

Approach is correct. The fix removes the early-return guard from addCapabilityFanoutDedupTarget entirely; probeCapability now short-circuits at line 252–255 when member.Address == "" and returns Reachable=false + Err = errCapabilityFanoutMalformedMember. Every member now appears in Result.Verdicts; capabilityFanoutAllOK correctly reads that verdict as a failure.

Sibling case (FullNodeID=0 ∧ Address=""): The synthetic ordinal key ("synthetic:<len(m)>") gives distinct map entries to two structurally identical malformed rows. Tracing through the test fixture:

first row:  len(m)=0 → key="synthetic:0"  → inserted
second row: len(m)=1 → key="synthetic:1"  → inserted

Two goroutines launch, both hit the empty-address branch, both produce Reachable=false verdicts. len(Result.Verdicts) == 2. Correct.

One observation (no action needed): Both doubly-malformed verdicts carry FullNodeID=0, so an operator querying Verdicts by node ID cannot distinguish them. The Err strings ("full_node_id=0 has empty address") are identical too. For genuine catalog misconfiguration, the operator has no way to know which slot corresponds to which member. This is an inherent limitation of the row having no identity at all; the current behavior is strictly better than the pre-fix silent drop. A future improvement could add a SlotHint field (group ID + voter/learner index) to CapabilityVerdict, but that is beyond the §4.1 scope of this PR.

Regression tests (TestCapabilityFanout_EmptyAddressFailsClosed, TestCapabilityFanout_FullyMalformedRowsNotCollapsed) correctly follow the CLAUDE.md test-then-fix convention and pin both the single-malformed and doubly-malformed paths. The pre-fix failure case is documented in the test docstrings.


Fix 2 — MD001 heading-level skip (CodeRabbit minor) ✅

### Shipped milestones and ### Open milestones are promoted to ## ... in the partial design doc (lines 11 and 25). Both now sit at H2, directly under the H1 title. The heading hierarchy is correct throughout the rest of the doc.


Fix 3 — Dedup map preallocation (Gemini medium) ✅

buildCapabilityFanoutDedupSet now computes upperBound in a pre-pass and passes it to make. The two-pass pattern (count, then build) is the standard Go idiom for this. The comment correctly notes the map may shrink when dedup collapses entries, and the function is pulled out of CapabilityFanout as before, keeping the cyclomatic complexity budget intact.


Other observations (informational only)

Design doc §4.1 type name. The doc's code block still shows FanoutResult; the implementation uses CapabilityFanoutResult. The reason is explained in the Shipped milestones entry and in the PR description. No fix needed — the deviation is documented and justified.

capabilityFanoutDedupKey signature change. The function now takes ordinal int from the caller. This couples the caller and the function, but since addCapabilityFanoutDedupTarget is the only call site and is private to this file, the coupling is tight and safe. The function's doc comment explains all three cases clearly.

Loop variable capture. The goroutine spawn at capability_fanout.go:172–179 passes m by value as member. With Go 1.22+ loop semantics (module declares go 1.25) the capture pattern is not strictly necessary, but it's harmless and documents intent.


Summary

All three Round-1 findings are addressed correctly. The two regression tests follow the project's test-then-fix convention and are comprehensive. No new issues introduced. LGTM for the implementation sub-PR shape; ready for 6D-4.
| Branch: feat/encryption-stage6d3-capability-fanout

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b46d1f5e24

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if closer != nil {
defer closer()
}
report, err := client.GetCapability(ctx, &pb.Empty{})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle nil dial client before issuing RPC

If dial returns (nil, closer, nil), this line panics on client.GetCapability(...) and can crash the admin RPC path instead of producing a Reachable=false verdict. Because DialFunc is injected and not validated beyond err, the helper should fail closed when client == nil (treating it like a dial failure) so a bad dialer implementation cannot take down cutover preflight.

Useful? React with 👍 / 👎.

…t buggy DialFunc)

Codex r2 P1: if a DialFunc returns (nil, nil, nil) - no error but also no client - probeCapability would panic on client.GetCapability(...) and crash the admin RPC path. The fix is a defensive nil-client check that converts the contract violation into a Reachable=false verdict with a new errCapabilityFanoutBadDialer sentinel.

Regression test per CLAUDE.md test-then-fix convention:

- TestCapabilityFanout_NilClientFromDialFailsClosed - injects a buggy dialer that returns (nil, nil, nil); confirmed panicking on the pre-fix code (goroutine traceback through probeCapability:264); passes on the fix.

Caller audit: probeCapability is file-private, sole caller is runCapabilityFanoutProbes also in this file. No external propagation. Semantic change is "nil client from a buggy dialer now produces a Reachable=false verdict instead of a panic" - strictly safer than the pre-fix crash.

Lint clean. All 10 TestCapabilityFanout_* tests pass.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 20, 2026

@claude review

Round-2 (commit f2bbdaa3): addressed codex r2 P1.

codex r2 P1 — nil-client guard against buggy DialFunc

Pre-fix: if a DialFunc returns (nil, nil, nil) — no error but also no client — probeCapability would panic on client.GetCapability(...) and crash the admin RPC path. The fix is a defensive nil-client check that converts the contract violation into a Reachable=false verdict with a new errCapabilityFanoutBadDialer sentinel.

Regression test per CLAUDE.md test-then-fix:

  • TestCapabilityFanout_NilClientFromDialFailsClosed — injects a buggy dialer that returns (nil, nil, nil). Confirmed panicking on the pre-fix code (goroutine traceback through probeCapability:264); passes on the fix.

Caller audit: probeCapability is file-private, sole caller runCapabilityFanoutProbes is also in this file. No external propagation. The semantic change is "nil client from a buggy dialer now produces a Reachable=false verdict instead of a panic" — strictly safer than the pre-fix crash.

Verification

  • go test -run TestCapabilityFanout ./internal/admin/... — all 10 tests pass (8 pre-existing + 1 new regression + 1 from round-1)
  • golangci-lint run internal/admin/... — 0 issues

@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

Claude finished @bootjp's task in 1m 27s —— View job


Round-2 Review

  • Load tools
  • Read changed files and diff
  • Review nil-client guard fix
  • Review regression test
  • Post comprehensive review

The Round-2 fix (nil-client guard) is correct. Detailed analysis below.


Fix — nil-client guard against buggy DialFunc

Location: capability_fanout.go:271-274

if closer != nil {
    defer closer()
}
if client == nil {
    verdict.Err = pkgerrors.Wrapf(errCapabilityFanoutBadDialer, "dial %s", member.Address)
    return verdict
}

Ordering is intentional and correct. The closer defer is registered before the nil-client check. If a buggy dialer returns (nil, someCloser, nil), the cleanup still runs when probeCapability returns. That's the right thing — a resource was allocated even though the client is nil, and skipping the close would be a leak. The test fixture uses (nil, nil, nil) which sidesteps this, but the production-code path handles the (nil, nonNilCloser, nil) case correctly too.

Sentinel value is correctly scoped. errCapabilityFanoutBadDialer is unexported, consistent with errCapabilityFanoutMalformedMember and errCapabilityFanoutBadInput. The CapabilityFanout API contract says callers check v.Reachable + v.Err != nil — they don't need to errors.Is against the sentinel. If 6D-6 ever needs to distinguish "bad dialer" from other dial failures for logging or telemetry, export the sentinel at that point; premature export now would only widen the public surface.

Reachable stays false on nil-client path. verdict.Reachable is only set to true at line 280, after the successful GetCapability call — the nil-client early return at 271-274 leaves it false. capabilityFanoutAllOK correctly rejects any verdict with Reachable=false, so Result.OK will be false. Fail-closed contract is satisfied.


Regression test — TestCapabilityFanout_NilClientFromDialFailsClosed

The test injects a dialer that returns exactly (nil, nil, nil) and asserts:

  • err == nil (no input validation error — this is a runtime/contract violation, not bad input)
  • res.OK == false
  • len(res.Verdicts) == 1
  • v.Reachable == false
  • v.Err != nil

All five assertions are necessary and sufficient to pin the regression. Per CLAUDE.md convention, the test was confirmed failing on pre-fix code (panic via probeCapability:264) and passing post-fix.

One minor test gap (no action needed now): The test doesn't exercise (nil, nonNilCloser, nil) — the case where the closer is returned alongside a nil client. The production code handles it correctly (closer deferred before nil check), but this sub-case isn't pinned. Worth noting for the 6D-6 integration test, where the real connection pool may return a non-nil closer on partial setup.


Overall

All three rounds of review feedback are now addressed:

  • Round-0: core implementation (7 tests)
  • Round-1: fail-closed empty address + MD001 + dedup map preallocation (→ 9 tests)
  • Round-2: nil-client guard against buggy DialFunc (→ 10 tests)

No new issues introduced. LGTM — ready for 6D-4 wiring.
| Branch: feat/encryption-stage6d3-capability-fanout

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f2bbdaa376

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +284 to +285
if report.GetFullNodeId() != 0 {
verdict.FullNodeID = report.GetFullNodeId()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject capability replies from unexpected node IDs

The probe accepts any successful GetCapability response as evidence that the target member is reachable/capable, but it never validates that the responder’s full_node_id matches the expected route member. In a stale/misrouted address scenario (or if two members accidentally share an address), one healthy node can satisfy probes for another and OK can incorrectly become true even though the intended member was never verified. This check should fail closed on ID mismatch instead of overwriting the verdict ID.

Useful? React with 👍 / 👎.

Comment thread internal/admin/capability_fanout.go Outdated
mu.Unlock()
}(m)
}
wg.Wait()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor timeout even when a probe ignores context

CapabilityFanout documents a hard timeout bound, but runCapabilityFanoutProbes waits for every goroutine unconditionally. If a DialFunc/client implementation blocks without honoring ctx cancellation, wg.Wait() never returns and the admin path can hang indefinitely, violating the helper’s contract that it returns within timeout and reports missing peers as unreachable.

Useful? React with 👍 / 👎.

… guard + timeout-bounded helper return)

Two codex r3 findings, both addressed:

## codex r3 P1 - reject capability replies from unexpected node IDs

Pre-fix: probeCapability accepted any successful GetCapability response as evidence that the expected route member was reachable, even when the responder reported a different full_node_id (stale routing, two members accidentally sharing an address, etc.). Result: one healthy node could satisfy probes for another, and Result.OK could become true even though the intended member was never verified.

Fix: when both the snapshot's expected FullNodeID and the response's full_node_id are non-zero AND they disagree, the verdict is Reachable=false with a new errCapabilityFanoutMismatchedResponder sentinel. The cutover RPC refuses with OK=false rather than silently passing.

Regression test: TestCapabilityFanout_MismatchedResponderIDFailsClosed - snapshot says n2:9000 has FullNodeID=2, responder reports FullNodeID=99, helper now produces Reachable=false instead of the pre-fix Reachable=true.

## codex r3 P2 - honor timeout even when a probe ignores ctx

Pre-fix: runCapabilityFanoutProbes did wg.Wait() unconditionally. A DialFunc/client that ignored ctx cancellation would hang the helper indefinitely, violating the §4.3 contract "Returns within `timeout`".

Fix: pre-seed every dedup-key slot with a Reachable=false verdict carrying errCapabilityFanoutProbeTimeout. Each goroutine overwrites its slot when its probe finishes. After launching goroutines, the orchestrator waits on min(done, ctx.Done()) and then returns whatever slot values are present. Probes that didn't finish by the deadline surface as Reachable=false with the pre-seeded timeout error.

Note on goroutine cleanup: goroutines whose probes ignore ctx continue running until they unblock on their own. The helper documents that explicitly: contract is "the HELPER returns within timeout", not "every spawned goroutine returns". A dialer/client that ignores ctx is a bug on the caller side; the helper's job is to prevent that bug from hanging the admin RPC path.

Regression test: TestCapabilityFanout_TimeoutBoundsHelperReturn - injects a hanging client that ignores ctx; the pre-fix code timed out the entire 30s test process (goroutine traceback through probeCapability); the fix returns within 500ms (5x the 100ms budget; generous for CI).

## caller audit (semantic change)

probeCapability and runCapabilityFanoutProbes are both file-private. The new sentinels (errCapabilityFanoutMismatchedResponder, errCapabilityFanoutProbeTimeout) are also file-private. Sole caller of probeCapability is runCapabilityFanoutProbes; sole caller of runCapabilityFanoutProbes is CapabilityFanout. The public-facing semantic change is:

  1. Stale-routing scenarios that previously slipped through as OK=true now produce OK=false. Strictly safer.
  2. Hung probes that previously hung the helper now produce Reachable=false verdicts. Strictly safer.

capabilityFanoutAllOK already requires Reachable=true so no caller-side change is needed.

## test/lint evidence

- go test -race -run TestCapabilityFanout -timeout 60s ./internal/admin/... - all 12 tests pass (10 pre-existing + 2 new regressions) under -race
- golangci-lint run internal/admin/... - 0 issues
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 20, 2026

@claude review

Round-3 (commit 438482f3): addressed two codex r3 findings.

codex r3 P1 — reject capability replies from unexpected node IDs

Pre-fix: probeCapability accepted any successful GetCapability response, even when the responder's full_node_id differed from the snapshot's expected one (stale routing, shared address). One healthy node could satisfy probes for another, and Result.OK could become true despite the intended member never being verified.

Fix: when both expected and response full_node_ids are non-zero and disagree, the verdict is Reachable=false with a new errCapabilityFanoutMismatchedResponder sentinel.

Regression test: TestCapabilityFanout_MismatchedResponderIDFailsClosed.

codex r3 P2 — honor timeout even when a probe ignores ctx

Pre-fix: runCapabilityFanoutProbes did wg.Wait() unconditionally. A buggy DialFunc/client that ignored ctx cancellation would hang the helper indefinitely, violating the §4.3 "Returns within timeout" contract.

Fix: pre-seed every dedup-key slot with a Reachable=false verdict carrying errCapabilityFanoutProbeTimeout. Each goroutine overwrites its slot when its probe finishes. The orchestrator selects on min(done, ctx.Done()) and returns whatever slot values are present.

Goroutine cleanup: goroutines that ignore ctx continue running. The helper documents this — the contract is "the HELPER returns within timeout", not "every spawned goroutine returns within timeout". A dialer that ignores ctx is a caller-side bug; the helper's job is to prevent that bug from hanging the admin RPC path.

Regression test: TestCapabilityFanout_TimeoutBoundsHelperReturn — the pre-fix code hung the entire 30s test process; the fix returns within 500ms (5x budget for CI margin).

Caller audit

probeCapability, runCapabilityFanoutProbes, and the two new sentinels are all file-private. The public-facing semantic changes are: (1) stale-routing scenarios previously slipping through as OK=true now produce OK=false, and (2) hung probes that hung the helper now produce Reachable=false verdicts. Both strictly safer. capabilityFanoutAllOK already requires Reachable=true so no caller-side change is needed.

Verification

  • go test -race -run TestCapabilityFanout -timeout 60s ./internal/admin/... — all 12 tests pass (10 pre-existing + 2 new regressions) under -race
  • golangci-lint run internal/admin/... — 0 issues

@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

Claude finished @bootjp's task in 2m 9s —— View job


Round-3 Review

  • Load tools
  • Read changed files (capability_fanout.go, capability_fanout_test.go)
  • Review mismatched-responder guard (codex r3 P1)
  • Review timeout-bounded helper return (codex r3 P2)
  • Post comprehensive review

Both Round-3 fixes land correctly. Detailed analysis below.


Fix 1 — Mismatched-responder guard (codex r3 P1) ✅

Location: capability_fanout.go:339–343

if member.FullNodeID != 0 && report.GetFullNodeId() != 0 && report.GetFullNodeId() != member.FullNodeID {
    verdict.Err = pkgerrors.Wrapf(errCapabilityFanoutMismatchedResponder,
        "%s: expected full_node_id=%d got %d", member.Address, member.FullNodeID, report.GetFullNodeId())
    return verdict
}

The three-clause guard correctly handles all four cases:

member.FullNodeID response.FullNodeId result
non-zero non-zero, disagrees Reachable=false (stale routing / shared address) ✅
non-zero zero passes through — legacy node that predates the ID field ✅
zero anything passes through — stub catalog, dedup fell back to address ✅
non-zero same non-zero passes through ✅

The second row (non-zero expected, zero response) is a deliberate design choice to accommodate pre-capability nodes that don't populate full_node_id in the response. Tightening this in future (dropping the report.GetFullNodeId() != 0 clause) would be a one-liner once the cluster is fully upgraded, and the comment makes the intent clear enough.

verdict.FullNodeID enrichment (lines 348–350): after the guard, if the response carries a non-zero ID, it is propagated to the verdict. This means CapabilityVerdict.FullNodeID reflects the actual responder identity rather than just the snapshot expectation. Correct, and consistent with the behavior for the member.FullNodeID == 0 (address-dedup) case where the response is the only source of truth.

Nil-report safety: protobuf-go getters are nil-safe, so report.GetFullNodeId() → 0 even if GetCapability returns (nil, nil). In that edge case the guard passes through (zero response ID), verdict.Reachable = true, and verdict.EncryptionCapable = false — still fail-closed overall, just via capability rather than reachability.

Regression test TestCapabilityFanout_MismatchedResponderIDFailsClosed: expected=2, response=99. The five assertions (err==nil, OK==false, len==1, Reachable==false, Err!=nil) are necessary and sufficient to pin the contract.


Fix 2 — Timeout-bounded helper return (codex r3 P2) ✅

Architecture: pre-seeded slot pattern in runCapabilityFanoutProbes (lines 203–244).

Correctness trace:

  1. All slots pre-seeded with {Reachable: false, Err: errCapabilityFanoutProbeTimeout} — any slot not overwritten by deadline returns as a timed-out verdict.
  2. Each goroutine acquires mu and writes *slots[slotKey] = verdict — value-overwrites the slot.
  3. Orchestrator selects on <-done (all probes finished) vs <-ctx.Done() (deadline fired).
  4. Final collection acquires mu, value-copies all slots into out, returns.

Race safety after timeout:

  • out contains value copies, so lingering goroutines writing to *slots[slotKey] after the orchestrator returns cannot corrupt the result.
  • The slots map and its heap-allocated CapabilityVerdict values are retained by the lingering goroutines' closures until those goroutines exit. No data corruption, just a documented goroutine leak.
  • The done goroutine (go func() { wg.Wait(); close(done) }()) is an extra goroutine per call but is cheap — it exits promptly once all probe goroutines finish.

Documented goroutine leak posture: the doc comment on runCapabilityFanoutProbes (lines 197–202) is explicit: "the contract is 'the HELPER returns within timeout', not 'every spawned goroutine returns within timeout'". This is the correct framing. The admin RPC path is protected; the dialer implementor is responsible for honoring ctx.

Regression test TestCapabilityFanout_TimeoutBoundsHelperReturn:

  • hangingEncryptionAdminClient blocks on hangCh and explicitly ignores ctx — precisely the pre-fix hang scenario.
  • defer close(hangCh) ensures the goroutine unblocks after the test, preventing a persistent goroutine leak from the test suite.
  • 500ms bound (5× the 100ms timeout) is generous enough for CI jitter; the pre-fix code would have hung for the full 30s test timeout, so any bounded return is a clear pass.
  • Asserts Reachable=false on the hung-probe verdict, confirming the pre-seeded slot value was returned.

Overall — Rounds 0–3

All rounds of review feedback have been addressed and pinned with regression tests:

Round Finding Test(s) added Status
0 Core implementation 7 baseline tests
1 Fail-closed empty address EmptyAddressFailsClosed, FullyMalformedRowsNotCollapsed
1 MD001 heading level
1 Dedup map preallocation
2 Nil client from buggy dialer NilClientFromDialFailsClosed
3 Mismatched-responder ID MismatchedResponderIDFailsClosed
3 Timeout bounds helper return TimeoutBoundsHelperReturn

12 tests, all passing under -race. LGTM — ready for 6D-4 wiring.
| Branch: feat/encryption-stage6d3-capability-fanout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant