Skip to content

feat(encryption): Stage 6D-6c-3a - wire capability fan-out closure into EnableStorageEnvelope#830

Merged
bootjp merged 4 commits into
mainfrom
feat/encryption-6d6c3a-capability-fanout-wiring
May 25, 2026
Merged

feat(encryption): Stage 6D-6c-3a - wire capability fan-out closure into EnableStorageEnvelope#830
bootjp merged 4 commits into
mainfrom
feat/encryption-6d6c3a-capability-fanout-wiring

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 25, 2026

Summary

Stage 6D-6c-3a wires the §4 capability fan-out closure into main.go so the EnableStorageEnvelope cutover RPC can run its pre-flight GetCapability fan-out. Previously s.capabilityFanout == nil, so the cutover refused with the §4 "fan-out not wired" FailedPrecondition. The fan-out helper (internal/admin.CapabilityFanout, 6D-3) and the RPC option (adapter.WithEncryptionAdminCapabilityFanout, 6D-6a) already existed and are tested — this slice is pure wiring.

Doc-first per CLAUDE.md: the proposed design doc is the first commit.

What landed

  • main_encryption_fanout.gobuildCapabilityFanoutFn assembles the adapter.CapabilityFanoutFn. The closure snapshots all groups' membership (the §4.1 "voter ∪ learner of every group" contract) via each runtime's engine.Configuration(ctx); routeGroupFromConfiguration maps raftengine.Serveradmin.RouteMember keyed by etcd.DeriveNodeID(srv.ID), split voter/learner on Suffrage (empty == voter). The DialFunc dials each peer's EncryptionAdmin endpoint over a dedicated kv.GRPCConnCache (no-op cleanup — the cache owns conns).
  • buildEncryptionCapabilityFanout — owns the dedicated conn cache (not the --adminEnabled-gated admin-forward cache, so the fan-out works whenever encryption mutators are enabled), drained on ctx cancellation via eg. Returns nil when mutators are disabled. Extracted to keep startRaftServers within the cyclop budget.
  • registerEncryptionAdminServer — threads the closure through WithEncryptionAdminCapabilityFanout, gated on the same enableMutators boolean as the Proposer/LeaderView.

Fail-closed posture

A snapshot-build error (any engine.Configuration failure) fails the closure → the cutover refuses rather than proposing against a membership view it couldn't fully enumerate. Unreachable / not-capable members are handled inside CapabilityFanout (OK=false → refuse; no partial success, §4.3).

Scope boundary

The end-to-end test (single-node: Bootstrap → EnableStorageEnvelope → Put → read-back-via-envelope) lands in 6D-6c-3b on top of this wiring.

Self-review (5 lenses)

  1. Data loss — read-only control-plane wiring; no writes, no apply-path change.
  2. ConcurrencyCapabilityFanout is already concurrent + timeout-bounded; the snapshot builder runs per-RPC single-shot; the conn cache is concurrency-safe; cleanup goroutine drains it on ctx.Done().
  3. Performance — the fan-out runs once per (rare) cutover RPC, off the data path.
  4. Data consistency — the cutover semantics (6D-6a) are unchanged; fan-out fails closed on snapshot error / unreachable / not-capable.
  5. Test coveragerouteGroupFromConfiguration mapping (voter/learner split incl. empty-suffrage == voter, DeriveNodeID identity, address, multi-server, empty config) + builder non-nil. Full fan-out behavior is covered by the 6D-6c-3b e2e.

Test plan

  • go test -race .
  • golangci-lint run . (0 issues)
  • CI green

Summary by CodeRabbit

  • Documentation

    • Updated storage envelope design milestones, marking capability fan-out closure wiring and production integration as completed.
  • New Features

    • Implemented capability fan-out closure wiring for storage envelope cutover operations, ensuring fail-closed behavior when prerequisites aren't met.
  • Tests

    • Added comprehensive test coverage for encryption fan-out logic, membership snapshot construction, and route building.

Review Change Stack

bootjp added 2 commits May 25, 2026 22:33
…e wiring

Proposes wiring the §4 capability fan-out closure into main.go so the
EnableStorageEnvelope cutover RPC can run its pre-flight GetCapability
fan-out (today s.capabilityFanout is nil → cutover refuses with the §4
"not wired" FailedPrecondition).

Design: build an all-groups admin.RouteSnapshot from each runtime's
engine.Configuration(ctx), a DialFunc over a dedicated kv.GRPCConnCache
returning pb.EncryptionAdminClient (no-op cleanup, cache owns conns),
and assemble adapter.CapabilityFanoutFn; wire via
WithEncryptionAdminCapabilityFanout gated on enableMutators. Fail-closed
when the membership view can't be fully enumerated. Dedicated conn cache
(not the --adminEnabled-gated admin-forward cache) so the fan-out works
whenever encryption mutators are enabled.

e2e test deferred to 6D-6c-3b. Doc-first per CLAUDE.md; implementation
commits follow on the same PR.
…to EnableStorageEnvelope

Wires the §4 capability fan-out into main.go so the EnableStorageEnvelope
cutover RPC can run its pre-flight GetCapability fan-out (previously
s.capabilityFanout was nil → cutover refused with the §4 "not wired"
FailedPrecondition).

- main_encryption_fanout.go: buildCapabilityFanoutFn assembles the
  adapter.CapabilityFanoutFn. The closure snapshots all groups'
  membership via each runtime's engine.Configuration(ctx)
  (routeGroupFromConfiguration maps raftengine.Server → admin.RouteMember
  keyed by etcd.DeriveNodeID, split voter/learner on Suffrage with empty
  == voter), and dials each peer's EncryptionAdmin endpoint over a
  dedicated kv.GRPCConnCache (no-op cleanup; the cache owns conns). A
  snapshot-build error fails the closure → cutover refuses (fail-closed).
- buildEncryptionCapabilityFanout owns the dedicated conn cache (NOT the
  --adminEnabled-gated admin-forward cache, so the fan-out works whenever
  encryption mutators are enabled), drained on ctx cancellation via eg.
  Returns nil when mutators are disabled. Extracted as a helper to keep
  startRaftServers under the cyclop budget.
- registerEncryptionAdminServer threads the closure through
  WithEncryptionAdminCapabilityFanout, gated on the same enableMutators
  boolean as the Proposer/LeaderView.

The end-to-end Bootstrap→cutover→Put→read-back test lands in 6D-6c-3b.

Tests: routeGroupFromConfiguration voter/learner split (incl. empty
suffrage == voter), DeriveNodeID node identity, address mapping,
multi-server, empty configuration; builder returns non-nil.

Self-review (5 lenses):
- Data loss/consistency: read-only control-plane wiring; no writes, no
  apply-path change; cutover semantics unchanged (6D-6a).
- Concurrency: CapabilityFanout already concurrent + timeout-bounded;
  snapshot builder per-RPC single-shot; conn cache concurrency-safe;
  cleanup via eg on ctx.Done.
- Performance: fan-out runs once per (rare) cutover RPC, off the data
  path.
- Test coverage: mapping helper unit-tested; full fan-out behavior
  covered by the 6D-6c-3b e2e.

Doc commit (6D-6c-3a proposed) landed first; 6D milestone doc updated.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 25, 2026

Warning

Review limit reached

@bootjp, we couldn't start this review because you've used your available PR reviews for now.

Your plan includes 1 review of capacity. Refill in 51 minutes and 45 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d37ee521-5e6e-4308-b318-207eef29bee3

📥 Commits

Reviewing files that changed from the base of the PR and between 26ecce5 and 664d8f9.

📒 Files selected for processing (3)
  • docs/design/2026_05_18_partial_6d_enable_storage_envelope.md
  • docs/design/2026_05_25_partial_6d6c3a_capability_fanout_wiring.md
  • main_encryption_fanout.go
📝 Walkthrough

Walkthrough

This PR implements the 6D-6c-3a capability fan-out wiring for EnableStorageEnvelope cutover RPCs. The change adds snapshot-based routing of live Raft membership across runtimes through a dedicated gRPC connection cache, gates the behavior behind the enableMutators condition, and wires the closure into the admin server registration to ensure fail-closed behavior.

Changes

EnableStorageEnvelope capability fan-out wiring

Layer / File(s) Summary
Design specification for capability fan-out
docs/design/2026_05_25_partial_6d6c3a_capability_fanout_wiring.md, docs/design/2026_05_18_partial_6d_enable_storage_envelope.md
Defines the 6D-6c-3a checkpoint with snapshot construction across runtimes, gRPC cache reuse, gating alignment with enableMutators, and fail-closed error handling. Updates stage 6D status to ship 6D-6c-2 and 6D-6c-3a.
Admin server registration with capability fan-out parameter
main_encryption_admin.go
Extends registerEncryptionAdminServer to accept capabilityFanout parameter and conditionally wires WithEncryptionAdminCapabilityFanout when enableMutators is active and fanout is non-nil.
Fanout helper construction and orchestration in main
main.go
startRaftServers builds encryptionCapabilityFanout via buildEncryptionCapabilityFanout before per-group server creation and threads the fanout through each registerEncryptionAdminServer call.
Fan-out closure implementation with snapshot and routing
main_encryption_fanout.go
Implements buildEncryptionCapabilityFanout, buildCapabilityFanoutFn, capabilityRouteSnapshot, routeSnapshotFromSources, and routeGroupFromConfiguration. Snapshots live Raft membership across runtimes, derives node IDs, splits voters/learners by suffrage, applies timeout bounding, and dials via dedicated kv.GRPCConnCache.
Fanout snapshot and routing tests
main_encryption_fanout_test.go
Tests routeGroupFromConfiguration voter/learner splitting and FullNodeID derivation, empty configuration handling, buildCapabilityFanoutFn closure construction, buildEncryptionCapabilityFanout disabled return, routeSnapshotFromSources group mapping and configuration error fail-closed behavior.
Admin server test signature updates
main_encryption_admin_test.go
Updates test helpers callBootstrapAgainstNewServer and TestRegisterEncryptionAdminServer_Registers to pass the new capabilityFanout parameter (nil in tests).

Sequence Diagram

sequenceDiagram
  participant startRaftServers
  participant buildEncryptionCapabilityFanout
  participant registerEncryptionAdminServer
  participant buildCapabilityFanoutFn
  participant capabilityRouteSnapshot
  participant admin.CapabilityFanout

  startRaftServers->>buildEncryptionCapabilityFanout: ctx, eg, runtimes, enableMutators
  buildEncryptionCapabilityFanout->>buildEncryptionCapabilityFanout: enableMutators ? buildCapabilityFanoutFn : nil
  buildEncryptionCapabilityFanout-->>startRaftServers: CapabilityFanoutFn or nil

  loop for each raft group
    startRaftServers->>registerEncryptionAdminServer: gs, fullNodeID, sidecarPath, enableMutators, engine, encryptionCapabilityFanout
    registerEncryptionAdminServer->>registerEncryptionAdminServer: if enableMutators && capabilityFanout != nil: wire WithEncryptionAdminCapabilityFanout
  end

  Note over buildCapabilityFanoutFn: When EnableStorageEnvelope RPC invoked:
  buildCapabilityFanoutFn->>capabilityRouteSnapshot: snapshot all runtimes
  capabilityRouteSnapshot->>capabilityRouteSnapshot: routeSnapshotFromSources: read each runtime's Configuration
  capabilityRouteSnapshot-->>buildCapabilityFanoutFn: live membership routes or error
  buildCapabilityFanoutFn->>admin.CapabilityFanout: routes, dialFn (with GRPCConnCache), timeout
  admin.CapabilityFanout-->>buildCapabilityFanoutFn: capability snapshot or error (fail-closed)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • bootjp/elastickv#812: Main PR's capability fan-out wiring is directly required by the EnableStorageEnvelope RPC in this PR, which calls the fan-out to validate cluster readiness before cutover.
  • bootjp/elastickv#793: Main PR's buildCapabilityFanoutFn wraps and routes snapshots into the CapabilityFanout internal helper that was introduced in this PR.
  • bootjp/elastickv#776: Both PRs extend the registerEncryptionAdminServer wiring behind the same enableMutators gate, with #776 introducing the gate and #830 extending it to thread the capabilityFanout closure.

Poem

🐰 A fanout across the warren hops,
Snapshots of each node it tops,
Voters, learners, routes all bright,
Fail-closed tight, enableMutators gait,
Bunny wires the cutover right! 🔒✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: wiring the capability fan-out closure into the EnableStorageEnvelope cutover RPC, corresponding to the 6D-6c-3a stage.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/encryption-6d6c3a-capability-fanout-wiring

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 25, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 25, 2026

Claude finished @bootjp's task in 3m 58s —— View job


Code Review — Stage 6D-6c-3a capability fan-out wiring

  • Read changed files and gather context
  • Review design doc changes
  • Review main_encryption_fanout.go (new file)
  • Review main_encryption_fanout_test.go (new file)
  • Review changes to main.go, main_encryption_admin.go, main_encryption_admin_test.go
  • Check cross-cutting concerns (concurrency, error handling, lifecycle)
  • Post final review

Overall the wiring is clean and the design rationale is solid. One race condition in the new file needs to be fixed before merge; the rest are lower-severity items.


Bug: data race in capabilityRouteSnapshotrt.engine read without lock

Severity: must-fix

capabilityRouteSnapshot (main_encryption_fanout.go:88-93) reads rt.engine directly:

for _, rt := range runtimes {
    if rt == nil || rt.engine == nil {
        continue
    }
    cfg, err := rt.engine.Configuration(ctx)

The comment on raftGroupRuntime in multiraft_runtime.go:19-26 is explicit:

long-lived goroutines that read after startup must go through snapshotEngine()

Close() (multiraft_runtime.go:78-81) sets engine = nil under engineMu.Lock(). The capability fan-out closure is invoked during live EnableStorageEnvelope RPCs — not at startup — so a concurrent Close() creates a data race on the engine field. The fix is to replace the direct field access with snapshotEngine():

for _, rt := range runtimes {
    eng := rt.snapshotEngine()   // nil-safe + acquires engineMu.RLock
    if eng == nil {
        continue
    }
    cfg, err := eng.Configuration(ctx)
    if err != nil {
        return admin.RouteSnapshot{}, errors.Wrapf(err, "capability fan-out: configuration for group %d", rt.spec.id)
    }
    groups = append(groups, routeGroupFromConfiguration(rt.spec.id, cfg))
}

Fix this →


Design doc lifecycle: new doc still marked "proposed" after implementation ships

Severity: should-fix (per CLAUDE.md conventions)

docs/design/2026_05_25_proposed_6d6c3a_capability_fanout_wiring.md has Status | proposed in its header table, but the implementation is landing in this PR. Per CLAUDE.md:

*_proposed_*.md — Design accepted, no implementation yet
*_partial_*.md — Some milestones shipped, proposal not yet complete

Since 6D-6c-3b (the e2e test) is still open, the file should be renamed to 2026_05_25_partial_6d6c3a_capability_fanout_wiring.md and the status updated to partial. The parent doc (2026_05_18_partial_6d_enable_storage_envelope.md) is already correctly updated to list 6D-6c-3a as shipped and 6D-6c-3b as open — just the per-PR doc needs the rename.


Test gaps

Severity: should-fix

The PR description states unit tests cover "Configuration error → closure error," but main_encryption_fanout_test.go has no such test. The fail-closed path (any Configuration() failure causes the cutover to refuse) is a load-bearing invariant worth pinning. A simple mock engine that returns an error from Configuration would cover it via capabilityRouteSnapshot directly.

Similarly, buildEncryptionCapabilityFanout's enableMutators=false → nil branch is untested. A one-liner table-driven test would lock it down.


Minor observations (non-blocking)

WaitForReady(true) in fan-out contextGRPCConnCache.ConnFor sets grpc.WaitForReady(true) as a default call option, so probes to slow/unreachable peers will block until the context deadline. The 5s capabilityFanoutTimeout bounds this correctly, and the CapabilityFanout helper applies it as a context deadline — this is fine as designed. Worth noting that on a large cluster with several unreachable peers, all probes will block in parallel until the shared deadline, not sequentially, which is the correct behavior.

capabilityFanoutTimeout placement — The design doc specifies this constant should live in main.go, but it ended up in main_encryption_fanout.go. Either location is fine; consistency with the design doc is the only consideration.

registerEncryptionAdminServer comment (main_encryption_admin.go:142-147) — The comment explains why the fan-out is gated on enableMutators. This is genuinely non-obvious (a reader might wonder why it isn't always set when capabilityFanout != nil) so it earns its place.

Conn cache error on shutdown — The eg.Go cleanup goroutine returns fanoutConnCache.Close() error to the errgroup. If closing fails (only the first error is returned when multiple conns fail), the error propagates to the errgroup. This is correct — shutdown errors should surface rather than be swallowed.


What's solid

  • The overall wiring is clean: buildEncryptionCapabilityFanoutbuildCapabilityFanoutFncapabilityRouteSnapshot + routeGroupFromConfiguration is a clear separation of concerns.
  • Dedicated conn cache (not the --adminEnabled-gated admin-forward one) is the right call — the design doc §3 explains the reasoning clearly.
  • Fail-closed posture is correctly implemented: snapshot build error short-circuits the entire closure.
  • routeGroupFromConfiguration tests cover the voter/learner split, empty-suffrage-as-voter convention, DeriveNodeID identity, and empty configuration — the critical paths for the mapping logic.
  • The capabilityFanout != nil guard in registerEncryptionAdminServer keeps the option from being set when mutators are disabled, matching the existing Proposer/LeaderView gating pattern.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the capability fan-out closure wiring for the EnableStorageEnvelope cutover RPC, introducing membership snapshotting and gRPC dialing via a dedicated connection cache. Feedback identifies an improvement opportunity to propagate the context within the dialing logic for better timeout management and suggests adding a safety check for runtime specifications to avoid potential nil pointer dereferences.

Comment thread main_encryption_fanout.go
Comment on lines +62 to +68
dial := func(_ context.Context, address string) (pb.EncryptionAdminClient, func(), error) {
conn, err := connCache.ConnFor(address)
if err != nil {
return nil, nil, errors.Wrapf(err, "capability fan-out: dial %s", address)
}
return pb.NewEncryptionAdminClient(conn), func() {}, nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The dial closure ignores the context.Context parameter passed by the fan-out helper. In accordance with project standards, context.Context should be used for managing deadlines and cancellation. The context should be propagated to the connection cache to ensure the fan-out timeout is strictly respected and to avoid goroutine leaks if a dial hangs.

Suggested change
dial := func(_ context.Context, address string) (pb.EncryptionAdminClient, func(), error) {
conn, err := connCache.ConnFor(address)
if err != nil {
return nil, nil, errors.Wrapf(err, "capability fan-out: dial %s", address)
}
return pb.NewEncryptionAdminClient(conn), func() {}, nil
}
dial := func(ctx context.Context, address string) (pb.EncryptionAdminClient, func(), error) {
conn, err := connCache.ConnFor(ctx, address)
if err != nil {
return nil, nil, errors.Wrapf(err, "capability fan-out: dial %s", address)
}
return pb.NewEncryptionAdminClient(conn), func() {}, nil
}
References
  1. When designing interfaces, use context.Context for managing deadlines and cancellation to ensure timeouts are strictly respected.

Comment thread main_encryption_fanout.go Outdated
}
cfg, err := rt.engine.Configuration(ctx)
if err != nil {
return admin.RouteSnapshot{}, errors.Wrapf(err, "capability fan-out: configuration for group %d", rt.spec.id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Potential nil pointer dereference: rt.spec is accessed here without a prior nil check. While rt and rt.engine are checked at line 89, rt.spec is assumed to be non-nil. Although this might be an internal invariant, adding a check or ensuring rt.spec is always populated during runtime construction would improve robustness.

… gaps, doc lifecycle

Addresses the claude review on PR #830:

- must-fix (data race): capabilityRouteSnapshot read rt.engine
  directly, but the fan-out closure runs during live
  EnableStorageEnvelope RPCs concurrently with a possible Close()
  that clears the engine field under engineMu.Lock(). Now reads via
  rt.snapshotEngine() (engineMu.RLock, nil-safe), matching the
  documented contract for post-startup engine reads. Not a semantic
  change — same value, race-safe access.

- should-fix (test gaps): extracted routeSnapshotFromSources behind a
  narrow configReader interface so the fail-closed path is unit
  testable without a full engine. Added:
  - TestRouteSnapshotFromSources_ConfigurationErrorFailsClosed — a
    group Configuration error aborts the whole snapshot (cutover
    refuses rather than proposing against a partial membership view).
  - TestRouteSnapshotFromSources_MapsAllGroups — multi-group mapping.
  - TestBuildEncryptionCapabilityFanout_DisabledReturnsNil — mutators
    off → nil closure.

- should-fix (doc lifecycle): git mv the per-PR design doc
  proposed -> partial with Status: partial + a lifecycle note (the
  fan-out wiring shipped; 6D-6c-3b e2e remains).

Minor non-blocking observations (WaitForReady bounded by the 5s
timeout, const placement, gating comment, shutdown error surfacing)
were confirmed correct as-is and left unchanged.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 25, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 25, 2026

Claude finished @bootjp's task in 2m 38s —— View job


Code Review — Stage 6D-6c-3a capability fan-out wiring (Round 2)

  • Read changed files and gather context
  • Review main_encryption_fanout.go (new file)
  • Review main_encryption_fanout_test.go (new file)
  • Review changes to main.go, main_encryption_admin.go, main_encryption_admin_test.go
  • Review design doc changes
  • Check cross-cutting concerns (concurrency, error handling, lifecycle)
  • Post final review

Round-1 fixes confirmed

All three must-fix / should-fix items from the first round are addressed in the current code:

  • Race conditioncapabilityRouteSnapshot now reads each engine via rt.snapshotEngine() and stores the result in the intermediate groupConfigSource struct before calling Configuration. The comment at main_encryption_fanout.go:94-98 documents why. Correct.
  • Design doc lifecycledocs/design/2026_05_25_partial_6d6c3a_capability_fanout_wiring.md carries Status | partial, matching the CLAUDE.md convention (3b e2e still open). Correct.
  • Test gapsTestRouteSnapshotFromSources_ConfigurationErrorFailsClosed (fail-closed path) and TestBuildEncryptionCapabilityFanout_DisabledReturnsNil (disabled branch) are both present. Correct.

Response to gemini-code-assist comments

"dial closure ignores context" — Not actionable as written. kv.GRPCConnCache.ConnFor (kv/grpc_conn_cache.go:68) has signature ConnFor(addr string) — there is no context parameter to thread through. grpc.NewClient is non-blocking (lazy-connect), so the context isn't needed at dial time. The fan-out timeout is applied correctly by admin.CapabilityFanout as a child-context deadline before it makes the actual GetCapability RPC call, which is where WaitForReady(true) kicks in and is bounded by that deadline. The _ context.Context in the dial closure is the right signature — the parameter is part of the DialFunc interface contract, not a design flaw.

"potential nil pointer dereference on rt.spec" — Not an issue. spec is groupSpec by value (multiraft_runtime.go:18), not a pointer. It cannot be nil. The concern is invalid.


Minor observation

capabilityFanoutTimeout placement vs design doc §1 — The design doc says this constant should live in main.go, but it ended up in main_encryption_fanout.go. The fanout file is the more cohesive home for it (the constant only affects fan-out behavior), so the deviation from the doc is an improvement, not a regression. Worth updating the doc's §1 to match, but non-blocking.

Test imports mix stdlib and cockroachdb errorsmain_encryption_fanout_test.go:5 imports stdlib "errors" for errors.New("...") and errors.Is(...). The implementation wraps with cockroachdb/errors.Wrapf. stdlib.errors.Is correctly unwraps through cockroachdb-wrapped errors (they implement Unwrap()), so TestRouteSnapshotFromSources_ConfigurationErrorFailsClosed:130 works as intended. Functionally sound.


What's solid

The refactoring since round 1 is clean and well-reasoned:

  • groupConfigSource intermediate struct neatly separates the race-safe engine snapshot from the Configuration() call, without the awkwardness of calling snapshotEngine() inside routeSnapshotFromSources.
  • configReader interface at main_encryption_fanout.go:83-85 lets routeSnapshotFromSources be unit-tested with a stub without pulling in a real engine — and the test file uses exactly this with stubConfigReader.
  • buildEncryptionCapabilityFanout correctly owns the conn cache lifecycle (drains it on ctx.Done() via the errgroup), and correctly returns nil when mutators are disabled — keeping the fan-out inert in the non-mutator path.
  • The capabilityFanout != nil inner guard in registerEncryptionAdminServer:147 is correct belt-and-suspenders (the outer enableMutators gate already ensures nil is passed when mutators are off, but the nil check is cheap and keeps the option harmless if the call sites ever diverge).

Approve to merge. No blocking issues in the current state.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/2026_05_18_partial_6d_enable_storage_envelope.md`:
- Line 114: Replace the incorrect cross-reference link text
"2026_05_25_proposed_6d6c3a_capability_fanout_wiring.md" in
docs/design/2026_05_18_partial_6d_enable_storage_envelope.md with the actual
shipped/partial document filename used in this PR (the matching _partial_
document name), ensuring the link points to the current partial document rather
than the proposed dated filename.

In `@docs/design/2026_05_25_partial_6d6c3a_capability_fanout_wiring.md`:
- Around line 36-38: The documentation currently states "SuffrageVoter → Voters,
else Learners" which contradicts implementation: update the phrasing around
`srv.Address`/`srv.Suffrage` mapping to state that `SuffrageLearner` maps to
Learners and all other suffrage values (including empty) are treated as Voters
(i.e., routed into `admin.RouteGroup{GroupID: rt.spec.id}` as voters); mention
the explicit exceptions `SuffrageLearner` and the default "all others → Voters"
behavior to match the code.

In `@main_encryption_fanout.go`:
- Around line 101-103: The loop over runtimes calls rt.snapshotEngine() without
guarding against nil entries, so add an explicit nil check for the loop variable
rt before invoking snapshotEngine() (e.g., if rt == nil { continue }),
preserving the documented “nil runtime is skipped” behavior; update the loop
surrounding runtimes and the call to snapshotEngine() to skip nil rt values and
only call snapshotEngine() when rt is non-nil.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fe24c67a-6af9-4e8e-8e51-7588839f429f

📥 Commits

Reviewing files that changed from the base of the PR and between ad60154 and 26ecce5.

📒 Files selected for processing (7)
  • docs/design/2026_05_18_partial_6d_enable_storage_envelope.md
  • docs/design/2026_05_25_partial_6d6c3a_capability_fanout_wiring.md
  • main.go
  • main_encryption_admin.go
  • main_encryption_admin_test.go
  • main_encryption_fanout.go
  • main_encryption_fanout_test.go

Comment thread docs/design/2026_05_18_partial_6d_enable_storage_envelope.md Outdated
Comment thread docs/design/2026_05_25_partial_6d6c3a_capability_fanout_wiring.md
Comment thread main_encryption_fanout.go
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cdf5021734

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread main_encryption_fanout.go Outdated
Comment on lines +89 to +92
if rt == nil || rt.engine == nil {
continue
}
cfg, err := rt.engine.Configuration(ctx)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Read engine through snapshot accessor in fan-out path

capabilityRouteSnapshot is a long-lived runtime path (invoked by admin RPCs after startup), but it reads rt.engine directly. raftGroupRuntime documents that post-startup readers must use snapshotEngine() because Close() clears engine under engineMu (multiraft_runtime.go), so this direct read introduces a shutdown-time data race and can produce an inconsistent/partial route snapshot during cutover checks. Please load the engine via snapshotEngine() and operate on that local value instead of touching the field directly.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in round-1 (commit 26ecce5): capabilityRouteSnapshot reads each engine via rt.snapshotEngine() (engineMu.RLock, nil-safe) and stashes it in groupConfigSource before calling Configuration — the direct rt.engine field read this P1 flagged is gone. This comment was filed against the pre-fix commit (cdf5021). The current head (664d8f9) has the locked accessor at main_encryption_fanout.go:105, and the round-2 review independently confirmed the race fix.

…rections

claude round-2 approved to merge; this folds in the late-arriving
coderabbit findings and confirms the codex P1.

- coderabbit Minor (nil runtime guard): add an explicit
  `if rt == nil { continue }` before rt.snapshotEngine() in
  capabilityRouteSnapshot. snapshotEngine is already nil-receiver-safe
  (returns nil → skipped), so this is defensive clarity, not a
  behavior change.
- coderabbit Minor (doc): correct the 6D-6c-3a suffrage mapping text
  to match the implementation — SuffrageLearner → Learners, else →
  Voters (empty/unannotated suffrage counts as voter).
- coderabbit Minor (doc): fix the parent 6D doc's reference link to
  point at the renamed *_partial_* 6D-6c-3a doc.

codex P1 ("read engine through snapshot accessor") was already
resolved in round-1 (commit 26ecce5): capabilityRouteSnapshot reads
via rt.snapshotEngine() at main_encryption_fanout.go:105 — the P1 was
filed against the pre-fix commit. claude round-2 independently
confirmed the race fix is correct.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 25, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 25, 2026

Claude finished @bootjp's task in 2m 27s —— View job


Code Review — Stage 6D-6c-3a capability fan-out wiring (Round 3)

  • Gather context from previous review rounds
  • Read current state of changed files
  • Verify CodeRabbit/gemini fixes in commit 664d8f9
  • Review design docs
  • Review implementation files
  • Post final review

Round-1 and Round-2 fixes confirmed

All findings from the previous two rounds are correctly resolved in the current head (664d8f9):

  • Race condition (capabilityRouteSnapshot reading rt.engine directly) — fixed. The function now uses the groupConfigSource intermediate struct: snapshotEngine() is called under engineMu.RLock in the outer loop, the result is stored in the struct, and only then is Configuration() called on the stored reference. The nil-rt guard (if rt == nil { continue }) is also present before snapshotEngine() is called (main_encryption_fanout.go:102).
  • Design doc lifecycle2026_05_25_partial_6d6c3a_capability_fanout_wiring.md correctly carries Status | partial. The parent doc's cross-reference link also correctly points to the _partial_ filename.
  • Test gapsTestRouteSnapshotFromSources_ConfigurationErrorFailsClosed (fail-closed path) and TestBuildEncryptionCapabilityFanout_DisabledReturnsNil (disabled branch) are both present and correct.

CodeRabbit Round-2 comments confirmed addressed

  • Nil-rt guard (main_encryption_fanout.go:102) — present.
  • Doc link fix (parent doc line 114) — correct _partial_ filename.
  • Suffrage mapping text — design doc §1 now correctly reads SuffrageLearner → Learners, else → Voters, matching the implementation (main_encryption_fanout.go:143-147).

Gemini concerns (both correctly dismissed, confirmed)

  • Dial closure ignoring contextGRPCConnCache.ConnFor signature is ConnFor(addr string) (no context parameter). grpc.NewClient is non-blocking; timeout bounding is applied by admin.CapabilityFanout via context.WithTimeout(ctx, timeout) before issuing the actual GetCapability RPC. The _ context.Context in the dial closure correctly satisfies the DialFunc interface contract. No change needed.
  • rt.spec nil pointerspec is groupSpec by value (multiraft_runtime.go:18), not a pointer. Cannot be nil. Dismissed correctly.

One minor doc inconsistency (non-blocking)

Design doc §1 states: "Fan-out timeout: a const in main.go", but capabilityFanoutTimeout lives in main_encryption_fanout.go. As noted in round 2, this is actually the more cohesive location. The doc §1 sentence is slightly stale. Worth a one-line fix before or at 6D-6c-3b, but not a blocker for this PR.


Subtle correctness check: TestBuildCapabilityFanoutFn_NonNil with nil connCache

fn := buildCapabilityFanoutFn(nil, nil, capabilityFanoutTimeout)

connCache is nil here. If the returned closure were invoked with runtimes=nil, capabilityRouteSnapshot returns an empty RouteSnapshot{}. admin.CapabilityFanout with an empty snapshot returns CapabilityFanoutResult{OK: false}, nil without ever calling dial (the dedup set is empty — confirmed by reading capability_fanout.go:152-153). So the nil connCache is never dereferenced. The test is correct and safe; this is a closure-non-nil smoke test, with full invocation covered by 6D-6c-3b.


Main.go call site uses rt.engine directly — correct

startRaftServers passes rt.engine directly to registerEncryptionAdminServer. This runs at startup, single-threaded before goroutines launch — the multiraft_runtime.go comment explicitly exempts startup-path reads from the snapshotEngine() requirement. The registration function only stores the engine ref in option structs; it doesn't race with Close(). Correct as-is.


What's solid

The implementation is clean, well-scoped, and correct:

  • buildEncryptionCapabilityFanoutbuildCapabilityFanoutFncapabilityRouteSnapshot + routeGroupFromConfiguration is a clear, testable separation of concerns.
  • The dedicated kv.GRPCConnCache (separate from the --adminEnabled-gated admin-forward cache) is the right call, with the §3 rationale clearly documented.
  • Fail-closed posture is maintained end-to-end: snapshot error → closure error → cutover refuses; unreachable/not-capable → handled inside CapabilityFanoutOK=false → cutover refuses.
  • configReader interface makes routeSnapshotFromSources independently unit-testable with stubConfigReader without standing up a real engine.
  • All five self-review lenses pass: read-only wiring, concurrency-safe (snapshotEngine + timeout-bounded fan-out + conn cache mutex), off the data path, cutover semantics unchanged, and test coverage covers the critical mapping + fail-closed paths.

Approve to merge. The one outstanding doc sentence (§1 constant location) is cosmetic and can be cleaned up in 6D-6c-3b alongside the e2e test.

@bootjp bootjp merged commit f627669 into main May 25, 2026
11 checks passed
@bootjp bootjp deleted the feat/encryption-6d6c3a-capability-fanout-wiring branch May 25, 2026 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant