feat(encryption): Stage 6D-2 — ErrNodeIDCollision + ErrLocalEpochRollback primitives + probe-node-id CLI#788
feat(encryption): Stage 6D-2 — ErrNodeIDCollision + ErrLocalEpochRollback primitives + probe-node-id CLI#788bootjp wants to merge 6 commits into
Conversation
…back primitives + probe-node-id CLI Stage 6D-2 per the 6D design doc 5 (PR #786, merged). Ships the two bundled 6C-3 startup-guard primitives as pure functions in the encryption package, plus the probe-node-id CLI helper from 5.1 mitigation #2. Operator-inert until 6D-6 wires the primitives into main.go's startup-guard phase. ## What this ships ### Two new sentinel errors - ErrNodeIDCollision (errors.go): two distinct full_node_id values in the local route-catalog hash to the same 16-bit node_id, which would reuse GCM nonces under the active DEK. Triage message points at the probe-node-id CLI for pre-join verification. - ErrLocalEpochRollback (errors.go): sidecar local_epoch for the active storage DEK is <= the local writer-registry last_seen_local_epoch (strict-ahead invariant from 5.2 of the 6D design doc). Triage message points at resync-sidecar or DEK rotation. ### Two pure-function guard primitives - CheckNodeIDCollision(fullNodeIDs []uint64) error (node_id_collision.go): narrows each full_node_id via uint16(full_node_id & 0xFFFF) - same shipped mask as applier.go's nodeIDMask - and returns ErrNodeIDCollision if any two DISTINCT values share a node_id. Order- independent. Same-value-twice is NOT a collision (duplicate-membership-entry quirk; the route catalog watcher should dedupe upstream but the primitive defends correctness here). Local-state-only; no RPC, no engine access. - CheckLocalEpochRollback(registry, fullNodeID, dekID, sidecarLocalEpoch) error (local_epoch_rollback.go): reads the local Pebble writer-registry row at RegistryKey(dekID, uint16(fullNodeID & 0xFFFF)). Returns nil if no row exists (freshly-joined learner; 4.1 case-1 first-seen creates the row on first encrypted write). Returns ErrLocalEpochRollback when sidecar <= registry (the strict-ahead invariant - equality is NOT safe because the same (node_id, local_epoch) prefix would be replayed). Pebble I/O errors propagate as wrapped errors NOT marked with ErrLocalEpochRollback so the operator triages transport failure separately from a real rollback. ### CLI: probe-node-id - elastickv-admin encryption probe-node-id --full-node-id=<u64> (cmd/elastickv-admin/encryption.go): derives the 16-bit node_id from a candidate full_node_id using the same shipped narrowing convention. Operator runs it against candidates BEFORE joining a new node to verify the value is free in the cluster's allocated node_id space - the recommended mitigation per the 6D design doc 5.1 mitigation menu. Accepts decimal or 0x-prefixed hex input; prints both the 16-hex-digit canonical full_node_id and the derived node_id. ## Test coverage 11 new tests pin the load-bearing properties: node_id_collision_test.go (8 tests): - Empty - skip - SingleNode - skip - NoCollision - nil - Collision - ErrNodeIDCollision - DuplicateNotCollision - same full_node_id twice is NOT a collision - OrderIndependent - either ordering of the colliding pair catches the hit - ErrorIncludesBothIDs - operator-triage shape names both colliding values and the shared node_id local_epoch_rollback_test.go (6 tests): - NoRow - freshly-joined-learner skip - SidecarStrictlyAhead - nil (healthy case) - SidecarLessThan - ErrLocalEpochRollback (classic rollback) - SidecarEqual - ErrLocalEpochRollback (strict-ahead invariant; replay scenario) - PebbleError - underlying err propagated, NOT classified as ErrLocalEpochRollback - NarrowingMatchesShippedConvention - looks up registry row at uint16-narrowed key matching applier.go:230 encryption_test.go (5 new tests for probe-node-id): - Hex - 0x-prefixed input - Decimal - decimal input - RequiresFlag - missing --full-node-id is an error - RejectsBadInput - non-numeric/non-hex input is an error - EncryptionMain_ProbeNodeIDSubcommand - dispatch reaches runEncryptionProbeNodeID (not the unknown-subcommand default branch) ## Caller audit No semantic changes to existing functions; only additions. - New sentinels in errors.go - exported package surface additions, no caller audit needed (no existing callers to update). - New pure functions in node_id_collision.go and local_epoch_rollback.go - no callers yet (operator-inert until 6D-6 wires them). - New CLI subcommand probe-node-id - additive; no existing subcommand changed. ## Five-lens self-review 1. Data loss - net-positive. Both guards catch potential nonce-reuse scenarios BEFORE the first encrypted write that would corrupt confidentiality. 2. Concurrency / distributed failures - primitives are pure functions over local snapshots; no shared state. Run in the startup-before-serving phase before any goroutine fan-out. 3. Performance - O(N) hash map walk for CheckNodeIDCollision (N = cluster member count). Single Pebble Get for CheckLocalEpochRollback. Both paid once per process start on the happy path. 4. Data consistency - matches the shipped 16-bit narrowing convention from applier.go (nodeIDMask = 0xFFFF) and the writer-registry keying. The narrowing-matches test pins this against a future change that would diverge. 5. Test coverage - 11 new unit tests cover every documented branch + the operationally important edge cases (equality for strict-ahead, narrowing equivalence, error propagation). ## Verification - go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/ - PASS - golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ - 0 issues - go build ./... - PASS ## Plan Stage 6D-2 ships the primitives + CLI as the bundled 6C-3 piece per the design doc 7 decomposition. Next sub-PRs per 7: - 6D-3: Voters union Learners CapabilityFanout helper - 6D-4: Wire format + FSM apply path - 6D-5: 6.2 storage-layer toggle - 6D-6: EnableStorageEnvelope admin RPC + CLI + integration test + main.go wiring of THIS PR's primitives The two primitives in this PR are operator-inert until 6D-6 wires them into main.go's startup-guard phase. The probe-node-id CLI is reachable today but only computes locally - no operator action required against the cluster.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughAdds startup guards and diagnostics for storage-envelope cutover: CheckNodeIDCollision (detects 16-bit node-ID collisions), CheckLocalEpochRollback (enforces local-epoch monotonicity via the writer-registry), exported sentinel errors, and an admin CLI probe-node-id subcommand with tests. ChangesStorage-envelope startup validation
Sequence DiagramsequenceDiagram
participant Startup as Startup
participant CollisionCheck as CheckNodeIDCollision
participant EpochCheck as CheckLocalEpochRollback
participant Registry as WriterRegistry
participant CLI as elastickv-admin(probe-node-id)
Startup->>CollisionCheck: validate full_node_id list
CollisionCheck->>CollisionCheck: narrow each full_node_id to uint16
CollisionCheck-->>Startup: nil or ErrNodeIDCollision
Startup->>EpochCheck: validate sidecarLocalEpoch for a node
EpochCheck->>Registry: GetRegistryRow(DEK, uint16(full&0xFFFF))
Registry-->>EpochCheck: row / missing / error
EpochCheck-->>Startup: nil or ErrLocalEpochRollback
CLI->>CLI: parse --full-node-id (hex or decimal) via parseUint64WithRadix
CLI->>CLI: narrow to uint16 and format output
CLI-->>Operator: printed full_node_id (hex) and node_id (uint16)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@claude review Stage 6D-2 ships the two bundled 6C-3 startup-guard primitives per §5 of the 6D design doc (#786, merged):
Operator-inert until 6D-6 wires the primitives into main.go's startup-guard phase. Please validate:
Verification:
|
|
Claude finished @bootjp's task in 4m 1s —— View job Stage 6D-2 Code ReviewTasks:
Review1. Narrowing Consistency ✅All three sites correctly use
All three match Minor comment accuracy note: 2. Strict-Ahead Invariant ✅
if sidecarLocalEpoch <= registryRow.LastSeenLocalEpoch {Correctly fires on The Pebble I/O error path is also correctly handled — wrapping without marking as 3. Caller Audit ✅
4. Test Coverage ✅ (with minor count discrepancy)18 test functions total (PR description says 19 — minor miscounting):
All documented branches are covered. 5. Operator-Inert Contract ✅No changes to Issues Found🔴
|
There was a problem hiding this comment.
Code Review
This pull request introduces critical safety guards to prevent AES-GCM nonce reuse within the encryption system. Key additions include the probe-node-id administrative subcommand for verifying node ID mappings, and automated startup checks to detect node_id collisions and local_epoch rollbacks. Feedback was provided regarding a formatting discrepancy in the CLI output where the hex padding width incorrectly accounted for the '0x' prefix, which could lead to inconsistent canonical representations.
| if _, err := fmt.Fprintf(out, "full_node_id: %#016x (%d)\nnode_id: %#04x (%d)\n", | ||
| full, full, narrowed, narrowed); err != nil { |
There was a problem hiding this comment.
The format string %#016x for full_node_id and %#04x for node_id does not align with the "canonical 16-hex-digit" and "4-hex-digit" forms described in the comments and expected by the tests. In Go, the width in %#0Nx includes the 0x prefix, so %#016x only provides 14 hex digits of padding. For example, TestRunEncryptionProbeNodeID_Decimal expects 0x0000000012345678 (18 characters), but the current code would produce 0x00000012345678 (16 characters). To consistently output the full 16 and 4 hex digits respectively, use 0x%016x and 0x%04x.
| if _, err := fmt.Fprintf(out, "full_node_id: %#016x (%d)\nnode_id: %#04x (%d)\n", | |
| full, full, narrowed, narrowed); err != nil { | |
| if _, err := fmt.Fprintf(out, "full_node_id: 0x%016x (%d)\nnode_id: 0x%04x (%d)\n", | |
| full, full, narrowed, narrowed); err != nil { |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a246a56259
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if err := fs.Parse(args); err != nil { | ||
| return errors.Wrap(err, "parse probe-node-id flags") |
There was a problem hiding this comment.
Treat probe-node-id help as non-error exit
runEncryptionProbeNodeID wraps every fs.Parse error, so elastickv-admin encryption probe-node-id -h/--help returns an error and exits non-zero instead of behaving like other encryption subcommands that treat flag.ErrHelp as a successful usage request. This breaks shell/help workflows that rely on help returning success.
Useful? React with 👍 / 👎.
| if _, err := fmt.Sscanf(s[2:], "%x", &v); err != nil { | ||
| return 0, errors.Wrap(err, "hex parse") |
There was a problem hiding this comment.
Reject malformed full-node-id strings completely
Using fmt.Sscanf here accepts partial parses (e.g. 0xdeadbeefZZ or 123abc) and silently ignores trailing garbage, so mistyped --full-node-id inputs can be treated as valid and produce the wrong derived node_id. For a safety probe command, accepting truncated IDs can mislead operators into validating the wrong value before join.
Useful? React with 👍 / 👎.
…dium (go-doc attribution) Two findings from claude r1 review on PR #788: ## claude high - parseUint64WithRadix silently accepts partial input cmd/elastickv-admin/encryption.go's parseUint64WithRadix used fmt.Sscanf for both hex and decimal parsing. Sscanf returns success once it matches the format, even if the input has trailing unparsed characters: "0x1234ZZZZ" -> parses as 0x1234 without error "1234abc" -> parses as 1234 without error This was unsafe for an operator-visible identifier where every digit matters - a typo'd full_node_id with a non-hex tail would silently return a different value than the operator meant, leading the operator to join a node under the wrong node_id. Fix: switched to strconv.ParseUint, which requires the ENTIRE input to be valid for the chosen radix. Updated import set and added TestRunEncryptionProbeNodeID_RejectsPartialInput with three regression cases (hex prefix + non-hex tail, decimal prefix + letter tail, hex with embedded space). ## claude medium - minMembersForCollision placement broke go doc internal/encryption/node_id_collision.go had the minMembersForCollision constant declared IMMEDIATELY before CheckNodeIDCollision (no blank line). In Go, a doc comment preceding a declaration is attributed to that declaration, so the 38-line doc comment block was being attributed to the CONSTANT by go doc, not to the function. go doc encryption.CheckNodeIDCollision would have produced no documentation. Fix: hoisted the constant declaration ABOVE the function's doc comment, with its own short comment. Both declarations now have correct go-doc attribution. ## Gemini medium NOT addressed (verified false) Gemini r1 flagged %#016x as producing 14 hex digits (claiming the 0x prefix counted toward the width). Verified directly: Go emits 16 hex digits after the 0x prefix for %#016x (output 0x0000000012345678 = 18 chars total). My tests already assert this exact form and pass. Gemini's claim is incorrect per Go's actual format behavior; no change needed. ## Caller audit (per cron directive) parseUint64WithRadix is package-private; the only caller is runEncryptionProbeNodeID (one call site). The semantic change is "strict parse vs. silent prefix" - the strict parse is the safer posture and no existing test relied on the silent-prefix behavior (the existing RejectsBadInput test used "notanumber" which fails both implementations). minMembersForCollision is a package-private constant only used within node_id_collision.go. Hoisting the declaration is structural; no semantic change. ## Verification - go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/ - PASS - golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ - 0 issues - go build ./... - PASS No semantic changes to existing functions outside the parseUint64WithRadix bug-fix.
|
@claude review Round-1 (commit claude high —
|
|
Claude encountered an error after 0s —— View job I'll analyze this and get back to you. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9eecdb559c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // §5.2 skip-condition, this is a legitimate lifecycle | ||
| // state — the first encrypted write will create the row | ||
| // via §4.1 case-1 first-seen. | ||
| return nil |
There was a problem hiding this comment.
Preserve missing-row signal in CheckLocalEpochRollback
Returning nil when the writer-registry row is absent collapses two different states—"safe strict-ahead" and "no rollback anchor exists"—into the same success result. That makes it impossible for startup wiring to enforce the post-cutover requirement (when StorageEnvelopeActive is true) to refuse a missing (full_node_id, dek_id) row, and can let a node start without a nonce-reuse guardrail. The helper should surface this case (e.g., distinct error/signal) instead of unconditionally succeeding.
Useful? React with 👍 / 👎.
…l split per 5.2)
Codex r1 P2 on PR788: CheckLocalEpochRollback collapsed two
distinct states - "safe strict-ahead" and "no rollback anchor
exists" - into the same nil return. The 6D design doc 5.2
explicitly splits these on storage_envelope_active:
- PRE-cutover (storage_envelope_active == false): missing row
is the freshly-joined-learner case (allow startup; 4.1
case-1 first-seen creates the row on first encrypted write)
- POST-cutover (storage_envelope_active == true): missing row
means encrypted writes are happening cluster-wide with no
rollback anchor on this node (refuse startup; the guard's
raison d'etre)
The previous implementation collapsed both to nil, meaning a
6D-6 implementor wiring the guard would have no way to enforce
the post-cutover requirement without reading the registry
again. Codex's flag is valid.
## Fix: storageEnvelopeActive bool parameter
Added a `storageEnvelopeActive` bool to CheckLocalEpochRollback.
The missing-row branch now splits internally:
- false + no row -> nil (pre-cutover freshly-joined-learner)
- true + no row -> ErrLocalEpochRollback wrapped with
"storage_envelope_active=true but no
registry row" diagnostic
This matches 5.2 exactly. The caller (main.go wiring in
6D-6 per the design doc 7) passes sidecar.StorageEnvelopeActive
as the bool.
## Caller audit (per cron directive - SEMANTIC CHANGE)
CheckLocalEpochRollback is currently called only from tests
(operator-inert per 6D-2 scope). grep verified:
- internal/encryption/local_epoch_rollback_test.go (6 callers)
- No production callers (the main.go wiring is deferred to
6D-6 per the design doc 7)
All six test callers updated to pass `false` (pre-cutover) as
the new bool, preserving the existing assertion semantics.
Added one new test for the post-cutover branch:
TestCheckLocalEpochRollback_NoRow_PostCutover - verifies
missing-row + active=true returns ErrLocalEpochRollback with
the "storage_envelope_active=true" context message.
Renamed the existing TestCheckLocalEpochRollback_NoRow to
TestCheckLocalEpochRollback_NoRow_PreCutover to make the split
explicit at the test level.
## Gemini r1 format-width finding - confirmed false, no change
Format width %#016x produces 16 hex digits after the 0x prefix
(output 0x0000000012345678 = 18 chars total). Verified via direct
Go execution. No change needed.
## Verification
- go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/ - PASS
- golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ - 0 issues
- go build ./... - PASS
No semantic changes outside CheckLocalEpochRollback's signature.
|
@claude review Round-2 (commit codex P2 — preserve missing-row signal in CheckLocalEpochRollbackThe previous implementation collapsed two distinct states into nil:
The 6D design doc §5.2 explicitly splits these on FixAdded
Matches §5.2 exactly. The 6D-6 wiring will pass Caller audit (SEMANTIC CHANGE per cron directive)
New test: Gemini r1 format-width finding — verified false, no change
Verification
|
|
Claude finished @bootjp's task in 3m 37s —— View job Stage 6D-2 Code Review — Round 2Tasks:
Review1.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
cmd/elastickv-admin/encryption.go (1)
95-97: ⚡ Quick winRemove the avoidable
//nolint:gosecsuppression.Line 96 can avoid the cast entirely while preserving behavior, which keeps lint policy clean.
♻️ Proposed refactor
- const nodeIDMask = 0xFFFF - narrowed := uint16(full & nodeIDMask) //nolint:gosec // masked to 16 bits; matches applier.go convention + const nodeIDMask uint64 = 0xFFFF + narrowed := full & nodeIDMaskAs per coding guidelines, "
**/*.go: Usegofmtformatting and pass linters configured in.golangci.yaml... Avoid//nolintannotations — refactor instead."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/elastickv-admin/encryption.go` around lines 95 - 97, The code suppresses gosec for a uint16 cast on narrowing full; instead make the mask have the same unsigned integer type as full (e.g., change nodeIDMask to uint64(0xFFFF) or the concrete type of full), remove the explicit uint16 cast and the //nolint:gosec comment, and use narrowed := full & nodeIDMask so the mask preserves type and behavior when passed to fmt.Fprintf (see nodeIDMask, narrowed and the fmt.Fprintf call).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/elastickv-admin/encryption.go`:
- Around line 85-87: The parse block in probe-node-id treats all parse errors as
failures; change the logic around fs.Parse(args) so that if it returns
flag.ErrHelp you return nil (success/normal help exit) instead of wrapping it;
for any other error keep returning errors.Wrap(err, "parse probe-node-id
flags"). Locate the fs.Parse call inside the probe-node-id handling code and add
an explicit check for errors.Is(err, flag.ErrHelp) or err == flag.ErrHelp before
wrapping to implement this behavior.
In `@internal/encryption/errors.go`:
- Around line 260-264: Update the stale comment for ErrLocalEpochRollback to
reflect the new pre-cutover vs post-cutover behavior: explain that missing
registry-row is still a skip condition before the cutover (when
storageEnvelopeActive=false or Active.Storage==0/bootstrap not committed) but
after cutover (storageEnvelopeActive=true) the guard now refuses startup if
there is no (full_node_id, active_storage_dek_id) registry row; mention the
related concepts so readers can find the logic (ErrLocalEpochRollback,
storageEnvelopeActive, Active.Storage, RegisterEncryptionWriter) and clarify the
different handling paths for pre-cutover vs post-cutover.
---
Nitpick comments:
In `@cmd/elastickv-admin/encryption.go`:
- Around line 95-97: The code suppresses gosec for a uint16 cast on narrowing
full; instead make the mask have the same unsigned integer type as full (e.g.,
change nodeIDMask to uint64(0xFFFF) or the concrete type of full), remove the
explicit uint16 cast and the //nolint:gosec comment, and use narrowed := full &
nodeIDMask so the mask preserves type and behavior when passed to fmt.Fprintf
(see nodeIDMask, narrowed and the fmt.Fprintf call).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b13ddfd9-60ac-44ae-9744-12c0a245b80c
📒 Files selected for processing (7)
cmd/elastickv-admin/encryption.gocmd/elastickv-admin/encryption_test.gointernal/encryption/errors.gointernal/encryption/local_epoch_rollback.gointernal/encryption/local_epoch_rollback_test.gointernal/encryption/node_id_collision.gointernal/encryption/node_id_collision_test.go
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eab8c49dd4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if sidecarLocalEpoch <= registryRow.LastSeenLocalEpoch { | ||
| return pkgerrors.Wrapf(ErrLocalEpochRollback, |
There was a problem hiding this comment.
Validate registry row belongs to requested full_node_id
CheckLocalEpochRollback looks up by uint16(fullNodeID&nodeIDMask) and then compares epochs without verifying that registryRow.FullNodeID matches the fullNodeID argument. If a different writer has (or previously had) the same 16-bit node ID, this function can treat that foreign row as this node’s rollback anchor and even return nil when sidecarLocalEpoch is greater than the foreign LastSeenLocalEpoch, bypassing the intended missing-row/post-cutover refusal for the real node. This leaves startup-guard behavior inconsistent and can defer failure to apply-time collision handling instead of failing early.
Useful? React with 👍 / 👎.
…lp handling) Two CodeRabbit minor findings from round-2 review. ## CodeRabbit minor #1 (errors.go:260-264) - stale skip-condition doc The ErrLocalEpochRollback skip-conditions comment in errors.go still listed "no registry row" as an unconditional skip. After the round-2 storageEnvelopeActive split, the missing-row case is no longer unconditional: pre-cutover it's a skip (returns nil), post-cutover it's the refusal that the guard exists to catch. Fix: rewrote the skip-conditions paragraph to enumerate just the two unconditional skips (encryption disabled, no active storage DEK) and cross-reference local_epoch_rollback.go for the §5.2 missing-row split. The comprehensive doc on CheckLocalEpochRollback is the authoritative source; errors.go now points there. ## CodeRabbit minor #2 (encryption.go:85-87) - ErrHelp handling probe-node-id wrapped flag.ErrHelp as an error, breaking the exit-zero-on-help convention used by every other encryption subcommand: - runEncryptionStatus (encryption.go:169-172) - runEncryptionRotateDEK (encryption_mutators.go:74) - runEncryptionRegisterWriter (encryption_mutators.go:119) - runEncryptionBootstrap (encryption_mutators.go:261) Operators running `elastickv-admin encryption probe-node-id -h` would get exit code 2 (the flag package default for help) which breaks shell scripts that test $?. Fix: added the standard `if errors.Is(err, flag.ErrHelp) { return nil }` short-circuit, matching the existing pattern. Added TestRunEncryptionProbeNodeID_HelpFlagExitsZero to pin the behavior. ## Verification - go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/ - PASS (24 tests total in these files) - golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ - 0 issues No semantic changes to function signatures or behavior outside of the help-flag exit code.
|
@claude review Round-3 (commit CodeRabbit minor —
|
|
Claude finished @bootjp's task in 3m 13s —— View job Stage 6D-2 Code Review — Round 3Tasks:
Review1.
|
…ullNodeID match)
Codex r3 P2 on PR788: CheckLocalEpochRollback looked up the
registry row by uint16(fullNodeID&nodeIDMask) and compared
epochs without verifying that the decoded RegistryValue's
FullNodeID matched the caller's argument. If a different writer
previously occupied the same 16-bit slot (historical
occupancy that the route-catalog watcher would not surface to
ErrNodeIDCollision), this function would treat that foreign
row as THIS node's rollback anchor. When sidecarLocalEpoch
happens to exceed the foreign LastSeenLocalEpoch, the function
would return nil and the missing-row-post-cutover refusal
would be bypassed.
## Fix: defense-in-depth FullNodeID equality check
After decoding the registry row, compare registryRow.FullNodeID
against the caller's fullNodeID. On mismatch, route through the
existing 5.2 split via a new shared helper:
- missingRegistryRowResult(storageEnvelopeActive, ...) -
pre-cutover allow startup, post-cutover refuse with
ErrLocalEpochRollback.
The mismatch is treated as "no row for THIS node" because the
node's rollback anchor genuinely doesn't exist - the foreign
writer's anchor doesn't apply to THIS node's monotonicity
contract.
This is layered defense: ErrNodeIDCollision is the primary
guard for active collisions, but a since-removed member can
leave its row behind. The FullNodeID check handles that case
correctly without depending on ErrNodeIDCollision's
route-catalog snapshot.
## Test coverage
Two new tests pin the foreign-row scenarios:
- TestCheckLocalEpochRollback_ForeignRowPreCutover - foreign
row + pre-cutover -> allow startup (next OpRegistration
will overwrite via 4.1 case 1).
- TestCheckLocalEpochRollback_ForeignRowPostCutover - foreign
row + post-cutover -> refuse with ErrLocalEpochRollback
AND uses the missing-row context message
("storage_envelope_active=true"). Includes the
"sidecar > foreign.LastSeen" sub-case which would have
masked the bug without the FullNodeID check.
The existing TestCheckLocalEpochRollback_NarrowingMatchesShippedConvention
was confusingly using the foreign-row scenario as its fixture.
Rewrote it to use the SAME full_node_id for seed and lookup
(still confirms the high-bit-aware uint16 narrowing), and
moved the cross-full_node_id case to the dedicated foreign-row
tests where it belongs.
## Caller audit (per cron directive)
CheckLocalEpochRollback's exported signature is unchanged - the
semantic change is internal (a new branch that routes through
the existing 5.2 split). No production callers (operator-inert
per 6D-2). All 6 existing test callers continue to work
unchanged.
## Verification
- go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/ - PASS (29 tests total in these files)
- golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ - 0 issues
|
@claude review Round-4 (commit codex P2 — registry row FullNodeID mismatchDefense-in-depth: after decoding the registry row, verify
Without this check, a sidecar that happens to be ahead of the foreign writer's TestsTwo new pinning tests:
The existing narrowing test was inadvertently exercising the foreign-row case as its fixture. Rewrote it to use the same Caller auditSignature unchanged; internal branch added. No production callers. All 6 existing test callers still work. Verification
|
|
Claude finished @bootjp's task in 1m 42s —— View job Stage 6D-2 Code Review — Round 4Tasks:
Review1.
|
| Item | Status |
|---|---|
strconv.ParseUint silent-prefix fix (r1) |
✅ Fixed in r1 |
go doc attribution for CheckNodeIDCollision (r1) |
✅ Fixed in r1 |
storageEnvelopeActive §5.2 split (r2) |
✅ Fixed in r2 |
errors.go stale doc (r3) |
✅ Fixed in r3 |
probe-node-id -h exit code (r3) |
✅ Fixed in r3 |
registryRow.FullNodeID mismatch (r3) |
✅ Fixed in r4 (this round) |
//nolint:gosec in encryption.go:106 (r3 nitpick) |
⚪ Not yet addressed |
One open nitpick: //nolint:gosec in encryption.go:106
CodeRabbit raised this in round-3 (and I noted it) — CLAUDE.md requires "avoid //nolint — refactor instead". The fix remains the same 2-line change:
// Before
const nodeIDMask = 0xFFFF
narrowed := uint16(full & nodeIDMask) //nolint:gosec
// After — mask typed uint64; no cast needed; fmt.Fprintf accepts any integer
const nodeIDMask uint64 = 0xFFFF
narrowed := full & nodeIDMaskThe local_epoch_rollback.go:98 and node_id_collision.go:59 nolints are harder — their callees require uint16 — so those are reasonably deferred to the narrowNodeID(uint64) uint16 helper cleanup in 6D-3+. But encryption.go:106 has no uint16-typed callee (narrowed feeds only fmt.Fprintf) and is the clean 2-liner. Either fix it here or explicitly accept a follow-up cleanup PR for all three at once.
Summary
Round-4 is correct. The FullNodeID mismatch guard is sound: missingRegistryRowResult cleanly centralises the §5.2 split, the mismatch check fires before the epoch comparison, and the sub-case that would have masked the anchor absence (sidecar > foreign.LastSeen) is explicitly pinned by the test. The narrowing test rewrite correctly separates "key derivation correctness" from "foreign-row handling". All five prior-round findings are now resolved. The one remaining item is the encryption.go:106 nolint — either close it with a 2-line fix or batch it with the narrowNodeID cleanup deferred to 6D-3+.
…ude r4 deferred item) Use the uint64 mask result directly for %#04x / %d formatting; the uint16 cast was only there for display and pulled in a gosec G115 nolint that did not exist anywhere else in the operator-tool path. The single-source narrow-and-cast convention (applier.go:230, local_epoch_rollback.go:98, node_id_collision.go:59) is unaffected - those sites genuinely need uint16 to build the 6-byte registry key. Display-only sites do not, and now do not pay the nolint cost.
|
@claude review Round-5 (commit Change// Before
const nodeIDMask = 0xFFFF
narrowed := uint16(full & nodeIDMask) //nolint:gosec // masked to 16 bits; matches applier.go convention
fmt.Fprintf(out, "...node_id: %#04x (%d)\n", full, full, narrowed, narrowed)
// After
const nodeIDMask = 0xFFFF
// keep masked as uint64: the value is only formatted for display, ...
masked := full & nodeIDMask
fmt.Fprintf(out, "...node_id: %#04x (%d)\n", full, full, masked, masked)
Why this site and not the othersThe 3 production sites still keep the convention because they need the actual
Verification
|
|
Claude finished @bootjp's task in 2m 8s —— View job Stage 6D-2 Code Review — Round 5Tasks:
Review1.
|
| Site | Reason |
|---|---|
local_epoch_rollback.go:98 |
RegistryKey(dekID, uint16) signature |
node_id_collision.go:59 |
map[uint16]uint64 key type |
applier.go:230 |
same registry-key derivation |
For these sites, a narrowNodeID(uint64) uint16 helper would consolidate the nolint to one location — but that's a deferred follow-up in 6D-3+, as the prior round established.
5. Prior-round item status
| Round | Item | Status |
|---|---|---|
| R1 | strconv.ParseUint silent-prefix |
✅ |
| R1 | go doc attribution for CheckNodeIDCollision |
✅ |
| R2 | storageEnvelopeActive §5.2 missing-row split |
✅ |
| R3 | errors.go stale skip-condition doc |
✅ |
| R3 | probe-node-id -h exits 0 |
✅ |
| R4 | registryRow.FullNodeID mismatch guard |
✅ |
| R4 | encryption.go:106 //nolint:gosec |
✅ this round |
All rounds resolved. The two remaining //nolint:gosec annotations in local_epoch_rollback.go:98 and node_id_collision.go:59 are correctly deferred to the narrowNodeID helper cleanup in 6D-3+.
Summary
Round-5 is correct. The //nolint:gosec is eliminated by keeping masked as uint64 — the only display-only site — with no change to output bytes. All seven prior-round findings are resolved. No new issues. This PR is ready to merge.
Summary
Stage 6D-2 per the 6D design doc §5 (landed in PR #786). Ships the two bundled 6C-3 startup-guard primitives as pure functions in the encryption package, plus the
probe-node-idCLI helper from §5.1 mitigation #2.Operator-inert until 6D-6 wires the primitives into
main.go's startup-guard phase. Theprobe-node-idCLI is reachable today but only computes locally — no operator action against the cluster.What this ships
Two new sentinel errors (
internal/encryption/errors.go)ErrNodeIDCollision— two distinctfull_node_idvalues hash to the same 16-bitnode_id(would reuse GCM nonces under the active DEK). Triage message points atprobe-node-idfor pre-join verification.ErrLocalEpochRollback— sidecarlocal_epoch <= registry.LastSeenLocalEpoch(strict-ahead invariant from §5.2). Triage message points atresync-sidecaror DEK rotation.Two pure-function guard primitives
CheckNodeIDCollision(fullNodeIDs []uint64) error(node_id_collision.go)CheckLocalEpochRollback(registry, fullNodeID, dekID, sidecarLocalEpoch) error(local_epoch_rollback.go)Both use the shipped
uint16(full_node_id & 0xFFFF)narrowing (matchesapplier.go::nodeIDMask). Local-state-only — no RPC, no engine access. Same-value-twice is NOT a collision (duplicate-membership-entry defense). Equality firesErrLocalEpochRollbackper the §5.2 strict-ahead invariant (replay would reuse the same(node_id, local_epoch)prefix).CLI:
elastickv-admin encryption probe-node-idPer §5.1 mitigation #2 — operator runs against a candidate
full_node_idBEFORE joining a node to verify the value is free in the cluster's allocatednode_idspace. Accepts decimal or0x-prefixed hex input; prints both the 16-hex-digit canonicalfull_node_idand the derivednode_id. No RPC; pure local computation.Test coverage
19 new tests:
node_id_collision_test.golocal_epoch_rollback_test.goencryption_test.go(new tests added)Caller audit
No semantic changes to existing functions; only additions:
errors.go— exported package-surface additions, no callers to update.probe-node-id— additive; no existing subcommand changed.Five-lens self-review
nodeIDMask = 0xFFFFconvention fromapplier.go. TheNarrowingMatchesShippedConventiontest pins this against a future change that would diverge.Verification
go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/— PASSgolangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/— 0 issuesgo build ./...— PASSPlan
Per the 6D design doc §7 decomposition, the next sub-PRs are:
CapabilityFanouthelper (used by 6D-6's pre-flight gate and by Stage 6E)RotateSubEnableStorageEnvelope = 0x04) + FSM apply path withlen(Wrapped) == 0check + sidecarStorageEnvelopeCutoverIndexwritePutAtreadsStorageEnvelopeActive)EnableStorageEnvelopeadmin RPC + CLI + integration test +main.gowiring of THIS PR's startup-guard primitivesTest plan
go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/— PASS (19 new tests + existing)golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/— 0 issuesgo build ./...— PASSSummary by CodeRabbit
New Features
Tests