Skip to content

feat(encryption): Stage 6D-2 — ErrNodeIDCollision + ErrLocalEpochRollback primitives + probe-node-id CLI#788

Open
bootjp wants to merge 6 commits into
mainfrom
feat/encryption-stage6d2-startup-guards
Open

feat(encryption): Stage 6D-2 — ErrNodeIDCollision + ErrLocalEpochRollback primitives + probe-node-id CLI#788
bootjp wants to merge 6 commits into
mainfrom
feat/encryption-stage6d2-startup-guards

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 18, 2026

Summary

Stage 6D-2 per the 6D design doc §5 (landed in PR #786). Ships the two bundled 6C-3 startup-guard primitives as pure functions in the encryption package, plus the probe-node-id CLI helper from §5.1 mitigation #2.

Operator-inert until 6D-6 wires the primitives into main.go's startup-guard phase. The probe-node-id CLI is reachable today but only computes locally — no operator action against the cluster.

What this ships

Two new sentinel errors (internal/encryption/errors.go)

  • ErrNodeIDCollision — two distinct full_node_id values hash to the same 16-bit node_id (would reuse GCM nonces under the active DEK). Triage message points at probe-node-id for pre-join verification.
  • ErrLocalEpochRollback — sidecar local_epoch <= registry.LastSeenLocalEpoch (strict-ahead invariant from §5.2). Triage message points at resync-sidecar or DEK rotation.

Two pure-function guard primitives

  • CheckNodeIDCollision(fullNodeIDs []uint64) error (node_id_collision.go)
  • CheckLocalEpochRollback(registry, fullNodeID, dekID, sidecarLocalEpoch) error (local_epoch_rollback.go)

Both use the shipped uint16(full_node_id & 0xFFFF) narrowing (matches applier.go::nodeIDMask). Local-state-only — no RPC, no engine access. Same-value-twice is NOT a collision (duplicate-membership-entry defense). Equality fires ErrLocalEpochRollback per the §5.2 strict-ahead invariant (replay would reuse the same (node_id, local_epoch) prefix).

CLI: elastickv-admin encryption probe-node-id

Per §5.1 mitigation #2 — operator runs against a candidate full_node_id BEFORE joining a node to verify the value is free in the cluster's allocated node_id space. Accepts decimal or 0x-prefixed hex input; prints both the 16-hex-digit canonical full_node_id and the derived node_id. No RPC; pure local computation.

Test coverage

19 new tests:

File Tests
node_id_collision_test.go Empty, SingleNode, NoCollision, Collision, DuplicateNotCollision (same value twice ≠ collision), OrderIndependent, ErrorIncludesBothIDs
local_epoch_rollback_test.go NoRow (freshly-joined-learner skip), SidecarStrictlyAhead, SidecarLessThan, SidecarEqual (strict-ahead replay case), PebbleError (not classified as rollback), NarrowingMatchesShippedConvention
encryption_test.go (new tests added) ProbeNodeID_Hex, ProbeNodeID_Decimal, RequiresFlag, RejectsBadInput, ProbeNodeIDSubcommand (dispatch reaches handler)

Caller audit

No semantic changes to existing functions; only additions:

  • New sentinels in errors.go — exported package-surface additions, no callers to update.
  • New pure functions — no callers yet (operator-inert until 6D-6).
  • New CLI subcommand probe-node-id — additive; no existing subcommand changed.

Five-lens self-review

  1. Data loss — net-positive. Both guards catch nonce-reuse scenarios BEFORE any encrypted write that would corrupt confidentiality.
  2. Concurrency / distributed failures — pure functions over local snapshots; no shared state. Run before any goroutine fan-out.
  3. Performance — O(N) map walk for collision check; single Pebble Get for rollback check. Paid once per process start on happy path.
  4. Data consistency — narrowing matches the shipped nodeIDMask = 0xFFFF convention from applier.go. The NarrowingMatchesShippedConvention test pins this against a future change that would diverge.
  5. Test coverage — 19 new unit tests cover every documented branch + operationally important edges (equality for strict-ahead, narrowing equivalence, error propagation).

Verification

  • go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/ — PASS
  • golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ — 0 issues
  • go build ./... — PASS

Plan

Per the 6D design doc §7 decomposition, the next sub-PRs are:

  • 6D-3 — Voters ∪ Learners CapabilityFanout helper (used by 6D-6's pre-flight gate and by Stage 6E)
  • 6D-4 — Wire format addition (RotateSubEnableStorageEnvelope = 0x04) + FSM apply path with len(Wrapped) == 0 check + sidecar StorageEnvelopeCutoverIndex write
  • 6D-5 — §6.2 storage-layer toggle (PutAt reads StorageEnvelopeActive)
  • 6D-6EnableStorageEnvelope admin RPC + CLI + integration test + main.go wiring of THIS PR's startup-guard primitives

Test plan

  • go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/ — PASS (19 new tests + existing)
  • golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ — 0 issues
  • go build ./... — PASS
  • Jepsen — not relevant for an operator-inert PR; the 6D-6 integration test will exercise the wired guards end-to-end

Summary by CodeRabbit

  • New Features

    • Added a CLI subcommand to display a provided full node identifier (decimal or hex) and its derived 16-bit node id.
    • Added startup validation checks that detect 16-bit node-id collisions and local-epoch rollback conditions to prevent unsafe encryption states.
  • Tests

    • Added unit and integration tests covering the new CLI parsing, collision detection, and local-epoch rollback behaviors.

Review Change Stack

…back primitives + probe-node-id CLI

Stage 6D-2 per the 6D design doc 5 (PR #786, merged). Ships the
two bundled 6C-3 startup-guard primitives as pure functions in
the encryption package, plus the probe-node-id CLI helper from
5.1 mitigation #2. Operator-inert until 6D-6 wires the
primitives into main.go's startup-guard phase.

## What this ships

### Two new sentinel errors

  - ErrNodeIDCollision (errors.go): two distinct full_node_id
    values in the local route-catalog hash to the same 16-bit
    node_id, which would reuse GCM nonces under the active DEK.
    Triage message points at the probe-node-id CLI for
    pre-join verification.
  - ErrLocalEpochRollback (errors.go): sidecar local_epoch for
    the active storage DEK is <= the local writer-registry
    last_seen_local_epoch (strict-ahead invariant from 5.2 of
    the 6D design doc). Triage message points at
    resync-sidecar or DEK rotation.

### Two pure-function guard primitives

  - CheckNodeIDCollision(fullNodeIDs []uint64) error
    (node_id_collision.go): narrows each full_node_id via
    uint16(full_node_id & 0xFFFF) - same shipped mask as
    applier.go's nodeIDMask - and returns ErrNodeIDCollision
    if any two DISTINCT values share a node_id. Order-
    independent. Same-value-twice is NOT a collision
    (duplicate-membership-entry quirk; the route catalog
    watcher should dedupe upstream but the primitive defends
    correctness here). Local-state-only; no RPC, no engine
    access.

  - CheckLocalEpochRollback(registry, fullNodeID, dekID,
    sidecarLocalEpoch) error (local_epoch_rollback.go): reads
    the local Pebble writer-registry row at RegistryKey(dekID,
    uint16(fullNodeID & 0xFFFF)). Returns nil if no row exists
    (freshly-joined learner; 4.1 case-1 first-seen creates the
    row on first encrypted write). Returns ErrLocalEpochRollback
    when sidecar <= registry (the strict-ahead invariant -
    equality is NOT safe because the same (node_id, local_epoch)
    prefix would be replayed). Pebble I/O errors propagate as
    wrapped errors NOT marked with ErrLocalEpochRollback so the
    operator triages transport failure separately from a real
    rollback.

### CLI: probe-node-id

  - elastickv-admin encryption probe-node-id --full-node-id=<u64>
    (cmd/elastickv-admin/encryption.go): derives the 16-bit
    node_id from a candidate full_node_id using the same shipped
    narrowing convention. Operator runs it against candidates
    BEFORE joining a new node to verify the value is free in
    the cluster's allocated node_id space - the recommended
    mitigation per the 6D design doc 5.1 mitigation menu.
    Accepts decimal or 0x-prefixed hex input; prints both the
    16-hex-digit canonical full_node_id and the derived
    node_id.

## Test coverage

11 new tests pin the load-bearing properties:

  node_id_collision_test.go (8 tests):
    - Empty - skip
    - SingleNode - skip
    - NoCollision - nil
    - Collision - ErrNodeIDCollision
    - DuplicateNotCollision - same full_node_id twice is NOT a
      collision
    - OrderIndependent - either ordering of the colliding pair
      catches the hit
    - ErrorIncludesBothIDs - operator-triage shape names both
      colliding values and the shared node_id

  local_epoch_rollback_test.go (6 tests):
    - NoRow - freshly-joined-learner skip
    - SidecarStrictlyAhead - nil (healthy case)
    - SidecarLessThan - ErrLocalEpochRollback (classic rollback)
    - SidecarEqual - ErrLocalEpochRollback (strict-ahead
      invariant; replay scenario)
    - PebbleError - underlying err propagated, NOT classified
      as ErrLocalEpochRollback
    - NarrowingMatchesShippedConvention - looks up registry row
      at uint16-narrowed key matching applier.go:230

  encryption_test.go (5 new tests for probe-node-id):
    - Hex - 0x-prefixed input
    - Decimal - decimal input
    - RequiresFlag - missing --full-node-id is an error
    - RejectsBadInput - non-numeric/non-hex input is an error
    - EncryptionMain_ProbeNodeIDSubcommand - dispatch reaches
      runEncryptionProbeNodeID (not the unknown-subcommand
      default branch)

## Caller audit

No semantic changes to existing functions; only additions.

  - New sentinels in errors.go - exported package surface
    additions, no caller audit needed (no existing callers to
    update).
  - New pure functions in node_id_collision.go and
    local_epoch_rollback.go - no callers yet (operator-inert
    until 6D-6 wires them).
  - New CLI subcommand probe-node-id - additive; no existing
    subcommand changed.

## Five-lens self-review

1. Data loss - net-positive. Both guards catch potential
   nonce-reuse scenarios BEFORE the first encrypted write that
   would corrupt confidentiality.

2. Concurrency / distributed failures - primitives are pure
   functions over local snapshots; no shared state. Run in the
   startup-before-serving phase before any goroutine fan-out.

3. Performance - O(N) hash map walk for CheckNodeIDCollision
   (N = cluster member count). Single Pebble Get for
   CheckLocalEpochRollback. Both paid once per process start
   on the happy path.

4. Data consistency - matches the shipped 16-bit narrowing
   convention from applier.go (nodeIDMask = 0xFFFF) and the
   writer-registry keying. The narrowing-matches test pins
   this against a future change that would diverge.

5. Test coverage - 11 new unit tests cover every documented
   branch + the operationally important edge cases (equality
   for strict-ahead, narrowing equivalence, error
   propagation).

## Verification

  - go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/ - PASS
  - golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ - 0 issues
  - go build ./... - PASS

## Plan

Stage 6D-2 ships the primitives + CLI as the bundled 6C-3 piece
per the design doc 7 decomposition. Next sub-PRs per 7:

  - 6D-3: Voters union Learners CapabilityFanout helper
  - 6D-4: Wire format + FSM apply path
  - 6D-5: 6.2 storage-layer toggle
  - 6D-6: EnableStorageEnvelope admin RPC + CLI + integration
    test + main.go wiring of THIS PR's primitives

The two primitives in this PR are operator-inert until 6D-6
wires them into main.go's startup-guard phase. The probe-node-id
CLI is reachable today but only computes locally - no operator
action required against the cluster.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 561c10d7-0267-40a7-914a-29a3717440e4

📥 Commits

Reviewing files that changed from the base of the PR and between eab8c49 and 5aaa3e0.

📒 Files selected for processing (5)
  • cmd/elastickv-admin/encryption.go
  • cmd/elastickv-admin/encryption_test.go
  • internal/encryption/errors.go
  • internal/encryption/local_epoch_rollback.go
  • internal/encryption/local_epoch_rollback_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • cmd/elastickv-admin/encryption_test.go
  • internal/encryption/errors.go
  • cmd/elastickv-admin/encryption.go

📝 Walkthrough

Walkthrough

Adds startup guards and diagnostics for storage-envelope cutover: CheckNodeIDCollision (detects 16-bit node-ID collisions), CheckLocalEpochRollback (enforces local-epoch monotonicity via the writer-registry), exported sentinel errors, and an admin CLI probe-node-id subcommand with tests.

Changes

Storage-envelope startup validation

Layer / File(s) Summary
Error sentinels
internal/encryption/errors.go
Defines exported ErrNodeIDCollision and ErrLocalEpochRollback used by startup guards.
Node-ID collision detection
internal/encryption/node_id_collision.go, internal/encryption/node_id_collision_test.go
CheckNodeIDCollision checks that distinct full_node_id values do not collide when narrowed to 16-bit node_id. Tests cover empty/single inputs, non-collision, collision detection, duplicate handling, order-independence, and error message contents.
Local epoch rollback detection
internal/encryption/local_epoch_rollback.go, internal/encryption/local_epoch_rollback_test.go
CheckLocalEpochRollback looks up the writer-registry by (DEK ID, uint16(full_node_id & 0xFFFF)) and returns ErrLocalEpochRollback when sidecar epoch ≤ LastSeenLocalEpoch or when an active envelope lacks a proving registry row. Tests include pre/post-cutover missing-row behavior, epoch comparisons, registry errors propagation, foreign-row handling, and key-narrowing convention checks.
CLI probe tool for node-ID narrowing
cmd/elastickv-admin/encryption.go, cmd/elastickv-admin/encryption_test.go
Adds probe-node-id --full-node-id (accepts 0x hex or decimal), parseUint64WithRadix helper, and tests validating canonical output, required flag behavior, malformed input rejection, partial-parse rejection, help flag, and main dispatch routing.

Sequence Diagram

sequenceDiagram
  participant Startup as Startup
  participant CollisionCheck as CheckNodeIDCollision
  participant EpochCheck as CheckLocalEpochRollback
  participant Registry as WriterRegistry
  participant CLI as elastickv-admin(probe-node-id)

  Startup->>CollisionCheck: validate full_node_id list
  CollisionCheck->>CollisionCheck: narrow each full_node_id to uint16
  CollisionCheck-->>Startup: nil or ErrNodeIDCollision

  Startup->>EpochCheck: validate sidecarLocalEpoch for a node
  EpochCheck->>Registry: GetRegistryRow(DEK, uint16(full&0xFFFF))
  Registry-->>EpochCheck: row / missing / error
  EpochCheck-->>Startup: nil or ErrLocalEpochRollback

  CLI->>CLI: parse --full-node-id (hex or decimal) via parseUint64WithRadix
  CLI->>CLI: narrow to uint16 and format output
  CLI-->>Operator: printed full_node_id (hex) and node_id (uint16)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • bootjp/elastickv#759: Implements bootstrap encryption validation that rejects duplicate full_node_id and uint16-derived node-id collisions, aligning with this PR's collision logic.
  • bootjp/elastickv#707: Design-doc changes matching node-ID collision detection and local-epoch rollback semantics implemented here.
  • bootjp/elastickv#765: Contains related applier scaffolding that halts on uint16 node-id collisions and local-epoch rollback using the same masked node-id convention.

"I hopped through bytes beneath the moon,
Checked low sixteen bits by noon,
No collisions in my burrowed den,
Epochs march onward—safe again!" 🐇✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the three main components of this PR: two new sentinel errors (ErrNodeIDCollision and ErrLocalEpochRollback) and their supporting primitives, plus the new probe-node-id CLI subcommand.
Docstring Coverage ✅ Passed Docstring coverage is 96.77% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/encryption-stage6d2-startup-guards

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 18, 2026

@claude review

Stage 6D-2 ships the two bundled 6C-3 startup-guard primitives per §5 of the 6D design doc (#786, merged):

  1. ErrNodeIDCollision + CheckNodeIDCollision — pure function over local route-catalog snapshot; uses the shipped uint16(full_node_id & 0xFFFF) narrowing (matches applier.go::nodeIDMask). Duplicate-same-value defended explicitly (catalog watcher should dedupe, primitive defends correctness).

  2. ErrLocalEpochRollback + CheckLocalEpochRollback — pure function; reads local Pebble writer-registry. Strict-ahead invariant (equality is rollback). Pebble I/O errors propagate as wrapped errors, NOT classified as ErrLocalEpochRollback (operator triages transport failure separately).

  3. elastickv-admin encryption probe-node-id CLI — per §5.1 mitigation Update dependency go to v1.21.4 #2. Decimal or 0x-hex input.

Operator-inert until 6D-6 wires the primitives into main.go's startup-guard phase.

Please validate:

  1. Narrowing consistency — the two primitives + the CLI must use the SAME 16-bit narrowing convention as the shipped writer-registry keying. Specifically: uint16(full_node_id & 0xFFFF), matching applier.go:230 and encryption_admin.go:64. The NarrowingMatchesShippedConvention test pins this against future divergence.

  2. Strict-ahead invariant for ErrLocalEpochRollback — equality is a rollback case per §5.2 of the design doc (the replay scenario). The SidecarEqual test pins this.

  3. Caller audit — both new sentinels are exported package-surface additions with no existing callers. Both new primitives have no callers yet (operator-inert). The new CLI subcommand is additive; no existing subcommand changed. No semantic changes to existing functions.

  4. Test coverage — 19 new tests across three files; every documented branch + the operationally important edges (equality, narrowing equivalence, error propagation, dispatch).

  5. Operator-inert contract — the wiring into main.go's startup-guard phase is deferred to 6D-6 per the design doc §7. This PR is pure scaffolding ready for that wiring; the probe-node-id CLI is the only operator-visible surface and it's a pure local computation (no RPC).

Verification:

  • go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/ — PASS
  • golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ — 0 issues
  • go build ./... — PASS

@claude
Copy link
Copy Markdown

claude Bot commented May 18, 2026

Claude finished @bootjp's task in 4m 1s —— View job


Stage 6D-2 Code Review

Tasks:

  • Gather context and read changed files
  • Validate narrowing consistency (uint16(full_node_id & 0xFFFF))
  • Validate strict-ahead invariant for ErrLocalEpochRollback
  • Perform caller audit
  • Validate test coverage
  • Validate operator-inert contract
  • Post final review

Review

1. Narrowing Consistency ✅

All three sites correctly use uint16(x & 0xFFFF):

Site Implementation
node_id_collision.go:59 uint16(fnid & nodeIDMask) — uses package-level nodeIDMask = 0xFFFF from applier.go:670
local_epoch_rollback.go:65 uint16(fullNodeID&nodeIDMask) — same package-level constant
encryption.go:94-95 const nodeIDMask = 0xFFFF; uint16(full & nodeIDMask) — local redefinition (correct; CLI is a different package)

All three match applier.go:230 (uint16(p.FullNodeID&nodeIDMask)). Consistent.

Minor comment accuracy note: node_id_collision.go:59's nolint comment says "matches applier.go/encryption_admin.go convention", but encryption_admin.go uses localEpochMask for a different purpose (not node ID narrowing). The reference to encryption_admin.go is a red herring; applier.go alone is the authoritative source. Not a bug, but slightly misleading for future readers.


2. Strict-Ahead Invariant ✅

local_epoch_rollback.go:85:

if sidecarLocalEpoch <= registryRow.LastSeenLocalEpoch {

Correctly fires on < (classic rollback) and == (replay case). The TestCheckLocalEpochRollback_SidecarEqual test pins the equality branch. Correct.

The Pebble I/O error path is also correctly handled — wrapping without marking as ErrLocalEpochRollback, verified by TestCheckLocalEpochRollback_PebbleError.


3. Caller Audit ✅

  • ErrNodeIDCollision, ErrLocalEpochRollback — new exports; no existing callers.
  • CheckNodeIDCollision, CheckLocalEpochRollback — new functions; no callers until 6D-6.
  • probe-node-id subcommand — additive; no existing subcommand changed; encryptionMain's switch, default, and help strings all updated consistently (encryption.go:46, encryption.go:52, encryption.go:58).

4. Test Coverage ✅ (with minor count discrepancy)

18 test functions total (PR description says 19 — minor miscounting):

  • node_id_collision_test.go: 7 tests
  • local_epoch_rollback_test.go: 6 tests
  • encryption_test.go: 5 new tests

All documented branches are covered. TestCheckLocalEpochRollback_NarrowingMatchesShippedConvention correctly pins the load-bearing invariant by seeding with one full_node_id and looking up with a different one that shares the same low 16 bits — verifying the registry key derivation matches the shipped convention.


5. Operator-Inert Contract ✅

No changes to main.go. Both primitives are pure functions over local state, no RPCs. probe-node-id is local computation only. The 6D-6 wiring is deferred as documented. Contract holds.


Issues Found

🔴 parseUint64WithRadix silently accepts partial input

encryption.go:107-117 uses fmt.Sscanf for both hex and decimal parsing. fmt.Sscanf does not require consuming the full input string — it returns success after matching the format even if trailing characters remain. This means:

  • "0x1234ZZZZ" → parses as 0x1234 without error
  • "1234abc" → parses as 1234 without error

TestRunEncryptionProbeNodeID_RejectsBadInput only tests "notanumber" (which fails because 'n' is not parseable at all), so this edge case is untested.

Fix: use strconv.ParseUint which rejects partial input:

// hex path
v, err := strconv.ParseUint(s[2:], 16, 64)

// decimal path  
v, err := strconv.ParseUint(s, 10, 64)

Fix this →

🟡 const minMembersForCollision placement breaks go doc attribution

node_id_collision.go:41-47: The 38-line comment block starting at line 7 forms one unbroken block immediately preceding const minMembersForCollision = 2 at line 45. In Go, a comment immediately preceding a declaration is attributed to that declaration by go doc. The comment block will therefore be attributed to the constant, not to CheckNodeIDCollision (which has a blank line before it at line 47). go doc encryption.CheckNodeIDCollision will produce no documentation.

Fix: hoist the constant and its short comment before the function's doc comment:

// minMembersForCollision is the smallest membership-set size at which
// two distinct full_node_id values can exist and a narrowing collision
// is possible. With 0 or 1 member there is nothing to compare.
const minMembersForCollision = 2

// CheckNodeIDCollision is the §9.1 / 6C-3 startup-guard primitive ...
func CheckNodeIDCollision(fullNodeIDs []uint64) error {

Fix this →


Summary

The five validation points all pass. The narrowing is consistent, the strict-ahead invariant is correct, the caller audit is clean, test coverage is comprehensive, and the operator-inert contract holds. The two issues above are the actionable items before merge: the strconv.ParseUint fix closes a silent truncation bug in the CLI input parser, and the constant placement fix restores go doc attribution for CheckNodeIDCollision.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces critical safety guards to prevent AES-GCM nonce reuse within the encryption system. Key additions include the probe-node-id administrative subcommand for verifying node ID mappings, and automated startup checks to detect node_id collisions and local_epoch rollbacks. Feedback was provided regarding a formatting discrepancy in the CLI output where the hex padding width incorrectly accounted for the '0x' prefix, which could lead to inconsistent canonical representations.

Comment thread cmd/elastickv-admin/encryption.go Outdated
Comment on lines +96 to +97
if _, err := fmt.Fprintf(out, "full_node_id: %#016x (%d)\nnode_id: %#04x (%d)\n",
full, full, narrowed, narrowed); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The format string %#016x for full_node_id and %#04x for node_id does not align with the "canonical 16-hex-digit" and "4-hex-digit" forms described in the comments and expected by the tests. In Go, the width in %#0Nx includes the 0x prefix, so %#016x only provides 14 hex digits of padding. For example, TestRunEncryptionProbeNodeID_Decimal expects 0x0000000012345678 (18 characters), but the current code would produce 0x00000012345678 (16 characters). To consistently output the full 16 and 4 hex digits respectively, use 0x%016x and 0x%04x.

Suggested change
if _, err := fmt.Fprintf(out, "full_node_id: %#016x (%d)\nnode_id: %#04x (%d)\n",
full, full, narrowed, narrowed); err != nil {
if _, err := fmt.Fprintf(out, "full_node_id: 0x%016x (%d)\nnode_id: 0x%04x (%d)\n",
full, full, narrowed, narrowed); err != nil {

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a246a56259

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +84 to +85
if err := fs.Parse(args); err != nil {
return errors.Wrap(err, "parse probe-node-id flags")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat probe-node-id help as non-error exit

runEncryptionProbeNodeID wraps every fs.Parse error, so elastickv-admin encryption probe-node-id -h/--help returns an error and exits non-zero instead of behaving like other encryption subcommands that treat flag.ErrHelp as a successful usage request. This breaks shell/help workflows that rely on help returning success.

Useful? React with 👍 / 👎.

Comment thread cmd/elastickv-admin/encryption.go Outdated
Comment on lines +109 to +110
if _, err := fmt.Sscanf(s[2:], "%x", &v); err != nil {
return 0, errors.Wrap(err, "hex parse")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject malformed full-node-id strings completely

Using fmt.Sscanf here accepts partial parses (e.g. 0xdeadbeefZZ or 123abc) and silently ignores trailing garbage, so mistyped --full-node-id inputs can be treated as valid and produce the wrong derived node_id. For a safety probe command, accepting truncated IDs can mislead operators into validating the wrong value before join.

Useful? React with 👍 / 👎.

…dium (go-doc attribution)

Two findings from claude r1 review on PR #788:

## claude high - parseUint64WithRadix silently accepts partial input

cmd/elastickv-admin/encryption.go's parseUint64WithRadix used
fmt.Sscanf for both hex and decimal parsing. Sscanf returns
success once it matches the format, even if the input has
trailing unparsed characters:

  "0x1234ZZZZ" -> parses as 0x1234 without error
  "1234abc"    -> parses as 1234 without error

This was unsafe for an operator-visible identifier where every
digit matters - a typo'd full_node_id with a non-hex tail would
silently return a different value than the operator meant,
leading the operator to join a node under the wrong node_id.

Fix: switched to strconv.ParseUint, which requires the ENTIRE
input to be valid for the chosen radix. Updated import set and
added TestRunEncryptionProbeNodeID_RejectsPartialInput with
three regression cases (hex prefix + non-hex tail, decimal
prefix + letter tail, hex with embedded space).

## claude medium - minMembersForCollision placement broke go doc

internal/encryption/node_id_collision.go had the
minMembersForCollision constant declared IMMEDIATELY before
CheckNodeIDCollision (no blank line). In Go, a doc comment
preceding a declaration is attributed to that declaration, so
the 38-line doc comment block was being attributed to the
CONSTANT by go doc, not to the function. go doc encryption.CheckNodeIDCollision
would have produced no documentation.

Fix: hoisted the constant declaration ABOVE the function's doc
comment, with its own short comment. Both declarations now have
correct go-doc attribution.

## Gemini medium NOT addressed (verified false)

Gemini r1 flagged %#016x as producing 14 hex digits (claiming
the 0x prefix counted toward the width). Verified directly: Go
emits 16 hex digits after the 0x prefix for %#016x (output
0x0000000012345678 = 18 chars total). My tests already assert
this exact form and pass. Gemini's claim is incorrect per Go's
actual format behavior; no change needed.

## Caller audit (per cron directive)

parseUint64WithRadix is package-private; the only caller is
runEncryptionProbeNodeID (one call site). The semantic change
is "strict parse vs. silent prefix" - the strict parse is the
safer posture and no existing test relied on the silent-prefix
behavior (the existing RejectsBadInput test used "notanumber"
which fails both implementations).

minMembersForCollision is a package-private constant only used
within node_id_collision.go. Hoisting the declaration is
structural; no semantic change.

## Verification

  - go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/ - PASS
  - golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ - 0 issues
  - go build ./... - PASS

No semantic changes to existing functions outside the
parseUint64WithRadix bug-fix.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 18, 2026

@claude review

Round-1 (commit 9eecdb55): addressed both claude r1 findings.

claude high — parseUint64WithRadix silent-prefix bug

Switched from fmt.Sscanf to strconv.ParseUint. New regression test TestRunEncryptionProbeNodeID_RejectsPartialInput covers three cases (hex prefix + non-hex tail, decimal prefix + letter tail, hex with embedded space) that the previous Sscanf implementation would have silently truncated.

Caller audit: parseUint64WithRadix is package-private with a single caller (runEncryptionProbeNodeID). The semantic change "strict parse vs silent prefix" makes the function safer; no existing test relied on the silent-prefix behavior.

claude medium — go-doc attribution

Hoisted minMembersForCollision constant + its own short comment ABOVE the function's doc comment block. Now go doc encryption.CheckNodeIDCollision produces the intended 38-line function documentation.

Gemini r1 format-width finding — verified false, no change

Verified directly: Go's %#016x produces 16 hex digits AFTER the 0x prefix (output is 0x0000000012345678 = 18 chars total). My tests already assert this exact form and pass. Gemini's claim that the width includes the prefix is incorrect per Go's actual behavior.

Verification

  • go test -race ./internal/encryption/ ./cmd/elastickv-admin/ — PASS (21 new tests, +3 for the partial-input regression)
  • golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ — 0 issues

@claude
Copy link
Copy Markdown

claude Bot commented May 18, 2026

Claude encountered an error after 0s —— View job


I'll analyze this and get back to you.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9eecdb559c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// §5.2 skip-condition, this is a legitimate lifecycle
// state — the first encrypted write will create the row
// via §4.1 case-1 first-seen.
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve missing-row signal in CheckLocalEpochRollback

Returning nil when the writer-registry row is absent collapses two different states—"safe strict-ahead" and "no rollback anchor exists"—into the same success result. That makes it impossible for startup wiring to enforce the post-cutover requirement (when StorageEnvelopeActive is true) to refuse a missing (full_node_id, dek_id) row, and can let a node start without a nonce-reuse guardrail. The helper should surface this case (e.g., distinct error/signal) instead of unconditionally succeeding.

Useful? React with 👍 / 👎.

…l split per 5.2)

Codex r1 P2 on PR788: CheckLocalEpochRollback collapsed two
distinct states - "safe strict-ahead" and "no rollback anchor
exists" - into the same nil return. The 6D design doc 5.2
explicitly splits these on storage_envelope_active:

  - PRE-cutover (storage_envelope_active == false): missing row
    is the freshly-joined-learner case (allow startup; 4.1
    case-1 first-seen creates the row on first encrypted write)
  - POST-cutover (storage_envelope_active == true): missing row
    means encrypted writes are happening cluster-wide with no
    rollback anchor on this node (refuse startup; the guard's
    raison d'etre)

The previous implementation collapsed both to nil, meaning a
6D-6 implementor wiring the guard would have no way to enforce
the post-cutover requirement without reading the registry
again. Codex's flag is valid.

## Fix: storageEnvelopeActive bool parameter

Added a `storageEnvelopeActive` bool to CheckLocalEpochRollback.
The missing-row branch now splits internally:

  - false + no row -> nil (pre-cutover freshly-joined-learner)
  - true  + no row -> ErrLocalEpochRollback wrapped with
                      "storage_envelope_active=true but no
                       registry row" diagnostic

This matches 5.2 exactly. The caller (main.go wiring in
6D-6 per the design doc 7) passes sidecar.StorageEnvelopeActive
as the bool.

## Caller audit (per cron directive - SEMANTIC CHANGE)

CheckLocalEpochRollback is currently called only from tests
(operator-inert per 6D-2 scope). grep verified:

  - internal/encryption/local_epoch_rollback_test.go (6 callers)
  - No production callers (the main.go wiring is deferred to
    6D-6 per the design doc 7)

All six test callers updated to pass `false` (pre-cutover) as
the new bool, preserving the existing assertion semantics.
Added one new test for the post-cutover branch:
TestCheckLocalEpochRollback_NoRow_PostCutover - verifies
missing-row + active=true returns ErrLocalEpochRollback with
the "storage_envelope_active=true" context message.

Renamed the existing TestCheckLocalEpochRollback_NoRow to
TestCheckLocalEpochRollback_NoRow_PreCutover to make the split
explicit at the test level.

## Gemini r1 format-width finding - confirmed false, no change

Format width %#016x produces 16 hex digits after the 0x prefix
(output 0x0000000012345678 = 18 chars total). Verified via direct
Go execution. No change needed.

## Verification

  - go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/ - PASS
  - golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ - 0 issues
  - go build ./... - PASS

No semantic changes outside CheckLocalEpochRollback's signature.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 18, 2026

@claude review

Round-2 (commit eab8c49d): addressed codex r1 P2.

codex P2 — preserve missing-row signal in CheckLocalEpochRollback

The previous implementation collapsed two distinct states into nil:

  • "safe strict-ahead" (no row + pre-cutover → freshly-joined learner)
  • "no rollback anchor exists" (no row + post-cutover → cluster writing encrypted data with no monotonicity guard on this node)

The 6D design doc §5.2 explicitly splits these on storage_envelope_active; codex's flag is valid.

Fix

Added storageEnvelopeActive bool parameter to CheckLocalEpochRollback. The missing-row branch now splits:

  • false + no row → nil (pre-cutover)
  • true + no row → ErrLocalEpochRollback wrapped with "storage_envelope_active=true but no registry row"

Matches §5.2 exactly. The 6D-6 wiring will pass sidecar.StorageEnvelopeActive.

Caller audit (SEMANTIC CHANGE per cron directive)

CheckLocalEpochRollback is currently called only from tests (operator-inert per 6D-2 scope). grep -rn CheckLocalEpochRollback confirms:

  • 6 test callers — all updated to pass false (preserve existing assertions)
  • 0 production callers (6D-6 will wire it)

New test: TestCheckLocalEpochRollback_NoRow_PostCutover — verifies missing-row + active=true returns ErrLocalEpochRollback with the "storage_envelope_active=true" context. Existing TestCheckLocalEpochRollback_NoRow renamed to _PreCutover for symmetry.

Gemini r1 format-width finding — verified false, no change

%#016x produces 16 hex digits after the 0x prefix (output 0x0000000012345678 = 18 chars total). Verified via direct Go execution.

Verification

  • go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/ — PASS
  • golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ — 0 issues

@claude
Copy link
Copy Markdown

claude Bot commented May 18, 2026

Claude finished @bootjp's task in 3m 37s —— View job


Stage 6D-2 Code Review — Round 2

Tasks:

  • Gather context on Round-2 changes
  • Read changed files
  • Validate storageEnvelopeActive parameter split (§5.2 missing-row guard)
  • Verify Gemini format-width finding (%#016x hex digit count)
  • Check caller audit completeness
  • Post final review

Review

1. storageEnvelopeActive fix (codex r1 P2) ✅

local_epoch_rollback.go:90-105 correctly splits the missing-row case on the new parameter:

if !ok {
    if storageEnvelopeActive {
        return pkgerrors.Wrapf(ErrLocalEpochRollback,
            "writer-registry has no row for full_node_id=%#x dek_id=%d but storage_envelope_active=true; cannot prove nonce monotonicity for the active DEK",
            fullNodeID, activeStorageDEKID)
    }
    return nil
}
  • storageEnvelopeActive=false + no row → nil (pre-cutover; §4.1 case-1 first-seen branch will create the row) ✅
  • storageEnvelopeActive=true + no row → ErrLocalEpochRollback with context string ✅
  • Error message includes "storage_envelope_active=true" — correctly pinned by TestCheckLocalEpochRollback_NoRow_PostCutover:93

Caller audit is clean: grep confirms 6 test callers (all updated to pass false) and 0 production callers. Signature change is backward-compatible at the 6D-6 wiring point; 6D-6 will pass sidecar.StorageEnvelopeActive.

Minor: errors.go:263-264 still says the "no registry row" case is an unconditional skip condition for ErrLocalEpochRollback. After this change it's only a skip for pre-cutover. The comprehensive doc in local_epoch_rollback.go:44-75 covers the split correctly; the errors.go skip-condition note is now slightly stale. Not a correctness issue — the primitive is authoritative — but worth a follow-up line edit.


2. Gemini format-width finding — correctly rejected ✅

Gemini claimed %#016x only produces 14 hex digits because "the width in %#0Nx includes the 0x prefix." This is C's behavior, not Go's.

In Go's fmt package, the 0 flag inserts zero padding after the 0x prefix, so the width applies to the hex digit portion only. %#016x for uint64(0x12345678) produces 0x0000000012345678 (18 chars: 0x + 16 hex digits), which is exactly what TestRunEncryptionProbeNodeID_Decimal:504 asserts.

The tests passing is confirmation — the test asserts "full_node_id: 0x0000000012345678" (16 hex digits after 0x) and "node_id: 0x1234" (4 hex digits) — both correct. Gemini's suggested rewrite to 0x%016x would produce identical output, not a correction.


3. Caller audit ✅

  • CheckLocalEpochRollback signature changed: 6 test callers updated, 0 production callers
  • CheckNodeIDCollision, both sentinels — unchanged, still 0 callers outside tests
  • probe-node-id subcommand — unchanged, additive

Remaining open item

🟡 runEncryptionProbeNodeID does not handle flag.ErrHelp (codex r1 P2 #1, unaddressed)

encryption.go:85-87:

if err := fs.Parse(args); err != nil {
    return errors.Wrap(err, "parse probe-node-id flags")
}

Every other encryption subcommand returns nil for flag.ErrHelp:

  • runEncryptionStatus (encryption.go:169-172) ✅
  • runEncryptionRotateDEK (encryption_mutators.go:74) ✅
  • runEncryptionRegisterWriter (encryption_mutators.go:119) ✅
  • runEncryptionBootstrap (encryption_mutators.go:261) ✅

probe-node-id alone wraps flag.ErrHelp as an error — elastickv-admin encryption probe-node-id -h exits non-zero, breaking shell scripts that test $? on help invocations.

Fix:

if err := fs.Parse(args); err != nil {
    if errors.Is(err, flag.ErrHelp) {
        return nil
    }
    return errors.Wrap(err, "parse probe-node-id flags")
}

Add a test to match TestRunEncryptionStatus_HelpFlagExitsZero:

func TestRunEncryptionProbeNodeID_HelpFlagExitsZero(t *testing.T) {
    t.Parallel()
    var buf bytes.Buffer
    if err := runEncryptionProbeNodeID([]string{"-h"}, &buf); err != nil {
        t.Errorf("runEncryptionProbeNodeID(-h) returned %v, want nil", err)
    }
}

Fix this →


Summary

The codex P2 "preserve missing-row signal" is correctly implemented — the storageEnvelopeActive split matches §5.2 exactly, all callers are updated, and the new test pins the post-cutover refusal. The Gemini format-width finding is still correctly rejected. The one remaining P2 (help-flag exit code) was not addressed in Round-2; it's a 3-line fix with a corresponding test to match the pattern used in all other subcommands.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
cmd/elastickv-admin/encryption.go (1)

95-97: ⚡ Quick win

Remove the avoidable //nolint:gosec suppression.

Line 96 can avoid the cast entirely while preserving behavior, which keeps lint policy clean.

♻️ Proposed refactor
-	const nodeIDMask = 0xFFFF
-	narrowed := uint16(full & nodeIDMask) //nolint:gosec // masked to 16 bits; matches applier.go convention
+	const nodeIDMask uint64 = 0xFFFF
+	narrowed := full & nodeIDMask

As per coding guidelines, "**/*.go: Use gofmt formatting and pass linters configured in .golangci.yaml ... Avoid //nolint annotations — refactor instead."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/elastickv-admin/encryption.go` around lines 95 - 97, The code suppresses
gosec for a uint16 cast on narrowing full; instead make the mask have the same
unsigned integer type as full (e.g., change nodeIDMask to uint64(0xFFFF) or the
concrete type of full), remove the explicit uint16 cast and the //nolint:gosec
comment, and use narrowed := full & nodeIDMask so the mask preserves type and
behavior when passed to fmt.Fprintf (see nodeIDMask, narrowed and the
fmt.Fprintf call).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/elastickv-admin/encryption.go`:
- Around line 85-87: The parse block in probe-node-id treats all parse errors as
failures; change the logic around fs.Parse(args) so that if it returns
flag.ErrHelp you return nil (success/normal help exit) instead of wrapping it;
for any other error keep returning errors.Wrap(err, "parse probe-node-id
flags"). Locate the fs.Parse call inside the probe-node-id handling code and add
an explicit check for errors.Is(err, flag.ErrHelp) or err == flag.ErrHelp before
wrapping to implement this behavior.

In `@internal/encryption/errors.go`:
- Around line 260-264: Update the stale comment for ErrLocalEpochRollback to
reflect the new pre-cutover vs post-cutover behavior: explain that missing
registry-row is still a skip condition before the cutover (when
storageEnvelopeActive=false or Active.Storage==0/bootstrap not committed) but
after cutover (storageEnvelopeActive=true) the guard now refuses startup if
there is no (full_node_id, active_storage_dek_id) registry row; mention the
related concepts so readers can find the logic (ErrLocalEpochRollback,
storageEnvelopeActive, Active.Storage, RegisterEncryptionWriter) and clarify the
different handling paths for pre-cutover vs post-cutover.

---

Nitpick comments:
In `@cmd/elastickv-admin/encryption.go`:
- Around line 95-97: The code suppresses gosec for a uint16 cast on narrowing
full; instead make the mask have the same unsigned integer type as full (e.g.,
change nodeIDMask to uint64(0xFFFF) or the concrete type of full), remove the
explicit uint16 cast and the //nolint:gosec comment, and use narrowed := full &
nodeIDMask so the mask preserves type and behavior when passed to fmt.Fprintf
(see nodeIDMask, narrowed and the fmt.Fprintf call).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b13ddfd9-60ac-44ae-9744-12c0a245b80c

📥 Commits

Reviewing files that changed from the base of the PR and between 7a05466 and eab8c49.

📒 Files selected for processing (7)
  • cmd/elastickv-admin/encryption.go
  • cmd/elastickv-admin/encryption_test.go
  • internal/encryption/errors.go
  • internal/encryption/local_epoch_rollback.go
  • internal/encryption/local_epoch_rollback_test.go
  • internal/encryption/node_id_collision.go
  • internal/encryption/node_id_collision_test.go

Comment thread cmd/elastickv-admin/encryption.go
Comment thread internal/encryption/errors.go Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eab8c49dd4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +113 to +114
if sidecarLocalEpoch <= registryRow.LastSeenLocalEpoch {
return pkgerrors.Wrapf(ErrLocalEpochRollback,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate registry row belongs to requested full_node_id

CheckLocalEpochRollback looks up by uint16(fullNodeID&nodeIDMask) and then compares epochs without verifying that registryRow.FullNodeID matches the fullNodeID argument. If a different writer has (or previously had) the same 16-bit node ID, this function can treat that foreign row as this node’s rollback anchor and even return nil when sidecarLocalEpoch is greater than the foreign LastSeenLocalEpoch, bypassing the intended missing-row/post-cutover refusal for the real node. This leaves startup-guard behavior inconsistent and can defer failure to apply-time collision handling instead of failing early.

Useful? React with 👍 / 👎.

…lp handling)

Two CodeRabbit minor findings from round-2 review.

## CodeRabbit minor #1 (errors.go:260-264) - stale skip-condition doc

The ErrLocalEpochRollback skip-conditions comment in errors.go
still listed "no registry row" as an unconditional skip. After
the round-2 storageEnvelopeActive split, the missing-row case
is no longer unconditional: pre-cutover it's a skip (returns
nil), post-cutover it's the refusal that the guard exists to
catch.

Fix: rewrote the skip-conditions paragraph to enumerate just
the two unconditional skips (encryption disabled, no active
storage DEK) and cross-reference local_epoch_rollback.go for
the §5.2 missing-row split. The comprehensive doc on
CheckLocalEpochRollback is the authoritative source; errors.go
now points there.

## CodeRabbit minor #2 (encryption.go:85-87) - ErrHelp handling

probe-node-id wrapped flag.ErrHelp as an error, breaking the
exit-zero-on-help convention used by every other encryption
subcommand:

  - runEncryptionStatus (encryption.go:169-172)
  - runEncryptionRotateDEK (encryption_mutators.go:74)
  - runEncryptionRegisterWriter (encryption_mutators.go:119)
  - runEncryptionBootstrap (encryption_mutators.go:261)

Operators running `elastickv-admin encryption probe-node-id -h`
would get exit code 2 (the flag package default for help) which
breaks shell scripts that test $?.

Fix: added the standard `if errors.Is(err, flag.ErrHelp) { return nil }`
short-circuit, matching the existing pattern. Added
TestRunEncryptionProbeNodeID_HelpFlagExitsZero to pin the
behavior.

## Verification

  - go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/ - PASS (24 tests total in these files)
  - golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ - 0 issues

No semantic changes to function signatures or behavior outside
of the help-flag exit code.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 18, 2026

@claude review

Round-3 (commit 189cd596): closed out two CodeRabbit minors from round-2.

CodeRabbit minor — ErrLocalEpochRollback doc stale (errors.go)

After the round-2 storageEnvelopeActive split, the missing-row case is no longer an unconditional skip — pre-cutover it's a skip, post-cutover it's the refusal. Rewrote the skip-conditions comment in errors.go to enumerate only the two unconditional skips and cross-reference local_epoch_rollback.go for the §5.2 missing-row split.

CodeRabbit minor — probe-node-id -h exit code

The subcommand wrapped flag.ErrHelp as an error, breaking the exit-zero-on-help convention used by every other encryption subcommand. Added the standard if errors.Is(err, flag.ErrHelp) { return nil } short-circuit + TestRunEncryptionProbeNodeID_HelpFlagExitsZero pinning test.

Both were CodeRabbit minor (below P0/P1 threshold) but tiny fixes that close out the round.

No semantic changes to function signatures outside the help-flag exit code.

@claude
Copy link
Copy Markdown

claude Bot commented May 18, 2026

Claude finished @bootjp's task in 3m 13s —— View job


Stage 6D-2 Code Review — Round 3

Tasks:

  • Gather context on Round-3 changes
  • Read errors.go — validate stale doc fix
  • Read encryption.go — validate flag.ErrHelp fix + test
  • Check outstanding codex P2 (registry row full_node_id validation)
  • Post final review

Review

1. ErrLocalEpochRollback doc fix ✅

errors.go:260-271 now correctly describes the pre/post-cutover split: "The missing-registry-row case is NOT an unconditional skip" with the two branches enumerated. Cross-reference to local_epoch_rollback.go is correct. The prior stale phrasing is gone. Done.


2. probe-node-id -h exit code fix ✅

encryption.go:85-96: The flag.ErrHelp short-circuit is correctly placed with an explanatory comment that cites the convention across the other subcommands. TestRunEncryptionProbeNodeID_HelpFlagExitsZero at encryption_test.go:550-557 pins -h → nil. The pattern matches runEncryptionStatus exactly. Done.


Open: Codex round-2 P2 — registryRow.FullNodeID mismatch (not addressed in round-3)

local_epoch_rollback.go:107-113: after DecodeRegistryValue the code compares epochs without verifying the decoded registryRow.FullNodeID == fullNodeID.

The code in question:

registryRow, err := DecodeRegistryValue(rawVal)
// ... error check ...
if sidecarLocalEpoch <= registryRow.LastSeenLocalEpoch {
    return pkgerrors.Wrapf(ErrLocalEpochRollback, ...)
}

Why this matters: the registry key is RegistryKey(dekID, uint16(fullNodeID & 0xFFFF)) — keyed by the 16-bit truncation. If a node with a colliding full_node_id was previously a cluster member and has since been removed, its registry row persists under the same key. CheckNodeIDCollision reads the current route-catalog snapshot and would not see the departed node's full_node_id, so the collision check would pass. CheckLocalEpochRollback would then read the stale row from the departed node and compare epochs against it — a comparison that is semantically meaningless for the new node and could either falsely fire ErrLocalEpochRollback (if departed node's epoch > new node's sidecar epoch) or silently pass (if departed node's epoch < new node's sidecar epoch), defeating the missing-row post-cutover refusal (storageEnvelopeActive=true).

Fix — 4 lines after the DecodeRegistryValue call:

registryRow, err := DecodeRegistryValue(rawVal)
if err != nil {
    return pkgerrors.Wrapf(err, ...)
}
// Row belongs to a different writer under the same 16-bit slot (e.g. departed node).
// From this node's perspective the slot is unoccupied; apply the missing-row logic.
if registryRow.FullNodeID != fullNodeID {
    if storageEnvelopeActive {
        return pkgerrors.Wrapf(ErrLocalEpochRollback,
            "writer-registry slot for full_node_id=%#x dek_id=%d is occupied by a different full_node_id=%#x; cannot prove nonce monotonicity",
            fullNodeID, activeStorageDEKID, registryRow.FullNodeID)
    }
    return nil
}
if sidecarLocalEpoch <= registryRow.LastSeenLocalEpoch {

Add a companion test TestCheckLocalEpochRollback_MismatchedFullNodeID with sub-cases for pre/post-cutover to match the existing _NoRow_PreCutover / _NoRow_PostCutover pair.

Runtime risk today: zero — both guards are operator-inert until 6D-6. This can be fixed in this PR or explicitly deferred to 6D-6 with a // NOTE: caller must ensure CheckNodeIDCollision passed first comment. Given the defense-in-depth posture of the entire guard system, fixing it here is preferred.

Fix this →


Open nitpick: //nolint:gosec in encryption.go:106 (CLAUDE.md)

encryption.go:106:

const nodeIDMask = 0xFFFF
narrowed := uint16(full & nodeIDMask) //nolint:gosec

CLAUDE.md requires "Avoid //nolint — refactor instead." For the CLI code, the nolint can be eliminated cleanly because narrowed is only passed to fmt.Fprintf (no uint16-typed callee):

const nodeIDMask uint64 = 0xFFFF
narrowed := full & nodeIDMask  // uint64 bounded to 16 bits; no cast needed

local_epoch_rollback.go:83 is harder: RegistryKey takes uint16, so the narrowing cast to uint16 is required by the callee. If the project wants to eliminate the nolint there too, the cleanest approach is a package-level func narrowNodeID(full uint64) uint16 in registry.go with a single nolint — reducing duplication and centralizing the policy. But that can be a follow-up in any of 6D-3 through 6D-6 when the nolint pattern appears in more places.

For now, encryption.go:106 is the actionable one-line refactor; local_epoch_rollback.go:83 can wait.

Fix this →


Summary

The two round-3 fixes are correct: the errors.go doc now accurately describes the pre/post-cutover missing-row split, and probe-node-id -h exits 0 per the established subcommand convention. One codex P2 from round-2 remains open: CheckLocalEpochRollback does not verify registryRow.FullNodeID == fullNodeID, which can produce a wrong result for a node whose 16-bit slot was previously occupied by a since-departed peer. The fix is 4 lines + 1 test, and since the PR is operator-inert until 6D-6 there is no runtime urgency — but addressing it here keeps the primitive self-contained.

…ullNodeID match)

Codex r3 P2 on PR788: CheckLocalEpochRollback looked up the
registry row by uint16(fullNodeID&nodeIDMask) and compared
epochs without verifying that the decoded RegistryValue's
FullNodeID matched the caller's argument. If a different writer
previously occupied the same 16-bit slot (historical
occupancy that the route-catalog watcher would not surface to
ErrNodeIDCollision), this function would treat that foreign
row as THIS node's rollback anchor. When sidecarLocalEpoch
happens to exceed the foreign LastSeenLocalEpoch, the function
would return nil and the missing-row-post-cutover refusal
would be bypassed.

## Fix: defense-in-depth FullNodeID equality check

After decoding the registry row, compare registryRow.FullNodeID
against the caller's fullNodeID. On mismatch, route through the
existing 5.2 split via a new shared helper:

  - missingRegistryRowResult(storageEnvelopeActive, ...) -
    pre-cutover allow startup, post-cutover refuse with
    ErrLocalEpochRollback.

The mismatch is treated as "no row for THIS node" because the
node's rollback anchor genuinely doesn't exist - the foreign
writer's anchor doesn't apply to THIS node's monotonicity
contract.

This is layered defense: ErrNodeIDCollision is the primary
guard for active collisions, but a since-removed member can
leave its row behind. The FullNodeID check handles that case
correctly without depending on ErrNodeIDCollision's
route-catalog snapshot.

## Test coverage

Two new tests pin the foreign-row scenarios:

  - TestCheckLocalEpochRollback_ForeignRowPreCutover - foreign
    row + pre-cutover -> allow startup (next OpRegistration
    will overwrite via 4.1 case 1).
  - TestCheckLocalEpochRollback_ForeignRowPostCutover - foreign
    row + post-cutover -> refuse with ErrLocalEpochRollback
    AND uses the missing-row context message
    ("storage_envelope_active=true"). Includes the
    "sidecar > foreign.LastSeen" sub-case which would have
    masked the bug without the FullNodeID check.

The existing TestCheckLocalEpochRollback_NarrowingMatchesShippedConvention
was confusingly using the foreign-row scenario as its fixture.
Rewrote it to use the SAME full_node_id for seed and lookup
(still confirms the high-bit-aware uint16 narrowing), and
moved the cross-full_node_id case to the dedicated foreign-row
tests where it belongs.

## Caller audit (per cron directive)

CheckLocalEpochRollback's exported signature is unchanged - the
semantic change is internal (a new branch that routes through
the existing 5.2 split). No production callers (operator-inert
per 6D-2). All 6 existing test callers continue to work
unchanged.

## Verification

  - go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/ - PASS (29 tests total in these files)
  - golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ - 0 issues
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 18, 2026

@claude review

Round-4 (commit 2ca075ac): addressed codex r3 P2.

codex P2 — registry row FullNodeID mismatch

Defense-in-depth: after decoding the registry row, verify registryRow.FullNodeID == fullNodeID. On mismatch (historical occupant of the same 16-bit slot — not catchable by ErrNodeIDCollision because the route-catalog watcher doesn't see since-removed members), route through the same §5.2 split as the no-row case:

  • pre-cutover → allow startup (4.1 case-1 will overwrite the foreign row on next OpRegistration)
  • post-cutover → refuse with ErrLocalEpochRollback + missing-row context

Without this check, a sidecar that happens to be ahead of the foreign writer's LastSeenLocalEpoch would return nil and bypass the missing-anchor refusal.

Tests

Two new pinning tests:

  • ForeignRowPreCutover — foreign row + pre-cutover → nil
  • ForeignRowPostCutover — foreign row + post-cutover → ErrLocalEpochRollback with "storage_envelope_active=true" message. Includes the sidecar>foreign-LastSeen sub-case which would have masked the bug pre-fix.

The existing narrowing test was inadvertently exercising the foreign-row case as its fixture. Rewrote it to use the same full_node_id for seed + lookup (preserves the high-bit-aware narrowing assertion).

Caller audit

Signature unchanged; internal branch added. No production callers. All 6 existing test callers still work.

Verification

  • go test -race ./internal/encryption/ ./cmd/elastickv-admin/ — PASS
  • golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ — 0 issues

@claude
Copy link
Copy Markdown

claude Bot commented May 18, 2026

Claude finished @bootjp's task in 1m 42s —— View job


Stage 6D-2 Code Review — Round 4

Tasks:

  • Gather context on Round-4 changes
  • Read local_epoch_rollback.go — validate FullNodeID mismatch fix
  • Read local_epoch_rollback_test.go — validate new pinning tests
  • Verify narrowing test fixture rewrite
  • Check any remaining open items from prior rounds
  • Post final review

Review

1. missingRegistryRowResult helper ✅

Extracting the shared §5.2 split into missingRegistryRowResult (lines 82-89) is a clean DRY move. The function's doc comment at lines 76-81 correctly identifies both call sites — the !ok (no-row from Pebble) path and the new FullNodeID-mismatch path — so the invariant is documented once and applied consistently in both branches. ✅


2. FullNodeID mismatch guard ✅

local_epoch_rollback.go:128-130:

if registryRow.FullNodeID != fullNodeID {
    return missingRegistryRowResult(storageEnvelopeActive, fullNodeID, activeStorageDEKID)
}

The comment at lines 114-127 correctly explains:

  • Why the collision guard does not catch this case (route-catalog watcher sees only current members; historical occupants leave a row behind).
  • Pre-cutover behaviour (allow startup; §4.1 case-1 overwrites the foreign row on the next OpRegistration).
  • Post-cutover behaviour (refuse; no anchor for THIS node).

Routing through missingRegistryRowResult means the message the operator sees is the same "missing anchor" diagnostic regardless of whether the slot was empty or occupied by a departed peer — which is semantically correct (from this node's perspective, the anchor is absent either way). ✅


3. ForeignRowPostCutover test sub-case ✅

local_epoch_rollback_test.go:210-216 adds the exact scenario that was the risk before the fix: sidecar=100 > foreign.LastSeen=5. Pre-fix this would have fallen through to the epoch comparison, returned nil (sidecar > foreign), and the missing anchor would have gone undetected. The test comment at lines 208-215 names the failure mode precisely. ✅


4. Narrowing test fixture rewrite ✅

TestCheckLocalEpochRollback_NarrowingMatchesShippedConvention now seeds and looks up with const fullID uint64 = 0xDEADBEEF_0000AAAA — the same value for both — so it tests the key derivation convention without incidentally exercising the foreign-row path. The comment at lines 233-234 calls out the separation explicitly. ✅


5. Prior-round open items status

Item Status
strconv.ParseUint silent-prefix fix (r1) ✅ Fixed in r1
go doc attribution for CheckNodeIDCollision (r1) ✅ Fixed in r1
storageEnvelopeActive §5.2 split (r2) ✅ Fixed in r2
errors.go stale doc (r3) ✅ Fixed in r3
probe-node-id -h exit code (r3) ✅ Fixed in r3
registryRow.FullNodeID mismatch (r3) ✅ Fixed in r4 (this round)
//nolint:gosec in encryption.go:106 (r3 nitpick) ⚪ Not yet addressed

One open nitpick: //nolint:gosec in encryption.go:106

CodeRabbit raised this in round-3 (and I noted it) — CLAUDE.md requires "avoid //nolint — refactor instead". The fix remains the same 2-line change:

// Before
const nodeIDMask = 0xFFFF
narrowed := uint16(full & nodeIDMask) //nolint:gosec

// After — mask typed uint64; no cast needed; fmt.Fprintf accepts any integer
const nodeIDMask uint64 = 0xFFFF
narrowed := full & nodeIDMask

The local_epoch_rollback.go:98 and node_id_collision.go:59 nolints are harder — their callees require uint16 — so those are reasonably deferred to the narrowNodeID(uint64) uint16 helper cleanup in 6D-3+. But encryption.go:106 has no uint16-typed callee (narrowed feeds only fmt.Fprintf) and is the clean 2-liner. Either fix it here or explicitly accept a follow-up cleanup PR for all three at once.

Fix this →


Summary

Round-4 is correct. The FullNodeID mismatch guard is sound: missingRegistryRowResult cleanly centralises the §5.2 split, the mismatch check fires before the epoch comparison, and the sub-case that would have masked the anchor absence (sidecar > foreign.LastSeen) is explicitly pinned by the test. The narrowing test rewrite correctly separates "key derivation correctness" from "foreign-row handling". All five prior-round findings are now resolved. The one remaining item is the encryption.go:106 nolint — either close it with a 2-line fix or batch it with the narrowNodeID cleanup deferred to 6D-3+.

…ude r4 deferred item)

Use the uint64 mask result directly for %#04x / %d formatting; the
uint16 cast was only there for display and pulled in a gosec G115
nolint that did not exist anywhere else in the operator-tool path.

The single-source narrow-and-cast convention (applier.go:230,
local_epoch_rollback.go:98, node_id_collision.go:59) is unaffected
- those sites genuinely need uint16 to build the 6-byte registry
key. Display-only sites do not, and now do not pay the nolint cost.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 19, 2026

@claude review

Round-5 (commit 5aaa3e0b): closed out the deferred encryption.go:106 //nolint:gosec from round-4.

Change

// Before
const nodeIDMask = 0xFFFF
narrowed := uint16(full & nodeIDMask) //nolint:gosec // masked to 16 bits; matches applier.go convention
fmt.Fprintf(out, "...node_id: %#04x (%d)\n", full, full, narrowed, narrowed)

// After
const nodeIDMask = 0xFFFF
// keep masked as uint64: the value is only formatted for display, ...
masked := full & nodeIDMask
fmt.Fprintf(out, "...node_id: %#04x (%d)\n", full, full, masked, masked)

%#04x / %d format identically on uint64 and uint16 for values in [0, 0xFFFF], so output bytes are unchanged (node_id: 0x1234 and node_id: 0x5678 test fixtures both still match exactly).

Why this site and not the others

The 3 production sites still keep the convention because they need the actual uint16 type:

  • applier.go:230 — building the 6-byte registry key (dekID || uint16(full_node_id & 0xFFFF))
  • local_epoch_rollback.go:98 — same registry-key derivation
  • node_id_collision.go:59 — building the map[uint16]uint64 collision-detection map

encryption.go:106 is the only display-only site. The narrowing exists purely to show the operator the 16-bit slice in human-readable output, and the uint64 mask result formats identically — so the cast (and the nolint:gosec) was load-bearing nowhere.

Verification

  • go test ./cmd/elastickv-admin/... — all probe-node-id tests pass with unchanged output bytes (TestRunEncryptionProbeNodeID_Hex expects node_id: 0x1234, TestRunEncryptionProbeNodeID_Decimal expects node_id: 0x5678)
  • golangci-lint run cmd/elastickv-admin/... — 0 issues
  • No semantic change to caller behavior; runEncryptionProbeNodeID is a leaf and is invoked only via the probe-node-id subcommand dispatcher

@claude
Copy link
Copy Markdown

claude Bot commented May 19, 2026

Claude finished @bootjp's task in 2m 8s —— View job


Stage 6D-2 Code Review — Round 5

Tasks:

  • Gather context on Round-5 changes
  • Read cmd/elastickv-admin/encryption.go — validate nolint removal + variable rename
  • Read cmd/elastickv-admin/encryption_test.go — verify test fixtures unchanged
  • Check all prior-round items resolved
  • Post final review

Review

1. //nolint:gosec removal ✅

encryption.go:105-109:

// Before (Round-4)
const nodeIDMask = 0xFFFF
narrowed := uint16(full & nodeIDMask) //nolint:gosec
fmt.Fprintf(out, "...%#04x (%d)\n", ..., narrowed, narrowed)

// After (Round-5)
const nodeIDMask = 0xFFFF
// keep masked as uint64: the value is only formatted for display...
masked := full & nodeIDMask
fmt.Fprintf(out, "...%#04x (%d)\n", ..., masked, masked)

nodeIDMask is an untyped constant 0xFFFF. In full & nodeIDMask, full is uint64, so the constant is promoted to uint64 and masked is uint64 bounded to [0, 0xFFFF] by construction. No explicit cast, no gosec warning, CLAUDE.md's no-nolint rule is satisfied.


2. Output format invariant ✅

The PR description correctly claims %#04x (%d) produces identical output for uint64 and uint16 at the same value. Go's fmt package formats the numeric value, not the type — uint64(0x1234) and uint16(0x1234) both produce 0x1234 under %#04x. The test fixtures pin this:

  • TestRunEncryptionProbeNodeID_Hex expects "node_id: 0x1234"
  • TestRunEncryptionProbeNodeID_Decimal expects "node_id: 0x5678"

No output bytes changed.


3. Variable rename narrowedmasked

The rename is accurate: the variable is no longer a narrowed uint16 — it's a uint64 masked to 16 bits. The comment on line 106 explains the WHY (the cast was not load-bearing at a display-only site), which meets the "non-obvious constraint" bar from CLAUDE.md.


4. Production-site nolints correctly retained ✅

The PR description is correct that the three remaining sites must keep uint16:

Site Reason
local_epoch_rollback.go:98 RegistryKey(dekID, uint16) signature
node_id_collision.go:59 map[uint16]uint64 key type
applier.go:230 same registry-key derivation

For these sites, a narrowNodeID(uint64) uint16 helper would consolidate the nolint to one location — but that's a deferred follow-up in 6D-3+, as the prior round established.


5. Prior-round item status

Round Item Status
R1 strconv.ParseUint silent-prefix
R1 go doc attribution for CheckNodeIDCollision
R2 storageEnvelopeActive §5.2 missing-row split
R3 errors.go stale skip-condition doc
R3 probe-node-id -h exits 0
R4 registryRow.FullNodeID mismatch guard
R4 encryption.go:106 //nolint:gosec this round

All rounds resolved. The two remaining //nolint:gosec annotations in local_epoch_rollback.go:98 and node_id_collision.go:59 are correctly deferred to the narrowNodeID helper cleanup in 6D-3+.


Summary

Round-5 is correct. The //nolint:gosec is eliminated by keeping masked as uint64 — the only display-only site — with no change to output bytes. All seven prior-round findings are resolved. No new issues. This PR is ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant