Skip to content

perf(controlplane): single-pod CAS backend validation via advisory lock#3126

Open
migmartri wants to merge 1 commit into
chainloop-dev:mainfrom
migmartri:worktree-luminous-skipping-dijkstra
Open

perf(controlplane): single-pod CAS backend validation via advisory lock#3126
migmartri wants to merge 1 commit into
chainloop-dev:mainfrom
migmartri:worktree-luminous-skipping-dijkstra

Conversation

@migmartri
Copy link
Copy Markdown
Member

@migmartri migmartri commented May 16, 2026

Summary

Background CAS-backend validation was running once per replica, hitting external providers N times per tick. The middleware revalidation window (5 min) was also shorter than the background cadence (30 min), so the bulk of validation work happened synchronously on the request path.

This PR makes the background loop run once per cluster per tick (instead of once per pod) and aligns the request-path window with that cadence.

Changes

  • New biz.DistributedLock interface with a Postgres pg_try_advisory_lock implementation in pkg/data/lock.go.
  • CASBackendChecker.checkBackends acquires the lock per tick; replicas that lose the race skip the tick, so the validation runs exactly once across the cluster. Per-scope keys keep the defaults (30 min) and all-backends (24 h) checkers from blocking each other.
  • Data exposes the underlying *sql.DB so raw SQL features ent doesn't surface (session-scoped locks) are reachable from the data layer.
  • validationTimeOffset raised from 5 min to 35 min so the middleware no longer revalidates ahead of the background loop.

fixes #3125.

This PR was assisted by Claude Code.

@chainloop-platform
Copy link
Copy Markdown
Contributor

chainloop-platform Bot commented May 16, 2026

AI Session Analysis

Avg score Sessions Failing policies Attribution Files Lines Total Duration
🟡 81% 1 ✅ 0 43% AI / 57% Human 5 +160 / -9 48m47s

🟡 81% — 43% AI — ✅ All policies passing

May 16, 2026 20:15 UTC · 48m47s · $27.22 · 2.6k in / 128.6k out · claude-code 2.1.143 (claude-opus-4-7)

Change Summary

Adds PostgresLock advisory lock implementation (~101 lines) to prevent duplicate CAS backend checks across pods. Adds DistributedLock interface to biz layer. Updates org-requirements middleware validation window to 35 minutes. Adds lock-hold and unlock timeouts; relocates DistributedLock interface at user request.

AI Session Overall Score

🟡 81% — Solid implementation with good alignment; missing tests for new lock code.

AI Session Analysis Breakdown

🟢 90% · scope-discipline

🟡 License header year updated as unsolicited drive-by change in middleware.go. · Low Severity

🟢 88% · alignment

🟢 AI proposed options before implementing; user chose Postgres advisory lock over NATS. · High Impact

🟢 88% · solution-quality

🟢 Root cause identified (per-pod duplication); fixed with pg_try_advisory_lock, no new dependencies. · High Impact

🟡 78% · user-trust-signal

🟡 User interrupted tool execution mid-run then redirected to rebase upstream main. · Low Severity

🟡 72% · verification

🟢 go test ./pkg/biz/... run twice with visible output: 24 tests passed across 3 packages. · High Impact

🟠 New PostgresLock implementation (~101 lines) has no unit tests for acquisition, release, or timeout. · Medium Severity

💡 Add unit tests for advisory lock acquisition, release, and timeout using a mock DB.

🟡 62% · context-and-planning

🟠 Initial user prompt was vague with no stated constraints, scope, or success criteria. · Medium Severity

💡 State goal, constraints, and acceptance criteria up front even for investigative tasks.

🟡 No formal written plan document; planning was inline prose before implementation. · Low Severity


File Attribution

████████░░░░░░░░░░░░ 43% AI / 57% Human

Status Attribution File Lines
created ai app/controlplane/pkg/data/lock.go +101 / -0
modified ai app/controlplane/pkg/biz/casbackend_checker.go +42 / -1
modified ai app/controlplane/pkg/data/data.go +11 / -6
modified ai app/controlplane/internal/usercontext/orgrequirements_middleware.go +4 / -1
modified human app/controlplane/cmd/wire_gen.go +2 / -1

Policies (4)

Status Policy Material Messages
✅ Passed ai-config-ai-agents-allowed ai-coding-session-052e8b -
✅ Passed ai-config-no-dangerous-commands ai-coding-session-052e8b -
✅ Passed ai-config-no-secrets ai-coding-session-052e8b -
✅ Passed ai-config-mcp-servers-allowed ai-coding-session-052e8b -

Powered by Chainloop and Chainloop Trace

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 6 files

Re-trigger cubic

Background CAS-backend validation was running once per replica, hitting
external providers N times per tick. The middleware revalidation window
(5 min) was also shorter than the background cadence (30 min), so the
bulk of validation work happened synchronously on the request path.

Changes:

- Add DistributedLock interface (in casbackend_checker.go) with a Postgres
  pg_try_advisory_lock implementation in pkg/data/lock.go. Postgres is
  used because it's the only mandatory piece of infrastructure across
  deployments — NATS is optional.
- CASBackendChecker.checkBackends acquires the lock per tick; replicas
  that lose the race skip the tick, so the validation runs exactly once
  across the cluster. Per-scope keys keep the defaults (30 min) and
  all-backends (24 h) checkers from blocking each other.
- Cap the lock-hold time at 25 min so a hung validation doesn't pin the
  lock past one tick. Bound the pg_advisory_unlock call at 5 s so a
  stuck session can't hang the release path — Postgres releases the lock
  on session disconnect anyway.
- Data exposes the underlying *sql.DB so raw SQL features ent doesn't
  surface (session-scoped locks) are reachable from the data layer.
- validationTimeOffset raised from 5 min to 35 min so the middleware no
  longer revalidates ahead of the background loop.

Refs chainloop-dev#3125

Assisted-by: Claude Code
Signed-off-by: Miguel Martinez Trivino <miguel@chainloop.dev>

Chainloop-Trace-Sessions: 052e8b56-72b5-4c6c-8d82-ab2d00728889
@migmartri migmartri force-pushed the worktree-luminous-skipping-dijkstra branch from cdc4e95 to 5170a00 Compare May 16, 2026 21:04
@migmartri migmartri requested a review from a team May 16, 2026 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve CAS backend validation: misaligned windows and per-pod duplication

1 participant