Background CAS-backend validation was running once per replica, hitting
external providers N times per tick. The middleware revalidation window
(5 min) was also shorter than the background cadence (30 min), so the
bulk of validation work happened synchronously on the request path.
Changes:
- Add DistributedLock interface (in casbackend_checker.go) with a Postgres
pg_try_advisory_lock implementation in pkg/data/lock.go. Postgres is
used because it's the only mandatory piece of infrastructure across
deployments — NATS is optional.
- CASBackendChecker.checkBackends acquires the lock per tick; replicas
that lose the race skip the tick, so the validation runs exactly once
across the cluster. Per-scope keys keep the defaults (30 min) and
all-backends (24 h) checkers from blocking each other.
- Cap the lock-hold time at 25 min so a hung validation doesn't pin the
lock past one tick. Bound the pg_advisory_unlock call at 5 s so a
stuck session can't hang the release path — Postgres releases the lock
on session disconnect anyway.
- Data exposes the underlying *sql.DB so raw SQL features ent doesn't
surface (session-scoped locks) are reachable from the data layer.
- validationTimeOffset raised from 5 min to 35 min so the middleware no
longer revalidates ahead of the background loop.
Refs chainloop-dev#3125
Assisted-by: Claude Code
Signed-off-by: Miguel Martinez Trivino <miguel@chainloop.dev>
Chainloop-Trace-Sessions: 052e8b56-72b5-4c6c-8d82-ab2d00728889
Summary
Background CAS-backend validation was running once per replica, hitting external providers N times per tick. The middleware revalidation window (5 min) was also shorter than the background cadence (30 min), so the bulk of validation work happened synchronously on the request path.
This PR makes the background loop run once per cluster per tick (instead of once per pod) and aligns the request-path window with that cadence.
Changes
biz.DistributedLockinterface with a Postgrespg_try_advisory_lockimplementation inpkg/data/lock.go.CASBackendChecker.checkBackendsacquires the lock per tick; replicas that lose the race skip the tick, so the validation runs exactly once across the cluster. Per-scope keys keep the defaults (30 min) and all-backends (24 h) checkers from blocking each other.Dataexposes the underlying*sql.DBso raw SQL features ent doesn't surface (session-scoped locks) are reachable from the data layer.validationTimeOffsetraised from 5 min to 35 min so the middleware no longer revalidates ahead of the background loop.fixes #3125.
This PR was assisted by Claude Code.