Engine: per-request read/work budget + backstop observability (#50)#62
Conversation
Bound the store work a single evaluation can do: evalState now carries a
read counter charged on every ListSubjects call, capped by a configurable
budget (GATEWAY_MAX_CHECK_READS, default 5000). Exhausting it returns
ErrEvalBudgetExceeded -> ErrResourceExhausted, not a silent deny, since a
runaway means a pathological/abusive query. ListObjects and CheckSet now
share ONE evalState (one budget) across all candidate/member Checks so a
wide namespace x deep graph cannot multiply into N x budget reads.
Surface which backstop fired via a context-attached Backstops collector
(keeps the engine prometheus-free); the connect layer emits
authz_eval_backstop_total{reason=depth|cycle|budget}, giving on-call an
alertable signal for an abusive tenant or a misconfigured deep/cyclic model.
Add GATEWAY_MAX_CHECK_READS (config -> service -> engine, default 5000)
and the authz_eval_backstop_total{reason} metric. Engine tests use a
counting reader to prove the read count is bounded: a long cyclic parent
chain and a wide/deep branching graph each trip ErrEvalBudgetExceeded near
the budget; a moderate chain stays well under it and succeeds; ListObjects
shares one budget across candidates (fails fast, not N x budget). A connect
test asserts the budget and depth/cycle backstops increment the metric.
Document the knob and metric in README and the per-request-cost/observable-
backstops section of the authorization model.
Closes #50
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8d3717832a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| ctx, backstops := authz.WithBackstops(ctx) | ||
| defer func() { h.metrics.recordBackstops(backstops) }() | ||
| results, err := h.svc.BatchCheck(ctx, p, svcItems) |
There was a problem hiding this comment.
Share the read budget across BatchCheck items
When a BatchCheck contains many individually near-limit checks, this still calls s.svc.BatchCheck, whose implementation loops over s.Check for each item (internal/service/batchcheck.go:42); each Check goes through Engine.CheckWithModel and creates a fresh newEvalState(e.maxReads) (pkg/authz/engine.go:134). That means a single BatchCheck RPC can consume len(items) * GATEWAY_MAX_CHECK_READS tuple reads, up to the batch item cap, instead of the new per-request budget, so the backstop does not protect this request shape.
Useful? React with 👍 / 👎.
| if n > 0 { | ||
| s.maxCheckReads = n | ||
| } |
There was a problem hiding this comment.
Allow zero to disable the read budget
With GATEWAY_MAX_CHECK_READS=0, config passes 0 into this option, but the option ignores non-positive values and leaves s.maxCheckReads at DefaultMaxCheckReads before New seeds the engine. That contradicts the new README/engine contract that 0 or negative is unbounded, so operators cannot disable the budget if it trips legitimate graphs.
Useful? React with 👍 / 👎.
A single item's ErrEvalBudgetExceeded is an item-specific condition (that item's model graph is too deep/branching/cyclic), not a systemic outage, so it must not discard every sibling result. Treat ErrResourceExhausted like a per-item validation error: capture it in the item's result slot and keep evaluating the rest of the batch. Systemic storage/engine failures still abort the whole call with a non-OK status.
ListObjects and the CheckSet member fan-out share ONE read budget across up to maxListObjects candidate Checks, but the flat single-Check budget (default 5000) is far below the legitimate worst case of a full-cap deep scan (candidates x maxDepth). A real ListObjects over a near-cap deep hierarchy was returning ResourceExhausted instead of the correct list. Decouple the fan-out budget from the single-Check budget: a fan-out operation gets max(maxReads, candidates x maxDepth x HEADROOM) with HEADROOM=2, computed at the ListObjects/CheckSet entry where the shared evalState is created. A legit full-cap deep scan (~32000) passes under ~64000 while all-cyclic abuse (~candidates x maxDepth^2) still trips. Single Check/Expand keep the flat maxReads. CheckSet has no member cap, so it borrows the same candidate ceiling a full-cap ListObjects gets. Tests: a production-defaults full-cap (1000) deep (12) ~900-candidate scan now SUCCEEDS and returns the full list; the abuse fixtures are resized to exceed the larger fan-out budget and still trip.
Explain that GATEWAY_MAX_CHECK_READS is the single-Check budget while a
fan-out (ListObjects/CheckSet) gets a budget scaled to
GATEWAY_MAX_LIST_OBJECTS x maxDepth x 2, so the two must be sized
together. Add a runbook note recommending an alert on
authz_eval_backstop_total{reason="budget"} to catch a legitimate tenant
tripping the budget before users notice, and note that a BatchCheck
budget trip is isolated to the offending item.
…udget Expand is a fan-out bounded by the node cap (maxExpandNodes), but it seeded its read budget with the flat single-Check maxReads. A legitimate acyclic Expand whose read count sat between maxReads and the node cap wrongly returned ResourceExhausted before ErrExpandTooLarge could fire. Add Engine.expandBudget(maxNodes) = max(maxReads, maxNodes*fanOutHeadroom), mirroring fanOutBudget but scaling by the node cap (Expand's read cost tracks reachable usersets, not candidates*depth), and seed ExpandWithModel with it. The node cap stays the primary bound; the read budget now only trips on genuinely pathological read-amplification.
Mirror the service-level isolation test at the connect boundary: a budget-
exhausting cyclic item returns HTTP 200 with the budget error in its slot, its
sibling resolves correctly, and authz_eval_backstop_total{reason=budget} is
counted once.
… sizing
Note that GATEWAY_MAX_CHECK_READS must be sized to the deepest/widest real
tenant model, that Expand's budget scales off GATEWAY_MAX_EXPAND_NODES (not the
flat Check budget), that a pathologically wide union/tupleToUserset model could
still need a higher knob, and reiterate the day-one alert on
authz_eval_backstop_total{reason=budget}.
A small-positive value (e.g. 5) silently failed authz closed fleet-wide: every non-trivial Check trips the read budget and returns ResourceExhausted. Reject a configured value below minMaxCheckReads (100) at startup, mirroring the minTenantRateLimitPerMinute guard; 0/negative still selects the service default. Document the floor in the README.
CheckSet sized its fan-out read budget off a hardcoded 1000 instead of the configured GATEWAY_MAX_LIST_OBJECTS, so raising the cap did not lift CheckSet's budget as the scaling contract documents. Add a maxListObjects field and WithMaxListObjects option to the engine, set it from service config alongside WithMaxReads, and size CheckSet's budget off it. Remove the hardcoded literal.
Product-surface RPCs (CreateWorkspace/AddMember/…) drive an engine check through Service.allowed but did not install the backstop collector, so a budget trip there was uncounted on authz_eval_backstop_total. Replace the scattered per-handler WithBackstops installs (Check/Expand/ListObjects/ BatchCheck) with a single Connect interceptor wired onto every service handler, so every engine evaluation — current and future — feeds the metric, counted once per reason per request. Add a product-surface regression test proving a budget trip returns ResourceExhausted and increments the metric.
Closes #50 — bounds per-request recursive cost and makes the engine's fail-closed backstops observable (the systemic hardening filed during the #21 nested-folders review).
evalStatecounts everyListSubjectsread viacharge(); exceedingmaxReadsreturnsErrEvalBudgetExceeded→ ResourceExhausted. So a tenant-planted cycle / deep-branching graph (O(maxDepth²) memo-poisoning) can't DoS a Check. ListObjects/CheckSet now thread ONE evalState across all candidates/members (resetting only the path-localvisited+memoper top-level Check) so the budget has teeth instead of N×budget.GATEWAY_MAX_CHECK_READS(default 5000 — generous; legitimate deep graphs read far fewer), wired like MaxListObjects/MaxExpandNodes; non-positive = default; ~zero overhead (one int increment) when far from the cap.authz_eval_backstop_total{reason=depth|cycle|budget}— the engine stays Prometheus-free (records into a request-scoped collector on the context); the connect handlers increment the metric. The previously-SILENT depth/cycle graceful-deny is now alertable; budget exhaustion errors (not silent denies).Engine decision-equivalence: all existing engine/exclusion/wildcard/condition/checkset tests pass unchanged. No proto change ·
go test -race ./...✓ (real Postgres) · lint 0 ✓.Closes #50