Skip to content

Engine: per-request read/work budget + backstop observability (#50)#62

Merged
iarunsaragadam merged 11 commits into
mainfrom
feat/read-budget
Jun 18, 2026
Merged

Engine: per-request read/work budget + backstop observability (#50)#62
iarunsaragadam merged 11 commits into
mainfrom
feat/read-budget

Conversation

@iarunsaragadam

Copy link
Copy Markdown
Contributor

Closes #50 — bounds per-request recursive cost and makes the engine's fail-closed backstops observable (the systemic hardening filed during the #21 nested-folders review).

  • Read budget: evalState counts every ListSubjects read via charge(); exceeding maxReads returns ErrEvalBudgetExceeded → ResourceExhausted. So a tenant-planted cycle / deep-branching graph (O(maxDepth²) memo-poisoning) can't DoS a Check. ListObjects/CheckSet now thread ONE evalState across all candidates/members (resetting only the path-local visited + memo per top-level Check) so the budget has teeth instead of N×budget.
  • Config: GATEWAY_MAX_CHECK_READS (default 5000 — generous; legitimate deep graphs read far fewer), wired like MaxListObjects/MaxExpandNodes; non-positive = default; ~zero overhead (one int increment) when far from the cap.
  • Observability: authz_eval_backstop_total{reason=depth|cycle|budget} — the engine stays Prometheus-free (records into a request-scoped collector on the context); the connect handlers increment the metric. The previously-SILENT depth/cycle graceful-deny is now alertable; budget exhaustion errors (not silent denies).
  • Tests: cyclic + wide-branching graphs trip the budget (reads bounded, via a counting reader); a moderate graph stays well under; ListObjects over 100 candidates shares one budget (reads ≤200); the backstop metric increments on depth/cycle/budget.

Engine decision-equivalence: all existing engine/exclusion/wildcard/condition/checkset tests pass unchanged. No proto change · go test -race ./... ✓ (real Postgres) · lint 0 ✓.

Closes #50

Bound the store work a single evaluation can do: evalState now carries a
read counter charged on every ListSubjects call, capped by a configurable
budget (GATEWAY_MAX_CHECK_READS, default 5000). Exhausting it returns
ErrEvalBudgetExceeded -> ErrResourceExhausted, not a silent deny, since a
runaway means a pathological/abusive query. ListObjects and CheckSet now
share ONE evalState (one budget) across all candidate/member Checks so a
wide namespace x deep graph cannot multiply into N x budget reads.

Surface which backstop fired via a context-attached Backstops collector
(keeps the engine prometheus-free); the connect layer emits
authz_eval_backstop_total{reason=depth|cycle|budget}, giving on-call an
alertable signal for an abusive tenant or a misconfigured deep/cyclic model.
Add GATEWAY_MAX_CHECK_READS (config -> service -> engine, default 5000)
and the authz_eval_backstop_total{reason} metric. Engine tests use a
counting reader to prove the read count is bounded: a long cyclic parent
chain and a wide/deep branching graph each trip ErrEvalBudgetExceeded near
the budget; a moderate chain stays well under it and succeeds; ListObjects
shares one budget across candidates (fails fast, not N x budget). A connect
test asserts the budget and depth/cycle backstops increment the metric.
Document the knob and metric in README and the per-request-cost/observable-
backstops section of the authorization model.

Closes #50

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8d3717832a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


ctx, backstops := authz.WithBackstops(ctx)
defer func() { h.metrics.recordBackstops(backstops) }()
results, err := h.svc.BatchCheck(ctx, p, svcItems)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Share the read budget across BatchCheck items

When a BatchCheck contains many individually near-limit checks, this still calls s.svc.BatchCheck, whose implementation loops over s.Check for each item (internal/service/batchcheck.go:42); each Check goes through Engine.CheckWithModel and creates a fresh newEvalState(e.maxReads) (pkg/authz/engine.go:134). That means a single BatchCheck RPC can consume len(items) * GATEWAY_MAX_CHECK_READS tuple reads, up to the batch item cap, instead of the new per-request budget, so the backstop does not protect this request shape.

Useful? React with 👍 / 👎.

Comment on lines +105 to +107
if n > 0 {
s.maxCheckReads = n
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Allow zero to disable the read budget

With GATEWAY_MAX_CHECK_READS=0, config passes 0 into this option, but the option ignores non-positive values and leaves s.maxCheckReads at DefaultMaxCheckReads before New seeds the engine. That contradicts the new README/engine contract that 0 or negative is unbounded, so operators cannot disable the budget if it trips legitimate graphs.

Useful? React with 👍 / 👎.

A single item's ErrEvalBudgetExceeded is an item-specific condition (that
item's model graph is too deep/branching/cyclic), not a systemic outage,
so it must not discard every sibling result. Treat ErrResourceExhausted
like a per-item validation error: capture it in the item's result slot
and keep evaluating the rest of the batch. Systemic storage/engine
failures still abort the whole call with a non-OK status.
ListObjects and the CheckSet member fan-out share ONE read budget across
up to maxListObjects candidate Checks, but the flat single-Check budget
(default 5000) is far below the legitimate worst case of a full-cap deep
scan (candidates x maxDepth). A real ListObjects over a near-cap deep
hierarchy was returning ResourceExhausted instead of the correct list.

Decouple the fan-out budget from the single-Check budget: a fan-out
operation gets max(maxReads, candidates x maxDepth x HEADROOM) with
HEADROOM=2, computed at the ListObjects/CheckSet entry where the shared
evalState is created. A legit full-cap deep scan (~32000) passes under
~64000 while all-cyclic abuse (~candidates x maxDepth^2) still trips.
Single Check/Expand keep the flat maxReads. CheckSet has no member cap,
so it borrows the same candidate ceiling a full-cap ListObjects gets.

Tests: a production-defaults full-cap (1000) deep (12) ~900-candidate
scan now SUCCEEDS and returns the full list; the abuse fixtures are
resized to exceed the larger fan-out budget and still trip.
Explain that GATEWAY_MAX_CHECK_READS is the single-Check budget while a
fan-out (ListObjects/CheckSet) gets a budget scaled to
GATEWAY_MAX_LIST_OBJECTS x maxDepth x 2, so the two must be sized
together. Add a runbook note recommending an alert on
authz_eval_backstop_total{reason="budget"} to catch a legitimate tenant
tripping the budget before users notice, and note that a BatchCheck
budget trip is isolated to the offending item.
…udget

Expand is a fan-out bounded by the node cap (maxExpandNodes), but it seeded
its read budget with the flat single-Check maxReads. A legitimate acyclic
Expand whose read count sat between maxReads and the node cap wrongly returned
ResourceExhausted before ErrExpandTooLarge could fire.

Add Engine.expandBudget(maxNodes) = max(maxReads, maxNodes*fanOutHeadroom),
mirroring fanOutBudget but scaling by the node cap (Expand's read cost tracks
reachable usersets, not candidates*depth), and seed ExpandWithModel with it.
The node cap stays the primary bound; the read budget now only trips on
genuinely pathological read-amplification.
Mirror the service-level isolation test at the connect boundary: a budget-
exhausting cyclic item returns HTTP 200 with the budget error in its slot, its
sibling resolves correctly, and authz_eval_backstop_total{reason=budget} is
counted once.
… sizing

Note that GATEWAY_MAX_CHECK_READS must be sized to the deepest/widest real
tenant model, that Expand's budget scales off GATEWAY_MAX_EXPAND_NODES (not the
flat Check budget), that a pathologically wide union/tupleToUserset model could
still need a higher knob, and reiterate the day-one alert on
authz_eval_backstop_total{reason=budget}.
A small-positive value (e.g. 5) silently failed authz closed fleet-wide:
every non-trivial Check trips the read budget and returns ResourceExhausted.
Reject a configured value below minMaxCheckReads (100) at startup, mirroring
the minTenantRateLimitPerMinute guard; 0/negative still selects the service
default. Document the floor in the README.
CheckSet sized its fan-out read budget off a hardcoded 1000 instead of the
configured GATEWAY_MAX_LIST_OBJECTS, so raising the cap did not lift CheckSet's
budget as the scaling contract documents. Add a maxListObjects field and
WithMaxListObjects option to the engine, set it from service config alongside
WithMaxReads, and size CheckSet's budget off it. Remove the hardcoded literal.
Product-surface RPCs (CreateWorkspace/AddMember/…) drive an engine check
through Service.allowed but did not install the backstop collector, so a
budget trip there was uncounted on authz_eval_backstop_total. Replace the
scattered per-handler WithBackstops installs (Check/Expand/ListObjects/
BatchCheck) with a single Connect interceptor wired onto every service
handler, so every engine evaluation — current and future — feeds the metric,
counted once per reason per request. Add a product-surface regression test
proving a budget trip returns ResourceExhausted and increments the metric.
@iarunsaragadam iarunsaragadam merged commit 7bd69bb into main Jun 18, 2026
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Engine: per-request read/work budget + observability for recursive cost

1 participant