Engine: per-request read/work budget + backstop observability (#50) by iarunsaragadam · Pull Request #62 · elloloop/workspace

iarunsaragadam · 2026-06-18T00:40:40Z

Closes #50 — bounds per-request recursive cost and makes the engine's fail-closed backstops observable (the systemic hardening filed during the #21 nested-folders review).

Read budget: evalState counts every ListSubjects read via charge(); exceeding maxReads returns ErrEvalBudgetExceeded → ResourceExhausted. So a tenant-planted cycle / deep-branching graph (O(maxDepth²) memo-poisoning) can't DoS a Check. ListObjects/CheckSet now thread ONE evalState across all candidates/members (resetting only the path-local visited + memo per top-level Check) so the budget has teeth instead of N×budget.
Config: GATEWAY_MAX_CHECK_READS (default 5000 — generous; legitimate deep graphs read far fewer), wired like MaxListObjects/MaxExpandNodes; non-positive = default; ~zero overhead (one int increment) when far from the cap.
Observability: authz_eval_backstop_total{reason=depth|cycle|budget} — the engine stays Prometheus-free (records into a request-scoped collector on the context); the connect handlers increment the metric. The previously-SILENT depth/cycle graceful-deny is now alertable; budget exhaustion errors (not silent denies).
Tests: cyclic + wide-branching graphs trip the budget (reads bounded, via a counting reader); a moderate graph stays well under; ListObjects over 100 candidates shares one budget (reads ≤200); the backstop metric increments on depth/cycle/budget.

Engine decision-equivalence: all existing engine/exclusion/wildcard/condition/checkset tests pass unchanged. No proto change · go test -race ./... ✓ (real Postgres) · lint 0 ✓.

Closes #50

Bound the store work a single evaluation can do: evalState now carries a read counter charged on every ListSubjects call, capped by a configurable budget (GATEWAY_MAX_CHECK_READS, default 5000). Exhausting it returns ErrEvalBudgetExceeded -> ErrResourceExhausted, not a silent deny, since a runaway means a pathological/abusive query. ListObjects and CheckSet now share ONE evalState (one budget) across all candidate/member Checks so a wide namespace x deep graph cannot multiply into N x budget reads. Surface which backstop fired via a context-attached Backstops collector (keeps the engine prometheus-free); the connect layer emits authz_eval_backstop_total{reason=depth|cycle|budget}, giving on-call an alertable signal for an abusive tenant or a misconfigured deep/cyclic model.

Add GATEWAY_MAX_CHECK_READS (config -> service -> engine, default 5000) and the authz_eval_backstop_total{reason} metric. Engine tests use a counting reader to prove the read count is bounded: a long cyclic parent chain and a wide/deep branching graph each trip ErrEvalBudgetExceeded near the budget; a moderate chain stays well under it and succeeds; ListObjects shares one budget across candidates (fails fast, not N x budget). A connect test asserts the budget and depth/cycle backstops increment the metric. Document the knob and metric in README and the per-request-cost/observable- backstops section of the authorization model. Closes #50

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8d3717832a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-18T00:43:41Z


+	ctx, backstops := authz.WithBackstops(ctx)
+	defer func() { h.metrics.recordBackstops(backstops) }()
 	results, err := h.svc.BatchCheck(ctx, p, svcItems)


Share the read budget across BatchCheck items

When a BatchCheck contains many individually near-limit checks, this still calls s.svc.BatchCheck, whose implementation loops over s.Check for each item (internal/service/batchcheck.go:42); each Check goes through Engine.CheckWithModel and creates a fresh newEvalState(e.maxReads) (pkg/authz/engine.go:134). That means a single BatchCheck RPC can consume len(items) * GATEWAY_MAX_CHECK_READS tuple reads, up to the batch item cap, instead of the new per-request budget, so the backstop does not protect this request shape.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-18T00:43:41Z

+		if n > 0 {
+			s.maxCheckReads = n
+		}


Allow zero to disable the read budget

With GATEWAY_MAX_CHECK_READS=0, config passes 0 into this option, but the option ignores non-positive values and leaves s.maxCheckReads at DefaultMaxCheckReads before New seeds the engine. That contradicts the new README/engine contract that 0 or negative is unbounded, so operators cannot disable the budget if it trips legitimate graphs.

Useful? React with 👍 / 👎.

A single item's ErrEvalBudgetExceeded is an item-specific condition (that item's model graph is too deep/branching/cyclic), not a systemic outage, so it must not discard every sibling result. Treat ErrResourceExhausted like a per-item validation error: capture it in the item's result slot and keep evaluating the rest of the batch. Systemic storage/engine failures still abort the whole call with a non-OK status.

ListObjects and the CheckSet member fan-out share ONE read budget across up to maxListObjects candidate Checks, but the flat single-Check budget (default 5000) is far below the legitimate worst case of a full-cap deep scan (candidates x maxDepth). A real ListObjects over a near-cap deep hierarchy was returning ResourceExhausted instead of the correct list. Decouple the fan-out budget from the single-Check budget: a fan-out operation gets max(maxReads, candidates x maxDepth x HEADROOM) with HEADROOM=2, computed at the ListObjects/CheckSet entry where the shared evalState is created. A legit full-cap deep scan (~32000) passes under ~64000 while all-cyclic abuse (~candidates x maxDepth^2) still trips. Single Check/Expand keep the flat maxReads. CheckSet has no member cap, so it borrows the same candidate ceiling a full-cap ListObjects gets. Tests: a production-defaults full-cap (1000) deep (12) ~900-candidate scan now SUCCEEDS and returns the full list; the abuse fixtures are resized to exceed the larger fan-out budget and still trip.

Explain that GATEWAY_MAX_CHECK_READS is the single-Check budget while a fan-out (ListObjects/CheckSet) gets a budget scaled to GATEWAY_MAX_LIST_OBJECTS x maxDepth x 2, so the two must be sized together. Add a runbook note recommending an alert on authz_eval_backstop_total{reason="budget"} to catch a legitimate tenant tripping the budget before users notice, and note that a BatchCheck budget trip is isolated to the offending item.

…udget Expand is a fan-out bounded by the node cap (maxExpandNodes), but it seeded its read budget with the flat single-Check maxReads. A legitimate acyclic Expand whose read count sat between maxReads and the node cap wrongly returned ResourceExhausted before ErrExpandTooLarge could fire. Add Engine.expandBudget(maxNodes) = max(maxReads, maxNodes*fanOutHeadroom), mirroring fanOutBudget but scaling by the node cap (Expand's read cost tracks reachable usersets, not candidates*depth), and seed ExpandWithModel with it. The node cap stays the primary bound; the read budget now only trips on genuinely pathological read-amplification.

Mirror the service-level isolation test at the connect boundary: a budget- exhausting cyclic item returns HTTP 200 with the budget error in its slot, its sibling resolves correctly, and authz_eval_backstop_total{reason=budget} is counted once.

… sizing Note that GATEWAY_MAX_CHECK_READS must be sized to the deepest/widest real tenant model, that Expand's budget scales off GATEWAY_MAX_EXPAND_NODES (not the flat Check budget), that a pathologically wide union/tupleToUserset model could still need a higher knob, and reiterate the day-one alert on authz_eval_backstop_total{reason=budget}.

A small-positive value (e.g. 5) silently failed authz closed fleet-wide: every non-trivial Check trips the read budget and returns ResourceExhausted. Reject a configured value below minMaxCheckReads (100) at startup, mirroring the minTenantRateLimitPerMinute guard; 0/negative still selects the service default. Document the floor in the README.

CheckSet sized its fan-out read budget off a hardcoded 1000 instead of the configured GATEWAY_MAX_LIST_OBJECTS, so raising the cap did not lift CheckSet's budget as the scaling contract documents. Add a maxListObjects field and WithMaxListObjects option to the engine, set it from service config alongside WithMaxReads, and size CheckSet's budget off it. Remove the hardcoded literal.

Product-surface RPCs (CreateWorkspace/AddMember/…) drive an engine check through Service.allowed but did not install the backstop collector, so a budget trip there was uncounted on authz_eval_backstop_total. Replace the scattered per-handler WithBackstops installs (Check/Expand/ListObjects/ BatchCheck) with a single Connect interceptor wired onto every service handler, so every engine evaluation — current and future — feeds the metric, counted once per reason per request. Add a product-surface regression test proving a budget trip returns ResourceExhausted and increments the metric.

iarunsaragadam added 2 commits June 18, 2026 01:34

chatgpt-codex-connector Bot reviewed Jun 18, 2026

View reviewed changes

iarunsaragadam added 6 commits June 18, 2026 01:56

iarunsaragadam mentioned this pull request Jun 18, 2026

Per-project/per-tenant MaxCheckReads override #63

Closed

iarunsaragadam added 3 commits June 18, 2026 02:43

iarunsaragadam merged commit 7bd69bb into main Jun 18, 2026
9 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Engine: per-request read/work budget + backstop observability (#50)#62

Engine: per-request read/work budget + backstop observability (#50)#62
iarunsaragadam merged 11 commits into
mainfrom
feat/read-budget

iarunsaragadam commented Jun 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

iarunsaragadam commented Jun 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant