Skip to content

rearchitect runtime, workflows, security, and reactivity#24

Merged
isala404 merged 31 commits into
mainfrom
rewrite
May 21, 2026
Merged

rearchitect runtime, workflows, security, and reactivity#24
isala404 merged 31 commits into
mainfrom
rewrite

Conversation

@isala404
Copy link
Copy Markdown
Owner

@isala404 isala404 commented May 21, 2026

Full rearchitecture of forge internals. 26 commits, 454 files, +42k/−21k LOC. Pre-1.0 — no migration path from intermediate states.

What changed

Runtime foundation

  • Single-pool doctrine: all queries, mutations, jobs, cron, daemons, workflows, observability, and signals share one primary pool (replicas round-robin, health-checked). Removed the per-subsystem pool wiring.
  • Function registry unification: queries, mutations, jobs, cron, workflows, daemons, webhooks, MCP tools, and signals all flow through one registry. Inventory-based auto-registration.
  • PG primitives centralized under crates/forge-runtime/src/pg/: migration runner, advisory locks, NOTIFY bus, listener reconnect, identifier validation.
  • Migration runner overhauled: forward-only, dollar-quote-aware SQL splitter, advisory-lock cluster safety, checksum drift detection, holder-pid diagnostics.

Reactivity

  • ChangeListener -> InvalidationEngine (50ms debounce / 200ms max) -> SubscriptionManager (DashMap, 64 shards, dedup by hash) -> Reactor (bounded concurrency 64, hash compare before push) -> SessionServer (SSE fan-out).
  • Cross-node invalidation via forge_changes NOTIFY.
  • PgNotifyBus: single connection, multi-channel, exponential-backoff reconnect (500ms -> 30s) with full re-LISTEN.
  • DashMap replaces Mutex/RwLock across realtime hot paths.

Workflows

  • Versioned with blake3 signature (128-bit) over the persisted contract (name, version, step keys, wait keys, timeout, input/output types).
  • One active version per name; resume requires exact version+signature match. Mismatches mark runs as blocked. Operator can cancel/force-abort via admin endpoints.
  • Steps cached by name (completed steps skip on resume), compensation reversed, durable sleep survives restart, external-event waits with timeout, parallel via ParallelBuilder.
  • Runs on the worker pool, no separate scheduler.

Mutations and dispatch

  • Outbox buffer replaced by in-transaction dispatch. dispatch_job / start_workflow share the mutation's tx; the forge_notify_job_available trigger fires inside that tx, so workers only see jobs whose mutation actually committed. At-least-once preserved.
  • Compile-time check: dispatch_job / start_workflow outside a #[mutation(transactional)] errors at macro expansion.

Security

  • Compile-time scope checks: private queries must filter by user_id/owner_id/tenant_id in SQL. #[query(unscoped)] opts out explicitly.
  • Sessions bound to IP and UA (HMAC); rotation requires both.
  • SSRF: is_private_ip covers IPv4-mapped IPv6, link-local, ULA, broadcast, documentation. SsrfSafeResolver filters at DNS resolution to close rebinding. Literal-IP URLs caught pre-DNS.
  • JWT validation cache, token issuer typed end-to-end, OAuth code/client tables for upcoming OAuth flows.
  • SQLx checked against PgBouncer — startup refuses transaction-pooling mode.

Clustering and operations

  • Leader election hardened (pg_try_advisory_lock, lock held by connection, holder-pid diagnostics).
  • Cron: leader-only execution, exactly-once via UNIQUE (cron_name, scheduled_time), catch-up for missed runs.
  • Daemons: leader-elected or replicated, auto-restart on failure, watch-channel shutdown.
  • PG NOTIFY multiplexer for cross-node coordination.
  • Admin API: workflow cancel/force-abort, paused queues.

Codegen

  • Single source of truth for type mapping in emit.rs (ts_type + dioxus_type).
  • Structural type walk: context detection no longer string-based.
  • TS / Dioxus emitters share the BindingSet IR.

Build and supply chain

  • System migrations v001..v011 collapsed into a single normalized v001_initial.sql.
  • Slim feature builds: gateway, jobs, cron, workflows, daemons, realtime, cluster gated independently. (See open items below.)
  • astral-tokio-tar bump clears RUSTSEC-2025-0146.
  • cargo-deny, audit, supply-chain guardrails tightened.

Docs

  • Full pass across docs/docs/ (Docusaurus) and docs/skills/forge-idiomatic-engineer/references/ (api.md, frontend.md, patterns.md, pitfalls.md).

Must fix before merge

  • Slim builds broken for workflows-only and cron-only configurations. cargo check -p forge-runtime --no-default-features --features workflows and --features cron both fail because workflow/scheduler.rs:442 and cron/scheduler.rs:449 unconditionally reference crate::jobs::JobRecord, which is #[cfg(feature = "jobs")]-gated. Fix: either add jobs to those features in Cargo.toml or cfg-gate the dispatch sites.

Non-blocking notes

  • Schema dead code: forge_invalidations table and forge_purge_expired_invalidations() defined but unreferenced from Rust. Drop or wire up.
  • Missing FK indexes: forge_workflow_events.consumed_by and forge_oauth_codes.client_id — both will seq-scan children on parent DELETE CASCADE.
  • forge_signals_events_default partition silently swallows misrouted rows if forge_signals_ensure_partition fails. Retention drop logic explicitly skips it. Add a guardrail.
  • Scope-check is name-based (any column literally named user_id/owner_id/tenant_id passes). Doc comment is honest: not a security boundary until RLS lands.
  • hash_ua truncates SHA-256 to 8 hex chars (32 bits). HMAC carries forgery resistance, so it's fine, but a wider truncation (64+ bits) is cheap insurance against UA-rotation bypass.
  • webhook/handler.rs still TODO'd to migrate from WebhookContext to MutationContext — same atomicity story already fixed for mutations.
  • Cross-node invalidation: forge_change_log sequence table exists, but no reader replays it on listener reconnect. If the doc claims "replay missed rows," verify the path.
  • Migration runner has no upgrade path from intermediate v002..v011 (never published — only relevant if anyone dogfooded the branch mid-flight).

Not reviewed (size budget)

Detailed audits skipped for forge-codegen internals, the cluster leader-election rewrite, webhook replay command, and the full cargo test --workspace run. Reviewed via cargo check against feature combos and targeted reads of architecture-sensitive files.

Test plan

  • cargo fmt --all --check
  • SQLX_OFFLINE=true cargo clippy --all-targets --all-features --workspace -- -D warnings
  • SQLX_OFFLINE=true cargo test --workspace
  • Squashed schema applies cleanly on Postgres 18 (system + example migrations + bench counters)
  • .sqlx/ cache regenerated against the squashed schema and committed
  • CI: validate (fmt, clippy, unit tests)
  • CI: workspace-integration (testcontainer DB tests)
  • CI: pr-smoke (with-svelte/demo + with-dioxus/demo)
  • cargo check --no-default-features --features workflows (currently failing — see above)
  • cargo check --no-default-features --features cron (currently failing — see above)

isala404 added 25 commits May 3, 2026 22:35
- squash __forge_v001..v011 into a single normalized v001_initial.sql
  carrying the final shape of every table, function, and trigger
- gate cluster metrics on the gateway feature; gate job_queue and
  realtime modules on their own features so api/worker/minimal slim
  builds compile cleanly
- bump astral-tokio-tar to clear RUSTSEC-2025-0146
- inline example Cargo.toml deps so `forge new` templates build
  standalone outside the workspace; drop dead workspace=true
  replacements from demo .forge-template.toml files
- drop deny.toml `version` key (cargo-deny no longer accepts it)
- prepare sqlx with --all-targets so test-only queries are cached
- fix slow realtime drain test that deadlocked on empty channel and
  update tls cert-path test to match current read_pem_certs error
- retire issues/ tracking notes and .agents/rewrite-progress.md;
  strip the running commentary from v001_initial.sql, keeping only
  comments that explain non-obvious intent
@isala404 isala404 changed the title rewrite: consolidate runtime, codegen, and system schema rewrite: consolidate runtime, reactivity, security, and clustering May 21, 2026
isala404 added 3 commits May 21, 2026 14:04
- codegen: handle HashMap and Vec end-to-end in Dioxus emitter; surface parser failures so silently-skipped files don't drop handlers from bindings
- pg leader: parse role string strictly, emit NOTIFY forge_leader_released on voluntary release so standbys fail over without waiting for the next check tick
- daemon runner: collect and abort spawned validate/refresh tasks on clean iteration exit to stop them leaking past handler return
- workflow scheduler: document is_leader() as advisory-only (correctness comes from atomic UPDATE-with-status-check)
- cluster registry: drop dead mark_dead_nodes path
- webhook handler: make dispatch transactional with idempotency release on failure; refactor context replay primitives
- signals: widen hash_ua to 64 bits and guard against misrouted default-partition rows
- runtime wiring: build PgNotifyBus before LeaderElection so leader-released wakeups flow through the shared bus
- migrations: collapse system schema to v001, add FK indexes on workflow_events.consumed_by and oauth_codes.client_id
- docs: clarify parser context detection is an allowlist of 8 framework context types
- workflow: route suspend through ForgeError::WorkflowSuspended, propagate
  state persistence errors, claim-for-execution UPDATE drops the 'running'
  filter, dispatch_job/start_workflow go through trait builders.
- gateway: hash JWT cache key with SHA-256 (was raw token), stateless HMAC-
  signed OAuth CSRF tokens with 5-min TTL, JWKS singleflight + 30s negative
  cache for unknown kids, rate limiter clamps to -1.0 instead of underflow.
- realtime: ChangeListener snapshots max_seq before bus.subscribe and
  replays missed events on NOTIFY reconnect via watch::Sender generation
  counter; leader-lease refresh holds the mutex across pg_locks probe and
  UPDATE forge_leaders.
- worker: shutdown drain via JoinSet with grace period, release_claim
  helper for orphaned claims, job-status row mapper surfaces unknown
  variants instead of panicking.
isala404 added 3 commits May 21, 2026 20:51
- macros: workflow and mutation visitors track the ctx ident from the first
  fn argument and only collect step/wait/dispatch keys when the method-call
  receiver chain bottoms out on that ident, eliminating false positives
  from any same-named method. Query unscoped-error message now spells out
  the structural lint, the #[query(unscoped)] opt-out, and points users at
  Postgres RLS for real isolation.
- testing: TestMutationContext::into_mutation_context(pool) bridges to the
  production MutationContext wired with the test mocks, so handlers written
  against &MutationContext can be exercised through assertion macros.
  MockWorkflowDispatch picks up the WorkflowDispatch trait impl that the
  bridge needs.
- dioxus signals: now_iso falls back to the portable formatter instead of
  silently returning ""; the three serde_json::to_value(..).unwrap_or_default
  sites now early-return with console.warn so dropped analytics aren't
  silent; rand_u32 non-wasm path uses subsec_nanos XOR counter instead of
  memory-address hashing; localStorage queue persistence ported from the
  Svelte client (key forge_signals_queue_v1, restored on init, written on
  every enqueue/flush, cleared on success/beacon).
The scheduler's consume-claim-and-resume / claim-and-resume transactions
flip the run to 'running' before enqueueing the resume job, so by the
time the worker reaches claim_for_execution the row is already there.
Excluding 'running' from the IN list left every event- or timer-driven
resume failing with InvalidState, which surfaced as the demo PR smoke
test (with-svelte/demo, with-dioxus/demo) stalling at 3/6 completed
steps once the user clicks Confirm Verification.

Dedup is already enforced upstream: the job queue holds resume jobs
under FOR UPDATE SKIP LOCKED, and the scheduler's claim UPDATE / event
consume ensures only one resume job per wake event. Re-admitting
'running' makes the worker idempotent without adding a real race.

The matching .sqlx cache file is renamed (new hash for the restored
IN list).
@isala404 isala404 changed the title rewrite: consolidate runtime, reactivity, security, and clustering rearchitect runtime, workflows, security, and reactivity May 21, 2026
@isala404 isala404 merged commit f42914e into main May 21, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant