feat(events): user-lifecycle eventing (outbox + signed webhooks) — #261#273
feat(events): user-lifecycle eventing (outbox + signed webhooks) — #261#273iarunsaragadam wants to merge 1 commit into
Conversation
…very Add pkg/events, the pluggable user-lifecycle eventing primitive: a typed Event for user.created / user.updated / user.deactivated, a Publisher interface with a no-op default (Discard), and an outbox-backed Publisher whose background worker delivers a signed webhook to every matching subscription at-least-once with retry and exponential backoff. - Delivery is idempotent by event id: the outbox dedupes the (event, subscription) pair, and the worker re-sends identical signed bytes, so a redelivery is safe for the subscriber to drop. - Payloads are HMAC-SHA256 signed (X-Identity-Signature) and carry the event id header; VerifySignature is exported for subscribers. - Failures are surfaced, not swallowed: a delivery that exhausts its retry budget is marked failed, logged, and reported via a FailureHook. - Persistence is behind OutboxStore; an in-memory store backs the single-node tier and the tests (a durable SQL outbox is a follow-up). Wire it through the service layer: AuthService and AdminService gain an optional WithEventPublisher; signup and OAuth/passwordless auto-create emit user.created, admin deactivate/delete emit user.deactivated. Emission is best-effort and never fails the originating RPC. A nil publisher is treated as the no-op, so existing deployments are byte-for-byte unchanged. Gate the whole feature behind GATEWAY_WEBHOOKS_ENABLED (default off) with typed retry/backoff/worker knobs validated at boot; app.New builds the publisher + worker only when enabled and ties the worker into the existing Start/Stop background-worker lifecycle. Deferred follow-ups (noted in the PR): durable SQL outbox across the postgres/sqlite drivers + conformance, the CreateWebhookSubscription / ListWebhookSubscriptions RPCs, and outbound SCIM provisioning that subscribes to this substrate.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7c5117de1d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // single-node tier; a durable SQL outbox is a follow-up. | ||
| var eventPublisher events.Publisher | ||
| if deps.Config.WebhooksEnabled { | ||
| outbox := events.NewMemoryOutbox() |
There was a problem hiding this comment.
Wire webhooks to a registrable outbox
When GATEWAY_WEBHOOKS_ENABLED is true, app.New creates a fresh local MemoryOutbox and never exposes it; the only mutator is MemoryOutbox.AddSubscription, and no config, Deps, or RPC path receives this instance. That means ListActiveSubscriptions is always empty in production, so user lifecycle emits return successfully without enqueueing any delivery and enabling webhooks silently delivers nothing.
Useful? React with 👍 / 👎.
| // EventUserUpdated is emitted after a user's mutable profile fields | ||
| // (name, email, status transitions other than deactivation) change. | ||
| EventUserUpdated EventType = "user.updated" |
There was a problem hiding this comment.
Emit the advertised user.updated event
This adds user.updated as a filterable wire value, but the existing mutation paths that change these fields (ProfileService.UpdateProfile, AdminService.UpdateUser/ReactivateUser, OAuth profile refresh, email verification/change) are not updated to publish it. A subscriber configured for user.updated will never be called when those profile/status fields change, leaving downstream provisioning state stale; either emit it from those paths or avoid exposing the event type until it is supported.
Useful? React with 👍 / 👎.
| } | ||
| }() | ||
| } | ||
| stopEvents = cancel |
There was a problem hiding this comment.
Wait for the webhook worker on Stop
Built.Stop is documented to drain background workers, but the webhook hook only calls cancel and returns without waiting for the worker.Run goroutine to exit. If shutdown happens while a webhook HTTP request is in flight, the process can continue past Stop before the delivery is marked delivered/rescheduled/failed, so shutdown can drop or leave delivery state inconsistent; mirror the sweeper's done-channel wait here.
Useful? React with 👍 / 👎.
iarunsaragadam
left a comment
There was a problem hiding this comment.
PR review gate: ❌ BLOCKED
| Reviewer | Effective verdict | Surviving blockers |
|---|---|---|
| Product | REQUEST_CHANGES | 1 |
| Security | APPROVE | 0 |
| Performance | APPROVE | 0 |
| Maintainability | APPROVE | 0 |
| Correctness | APPROVE | 0 |
Conditional reviewers: Contract & Migration — skipped (no proto/API/schema/migration changes); Accessibility & UX — skipped (no UI changes).
Confirmed blockers
Product
user.updatedevent type and config docs advertise an event that is never emitted.EventUserUpdated(user.updated) is defined and documented inpkg/events/events.go:35-37as "emitted after a user's mutable profile fields change", and the new config block documents the feature as fanning out "user create/update/deactivate events". No code path emits it:AdminService.UpdateUser(internal/service/admin_ops.go:216-250) mutates name/role/avatar and returns without callingemitUserEvent, and there are zero emit sites forEventUserUpdatedininternal/. A subscriber filtering onuser.updated(which the matching logic inoutbox.gosupports) silently receives nothing — the documented contract does not match behaviour.- Location:
internal/service/admin_ops.go:216-250;pkg/events/events.go:35-37;internal/config/config.go(doc comment "user create/update/deactivate"). - Fix: Emit
EventUserUpdatedfromAdminService.UpdateUser(and any other profile-mutation path) using the secret-free mapping, plus a test mirroring the deactivate test. Or, if update is intentionally out of scope for this slice, remove/clearly markEventUserUpdatedas not-yet-emitted and correct the config doc so the advertised contract matches reality.
- Location:
Refuted (raised but verification cleared them)
- Product — "Enabling
GATEWAY_WEBHOOKS_ENABLEDdelivers nothing". Refuted: the feature is dark by default and the only subscription-registration path is an in-process test seam with no production caller; the subscription RPC is explicitly deferred, so this is a tracked latent gap rather than a shipped footgun, and does not block this slice.
Non-blocking findings
Product
- No docs/changelog for the six new operator-facing
GATEWAY_WEBHOOKS_*env vars or theX-Identity-SignatureHMAC-SHA256 webhook contract on an OSS-distributed image (minor). Add a docs-site env reference + signature-verification recipe and a changelog entry.
Security
- Latent SSRF (major, non-blocking):
pkg/events/worker.go:37-45POSTs to unvalidated subscriber-supplied URLs — no https/scheme allowlist, no private/loopback/link-local/metadata IP block, default client follows redirects. Not externally reachable this slice (onlyMemoryOutbox.AddSubscription, no prod caller). Must be fixed in the same change that lands the tenant-facing subscription RPC: validate resolved IP (defeat DNS rebinding), https-only, disallow redirects, with a security test. - No per-subscription/per-tenant rate limiting or payload-size bound on outbound deliveries (nit). Track alongside the subscription RPC.
Performance
- Worker delivers a batch strictly sequentially — one slow/timing-out subscriber (10s) blocks the whole tick, worst case ~Batch×10s (major, non-blocking; in-memory single-node only). Fan out the claimed batch across a bounded pool (e.g.
errgroupwithSetLimit) before real traffic. ClaimDueis an O(all-time-deliveries) full-map scan every tick; terminal rows are never reclaimed (minor). Maintain a pending index / prune; for the SQL store ensure an index on(status, next_attempt_at)+ retention sweep.- Synchronous fan-out inside
Emitis fine for the map store but becomes 1 SELECT + N INSERTs on the signup/login hot path once the SQL outbox lands (minor). Batch into a single multi-row insert /INSERT ... SELECT. - No load/bench coverage for the delivery engine (nit). Add benchmarks for
ClaimDuescan andProcessOncethroughput with slow senders.
Maintainability
- Duplicated event-id generation logic across
internal/app/app.go:714(randomEventID) andinternal/service/events_emit.go:25(newEventID) — same knowledge in two places (major, non-blocking). Add one exportedevents.NewID()and delete both private copies. randomEventIDis misnamed: it mints delivery-row ids (Delivery.ID), not event ids, yet stamps anevt_prefix (minor). Rename (e.g.dlv_).- Config doc comments at
internal/config/config.go:489,494describe the aspirational end-state ("transactional outbox", "surfaced via audit") rather than the shipped in-memory outbox + logger surfacing (minor). Reword to match current behaviour and note the deferred path. - Retry-policy defaults duplicated between
DefaultRetryPolicy()(publisher.go:106) and config defaults (nit). Cross-reference or single-source them.
Correctness
- Claimed in-memory rows are not released on a mid-batch worker crash — they stay claimed for the process lifetime (minor; bounded, acceptable for in-memory store). The SQL-outbox follow-up must implement a visibility-timeout / lease-expiry.
Recommendation
This is a clean, well-tested, dark-by-default substrate that Security, Performance, Maintainability, and Correctness all cleared, and the "enabling delivers nothing" concern was refuted as a tracked deferral. One confirmed blocker remains: the user.updated event type and config docs advertise an event that no code path emits. Resolve it by either emitting EventUserUpdated from AdminService.UpdateUser with a test, or removing/marking the type as not-yet-emitted and correcting the config doc so the contract matches reality. Please also capture the latent SSRF and sequential-delivery items as explicit follow-ups gating the subscription RPC. Once the blocker is closed, this is mergeable.
Generated by the PR review gate (AGENTS.md §11) — 5 always-on reviewers + conditional Contract/Accessibility, with adversarial verification.
Closes #261.
What this lands
The eventing substrate the issue asks for: user-lifecycle events flow out of the service as signed webhooks, at-least-once, idempotent by event id — additive, no-op by default.
pkg/events(new, self-contained — no internal/* deps)Event+EventType(user.created,user.updated,user.deactivated, append-only wire values) carrying a secret-freeUser(no password hash/tokens).Publisherinterface with a no-op default (Discard).OutboxPublisher:Emitfans an event out to every matching active subscription by writing one durable pending row per subscription, then returns — no network I/O on the RPC path. Idempotent by event id (the outbox dedupes the(event, subscription)pair).Worker: claims due deliveries, POSTs each with an HMAC-SHA256 signature (X-Identity-Signature) + event-id header, treats 2xx as ack, retries with exponential backoff, and on retry-budget exhaustion marks the row failed, logs it, and reports it via aFailureHook(failures surfaced, not swallowed).VerifySignatureis exported for subscribers.OutboxStoreboundary with an in-memory implementation (MemoryOutbox) backing tests + single-node runs.Service wiring (backward-compatible, additive)
AuthService.WithEventPublisher/AdminService.WithEventPublisher— optional set-once dependency, matching the existingWithTenantAutoFormer/WithLoginGovernancepattern.user.createdon password signup and OAuth/passwordless auto-create;user.deactivatedon admin deactivate and delete. Best-effort: a publish failure is logged, never fails the RPC. A nil publisher = the no-op, so existing behaviour is unchanged.Config gate
GATEWAY_WEBHOOKS_ENABLED(default false) plus typed retry/backoff/worker knobs, validated at boot (validateWebhooks).app.Newbuilds the publisher + worker only when enabled and ties the worker into the existing Start/Stop background-worker lifecycle.Tests
pkg/events: fan-out matching, idempotency by event id, signing/verification round-trip + tamper rejection, retry-with-backoff-then-deliver, fail-after-max-attempts + FailureHook, double-claim prevention, backoff monotonic/capped, and a liveRunloop over httptest.internal/service: signup emitsuser.created; nil publisher no panic; publisher error doesn't fail the RPC; admin deactivate emitsuser.deactivated; secret-free payload mapping.internal/config: disabled-skips-checks, enabled-rejects-bad-knobs, Load defaults.go build ./...,go test ./pkg/events ./internal/config ./internal/service ./internal/app,golangci-lint runon the touched packages, andgo mod tidyare all clean. No new dependencies.Deferred (largest complete tested slice; deferrals per AGENTS.md)
OutboxStoreboundary is in place for it).CreateWebhookSubscription/ListWebhookSubscriptionsper-tenant RPCs (subscriptions are wired via the store seam today).These are clean follow-ups along feature boundaries; the tree is green and the no-op default keeps the feature dark until explicitly enabled.
🤖 Generated with Claude Code