Skip to content

feat(events): user-lifecycle eventing (outbox + signed webhooks) — #261#273

Open
iarunsaragadam wants to merge 1 commit into
mainfrom
feat/issue-261-user-lifecycle-events
Open

feat(events): user-lifecycle eventing (outbox + signed webhooks) — #261#273
iarunsaragadam wants to merge 1 commit into
mainfrom
feat/issue-261-user-lifecycle-events

Conversation

@iarunsaragadam

Copy link
Copy Markdown
Contributor

Closes #261.

What this lands

The eventing substrate the issue asks for: user-lifecycle events flow out of the service as signed webhooks, at-least-once, idempotent by event id — additive, no-op by default.

pkg/events (new, self-contained — no internal/* deps)

  • Typed Event + EventType (user.created, user.updated, user.deactivated, append-only wire values) carrying a secret-free User (no password hash/tokens).
  • Publisher interface with a no-op default (Discard).
  • OutboxPublisher: Emit fans an event out to every matching active subscription by writing one durable pending row per subscription, then returns — no network I/O on the RPC path. Idempotent by event id (the outbox dedupes the (event, subscription) pair).
  • Worker: claims due deliveries, POSTs each with an HMAC-SHA256 signature (X-Identity-Signature) + event-id header, treats 2xx as ack, retries with exponential backoff, and on retry-budget exhaustion marks the row failed, logs it, and reports it via a FailureHook (failures surfaced, not swallowed). VerifySignature is exported for subscribers.
  • OutboxStore boundary with an in-memory implementation (MemoryOutbox) backing tests + single-node runs.

Service wiring (backward-compatible, additive)

  • AuthService.WithEventPublisher / AdminService.WithEventPublisher — optional set-once dependency, matching the existing WithTenantAutoFormer / WithLoginGovernance pattern.
  • Emits user.created on password signup and OAuth/passwordless auto-create; user.deactivated on admin deactivate and delete. Best-effort: a publish failure is logged, never fails the RPC. A nil publisher = the no-op, so existing behaviour is unchanged.

Config gate

  • GATEWAY_WEBHOOKS_ENABLED (default false) plus typed retry/backoff/worker knobs, validated at boot (validateWebhooks). app.New builds the publisher + worker only when enabled and ties the worker into the existing Start/Stop background-worker lifecycle.

Tests

  • pkg/events: fan-out matching, idempotency by event id, signing/verification round-trip + tamper rejection, retry-with-backoff-then-deliver, fail-after-max-attempts + FailureHook, double-claim prevention, backoff monotonic/capped, and a live Run loop over httptest.
  • internal/service: signup emits user.created; nil publisher no panic; publisher error doesn't fail the RPC; admin deactivate emits user.deactivated; secret-free payload mapping.
  • internal/config: disabled-skips-checks, enabled-rejects-bad-knobs, Load defaults.

go build ./..., go test ./pkg/events ./internal/config ./internal/service ./internal/app, golangci-lint run on the touched packages, and go mod tidy are all clean. No new dependencies.

Deferred (largest complete tested slice; deferrals per AGENTS.md)

  • Durable SQL outbox across the postgres/sqlite drivers + conformance (this slice uses the in-memory outbox; the OutboxStore boundary is in place for it).
  • CreateWebhookSubscription / ListWebhookSubscriptions per-tenant RPCs (subscriptions are wired via the store seam today).
  • Outbound SCIM provisioning, which subscribes to this same events substrate.

These are clean follow-ups along feature boundaries; the tree is green and the no-op default keeps the feature dark until explicitly enabled.

🤖 Generated with Claude Code

…very

Add pkg/events, the pluggable user-lifecycle eventing primitive: a typed
Event for user.created / user.updated / user.deactivated, a Publisher
interface with a no-op default (Discard), and an outbox-backed Publisher
whose background worker delivers a signed webhook to every matching
subscription at-least-once with retry and exponential backoff.

- Delivery is idempotent by event id: the outbox dedupes the
  (event, subscription) pair, and the worker re-sends identical signed
  bytes, so a redelivery is safe for the subscriber to drop.
- Payloads are HMAC-SHA256 signed (X-Identity-Signature) and carry the
  event id header; VerifySignature is exported for subscribers.
- Failures are surfaced, not swallowed: a delivery that exhausts its retry
  budget is marked failed, logged, and reported via a FailureHook.
- Persistence is behind OutboxStore; an in-memory store backs the
  single-node tier and the tests (a durable SQL outbox is a follow-up).

Wire it through the service layer: AuthService and AdminService gain an
optional WithEventPublisher; signup and OAuth/passwordless auto-create
emit user.created, admin deactivate/delete emit user.deactivated. Emission
is best-effort and never fails the originating RPC. A nil publisher is
treated as the no-op, so existing deployments are byte-for-byte unchanged.

Gate the whole feature behind GATEWAY_WEBHOOKS_ENABLED (default off) with
typed retry/backoff/worker knobs validated at boot; app.New builds the
publisher + worker only when enabled and ties the worker into the existing
Start/Stop background-worker lifecycle.

Deferred follow-ups (noted in the PR): durable SQL outbox across the
postgres/sqlite drivers + conformance, the CreateWebhookSubscription /
ListWebhookSubscriptions RPCs, and outbound SCIM provisioning that
subscribes to this substrate.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c5117de1d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/app/app.go
// single-node tier; a durable SQL outbox is a follow-up.
var eventPublisher events.Publisher
if deps.Config.WebhooksEnabled {
outbox := events.NewMemoryOutbox()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Wire webhooks to a registrable outbox

When GATEWAY_WEBHOOKS_ENABLED is true, app.New creates a fresh local MemoryOutbox and never exposes it; the only mutator is MemoryOutbox.AddSubscription, and no config, Deps, or RPC path receives this instance. That means ListActiveSubscriptions is always empty in production, so user lifecycle emits return successfully without enqueueing any delivery and enabling webhooks silently delivers nothing.

Useful? React with 👍 / 👎.

Comment thread pkg/events/events.go
Comment on lines +35 to +37
// EventUserUpdated is emitted after a user's mutable profile fields
// (name, email, status transitions other than deactivation) change.
EventUserUpdated EventType = "user.updated"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Emit the advertised user.updated event

This adds user.updated as a filterable wire value, but the existing mutation paths that change these fields (ProfileService.UpdateProfile, AdminService.UpdateUser/ReactivateUser, OAuth profile refresh, email verification/change) are not updated to publish it. A subscriber configured for user.updated will never be called when those profile/status fields change, leaving downstream provisioning state stale; either emit it from those paths or avoid exposing the event type until it is supported.

Useful? React with 👍 / 👎.

Comment thread internal/app/app.go
}
}()
}
stopEvents = cancel

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Wait for the webhook worker on Stop

Built.Stop is documented to drain background workers, but the webhook hook only calls cancel and returns without waiting for the worker.Run goroutine to exit. If shutdown happens while a webhook HTTP request is in flight, the process can continue past Stop before the delivery is marked delivered/rescheduled/failed, so shutdown can drop or leave delivery state inconsistent; mirror the sweeper's done-channel wait here.

Useful? React with 👍 / 👎.

@iarunsaragadam iarunsaragadam left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR review gate: ❌ BLOCKED

Reviewer Effective verdict Surviving blockers
Product REQUEST_CHANGES 1
Security APPROVE 0
Performance APPROVE 0
Maintainability APPROVE 0
Correctness APPROVE 0

Conditional reviewers: Contract & Migration — skipped (no proto/API/schema/migration changes); Accessibility & UX — skipped (no UI changes).

Confirmed blockers

Product

  • user.updated event type and config docs advertise an event that is never emitted. EventUserUpdated (user.updated) is defined and documented in pkg/events/events.go:35-37 as "emitted after a user's mutable profile fields change", and the new config block documents the feature as fanning out "user create/update/deactivate events". No code path emits it: AdminService.UpdateUser (internal/service/admin_ops.go:216-250) mutates name/role/avatar and returns without calling emitUserEvent, and there are zero emit sites for EventUserUpdated in internal/. A subscriber filtering on user.updated (which the matching logic in outbox.go supports) silently receives nothing — the documented contract does not match behaviour.
    • Location: internal/service/admin_ops.go:216-250; pkg/events/events.go:35-37; internal/config/config.go (doc comment "user create/update/deactivate").
    • Fix: Emit EventUserUpdated from AdminService.UpdateUser (and any other profile-mutation path) using the secret-free mapping, plus a test mirroring the deactivate test. Or, if update is intentionally out of scope for this slice, remove/clearly mark EventUserUpdated as not-yet-emitted and correct the config doc so the advertised contract matches reality.

Refuted (raised but verification cleared them)

  • Product — "Enabling GATEWAY_WEBHOOKS_ENABLED delivers nothing". Refuted: the feature is dark by default and the only subscription-registration path is an in-process test seam with no production caller; the subscription RPC is explicitly deferred, so this is a tracked latent gap rather than a shipped footgun, and does not block this slice.

Non-blocking findings

Product

  • No docs/changelog for the six new operator-facing GATEWAY_WEBHOOKS_* env vars or the X-Identity-Signature HMAC-SHA256 webhook contract on an OSS-distributed image (minor). Add a docs-site env reference + signature-verification recipe and a changelog entry.

Security

  • Latent SSRF (major, non-blocking): pkg/events/worker.go:37-45 POSTs to unvalidated subscriber-supplied URLs — no https/scheme allowlist, no private/loopback/link-local/metadata IP block, default client follows redirects. Not externally reachable this slice (only MemoryOutbox.AddSubscription, no prod caller). Must be fixed in the same change that lands the tenant-facing subscription RPC: validate resolved IP (defeat DNS rebinding), https-only, disallow redirects, with a security test.
  • No per-subscription/per-tenant rate limiting or payload-size bound on outbound deliveries (nit). Track alongside the subscription RPC.

Performance

  • Worker delivers a batch strictly sequentially — one slow/timing-out subscriber (10s) blocks the whole tick, worst case ~Batch×10s (major, non-blocking; in-memory single-node only). Fan out the claimed batch across a bounded pool (e.g. errgroup with SetLimit) before real traffic.
  • ClaimDue is an O(all-time-deliveries) full-map scan every tick; terminal rows are never reclaimed (minor). Maintain a pending index / prune; for the SQL store ensure an index on (status, next_attempt_at) + retention sweep.
  • Synchronous fan-out inside Emit is fine for the map store but becomes 1 SELECT + N INSERTs on the signup/login hot path once the SQL outbox lands (minor). Batch into a single multi-row insert / INSERT ... SELECT.
  • No load/bench coverage for the delivery engine (nit). Add benchmarks for ClaimDue scan and ProcessOnce throughput with slow senders.

Maintainability

  • Duplicated event-id generation logic across internal/app/app.go:714 (randomEventID) and internal/service/events_emit.go:25 (newEventID) — same knowledge in two places (major, non-blocking). Add one exported events.NewID() and delete both private copies.
  • randomEventID is misnamed: it mints delivery-row ids (Delivery.ID), not event ids, yet stamps an evt_ prefix (minor). Rename (e.g. dlv_).
  • Config doc comments at internal/config/config.go:489,494 describe the aspirational end-state ("transactional outbox", "surfaced via audit") rather than the shipped in-memory outbox + logger surfacing (minor). Reword to match current behaviour and note the deferred path.
  • Retry-policy defaults duplicated between DefaultRetryPolicy() (publisher.go:106) and config defaults (nit). Cross-reference or single-source them.

Correctness

  • Claimed in-memory rows are not released on a mid-batch worker crash — they stay claimed for the process lifetime (minor; bounded, acceptable for in-memory store). The SQL-outbox follow-up must implement a visibility-timeout / lease-expiry.

Recommendation

This is a clean, well-tested, dark-by-default substrate that Security, Performance, Maintainability, and Correctness all cleared, and the "enabling delivers nothing" concern was refuted as a tracked deferral. One confirmed blocker remains: the user.updated event type and config docs advertise an event that no code path emits. Resolve it by either emitting EventUserUpdated from AdminService.UpdateUser with a test, or removing/marking the type as not-yet-emitted and correcting the config doc so the contract matches reality. Please also capture the latent SSRF and sequential-delivery items as explicit follow-ups gating the subscription RPC. Once the blocker is closed, this is mergeable.

Generated by the PR review gate (AGENTS.md §11) — 5 always-on reviewers + conditional Contract/Accessibility, with adversarial verification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

User-lifecycle eventing (transactional outbox + webhooks) and outbound SCIM provisioning to downstream SaaS

1 participant