feat: cmd/notifyd standalone server + Docker + CI conformance matrix + release workflow#1
Conversation
…am, observability)
Adds the internal/server package: the testable wiring layer between the
Connect service contract and the notify orchestrator.
- config.go — env-driven Config with deterministic defaults and
fail-fast validation; never silently accepts a
misconfigured production deploy.
- auth.go — HS256 JWT validator + Connect interceptors; dev-mode
escape hatch for local-dev only.
- handlers.go — Connect handlers for NotificationInternalService and
all unary RPCs of NotificationClientService.
- stream.go — Server-streaming handler: handshake heartbeat carries
session_id, ticker fires per LiveConnections cadence,
ctx cancel unregisters cleanly.
- middleware.go — Logging + recovery + stream-auth shim. Recovery
covers both unary and streaming code paths.
- observability.go — /healthz + /metrics placeholder mux; JSON slog
handler for production-grade structured logs.
- inapp.go — notify.Provider implementation that bridges the
orchestrator with realtime.Registry[*StreamEvent].
- mapping.go — proto ↔ domain conversions in one place.
- server.go — Server struct: lifecycle, listener binding, signal
handling, graceful shutdown with deadline.
Tests cover handlers (validation, success, error mapping, store/notifier
errors propagated as the right Connect codes), auth (HS256 happy path +
8 failure modes + dev-mode bypass), config (every env var parsed, every
invalid value rejected at boot), stream (handshake, registry delivery,
heartbeat ticker, client cancel cleanup, backpressure no-deadlock),
middleware (logging captures latency + user_id, recovery turns panics
into Internal), and server boot/shutdown round-trip.
79 test functions in internal/server, 85.3% statement coverage, all
under -race.
Twenty-line main() that delegates to internal/server. The only thing that does NOT live in internal/server is the construction of the Postgres / EntDB store drivers — keeping those out of the server package means in-process library consumers can import server without pulling pgx + the EntDB SDK + their transitive closure. version + commit are stamped via -ldflags at link time so the running container reports the exact tag it was built from. .gitignore: tighten the binary glob from `notifyd` to `/notifyd` so the source directory `cmd/notifyd/` is not silently swallowed by the ignore.
Mirrors the identity Dockerfile shape: - golang:1.26.3-alpine3.23 builder with TARGETOS/TARGETARCH from buildx so the same Dockerfile produces linux/amd64 and linux/arm64. - ldflags strip + inject version/commit from build args. - FROM scratch final stage with ca-certs + USER 65532. - Exposes 8080 (client Connect), 8081 (internal gRPC), 9090 (metrics). .dockerignore drops .git, *.md, test files and testdata from the build context so multi-arch builds are not invalidated on every edit unrelated to the compiled binary.
conformance.yml mirrors identity's pattern: - Driver matrix (memory · postgres · entdb) with service containers attached at the job level; the 'Conformance / <driver>' check name is the branch-protection pin. - A unit job that runs go build / go vet / go test -race across everything except the driver packages, so PRs catch handler/server regressions even if a driver-specific job is flaky. release.yml mirrors identity's: - Re-runs test + docker smoke on tag push or workflow_dispatch. - Multi-arch buildx (linux/amd64,linux/arm64) → ghcr.io/elloloop/notify with semver + sha tags. - cosign keyless OIDC signing of every published tag. - Trivy SARIF scan with HIGH/CRITICAL gate, results uploaded to GitHub Security tab. Trivy action pinned by SHA per the March 2026 advisory. - Proto archive published to the GitHub Release so consumers can pin to a version without vendoring a git ref.
Replace the 'early scaffold / in progress' section with a Status table that reflects wave-2 reality: container, conformance matrix and the internal/server wiring are all in tree. Add a Deploy section with a docker pull / docker run example and a full env-var reference table (37 vars) covering ports, store driver, auth, live-connections and every channel block. Operators reading the README can now stand up a notify container without spelunking the internal/server source.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1e47c38637
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if id == "" { | ||
| return nil, connect.NewError(connect.CodeInvalidArgument, errors.New("id is required")) | ||
| } | ||
| if err := h.store.UpdateStatus(ctx, claims.TenantID, id, notify.StatusRead, h.now()); err != nil { |
There was a problem hiding this comment.
Scope acknowledgements to the authenticated user
When a client calls AckNotification with another user's notification id in the same tenant, this updates it because UpdateStatus only checks (tenantID, id) (see store.go) and the handler never uses claims.UserID. That lets any authenticated tenant member mark any known notification read; verify ownership first, e.g. via GetNotification(ctx, claims.TenantID, claims.UserID, id), before updating.
Useful? React with 👍 / 👎.
| // Push to every live connection for the user. Push returns the count of | ||
| // accepted connections — we don't act on it: offline users still have | ||
| // the row stored, so the orchestrator's StatusDelivered is correct. | ||
| p.registry.Push(msg.To, ev) |
There was a problem hiding this comment.
Prevent cross-tenant in-app fanout
For deployments where the same userID can exist in multiple tenants, this pushes tenant A's notification to every live connection with that user id, including tenant B, because realtime.Registry is keyed and pushed only by user id (Register stores byUser[c.UserID], and Push accepts just userID). The stream records TenantID on the connection, but this call discards it, so in-app notifications can leak across tenants unless the registry/fanout is scoped by (tenantID, userID) or filters connections by msg.Notification.TenantID.
Useful? React with 👍 / 👎.
| } | ||
| out := ¬ifyv1.GetNotificationsResponse{ | ||
| Notifications: make([]*notifyv1.Notification, 0, len(items)), | ||
| NextCursor: next, |
There was a problem hiding this comment.
Return encoded cursors that clients can replay
When a page has more results, each store implementation returns next as a raw millisecond string, but the next request path requires cursors to be base64-url encoded via decodeCursor. Returning the raw value here means clients that round-trip NextCursor get InvalidArgument instead of page two; encode the store cursor before putting it on the wire.
Useful? React with 👍 / 👎.
…, in-app pending
Three real bugs surfaced during principal review of the PR. All three are
in this commit so the fix and the regression test ride together.
1. **Cursor round-trip is broken end-to-end.** Store.QueryUserNotifications
returns a raw decimal-ms cursor; the handler set it on
GetNotificationsResponse.NextCursor verbatim, but decodeCursor expected
a base64-encoded payload. A client that paged with the value the server
just gave it got CodeInvalidArgument on every second page (base64 decode
fails on `1700000000000` at byte 12). The earlier `TestCursorRoundTrip`
round-tripped encode→decode in isolation, never through a real handler
call, which is why CI didn't catch it.
Fix: drop the misleading base64 layer entirely. The wire format is the
store's opaque string — clients never inspect the cursor, so there is no
value in obfuscating it, and matching the store's format makes paging
trivially debuggable in production logs. New
`TestClientHandler_GetNotifications_RealRoundTrip` feeds a response's
NextCursor back as the next request's Cursor and asserts the handler
accepts it — the regression test the original suite was missing.
2. **AckNotification could mark another user's row as Read.** The Store
contract for UpdateStatus takes (tenantID, id, status, atMS) — no userID.
The handler called UpdateStatus with the caller's tenantID and the
request's id, so any user in tenant T could ack any other user's
notification in T by guessing the id (a tenant-scoped privilege
escalation). Surfaces only when a tenant has more than one user, which
is why the single-user-per-tenant tests missed it.
Fix: ownership check before UpdateStatus — GetNotification(tenant, user,
id) must succeed first. A cross-user or cross-tenant lookup miss
surfaces as ErrNotFound → CodeNotFound on the wire, never
CodePermissionDenied (which would leak the existence of another user's
row). New CrossUser + CrossTenant + LookupInternalError regression tests
added; the original NotFound and InternalError tests updated to populate
the fake's getResult on the happy path.
3. **In-app provider lied about delivery when no connections were live.**
inAppProvider.Send always returned StatusDelivered regardless of how
many connections Registry.Push delivered to, so an offline user's row
was marked Delivered even though no live device received the push.
Downstream consumers reading the row (unread filter, ack flow,
analytics) would see "delivered" when really the user never saw it.
Fix: when Registry.Push returns 0, return Receipt{Status: StatusPending}
so the row stays unread and surfaces in the next GetNotifications page.
1+ connections → StatusDelivered as before. Two new tests cover the
zero-connections and N-connections fan-out cases.
go test ./... -race -count=1 stays green; internal/server coverage rises
from 85.3% to 85.5% with the new regression tests.
Three small follow-ups to the fix commit: - cmd/notifyd: drop the unused notify import + the `var _ = notify.ChannelInApp` placeholder that existed only to keep the import referenced. Nothing in main.go calls into the root notify package directly — every type goes through server.Config. - cmd/notifyd: stop wiring Dependencies.StoreCloser. The cleanup function from buildDependencies is the single source of truth for store close during graceful shutdown; passing StoreCloser on top would close pgx pools / entdb SDK clients twice. server.Shutdown still honours StoreCloser when tests pass one explicitly. - server/auth: replace the in-package constantTimeEqual helper with crypto/subtle.ConstantTimeCompare. The stdlib carries the byte-compare primitive and the TestInternalAuthInterceptor suite covers the end-to-end match / mismatch / missing-header paths, so the standalone unit test for the inlined helper is no longer needed. - server: tighten the Server.Addr() doc comment — it returns "" when Run has not bound the listener, never panics. Previous comment said "panicking" which was simply wrong. go test ./... -race -count=1 stays green; coverage on internal/server is 85.3% (steady).
Scope
Wave-2 of
elloloop/notify: the standalone container plus its packaging and CI.This brings the platform up to the same dual-mode shape as
elloloop/identity— a library you can embed AND aghcr.io/elloloop/notify:<version>image you can deploy.File map
Test counts
All
-race, all green:Container smoke
Boot logs (JSON, stamped via
slog.JSONHandler):{"level":"INFO","msg":"notifyd_starting","version":"dev","store_driver":"memory","live_connections_enabled":true,"client_port":8080,"internal_port":8081,"metrics_port":9090} {"level":"INFO","msg":"server_listen","name":"client","addr":"[::]:8080"} {"level":"INFO","msg":"server_listen","name":"internal","addr":"[::]:8081"} {"level":"INFO","msg":"server_listen","name":"metrics","addr":"[::]:9090"}Reviewer hot-spots
server.go::buildClientServer).The unary Connect interceptor handles every unary RPC, but Connect's
UnaryInterceptorFuncdoes NOT run against server-streaming RPCs.streamAuthShimre-applies the same validator at the streamhandshake — please double-check the contract is tight.
inapp.go). Notifier passes thestored row through
Message.Notification; the provider pushes aNotificationEventto every live connection viaRegistry.Push.Offline users still get the row stored —
Sendalways returnsStatusDelivered. If you'd rather mark offline users asPendingwhen zero connections accept, that's a one-line change.
cmd/notifyd, not ininternal/server. The reason is documented inserver.go::buildStoreand
cmd/notifyd/main.go: keepingpgx+ the EntDB SDK out ofinternal/servermeans a library consumer who embeds the serverpackage does not pay for two heavy transitive closures they don't
use. If you want to merge them, the symmetric move is to gate the
drivers behind build tags.
conformance.yml's matrix shape). I dropped the
lint/vuln/fuzz/integrationjobs because notify does not yet ship a.golangci.yml, fuzz targets, ortests/integration/. Easy toreintroduce once those land — the workflow leaves their slots open.
.gitignorenotifyd→/notifyd. The unanchored glob wouldhave swallowed the
cmd/notifyd/source directory. Anchoring tothe repo root keeps the binary-ignore behaviour while letting the
source live where Go expects it.
NOT done in this PR
/metricsreturns a placeholder exposition.Real Prometheus counters/histograms slot into
observability.gowithout changing the mux shape.
NOTIFY_ALLOWED_ORIGINSis parsed into
Config.LiveConnections.AllowedOriginsbut nomiddleware consumes it yet — slot for the next wave.