Skip to content

feat: cmd/notifyd standalone server + Docker + CI conformance matrix + release workflow#1

Merged
iarunsaragadam merged 7 commits into
mainfrom
feat/wave2-server-and-packaging
May 27, 2026
Merged

feat: cmd/notifyd standalone server + Docker + CI conformance matrix + release workflow#1
iarunsaragadam merged 7 commits into
mainfrom
feat/wave2-server-and-packaging

Conversation

@iarunsaragadam

Copy link
Copy Markdown
Contributor

Scope

Wave-2 of elloloop/notify: the standalone container plus its packaging and CI.
This brings the platform up to the same dual-mode shape as
elloloop/identity — a library you can embed AND a
ghcr.io/elloloop/notify:<version> image you can deploy.

File map

cmd/notifyd/main.go                — thin entry point (~80 lines)
internal/server/
  config.go         + config_test.go         — env → Config, validate
  auth.go           + auth_test.go           — HS256 JWT + dev validator + Connect interceptors
  handlers.go       + handlers_test.go       — Notify, Get/Ack/Register, mapping
  stream.go         + stream_test.go         — StreamEvents server-stream + heartbeat
  middleware.go     + middleware_test.go     — logging + recovery + stream auth shim
  observability.go                           — /healthz, /metrics, JSON slog
  inapp.go          + inapp_test.go          — notify.Provider over realtime.Registry
  mapping.go                                 — proto ↔ domain
  server.go         + server_test.go         — lifecycle, listeners, shutdown
.github/workflows/
  conformance.yml                            — driver matrix (memory · postgres · entdb) + unit job
  release.yml                                — tag push → buildx multi-arch + cosign + Trivy
Dockerfile                                   — multi-arch builder, FROM scratch final
.dockerignore                                — drops .git, *.md, test files
README.md                                    — Status table + Deploy + env-var reference
.gitignore                                   — tighten `notifyd` → `/notifyd`

Test counts

internal/server/auth_test.go                 11
internal/server/config_test.go               15
internal/server/handlers_test.go             27
internal/server/inapp_test.go                 4
internal/server/middleware_test.go            6
internal/server/server_test.go               10
internal/server/stream_test.go                6
                                            ───
                                             79 test functions
                          85.3% statement coverage on internal/server

All -race, all green:

$ go build ./...    # clean
$ go vet ./...      # clean
$ go test ./... -race -count=1 -timeout=180s
ok  github.com/elloloop/notify
ok  github.com/elloloop/notify/channels/email/emailservice
ok  github.com/elloloop/notify/channels/fcm
ok  github.com/elloloop/notify/channels/twilio
ok  github.com/elloloop/notify/channels/webpush
ok  github.com/elloloop/notify/gen/go/notify/v1
ok  github.com/elloloop/notify/internal/server     2.801s
ok  github.com/elloloop/notify/store/entdb
ok  github.com/elloloop/notify/store/memory
ok  github.com/elloloop/notify/store/postgres

Container smoke

$ docker build --platform linux/amd64 -t notify:local .
# ✓ builds

$ docker run -d --rm -p 19090:9090 \
    -e NOTIFY_AUTH_DEV_MODE=true \
    -e NOTIFY_STORE_DRIVER=memory \
    notify:local
$ curl http://127.0.0.1:19090/healthz
{"status":"ok"}
$ curl http://127.0.0.1:19090/metrics
# notify metrics — placeholder

Boot logs (JSON, stamped via slog.JSONHandler):

{"level":"INFO","msg":"notifyd_starting","version":"dev","store_driver":"memory","live_connections_enabled":true,"client_port":8080,"internal_port":8081,"metrics_port":9090}
{"level":"INFO","msg":"server_listen","name":"client","addr":"[::]:8080"}
{"level":"INFO","msg":"server_listen","name":"internal","addr":"[::]:8081"}
{"level":"INFO","msg":"server_listen","name":"metrics","addr":"[::]:9090"}

Reviewer hot-spots

  1. Auth interceptor wiring (server.go::buildClientServer).
    The unary Connect interceptor handles every unary RPC, but Connect's
    UnaryInterceptorFunc does NOT run against server-streaming RPCs.
    streamAuthShim re-applies the same validator at the stream
    handshake — please double-check the contract is tight.
  2. In-app provider semantics (inapp.go). Notifier passes the
    stored row through Message.Notification; the provider pushes a
    NotificationEvent to every live connection via Registry.Push.
    Offline users still get the row stored — Send always returns
    StatusDelivered. If you'd rather mark offline users as Pending
    when zero connections accept, that's a one-line change.
  3. Store driver wiring lives in cmd/notifyd, not in
    internal/server. The reason is documented in server.go::buildStore
    and cmd/notifyd/main.go: keeping pgx + the EntDB SDK out of
    internal/server means a library consumer who embeds the server
    package does not pay for two heavy transitive closures they don't
    use. If you want to merge them, the symmetric move is to gate the
    drivers behind build tags.
  4. CI workflows are mirrored from identity (release.yml step-for-step,
    conformance.yml's matrix shape). I dropped the lint / vuln /
    fuzz / integration jobs because notify does not yet ship a
    .golangci.yml, fuzz targets, or tests/integration/. Easy to
    reintroduce once those land — the workflow leaves their slots open.
  5. .gitignore notifyd/notifyd. The unanchored glob would
    have swallowed the cmd/notifyd/ source directory. Anchoring to
    the repo root keeps the binary-ignore behaviour while letting the
    source live where Go expects it.

NOT done in this PR

  • Wave-3 metrics: /metrics returns a placeholder exposition.
    Real Prometheus counters/histograms slot into observability.go
    without changing the mux shape.
  • Real CORS handling on the client surface. NOTIFY_ALLOWED_ORIGINS
    is parsed into Config.LiveConnections.AllowedOrigins but no
    middleware consumes it yet — slot for the next wave.

…am, observability)

Adds the internal/server package: the testable wiring layer between the
Connect service contract and the notify orchestrator.

- config.go        — env-driven Config with deterministic defaults and
                     fail-fast validation; never silently accepts a
                     misconfigured production deploy.
- auth.go          — HS256 JWT validator + Connect interceptors; dev-mode
                     escape hatch for local-dev only.
- handlers.go      — Connect handlers for NotificationInternalService and
                     all unary RPCs of NotificationClientService.
- stream.go        — Server-streaming handler: handshake heartbeat carries
                     session_id, ticker fires per LiveConnections cadence,
                     ctx cancel unregisters cleanly.
- middleware.go    — Logging + recovery + stream-auth shim. Recovery
                     covers both unary and streaming code paths.
- observability.go — /healthz + /metrics placeholder mux; JSON slog
                     handler for production-grade structured logs.
- inapp.go         — notify.Provider implementation that bridges the
                     orchestrator with realtime.Registry[*StreamEvent].
- mapping.go       — proto ↔ domain conversions in one place.
- server.go        — Server struct: lifecycle, listener binding, signal
                     handling, graceful shutdown with deadline.

Tests cover handlers (validation, success, error mapping, store/notifier
errors propagated as the right Connect codes), auth (HS256 happy path +
8 failure modes + dev-mode bypass), config (every env var parsed, every
invalid value rejected at boot), stream (handshake, registry delivery,
heartbeat ticker, client cancel cleanup, backpressure no-deadlock),
middleware (logging captures latency + user_id, recovery turns panics
into Internal), and server boot/shutdown round-trip.

79 test functions in internal/server, 85.3% statement coverage, all
under -race.
Twenty-line main() that delegates to internal/server. The only thing
that does NOT live in internal/server is the construction of the
Postgres / EntDB store drivers — keeping those out of the server
package means in-process library consumers can import server without
pulling pgx + the EntDB SDK + their transitive closure.

version + commit are stamped via -ldflags at link time so the running
container reports the exact tag it was built from.

.gitignore: tighten the binary glob from `notifyd` to `/notifyd` so
the source directory `cmd/notifyd/` is not silently swallowed by the
ignore.
Mirrors the identity Dockerfile shape:
- golang:1.26.3-alpine3.23 builder with TARGETOS/TARGETARCH from buildx
  so the same Dockerfile produces linux/amd64 and linux/arm64.
- ldflags strip + inject version/commit from build args.
- FROM scratch final stage with ca-certs + USER 65532.
- Exposes 8080 (client Connect), 8081 (internal gRPC), 9090 (metrics).

.dockerignore drops .git, *.md, test files and testdata from the build
context so multi-arch builds are not invalidated on every edit unrelated
to the compiled binary.
conformance.yml mirrors identity's pattern:
- Driver matrix (memory · postgres · entdb) with service containers
  attached at the job level; the 'Conformance / <driver>' check name
  is the branch-protection pin.
- A unit job that runs go build / go vet / go test -race across
  everything except the driver packages, so PRs catch handler/server
  regressions even if a driver-specific job is flaky.

release.yml mirrors identity's:
- Re-runs test + docker smoke on tag push or workflow_dispatch.
- Multi-arch buildx (linux/amd64,linux/arm64) → ghcr.io/elloloop/notify
  with semver + sha tags.
- cosign keyless OIDC signing of every published tag.
- Trivy SARIF scan with HIGH/CRITICAL gate, results uploaded to GitHub
  Security tab. Trivy action pinned by SHA per the March 2026 advisory.
- Proto archive published to the GitHub Release so consumers can pin
  to a version without vendoring a git ref.
Replace the 'early scaffold / in progress' section with a Status table
that reflects wave-2 reality: container, conformance matrix and the
internal/server wiring are all in tree.

Add a Deploy section with a docker pull / docker run example and a
full env-var reference table (37 vars) covering ports, store driver,
auth, live-connections and every channel block. Operators reading the
README can now stand up a notify container without spelunking the
internal/server source.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1e47c38637

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if id == "" {
return nil, connect.NewError(connect.CodeInvalidArgument, errors.New("id is required"))
}
if err := h.store.UpdateStatus(ctx, claims.TenantID, id, notify.StatusRead, h.now()); err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Scope acknowledgements to the authenticated user

When a client calls AckNotification with another user's notification id in the same tenant, this updates it because UpdateStatus only checks (tenantID, id) (see store.go) and the handler never uses claims.UserID. That lets any authenticated tenant member mark any known notification read; verify ownership first, e.g. via GetNotification(ctx, claims.TenantID, claims.UserID, id), before updating.

Useful? React with 👍 / 👎.

Comment thread internal/server/inapp.go Outdated
// Push to every live connection for the user. Push returns the count of
// accepted connections — we don't act on it: offline users still have
// the row stored, so the orchestrator's StatusDelivered is correct.
p.registry.Push(msg.To, ev)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent cross-tenant in-app fanout

For deployments where the same userID can exist in multiple tenants, this pushes tenant A's notification to every live connection with that user id, including tenant B, because realtime.Registry is keyed and pushed only by user id (Register stores byUser[c.UserID], and Push accepts just userID). The stream records TenantID on the connection, but this call discards it, so in-app notifications can leak across tenants unless the registry/fanout is scoped by (tenantID, userID) or filters connections by msg.Notification.TenantID.

Useful? React with 👍 / 👎.

}
out := &notifyv1.GetNotificationsResponse{
Notifications: make([]*notifyv1.Notification, 0, len(items)),
NextCursor: next,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return encoded cursors that clients can replay

When a page has more results, each store implementation returns next as a raw millisecond string, but the next request path requires cursors to be base64-url encoded via decodeCursor. Returning the raw value here means clients that round-trip NextCursor get InvalidArgument instead of page two; encode the store cursor before putting it on the wire.

Useful? React with 👍 / 👎.

…, in-app pending

Three real bugs surfaced during principal review of the PR. All three are
in this commit so the fix and the regression test ride together.

1. **Cursor round-trip is broken end-to-end.** Store.QueryUserNotifications
   returns a raw decimal-ms cursor; the handler set it on
   GetNotificationsResponse.NextCursor verbatim, but decodeCursor expected
   a base64-encoded payload. A client that paged with the value the server
   just gave it got CodeInvalidArgument on every second page (base64 decode
   fails on `1700000000000` at byte 12). The earlier `TestCursorRoundTrip`
   round-tripped encode→decode in isolation, never through a real handler
   call, which is why CI didn't catch it.

   Fix: drop the misleading base64 layer entirely. The wire format is the
   store's opaque string — clients never inspect the cursor, so there is no
   value in obfuscating it, and matching the store's format makes paging
   trivially debuggable in production logs. New
   `TestClientHandler_GetNotifications_RealRoundTrip` feeds a response's
   NextCursor back as the next request's Cursor and asserts the handler
   accepts it — the regression test the original suite was missing.

2. **AckNotification could mark another user's row as Read.** The Store
   contract for UpdateStatus takes (tenantID, id, status, atMS) — no userID.
   The handler called UpdateStatus with the caller's tenantID and the
   request's id, so any user in tenant T could ack any other user's
   notification in T by guessing the id (a tenant-scoped privilege
   escalation). Surfaces only when a tenant has more than one user, which
   is why the single-user-per-tenant tests missed it.

   Fix: ownership check before UpdateStatus — GetNotification(tenant, user,
   id) must succeed first. A cross-user or cross-tenant lookup miss
   surfaces as ErrNotFound → CodeNotFound on the wire, never
   CodePermissionDenied (which would leak the existence of another user's
   row). New CrossUser + CrossTenant + LookupInternalError regression tests
   added; the original NotFound and InternalError tests updated to populate
   the fake's getResult on the happy path.

3. **In-app provider lied about delivery when no connections were live.**
   inAppProvider.Send always returned StatusDelivered regardless of how
   many connections Registry.Push delivered to, so an offline user's row
   was marked Delivered even though no live device received the push.
   Downstream consumers reading the row (unread filter, ack flow,
   analytics) would see "delivered" when really the user never saw it.

   Fix: when Registry.Push returns 0, return Receipt{Status: StatusPending}
   so the row stays unread and surfaces in the next GetNotifications page.
   1+ connections → StatusDelivered as before. Two new tests cover the
   zero-connections and N-connections fan-out cases.

go test ./... -race -count=1 stays green; internal/server coverage rises
from 85.3% to 85.5% with the new regression tests.
Three small follow-ups to the fix commit:

- cmd/notifyd: drop the unused notify import + the `var _ = notify.ChannelInApp`
  placeholder that existed only to keep the import referenced. Nothing in
  main.go calls into the root notify package directly — every type goes
  through server.Config.
- cmd/notifyd: stop wiring Dependencies.StoreCloser. The cleanup function
  from buildDependencies is the single source of truth for store close
  during graceful shutdown; passing StoreCloser on top would close pgx
  pools / entdb SDK clients twice. server.Shutdown still honours
  StoreCloser when tests pass one explicitly.
- server/auth: replace the in-package constantTimeEqual helper with
  crypto/subtle.ConstantTimeCompare. The stdlib carries the byte-compare
  primitive and the TestInternalAuthInterceptor suite covers the
  end-to-end match / mismatch / missing-header paths, so the standalone
  unit test for the inlined helper is no longer needed.
- server: tighten the Server.Addr() doc comment — it returns "" when Run
  has not bound the listener, never panics. Previous comment said
  "panicking" which was simply wrong.

go test ./... -race -count=1 stays green; coverage on internal/server is
85.3% (steady).
@iarunsaragadam iarunsaragadam merged commit a73e65e into main May 27, 2026
4 checks passed
@iarunsaragadam iarunsaragadam deleted the feat/wave2-server-and-packaging branch May 27, 2026 03:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant