Skip to content

fix: MutexOwner retries on transient Redis errors instead of crashing#2131

Open
isaacrowntree wants to merge 3 commits into
sequinstream:mainfrom
isaacrowntree:fix/redis-reconnect-resilience
Open

fix: MutexOwner retries on transient Redis errors instead of crashing#2131
isaacrowntree wants to merge 3 commits into
sequinstream:mainfrom
isaacrowntree:fix/redis-reconnect-resilience

Conversation

@isaacrowntree

@isaacrowntree isaacrowntree commented Mar 26, 2026

Copy link
Copy Markdown

Summary

Fixes #2072 — When Redis (or a Redis-compatible store like Dragonfly/KeyDB) restarts, Sequin enters an unrecoverable failure state requiring a full restart. This has bitten us (and others per #2072) multiple times in production.

Root cause: MutexOwner in :has_mutex state calls acquire_mutex which returns :error when Redis is unreachable. The handler immediately returns {:shutdown, :err_keeping_mutex} — an invalid GenStateMachine return that crashes the process with {:bad_return_from_state_function, ...}. Due to MutexedSupervisor's :one_for_all strategy, this cascades and takes down the entire Runtime.Supervisor including all consumers.

Changes:

  • MutexOwner: Instead of crashing on Redis errors, retry indefinitely with exponential backoff (capped at 1 hour). When Redis comes back, the error counter resets and normal operation resumes. Also fixes the GenStateMachine return value ({:stop, {:shutdown, reason}} instead of invalid {:shutdown, reason}).
  • SinkConsumersLive.Index: Handle {:error, _} from metrics calls instead of crashing with MatchError.
  • SinkConsumersLive.Show: Same defensive handling for all three metrics calls.
  • HttpEndpointsLive.Show: Same fix for get_http_endpoint_throughput/1.

Context

We self-host Sequin on Railway with Dragonfly (Redis-compatible) as the backing store. Railway periodically auto-updates Dragonfly, which causes a brief restart. Every time this happens, Sequin fails to self-heal and requires a manual restart — we've hit this 3 times now. The :await_mutex state already retries correctly on Redis errors; this PR brings the same resilience to the :has_mutex state, but with exponential backoff so it doesn't spam during extended outages.

Test plan

  • Added MutexOwnerTest with unit tests for State struct and integration tests (tagged :integration) that use iptables REJECT to simulate Redis going down and coming back
  • Integration tests verify: process survives Redis outage, never crashes, recovers when Redis returns
  • All existing tests pass (mix test)
  • mix format --check-formatted passes
  • Verify in staging by restarting Redis while Sequin is running

Note: Integration tests require NET_ADMIN capability (for iptables) and are tagged :integration so they can be excluded from normal test runs: mix test --exclude integration

🤖 Generated with Claude Code

When Redis (or a Redis-compatible store like Dragonfly/KeyDB) restarts,
MutexOwner would immediately crash with {:shutdown, :err_keeping_mutex}.
Since MutexedSupervisor uses :one_for_all strategy, this cascades and
takes down the entire Runtime.Supervisor including all consumers.

The fix:
- MutexOwner now retries up to 5 times with backoff on Redis errors
  while in :has_mutex state, giving Redis time to come back
- Resets the error counter on successful reconnection
- Also fixes the GenStateMachine return value (was {:shutdown, reason}
  which is invalid - now {:stop, {:shutdown, reason}})
- LiveView pages (index.ex, show.ex) now handle Redis errors gracefully
  in metrics loading instead of crashing with MatchError

Fixes sequinstream#2072

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working reliability labels Mar 26, 2026
isaacrowntree and others added 2 commits March 27, 2026 00:35
…s errors

Instead of giving up after N errors, MutexOwner now retries indefinitely
with exponential backoff capped at 1 hour. Redis going down should never
crash Sequin — it should degrade gracefully and self-heal when Redis returns.

Integration tests use iptables REJECT to simulate a real Redis/Dragonfly
redeploy and verify the process survives the outage and recovers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ation tests

Unit tests (6, <0.1s): verify backoff math, error counter behavior, and
state struct defaults without needing Redis.

Integration tests (2, ~35s each, tagged :integration): use iptables REJECT
to simulate Redis going down, verifying the process survives and recovers.
Run with: mix test --include integration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@isaacrowntree

isaacrowntree commented Jun 12, 2026

Copy link
Copy Markdown
Author

Since upstream appears unmaintained (no maintainer activity since Feb 2026), this fix is now merged and released in the maintained fork at https://github.com/triptechtravel/sequin — release v0.14.6-tt2, image ghcr.io/triptechtravel/sequin:v0.14.6-tt2. The fork continues from upstream's final v0.14.6 with production-driven maintenance (security/crash/bug fixes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working reliability size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Redis lost connection does not seem to get re-created

1 participant