docs: SEV1 post-mortem for model-engine 500s incident (MLI-6574) by lilyz-ai · Pull Request #817 · scaleapi/llm-engine

lilyz-ai · 2026-04-29T00:07:47Z

Summary

Adds post-mortem one-pager for the Apr 24–25 SEV1 where a kubectl set image deployed a model-engine image that referenced endpoints.temporal_task_queue before the DB migration was applied, causing ~47 min of 500s on all endpoint operations
Document lives in docs/internal/ alongside other internal ops docs
Covers: timeline, root cause + contributing factors, impact, 5 action items, lessons learned

Test plan

Review doc for accuracy against incident timeline
Confirm action item owners

Closes MLI-6574

🤖 Generated with Claude Code

Greptile Summary

This PR adds a post-mortem for the Apr 24–25 SEV1 where a kubectl set image bypassed Helm migration hooks and caused ~58 min of 500s on all model-engine endpoint operations. It also silently promotes a new production image tag in values_sample.yaml.

Previous review rounds have identified several significant issues in the action-item code samples that need resolution before the document is referenced by engineers: the _check_master_branch justfile guard targets origin/master (the repo default branch is main), the ClusterRoleBinding snippet grants cluster-wide admin rather than the intended namespace-scoped access, the scale_developer RBAC role over-provisions create/delete on pods/configmaps/services, and the "chosen" Option A is not reflected in any of the implementation snippets which all implement Option B.

Confidence Score: 3/5

Not safe to merge until action-item code samples are corrected and the image tag bump is validated against applied migrations

Multiple P1 findings flagged in prior review rounds: the justfile guard references the wrong branch (origin/master vs origin/main), the ClusterRoleBinding grants cluster-wide admin defeating the stated security goal, the scale_developer RBAC role over-provisions destructive verbs, the chosen Option A is entirely absent from all implementation snippets, and the values_sample.yaml image tag change lacks confirmation that the DB migration is in place.

docs/internal/post-mortem-mli-6574-model-engine-500s.md requires corrections to the action-item code samples before it can serve as a reliable runbook; charts/model-engine/values_sample.yaml needs migration-safety confirmation before the image tag lands in production

Important Files Changed

Filename	Overview
docs/internal/post-mortem-mli-6574-model-engine-500s.md	New post-mortem document; multiple factual inconsistencies (duration values) and implementation errors in action-item code samples (wrong git branch ref, over-privileged RBAC, Option A/B mismatch, ClusterRoleBinding scope) flagged in prior review rounds
charts/model-engine/values_sample.yaml	Bumps the production image tag without explicit documentation that the accompanying Alembic migration has been applied, repeating the exact failure pattern documented in the post-mortem

Sequence Diagram

sequenceDiagram
    participant Eng as Engineer
    participant K8s as kubectl / K8s API
    participant App as model-engine Pod
    participant DB as PostgreSQL
    participant PD as PagerDuty

    Note over Eng,DB: Apr 24, 23:50Z — Incident begins
    Eng->>K8s: kubectl set image (04729cef…)
    Note over K8s: Helm pre-upgrade hook SKIPPED
    K8s->>App: New pod starts (ORM declares temporal_task_queue)
    App->>DB: SELECT … temporal_task_queue … FROM endpoints
    DB-->>App: ERROR: column does not exist
    App-->>Eng: HTTP 500 on every list/get/create/update/delete

    loop 5 patch attempts over ~34 min
        Eng->>K8s: kubectl set image (patch N)
        K8s->>App: New pod (no migration applied)
        App->>DB: Same failing SELECT
        DB-->>App: ERROR: column does not exist
    end

    Note over PD: 00:37Z — PagerDuty fires (47 min late)
    Note over Eng,DB: 00:48Z — Rollback to f395ffa6…
    Eng->>K8s: kubectl rollout (f395ffa6…)
    K8s->>App: Old pod (no temporal_task_queue in ORM)
    App->>DB: SELECT without missing column
    DB-->>App: OK
    App-->>Eng: HTTP 200 — errors clear

_{Reviews (22): Last reviewed commit: "docs(post-mortem): drop genai from audit..." | Re-trigger Greptile}

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Documents the Apr 24-25 outage caused by deploying an ORM model referencing a missing DB column, including timeline, root cause, impact, and follow-up action items. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…(MLI-6574)

…h rollout strategy - Add ORM column and Alembic migration definitions inline - Explain why every endpoint query failed (SQLAlchemy SELECT includes all ORM columns) - Remove incorrect downstream impact entry - Clarify P0 migration CI check doesn't prevent kubectl bypass (scoped note added) - Reframe kubectl ban as preventative (RBAC blocks at API server) not reactive (--atomic) - Replace P1/P2 items with a Staging vs Production rollout strategy section Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Remove redundant CI migration check (Helm pre-upgrade hook already enforces this) - Drop ArgoCD alternative; keep RBAC as the single preventative approach - Expand RBAC section with specific files to change (Terracode-ML Role, scripts/deploy.sh, CircleCI config) - Move image testing guidance into Rollout Strategy - Add staging environment gap (no values_staging.yaml exists) with concrete plan Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps · 2026-05-04T00:18:15Z


 # tag [required] is the LLM Engine docker image tag
-tag: e360bfb1d21d9d4e7b7fcb6b29ca752095b4d0f4
+tag: 2e9d00786419ef44ec5c9d3305d8d6451d6aabfb


Undocumented production image tag change

The PR description mentions only the post-mortem doc addition, but values_sample.yaml is quietly updated to a new image tag (2e9d00786419ef44ec5c9d3305d8d6451d6aabfb). The post-mortem itself notes that values_sample.yaml targets production (no separate staging config exists). Deploying a new image tag here without an explicit mention in the PR description or test plan risks repeating the exact pattern documented in the incident — an image promotion that bypasses scrutiny of whether the accompanying Alembic migration is in place. The test plan should confirm that the migration for this new image has been applied to production before this config lands.

Prompt To Fix With AI

This is a comment left during a code review. Path: charts/model-engine/values_sample.yaml Line: 27 Comment: **Undocumented production image tag change** The PR description mentions only the post-mortem doc addition, but `values_sample.yaml` is quietly updated to a new image tag (`2e9d00786419ef44ec5c9d3305d8d6451d6aabfb`). The post-mortem itself notes that `values_sample.yaml` targets production (no separate staging config exists). Deploying a new image tag here without an explicit mention in the PR description or test plan risks repeating the exact pattern documented in the incident — an image promotion that bypasses scrutiny of whether the accompanying Alembic migration is in place. The test plan should confirm that the migration for this new image has been applied to production before this config lands. How can I resolve this? If you propose a fix, please make it concise.

…rollout strategy - RBAC: note no developer Role exists in Terracode-ML scaleapi-ml-serving/; add inspection steps to find the correct EKS access entry before making changes - Deploy path: replace proposed scripts/deploy.sh with existing just deploy prod (model-engine-internal justfile already does helm upgrade --atomic --timeout 120m0s) - Rollout strategy: add env table (training/launch/prod/circleci), note launch cluster may already serve as staging with ~1h effort vs new cluster, clarify effort range Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…estimate - RBAC: identify eks.tf (lines 545-561) and workloads.tf as exact files to change; explain ml_infra_admin cluster-admin grant is why kubectl set image works today - Staging: clarify all 3 clusters are prod; add component table showing RDS/Redis/VPC/DNS already exist in scaleapi-ml-serving/global/; new EKS cluster is the dominant cost; total estimate ~8-11 days Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- eks.tf: show exact before/after — scope ml_infra_admin from type=cluster to type=namespace excluding scale-deploy (workload_namespaces_default minus scale-deploy) - workloads.tf: rename model_engine_deployer -> scale_deploy_developer; full Role content with deployment read-only + pod/exec/log/job rules + RoleBinding - Staging: clarify staging DB/Redis/DNS already exist (never wired up); add Option A (new namespace in ml-serving-new, ~3-4 days) vs Option B (new EKS cluster, ~8-11 days) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…m sections - eks.tf: note that scoping away scale-deploy also blocks just deploy prod (uses dev context); production deploys must move to CI (ml_integration_test_lambda) first - Rename scale_deploy_developer -> scale_developer / scale-developer - Staging: drop Option B (new EKS cluster); only Option A (new namespace, ~3-4 days) - Remove Lessons Learned section Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…uard - Control a: eks.tf scopes ml_infra_admin away from scale-deploy (blocks kubectl set image); just deploy prod compensates by minting a short-lived ml-k8s-admin SA token for helm; scale_developer role gains serviceaccounts/token create so devs can mint it - Control b: justfile _check_master_branch guard — prod deploy aborts if current commit is not merged to origin/master; unmerged local branches cannot reach production - Remove CI deploy requirement; developers still run just deploy prod locally Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…t alternative - Define ServiceAccount (SA) token inline where first used - Option A: mint short-lived ml-k8s-admin token at deploy time (no new AWS infra) - Option B: dedicated ml-serving-deployer IAM role — cleaner separation of concerns, no token minting, easy to audit; ~1 extra day of setup Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tion B only) - ml_infra_admin: scoped away from scale-deploy; gets scale_developer role (read/exec/log only, cannot kubectl set image or helm upgrade) - New ml-serving-deployer IAM role: scoped to scale-deploy with cluster-admin; only this role can run just deploy prod - justfile: _check_master_branch guard + _mount_aws uses ml-serving-deployer for prod - Files: eks.tf (scope + new access entry), workloads.tf (two roles), iam.tf (new SSO role data source), justfile Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

scale-deploy is shared; ml_infra_admin scoping affects all teams deploying there. Add pre-flight check: enumerate live deployments and confirm with other teams before applying the RBAC change. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…l inspection - Ran kubectl get deployments -n scale-deploy on ml-serving-new - auto-hillclimb-ui (genai team, manual deploy, ml_infra_admin) would break - k8s-startup-watcher is part of llm-engine helm chart, unaffected - launch-endpoint-id-* are model-engine celery-managed via ml-k8s-admin SA, unaffected - Add pre-requisite: coordinate with genai team before applying RBAC change Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…_admin untouched - Drop all changes to ml_infra_admin; genai team and others unaffected - New ml-serving-deployer IAM role (eks.tf + iam.tf): cluster-admin scoped to scale-deploy; only model-engine prod deploys use this role - kubectl set image prevention is now process-enforced (just deploy prod + master branch guard), not RBAC-enforced; acknowledge this tradeoff explicitly - justfile: _check_master_branch + _mount_aws uses ml-serving-deployer for prod Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tity approach Option A (scope ml_infra_admin) vs Option B (new ml-serving-deployer role); table shows hard vs soft kubectl prevention, backward compat, and known breakage (auto-hillclimb-ui); Option B chosen with explicit tradeoff acknowledged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…sion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

billyang-scale · 2026-05-04T16:48:35Z

+
+Two options were considered for preventing unauthorized production deploys:
+
+| | Option A: Scope down `ml_infra_admin` | Option B: New `ml-serving-deployer` role ✓ |


for option b, which service assume this new role?

billyang-scale · 2026-05-04T16:48:58Z

+|---|---|---|
+| **How** | Remove `patch`/`update` on `apps/deployments` in `scale-deploy` from `ml_infra_admin` | Add a new AWS SSO role with cluster-admin scoped to `scale-deploy`; prod deploy uses this role |
+| **`kubectl set image` prevention** | Hard — RBAC rejects it at the API server | Soft — process-enforced; `ml_infra_admin` users can still do it |
+| **Backward compatible** | No — breaks any other team deploying to `scale-deploy` with `ml_infra_admin` | Yes — `ml_infra_admin` unchanged |


in what scenario, other team need to deploy to scale-deploywithml_infra_admin`?

billyang-scale · 2026-05-04T16:50:42Z

+| **Known breakage** | `auto-hillclimb-ui` (genai team, `genai/justfile deploy-auto-hillclimb-ui`) uses `kubectl apply` in `scale-deploy` via `ml_infra_admin`; would immediately break | None |
+| **Coordination required** | Yes — must migrate every team deploying to `scale-deploy` | No |
+
+**Open question for discussion:** Option A gives a hard guarantee (RBAC rejects `kubectl set image` at the API server) but requires coordinating a breaking change across every team using `ml_infra_admin` in `scale-deploy`. Option B is backward compatible and unblocks model-engine immediately, but relies on process discipline — a developer can still bypass it with a direct `kubectl set image`. Which guarantee level is acceptable?


ideal world, no one should have mutate permission in prodution environment unless there is additional approval process.

regurlar or on demand release/deploymeny should be done by ci/cd pipeline / automation

billyang-scale · 2026-05-04T16:53:30Z

+| `prod` | `ml-serving-new` | production serving |
+| `circleci` | minikube (ephemeral) | CI integration tests only |
+
+There is no staging environment for model-engine. One needs to be created.


+1, need canary environment to test production changes before actually release to production

I don't disagree but it seems like once we're running kubectl commands directly, that's already incorrect, regardless of where the command is pointed.

billyang-scale · 2026-05-04T16:54:45Z

+- **ORM column** — a Python class attribute like `temporal_task_queue = Column(Text)` on the `Endpoint` model. SQLAlchemy (our ORM) translates these attributes into SQL; every declared column is included in the `SELECT` it generates when loading `Endpoint` objects.
+- **Alembic migration** — a versioned SQL script (e.g. `ALTER TABLE endpoints ADD COLUMN temporal_task_queue TEXT`) that updates the live database schema to match what the application code expects. Without it, the DB is missing the column the ORM declares.
+
+The `temporal_task_queue` column was added to the `Endpoint` ORM model (`hosted_model_inference.py`) as part of the temporal endpoint type feature (MLI-6425), but **no Alembic migration was written for it**. Because SQLAlchemy includes every ORM-declared column in its generated `SELECT` statement, all endpoint operations — list, get, create, update, delete — issued a query referencing `temporal_task_queue`. PostgreSQL returned `column "temporal_task_queue" does not exist` on every call, causing a 500 across the board.


do we have alert ready to capture if similar things happen in the future?

added a new section for alerting on failure over 50% in data plane Istio mesh metrics.

- Clarify Option B assumes ml-serving-deployer for model-engine only - Correct Known breakage row: auto-hillclimb-ui deploys to ml-training-new, not ml-serving-new; no confirmed breakage on serving cluster - Add P1 action item: tighten Envoy 5xx PagerDuty alert threshold/window Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Audit of all justfiles confirmed no other service deploys to scale-deploy on ml-serving-new via ml_infra_admin — Option A has no coordination burden and gives a hard RBAC guarantee vs Option B's soft process enforcement. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove comparison table and Option B content. Rewrite a. as a direct description of the chosen approach (scope down ml_infra_admin). Update files-to-change: eks.tf RBAC narrowing + justfile _check_master_branch only — no new IAM role, no iam.tf change, no profile switch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ting - eks.tf: change ml_infra_admin access scope from cluster-wide to namespace-scoped with all namespaces except scale-deploy (enumerated from kubectl namespace audit); note RBAC is additive so scoping is the only way to block kubectl set image - P1 alert: add prod_5m entry to launch_v1_route_monitor_params in launch.tf for a 5-min evaluation window alongside the existing 15m Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… required ml-serving-admin resolves to AWSReservedSSO_MLInfraAdmin (same as ml_infra_admin), so scoping ml_infra_admin out of scale-deploy without a replacement breaks just deploy prod. Correct implementation requires: 1. New ml-serving-deployer role scoped to scale-deploy 2. ml_infra_admin scoped down to exclude scale-deploy 3. justfile prod profile switched to ml-serving-deployer Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Swap the noisy prod_5m ratio tweak for a dedicated monitor: fires when >50% of all model-engine Istio requests are 5xx over 5 min. Uses broad istio.mesh.request.count.total (not the narrow Envoy trace query) so a complete outage like this one pages within 5 min. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lilyz-ai · 2026-05-05T01:54:06Z

thanks @billyang-scale for reviewing! took another look, ml-infra role should only be used by our team in ml-serving-new. so scope down the role permission + a new deployer role is a better design. also added alerting.

ml-admin is MLInfraAdmin on the ML R&D account (692474966980), not the Serving account — different IAM role, no access to ml-serving-new. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

billyang-scale · 2026-05-05T18:14:27Z

+| `training` | `ml-training-new` | production training workloads |
+| `launch` | `ml-launch-new` | production Launch API |
+| `prod` | `ml-serving-new` | production serving |
+| `circleci` | minikube (ephemeral) | CI integration tests only |


are this related to the scope of this doc? ml-training-new and circelci seems irrelavant

sandeshghanta · 2026-05-05T18:14:45Z

+
+**What:** `kubectl set image` updates pods directly, bypassing all Helm pre-upgrade hooks including the migration job. The fix must be preventative: a `kubectl set image` attempt against the production deployment is rejected by the API server before it takes effect.
+
+Two preventative controls:


A quick chat told me it's possible to have a guardrail which ensures alembic migrations is in sync with ORM model

#!/usr/bin/env bash set -euo pipefail # use a clean test DB alembic upgrade head # Make sure alembic/env.py points at your ORM metadata: # fail if models contain schema changes not represented by migrations alembic check

mayavkrishnan25 · 2026-05-05T18:15:23Z

+
+## Summary
+
+At 23:50Z on Apr 24, a direct `kubectl set image` pushed a model-engine image (`04729cef…`) that declared a new ORM column `endpoints.temporal_task_queue` with no corresponding Alembic migration applied to production. Every `list_model_endpoints` and `get_model_endpoint` call immediately returned 500. Five patch images over ~34 minutes failed to resolve the issue before a rollback to `f395ffa6…` cleared errors at 00:48Z.


could you add some context on who the entity was who ran the set-image - human user/some automated process

yixu34

Going forward let's please use the postmortem template. This is AI generated and only covers the technical aspects but doesn't go into the 5 why's.

yixu34 · 2026-05-05T18:12:34Z

+
+Audit confirmed no other service deploys to `scale-deploy` on `ml-serving-new` via `ml_infra_admin`: all other services using `ml_infra_admin` on `ml-serving-new` (`dagster`, `scaletrain-ui`, `ml-orchestration-internal`, `research_evals`) deploy to the `dagster` namespace.
+
+**b. justfile guard — block deploying unmerged code**


This can work, but if we're going to do this, then we'd also want to make sure that people can force merge things in the models repo. Otherwise, there's no available way to do emergency deploys.

Think another issue is: while this is good in the sense that it would force a review of the PR, it assumes that everyone follows the same release checklist. Would something like greptile caught the fact that we were trying to push something without a corresponding migration?

yixu34 · 2026-05-05T18:13:24Z

+Add the `ml-serving-deployer` access entry scoped to `scale-deploy`, and narrow `ml_infra_admin` to every namespace except `scale-deploy`:
+
+```hcl
+# New — dedicated deploy role for model-engine, scoped to scale-deploy only


Do other teams do this?

yixu34 · 2026-05-05T18:16:10Z

+| `prod` | `ml-serving-new` | production serving |
+| `circleci` | minikube (ephemeral) | CI integration tests only |
+
+There is no staging environment for model-engine. One needs to be created.


I don't disagree but it seems like once we're running kubectl commands directly, that's already incorrect, regardless of where the command is pointed.

billyang-scale · 2026-05-05T18:17:13Z

+            ↓
+            validate: migrations apply cleanly, smoke tests pass, no 500s
+            ↓
+production: just deploy prod   (helm upgrade, migration-first, RBAC-enforced)


we should propose a release process to make sure release is ready for prod release. e.g

no auto deployment in production.

make cutline to release candidate in staging env, and let it back for certain time e.g 24h

have monitor/test running make sure no regression

release to prod expicitly with release candidate version

sandeshghanta · 2026-05-05T18:18:06Z

+
+---
+
+### P1 — Add high-severity 5xx outage alert


Confirming - can we do this today without being spammy?
Ex: Does model engine throw 5xx if user request is overwhelming a model?

lilyz-ai · 2026-05-05T19:03:29Z

Action items:

improve devx of model engine, to be on par with newer service, e.g. arp. @lilyz-ai to create a new design doc for model engine devx improvements.
prod deploy process documentation.
scope down ml-infra role permissions to avoid deploy.
have a separate deployer role to do prod deploy. allow force merge to allow emergency hot fix.
have separate staging env for model engine, where ml infra team can test things.

Discussions:

staging needs to follow prod deployment process, before prod deployment.
hot fixes, post release verification.
model engine testing locally, https://github.com/scaleapi/models/tree/master/model-engine-internal#running-the-server-locally.
sandbox local testing, @sandeshghanta will share a process doc.
developer experience improvement should be put in place first, so the guardrail is less relevant or will be needed less.

lilyz-ai and others added 2 commits April 21, 2026 01:11

chore: bump model-engine image tag to latest main (2e9d007)

9a10495

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lilyz-ai force-pushed the lilyz-ai/mli-6574-sev1-post-mortem branch from 71a834c to 4332cad Compare April 29, 2026 00:10

greptile-apps Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread docs/internal/post-mortem-mli-6574-model-engine-500s.md Outdated

Comment thread docs/internal/post-mortem-mli-6574-model-engine-500s.md

Comment thread docs/internal/post-mortem-mli-6574-model-engine-500s.md

lilyz-ai and others added 4 commits April 29, 2026 00:27

docs: expand post-mortem with concrete action item plans (MLI-6574)

de0c354

docs: add kubectl ban enforcement options and image testing workflow …

413ea98

…(MLI-6574)