docs: SEV1 post-mortem for model-engine 500s incident (MLI-6574)#817
docs: SEV1 post-mortem for model-engine 500s incident (MLI-6574)#817
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents the Apr 24-25 outage caused by deploying an ORM model referencing a missing DB column, including timeline, root cause, impact, and follow-up action items. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
71a834c to
4332cad
Compare
…h rollout strategy - Add ORM column and Alembic migration definitions inline - Explain why every endpoint query failed (SQLAlchemy SELECT includes all ORM columns) - Remove incorrect downstream impact entry - Clarify P0 migration CI check doesn't prevent kubectl bypass (scoped note added) - Reframe kubectl ban as preventative (RBAC blocks at API server) not reactive (--atomic) - Replace P1/P2 items with a Staging vs Production rollout strategy section Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove redundant CI migration check (Helm pre-upgrade hook already enforces this) - Drop ArgoCD alternative; keep RBAC as the single preventative approach - Expand RBAC section with specific files to change (Terracode-ML Role, scripts/deploy.sh, CircleCI config) - Move image testing guidance into Rollout Strategy - Add staging environment gap (no values_staging.yaml exists) with concrete plan Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
||
| # tag [required] is the LLM Engine docker image tag | ||
| tag: e360bfb1d21d9d4e7b7fcb6b29ca752095b4d0f4 | ||
| tag: 2e9d00786419ef44ec5c9d3305d8d6451d6aabfb |
There was a problem hiding this comment.
Undocumented production image tag change
The PR description mentions only the post-mortem doc addition, but values_sample.yaml is quietly updated to a new image tag (2e9d00786419ef44ec5c9d3305d8d6451d6aabfb). The post-mortem itself notes that values_sample.yaml targets production (no separate staging config exists). Deploying a new image tag here without an explicit mention in the PR description or test plan risks repeating the exact pattern documented in the incident — an image promotion that bypasses scrutiny of whether the accompanying Alembic migration is in place. The test plan should confirm that the migration for this new image has been applied to production before this config lands.
Prompt To Fix With AI
This is a comment left during a code review.
Path: charts/model-engine/values_sample.yaml
Line: 27
Comment:
**Undocumented production image tag change**
The PR description mentions only the post-mortem doc addition, but `values_sample.yaml` is quietly updated to a new image tag (`2e9d00786419ef44ec5c9d3305d8d6451d6aabfb`). The post-mortem itself notes that `values_sample.yaml` targets production (no separate staging config exists). Deploying a new image tag here without an explicit mention in the PR description or test plan risks repeating the exact pattern documented in the incident — an image promotion that bypasses scrutiny of whether the accompanying Alembic migration is in place. The test plan should confirm that the migration for this new image has been applied to production before this config lands.
How can I resolve this? If you propose a fix, please make it concise.…rollout strategy - RBAC: note no developer Role exists in Terracode-ML scaleapi-ml-serving/; add inspection steps to find the correct EKS access entry before making changes - Deploy path: replace proposed scripts/deploy.sh with existing just deploy prod (model-engine-internal justfile already does helm upgrade --atomic --timeout 120m0s) - Rollout strategy: add env table (training/launch/prod/circleci), note launch cluster may already serve as staging with ~1h effort vs new cluster, clarify effort range Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…estimate - RBAC: identify eks.tf (lines 545-561) and workloads.tf as exact files to change; explain ml_infra_admin cluster-admin grant is why kubectl set image works today - Staging: clarify all 3 clusters are prod; add component table showing RDS/Redis/VPC/DNS already exist in scaleapi-ml-serving/global/; new EKS cluster is the dominant cost; total estimate ~8-11 days Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- eks.tf: show exact before/after — scope ml_infra_admin from type=cluster to type=namespace excluding scale-deploy (workload_namespaces_default minus scale-deploy) - workloads.tf: rename model_engine_deployer -> scale_deploy_developer; full Role content with deployment read-only + pod/exec/log/job rules + RoleBinding - Staging: clarify staging DB/Redis/DNS already exist (never wired up); add Option A (new namespace in ml-serving-new, ~3-4 days) vs Option B (new EKS cluster, ~8-11 days) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…m sections - eks.tf: note that scoping away scale-deploy also blocks just deploy prod (uses dev context); production deploys must move to CI (ml_integration_test_lambda) first - Rename scale_deploy_developer -> scale_developer / scale-developer - Staging: drop Option B (new EKS cluster); only Option A (new namespace, ~3-4 days) - Remove Lessons Learned section Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…uard - Control a: eks.tf scopes ml_infra_admin away from scale-deploy (blocks kubectl set image); just deploy prod compensates by minting a short-lived ml-k8s-admin SA token for helm; scale_developer role gains serviceaccounts/token create so devs can mint it - Control b: justfile _check_master_branch guard — prod deploy aborts if current commit is not merged to origin/master; unmerged local branches cannot reach production - Remove CI deploy requirement; developers still run just deploy prod locally Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t alternative - Define ServiceAccount (SA) token inline where first used - Option A: mint short-lived ml-k8s-admin token at deploy time (no new AWS infra) - Option B: dedicated ml-serving-deployer IAM role — cleaner separation of concerns, no token minting, easy to audit; ~1 extra day of setup Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion B only) - ml_infra_admin: scoped away from scale-deploy; gets scale_developer role (read/exec/log only, cannot kubectl set image or helm upgrade) - New ml-serving-deployer IAM role: scoped to scale-deploy with cluster-admin; only this role can run just deploy prod - justfile: _check_master_branch guard + _mount_aws uses ml-serving-deployer for prod - Files: eks.tf (scope + new access entry), workloads.tf (two roles), iam.tf (new SSO role data source), justfile Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
scale-deploy is shared; ml_infra_admin scoping affects all teams deploying there. Add pre-flight check: enumerate live deployments and confirm with other teams before applying the RBAC change. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…l inspection - Ran kubectl get deployments -n scale-deploy on ml-serving-new - auto-hillclimb-ui (genai team, manual deploy, ml_infra_admin) would break - k8s-startup-watcher is part of llm-engine helm chart, unaffected - launch-endpoint-id-* are model-engine celery-managed via ml-k8s-admin SA, unaffected - Add pre-requisite: coordinate with genai team before applying RBAC change Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…_admin untouched - Drop all changes to ml_infra_admin; genai team and others unaffected - New ml-serving-deployer IAM role (eks.tf + iam.tf): cluster-admin scoped to scale-deploy; only model-engine prod deploys use this role - kubectl set image prevention is now process-enforced (just deploy prod + master branch guard), not RBAC-enforced; acknowledge this tradeoff explicitly - justfile: _check_master_branch + _mount_aws uses ml-serving-deployer for prod Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tity approach Option A (scope ml_infra_admin) vs Option B (new ml-serving-deployer role); table shows hard vs soft kubectl prevention, backward compat, and known breakage (auto-hillclimb-ui); Option B chosen with explicit tradeoff acknowledged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
||
| Two options were considered for preventing unauthorized production deploys: | ||
|
|
||
| | | Option A: Scope down `ml_infra_admin` | Option B: New `ml-serving-deployer` role ✓ | |
There was a problem hiding this comment.
for option b, which service assume this new role?
| |---|---|---| | ||
| | **How** | Remove `patch`/`update` on `apps/deployments` in `scale-deploy` from `ml_infra_admin` | Add a new AWS SSO role with cluster-admin scoped to `scale-deploy`; prod deploy uses this role | | ||
| | **`kubectl set image` prevention** | Hard — RBAC rejects it at the API server | Soft — process-enforced; `ml_infra_admin` users can still do it | | ||
| | **Backward compatible** | No — breaks any other team deploying to `scale-deploy` with `ml_infra_admin` | Yes — `ml_infra_admin` unchanged | |
There was a problem hiding this comment.
in what scenario, other team need to deploy to scale-deploywithml_infra_admin`?
| | **Known breakage** | `auto-hillclimb-ui` (genai team, `genai/justfile deploy-auto-hillclimb-ui`) uses `kubectl apply` in `scale-deploy` via `ml_infra_admin`; would immediately break | None | | ||
| | **Coordination required** | Yes — must migrate every team deploying to `scale-deploy` | No | | ||
|
|
||
| **Open question for discussion:** Option A gives a hard guarantee (RBAC rejects `kubectl set image` at the API server) but requires coordinating a breaking change across every team using `ml_infra_admin` in `scale-deploy`. Option B is backward compatible and unblocks model-engine immediately, but relies on process discipline — a developer can still bypass it with a direct `kubectl set image`. Which guarantee level is acceptable? |
There was a problem hiding this comment.
ideal world, no one should have mutate permission in prodution environment unless there is additional approval process.
regurlar or on demand release/deploymeny should be done by ci/cd pipeline / automation
| | `prod` | `ml-serving-new` | production serving | | ||
| | `circleci` | minikube (ephemeral) | CI integration tests only | | ||
|
|
||
| There is no staging environment for model-engine. One needs to be created. |
There was a problem hiding this comment.
+1, need canary environment to test production changes before actually release to production
There was a problem hiding this comment.
I don't disagree but it seems like once we're running kubectl commands directly, that's already incorrect, regardless of where the command is pointed.
| - **ORM column** — a Python class attribute like `temporal_task_queue = Column(Text)` on the `Endpoint` model. SQLAlchemy (our ORM) translates these attributes into SQL; every declared column is included in the `SELECT` it generates when loading `Endpoint` objects. | ||
| - **Alembic migration** — a versioned SQL script (e.g. `ALTER TABLE endpoints ADD COLUMN temporal_task_queue TEXT`) that updates the live database schema to match what the application code expects. Without it, the DB is missing the column the ORM declares. | ||
|
|
||
| The `temporal_task_queue` column was added to the `Endpoint` ORM model (`hosted_model_inference.py`) as part of the temporal endpoint type feature (MLI-6425), but **no Alembic migration was written for it**. Because SQLAlchemy includes every ORM-declared column in its generated `SELECT` statement, all endpoint operations — list, get, create, update, delete — issued a query referencing `temporal_task_queue`. PostgreSQL returned `column "temporal_task_queue" does not exist` on every call, causing a 500 across the board. |
There was a problem hiding this comment.
do we have alert ready to capture if similar things happen in the future?
There was a problem hiding this comment.
added a new section for alerting on failure over 50% in data plane Istio mesh metrics.
- Clarify Option B assumes ml-serving-deployer for model-engine only - Correct Known breakage row: auto-hillclimb-ui deploys to ml-training-new, not ml-serving-new; no confirmed breakage on serving cluster - Add P1 action item: tighten Envoy 5xx PagerDuty alert threshold/window Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Audit of all justfiles confirmed no other service deploys to scale-deploy on ml-serving-new via ml_infra_admin — Option A has no coordination burden and gives a hard RBAC guarantee vs Option B's soft process enforcement. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove comparison table and Option B content. Rewrite a. as a direct description of the chosen approach (scope down ml_infra_admin). Update files-to-change: eks.tf RBAC narrowing + justfile _check_master_branch only — no new IAM role, no iam.tf change, no profile switch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ting - eks.tf: change ml_infra_admin access scope from cluster-wide to namespace-scoped with all namespaces except scale-deploy (enumerated from kubectl namespace audit); note RBAC is additive so scoping is the only way to block kubectl set image - P1 alert: add prod_5m entry to launch_v1_route_monitor_params in launch.tf for a 5-min evaluation window alongside the existing 15m Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… required ml-serving-admin resolves to AWSReservedSSO_MLInfraAdmin (same as ml_infra_admin), so scoping ml_infra_admin out of scale-deploy without a replacement breaks just deploy prod. Correct implementation requires: 1. New ml-serving-deployer role scoped to scale-deploy 2. ml_infra_admin scoped down to exclude scale-deploy 3. justfile prod profile switched to ml-serving-deployer Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Swap the noisy prod_5m ratio tweak for a dedicated monitor: fires when >50% of all model-engine Istio requests are 5xx over 5 min. Uses broad istio.mesh.request.count.total (not the narrow Envoy trace query) so a complete outage like this one pages within 5 min. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
thanks @billyang-scale for reviewing! took another look, ml-infra role should only be used by our team in ml-serving-new. so scope down the role permission + a new deployer role is a better design. also added alerting. |
ml-admin is MLInfraAdmin on the ML R&D account (692474966980), not the Serving account — different IAM role, no access to ml-serving-new. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| | `training` | `ml-training-new` | production training workloads | | ||
| | `launch` | `ml-launch-new` | production Launch API | | ||
| | `prod` | `ml-serving-new` | production serving | | ||
| | `circleci` | minikube (ephemeral) | CI integration tests only | |
There was a problem hiding this comment.
are this related to the scope of this doc? ml-training-new and circelci seems irrelavant
|
|
||
| **What:** `kubectl set image` updates pods directly, bypassing all Helm pre-upgrade hooks including the migration job. The fix must be preventative: a `kubectl set image` attempt against the production deployment is rejected by the API server before it takes effect. | ||
|
|
||
| Two preventative controls: |
There was a problem hiding this comment.
A quick chat told me it's possible to have a guardrail which ensures alembic migrations is in sync with ORM model
#!/usr/bin/env bash
set -euo pipefail
# use a clean test DB
alembic upgrade head
# Make sure alembic/env.py points at your ORM metadata:
# fail if models contain schema changes not represented by migrations
alembic check
|
|
||
| ## Summary | ||
|
|
||
| At 23:50Z on Apr 24, a direct `kubectl set image` pushed a model-engine image (`04729cef…`) that declared a new ORM column `endpoints.temporal_task_queue` with no corresponding Alembic migration applied to production. Every `list_model_endpoints` and `get_model_endpoint` call immediately returned 500. Five patch images over ~34 minutes failed to resolve the issue before a rollback to `f395ffa6…` cleared errors at 00:48Z. |
There was a problem hiding this comment.
could you add some context on who the entity was who ran the set-image - human user/some automated process
yixu34
left a comment
There was a problem hiding this comment.
Going forward let's please use the postmortem template. This is AI generated and only covers the technical aspects but doesn't go into the 5 why's.
|
|
||
| Audit confirmed no other service deploys to `scale-deploy` on `ml-serving-new` via `ml_infra_admin`: all other services using `ml_infra_admin` on `ml-serving-new` (`dagster`, `scaletrain-ui`, `ml-orchestration-internal`, `research_evals`) deploy to the `dagster` namespace. | ||
|
|
||
| **b. justfile guard — block deploying unmerged code** |
There was a problem hiding this comment.
This can work, but if we're going to do this, then we'd also want to make sure that people can force merge things in the models repo. Otherwise, there's no available way to do emergency deploys.
There was a problem hiding this comment.
Think another issue is: while this is good in the sense that it would force a review of the PR, it assumes that everyone follows the same release checklist. Would something like greptile caught the fact that we were trying to push something without a corresponding migration?
| Add the `ml-serving-deployer` access entry scoped to `scale-deploy`, and narrow `ml_infra_admin` to every namespace except `scale-deploy`: | ||
|
|
||
| ```hcl | ||
| # New — dedicated deploy role for model-engine, scoped to scale-deploy only |
| | `prod` | `ml-serving-new` | production serving | | ||
| | `circleci` | minikube (ephemeral) | CI integration tests only | | ||
|
|
||
| There is no staging environment for model-engine. One needs to be created. |
There was a problem hiding this comment.
I don't disagree but it seems like once we're running kubectl commands directly, that's already incorrect, regardless of where the command is pointed.
| ↓ | ||
| validate: migrations apply cleanly, smoke tests pass, no 500s | ||
| ↓ | ||
| production: just deploy prod (helm upgrade, migration-first, RBAC-enforced) |
There was a problem hiding this comment.
we should propose a release process to make sure release is ready for prod release. e.g
- no auto deployment in production.
- make cutline to release candidate in staging env, and let it back for certain time e.g
24h - have monitor/test running make sure no regression
- release to prod expicitly with release candidate version
|
|
||
| --- | ||
|
|
||
| ### P1 — Add high-severity 5xx outage alert |
There was a problem hiding this comment.
Confirming - can we do this today without being spammy?
Ex: Does model engine throw 5xx if user request is overwhelming a model?
|
Action items:
Discussions:
|
Summary
kubectl set imagedeployed a model-engine image that referencedendpoints.temporal_task_queuebefore the DB migration was applied, causing ~47 min of 500s on all endpoint operationsdocs/internal/alongside other internal ops docsTest plan
Closes MLI-6574
🤖 Generated with Claude Code
Greptile Summary
This PR adds a post-mortem for the Apr 24–25 SEV1 where a
kubectl set imagebypassed Helm migration hooks and caused ~58 min of 500s on all model-engine endpoint operations. It also silently promotes a new production image tag invalues_sample.yaml.Previous review rounds have identified several significant issues in the action-item code samples that need resolution before the document is referenced by engineers: the
_check_master_branchjustfile guard targetsorigin/master(the repo default branch ismain), theClusterRoleBindingsnippet grants cluster-wide admin rather than the intended namespace-scoped access, thescale_developerRBAC role over-provisionscreate/deleteonpods/configmaps/services, and the "chosen" Option A is not reflected in any of the implementation snippets which all implement Option B.Confidence Score: 3/5
Not safe to merge until action-item code samples are corrected and the image tag bump is validated against applied migrations
Multiple P1 findings flagged in prior review rounds: the justfile guard references the wrong branch (origin/master vs origin/main), the ClusterRoleBinding grants cluster-wide admin defeating the stated security goal, the scale_developer RBAC role over-provisions destructive verbs, the chosen Option A is entirely absent from all implementation snippets, and the values_sample.yaml image tag change lacks confirmation that the DB migration is in place.
docs/internal/post-mortem-mli-6574-model-engine-500s.md requires corrections to the action-item code samples before it can serve as a reliable runbook; charts/model-engine/values_sample.yaml needs migration-safety confirmation before the image tag lands in production
Important Files Changed
Sequence Diagram
sequenceDiagram participant Eng as Engineer participant K8s as kubectl / K8s API participant App as model-engine Pod participant DB as PostgreSQL participant PD as PagerDuty Note over Eng,DB: Apr 24, 23:50Z — Incident begins Eng->>K8s: kubectl set image (04729cef…) Note over K8s: Helm pre-upgrade hook SKIPPED K8s->>App: New pod starts (ORM declares temporal_task_queue) App->>DB: SELECT … temporal_task_queue … FROM endpoints DB-->>App: ERROR: column does not exist App-->>Eng: HTTP 500 on every list/get/create/update/delete loop 5 patch attempts over ~34 min Eng->>K8s: kubectl set image (patch N) K8s->>App: New pod (no migration applied) App->>DB: Same failing SELECT DB-->>App: ERROR: column does not exist end Note over PD: 00:37Z — PagerDuty fires (47 min late) Note over Eng,DB: 00:48Z — Rollback to f395ffa6… Eng->>K8s: kubectl rollout (f395ffa6…) K8s->>App: Old pod (no temporal_task_queue in ORM) App->>DB: SELECT without missing column DB-->>App: OK App-->>Eng: HTTP 200 — errors clearReviews (22): Last reviewed commit: "docs(post-mortem): drop genai from audit..." | Re-trigger Greptile