Skip to content

docs: SEV1 post-mortem for model-engine 500s incident (MLI-6574)#817

Open
lilyz-ai wants to merge 27 commits intomainfrom
lilyz-ai/mli-6574-sev1-post-mortem
Open

docs: SEV1 post-mortem for model-engine 500s incident (MLI-6574)#817
lilyz-ai wants to merge 27 commits intomainfrom
lilyz-ai/mli-6574-sev1-post-mortem

Conversation

@lilyz-ai
Copy link
Copy Markdown
Collaborator

@lilyz-ai lilyz-ai commented Apr 29, 2026

Summary

  • Adds post-mortem one-pager for the Apr 24–25 SEV1 where a kubectl set image deployed a model-engine image that referenced endpoints.temporal_task_queue before the DB migration was applied, causing ~47 min of 500s on all endpoint operations
  • Document lives in docs/internal/ alongside other internal ops docs
  • Covers: timeline, root cause + contributing factors, impact, 5 action items, lessons learned

Test plan

  • Review doc for accuracy against incident timeline
  • Confirm action item owners

Closes MLI-6574

🤖 Generated with Claude Code

Greptile Summary

This PR adds a post-mortem for the Apr 24–25 SEV1 where a kubectl set image bypassed Helm migration hooks and caused ~58 min of 500s on all model-engine endpoint operations. It also silently promotes a new production image tag in values_sample.yaml.

Previous review rounds have identified several significant issues in the action-item code samples that need resolution before the document is referenced by engineers: the _check_master_branch justfile guard targets origin/master (the repo default branch is main), the ClusterRoleBinding snippet grants cluster-wide admin rather than the intended namespace-scoped access, the scale_developer RBAC role over-provisions create/delete on pods/configmaps/services, and the "chosen" Option A is not reflected in any of the implementation snippets which all implement Option B.

Confidence Score: 3/5

Not safe to merge until action-item code samples are corrected and the image tag bump is validated against applied migrations

Multiple P1 findings flagged in prior review rounds: the justfile guard references the wrong branch (origin/master vs origin/main), the ClusterRoleBinding grants cluster-wide admin defeating the stated security goal, the scale_developer RBAC role over-provisions destructive verbs, the chosen Option A is entirely absent from all implementation snippets, and the values_sample.yaml image tag change lacks confirmation that the DB migration is in place.

docs/internal/post-mortem-mli-6574-model-engine-500s.md requires corrections to the action-item code samples before it can serve as a reliable runbook; charts/model-engine/values_sample.yaml needs migration-safety confirmation before the image tag lands in production

Important Files Changed

Filename Overview
docs/internal/post-mortem-mli-6574-model-engine-500s.md New post-mortem document; multiple factual inconsistencies (duration values) and implementation errors in action-item code samples (wrong git branch ref, over-privileged RBAC, Option A/B mismatch, ClusterRoleBinding scope) flagged in prior review rounds
charts/model-engine/values_sample.yaml Bumps the production image tag without explicit documentation that the accompanying Alembic migration has been applied, repeating the exact failure pattern documented in the post-mortem

Sequence Diagram

sequenceDiagram
    participant Eng as Engineer
    participant K8s as kubectl / K8s API
    participant App as model-engine Pod
    participant DB as PostgreSQL
    participant PD as PagerDuty

    Note over Eng,DB: Apr 24, 23:50Z — Incident begins
    Eng->>K8s: kubectl set image (04729cef…)
    Note over K8s: Helm pre-upgrade hook SKIPPED
    K8s->>App: New pod starts (ORM declares temporal_task_queue)
    App->>DB: SELECT … temporal_task_queue … FROM endpoints
    DB-->>App: ERROR: column does not exist
    App-->>Eng: HTTP 500 on every list/get/create/update/delete

    loop 5 patch attempts over ~34 min
        Eng->>K8s: kubectl set image (patch N)
        K8s->>App: New pod (no migration applied)
        App->>DB: Same failing SELECT
        DB-->>App: ERROR: column does not exist
    end

    Note over PD: 00:37Z — PagerDuty fires (47 min late)
    Note over Eng,DB: 00:48Z — Rollback to f395ffa6…
    Eng->>K8s: kubectl rollout (f395ffa6…)
    K8s->>App: Old pod (no temporal_task_queue in ORM)
    App->>DB: SELECT without missing column
    DB-->>App: OK
    App-->>Eng: HTTP 200 — errors clear
Loading

Reviews (22): Last reviewed commit: "docs(post-mortem): drop genai from audit..." | Re-trigger Greptile

lilyz-ai and others added 2 commits April 21, 2026 01:11
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents the Apr 24-25 outage caused by deploying an ORM model
referencing a missing DB column, including timeline, root cause,
impact, and follow-up action items.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lilyz-ai lilyz-ai force-pushed the lilyz-ai/mli-6574-sev1-post-mortem branch from 71a834c to 4332cad Compare April 29, 2026 00:10
Comment thread docs/internal/post-mortem-mli-6574-model-engine-500s.md Outdated
Comment thread docs/internal/post-mortem-mli-6574-model-engine-500s.md
Comment thread docs/internal/post-mortem-mli-6574-model-engine-500s.md
lilyz-ai and others added 4 commits April 29, 2026 00:27
…h rollout strategy

- Add ORM column and Alembic migration definitions inline
- Explain why every endpoint query failed (SQLAlchemy SELECT includes all ORM columns)
- Remove incorrect downstream impact entry
- Clarify P0 migration CI check doesn't prevent kubectl bypass (scoped note added)
- Reframe kubectl ban as preventative (RBAC blocks at API server) not reactive (--atomic)
- Replace P1/P2 items with a Staging vs Production rollout strategy section

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove redundant CI migration check (Helm pre-upgrade hook already enforces this)
- Drop ArgoCD alternative; keep RBAC as the single preventative approach
- Expand RBAC section with specific files to change (Terracode-ML Role, scripts/deploy.sh, CircleCI config)
- Move image testing guidance into Rollout Strategy
- Add staging environment gap (no values_staging.yaml exists) with concrete plan

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

# tag [required] is the LLM Engine docker image tag
tag: e360bfb1d21d9d4e7b7fcb6b29ca752095b4d0f4
tag: 2e9d00786419ef44ec5c9d3305d8d6451d6aabfb
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Undocumented production image tag change

The PR description mentions only the post-mortem doc addition, but values_sample.yaml is quietly updated to a new image tag (2e9d00786419ef44ec5c9d3305d8d6451d6aabfb). The post-mortem itself notes that values_sample.yaml targets production (no separate staging config exists). Deploying a new image tag here without an explicit mention in the PR description or test plan risks repeating the exact pattern documented in the incident — an image promotion that bypasses scrutiny of whether the accompanying Alembic migration is in place. The test plan should confirm that the migration for this new image has been applied to production before this config lands.

Prompt To Fix With AI
This is a comment left during a code review.
Path: charts/model-engine/values_sample.yaml
Line: 27

Comment:
**Undocumented production image tag change**

The PR description mentions only the post-mortem doc addition, but `values_sample.yaml` is quietly updated to a new image tag (`2e9d00786419ef44ec5c9d3305d8d6451d6aabfb`). The post-mortem itself notes that `values_sample.yaml` targets production (no separate staging config exists). Deploying a new image tag here without an explicit mention in the PR description or test plan risks repeating the exact pattern documented in the incident — an image promotion that bypasses scrutiny of whether the accompanying Alembic migration is in place. The test plan should confirm that the migration for this new image has been applied to production before this config lands.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

lilyz-ai and others added 3 commits May 4, 2026 00:29
…rollout strategy

- RBAC: note no developer Role exists in Terracode-ML scaleapi-ml-serving/; add
  inspection steps to find the correct EKS access entry before making changes
- Deploy path: replace proposed scripts/deploy.sh with existing just deploy prod
  (model-engine-internal justfile already does helm upgrade --atomic --timeout 120m0s)
- Rollout strategy: add env table (training/launch/prod/circleci), note launch cluster
  may already serve as staging with ~1h effort vs new cluster, clarify effort range

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…estimate

- RBAC: identify eks.tf (lines 545-561) and workloads.tf as exact files to change;
  explain ml_infra_admin cluster-admin grant is why kubectl set image works today
- Staging: clarify all 3 clusters are prod; add component table showing RDS/Redis/VPC/DNS
  already exist in scaleapi-ml-serving/global/; new EKS cluster is the dominant cost;
  total estimate ~8-11 days

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- eks.tf: show exact before/after — scope ml_infra_admin from type=cluster to
  type=namespace excluding scale-deploy (workload_namespaces_default minus scale-deploy)
- workloads.tf: rename model_engine_deployer -> scale_deploy_developer; full Role
  content with deployment read-only + pod/exec/log/job rules + RoleBinding
- Staging: clarify staging DB/Redis/DNS already exist (never wired up); add Option A
  (new namespace in ml-serving-new, ~3-4 days) vs Option B (new EKS cluster, ~8-11 days)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread docs/internal/post-mortem-mli-6574-model-engine-500s.md
lilyz-ai and others added 4 commits May 4, 2026 01:03
…m sections

- eks.tf: note that scoping away scale-deploy also blocks just deploy prod (uses dev
  context); production deploys must move to CI (ml_integration_test_lambda) first
- Rename scale_deploy_developer -> scale_developer / scale-developer
- Staging: drop Option B (new EKS cluster); only Option A (new namespace, ~3-4 days)
- Remove Lessons Learned section

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…uard

- Control a: eks.tf scopes ml_infra_admin away from scale-deploy (blocks kubectl set image);
  just deploy prod compensates by minting a short-lived ml-k8s-admin SA token for helm;
  scale_developer role gains serviceaccounts/token create so devs can mint it
- Control b: justfile _check_master_branch guard — prod deploy aborts if current commit
  is not merged to origin/master; unmerged local branches cannot reach production
- Remove CI deploy requirement; developers still run just deploy prod locally

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t alternative

- Define ServiceAccount (SA) token inline where first used
- Option A: mint short-lived ml-k8s-admin token at deploy time (no new AWS infra)
- Option B: dedicated ml-serving-deployer IAM role — cleaner separation of concerns,
  no token minting, easy to audit; ~1 extra day of setup

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion B only)

- ml_infra_admin: scoped away from scale-deploy; gets scale_developer role
  (read/exec/log only, cannot kubectl set image or helm upgrade)
- New ml-serving-deployer IAM role: scoped to scale-deploy with cluster-admin;
  only this role can run just deploy prod
- justfile: _check_master_branch guard + _mount_aws uses ml-serving-deployer for prod
- Files: eks.tf (scope + new access entry), workloads.tf (two roles), iam.tf
  (new SSO role data source), justfile

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread docs/internal/post-mortem-mli-6574-model-engine-500s.md Outdated
scale-deploy is shared; ml_infra_admin scoping affects all teams deploying there.
Add pre-flight check: enumerate live deployments and confirm with other teams before
applying the RBAC change.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread docs/internal/post-mortem-mli-6574-model-engine-500s.md Outdated
…l inspection

- Ran kubectl get deployments -n scale-deploy on ml-serving-new
- auto-hillclimb-ui (genai team, manual deploy, ml_infra_admin) would break
- k8s-startup-watcher is part of llm-engine helm chart, unaffected
- launch-endpoint-id-* are model-engine celery-managed via ml-k8s-admin SA, unaffected
- Add pre-requisite: coordinate with genai team before applying RBAC change

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread docs/internal/post-mortem-mli-6574-model-engine-500s.md
lilyz-ai and others added 4 commits May 4, 2026 02:08
…_admin untouched

- Drop all changes to ml_infra_admin; genai team and others unaffected
- New ml-serving-deployer IAM role (eks.tf + iam.tf): cluster-admin scoped to
  scale-deploy; only model-engine prod deploys use this role
- kubectl set image prevention is now process-enforced (just deploy prod + master
  branch guard), not RBAC-enforced; acknowledge this tradeoff explicitly
- justfile: _check_master_branch + _mount_aws uses ml-serving-deployer for prod

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tity approach

Option A (scope ml_infra_admin) vs Option B (new ml-serving-deployer role);
table shows hard vs soft kubectl prevention, backward compat, and known breakage
(auto-hillclimb-ui); Option B chosen with explicit tradeoff acknowledged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lilyz-ai lilyz-ai requested a review from a team May 4, 2026 02:27

Two options were considered for preventing unauthorized production deploys:

| | Option A: Scope down `ml_infra_admin` | Option B: New `ml-serving-deployer` role ✓ |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for option b, which service assume this new role?

|---|---|---|
| **How** | Remove `patch`/`update` on `apps/deployments` in `scale-deploy` from `ml_infra_admin` | Add a new AWS SSO role with cluster-admin scoped to `scale-deploy`; prod deploy uses this role |
| **`kubectl set image` prevention** | Hard — RBAC rejects it at the API server | Soft — process-enforced; `ml_infra_admin` users can still do it |
| **Backward compatible** | No — breaks any other team deploying to `scale-deploy` with `ml_infra_admin` | Yes — `ml_infra_admin` unchanged |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in what scenario, other team need to deploy to scale-deploywithml_infra_admin`?

| **Known breakage** | `auto-hillclimb-ui` (genai team, `genai/justfile deploy-auto-hillclimb-ui`) uses `kubectl apply` in `scale-deploy` via `ml_infra_admin`; would immediately break | None |
| **Coordination required** | Yes — must migrate every team deploying to `scale-deploy` | No |

**Open question for discussion:** Option A gives a hard guarantee (RBAC rejects `kubectl set image` at the API server) but requires coordinating a breaking change across every team using `ml_infra_admin` in `scale-deploy`. Option B is backward compatible and unblocks model-engine immediately, but relies on process discipline — a developer can still bypass it with a direct `kubectl set image`. Which guarantee level is acceptable?
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideal world, no one should have mutate permission in prodution environment unless there is additional approval process.

regurlar or on demand release/deploymeny should be done by ci/cd pipeline / automation

| `prod` | `ml-serving-new` | production serving |
| `circleci` | minikube (ephemeral) | CI integration tests only |

There is no staging environment for model-engine. One needs to be created.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, need canary environment to test production changes before actually release to production

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree but it seems like once we're running kubectl commands directly, that's already incorrect, regardless of where the command is pointed.

- **ORM column** — a Python class attribute like `temporal_task_queue = Column(Text)` on the `Endpoint` model. SQLAlchemy (our ORM) translates these attributes into SQL; every declared column is included in the `SELECT` it generates when loading `Endpoint` objects.
- **Alembic migration** — a versioned SQL script (e.g. `ALTER TABLE endpoints ADD COLUMN temporal_task_queue TEXT`) that updates the live database schema to match what the application code expects. Without it, the DB is missing the column the ORM declares.

The `temporal_task_queue` column was added to the `Endpoint` ORM model (`hosted_model_inference.py`) as part of the temporal endpoint type feature (MLI-6425), but **no Alembic migration was written for it**. Because SQLAlchemy includes every ORM-declared column in its generated `SELECT` statement, all endpoint operations — list, get, create, update, delete — issued a query referencing `temporal_task_queue`. PostgreSQL returned `column "temporal_task_queue" does not exist` on every call, causing a 500 across the board.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have alert ready to capture if similar things happen in the future?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a new section for alerting on failure over 50% in data plane Istio mesh metrics.

lilyz-ai and others added 2 commits May 5, 2026 01:15
- Clarify Option B assumes ml-serving-deployer for model-engine only
- Correct Known breakage row: auto-hillclimb-ui deploys to ml-training-new, not ml-serving-new; no confirmed breakage on serving cluster
- Add P1 action item: tighten Envoy 5xx PagerDuty alert threshold/window

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Audit of all justfiles confirmed no other service deploys to scale-deploy
on ml-serving-new via ml_infra_admin — Option A has no coordination burden
and gives a hard RBAC guarantee vs Option B's soft process enforcement.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread docs/internal/post-mortem-mli-6574-model-engine-500s.md
lilyz-ai and others added 2 commits May 5, 2026 01:22
Remove comparison table and Option B content. Rewrite a. as a direct
description of the chosen approach (scope down ml_infra_admin). Update
files-to-change: eks.tf RBAC narrowing + justfile _check_master_branch
only — no new IAM role, no iam.tf change, no profile switch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ting

- eks.tf: change ml_infra_admin access scope from cluster-wide to
  namespace-scoped with all namespaces except scale-deploy (enumerated
  from kubectl namespace audit); note RBAC is additive so scoping is
  the only way to block kubectl set image
- P1 alert: add prod_5m entry to launch_v1_route_monitor_params in
  launch.tf for a 5-min evaluation window alongside the existing 15m

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread docs/internal/post-mortem-mli-6574-model-engine-500s.md
lilyz-ai and others added 2 commits May 5, 2026 01:40
… required

ml-serving-admin resolves to AWSReservedSSO_MLInfraAdmin (same as
ml_infra_admin), so scoping ml_infra_admin out of scale-deploy without a
replacement breaks just deploy prod. Correct implementation requires:
1. New ml-serving-deployer role scoped to scale-deploy
2. ml_infra_admin scoped down to exclude scale-deploy
3. justfile prod profile switched to ml-serving-deployer

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Swap the noisy prod_5m ratio tweak for a dedicated monitor: fires when
>50% of all model-engine Istio requests are 5xx over 5 min. Uses broad
istio.mesh.request.count.total (not the narrow Envoy trace query) so a
complete outage like this one pages within 5 min.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lilyz-ai
Copy link
Copy Markdown
Collaborator Author

lilyz-ai commented May 5, 2026

thanks @billyang-scale for reviewing! took another look, ml-infra role should only be used by our team in ml-serving-new. so scope down the role permission + a new deployer role is a better design. also added alerting.

lilyz-ai and others added 2 commits May 5, 2026 01:54
ml-admin is MLInfraAdmin on the ML R&D account (692474966980), not the
Serving account — different IAM role, no access to ml-serving-new.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment on lines +210 to +213
| `training` | `ml-training-new` | production training workloads |
| `launch` | `ml-launch-new` | production Launch API |
| `prod` | `ml-serving-new` | production serving |
| `circleci` | minikube (ephemeral) | CI integration tests only |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are this related to the scope of this doc? ml-training-new and circelci seems irrelavant


**What:** `kubectl set image` updates pods directly, bypassing all Helm pre-upgrade hooks including the migration job. The fix must be preventative: a `kubectl set image` attempt against the production deployment is rejected by the API server before it takes effect.

Two preventative controls:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick chat told me it's possible to have a guardrail which ensures alembic migrations is in sync with ORM model

#!/usr/bin/env bash
set -euo pipefail

# use a clean test DB
alembic upgrade head

# Make sure alembic/env.py points at your ORM metadata:
# fail if models contain schema changes not represented by migrations
alembic check


## Summary

At 23:50Z on Apr 24, a direct `kubectl set image` pushed a model-engine image (`04729cef…`) that declared a new ORM column `endpoints.temporal_task_queue` with no corresponding Alembic migration applied to production. Every `list_model_endpoints` and `get_model_endpoint` call immediately returned 500. Five patch images over ~34 minutes failed to resolve the issue before a rollback to `f395ffa6…` cleared errors at 00:48Z.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add some context on who the entity was who ran the set-image - human user/some automated process

Copy link
Copy Markdown
Member

@yixu34 yixu34 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going forward let's please use the postmortem template. This is AI generated and only covers the technical aspects but doesn't go into the 5 why's.


Audit confirmed no other service deploys to `scale-deploy` on `ml-serving-new` via `ml_infra_admin`: all other services using `ml_infra_admin` on `ml-serving-new` (`dagster`, `scaletrain-ui`, `ml-orchestration-internal`, `research_evals`) deploy to the `dagster` namespace.

**b. justfile guard — block deploying unmerged code**
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can work, but if we're going to do this, then we'd also want to make sure that people can force merge things in the models repo. Otherwise, there's no available way to do emergency deploys.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think another issue is: while this is good in the sense that it would force a review of the PR, it assumes that everyone follows the same release checklist. Would something like greptile caught the fact that we were trying to push something without a corresponding migration?

Add the `ml-serving-deployer` access entry scoped to `scale-deploy`, and narrow `ml_infra_admin` to every namespace except `scale-deploy`:

```hcl
# New — dedicated deploy role for model-engine, scoped to scale-deploy only
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do other teams do this?

| `prod` | `ml-serving-new` | production serving |
| `circleci` | minikube (ephemeral) | CI integration tests only |

There is no staging environment for model-engine. One needs to be created.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree but it seems like once we're running kubectl commands directly, that's already incorrect, regardless of where the command is pointed.

validate: migrations apply cleanly, smoke tests pass, no 500s
production: just deploy prod (helm upgrade, migration-first, RBAC-enforced)
Copy link
Copy Markdown
Collaborator

@billyang-scale billyang-scale May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should propose a release process to make sure release is ready for prod release. e.g

  • no auto deployment in production.
  • make cutline to release candidate in staging env, and let it back for certain time e.g 24h
  • have monitor/test running make sure no regression
  • release to prod expicitly with release candidate version


---

### P1 — Add high-severity 5xx outage alert
Copy link
Copy Markdown
Collaborator

@sandeshghanta sandeshghanta May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming - can we do this today without being spammy?
Ex: Does model engine throw 5xx if user request is overwhelming a model?

@lilyz-ai
Copy link
Copy Markdown
Collaborator Author

lilyz-ai commented May 5, 2026

Action items:

  1. improve devx of model engine, to be on par with newer service, e.g. arp. @lilyz-ai to create a new design doc for model engine devx improvements.
  2. prod deploy process documentation.
  3. scope down ml-infra role permissions to avoid deploy.
  4. have a separate deployer role to do prod deploy. allow force merge to allow emergency hot fix.
  5. have separate staging env for model engine, where ml infra team can test things.

Discussions:

  1. staging needs to follow prod deployment process, before prod deployment.
  2. hot fixes, post release verification.
  3. model engine testing locally, https://github.com/scaleapi/models/tree/master/model-engine-internal#running-the-server-locally.
  4. sandbox local testing, @sandeshghanta will share a process doc.
  5. developer experience improvement should be put in place first, so the guardrail is less relevant or will be needed less.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants