Add persistent external model cache via PVC by lukasewecker · Pull Request #822 · scaleapi/llm-engine

lukasewecker · 2026-05-06T21:21:28Z

Pull Request Summary

Currently, model checkpoints are downloaded on ephemermal disk storage. This will be an issue when the node's disk space is limited, and model weights are larger. This is a real gap in current customer deployments.

This PR adds optional vLLM model weight caching backed by a Kubernetes PVC. A PVC is created per endpoint, shared across all model workers (replicas) of one endpoint.

In this implementation it is chosen to add the configuration in the helm chart. The better implementation would probably be to make this configurable at model deployment via the API, because you could then configure it for each endpoint separately. However, this is more complex and requires also changes in the SGP API, which is why this implementation was not chosen here.

Note that its optional to enable this caching feature. It can be enabled & configured in the helm chart/ desired state in system manager.

The new values in the helm chart are set to disable the model cache by default:

modelCache:
    enabled: false
    storage: 100Gi
    storageClassName: null
    accessMode: ReadWriteMany
    mountPath: /mnt/model-cache
    lockStaleSeconds: 600

You can override the values to enable the model cache (e.g. in azure) like this:

modelCache:
    enabled: true
    storage: 100Gi
    storageClassName: azurefile-csi-premium
    accessMode: ReadWriteMany
    mountPath: /mnt/model-cache
    lockStaleSeconds: 600

Here an overview of the main changes:

Add modelCache Helm values for enabling PVC-backed cache, storage size, storage class, access mode, and mount path.
Add a model-cache PVC template and Python-side PVC/mount injection for cached vLLM endpoints.
Keep cache mode sticky per bundle via LLM metadata, so Helm flag changes do not unexpectedly mutate existing bundles.
Add a cache download lock with heartbeat/stale-lock recovery to avoid parallel downloads by replicas.
Add cache fingerprint validation so stale cached files are not reused after relevant bundle/download inputs change.
Preserve vLLM additional args across bundle recreation, including legacy bundle fallback parsing.
Fix vLLM CLI rendering for quoted values, dict/list args, and enum values.
Add Helm in CircleCI unit tests so Helm-rendered service template tests run in CI.

Test Plan and Usage Guide

Added unit coverage for model-cache bundle behavior and Kubernetes resource generation.
Added Helm/chart coverage for the new PVC template and model-cache env vars.
Ran focused Python tests, Helm lint with cache enabled/disabled

Known Gaps:

Cache contents are endpoint-scoped, not bundle-version-scoped. Same-endpoint updates that change cache fingerprint inputs can replace the existing cache during rollout. Current expected usage is to create a new endpoint for model/checkpoint-style changes.
- modelCache.mountPath is effectively immutable for existing cached bundles because the path is baked into the model bundle command. Changing it later only affects newly created or explicitly recreated bundles.
- Existing PVC specs are not reconciled after creation. If modelCache.storage, storageClassName, or accessMode changes, already-created endpoint PVCs are treated as existing and are not patched or recreated.
- Stale lock recovery is best-effort. Normal locking uses atomic mkdir, but stale-lock removal is not fully atomic if multiple waiters try to recover the same abandoned lock at once. But this is a rare edge case in practice.

Greptile Summary

Adds optional PVC-backed model weight caching for vLLM endpoints: a per-endpoint PVC is created at deploy time, shared across all replicas, with an atomic mkdir-based download lock and heartbeat to prevent duplicate parallel downloads. Cache mode is stored in LLMMetadata so it remains sticky across resource-only updates.
Fixes a pre-existing bug where vLLM Quantization enum values were rendered as Quantization.AWQ in the CLI command string instead of awq; also refactors all vLLM CLI argument formatting through the new format_vllm_cli_value helper and preserves vllm_additional_args across bundle updates.
Adds *.jinja and *.model.v* to Azure/S3 download include patterns, removes runtime azcopy installation for vLLM images (assumed pre-installed), and updates the CircleCI service template ConfigMap with PVC, Istio sidecar, and forwarder route fixes.

Confidence Score: 4/5

Safe to merge with minor P2 findings; existing P1s from previous review rounds should be addressed first

No new P0/P1 issues found in this review pass. The two previously-flagged P1s (duplicate LLM_METADATA_KEY and find() prefix-matching bug) remain open. New findings are P2 only. Core locking logic, fingerprinting, PVC lifecycle, and both delete paths look correct. Good test coverage for the new paths.

model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py — the legacy arg parser and model cache locking script are the most complex new logic and warrant careful review

Important Files Changed

Filename	Overview
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py	Core vLLM bundle creation rewritten: adds model cache path, fingerprinting, atomic download locking, legacy arg parsing, and fixed Enum CLI rendering (quantize.value)
model-engine/model_engine_server/infra/gateways/resources/k8s_endpoint_resource_delegate.py	Adds PVC creation/injection into deployment and LWS templates; PVC deletion added to both async and sync delete paths; model_endpoint_uses_model_cache drives opt-in behavior
model-engine/model_engine_server/infra/gateways/resources/k8s_resource_types.py	Adds PersistentVolumeClaimArguments TypedDict and MODEL_CACHE_PVC_NAME field to all deployment argument types; clean and consistent
model-engine/model_engine_server/inference/vllm/vllm_startup_wrapper.py	Extracts model_dir from server args for download size reporting; switches subprocess.run to explicit ["/bin/bash", "-c"] form; clean refactor
model-engine/model_engine_server/common/env_vars.py	Adds MODEL_CACHE_ENABLED, MODEL_CACHE_MOUNT_PATH, MODEL_CACHE_PVC_SUFFIX, MODEL_CACHE_LOCK_STALE_SECONDS env vars with eager validation for the stale seconds value
model-engine/model_engine_server/domain/entities/llm_entity.py	Adds model_cache_enabled: bool = False field to LLMMetadata, making cache mode sticky per bundle via metadata
model-engine/model_engine_server/infra/gateways/resources/templates/service_template_config_map_circleci.yaml	Adds PVC template for model-cache, removes taskExpiresSeconds annotation, adds Istio sidecar hold annotation, and various forwarder route fixes
model-engine/tests/unit/domain/test_llm_use_cases.py	Adds comprehensive tests for model cache bundle command generation, fingerprinting, multinode cache consistency, legacy arg parsing, and quantization enum rendering
model-engine/tests/unit/infra/gateways/resources/test_k8s_endpoint_resource_delegate.py	Adds Kubernetes resource delegate tests for PVC creation, volume/mount injection, cache-disabled path, and LWS multinode cache setup

Sequence Diagram

sequenceDiagram
    participant UC as CreateLLMEndpointUseCase
    participant BU as CreateLLMModelBundleUseCase
    participant K8S as K8SEndpointResourceDelegate
    participant PVC as Kubernetes PVC
    participant Pod as vLLM Pod (replica N)

    UC->>BU: execute(model_name, checkpoint_path, ...)
    BU->>BU: get_vllm_model_weights_folder()
    Note over BU: /mnt/model-cache/model_files if cache enabled
    BU->>BU: load_model_weights_sub_commands()
    BU->>BU: get_model_cache_fingerprint(sha256)
    BU->>BU: wrap_download_subcommands_with_model_cache_lock()
    BU-->>UC: bundle with cache-aware command

    UC->>K8S: create_or_update_resources(request)
    K8S->>K8S: model_endpoint_uses_model_cache(record)
    alt "uses_model_cache=true"
        K8S->>PVC: create_namespaced_persistent_volume_claim (409 skip)
        K8S->>K8S: add_model_cache_to_deployment_template(pvc_name)
    else "uses_model_cache=false"
        K8S->>K8S: remove_model_cache_from_deployment_template()
    end

    Pod->>Pod: mkdir .download.lock (atomic)
    alt lock acquired
        Pod->>Pod: heartbeat loop (bg pid)
        Pod->>Pod: bash -c download_subcommands
        Pod->>Pod: model_cache_ready_without_marker()
        Pod->>Pod: write .complete (fingerprint)
        Pod->>Pod: cleanup_model_cache_lock
    else another replica holds lock
        Pod->>Pod: poll until .complete or lock stale
    end
    Pod->>Pod: exec vllm_server --model /mnt/model-cache/model_files

Comments Outside Diff (1)

model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py, line 633-654 (link)

model_cache_ready_without_marker only checks for .safetensors/.bin

The helper that gate-keeps fingerprint file writing only looks for *.safetensors or *.bin files, matching the current download filter. However, if a checkpoint is stored in a format not covered by this glob (e.g., quantized GGUF or multi-part shards with different extensions) the function always returns 1, the outer set -e kills the pod on every startup before the fingerprint is written, and every replica re-downloads on every restart — making the cache useless. Consider whether the file-presence check should match the exact same inclusion patterns used in the download subcommands, or at minimum verify config.json alone as a signal of a completed download.

Prompt To Fix With AI

This is a comment left during a code review.
Path: model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Line: 633-654

Comment:
**`model_cache_ready_without_marker` only checks for `.safetensors`/`.bin`**

The helper that gate-keeps fingerprint file writing only looks for `*.safetensors` or `*.bin` files, matching the current download filter. However, if a checkpoint is stored in a format not covered by this glob (e.g., quantized GGUF or multi-part shards with different extensions) the function always returns 1, the outer `set -e` kills the pod on every startup before the fingerprint is written, and every replica re-downloads on every restart — making the cache useless. Consider whether the file-presence check should match the exact same inclusion patterns used in the download subcommands, or at minimum verify `config.json` alone as a signal of a completed download.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py:1062-1065
**azcopy assumed pre-installed in vLLM images for Azure checkpoints**

The `curl | tar` azcopy installation step is now skipped when `framework == LLMInferenceFramework.VLLM`. This means any vLLM bundle using an Azure checkpoint path will silently depend on `azcopy` being present in the container image. If the image doesn't include it, the `azcopy copy` command fails at pod startup — with `set -euo pipefail` active, this kills the pod with a confusing "command not found" error rather than a clear message. A guard or comment here would prevent future confusion.

```suggestion
        # vLLM images are expected to ship with azcopy pre-installed (e.g. *-with-azcopy tags).
        # Non-vLLM images still download azcopy at runtime.
        if framework != LLMInferenceFramework.VLLM:
            subcommands.append(
                "curl -L https://aka.ms/downloadazcopy-v10-linux | tar --strip-components=1 -C /usr/local/bin --no-same-owner --exclude=*.txt -xzvf - && chmod 755 /usr/local/bin/azcopy"
            )
```

### Issue 2 of 2
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py:368
**`get_legacy_vllm_additional_args_from_command` skips unrecognised flags even when they have values**

When `field_name` is not in `VLLMEndpointAdditionalArgs.model_fields` (e.g. a custom or framework-managed flag), `raw_value` is still extracted from the command string and `i` is advanced by 2. This correctly skips the value token in the token list. However, if an unrecognised flag is immediately followed by a value that looks like another flag (starts with `--`), `_extract_legacy_vllm_raw_arg` returns `None`, `i` advances by 1, and the next token is re-evaluated correctly. The logic holds, but a brief comment explaining why unknown flags advance `i` regardless of whether they end up in `parsed_args` would improve readability for future maintainers.

_{Reviews (2): Last reviewed commit: "run black formatter" | Re-trigger Greptile}

greptile-apps · 2026-05-06T21:37:05Z

 # for the container names in the LWS template.
 LWS_LEADER_CONTAINER_NAME = "lws-leader"
 LWS_WORKER_CONTAINER_NAME = "lws-worker"
+LLM_METADATA_KEY = "_llm"


Duplicated LLM_METADATA_KEY constant

LLM_METADATA_KEY = "_llm" is now defined in both this file (line 93) and in llm_model_endpoint_use_cases.py (line 195). If they ever diverge, model_endpoint_uses_model_cache will silently fail to find the cache flag in metadata — every endpoint will be treated as cache-disabled even when the bundle was created with it enabled. This should import from a shared location or from llm_model_endpoint_use_cases.

Prompt To Fix With AI

This is a comment left during a code review. Path: model-engine/model_engine_server/infra/gateways/resources/k8s_endpoint_resource_delegate.py Line: 93 Comment: **Duplicated `LLM_METADATA_KEY` constant** `LLM_METADATA_KEY = "_llm"` is now defined in both this file (line 93) and in `llm_model_endpoint_use_cases.py` (line 195). If they ever diverge, `model_endpoint_uses_model_cache` will silently fail to find the cache flag in metadata — every endpoint will be treated as cache-disabled even when the bundle was created with it enabled. This should import from a shared location or from `llm_model_endpoint_use_cases`. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-06T21:37:06Z

+        token_index = command_match.group(1).find(token)
+        raw_value = _extract_legacy_vllm_raw_arg(command_match.group(1), token_index)


find(token) may match wrong position for prefix-sharing flags

command_match.group(1).find(token) always returns the index of the first occurrence of the token string in the command. If the flag being looked up (e.g., --max-model-len) is a prefix of an earlier flag in the command string (e.g., --max-model-len-tokens), find lands on the earlier flag's position, and _extract_legacy_vllm_raw_arg extracts the wrong raw value — poisoning the stored vllm_additional_args on bundle update. A safer approach would be to track the position of each token within the shlex-parsed token list rather than re-scanning the raw string.

Prompt To Fix With AI

This is a comment left during a code review. Path: model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py Line: 380-381 Comment: **`find(token)` may match wrong position for prefix-sharing flags** `command_match.group(1).find(token)` always returns the index of the *first* occurrence of the token string in the command. If the flag being looked up (e.g., `--max-model-len`) is a prefix of an earlier flag in the command string (e.g., `--max-model-len-tokens`), `find` lands on the earlier flag's position, and `_extract_legacy_vllm_raw_arg` extracts the *wrong* raw value — poisoning the stored `vllm_additional_args` on bundle update. A safer approach would be to track the position of each token within the shlex-parsed token list rather than re-scanning the raw string. How can I resolve this? If you propose a fix, please make it concise.

lilyz-ai

@lukasewecker please fix greptile comments before merging.

Add persistent external model cache via PVC

cbfd3f3

lukasewecker requested a review from lilyz-ai May 6, 2026 21:21

greptile-apps Bot reviewed May 6, 2026

View reviewed changes

run black formatter

e387222

lilyz-ai approved these changes May 7, 2026

View reviewed changes

lilyz-ai reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add persistent external model cache via PVC#822

Add persistent external model cache via PVC#822
lukasewecker wants to merge 2 commits intomainfrom
lukas/model-cache

lukasewecker commented May 6, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot May 6, 2026

Uh oh!

greptile-apps Bot May 6, 2026

Uh oh!

lilyz-ai left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		token_index = command_match.group(1).find(token)
		raw_value = _extract_legacy_vllm_raw_arg(command_match.group(1), token_index)

Conversation

lukasewecker commented May 6, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Summary

Test Plan and Usage Guide

Known Gaps:

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

lilyz-ai left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lukasewecker commented May 6, 2026 •

edited by greptile-apps Bot

Loading