Add persistent external model cache via PVC#822
Conversation
| # for the container names in the LWS template. | ||
| LWS_LEADER_CONTAINER_NAME = "lws-leader" | ||
| LWS_WORKER_CONTAINER_NAME = "lws-worker" | ||
| LLM_METADATA_KEY = "_llm" |
There was a problem hiding this comment.
Duplicated
LLM_METADATA_KEY constant
LLM_METADATA_KEY = "_llm" is now defined in both this file (line 93) and in llm_model_endpoint_use_cases.py (line 195). If they ever diverge, model_endpoint_uses_model_cache will silently fail to find the cache flag in metadata — every endpoint will be treated as cache-disabled even when the bundle was created with it enabled. This should import from a shared location or from llm_model_endpoint_use_cases.
Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/infra/gateways/resources/k8s_endpoint_resource_delegate.py
Line: 93
Comment:
**Duplicated `LLM_METADATA_KEY` constant**
`LLM_METADATA_KEY = "_llm"` is now defined in both this file (line 93) and in `llm_model_endpoint_use_cases.py` (line 195). If they ever diverge, `model_endpoint_uses_model_cache` will silently fail to find the cache flag in metadata — every endpoint will be treated as cache-disabled even when the bundle was created with it enabled. This should import from a shared location or from `llm_model_endpoint_use_cases`.
How can I resolve this? If you propose a fix, please make it concise.| token_index = command_match.group(1).find(token) | ||
| raw_value = _extract_legacy_vllm_raw_arg(command_match.group(1), token_index) |
There was a problem hiding this comment.
find(token) may match wrong position for prefix-sharing flags
command_match.group(1).find(token) always returns the index of the first occurrence of the token string in the command. If the flag being looked up (e.g., --max-model-len) is a prefix of an earlier flag in the command string (e.g., --max-model-len-tokens), find lands on the earlier flag's position, and _extract_legacy_vllm_raw_arg extracts the wrong raw value — poisoning the stored vllm_additional_args on bundle update. A safer approach would be to track the position of each token within the shlex-parsed token list rather than re-scanning the raw string.
Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Line: 380-381
Comment:
**`find(token)` may match wrong position for prefix-sharing flags**
`command_match.group(1).find(token)` always returns the index of the *first* occurrence of the token string in the command. If the flag being looked up (e.g., `--max-model-len`) is a prefix of an earlier flag in the command string (e.g., `--max-model-len-tokens`), `find` lands on the earlier flag's position, and `_extract_legacy_vllm_raw_arg` extracts the *wrong* raw value — poisoning the stored `vllm_additional_args` on bundle update. A safer approach would be to track the position of each token within the shlex-parsed token list rather than re-scanning the raw string.
How can I resolve this? If you propose a fix, please make it concise.
lilyz-ai
left a comment
There was a problem hiding this comment.
@lukasewecker please fix greptile comments before merging.
Pull Request Summary
Currently, model checkpoints are downloaded on ephemermal disk storage. This will be an issue when the node's disk space is limited, and model weights are larger. This is a real gap in current customer deployments.
This PR adds optional vLLM model weight caching backed by a Kubernetes PVC. A PVC is created per endpoint, shared across all model workers (replicas) of one endpoint.
In this implementation it is chosen to add the configuration in the helm chart. The better implementation would probably be to make this configurable at model deployment via the API, because you could then configure it for each endpoint separately. However, this is more complex and requires also changes in the SGP API, which is why this implementation was not chosen here.
Note that its optional to enable this caching feature. It can be enabled & configured in the helm chart/ desired state in system manager.
The new values in the helm chart are set to disable the model cache by default:
You can override the values to enable the model cache (e.g. in azure) like this:
Here an overview of the main changes:
modelCacheHelm values for enabling PVC-backed cache, storage size, storage class, access mode, and mount path.Test Plan and Usage Guide
Known Gaps:
modelCache.mountPathis effectively immutable for existing cached bundles because the path is baked into the model bundle command. Changing it later only affects newly created or explicitly recreated bundles.modelCache.storage,storageClassName, oraccessModechanges, already-created endpoint PVCs are treated as existing and are not patched or recreated.mkdir, but stale-lock removal is not fully atomic if multiple waiters try to recover the same abandoned lock at once. But this is a rare edge case in practice.Greptile Summary
mkdir-based download lock and heartbeat to prevent duplicate parallel downloads. Cache mode is stored inLLMMetadataso it remains sticky across resource-only updates.Quantizationenum values were rendered asQuantization.AWQin the CLI command string instead ofawq; also refactors all vLLM CLI argument formatting through the newformat_vllm_cli_valuehelper and preservesvllm_additional_argsacross bundle updates.*.jinjaand*.model.v*to Azure/S3 download include patterns, removes runtimeazcopyinstallation for vLLM images (assumed pre-installed), and updates the CircleCI service template ConfigMap with PVC, Istio sidecar, and forwarder route fixes.Confidence Score: 4/5
Safe to merge with minor P2 findings; existing P1s from previous review rounds should be addressed first
No new P0/P1 issues found in this review pass. The two previously-flagged P1s (duplicate LLM_METADATA_KEY and find() prefix-matching bug) remain open. New findings are P2 only. Core locking logic, fingerprinting, PVC lifecycle, and both delete paths look correct. Good test coverage for the new paths.
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py — the legacy arg parser and model cache locking script are the most complex new logic and warrant careful review
Important Files Changed
Sequence Diagram
sequenceDiagram participant UC as CreateLLMEndpointUseCase participant BU as CreateLLMModelBundleUseCase participant K8S as K8SEndpointResourceDelegate participant PVC as Kubernetes PVC participant Pod as vLLM Pod (replica N) UC->>BU: execute(model_name, checkpoint_path, ...) BU->>BU: get_vllm_model_weights_folder() Note over BU: /mnt/model-cache/model_files if cache enabled BU->>BU: load_model_weights_sub_commands() BU->>BU: get_model_cache_fingerprint(sha256) BU->>BU: wrap_download_subcommands_with_model_cache_lock() BU-->>UC: bundle with cache-aware command UC->>K8S: create_or_update_resources(request) K8S->>K8S: model_endpoint_uses_model_cache(record) alt "uses_model_cache=true" K8S->>PVC: create_namespaced_persistent_volume_claim (409 skip) K8S->>K8S: add_model_cache_to_deployment_template(pvc_name) else "uses_model_cache=false" K8S->>K8S: remove_model_cache_from_deployment_template() end Pod->>Pod: mkdir .download.lock (atomic) alt lock acquired Pod->>Pod: heartbeat loop (bg pid) Pod->>Pod: bash -c download_subcommands Pod->>Pod: model_cache_ready_without_marker() Pod->>Pod: write .complete (fingerprint) Pod->>Pod: cleanup_model_cache_lock else another replica holds lock Pod->>Pod: poll until .complete or lock stale end Pod->>Pod: exec vllm_server --model /mnt/model-cache/model_filesComments Outside Diff (1)
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py, line 633-654 (link)model_cache_ready_without_markeronly checks for.safetensors/.binThe helper that gate-keeps fingerprint file writing only looks for
*.safetensorsor*.binfiles, matching the current download filter. However, if a checkpoint is stored in a format not covered by this glob (e.g., quantized GGUF or multi-part shards with different extensions) the function always returns 1, the outerset -ekills the pod on every startup before the fingerprint is written, and every replica re-downloads on every restart — making the cache useless. Consider whether the file-presence check should match the exact same inclusion patterns used in the download subcommands, or at minimum verifyconfig.jsonalone as a signal of a completed download.Prompt To Fix With AI
Prompt To Fix All With AI
Reviews (2): Last reviewed commit: "run black formatter" | Re-trigger Greptile