feat: Support vLLM data parallel calibration for hy_v3#346
Open
jds250 wants to merge 1 commit into
Open
Conversation
Collaborator
|
@jds250 pip3 install pre-commit black isort flake8
cd AngelSlim
pre-commit install |
WOODchen7
reviewed
Jun 16, 2026
| tokenizer = llm.get_tokenizer() | ||
| slim_engine = Engine() | ||
| slim_engine.series = "LLM" | ||
| from types import SimpleNamespace |
WOODchen7
reviewed
Jun 16, 2026
| slim_engine.slim_model.tokenizer = tokenizer | ||
| slim_engine.slim_model.model = llm | ||
| slim_engine.slim_model.model.device = "cpu" | ||
| from types import SimpleNamespace |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR adds Ray actor-managed data parallelism to the vLLM-based calibration workflow and integrates it into the Hy3 FP8 post-training quantization pipeline.
Motivation
Calibrating large Hy3 models with a single vLLM instance can be time-consuming and may not fully utilize available multi-node and multi-GPU resources.
This PR enables calibration samples to be sharded across multiple data-parallel replicas. Each DP replica can still use tensor parallelism internally through vLLM. After calibration, the driver merges the activation, MoE, MTP, and KV-cache statistics collected by all replicas into the same artifacts expected by the existing quantization stage.
In addition, the calibration and quantization stages can now share a single YAML configuration, reducing duplicated command-line arguments and making the overall PTQ workflow easier to configure, reproduce, and maintain.
Key Changes
tools/run_vllm_calibrate.pyAdded data-parallel calibration options, including:
dp_sizeRefactored the calibration logic into
run_one_calibration, allowing the same implementation to run in:Sharded calibration prompts and KV scale search prompts according to
dp_rank.Added per-rank result payloads for cross-DP aggregation.
Added support for a second-stage KV-cache scale search after merged activation statistics become available.
Preserved support for:
angelslim/utils/vllm_calibration_dp.pyAdded a Ray actor-managed launcher for data-parallel calibration.
Creates one placement group and one long-lived actor for each DP rank.
Validates DP/TP resource requirements and placement-group scheduling feasibility.
Keeps each actor's vLLM instance alive so that the follow-up KV scale search can reuse the loaded model.
Merges DP-local:
Aggregates KV quantization error profiles across DP replicas and writes the final merged outputs:
kv_scale_multipliers*.jsonkv_cache_tuned_scales*.jsonangelslim/utils/__init__.pyangelslim/compressor/quant/core/vllm_calibrate_utils/search.pyAdded DP-friendly KV-cache MSE profile collection.
Supports both:
Preserves the existing local search logic while enabling cross-DP aggregation through additive MSE/SSE profiles.
angelslim/compressor/quant/core/vllm_calibrate_utils/__init__.pyconfigs/Hy3/ptq/fp8/Hy3_vllm_ptq_per_tensor.yamlOutput Artifacts
The updated pipeline preserves the existing stage-1 output names and generates DP-aggregated KV scale-search outputs when enabled:
Summary
This PR introduces the following improvements: