feat: Support vLLM data parallel calibration for hy_v3 by jds250 · Pull Request #346 · Tencent/AngelSlim

jds250 · 2026-06-16T03:28:30Z

Overview

This PR adds Ray actor-managed data parallelism to the vLLM-based calibration workflow and integrates it into the Hy3 FP8 post-training quantization pipeline.

Motivation

Calibrating large Hy3 models with a single vLLM instance can be time-consuming and may not fully utilize available multi-node and multi-GPU resources.

This PR enables calibration samples to be sharded across multiple data-parallel replicas. Each DP replica can still use tensor parallelism internally through vLLM. After calibration, the driver merges the activation, MoE, MTP, and KV-cache statistics collected by all replicas into the same artifacts expected by the existing quantization stage.

In addition, the calibration and quantization stages can now share a single YAML configuration, reducing duplicated command-line arguments and making the overall PTQ workflow easier to configure, reproduce, and maintain.

Key Changes

`tools/run_vllm_calibrate.py`

Added data-parallel calibration options, including:
- dp_size
Refactored the calibration logic into run_one_calibration, allowing the same implementation to run in:
- single-replica mode
- Ray actor-managed DP mode
Sharded calibration prompts and KV scale search prompts according to dp_rank.
Added per-rank result payloads for cross-DP aggregation.
Added support for a second-stage KV-cache scale search after merged activation statistics become available.
Preserved support for:
- per-tensor KV-cache calibration
- per-head KV-cache calibration
- MTP activation statistics
- MTP MoE statistics

`angelslim/utils/vllm_calibration_dp.py`

Added a Ray actor-managed launcher for data-parallel calibration.
Creates one placement group and one long-lived actor for each DP rank.
Validates DP/TP resource requirements and placement-group scheduling feasibility.
Keeps each actor's vLLM instance alive so that the follow-up KV scale search can reuse the loaded model.
Merges DP-local:
- activation statistics
- MoE expert statistics
- MTP activation statistics
- MTP MoE statistics
- KV-cache statistics
Aggregates KV quantization error profiles across DP replicas and writes the final merged outputs:
- kv_scale_multipliers*.json
- kv_cache_tuned_scales*.json

`angelslim/utils/init.py`

Exported the newly added vLLM data-parallel calibration helper APIs.

`angelslim/compressor/quant/core/vllm_calibrate_utils/search.py`

Added DP-friendly KV-cache MSE profile collection.
Supports both:
- per-tensor KV scale search
- per-head KV scale search
Preserves the existing local search logic while enabling cross-DP aggregation through additive MSE/SSE profiles.

`angelslim/compressor/quant/core/vllm_calibrate_utils/init.py`

Exported the newly added KV profile collectors and result-processing helpers.

`configs/Hy3/ptq/fp8/Hy3_vllm_ptq_per_tensor.yaml`

Added a configuration for the Hy3 FP8 PTQ workflow.
- DP settings——dp_size

Output Artifacts

The updated pipeline preserves the existing stage-1 output names and generates DP-aggregated KV scale-search outputs when enabled:

activation_stats.json
moe_expert_stats.json
mtp_activation_stats.json
mtp_moe_expert_stats.json
kv_scale_multipliers.json
kv_cache_tuned_scales.json
kv_scale_multipliers_per_head.json
kv_cache_tuned_scales_per_head.json

Summary

This PR introduces the following improvements:

Scales Hy3 vLLM calibration across multiple data-parallel replicas.
Merges activation, MoE, MTP, and KV-cache statistics across DP replicas.
Aggregates KV-cache scale-search error profiles across DP replicas.
Preserves compatibility with the existing FP8 quantization workflow and artifact formats.

yghstill · 2026-06-16T03:35:28Z

@jds250
Please per-commit code formatting:

pip3 install pre-commit black isort flake8
cd AngelSlim
pre-commit install

WOODchen7 · 2026-06-16T07:24:15Z

+        tokenizer = llm.get_tokenizer()
+        slim_engine = Engine()
+        slim_engine.series = "LLM"
+        from types import SimpleNamespace


这个import放前面

WOODchen7 · 2026-06-16T07:25:27Z

-    slim_engine.slim_model.tokenizer = tokenizer
-    slim_engine.slim_model.model = llm
-    slim_engine.slim_model.model.device = "cpu"
+    from types import SimpleNamespace


还有这个import

feat: Support vLLM data parallel calibration for hy_v3

953512c

WOODchen7 reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support vLLM data parallel calibration for hy_v3#346

feat: Support vLLM data parallel calibration for hy_v3#346
jds250 wants to merge 1 commit into
Tencent:mainfrom
jds250:feature/dp

jds250 commented Jun 16, 2026

Uh oh!

yghstill commented Jun 16, 2026

Uh oh!

WOODchen7 Jun 16, 2026

Uh oh!

WOODchen7 Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jds250 commented Jun 16, 2026

Overview

Motivation

Key Changes

tools/run_vllm_calibrate.py

angelslim/utils/vllm_calibration_dp.py

angelslim/utils/__init__.py

angelslim/compressor/quant/core/vllm_calibrate_utils/search.py

angelslim/compressor/quant/core/vllm_calibrate_utils/__init__.py

configs/Hy3/ptq/fp8/Hy3_vllm_ptq_per_tensor.yaml

Output Artifacts

Summary

Uh oh!

yghstill commented Jun 16, 2026

Uh oh!

WOODchen7 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

WOODchen7 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`tools/run_vllm_calibrate.py`

`angelslim/utils/vllm_calibration_dp.py`

`angelslim/utils/init.py`

`angelslim/compressor/quant/core/vllm_calibrate_utils/search.py`

`angelslim/compressor/quant/core/vllm_calibrate_utils/init.py`

`configs/Hy3/ptq/fp8/Hy3_vllm_ptq_per_tensor.yaml`