Skip to content

feat: Support vLLM data parallel calibration for hy_v3#346

Open
jds250 wants to merge 1 commit into
Tencent:mainfrom
jds250:feature/dp
Open

feat: Support vLLM data parallel calibration for hy_v3#346
jds250 wants to merge 1 commit into
Tencent:mainfrom
jds250:feature/dp

Conversation

@jds250

@jds250 jds250 commented Jun 16, 2026

Copy link
Copy Markdown

Overview

This PR adds Ray actor-managed data parallelism to the vLLM-based calibration workflow and integrates it into the Hy3 FP8 post-training quantization pipeline.

Motivation

Calibrating large Hy3 models with a single vLLM instance can be time-consuming and may not fully utilize available multi-node and multi-GPU resources.

This PR enables calibration samples to be sharded across multiple data-parallel replicas. Each DP replica can still use tensor parallelism internally through vLLM. After calibration, the driver merges the activation, MoE, MTP, and KV-cache statistics collected by all replicas into the same artifacts expected by the existing quantization stage.

In addition, the calibration and quantization stages can now share a single YAML configuration, reducing duplicated command-line arguments and making the overall PTQ workflow easier to configure, reproduce, and maintain.

Key Changes

tools/run_vllm_calibrate.py

  • Added data-parallel calibration options, including:

    • dp_size
  • Refactored the calibration logic into run_one_calibration, allowing the same implementation to run in:

    • single-replica mode
    • Ray actor-managed DP mode
  • Sharded calibration prompts and KV scale search prompts according to dp_rank.

  • Added per-rank result payloads for cross-DP aggregation.

  • Added support for a second-stage KV-cache scale search after merged activation statistics become available.

  • Preserved support for:

    • per-tensor KV-cache calibration
    • per-head KV-cache calibration
    • MTP activation statistics
    • MTP MoE statistics

angelslim/utils/vllm_calibration_dp.py

  • Added a Ray actor-managed launcher for data-parallel calibration.

  • Creates one placement group and one long-lived actor for each DP rank.

  • Validates DP/TP resource requirements and placement-group scheduling feasibility.

  • Keeps each actor's vLLM instance alive so that the follow-up KV scale search can reuse the loaded model.

  • Merges DP-local:

    • activation statistics
    • MoE expert statistics
    • MTP activation statistics
    • MTP MoE statistics
    • KV-cache statistics
  • Aggregates KV quantization error profiles across DP replicas and writes the final merged outputs:

    • kv_scale_multipliers*.json
    • kv_cache_tuned_scales*.json

angelslim/utils/__init__.py

  • Exported the newly added vLLM data-parallel calibration helper APIs.

angelslim/compressor/quant/core/vllm_calibrate_utils/search.py

  • Added DP-friendly KV-cache MSE profile collection.

  • Supports both:

    • per-tensor KV scale search
    • per-head KV scale search
  • Preserves the existing local search logic while enabling cross-DP aggregation through additive MSE/SSE profiles.

angelslim/compressor/quant/core/vllm_calibrate_utils/__init__.py

  • Exported the newly added KV profile collectors and result-processing helpers.

configs/Hy3/ptq/fp8/Hy3_vllm_ptq_per_tensor.yaml

  • Added a configuration for the Hy3 FP8 PTQ workflow.
    • DP settings——dp_size

Output Artifacts

The updated pipeline preserves the existing stage-1 output names and generates DP-aggregated KV scale-search outputs when enabled:

activation_stats.json
moe_expert_stats.json
mtp_activation_stats.json
mtp_moe_expert_stats.json
kv_scale_multipliers.json
kv_cache_tuned_scales.json
kv_scale_multipliers_per_head.json
kv_cache_tuned_scales_per_head.json

Summary

This PR introduces the following improvements:

  1. Scales Hy3 vLLM calibration across multiple data-parallel replicas.
  2. Merges activation, MoE, MTP, and KV-cache statistics across DP replicas.
  3. Aggregates KV-cache scale-search error profiles across DP replicas.
  4. Preserves compatibility with the existing FP8 quantization workflow and artifact formats.

@yghstill

Copy link
Copy Markdown
Collaborator

@jds250
Please per-commit code formatting:

pip3 install pre-commit black isort flake8
cd AngelSlim
pre-commit install

tokenizer = llm.get_tokenizer()
slim_engine = Engine()
slim_engine.series = "LLM"
from types import SimpleNamespace

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个import放前面

slim_engine.slim_model.tokenizer = tokenizer
slim_engine.slim_model.model = llm
slim_engine.slim_model.model.device = "cpu"
from types import SimpleNamespace

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还有这个import

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants