Skip to content

[TransferEngine] Make InfiniBand Service Level configurable via MC_IB_SL#2525

Open
catyans wants to merge 1 commit into
kvcache-ai:mainfrom
catyans:feat/mc-ib-sl
Open

[TransferEngine] Make InfiniBand Service Level configurable via MC_IB_SL#2525
catyans wants to merge 1 commit into
kvcache-ai:mainfrom
catyans:feat/mc-ib-sl

Conversation

@catyans

@catyans catyans commented Jun 18, 2026

Copy link
Copy Markdown

Description

The RDMA endpoint hard-codes attr.ah_attr.sl = 0 in
rdma_endpoint.cpp, so Mooncake traffic cannot be steered into a
dedicated InfiniBand Virtual Lane for QoS isolation.

This matters on a shared NIC. In MoE inference (e.g. DeepSeek-V3,
Qwen3-235B) with PD disaggregation, Expert-Parallel all-to-all traffic
and KV-cache transfers contend for the same RDMA NIC. SL→VL mapping on
the switch is the standard way to separate them, but Mooncake currently
gives KV traffic no SL knob, so it cannot be moved off the default lane.

Peer libraries on this path already expose the knob:

Library SL knob
NCCL NCCL_IB_SL
UCX UCX_IB_SL
NVSHMEM NVSHMEM_IB_SL
DeepEP EP_OVERRIDE_RDMA_SL
Mooncake TE (none — this PR)

Note: the existing MC_IB_TC only sets ah_attr.grh.traffic_class,
which does not drive VL selection on native InfiniBand, so it does not
cover this case.

This change mirrors the existing MC_IB_TC implementation:

  • add ib_service_level = -1 to GlobalConfig (-1 = use default 0);
  • parse MC_IB_SL, validated to the InfiniBand SL range 0-15;
  • apply it to attr.ah_attr.sl only when configured;
  • log it in dumpGlobalConfig.

Fully backward compatible: when MC_IB_SL is unset, the SL stays
0, exactly as today.

Closes #2515

Module

  • Transfer Engine (mooncake-transfer-engine)

Type of Change

  • New feature

How Has This Been Tested?

Added unit tests in tests/config_test.cpp for the MC_IB_SL env-var
parsing, covering: unset default (-1), valid override, min/max
boundaries (0/15), out-of-range (16), negative, and non-numeric
input (all reject and preserve the prior value). Mirrors the existing
MC_PKEY_INDEX test pattern.

  • Unit tests pass
  • Integration tests pass (N/A)
  • Manual testing done

Note: I have not built this locally (no IB toolchain on my dev
machine); relying on CI for the build. The change mirrors the existing
MC_IB_TC code path one-for-one.

Checklist

  • I have performed a self-review of my own code
  • I have formatted my code using clang-format (v20.1.8, repo .clang-format)
  • I have run pre-commit run --all-files and all hooks pass
  • I have updated the documentation (docs/source/design/transfer-engine/index.md)
  • I have added tests to prove my changes are effective
  • For changes >500 LOC: I have filed an RFC issue — N/A (~25 LOC)

AI Assistance Disclosure

  • AI tools were used (Claude Code) for code navigation and drafting.

The RDMA endpoint hard-codes `attr.ah_attr.sl = 0`, so Mooncake traffic
cannot be steered into a dedicated Virtual Lane for QoS isolation. On a
shared NIC (e.g. MoE inference where Expert-Parallel all-to-all and
KV-cache transfers contend for the same RDMA NIC), there is no way to
separate the two flows at the SL/VL level.

Peer libraries already expose this knob: NCCL `NCCL_IB_SL`, UCX
`UCX_IB_SL`, NVSHMEM `NVSHMEM_IB_SL`, DeepEP `EP_OVERRIDE_RDMA_SL`.
Mooncake was the only one on the KV-transfer path without it. (Note
`MC_IB_TC` only sets `ah_attr.grh.traffic_class`, which does not drive
VL selection on native InfiniBand.)

This mirrors the existing `MC_IB_TC` implementation:
- add `ib_service_level = -1` to GlobalConfig (-1 = use default 0);
- parse `MC_IB_SL`, validated to the InfiniBand SL range 0-15;
- apply it to `attr.ah_attr.sl` only when configured;
- log it in dumpGlobalConfig.

Fully backward compatible: when MC_IB_SL is unset the SL stays 0,
exactly as before. Adds unit tests for the env-var parsing
(default/valid/boundary/out-of-range/negative/non-numeric) and
documents the knob.

Closes kvcache-ai#2515

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for configuring the InfiniBand Service Level (SL) via the MC_IB_SL environment variable (0-15) to enable QoS isolation on RDMA QPs. This change includes configuration updates, environment variable parsing, documentation, RDMA endpoint integration, and unit tests. The feedback suggests checking if the environment variable is non-empty before parsing to avoid throwing exceptions or logging spurious warnings when it is set to an empty string.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +393 to +394
const char* service_level_env = std::getenv("MC_IB_SL");
if (service_level_env) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent throwing a std::invalid_argument exception and logging a spurious warning when MC_IB_SL is set to an empty string (which is common in some deployment environments), check if the environment variable is non-empty before parsing it. This also aligns with the pattern used elsewhere in this file (e.g., for MC_MLX5_QP_UDP_SPORTS and MC_MLX5_QP_LAG_PORT_BALANCE).

Suggested change
const char* service_level_env = std::getenv("MC_IB_SL");
if (service_level_env) {
const char* service_level_env = std::getenv("MC_IB_SL");
if (service_level_env && *service_level_env) {

@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 94.23077% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ine/src/transport/rdma_transport/rdma_endpoint.cpp 0.00% 2 Missing ⚠️
mooncake-transfer-engine/src/config.cpp 90.90% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@catyans

catyans commented Jun 18, 2026

Copy link
Copy Markdown
Author

The failing build (3.10/3.12) jobs actually passed all compile and test steps — Build project and Test (in build env) with coverage are both green, including the new MC_IB_SL unit tests added in this PR. The only non-success step is the post-build packaging stage (Build Python wheel was cancelled / Generate coverage report timed out, ~30m in), which looks unrelated to this change. CI Gate is just the aggregate gate reflecting those.

On the Codecov note: the 2 uncovered lines in rdma_endpoint.cpp are the if (ib_service_level >= 0) { attr.ah_attr.sl = ... } branch, which only executes on real IB hardware with MC_IB_SL set — CI has no IB device, so that path can't be exercised there. The env-var parsing logic itself is covered by the new unit tests.

Could a maintainer re-run these jobs with their permissions? Happy to rebase onto latest main if that helps. Thanks!

@stmatengss

Copy link
Copy Markdown
Collaborator

Good catch! It is a solution for congestion control.

@alogfans alogfans left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. You can apply to TENT as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation run-ci Transfer Engine

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request]: Make InfiniBand Service Level (SL) configurable via MC_IB_SL

4 participants