[TransferEngine] Add congestion control plugin interface and AIMD implementation by stmatengss · Pull Request #2489 · kvcache-ai/Mooncake

stmatengss · 2026-06-15T15:51:18Z

Summary

Add a pluggable congestion control framework to the TENT runtime, enabling per-device flow control and congestion awareness for multi-NIC deployments.

Changes

Plugin Interface (`tent/include/tent/runtime/congestion_control.h`)

CongestionControlPlugin abstract base class with 3 hook points:
- onAdmit() — admission control before scheduling transfers
- onCompletion() — feedback on successful transfer completion
- onCongestionSignal() — reaction to timeout/ECN signals
Context structs: AdmitContext, AdmitDecision, CompletionEvent, CongestionSignal
NoOp default implementation (zero-overhead when no policy is active)

AIMD Plugin (`tent/include/tent/runtime/aimd_congestion_control.h`)

Additive Increase / Multiplicative Decrease per-device window management
Lock-free atomic state for hot-path performance (<100ns)
Configurable parameters: initial/cap/floor window, AI increment, MD factor

Integration Points

TransferEngineImpl: admission hook in submitTransfer(), plugin lifecycle
DeviceSelector: completion callback in release()
Workers: congestion signal emission on CQ timeout

Tests

13 GTest unit tests for AIMD plugin covering window growth, shrink, bounds, multi-device isolation

Documentation

Detailed design doc (Chinese): docs/source/design/tent/congestion-control.md
Brief presentation doc: docs/source/design/tent/congestion-control-brief.md

Motivation

Addresses deployment risks for HPN 7.0UL network architecture with L20D nodes:

Risk 3.3: M2N ASW buffer exhaustion — solved by per-destination inflight tracking
Risk 3.4: HPN 7.0UL uplink congestion — solved by topology-aware rate limiting

Testing

All 13 AIMD unit tests pass
Full TENT build (373 targets) compiles cleanly with zero errors

…lementation Add a pluggable congestion control framework to the TENT runtime, enabling per-device flow control and congestion awareness for multi-NIC deployments. Plugin Interface (tent/include/tent/runtime/congestion_control.h): - CongestionControlPlugin abstract base class with 3 hook points: * onAdmit() — admission control before scheduling transfers * onCompletion() — feedback on successful transfer completion * onCongestionSignal() — reaction to timeout/ECN signals - Context structs: AdmitContext, AdmitDecision, CompletionEvent, CongestionSignal - NoOp default implementation (zero-overhead when no policy is active) AIMD Plugin (tent/include/tent/runtime/aimd_congestion_control.h): - Additive Increase / Multiplicative Decrease per-device window management - Lock-free atomic state for hot-path performance (<100ns) - Configurable parameters: initial/cap/floor window, AI increment, MD factor Integration Points: - TransferEngineImpl: admission hook in submitTransfer(), plugin lifecycle - DeviceSelector / quota: completion callback in release() - Workers: congestion signal emission on CQ timeout Tests: - 13 GTest unit tests for AIMD plugin covering window growth, shrink, bounds, multi-device isolation Documentation: - Detailed design doc (Chinese): docs/source/design/tent/congestion-control.md - Brief presentation doc: docs/source/design/tent/congestion-control-brief.md Motivation: Addresses deployment risks for HPN 7.0UL network architecture with L20D nodes: - Risk 3.3: M2N ASW buffer exhaustion — solved by per-destination inflight tracking - Risk 3.4: HPN 7.0UL uplink congestion — solved by topology-aware rate limiting

gemini-code-assist

Code Review

This pull request introduces a pluggable Congestion Control Plugin framework for the TENT Runtime, featuring an AIMD-based implementation and corresponding unit tests to mitigate network buffer exhaustion and uplink congestion. Feedback on the changes highlights critical issues: non-atomic read-modify-write operations on congestion windows that cause race conditions, a lack of propagation of the plugin to active transports and device selectors, an ineffective admission check that ignores deferral decisions and passes an invalid device ID, and hardcoded status and priority values in completion events.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-15T16:04:34Z

+    auto& w = window_bytes_[event.device_id];
+    uint64_t cur = w.load(std::memory_order_relaxed);
+    uint64_t next = std::min(cur + config_.ai_increment_bytes,
+                             config_.window_cap_bytes);
+    w.store(next, std::memory_order_relaxed);


The read-modify-write operation on window_bytes_ is not atomic. Since onCompletion can be called concurrently from multiple worker threads when transfers complete, this relaxed load followed by a relaxed store creates a race condition where window updates can be lost. Use compare_exchange_weak to perform the update atomically.

auto& w = window_bytes_[event.device_id]; uint64_t cur = w.load(std::memory_order_relaxed); while (!w.compare_exchange_weak( cur, std::min(cur + config_.ai_increment_bytes, config_.window_cap_bytes), std::memory_order_relaxed)) { }

gemini-code-assist · 2026-06-15T16:04:34Z

+    auto& w = window_bytes_[signal.device_id];
+    uint64_t cur = w.load(std::memory_order_relaxed);
+    double factor = 1.0 - config_.md_factor * signal.severity;
+    uint64_t reduced = static_cast<uint64_t>(cur * std::max(factor, 0.0));
+    uint64_t clamped = std::max(reduced, config_.window_floor_bytes);
+    w.store(clamped, std::memory_order_relaxed);


Similar to onCompletion, the read-modify-write operation on window_bytes_ in onCongestionSignal is not atomic and can suffer from lost updates under concurrent congestion signals. Use compare_exchange_weak to ensure the multiplicative decrease is applied atomically.

auto& w = window_bytes_[signal.device_id]; uint64_t cur = w.load(std::memory_order_relaxed); double factor = 1.0 - config_.md_factor * signal.severity; uint64_t next; do { uint64_t reduced = static_cast<uint64_t>(cur * std::max(factor, 0.0)); next = std::max(reduced, config_.window_floor_bytes); } while (!w.compare_exchange_weak(cur, next, std::memory_order_relaxed));

gemini-code-assist · 2026-06-15T16:04:34Z

+    cc_plugin_ = std::move(plugin);
+    LOG(INFO) << "Congestion control plugin set: " << cc_plugin_->getName();


The congestion control plugin is set on TransferEngineImpl but is never propagated to the loaded transports or their DeviceSelectors. Because of this, DeviceSelector::cc_plugin_ remains nullptr, and no completion events or congestion signals will ever be delivered to the plugin. You need to propagate the plugin to the active transports so they can register it on their selectors.

gemini-code-assist · 2026-06-15T16:04:34Z

+        // Congestion control admission check
+        {
+            AdmitContext admit_ctx{
+                merged_request,
+                0,  // TODO: track global_inflight_bytes
+                0,  // TODO: track per-device inflight from DeviceSelector
+                -1, // device_id not yet determined at this stage
+                merged_request.priority};
+            auto decision = cc_plugin_->onAdmit(admit_ctx);
+            if (decision.device_mask_override != 0) {
+                // Apply congestion control device mask constraint
+                task.device_mask &= decision.device_mask_override;
+            }
+        }


There are two major issues with the admission control check here:

decision.admit is completely ignored. If the plugin decides to defer or reject the request (admit = false), the engine still proceeds with the transfer anyway.

device_id is passed as -1 because the device is not yet determined at this stage. However, the AIMD plugin (AimdCongestionControlPlugin::onAdmit) always returns true (admit) when device_id < 0. This means the AIMD congestion control is completely bypassed and ineffective.

To fix this, you need to integrate a deferral queue or callback mechanism to handle admit = false and ensure the candidate device is resolved before calling onAdmit.

gemini-code-assist · 2026-06-15T16:04:34Z

+    if (cc_plugin_) {
+        CompletionEvent event;
+        event.device_id = dev_id;
+        event.bytes = length;
+        event.latency_us = latency;
+        event.status = TransferStatusEnum::COMPLETED;
+        event.priority = PRIO_HIGH;  // Default; actual priority not available here
+        cc_plugin_->onCompletion(event);
+    }


The CompletionEvent is constructed with hardcoded status = TransferStatusEnum::COMPLETED and priority = PRIO_HIGH. This prevents the congestion control plugin from distinguishing between different request priorities or reacting to failed transfers. Consider refactoring DeviceSelector::release to accept the actual status and priority from the transport layer.

stmatengss · 2026-06-15T16:34:58Z

@qoder review it

qoderai · 2026-06-15T16:35:54Z

Thanks for the congestion-control PR ping! I’ve gone through the design docs, AIMD plugin implementation, integration points, and tests. Here’s my review summary:

Overall

The plugin abstraction and AIMD design look solid and align with the documented goals (per-device control, zero-op noop plugin, hot-path friendliness).
Tests are comprehensive and cover window growth/shrink, bounds, multi-device isolation, and factory behavior.
The Gemini review already pointed out most of the critical issues; I’ll focus on those plus a couple of integration details.

AIMD implementation (aimd_congestion_control.*)

onCompletion and onCongestionSignal currently do non-atomic read–modify–write on window_bytes_. Under concurrent completions/signals you can lose updates.
- I recommend switching to a compare_exchange_weak loop for both paths, as the existing review suggests, so the AIMD window is updated atomically per device.
Bounds logic (cap/floor) and severity handling look correct; clamping to window_floor_bytes prevents collapse to zero.
Out-of-range device IDs correctly degrade to no-op, which is good for safety.

Plugin wiring / lifecycle

TransferEngineImpl owns a shared_ptr<CongestionControlPlugin> with a default noop instance, and setCongestionControlPlugin replaces it. That part is fine.
However, the plugin isn’t propagated into the RDMA stack:
- DeviceSelector has setCongestionControlPlugin / getCongestionControlPlugin, but nothing in TransferEngineImpl calls it when the plugin changes.
- As a result, DeviceSelector::cc_plugin_ stays nullptr and the callbacks in quota.cpp / workers.cpp never hit your plugin.
- You’ll want to hook setCongestionControlPlugin up to the existing transport/selector instances (e.g., via TransportSelector or wherever DeviceSelector is constructed) so that changing the plugin takes effect cluster-wide.

Admission path (TransferEngineImpl::submitTransfer)

Current admission hook calls onAdmit with:
- device_id = -1
- global_inflight_bytes = 0, device_inflight_bytes = 0
- The result’s admit flag is ignored; only device_mask_override is used.
With the current AIMD implementation, device_id < 0 means “always admit”, so AIMD-based throttling doesn’t actually engage here.
To make admission control effective you’ll likely need to:
- Integrate this at a point where a candidate device_id and inflight counters are known (or call it post-device selection with a specific dev).
- Respect admit == false by deferring/queueing the request instead of continuing immediately.
- Optionally plumb real global_inflight_bytes and per-device inflight into AdmitContext as the design doc suggests.

Completion and congestion signals

DeviceSelector::release currently synthesizes CompletionEvent with status = COMPLETED and priority = PRIO_HIGH.
- That prevents the plugin from differentiating priorities or reacting to failures/timeouts on the completion path.
- Refactoring release to accept real status/priority (or adding an overload that does) will make the feedback loop much more informative.
Timeout-based congestion signal in Workers::asyncPollCq looks reasonable; every timeout generates a TIMEOUT_SPIKE with severity 1.0, which is consistent with treating timeouts as maximal congestion.

Docs & tests

The design docs match the header-level API well, including the three hook points and context structs.
Unit tests for AIMD are well structured and give good confidence in the per-device window logic as a standalone component.

Suggested next steps

Fix the atomicity of window_bytes_ updates with CAS loops in onCompletion / onCongestionSignal.
Wire setCongestionControlPlugin through to the active DeviceSelector/transports so callbacks are actually delivered.
Revisit the submitTransfer hook to:
- Use a concrete device_id and inflight metrics where possible.
- Implement real handling of admit = false (deferral) per the design doc.
Extend DeviceSelector::release (or related paths) to pass real status/priority to CompletionEvent.

Once those integration and concurrency issues are addressed, the plugin framework should be in good shape to control congestion in the intended M2N/HPN scenarios. 🚀

🤖 Generated by Qoder • View workflow run

staryxchen · 2026-06-16T07:29:31Z

Do congestion control and QoS conflict with each other? Or can they complement each other? cc @alogfans
BTW, we currently maintain only English version of docs, so design documents should be submitted in English.

catyans · 2026-06-18T07:27:43Z

Do congestion control and QoS conflict with each other? Or can they complement each other?

They're complementary and operate at different layers — congestion control and QoS answer different questions:

QoS (ordering + admission) answers "given the transfers we have, which one goes next, and may it enter at all?" — the existing priority queues ([TENT] Enhanced QoS and Slice Spraying for TENT #2048), starvation promotion, and the local admission queue ([TENT] Add local admission queue prototype #2341 / RFC [RFC]: TENT Local Transfer Admission Queue #2132) live here. It's about scheduling and selection.
Congestion control (this PR) answers "once a transfer is dispatched onto a device, how fast may it send?" — AIMD adjusts the per-device window based on completion/loss feedback. It's about rate.

So the natural composition is: QoS decides what gets dispatched and in what order; CC decides how fast each dispatched flow drives a given device. They don't fight as long as the layering is kept clean:

CC should observe, not override, priority. The window is a per-device rate cap; QoS still owns dequeue order. A high-priority transfer shouldn't be silently throttled below a low-priority one on the same device — the AIMD window applies to aggregate device load, and the priority queue decides who consumes that budget first.
The admission hook is where the two meet: admission (QoS) can reject/defer; CC can shape the rate of what's admitted. Today the hook is wired with device_id=-1 / inflight=0 and ignores the admit flag, so the two layers aren't actually talking yet — wiring that through is what makes them compose rather than each acting blind.

A useful framing: CC handles the network layer (rate vs. fabric capacity), QoS handles the application/SLO layer (which flows matter, in what order). They're orthogonal axes, and you generally want both. This is also why a deadline/SLO-aware admission policy (RFC #2519) sits cleanly on top — it consumes a per-transfer deadline to decide admission, while AIMD independently governs send rate; neither needs to know the other's internal state, they just share the admission boundary.

Happy to help untangle the integration points if useful.

alogfans · 2026-06-18T11:12:47Z

Do congestion control and QoS conflict with each other? Or can they complement each other? cc @alogfans BTW, we currently maintain only English version of docs, so design documents should be submitted in English.

They are not conflicted but should be considered together, and uses the uniform mechanism.

stmatengss requested review from 00fish0, ShangmingCai, alogfans, chestnut-Q, doujiang24, dtcccc, staryxchen and ykwd as code owners June 15, 2026 15:51

github-actions Bot added documentation Improvements or additions to documentation run-ci Transfer Engine labels Jun 15, 2026

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

This was referenced Jun 22, 2026

[RFC]: Deadline-aware admission & adaptive degradation for TENT transfers #2519

Open

[RFC]: Traffic class hint API for TransferEngine — bridging upper-layer semantics (KV / EP / CTRL) to lower-layer SL/TC #2568

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TransferEngine] Add congestion control plugin interface and AIMD implementation#2489

[TransferEngine] Add congestion control plugin interface and AIMD implementation#2489
stmatengss wants to merge 1 commit into
kvcache-ai:mainfrom
stmatengss:feature/tent-congestion-control-plugin

stmatengss commented Jun 15, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

stmatengss commented Jun 15, 2026

Uh oh!

qoderai Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

staryxchen commented Jun 16, 2026

Uh oh!

catyans commented Jun 18, 2026

Uh oh!

alogfans commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		cc_plugin_ = std::move(plugin);
		LOG(INFO) << "Congestion control plugin set: " << cc_plugin_->getName();

Conversation

stmatengss commented Jun 15, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Plugin Interface (tent/include/tent/runtime/congestion_control.h)

AIMD Plugin (tent/include/tent/runtime/aimd_congestion_control.h)

Integration Points

Tests

Documentation

Motivation

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

stmatengss commented Jun 15, 2026

Uh oh!

qoderai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

staryxchen commented Jun 16, 2026

Uh oh!

catyans commented Jun 18, 2026

Uh oh!

alogfans commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

stmatengss commented Jun 15, 2026 •

edited by github-actions Bot

Loading

Plugin Interface (`tent/include/tent/runtime/congestion_control.h`)

AIMD Plugin (`tent/include/tent/runtime/aimd_congestion_control.h`)

qoderai Bot commented Jun 15, 2026 •

edited

Loading