Skip to content

[TransferEngine] Add congestion control plugin interface and AIMD implementation#2489

Open
stmatengss wants to merge 1 commit into
kvcache-ai:mainfrom
stmatengss:feature/tent-congestion-control-plugin
Open

[TransferEngine] Add congestion control plugin interface and AIMD implementation#2489
stmatengss wants to merge 1 commit into
kvcache-ai:mainfrom
stmatengss:feature/tent-congestion-control-plugin

Conversation

@stmatengss

@stmatengss stmatengss commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Add a pluggable congestion control framework to the TENT runtime, enabling per-device flow control and congestion awareness for multi-NIC deployments.

Changes

Plugin Interface (tent/include/tent/runtime/congestion_control.h)

  • CongestionControlPlugin abstract base class with 3 hook points:
    • onAdmit() — admission control before scheduling transfers
    • onCompletion() — feedback on successful transfer completion
    • onCongestionSignal() — reaction to timeout/ECN signals
  • Context structs: AdmitContext, AdmitDecision, CompletionEvent, CongestionSignal
  • NoOp default implementation (zero-overhead when no policy is active)

AIMD Plugin (tent/include/tent/runtime/aimd_congestion_control.h)

  • Additive Increase / Multiplicative Decrease per-device window management
  • Lock-free atomic state for hot-path performance (<100ns)
  • Configurable parameters: initial/cap/floor window, AI increment, MD factor

Integration Points

  • TransferEngineImpl: admission hook in submitTransfer(), plugin lifecycle
  • DeviceSelector: completion callback in release()
  • Workers: congestion signal emission on CQ timeout

Tests

  • 13 GTest unit tests for AIMD plugin covering window growth, shrink, bounds, multi-device isolation

Documentation

  • Detailed design doc (Chinese): docs/source/design/tent/congestion-control.md
  • Brief presentation doc: docs/source/design/tent/congestion-control-brief.md

Motivation

Addresses deployment risks for HPN 7.0UL network architecture with L20D nodes:

  • Risk 3.3: M2N ASW buffer exhaustion — solved by per-destination inflight tracking
  • Risk 3.4: HPN 7.0UL uplink congestion — solved by topology-aware rate limiting

Testing

  • All 13 AIMD unit tests pass
  • Full TENT build (373 targets) compiles cleanly with zero errors

…lementation

Add a pluggable congestion control framework to the TENT runtime, enabling
per-device flow control and congestion awareness for multi-NIC deployments.

Plugin Interface (tent/include/tent/runtime/congestion_control.h):
- CongestionControlPlugin abstract base class with 3 hook points:
  * onAdmit() — admission control before scheduling transfers
  * onCompletion() — feedback on successful transfer completion
  * onCongestionSignal() — reaction to timeout/ECN signals
- Context structs: AdmitContext, AdmitDecision, CompletionEvent,
  CongestionSignal
- NoOp default implementation (zero-overhead when no policy is active)

AIMD Plugin (tent/include/tent/runtime/aimd_congestion_control.h):
- Additive Increase / Multiplicative Decrease per-device window management
- Lock-free atomic state for hot-path performance (<100ns)
- Configurable parameters: initial/cap/floor window, AI increment, MD factor

Integration Points:
- TransferEngineImpl: admission hook in submitTransfer(), plugin lifecycle
- DeviceSelector / quota: completion callback in release()
- Workers: congestion signal emission on CQ timeout

Tests:
- 13 GTest unit tests for AIMD plugin covering window growth, shrink,
  bounds, multi-device isolation

Documentation:
- Detailed design doc (Chinese): docs/source/design/tent/congestion-control.md
- Brief presentation doc: docs/source/design/tent/congestion-control-brief.md

Motivation:
Addresses deployment risks for HPN 7.0UL network architecture with L20D nodes:
- Risk 3.3: M2N ASW buffer exhaustion — solved by per-destination inflight
  tracking
- Risk 3.4: HPN 7.0UL uplink congestion — solved by topology-aware rate
  limiting

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a pluggable Congestion Control Plugin framework for the TENT Runtime, featuring an AIMD-based implementation and corresponding unit tests to mitigate network buffer exhaustion and uplink congestion. Feedback on the changes highlights critical issues: non-atomic read-modify-write operations on congestion windows that cause race conditions, a lack of propagation of the plugin to active transports and device selectors, an ineffective admission check that ignores deferral decisions and passes an invalid device ID, and hardcoded status and priority values in completion events.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +55 to +59
auto& w = window_bytes_[event.device_id];
uint64_t cur = w.load(std::memory_order_relaxed);
uint64_t next = std::min(cur + config_.ai_increment_bytes,
config_.window_cap_bytes);
w.store(next, std::memory_order_relaxed);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The read-modify-write operation on window_bytes_ is not atomic. Since onCompletion can be called concurrently from multiple worker threads when transfers complete, this relaxed load followed by a relaxed store creates a race condition where window updates can be lost. Use compare_exchange_weak to perform the update atomically.

    auto& w = window_bytes_[event.device_id];
    uint64_t cur = w.load(std::memory_order_relaxed);
    while (!w.compare_exchange_weak(
        cur,
        std::min(cur + config_.ai_increment_bytes, config_.window_cap_bytes),
        std::memory_order_relaxed)) {
    }

Comment on lines +70 to +75
auto& w = window_bytes_[signal.device_id];
uint64_t cur = w.load(std::memory_order_relaxed);
double factor = 1.0 - config_.md_factor * signal.severity;
uint64_t reduced = static_cast<uint64_t>(cur * std::max(factor, 0.0));
uint64_t clamped = std::max(reduced, config_.window_floor_bytes);
w.store(clamped, std::memory_order_relaxed);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to onCompletion, the read-modify-write operation on window_bytes_ in onCongestionSignal is not atomic and can suffer from lost updates under concurrent congestion signals. Use compare_exchange_weak to ensure the multiplicative decrease is applied atomically.

    auto& w = window_bytes_[signal.device_id];
    uint64_t cur = w.load(std::memory_order_relaxed);
    double factor = 1.0 - config_.md_factor * signal.severity;
    uint64_t next;
    do {
        uint64_t reduced = static_cast<uint64_t>(cur * std::max(factor, 0.0));
        next = std::max(reduced, config_.window_floor_bytes);
    } while (!w.compare_exchange_weak(cur, next, std::memory_order_relaxed));

Comment on lines +1630 to +1631
cc_plugin_ = std::move(plugin);
LOG(INFO) << "Congestion control plugin set: " << cc_plugin_->getName();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The congestion control plugin is set on TransferEngineImpl but is never propagated to the loaded transports or their DeviceSelectors. Because of this, DeviceSelector::cc_plugin_ remains nullptr, and no completion events or congestion signals will ever be delivered to the plugin. You need to propagate the plugin to the active transports so they can register it on their selectors.

Comment on lines +1255 to +1268
// Congestion control admission check
{
AdmitContext admit_ctx{
merged_request,
0, // TODO: track global_inflight_bytes
0, // TODO: track per-device inflight from DeviceSelector
-1, // device_id not yet determined at this stage
merged_request.priority};
auto decision = cc_plugin_->onAdmit(admit_ctx);
if (decision.device_mask_override != 0) {
// Apply congestion control device mask constraint
task.device_mask &= decision.device_mask_override;
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are two major issues with the admission control check here:

  1. decision.admit is completely ignored. If the plugin decides to defer or reject the request (admit = false), the engine still proceeds with the transfer anyway.
  2. device_id is passed as -1 because the device is not yet determined at this stage. However, the AIMD plugin (AimdCongestionControlPlugin::onAdmit) always returns true (admit) when device_id < 0. This means the AIMD congestion control is completely bypassed and ineffective.

To fix this, you need to integrate a deferral queue or callback mechanism to handle admit = false and ensure the candidate device is resolved before calling onAdmit.

Comment on lines +318 to +326
if (cc_plugin_) {
CompletionEvent event;
event.device_id = dev_id;
event.bytes = length;
event.latency_us = latency;
event.status = TransferStatusEnum::COMPLETED;
event.priority = PRIO_HIGH; // Default; actual priority not available here
cc_plugin_->onCompletion(event);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The CompletionEvent is constructed with hardcoded status = TransferStatusEnum::COMPLETED and priority = PRIO_HIGH. This prevents the congestion control plugin from distinguishing between different request priorities or reacting to failed transfers. Consider refactoring DeviceSelector::release to accept the actual status and priority from the transport layer.

@stmatengss

Copy link
Copy Markdown
Collaborator Author

@qoder review it

@qoderai

qoderai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Thanks for the congestion-control PR ping! I’ve gone through the design docs, AIMD plugin implementation, integration points, and tests. Here’s my review summary:

Overall

  • The plugin abstraction and AIMD design look solid and align with the documented goals (per-device control, zero-op noop plugin, hot-path friendliness).
  • Tests are comprehensive and cover window growth/shrink, bounds, multi-device isolation, and factory behavior.
  • The Gemini review already pointed out most of the critical issues; I’ll focus on those plus a couple of integration details.

AIMD implementation (aimd_congestion_control.*)

  • onCompletion and onCongestionSignal currently do non-atomic read–modify–write on window_bytes_. Under concurrent completions/signals you can lose updates.
    • I recommend switching to a compare_exchange_weak loop for both paths, as the existing review suggests, so the AIMD window is updated atomically per device.
  • Bounds logic (cap/floor) and severity handling look correct; clamping to window_floor_bytes prevents collapse to zero.
  • Out-of-range device IDs correctly degrade to no-op, which is good for safety.

Plugin wiring / lifecycle

  • TransferEngineImpl owns a shared_ptr<CongestionControlPlugin> with a default noop instance, and setCongestionControlPlugin replaces it. That part is fine.
  • However, the plugin isn’t propagated into the RDMA stack:
    • DeviceSelector has setCongestionControlPlugin / getCongestionControlPlugin, but nothing in TransferEngineImpl calls it when the plugin changes.
    • As a result, DeviceSelector::cc_plugin_ stays nullptr and the callbacks in quota.cpp / workers.cpp never hit your plugin.
    • You’ll want to hook setCongestionControlPlugin up to the existing transport/selector instances (e.g., via TransportSelector or wherever DeviceSelector is constructed) so that changing the plugin takes effect cluster-wide.

Admission path (TransferEngineImpl::submitTransfer)

  • Current admission hook calls onAdmit with:
    • device_id = -1
    • global_inflight_bytes = 0, device_inflight_bytes = 0
    • The result’s admit flag is ignored; only device_mask_override is used.
  • With the current AIMD implementation, device_id < 0 means “always admit”, so AIMD-based throttling doesn’t actually engage here.
  • To make admission control effective you’ll likely need to:
    • Integrate this at a point where a candidate device_id and inflight counters are known (or call it post-device selection with a specific dev).
    • Respect admit == false by deferring/queueing the request instead of continuing immediately.
    • Optionally plumb real global_inflight_bytes and per-device inflight into AdmitContext as the design doc suggests.

Completion and congestion signals

  • DeviceSelector::release currently synthesizes CompletionEvent with status = COMPLETED and priority = PRIO_HIGH.
    • That prevents the plugin from differentiating priorities or reacting to failures/timeouts on the completion path.
    • Refactoring release to accept real status/priority (or adding an overload that does) will make the feedback loop much more informative.
  • Timeout-based congestion signal in Workers::asyncPollCq looks reasonable; every timeout generates a TIMEOUT_SPIKE with severity 1.0, which is consistent with treating timeouts as maximal congestion.

Docs & tests

  • The design docs match the header-level API well, including the three hook points and context structs.
  • Unit tests for AIMD are well structured and give good confidence in the per-device window logic as a standalone component.

Suggested next steps

  • Fix the atomicity of window_bytes_ updates with CAS loops in onCompletion / onCongestionSignal.
  • Wire setCongestionControlPlugin through to the active DeviceSelector/transports so callbacks are actually delivered.
  • Revisit the submitTransfer hook to:
    • Use a concrete device_id and inflight metrics where possible.
    • Implement real handling of admit = false (deferral) per the design doc.
  • Extend DeviceSelector::release (or related paths) to pass real status/priority to CompletionEvent.

Once those integration and concurrency issues are addressed, the plugin framework should be in good shape to control congestion in the intended M2N/HPN scenarios. 🚀


🤖 Generated by QoderView workflow run

@staryxchen

Copy link
Copy Markdown
Collaborator

Do congestion control and QoS conflict with each other? Or can they complement each other? cc @alogfans
BTW, we currently maintain only English version of docs, so design documents should be submitted in English.

@catyans

catyans commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Do congestion control and QoS conflict with each other? Or can they complement each other?

They're complementary and operate at different layers — congestion control and QoS answer different questions:

So the natural composition is: QoS decides what gets dispatched and in what order; CC decides how fast each dispatched flow drives a given device. They don't fight as long as the layering is kept clean:

  • CC should observe, not override, priority. The window is a per-device rate cap; QoS still owns dequeue order. A high-priority transfer shouldn't be silently throttled below a low-priority one on the same device — the AIMD window applies to aggregate device load, and the priority queue decides who consumes that budget first.
  • The admission hook is where the two meet: admission (QoS) can reject/defer; CC can shape the rate of what's admitted. Today the hook is wired with device_id=-1 / inflight=0 and ignores the admit flag, so the two layers aren't actually talking yet — wiring that through is what makes them compose rather than each acting blind.

A useful framing: CC handles the network layer (rate vs. fabric capacity), QoS handles the application/SLO layer (which flows matter, in what order). They're orthogonal axes, and you generally want both. This is also why a deadline/SLO-aware admission policy (RFC #2519) sits cleanly on top — it consumes a per-transfer deadline to decide admission, while AIMD independently governs send rate; neither needs to know the other's internal state, they just share the admission boundary.

Happy to help untangle the integration points if useful.

@alogfans

Copy link
Copy Markdown
Collaborator

Do congestion control and QoS conflict with each other? Or can they complement each other? cc @alogfans BTW, we currently maintain only English version of docs, so design documents should be submitted in English.

They are not conflicted but should be considered together, and uses the uniform mechanism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation run-ci Transfer Engine

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants