Conversation
There was a problem hiding this comment.
Code Review
This pull request integrates ElasticBuffer from deep_ep to optimize MoE dispatching by separating prefill and decode logic. It introduces distinct buffers for low-latency and elastic operations, refines SM allocation for deep_gemm, and updates environment variable handling for token dispatch limits across various models. Feedback was provided regarding a potential NameError in type hinting for EventOverlap when the deep_ep library is missing.
| w1_scale: Optional[torch.Tensor] = None, | ||
| w2_scale: Optional[torch.Tensor] = None, | ||
| previous_event: Optional["EventOverlap"] = None, | ||
| previous_event: Optional[EventOverlap] = None, |
There was a problem hiding this comment.
Using EventOverlap directly in the type hint will cause a NameError at import time if deep_ep is not installed, as the import is wrapped in a try...except block. Please use a string literal for the type hint to maintain compatibility with environments where deep_ep might be missing.
| previous_event: Optional[EventOverlap] = None, | |
| previous_event: Optional["EventOverlap"] = None, |
No description provided.