npu gemm patch by a550580874 · Pull Request #176 · modelscope/twinkle

a550580874 · 2026-04-21T11:49:55Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

This PR adds an NPU-only monkey patch for the MoE grouped GEMM path used by fsdp2_moe.py.

When running on Ascend/NPU, the patch replaces transformers.integrations.moe._grouped_mm with an NPU implementation backed by torch_npu grouped matmul. The goal is to improve MoE training performance on NPU while keeping numerical behavior aligned with the original implementation.

What is changed

add an NPU-only monkey patch for the MoE grouped GEMM path
replace transformers.integrations.moe._grouped_mm with _grouped_mm_npu on NPU
keep the default behavior unchanged on non-NPU environments
add helper files for running fsdp2_moe.py with the NPU patch enabled

Motivation

For MoE training on Ascend/NPU, the default grouped GEMM path is not optimal for NPU execution. This PR introduces a minimal NPU-specific patch so that the MoE grouped matmul path can use the native NPU grouped GEMM kernel, improving training performance while keeping numerical behavior aligned.

Scope

This PR is intended for the NPU/Ascend path only.

non-NPU behavior is unchanged
the patch is applied only when NPU is detected
the code change is localized and low-risk

Experiment results

Accuracy alignment

The patched run is numerically aligned with the original run in the compared logs.

Checked steps show identical values for both:

loss
grad_norm

Examples:

step 2: loss=11.7626, grad_norm=2.818920
step 4: loss=11.8967, grad_norm=3.034608
step 10: loss=11.2776, grad_norm=4.786496
step 20: loss=10.4782, grad_norm=4.824494

This indicates that the patched NPU grouped GEMM path is numerically aligned with the original implementation for this experiment.

Performance improvement

The patched run logs:
[PATCH] transformers.integrations.moe._grouped_mm -> _grouped_mm_npu

After excluding the first warmup step:

baseline (without patch): average step time is about 1.74 s/step
patched version: average step time is about 0.46 s/step

This gives an approximate 3.77x speedup on the tested run.

A few concrete step time examples:

Baseline:

1.5999
2.2731
1.6419
1.6620

Patched:

0.4250
0.9911
0.4147
0.4503

The first step is slower in the patched run due to initialization and warmup overhead, but the steady-state step time is significantly lower.

Validation

Validation for this PR is based on:

successful patch activation on NPU
identical loss and grad_norm on matched steps in the compared logs
significantly reduced steady-state step_time

Notes

This PR focuses on the NPU MoE grouped GEMM path only. It does not change the default behavior for non-NPU backends.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

gemini-code-assist

Code Review

This pull request introduces NPU support for FSDP2 MoE models by implementing a monkey patch for the _grouped_mm function using torch_npu. It also adds a utility to detect NPU availability and a shell script for running the model on Ascend hardware. Key feedback points include addressing a missing dist import and an unused rank variable in fsdp2_moe.py, removing a redundant int64 conversion, and simplifying the logic for calculating group counts in the NPU patch.

Removed Chinese comments and unnecessary code comments for clarity.

tardis-key · 2026-04-25T03:09:07Z

lgtm

a550580874 and others added 9 commits April 17, 2026 14:14

Create test_method.py

e6ebca7

Merge branch 'modelscope:main' into main

ca22c4d

Merge branch 'modelscope:main' into main

18343bb

npu moe patch

1e7f7e5

delete log

1cebcfa

Apply suggestion from @gemini-code-assist[bot]

72f79bf

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update fsdp2_moe.py

32d9295

Remove unnecessary blank line in fsdp2_moe.py

08c2320

Add NPU patch for HF MoE grouped MM

f2c350f

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread cookbook/transformers/fsdp2_moe.py

Comment thread src/twinkle/kernel/monkey_patch_npu.py

Comment thread src/twinkle/kernel/monkey_patch_npu.py

a550580874 added 2 commits April 21, 2026 19:52

Update fsdp2_moe.py

c1c21a1

Clean up comments in monkey_patch_npu.py

e1f172b

Removed Chinese comments and unnecessary code comments for clarity.

tardis-key reviewed Apr 22, 2026

View reviewed changes

Comment thread src/twinkle/kernel/monkey_patch_npu.py

tardis-key reviewed Apr 22, 2026

View reviewed changes

Comment thread cookbook/transformers/fsdp2_moe_npu.sh Outdated

tardis-key reviewed Apr 22, 2026

View reviewed changes

Comment thread cookbook/transformers/fsdp2_moe.py Outdated

a550580874 added 4 commits April 22, 2026 15:06

Update fsdp2_moe_npu.sh

87a999c

Merge branch 'modelscope:main' into main

5264900

clean code

3021e73

update

588ae42

fix the linting

ad59ddd

tastelikefeet reviewed Apr 25, 2026

View reviewed changes

Comment thread src/twinkle/utils/utils.py Outdated

a550580874 added 2 commits April 25, 2026 16:23

update is_npu_available use

090f91e

clean

429922c

tastelikefeet approved these changes Apr 25, 2026

View reviewed changes

tastelikefeet merged commit ed00083 into modelscope:main Apr 25, 2026
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

npu gemm patch#176

npu gemm patch#176
tastelikefeet merged 18 commits intomodelscope:mainfrom
a550580874:main

a550580874 commented Apr 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tardis-key commented Apr 25, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

a550580874 commented Apr 21, 2026

PR type

PR information

What is changed

Motivation

Scope

Experiment results

Accuracy alignment

Performance improvement

Validation

Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tardis-key commented Apr 25, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants