Skip to content

npu gemm patch#176

Merged
tastelikefeet merged 18 commits intomodelscope:mainfrom
a550580874:main
Apr 25, 2026
Merged

npu gemm patch#176
tastelikefeet merged 18 commits intomodelscope:mainfrom
a550580874:main

Conversation

@a550580874
Copy link
Copy Markdown
Contributor

no_gemm_change.log
gemm_change.log

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

This PR adds an NPU-only monkey patch for the MoE grouped GEMM path used by fsdp2_moe.py.

When running on Ascend/NPU, the patch replaces transformers.integrations.moe._grouped_mm with an NPU implementation backed by torch_npu grouped matmul. The goal is to improve MoE training performance on NPU while keeping numerical behavior aligned with the original implementation.

What is changed

  • add an NPU-only monkey patch for the MoE grouped GEMM path
  • replace transformers.integrations.moe._grouped_mm with _grouped_mm_npu on NPU
  • keep the default behavior unchanged on non-NPU environments
  • add helper files for running fsdp2_moe.py with the NPU patch enabled

Motivation

For MoE training on Ascend/NPU, the default grouped GEMM path is not optimal for NPU execution. This PR introduces a minimal NPU-specific patch so that the MoE grouped matmul path can use the native NPU grouped GEMM kernel, improving training performance while keeping numerical behavior aligned.

Scope

This PR is intended for the NPU/Ascend path only.

  • non-NPU behavior is unchanged
  • the patch is applied only when NPU is detected
  • the code change is localized and low-risk

Experiment results

Accuracy alignment

The patched run is numerically aligned with the original run in the compared logs.

Checked steps show identical values for both:

  • loss
  • grad_norm

Examples:

  • step 2: loss=11.7626, grad_norm=2.818920
  • step 4: loss=11.8967, grad_norm=3.034608
  • step 10: loss=11.2776, grad_norm=4.786496
  • step 20: loss=10.4782, grad_norm=4.824494

This indicates that the patched NPU grouped GEMM path is numerically aligned with the original implementation for this experiment.

Performance improvement

The patched run logs:
[PATCH] transformers.integrations.moe._grouped_mm -> _grouped_mm_npu

After excluding the first warmup step:

  • baseline (without patch): average step time is about 1.74 s/step
  • patched version: average step time is about 0.46 s/step

This gives an approximate 3.77x speedup on the tested run.

A few concrete step time examples:

Baseline:

  • 1.5999
  • 2.2731
  • 1.6419
  • 1.6620

Patched:

  • 0.4250
  • 0.9911
  • 0.4147
  • 0.4503

The first step is slower in the patched run due to initialization and warmup overhead, but the steady-state step time is significantly lower.

Validation

Validation for this PR is based on:

  1. successful patch activation on NPU
  2. identical loss and grad_norm on matched steps in the compared logs
  3. significantly reduced steady-state step_time

Notes

This PR focuses on the NPU MoE grouped GEMM path only. It does not change the default behavior for non-NPU backends.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces NPU support for FSDP2 MoE models by implementing a monkey patch for the _grouped_mm function using torch_npu. It also adds a utility to detect NPU availability and a shell script for running the model on Ascend hardware. Key feedback points include addressing a missing dist import and an unused rank variable in fsdp2_moe.py, removing a redundant int64 conversion, and simplifying the logic for calculating group counts in the NPU patch.

Comment thread cookbook/transformers/fsdp2_moe.py
Comment thread src/twinkle/kernel/monkey_patch_npu.py
Comment thread src/twinkle/kernel/monkey_patch_npu.py
Removed Chinese comments and unnecessary code comments for clarity.
Comment thread src/twinkle/kernel/monkey_patch_npu.py
Comment thread cookbook/transformers/fsdp2_moe_npu.sh Outdated
Comment thread cookbook/transformers/fsdp2_moe.py Outdated
@tardis-key
Copy link
Copy Markdown

lgtm

Comment thread src/twinkle/utils/utils.py Outdated
@tastelikefeet tastelikefeet merged commit ed00083 into modelscope:main Apr 25, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants