Skip to content

feat: add SpeechNet (SilentWear) training test#31

Draft
runwangdl wants to merge 8 commits into
develfrom
feat/speechnet-training
Draft

feat: add SpeechNet (SilentWear) training test#31
runwangdl wants to merge 8 commits into
develfrom
feat/speechnet-training

Conversation

@runwangdl
Copy link
Copy Markdown
Owner

Summary

  • Add SpeechNet EMG silent speech recognition model (14ch × 700 samples, 9 classes, ~15K params) to training test suite
  • Fix ConvLayer.computeShapes bias shape bug: inputShapes[1][0](inputShapes[1][0],) — prevents graphsurgeon export crash on Conv layers with bias
  • Register SpeechNet in L2 singlebuffer training config (l1=128000, l2=2000000)

Test results

Untiled (verified):

[loss 0] computed=2.267950  ref=2.267950  diff=0.000000
[loss 1] computed=2.498553  ref=2.498553  diff=0.000000
[loss 2] computed=2.083153  ref=2.083153  diff=0.000000
[loss 3] computed=1.905963  ref=1.905963  diff=0.000000
Errors: 0 out of 4
BENCH train_cycles=285250543 opt_cycles=429083 weight_sram=61956

Test plan

  • Untiled Siracusa training: PASS (4/4 loss exact match)
  • Tiled Siracusa training (L2 singlebuffer, l1=128000)
  • CI regression on existing models

🤖 Generated with Claude Code

@runwangdl runwangdl marked this pull request as draft May 18, 2026 08:55
@runwangdl runwangdl force-pushed the feat/speechnet-training branch 2 times, most recently from a74bfac to 95fef65 Compare May 18, 2026 13:30
runwangdl added 6 commits May 28, 2026 21:58
Add SpeechNet EMG silent speech recognition model (14 channels, 700
time samples, 9 classes, ~15K params) to the training test suite.

Changes:
- Add SpeechNet training ONNX artifacts (network.onnx, inputs.npz,
  outputs.npz, optimizer network) exported from Onnx4Deeploy with
  static reshape (no dynamic Shape/Flatten ops).
- Fix ConvLayer.computeShapes bias shape: wrap scalar int in tuple
  to prevent graphsurgeon export crash on Conv layers with bias.
- Register SpeechNet in L2 singlebuffer training test config
  (l1=128000, l2=2000000).

Untiled test verified: 4/4 loss diff=0.000000, 285M train cycles.
Freeze Block0 (first conv layer with large 14×701 activations) to
avoid tiling issues with its backward pass. Train Block1-4 + FC
(18 trainable params, 4 ConvGrad + 4 BatchNormGrad).

Tiled test verified: 4/4 loss diff < 0.001, 96M train cycles.

Block0 backward tiling hang is tracked separately — the 314 KB
activation tensor requires heavy L1 tiling that triggers a
simulation hang in the ConvGrad/AveragePoolGrad backward path.
Now that ConvGradX uses the naive kernel, full SpeechNet training
(all 5 blocks + FC, 22 trainable params) passes tiled simulation.
Step-by-step tutorial covering PyTorch model design, Onnx4Deeploy
export, untiled/tiled Deeploy deployment, tiling pipeline overview,
common pitfalls, and GVSoC trace debugging.
@runwangdl runwangdl force-pushed the feat/speechnet-training branch from 24bae6b to b185be6 Compare May 28, 2026 22:07
runwangdl added 2 commits May 29, 2026 00:42
…tile

ConvGradW accumulates the weight gradient across spatial (H/W) tiles via the
kernel's mm_add. The dW buffer must be zeroed exactly once per backward pass,
before the first spatial tile. The memset guard used `*tileIdxPtr == 0`, but
tileIdxPtr is a per-EXECUTION index (it selects the numTiles prefix-sum range
and is incremented after the whole tile loop), so it stays constant across an
execution's spatial tiles. The guard was therefore true for every tile of the
first execution, re-zeroing grad_weight on each tile and wiping the cross-tile
accumulation -- only the last tile's partial dW survived (~1/numTiles of the
true gradient).

Add a dedicated per-execution `dwZeroFlagPtr` (reset to 0 on every backward
pass, set to 1 after the first tile zeroes grad_weight) and guard the H/W-tiled
memset on it instead. grad_weight is now zeroed once and accumulated across all
spatial tiles.

Effect: SpeechNet tiled L2 training, which spatially tiles the wide block-0/1
convolutions, went from per-step loss drift up to 2.7e-3 on real EMG data
(exceeding the 1e-3 tolerance) to bit-exact (diff = 0.000000) against the
PyTorch/ORT reference. ResNet8 and CCT training regressions still pass.
Replace the random-normal input tensors in the SpeechNet training test
fixture with 4 real surface-EMG windows (14 channels x 700 samples, 1.4 s
@ 500 Hz) drawn from the public PulpBio/SilentWear dataset (subject S01,
vocalized, session 1), labels [7, 3, 2, 6]. Reference losses are
recomputed by ORT for these inputs.

With the ConvGradW spatial-tile dW-accumulation fix, the tiled L2
single-buffer training loss now matches the reference bit-exactly
(diff = 0.000000 across all 4 steps) on real EMG data, where it
previously drifted up to 2.7e-3 and exceeded the 1e-3 tolerance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant