feat: add SpeechNet (SilentWear) training test by runwangdl · Pull Request #31 · runwangdl/TrainDeeploy

runwangdl · 2026-05-18T08:55:00Z

Summary

Add SpeechNet EMG silent speech recognition model (14ch × 700 samples, 9 classes, ~15K params) to training test suite
Fix ConvLayer.computeShapes bias shape bug: inputShapes[1][0] → (inputShapes[1][0],) — prevents graphsurgeon export crash on Conv layers with bias
Register SpeechNet in L2 singlebuffer training config (l1=128000, l2=2000000)

Test results

Untiled (verified):

[loss 0] computed=2.267950  ref=2.267950  diff=0.000000
[loss 1] computed=2.498553  ref=2.498553  diff=0.000000
[loss 2] computed=2.083153  ref=2.083153  diff=0.000000
[loss 3] computed=1.905963  ref=1.905963  diff=0.000000
Errors: 0 out of 4
BENCH train_cycles=285250543 opt_cycles=429083 weight_sram=61956

Test plan

Untiled Siracusa training: PASS (4/4 loss exact match)
Tiled Siracusa training (L2 singlebuffer, l1=128000)
CI regression on existing models

🤖 Generated with Claude Code

Add SpeechNet EMG silent speech recognition model (14 channels, 700 time samples, 9 classes, ~15K params) to the training test suite. Changes: - Add SpeechNet training ONNX artifacts (network.onnx, inputs.npz, outputs.npz, optimizer network) exported from Onnx4Deeploy with static reshape (no dynamic Shape/Flatten ops). - Fix ConvLayer.computeShapes bias shape: wrap scalar int in tuple to prevent graphsurgeon export crash on Conv layers with bias. - Register SpeechNet in L2 singlebuffer training test config (l1=128000, l2=2000000). Untiled test verified: 4/4 loss diff=0.000000, 285M train cycles.

Freeze Block0 (first conv layer with large 14×701 activations) to avoid tiling issues with its backward pass. Train Block1-4 + FC (18 trainable params, 4 ConvGrad + 4 BatchNormGrad). Tiled test verified: 4/4 loss diff < 0.001, 96M train cycles. Block0 backward tiling hang is tracked separately — the 314 KB activation tensor requires heavy L1 tiling that triggers a simulation hang in the ConvGrad/AveragePoolGrad backward path.

Now that ConvGradX uses the naive kernel, full SpeechNet training (all 5 blocks + FC, 22 trainable params) passes tiled simulation.

Step-by-step tutorial covering PyTorch model design, Onnx4Deeploy export, untiled/tiled Deeploy deployment, tiling pipeline overview, common pitfalls, and GVSoC trace debugging.

…tile ConvGradW accumulates the weight gradient across spatial (H/W) tiles via the kernel's mm_add. The dW buffer must be zeroed exactly once per backward pass, before the first spatial tile. The memset guard used `*tileIdxPtr == 0`, but tileIdxPtr is a per-EXECUTION index (it selects the numTiles prefix-sum range and is incremented after the whole tile loop), so it stays constant across an execution's spatial tiles. The guard was therefore true for every tile of the first execution, re-zeroing grad_weight on each tile and wiping the cross-tile accumulation -- only the last tile's partial dW survived (~1/numTiles of the true gradient). Add a dedicated per-execution `dwZeroFlagPtr` (reset to 0 on every backward pass, set to 1 after the first tile zeroes grad_weight) and guard the H/W-tiled memset on it instead. grad_weight is now zeroed once and accumulated across all spatial tiles. Effect: SpeechNet tiled L2 training, which spatially tiles the wide block-0/1 convolutions, went from per-step loss drift up to 2.7e-3 on real EMG data (exceeding the 1e-3 tolerance) to bit-exact (diff = 0.000000) against the PyTorch/ORT reference. ResNet8 and CCT training regressions still pass.

Replace the random-normal input tensors in the SpeechNet training test fixture with 4 real surface-EMG windows (14 channels x 700 samples, 1.4 s @ 500 Hz) drawn from the public PulpBio/SilentWear dataset (subject S01, vocalized, session 1), labels [7, 3, 2, 6]. Reference losses are recomputed by ORT for these inputs. With the ConvGradW spatial-tile dW-accumulation fix, the tiled L2 single-buffer training loss now matches the reference bit-exactly (diff = 0.000000 across all 4 steps) on real EMG data, where it previously drifted up to 2.7e-3 and exceeded the 1e-3 tolerance.

runwangdl marked this pull request as draft May 18, 2026 08:55

runwangdl force-pushed the feat/speechnet-training branch 2 times, most recently from a74bfac to 95fef65 Compare May 18, 2026 13:30

runwangdl added 6 commits May 28, 2026 21:58

update(speechnet): full 22-trainable test data (all blocks)

a3cac0e

Now that ConvGradX uses the naive kernel, full SpeechNet training (all 5 blocks + FC, 22 trainable params) passes tiled simulation.

style: fix yapf/isort formatting

7321bcb

style: fix yapf formatting in codeGenerateTraining.py

4741b4c

docs: add SpeechNet on-device training tutorial notebook

b185be6

Step-by-step tutorial covering PyTorch model design, Onnx4Deeploy export, untiled/tiled Deeploy deployment, tiling pipeline overview, common pitfalls, and GVSoC trace debugging.

runwangdl force-pushed the feat/speechnet-training branch from 24bae6b to b185be6 Compare May 28, 2026 22:07

runwangdl added 2 commits May 29, 2026 00:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add SpeechNet (SilentWear) training test#31

feat: add SpeechNet (SilentWear) training test#31
runwangdl wants to merge 8 commits into
develfrom
feat/speechnet-training

runwangdl commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

runwangdl commented May 18, 2026

Summary

Test results

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant