[2026春季][T1-2-1] 何ev — NineToothed 代码生成特化增强 by gacn2890356890-rgb · Pull Request #163 · InfiniTensor/ninetoothed

gacn2890356890-rgb · 2026-06-12T03:14:37Z

赛题信息

赛题编号: T1-2-1
小组名称: 何ev
GitHub ID: gacn2890356890-rgb
分支: 2026-spring-gacn2890356890-rgb-T1-2-1

选定特化类别

Category 1 (Contiguous Fast Path) + Category 2 (Divisible Tile Fast Path)

主要改动点

文件	改动类型	说明
`src/ninetoothed/generation.py`	核心修改	新增 TilingHint 数据类；修改 mask/stride/innermost 生成逻辑
`src/ninetoothed/aot.py`	桥接修改	将 AOT variant spec 转化为 TilingHint，每个 variant 生成特化源码
`tests/test_specialization.py`	新增	12 个测试用例
`benchmarks/bench_specialization.py`	新增	6 场景 benchmark，JSON 输出
`benchmarks/bench_matmul.py`	新增	Matmul 1024³ benchmark
`report/`	新增	Weakness analysis + 赛题报告 + 中期报告 + PDF
`HONOR_CODE.md`	新增	署名
`REFERENCE.md`	新增	引用披露

自测命令

pytest tests/test_specialization.py -v
python benchmarks/bench_specialization.py
python benchmarks/bench_matmul.py

实测环境: NVIDIA RTX 3090 24GB, CUDA 13.0, Triton 3.1.0, PyTorch 2.5.1

实测指标对比

场景	输入	baseline_ms	submitted_ms	speedup	hit	mask B→S	stride B→S
Contiguous+Divisible	2048	0.0183	0.0182	1.0056	✅	2→0	2→0
Contiguous Only	1027	0.0178	0.0180	0.9886	✅	2→2	2→0
Divisible Only	2048	0.0176	0.0179	0.9849	✅	2→0	2→2
Pure Fallback	1027	0.0176	0.0176	0.9970	❌	2→2	2→2
2D Divisible	512×512	0.0205	0.0203	1.0097	✅	2→0	4→4
2D Non-Divisible	519×519	0.0199	0.0199	0.9975	❌	2→2	4→4

Generated code metrics (实测)

指标	Baseline	Contiguous+Divisible	改善
mask_complexity	2	0	-100%
stride_expr_count	2	0	-100%
pointer_expr_count	2	2	0%
source_line_count	14	14	微内核无显著变化

源码 diff（实测）

- tl.load(ptr + (...) * stride_0, mask=True & (6 boundary checks), other=None)
+ tl.load(ptr + (...), mask=True, other=None)

测试结果

105/105 tests PASSED (含 NineToothed 全部原有测试)
零回归

未覆盖、未实现或已知风险

Category 3 (Broadcast/Scalar Fast Path)：未选择
Jagged/ragged tensor 特化：当前不覆盖
has_divisible_tiles 安全 guard：仅对简单 tiling 生效
JIT 路径不自动提供 contiguity 信息
无性能回退：TilingHint 默认值时生成代码与 baseline 字符级一致

Honor Code & Reference

HONOR_CODE.md：署名的独立完成声明
REFERENCE.md：引用资料和 AI 辅助使用情况

赛题报告

report/何ev_九齿编译优化_T1-2-1_赛题报告.pdf

[2026春季][T1-2-1] Add contiguous fast path + divisible tile fast path specialization ## Summary Implement Category 1 (contiguous fast path) and Category 2 (divisible tile fast path) code generation specialization for NineToothed. ## Changes - generation.py: Add TilingHint dataclass, modify mask generation to skip boundary checks when tiles evenly divide (divisible fast path), simplify stride arithmetic for known-contiguous dimensions (contiguous fast path), support exact innermost loop sizes - aot.py: Bridge AOT variant specs (divisibility/contiguity) to TilingHint objects, enabling per-variant specialized Triton code generation - tests/test_specialization.py: 11 test cases covering specialization hit, fallback correctness, and generated source structure - benchmarks/bench_specialization.py: 6 benchmark scenarios with JSON output - report/: Weakness analysis + full competition report - HONOR_CODE.md + REFERENCE.md: Competition compliance documents ## Selection Specialization categories: 1 (contiguous fast path) + 2 (divisible tile fast path) Enabling conditions: Based on tensor layout properties (contiguity, divisibility) Fallback: Full fallback to general path when hints inactive @

[2026春季][T1-2-1] 补全比赛要求：中期报告、PR描述、variant_name、weakness分类 - Rename report to 何ev_九齿编译优化_T1-2-1_赛题报告.md - Add 何ev_中期报告.md (due 6/15) - Add PR_DESCRIPTION.md (7 required items) - Update weakness_analysis.md with explicit ①-⑤ category classification - Add variant_name to benchmark output @

…most Previously _build_tiling_hint only checked innermost dim, missing second-innermost dim for 2D+ tensors. Now uses _per_tensor_dim_options to determine which dims need divisibility coverage for true no-tail-block guarantee.

[2026春季][T1-2-1] GPU-verified: 12/12 tests pass, mask_complexity 2->0 reduction ## Test Results (RTX 3090) - 12/12 tests passed on GPU (pytest) - 2 specialization-hit tests: mask elimination verified - 4 fallback correctness tests: all correct - 4 source structure tests: source diffs confirmed ## Benchmark Results - 6 scenarios (4 hit + 2 fallback) - mask_complexity: 2->0 (-100%) in 3/4 hit scenarios - stride_expr_count: correctly measured in kernel body only - Fallback scenarios: metrics identical to baseline (no false hits) ## Source Diff Evidence Baseline: tl.load(..., mask=True & (6 boundary conditions), other=None) Hinted: tl.load(..., mask=True, other=None) ## Files - tests/test_specialization.py: added _prepare_app helper for direct CodeGenerator usage (annotation setup) - benchmarks/bench_specialization.py: fixed kernel_name passing, added _prepare_app, improved mask_complexity metric, stride counting now excludes function signature - benchmarks/bench_specialization_results.json: GPU-measured data @

[2026春季][T1-2-1] Fix all benchmark metrics: stride 2->0, mask 2->0, pointer fixed Root cause: TilingHint used hardcoded tensor names ("tensor_0") that did not match auto-generated names from Tensor(). Fixed by adding _auto_hint() helper that reads actual tensor.source.name via naming.remove_prefixes(). Results now show: - mask_complexity: 2->0 (-100%) for divisible tile scenarios - stride_expr_count: 2->0 (-100%) for contiguous scenarios - pointer_expr_count: fixed regex (_pointers not _pointer) - Fallback scenarios: all metrics identical to baseline @

[2026春季][T1-2-1] Final benchmark results: mask 2->0, stride 2->0, GPU verified All metrics now show clear specialization improvements: mask_complexity reduction = 100% (divisible tile fast path) stride_expr_count reduction = 100% (contiguous fast path) pointer_expr_count = stable (pointers always needed) Fallback scenarios: all metrics identical to baseline. @

[2026春季][T1-2-1] Replace all estimates with GPU-measured data in reports - PR_DESCRIPTION.md: actual runtime, mask, stride numbers from RTX 3090 - 赛题报告: real benchmark table with 6 scenarios - Honest speedup analysis: ~1.0 for micro-kernel (identity op, 18us) because mask/stride overhead is ~0.5% of total time - Generated code metrics: mask 2->0 (-100%), stride 2->0 (-100%) proven by source diff - Clean up debug/temp files @

[2026春季][T1-2-1] Final: matmul stride 12->9, safety guard, honest report Key additions: - bench_matmul.py: 1024^3 matmul shows stride 12->9 (-25%), speedup 1.02 - Safety guard: has_divisible_tiles only for simple tiling (len<=2 levels) - Fix _auto_hint: only mark innermost dim as contiguous (not all dims) - Report updated with real GPU-measured data from all benchmarks - Honest runtime analysis: micro-kernels too light to show speedup; generated code metrics prove optimization effectiveness @

gacn2890356890-rgb added 8 commits June 12, 2026 09:00

gacn2890356890-rgb force-pushed the 2026-spring-gacn2890356890-rgb-T1-2-1 branch from e6c419b to 2fd2675 Compare June 12, 2026 03:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[2026春季][T1-2-1] 何ev — NineToothed 代码生成特化增强#163

[2026春季][T1-2-1] 何ev — NineToothed 代码生成特化增强#163
gacn2890356890-rgb wants to merge 8 commits into
InfiniTensor:masterfrom
gacn2890356890-rgb:2026-spring-gacn2890356890-rgb-T1-2-1

gacn2890356890-rgb commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

gacn2890356890-rgb commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

赛题信息

选定特化类别

主要改动点

自测命令

实测指标对比

Generated code metrics (实测)

源码 diff（实测）

测试结果

未覆盖、未实现或已知风险

Honor Code & Reference

赛题报告

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gacn2890356890-rgb commented Jun 12, 2026 •

edited

Loading