[2026春季][T1-2-1] 何ev — NineToothed 代码生成特化增强#163
Open
gacn2890356890-rgb wants to merge 8 commits into
Open
Conversation
[2026春季][T1-2-1] Add contiguous fast path + divisible tile fast path specialization ## Summary Implement Category 1 (contiguous fast path) and Category 2 (divisible tile fast path) code generation specialization for NineToothed. ## Changes - generation.py: Add TilingHint dataclass, modify mask generation to skip boundary checks when tiles evenly divide (divisible fast path), simplify stride arithmetic for known-contiguous dimensions (contiguous fast path), support exact innermost loop sizes - aot.py: Bridge AOT variant specs (divisibility/contiguity) to TilingHint objects, enabling per-variant specialized Triton code generation - tests/test_specialization.py: 11 test cases covering specialization hit, fallback correctness, and generated source structure - benchmarks/bench_specialization.py: 6 benchmark scenarios with JSON output - report/: Weakness analysis + full competition report - HONOR_CODE.md + REFERENCE.md: Competition compliance documents ## Selection Specialization categories: 1 (contiguous fast path) + 2 (divisible tile fast path) Enabling conditions: Based on tensor layout properties (contiguity, divisibility) Fallback: Full fallback to general path when hints inactive @
…most Previously _build_tiling_hint only checked innermost dim, missing second-innermost dim for 2D+ tensors. Now uses _per_tensor_dim_options to determine which dims need divisibility coverage for true no-tail-block guarantee.
[2026春季][T1-2-1] GPU-verified: 12/12 tests pass, mask_complexity 2->0 reduction ## Test Results (RTX 3090) - 12/12 tests passed on GPU (pytest) - 2 specialization-hit tests: mask elimination verified - 4 fallback correctness tests: all correct - 4 source structure tests: source diffs confirmed ## Benchmark Results - 6 scenarios (4 hit + 2 fallback) - mask_complexity: 2->0 (-100%) in 3/4 hit scenarios - stride_expr_count: correctly measured in kernel body only - Fallback scenarios: metrics identical to baseline (no false hits) ## Source Diff Evidence Baseline: tl.load(..., mask=True & (6 boundary conditions), other=None) Hinted: tl.load(..., mask=True, other=None) ## Files - tests/test_specialization.py: added _prepare_app helper for direct CodeGenerator usage (annotation setup) - benchmarks/bench_specialization.py: fixed kernel_name passing, added _prepare_app, improved mask_complexity metric, stride counting now excludes function signature - benchmarks/bench_specialization_results.json: GPU-measured data @
[2026春季][T1-2-1] Fix all benchmark metrics: stride 2->0, mask 2->0, pointer fixed
Root cause: TilingHint used hardcoded tensor names ("tensor_0") that
did not match auto-generated names from Tensor(). Fixed by adding
_auto_hint() helper that reads actual tensor.source.name via
naming.remove_prefixes().
Results now show:
- mask_complexity: 2->0 (-100%) for divisible tile scenarios
- stride_expr_count: 2->0 (-100%) for contiguous scenarios
- pointer_expr_count: fixed regex (_pointers not _pointer)
- Fallback scenarios: all metrics identical to baseline
@
[2026春季][T1-2-1] Final benchmark results: mask 2->0, stride 2->0, GPU verified All metrics now show clear specialization improvements: mask_complexity reduction = 100% (divisible tile fast path) stride_expr_count reduction = 100% (contiguous fast path) pointer_expr_count = stable (pointers always needed) Fallback scenarios: all metrics identical to baseline. @
[2026春季][T1-2-1] Replace all estimates with GPU-measured data in reports - PR_DESCRIPTION.md: actual runtime, mask, stride numbers from RTX 3090 - 赛题报告: real benchmark table with 6 scenarios - Honest speedup analysis: ~1.0 for micro-kernel (identity op, 18us) because mask/stride overhead is ~0.5% of total time - Generated code metrics: mask 2->0 (-100%), stride 2->0 (-100%) proven by source diff - Clean up debug/temp files @
[2026春季][T1-2-1] Final: matmul stride 12->9, safety guard, honest report Key additions: - bench_matmul.py: 1024^3 matmul shows stride 12->9 (-25%), speedup 1.02 - Safety guard: has_divisible_tiles only for simple tiling (len<=2 levels) - Fix _auto_hint: only mark innermost dim as contiguous (not all dims) - Report updated with real GPU-measured data from all benchmarks - Honest runtime analysis: micro-kernels too light to show speedup; generated code metrics prove optimization effectiveness @
e6c419b to
2fd2675
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
赛题信息
2026-spring-gacn2890356890-rgb-T1-2-1选定特化类别
Category 1 (Contiguous Fast Path) + Category 2 (Divisible Tile Fast Path)
主要改动点
src/ninetoothed/generation.pysrc/ninetoothed/aot.pytests/test_specialization.pybenchmarks/bench_specialization.pybenchmarks/bench_matmul.pyreport/HONOR_CODE.mdREFERENCE.md自测命令
实测环境: NVIDIA RTX 3090 24GB, CUDA 13.0, Triton 3.1.0, PyTorch 2.5.1
实测指标对比
Generated code metrics (实测)
源码 diff(实测)
测试结果
未覆盖、未实现或已知风险
Honor Code & Reference
HONOR_CODE.md:署名的独立完成声明REFERENCE.md:引用资料和 AI 辅助使用情况赛题报告
report/何ev_九齿编译优化_T1-2-1_赛题报告.pdf