Skip to content

[2026春季][T1-2-1] 何ev — NineToothed 代码生成特化增强#163

Open
gacn2890356890-rgb wants to merge 8 commits into
InfiniTensor:masterfrom
gacn2890356890-rgb:2026-spring-gacn2890356890-rgb-T1-2-1
Open

[2026春季][T1-2-1] 何ev — NineToothed 代码生成特化增强#163
gacn2890356890-rgb wants to merge 8 commits into
InfiniTensor:masterfrom
gacn2890356890-rgb:2026-spring-gacn2890356890-rgb-T1-2-1

Conversation

@gacn2890356890-rgb

@gacn2890356890-rgb gacn2890356890-rgb commented Jun 12, 2026

Copy link
Copy Markdown

赛题信息

  • 赛题编号: T1-2-1
  • 小组名称: 何ev
  • GitHub ID: gacn2890356890-rgb
  • 分支: 2026-spring-gacn2890356890-rgb-T1-2-1

选定特化类别

Category 1 (Contiguous Fast Path) + Category 2 (Divisible Tile Fast Path)

主要改动点

文件 改动类型 说明
src/ninetoothed/generation.py 核心修改 新增 TilingHint 数据类;修改 mask/stride/innermost 生成逻辑
src/ninetoothed/aot.py 桥接修改 将 AOT variant spec 转化为 TilingHint,每个 variant 生成特化源码
tests/test_specialization.py 新增 12 个测试用例
benchmarks/bench_specialization.py 新增 6 场景 benchmark,JSON 输出
benchmarks/bench_matmul.py 新增 Matmul 1024³ benchmark
report/ 新增 Weakness analysis + 赛题报告 + 中期报告 + PDF
HONOR_CODE.md 新增 署名
REFERENCE.md 新增 引用披露

自测命令

pytest tests/test_specialization.py -v
python benchmarks/bench_specialization.py
python benchmarks/bench_matmul.py

实测环境: NVIDIA RTX 3090 24GB, CUDA 13.0, Triton 3.1.0, PyTorch 2.5.1

实测指标对比

场景 输入 baseline_ms submitted_ms speedup hit mask B→S stride B→S
Contiguous+Divisible 2048 0.0183 0.0182 1.0056 2→0 2→0
Contiguous Only 1027 0.0178 0.0180 0.9886 2→2 2→0
Divisible Only 2048 0.0176 0.0179 0.9849 2→0 2→2
Pure Fallback 1027 0.0176 0.0176 0.9970 2→2 2→2
2D Divisible 512×512 0.0205 0.0203 1.0097 2→0 4→4
2D Non-Divisible 519×519 0.0199 0.0199 0.9975 2→2 4→4

Generated code metrics (实测)

指标 Baseline Contiguous+Divisible 改善
mask_complexity 2 0 -100%
stride_expr_count 2 0 -100%
pointer_expr_count 2 2 0%
source_line_count 14 14 微内核无显著变化

源码 diff(实测)

- tl.load(ptr + (...) * stride_0, mask=True & (6 boundary checks), other=None)
+ tl.load(ptr + (...), mask=True, other=None)

测试结果

  • 105/105 tests PASSED (含 NineToothed 全部原有测试)
  • 零回归

未覆盖、未实现或已知风险

  • Category 3 (Broadcast/Scalar Fast Path):未选择
  • Jagged/ragged tensor 特化:当前不覆盖
  • has_divisible_tiles 安全 guard:仅对简单 tiling 生效
  • JIT 路径不自动提供 contiguity 信息
  • 无性能回退:TilingHint 默认值时生成代码与 baseline 字符级一致

Honor Code & Reference

  • HONOR_CODE.md:署名的独立完成声明
  • REFERENCE.md:引用资料和 AI 辅助使用情况

赛题报告

  • report/何ev_九齿编译优化_T1-2-1_赛题报告.pdf

@
[2026春季][T1-2-1] Add contiguous fast path + divisible tile fast path specialization

## Summary
Implement Category 1 (contiguous fast path) and Category 2 (divisible tile
fast path) code generation specialization for NineToothed.

## Changes
- generation.py: Add TilingHint dataclass, modify mask generation to skip
  boundary checks when tiles evenly divide (divisible fast path), simplify
  stride arithmetic for known-contiguous dimensions (contiguous fast path),
  support exact innermost loop sizes
- aot.py: Bridge AOT variant specs (divisibility/contiguity) to TilingHint
  objects, enabling per-variant specialized Triton code generation
- tests/test_specialization.py: 11 test cases covering specialization hit,
  fallback correctness, and generated source structure
- benchmarks/bench_specialization.py: 6 benchmark scenarios with JSON output
- report/: Weakness analysis + full competition report
- HONOR_CODE.md + REFERENCE.md: Competition compliance documents

## Selection
Specialization categories: 1 (contiguous fast path) + 2 (divisible tile fast path)
Enabling conditions: Based on tensor layout properties (contiguity, divisibility)
Fallback: Full fallback to general path when hints inactive

@
@
[2026春季][T1-2-1] 补全比赛要求:中期报告、PR描述、variant_name、weakness分类

- Rename report to 何ev_九齿编译优化_T1-2-1_赛题报告.md
- Add 何ev_中期报告.md (due 6/15)
- Add PR_DESCRIPTION.md (7 required items)
- Update weakness_analysis.md with explicit ①-⑤ category classification
- Add variant_name to benchmark output

@
…most

Previously _build_tiling_hint only checked innermost dim, missing
second-innermost dim for 2D+ tensors. Now uses _per_tensor_dim_options
to determine which dims need divisibility coverage for true no-tail-block
guarantee.
@
[2026春季][T1-2-1] GPU-verified: 12/12 tests pass, mask_complexity 2->0 reduction

## Test Results (RTX 3090)
- 12/12 tests passed on GPU (pytest)
- 2 specialization-hit tests: mask elimination verified
- 4 fallback correctness tests: all correct
- 4 source structure tests: source diffs confirmed

## Benchmark Results
- 6 scenarios (4 hit + 2 fallback)
- mask_complexity: 2->0 (-100%) in 3/4 hit scenarios
- stride_expr_count: correctly measured in kernel body only
- Fallback scenarios: metrics identical to baseline (no false hits)

## Source Diff Evidence
Baseline: tl.load(..., mask=True & (6 boundary conditions), other=None)
Hinted:  tl.load(..., mask=True, other=None)

## Files
- tests/test_specialization.py: added _prepare_app helper for
  direct CodeGenerator usage (annotation setup)
- benchmarks/bench_specialization.py: fixed kernel_name passing,
  added _prepare_app, improved mask_complexity metric,
  stride counting now excludes function signature
- benchmarks/bench_specialization_results.json: GPU-measured data

@
@
[2026春季][T1-2-1] Fix all benchmark metrics: stride 2->0, mask 2->0, pointer fixed

Root cause: TilingHint used hardcoded tensor names ("tensor_0") that
did not match auto-generated names from Tensor(). Fixed by adding
_auto_hint() helper that reads actual tensor.source.name via
naming.remove_prefixes().

Results now show:
- mask_complexity: 2->0 (-100%) for divisible tile scenarios
- stride_expr_count: 2->0 (-100%) for contiguous scenarios
- pointer_expr_count: fixed regex (_pointers not _pointer)
- Fallback scenarios: all metrics identical to baseline

@
@
[2026春季][T1-2-1] Final benchmark results: mask 2->0, stride 2->0, GPU verified

All metrics now show clear specialization improvements:

mask_complexity reduction = 100% (divisible tile fast path)
stride_expr_count reduction = 100% (contiguous fast path)
pointer_expr_count = stable (pointers always needed)

Fallback scenarios: all metrics identical to baseline.

@
@
[2026春季][T1-2-1] Replace all estimates with GPU-measured data in reports

- PR_DESCRIPTION.md: actual runtime, mask, stride numbers from RTX 3090
- 赛题报告: real benchmark table with 6 scenarios
- Honest speedup analysis: ~1.0 for micro-kernel (identity op, 18us)
  because mask/stride overhead is ~0.5% of total time
- Generated code metrics: mask 2->0 (-100%), stride 2->0 (-100%)
  proven by source diff
- Clean up debug/temp files

@
@
[2026春季][T1-2-1] Final: matmul stride 12->9, safety guard, honest report

Key additions:
- bench_matmul.py: 1024^3 matmul shows stride 12->9 (-25%), speedup 1.02
- Safety guard: has_divisible_tiles only for simple tiling (len<=2 levels)
- Fix _auto_hint: only mark innermost dim as contiguous (not all dims)
- Report updated with real GPU-measured data from all benchmarks
- Honest runtime analysis: micro-kernels too light to show speedup;
  generated code metrics prove optimization effectiveness

@
@gacn2890356890-rgb gacn2890356890-rgb force-pushed the 2026-spring-gacn2890356890-rgb-T1-2-1 branch from e6c419b to 2fd2675 Compare June 12, 2026 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant