Skip to content

kernel-bench: seed sm90/sm100 ground truth (bootstrap for sglang#28138)#21

Open
BBuf wants to merge 2 commits into
sgl-project:mainfrom
BBuf:kernel-bench-gt-bootstrap
Open

kernel-bench: seed sm90/sm100 ground truth (bootstrap for sglang#28138)#21
BBuf wants to merge 2 commits into
sgl-project:mainfrom
BBuf:kernel-bench-gt-bootstrap

Conversation

@BBuf

@BBuf BBuf commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What

Seeds the kernel-benchmark regression ground truth under kernel-bench/ so the per-PR gate in sgl-project/sglang#28138 ([CI] Kernel benchmark regression gate) has a baseline to pull and compare against. This is the one-time bootstrap that the nightly nightly-kernel-bench-gt.yml workflow will take over once #28138 lands on main (the workflow can't workflow_dispatch until it's on the default branch).

  • kernel-bench/sm90.json — generated on NVIDIA H100 80GB HBM3 (sm90 / cc 9.0)
  • kernel-bench/sm100.json — generated on NVIDIA B200 (sm100 / cc 10.0)

Provenance

  • Generated from sglang branch feat/kernel-bench-regression-ci @ cf8dbf44d9 (PR #28138, freshly rebased on main).
  • Command: python3 -m kernel_bench_regression generate --repeat 5 --commit cf8dbf44d9.
  • Each SKU was measured on its native hardware (sm90 on real H100, sm100 on real B200) so the numbers are faithful baselines for the gate's H100/B200 runners — no cross-class extrapolation. torch 2.11.0+cu130 on both; sm100 reused the container's prebuilt sgl_kernel, sm90 built sgl_kernel from the PR branch (the container's preinstalled 0.4.2.post1 predates the new dsv4_fused_q_norm_rope op).

Coverage / known gaps

sm90: 13 cases, 34/37 real measurements. cutlass_mla_decode correctly auto-skipped (sm100-only).
sm100: 14 cases, 30/40 real measurements. cutlass_mla_decode ran.

Two cases recorded null (the harness records null and continues; the gate skips null baseline entries). These are pre-existing issues in #28138's bench wiring / kernels, not in the GT pipeline, and are called out on the PR for follow-up:

  • fp8_gemm — null on both SKUs. bench_fp8_gemm.py passes a per-token scale (M,)/(N,) into the static per-tensor-quant path, which asserts a scalar {1} scale (per_tensor_quant_fp8.cuh:113) → Tensor match failed. Independent of the kernel build.
  • per_token_group_quant_8bit — null on sm100 only (works on sm90, ~3–27us). On B200 the bench utility's profiler-table check trips on a large FillFunctor + "Command Buffer Full" rows.

These null entries are harmless for the gate (skipped), and can be backfilled by the nightly once the two bench issues are fixed in #28138.

Companion to sgl-project/sglang#28138.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant