Skip to content

nvcc 13.1 ptxas miscompile on sm_90 drops createpolicy for cp.async cache_hint #44

@huangzhilin-hzl

Description

@huangzhilin-hzl

We found a CUDA 13.1 ptxas codegen issue while running Marlin on an sm_90 GPU. The symptom is a runtime illegal-instruction trap in the Marlin<256,1,8,8,4,8> specialization, hit by the small-batch + groupsize-aware path.

Symptom

cudaErrorIllegalInstruction (error 716)
========= Illegal instruction
=========     at void Marlin<256,1,8,8,4,8>(...)+0x1180

This is not an illegal address or misalignment issue. The failing instruction is an LDGSTS generated from inline cp.async.

Root cause

With CUDA Toolkit 13.1:

nvcc V13.1.80

ptxas appears to silently drop:

createpolicy.fractional.L2::evict_first.b64

while keeping dependent:

cp.async.cg.shared.global.L2::cache_hint

In SASS, the affected LDGSTS instructions read an uninitialized uniform register as the L2 cache descriptor:

30 x LDGSTS ... desc[UR1] ...
0 x definition/materialization of UR1

So the generated SASS contains desc[UR1] users but no corresponding ULDC / materialization of UR1. At runtime, the hardware traps while decoding that invalid descriptor.

The PTX still contains the expected createpolicy instructions, so the C++ -> PTX stage looks correct. The bug seems to happen in ptxas during PTX -> SASS lowering/optimization.

The sibling specialization:

Marlin<256,1,8,8,4,-1>

does not show the same issue; its descriptor registers are properly materialized in SASS.

Repro shape

A minimal Marlin-level repro is:

import torch, marlin

torch.cuda.set_device(0)
dev = "cuda:0"

m, k, n, gs = 1, 4096, 4096, 128

A = torch.randn((m, k), dtype=torch.half, device=dev)
B = torch.randint(
    low=-(2**31),
    high=2**31,
    size=(k * n // 8,),
    dtype=torch.int,
    device=dev,
)
C = torch.zeros((m, n), dtype=torch.half, device=dev)
s = torch.zeros((k // gs, n), dtype=torch.half, device=dev)
workspace = torch.zeros(n // 128 * 16, dtype=torch.int, device=dev)

marlin.mul(A, B, C, s, workspace, -1, -1, -1)
torch.cuda.synchronize()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions