Skip to content

[GPU] fix limited GV un init problem#819

Open
BI71317 wants to merge 1 commit into
exaloop:developfrom
BI71317:pr-gv-uninit
Open

[GPU] fix limited GV un init problem#819
BI71317 wants to merge 1 commit into
exaloop:developfrom
BI71317:pr-gv-uninit

Conversation

@BI71317
Copy link
Copy Markdown
Contributor

@BI71317 BI71317 commented May 24, 2026

This PR partially fixes #781 and fixes #818.

What does this PR do

This PR updates the constant propagation pass so that global constants initialized through simple scalar cast/constructor patterns can also be folded.

Previously, CIR only considered literal scalar constants as valid global constants:

// codon/cir/transform/folding/const_prop.cpp
bool okConst(const Value *v) {
  return v && (isA<IntConst>(v) || isA<FloatConst>(v) || isA<BoolConst>(v));
}

With this change, the pass also accepts a narrow CallInstr pattern where the global value is initialized through a single-argument type.__new__ call, such as float32(pi):

bool okGlobalConst(const Value *v) {
  if (okConst(v))
    return true;

  const auto *call = cast<CallInstr>(v);
  if (!call || call->numArgs() != 1)
    return false;

  auto *func = util::getFunc(call->getCallee());
  if (!func || func->getUnmangledName() != Module::NEW_MAGIC_NAME)
    return false;

  if (!func->getParentType())
    return false;

  return okConst(call->front());
}

Motivation

While investigating the generated LLVM IR and PTX, I noticed that literal global constants were already being inlined into GPU kernels, but casted global constants were not.

For example, in the math module, math.pi is a literal scalar constant, while math.pi32 is initialized as something equivalent to float32(math.pi). Even though both are initialized from constants, only math.pi was folded and inlined into the kernel.

MRE

import math
import gpu

@gpu.kernel
def kernel(out_64, out_32):
    out_64[0] = math.pi
    out_32[0] = math.pi32

def main():
    out_64: list = [0.0]
    out_32: list = [float32(0.0)]
    kernel(out_64, out_32, grid=1, block=1)
    print(out_64[0])
    print(out_32[0])

main()

Result

3.14159
0

In the generated PTX, only pi32 remained as an uninitialized global variable:

...
	// .globl	kernel_0_0_std_internal_types_array_List_0_float__std_internal_types_array_List_0_float32__
.visible .global .align 4 .f32 _pi32_0;
...

Looking at the LLVM IR,

both pi and pi32 are emitted as globals initialized to zero,

and their actual values are assigned inside the math import initialization function:

@.pi.0 = private unnamed_addr global double 0.000000e+00
@.pi32.0 = private unnamed_addr global float 0.000000e+00
// But Both pi, pi32 is unintialized in IR!
....

// in main.0,
  %52 = call {} @"%_import_math_129_call.0:0.1152"(), !dbg !25831
...

// in _import_math_129_call(), 
  store double 0x400921FB54442D18, ptr @.pi.0, align 8, !dbg !8449 // pi is literal
...
  %6 = load double, ptr @.pi.0, align 8, !dbg !8456
  %7 = fptrunc double %6 to float, !dbg !8457
  store float %7, ptr @.pi32.0, align 4, !dbg !8456 // and pi32 is induced from truncate
 

Folding Pass

However, the folding pass runs earlier at the CIR level. When the folding pass group is disabled:

When this pass is turned off,

 codon build -release -llvm -ptx no_fold.ptx \
  -disable-opt=core-folding-pass-group simple_gv_un_init.codon

both pi and pi32 remain as uninitialized globals in PTX:

	// .globl	kernel_0_0_std_internal_types_array_List_0_float__std_internal_types_array_List_0_float32__
.visible .global .align 8 .f64 _pi_0;
.visible .global .align 4 .f32 _pi32_0;
...

This is why I changed the constant propagation logic in the CIR folding pass

rather than trying to handle this later in LLVM IR or PTX generation.

Scope of the Change

The change is intentionally narrow.

It does not attempt to fold arbitrary function-call initializers.

Previously, only direct scalar constants were accepted.

This PR extends the accepted pattern to simple constructor/cast calls of the form:

T(literal)

where the call:

  • has exactly one argument,
  • resolves to Module::NEW_MAGIC_NAME,
  • has a parent type,
  • and the argument itself is already an accepted scalar constant.

This covers such as:

float32(3.14159)
i32(123)
u32(456)

Additional tests

MRE 1: const in math module

import math
import gpu

# math.e/pi/tau are literal float GVs.
# math.e32/pi32/tau32 are float32(literal) after global const propagation.
# math.inf/nan and inf32/nan32 come from function-call initializers.

@gpu.kernel
def kernel(out_f64, out_f32, out_special64, out_special32):
    out_f64[0] = math.e
    out_f64[1] = math.pi
    out_f64[2] = math.tau

    out_f32[0] = math.e32
    out_f32[1] = math.pi32
    out_f32[2] = math.tau32

    out_special64[0] = math.inf
    out_special64[1] = math.nan

    out_special32[0] = math.inf32
    out_special32[1] = math.nan32

def main():
    out_f64: list = [0.0, 0.0, 0.0]
    out_f32: list = [float32(0.0), float32(0.0), float32(0.0)]
    out_special64: list = [0.0, 0.0]
    out_special32: list = [float32(0.0), float32(0.0)]
    kernel(out_f64, out_f32, out_special64, out_special32, grid=1, block=1)

    print(out_f64[0])
    print(out_f64[1])
    print(out_f64[2])
    print(out_f32[0])
    print(out_f32[1])
    print(out_f32[2])
    print(out_special64[0])
    print(out_special64[1])
    print(out_special32[0])
    print(out_special32[1])

main()

Result

$ codon run -release gv_math_consts.codon 
2.71828
3.14159
6.28319
2.71828
3.14159
6.28319
0                 // inf and nan is not printed.
0
0
0

The e32, pi32, and tau32 cases are now handled because they match the scalar cast pattern. The inf, nan, inf32, and nan32 cases are still not handled, since they are initialized through function-call patterns outside the current whitelist. (And Actually I don` know which way is proper way to express.)

MRE 2: user-defined casted globals

import gpu

# GV RHS is a constructor/cast call. The current whitelist only accepts float32(literal).
G_I32 = i32(123)
G_U32 = u32(456)
G_F32 = float32(1.25)

@gpu.kernel
def kernel(out_i32, out_u32, out_f32):
    out_i32[0] = G_I32
    out_u32[0] = G_U32
    out_f32[0] = G_F32

def main():
    out_i32: list = [i32(0)]
    out_u32: list = [u32(0)]
    out_f32: list = [float32(0.0)]
    kernel(out_i32, out_u32, out_f32, grid=1, block=1)
    print(out_i32[0])
    print(out_u32[0])
    print(out_f32[0])

main()

Result

$ codon run -release gv_user_casts.codon 
123
456
1.25

But except these patterns, still GVs are not initialized as explained.

MRE3: general function-call globals

import gpu

def make_int() -> int:
    return 123

def make_f32() -> float32:
    return float32(1.25)

# GV RHS is a general function call. These are not covered by the narrow whitelist.
G_INT_CALL = make_int()
G_F32_CALL = make_f32()

@gpu.kernel
def kernel(out_int, out_f32):
    out_int[0] = G_INT_CALL
    out_f32[0] = G_F32_CALL

def main():
    out_int: list = [0]
    out_f32: list = [float32(0.0)]
    kernel(out_int, out_f32, grid=1, block=1)
    print(out_int[0])
    print(out_f32[0])

main()

Result

$ codon run -release gv_user_calls.codon 
0
0

Questions / Limits

This is a deliberately limited fix for scalar cast/constructor patterns. It does not solve the broader global-variable initialization issue in general.

Also, because the change is made at the CIR constant propagation level, it is not GPU-specific and may affect CPU-side optimization behavior as well.

I would appreciate feedback on whether this is an acceptable direction, or whether this should instead be handled in a GPU-specific lowering path or through a more general global-initialization mechanism.

@BI71317 BI71317 requested a review from arshajii as a code owner May 24, 2026 11:36
@cla-bot cla-bot Bot added the cla-signed label May 24, 2026
@BI71317
Copy link
Copy Markdown
Contributor Author

BI71317 commented May 24, 2026

As I can't so sure whether this PR acceptable, I didn't added Kernel test cases. If PR is acceptable, I'll also think about test cases for Kernel GV examples, and add it in this branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[GPU] complex sin op results nan GPU kernel cannot use non-literal global variables

1 participant