Skip to content

Add a ROCm/HIP build of the GPU library (gtsam_points_cuda)#99

Open
jeffdaily wants to merge 2 commits into
koide3:masterfrom
jeffdaily:moat-port
Open

Add a ROCm/HIP build of the GPU library (gtsam_points_cuda)#99
jeffdaily wants to merge 2 commits into
koide3:masterfrom
jeffdaily:moat-port

Conversation

@jeffdaily

@jeffdaily jeffdaily commented Jun 11, 2026

Copy link
Copy Markdown

Adds a ROCm/HIP build of the gtsam_points_cuda GPU library (the IntegratedVGICPFactorGPU registration factor and the GaussianVoxelMapGPU) alongside the existing CUDA build, behind a new BUILD_WITH_HIP CMake option. We have made every effort to leave the NVIDIA build unchanged: every addition is gated on BUILD_WITH_HIP, and the compat header is never on the include path of a CUDA build, so a CUDA configure and compile proceed exactly as before.

The .cu sources keep their CUDA spelling and are marked LANGUAGE HIP; a single force-included compat header (include/gtsam_points/cuda/cuda_to_hip.h) aliases the cuda* runtime symbols the library uses to their hip* equivalents and pulls the HIP runtime, and a small set of CUDA-toolkit-named forwarding shims (hip_compat/) retarget the angle-bracket includes the toolchain has no shim for. No device kernel was rewritten: the GPU surface is entirely rocThrust/hipCUB primitives over __host__ __device__ Eigen functors with zero __global__ kernels and zero warp intrinsics, so there is no wave-size-dependent code.

Review order: start with cuda_to_hip.h and cuda_to_hip_types.h (the only files that know HIP), then the CMakeLists BUILD_WITH_HIP arms, then the hip_compat shims. The load-bearing details:

  • Opaque API-type bridge. The public headers use the CUDA driver opaque structs (CUstream_st*, CUgraph_st*, CUgraphNode_st*, CUgraphExec_st*, CUevent_st*) as their stream/graph/event types across ~27 files, including headers the host-only main library and the tests include. The lightweight cuda_to_hip_types.h aliases each to the matching HIP opaque struct (CUstream_st -> ihipStream_t, etc.) and is force-included into the main library and every test target as well as the GPU library, so a CUstream_st* signature is the same type (== hipStream_t) in every translation unit and the host callers link against the .cu definitions. Getting this consistent across all three consumers is what makes the link resolve.
  • thrust::cuda::par[_nosync] (~12 sites) is the CUDA backend and is not valid on the HIP device system; the compat header pulls the HIP execution-policy header and aliases namespace thrust::cuda = thrust::hip (guarded on HIPCC so only the -x hip .cu see it; the host .cpp in the GPU library never use it and cannot parse rocThrust).
  • CUDA_VERSION is undefined on HIP, which would trip cuda_malloc_async.hpp's #if (CUDA_VERSION < 11000) and silently downgrade every cudaMallocAsync to a synchronous hipMalloc. The compat header pins CUDA_VERSION into [11000, 13000) so the stream-ordered hipMallocAsync path is taken and the cuda_graph.cu arity branch selects the 4-arg hipGraphAddDependencies / 5-arg hipGraphInstantiate that ROCm uses.
  • The unsigned atomicMax in the voxel intensity accumulation (GaussianVoxelMapGPU::insert) is the documented "atomicMax dropped on coarse-grained memory" fault class on gfx90a, but the buffer is fine-grained hipMallocAsync device memory, where it is correct -- verified on GPU (test_voxelmap VoxelMapGPU_Intensity), so no atomicCAS-loop emulation is needed.
  • hip::host (runtime), not hip::device, is linked: hip::device carries -x hip/--offload-arch in INTERFACE_COMPILE_OPTIONS that would propagate to the host .cpp in the mixed-language target.
  • The HIP architecture is read from CMAKE_HIP_ARCHITECTURES (defaulted to gfx90a only when unset), never a literal, so other AMD GPUs build from the same source with only -DCMAKE_HIP_ARCHITECTURES.

A small number of IO and build edits make the GPU and CPU voxelmap round-trips correct on Windows (open the compact binary files with std::ios::binary so text-mode does not corrupt the data; match aux attribute files with a forward-slash path; static libraries and NOMINMAX/_USE_MATH_DEFINES under the Windows clang toolchain). These are no-ops on Linux (text mode == binary mode, forward-slash paths), and the two added stream synchronizations around the bulk save/load copies are correctness improvements on every platform.

The installed package config gains a HIP branch (find_dependency(hip) when GTSAM_POINTS_USE_HIP) mirroring the CUDA one so a downstream find_package(gtsam_points) resolves on a ROCm-only host. The README documents the BUILD_WITH_HIP option and AMD GPU arch selection alongside the existing CUDA build.

The GPU library requires GTSAM with GTSAM_UNSTABLE; the Ubuntu apt libgtsam-dev (4.2.0) ships no gtsam_unstable, so a source build of GTSAM 4.3a0 (the default GTSAM_BUILD_UNSTABLE=ON) is used.

Authored with the assistance of Claude (Anthropic).

Test Plan: Built and validated on real GPUs on four AMD architectures: AMD MI250X (gfx90a, CDNA2, ROCm 7.2.1), AMD Radeon Pro W7800 (gfx1100, RDNA3, ROCm 7.2.1), AMD Radeon PRO V710 (gfx1101, RDNA3, Windows, ROCm 7.14), and AMD Radeon RX 9070 XT (gfx1201, RDNA4, Windows, ROCm 7.14).

cmake -S . -B build_hip -G Ninja \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DBUILD_WITH_HIP=ON \
  -DCMAKE_HIP_ARCHITECTURES=gfx90a \
  -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ \
  -DCMAKE_PREFIX_PATH=<gtsam-4.3a0-install> \
  -DBUILD_WITH_OPENMP=ON -DBUILD_WITH_TBB=OFF -DBUILD_TESTS=ON \
  -DBUILD_DEMO=OFF -DBUILD_EXAMPLE=OFF -DBUILD_TOOLS=OFF
cmake --build build_hip -j 16

# full suite, serial, on one GPU
HIP_VISIBLE_DEVICES=0 ctest --test-dir build_hip --output-on-failure -j1

Result on every arch: 87/87 tests passed, 0 failed. The GPU gates pass: test_matching_cost_factors VGICP_CUDA_* (IntegratedVGICPFactorGPU converges to ground truth across FORWARD/BACKWARD/UNARY/MULTI_FRAME; rot error << 0.015 rad, trans error << 0.15 m); test_voxelmap VoxelMapGPU/_Intensity/_IO; test_types TestPointCloudGPU. Two runs agree to tolerance. The non-GPU suite (test_alignment, test_kdtree, test_loam_factors, test_bundle_adjustment, CPU ICP/GICP/VGICP, etc.) is unchanged. The CUDA build was separately compiled with nvcc 12.8 (BUILD_WITH_CUDA=ON, BUILD_WITH_HIP=OFF, CMAKE_CUDA_ARCHITECTURES=80) to confirm it is unaffected.

Adds a ROCm/HIP build of the gtsam_points_cuda GPU library (the
IntegratedVGICPFactorGPU registration factor and the GaussianVoxelMapGPU)
alongside the existing CUDA build, behind a new BUILD_WITH_HIP CMake
option. We have made every effort to leave the NVIDIA build unchanged:
every addition is gated on BUILD_WITH_HIP, and the compat header is never
on the include path of a CUDA build, so a CUDA configure and compile
proceed exactly as before.

The .cu sources keep their CUDA spelling and are marked LANGUAGE HIP; a
single force-included compat header
(include/gtsam_points/cuda/cuda_to_hip.h) aliases the cuda* runtime
symbols the library uses to their hip* equivalents and pulls the HIP
runtime, and a small set of CUDA-toolkit-named forwarding shims
(hip_compat/) retarget the angle-bracket includes the toolchain has no
shim for. No device kernel was rewritten: the GPU surface is entirely
rocThrust/hipCUB primitives over __host__ __device__ Eigen functors with
zero __global__ kernels and zero warp intrinsics, so there is no
wave-size-dependent code.

Review order: start with cuda_to_hip.h and cuda_to_hip_types.h (the only
files that know HIP), then the CMakeLists BUILD_WITH_HIP arms, then the
hip_compat shims. The load-bearing details:

- Opaque API-type bridge. The public headers use the CUDA driver opaque
  structs (CUstream_st*, CUgraph_st*, CUgraphNode_st*, CUgraphExec_st*,
  CUevent_st*) as their stream/graph/event types across ~27 files,
  including headers the host-only main library and the tests include. The
  lightweight cuda_to_hip_types.h aliases each to the matching HIP opaque
  struct (CUstream_st -> ihipStream_t, etc.) and is force-included into
  the main library and every test target as well as the GPU library, so a
  CUstream_st* signature is the same type (== hipStream_t) in every
  translation unit and the host callers link against the .cu definitions.
  Getting this consistent across all three consumers is what makes the
  link resolve.
- thrust::cuda::par[_nosync] (~12 sites) is the CUDA backend and is not
  valid on the HIP device system; the compat header pulls the HIP
  execution-policy header and aliases namespace thrust::cuda = thrust::hip
  (guarded on __HIPCC__ so only the -x hip .cu see it; the host .cpp in
  the GPU library never use it and cannot parse rocThrust).
- CUDA_VERSION is undefined on HIP, which would trip
  cuda_malloc_async.hpp's `#if (CUDA_VERSION < 11000)` and silently
  downgrade every cudaMallocAsync to a synchronous hipMalloc. The compat
  header pins CUDA_VERSION into [11000, 13000) so the stream-ordered
  hipMallocAsync path is taken and the cuda_graph.cu arity branch selects
  the 4-arg hipGraphAddDependencies / 5-arg hipGraphInstantiate that ROCm
  uses.
- The unsigned atomicMax in the voxel intensity accumulation
  (GaussianVoxelMapGPU::insert) is the documented "atomicMax dropped on
  coarse-grained memory" fault class on gfx90a, but the buffer is
  fine-grained hipMallocAsync device memory, where it is correct -- verified
  on GPU (test_voxelmap VoxelMapGPU_Intensity), so no atomicCAS-loop
  emulation is needed.
- hip::host (runtime), not hip::device, is linked: hip::device carries
  -x hip/--offload-arch in INTERFACE_COMPILE_OPTIONS that would propagate
  to the host .cpp in the mixed-language target.
- The HIP architecture is read from CMAKE_HIP_ARCHITECTURES (defaulted to
  gfx90a only when unset), never a literal, so other AMD GPUs build from
  the same source with only -DCMAKE_HIP_ARCHITECTURES.

A small number of IO and build edits make the GPU and CPU voxelmap
round-trips correct on Windows (open the compact binary files with
std::ios::binary so text-mode does not corrupt the data; match aux
attribute files with a forward-slash path; static libraries and
NOMINMAX/_USE_MATH_DEFINES under the Windows clang toolchain). These are
no-ops on Linux (text mode == binary mode, forward-slash paths), and the
two added stream synchronizations around the bulk save/load copies are
correctness improvements on every platform.

The installed package config gains a HIP branch (find_dependency(hip)
when GTSAM_POINTS_USE_HIP) mirroring the CUDA one so a downstream
find_package(gtsam_points) resolves on a ROCm-only host. The README
documents the BUILD_WITH_HIP option and AMD GPU arch selection alongside
the existing CUDA build.

The GPU library requires GTSAM with GTSAM_UNSTABLE; the Ubuntu apt
libgtsam-dev (4.2.0) ships no gtsam_unstable, so a source build of GTSAM
4.3a0 (the default GTSAM_BUILD_UNSTABLE=ON) is used.

Authored with the assistance of Claude (Anthropic).

Test Plan:
Built and validated on real GPUs on four AMD architectures: AMD MI250X
(gfx90a, CDNA2, ROCm 7.2.1), AMD Radeon Pro W7800 (gfx1100, RDNA3, ROCm
7.2.1), AMD Radeon PRO V710 (gfx1101, RDNA3, Windows, ROCm 7.14), and AMD
Radeon RX 9070 XT (gfx1201, RDNA4, Windows, ROCm 7.14).

```
cmake -S . -B build_hip -G Ninja \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DBUILD_WITH_HIP=ON \
  -DCMAKE_HIP_ARCHITECTURES=gfx90a \
  -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ \
  -DCMAKE_PREFIX_PATH=<gtsam-4.3a0-install> \
  -DBUILD_WITH_OPENMP=ON -DBUILD_WITH_TBB=OFF -DBUILD_TESTS=ON \
  -DBUILD_DEMO=OFF -DBUILD_EXAMPLE=OFF -DBUILD_TOOLS=OFF
cmake --build build_hip -j 16

# full suite, serial, on one GPU
HIP_VISIBLE_DEVICES=0 ctest --test-dir build_hip --output-on-failure -j1
```

Result on every arch: 87/87 tests passed, 0 failed. The GPU gates pass:
test_matching_cost_factors VGICP_CUDA_* (IntegratedVGICPFactorGPU
converges to ground truth across FORWARD/BACKWARD/UNARY/MULTI_FRAME; rot
error << 0.015 rad, trans error << 0.15 m); test_voxelmap
VoxelMapGPU/_Intensity/_IO; test_types TestPointCloudGPU. Two runs agree
to tolerance. The non-GPU suite (test_alignment, test_kdtree,
test_loam_factors, test_bundle_adjustment, CPU ICP/GICP/VGICP, etc.) is
unchanged. The CUDA build was separately compiled with nvcc 12.8
(BUILD_WITH_CUDA=ON, BUILD_WITH_HIP=OFF, CMAKE_CUDA_ARCHITECTURES=80) to
confirm it is unaffected.
jeffdaily added a commit to jeffdaily/moat that referenced this pull request Jun 11, 2026
@koide3

koide3 commented Jun 16, 2026

Copy link
Copy Markdown
Owner

Thank you for your contribution! This is an exciting feature. I'll review it.

The HIP arm pinned CMAKE_HIP_ARCHITECTURES to gfx90a when unset, but the
pin ran after enable_language(HIP), so the guard was already false and the
block was dead code that could only mislead (or become a live footgun if
file order ever changed).

enable_language(HIP) already honors an explicit -DCMAKE_HIP_ARCHITECTURES,
otherwise auto-detects the host GPU, and errors on a no-GPU build host.
Removing the pin lets that single mechanism decide the arch, so a user on a
non-gfx90a card no longer risks a silently mistargeted default.

This work was authored with the assistance of the Claude AI assistant.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants