Add a ROCm/HIP build of the GPU library (gtsam_points_cuda) by jeffdaily · Pull Request #99 · koide3/gtsam_points

jeffdaily · 2026-06-11T23:14:59Z

Adds a ROCm/HIP build of the gtsam_points_cuda GPU library (the IntegratedVGICPFactorGPU registration factor and the GaussianVoxelMapGPU) alongside the existing CUDA build, behind a new BUILD_WITH_HIP CMake option. We have made every effort to leave the NVIDIA build unchanged: every addition is gated on BUILD_WITH_HIP, and the compat header is never on the include path of a CUDA build, so a CUDA configure and compile proceed exactly as before.

The .cu sources keep their CUDA spelling and are marked LANGUAGE HIP; a single force-included compat header (include/gtsam_points/cuda/cuda_to_hip.h) aliases the cuda* runtime symbols the library uses to their hip* equivalents and pulls the HIP runtime, and a small set of CUDA-toolkit-named forwarding shims (hip_compat/) retarget the angle-bracket includes the toolchain has no shim for. No device kernel was rewritten: the GPU surface is entirely rocThrust/hipCUB primitives over __host__ __device__ Eigen functors with zero __global__ kernels and zero warp intrinsics, so there is no wave-size-dependent code.

Review order: start with cuda_to_hip.h and cuda_to_hip_types.h (the only files that know HIP), then the CMakeLists BUILD_WITH_HIP arms, then the hip_compat shims. The load-bearing details:

Opaque API-type bridge. The public headers use the CUDA driver opaque structs (CUstream_st*, CUgraph_st*, CUgraphNode_st*, CUgraphExec_st*, CUevent_st*) as their stream/graph/event types across ~27 files, including headers the host-only main library and the tests include. The lightweight cuda_to_hip_types.h aliases each to the matching HIP opaque struct (CUstream_st -> ihipStream_t, etc.) and is force-included into the main library and every test target as well as the GPU library, so a CUstream_st* signature is the same type (== hipStream_t) in every translation unit and the host callers link against the .cu definitions. Getting this consistent across all three consumers is what makes the link resolve.
thrust::cuda::par[_nosync] (~12 sites) is the CUDA backend and is not valid on the HIP device system; the compat header pulls the HIP execution-policy header and aliases namespace thrust::cuda = thrust::hip (guarded on HIPCC so only the -x hip .cu see it; the host .cpp in the GPU library never use it and cannot parse rocThrust).
CUDA_VERSION is undefined on HIP, which would trip cuda_malloc_async.hpp's #if (CUDA_VERSION < 11000) and silently downgrade every cudaMallocAsync to a synchronous hipMalloc. The compat header pins CUDA_VERSION into [11000, 13000) so the stream-ordered hipMallocAsync path is taken and the cuda_graph.cu arity branch selects the 4-arg hipGraphAddDependencies / 5-arg hipGraphInstantiate that ROCm uses.
The unsigned atomicMax in the voxel intensity accumulation (GaussianVoxelMapGPU::insert) is the documented "atomicMax dropped on coarse-grained memory" fault class on gfx90a, but the buffer is fine-grained hipMallocAsync device memory, where it is correct -- verified on GPU (test_voxelmap VoxelMapGPU_Intensity), so no atomicCAS-loop emulation is needed.
hip::host (runtime), not hip::device, is linked: hip::device carries -x hip/--offload-arch in INTERFACE_COMPILE_OPTIONS that would propagate to the host .cpp in the mixed-language target.
The HIP architecture is read from CMAKE_HIP_ARCHITECTURES (defaulted to gfx90a only when unset), never a literal, so other AMD GPUs build from the same source with only -DCMAKE_HIP_ARCHITECTURES.

A small number of IO and build edits make the GPU and CPU voxelmap round-trips correct on Windows (open the compact binary files with std::ios::binary so text-mode does not corrupt the data; match aux attribute files with a forward-slash path; static libraries and NOMINMAX/_USE_MATH_DEFINES under the Windows clang toolchain). These are no-ops on Linux (text mode == binary mode, forward-slash paths), and the two added stream synchronizations around the bulk save/load copies are correctness improvements on every platform.

The installed package config gains a HIP branch (find_dependency(hip) when GTSAM_POINTS_USE_HIP) mirroring the CUDA one so a downstream find_package(gtsam_points) resolves on a ROCm-only host. The README documents the BUILD_WITH_HIP option and AMD GPU arch selection alongside the existing CUDA build.

The GPU library requires GTSAM with GTSAM_UNSTABLE; the Ubuntu apt libgtsam-dev (4.2.0) ships no gtsam_unstable, so a source build of GTSAM 4.3a0 (the default GTSAM_BUILD_UNSTABLE=ON) is used.

Authored with the assistance of Claude (Anthropic).

Test Plan: Built and validated on real GPUs on four AMD architectures: AMD MI250X (gfx90a, CDNA2, ROCm 7.2.1), AMD Radeon Pro W7800 (gfx1100, RDNA3, ROCm 7.2.1), AMD Radeon PRO V710 (gfx1101, RDNA3, Windows, ROCm 7.14), and AMD Radeon RX 9070 XT (gfx1201, RDNA4, Windows, ROCm 7.14).

cmake -S . -B build_hip -G Ninja \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DBUILD_WITH_HIP=ON \
  -DCMAKE_HIP_ARCHITECTURES=gfx90a \
  -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ \
  -DCMAKE_PREFIX_PATH=<gtsam-4.3a0-install> \
  -DBUILD_WITH_OPENMP=ON -DBUILD_WITH_TBB=OFF -DBUILD_TESTS=ON \
  -DBUILD_DEMO=OFF -DBUILD_EXAMPLE=OFF -DBUILD_TOOLS=OFF
cmake --build build_hip -j 16

# full suite, serial, on one GPU
HIP_VISIBLE_DEVICES=0 ctest --test-dir build_hip --output-on-failure -j1

Result on every arch: 87/87 tests passed, 0 failed. The GPU gates pass: test_matching_cost_factors VGICP_CUDA_* (IntegratedVGICPFactorGPU converges to ground truth across FORWARD/BACKWARD/UNARY/MULTI_FRAME; rot error << 0.015 rad, trans error << 0.15 m); test_voxelmap VoxelMapGPU/_Intensity/_IO; test_types TestPointCloudGPU. Two runs agree to tolerance. The non-GPU suite (test_alignment, test_kdtree, test_loam_factors, test_bundle_adjustment, CPU ICP/GICP/VGICP, etc.) is unchanged. The CUDA build was separately compiled with nvcc 12.8 (BUILD_WITH_CUDA=ON, BUILD_WITH_HIP=OFF, CMAKE_CUDA_ARCHITECTURES=80) to confirm it is unaffected.

Adds a ROCm/HIP build of the gtsam_points_cuda GPU library (the IntegratedVGICPFactorGPU registration factor and the GaussianVoxelMapGPU) alongside the existing CUDA build, behind a new BUILD_WITH_HIP CMake option. We have made every effort to leave the NVIDIA build unchanged: every addition is gated on BUILD_WITH_HIP, and the compat header is never on the include path of a CUDA build, so a CUDA configure and compile proceed exactly as before. The .cu sources keep their CUDA spelling and are marked LANGUAGE HIP; a single force-included compat header (include/gtsam_points/cuda/cuda_to_hip.h) aliases the cuda* runtime symbols the library uses to their hip* equivalents and pulls the HIP runtime, and a small set of CUDA-toolkit-named forwarding shims (hip_compat/) retarget the angle-bracket includes the toolchain has no shim for. No device kernel was rewritten: the GPU surface is entirely rocThrust/hipCUB primitives over __host__ __device__ Eigen functors with zero __global__ kernels and zero warp intrinsics, so there is no wave-size-dependent code. Review order: start with cuda_to_hip.h and cuda_to_hip_types.h (the only files that know HIP), then the CMakeLists BUILD_WITH_HIP arms, then the hip_compat shims. The load-bearing details: - Opaque API-type bridge. The public headers use the CUDA driver opaque structs (CUstream_st*, CUgraph_st*, CUgraphNode_st*, CUgraphExec_st*, CUevent_st*) as their stream/graph/event types across ~27 files, including headers the host-only main library and the tests include. The lightweight cuda_to_hip_types.h aliases each to the matching HIP opaque struct (CUstream_st -> ihipStream_t, etc.) and is force-included into the main library and every test target as well as the GPU library, so a CUstream_st* signature is the same type (== hipStream_t) in every translation unit and the host callers link against the .cu definitions. Getting this consistent across all three consumers is what makes the link resolve. - thrust::cuda::par[_nosync] (~12 sites) is the CUDA backend and is not valid on the HIP device system; the compat header pulls the HIP execution-policy header and aliases namespace thrust::cuda = thrust::hip (guarded on __HIPCC__ so only the -x hip .cu see it; the host .cpp in the GPU library never use it and cannot parse rocThrust). - CUDA_VERSION is undefined on HIP, which would trip cuda_malloc_async.hpp's `#if (CUDA_VERSION < 11000)` and silently downgrade every cudaMallocAsync to a synchronous hipMalloc. The compat header pins CUDA_VERSION into [11000, 13000) so the stream-ordered hipMallocAsync path is taken and the cuda_graph.cu arity branch selects the 4-arg hipGraphAddDependencies / 5-arg hipGraphInstantiate that ROCm uses. - The unsigned atomicMax in the voxel intensity accumulation (GaussianVoxelMapGPU::insert) is the documented "atomicMax dropped on coarse-grained memory" fault class on gfx90a, but the buffer is fine-grained hipMallocAsync device memory, where it is correct -- verified on GPU (test_voxelmap VoxelMapGPU_Intensity), so no atomicCAS-loop emulation is needed. - hip::host (runtime), not hip::device, is linked: hip::device carries -x hip/--offload-arch in INTERFACE_COMPILE_OPTIONS that would propagate to the host .cpp in the mixed-language target. - The HIP architecture is read from CMAKE_HIP_ARCHITECTURES (defaulted to gfx90a only when unset), never a literal, so other AMD GPUs build from the same source with only -DCMAKE_HIP_ARCHITECTURES. A small number of IO and build edits make the GPU and CPU voxelmap round-trips correct on Windows (open the compact binary files with std::ios::binary so text-mode does not corrupt the data; match aux attribute files with a forward-slash path; static libraries and NOMINMAX/_USE_MATH_DEFINES under the Windows clang toolchain). These are no-ops on Linux (text mode == binary mode, forward-slash paths), and the two added stream synchronizations around the bulk save/load copies are correctness improvements on every platform. The installed package config gains a HIP branch (find_dependency(hip) when GTSAM_POINTS_USE_HIP) mirroring the CUDA one so a downstream find_package(gtsam_points) resolves on a ROCm-only host. The README documents the BUILD_WITH_HIP option and AMD GPU arch selection alongside the existing CUDA build. The GPU library requires GTSAM with GTSAM_UNSTABLE; the Ubuntu apt libgtsam-dev (4.2.0) ships no gtsam_unstable, so a source build of GTSAM 4.3a0 (the default GTSAM_BUILD_UNSTABLE=ON) is used. Authored with the assistance of Claude (Anthropic). Test Plan: Built and validated on real GPUs on four AMD architectures: AMD MI250X (gfx90a, CDNA2, ROCm 7.2.1), AMD Radeon Pro W7800 (gfx1100, RDNA3, ROCm 7.2.1), AMD Radeon PRO V710 (gfx1101, RDNA3, Windows, ROCm 7.14), and AMD Radeon RX 9070 XT (gfx1201, RDNA4, Windows, ROCm 7.14). ``` cmake -S . -B build_hip -G Ninja \ -DCMAKE_BUILD_TYPE=RelWithDebInfo \ -DBUILD_WITH_HIP=ON \ -DCMAKE_HIP_ARCHITECTURES=gfx90a \ -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ \ -DCMAKE_PREFIX_PATH=<gtsam-4.3a0-install> \ -DBUILD_WITH_OPENMP=ON -DBUILD_WITH_TBB=OFF -DBUILD_TESTS=ON \ -DBUILD_DEMO=OFF -DBUILD_EXAMPLE=OFF -DBUILD_TOOLS=OFF cmake --build build_hip -j 16 # full suite, serial, on one GPU HIP_VISIBLE_DEVICES=0 ctest --test-dir build_hip --output-on-failure -j1 ``` Result on every arch: 87/87 tests passed, 0 failed. The GPU gates pass: test_matching_cost_factors VGICP_CUDA_* (IntegratedVGICPFactorGPU converges to ground truth across FORWARD/BACKWARD/UNARY/MULTI_FRAME; rot error << 0.015 rad, trans error << 0.15 m); test_voxelmap VoxelMapGPU/_Intensity/_IO; test_types TestPointCloudGPU. Two runs agree to tolerance. The non-GPU suite (test_alignment, test_kdtree, test_loam_factors, test_bundle_adjustment, CPU ICP/GICP/VGICP, etc.) is unchanged. The CUDA build was separately compiled with nvcc 12.8 (BUILD_WITH_CUDA=ON, BUILD_WITH_HIP=OFF, CMAKE_CUDA_ARCHITECTURES=80) to confirm it is unaffected.

koide3 · 2026-06-16T04:47:17Z

Thank you for your contribution! This is an exciting feature. I'll review it.

The HIP arm pinned CMAKE_HIP_ARCHITECTURES to gfx90a when unset, but the pin ran after enable_language(HIP), so the guard was already false and the block was dead code that could only mislead (or become a live footgun if file order ever changed). enable_language(HIP) already honors an explicit -DCMAKE_HIP_ARCHITECTURES, otherwise auto-detects the host GPU, and errors on a no-GPU build host. Removing the pin lets that single mechanism decide the arch, so a user on a non-gfx90a card no longer risks a silently mistargeted default. This work was authored with the assistance of the Claude AI assistant.

jeffdaily added a commit to jeffdaily/moat that referenced this pull request Jun 11, 2026

gtsam_points: lead -> pr-open (koide3/gtsam_points#99)

9c1bce4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a ROCm/HIP build of the GPU library (gtsam_points_cuda)#99

Add a ROCm/HIP build of the GPU library (gtsam_points_cuda)#99
jeffdaily wants to merge 2 commits into
koide3:masterfrom
jeffdaily:moat-port

jeffdaily commented Jun 11, 2026 •

edited

Loading

Uh oh!

koide3 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeffdaily commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koide3 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeffdaily commented Jun 11, 2026 •

edited

Loading