Add a ROCm/HIP build of the GPU library (gtsam_points_cuda)#99
Open
jeffdaily wants to merge 2 commits into
Open
Add a ROCm/HIP build of the GPU library (gtsam_points_cuda)#99jeffdaily wants to merge 2 commits into
jeffdaily wants to merge 2 commits into
Conversation
Adds a ROCm/HIP build of the gtsam_points_cuda GPU library (the IntegratedVGICPFactorGPU registration factor and the GaussianVoxelMapGPU) alongside the existing CUDA build, behind a new BUILD_WITH_HIP CMake option. We have made every effort to leave the NVIDIA build unchanged: every addition is gated on BUILD_WITH_HIP, and the compat header is never on the include path of a CUDA build, so a CUDA configure and compile proceed exactly as before. The .cu sources keep their CUDA spelling and are marked LANGUAGE HIP; a single force-included compat header (include/gtsam_points/cuda/cuda_to_hip.h) aliases the cuda* runtime symbols the library uses to their hip* equivalents and pulls the HIP runtime, and a small set of CUDA-toolkit-named forwarding shims (hip_compat/) retarget the angle-bracket includes the toolchain has no shim for. No device kernel was rewritten: the GPU surface is entirely rocThrust/hipCUB primitives over __host__ __device__ Eigen functors with zero __global__ kernels and zero warp intrinsics, so there is no wave-size-dependent code. Review order: start with cuda_to_hip.h and cuda_to_hip_types.h (the only files that know HIP), then the CMakeLists BUILD_WITH_HIP arms, then the hip_compat shims. The load-bearing details: - Opaque API-type bridge. The public headers use the CUDA driver opaque structs (CUstream_st*, CUgraph_st*, CUgraphNode_st*, CUgraphExec_st*, CUevent_st*) as their stream/graph/event types across ~27 files, including headers the host-only main library and the tests include. The lightweight cuda_to_hip_types.h aliases each to the matching HIP opaque struct (CUstream_st -> ihipStream_t, etc.) and is force-included into the main library and every test target as well as the GPU library, so a CUstream_st* signature is the same type (== hipStream_t) in every translation unit and the host callers link against the .cu definitions. Getting this consistent across all three consumers is what makes the link resolve. - thrust::cuda::par[_nosync] (~12 sites) is the CUDA backend and is not valid on the HIP device system; the compat header pulls the HIP execution-policy header and aliases namespace thrust::cuda = thrust::hip (guarded on __HIPCC__ so only the -x hip .cu see it; the host .cpp in the GPU library never use it and cannot parse rocThrust). - CUDA_VERSION is undefined on HIP, which would trip cuda_malloc_async.hpp's `#if (CUDA_VERSION < 11000)` and silently downgrade every cudaMallocAsync to a synchronous hipMalloc. The compat header pins CUDA_VERSION into [11000, 13000) so the stream-ordered hipMallocAsync path is taken and the cuda_graph.cu arity branch selects the 4-arg hipGraphAddDependencies / 5-arg hipGraphInstantiate that ROCm uses. - The unsigned atomicMax in the voxel intensity accumulation (GaussianVoxelMapGPU::insert) is the documented "atomicMax dropped on coarse-grained memory" fault class on gfx90a, but the buffer is fine-grained hipMallocAsync device memory, where it is correct -- verified on GPU (test_voxelmap VoxelMapGPU_Intensity), so no atomicCAS-loop emulation is needed. - hip::host (runtime), not hip::device, is linked: hip::device carries -x hip/--offload-arch in INTERFACE_COMPILE_OPTIONS that would propagate to the host .cpp in the mixed-language target. - The HIP architecture is read from CMAKE_HIP_ARCHITECTURES (defaulted to gfx90a only when unset), never a literal, so other AMD GPUs build from the same source with only -DCMAKE_HIP_ARCHITECTURES. A small number of IO and build edits make the GPU and CPU voxelmap round-trips correct on Windows (open the compact binary files with std::ios::binary so text-mode does not corrupt the data; match aux attribute files with a forward-slash path; static libraries and NOMINMAX/_USE_MATH_DEFINES under the Windows clang toolchain). These are no-ops on Linux (text mode == binary mode, forward-slash paths), and the two added stream synchronizations around the bulk save/load copies are correctness improvements on every platform. The installed package config gains a HIP branch (find_dependency(hip) when GTSAM_POINTS_USE_HIP) mirroring the CUDA one so a downstream find_package(gtsam_points) resolves on a ROCm-only host. The README documents the BUILD_WITH_HIP option and AMD GPU arch selection alongside the existing CUDA build. The GPU library requires GTSAM with GTSAM_UNSTABLE; the Ubuntu apt libgtsam-dev (4.2.0) ships no gtsam_unstable, so a source build of GTSAM 4.3a0 (the default GTSAM_BUILD_UNSTABLE=ON) is used. Authored with the assistance of Claude (Anthropic). Test Plan: Built and validated on real GPUs on four AMD architectures: AMD MI250X (gfx90a, CDNA2, ROCm 7.2.1), AMD Radeon Pro W7800 (gfx1100, RDNA3, ROCm 7.2.1), AMD Radeon PRO V710 (gfx1101, RDNA3, Windows, ROCm 7.14), and AMD Radeon RX 9070 XT (gfx1201, RDNA4, Windows, ROCm 7.14). ``` cmake -S . -B build_hip -G Ninja \ -DCMAKE_BUILD_TYPE=RelWithDebInfo \ -DBUILD_WITH_HIP=ON \ -DCMAKE_HIP_ARCHITECTURES=gfx90a \ -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ \ -DCMAKE_PREFIX_PATH=<gtsam-4.3a0-install> \ -DBUILD_WITH_OPENMP=ON -DBUILD_WITH_TBB=OFF -DBUILD_TESTS=ON \ -DBUILD_DEMO=OFF -DBUILD_EXAMPLE=OFF -DBUILD_TOOLS=OFF cmake --build build_hip -j 16 # full suite, serial, on one GPU HIP_VISIBLE_DEVICES=0 ctest --test-dir build_hip --output-on-failure -j1 ``` Result on every arch: 87/87 tests passed, 0 failed. The GPU gates pass: test_matching_cost_factors VGICP_CUDA_* (IntegratedVGICPFactorGPU converges to ground truth across FORWARD/BACKWARD/UNARY/MULTI_FRAME; rot error << 0.015 rad, trans error << 0.15 m); test_voxelmap VoxelMapGPU/_Intensity/_IO; test_types TestPointCloudGPU. Two runs agree to tolerance. The non-GPU suite (test_alignment, test_kdtree, test_loam_factors, test_bundle_adjustment, CPU ICP/GICP/VGICP, etc.) is unchanged. The CUDA build was separately compiled with nvcc 12.8 (BUILD_WITH_CUDA=ON, BUILD_WITH_HIP=OFF, CMAKE_CUDA_ARCHITECTURES=80) to confirm it is unaffected.
jeffdaily
added a commit
to jeffdaily/moat
that referenced
this pull request
Jun 11, 2026
Owner
|
Thank you for your contribution! This is an exciting feature. I'll review it. |
The HIP arm pinned CMAKE_HIP_ARCHITECTURES to gfx90a when unset, but the pin ran after enable_language(HIP), so the guard was already false and the block was dead code that could only mislead (or become a live footgun if file order ever changed). enable_language(HIP) already honors an explicit -DCMAKE_HIP_ARCHITECTURES, otherwise auto-detects the host GPU, and errors on a no-GPU build host. Removing the pin lets that single mechanism decide the arch, so a user on a non-gfx90a card no longer risks a silently mistargeted default. This work was authored with the assistance of the Claude AI assistant.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a ROCm/HIP build of the gtsam_points_cuda GPU library (the IntegratedVGICPFactorGPU registration factor and the GaussianVoxelMapGPU) alongside the existing CUDA build, behind a new BUILD_WITH_HIP CMake option. We have made every effort to leave the NVIDIA build unchanged: every addition is gated on BUILD_WITH_HIP, and the compat header is never on the include path of a CUDA build, so a CUDA configure and compile proceed exactly as before.
The .cu sources keep their CUDA spelling and are marked LANGUAGE HIP; a single force-included compat header (include/gtsam_points/cuda/cuda_to_hip.h) aliases the cuda* runtime symbols the library uses to their hip* equivalents and pulls the HIP runtime, and a small set of CUDA-toolkit-named forwarding shims (hip_compat/) retarget the angle-bracket includes the toolchain has no shim for. No device kernel was rewritten: the GPU surface is entirely rocThrust/hipCUB primitives over
__host__ __device__Eigen functors with zero__global__kernels and zero warp intrinsics, so there is no wave-size-dependent code.Review order: start with cuda_to_hip.h and cuda_to_hip_types.h (the only files that know HIP), then the CMakeLists BUILD_WITH_HIP arms, then the hip_compat shims. The load-bearing details:
#if (CUDA_VERSION < 11000)and silently downgrade every cudaMallocAsync to a synchronous hipMalloc. The compat header pins CUDA_VERSION into [11000, 13000) so the stream-ordered hipMallocAsync path is taken and the cuda_graph.cu arity branch selects the 4-arg hipGraphAddDependencies / 5-arg hipGraphInstantiate that ROCm uses.A small number of IO and build edits make the GPU and CPU voxelmap round-trips correct on Windows (open the compact binary files with std::ios::binary so text-mode does not corrupt the data; match aux attribute files with a forward-slash path; static libraries and NOMINMAX/_USE_MATH_DEFINES under the Windows clang toolchain). These are no-ops on Linux (text mode == binary mode, forward-slash paths), and the two added stream synchronizations around the bulk save/load copies are correctness improvements on every platform.
The installed package config gains a HIP branch (find_dependency(hip) when GTSAM_POINTS_USE_HIP) mirroring the CUDA one so a downstream find_package(gtsam_points) resolves on a ROCm-only host. The README documents the BUILD_WITH_HIP option and AMD GPU arch selection alongside the existing CUDA build.
The GPU library requires GTSAM with GTSAM_UNSTABLE; the Ubuntu apt libgtsam-dev (4.2.0) ships no gtsam_unstable, so a source build of GTSAM 4.3a0 (the default GTSAM_BUILD_UNSTABLE=ON) is used.
Authored with the assistance of Claude (Anthropic).
Test Plan: Built and validated on real GPUs on four AMD architectures: AMD MI250X (gfx90a, CDNA2, ROCm 7.2.1), AMD Radeon Pro W7800 (gfx1100, RDNA3, ROCm 7.2.1), AMD Radeon PRO V710 (gfx1101, RDNA3, Windows, ROCm 7.14), and AMD Radeon RX 9070 XT (gfx1201, RDNA4, Windows, ROCm 7.14).
Result on every arch: 87/87 tests passed, 0 failed. The GPU gates pass: test_matching_cost_factors VGICP_CUDA_* (IntegratedVGICPFactorGPU converges to ground truth across FORWARD/BACKWARD/UNARY/MULTI_FRAME; rot error << 0.015 rad, trans error << 0.15 m); test_voxelmap VoxelMapGPU/_Intensity/_IO; test_types TestPointCloudGPU. Two runs agree to tolerance. The non-GPU suite (test_alignment, test_kdtree, test_loam_factors, test_bundle_adjustment, CPU ICP/GICP/VGICP, etc.) is unchanged. The CUDA build was separately compiled with nvcc 12.8 (BUILD_WITH_CUDA=ON, BUILD_WITH_HIP=OFF, CMAKE_CUDA_ARCHITECTURES=80) to confirm it is unaffected.