[no-ci] coverage: fix Linux git ownership and add Windows stack wrapper by rluo8 · Pull Request #1829 · NVIDIA/cuda-python

rluo8 · 2026-03-30T08:39:59Z

This is to fix the following error when running coverage for Linux:
fatal: detected dubious ownership in repository at '/__w/cuda-python/cuda-python'
To add an exception for this directory, call:

    git config --global --add safe.directory /__w/cuda-python/cuda-python

git introspection failed: fatal: detected dubious ownership in repository at '/__w/cuda-python/cuda-python'

This is caused by the setuptools_scm recent changes. setuptools_scm 10.0+ treats git ownership mismatches as hard errors.
The fix is to change to ownership to the user which run tests inside the container.

rluo8 · 2026-03-30T08:40:29Z

Hi @mdboom , could you please help review it?
Thanks!

github-actions · 2026-03-30T08:56:19Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1829/
https://nvidia.github.io/cuda-python/pr-preview/pr-1829/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1829/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1829/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

mdboom

IIUC, this happens when a git repository is queried by a different user than the one who owns the files in the git repo. Do we know why this is happening? Presumably the whole GHA workflow is run by the same user. Basically, I want to understand the root cause before approving (and also to understand if we should expect to see this issue with setuptools-scm in other workflows...)

rluo8 · 2026-03-31T03:01:20Z

IIUC, this happens when a git repository is queried by a different user than the one who owns the files in the git repo. Do we know why this is happening? Presumably the whole GHA workflow is run by the same user. Basically, I want to understand the root cause before approving (and also to understand if we should expect to see this issue with setuptools-scm in other workflows...)

@mdboom This problem was exposed because of the setuptools_scm upgrade. Previously, setuptools_scm 9.2.2 was used. Now it used setuptools_scm 10.0.5 when installing cuda_pathfinder.
The workspace directory was created by the runner user (which is not root) on the host and mounted into the Docker container. The pip install was executed by root inside the container. So root and the workspace directory owner were different, which caused git to reject the repository with "dubious ownership."

When running pip install -v . under cuda_pathfinder, it eventually calls _git_toplevel in vcs_versioning/_file_finders/_git.py.
In setuptools_scm version 10.0.5 (via vcs-versioning), this function raises SystemExit when detecting an ownership mismatch:
def _git_toplevel(path: str) -> str | None:
...
res = _run(["git", "rev-parse", "HEAD"], cwd=cwd)
if res.returncode:
if "--add safe.directory" in res.stderr and not os.environ.get(
"SETUPTOOLS_SCM_IGNORE_DUBIOUS_OWNER"
):
raise SystemExit(
"git introspection failed: {}".format(res.stderr.split("\n")[0])
)

In version 9.2.2, the same function simply logged the error and returned None, allowing the build to continue:
def _git_toplevel(path: str) -> str | None:
...
res = _run(["git", "rev-parse", "HEAD"], cwd=cwd)
if res.returncode:
log.error("listing git files failed - pretending there aren't any")
return None

One of the workaround is to mark the workspace as a safe directory after checkout: git config --global --add safe.directory "$GITHUB_WORKSPACE".

mdboom · 2026-03-31T15:30:13Z

Thanks for the explanation, @rluo8. Can we try removing this line instead?

https://github.com/NVIDIA/cuda-python/blob/main/.github/workflows/coverage.yml#L49

That may fix the problem in a better way without ignoring a potentially important safety check. I experimented with removing it on our main CI, and it doesn't seem to be necessary, so maybe it will work for coverage as well.

rluo8 · 2026-04-01T02:53:24Z

Thanks @mdboom. I've removed the options: -u root --security-opt seccomp=unconfined --shm-size 16g line from the container config. Could you help trigger the coverage workflow to see if it works?

mdboom · 2026-04-03T14:07:56Z

Thanks @mdboom. I've removed the options: -u root --security-opt seccomp=unconfined --shm-size 16g line from the container config. Could you help trigger the coverage workflow to see if it works?

Here's the coverage workflow: https://github.com/NVIDIA/cuda-python/actions/runs/23949027494

rluo8 · 2026-04-03T14:27:14Z

Thanks @mdboom. I've removed the options: -u root --security-opt seccomp=unconfined --shm-size 16g line from the container config. Could you help trigger the coverage workflow to see if it works?

Here's the coverage workflow: https://github.com/NVIDIA/cuda-python/actions/runs/23949027494

The coverage workflow failed. It seems that the docker container might use root as default user.
Could we change the owner of the workspace to the user who ran tests inside the container after checking out the workspace?
chown -R $(id -u):$(id -g) "$GITHUB_WORKSPACE"
After changing the owner, git command should pass as there is no dubious ownership.

mdboom · 2026-04-06T13:51:58Z

Thanks @mdboom. I've removed the options: -u root --security-opt seccomp=unconfined --shm-size 16g line from the container config. Could you help trigger the coverage workflow to see if it works?

Here's the coverage workflow: https://github.com/NVIDIA/cuda-python/actions/runs/23949027494

The coverage workflow failed. It seems that the docker container might use root as default user. Could we change the owner of the workspace to the user who ran tests inside the container after checking out the workspace? chown -R ( i d − u ) : (id -g) "$GITHUB_WORKSPACE" After changing the owner, git command should pass as there is no dubious ownership.

That seems like a good approach. Would you mind updating the PR to do that and then I'll kick off another coverage run.

rluo8 · 2026-04-07T01:09:46Z

Thanks @mdboom. I've removed the options: -u root --security-opt seccomp=unconfined --shm-size 16g line from the container config. Could you help trigger the coverage workflow to see if it works?

Here's the coverage workflow: https://github.com/NVIDIA/cuda-python/actions/runs/23949027494

The coverage workflow failed. It seems that the docker container might use root as default user. Could we change the owner of the workspace to the user who ran tests inside the container after checking out the workspace? chown -R ( i d − u ) : (id -g) "$GITHUB_WORKSPACE" After changing the owner, git command should pass as there is no dubious ownership.

That seems like a good approach. Would you mind updating the PR to do that and then I'll kick off another coverage run.

Thanks Mike. I have updated the PR

rluo8 · 2026-04-08T08:30:11Z

Also update the codes to run cuda core tests for coverage using large stack size thread. When running locally, it seems that some tests might fail due to python stack size limitation for Windows.

rluo8 · 2026-04-14T06:50:09Z

Hi @mdboom , could you please help trigger the coverage workflow? Thanks!

rwgk · 2026-04-15T01:09:06Z

I added [no-ci] to the PR title because this PR only changes .github/workflows/coverage.yml

rwgk · 2026-04-15T01:25:30Z

I just triggered this coverage run: https://github.com/NVIDIA/cuda-python/actions/runs/24431536905

rluo8 · 2026-04-15T02:04:49Z

Thanks @rwgk . The cuda_pathfinder install in linux passed now, and workflow run passed.

rwgk · 2026-04-15T05:48:35Z

        run: |
          cd "${{ steps.install-root.outputs.INSTALL_ROOT }}"
-          "$GITHUB_WORKSPACE/.venv/Scripts/pytest" -v --cov=./cuda --cov-append --cov-context=test --cov-config="$GITHUB_WORKSPACE/.coveragerc" "$GITHUB_WORKSPACE/cuda_core/tests"
+          "$GITHUB_WORKSPACE/.venv/Scripts/python" << PYTEST_EOF


Generated by Cursor GPT-5.4 Extra High Fast with a couple prompts (it's a suggestion to make this DRY):

One cleanup idea: the new cuda.core block looks like it is only changing the execution model from

"$GITHUB_WORKSPACE/.venv/Scripts/pytest" ...

to "run the same pytest args on a freshly-created Python thread after threading.stack_size(8 * 1024 * 1024)".

I do not think there is a native one-line pytest / Actions knob for "run this with an 8 MB stack", so the wrapper approach makes sense. That said, would you consider factoring the duplicated bindings / core heredoc into a small helper, or at least using a shorter inline wrapper? The current ~30-line block seems to be mostly plumbing rather than behavior change.

For example, the inline version could be reduced to something like:

import concurrent.futures import os import sys import threading import pytest os.chdir(r'${{ steps.install-root.outputs.INSTALL_ROOT }}') threading.stack_size(8 * 1024 * 1024) args = [ '-v', '--cov=./cuda', '--cov-append', '--cov-context=test', f'--cov-config={os.environ["GITHUB_WORKSPACE"]}/.coveragerc', f'{os.environ["GITHUB_WORKSPACE"]}/cuda_core/tests', ] with concurrent.futures.ThreadPoolExecutor(max_workers=1) as ex: sys.exit(ex.submit(pytest.main, args).result())

Even better might be a tiny checked-in helper that both steps call, so the workflow stays readable and the "8 MB stack thread" workaround only lives in one place.

Agent notes / clues

Scope: the only changed file in PR 1829 is .github/workflows/coverage.yml.

The key semantic change in the cuda.core step is the same workaround already used for cuda.bindings: run pytest.main(...) on a newly-created thread after threading.stack_size(8 * 1024 * 1024).

I did not see another meaningful behavior change in the expanded block:

same pytest flags

same coverage config

same target test directory

same effective working directory

exit code is still propagated

os.chdir(...) in the Python snippet is probably redundant because the shell step already does cd "${{ steps.install-root.outputs.INSTALL_ROOT }}", but keeping it is harmless.

There is no obvious native one-liner in GitHub Actions / pytest / CPython to say "run this test invocation with an 8 MB stack"; some Python wrapper is required because threading.stack_size(...) only affects threads created after the call.

If another agent wants to refactor instead of just comment, the likely clean option is a helper script, something like:

ci/tools/run_pytest_with_stack.py

inputs: cwd, test path, stack size, and passthrough pytest args

behavior: call threading.stack_size(...), spawn one thread, run pytest.main(args), exit with its code

Alternative smaller inline version: ThreadPoolExecutor(max_workers=1) is a compact replacement for the manual result dict plus Thread.start()/join().

Practical review angle: this is a readability / maintainability suggestion, not a correctness concern. The workflow was manually triggered on an upstream branch containing the PR commit, and it passed.

I forgot to mention before: I think one of these is redundant:

cd "${{ steps.install-root.outputs.INSTALL_ROOT }}" ... os.chdir(r'${{ steps.install-root.outputs.INSTALL_ROOT }}')

If you add the suggested helper, I'd drop the cd and pass steps.install-root.outputs.INSTALL_ROOT to the helper; I believe that'll be most readable.

Thanks Ralf. It's reasonable to add a helper to make the workflow cleaner. I'll update that.

rwgk · 2026-04-15T06:01:01Z

Another tiny thing: the PR title and description seem to be out of sync with the code changes. It'd be great to revise both.

rwgk · 2026-04-15T17:35:05Z

This PR looks awesome now. I'll re-trigger the coverage worflow to retest.

rwgk · 2026-04-15T18:31:41Z

The workflow is green again, but at closer inspection, I realized the workflow publishes coverage, but there are remaining annotations we should definitely follow-up on:

Below is a Cursor-generated analysis.

My vote: Let's merge this PR as-is, then debug the failures.

From Cursor GPT-5.4 Extra High Fast:

I took a closer look at the latest successful dispatch on this branch and compared it with both the previous green PR run and the latest scheduled main run:

Latest green PR run on current head 66c2fd27439b856d6de3e1db1d83caf2ad5ef2e7: https://github.com/NVIDIA/cuda-python/actions/runs/24468963676
Previous green PR run on earlier head 2137fed1db76b8b306077da56ca077b5b94a318b: https://github.com/NVIDIA/cuda-python/actions/runs/24431536905
Recent scheduled main run that failed earlier in Linux setup/build: https://github.com/NVIDIA/cuda-python/actions/runs/24430611309

What I found:

The latest tip is easier to review than before: the 8 MB stack workaround has now been factored into ci/tools/run_pytest_with_stack.py instead of duplicating large inline heredocs in .github/workflows/coverage.yml.
The latest run really does publish coverage end-to-end, so from that perspective this branch is a net improvement over current main, where the recent scheduled run failed before coverage could be combined/deployed.
However, the annotations are not fake UI noise. They correspond to real non-zero exits inside steps that are marked continue-on-error: true, so the workflow stays green and still uploads/deploys coverage artifacts.
In the latest run, Windows Run cuda.core tests (with 8MB stack) still ends with exit code 139 / segfault.
In the latest run, Linux Run cuda.bindings tests returns exit code 1 (cuda_bindings/tests/nvml/test_pci.py::test_discover_gpus), and Linux Run cuda.core tests returns exit code 1 with a large batch of IPC/example failures, many with child exit -11.
Importantly, these hidden failures/crashes were already present in the previous green PR run as well. In particular, the earlier run also had:
- a Windows cuda.core segfault with exit code 139
- Linux cuda.bindings failures (including test_discover_gpus, and in that run also test_cufile.py::test_buf_register_host_memory)
- Linux cuda.core IPC/example failures with child exit -11
So I do not think the new helper script introduced those failures; it mainly makes the workflow more readable.
The Linux pip warning about running as root looks like the normal container warning and does not seem worth addressing in this PR.
The Node 20 deprecation warning is coming from ilammy/msvc-dev-cmd@v1, not from this PR's logic. That seems like a separate maintenance task before GitHub enforces Node 24, not something blocking this coverage fix.

My recommendation:

Keep this PR focused on restoring/preserving coverage publication and readability of the coverage workflow.
Do not treat the green workflow as meaning "all covered suites passed"; right now it really means "coverage published despite tolerated test failures."
Track the remaining failing/crashing coverage steps separately, since they are real and pre-existing:
- Windows cuda.core coverage segfault (exit 139)
- Linux cuda.bindings coverage failures (nvml, and possibly cufile)
- Linux cuda.core IPC/example failures with child exit -11
Ignore the pip as root warning for now unless someone specifically wants to spend effort removing cosmetic warnings from the workflow UI.
Treat the Node 20 deprecation warning as a separate follow-up issue/PR to update or replace ilammy/msvc-dev-cmd@v1 before the GitHub runtime deadline.

If helpful, I think a small follow-up issue would be justified just to document that the coverage workflow is intentionally tolerant of failing test steps today, so that the green check is not over-interpreted.

rwgk · 2026-04-15T18:32:59Z

@mdboom this PR is currently blocked on your request for changes. Could you please re-review? I believe it's a good step forward to merge as-is.

rluo8 · 2026-04-16T01:41:34Z

Thanks @rwgk . I noticed about the tests failures, and was investigating that. Previously, the workflow was designed to continue on error to make sure all the tests could be run and the coverage pipeline could work. It won't be blocked by the test failures. Maybe we could add warnings about the test failures. I'll follow up your recommendations.

rwgk · 2026-04-16T03:53:34Z

I'm not familiar with the workflow and can only offer high-level ideas:

AFAIK there is no way to generate a warning, github actions only support binary success/failure outcomes.

From the Cursor output I saw, it seems the workflow is generating artifacts, and possibly the motivation for the continue-on-error design was to generate as many artifacts as possible. I can see how that's valuable, and nice to keep. I'd look for a way to track errors, and then fail the workflow at the very end if there were any.

I'm assuming we want to debug and fix all errors.

mdboom

Thanks. This looks great.

coverage: add git safe.directory for linux builds

b73a09a

mdboom requested changes Mar 30, 2026

View reviewed changes

mdboom mentioned this pull request Mar 30, 2026

Try running without root #1833

Closed

coverage: run coverage for linux without root

326a960

rwgk assigned rluo8 Apr 1, 2026

rwgk added bug Something isn't working P1 Medium priority - Should do CI/CD CI/CD infrastructure labels Apr 1, 2026

rwgk added this to the ci backlog milestone Apr 1, 2026

rwgk added bug Something isn't working and removed bug Something isn't working labels Apr 1, 2026

rparolin removed this from the ci backlog milestone Apr 2, 2026

Merge branch 'main' into main

8bdbff0

rluo8 added 2 commits April 7, 2026 08:56

Merge branch 'NVIDIA:main' into main

f440fa9

coverage: change github workspace owner for Linux coverage

ecf269f

rluo8 requested a review from mdboom April 7, 2026 01:10

coverage: run cuda core tests in coverage using 8M thread

ed96375

rluo8 added 2 commits April 13, 2026 13:34

Merge branch 'NVIDIA:main' into main

c0d8861

Merge branch 'NVIDIA:main' into main

7291fcb

rwgk changed the title ~~coverage: add git safe.directory for linux builds~~ [no-ci] coverage: add git safe.directory for linux builds Apr 15, 2026

Merge branch 'main' into main

2137fed

rwgk added this to the cuda.core v1.0.0 milestone Apr 15, 2026

rwgk added P1 Medium priority - Should do and removed P1 Medium priority - Should do labels Apr 15, 2026

rwgk approved these changes Apr 15, 2026

View reviewed changes

rluo8 changed the title ~~[no-ci] coverage: add git safe.directory for linux builds~~ [no-ci] coverage: fix Linux git ownership and add Windows stack wrapper Apr 15, 2026

rluo8 and others added 2 commits April 15, 2026 16:29

Add coverage test helper to run pytest using larger stack

d9ac8c6

Merge branch 'main' into main

66c2fd2

mdboom approved these changes Apr 16, 2026

View reviewed changes

mdboom merged commit fd5700d into NVIDIA:main Apr 16, 2026
20 checks passed

Conversation

rluo8 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rluo8 commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

mdboom left a comment

Choose a reason for hiding this comment

Uh oh!

rluo8 commented Mar 31, 2026

Uh oh!

mdboom commented Mar 31, 2026

Uh oh!

rluo8 commented Apr 1, 2026

Uh oh!

mdboom commented Apr 3, 2026

Uh oh!

rluo8 commented Apr 3, 2026

Uh oh!

mdboom commented Apr 6, 2026

Uh oh!

rluo8 commented Apr 7, 2026

Uh oh!

rluo8 commented Apr 8, 2026

Uh oh!

rluo8 commented Apr 14, 2026

Uh oh!

rwgk commented Apr 15, 2026

Uh oh!

rwgk commented Apr 15, 2026

Uh oh!

rluo8 commented Apr 15, 2026

Uh oh!

rwgk Apr 15, 2026

Choose a reason for hiding this comment

Agent notes / clues

Uh oh!

rwgk Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

rluo8 Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

rwgk commented Apr 15, 2026

Uh oh!

rwgk commented Apr 15, 2026

Uh oh!

rwgk commented Apr 15, 2026

Uh oh!

rwgk commented Apr 15, 2026

Uh oh!

rluo8 commented Apr 16, 2026

Uh oh!

rwgk commented Apr 16, 2026

Uh oh!

mdboom left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rluo8 commented Mar 30, 2026 •

edited

Loading