Skip to content

[no-ci] coverage: fix Linux git ownership and add Windows stack wrapper#1829

Merged
mdboom merged 11 commits intoNVIDIA:mainfrom
rluo8:main
Apr 16, 2026
Merged

[no-ci] coverage: fix Linux git ownership and add Windows stack wrapper#1829
mdboom merged 11 commits intoNVIDIA:mainfrom
rluo8:main

Conversation

@rluo8
Copy link
Copy Markdown
Contributor

@rluo8 rluo8 commented Mar 30, 2026

This is to fix the following error when running coverage for Linux:
fatal: detected dubious ownership in repository at '/__w/cuda-python/cuda-python'
To add an exception for this directory, call:

    git config --global --add safe.directory /__w/cuda-python/cuda-python

git introspection failed: fatal: detected dubious ownership in repository at '/__w/cuda-python/cuda-python'

This is caused by the setuptools_scm recent changes. setuptools_scm 10.0+ treats git ownership mismatches as hard errors.
The fix is to change to ownership to the user which run tests inside the container.

@rluo8
Copy link
Copy Markdown
Contributor Author

rluo8 commented Mar 30, 2026

Hi @mdboom , could you please help review it?
Thanks!

@github-actions
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@mdboom mdboom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this happens when a git repository is queried by a different user than the one who owns the files in the git repo. Do we know why this is happening? Presumably the whole GHA workflow is run by the same user. Basically, I want to understand the root cause before approving (and also to understand if we should expect to see this issue with setuptools-scm in other workflows...)

@mdboom mdboom mentioned this pull request Mar 30, 2026
@rluo8
Copy link
Copy Markdown
Contributor Author

rluo8 commented Mar 31, 2026

IIUC, this happens when a git repository is queried by a different user than the one who owns the files in the git repo. Do we know why this is happening? Presumably the whole GHA workflow is run by the same user. Basically, I want to understand the root cause before approving (and also to understand if we should expect to see this issue with setuptools-scm in other workflows...)

@mdboom This problem was exposed because of the setuptools_scm upgrade. Previously, setuptools_scm 9.2.2 was used. Now it used setuptools_scm 10.0.5 when installing cuda_pathfinder.
The workspace directory was created by the runner user (which is not root) on the host and mounted into the Docker container. The pip install was executed by root inside the container. So root and the workspace directory owner were different, which caused git to reject the repository with "dubious ownership."

When running pip install -v . under cuda_pathfinder, it eventually calls _git_toplevel in vcs_versioning/_file_finders/_git.py.
In setuptools_scm version 10.0.5 (via vcs-versioning), this function raises SystemExit when detecting an ownership mismatch:
def _git_toplevel(path: str) -> str | None:
...
res = _run(["git", "rev-parse", "HEAD"], cwd=cwd)
if res.returncode:
if "--add safe.directory" in res.stderr and not os.environ.get(
"SETUPTOOLS_SCM_IGNORE_DUBIOUS_OWNER"
):
raise SystemExit(
"git introspection failed: {}".format(res.stderr.split("\n")[0])
)

In version 9.2.2, the same function simply logged the error and returned None, allowing the build to continue:
def _git_toplevel(path: str) -> str | None:
...
res = _run(["git", "rev-parse", "HEAD"], cwd=cwd)
if res.returncode:
log.error("listing git files failed - pretending there aren't any")
return None

One of the workaround is to mark the workspace as a safe directory after checkout: git config --global --add safe.directory "$GITHUB_WORKSPACE".

@mdboom
Copy link
Copy Markdown
Contributor

mdboom commented Mar 31, 2026

Thanks for the explanation, @rluo8. Can we try removing this line instead?

https://github.com/NVIDIA/cuda-python/blob/main/.github/workflows/coverage.yml#L49

That may fix the problem in a better way without ignoring a potentially important safety check. I experimented with removing it on our main CI, and it doesn't seem to be necessary, so maybe it will work for coverage as well.

@rluo8
Copy link
Copy Markdown
Contributor Author

rluo8 commented Apr 1, 2026

Thanks @mdboom. I've removed the options: -u root --security-opt seccomp=unconfined --shm-size 16g line from the container config. Could you help trigger the coverage workflow to see if it works?

@rwgk rwgk added bug Something isn't working P1 Medium priority - Should do CI/CD CI/CD infrastructure labels Apr 1, 2026
@rwgk rwgk added this to the ci backlog milestone Apr 1, 2026
@rwgk rwgk added bug Something isn't working and removed bug Something isn't working labels Apr 1, 2026
@rparolin rparolin removed this from the ci backlog milestone Apr 2, 2026
@mdboom
Copy link
Copy Markdown
Contributor

mdboom commented Apr 3, 2026

Thanks @mdboom. I've removed the options: -u root --security-opt seccomp=unconfined --shm-size 16g line from the container config. Could you help trigger the coverage workflow to see if it works?

Here's the coverage workflow: https://github.com/NVIDIA/cuda-python/actions/runs/23949027494

@rluo8
Copy link
Copy Markdown
Contributor Author

rluo8 commented Apr 3, 2026

Thanks @mdboom. I've removed the options: -u root --security-opt seccomp=unconfined --shm-size 16g line from the container config. Could you help trigger the coverage workflow to see if it works?

Here's the coverage workflow: https://github.com/NVIDIA/cuda-python/actions/runs/23949027494

The coverage workflow failed. It seems that the docker container might use root as default user.
Could we change the owner of the workspace to the user who ran tests inside the container after checking out the workspace?
chown -R $(id -u):$(id -g) "$GITHUB_WORKSPACE"
After changing the owner, git command should pass as there is no dubious ownership.

@mdboom
Copy link
Copy Markdown
Contributor

mdboom commented Apr 6, 2026

Thanks @mdboom. I've removed the options: -u root --security-opt seccomp=unconfined --shm-size 16g line from the container config. Could you help trigger the coverage workflow to see if it works?

Here's the coverage workflow: https://github.com/NVIDIA/cuda-python/actions/runs/23949027494

The coverage workflow failed. It seems that the docker container might use root as default user. Could we change the owner of the workspace to the user who ran tests inside the container after checking out the workspace? chown -R ( i d − u ) : (id -g) "$GITHUB_WORKSPACE" After changing the owner, git command should pass as there is no dubious ownership.

That seems like a good approach. Would you mind updating the PR to do that and then I'll kick off another coverage run.

@rluo8
Copy link
Copy Markdown
Contributor Author

rluo8 commented Apr 7, 2026

Thanks @mdboom. I've removed the options: -u root --security-opt seccomp=unconfined --shm-size 16g line from the container config. Could you help trigger the coverage workflow to see if it works?

Here's the coverage workflow: https://github.com/NVIDIA/cuda-python/actions/runs/23949027494

The coverage workflow failed. It seems that the docker container might use root as default user. Could we change the owner of the workspace to the user who ran tests inside the container after checking out the workspace? chown -R ( i d − u ) : (id -g) "$GITHUB_WORKSPACE" After changing the owner, git command should pass as there is no dubious ownership.

That seems like a good approach. Would you mind updating the PR to do that and then I'll kick off another coverage run.

Thanks Mike. I have updated the PR

@rluo8 rluo8 requested a review from mdboom April 7, 2026 01:10
@rluo8
Copy link
Copy Markdown
Contributor Author

rluo8 commented Apr 8, 2026

Also update the codes to run cuda core tests for coverage using large stack size thread. When running locally, it seems that some tests might fail due to python stack size limitation for Windows.

@rluo8
Copy link
Copy Markdown
Contributor Author

rluo8 commented Apr 14, 2026

Hi @mdboom , could you please help trigger the coverage workflow? Thanks!

@rwgk rwgk changed the title coverage: add git safe.directory for linux builds [no-ci] coverage: add git safe.directory for linux builds Apr 15, 2026
@rwgk
Copy link
Copy Markdown
Collaborator

rwgk commented Apr 15, 2026

I added [no-ci] to the PR title because this PR only changes .github/workflows/coverage.yml

@rwgk rwgk added this to the cuda.core v1.0.0 milestone Apr 15, 2026
@rwgk rwgk added P1 Medium priority - Should do and removed P1 Medium priority - Should do labels Apr 15, 2026
@rwgk
Copy link
Copy Markdown
Collaborator

rwgk commented Apr 15, 2026

I just triggered this coverage run: https://github.com/NVIDIA/cuda-python/actions/runs/24431536905

@rluo8
Copy link
Copy Markdown
Contributor Author

rluo8 commented Apr 15, 2026

Thanks @rwgk . The cuda_pathfinder install in linux passed now, and workflow run passed.

Comment thread .github/workflows/coverage.yml Outdated
run: |
cd "${{ steps.install-root.outputs.INSTALL_ROOT }}"
"$GITHUB_WORKSPACE/.venv/Scripts/pytest" -v --cov=./cuda --cov-append --cov-context=test --cov-config="$GITHUB_WORKSPACE/.coveragerc" "$GITHUB_WORKSPACE/cuda_core/tests"
"$GITHUB_WORKSPACE/.venv/Scripts/python" << PYTEST_EOF
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generated by Cursor GPT-5.4 Extra High Fast with a couple prompts (it's a suggestion to make this DRY):


One cleanup idea: the new cuda.core block looks like it is only changing the execution model from

"$GITHUB_WORKSPACE/.venv/Scripts/pytest" ...

to "run the same pytest args on a freshly-created Python thread after threading.stack_size(8 * 1024 * 1024)".

I do not think there is a native one-line pytest / Actions knob for "run this with an 8 MB stack", so the wrapper approach makes sense. That said, would you consider factoring the duplicated bindings / core heredoc into a small helper, or at least using a shorter inline wrapper? The current ~30-line block seems to be mostly plumbing rather than behavior change.

For example, the inline version could be reduced to something like:

import concurrent.futures
import os
import sys
import threading
import pytest

os.chdir(r'${{ steps.install-root.outputs.INSTALL_ROOT }}')
threading.stack_size(8 * 1024 * 1024)

args = [
    '-v',
    '--cov=./cuda',
    '--cov-append',
    '--cov-context=test',
    f'--cov-config={os.environ["GITHUB_WORKSPACE"]}/.coveragerc',
    f'{os.environ["GITHUB_WORKSPACE"]}/cuda_core/tests',
]

with concurrent.futures.ThreadPoolExecutor(max_workers=1) as ex:
    sys.exit(ex.submit(pytest.main, args).result())

Even better might be a tiny checked-in helper that both steps call, so the workflow stays readable and the "8 MB stack thread" workaround only lives in one place.

Agent notes / clues

  • Scope: the only changed file in PR 1829 is .github/workflows/coverage.yml.
  • The key semantic change in the cuda.core step is the same workaround already used for cuda.bindings: run pytest.main(...) on a newly-created thread after threading.stack_size(8 * 1024 * 1024).
  • I did not see another meaningful behavior change in the expanded block:
    • same pytest flags
    • same coverage config
    • same target test directory
    • same effective working directory
    • exit code is still propagated
  • os.chdir(...) in the Python snippet is probably redundant because the shell step already does cd "${{ steps.install-root.outputs.INSTALL_ROOT }}", but keeping it is harmless.
  • There is no obvious native one-liner in GitHub Actions / pytest / CPython to say "run this test invocation with an 8 MB stack"; some Python wrapper is required because threading.stack_size(...) only affects threads created after the call.
  • If another agent wants to refactor instead of just comment, the likely clean option is a helper script, something like:
    • ci/tools/run_pytest_with_stack.py
    • inputs: cwd, test path, stack size, and passthrough pytest args
    • behavior: call threading.stack_size(...), spawn one thread, run pytest.main(args), exit with its code
  • Alternative smaller inline version: ThreadPoolExecutor(max_workers=1) is a compact replacement for the manual result dict plus Thread.start()/join().
  • Practical review angle: this is a readability / maintainability suggestion, not a correctness concern. The workflow was manually triggered on an upstream branch containing the PR commit, and it passed.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to mention before: I think one of these is redundant:

         cd "${{ steps.install-root.outputs.INSTALL_ROOT }}"
...
          os.chdir(r'${{ steps.install-root.outputs.INSTALL_ROOT }}')

If you add the suggested helper, I'd drop the cd and pass steps.install-root.outputs.INSTALL_ROOT to the helper; I believe that'll be most readable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ralf. It's reasonable to add a helper to make the workflow cleaner. I'll update that.

@rwgk
Copy link
Copy Markdown
Collaborator

rwgk commented Apr 15, 2026

Another tiny thing: the PR title and description seem to be out of sync with the code changes. It'd be great to revise both.

@rluo8 rluo8 changed the title [no-ci] coverage: add git safe.directory for linux builds [no-ci] coverage: fix Linux git ownership and add Windows stack wrapper Apr 15, 2026
@rwgk
Copy link
Copy Markdown
Collaborator

rwgk commented Apr 15, 2026

This PR looks awesome now. I'll re-trigger the coverage worflow to retest.

@rwgk
Copy link
Copy Markdown
Collaborator

rwgk commented Apr 15, 2026

The workflow is green again, but at closer inspection, I realized the workflow publishes coverage, but there are remaining annotations we should definitely follow-up on:

Screenshot 2026-04-15 at 11 29 29

Below is a Cursor-generated analysis.

My vote: Let's merge this PR as-is, then debug the failures.


From Cursor GPT-5.4 Extra High Fast:

I took a closer look at the latest successful dispatch on this branch and compared it with both the previous green PR run and the latest scheduled main run:

What I found:

  • The latest tip is easier to review than before: the 8 MB stack workaround has now been factored into ci/tools/run_pytest_with_stack.py instead of duplicating large inline heredocs in .github/workflows/coverage.yml.
  • The latest run really does publish coverage end-to-end, so from that perspective this branch is a net improvement over current main, where the recent scheduled run failed before coverage could be combined/deployed.
  • However, the annotations are not fake UI noise. They correspond to real non-zero exits inside steps that are marked continue-on-error: true, so the workflow stays green and still uploads/deploys coverage artifacts.
  • In the latest run, Windows Run cuda.core tests (with 8MB stack) still ends with exit code 139 / segfault.
  • In the latest run, Linux Run cuda.bindings tests returns exit code 1 (cuda_bindings/tests/nvml/test_pci.py::test_discover_gpus), and Linux Run cuda.core tests returns exit code 1 with a large batch of IPC/example failures, many with child exit -11.
  • Importantly, these hidden failures/crashes were already present in the previous green PR run as well. In particular, the earlier run also had:
    • a Windows cuda.core segfault with exit code 139
    • Linux cuda.bindings failures (including test_discover_gpus, and in that run also test_cufile.py::test_buf_register_host_memory)
    • Linux cuda.core IPC/example failures with child exit -11
  • So I do not think the new helper script introduced those failures; it mainly makes the workflow more readable.
  • The Linux pip warning about running as root looks like the normal container warning and does not seem worth addressing in this PR.
  • The Node 20 deprecation warning is coming from ilammy/msvc-dev-cmd@v1, not from this PR's logic. That seems like a separate maintenance task before GitHub enforces Node 24, not something blocking this coverage fix.

My recommendation:

  • Keep this PR focused on restoring/preserving coverage publication and readability of the coverage workflow.
  • Do not treat the green workflow as meaning "all covered suites passed"; right now it really means "coverage published despite tolerated test failures."
  • Track the remaining failing/crashing coverage steps separately, since they are real and pre-existing:
    • Windows cuda.core coverage segfault (exit 139)
    • Linux cuda.bindings coverage failures (nvml, and possibly cufile)
    • Linux cuda.core IPC/example failures with child exit -11
  • Ignore the pip as root warning for now unless someone specifically wants to spend effort removing cosmetic warnings from the workflow UI.
  • Treat the Node 20 deprecation warning as a separate follow-up issue/PR to update or replace ilammy/msvc-dev-cmd@v1 before the GitHub runtime deadline.

If helpful, I think a small follow-up issue would be justified just to document that the coverage workflow is intentionally tolerant of failing test steps today, so that the green check is not over-interpreted.

@rwgk
Copy link
Copy Markdown
Collaborator

rwgk commented Apr 15, 2026

@mdboom this PR is currently blocked on your request for changes. Could you please re-review? I believe it's a good step forward to merge as-is.

@rluo8
Copy link
Copy Markdown
Contributor Author

rluo8 commented Apr 16, 2026

Thanks @rwgk . I noticed about the tests failures, and was investigating that. Previously, the workflow was designed to continue on error to make sure all the tests could be run and the coverage pipeline could work. It won't be blocked by the test failures. Maybe we could add warnings about the test failures. I'll follow up your recommendations.

@rwgk
Copy link
Copy Markdown
Collaborator

rwgk commented Apr 16, 2026

I'm not familiar with the workflow and can only offer high-level ideas:

AFAIK there is no way to generate a warning, github actions only support binary success/failure outcomes.

From the Cursor output I saw, it seems the workflow is generating artifacts, and possibly the motivation for the continue-on-error design was to generate as many artifacts as possible. I can see how that's valuable, and nice to keep. I'd look for a way to track errors, and then fail the workflow at the very end if there were any.

I'm assuming we want to debug and fix all errors.

Copy link
Copy Markdown
Contributor

@mdboom mdboom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This looks great.

@mdboom mdboom merged commit fd5700d into NVIDIA:main Apr 16, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working CI/CD CI/CD infrastructure P1 Medium priority - Should do

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants