[no-ci] coverage: fix Linux git ownership and add Windows stack wrapper#1829
[no-ci] coverage: fix Linux git ownership and add Windows stack wrapper#1829mdboom merged 11 commits intoNVIDIA:mainfrom
Conversation
|
Hi @mdboom , could you please help review it? |
|
mdboom
left a comment
There was a problem hiding this comment.
IIUC, this happens when a git repository is queried by a different user than the one who owns the files in the git repo. Do we know why this is happening? Presumably the whole GHA workflow is run by the same user. Basically, I want to understand the root cause before approving (and also to understand if we should expect to see this issue with setuptools-scm in other workflows...)
@mdboom This problem was exposed because of the setuptools_scm upgrade. Previously, setuptools_scm 9.2.2 was used. Now it used setuptools_scm 10.0.5 when installing cuda_pathfinder. When running pip install -v . under cuda_pathfinder, it eventually calls _git_toplevel in vcs_versioning/_file_finders/_git.py. In version 9.2.2, the same function simply logged the error and returned None, allowing the build to continue: One of the workaround is to mark the workspace as a safe directory after checkout: git config --global --add safe.directory "$GITHUB_WORKSPACE". |
|
Thanks for the explanation, @rluo8. Can we try removing this line instead? https://github.com/NVIDIA/cuda-python/blob/main/.github/workflows/coverage.yml#L49 That may fix the problem in a better way without ignoring a potentially important safety check. I experimented with removing it on our main CI, and it doesn't seem to be necessary, so maybe it will work for coverage as well. |
|
Thanks @mdboom. I've removed the options: -u root --security-opt seccomp=unconfined --shm-size 16g line from the container config. Could you help trigger the coverage workflow to see if it works? |
Here's the coverage workflow: https://github.com/NVIDIA/cuda-python/actions/runs/23949027494 |
The coverage workflow failed. It seems that the docker container might use root as default user. |
That seems like a good approach. Would you mind updating the PR to do that and then I'll kick off another coverage run. |
Thanks Mike. I have updated the PR |
|
Also update the codes to run cuda core tests for coverage using large stack size thread. When running locally, it seems that some tests might fail due to python stack size limitation for Windows. |
|
Hi @mdboom , could you please help trigger the coverage workflow? Thanks! |
|
I added [no-ci] to the PR title because this PR only changes |
|
I just triggered this coverage run: https://github.com/NVIDIA/cuda-python/actions/runs/24431536905 |
|
Thanks @rwgk . The cuda_pathfinder install in linux passed now, and workflow run passed. |
| run: | | ||
| cd "${{ steps.install-root.outputs.INSTALL_ROOT }}" | ||
| "$GITHUB_WORKSPACE/.venv/Scripts/pytest" -v --cov=./cuda --cov-append --cov-context=test --cov-config="$GITHUB_WORKSPACE/.coveragerc" "$GITHUB_WORKSPACE/cuda_core/tests" | ||
| "$GITHUB_WORKSPACE/.venv/Scripts/python" << PYTEST_EOF |
There was a problem hiding this comment.
Generated by Cursor GPT-5.4 Extra High Fast with a couple prompts (it's a suggestion to make this DRY):
One cleanup idea: the new cuda.core block looks like it is only changing the execution model from
"$GITHUB_WORKSPACE/.venv/Scripts/pytest" ...to "run the same pytest args on a freshly-created Python thread after threading.stack_size(8 * 1024 * 1024)".
I do not think there is a native one-line pytest / Actions knob for "run this with an 8 MB stack", so the wrapper approach makes sense. That said, would you consider factoring the duplicated bindings / core heredoc into a small helper, or at least using a shorter inline wrapper? The current ~30-line block seems to be mostly plumbing rather than behavior change.
For example, the inline version could be reduced to something like:
import concurrent.futures
import os
import sys
import threading
import pytest
os.chdir(r'${{ steps.install-root.outputs.INSTALL_ROOT }}')
threading.stack_size(8 * 1024 * 1024)
args = [
'-v',
'--cov=./cuda',
'--cov-append',
'--cov-context=test',
f'--cov-config={os.environ["GITHUB_WORKSPACE"]}/.coveragerc',
f'{os.environ["GITHUB_WORKSPACE"]}/cuda_core/tests',
]
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as ex:
sys.exit(ex.submit(pytest.main, args).result())Even better might be a tiny checked-in helper that both steps call, so the workflow stays readable and the "8 MB stack thread" workaround only lives in one place.
Agent notes / clues
- Scope: the only changed file in PR 1829 is
.github/workflows/coverage.yml. - The key semantic change in the
cuda.corestep is the same workaround already used forcuda.bindings: runpytest.main(...)on a newly-created thread afterthreading.stack_size(8 * 1024 * 1024). - I did not see another meaningful behavior change in the expanded block:
- same pytest flags
- same coverage config
- same target test directory
- same effective working directory
- exit code is still propagated
os.chdir(...)in the Python snippet is probably redundant because the shell step already doescd "${{ steps.install-root.outputs.INSTALL_ROOT }}", but keeping it is harmless.- There is no obvious native one-liner in GitHub Actions /
pytest/ CPython to say "run this test invocation with an 8 MB stack"; some Python wrapper is required becausethreading.stack_size(...)only affects threads created after the call. - If another agent wants to refactor instead of just comment, the likely clean option is a helper script, something like:
ci/tools/run_pytest_with_stack.py- inputs: cwd, test path, stack size, and passthrough pytest args
- behavior: call
threading.stack_size(...), spawn one thread, runpytest.main(args), exit with its code
- Alternative smaller inline version:
ThreadPoolExecutor(max_workers=1)is a compact replacement for the manualresultdict plusThread.start()/join(). - Practical review angle: this is a readability / maintainability suggestion, not a correctness concern. The workflow was manually triggered on an upstream branch containing the PR commit, and it passed.
There was a problem hiding this comment.
I forgot to mention before: I think one of these is redundant:
cd "${{ steps.install-root.outputs.INSTALL_ROOT }}"
...
os.chdir(r'${{ steps.install-root.outputs.INSTALL_ROOT }}')
If you add the suggested helper, I'd drop the cd and pass steps.install-root.outputs.INSTALL_ROOT to the helper; I believe that'll be most readable.
There was a problem hiding this comment.
Thanks Ralf. It's reasonable to add a helper to make the workflow cleaner. I'll update that.
|
Another tiny thing: the PR title and description seem to be out of sync with the code changes. It'd be great to revise both. |
|
This PR looks awesome now. I'll re-trigger the coverage worflow to retest. |
|
The workflow is green again, but at closer inspection, I realized the workflow publishes coverage, but there are remaining annotations we should definitely follow-up on:
Below is a Cursor-generated analysis. My vote: Let's merge this PR as-is, then debug the failures. From Cursor GPT-5.4 Extra High Fast: I took a closer look at the latest successful dispatch on this branch and compared it with both the previous green PR run and the latest scheduled
What I found:
My recommendation:
If helpful, I think a small follow-up issue would be justified just to document that the coverage workflow is intentionally tolerant of failing test steps today, so that the green check is not over-interpreted. |
|
@mdboom this PR is currently blocked on your request for changes. Could you please re-review? I believe it's a good step forward to merge as-is. |
|
Thanks @rwgk . I noticed about the tests failures, and was investigating that. Previously, the workflow was designed to continue on error to make sure all the tests could be run and the coverage pipeline could work. It won't be blocked by the test failures. Maybe we could add warnings about the test failures. I'll follow up your recommendations. |
|
I'm not familiar with the workflow and can only offer high-level ideas: AFAIK there is no way to generate a warning, github actions only support binary success/failure outcomes. From the Cursor output I saw, it seems the workflow is generating artifacts, and possibly the motivation for the continue-on-error design was to generate as many artifacts as possible. I can see how that's valuable, and nice to keep. I'd look for a way to track errors, and then fail the workflow at the very end if there were any. I'm assuming we want to debug and fix all errors. |
mdboom
left a comment
There was a problem hiding this comment.
Thanks. This looks great.

This is to fix the following error when running coverage for Linux:
fatal: detected dubious ownership in repository at '/__w/cuda-python/cuda-python'
To add an exception for this directory, call:
git introspection failed: fatal: detected dubious ownership in repository at '/__w/cuda-python/cuda-python'
This is caused by the setuptools_scm recent changes. setuptools_scm 10.0+ treats git ownership mismatches as hard errors.
The fix is to change to ownership to the user which run tests inside the container.