Shift GPU task completion by devreal · Pull Request #762 · ICLDisco/parsec

devreal · 2026-03-28T00:08:16Z

Completion is potential costly (discovering new tasks) and might hit the network so we better move that away from the device management thread.

Completion is potential costly (discovering new tasks) and might hit the network so we better move that away from the device management thread. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

bosilca · 2026-04-03T20:30:38Z

Do you have any numbers that validate this need ? We tried this in the past, and even for short tasks with a lot of descendents didn't indicate there was a real need for shifting the completion to another thread (which comes with it's own set of problems).

devreal · 2026-04-06T13:42:17Z

I have not measured the impact, no. In TTG we run multi-threaded MPI so the GPU thread ends up calling MPI. That's not great. Also, in practice the worker threads are idle most of the time, since they only select the device and then pass the task on to the device manager. Any task we push out of the manager will be picked up relatively quickly because there is not much else the worker threads are doing. Without it, in a single device run we're effectively serializing everything onto the device manager thread.

devreal · 2026-05-11T20:08:29Z

@bosilca Can you elaborate on the problems you see with thread-shifting the completion?

bosilca · 2026-05-11T21:19:29Z

None in the runtime, most in the profiling (which assumes the completion thread is the same as the submission or it does some O(N ^3) searches.

Last time we did something similar we had a cleaner approach, we reserved some hyper-threads as helpers for the task preparation and cleanup, and they were bound to the same resources as the threads they were helping. This allowed us to take advantage of the core-level cache coherence so the thread and its helpers were able to collaborate without locks and atomics. Despite that we were never able to show that the device in charge of the GPU was a starvation point, which is the only valid reason to try to delegate the completions.

devreal · 2026-05-11T21:28:27Z

In DPLASMA, the GPU thread does not hit the network because DPLASMA runs with serialized MPI. In TTG, every thread can call into MPI, so the GPU thread will eventually end up there. Then all bets are off. It's also safe to assume that the worker threads are mostly idle so there are plenty of cycles to spare and the completion will be picked up quickly. SMT is not enabled everywhere so it's not an option we can rely on.

FWIW, the approach here follows the suggestion you made in #687 (comment).

devreal · 2026-05-11T21:29:26Z

We can add YACK (yet another configuation knob) to disable the offloading if it becomes a problem with profiling.

bosilca · 2026-05-11T23:16:13Z

Because SMT was not enabled everywhere and because we could not figure out a valid case where it was needed, the changes I mentionned above are still pending in one of the stale PRs imported from bitbucket.

I understand why you would need this in TTG, and I want to give you an escape route but not a global knob. Let's add a flag onto the taskclass incarnation to mark it as completion delegation. Then in the GPU code, if the completion is marked as delegate-ready, do what you do here, otherwise fallback on the traditional path. With this all PTG and DTD would keep working as before, and all TTG will be able to take this different path. In fact it will even allow you different behaviors for different incarnations.

devreal · 2026-05-14T19:30:34Z

@bosilca I like the idea of controling the thread-shift per chore. Please check if this is what you had in mind.

The DSL can decide whether the completion of the task should be shifted to a worker thread by setting the PARSEC_CHORE_FLAG_SHIFT_COMPLETION flag on the chore. DTD and PTG will not currently thread-shift completions. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

bosilca

This is great. Fix the chore accessor and it will be ready to be merged.

bosilca

The current PR branch is based before the merged batching code, so the device path has changed around it.

What happens today with batching:

A GPU submit hook can call parsec_gpu_task_collect_batch() and create a ring of parsec_gpu_task_t objects.
The execution event tracks that ring as one submitted GPU operation.
When the event completes, any complete_stage is called first. For example, the batched GEMM test uses this only to release the batch slot/state, not to complete PaRSEC tasks.
The ring is then pushed into the next GPU stream FIFO with parsec_gpu_stream_push_pending().
The pop/stage-out stream drains the ring back into individual GPU tasks.
Only after each individual task has passed parsec_device_kernel_pop() and parsec_device_kernel_epilog() does it reach the final complete_task block.

We need to make sure we do not shift completion at the batch event-completion level. That would risk only completing the head of the ring, or bypassing the per-task pop/epilog work.

Few other small changes are needed to fix all the tasks generation point in the PTG compiler. The should also be a point in the insert_task where the task's .flags is set to PARSEC_CHORE_FLAG_NONE

Make sure we set the flag on the chore in JDF Co-authored-by: bosilca <bosilca@users.noreply.github.com>

In case the we stop picking apart a batch before completing tasks. Also, add debug output to signal that the task was sent for completion. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

bosilca · 2026-06-09T00:30:54Z

I was reading you comment today "In TTG, every thread can call into MPI, so the GPU thread will eventually end up there." and I wonder if this really solves your problem or just alleviate it in some cases. I think the root issue is that you have a thread that is in charge of the GPU, owns the lock, when it goes doing some MPI calls. For as long as it remains in MPI, there is no possible progress on the GPU, the context has been locked by a thread that is now away. One can assume that most MPI calls that can be issued there are non-blocking sends, which means most of the time the duration of the MPI call will be small, but that's not always the case (burst of small messages toward the same destination might overrun the available credits). Maybe that's the point where the GPU manager needs help, with handling external data movements, not local completions.

GPU: Thread-shift completion of tasks to worker threads

ac0e0ed

Completion is potential costly (discovering new tasks) and might hit the network so we better move that away from the device management thread. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

devreal requested a review from a team as a code owner March 28, 2026 00:08

devreal force-pushed the shift-gpu-task-completion branch from 9fede2e to cd63733 Compare May 14, 2026 19:59

devreal force-pushed the shift-gpu-task-completion branch from cd63733 to da86325 Compare May 14, 2026 20:43

bosilca reviewed May 14, 2026

View reviewed changes

Comment thread parsec/mca/device/device_gpu.c Outdated

bosilca mentioned this pull request May 17, 2026

Delegate GPU task completion to a co-manager #509

Closed

bosilca reviewed May 18, 2026

View reviewed changes

Comment thread parsec/mca/device/device_gpu.c Outdated

Apply suggestion from @bosilca

c3d31c2

bosilca mentioned this pull request May 18, 2026

co_manager shortcuting the scheduler #566

Closed

bosilca reviewed May 19, 2026

View reviewed changes

Comment thread parsec/mca/device/device_gpu.c

bosilca reviewed May 19, 2026

View reviewed changes

Comment thread parsec/mca/device/device_gpu.c

bosilca reviewed May 19, 2026

View reviewed changes

Comment thread parsec/interfaces/ptg/ptg-compiler/jdf2c.c

bosilca reviewed May 19, 2026

View reviewed changes

Comment thread parsec/interfaces/ptg/ptg-compiler/jdf2c.c Outdated

bosilca requested changes May 19, 2026

View reviewed changes

devreal and others added 3 commits June 8, 2026 19:09

Add PARSEC_CHORE_FLAG_NONE to JDF

b43f4e1

Make sure we set the flag on the chore in JDF Co-authored-by: bosilca <bosilca@users.noreply.github.com>

Assert that the GPU task is a singleton before completing

c990f18

In case the we stop picking apart a batch before completing tasks. Also, add debug output to signal that the task was sent for completion. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

DTD: set flags of chore to PARSEC_CHORE_FLAG_NONE

ac0008c

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Conversation

devreal commented Mar 28, 2026

Uh oh!

bosilca commented Apr 3, 2026

Uh oh!

devreal commented Apr 6, 2026

Uh oh!

devreal commented May 11, 2026

Uh oh!

bosilca commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devreal commented May 11, 2026

Uh oh!

devreal commented May 11, 2026

Uh oh!

bosilca commented May 11, 2026

Uh oh!

devreal commented May 14, 2026

Uh oh!

bosilca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bosilca left a comment

Choose a reason for hiding this comment

Uh oh!

bosilca commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bosilca commented May 11, 2026 •

edited

Loading