Skip to content

Shift GPU task completion#762

Open
devreal wants to merge 6 commits into
ICLDisco:masterfrom
devreal:shift-gpu-task-completion
Open

Shift GPU task completion#762
devreal wants to merge 6 commits into
ICLDisco:masterfrom
devreal:shift-gpu-task-completion

Conversation

@devreal

@devreal devreal commented Mar 28, 2026

Copy link
Copy Markdown
Contributor

Completion is potential costly (discovering new tasks) and might hit the network so we better move that away from the device management thread.

Completion is potential costly (discovering new tasks) and might hit the
network so we better move that away from the device management thread.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
@devreal devreal requested a review from a team as a code owner March 28, 2026 00:08
@bosilca

bosilca commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

Do you have any numbers that validate this need ? We tried this in the past, and even for short tasks with a lot of descendents didn't indicate there was a real need for shifting the completion to another thread (which comes with it's own set of problems).

@devreal

devreal commented Apr 6, 2026

Copy link
Copy Markdown
Contributor Author

I have not measured the impact, no. In TTG we run multi-threaded MPI so the GPU thread ends up calling MPI. That's not great. Also, in practice the worker threads are idle most of the time, since they only select the device and then pass the task on to the device manager. Any task we push out of the manager will be picked up relatively quickly because there is not much else the worker threads are doing. Without it, in a single device run we're effectively serializing everything onto the device manager thread.

@devreal

devreal commented May 11, 2026

Copy link
Copy Markdown
Contributor Author

@bosilca Can you elaborate on the problems you see with thread-shifting the completion?

@bosilca

bosilca commented May 11, 2026

Copy link
Copy Markdown
Contributor

None in the runtime, most in the profiling (which assumes the completion thread is the same as the submission or it does some O(N ^3) searches.

Last time we did something similar we had a cleaner approach, we reserved some hyper-threads as helpers for the task preparation and cleanup, and they were bound to the same resources as the threads they were helping. This allowed us to take advantage of the core-level cache coherence so the thread and its helpers were able to collaborate without locks and atomics. Despite that we were never able to show that the device in charge of the GPU was a starvation point, which is the only valid reason to try to delegate the completions.

@devreal

devreal commented May 11, 2026

Copy link
Copy Markdown
Contributor Author

In DPLASMA, the GPU thread does not hit the network because DPLASMA runs with serialized MPI. In TTG, every thread can call into MPI, so the GPU thread will eventually end up there. Then all bets are off. It's also safe to assume that the worker threads are mostly idle so there are plenty of cycles to spare and the completion will be picked up quickly. SMT is not enabled everywhere so it's not an option we can rely on.

FWIW, the approach here follows the suggestion you made in #687 (comment).

@devreal

devreal commented May 11, 2026

Copy link
Copy Markdown
Contributor Author

We can add YACK (yet another configuation knob) to disable the offloading if it becomes a problem with profiling.

@bosilca

bosilca commented May 11, 2026

Copy link
Copy Markdown
Contributor

Because SMT was not enabled everywhere and because we could not figure out a valid case where it was needed, the changes I mentionned above are still pending in one of the stale PRs imported from bitbucket.

I understand why you would need this in TTG, and I want to give you an escape route but not a global knob. Let's add a flag onto the taskclass incarnation to mark it as completion delegation. Then in the GPU code, if the completion is marked as delegate-ready, do what you do here, otherwise fallback on the traditional path. With this all PTG and DTD would keep working as before, and all TTG will be able to take this different path. In fact it will even allow you different behaviors for different incarnations.

@devreal

devreal commented May 14, 2026

Copy link
Copy Markdown
Contributor Author

@bosilca I like the idea of controling the thread-shift per chore. Please check if this is what you had in mind.

@devreal devreal force-pushed the shift-gpu-task-completion branch from 9fede2e to cd63733 Compare May 14, 2026 19:59
The DSL can decide whether the completion of the task should be
shifted to a worker thread by setting the
PARSEC_CHORE_FLAG_SHIFT_COMPLETION flag on the chore.
DTD and PTG will not currently thread-shift completions.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
@devreal devreal force-pushed the shift-gpu-task-completion branch from cd63733 to da86325 Compare May 14, 2026 20:43

@bosilca bosilca left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. Fix the chore accessor and it will be ready to be merged.

Comment thread parsec/mca/device/device_gpu.c Outdated
Comment thread parsec/mca/device/device_gpu.c Outdated
Comment thread parsec/mca/device/device_gpu.c
Comment thread parsec/mca/device/device_gpu.c
Comment thread parsec/interfaces/ptg/ptg-compiler/jdf2c.c
Comment thread parsec/interfaces/ptg/ptg-compiler/jdf2c.c Outdated

@bosilca bosilca left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current PR branch is based before the merged batching code, so the device path has changed around it.

What happens today with batching:

  • A GPU submit hook can call parsec_gpu_task_collect_batch() and create a ring of parsec_gpu_task_t objects.
  • The execution event tracks that ring as one submitted GPU operation.
  • When the event completes, any complete_stage is called first. For example, the batched GEMM test uses this only to release the batch slot/state, not to complete PaRSEC tasks.
  • The ring is then pushed into the next GPU stream FIFO with parsec_gpu_stream_push_pending().
  • The pop/stage-out stream drains the ring back into individual GPU tasks.
  • Only after each individual task has passed parsec_device_kernel_pop() and parsec_device_kernel_epilog() does it reach the final complete_task block.

We need to make sure we do not shift completion at the batch event-completion level. That would risk only completing the head of the ring, or bypassing the per-task pop/epilog work.

Few other small changes are needed to fix all the tasks generation point in the PTG compiler. The should also be a point in the insert_task where the task's .flags is set to PARSEC_CHORE_FLAG_NONE

devreal and others added 3 commits June 8, 2026 19:09
Make sure we set the flag on the chore in JDF

Co-authored-by: bosilca <bosilca@users.noreply.github.com>
In case the we stop picking apart a batch before completing tasks.

Also, add debug output to signal that the task was sent for completion.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
@bosilca

bosilca commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

I was reading you comment today "In TTG, every thread can call into MPI, so the GPU thread will eventually end up there." and I wonder if this really solves your problem or just alleviate it in some cases. I think the root issue is that you have a thread that is in charge of the GPU, owns the lock, when it goes doing some MPI calls. For as long as it remains in MPI, there is no possible progress on the GPU, the context has been locked by a thread that is now away. One can assume that most MPI calls that can be issued there are non-blocking sends, which means most of the time the duration of the MPI call will be small, but that's not always the case (burst of small messages toward the same destination might overrun the available credits). Maybe that's the point where the GPU manager needs help, with handling external data movements, not local completions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants