Shift GPU task completion#762
Conversation
Completion is potential costly (discovering new tasks) and might hit the network so we better move that away from the device management thread. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
|
Do you have any numbers that validate this need ? We tried this in the past, and even for short tasks with a lot of descendents didn't indicate there was a real need for shifting the completion to another thread (which comes with it's own set of problems). |
|
I have not measured the impact, no. In TTG we run multi-threaded MPI so the GPU thread ends up calling MPI. That's not great. Also, in practice the worker threads are idle most of the time, since they only select the device and then pass the task on to the device manager. Any task we push out of the manager will be picked up relatively quickly because there is not much else the worker threads are doing. Without it, in a single device run we're effectively serializing everything onto the device manager thread. |
|
@bosilca Can you elaborate on the problems you see with thread-shifting the completion? |
|
None in the runtime, most in the profiling (which assumes the completion thread is the same as the submission or it does some O(N ^3) searches. Last time we did something similar we had a cleaner approach, we reserved some hyper-threads as helpers for the task preparation and cleanup, and they were bound to the same resources as the threads they were helping. This allowed us to take advantage of the core-level cache coherence so the thread and its helpers were able to collaborate without locks and atomics. Despite that we were never able to show that the device in charge of the GPU was a starvation point, which is the only valid reason to try to delegate the completions. |
|
In DPLASMA, the GPU thread does not hit the network because DPLASMA runs with serialized MPI. In TTG, every thread can call into MPI, so the GPU thread will eventually end up there. Then all bets are off. It's also safe to assume that the worker threads are mostly idle so there are plenty of cycles to spare and the completion will be picked up quickly. SMT is not enabled everywhere so it's not an option we can rely on. FWIW, the approach here follows the suggestion you made in #687 (comment). |
|
We can add YACK (yet another configuation knob) to disable the offloading if it becomes a problem with profiling. |
|
Because SMT was not enabled everywhere and because we could not figure out a valid case where it was needed, the changes I mentionned above are still pending in one of the stale PRs imported from bitbucket. I understand why you would need this in TTG, and I want to give you an escape route but not a global knob. Let's add a flag onto the taskclass incarnation to mark it as completion delegation. Then in the GPU code, if the completion is marked as delegate-ready, do what you do here, otherwise fallback on the traditional path. With this all PTG and DTD would keep working as before, and all TTG will be able to take this different path. In fact it will even allow you different behaviors for different incarnations. |
|
@bosilca I like the idea of controling the thread-shift per chore. Please check if this is what you had in mind. |
9fede2e to
cd63733
Compare
The DSL can decide whether the completion of the task should be shifted to a worker thread by setting the PARSEC_CHORE_FLAG_SHIFT_COMPLETION flag on the chore. DTD and PTG will not currently thread-shift completions. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
cd63733 to
da86325
Compare
bosilca
left a comment
There was a problem hiding this comment.
This is great. Fix the chore accessor and it will be ready to be merged.
bosilca
left a comment
There was a problem hiding this comment.
The current PR branch is based before the merged batching code, so the device path has changed around it.
What happens today with batching:
- A GPU submit hook can call parsec_gpu_task_collect_batch() and create a ring of parsec_gpu_task_t objects.
- The execution event tracks that ring as one submitted GPU operation.
- When the event completes, any complete_stage is called first. For example, the batched GEMM test uses this only to release the batch slot/state, not to complete PaRSEC tasks.
- The ring is then pushed into the next GPU stream FIFO with parsec_gpu_stream_push_pending().
- The pop/stage-out stream drains the ring back into individual GPU tasks.
- Only after each individual task has passed parsec_device_kernel_pop() and parsec_device_kernel_epilog() does it reach the final complete_task block.
We need to make sure we do not shift completion at the batch event-completion level. That would risk only completing the head of the ring, or bypassing the per-task pop/epilog work.
Few other small changes are needed to fix all the tasks generation point in the PTG compiler. The should also be a point in the insert_task where the task's .flags is set to PARSEC_CHORE_FLAG_NONE
Make sure we set the flag on the chore in JDF Co-authored-by: bosilca <bosilca@users.noreply.github.com>
In case the we stop picking apart a batch before completing tasks. Also, add debug output to signal that the task was sent for completion. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
|
I was reading you comment today "In TTG, every thread can call into MPI, so the GPU thread will eventually end up there." and I wonder if this really solves your problem or just alleviate it in some cases. I think the root issue is that you have a thread that is in charge of the GPU, owns the lock, when it goes doing some MPI calls. For as long as it remains in MPI, there is no possible progress on the GPU, the context has been locked by a thread that is now away. One can assume that most MPI calls that can be issued there are non-blocking sends, which means most of the time the duration of the MPI call will be small, but that's not always the case (burst of small messages toward the same destination might overrun the available credits). Maybe that's the point where the GPU manager needs help, with handling external data movements, not local completions. |
Completion is potential costly (discovering new tasks) and might hit the network so we better move that away from the device management thread.