Sort incoming device tasks using heap#784
Open
devreal wants to merge 7 commits into
Open
Conversation
Currently, sorting of device tasks is broken (they don't actually contain a priority field) and linear over a slice of the input list. Instead, we can use a max-heap to keep a sorted array of device tasks, push all new tasks into the heap, and take out the highest priority task. Complexity of inserting the ring into the heap is O(N+M) or O(N*logM), depending on the whether we just insert elements one by one or rebuild from scratch (beneficial if we insert more tasks than exist already). We can detach the full lifo at once to avoid repeated atomic operations by the management thread. Also, use nolock variants on stream-local lists and change to lifo for the shared queue to avoid the lock of the fifo. They are only modified by the device management thread. Inter-thread communication happens through the pending queue. Once tasks are trickling through the system they do not need to be sorted anymore. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
There was a problem hiding this comment.
Pull request overview
This PR reworks GPU device task intake to restore proper priority-based selection and reduce overhead by switching from FIFO-based pending queues and ad-hoc sorting to a lock-free LIFO intake plus a manager-private max-heap for prioritized pop.
Changes:
- Replace shared
pendingqueues in CUDA and Level Zero devices with a lock-freeparsec_lifo_t, and introduce apending_heap(parsec_heap_*) used by the device management thread to pop the highest-priority task. - Remove the old “sort pending tasks” mechanism and MCA parameters (
sort_pending_tasks) along with the associated function pointer inparsec_device_module_t. - Add
parsec/class/parsec_heap.{h,c}and integrate it into the build, plus addparsec_lifo_detach_all()helpers to batch-move tasks with fewer atomic operations.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| parsec/mca/device/level_zero/device_level_zero_module.c | Switch Level Zero pending queue to LIFO and initialize/finalize the new pending heap. |
| parsec/mca/device/level_zero/device_level_zero_component.c | Remove the Level Zero MCA knob and wiring for pending-task sorting. |
| parsec/mca/device/device.h | Remove the sort_pending_list function pointer from the device module interface. |
| parsec/mca/device/device_gpu.h | Introduce pq_priority, swap pending type to LIFO, and add a manager-private pending heap field. |
| parsec/mca/device/device_gpu.c | Detach all pending tasks from LIFO, assign heap priorities, push into heap, and pop highest-priority task; switch stream-local queues to nolock list ops. |
| parsec/mca/device/cuda/device_cuda_module.c | Switch CUDA pending queue to LIFO and initialize/finalize the new pending heap. |
| parsec/mca/device/cuda/device_cuda_component.c | Remove the CUDA MCA knob and wiring for pending-task sorting. |
| parsec/CMakeLists.txt | Add class/parsec_heap.c to the build. |
| parsec/class/parsec_heap.h | Add new array-based max-heap API (pointer elements, int32_t priority key by offset). |
| parsec/class/parsec_heap.c | Implement heap push/pop and ring batch-push with adaptive heapify strategy. |
| parsec/class/lifo.h | Add helpers to detach the entire LIFO at once and convert the internal chain into a doubly-linked ring. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+25
to
+31
| typedef struct parsec_heap_s { | ||
| parsec_object_t super; | ||
| void **nodes; /**< pointer array; nodes[0] is the maximum */ | ||
| size_t size; /**< current number of elements */ | ||
| size_t capacity; /**< allocated slots in nodes[] */ | ||
| size_t comp_offset; /**< byte offset of int32_t priority key */ | ||
| } parsec_heap_t; |
Comment on lines
+158
to
+159
| for (ssize_t i = (ssize_t)(heap->size / 2) - 1; i >= 0; i--) | ||
| heap_sift_down(heap, (size_t)i); |
Contributor
Author
|
Looking into merging the two heap implementations. Didn't see the existing one because it wasn't in the class/ directory. |
Similar to the existing maxheap implementation. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
The old implementation now uses the new class and was renamed to parsec_task_heap_t, since it is specialized for hierarchical heaps. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Comment on lines
+22
to
+27
| static inline int heap_cmp(const parsec_heap_t *h, | ||
| const parsec_list_item_t *a, | ||
| const parsec_list_item_t *b) | ||
| { | ||
| return COMPARISON_VAL(a, h->comp_offset) - COMPARISON_VAL(b, h->comp_offset); | ||
| } |
Comment on lines
+92
to
+96
| char tmp[MAX_TASK_STRLEN]; | ||
| PARSEC_DEBUG_VERBOSE(20, parsec_debug_output, | ||
| "MH:\tStole exec C %s (%p) from heap %p", | ||
| parsec_task_snprintf(tmp, MAX_TASK_STRLEN, task), task, heap); | ||
| } |
Comment on lines
176
to
180
| char tmp[MAX_TASK_STRLEN]; | ||
| PARSEC_DEBUG_VERBOSE(20, parsec_debug_output, "MH:\tStole exec C %s (%p) from heap %p", | ||
| PARSEC_DEBUG_VERBOSE(20, parsec_debug_output, | ||
| "MH:\tStole exec C %s (%p) from heap %p", | ||
| parsec_task_snprintf(tmp, MAX_TASK_STRLEN, to_use), to_use, heap); | ||
| } |
| assert(NULL == task); | ||
| if( NULL == stream->tasks[stream->start] ) { /* there is room on the stream */ | ||
| task = (parsec_gpu_task_t*)parsec_list_pop_front(stream->fifo_pending); /* get the best task */ | ||
| task = (parsec_gpu_task_t*)parsec_list_nolock_pop_front(stream->fifo_pending); /* get the best task */ |
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Carefully handle list_prev/list_next since PARSEC_ITEM_DETACH may reset them. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
524089e to
826f3cf
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently, sorting of device tasks is broken (they don't actually contain a priority field) and linear over a slice of the input list. Instead, we can use a max-heap to keep a sorted array of device tasks, push all new tasks into the heap, and take out the highest priority task.
Complexity of inserting the ring into the heap is O(N+M) or O(N*logM), depending on the whether we just insert elements one by one or rebuild from scratch (beneficial if we insert more tasks than exist already).
We can detach the full lifo at once to avoid repeated atomic operations by the management thread.
Also, use nolock variants on stream-local lists and change to lifo for the shared queue to avoid the lock of the fifo. They are only modified by the device management thread. Inter-thread communication happens through the pending queue.
Once tasks are trickling through the system they do not need to be sorted anymore.