Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upLoad balancing for tasks with long tailed distributions #1343
Comments
This comment has been minimized.
This comment has been minimized.
We would be happy to specify this as an optional distribution strategy using our use case assumption that tasks are minutes to hours. I would be a bit hesitant to simply drop tasks and reschedule them without an explicit option however. |
This comment has been minimized.
This comment has been minimized.
I feel compelled to point to https://doi.org/10.1109/MTAGS.2010.5699433 for some previous thinking on this problem |
This comment has been minimized.
This comment has been minimized.
Cool!
You solved your own problem without changing supercomputer schedulers for non-leadership platforms! |
yadudoc commentedOct 9, 2019
Is your feature request related to a problem? Please describe.
The HTEX executor and our current strategy try to evenly distribute tasks to all online managers which can result in underutilization at the tail end of a large run. Since the tasks are not packed to the least filled block and then to the least filled manager, we end up in a situation where we cannot relinquish blocks that are severely underfilled.
Related to #172
Describe the solution you'd like
Currently we use a randomized scheme that works great for short duration tasks, but poorly for the situation described above. We'd need a spill-over algorithm which attempts to fill a block and each manager first before moving to the next. Once this feature is added, we can add a new strategy that shuts down any empty blocks rather than wait for the entire executor to be idle before starting scale-down events. We could leave it to the user to select the manager-task mapping algorithm via the Config.
Describe alternatives you've considered
Once blocks drop below a utilization threshold, you could terminate blocks and reschedule tasks.
This crude method I guess probably would give better utilization, at the cost of some wasted compute.
Additional context
Requested by @dgasmith during Parslfest