The sequential task flow (STF) model is the main-stream approach for interacting with task-based runtime systems, with StarPU and the Dynamic task discovery (DTD) in PaRSEC being two implementations of this model. Compared with other approaches of submitting tasks into a runtime system, STF has interesting advantages centered around an easy-to-use API, that allows users to expressed algorithms as a sequence of tasks (much like in OpenMP), while allowing the runtime to automatically identify and analyze the task dependencies and scheduling. In this paper, we focus on the DTD interface in PaRSEC, highlight some of its lesser known limitations and implemented two optimization techniques for DTD: support for user level graph trimming, and a new API for broadcast read-only data to remote tasks. We then analyze the benefits and limitations of these optimizations with benchmarks as well as on two common matrix factorization kernels Cholesky and QR, on two different systems Shaheen II from KAUST and Fugaku from RIKEN. We point out some potential for further improvements, and provided valuable insights into the strength and weakness of STF model. hoping to guide the future developments of task-based runtime systems.
|Original language||English (US)|
|Title of host publication||2022 IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers (ROSS)|
|State||Published - Jan 27 2023|
Bibliographical noteKAUST Repository Item: Exported on 2023-01-31
Acknowledgements: For computer time, this research used the resources of the Supercomputing Laboratory (KSL) Shaheen II at King Abdullah University of Science & Technology (KAUST) in Thuwal Saudi Arabia and the supercomputer Fugaku provided by RIKEN.
This publication acknowledges KAUST support, but has no KAUST affiliated authors.