TY - GEN
T1 - A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction
AU - Haidar, Azzam
AU - Ltaief, Hatem
AU - Luszczek, Piotr R.
AU - Dongarra, Jack
N1 - KAUST Repository Item: Exported on 2020-10-01
PY - 2012/5
Y1 - 2012/5
N2 - We present new high performance numerical kernels combined with advanced optimization techniques that significantly increase the performance of parallel bidiagonal reduction. Our approach is based on developing efficient fine-grained computational tasks as well as reducing overheads associated with their high-level scheduling during the so-called bulge chasing procedure that is an essential phase of a scalable bidiagonalization procedure. In essence, we coalesce multiple tasks in a way that reduces the time needed to switch execution context between the scheduler and useful computational tasks. At the same time, we maintain the crucial information about the tasks and their data dependencies between the coalescing groups. This is the necessary condition to preserve numerical correctness of the computation. We show our annihilation strategy based on multiple applications of single orthogonal reflectors. Despite non-trivial characteristics in computational complexity and memory access patterns, our optimization approach smoothly applies to the annihilation scenario. The coalescing positively influences another equally important aspect of the bulge chasing stage: the memory reuse. For the tasks within the coalescing groups, the data is retained in high levels of the cache hierarchy and, as a consequence, operations that are normally memory-bound increase their ratio of computation to off-chip communication and become compute-bound which renders them amenable to efficient execution on multicore architectures. The performance for the new two-stage bidiagonal reduction is staggering. Our implementation results in up to 50-fold and 12-fold improvement (∼130 Gflop/s) compared to the equivalent routines from LAPACK V3.2 and Intel MKL V10.3, respectively, on an eight socket hexa-core AMD Opteron multicore shared-memory system with a matrix size of 24000 x 24000. Last but not least, we provide a comprehensive study on the impact of the coalescing group size in terms of cache utilization and power consumption in the context of this new two-stage bidiagonal reduction. © 2012 IEEE.
AB - We present new high performance numerical kernels combined with advanced optimization techniques that significantly increase the performance of parallel bidiagonal reduction. Our approach is based on developing efficient fine-grained computational tasks as well as reducing overheads associated with their high-level scheduling during the so-called bulge chasing procedure that is an essential phase of a scalable bidiagonalization procedure. In essence, we coalesce multiple tasks in a way that reduces the time needed to switch execution context between the scheduler and useful computational tasks. At the same time, we maintain the crucial information about the tasks and their data dependencies between the coalescing groups. This is the necessary condition to preserve numerical correctness of the computation. We show our annihilation strategy based on multiple applications of single orthogonal reflectors. Despite non-trivial characteristics in computational complexity and memory access patterns, our optimization approach smoothly applies to the annihilation scenario. The coalescing positively influences another equally important aspect of the bulge chasing stage: the memory reuse. For the tasks within the coalescing groups, the data is retained in high levels of the cache hierarchy and, as a consequence, operations that are normally memory-bound increase their ratio of computation to off-chip communication and become compute-bound which renders them amenable to efficient execution on multicore architectures. The performance for the new two-stage bidiagonal reduction is staggering. Our implementation results in up to 50-fold and 12-fold improvement (∼130 Gflop/s) compared to the equivalent routines from LAPACK V3.2 and Intel MKL V10.3, respectively, on an eight socket hexa-core AMD Opteron multicore shared-memory system with a matrix size of 24000 x 24000. Last but not least, we provide a comprehensive study on the impact of the coalescing group size in terms of cache utilization and power consumption in the context of this new two-stage bidiagonal reduction. © 2012 IEEE.
UR - http://hdl.handle.net/10754/575805
UR - http://ieeexplore.ieee.org/document/6267821/
UR - http://www.scopus.com/inward/record.url?scp=84866876895&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2012.13
DO - 10.1109/IPDPS.2012.13
M3 - Conference contribution
SN - 9780769546759
SP - 25
EP - 35
BT - 2012 IEEE 26th International Parallel and Distributed Processing Symposium
PB - Institute of Electrical and Electronics Engineers (IEEE)
ER -