TY - GEN
T1 - Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels
AU - Haidar, Azzam
AU - Ltaief, Hatem
AU - Dongarra, Jack
N1 - KAUST Repository Item: Exported on 2020-10-01
PY - 2011
Y1 - 2011
N2 - This paper introduces a novel implementation in reducing a symmetric dense matrix to tridiagonal form, which is the preprocessing step toward solving symmetric eigenvalue problems. Based on tile algorithms, the reduction follows a two-stage approach, where the tile matrix is first reduced to symmetric band form prior to the final condensed structure. The challenging trade-off between algorithmic performance and task granularity has been tackled through a grouping technique, which consists of aggregating fine-grained and memory-aware computational tasks during both stages, while sustaining the application's overall high performance. A dynamic runtime environment system then schedules the different tasks in an out-of-order fashion. The performance for the tridiagonal reduction reported in this paper is unprecedented. Our implementation results in up to 50-fold and 12-fold improvement (130 Gflop/s) compared to the equivalent routines from LAPACK V3.2 and Intel MKL V10.3, respectively, on an eight socket hexa-core AMD Opteron multicore shared-memory system with a matrix size of 24000×24000. Copyright 2011 ACM.
AB - This paper introduces a novel implementation in reducing a symmetric dense matrix to tridiagonal form, which is the preprocessing step toward solving symmetric eigenvalue problems. Based on tile algorithms, the reduction follows a two-stage approach, where the tile matrix is first reduced to symmetric band form prior to the final condensed structure. The challenging trade-off between algorithmic performance and task granularity has been tackled through a grouping technique, which consists of aggregating fine-grained and memory-aware computational tasks during both stages, while sustaining the application's overall high performance. A dynamic runtime environment system then schedules the different tasks in an out-of-order fashion. The performance for the tridiagonal reduction reported in this paper is unprecedented. Our implementation results in up to 50-fold and 12-fold improvement (130 Gflop/s) compared to the equivalent routines from LAPACK V3.2 and Intel MKL V10.3, respectively, on an eight socket hexa-core AMD Opteron multicore shared-memory system with a matrix size of 24000×24000. Copyright 2011 ACM.
UR - http://hdl.handle.net/10754/575751
UR - http://dl.acm.org/citation.cfm?doid=2063384.2063394
UR - http://www.scopus.com/inward/record.url?scp=83155188961&partnerID=8YFLogxK
U2 - 10.1145/2063384.2063394
DO - 10.1145/2063384.2063394
M3 - Conference contribution
SN - 9781450307710
BT - Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
PB - Association for Computing Machinery (ACM)
ER -