TY - GEN
T1 - Exploring Shared-Memory Optimizations for an Unstructured Mesh CFD Application on Modern Parallel Systems
AU - Mudigere, Dheevatsa
AU - Sridharan, Srinivas
AU - Deshpande, Anand
AU - Park, Jongsoo
AU - Heinecke, Alexander
AU - Smelyanskiy, Mikhail
AU - Kaul, Bharat
AU - Dubey, Pradeep
AU - Kaushik, Dinesh
AU - Keyes, David E.
N1 - KAUST Repository Item: Exported on 2020-10-01
PY - 2015/5
Y1 - 2015/5
N2 - In this work, we revisit the 1999 Gordon Bell Prize winning PETSc-FUN3D aerodynamics code, extending it with highly-tuned shared-memory parallelization and detailed performance analysis on modern highly parallel architectures. An unstructured-grid implicit flow solver, which forms the backbone of computational aerodynamics, poses particular challenges due to its large irregular working sets, unstructured memory accesses, and variable/limited amount of parallelism. This code, based on a domain decomposition approach, exposes tradeoffs between the number of threads assigned to each MPI-rank sub domain, and the total number of domains. By applying several algorithm- and architecture-aware optimization techniques for unstructured grids, we show a 6.9X speed-up in performance on a single-node Intel® XeonTM1 E5 2690 v2 processor relative to the out-of-the-box compilation. Our scaling studies on TACC Stampede supercomputer show that our optimizations continue to provide performance benefits over baseline implementation as we scale up to 256 nodes.
AB - In this work, we revisit the 1999 Gordon Bell Prize winning PETSc-FUN3D aerodynamics code, extending it with highly-tuned shared-memory parallelization and detailed performance analysis on modern highly parallel architectures. An unstructured-grid implicit flow solver, which forms the backbone of computational aerodynamics, poses particular challenges due to its large irregular working sets, unstructured memory accesses, and variable/limited amount of parallelism. This code, based on a domain decomposition approach, exposes tradeoffs between the number of threads assigned to each MPI-rank sub domain, and the total number of domains. By applying several algorithm- and architecture-aware optimization techniques for unstructured grids, we show a 6.9X speed-up in performance on a single-node Intel® XeonTM1 E5 2690 v2 processor relative to the out-of-the-box compilation. Our scaling studies on TACC Stampede supercomputer show that our optimizations continue to provide performance benefits over baseline implementation as we scale up to 256 nodes.
UR - http://hdl.handle.net/10754/577110
UR - http://ieeexplore.ieee.org/document/7161559/
UR - http://www.scopus.com/inward/record.url?scp=84971375871&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2015.114
DO - 10.1109/IPDPS.2015.114
M3 - Conference contribution
SN - 9781479986491
SP - 723
EP - 732
BT - 2015 IEEE International Parallel and Distributed Processing Symposium
PB - Institute of Electrical and Electronics Engineers (IEEE)
ER -