Computational science has benefited in the last years from emerging accelerators that increase the performance of scientific simulations, but using these devices hinders the programming task. This paper presents AMA: a set of optimization techniques to efficiently manage multiaccelerator systems. AMA maximizes the overlap of computation and communication in a blocking-free way. Then, we can use such spare time to do other work while waiting for device operations. Implemented on top of a task-based framework, the experimental evaluation of AMA on a quad-GPU node shows that we reach the performance of a hand-tuned native CUDA code, with the advantage of fully hiding the device management. In addition, we obtain up to more than 2x performance speed-up with respect to the original framework implementation.
Bibliographical noteKAUST Repository Item: Exported on 2022-06-24
Acknowledgements: European Commission (HiPEAC-3 Network of Excellence, FP7-ICT 287759), Intel-BSC Exas-cale Lab and IBM/BSC Exascale Initiative collaboration, Spanish Ministry of Education (FPU), Computación de Altas Prestaciones VI (TIN2012-34557), Generalitat de Catalunya (2014-SGR-1051). We thank KAUST IT Research Computing for granting access to their machines.
This publication acknowledges KAUST support, but has no KAUST affiliated authors.