Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression

Wagih Halim Boukaram, George Turkiyyah, Hatem Ltaief, David E. Keyes

Research output: Contribution to journalArticlepeer-review

32 Scopus citations


We present high performance implementations of the QR and the singular value decomposition of a batch of small matrices hosted on the GPU with applications in the compression of hierarchical matrices. The one-sided Jacobi algorithm is used for its simplicity and inherent parallelism as a building block for the SVD of low rank blocks using randomized methods. We implement multiple kernels based on the level of the GPU memory hierarchy in which the matrices can reside and show substantial speedups against streamed cuSOLVER SVDs. The resulting batched routine is a key component of hierarchical matrix compression, opening up opportunities to perform H-matrix arithmetic efficiently on GPUs.
Original languageEnglish (US)
Pages (from-to)19-33
Number of pages15
JournalParallel Computing
StatePublished - Sep 14 2017

Bibliographical note

KAUST Repository Item: Exported on 2020-10-01
Acknowledgements: The work of all four authors was supported by the Extreme Computing Research Center at the King Abdullah University of Science and Technology. We thank the NVIDIA Corporation for providing access to the P100 GPU used in this work.


Dive into the research topics of 'Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression'. Together they form a unique fingerprint.

Cite this