To leverage the extreme parallelism of emerging architectures, so that scientific applications can fulfill their high fidelity and multi-physics potential while sustaining high efficiency relative to the limiting resource, numerical algorithms must be redesigned. Algorithmic redesign is capable of shifting the limiting resource, for example from memory or communication to arithmetic capacity. The benefit of algorithmic redesign expands greatly when introducing a tunable tradeoff between accuracy and resources. Scientific applications from diverse sources rely on dense matrix operations. These operations arise in: Schur complements, integral equations, covariances in spatial statistics, ridge regression, radial basis functions from unstructured meshes, and kernel matrices from machine learning, among others. This thesis demonstrates how to extend the problem sizes that may be treated and to reduce their execution time. Two “universes” of algorithmic innovations have emerged to improve computations by orders of magnitude in capacity and runtime. Each introduces a hierarchy, of rank or precision. Tile Low-Rank approximation replaces blocks of dense operator with those of low rank. Mixed precision approximation, increasingly well supported by contemporary hardware, replaces blocks of high with low precision. Herein, we design new high-performance direct solvers based on the synergism of TLR and mixed precision. Since adapting to data sparsity leads to heterogeneous workloads, we rely on task-based runtime systems to orchestrate the scheduling of fine-grained kernels onto computational resources. We first demonstrate how TLR permits to accelerate acoustic scattering and mesh deformation simulations. Our solvers outperform the state-of-art libraries by up to an order of magnitude. Then, we demonstrate the impact of enabling mixed precision in bioinformatics context. Mixed precision enhances the performance up to three-fold speedup. To facilitate the adoption of task-based runtime systems, we introduce the AL4SAN library to provide a common API for the expression and queueing of tasks across multiple dynamic runtime systems. This library handles a variety of workloads at a low overhead, while increasing user productivity. AL4SAN enables interoperability by switching runtimes at runtime, which permits to achieve a twofold speedup on a task-based generalized symmetric eigenvalue solver.
|Date made available
|KAUST Research Repository