Abstract
Extracting maximum performance of multi-core architectures is a difficult task primarily due to bandwidth limitations of the memory subsystem and its complex hierarchy. In this work, we study the implications of fork-join and data-driven execution models on this type of architecture at the level of task parallelism. For this purpose, we use a highly optimized fork-join based implementation of the FMM and extend it to a data-driven implementation using a distributed task scheduling approach. This study exposes some limitations of the conventional fork-join implementation in terms of synchronization overheads. We find that these are not negligible and their elimination by the data-driven method, with a careful data locality strategy, was beneficial. Experimental evaluation of both methods on state-of-the-art multi-socket multi-core architectures showed up to 22% speed-ups of the data-driven approach compared to the original method. We demonstrate that a data-driven execution of FMM not only improves performance by avoiding global synchronization overheads but also reduces the memory-bandwidth pressure caused by memory-intensive computations. © 2013 Springer-Verlag.
Original language | English (US) |
---|---|
Title of host publication | Lecture Notes in Computer Science |
Publisher | Springer Nature |
Pages | 255-266 |
Number of pages | 12 |
ISBN (Print) | 9783642387494 |
DOIs | |
State | Published - 2013 |
Bibliographical note
KAUST Repository Item: Exported on 2020-10-01ASJC Scopus subject areas
- Theoretical Computer Science
- General Computer Science