Abstract
We present a resilient task-based domain-decomposition preconditioner for partial differential equations (PDEs) built on top of User Level Fault Mitigation Message Passing Interface (ULFM-MPI). The algorithm reformulates the PDE as a sampling problem, followed by a robust regression-based solution update that is resilient to silent data corruptions (SDCs). We adopt a server-client model where all state information is held by the servers, while clients only serve as computational units. The task-based nature of the algorithm and the capabilities of ULFM complement each other to support missing tasks, making the application resilient to clients failing.We present weak and strong scaling results on Edison, National Energy Research Scientific Computing Center (NERSC), for a nominal and a fault-injected case, showing that even in the presence of faults, scalability tested up to 50k cores is within 90%. We then quantify the variability of weak and strong scaling due to the presence of faults. Finally, we discuss the performance of our application with respect to subdomain size, server/client configuration, and the interplay between energy and resilience.
Original language | English (US) |
---|---|
Title of host publication | Proceedings of ScalA 2016 |
Subtitle of host publication | 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 41-48 |
Number of pages | 8 |
ISBN (Electronic) | 9781509052226 |
DOIs | |
State | Published - Jan 30 2017 |
Event | 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2016 - Salt Lake City, United States Duration: Nov 13 2016 → Nov 18 2016 |
Publication series
Name | Proceedings of ScalA 2016: 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis |
---|
Other
Other | 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2016 |
---|---|
Country/Territory | United States |
City | Salt Lake City |
Period | 11/13/16 → 11/18/16 |
Bibliographical note
Publisher Copyright:© 2016 IEEE.
Keywords
- Client-server systems
- Dynamic voltage scaling
- Fault tolerance
- Partial differential equations
- Resilience
ASJC Scopus subject areas
- Computer Science Applications
- Numerical Analysis
- Software
- Computational Mathematics