A framework for the efficient in-network data transfer between a parallel application and an independent storage server is proposed. The case of an unexpected and unrecoverable interruption of the application is considered, where the server takes the role of an emergency backup service preventing the unnecessary loss of valuable information. Workload managers such as SLURM provide a time buffer between the interruption and the termination of an application which can be optimally exploited by the framework making use of RDMA transport and redistribution of data by means of the Maestro middleware. An alternative could consist in a state-of-the-art checkpoint restart mechanism relying on a possibly shared storage hierarchy which suffers from variability and is not scalable in general or in-memory checkpointing which increases memory consumption considerably. Experiments are performed on a HPE/Cray EX system to construct a heuristics for amounts of data that can realistically be backed up during a given time buffer. The method proves to be faster than VELOC and plain MPI-IO using one server node already, for a number of user ranks up to a hundred, with the promise of also better scalability in the long run due to the in-network approach as opposed to filesystem transport.
|Original language||English (US)|
|Title of host publication||2022 IEEE/ACM Third International Symposium on Checkpointing for Supercomputing (SuperCheck)|
|State||Published - Jan 27 2023|
Bibliographical noteKAUST Repository Item: Exported on 2023-01-31
Acknowledgements: This work is supported by the HPE/Cray/KAUST center of excellence collaboration. We want to thank Timothy Dykes and Utz-Uwe Haus from the EMEA research lab at Hewlett Packard Enterprise for insightful discussions and suggestions.
This publication acknowledges KAUST support, but has no KAUST affiliated authors.