Abstract
A computer usable program product for accelerating recovery in an MPI environment is provided in the illustrative embodiments. A first portion of a distributed application executes using a first processor and a second portion using a second processor in a distributed computing environment. After a failure of operation of the first portion, the first portion is restored to a checkpoint. A first part of the first portion is distributed to a third processor and a second part to a fourth processor. A computation of the first portion is performed using the first and the second parts in parallel. A first message is computed in the first portion and sent to the second portion, the message having been initially computed after a time of the checkpoint. A second message is replayed from the second portion without computing the second message in the second portion.
Original language | English (US) |
---|---|
Patent number | US2012226939 |
IPC | G06F 11/ 14 A I |
Priority date | 04/16/12 |
State | Published - Sep 6 2012 |
Externally published | Yes |