LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2009-03-26 11:44:00


On Sun, 8 Mar 2009, Yenke Blaise Omer wrote:

> I have conducted a lot of experiments that have led me to conclude that the
> checkpoint time of a parallel application of n processes with aggregated
> memory size S is almost the double of the checkpoint time of a sequential
> application with memory size S.
> Is there any explanation to this?

Sorry about the slow reply, but LAM's in a maintenance mode and the paying
job keeps me busy.

The answer is "lots of stuff". In a serial application, there's no
synchronization necessary, so as soon as the checkpoint request is issued,
streaming data to disk can begin. In a parallel application, there's
coordination to make sure there are no messages in flight and that the
application is in a consistent state. Only then can the processes begin
streaming data to disk.

Since there are multiple processes, there are multiple checkpoint streams,
which means a different I/O pattern to the disk system. Interleaved
strides are not always handled well by disk subsystems, so that might slow
things down.

Doubling the time would suprise me unless the application was run on a
large number of nodes and had large number of messages in flight, or the
serial memory image was small. But there definitely will be overhead -
it's the nature of parallel programming.

Brian

-- 
   Brian Barrett
   LAM/MPI Developer
   Make today a LAM/MPI day!