LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-11-26 14:39:53


Sorry for taking so long to answer -- my INBOX gets swamped with mails
and I sometimes miss stuff. :-\

On Nov 13, 2004, at 9:37 AM, Neville Lee wrote:

> Hi there,
> I'm doing a project on process migration for MPI programs. I've
> managed to migrate a process by modifying the checkpoint files
> produced by BLCR and restart the MPI application. I'm not sure if it's
> the best way, so there are some questions here.
> Suppose there are 4 processes running on host001-004.
> The checkpoint files are:
> context.1000
> context.1000-n0-1001
> context.1000-n1-1002
> context.1000-n2-1003
> context.1000-n3-1004
> And I want to migrate process rank 3 from n3 to n2
> Here are the steps:
> 1. Rename context.1000-n3-1004 to context.1000-n2-1004. This is not
> necessary but it's for better consistency in filenames.
> 2. In all checkpoint files, look for a target struct _gps where
> _gps.gps_node=3, _gps.gps_pid=1004, _gps.gps_idx=11, _gps.gps_rank=3.
> _gps.gps_idx is the index of the process in the local LAM daemon. It
> can be obtained by "mpitask -gps". In this scenario, all four
> processes have the same index 11. Change the first element gps_node
> from 3 to 2. This step can be done by scan the files byte by byte,
> construct a struct from the bytes read, and memcmp them to the target
> struct mentioned above.
> 3. In file context.1000, find the string "context.1000-n3-1004" and
> modify it to "context.1000-n2-1004". This reflects the rename in step
> 1. Find the string "n3 /usr/local/blcr/bin/cr_restart
> /home/neville/context.1000-n3-1004", replace "n3" with "n2" and
> "context.1000-n3-1004" with "context.1000-n2-1004". I assume this is
> the actuall command to restart each process.
> 4. In file context.1000-n2-1004, find the string "<n3/1004/3>",
> replace it with "<n2/1004/3>". I assume this string is actually
> "<$NODE/$PID/$RANK>". Also find "/tmp/lam-neville_at_host004" and replace
> it with "/tmp/lam-neville_at_host003". This seems to be the temporary of
> LAM.

We had thought about doing process migration, but really didn't add any
of the extra bookkeeping to do it (yet). Since we're now spending the
majority of our time in Open MPI, it's unlikely that we'll add this
necessary bookkeeping to LAM, but will eventually add it to Open MPI.

That being said, it'll likely be a while before it's ready in Open MPI.
  If you'd like to poke around and add this kind of functionality into
the source code of LAM/MPI yourself, you could potentially save
yourself a lot of inaccurate post-mortem scanning of the corefiles, and
we'd be happy to tell you where to look in the LAM source code (a
conversation probably better suited for the lam-devel list).

> My question is, there must be many defects in this solution. Here
> are some I know:
> After migrating process rank 3 from n3 to n2, there are two
> processes on that node, pid 1004 and 1003. Since I don't modify the
> value of gps_idx, they are having the same index. However, the program
> restarts and completes successfully. So, what is the use of gps_idx?
> Do their values affect the running of MPI programs? And, is it safe
> not to modify gps_idx or have duplicate gps_idx on one host?

You're probably ok because the idx is mainly used for out-of-band
communication, which, in an MPI application, usually only happens in 3
places:

- MPI_INIT
- MPI-2 dynamic process functionality
- MPI_FINALIZE

MPI_INIT is clearly already passed, and MPI-2 dynamic functionality is
disallowed (i.e., our checkpoint / restart stuff doesn't support it).
MPI_FINALIZE out-of-band issues are taken care of by the re-launched
mpirun. So it isn't really *right*, but you're ok in this situation.

> Also I notices, if I restart an MPI program from checkpoint files,
> "mpitask -gps" can no longer show the process information. I figure
> that because the program is not started by mpirun, the process table
> is not registered in lamd or like. Is that correct?

Yes.

> For all those find, replace operations, since I'm not familiar with
> the checkpoint files or memory layout, I have to scan through a file
> looking for a struct or string. This is ok with small programs, but
> extremely inefficient with programs with huge data (e.g. multiplying
> huge matrics). Is there a way to analyze the checkpoint files and find
> the offset of these variables quickly?

I don't know. We simply invoke BLCR to dump the process state; LAM has
no knowledge of the internal structure of that file.

> My testing cluster has only four nodes, so there is only one digit
> in node id, such as "n0", "n1". What if the user want to migrate a
> process from "n0" to "n10", thus replacing "n0 /usr/local..." with "n1
> /usr/local..."? These strings, are they fixed length or dynamically
> allocated? Is it possible to shift the characters after "n0" one byte
> to the right to make space for "n10", and overwriting one zero byte
> following the string?

I doubt it.

There's some more things that would really need to be done to make true
process migration possible. It's always possible to do something like:

- lamboot some-hostfile
- mpirun ...
- checkpoint
- lamclean
- lamhalt
- lamboot some-hostfile-with-same-topology-as-first-one
- mpirun ...

So as long as the second universe has the same topology as the first,
the files will restart cleanly. Extra work will need to be done, as
you noticed, to have a different topology upon restart. Given that I
took so long to reply, if you're still working on this and want to hear
more, ping me on the lam-devel list and I'll point you in the right
directions about what would need to be done.

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/