LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Neville Lee (neville.lee_at_[hidden])
Date: 2004-11-13 09:37:20


Hi there,
    I'm doing a project on process migration for MPI programs. I've
managed to migrate a process by modifying the checkpoint files produced
by BLCR and restart the MPI application. I'm not sure if it's the best
way, so there are some questions here.
    Suppose there are 4 processes running on host001-004.
    The checkpoint files are:
    context.1000
    context.1000-n0-1001
    context.1000-n1-1002
    context.1000-n2-1003
    context.1000-n3-1004
    And I want to migrate process rank 3 from n3 to n2
    Here are the steps:
    1. Rename context.1000-n3-1004 to context.1000-n2-1004. This is not
necessary but it's for better consistency in filenames.
    2. In all checkpoint files, look for a target struct _gps where
_gps.gps_node=3, _gps.gps_pid=1004, _gps.gps_idx=11, _gps.gps_rank=3.
_gps.gps_idx is the index of the process in the local LAM daemon. It can
be obtained by "mpitask -gps". In this scenario, all four processes have
the same index 11. Change the first element gps_node from 3 to 2. This
step can be done by scan the files byte by byte, construct a struct from
the bytes read, and memcmp them to the target struct mentioned above.
    3. In file context.1000, find the string "context.1000-n3-1004" and
modify it to "context.1000-n2-1004". This reflects the rename in step 1.
Find the string "n3 /usr/local/blcr/bin/cr_restart
/home/neville/context.1000-n3-1004", replace "n3" with "n2" and
"context.1000-n3-1004" with "context.1000-n2-1004". I assume this is the
actuall command to restart each process.
    4. In file context.1000-n2-1004, find the string "<n3/1004/3>",
replace it with "<n2/1004/3>". I assume this string is actually
"<$NODE/$PID/$RANK>". Also find "/tmp/lam-neville_at_host004" and replace
it with "/tmp/lam-neville_at_host003". This seems to be the temporary of LAM.

    My question is, there must be many defects in this solution. Here
are some I know:
    After migrating process rank 3 from n3 to n2, there are two
processes on that node, pid 1004 and 1003. Since I don't modify the
value of gps_idx, they are having the same index. However, the program
restarts and completes successfully. So, what is the use of gps_idx? Do
their values affect the running of MPI programs? And, is it safe not to
modify gps_idx or have duplicate gps_idx on one host?
    Also I notices, if I restart an MPI program from checkpoint files,
"mpitask -gps" can no longer show the process information. I figure that
because the program is not started by mpirun, the process table is not
registered in lamd or like. Is that correct?
    For all those find, replace operations, since I'm not familiar with
the checkpoint files or memory layout, I have to scan through a file
looking for a struct or string. This is ok with small programs, but
extremely inefficient with programs with huge data (e.g. multiplying
huge matrics). Is there a way to analyze the checkpoint files and find
the offset of these variables quickly?
    My testing cluster has only four nodes, so there is only one digit
in node id, such as "n0", "n1". What if the user want to migrate a
process from "n0" to "n10", thus replacing "n0 /usr/local..." with "n1
/usr/local..."? These strings, are they fixed length or dynamically
allocated? Is it possible to shift the characters after "n0" one byte to
the right to make space for "n10", and overwriting one zero byte
following the string?

    Apart from the above points, is there any possible problem? I would
really appreciate your comments.

Regards,
Neville