LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Douglas A. Vechinski (douglas.vechinski_at_[hidden])
Date: 2003-11-13 15:24:03


I had a similar circumstance with an MPI application I was running on a
local LAN where users may reboot their machines to Windows and posted a
similar question here but never received any suggestions or replies. The
LAM deamon supposed can run in a fault tolerant mode but I haven't been
able to find information as to what it does. So in the intriem I set out
to see if there was anything I could do myself. My current solution is
as follows:

First I modified my MPI code to catch a SIGTERM and send a special
message to the master code that it was "leaving the scene" when the
SIGTERM signal handler was called followed by an MPI_Finalize.

Next I wrote a shell script which when executed looked for /tmp/lam-*
directories. Then under each of these directories it looks for a lam
file which appears to hold the PID numbers of the various jobs the lam
deamon is running. I read these in, skipping the first one which appears
to be the lam deamon itself (we don't want to kill it yet) and issue a
"kill -TERM pid#". Hence, the reason for the first step. You may wish to
place a pause (sleep) at the end of this script to give any clean up
steps to execute.

Finally, all that is necessary is to get this kill_lam_jobs script to
execute during shutdown. I had a heck of a time finding the proper
time/place to do this. We want it to execute before the network
interface is disabled. On a Linux system I created a
"/etc/rc0.d/K02local" link pointing to the "kill_lam_jobs" script
described in step 2. I also created one in /etc/rc6.d. These links are
executed when a shutdown or halt procedure is executed i.e. runlevels 0
or 6. This was on Redhad systems. Other Linuxes may vary as will other
Unixes. I had to used the "local" in the name otherwise it wouldn't work
without altering the rc.d script that is executed during change in runlevel.

It's not as elegant as I would like but it solved my then immediate
problem. Until I can find out more information about LAM's fault
tolerance it'll do for me.

Hopefully this may aid in your situation.

YoungHui Amend wrote:

> I’m getting hung jobs in my parallel run when a child machine is
> powered down. Here is the scenario:
>
> My main parent process was sending data to child processes via
> mpi_send when one the child machine was powered down. The main process
> is sitting in the mpi_send since it’s a blocking send. All other
> children continued running their current "batch" of work and then went
> to sleep. The main parent process also went to sleep. I left it in
> this state for about 10
>
> minutes, with no changes to any of the processes or machines.
>
> Then, I powered up the previously killed machine, and as soon as it
> got its IP address, the parent process came back to life. It started
> consuming 100% of the CPU. All of the remaining children stayed
> asleep. The parent process kept consuming CPU time without printing
> any output to the log. I left it this way until it consumed 105 minutes.
>
> Here is what my debugger trace is showing me:
>
> #0 0x13b15c2e in _tcp_adv1 () from
>
> /tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/libmpi.so
>
> #1 0x13b147d4 in _rpi_c2c_advance () from
>
> /tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/libmpi.so
>
> #2 0x13b35275 in _mpi_req_advance () from
>
> /tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/libmpi.so
>
> #3 0x13b35cf2 in lam_send () from
>
> /tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/libmpi.so
>
> #4 0x13b4062c in MPI_Send () from
>
> /tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/libmpi.so
>
> Why is main process hung up in _tcp_adv1 when the child machine was
> restarted? What is the main process doing when it’s consuming 100% of
> CPU time when the child machine was restarted?
>
> Is there some sort of timeout for mpi_send so that if the target
> (child) machine does not respond, it can return some error code so the
> application can end gracefully?
>
> Believe it or not, we have a customer that uses our code with LAM/MPI
> implementation where their machines go down all the time.
>
> Thanks in advance for your help.
>
> YoungHui Amend
>
>------------------------------------------------------------------------
>
>_______________________________________________
>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>