Hi,
I suspect that the reason of your parent process consuming 100% of the
CPU, is a bug in lam daemon. We have fixed this bug in lam-7.0.3, which
is scheduled to be released in a couple of days. I will be more
confident of this if you can send me your LAM configurations options
and the architecture you are running these processes on.
There are two aspects of your situation that I see. First is that after
the child comes up, the parent should be able to deliver the pending
messages. With the bug fix, this should happen. Second is the ability
of parent to give up its retransmission attempts if child doesn't come
up for a long time. The problem with handling this at LAM level is that
according to MPI standards, MPI_Send should never fail or give up its
attempt to deliver the message.
So sending would fail only when child doesn't come up for long and keep
alive timer of parent's TCP expires. Socket write() error will then
propagated up the MPI stack. AFAIR, the default value of keep alive
timer is 2 hours. So it might help you if you can turn down the default
value of this timer in your kernel.
The other option is to use lamd fault tolerant mode (-x). I am not sure
how this works but it may notice that node is down and kill the MPI job.
Thanks.
--
Shashwat Srivastav
LAM / MPI Developer (http://www.lam-mpi.org)
Indiana University
http://www.cs.indiana.edu/~ssrivast
On Thursday, Nov 13, 2003, at 12:08 America/Chicago, YoungHui Amend
wrote:
>
>
>
>
> Im getting hung jobs in my parallel run when a child machine is
> powered down. Here is the scenario:
>
>
>
> My main parent process was sending data to child processes via
> mpi_send when one the child machine was powered down. The main process
> is sitting in the mpi_send since its a blocking send. All other
> children continued running their current "batch" of work and then went
> to sleep. The main parent process also went to sleep. I left it in
> this state for about 10
>
> minutes, with no changes to any of the processes or machines.
>
>
>
> Then, I powered up the previously killed machine, and as soon as it
> got its IP address, the parent process came back to life. It started
> consuming 100% of the CPU. All of the remaining children stayed
> asleep. The parent process kept consuming CPU time without printing
> any output to the log. I left it this way until it consumed 105
> minutes.
>
>
>
> Here is what my debugger trace is showing me:
>
> #0 0x13b15c2e in _tcp_adv1 () from
>
> /tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
> libmpi.so
>
> #1 0x13b147d4 in _rpi_c2c_advance () from
>
> /tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
> libmpi.so
>
> #2 0x13b35275 in _mpi_req_advance () from
>
> /tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
> libmpi.so
>
> #3 0x13b35cf2 in lam_send () from
>
> /tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
> libmpi.so
>
> #4 0x13b4062c in MPI_Send () from
>
> /tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
> libmpi.so
>
>
>
>
>
> Why is main process hung up in _tcp_adv1 when the child machine was
> restarted? What is the main process doing when its consuming 100% of
> CPU time when the child machine was restarted?
>
>
>
> Is there some sort of timeout for mpi_send so that if the target
> (child) machine does not respond, it can return some error code so the
> application can end gracefully?
>
>
>
> Believe it or not, we have a customer that uses our code with LAM/MPI
> implementation where their machines go down all the time.
>
>
>
>
>
> Thanks in advance for your help.
>
>
>
> YoungHui Amend
>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
- text/enriched attachment: stored
|