LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: YoungHui Amend (yamend_at_[hidden])
Date: 2003-11-13 13:08:41


 

 

I'm getting hung jobs in my parallel run when a child machine is powered
down. Here is the scenario:

 

My main parent process was sending data to child processes via mpi_send
when one the child machine was powered down. The main process is sitting
in the mpi_send since it's a blocking send. All other children
continued running their current "batch" of work and then went to sleep.
The main parent process also went to sleep. I left it in this state for
about 10

minutes, with no changes to any of the processes or machines.

 

Then, I powered up the previously killed machine, and as soon as it got
its IP address, the parent process came back to life. It started
consuming 100% of the CPU. All of the remaining children stayed asleep.
The parent process kept consuming CPU time without printing any output
to the log. I left it this way until it consumed 105 minutes.

 

Here is what my debugger trace is showing me:

#0 0x13b15c2e in _tcp_adv1 () from

/tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/libmp
i.so

#1 0x13b147d4 in _rpi_c2c_advance () from

/tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/libmp
i.so

#2 0x13b35275 in _mpi_req_advance () from

/tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/libmp
i.so

#3 0x13b35cf2 in lam_send () from

/tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/libmp
i.so

#4 0x13b4062c in MPI_Send () from

/tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/libmp
i.so

 

 

Why is main process hung up in _tcp_adv1 when the child machine was
restarted? What is the main process doing when it's consuming 100% of
CPU time when the child machine was restarted?

 

Is there some sort of timeout for mpi_send so that if the target (child)
machine does not respond, it can return some error code so the
application can end gracefully?

 

Believe it or not, we have a customer that uses our code with LAM/MPI
implementation where their machines go down all the time.

 

 

Thanks in advance for your help.

 

YoungHui Amend