LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: John Korah (j_korah_at_[hidden])
Date: 2003-11-18 11:58:36


Hi,
I am trying to install LAM -7.0.3 on an Ultra 5
running Solaris 5.9.
when i ran
./configure --prefix=/guest --with-threads=posix

i get the following errors

configure: WARNING: *** Problem running configure
test!
configure: WARNING: *** See config.log for details.
configure: error: *** Cannot continue.

Help !!
John Korah
CS Virginia Tech

--- Shashwat Srivastav <ssrivast_at_[hidden]> wrote:
> Hi,
>
> I suspect that the reason of your parent process
> consuming 100% of the
> CPU, is a bug in lam daemon. We have fixed this bug
> in lam-7.0.3, which
> is scheduled to be released in a couple of days. I
> will be more
> confident of this if you can send me your LAM
> configurations options
> and the architecture you are running these processes
> on.
>
> There are two aspects of your situation that I see.
> First is that after
> the child comes up, the parent should be able to
> deliver the pending
> messages. With the bug fix, this should happen.
> Second is the ability
> of parent to give up its retransmission attempts if
> child doesn't come
> up for a long time. The problem with handling this
> at LAM level is that
> according to MPI standards, MPI_Send should never
> fail or give up its
> attempt to deliver the message.
>
> So sending would fail only when child doesn't come
> up for long and keep
> alive timer of parent's TCP expires. Socket write()
> error will then
> propagated up the MPI stack. AFAIR, the default
> value of keep alive
> timer is 2 hours. So it might help you if you can
> turn down the default
> value of this timer in your kernel.
>
> The other option is to use lamd fault tolerant mode
> (-x). I am not sure
> how this works but it may notice that node is down
> and kill the MPI job.
>
> Thanks.
> --
> Shashwat Srivastav
> LAM / MPI Developer (http://www.lam-mpi.org)
> Indiana University
> http://www.cs.indiana.edu/~ssrivast
>
> On Thursday, Nov 13, 2003, at 12:08 America/Chicago,
> YoungHui Amend
> wrote:
>
> >
> >
> >
> >
> > I’m getting hung jobs in my parallel run when a
> child machine is
> > powered down. Here is the scenario:
> >
> >
> >
> > My main parent process was sending data to child
> processes via
> > mpi_send when one the child machine was powered
> down. The main process
> > is sitting in the mpi_send since it’s a blocking
> send. All other
> > children continued running their current "batch"
> of work and then went
> > to sleep. The main parent process also went to
> sleep. I left it in
> > this state for about 10
> >
> > minutes, with no changes to any of the processes
> or machines.
> >
> >
> >
> > Then, I powered up the previously killed machine,
> and as soon as it
> > got its IP address, the parent process came back
> to life. It started
> > consuming 100% of the CPU. All of the remaining
> children stayed
> > asleep. The parent process kept consuming CPU time
> without printing
> > any output to the log. I left it this way until
> it consumed 105
> > minutes.
> >
> >
> >
> > Here is what my debugger trace is showing me:
> >
> > #0 0x13b15c2e in _tcp_adv1 () from
> >
> >
>
/tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
>
> > libmpi.so
> >
> > #1 0x13b147d4 in _rpi_c2c_advance () from
> >
> >
>
/tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
>
> > libmpi.so
> >
> > #2 0x13b35275 in _mpi_req_advance () from
> >
> >
>
/tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
>
> > libmpi.so
> >
> > #3 0x13b35cf2 in lam_send () from
> >
> >
>
/tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
>
> > libmpi.so
> >
> > #4 0x13b4062c in MPI_Send () from
> >
> >
>
/tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
>
> > libmpi.so
> >
> >
> >
> >
> >
> > Why is main process hung up in _tcp_adv1 when the
> child machine was
> > restarted? What is the main process doing when
> it’s consuming 100% of
> > CPU time when the child machine was restarted?
> >
> >
> >
> > Is there some sort of timeout for mpi_send so that
> if the target
> > (child) machine does not respond, it can return
> some error code so the
> > application can end gracefully?
> >
> >
> >
> > Believe it or not, we have a customer that uses
> our code with LAM/MPI
> > implementation where their machines go down all
> the time.
> >
> >
> >
> >
> >
> > Thanks in advance for your help.
> >
> >
> >
> > YoungHui Amend
> >
> >
> >
> > _______________________________________________
> > This list is archived at
> http://www.lam-mpi.org/MailArchives/lam/
> >
> > _______________________________________________
> This list is archived at
> http://www.lam-mpi.org/MailArchives/lam/
>

__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree