Thanks for replying ..
I have attached my config.log file
john
--- Shashwat Srivastav <ssrivast_at_[hidden]> wrote:
> Hi,
>
> Its difficult to tell what the problem is, without
> additional
> information. Can you send us full output of
> configure command and
> config.log file in the LAM source directory, after
> you ran the
> configure command.
>
> Thanks.
> --
> Shashwat Srivastav
> LAM / MPI Developer (http://www.lam-mpi.org)
> Indiana University
> http://www.cs.indiana.edu/~ssrivast
>
> On Tuesday, Nov 18, 2003, at 10:58 America/Chicago,
> John Korah wrote:
>
> > Hi,
> > I am trying to install LAM -7.0.3 on an Ultra 5
> > running Solaris 5.9.
> > when i ran
> > ./configure --prefix=/guest --with-threads=posix
> >
> > i get the following errors
> >
> > configure: WARNING: *** Problem running configure
> > test!
> > configure: WARNING: *** See config.log for
> details.
> > configure: error: *** Cannot continue.
> >
> > Help !!
> > John Korah
> > CS Virginia Tech
> >
> > --- Shashwat Srivastav <ssrivast_at_[hidden]>
> wrote:
> >> Hi,
> >>
> >> I suspect that the reason of your parent process
> >> consuming 100% of the
> >> CPU, is a bug in lam daemon. We have fixed this
> bug
> >> in lam-7.0.3, which
> >> is scheduled to be released in a couple of days.
> I
> >> will be more
> >> confident of this if you can send me your LAM
> >> configurations options
> >> and the architecture you are running these
> processes
> >> on.
> >>
> >> There are two aspects of your situation that I
> see.
> >> First is that after
> >> the child comes up, the parent should be able to
> >> deliver the pending
> >> messages. With the bug fix, this should happen.
> >> Second is the ability
> >> of parent to give up its retransmission attempts
> if
> >> child doesn't come
> >> up for a long time. The problem with handling
> this
> >> at LAM level is that
> >> according to MPI standards, MPI_Send should never
> >> fail or give up its
> >> attempt to deliver the message.
> >>
> >> So sending would fail only when child doesn't
> come
> >> up for long and keep
> >> alive timer of parent's TCP expires. Socket
> write()
> >> error will then
> >> propagated up the MPI stack. AFAIR, the default
> >> value of keep alive
> >> timer is 2 hours. So it might help you if you can
> >> turn down the default
> >> value of this timer in your kernel.
> >>
> >> The other option is to use lamd fault tolerant
> mode
> >> (-x). I am not sure
> >> how this works but it may notice that node is
> down
> >> and kill the MPI job.
> >>
> >> Thanks.
> >> --
> >> Shashwat Srivastav
> >> LAM / MPI Developer (http://www.lam-mpi.org)
> >> Indiana University
> >> http://www.cs.indiana.edu/~ssrivast
> >>
> >> On Thursday, Nov 13, 2003, at 12:08
> America/Chicago,
> >> YoungHui Amend
> >> wrote:
> >>
> >>>
> >>>
> >>>
> >>>
> >>> Im getting hung jobs in my parallel run when a
> >> child machine is
> >>> powered down. Here is the scenario:
> >>>
> >>>
> >>>
> >>> My main parent process was sending data to child
> >> processes via
> >>> mpi_send when one the child machine was powered
> >> down. The main process
> >>> is sitting in the mpi_send since its a blocking
> >> send. All other
> >>> children continued running their current "batch"
> >> of work and then went
> >>> to sleep. The main parent process also went to
> >> sleep. I left it in
> >>> this state for about 10
> >>>
> >>> minutes, with no changes to any of the processes
> >> or machines.
> >>>
> >>>
> >>>
> >>> Then, I powered up the previously killed
> machine,
> >> and as soon as it
> >>> got its IP address, the parent process came back
> >> to life. It started
> >>> consuming 100% of the CPU. All of the remaining
> >> children stayed
> >>> asleep. The parent process kept consuming CPU
> time
> >> without printing
> >>> any output to the log. I left it this way until
> >> it consumed 105
> >>> minutes.
> >>>
> >>>
> >>>
> >>> Here is what my debugger trace is showing me:
> >>>
> >>> #0 0x13b15c2e in _tcp_adv1 () from
> >>>
> >>>
> >>
> >
>
/tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
> >>
> >>> libmpi.so
> >>>
> >>> #1 0x13b147d4 in _rpi_c2c_advance () from
> >>>
> >>>
> >>
> >
>
/tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
> >>
> >>> libmpi.so
> >>>
> >>> #2 0x13b35275 in _mpi_req_advance () from
> >>>
> >>>
> >>
> >
>
/tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
> >>
> >>> libmpi.so
> >>>
> >>> #3 0x13b35cf2 in lam_send () from
> >>>
> >>>
> >>
> >
>
/tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
> >>
> >>> libmpi.so
> >>>
> >>> #4 0x13b4062c in MPI_Send () from
> >>>
> >>>
> >>
> >
>
/tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
> >>
> >>> libmpi.so
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Why is main process hung up in _tcp_adv1 when
> the
> >> child machine was
>
=== message truncated ===
__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree
|