Hi,
Its difficult to tell what the problem is, without additional
information. Can you send us full output of configure command and
config.log file in the LAM source directory, after you ran the
configure command.
Thanks.
--
Shashwat Srivastav
LAM / MPI Developer (http://www.lam-mpi.org)
Indiana University
http://www.cs.indiana.edu/~ssrivast
On Tuesday, Nov 18, 2003, at 10:58 America/Chicago, John Korah wrote:
> Hi,
> I am trying to install LAM -7.0.3 on an Ultra 5
> running Solaris 5.9.
> when i ran
> ./configure --prefix=/guest --with-threads=posix
>
> i get the following errors
>
> configure: WARNING: *** Problem running configure
> test!
> configure: WARNING: *** See config.log for details.
> configure: error: *** Cannot continue.
>
> Help !!
> John Korah
> CS Virginia Tech
>
> --- Shashwat Srivastav <ssrivast_at_[hidden]> wrote:
>> Hi,
>>
>> I suspect that the reason of your parent process
>> consuming 100% of the
>> CPU, is a bug in lam daemon. We have fixed this bug
>> in lam-7.0.3, which
>> is scheduled to be released in a couple of days. I
>> will be more
>> confident of this if you can send me your LAM
>> configurations options
>> and the architecture you are running these processes
>> on.
>>
>> There are two aspects of your situation that I see.
>> First is that after
>> the child comes up, the parent should be able to
>> deliver the pending
>> messages. With the bug fix, this should happen.
>> Second is the ability
>> of parent to give up its retransmission attempts if
>> child doesn't come
>> up for a long time. The problem with handling this
>> at LAM level is that
>> according to MPI standards, MPI_Send should never
>> fail or give up its
>> attempt to deliver the message.
>>
>> So sending would fail only when child doesn't come
>> up for long and keep
>> alive timer of parent's TCP expires. Socket write()
>> error will then
>> propagated up the MPI stack. AFAIR, the default
>> value of keep alive
>> timer is 2 hours. So it might help you if you can
>> turn down the default
>> value of this timer in your kernel.
>>
>> The other option is to use lamd fault tolerant mode
>> (-x). I am not sure
>> how this works but it may notice that node is down
>> and kill the MPI job.
>>
>> Thanks.
>> --
>> Shashwat Srivastav
>> LAM / MPI Developer (http://www.lam-mpi.org)
>> Indiana University
>> http://www.cs.indiana.edu/~ssrivast
>>
>> On Thursday, Nov 13, 2003, at 12:08 America/Chicago,
>> YoungHui Amend
>> wrote:
>>
>>>
>>>
>>>
>>>
>>> Im getting hung jobs in my parallel run when a
>> child machine is
>>> powered down. Here is the scenario:
>>>
>>>
>>>
>>> My main parent process was sending data to child
>> processes via
>>> mpi_send when one the child machine was powered
>> down. The main process
>>> is sitting in the mpi_send since its a blocking
>> send. All other
>>> children continued running their current "batch"
>> of work and then went
>>> to sleep. The main parent process also went to
>> sleep. I left it in
>>> this state for about 10
>>>
>>> minutes, with no changes to any of the processes
>> or machines.
>>>
>>>
>>>
>>> Then, I powered up the previously killed machine,
>> and as soon as it
>>> got its IP address, the parent process came back
>> to life. It started
>>> consuming 100% of the CPU. All of the remaining
>> children stayed
>>> asleep. The parent process kept consuming CPU time
>> without printing
>>> any output to the log. I left it this way until
>> it consumed 105
>>> minutes.
>>>
>>>
>>>
>>> Here is what my debugger trace is showing me:
>>>
>>> #0 0x13b15c2e in _tcp_adv1 () from
>>>
>>>
>>
> /tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
>>
>>> libmpi.so
>>>
>>> #1 0x13b147d4 in _rpi_c2c_advance () from
>>>
>>>
>>
> /tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
>>
>>> libmpi.so
>>>
>>> #2 0x13b35275 in _mpi_req_advance () from
>>>
>>>
>>
> /tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
>>
>>> libmpi.so
>>>
>>> #3 0x13b35cf2 in lam_send () from
>>>
>>>
>>
> /tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
>>
>>> libmpi.so
>>>
>>> #4 0x13b4062c in MPI_Send () from
>>>
>>>
>>
> /tool/cbar/apps_lipc24/dft/testbench-2003.2/tools/tb/2003/lam//lib/
>>
>>> libmpi.so
>>>
>>>
>>>
>>>
>>>
>>> Why is main process hung up in _tcp_adv1 when the
>> child machine was
>>> restarted? What is the main process doing when
>> its consuming 100% of
>>> CPU time when the child machine was restarted?
>>>
>>>
>>>
>>> Is there some sort of timeout for mpi_send so that
>> if the target
>>> (child) machine does not respond, it can return
>> some error code so the
>>> application can end gracefully?
>>>
>>>
>>>
>>> Believe it or not, we have a customer that uses
>> our code with LAM/MPI
>>> implementation where their machines go down all
>> the time.
>>>
>>>
>>>
>>>
>>>
>>> Thanks in advance for your help.
>>>
>>>
>>>
>>> YoungHui Amend
>>>
>>>
>>>
>>> _______________________________________________
>>> This list is archived at
>> http://www.lam-mpi.org/MailArchives/lam/
>>>
>>> _______________________________________________
>> This list is archived at
>> http://www.lam-mpi.org/MailArchives/lam/
>>
>
>
> __________________________________
> Do you Yahoo!?
> Protect your identity with Yahoo! Mail AddressGuard
> http://antispam.yahoo.com/whatsnewfree
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
|