LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-11-10 08:08:10


This is quite odd -- I have not heard of such a problem before. It
sounds like a race condition (without -d fails, but with -d it doesn't
fail), but the boot algorithm is linear, so I'm not quite sure how that
could happen.

1. In the linux->solaris case, what is the rest of the error message
that is shown? It says that LAM tried to execute something, but...?

2. In the solaris->linux case, can you ps on both machines and see what
is executing? Is ssh still running? Did the hboot / the lamd actually
get launched on the remote node? If you wait long enough, does lamboot
time out? Does "-d" make it work in this case, too?

On Nov 8, 2004, at 8:37 AM, manumachu reddy wrote:

> Hi,
>
> I have LAM-7.1.1 installed on a Solaris and a Linux machine. The
> installation details are shown below:
>
> Linux machine
> ---------------------
> $uname -a
> Linux pg1cluster01 2.6.8-1.521smp #1 SMP Mon Aug 16 09:25:06 EDT 2004
> i686 i686 i386 GNU/Linux
> $laminfo
> LAM/MPI: 7.1.1
> Prefix:
> /home/cs/manredd/lam-7.1.1/lam-7.1.1/LAM-Linux-2.6.8-1.521smp
> Architecture: i686-pc-linux-gnu
> Configured by: manredd
> Configured on: Mon Nov 1 10:50:20 GMT 2004
> Configure host: pg1cluster01
> Memory manager: ptmalloc2
> C bindings: yes
> C++ bindings: yes
> Fortran bindings: yes
> C compiler: gcc
> C++ compiler: g++
> Fortran compiler: g77
> Fortran symbols: double_underscore
> C profiling: yes
> C++ profiling: yes
> Fortran profiling: yes
> C++ exceptions: no
> Thread support: yes
> ROMIO support: yes
> IMPI support: no
> Debug support: no
> Purify clean: no
> SSI boot: globus (API v1.1, Module v0.6)
> SSI boot: rsh (API v1.1, Module v1.1)
> SSI boot: slurm (API v1.1, Module v1.0)
> SSI coll: lam_basic (API v1.1, Module v7.1)
> SSI coll: shmem (API v1.1, Module v1.0)
> SSI coll: smp (API v1.1, Module v1.2)
> SSI rpi: crtcp (API v1.1, Module v1.1)
> SSI rpi: lamd (API v1.0, Module v7.1)
> SSI rpi: sysv (API v1.0, Module v7.1)
> SSI rpi: tcp (API v1.0, Module v7.1)
> SSI rpi: usysv (API v1.0, Module v7.1)
> SSI cr: self (API v1.0, Module v1.0)
>
> Solaris machine
> ---------------------
> $uname -a
> SunOS csultra01 5.9 Generic_112233-10 sun4u sparc SUNW,Ultra-5_10
> $laminfo
> LAM/MPI: 7.1.1
> Prefix:
> /home/cs/manredd/lam-7.1.1/lam-7.1.1/LAM-SunOS-5.9
> Architecture: sparc-sun-solaris2.9
> Configured by:
> Configured on: Tue Nov 2 14:19:31 GMT 2004
> Configure host: csultra01
> Memory manager: none
> C bindings: yes
> C++ bindings: yes
> Fortran bindings: yes
> C compiler: gcc
> C++ compiler: g++
> Fortran compiler: g77
> Fortran symbols: double_underscore
> C profiling: yes
> C++ profiling: yes
> Fortran profiling: yes
> C++ exceptions: no
> Thread support: yes
> ROMIO support: yes
> IMPI support: no
> Debug support: no
> Purify clean: no
> SSI boot: globus (API v1.1, Module v0.6)
> SSI boot: rsh (API v1.1, Module v1.1)
> SSI boot: slurm (API v1.1, Module v1.0)
> SSI coll: lam_basic (API v1.1, Module v7.1)
> SSI coll: shmem (API v1.1, Module v1.0)
> SSI coll: smp (API v1.1, Module v1.2)
> SSI rpi: crtcp (API v1.1, Module v1.1)
> SSI rpi: lamd (API v1.0, Module v7.1)
> SSI rpi: sysv (API v1.0, Module v7.1)
> SSI rpi: tcp (API v1.0, Module v7.1)
> SSI rpi: usysv (API v1.0, Module v7.1)
> SSI cr: self (API v1.0, Module v1.0)
>
> I have a lamboot file which includes both the machines.
>
> $cat $HOME/lamtopo/Linux_Solaris
> pg1cluster01
> csultra01
>
> 'ssh' works fine between the two machines and is set up to not
> prompt for the password.
>
> When I try to lamboot from the Linux machine, I get the error:
>
> lamboot on Linux machine
> --------------------------------------
> $lamboot -v $HOME/lamtopo/Linux_Solaris
>
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
> n-1<18055> ssi:boot:base:linear: booting n0 (pg1cluster01)
> n-1<18055> ssi:boot:base:linear: booting n1 (csultra01)
> -----------------------------------------------------------------------
> ------
> LAM failed to execute a process on the remote node "csultra01".
> ...
>
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
> n-1<18055> ssi:boot:base:linear: booting n0 (pg1cluster01)
> n-1<18055> ssi:boot:base:linear: booting n1 (csultra01)
> -----------------------------------------------------------------------
> ------
> LAM failed to execute a process on the remote node "csultra01".
> ...
> -----------------------------------------------------------------------
> ------
> n-1<18060> ssi:boot:base:linear: Failed to boot n1 (csultra01)
> n-1<18060> ssi:boot:base:linear: aborted!
> lamboot did NOT complete successfully
>
> When I try to lamboot from the Solaris machine, it hangs.
>
> lamboot on Solaris machine
> --------------------------------------
> $lamboot -v $HOME/lamtopo/Solaris_Linux
>
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
> n-1<1704> ssi:boot:base:linear: booting n0 (csultra01)
> n-1<1704> ssi:boot:base:linear: booting n1 (pg1cluster01)
> HANGS
>
> But using the '-d' switch, lamboot works fine. MPI applications
> also run successfully.
>
> $ lamboot -d -v ~/lamtopo/csultra01_pg1
> < ...lots of diagnostics... >
> $ lamnodes
> n0 csultra01.ucd.ie:1:origin,this_node
> n1 pg1cluster01.ucd.ie:1:
>
> Could you please let me know if you have experienced this problem
> before? Is there any solution?
>
> Thanks and Regards,
> Ravi.
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/