LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Keith Stevens (kstevens_at_[hidden])
Date: 2005-01-20 12:51:07


Hello,

 

We have been trying to get the 7.0.6 version of LAM that we compile with the
Intel C Compiler v8.0 running on Linux RedHat 9 with recent updates. We are
also using Oscar 3.0 for cluster management.

 

Basically the problem is after we compiled LAM 7.0.6 with ICC and try to
lamboot a couple of nodes along with the head node it fails LAM library
failures. We have compiled this version of LAM in the manor with GCC as well
and it works fine.

 

Please help!

 

Below are outputs of the environment and other settings, I also attached the
config.log when we compiled and installed LAM 7.0.6 with ICC.

 

Thanks,

Keith

 

ldd output:
---------------

ldd /scratch/iccLAM/bin/lamboot
        libaio.so.1 => /usr/lib/libaio.so.1 (0x40025000)
        libutil.so.1 => /lib/libutil.so.1 (0x40027000)
        libm.so.6 => /lib/tls/libm.so.6 (0x4002a000)
        libcprts.so.5 => /opt/intel_cc_80/lib/libcprts.so.5 (0x4004c000)
        libcxa.so.5 => /opt/intel_cc_80/lib/libcxa.so.5 (0x400ca000)
        libunwind.so.5 => /opt/intel_cc_80/lib/libunwind.so.5 (0x400f0000)
        libpthread.so.0 => /lib/tls/libpthread.so.0 (0x400f7000)
        libc.so.6 => /lib/tls/libc.so.6 (0x42000000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

 

laminfo output:
---------------------
 laminfo
           LAM/MPI: 7.0.6
            Prefix: /scratch/iccLAM
      Architecture: i686-pc-linux-gnu
     Configured by: talmas
     Configured on: Tue Dec 28 11:08:20 EST 2004
    Configure host: ccbm-hn01.ccbm.jhu.edu
        C bindings: yes
      C++ bindings: yes
  Fortran bindings: yes
       C profiling: yes
     C++ profiling: yes
 Fortran profiling: yes
     ROMIO support: yes
      IMPI support: no
     Debug support: no
      Purify clean: no
          SSI boot: globus (Module v0.5)
          SSI boot: rsh (Module v1.0)
          SSI coll: lam_basic (Module v7.0)
          SSI coll: smp (Module v1.0)
           SSI rpi: crtcp (Module v1.0.1)
           SSI rpi: lamd (Module v7.0)
           SSI rpi: sysv (Module v7.0)
           SSI rpi: tcp (Module v7.0)
           SSI rpi: usysv (Module v7.0)

 

recon output:
-------------------
recon
----------------------------------------------------------------------------
-
Woo hoo!

recon has completed successfully. This means that you will most likely
be able to boot LAM successfully with the "lamboot" command (but this
is not a guarantee). See the lamboot(1) manual page for more
information on the lamboot command.

If you have problems booting LAM (with lamboot) even though recon
worked successfully, enable the "-d" option to lamboot to examine each
step of lamboot and see what fails. Most situations where recon
succeeds and lamboot fails have to do with the hboot(1) command (that
lamboot invokes on each host in the hostfile).
----------------------------------------------------------------------------
-

lamboot output:
-----------------------
lamboot -v hostfile

LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University

n-1<14947> ssi:boot:base:linear: booting n0 (ccbm-hn01)
n-1<14947> ssi:boot:base:linear: booting n1 (ccbm-cn01)
ERROR: LAM/MPI unexpectedly received the following on stderr:
hboot: error while loading shared libraries: libcprts.so.5: cannot open
shared object file: No such file or directory
----------------------------------------------------------------------------
-
LAM failed to execute a LAM binary on the remote node "ccbm-cn01".
Since LAM was already able to determine your remote shell as "hboot",
it is probable that this is not an authentication problem.

LAM tried to use the remote agent command "/usr/bin/ssh"
to invoke the following command:

        /usr/bin/ssh -x ccbm-cn01 -n hboot -t -c lam-conf.lamd -v -s -I "-H
192.168.137.63 -P 45518 -n 1 -o 0"

This can indicate several things. You should check the following:

        - The LAM binaries are in your $PATH
        - You can run the LAM binaries
        - The $PATH variable is set properly before your
          .cshrc/.profile exits

Try to invoke the command listed above manually at a Unix prompt.

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
----------------------------------------------------------------------------
-
n-1<14947> ssi:boot:base:linear: Failed to boot n1 (ccbm-cn01)
n-1<14947> ssi:boot:base:linear: aborted!
----------------------------------------------------------------------------
-
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
----------------------------------------------------------------------------
-
n-1<14953> ssi:boot:base:linear: booting n0 (ccbm-hn01)
n-1<14953> ssi:boot:base:linear: booting n1 (ccbm-cn01)
ERROR: LAM/MPI unexpectedly received the following on stderr:
tkill: error while loading shared libraries: libcprts.so.5: cannot open
shared object file: No such file or directory
----------------------------------------------------------------------------
-
LAM failed to execute a LAM binary on the remote node "ccbm-cn01".
Since LAM was already able to determine your remote shell as "tkill",
it is probable that this is not an authentication problem.

LAM tried to use the remote agent command "/usr/bin/ssh"
to invoke the following command:

        /usr/bin/ssh -x ccbm-cn01 -n tkill -v

This can indicate several things. You should check the following:

        - The LAM binaries are in your $PATH
        - You can run the LAM binaries
        - The $PATH variable is set properly before your
          .cshrc/.profile exits

Try to invoke the command listed above manually at a Unix prompt.

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
----------------------------------------------------------------------------
-
n-1<14953> ssi:boot:base:linear: Failed to boot n1 (ccbm-cn01)
n-1<14953> ssi:boot:base:linear: aborted!
lamboot did NOT complete successfully

hosfile
----------
ccbm-hn01
ccbm-cn01

 

<<...>>