Somehow it didn't work with the attachments so here again without them, but
harder to read
I try to run lam-7.1.3 with infiniband. My configuration looks like:
---------------------
./configure
--prefix=/home/pub/OpenFOAM/OpenFOAM-1.4/src/lam-7.1.3/platforms/linux64Gcc4DPOpt
--with-rpi-ib=/usr/ibgd/driver/infinihost --with-rpi=ib --enable-shared
--disable-static --without-romio --without-mpi2cpp --without-profiling
--without-fc --without-tm --with-boot=rsh --with-rsh=ssh -x
---------------------
This compiles without a problem, but unfortunately I can't switch of the "SSI
boot: tm" module, can I?
This can be seen by laminfo, which gives the following:
---------------------
LAM/MPI: 7.1.3
Prefix:
/home/pub/OpenFOAM/OpenFOAM-1.4/src/lam-7.1.3/platforms/linux64Gcc4DPOpt
Architecture: x86_64-unknown-linux-gnu
Configured by: klosterm
Configured on: Tue Jun 26 22:32:28 CEST 2007
Configure host: stokes
Memory manager: ptmalloc2
C bindings: yes
C++ bindings: no
Fortran bindings: no
C compiler: gcc
C++ compiler: g++
Fortran compiler: false
Fortran symbols: none
C profiling: no
C++ profiling: no
Fortran profiling: no
C++ exceptions: no
Thread support: yes
ROMIO support: no
IMPI support: no
Debug support: no
Purify clean: no
SSI boot: globus (API v1.1, Module v0.6)
SSI boot: rsh (API v1.1, Module v1.1)
SSI boot: slurm (API v1.1, Module v1.0)
SSI boot: tm (API v1.1, Module v1.1)
SSI coll: lam_basic (API v1.1, Module v7.1)
SSI coll: shmem (API v1.1, Module v1.0)
SSI coll: smp (API v1.1, Module v1.2)
SSI rpi: crtcp (API v1.1, Module v1.1)
SSI rpi: ib (API v1.1, Module v1.0)
SSI rpi: lamd (API v1.0, Module v7.1)
SSI rpi: sysv (API v1.0, Module v7.1)
SSI rpi: tcp (API v1.0, Module v7.1)
SSI rpi: usysv (API v1.0, Module v7.1)
SSI cr: self (API v1.0, Module v1.0)
---------------------
So and here is my problem: lamboot is asking for libtorque.so.0, which seem to
be related to the torque batch system?? Since our cluster doesn't use any batch
system, I would like to switch off the tm-module (this is the reason I used
--without-tm as an configure option, which did obviously not work):
---------------------
lamboot -v -ssi boot rsh ./knotenliste_lam
LAM 7.1.3 - Indiana University
n-1<6016> ssi:boot:base:linear: booting n0 (stokes)
n-1<6016> ssi:boot:base:linear: booting n1 (node13)
ERROR: LAM/MPI unexpectedly received the following on stderr:
hboot: error while loading shared libraries: libtorque.so.0: cannot open shared
object file: No such file or directory
-----------------------------------------------------------------------------
LAM failed to execute a LAM binary on the remote node "node13".
Since LAM was already able to determine your remote shell as "hboot",
it is probable that this is not an authentication problem.
*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.
LAM tried to use the remote agent command "ssh"
to invoke the following command:
ssh -x node13 -n hboot -t -c lam-conf.lamd -v -s -I '"-H 139.20.53.201
-P 29989 -n 1 -o 0"'
This can indicate several things. You should check the following:
- The LAM binaries are in your $PATH
- You can run the LAM binaries
- The $PATH variable is set properly before your
.cshrc/.profile exits
Try to invoke the command listed above manually at a Unix prompt.
You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.
When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<6016> ssi:boot:base:linear: Failed to boot n1 (node13)
n-1<6016> ssi:boot:base:linear: aborted!
n-1<6022> ssi:boot:base:linear: booting n0 (stokes)
n-1<6022> ssi:boot:base:linear: booting n1 (node13)
ERROR: LAM/MPI unexpectedly received the following on stderr:
tkill: error while loading shared libraries: libtorque.so.0: cannot open shared
object file: No such file or directory
-----------------------------------------------------------------------------
LAM failed to execute a LAM binary on the remote node "node13".
Since LAM was already able to determine your remote shell as "tkill",
it is probable that this is not an authentication problem.
*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.
LAM tried to use the remote agent command "ssh"
to invoke the following command:
ssh -x node13 -n tkill -v
This can indicate several things. You should check the following:
- The LAM binaries are in your $PATH
- You can run the LAM binaries
- The $PATH variable is set properly before your
.cshrc/.profile exits
Try to invoke the command listed above manually at a Unix prompt.
You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.
When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<6022> ssi:boot:base:linear: Failed to boot n1 (node13)
n-1<6022> ssi:boot:base:linear: aborted!
lamboot did NOT complete successfully
klosterm_at_stokes:/home/pub/infiniband/tests> ssh -x node13 -n tkill
tkill: error while loading shared libraries: libtorque.so.0: cannot open shared
object file: No such file or directory
---------------------
The funny thing is lamboot with just localhost works on the frontend:
--------------------
lamboot -v -ssi boot rsh
LAM 7.1.3 - Indiana University
n-1<7868> ssi:boot:base:linear: booting n0 (localhost)
n-1<7868> ssi:boot:base:linear: finished
--------------------
but not on node 13:
--------------------
lamboot -v -ssi boot rsh
lamboot: error while loading shared libraries: libtorque.so.0: cannot open
shared object file: No such file or directory
klosterm_at_node13:~>
klosterm_at_node13:~> LAM 7.1.3 - Indiana University
-bash: LAM: command not found
klosterm_at_node13:~>
klosterm_at_node13:~> n-1<7868> ssi:boot:base:linear: booting n0 (localhost)
-bash: syntax error near unexpected token `7868'
klosterm_at_node13:~> n-1<7868> ssi:boot:base:linear: finished
-bash: syntax error near unexpected token `7868'
--------------------
Any help is appreciated.
With regards Jens
|