Yes, you are correct that libtorque is related to the tm boot SSI and
the Torque queueing system. I think you want to use --without-boot-
tm and that should de-activate the boot tm module. Failing that, you
should be able to rm the $prefix/lib/lam/*boot_tm* files (that's from
memory -- double check before rm'ing that!). It should be fairly
obvious while files to remove -- there should be a .lo and a .la that
have "boot" and "tm" in them. These are the TM plugins; if you
remove them, LAM won't have any knowledge of the tm system and you
should be fine.
On Jun 26, 2007, at 11:21 PM, Jens.Klostermann_at_[hidden]
wrote:
> Somehow it didn't work with the attachments so here again without
> them, but
> harder to read
>
> I try to run lam-7.1.3 with infiniband. My configuration looks like:
> ---------------------
> ./configure
> --prefix=/home/pub/OpenFOAM/OpenFOAM-1.4/src/lam-7.1.3/platforms/
> linux64Gcc4DPOpt
> --with-rpi-ib=/usr/ibgd/driver/infinihost --with-rpi=ib --enable-
> shared
> --disable-static --without-romio --without-mpi2cpp --without-profiling
> --without-fc --without-tm --with-boot=rsh --with-rsh=ssh -x
> ---------------------
>
> This compiles without a problem, but unfortunately I can't switch
> of the "SSI
> boot: tm" module, can I?
>
> This can be seen by laminfo, which gives the following:
> ---------------------
> LAM/MPI: 7.1.3
> Prefix:
> /home/pub/OpenFOAM/OpenFOAM-1.4/src/lam-7.1.3/platforms/
> linux64Gcc4DPOpt
> Architecture: x86_64-unknown-linux-gnu
> Configured by: klosterm
> Configured on: Tue Jun 26 22:32:28 CEST 2007
> Configure host: stokes
> Memory manager: ptmalloc2
> C bindings: yes
> C++ bindings: no
> Fortran bindings: no
> C compiler: gcc
> C++ compiler: g++
> Fortran compiler: false
> Fortran symbols: none
> C profiling: no
> C++ profiling: no
> Fortran profiling: no
> C++ exceptions: no
> Thread support: yes
> ROMIO support: no
> IMPI support: no
> Debug support: no
> Purify clean: no
> SSI boot: globus (API v1.1, Module v0.6)
> SSI boot: rsh (API v1.1, Module v1.1)
> SSI boot: slurm (API v1.1, Module v1.0)
> SSI boot: tm (API v1.1, Module v1.1)
> SSI coll: lam_basic (API v1.1, Module v7.1)
> SSI coll: shmem (API v1.1, Module v1.0)
> SSI coll: smp (API v1.1, Module v1.2)
> SSI rpi: crtcp (API v1.1, Module v1.1)
> SSI rpi: ib (API v1.1, Module v1.0)
> SSI rpi: lamd (API v1.0, Module v7.1)
> SSI rpi: sysv (API v1.0, Module v7.1)
> SSI rpi: tcp (API v1.0, Module v7.1)
> SSI rpi: usysv (API v1.0, Module v7.1)
> SSI cr: self (API v1.0, Module v1.0)
> ---------------------
>
>
> So and here is my problem: lamboot is asking for libtorque.so.0,
> which seem to
> be related to the torque batch system?? Since our cluster doesn't
> use any batch
> system, I would like to switch off the tm-module (this is the
> reason I used
> --without-tm as an configure option, which did obviously not work):
> ---------------------
> lamboot -v -ssi boot rsh ./knotenliste_lam
>
> LAM 7.1.3 - Indiana University
>
> n-1<6016> ssi:boot:base:linear: booting n0 (stokes)
> n-1<6016> ssi:boot:base:linear: booting n1 (node13)
> ERROR: LAM/MPI unexpectedly received the following on stderr:
> hboot: error while loading shared libraries: libtorque.so.0: cannot
> open shared
> object file: No such file or directory
> ----------------------------------------------------------------------
> -------
> LAM failed to execute a LAM binary on the remote node "node13".
> Since LAM was already able to determine your remote shell as "hboot",
> it is probable that this is not an authentication problem.
>
> *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
> *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
> *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
> *** MAILING LIST.
>
> LAM tried to use the remote agent command "ssh"
> to invoke the following command:
>
> ssh -x node13 -n hboot -t -c lam-conf.lamd -v -s -I '"-H
> 139.20.53.201
> -P 29989 -n 1 -o 0"'
>
> This can indicate several things. You should check the following:
>
> - The LAM binaries are in your $PATH
> - You can run the LAM binaries
> - The $PATH variable is set properly before your
> .cshrc/.profile exits
>
> Try to invoke the command listed above manually at a Unix prompt.
>
> You will need to configure your local setup such that you will *not*
> be prompted for a password to invoke this command on the remote node.
> No output should be printed from the remote node before the output of
> the command is displayed.
>
> When you can get this command to execute successfully by hand, LAM
> will probably be able to function properly.
> ----------------------------------------------------------------------
> -------
> n-1<6016> ssi:boot:base:linear: Failed to boot n1 (node13)
> n-1<6016> ssi:boot:base:linear: aborted!
> n-1<6022> ssi:boot:base:linear: booting n0 (stokes)
> n-1<6022> ssi:boot:base:linear: booting n1 (node13)
> ERROR: LAM/MPI unexpectedly received the following on stderr:
> tkill: error while loading shared libraries: libtorque.so.0: cannot
> open shared
> object file: No such file or directory
> ----------------------------------------------------------------------
> -------
> LAM failed to execute a LAM binary on the remote node "node13".
> Since LAM was already able to determine your remote shell as "tkill",
> it is probable that this is not an authentication problem.
>
> *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
> *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
> *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
> *** MAILING LIST.
>
> LAM tried to use the remote agent command "ssh"
> to invoke the following command:
>
> ssh -x node13 -n tkill -v
>
> This can indicate several things. You should check the following:
>
> - The LAM binaries are in your $PATH
> - You can run the LAM binaries
> - The $PATH variable is set properly before your
> .cshrc/.profile exits
>
> Try to invoke the command listed above manually at a Unix prompt.
>
> You will need to configure your local setup such that you will *not*
> be prompted for a password to invoke this command on the remote node.
> No output should be printed from the remote node before the output of
> the command is displayed.
>
> When you can get this command to execute successfully by hand, LAM
> will probably be able to function properly.
> ----------------------------------------------------------------------
> -------
> n-1<6022> ssi:boot:base:linear: Failed to boot n1 (node13)
> n-1<6022> ssi:boot:base:linear: aborted!
> lamboot did NOT complete successfully
> klosterm_at_stokes:/home/pub/infiniband/tests> ssh -x node13 -n tkill
> tkill: error while loading shared libraries: libtorque.so.0: cannot
> open shared
> object file: No such file or directory
> ---------------------
>
>
> The funny thing is lamboot with just localhost works on the frontend:
> --------------------
> lamboot -v -ssi boot rsh
>
> LAM 7.1.3 - Indiana University
>
> n-1<7868> ssi:boot:base:linear: booting n0 (localhost)
> n-1<7868> ssi:boot:base:linear: finished
> --------------------
>
> but not on node 13:
> --------------------
> lamboot -v -ssi boot rsh
> lamboot: error while loading shared libraries: libtorque.so.0:
> cannot open
> shared object file: No such file or directory
> klosterm_at_node13:~>
> klosterm_at_node13:~> LAM 7.1.3 - Indiana University
> -bash: LAM: command not found
> klosterm_at_node13:~>
> klosterm_at_node13:~> n-1<7868> ssi:boot:base:linear: booting n0
> (localhost)
> -bash: syntax error near unexpected token `7868'
> klosterm_at_node13:~> n-1<7868> ssi:boot:base:linear: finished
> -bash: syntax error near unexpected token `7868'
> --------------------
>
> Any help is appreciated.
>
> With regards Jens
>
>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
Jeff Squyres
Cisco Systems
|