Hi!
I'm trying to set up Torque+LAM to run on a small 8 node (2xOpteron)
cluster and I'm experiencing some difficulties with getting LAM to run
under Torque. Torque seems to be running fine and LAM will work on a few
nodes without Torque.
Specifically, when I attempt to run lamboot under a torque job
(whether interactively or in a pbs script) it fails. Lamboot will
succeed if run on a single node under a Torque job but will fail for any
node number greater than 1.
Following is a log of such an attempt:
[ben_at_creambo ~]$ qsub -l nodes=2 -I
qsub: waiting for job 17.creambo to start
qsub: job 17.creambo ready
[ben_at_wild2 ~]$ laminfo
LAM/MPI: 7.1.1
Prefix: /usr
Architecture: x86_64-redhat-linux-gnu
Configured by: root
Configured on: Tue Jun 21 13:32:27 IDT 2005
Configure host: creambo
Memory manager: ptmalloc2
C bindings: yes
C++ bindings: yes
Fortran bindings: yes
C compiler: gcc
C++ compiler: g++
Fortran compiler: g77
Fortran symbols: double_underscore
C profiling: yes
C++ profiling: yes
Fortran profiling: yes
C++ exceptions: no
Thread support: yes
ROMIO support: yes
IMPI support: no
Debug support: no
Purify clean: no
SSI boot: globus (API v1.1, Module v0.6)
SSI boot: rsh (API v1.1, Module v1.1)
SSI boot: slurm (API v1.1, Module v1.0)
SSI boot: tm (API v1.1, Module v1.1)
SSI coll: lam_basic (API v1.1, Module v7.1)
SSI coll: shmem (API v1.1, Module v1.0)
SSI coll: smp (API v1.1, Module v1.2)
SSI rpi: crtcp (API v1.1, Module v1.1)
SSI rpi: lamd (API v1.0, Module v7.1)
SSI rpi: sysv (API v1.0, Module v7.1)
SSI rpi: tcp (API v1.0, Module v7.1)
SSI rpi: usysv (API v1.0, Module v7.1)
SSI cr: self (API v1.0, Module v1.0)
[ben_at_wild2 ~]$ lamboot -d
n-1<4121> ssi:boot:open: opening
n-1<4121> ssi:boot:open: opening boot module globus
n-1<4121> ssi:boot:open: opened boot module globus
n-1<4121> ssi:boot:open: opening boot module rsh
n-1<4121> ssi:boot:open: opened boot module rsh
n-1<4121> ssi:boot:open: opening boot module slurm
n-1<4121> ssi:boot:open: opened boot module slurm
n-1<4121> ssi:boot:open: opening boot module tm
n-1<4121> ssi:boot:open: opened boot module tm
n-1<4121> ssi:boot:select: initializing boot module slurm
n-1<4121> ssi:boot:slurm: not running under SLURM
n-1<4121> ssi:boot:select: boot module not available: slurm
n-1<4121> ssi:boot:select: initializing boot module tm
n-1<4121> ssi:boot:tm: module initializing
n-1<4121> ssi:boot:tm:verbose: 1000
n-1<4121> ssi:boot:tm:priority: 50
n-1<4121> ssi:boot:select: boot module available: tm, priority: 50
n-1<4121> ssi:boot:select: initializing boot module globus
n-1<4121> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n-1<4121> ssi:boot:select: boot module not available: globus
n-1<4121> ssi:boot:select: initializing boot module rsh
n-1<4121> ssi:boot:rsh: module initializing
n-1<4121> ssi:boot:rsh:agent: ssh -x
n-1<4121> ssi:boot:rsh:username: <same>
n-1<4121> ssi:boot:rsh:verbose: 1000
n-1<4121> ssi:boot:rsh:algorithm: linear
n-1<4121> ssi:boot:rsh:no_n: 0
n-1<4121> ssi:boot:rsh:no_profile: 0
n-1<4121> ssi:boot:rsh:fast: 0
n-1<4121> ssi:boot:rsh:ignore_stderr: 0
n-1<4121> ssi:boot:rsh:priority: 10
n-1<4121> ssi:boot:select: boot module available: rsh, priority: 10
n-1<4121> ssi:boot:select: finalizing boot module slurm
n-1<4121> ssi:boot:slurm: finalizing
n-1<4121> ssi:boot:select: closing boot module slurm
n-1<4121> ssi:boot:select: finalizing boot module globus
n-1<4121> ssi:boot:globus: finalizing
n-1<4121> ssi:boot:select: closing boot module globus
n-1<4121> ssi:boot:select: finalizing boot module rsh
n-1<4121> ssi:boot:rsh: finalizing
n-1<4121> ssi:boot:select: closing boot module rsh
n-1<4121> ssi:boot:select: selected boot module tm
LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
n-1<4121> ssi:boot:tm: found the following 2 hosts:
n-1<4121> ssi:boot:tm: n0 wild2.camero-tech.com (cpu=1)
n-1<4121> ssi:boot:tm: n1 wild1.camero-tech.com (cpu=1)
n-1<4121> ssi:boot:tm: starting RTE procs
n-1<4121> ssi:boot:base:linear_windowed: starting
n-1<4121> ssi:boot:base:linear_windowed: window size: 5
n-1<4121> ssi:boot:base:server: opening server TCP socket
n-1<4121> ssi:boot:base:server: opened port 32839
n-1<4121> ssi:boot:base:linear_windowed: booting n0
(wild2.camero-tech.com)
n-1<4121> ssi:boot:tm: starting wipe on (wild2.camero-tech.com)
n-1<4121> ssi:boot:tm: starting on n0 (wild2.camero-tech.com):
/usr/bin/tkill -setsid -d
n-1<4121> ssi:boot:tm: successfully launched on n0
(wild2.camero-tech.com)
n-1<4121> ssi:boot:tm: waiting for completion on n0
(wild2.camero-tech.com)
n-1<4121> ssi:boot:base:linear_windowed: Failed to boot n0
(wild2.camero-tech.com)
n-1<4121> ssi:boot:base:linear_windowed: finished launching
n-1<4121> ssi:boot:base:server: closing server socket
n-1<4121> ssi:boot:base:linear_windowed: aborted!
lamboot did NOT complete successfully
Any ideas?
Many thanks,
_______________________________________________
Ben Boxman,
Algorithm Engineer.
Camero-Tech
Office: (972) 9 8659088, ext. 211
Fax: (972) 9 8659388
Mobile: (972) 50 6519091
9 HaOmanut St., POB 8580
Poleg Park, B2 Bldg.
Poleg Industrial area, Zip 42160
Netanya, Israel
|