LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Ben Boxman (ben_at_[hidden])
Date: 2005-06-21 11:25:01


Hi!

 

   I'm trying to set up Torque+LAM to run on a small 8 node (2xOpteron)
cluster and I'm experiencing some difficulties with getting LAM to run
under Torque. Torque seems to be running fine and LAM will work on a few
nodes without Torque.

 

  Specifically, when I attempt to run lamboot under a torque job
(whether interactively or in a pbs script) it fails. Lamboot will
succeed if run on a single node under a Torque job but will fail for any
node number greater than 1.

 

Following is a log of such an attempt:

 

[ben_at_creambo ~]$ qsub -l nodes=2 -I

qsub: waiting for job 17.creambo to start

qsub: job 17.creambo ready

 

[ben_at_wild2 ~]$ laminfo

             LAM/MPI: 7.1.1

              Prefix: /usr

        Architecture: x86_64-redhat-linux-gnu

       Configured by: root

       Configured on: Tue Jun 21 13:32:27 IDT 2005

      Configure host: creambo

      Memory manager: ptmalloc2

          C bindings: yes

        C++ bindings: yes

    Fortran bindings: yes

          C compiler: gcc

        C++ compiler: g++

    Fortran compiler: g77

     Fortran symbols: double_underscore

         C profiling: yes

       C++ profiling: yes

   Fortran profiling: yes

      C++ exceptions: no

      Thread support: yes

       ROMIO support: yes

        IMPI support: no

       Debug support: no

        Purify clean: no

            SSI boot: globus (API v1.1, Module v0.6)

            SSI boot: rsh (API v1.1, Module v1.1)

            SSI boot: slurm (API v1.1, Module v1.0)

            SSI boot: tm (API v1.1, Module v1.1)

            SSI coll: lam_basic (API v1.1, Module v7.1)

            SSI coll: shmem (API v1.1, Module v1.0)

            SSI coll: smp (API v1.1, Module v1.2)

             SSI rpi: crtcp (API v1.1, Module v1.1)

             SSI rpi: lamd (API v1.0, Module v7.1)

             SSI rpi: sysv (API v1.0, Module v7.1)

             SSI rpi: tcp (API v1.0, Module v7.1)

             SSI rpi: usysv (API v1.0, Module v7.1)

              SSI cr: self (API v1.0, Module v1.0)

[ben_at_wild2 ~]$ lamboot -d

n-1<4121> ssi:boot:open: opening

n-1<4121> ssi:boot:open: opening boot module globus

n-1<4121> ssi:boot:open: opened boot module globus

n-1<4121> ssi:boot:open: opening boot module rsh

n-1<4121> ssi:boot:open: opened boot module rsh

n-1<4121> ssi:boot:open: opening boot module slurm

n-1<4121> ssi:boot:open: opened boot module slurm

n-1<4121> ssi:boot:open: opening boot module tm

n-1<4121> ssi:boot:open: opened boot module tm

n-1<4121> ssi:boot:select: initializing boot module slurm

n-1<4121> ssi:boot:slurm: not running under SLURM

n-1<4121> ssi:boot:select: boot module not available: slurm

n-1<4121> ssi:boot:select: initializing boot module tm

n-1<4121> ssi:boot:tm: module initializing

n-1<4121> ssi:boot:tm:verbose: 1000

n-1<4121> ssi:boot:tm:priority: 50

n-1<4121> ssi:boot:select: boot module available: tm, priority: 50

n-1<4121> ssi:boot:select: initializing boot module globus

n-1<4121> ssi:boot:globus: globus-job-run not found, globus boot will
not run

n-1<4121> ssi:boot:select: boot module not available: globus

n-1<4121> ssi:boot:select: initializing boot module rsh

n-1<4121> ssi:boot:rsh: module initializing

n-1<4121> ssi:boot:rsh:agent: ssh -x

n-1<4121> ssi:boot:rsh:username: <same>

n-1<4121> ssi:boot:rsh:verbose: 1000

n-1<4121> ssi:boot:rsh:algorithm: linear

n-1<4121> ssi:boot:rsh:no_n: 0

n-1<4121> ssi:boot:rsh:no_profile: 0

n-1<4121> ssi:boot:rsh:fast: 0

n-1<4121> ssi:boot:rsh:ignore_stderr: 0

n-1<4121> ssi:boot:rsh:priority: 10

n-1<4121> ssi:boot:select: boot module available: rsh, priority: 10

n-1<4121> ssi:boot:select: finalizing boot module slurm

n-1<4121> ssi:boot:slurm: finalizing

n-1<4121> ssi:boot:select: closing boot module slurm

n-1<4121> ssi:boot:select: finalizing boot module globus

n-1<4121> ssi:boot:globus: finalizing

n-1<4121> ssi:boot:select: closing boot module globus

n-1<4121> ssi:boot:select: finalizing boot module rsh

n-1<4121> ssi:boot:rsh: finalizing

n-1<4121> ssi:boot:select: closing boot module rsh

n-1<4121> ssi:boot:select: selected boot module tm

 

LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University

 

n-1<4121> ssi:boot:tm: found the following 2 hosts:

n-1<4121> ssi:boot:tm: n0 wild2.camero-tech.com (cpu=1)

n-1<4121> ssi:boot:tm: n1 wild1.camero-tech.com (cpu=1)

n-1<4121> ssi:boot:tm: starting RTE procs

n-1<4121> ssi:boot:base:linear_windowed: starting

n-1<4121> ssi:boot:base:linear_windowed: window size: 5

n-1<4121> ssi:boot:base:server: opening server TCP socket

n-1<4121> ssi:boot:base:server: opened port 32839

n-1<4121> ssi:boot:base:linear_windowed: booting n0
(wild2.camero-tech.com)

n-1<4121> ssi:boot:tm: starting wipe on (wild2.camero-tech.com)

n-1<4121> ssi:boot:tm: starting on n0 (wild2.camero-tech.com):
/usr/bin/tkill -setsid -d

n-1<4121> ssi:boot:tm: successfully launched on n0
(wild2.camero-tech.com)

n-1<4121> ssi:boot:tm: waiting for completion on n0
(wild2.camero-tech.com)

n-1<4121> ssi:boot:base:linear_windowed: Failed to boot n0
(wild2.camero-tech.com)

n-1<4121> ssi:boot:base:linear_windowed: finished launching

n-1<4121> ssi:boot:base:server: closing server socket

n-1<4121> ssi:boot:base:linear_windowed: aborted!

lamboot did NOT complete successfully

 

 

Any ideas?

 

Many thanks,

_______________________________________________

Ben Boxman,
Algorithm Engineer.
Camero-Tech

 

Office: (972) 9 8659088, ext. 211
Fax: (972) 9 8659388
Mobile: (972) 50 6519091

9 HaOmanut St., POB 8580
Poleg Park, B2 Bldg.
Poleg Industrial area, Zip 42160
Netanya, Israel