LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Michael Wheatley (michaelw_at_[hidden])
Date: 2005-02-23 08:24:41


Hello all,
I have just rebuilt my lam7.1.1 for tm so I can use it in a pbs
environment. When I run my script recon drops out and states that it
"can't find executable for tkill" but when i run 'which tkill' it points to
/usr/local/bin/tkill on the nodes. /usr/local is shared by an NFS server
which also houses the pbs server and on which pbs was built. Presumably
I've done something foolish (again). Any help in solving this problem
would be appreciated.

Mike
#############################################################################
details
Mandrake 10
Lam7.1.1
PBS: torque 1.1.0p6
##############################################################################
Lam info
              LAM/MPI: 7.1.1
               Prefix: /usr/local
         Architecture: i686-pc-linux-gnu
        Configured by: michaelw
        Configured on: Thu Feb 24 10:42:38 EST 2005
       Configure host: panda.ceic.local
       Memory manager: ptmalloc2
           C bindings: yes
         C++ bindings: yes
     Fortran bindings: yes
           C compiler: icc
         C++ compiler: icpc
     Fortran compiler: g77
      Fortran symbols: double_underscore
          C profiling: yes
        C++ profiling: yes
    Fortran profiling: yes
       C++ exceptions: no
       Thread support: yes
        ROMIO support: yes
         IMPI support: no
        Debug support: no
         Purify clean: no
             SSI boot: globus (API v1.1, Module v0.6)
             SSI boot: rsh (API v1.1, Module v1.1)
             SSI boot: slurm (API v1.1, Module v1.0)
             SSI boot: tm (API v1.1, Module v1.1)
             SSI coll: lam_basic (API v1.1, Module v7.1)
             SSI coll: shmem (API v1.1, Module v1.0)
             SSI coll: smp (API v1.1, Module v1.2)
              SSI rpi: crtcp (API v1.1, Module v1.1)
              SSI rpi: lamd (API v1.0, Module v7.1)
              SSI rpi: sysv (API v1.0, Module v7.1)
              SSI rpi: tcp (API v1.0, Module v7.1)
              SSI rpi: usysv (API v1.0, Module v7.1)
               SSI cr: self (API v1.0, Module v1.0
****************************************************************
my script
#! /bin/bash
/usr/local/bin/recon -d -ssi boot tm
/usr/local/bin/lamboot -ssi boot tm
/usr/local/bin/mpirun -np $NCPU /home/code/demMP
/usr/local/bin/lamhalt
###################################################
output of stderr from failed qsub -l nodes=10 script

n-1<3308> ssi:boot:open: opening
n-1<3308> ssi:boot:open: opening boot module globus
n-1<3308> ssi:boot:open: opened boot module globus
n-1<3308> ssi:boot:open: opening boot module rsh
n-1<3308> ssi:boot:open: opened boot module rsh
n-1<3308> ssi:boot:open: opening boot module slurm
n-1<3308> ssi:boot:open: opened boot module slurm
n-1<3308> ssi:boot:open: opening boot module tm
n-1<3308> ssi:boot:open: opened boot module tm
n-1<3308> ssi:boot:select: initializing boot module tm
n-1<3308> ssi:boot:tm: module initializing
n-1<3308> ssi:boot:tm:verbose: 1000
n-1<3308> ssi:boot:tm:priority: 75
n-1<3308> ssi:boot:select: boot module available: tm, priority: 75
n-1<3308> ssi:boot:select: initializing boot module slurm
n-1<3308> ssi:boot:slurm: not running under SLURM
n-1<3308> ssi:boot:select: boot module not available: slurm
n-1<3308> ssi:boot:select: initializing boot module rsh
n-1<3308> ssi:boot:rsh: module initializing
n-1<3308> ssi:boot:rsh:agent: ssh
n-1<3308> ssi:boot:rsh:username: <same>
n-1<3308> ssi:boot:rsh:verbose: 1000
n-1<3308> ssi:boot:rsh:algorithm: linear
n-1<3308> ssi:boot:rsh:no_n: 0
n-1<3308> ssi:boot:rsh:no_profile: 0
n-1<3308> ssi:boot:rsh:fast: 0
n-1<3308> ssi:boot:rsh:ignore_stderr: 0
n-1<3308> ssi:boot:rsh:priority: 10
n-1<3308> ssi:boot:select: boot module available: rsh, priority: 10
n-1<3308> ssi:boot:select: initializing boot module globus
n-1<3308> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<3308> ssi:boot:select: boot module not available: globus
n-1<3308> ssi:boot:select: finalizing boot module slurm
n-1<3308> ssi:boot:slurm: finalizing
n-1<3308> ssi:boot:select: closing boot module slurm
n-1<3308> ssi:boot:select: finalizing boot module rsh
n-1<3308> ssi:boot:rsh: finalizing
n-1<3308> ssi:boot:select: closing boot module rsh
n-1<3308> ssi:boot:select: finalizing boot module globus
n-1<3308> ssi:boot:globus: finalizing
n-1<3308> ssi:boot:select: closing boot module globus
n-1<3308> ssi:boot:select: selected boot module tm
n-1<3308> ssi:boot:tm: found the following 10 hosts:
n-1<3308> ssi:boot:tm: n0 ws11.ceic.local (cpu=1)
n-1<3308> ssi:boot:tm: n1 ws10.ceic.local (cpu=1)
n-1<3308> ssi:boot:tm: n2 ws09.ceic.local (cpu=1)
n-1<3308> ssi:boot:tm: n3 ws08.ceic.local (cpu=1)
n-1<3308> ssi:boot:tm: n4 ws07.ceic.local (cpu=1)
n-1<3308> ssi:boot:tm: n5 ws06.ceic.local (cpu=1)
n-1<3308> ssi:boot:tm: n6 ws05.ceic.local (cpu=1)
n-1<3308> ssi:boot:tm: n7 ws04.ceic.local (cpu=1)
n-1<3308> ssi:boot:tm: n8 ws03.ceic.local (cpu=1)
n-1<3308> ssi:boot:tm: n9 WS02.ceic.local (cpu=1)
n-1<3308> ssi:boot:tm: starting RTE procs
n-1<3308> ssi:boot:base:linear_windowed: starting
n-1<3308> ssi:boot:base:linear_windowed: no startup protocol
n-1<3308> ssi:boot:base:linear_windowed: invoking linear
n-1<3308> ssi:boot:base:linear: starting
n-1<3308> ssi:boot:base:linear: booting n0 (ws11.ceic.local)
n-1<3308> ssi:boot:tm: starting recon on (ws11.ceic.local)
Can't find executable for tkill
n-1<3308> ssi:boot:base:linear: Failed to boot n0 (ws11.ceic.local)
n-1<3308> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
recon was not able to complete successfully. There can be any number
of problems that did not allow recon to work properly. You should use
the "-d" option to recon to get more information about each step that
recon attempts.

Any error message above may present a more detailed description of the
actual problem.

Here is general a list of prerequisites that *must* be fulfilled
before recon can work:

         - Each machine in the hostfile must be reachable and operational.
         - You must have an account on each machine.
         - You must be able to rsh(1) to the machine (permissions
           are typically set in the user's $HOME/.rhosts file).

         *** Sidenote: If you compiled LAM to use a remote shell program
             other than rsh (with the --with-rsh option to ./configure;
             e.g., ssh), or if you set the LAMRSH environment variable
             to an alternate remote shell program, you need to ensure
             that you can execute programs on remote nodes with no
             password. For example:

         unix% ssh -x pinky uptime
         3:09am up 211 day(s), 23:49, 2 users, load average: 0.01, 0.08, 0.10

         - The LAM executables must be locatable on each machine, using
           the shell's search path and possibly the LAMHOME environment
           variable.
         - The shell's start-up script must not print anything on standard
           error. You can take advantage of the fact that rsh(1) will
           start the shell non-interactively. The start-up script (such
           as .profile or .cshrc) can exit early in this case, before
           executing many commands relevant only to interactive sessions
           and likely to generate output.
-----------------------------------------------------------------------------
n-1<3308> ssi:boot:tm: finalizing
n-1<3308> ssi:boot: Closing
Can't find executable for tkill
-----------------------------------------------------------------------------
Synopsis: mpirun [options] <app>
                 mpirun [options] <where> <program> [<prog args>]

Description: Start an MPI application in LAM/MPI.

Notes:
                 [options] Zero or more of the options listed below
                 <app> LAM/MPI appschema
                 <where> List of LAM nodes and/or CPUs (examples
                                 below)
                 <program> Must be a LAM/MPI program that either
                                 invokes MPI_INIT or has exactly one of
                                 its children invoke MPI_INIT
                 <prog args> Optional list of command line arguments
                                 to <program>

Options:
                 -c <num> Run <num> copies of <program> (same as -np)
                 -c2c Use fast library (C2C) mode
                 -client <rank> <host>:<port>
                                Run IMPI job; connect to the IMPI server <host>
                                 at port <port> as IMPI client number <rank>
                 -D Change current working directory of new
                                 processes to the directory where the
                                 executable resides
                 -f Do not open stdio descriptors
                 -ger Turn on GER mode
                 -h Print this help message
                 -l Force line-buffered output
                 -lamd Use LAM daemon (LAMD) mode (opposite of -c2c)
                 -nger Turn off GER mode
                 -np <num> Run <num> copies of <program> (same as -c)
                 -nx Don't export LAM_MPI_* environment variables
                 -O Universe is homogeneous
                 -pty / -npty Use/don't use pseudo terminals when stdout is
                                 a tty
                 -s <nodeid> Load <program> from node <nodeid>
                 -sigs / -nsigs Catch/don't catch signals in MPI application
                 -ssi <n> <arg> Set environment variable LAM_MPI_SSI_<n>=<arg>
                 -toff Enable tracing with generation initially off
                 -ton, -t Enable tracing with generation initially on
                 -tv Launch processes under TotalView Debugger
                 -v Be verbose
                 -w / -nw Wait/don't wait for application to complete
                 -wd <dir> Change current working directory of new
                                 processes to <dir>
                 -x <envlist> Export environment vars in <envlist>

Nodes: n<list>, e.g., n0-3,5
CPUS: c<list>, e.g., c0-3,5
Extras: h (local node), o (origin node), N (all nodes), C (all CPUs)

Examples: mpirun n0-7 prog1
                 Executes "prog1" on nodes 0 through 7.

                 mpirun -lamd -x FOO=bar,DISPLAY N prog2
                 Executes "prog2" on all nodes using the LAMD RPI.
                 In the environment of each process, set FOO to the value
                 "bar", and set DISPLAY to the current value.

                 mpirun n0 N prog3
                 Run "prog3" on node 0, *and* all nodes. This executes *2*
                 copies on n0.

                 mpirun C prog4 arg1 arg2
                 Run "prog4" on each available CPU with command line
                 arguments of "arg1" and "arg2". If each node has a
                 CPU count of 1, the "C" is equivalent to "N". If at
                 least one node has a CPU count greater than 1, LAM
                 will run neighboring ranks of MPI_COMM_WORLD on that
                 node. For example, if node 0 has a CPU count of 4 and
                 node 1 has a CPU count of 2, "prog4" will have
                 MPI_COMM_WORLD ranks 0 through 3 on n0, and ranks 4
                 and 5 on n1.

                 mpirun c0 C prog5
                 Similar to the "prog3" example above, this runs "prog5"
                 on CPU 0 *and* on each available CPU. This executes
                 *2* copies on the node where CPU 0 is (i.e., n0).
                 This is probably not a useful use of the "C" notation;
                 it is only shown here for an example.

Defaults: -c2c -w -pty -nger -nsigs
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
It seems that there is no lamd running on the host ws11.ceic.local.

This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for the "lamhalt" command.

Please run the "lamboot" command the start the LAM/MPI runtime
environment. See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
-----------------------------------------------------------------------------
#######################################################################################

-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 266.1.0 - Release Date: 18/02/2005