Hi TEAM,
I am in the way to integrate LAM with Sun GRID(N1GE6) {sge 6.0}.
My GRID machine is Sun Fire v20z with AMD 64 architecture and Operating
System:RedHat Enterprise WS 3.0. Currently I have integrated my
customer's EDA tools(cadence) with GRID but for Parallel Environment
support I need LAM to integrate.
Before Integrating LAM with GRID I just want to checkup how LAM works
and my job runs in parallel environment with support of LAM without GRID
environment.
Currently I have configured only three nodes in GRID (1) SGMASTER(master
node) and other two nodes.(2)SG4,(3)SG7
All the three nodes have 2 cpus each. I am logging onto nodes with a nis
user sgeadmin.
I downloaded LAM 7.1.1.tar.gz from www.lam-mpi.org/download and put
it in SGMASTER(master node). In all three nodes I mounted the untarred
directory at same path through NFS.
Then I did
./configure --prefix=/usr/local/lam (i have created a lam dir under
/usr/local)
make
make install
I checked all the nodes are able to do rsh between themselves without
prompting for password.
In Home directory of the logged in user in all three nodes I have set
following environment variables in .cshrc
setenv LAMHOME /usr/local/lam
setenv TROLLIUSHOME /usr/local/lam
Then I editted following files in /usr/local/lam/etc
vi lam-bhost.def
#localhost
SGMASTER
SGMASTER
SG4
SG4
SG7
SG7
vi lam-conf.lamd
/usr/local/lam/bin/lamd $inet_topo $debug $session_prefix
$session_suffix
Then in /usr/local/lam/bin path
source /home/sgeadmin/.cshrc
./lamboot -d
SGMASTER:/usr/local/lam/bin 2% ./lamboot -d
n-1<544> ssi:boot:open: opening
n-1<544> ssi:boot:open: opening boot module globus
n-1<544> ssi:boot:open: opened boot module globus
n-1<544> ssi:boot:open: opening boot module rsh
n-1<544> ssi:boot:open: opened boot module rsh
n-1<544> ssi:boot:open: opening boot module slurm
n-1<544> ssi:boot:open: opened boot module slurm
n-1<544> ssi:boot:select: initializing boot module slurm
n-1<544> ssi:boot:slurm: not running under SLURM
n-1<544> ssi:boot:select: boot module not available: slurm
n-1<544> ssi:boot:select: initializing boot module globus
n-1<544> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<544> ssi:boot:select: boot module not available: globus
n-1<544> ssi:boot:select: initializing boot module rsh
n-1<544> ssi:boot:rsh: module initializing
n-1<544> ssi:boot:rsh:agent: rsh
n-1<544> ssi:boot:rsh:username: <same>
n-1<544> ssi:boot:rsh:verbose: 1000
n-1<544> ssi:boot:rsh:algorithm: linear
n-1<544> ssi:boot:rsh:no_n: 0
n-1<544> ssi:boot:rsh:no_profile: 0
n-1<544> ssi:boot:rsh:fast: 0
n-1<544> ssi:boot:rsh:ignore_stderr: 0
n-1<544> ssi:boot:rsh:priority: 10
n-1<544> ssi:boot:select: boot module available: rsh, priority: 10
n-1<544> ssi:boot:select: finalizing boot module slurm
n-1<544> ssi:boot:slurm: finalizing
n-1<544> ssi:boot:select: closing boot module slurm
n-1<544> ssi:boot:select: finalizing boot module globus
n-1<544> ssi:boot:globus: finalizing
n-1<544> ssi:boot:select: closing boot module globus
n-1<544> ssi:boot:select: selected boot module rsh
LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
n-1<544> ssi:boot:base: looking for boot schema in following
directories:
n-1<544> ssi:boot:base: <current directory>
n-1<544> ssi:boot:base: $TROLLIUSHOME/etc
n-1<544> ssi:boot:base: $LAMHOME/etc
n-1<544> ssi:boot:base: /usr/local/lam/etc
n-1<544> ssi:boot:base: looking for boot schema file:
n-1<544> ssi:boot:base: lam-bhost.def
n-1<544> ssi:boot:base: found boot schema:
/usr/local/lam/etc/lam-bhost.def
n-1<544> ssi:boot:rsh: found the following hosts:
n-1<544> ssi:boot:rsh: n0 SGMASTER (cpu=2)
n-1<544> ssi:boot:rsh: n1 SG4 (cpu=2)
n-1<544> ssi:boot:rsh: n2 SG7 (cpu=2)
n-1<544> ssi:boot:rsh: resolved hosts:
n-1<544> ssi:boot:rsh: n0 SGMASTER --> 192.168.4.20 (origin)
n-1<544> ssi:boot:rsh: n1 SG4 --> 192.168.4.24
n-1<544> ssi:boot:rsh: n2 SG7 --> 192.168.4.27
n-1<544> ssi:boot:rsh: starting RTE procs
n-1<544> ssi:boot:base:linear: starting
n-1<544> ssi:boot:base:server: opening server TCP socket
n-1<544> ssi:boot:base:server: opened port 39583
n-1<544> ssi:boot:base:linear: booting n0 (SGMASTER)
n-1<544> ssi:boot:rsh: starting lamd on (SGMASTER)
n-1<544> ssi:boot:rsh: starting on n0 (SGMASTER): hboot -t -c
lam-conf.lamd -d -I -H 192.168.4.20 -P 39583 -n 0 -o 0
n-1<544> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-sgeadmin_at_SGMASTER/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-sgeadmin_at_SGMASTER/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-sgeadmin_at_SGMASTER/lam-io-socket
tkill: f_kill = "/tmp/lam-sgeadmin_at_SGMASTER/lam-killfile"
tkill: nothing to kill: "/tmp/lam-sgeadmin_at_SGMASTER/lam-killfile"
hboot: booting...
hboot: fork /usr/local/lam/bin/lamd
hboot: attempting to execute
n-1<547> ssi:boot:open: opening
n-1<547> ssi:boot:open: opening boot module globus
n-1<547> ssi:boot:open: opened boot module globus
n-1<547> ssi:boot:open: opening boot module rsh
n-1<547> ssi:boot:open: opened boot module rsh
n-1<547> ssi:boot:open: opening boot module slurm
n-1<547> ssi:boot:open: opened boot module slurm
n-1<547> ssi:boot:select: initializing boot module slurm
n-1<547> ssi:boot:slurm: not running under SLURM
n-1<547> ssi:boot:select: boot module not available: slurm
n-1<547> ssi:boot:select: initializing boot module globus
[1] 547 lamd -H 192.168.4.20 -P 39583 -n 0 -o 0 -d
n-1<547> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<547> ssi:boot:select: boot module not available: globus
n-1<547> ssi:boot:select: initializing boot module rsh
n-1<547> ssi:boot:rsh: module initializing
n-1<547> ssi:boot:rsh:agent: rsh
n-1<547> ssi:boot:rsh:username: <same>
n-1<547> ssi:boot:rsh:verbose: 1000
n-1<544> ssi:boot:rsh: successfully launched on n0 (SGMASTER)
n-1<544> ssi:boot:base:server: expecting connection from finite list
n-1<547> ssi:boot:rsh:algorithm: linear
n-1<547> ssi:boot:rsh:no_n: 0
n-1<547> ssi:boot:rsh:no_profile: 0
n-1<547> ssi:boot:rsh:fast: 0
n-1<547> ssi:boot:rsh:ignore_stderr: 0
n-1<547> ssi:boot:rsh:priority: 10
n-1<547> ssi:boot:select: boot module available: rsh, priority: 10
n-1<547> ssi:boot:select: finalizing boot module slurm
n-1<547> ssi:boot:slurm: finalizing
n-1<547> ssi:boot:select: closing boot module slurm
n-1<547> ssi:boot:select: finalizing boot module globus
n-1<547> ssi:boot:globus: finalizing
n-1<547> ssi:boot:select: closing boot module globus
n-1<547> ssi:boot:select: selected boot module rsh
n-1<547> ssi:boot:send_lamd: getting node ID from command line
n-1<547> ssi:boot:send_lamd: getting agent haddr from command line
n-1<547> ssi:boot:send_lamd: getting agent port from command line
n-1<547> ssi:boot:send_lamd: getting node ID from command line
n-1<547> ssi:boot:send_lamd: connecting to 192.168.4.20:39583, node id 0
n-1<544> ssi:boot:base:server: got connection from 192.168.4.20
n-1<544> ssi:boot:base:server: this connection is expected (n0)
n-1<547> ssi:boot:send_lamd: sending dli_port 32858
n-1<544> ssi:boot:base:server: remote lamd is at 192.168.4.20:32858
n-1<544> ssi:boot:base:linear: booting n1 (SG4)
n-1<544> ssi:boot:rsh: starting lamd on (SG4)
n-1<544> ssi:boot:rsh: starting on n1 (SG4): hboot -t -c lam-conf.lamd
-d -s -I "-H 192.168.4.20 -P 39583 -n 1 -o 0"
n-1<544> ssi:boot:rsh: launching remotely
n-1<544> ssi:boot:rsh: attempting to execute: rsh SG4 -n 'echo $SHELL'
n-1<544> ssi:boot:rsh: remote shell /bin/bash
n-1<544> ssi:boot:rsh: attempting to execute: rsh SG4 -n hboot -t -c
lam-conf.lamd -d -s -I '"-H 192.168.4.20 -P 39583 -n 1 -o 0"'
ERROR: LAM/MPI unexpectedly received the following on stderr:
hboot: cannot find process schema lam-conf.lamd: No such file or
directory
-----------------------------------------------------------------------------
LAM attempted to execute a process on the remote node "SG4",
but received some output on the standard error. This heuristic
assumes that any output on the standard error indicates a fatal error,
and therefore aborts. You can disable this behavior (i.e., have LAM
ignore output on standard error) in the rsh boot module by setting the
SSI parameter boot_rsh_ignore_stderr to 1.
LAM tried to use the remote agent command "rsh"
to invoke "hboot" on the remote node.
*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.
This can indicate an authentication error with the remote agent, or
can indicate an error in your $HOME/.cshrc, $HOME/.login, or
$HOME/.profile files. The following is a (non-inclusive) list of items
that you should check on the remote node:
- You have an account and can login to the remote machine
- Incorrect permissions on your home directory (should
probably be 0755)
- Incorrect permissions on your $HOME/.rhosts file (if you are
using rsh -- they should probably be 0644)
- You have an entry in the remote $HOME/.rhosts file (if you
are using rsh) for the machine and username that you are
running from
- Your .cshrc/.profile must not print anything out to the
standard error
- Your .cshrc/.profile should set a correct TERM type
- Your .cshrc/.profile should set the SHELL environment
variable to your default shell
Try invoking the following command at the unix command line:
rsh SG4 -n hboot -t -c lam-conf.lamd -d -s -I '"-H 192.168.4.20
-P 39583 -n 1 -o 0"'
You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.
When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<544> ssi:boot:base:linear: Failed to boot n1 (SG4)
n-1<544> ssi:boot:base:server: closing server socket
n-1<544> ssi:boot:base:linear: aborted!
n-1<550> ssi:boot:open: opening
n-1<550> ssi:boot:open: opening boot module globus
n-1<550> ssi:boot:open: opened boot module globus
n-1<550> ssi:boot:open: opening boot module rsh
n-1<550> ssi:boot:open: opened boot module rsh
n-1<550> ssi:boot:open: opening boot module slurm
n-1<550> ssi:boot:open: opened boot module slurm
n-1<550> ssi:boot:select: initializing boot module slurm
n-1<550> ssi:boot:slurm: not running under SLURM
n-1<550> ssi:boot:select: boot module not available: slurm
n-1<550> ssi:boot:select: initializing boot module globus
n-1<550> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<550> ssi:boot:select: boot module not available: globus
n-1<550> ssi:boot:select: initializing boot module rsh
n-1<550> ssi:boot:rsh: module initializing
n-1<550> ssi:boot:rsh:agent: rsh
n-1<550> ssi:boot:rsh:username: <same>
n-1<550> ssi:boot:rsh:verbose: 1000
n-1<550> ssi:boot:rsh:algorithm: linear
n-1<550> ssi:boot:rsh:no_n: 0
n-1<550> ssi:boot:rsh:no_profile: 0
n-1<550> ssi:boot:rsh:fast: 0
n-1<550> ssi:boot:rsh:ignore_stderr: 0
n-1<550> ssi:boot:rsh:priority: 10
n-1<550> ssi:boot:select: boot module available: rsh, priority: 10
n-1<550> ssi:boot:select: finalizing boot module slurm
n-1<550> ssi:boot:slurm: finalizing
n-1<550> ssi:boot:select: closing boot module slurm
n-1<550> ssi:boot:select: finalizing boot module globus
n-1<550> ssi:boot:globus: finalizing
n-1<550> ssi:boot:select: closing boot module globus
n-1<550> ssi:boot:select: selected boot module rsh
n-1<550> ssi:boot:base: looking for boot schema in following
directories:
n-1<550> ssi:boot:base: <current directory>
n-1<550> ssi:boot:base: $TROLLIUSHOME/etc
n-1<550> ssi:boot:base: $LAMHOME/etc
n-1<550> ssi:boot:base: /usr/local/lam/etc
n-1<550> ssi:boot:base: looking for boot schema file:
n-1<550> ssi:boot:base: lam-bhost.def
n-1<550> ssi:boot:base: found boot schema:
/usr/local/lam/etc/lam-bhost.def
n-1<550> ssi:boot:rsh: found the following hosts:
n-1<550> ssi:boot:rsh: n0 SGMASTER (cpu=2)
n-1<550> ssi:boot:rsh: n1 SG4 (cpu=2)
n-1<550> ssi:boot:rsh: n2 SG7 (cpu=2)
n-1<550> ssi:boot:rsh: resolved hosts:
n-1<550> ssi:boot:rsh: n0 SGMASTER --> 192.168.4.20 (origin)
n-1<550> ssi:boot:rsh: n1 SG4 --> 192.168.4.24
n-1<550> ssi:boot:rsh: n2 SG7 --> 192.168.4.27
n-1<550> ssi:boot:rsh: starting RTE procs
n-1<550> ssi:boot:base:linear: starting
n-1<550> ssi:boot:base:linear: booting n0 (SGMASTER)
n-1<550> ssi:boot:rsh: starting wipe on (SGMASTER)
n-1<550> ssi:boot:rsh: starting on n0 (SGMASTER): tkill -d
n-1<550> ssi:boot:rsh: launching locally
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-sgeadmin_at_SGMASTER/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-sgeadmin_at_SGMASTER/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-sgeadmin_at_SGMASTER/lam-io-socket
tkill: f_kill = "/tmp/lam-sgeadmin_at_SGMASTER/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 547 ...
tkill: killed
tkill: all finished
n-1<550> ssi:boot:rsh: successfully launched on n0 (SGMASTER)
n-1<550> ssi:boot:base:linear: booting n1 (SG4)
n-1<550> ssi:boot:rsh: starting wipe on (SG4)
n-1<550> ssi:boot:rsh: starting on n1 (SG4): tkill -d
n-1<550> ssi:boot:rsh: launching remotely
n-1<550> ssi:boot:rsh: attempting to execute: rsh SG4 -n 'echo $SHELL'
n-1<550> ssi:boot:rsh: remote shell /bin/bash
n-1<550> ssi:boot:rsh: attempting to execute: rsh SG4 -n tkill -d
tkill: removing socket file ...
tkill: socket file: /tmp/lam-sgeadmin_at_SG4/lam-sd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-sgeadmin_at_SG4/lam-sio
tkill: f_kill = "/tmp/lam-sgeadmin_at_SG4/lam"
tkill: nothing to kill: "/tmp/lam-sgeadmin_at_SG4/lam"
n-1<550> ssi:boot:rsh: successfully launched on n1 (SG4)
n-1<550> ssi:boot:base:linear: booting n2 (SG7)
n-1<550> ssi:boot:rsh: starting wipe on (SG7)
n-1<550> ssi:boot:rsh: starting on n2 (SG7): tkill -d
# Copyright (c) 1998-2001 University of Notre Dame.
#
# Copyright (c) 2001-2003 The Trustees of Indiana University.
# All rights reserved.
# Copyright (c) 1998-2001 University of Notre Dame.
# All rights reserved.
# Copyright (c) 1994-1998 The Ohio State University.
# All rights reserved.
#
# This file is part of the LAM/MPI software package. For license
# information, see the LICENSE file in the top level directory of the
# LAM/MPI source distribution.
#
# $HEADER$
#
# Function: - LAM process schema
# - single daemon version
#
/usr/local/lam/bin/lamd $inet_topo $debug $session_prefix
$session_suffix
~
~
~
~
SGMASTER:/usr/local/lam/etc 7% cd /tmp
jd_sockV4= orbit-root/ orbit-sgeadmin/
SGMASTER:/tmp 8% cd
SGMASTER:/home/sgeadmin 9% cd /usr/local/lam/bin
hboot* lamcheckpoint* lamhalt* lamtrace* mpiexec* recon*
hcc@ lamclean* laminfo* lamwipe* mpif77* tkill*
hcp@ lamd* lamnodes* mpic++* mpimsg* tping*
hf77@ lamexec* lamrestart* mpicc* mpirun* wipe@
lamboot* lamgrow* lamshrink* mpiCC@ mpitask*
SGMASTER:/usr/local/lam/bin 10% csh
xhost: Command not found.
SGMASTER:/usr/local/lam/bin 1% source /home/sgeadmin/.cshrc
xhost: Command not found.
SGMASTER:/usr/local/lam/bin 2% ./lamboot -d
n-1<651> ssi:boot:open: opening
n-1<651> ssi:boot:open: opening boot module globus
n-1<651> ssi:boot:open: opened boot module globus
n-1<651> ssi:boot:open: opening boot module rsh
n-1<651> ssi:boot:open: opened boot module rsh
n-1<651> ssi:boot:open: opening boot module slurm
n-1<651> ssi:boot:open: opened boot module slurm
n-1<651> ssi:boot:select: initializing boot module slurm
n-1<651> ssi:boot:slurm: not running under SLURM
n-1<651> ssi:boot:select: boot module not available: slurm
n-1<651> ssi:boot:select: initializing boot module globus
n-1<651> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<651> ssi:boot:select: boot module not available: globus
n-1<651> ssi:boot:select: initializing boot module rsh
n-1<651> ssi:boot:rsh: module initializing
n-1<651> ssi:boot:rsh:agent: rsh
n-1<651> ssi:boot:rsh:username: <same>
n-1<651> ssi:boot:rsh:verbose: 1000
n-1<651> ssi:boot:rsh:algorithm: linear
n-1<651> ssi:boot:rsh:no_n: 0
n-1<651> ssi:boot:rsh:no_profile: 0
n-1<651> ssi:boot:rsh:fast: 0
n-1<651> ssi:boot:rsh:ignore_stderr: 0
n-1<651> ssi:boot:rsh:priority: 10
n-1<651> ssi:boot:select: boot module available: rsh, priority: 10
n-1<651> ssi:boot:select: finalizing boot module slurm
n-1<651> ssi:boot:slurm: finalizing
n-1<651> ssi:boot:select: closing boot module slurm
n-1<651> ssi:boot:select: finalizing boot module globus
n-1<651> ssi:boot:globus: finalizing
n-1<651> ssi:boot:select: closing boot module globus
n-1<651> ssi:boot:select: selected boot module rsh
LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
n-1<651> ssi:boot:base: looking for boot schema in following
directories:
n-1<651> ssi:boot:base: <current directory>
n-1<651> ssi:boot:base: $TROLLIUSHOME/etc
n-1<651> ssi:boot:base: $LAMHOME/etc
n-1<651> ssi:boot:base: /usr/local/lam/etc
n-1<651> ssi:boot:base: looking for boot schema file:
n-1<651> ssi:boot:base: lam-bhost.def
n-1<651> ssi:boot:base: found boot schema:
/usr/local/lam/etc/lam-bhost.def
n-1<651> ssi:boot:rsh: found the following hosts:
n-1<651> ssi:boot:rsh: n0 SGMASTER (cpu=2)
n-1<651> ssi:boot:rsh: n1 SG4 (cpu=2)
n-1<651> ssi:boot:rsh: n2 SG7 (cpu=2)
n-1<651> ssi:boot:rsh: resolved hosts:
n-1<651> ssi:boot:rsh: n0 SGMASTER --> 192.168.4.20 (origin)
n-1<651> ssi:boot:rsh: n1 SG4 --> 192.168.4.24
n-1<651> ssi:boot:rsh: n2 SG7 --> 192.168.4.27
n-1<651> ssi:boot:rsh: starting RTE procs
n-1<651> ssi:boot:base:linear: starting
n-1<651> ssi:boot:base:server: opening server TCP socket
n-1<651> ssi:boot:base:server: opened port 39586
n-1<651> ssi:boot:base:linear: booting n0 (SGMASTER)
n-1<651> ssi:boot:rsh: starting lamd on (SGMASTER)
n-1<651> ssi:boot:rsh: starting on n0 (SGMASTER): hboot -t -c
lam-conf.lamd -d -I -H 192.168.4.20 -P 39586 -n 0 -o 0
n-1<651> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-sgeadmin_at_SGMASTER/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-sgeadmin_at_SGMASTER/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-sgeadmin_at_SGMASTER/lam-io-socket
tkill: f_kill = "/tmp/lam-sgeadmin_at_SGMASTER/lam-killfile"
tkill: nothing to kill: "/tmp/lam-sgeadmin_at_SGMASTER/lam-killfile"
hboot: booting...
hboot: fork /usr/local/lam/bin/lamd
hboot: attempting to execute
n-1<654> ssi:boot:open: opening
n-1<654> ssi:boot:open: opening boot module globus
n-1<654> ssi:boot:open: opened boot module globus
n-1<654> ssi:boot:open: opening boot module rsh
n-1<654> ssi:boot:open: opened boot module rsh
n-1<654> ssi:boot:open: opening boot module slurm
n-1<654> ssi:boot:open: opened boot module slurm
n-1<654> ssi:boot:select: initializing boot module slurm
n-1<654> ssi:boot:slurm: not running under SLURM
n-1<654> ssi:boot:select: boot module not available: slurm
n-1<654> ssi:boot:select: initializing boot module globus
n-1<654> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<654> ssi:boot:select: boot module not available: globus
n-1<654> ssi:boot:select: initializing boot module rsh
n-1<654> ssi:boot:rsh: module initializing
n-1<654> ssi:boot:rsh:agent: rsh
n-1<654> ssi:boot:rsh:username: <same>
n-1<654> ssi:boot:rsh:verbose: 1000
n-1<654> ssi:boot:rsh:algorithm: linear
n-1<654> ssi:boot:rsh:no_n: 0
n-1<654> ssi:boot:rsh:no_profile: 0
n-1<654> ssi:boot:rsh:fast: 0
n-1<654> ssi:boot:rsh:ignore_stderr: 0
n-1<654> ssi:boot:rsh:priority: 10
n-1<654> ssi:boot:select: boot module available: rsh, priority: 10
n-1<654> ssi:boot:select: finalizing boot module slurm
n-1<654> ssi:boot:slurm: finalizing
n-1<654> ssi:boot:select: closing boot module slurm
n-1<654> ssi:boot:select: finalizing boot module globus
n-1<654> ssi:boot:globus: finalizing
n-1<654> ssi:boot:select: closing boot module globus
n-1<654> ssi:boot:select: selected boot module rsh
n-1<654> ssi:boot:send_lamd: getting node ID from command line
n-1<654> ssi:boot:send_lamd: getting agent haddr from command line
n-1<654> ssi:boot:send_lamd: getting agent port from command line
n-1<654> ssi:boot:send_lamd: getting node ID from command line
n-1<654> ssi:boot:send_lamd: connecting to 192.168.4.20:39586, node id 0
n-1<654> ssi:boot:send_lamd: sending dli_port 32859
[1] 654 lamd -H 192.168.4.20 -P 39586 -n 0 -o 0 -d
n-1<651> ssi:boot:rsh: successfully launched on n0 (SGMASTER)
n-1<651> ssi:boot:base:server: expecting connection from finite list
n-1<651> ssi:boot:base:server: got connection from 192.168.4.20
n-1<651> ssi:boot:base:server: this connection is expected (n0)
n-1<651> ssi:boot:base:server: remote lamd is at 192.168.4.20:32859
n-1<651> ssi:boot:base:linear: booting n1 (SG4)
n-1<651> ssi:boot:rsh: starting lamd on (SG4)
n-1<651> ssi:boot:rsh: starting on n1 (SG4): hboot -t -c lam-conf.lamd
-d -s -I "-H 192.168.4.20 -P 39586 -n 1 -o 0"
n-1<651> ssi:boot:rsh: launching remotely
n-1<651> ssi:boot:rsh: attempting to execute: rsh SG4 -n 'echo $SHELL'
n-1<651> ssi:boot:rsh: remote shell /bin/bash
n-1<651> ssi:boot:rsh: attempting to execute: rsh SG4 -n hboot -t -c
lam-conf.lamd -d -s -I '"-H 192.168.4.20 -P 39586 -n 1 -o 0"'
ERROR: LAM/MPI unexpectedly received the following on stderr:
hboot: cannot find process schema lam-conf.lamd: No such file or
directory
-----------------------------------------------------------------------------
LAM attempted to execute a process on the remote node "SG4",
but received some output on the standard error. This heuristic
assumes that any output on the standard error indicates a fatal error,
and therefore aborts. You can disable this behavior (i.e., have LAM
ignore output on standard error) in the rsh boot module by setting the
SSI parameter boot_rsh_ignore_stderr to 1.
LAM tried to use the remote agent command "rsh"
to invoke "hboot" on the remote node.
*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.
This can indicate an authentication error with the remote agent, or
can indicate an error in your $HOME/.cshrc, $HOME/.login, or
$HOME/.profile files. The following is a (non-inclusive) list of items
that you should check on the remote node:
- You have an account and can login to the remote machine
- Incorrect permissions on your home directory (should
probably be 0755)
- Incorrect permissions on your $HOME/.rhosts file (if you are
using rsh -- they should probably be 0644)
- You have an entry in the remote $HOME/.rhosts file (if you
are using rsh) for the machine and username that you are
running from
- Your .cshrc/.profile must not print anything out to the
standard error
- Your .cshrc/.profile should set a correct TERM type
- Your .cshrc/.profile should set the SHELL environment
variable to your default shell
Try invoking the following command at the unix command line:
rsh SG4 -n hboot -t -c lam-conf.lamd -d -s -I '"-H 192.168.4.20
-P 39586 -n 1 -o 0"'
You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.
When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<651> ssi:boot:base:linear: Failed to boot n1 (SG4)
n-1<651> ssi:boot:base:server: closing server socket
n-1<651> ssi:boot:base:linear: aborted!
n-1<657> ssi:boot:open: opening
n-1<657> ssi:boot:open: opening boot module globus
n-1<657> ssi:boot:open: opened boot module globus
n-1<657> ssi:boot:open: opening boot module rsh
n-1<657> ssi:boot:open: opened boot module rsh
n-1<657> ssi:boot:open: opening boot module slurm
n-1<657> ssi:boot:open: opened boot module slurm
n-1<657> ssi:boot:select: initializing boot module slurm
n-1<657> ssi:boot:slurm: not running under SLURM
n-1<657> ssi:boot:select: boot module not available: slurm
n-1<657> ssi:boot:select: initializing boot module globus
n-1<657> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<657> ssi:boot:select: boot module not available: globus
n-1<657> ssi:boot:select: initializing boot module rsh
n-1<657> ssi:boot:rsh: module initializing
n-1<657> ssi:boot:rsh:agent: rsh
n-1<657> ssi:boot:rsh:username: <same>
n-1<657> ssi:boot:rsh:verbose: 1000
n-1<657> ssi:boot:rsh:algorithm: linear
n-1<657> ssi:boot:rsh:no_n: 0
n-1<657> ssi:boot:rsh:no_profile: 0
n-1<657> ssi:boot:rsh:fast: 0
n-1<657> ssi:boot:rsh:ignore_stderr: 0
n-1<657> ssi:boot:rsh:priority: 10
n-1<657> ssi:boot:select: boot module available: rsh, priority: 10
n-1<657> ssi:boot:select: finalizing boot module slurm
n-1<657> ssi:boot:slurm: finalizing
n-1<657> ssi:boot:select: closing boot module slurm
n-1<657> ssi:boot:select: finalizing boot module globus
n-1<657> ssi:boot:globus: finalizing
n-1<657> ssi:boot:select: closing boot module globus
n-1<657> ssi:boot:select: selected boot module rsh
n-1<657> ssi:boot:base: looking for boot schema in following
directories:
n-1<657> ssi:boot:base: <current directory>
n-1<657> ssi:boot:base: $TROLLIUSHOME/etc
n-1<657> ssi:boot:base: $LAMHOME/etc
n-1<657> ssi:boot:base: /usr/local/lam/etc
n-1<657> ssi:boot:base: looking for boot schema file:
n-1<657> ssi:boot:base: lam-bhost.def
n-1<657> ssi:boot:base: found boot schema:
/usr/local/lam/etc/lam-bhost.def
n-1<657> ssi:boot:rsh: found the following hosts:
n-1<657> ssi:boot:rsh: n0 SGMASTER (cpu=2)
n-1<657> ssi:boot:rsh: n1 SG4 (cpu=2)
n-1<657> ssi:boot:rsh: n2 SG7 (cpu=2)
n-1<657> ssi:boot:rsh: resolved hosts:
n-1<657> ssi:boot:rsh: n0 SGMASTER --> 192.168.4.20 (origin)
n-1<657> ssi:boot:rsh: n1 SG4 --> 192.168.4.24
n-1<657> ssi:boot:rsh: n2 SG7 --> 192.168.4.27
n-1<657> ssi:boot:rsh: starting RTE procs
n-1<657> ssi:boot:base:linear: starting
n-1<657> ssi:boot:base:linear: booting n0 (SGMASTER)
n-1<657> ssi:boot:rsh: starting wipe on (SGMASTER)
n-1<657> ssi:boot:rsh: starting on n0 (SGMASTER): tkill -d
n-1<657> ssi:boot:rsh: launching locally
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-sgeadmin_at_SGMASTER/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-sgeadmin_at_SGMASTER/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-sgeadmin_at_SGMASTER/lam-io-socket
tkill: f_kill = "/tmp/lam-sgeadmin_at_SGMASTER/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 654 ...
tkill: killed
tkill: all finished
n-1<657> ssi:boot:rsh: successfully launched on n0 (SGMASTER)
n-1<657> ssi:boot:base:linear: booting n1 (SG4)
n-1<657> ssi:boot:rsh: starting wipe on (SG4)
n-1<657> ssi:boot:rsh: starting on n1 (SG4): tkill -d
n-1<657> ssi:boot:rsh: launching remotely
n-1<657> ssi:boot:rsh: attempting to execute: rsh SG4 -n 'echo $SHELL'
n-1<657> ssi:boot:rsh: remote shell /bin/bash
n-1<657> ssi:boot:rsh: attempting to execute: rsh SG4 -n tkill -d
tkill: removing socket file ...
tkill: socket file: /tmp/lam-sgeadmin_at_SG4/lam-sd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-sgeadmin_at_SG4/lam-sio
tkill: f_kill = "/tmp/lam-sgeadmin_at_SG4/lam"
tkill: nothing to kill: "/tmp/lam-sgeadmin_at_SG4/lam"
n-1<657> ssi:boot:rsh: successfully launched on n1 (SG4)
n-1<657> ssi:boot:base:linear: booting n2 (SG7)
n-1<657> ssi:boot:rsh: starting wipe on (SG7)
n-1<657> ssi:boot:rsh: starting on n2 (SG7): tkill -d
n-1<657> ssi:boot:rsh: launching remotely
n-1<657> ssi:boot:rsh: attempting to execute: rsh SG7 -n 'echo $SHELL'
n-1<657> ssi:boot:rsh: remote shell /bin/bash
n-1<657> ssi:boot:rsh: attempting to execute: rsh SG7 -n tkill -d
tkill: removing socket file ...
tkill: socket file: /tmp/lam-sgeadmin_at_SG7/lam-sd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-sgeadmin_at_SG7/lam-sio
tkill: f_kill = "/tmp/lam-sgeadmin_at_SG7/lam"
tkill: nothing to kill: "/tmp/lam-sgeadmin_at_SG7/lam"
n-1<657> ssi:boot:rsh: successfully launched on n2 (SG7)
n-1<657> ssi:boot:base:linear: finished
n-1<657> ssi:boot:rsh: all RTE procs started
n-1<657> ssi:boot:rsh: finalizing
n-1<657> ssi:boot: Closing
lamboot did NOT complete successfully
Please Help Me Out.
Debasis Satapathy
Jr. Support Engineer
Locuz Enterprise Solutions
Mobile:9440551394
|