LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Reuti (reuti_at_[hidden])
Date: 2005-03-12 07:13:04


Hi,

I suggest not to change anything from the original LAM/MPI installation! It's
easier to prepare the boot schema aka hostfile, as you did now in
/usr/local/lam/etc, simply in your home directory.

Then start from your home directory e.g.:

lamboot myschema

and check with:

lamnodes

Most likely you will need the path to your LAM/MPI installation on the slave
nodes, so put it in your .bashrc (in .cshrc similar for csh):

export PATH=/usr/local/lam/bin:$PATH

Another point: are you getting any "welcome message" when using a plain "rsh"
to a slave node? Will "rsh SG4 date" just return the date?

Are you also on the SGE-users list? Otherwise here is the link to a Howto for
integration into SGE:

http://gridengine.sunsource.net/project/gridengine/howto/lam-integration/lam-in
tegration.html

Cheers - Reuti

Quoting Debasis <dsatapathy_at_[hidden]>:

> Hi TEAM,
> I am in the way to integrate LAM with Sun GRID(N1GE6) {sge 6.0}.
> My GRID machine is Sun Fire v20z with AMD 64 architecture and Operating
> System:RedHat Enterprise WS 3.0. Currently I have integrated my
> customer's EDA tools(cadence) with GRID but for Parallel Environment
> support I need LAM to integrate.
> Before Integrating LAM with GRID I just want to checkup how LAM works
> and my job runs in parallel environment with support of LAM without GRID
> environment.
>
> Currently I have configured only three nodes in GRID (1) SGMASTER(master
> node) and other two nodes.(2)SG4,(3)SG7
> All the three nodes have 2 cpus each. I am logging onto nodes with a nis
> user sgeadmin.
> I downloaded LAM 7.1.1.tar.gz from www.lam-mpi.org/download and put
> it in SGMASTER(master node). In all three nodes I mounted the untarred
> directory at same path through NFS.
> Then I did
> ./configure --prefix=/usr/local/lam (i have created a lam dir under
> /usr/local)
>
> make
> make install
>
> I checked all the nodes are able to do rsh between themselves without
> prompting for password.
>
> In Home directory of the logged in user in all three nodes I have set
> following environment variables in .cshrc
> setenv LAMHOME /usr/local/lam
> setenv TROLLIUSHOME /usr/local/lam
>
> Then I editted following files in /usr/local/lam/etc
>
> vi lam-bhost.def
>
> #localhost
> SGMASTER
> SGMASTER
> SG4
> SG4
> SG7
> SG7
>
> vi lam-conf.lamd
>
> /usr/local/lam/bin/lamd $inet_topo $debug $session_prefix
> $session_suffix
>
> Then in /usr/local/lam/bin path
> source /home/sgeadmin/.cshrc
>
> ./lamboot -d
>
> SGMASTER:/usr/local/lam/bin 2% ./lamboot -d
> n-1<544> ssi:boot:open: opening
> n-1<544> ssi:boot:open: opening boot module globus
> n-1<544> ssi:boot:open: opened boot module globus
> n-1<544> ssi:boot:open: opening boot module rsh
> n-1<544> ssi:boot:open: opened boot module rsh
> n-1<544> ssi:boot:open: opening boot module slurm
> n-1<544> ssi:boot:open: opened boot module slurm
> n-1<544> ssi:boot:select: initializing boot module slurm
> n-1<544> ssi:boot:slurm: not running under SLURM
> n-1<544> ssi:boot:select: boot module not available: slurm
> n-1<544> ssi:boot:select: initializing boot module globus
> n-1<544> ssi:boot:globus: globus-job-run not found, globus boot will not
> run
> n-1<544> ssi:boot:select: boot module not available: globus
> n-1<544> ssi:boot:select: initializing boot module rsh
> n-1<544> ssi:boot:rsh: module initializing
> n-1<544> ssi:boot:rsh:agent: rsh
> n-1<544> ssi:boot:rsh:username: <same>
> n-1<544> ssi:boot:rsh:verbose: 1000
> n-1<544> ssi:boot:rsh:algorithm: linear
> n-1<544> ssi:boot:rsh:no_n: 0
> n-1<544> ssi:boot:rsh:no_profile: 0
> n-1<544> ssi:boot:rsh:fast: 0
> n-1<544> ssi:boot:rsh:ignore_stderr: 0
> n-1<544> ssi:boot:rsh:priority: 10
> n-1<544> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<544> ssi:boot:select: finalizing boot module slurm
> n-1<544> ssi:boot:slurm: finalizing
> n-1<544> ssi:boot:select: closing boot module slurm
> n-1<544> ssi:boot:select: finalizing boot module globus
> n-1<544> ssi:boot:globus: finalizing
> n-1<544> ssi:boot:select: closing boot module globus
> n-1<544> ssi:boot:select: selected boot module rsh
>
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
> n-1<544> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<544> ssi:boot:base: <current directory>
> n-1<544> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<544> ssi:boot:base: $LAMHOME/etc
> n-1<544> ssi:boot:base: /usr/local/lam/etc
> n-1<544> ssi:boot:base: looking for boot schema file:
> n-1<544> ssi:boot:base: lam-bhost.def
> n-1<544> ssi:boot:base: found boot schema:
> /usr/local/lam/etc/lam-bhost.def
> n-1<544> ssi:boot:rsh: found the following hosts:
> n-1<544> ssi:boot:rsh: n0 SGMASTER (cpu=2)
> n-1<544> ssi:boot:rsh: n1 SG4 (cpu=2)
> n-1<544> ssi:boot:rsh: n2 SG7 (cpu=2)
> n-1<544> ssi:boot:rsh: resolved hosts:
> n-1<544> ssi:boot:rsh: n0 SGMASTER --> 192.168.4.20 (origin)
> n-1<544> ssi:boot:rsh: n1 SG4 --> 192.168.4.24
> n-1<544> ssi:boot:rsh: n2 SG7 --> 192.168.4.27
> n-1<544> ssi:boot:rsh: starting RTE procs
> n-1<544> ssi:boot:base:linear: starting
> n-1<544> ssi:boot:base:server: opening server TCP socket
> n-1<544> ssi:boot:base:server: opened port 39583
> n-1<544> ssi:boot:base:linear: booting n0 (SGMASTER)
> n-1<544> ssi:boot:rsh: starting lamd on (SGMASTER)
> n-1<544> ssi:boot:rsh: starting on n0 (SGMASTER): hboot -t -c
> lam-conf.lamd -d -I -H 192.168.4.20 -P 39583 -n 0 -o 0
> n-1<544> ssi:boot:rsh: launching locally
> hboot: performing tkill
> hboot: tkill -d
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-sgeadmin_at_SGMASTER/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-sgeadmin_at_SGMASTER/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-sgeadmin_at_SGMASTER/lam-io-socket
> tkill: f_kill = "/tmp/lam-sgeadmin_at_SGMASTER/lam-killfile"
> tkill: nothing to kill: "/tmp/lam-sgeadmin_at_SGMASTER/lam-killfile"
> hboot: booting...
> hboot: fork /usr/local/lam/bin/lamd
> hboot: attempting to execute
> n-1<547> ssi:boot:open: opening
> n-1<547> ssi:boot:open: opening boot module globus
> n-1<547> ssi:boot:open: opened boot module globus
> n-1<547> ssi:boot:open: opening boot module rsh
> n-1<547> ssi:boot:open: opened boot module rsh
> n-1<547> ssi:boot:open: opening boot module slurm
> n-1<547> ssi:boot:open: opened boot module slurm
> n-1<547> ssi:boot:select: initializing boot module slurm
> n-1<547> ssi:boot:slurm: not running under SLURM
> n-1<547> ssi:boot:select: boot module not available: slurm
> n-1<547> ssi:boot:select: initializing boot module globus
> [1] 547 lamd -H 192.168.4.20 -P 39583 -n 0 -o 0 -d
> n-1<547> ssi:boot:globus: globus-job-run not found, globus boot will not
> run
> n-1<547> ssi:boot:select: boot module not available: globus
> n-1<547> ssi:boot:select: initializing boot module rsh
> n-1<547> ssi:boot:rsh: module initializing
> n-1<547> ssi:boot:rsh:agent: rsh
> n-1<547> ssi:boot:rsh:username: <same>
> n-1<547> ssi:boot:rsh:verbose: 1000
> n-1<544> ssi:boot:rsh: successfully launched on n0 (SGMASTER)
> n-1<544> ssi:boot:base:server: expecting connection from finite list
> n-1<547> ssi:boot:rsh:algorithm: linear
> n-1<547> ssi:boot:rsh:no_n: 0
> n-1<547> ssi:boot:rsh:no_profile: 0
> n-1<547> ssi:boot:rsh:fast: 0
> n-1<547> ssi:boot:rsh:ignore_stderr: 0
> n-1<547> ssi:boot:rsh:priority: 10
> n-1<547> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<547> ssi:boot:select: finalizing boot module slurm
> n-1<547> ssi:boot:slurm: finalizing
> n-1<547> ssi:boot:select: closing boot module slurm
> n-1<547> ssi:boot:select: finalizing boot module globus
> n-1<547> ssi:boot:globus: finalizing
> n-1<547> ssi:boot:select: closing boot module globus
> n-1<547> ssi:boot:select: selected boot module rsh
> n-1<547> ssi:boot:send_lamd: getting node ID from command line
> n-1<547> ssi:boot:send_lamd: getting agent haddr from command line
> n-1<547> ssi:boot:send_lamd: getting agent port from command line
> n-1<547> ssi:boot:send_lamd: getting node ID from command line
> n-1<547> ssi:boot:send_lamd: connecting to 192.168.4.20:39583, node id 0
> n-1<544> ssi:boot:base:server: got connection from 192.168.4.20
> n-1<544> ssi:boot:base:server: this connection is expected (n0)
> n-1<547> ssi:boot:send_lamd: sending dli_port 32858
> n-1<544> ssi:boot:base:server: remote lamd is at 192.168.4.20:32858
> n-1<544> ssi:boot:base:linear: booting n1 (SG4)
> n-1<544> ssi:boot:rsh: starting lamd on (SG4)
> n-1<544> ssi:boot:rsh: starting on n1 (SG4): hboot -t -c lam-conf.lamd
> -d -s -I "-H 192.168.4.20 -P 39583 -n 1 -o 0"
> n-1<544> ssi:boot:rsh: launching remotely
> n-1<544> ssi:boot:rsh: attempting to execute: rsh SG4 -n 'echo $SHELL'
> n-1<544> ssi:boot:rsh: remote shell /bin/bash
> n-1<544> ssi:boot:rsh: attempting to execute: rsh SG4 -n hboot -t -c
> lam-conf.lamd -d -s -I '"-H 192.168.4.20 -P 39583 -n 1 -o 0"'
> ERROR: LAM/MPI unexpectedly received the following on stderr:
> hboot: cannot find process schema lam-conf.lamd: No such file or
> directory
> -----------------------------------------------------------------------------
> LAM attempted to execute a process on the remote node "SG4",
> but received some output on the standard error. This heuristic
> assumes that any output on the standard error indicates a fatal error,
> and therefore aborts. You can disable this behavior (i.e., have LAM
> ignore output on standard error) in the rsh boot module by setting the
> SSI parameter boot_rsh_ignore_stderr to 1.
>
> LAM tried to use the remote agent command "rsh"
> to invoke "hboot" on the remote node.
>
> *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
> *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
> *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
> *** MAILING LIST.
>
> This can indicate an authentication error with the remote agent, or
> can indicate an error in your $HOME/.cshrc, $HOME/.login, or
> $HOME/.profile files. The following is a (non-inclusive) list of items
> that you should check on the remote node:
>
> - You have an account and can login to the remote machine
> - Incorrect permissions on your home directory (should
> probably be 0755)
> - Incorrect permissions on your $HOME/.rhosts file (if you are
> using rsh -- they should probably be 0644)
> - You have an entry in the remote $HOME/.rhosts file (if you
> are using rsh) for the machine and username that you are
> running from
> - Your .cshrc/.profile must not print anything out to the
> standard error
> - Your .cshrc/.profile should set a correct TERM type
> - Your .cshrc/.profile should set the SHELL environment
> variable to your default shell
>
> Try invoking the following command at the unix command line:
>
> rsh SG4 -n hboot -t -c lam-conf.lamd -d -s -I '"-H 192.168.4.20
> -P 39583 -n 1 -o 0"'
>
> You will need to configure your local setup such that you will *not*
> be prompted for a password to invoke this command on the remote node.
> No output should be printed from the remote node before the output of
> the command is displayed.
>
> When you can get this command to execute successfully by hand, LAM
> will probably be able to function properly.
> -----------------------------------------------------------------------------
> n-1<544> ssi:boot:base:linear: Failed to boot n1 (SG4)
> n-1<544> ssi:boot:base:server: closing server socket
> n-1<544> ssi:boot:base:linear: aborted!
> n-1<550> ssi:boot:open: opening
> n-1<550> ssi:boot:open: opening boot module globus
> n-1<550> ssi:boot:open: opened boot module globus
> n-1<550> ssi:boot:open: opening boot module rsh
> n-1<550> ssi:boot:open: opened boot module rsh
> n-1<550> ssi:boot:open: opening boot module slurm
> n-1<550> ssi:boot:open: opened boot module slurm
> n-1<550> ssi:boot:select: initializing boot module slurm
> n-1<550> ssi:boot:slurm: not running under SLURM
> n-1<550> ssi:boot:select: boot module not available: slurm
> n-1<550> ssi:boot:select: initializing boot module globus
> n-1<550> ssi:boot:globus: globus-job-run not found, globus boot will not
> run
> n-1<550> ssi:boot:select: boot module not available: globus
> n-1<550> ssi:boot:select: initializing boot module rsh
> n-1<550> ssi:boot:rsh: module initializing
> n-1<550> ssi:boot:rsh:agent: rsh
> n-1<550> ssi:boot:rsh:username: <same>
> n-1<550> ssi:boot:rsh:verbose: 1000
> n-1<550> ssi:boot:rsh:algorithm: linear
> n-1<550> ssi:boot:rsh:no_n: 0
> n-1<550> ssi:boot:rsh:no_profile: 0
> n-1<550> ssi:boot:rsh:fast: 0
> n-1<550> ssi:boot:rsh:ignore_stderr: 0
> n-1<550> ssi:boot:rsh:priority: 10
> n-1<550> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<550> ssi:boot:select: finalizing boot module slurm
> n-1<550> ssi:boot:slurm: finalizing
> n-1<550> ssi:boot:select: closing boot module slurm
> n-1<550> ssi:boot:select: finalizing boot module globus
> n-1<550> ssi:boot:globus: finalizing
> n-1<550> ssi:boot:select: closing boot module globus
> n-1<550> ssi:boot:select: selected boot module rsh
> n-1<550> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<550> ssi:boot:base: <current directory>
> n-1<550> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<550> ssi:boot:base: $LAMHOME/etc
> n-1<550> ssi:boot:base: /usr/local/lam/etc
> n-1<550> ssi:boot:base: looking for boot schema file:
> n-1<550> ssi:boot:base: lam-bhost.def
> n-1<550> ssi:boot:base: found boot schema:
> /usr/local/lam/etc/lam-bhost.def
> n-1<550> ssi:boot:rsh: found the following hosts:
> n-1<550> ssi:boot:rsh: n0 SGMASTER (cpu=2)
> n-1<550> ssi:boot:rsh: n1 SG4 (cpu=2)
> n-1<550> ssi:boot:rsh: n2 SG7 (cpu=2)
> n-1<550> ssi:boot:rsh: resolved hosts:
> n-1<550> ssi:boot:rsh: n0 SGMASTER --> 192.168.4.20 (origin)
> n-1<550> ssi:boot:rsh: n1 SG4 --> 192.168.4.24
> n-1<550> ssi:boot:rsh: n2 SG7 --> 192.168.4.27
> n-1<550> ssi:boot:rsh: starting RTE procs
> n-1<550> ssi:boot:base:linear: starting
> n-1<550> ssi:boot:base:linear: booting n0 (SGMASTER)
> n-1<550> ssi:boot:rsh: starting wipe on (SGMASTER)
> n-1<550> ssi:boot:rsh: starting on n0 (SGMASTER): tkill -d
> n-1<550> ssi:boot:rsh: launching locally
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-sgeadmin_at_SGMASTER/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-sgeadmin_at_SGMASTER/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-sgeadmin_at_SGMASTER/lam-io-socket
> tkill: f_kill = "/tmp/lam-sgeadmin_at_SGMASTER/lam-killfile"
> tkill: killing LAM...
> tkill: killing PID (SIGHUP) 547 ...
> tkill: killed
> tkill: all finished
> n-1<550> ssi:boot:rsh: successfully launched on n0 (SGMASTER)
> n-1<550> ssi:boot:base:linear: booting n1 (SG4)
> n-1<550> ssi:boot:rsh: starting wipe on (SG4)
> n-1<550> ssi:boot:rsh: starting on n1 (SG4): tkill -d
> n-1<550> ssi:boot:rsh: launching remotely
> n-1<550> ssi:boot:rsh: attempting to execute: rsh SG4 -n 'echo $SHELL'
> n-1<550> ssi:boot:rsh: remote shell /bin/bash
> n-1<550> ssi:boot:rsh: attempting to execute: rsh SG4 -n tkill -d
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-sgeadmin_at_SG4/lam-sd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-sgeadmin_at_SG4/lam-sio
> tkill: f_kill = "/tmp/lam-sgeadmin_at_SG4/lam"
> tkill: nothing to kill: "/tmp/lam-sgeadmin_at_SG4/lam"
> n-1<550> ssi:boot:rsh: successfully launched on n1 (SG4)
> n-1<550> ssi:boot:base:linear: booting n2 (SG7)
> n-1<550> ssi:boot:rsh: starting wipe on (SG7)
> n-1<550> ssi:boot:rsh: starting on n2 (SG7): tkill -d
> # Copyright (c) 1998-2001 University of Notre Dame.
> #
> # Copyright (c) 2001-2003 The Trustees of Indiana University.
> # All rights reserved.
> # Copyright (c) 1998-2001 University of Notre Dame.
> # All rights reserved.
> # Copyright (c) 1994-1998 The Ohio State University.
> # All rights reserved.
> #
> # This file is part of the LAM/MPI software package. For license
> # information, see the LICENSE file in the top level directory of the
> # LAM/MPI source distribution.
> #
> # $HEADER$
> #
> # Function: - LAM process schema
> # - single daemon version
> #
>
> /usr/local/lam/bin/lamd $inet_topo $debug $session_prefix
> $session_suffix
> ~
>
> ~
>
> ~
>
> ~
>
> SGMASTER:/usr/local/lam/etc 7% cd /tmp
> jd_sockV4= orbit-root/ orbit-sgeadmin/
> SGMASTER:/tmp 8% cd
> SGMASTER:/home/sgeadmin 9% cd /usr/local/lam/bin
> hboot* lamcheckpoint* lamhalt* lamtrace* mpiexec* recon*
> hcc@ lamclean* laminfo* lamwipe* mpif77* tkill*
> hcp@ lamd* lamnodes* mpic++* mpimsg* tping*
> hf77@ lamexec* lamrestart* mpicc* mpirun* wipe@
> lamboot* lamgrow* lamshrink* mpiCC@ mpitask*
> SGMASTER:/usr/local/lam/bin 10% csh
> xhost: Command not found.
> SGMASTER:/usr/local/lam/bin 1% source /home/sgeadmin/.cshrc
> xhost: Command not found.
> SGMASTER:/usr/local/lam/bin 2% ./lamboot -d
> n-1<651> ssi:boot:open: opening
> n-1<651> ssi:boot:open: opening boot module globus
> n-1<651> ssi:boot:open: opened boot module globus
> n-1<651> ssi:boot:open: opening boot module rsh
> n-1<651> ssi:boot:open: opened boot module rsh
> n-1<651> ssi:boot:open: opening boot module slurm
> n-1<651> ssi:boot:open: opened boot module slurm
> n-1<651> ssi:boot:select: initializing boot module slurm
> n-1<651> ssi:boot:slurm: not running under SLURM
> n-1<651> ssi:boot:select: boot module not available: slurm
> n-1<651> ssi:boot:select: initializing boot module globus
> n-1<651> ssi:boot:globus: globus-job-run not found, globus boot will not
> run
> n-1<651> ssi:boot:select: boot module not available: globus
> n-1<651> ssi:boot:select: initializing boot module rsh
> n-1<651> ssi:boot:rsh: module initializing
> n-1<651> ssi:boot:rsh:agent: rsh
> n-1<651> ssi:boot:rsh:username: <same>
> n-1<651> ssi:boot:rsh:verbose: 1000
> n-1<651> ssi:boot:rsh:algorithm: linear
> n-1<651> ssi:boot:rsh:no_n: 0
> n-1<651> ssi:boot:rsh:no_profile: 0
> n-1<651> ssi:boot:rsh:fast: 0
> n-1<651> ssi:boot:rsh:ignore_stderr: 0
> n-1<651> ssi:boot:rsh:priority: 10
> n-1<651> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<651> ssi:boot:select: finalizing boot module slurm
> n-1<651> ssi:boot:slurm: finalizing
> n-1<651> ssi:boot:select: closing boot module slurm
> n-1<651> ssi:boot:select: finalizing boot module globus
> n-1<651> ssi:boot:globus: finalizing
> n-1<651> ssi:boot:select: closing boot module globus
> n-1<651> ssi:boot:select: selected boot module rsh
>
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
> n-1<651> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<651> ssi:boot:base: <current directory>
> n-1<651> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<651> ssi:boot:base: $LAMHOME/etc
> n-1<651> ssi:boot:base: /usr/local/lam/etc
> n-1<651> ssi:boot:base: looking for boot schema file:
> n-1<651> ssi:boot:base: lam-bhost.def
> n-1<651> ssi:boot:base: found boot schema:
> /usr/local/lam/etc/lam-bhost.def
> n-1<651> ssi:boot:rsh: found the following hosts:
> n-1<651> ssi:boot:rsh: n0 SGMASTER (cpu=2)
> n-1<651> ssi:boot:rsh: n1 SG4 (cpu=2)
> n-1<651> ssi:boot:rsh: n2 SG7 (cpu=2)
> n-1<651> ssi:boot:rsh: resolved hosts:
> n-1<651> ssi:boot:rsh: n0 SGMASTER --> 192.168.4.20 (origin)
> n-1<651> ssi:boot:rsh: n1 SG4 --> 192.168.4.24
> n-1<651> ssi:boot:rsh: n2 SG7 --> 192.168.4.27
> n-1<651> ssi:boot:rsh: starting RTE procs
> n-1<651> ssi:boot:base:linear: starting
> n-1<651> ssi:boot:base:server: opening server TCP socket
> n-1<651> ssi:boot:base:server: opened port 39586
> n-1<651> ssi:boot:base:linear: booting n0 (SGMASTER)
> n-1<651> ssi:boot:rsh: starting lamd on (SGMASTER)
> n-1<651> ssi:boot:rsh: starting on n0 (SGMASTER): hboot -t -c
> lam-conf.lamd -d -I -H 192.168.4.20 -P 39586 -n 0 -o 0
> n-1<651> ssi:boot:rsh: launching locally
> hboot: performing tkill
> hboot: tkill -d
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-sgeadmin_at_SGMASTER/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-sgeadmin_at_SGMASTER/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-sgeadmin_at_SGMASTER/lam-io-socket
> tkill: f_kill = "/tmp/lam-sgeadmin_at_SGMASTER/lam-killfile"
> tkill: nothing to kill: "/tmp/lam-sgeadmin_at_SGMASTER/lam-killfile"
> hboot: booting...
> hboot: fork /usr/local/lam/bin/lamd
> hboot: attempting to execute
> n-1<654> ssi:boot:open: opening
> n-1<654> ssi:boot:open: opening boot module globus
> n-1<654> ssi:boot:open: opened boot module globus
> n-1<654> ssi:boot:open: opening boot module rsh
> n-1<654> ssi:boot:open: opened boot module rsh
> n-1<654> ssi:boot:open: opening boot module slurm
> n-1<654> ssi:boot:open: opened boot module slurm
> n-1<654> ssi:boot:select: initializing boot module slurm
> n-1<654> ssi:boot:slurm: not running under SLURM
> n-1<654> ssi:boot:select: boot module not available: slurm
> n-1<654> ssi:boot:select: initializing boot module globus
> n-1<654> ssi:boot:globus: globus-job-run not found, globus boot will not
> run
> n-1<654> ssi:boot:select: boot module not available: globus
> n-1<654> ssi:boot:select: initializing boot module rsh
> n-1<654> ssi:boot:rsh: module initializing
> n-1<654> ssi:boot:rsh:agent: rsh
> n-1<654> ssi:boot:rsh:username: <same>
> n-1<654> ssi:boot:rsh:verbose: 1000
> n-1<654> ssi:boot:rsh:algorithm: linear
> n-1<654> ssi:boot:rsh:no_n: 0
> n-1<654> ssi:boot:rsh:no_profile: 0
> n-1<654> ssi:boot:rsh:fast: 0
> n-1<654> ssi:boot:rsh:ignore_stderr: 0
> n-1<654> ssi:boot:rsh:priority: 10
> n-1<654> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<654> ssi:boot:select: finalizing boot module slurm
> n-1<654> ssi:boot:slurm: finalizing
> n-1<654> ssi:boot:select: closing boot module slurm
> n-1<654> ssi:boot:select: finalizing boot module globus
> n-1<654> ssi:boot:globus: finalizing
> n-1<654> ssi:boot:select: closing boot module globus
> n-1<654> ssi:boot:select: selected boot module rsh
> n-1<654> ssi:boot:send_lamd: getting node ID from command line
> n-1<654> ssi:boot:send_lamd: getting agent haddr from command line
> n-1<654> ssi:boot:send_lamd: getting agent port from command line
> n-1<654> ssi:boot:send_lamd: getting node ID from command line
> n-1<654> ssi:boot:send_lamd: connecting to 192.168.4.20:39586, node id 0
> n-1<654> ssi:boot:send_lamd: sending dli_port 32859
> [1] 654 lamd -H 192.168.4.20 -P 39586 -n 0 -o 0 -d
> n-1<651> ssi:boot:rsh: successfully launched on n0 (SGMASTER)
> n-1<651> ssi:boot:base:server: expecting connection from finite list
> n-1<651> ssi:boot:base:server: got connection from 192.168.4.20
> n-1<651> ssi:boot:base:server: this connection is expected (n0)
> n-1<651> ssi:boot:base:server: remote lamd is at 192.168.4.20:32859
> n-1<651> ssi:boot:base:linear: booting n1 (SG4)
> n-1<651> ssi:boot:rsh: starting lamd on (SG4)
> n-1<651> ssi:boot:rsh: starting on n1 (SG4): hboot -t -c lam-conf.lamd
> -d -s -I "-H 192.168.4.20 -P 39586 -n 1 -o 0"
> n-1<651> ssi:boot:rsh: launching remotely
> n-1<651> ssi:boot:rsh: attempting to execute: rsh SG4 -n 'echo $SHELL'
> n-1<651> ssi:boot:rsh: remote shell /bin/bash
> n-1<651> ssi:boot:rsh: attempting to execute: rsh SG4 -n hboot -t -c
> lam-conf.lamd -d -s -I '"-H 192.168.4.20 -P 39586 -n 1 -o 0"'
> ERROR: LAM/MPI unexpectedly received the following on stderr:
> hboot: cannot find process schema lam-conf.lamd: No such file or
> directory
> -----------------------------------------------------------------------------
> LAM attempted to execute a process on the remote node "SG4",
> but received some output on the standard error. This heuristic
> assumes that any output on the standard error indicates a fatal error,
> and therefore aborts. You can disable this behavior (i.e., have LAM
> ignore output on standard error) in the rsh boot module by setting the
> SSI parameter boot_rsh_ignore_stderr to 1.
>
> LAM tried to use the remote agent command "rsh"
> to invoke "hboot" on the remote node.
>
> *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
> *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
> *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
> *** MAILING LIST.
>
> This can indicate an authentication error with the remote agent, or
> can indicate an error in your $HOME/.cshrc, $HOME/.login, or
> $HOME/.profile files. The following is a (non-inclusive) list of items
> that you should check on the remote node:
>
> - You have an account and can login to the remote machine
> - Incorrect permissions on your home directory (should
> probably be 0755)
> - Incorrect permissions on your $HOME/.rhosts file (if you are
> using rsh -- they should probably be 0644)
> - You have an entry in the remote $HOME/.rhosts file (if you
> are using rsh) for the machine and username that you are
> running from
> - Your .cshrc/.profile must not print anything out to the
> standard error
> - Your .cshrc/.profile should set a correct TERM type
> - Your .cshrc/.profile should set the SHELL environment
> variable to your default shell
>
> Try invoking the following command at the unix command line:
>
> rsh SG4 -n hboot -t -c lam-conf.lamd -d -s -I '"-H 192.168.4.20
> -P 39586 -n 1 -o 0"'
>
> You will need to configure your local setup such that you will *not*
> be prompted for a password to invoke this command on the remote node.
> No output should be printed from the remote node before the output of
> the command is displayed.
>
> When you can get this command to execute successfully by hand, LAM
> will probably be able to function properly.
> -----------------------------------------------------------------------------
> n-1<651> ssi:boot:base:linear: Failed to boot n1 (SG4)
> n-1<651> ssi:boot:base:server: closing server socket
> n-1<651> ssi:boot:base:linear: aborted!
> n-1<657> ssi:boot:open: opening
> n-1<657> ssi:boot:open: opening boot module globus
> n-1<657> ssi:boot:open: opened boot module globus
> n-1<657> ssi:boot:open: opening boot module rsh
> n-1<657> ssi:boot:open: opened boot module rsh
> n-1<657> ssi:boot:open: opening boot module slurm
> n-1<657> ssi:boot:open: opened boot module slurm
> n-1<657> ssi:boot:select: initializing boot module slurm
> n-1<657> ssi:boot:slurm: not running under SLURM
> n-1<657> ssi:boot:select: boot module not available: slurm
> n-1<657> ssi:boot:select: initializing boot module globus
> n-1<657> ssi:boot:globus: globus-job-run not found, globus boot will not
> run
> n-1<657> ssi:boot:select: boot module not available: globus
> n-1<657> ssi:boot:select: initializing boot module rsh
> n-1<657> ssi:boot:rsh: module initializing
> n-1<657> ssi:boot:rsh:agent: rsh
> n-1<657> ssi:boot:rsh:username: <same>
> n-1<657> ssi:boot:rsh:verbose: 1000
> n-1<657> ssi:boot:rsh:algorithm: linear
> n-1<657> ssi:boot:rsh:no_n: 0
> n-1<657> ssi:boot:rsh:no_profile: 0
> n-1<657> ssi:boot:rsh:fast: 0
> n-1<657> ssi:boot:rsh:ignore_stderr: 0
> n-1<657> ssi:boot:rsh:priority: 10
> n-1<657> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<657> ssi:boot:select: finalizing boot module slurm
> n-1<657> ssi:boot:slurm: finalizing
> n-1<657> ssi:boot:select: closing boot module slurm
> n-1<657> ssi:boot:select: finalizing boot module globus
> n-1<657> ssi:boot:globus: finalizing
> n-1<657> ssi:boot:select: closing boot module globus
> n-1<657> ssi:boot:select: selected boot module rsh
> n-1<657> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<657> ssi:boot:base: <current directory>
> n-1<657> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<657> ssi:boot:base: $LAMHOME/etc
> n-1<657> ssi:boot:base: /usr/local/lam/etc
> n-1<657> ssi:boot:base: looking for boot schema file:
> n-1<657> ssi:boot:base: lam-bhost.def
> n-1<657> ssi:boot:base: found boot schema:
> /usr/local/lam/etc/lam-bhost.def
> n-1<657> ssi:boot:rsh: found the following hosts:
> n-1<657> ssi:boot:rsh: n0 SGMASTER (cpu=2)
> n-1<657> ssi:boot:rsh: n1 SG4 (cpu=2)
> n-1<657> ssi:boot:rsh: n2 SG7 (cpu=2)
> n-1<657> ssi:boot:rsh: resolved hosts:
> n-1<657> ssi:boot:rsh: n0 SGMASTER --> 192.168.4.20 (origin)
> n-1<657> ssi:boot:rsh: n1 SG4 --> 192.168.4.24
> n-1<657> ssi:boot:rsh: n2 SG7 --> 192.168.4.27
> n-1<657> ssi:boot:rsh: starting RTE procs
> n-1<657> ssi:boot:base:linear: starting
> n-1<657> ssi:boot:base:linear: booting n0 (SGMASTER)
> n-1<657> ssi:boot:rsh: starting wipe on (SGMASTER)
> n-1<657> ssi:boot:rsh: starting on n0 (SGMASTER): tkill -d
> n-1<657> ssi:boot:rsh: launching locally
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-sgeadmin_at_SGMASTER/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-sgeadmin_at_SGMASTER/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-sgeadmin_at_SGMASTER/lam-io-socket
> tkill: f_kill = "/tmp/lam-sgeadmin_at_SGMASTER/lam-killfile"
> tkill: killing LAM...
> tkill: killing PID (SIGHUP) 654 ...
> tkill: killed
> tkill: all finished
> n-1<657> ssi:boot:rsh: successfully launched on n0 (SGMASTER)
> n-1<657> ssi:boot:base:linear: booting n1 (SG4)
> n-1<657> ssi:boot:rsh: starting wipe on (SG4)
> n-1<657> ssi:boot:rsh: starting on n1 (SG4): tkill -d
> n-1<657> ssi:boot:rsh: launching remotely
> n-1<657> ssi:boot:rsh: attempting to execute: rsh SG4 -n 'echo $SHELL'
> n-1<657> ssi:boot:rsh: remote shell /bin/bash
> n-1<657> ssi:boot:rsh: attempting to execute: rsh SG4 -n tkill -d
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-sgeadmin_at_SG4/lam-sd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-sgeadmin_at_SG4/lam-sio
> tkill: f_kill = "/tmp/lam-sgeadmin_at_SG4/lam"
> tkill: nothing to kill: "/tmp/lam-sgeadmin_at_SG4/lam"
> n-1<657> ssi:boot:rsh: successfully launched on n1 (SG4)
> n-1<657> ssi:boot:base:linear: booting n2 (SG7)
> n-1<657> ssi:boot:rsh: starting wipe on (SG7)
> n-1<657> ssi:boot:rsh: starting on n2 (SG7): tkill -d
> n-1<657> ssi:boot:rsh: launching remotely
> n-1<657> ssi:boot:rsh: attempting to execute: rsh SG7 -n 'echo $SHELL'
> n-1<657> ssi:boot:rsh: remote shell /bin/bash
> n-1<657> ssi:boot:rsh: attempting to execute: rsh SG7 -n tkill -d
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-sgeadmin_at_SG7/lam-sd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-sgeadmin_at_SG7/lam-sio
> tkill: f_kill = "/tmp/lam-sgeadmin_at_SG7/lam"
> tkill: nothing to kill: "/tmp/lam-sgeadmin_at_SG7/lam"
> n-1<657> ssi:boot:rsh: successfully launched on n2 (SG7)
> n-1<657> ssi:boot:base:linear: finished
> n-1<657> ssi:boot:rsh: all RTE procs started
> n-1<657> ssi:boot:rsh: finalizing
> n-1<657> ssi:boot: Closing
> lamboot did NOT complete successfully
>
> Please Help Me Out.
>
>
>
>
>
> Debasis Satapathy
> Jr. Support Engineer
> Locuz Enterprise Solutions
> Mobile:9440551394
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>