LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: 70uf33q Hu5541n (topa_007_at_[hidden])
Date: 2004-01-23 14:27:33


hi,

need help urgently.
trying to boot LAM for the past 3 days.

I can ssh and rsh into the other node without
problems.

I'm attaching a file that was generated with teh
lamboot -d option.

PLease help.

thanks,
Toufeeq

=====
"Love is control,I'll die if I let go
I will only let you breathe
My air that you receive
Then we'll see if I let you love me."
-James Hetfield
All Within My Hands,St.Anger
Metallica

__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free web site building tool. Try it!
http://webhosting.yahoo.com/ps/sb/

pa_at_node1 topa]$ lamboot -d host
n0<3474> ssi:boot: Opening
n0<3474> ssi:boot: opening module globus
n0<3474> ssi:boot: initializing module globus
n0<3474> ssi:boot:globus: globus-job-run not found, globus boot will not run
n0<3474> ssi:boot: module not available: globus
n0<3474> ssi:boot: opening module rsh
n0<3474> ssi:boot: initializing module rsh
n0<3474> ssi:boot:rsh: module initializing
n0<3474> ssi:boot:rsh:agent: rsh
n0<3474> ssi:boot:rsh:username: <same>
n0<3474> ssi:boot:rsh:verbose: 1000
n0<3474> ssi:boot:rsh:algorithm: linear
n0<3474> ssi:boot:rsh:priority: 10
n0<3474> ssi:boot: module available: rsh, priority: 10
n0<3474> ssi:boot: finalizing module globus
n0<3474> ssi:boot:globus: finalizing
n0<3474> ssi:boot: closing module globus
n0<3474> ssi:boot: Selected boot module rsh
 
LAM 7.0.3/MPI 2 C++/ROMIO - Indiana University
 
n0<3474> ssi:boot:base: looking for boot schema in following directories:
n0<3474> ssi:boot:base: <current directory>
n0<3474> ssi:boot:base: $TROLLIUSHOME/etc
n0<3474> ssi:boot:base: $LAMHOME/etc
n0<3474> ssi:boot:base: /etc/lam
n0<3474> ssi:boot:base: looking for boot schema file:
n0<3474> ssi:boot:base: host
n0<3474> ssi:boot:base: found boot schema: host
n0<3474> ssi:boot:rsh: found the following hosts:
n0<3474> ssi:boot:rsh: n0 node1.topa.com (cpu=1)
n0<3474> ssi:boot:rsh: n1 node2 (cpu=1)
n0<3474> ssi:boot:rsh: resolved hosts:
n0<3474> ssi:boot:rsh: n0 node1.topa.com --> 192.168.0.1 (origin)
n0<3474> ssi:boot:rsh: n1 node2 --> 192.168.0.2
n0<3474> ssi:boot:rsh: starting RTE procs
n0<3474> ssi:boot:base:linear: starting
n0<3474> ssi:boot:base:server: opening server TCP socket
n0<3474> ssi:boot:base:server: opened port 32989
n0<3474> ssi:boot:base:linear: booting n0 (node1.topa.com)
n0<3474> ssi:boot:rsh: starting lamd on (node1.topa.com)
n0<3474> ssi:boot:rsh: starting on n0 (node1.topa.com): hboot -t -c lam-conf.lamd -d -I -H 192.168.0.1 -P 32989 -n 0 -o 0
n0<3474> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-topa_at_node1/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-topa_at_node1/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-topa_at_node1/lam-io-sockettkill: f_kill = "/tmp/lam-topa_at_node1/lam-killfile"
tkill: nothing to kill: "/tmp/lam-topa_at_node1/lam-killfile"
hboot: booting...
hboot: fork /usr/bin/lamd
hboot: attempting to execute
n-1<3477> ssi:boot: Opening
n-1<3477> ssi:boot: opening module globus
n-1<3477> ssi:boot: initializing module globus
n-1<3477> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<3477> ssi:boot: module not available: globus
n-1<3477> ssi:boot: opening module rsh
n-1<3477> ssi:boot: initializing module rsh
n-1<3477> ssi:boot:rsh: module initializing
n-1<3477> ssi:boot:rsh:agent: rsh
n-1<3477> ssi:boot:rsh:username: <same>
n-1<3477> ssi:boot:rsh:verbose: 1000
n-1<3477> ssi:boot:rsh:algorithm: linear
n-1<3477> ssi:boot:rsh:priority: 10
n-1<3477> ssi:boot: module available: rsh, priority: 10
n-1<3477> ssi:boot: finalizing module globus
n-1<3477> ssi:boot:globus: finalizing
n-1<3477> ssi:boot: closing module globus
n-1<3477> ssi:boot: Selected boot module rsh
[1] 3477 lamd -H 192.168.0.1 -P 32989 -n 0 -o 0 -d
n0<3474> ssi:boot:rsh: successfully launched on n0 (node1.topa.com)
n0<3474> ssi:boot:base:server: expecting connection from finite list
n0<3474> ssi:boot:base:server: got connection from 192.168.0.1
n0<3474> ssi:boot:base:server: this connection is expected (n0)
n0<3474> ssi:boot:base:server: remote lamd is at 192.168.0.1:32780
n0<3474> ssi:boot:base:linear: booting n1 (node2)
n0<3474> ssi:boot:rsh: starting lamd on (node2)
n0<3474> ssi:boot:rsh: starting on n1 (node2): hboot -t -c lam-conf.lamd -d -s -I "-H 192.168.0.1 -P 32989 -n 1 -o 0"
n0<3474> ssi:boot:rsh: launching remotely
n0<3474> ssi:boot:rsh: attempting to execute "rsh node2 -n echo $SHELL"
n0<3474> ssi:boot:rsh: remote shell /bin/bash
n0<3474> ssi:boot:rsh: attempting to execute "rsh node2 -n hboot -t -c lam-conf.lamd -d -s -I "-H 192.168.0.1 -P 32989 -n 1 -o 0""
ERROR: LAM/MPI unexpectedly received the following on stderr:
hboot: cannot find process schema lam-conf.lamd: No such file or directory
-----------------------------------------------------------------------------
LAM attempted to execute a process on the remote node "node2",
but received some output on the standard error.
 
LAM tried to use the remote agent command "rsh"
to invoke "hboot" on the remote node.
 
This can indicate an authentication error with the remote agent, or
can indicate an error in your $HOME/.cshrc, $HOME/.login, or
$HOME/.profile files. The following is a list of items that you may
wish to check on the remote node:
 
        - You have an account and can login to the remote machine
        - Incorrect permissions on your home directory (should
          probably be 0755)
        - Incorrect permissions on your $HOME/.rhosts file (if you are
          using rsh -- they should probably be 0644)
        - You have an entry in the remote $HOME/.rhosts file (if you
          are using rsh) for the machine and username that you are
          running from
        - Your .cshrc/.profile must not print anything out to the
          standard error
        - Your .cshrc/.profile should set a correct TERM type
        - Your .cshrc/.profile should set the SHELL environment
          variable to your default shell
 
Try invoking the following command at the unix command line:
 
        rsh node2 -n hboot -t -c lam-conf.lamd -d -s -I "-H 192.168.0.1 -P 32989 -n 1 -o 0"
 
You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.
 
When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n0<3474> ssi:boot:base:linear: Failed to boot n1 (node2)
n0<3474> ssi:boot:base:server: closing server socket
n0<3474> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).
 
Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
n0<3482> ssi:boot: Opening
n0<3482> ssi:boot: opening module globus
n0<3482> ssi:boot: initializing module globus
n0<3482> ssi:boot:globus: globus-job-run not found, globus boot will not run
n0<3482> ssi:boot: module not available: globus
n0<3482> ssi:boot: opening module rsh
n0<3482> ssi:boot: initializing module rsh
n0<3482> ssi:boot:rsh: module initializing
n0<3482> ssi:boot:rsh:agent: rsh
n0<3482> ssi:boot:rsh:username: <same>
n0<3482> ssi:boot:rsh:verbose: 1000
n0<3482> ssi:boot:rsh:algorithm: linear
n0<3482> ssi:boot:rsh:priority: 10
n0<3482> ssi:boot: module available: rsh, priority: 10
n0<3482> ssi:boot: finalizing module globus
n0<3482> ssi:boot:globus: finalizing
n0<3482> ssi:boot: closing module globus
n0<3482> ssi:boot: Selected boot module rsh
n0<3482> ssi:boot:base: looking for boot schema in following directories:
n0<3482> ssi:boot:base: <current directory>
n0<3482> ssi:boot:base: $TROLLIUSHOME/etc
n0<3482> ssi:boot:base: $LAMHOME/etc
n0<3482> ssi:boot:base: /etc/lam
n0<3482> ssi:boot:base: looking for boot schema file:
n0<3482> ssi:boot:base: host
n0<3482> ssi:boot:base: found boot schema: host
n0<3482> ssi:boot:rsh: found the following hosts:
n0<3482> ssi:boot:rsh: n0 node1.topa.com (cpu=1)
n0<3482> ssi:boot:rsh: n1 node2 (cpu=1)
n0<3482> ssi:boot:rsh: resolved hosts:
n0<3482> ssi:boot:rsh: n0 node1.topa.com --> 192.168.0.1 (origin)
n0<3482> ssi:boot:rsh: n1 node2 --> 192.168.0.2
n0<3482> ssi:boot:rsh: starting RTE procs
n0<3482> ssi:boot:base:linear: starting
n0<3482> ssi:boot:base:linear: booting n0 (node1.topa.com)
n0<3482> ssi:boot:rsh: starting wipe on (node1.topa.com)
n0<3482> ssi:boot:rsh: starting on n0 (node1.topa.com): tkill -d
n0<3482> ssi:boot:rsh: launching locally
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-topa_at_node1/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-topa_at_node1/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-topa_at_node1/lam-io-sockettkill: f_kill = "/tmp/lam-topa_at_node1/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 3477 ...
tkill: killed
tkill: all finished
n0<3482> ssi:boot:rsh: successfully launched on n0 (node1.topa.com)
n0<3482> ssi:boot:base:linear: booting n1 (node2)
n0<3482> ssi:boot:rsh: starting wipe on (node2)
n0<3482> ssi:boot:rsh: starting on n1 (node2): tkill -d
n0<3482> ssi:boot:rsh: launching remotely
n0<3482> ssi:boot:rsh: attempting to execute "rsh node2 -n echo $SHELL"
n0<3482> ssi:boot:rsh: remote shell /bin/bash
n0<3482> ssi:boot:rsh: attempting to execute "rsh node2 -n tkill -d"
tkill: removing socket file ...
tkill: socket file: /tmp/lam-topa_at_node2/lam-sd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-topa_at_node2/lam-sio
tkill: f_kill = "/tmp/lam-topa_at_node2/lam"
tkill: nothing to kill: "/tmp/lam-topa_at_node2/lam"
n0<3482> ssi:boot:rsh: successfully launched on n1 (node2)
n0<3482> ssi:boot:base:linear: finished
n0<3482> ssi:boot:rsh: all RTE procs started
n0<3482> ssi:boot:rsh: finalizing
n0<3482> ssi:boot: Closing
lamboot did NOT complete successfully