LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Prerna Dua (prerna.dua_at_[hidden])
Date: 2006-03-14 11:16:36


Dear Team members,

I saw that this topic has been posted earlier and also found it on the FAQs
page. I have tried all possible solutions but still the error persists.

I log in with the user-id and password in my account and try to invoke a
script. The script has following commands,

#!/bin/sh
echo "Running LAM/MPI Configuration script..."
# ssh has to be the default remote invocation in order for LAM/MPI to work
properly
export LAMRSH="ssh -x"
# if the path is not changed MPICH might run instead of LAM/MPI with the
current Mac Cluster's configuration
export
PATH="/bin:/sbin:/usr/bin:/usr/sbin:/common/bin:/common/sbin:/usr/local/bin:/usr/local/sbin:/usr/X11R6/bin"

# start LAM/MPI with verbose, assumes that hosts file bhosts is in the same
directory
echo "Starting LAM daemons..."
lamboot bhosts -v
echo " "
echo " Note: If lamboot started successfuly call lamhalt when done"
echo "Done."
echo " "
echo "If mpirun is not working properly call the following command: "
echo "export
PATH=[quote]/bin:/sbin:/usr/bin:/usr/sbin:/common/bin:/common/sbin:/usr/local/bin:/usr/local/sbin:/usr/X11R6/bin[quote]"

Till about day before yesterday, this would boot up the lam daemons and I
worked with my programs. But, now I am encountering the following problem.

LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University

n-1<1605> ssi:boot:base:linear: booting n0 (portal2net.cluster.private)
n-1<1605> ssi:boot:base:linear: booting n1 (node001.cluster.private)

ERROR: LAM/MPI unexpectedly received the following on stderr:
node001.cluster.private: Connection refused

-----------------------------------------------------------------------------

LAM failed to execute a process on the remote node "node001.cluster.private
".
LAM was not trying to invoke any LAM-specific commands yet -- we were
simply trying to determine what shell was being used on the remote
host.

LAM tried to use the remote agent command "rsh"
to invoke "echo $SHELL" on the remote node.

[This is the problem; By default rsh is taken instead of ssh, while I am
using ssh hence it has to be explicitly specified. In the script
mpi_script.sh, it has been specified to use ssh instead of rsh. I also tried
giving the command manually, export LAMRSH="ssh -x" but after waiting for a
long time, I got the same error].

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** ( http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

This usually indicates an authentication problem with the remote
agent, some other configuration type of error in your .cshrc or
.profile file, or you were unable to executable a command on the
remote node for some other reason. The following is a list of items
that you should check on the remote node:

        - You have an account and can login to the remote machine
        - Incorrect permissions on your home directory (should
          probably be 0755)
         [I have checked my permission , it is 0755]

        - Incorrect permissions on your $HOME/.rhosts file (if you are
          using rsh -- they should probably be 0644)
         [ I am not using rsh, but ssh]

        - You have an entry in the remote $HOME/.rhosts file (if you
          are using rsh) for the machine and username that you are
          running from
         [ I am not using rsh, but ssh]

        - Your .cshrc/.profile must not print anything out to the
          standard error

        - Your .cshrc/.profile should set a correct TERM type

        - Your .cshrc/.profile should set the SHELL environment
          variable to your default shell

Try invoking the following command at the unix command line:

         rsh node001.cluster.private -n 'echo $SHELL'

 You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.

-----------------------------------------------------------------------------

n-1<1605> ssi:boot:base:linear: Failed to boot n1 (node001.cluster.private)

n-1<1605> ssi:boot:base:linear: aborted!

n-1<1644> ssi:boot:base:linear: booting n0 (portal2net.cluster.private)

n-1<1644> ssi:boot:base:linear: booting n1 (node001.cluster.private)

ERROR: LAM/MPI unexpectedly received the following on stderr:

node001.cluster.private: Connection refused

lamboot did NOT complete successfully

dmrlbioinformaticsmaster:~/MPI_Itemsets_1_0_1_try genemine$ rsh
node001.cluster.private -n 'echo $SHELL'

node001.cluster.private: Connection refused.

Next, I opened the bhosts file, which has the following information:

portal2net.cluster.private cpu=2
node002.cluster.private cpu=2
node003.cluster.private cpu=2
node004.cluster.private cpu=2
node005.cluster.private cpu=2
node006.cluster.private cpu=2
node007.cluster.private cpu=2

I removed the line, node001.cluster.private cpu=2 from the list and tried
using,

$ export LAMRSH="ssh -x"

$ lamboot bhosts -v

it hanged after the node0 booted and it was trying to boot node2. This was
happening before also. I usually wait for half an hour, some times I get the
error message, which I copied previously otherwise there is no activity once
node) is booted.

Anyways, also tried to run from the command-line and here is what I get:

$ LAMRSH="ssh -x"
$ export LAMRSH

$ export

declare -x BLASTDB="/common/data"
declare -x BLASTMAT="/common/data/blastmat"
declare -x DYLD_LIBRARY_PATH="/common/sge/lib/darwin:/common/lib"
declare -x EMBOSS_ACDROOT="/common/share/EMBOSS/acd"
declare -x HOME="/Users/genemine"
declare -x INQUIRYVER="1.4.1"
declare -x LAMRSH="ssh -x"
declare -x LOGNAME="genemine"
declare -x MAIL="/var/mail/genemine"
declare -x MANPATH="/common/sge/man:/usr/share/man:/common/mpich-1.2.7
/ch_p4/man:/common/man:/usr/local/man:/usr/X11R6/man"
declare -x MPICH="/common/mpich-1.2.7/ch_p4"
declare -x OLDPWD="/Users/genemine"
declare -x
PATH="/common/sge/bin/darwin:/bin:/sbin:/usr/bin:/usr/sbin:/common/mpich-
1.2.7
/ch_p4/bin:/common/bin:/common/sbin:/usr/local/bin:/usr/local/sbin:/usr/X11R6/bin"
declare -x PERL5LIB="/RemotePerl"
declare -x PLPLOT_LIB="/common/lib"
declare -x PWD="/Users/genemine/MPI_Itemsets_1_0_1_try"
declare -x RSHCOMMAND="/usr/bin/ssh"
declare -x SGE_CELL="default"
declare -x SGE_EXECD_PORT="702"
declare -x SGE_QMASTER_PORT="701"
declare -x SGE_ROOT="/common/sge"
declare -x SHELL="/bin/bash"
declare -x SHLVL="1"
declare -x SSH_CLIENT=" 138.47.152.10 2308 22"
declare -x SSH_CONNECTION="138.47.152.10 2308 138.47.102.17 22"
declare -x SSH_TTY="/dev/ttyp0"
declare -x TERM="vt100"
declare -x USER="genemine"
declare -x WISECONFIGDIR="/common/wisecfg"
$ lamboot bhosts -v

LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University

n-1<2759> ssi:boot:base:linear: booting n0 (portal2net.cluster.private)
mkdir: No space left on device
mkdir: No space left on device
chdir failed!: No such file or directory
-----------------------------------------------------------------------------

The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:

        - There are network filters between the lamboot agent host and
          the remote host such that communication on random TCP ports
          is blocked
        - Network routing from the remote host to the local host isn't
          properly configured (this is uncommon)

You can check these things by watching the output from "lamboot -d".

1. On the command line for hboot, there are two important parameters:
   one is the IP address of where the lamboot agent was invoked, the
   other is the port number that the lamboot agent is expecting the
   newly-booted process to call back on (this will be a random
   integer).

2. Manually login to the remote machine and try to telnet to the port
   indicated on the hboot command line. For example,
       telnet <ipnumber> <portnumber>
   If all goes well, you should get a "Connection refused" error. If
   you get any other kind of error, it could indicate either of the
   two conditions above. Consult with your system/network
   administrator.
-----------------------------------------------------------------------------

n-1<2759> ssi:boot:base:linear: aborted!

lamboot did NOT complete successfully

dmrlbioinformaticsmaster:~ genemine$ telnet 127.0.0.1 52423
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused
telnet: Unable to connect to remote host

 I have also removed a big chunk of files from my disk, thinking that may be
there wasn't enough space on the hard disk, but the problem persists. Please
advise.

Thanks.