Hi all,
I'm new to LAM_MPI set-up although I have done MPI programming before. I
have a Beowulf cluster with shared memory nodes. I created a lam_machines
file and invoking the command "lamboot -dv lam_machines". I get the
following output:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[sriramr_at_p02 basic]$ lamboot -dv lam_machines
LAM 6.5.8/MPI 2 C++/ROMIO - Indiana University
lamboot: boot schema file: lam_machines
lamboot: opening hostfile lam_machines
lamboot: found the following hosts:
lamboot: n0 p02.asdl.ae.gatech.edu
lamboot: n1 p03.asdl.ae.gatech.edu
lamboot: n2 p04.asdl.ae.gatech.edu
lamboot: resolved hosts:
lamboot: n0 p02.asdl.ae.gatech.edu --> 172.16.3.102
lamboot: n1 p03.asdl.ae.gatech.edu --> 172.16.3.103
lamboot: n2 p04.asdl.ae.gatech.edu --> 172.16.3.104
lamboot: found 3 host node(s)
lamboot: origin node is 0 (p02.asdl.ae.gatech.edu)
Executing hboot on n0 (p02.asdl.ae.gatech.edu - 1 CPU)...
lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H
172.16.3.102 -P 32850 -n 0 -o 0 ""
hboot: process schema = "/etc/lam/lam-conf.lam"
hboot: found /usr/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/bin/lamd
hboot: attempting to execute
[1] 12357 lamd -H 172.16.3.102 -P 32850 -n 0 -o 0 -d
Executing hboot on n1 (p03.asdl.ae.gatech.edu - 1 CPU)...
lamboot: attempting to execute "/usr/bin/ssh -x -a p03.asdl.ae.gatech.edu
-n echo $SHELL"
lamboot: got remote shell /bin/ksh
lamboot: attempting to execute "/usr/bin/ssh -x -a p03.asdl.ae.gatech.edu
-n (. ./.profile; hboot -t -c lam-conf.lam -d -v -s -I "-H 172.16.3.102 -P
32850 -n 1 -o 0 " )"
stty:
-----------------------------------------------------------------------------
LAM attempted to execute a process on the remote node "p03.asdl.ae.gatech.edu",
but received some output on the standard error.
LAM tried to use the remote agent command "/usr/bin/ssh"
to invoke "hboot" on the remote node.
This can indicate an authentication error with the remote agent, or
can indicate an error in your $HOME/.cshrc, $HOME/.login, or
$HOME/.profile files. The following is a list of items that you may
wish to check on the remote node:
- You have an account and can login to the remote machine
- Incorrect permissions on your home directory (should
probably be 0755)
- Incorrect permissions on your $HOME/.rhosts file (if you are
using rsh -- they should probably be 0644)
- You have an entry in the remote $HOME/.rhosts file (if you
are using rsh) for the machine and username that you are
running from
- Your .cshrc/.profile must not print anything out to the
standard error
- Your .cshrc/.profile should set a correct TERM type
- Your .cshrc/.profile should set the SHELL environment
variable to your default shell
Try invoking the following command at the unix command line:
/usr/bin/ssh -x -a p03.asdl.ae.gatech.edu -n (. ./.profile; hboot
-t -c lam-conf.lam -d -v -s -I "-H 172.16.3.102 -P 32850 -n 1 -o 0 " )
You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.
When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).
Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
wipe ...
LAM 6.5.8/MPI 2 C++/ROMIO - Indiana University
Executing tkill on n0 (p02.asdl.ae.gatech.edu)...
lamboot did NOT complete successfully
---------------------------------------------------------------------------------------------------------------------------------------------------------
From the output above, the root node is attempting to invoke the following
command:
/usr/bin/ssh -x -a p03.asdl.ae.gatech.edu -n (. ./.profile; hboot -t -c
lam-conf.lam -d -v -s -I "-H 172.16.3.102 -P 32850 -n 1 -o 0 " )
I don't know why I have parenthesis in the above command. With those
parenthesis, I get "badly placed ()'s" error. So I removed the parenthesis
and invoked the command from the root node (P02)
/usr/bin/ssh -x -a p03.asdl.ae.gatech.edu -n . ./.profile; hboot -t -c
lam-conf.lam -d -v -s -I "-H 172.16.3.102 -P 32857 -n 1 -o 0 "
and got the following output without any errors.
stty: standard input: Invalid argument
hboot: process schema = "/etc/lam/lam-conf.lam"
hboot: found /usr/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/bin/lamd
[1] 12495 lamd -H 172.16.3.102 -P 32857 -n 1 -o 0 -d
However, when I do this and then invoke mpirun command, the other nodes are
not recognized and get the following output.
-----------------------------------------------------------------------------
It seems that [at least] one of processes that was started with mpirun
did not invoke MPI_INIT before quitting (it is possible that more than
one process did not invoke MPI_INIT -- mpirun was only notified of the
first one, which was on node n0).
mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.
-----------------------------------------------------------------------------
I have set up the nodes so that I can ssh into any of them without entering
the password. I know there have been many posts about lamboot problems in these
archives, but none specifically could clear my problem.
Could someone help me set-up LAM and MPI on my cluster?
Thanks
Sriram
-------------------------------------------------------------------------------
Sriram K. Rallabhandi
Graduate Research Assistant Work: 404 385 2789
Aerospace Engineering Res: 404 603 9160
Georgia Inst. of Technology
-------------------------------------------------------------------------------
|