LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Abhirup Chakraborty (abhirupc91_at_[hidden])
Date: 2009-02-08 00:35:12


 Hi All,
I wanted to run a test program over two machines using MPI/LAM. I got the
following error while ran the 'lamboot' command (from machine
Bluesky2.xxx.yy). It seems that 'lamboot' fails, at the end, while trying to
set the return ip-address in the other machine (i.e., bluesky4.xxx.yy). I
used LAM/MPI 6.5.9. The 'recon' command okayed system. It should be
noted that 'lamboot' runs properly in one machine, but causes the error
while run over multiple ones (i.e., the hostfile feed to the lamboot command
contains multiple machines)

Could anyone please suggest me the solution?

Thanking you

-Abhirup

LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University

lamboot: boot schema file: machines
lamboot: opening hostfile machines
lamboot: found the following hosts:
lamboot: n0 bluesky2.xxx.yy
lamboot: n1 bluesky4.xxx.yy
lamboot: resolved hosts:
lamboot: n0 bluesky2.xxx.yy --> NNN.97.000.52
lamboot: n1 bluesky4.xxx.yy --> NNN.97.000.54
lamboot: found 2 host node(s)
lamboot: origin node is 0 (bluesky2.xxx.yy)
Executing hboot on n0 (bluesky2.xxx.yy - 1 CPU)...
lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H
NNN.97.000.52 -P 47227 -n 0 -o 0 ""
hboot: process schema = "/usr/local/etc/lam-conf.lam"
hboot: found /usr/local/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/local/bin/lamd
hboot: attempting to execute
[1] 12509 lamd -H NNN.97.000.52 -P 47227 -n 0 -o 0 -d
Executing hboot on n1 (bluesky4.xxx.yy - 1 CPU)...
lamboot: attempting to execute "ssh -x bluesky4.xxx.yy -n echo $SHELL"
lamboot: got remote shell /bin/bash
lamboot: attempting to execute "ssh -x bluesky4.xxx.yy -n hboot -t -c
lam-conf.lam -d -v -s -I "-H NNN.97.000.52 -P 47227 -n 1 -o 0 ""
hboot: process schema = "/usr/local/etc/lam-conf.lam"
hboot: found /usr/local/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/local/bin/lamd
[1] 9223 lamd -H NNN.97.000.52 -P 47227 -n 1 -o 0 -d
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
wipe ...

LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University

Executing tkill on n0 (bluesky2.xxx.yy)...
Executing tkill on n1 (bluesky4.xxx.yy)...
lamboot did NOT complete successfully

messages from the econ command
==============================

recon: opening hostfile machines
recon: found the following hosts:
recon: n0 bluesky2.xxx.yy
recon: n1 bluesky4.xxx.yy
recon: found addresses for all hosts
recon: found 2 host node(s)
recon: origin node is n0 (bluesky2.xxx.yy)
recon: -- testing n0 (bluesky2.xxx.yy)
recon: attempting to launch "tkill -N" (local execution)
recon: launch successful
recon: -- testing n1 (bluesky4.xxx.yy)
recon: attempting to launch "tkill -N" (remote execution)
recon: -b used, assuming same shell on remote nodes
recon: got local shell /bin/bash
recon: attempting to execute "ssh -x bluesky4.xxx.yy -n tkill -N"
recon: launch successful
-----------------------------------------------------------------------------
Woo hoo!

recon has completed successfully. This means that you will most likely
be able to boot LAM successfully with the "lamboot" command (but this
is not a guarantee). See the lamboot(1) manual page for more
information on the lamboot command.

If you have problems booting LAM (with lamboot) even though recon
worked successfully, enable the "-d" option to lamboot to examine each
step of lamboot and see what fails. Most situations where recon
succeeds and lamboot fails have to do with the hboot(1) command (that
lamboot invokes on each host in the hostfile).