Check to ensure that TCP firewalls are either disabled or allow
arbitrary communication between all your compute nodes.
You're also using an ancient version of LAM/MPI -- if you're just
starting with MPI, you might want to give Open MPI a shot (www.open-mpi.org
). LAM/MPI is in maintenance mode, but 6.5.9 is so ancient that I
don't know if anyone could answer any reasonable questions about it.
All current and future work is occurring on Open MPI.
On Feb 7, 2009, at 9:35 PM, Abhirup Chakraborty wrote:
> Hi All,
> I wanted to run a test program over two machines using MPI/LAM. I
> got the following error while ran the 'lamboot' command (from
> machine Bluesky2.xxx.yy). It seems that 'lamboot' fails, at the end,
> while trying to set the return ip-address in the other machine
> (i.e., bluesky4.xxx.yy). I used LAM/MPI 6.5.9. The 'recon'
> command okayed system. It should be noted that 'lamboot' runs
> properly in one machine, but causes the error while run over
> multiple ones (i.e., the hostfile feed to the lamboot command
> contains multiple machines)
>
> Could anyone please suggest me the solution?
>
> Thanking you
>
> -Abhirup
>
>
> LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
>
> lamboot: boot schema file: machines
> lamboot: opening hostfile machines
> lamboot: found the following hosts:
> lamboot: n0 bluesky2.xxx.yy
> lamboot: n1 bluesky4.xxx.yy
> lamboot: resolved hosts:
> lamboot: n0 bluesky2.xxx.yy --> NNN.97.000.52
> lamboot: n1 bluesky4.xxx.yy --> NNN.97.000.54
> lamboot: found 2 host node(s)
> lamboot: origin node is 0 (bluesky2.xxx.yy)
> Executing hboot on n0 (bluesky2.xxx.yy - 1 CPU)...
> lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -
> H NNN.97.000.52 -P 47227 -n 0 -o 0 ""
> hboot: process schema = "/usr/local/etc/lam-conf.lam"
> hboot: found /usr/local/bin/lamd
> hboot: performing tkill
> hboot: tkill
> hboot: booting...
> hboot: fork /usr/local/bin/lamd
> hboot: attempting to execute
> [1] 12509 lamd -H NNN.97.000.52 -P 47227 -n 0 -o 0 -d
> Executing hboot on n1 (bluesky4.xxx.yy - 1 CPU)...
> lamboot: attempting to execute "ssh -x bluesky4.xxx.yy -n echo $SHELL"
> lamboot: got remote shell /bin/bash
> lamboot: attempting to execute "ssh -x bluesky4.xxx.yy -n hboot -t -
> c lam-conf.lam -d -v -s -I "-H NNN.97.000.52 -P 47227 -n 1 -o 0 ""
> hboot: process schema = "/usr/local/etc/lam-conf.lam"
> hboot: found /usr/local/bin/lamd
> hboot: performing tkill
> hboot: tkill
> hboot: booting...
> hboot: fork /usr/local/bin/lamd
> [1] 9223 lamd -H NNN.97.000.52 -P 47227 -n 1 -o 0 -d
> -----------------------------------------------------------------------------
> lamboot encountered some error (see above) during the boot process,
> and will now attempt to kill all nodes that it was previously able to
> boot (if any).
>
> Please wait for LAM to finish; if you interrupt this process, you may
> have LAM daemons still running on remote nodes.
> -----------------------------------------------------------------------------
> wipe ...
>
> LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
>
> Executing tkill on n0 (bluesky2.xxx.yy)...
> Executing tkill on n1 (bluesky4.xxx.yy)...
> lamboot did NOT complete successfully
>
> messages from the econ command
> ==============================
>
> recon: opening hostfile machines
> recon: found the following hosts:
> recon: n0 bluesky2.xxx.yy
> recon: n1 bluesky4.xxx.yy
> recon: found addresses for all hosts
> recon: found 2 host node(s)
> recon: origin node is n0 (bluesky2.xxx.yy)
> recon: -- testing n0 (bluesky2.xxx.yy)
> recon: attempting to launch "tkill -N" (local execution)
> recon: launch successful
> recon: -- testing n1 (bluesky4.xxx.yy)
> recon: attempting to launch "tkill -N" (remote execution)
> recon: -b used, assuming same shell on remote nodes
> recon: got local shell /bin/bash
> recon: attempting to execute "ssh -x bluesky4.xxx.yy -n tkill -N"
> recon: launch successful
> -----------------------------------------------------------------------------
> Woo hoo!
>
> recon has completed successfully. This means that you will most
> likely
> be able to boot LAM successfully with the "lamboot" command (but this
> is not a guarantee). See the lamboot(1) manual page for more
> information on the lamboot command.
>
> If you have problems booting LAM (with lamboot) even though recon
> worked successfully, enable the "-d" option to lamboot to examine each
> step of lamboot and see what fails. Most situations where recon
> succeeds and lamboot fails have to do with the hboot(1) command (that
> lamboot invokes on each host in the hostfile).
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
Jeff Squyres
Cisco Systems
|