It sounds like LAM is not installed properly on both machines.
Have you tried upgrading to Open MPI? LAM/MPI is barely maintained
anymore -- Open MPI is where all active development work is occurring.
On Apr 30, 2009, at 6:16 AM, Mahesh Salunkhe wrote:
> Hello !
> I' ve installed lam-6.5.9-1.i386.rpm on my cluster of two machines:
> 192.168.10.130
> 192.168.10.129
> (on one machine redhat enterprise linux 3 is installed and on the
> other redhat enterprise linux 4)
> recon is running successfully but lamboot is giving problem. I'm
> pasting here the output of the command : lamboot -d
>
> [mss_at_mss ~]$ lamboot -d
>
> LAM 6.5.9/MPI 2 C++ - Indiana University
>
> lamboot: boot schema file: /etc/lam/lam-bhost.def
> lamboot: opening hostfile /etc/lam/lam-bhost.def
> lamboot: found the following hosts:
> lamboot: n0 192.168.10.130
> lamboot: n1 192.168.10.129
> lamboot: resolved hosts:
> lamboot: n0 192.168.10.130 --> 192.168.10.130
> lamboot: n1 192.168.10.129 --> 192.168.10.129
> lamboot: found 2 host node(s)
> lamboot: origin node is 0 (192.168.10.130)
> lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -I " -H
> 192.168.10.130 -P 33130 -n 0 -o 0 ""
> hboot: process schema = "/etc/lam/lam-conf.lam"
> hboot: found /usr/bin/lamd
> hboot: performing tkill
> hboot: tkill
> hboot: booting...
> hboot: fork /usr/bin/lamd
> hboot: attempting to execute
> [1] 4832 lamd -H 192.168.10.130 -P 33130 -n 0 -o 0 -d
> lamboot: attempting to execute "/usr/bin/ssh -x -a 192.168.10.129 -n
> echo $SHELL"
> lamboot: got remote shell /bin/bash
> lamboot: attempting to execute "/usr/bin/ssh -x -a 192.168.10.129 -n
> hboot -t -c lam-conf.lam -d -s -I "-H 192.168.10.130 -P 33130 -n 1 -
> o 0 ""
> hboot: process schema = "/etc/lam/lam-conf.lam"
> hboot: found /usr/bin/lamd
> hboot: performing tkill
> hboot: tkill
> hboot: booting...
> hboot: fork /usr/bin/lamd
> [1] 4858 lamd -H 192.168.10.130 -P 33130 -n 1 -o 0 -d
> -----------------------------------------------------------------------------
> lamboot encountered some error (see above) during the boot process,
> and will now attempt to kill all nodes that it was previously able to
> boot (if any).
>
> Please wait for LAM to finish; if you interrupt this process, you may
> have LAM daemons still running on remote nodes.
> -----------------------------------------------------------------------------
> wipe ...
>
> LAM 6.5.9/MPI 2 C++ - Indiana University
>
> Executing tkill on n0 (192.168.10.130)...
> Executing tkill on n1 (192.168.10.129)...
> lamboot did NOT complete successfully
>
>
> Could u please tell me what is the error?
>
> Actually the problem arises when hboot is being called on the remote
> machines.
> I tried to run the hboot command on the remote machine locally.The
> error given while running the command is :
> kernel not found
>
> which is the first command in the /etc/lam/lam-conf.otb
>
>
> --
> Regards
> Mahesh
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
Jeff Squyres
Cisco Systems
|