Can you upgrade to a later version of LAM, such as 7.1.1? The 6.5.x
series is actually no longer supported. The 7.1.x series contains a
*LOT* more functionality and is source compatible with the 6.5.x
series. It also has a lot more debugging output for diagnosing lamboot
problems like this (i.e., the "lamboot -d" output is much more
verbose).
On Dec 1, 2004, at 6:47 AM, Susanne Hemker wrote:
> Hi everybody,
> I am trying to start lam, but it won't finish the boot process. Since
> "recon -v boot_schema_file" did work fine, I have changed the boot
> schema faile to the IP addresse instead of the node names,as recommende
> in the FAQ, but this did not resolve the problem. Can anybody tell me
> what I might have done wrong or need to change in order to get lam to
> boot?
>
> Here's the output from the boot attempt:
>
> n65(15)% lamboot -d -v boot_schema_file
>
> LAM 6.5.7/MPI 2 C++/ROMIO - Indiana University
>
> lamboot: boot schema file: boot_schema_file
> lamboot: opening hostfile boot_schema_file
> lamboot: found the following hosts:
> lamboot: n0 192.168.0.65
> lamboot: n1 192.168.0.66
> lamboot: n2 192.168.0.67
> lamboot: n3 192.168.0.68
> lamboot: resolved hosts:
> lamboot: n0 192.168.0.65 --> 192.168.0.65
> lamboot: n1 192.168.0.66 --> 192.168.0.66
> lamboot: n2 192.168.0.67 --> 192.168.0.67
> lamboot: n3 192.168.0.68 --> 192.168.0.68
> lamboot: found 4 host node(s)
> lamboot: origin node is 0 (192.168.0.65)
> Executing hboot on n0 (192.168.0.65 - 1 CPU)...
> lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H
> 192.168.0.65 -P 32807 -n 0 -o 0 ""
> hboot: process schema = "/opt/lam-6.5.7/etc/lam-conf.lam"
> hboot: found /opt/lam-6.5.7/bin/lamd
> hboot: performing tkill
> hboot: tkill
> hboot: booting...
> hboot: fork /opt/lam-6.5.7/bin/lamd
> [1] 1628 lamd -H 192.168.0.65 -P 32807 -n 0 -o 0 -d
> hboot: attempting to execute
> Executing hboot on n1 (192.168.0.66 - 1 CPU)...
> lamboot: attempting to execute "/usr/bin/ssh 192.168.0.66 -n echo
> $SHELL"
> lamboot: got remote shell /bin/tcsh
> lamboot: attempting to execute "/usr/bin/ssh 192.168.0.66 -n hboot -t
> -c lam-conf.lam -d -v -s -I "-H 192.168.0.65 -P 32807 -n 1 -o 0 ""
> hboot: process schema = "/opt/lam-6.5.7/etc/lam-conf.lam"
> hboot: found /opt/lam-6.5.7/bin/lamd
> hboot: performing tkill
> hboot: tkill
> hboot: booting...
> hboot: fork /opt/lam-6.5.7/bin/lamd
> [1] 25996 lamd -H 192.168.0.65 -P 32807 -n 1 -o 0 -d
> -----------------------------------------------------------------------
> ------
> lamboot encountered some error (see above) during the boot process,
> and will now attempt to kill all nodes that it was previously able to
> boot (if any).
>
> Please wait for LAM to finish; if you interrupt this process, you may
> have LAM daemons still running on remote nodes.
> -----------------------------------------------------------------------
> ------
> wipe ...
>
> LAM 6.5.7/MPI 2 C++/ROMIO - Indiana University
>
> Executing tkill on n0 (192.168.0.65)...
> Executing tkill on n1 (192.168.0.66)...
> lamboot did NOT complete successfully
>
>
> Thanks for any suggestions,
>
> Susanne
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|