First, I have to ask: is there any way that you can upgrade? 6.3.x is
a VERY old version of LAM. We're up into the 7.0.x series these days.
:-)
It's been several years since I've worked on the 6.3 series, so I can
only hazard a guess here:
1. I'm not sure why you had to change the hostfile order -- I note that
emag6 wasn't listed in your /etc/hosts snipit. Also note that emag5
was listed twice -- once for 127.0.0.1 and once for a 192.168 address.
I'm not sure why it wouldn't find the 192.168 address when you used the
full *.com address; it shouldn't. This sounds like a network/OS error,
but if that's actually a bug in LAM, you'll need to upgrade to get it
fixed. Honestly, I don't remember whether that is something we fixed
since 6.3.1. Sorry... :-\
2. I believe that you copied the wrong file into conf.lam -- it
shouldn't have "localhost" in it, IIRC. We've changed the name of that
file in the 7.x series, but I'm pretty sure that that file is supposed
to have a line with "lamd ..." in it. That is, it should be specifying
what processes to start on the host, not which hosts to start on.
Check your original LAM tarball for the right file.
On Sep 14, 2004, at 4:55 PM, Kevin Kuo wrote:
> Hi All,
>
> I'm having some problems booting LAM on an older cluster. Here is what
> I did:
>
> $ lamboot -v ~/machines
>
> LAM 6.3.1/MPI 2 C++/ROMIO - University of Notre Dame
>
> Executing hboot on n0 (emag6)...
> -----------------------------------------------------------------------
> ------
> LAM was trying to determine the your shell on the "emag6".
> However, LAM did not receive any valid output.
>
> LAM tried to use the remote agent command "/usr/bin/rsh"
> to invoke "echo $SHELL" on the remote node.
>
> This is an unusaual error -- it does not typically indicate a
> permissions problem. But it can sometimes indicate latent (or
> "silent") errors in your $HOME/.cshrc, $HOME/.login, or $HOME/.profile
> file.
>
> Try invoking the following command at the unix command line:
>
> /usr/bin/rsh emag6 -n echo $SHELL
>
> You will need to configure your local setup such that you will *not*
> be prompted for a password to invoke this command on the remote node.
> No output should be printed from the remote node before the output of
> the command is displayed.
>
> When you can get this command to execute successfully by hand, LAM
> will probably be able to function properly.
> -----------------------------------------------------------------------
> ------
> -----------------------------------------------------------------------
> ------
> lamboot encountered some error (see above) during the boot process,
> and will now attempt to kill all nodes that it was previously able to
> boot (if any).
>
> Please wait for LAM to finish; if you interrupt this process, you may
> have LAM daemons still running on remote nodes.
> -----------------------------------------------------------------------
> ------
>
>
> so I tried what the error message is saying:
>
> $ /usr/bin/rsh emag6 -n echo $SHELL
> /bin/bash
> $
>
> Then in the machines file is says:
>
> The hostname should
> # be the same as the result from the command "hostname"
> emag6
> emag5
> emag7
>
> so when I do `hostname` on emag6, I get:
> emag6.cluster.xxx.xxx.com
>
> emag6$ ping emag5.cluster.xxx.xxx.com
> times out....
>
> So I look at my /etc/hosts:
> 127.0.0.1 emag5 localhost.localdomain localhost
> 192.168.2.254 emag7.cluster.xxx.xxx.com emag7
> 192.168.2.12 emag5.cluster.xxx.xxx.com emag5
>
> So it looks like emag5 and emag5.cluster.xxx.xxx.com should be flipped,
> yes? But it's somewhat strange that whatever Unix system call lamboot
> makes couldn't resolve network aliases?
>
> So then I switched the order of the node listings by putting emag7 at
> the top:
>
> emag7
> emag6
> emag5
>
> and now this is what lamboot gives me:
>
> $ lamboot -d ~/machines
>
> LAM 6.3.1/MPI 2 C++/ROMIO - University of Notre Dame
>
> lamboot: boot schema file: /export/homes/donghoon/machines
> lamboot: opening hostfile /export/homes/donghoon/machines
> lamboot: found the following hosts:
> lamboot: n0 emag7
> lamboot: n1 emag6
> lamboot: n2 emag5
> lamboot: found 3 host node(s)
> lamboot: origin node is 0 (emag7)
> lamboot: attempting to execute "hboot -t -c conf.lam -d -I " -H
> 127.0.0.1 -P 1066 -n 0 -o 0 ""
> hboot: process schema = "conf.lam"
>
> and hangs...
>
> conf.lam was copied from /usr/boot/bhost.def and contains just
> "localhost"
>
>
> Sorry for the very long post, and thanks for your patience. I would
> appreciate any help you can provide.
>
> Truly,
>
> Kevin Kuo
>
>
>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|