Sorry I didn't join the discussion until now -- had some pressing
deadlines and I hadn't gotten to read this thread.
Tim's description below is exactly correct, so I won't re-iterate it.
That should clear up your /etc/hosts issues. :-)
Additionally, if you're having a hostname resolution issue, you might want
to check the contents of your /etc/nsswitch.conf file. The "hosts" line
should probably have the word "files" before dns (perhaps before any other
possibilities, but that depends on how your machines are setup). For
example, I have the following line in my /etc/nsswitch.conf file:
-----
hosts: files dns
-----
This means that name resolutions will first look in /etc/hosts. If it
can't find the name in there, it will fall back to DNS. If it can't find
the name in DNS, it returns a failure. Again: this is how *my* cluster is
configured -- yours may be different.
However -- all that being said -- you should also be able to have a boot
schema file with pure IP addresses and avoid this extra layer of
abstraction (I don't typically recommend this -- it's almost always easier
to use names instead of IP addresses, but if the name resolution is in
question, then manually [temporarily] switching back to IP addresses would
eliminate some variables).
So you should be able to have a boot schema file with the following:
----
192.168.1.1
192.168.1.2
192.168.1.3
192.168.1.4
----
If that works, then it's definitely something screwy with your name
resolution. If that doesn't work, then there is something else wrong and
we should investigate further.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
On Thu, 18 Sep 2003, Timothy I Mattox wrote:
> Hello,
> It is unfortunate that recent Linux distributions have been putting your
> hostname on the same line as the "127.0.0.1 localhost" line in /etc/hosts.
> The 127.0.0.1 is a special value, and shouldn't (in most circomstances)
> have anything but variations of localhost and localhost.localdomain
> associated with it. A real hostname, to be useful in a network, i.e.
> with more than one machine, needs to have an externally useful IP
> address. 127.0.0.1 will never leave the box it starts from.
>
> The issue with LAM is that lamboot remotely executes a command (hboot) on
> each node in your cluster, and on the commandline it sends the IP address
> (or hostname if you use the -l option) of the node you are starting LAM
> from. hboot uses that address to call home and join the LAM environment.
> If it resolves to 127.0.0.1 it tries to talk to itself, the localhost, and
> fails.
>
> So, in short, edit all your /etc/hosts files to look identical,
> something like this:
>
> 127.0.0.1 localhost localhost.localdomain
> 192.168.1.1 lilian1
> 192.168.1.2 lilian2
> 192.168.1.3 lilian3
> 192.168.1.4 lilian4
|