LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2003-09-18 13:49:37


Sorry I didn't join the discussion until now -- had some pressing
deadlines and I hadn't gotten to read this thread.

Tim's description below is exactly correct, so I won't re-iterate it.
That should clear up your /etc/hosts issues. :-)

Additionally, if you're having a hostname resolution issue, you might want
to check the contents of your /etc/nsswitch.conf file. The "hosts" line
should probably have the word "files" before dns (perhaps before any other
possibilities, but that depends on how your machines are setup). For
example, I have the following line in my /etc/nsswitch.conf file:

-----
hosts: files dns
-----

This means that name resolutions will first look in /etc/hosts. If it
can't find the name in there, it will fall back to DNS. If it can't find
the name in DNS, it returns a failure. Again: this is how *my* cluster is
configured -- yours may be different.

However -- all that being said -- you should also be able to have a boot
schema file with pure IP addresses and avoid this extra layer of
abstraction (I don't typically recommend this -- it's almost always easier
to use names instead of IP addresses, but if the name resolution is in
question, then manually [temporarily] switching back to IP addresses would
eliminate some variables).

So you should be able to have a boot schema file with the following:

----
192.168.1.1
192.168.1.2
192.168.1.3
192.168.1.4
----
If that works, then it's definitely something screwy with your name
resolution.  If that doesn't work, then there is something else wrong and
we should investigate further.
-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
On Thu, 18 Sep 2003, Timothy I Mattox wrote:
> Hello,
> It is unfortunate that recent Linux distributions have been putting your
> hostname on the same line as the "127.0.0.1 localhost" line in /etc/hosts.
> The 127.0.0.1 is a special value, and shouldn't (in most circomstances)
> have anything but variations of localhost and localhost.localdomain
> associated with it.  A real hostname, to be useful in a network, i.e.
> with more than one machine, needs to have an externally useful IP
> address.  127.0.0.1 will never leave the box it starts from.
>
> The issue with LAM is that lamboot remotely executes a command (hboot) on
> each node in your cluster, and on the commandline it sends the IP address
> (or hostname if you use the -l option) of the node you are starting LAM
> from.  hboot uses that address to call home and join the LAM environment.
> If it resolves to 127.0.0.1 it tries to talk to itself, the localhost, and
> fails.
>
> So, in short, edit all your /etc/hosts files to look identical,
> something like this:
>
> 127.0.0.1           localhost  localhost.localdomain
> 192.168.1.1         lilian1
> 192.168.1.2         lilian2
> 192.168.1.3         lilian3
> 192.168.1.4         lilian4