LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-01-19 13:20:55


On Jan 19, 2005, at 11:52 AM, redirecting decoy wrote:

> Can anyone tell my what would cause the "Lamnodes"
> command to hang on one of the machines in my lam
> universe ?

Generally, this happens when lamnodes is unable to contact the local
lamd. Off the top of my head, this may be because the lamd
unexpectedly died (and left its now-stale unix socket around), or the
lamd is busy and simply became unresponsive (perhaps even during the
execution of the request from lamnodes).

More below.

> I have a total of 19 machines. Three of these machines are servers
> with 2 nics, which I will call S1, S2 and S3. These servers have
> public addresses of 192.168.10.XXX on nic 1. Then, the remaining 16
> machines have 1 nic with private addresses. 10.1.2.xxx for eight,
> and 10.1.3.xxx for the remaining eight machines. The 3 Servers are
> the first threee machines that are booted into the lam universe using
> globus; "lamboot -v -x -d -ssi boot globus machines.globus"

Just curious: any reason you're using globus instead of, say, ssh?

> On nic 2 in S1,S2 and S3, the adresses are setup to be 10.1.(1,2,3).1
> respectively. I am using S1 to initially create the lam universe,
> then I use lamgrow from S2 and S3 to add eight machines each into the
> universe. This all seems to work fine. However, I am having a problem
> with S1. I can run the lamnodes command from every machine in my
> universe, except it hangs when I try and run it on S1. I think
> whatever is causing this, is causing my lam universe to not function
> properly, as it seems that lamnodes reports that S1 has become an
> invalid node on some of the machines in the lam universe. The
> programs I try and run just hang there without doing anything for a
> long time. Then I am forced to kill it after a while because it just
> doesn't do anything.

Hmm. This smells like an IP routing or firewall problem -- the other
nodes eventually rule that S1 is an invalid node because they can't
reach it, and with lamboot's "-x" option, they'll just ignore that node
from then on.

So let's look at what happens, step by step...

- you lamboot from S1 with a globus hostfile including S1, S2, S3
- a local lamd is fork/exec'd on S1
- the lamd opens a socket back to lamboot to report its final position
- S1 executes globusrun to launch "hboot" (and friends) on S2 and S3,
which eventually results in "lamd" being run on S2 and S3
- the lamd on S2 and S3 open a TCP socket back to lamboot to report
their final positions
- lamboot then opens sockets back to the lamd on each of S1, S2, and S3
reporting the final location of all lamds (i.e., the universe
information)

Question: from this point, can you run lamnodes successfully on all 3
nodes? I.e., *before* you run lamgrow?

> Note: S1, S2 and S3 all have identical OS configurations. Also, I am
> using Lam 7.0.6.
>
> I know that what I am trying to do works, because It has worked
> before. The only difference now is the addition of S1 to the lam
> universe.
>
> Is it possible that my firewall on the Servers could be the cause of
> the problem ? In order to get lam to boot at all I needed to add the
> following to my iptables configuration:

> -A INPUT -m state --state NEW -p tcp -s 192.168.10.100 -j ACCEPT
> -A INPUT -m state --state NEW -p udp -s 192.168.10.100 -j ACCEPT
> -A INPUT -m state --state NEW -p tcp -s 192.168.10.101 -j ACCEPT
> -A INPUT -m state --state NEW -p udp -s 192.168.10.101 -j ACCEPT
> -A INPUT -m state --state NEW -p tcp -s 192.168.10.102 -j ACCEPT
> -A INPUT -m state --state NEW -p udp -s 192.168.10.102 -j ACCEPT

I'm not familiar with the format of iptables configuration -- the idea
is that you need to allow any port, TCP and UDP from any nodes who want
to be in a LAM universe together.

> Adding the above allows me to boot the universe with or without globus.
>
> When trying to do: strace lamnodes from S1, I get some output, then
> it hangs while trying to read something...

FYI: the lamnodes command actually just sends a request down to the
local lamd -- the lamd then returns its local routing table, which is
what the lamdnodes command prints on stdout. It *looks* like it's
writing the request down to the lamd, but then hanging waiting for the
lamd to reply.

Can you verify that the lamd is still running? If not, did it dump a
corefile, or write anything into the syslog indicating why it died?

(I'm wondering if you're seeing a different manifestation of the same
lamgrow problem that you found in 7.1.1, perhaps something similar...?)

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/