LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Andreas Wilde (Andreas.Wilde_at_[hidden])
Date: 2003-09-18 09:09:39


> Hi Andreas,
> I think you can solve the problem just writing the same line in /etc/hosts
> on every machine on your cluster:
>
> 127.0.0.1 localhost lilianX
>
> It seems that you can execute this from lilian4
> /usr/bin/rsh lilian3 -n hboot
>
> so if there's some confusion with the names you can use /etc/hosts
> to solve that.
> But that's just a sugestion...
>
> nelson

Hi,
maybe I'm too stupid to get it, but it doesn't work here. I changed
/etc/hosts on every machine (thank god its only 4) to

127.0.0.1 localhost lilian1
...
on lilian1,
127.0.0.1 localhost lilian2
on lilian2 and so.
This makes recon work. lamboot fails. There is interesting output from
lamboot:
#########################################################
...
Executing hboot on n1 (lilian2 - 1 CPU)...
lamboot: attempting to execute "/usr/bin/rsh lilian2 -n echo $SHELL"
lamboot: got remote shell /bin/tcsh
lamboot: attempting to execute "/usr/bin/rsh lilian2 -n hboot -t -c
lam-conf.lam -d -v -s -I "-H 127.0.0.1 -P 13813 -n 1 -o 0 ""
hboot: process schema =
"/usr/local/powerflow/3.4p2-SuSE-patch/dist/sw/x86_linux/lam/etc/lam-conf.lam"
hboot: found
/usr/local/powerflow/3.4p2-SuSE-patch/dist/sw/x86_linux/lam/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork
/usr/local/powerflow/3.4p2-SuSE-patch/dist/sw/x86_linux/lam/bin/lamd
[1] 21162 lamd -H 127.0.0.1 -P 13813 -n 1 -o 0 -d
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
##########################################################

Just speculating:
The line

[1] 21162 lamd -H 127.0.0.1 -P 13813 -n 1 -o 0 -d

says, that lamd is executed on lilian2 with a given home node 127.0.0.1,
right? If lamd on lilian2 tries to contact some process via 127.0.0.1, it
lands on lilian2, not on the home node, which is lilian1 (I tried it from
lilian1 for change). Hence lamboot fails.

I tried another thing:
I hacked find_orig() in lamnet.c to return the IP-number of eth0 of the home
node, removed the alias from /etc/hosts, so it read
127.0.0.1 localhost
192.168.1.1 lilian1
started lamboot, et voila, lamboot succeeds:

#########################################################
....
 /usr/local/powerflow/3.4p2-SuSE-patch/dist/sw/x86_linux/lam/bin/lamd
[1] 20426 lamd -H 192.168.1.1 -P 13860 -n 3 -o 0 -d
topology done
lamboot completed successfully
##########################################################

Note, that this time lamd is executed with the option -H 192.168.1.1, which
is lilian1. So I'm pretty sure that the solution to my problem is to make
getifaddr() in lamnet.c return 198.162.1.1 (or 192.168.1.X on the other
nodes).

Am I right? How to achieve that? Are there other options?

regards,
andreas

-- 
________________________________________________
Andreas Wilde
Fraunhofer-Institut fuer Integrierte Schaltungen
Aussenstelle Entwurfsautomatisierung
Zeunerstr. 38
D-01069 Dresden
Tel.: 49 (0) 351 4640 852
Fax : 49 (0) 351 4640 703
E-Mail: Andreas.Wilde_at_[hidden]
________________________________________________