On Thursday 18 September 2003 11:38, Nelson Brito wrote:
>
> >
> > 127.0.0.1 localhost
>
> substitute the previous line for this one
> 127.0.0.1 localhost lilian4
>
> and remove the folowing:
> > 192.168.1.4 lilian4
>
> regards,
> nelson
Hi,
thanks, but it doesn't do the trick. recon succeds now, but lamboot still
fails. This is the output:
#####################################################
{wilde_at_lilian4}% lamboot -d -v lam_hosts
LAM 6.5.1/MPI 2 C++/ROMIO - University of Notre Dame
lamboot: boot schema file: lam_hosts
lamboot: opening hostfile lam_hosts
lamboot: found the following hosts:
lamboot: n0 lilian4
lamboot: n1 lilian3
lamboot: n2 lilian2
lamboot: n3 lilian1
lamboot: found 4 host node(s)
lamboot: origin node is 0 (lilian4)
Executing hboot on n0 (lilian4 - 1 CPU)...
lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H
127.0.0.1 -P 59101 -n 0 -o 0 ""
hboot: process schema =
"/usr/local/powerflow/3.4p2-SuSE-patch/dist/sw/x86_linux/lam/etc/lam-conf.lam"
hboot: found
/usr/local/powerflow/3.4p2-SuSE-patch/dist/sw/x86_linux/lam/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork
/usr/local/powerflow/3.4p2-SuSE-patch/dist/sw/x86_linux/lam/bin/lamd
hboot: attempting to execute
[1] 19100 lamd -H 127.0.0.1 -P 59101 -n 0 -o 0 -d
Executing hboot on n1 (lilian3 - 1 CPU)...
lamboot: attempting to execute "/usr/bin/rsh lilian3 -n echo $SHELL"
lamboot: got remote shell /bin/tcsh
lamboot: attempting to execute "/usr/bin/rsh lilian3 -n hboot -t -c
lam-conf.lam -d -v -s -I "-H 127.0.0.1 -P 59101 -n 1 -o 0 ""
hboot: process schema =
"/usr/local/powerflow/3.4p2-SuSE-patch/dist/sw/x86_linux/lam/etc/lam-conf.lam"
hboot: found
/usr/local/powerflow/3.4p2-SuSE-patch/dist/sw/x86_linux/lam/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork
/usr/local/powerflow/3.4p2-SuSE-patch/dist/sw/x86_linux/lam/bin/lamd
[1] 19746 lamd -H 127.0.0.1 -P 59101 -n 1 -o 0 -d
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).
Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
wipe ...
LAM 6.5.1/MPI 2 C++/ROMIO - University of Notre Dame
Executing tkill on n0 (lilian4)...
Executing tkill on n1 (lilian3)...
lamboot did NOT complete successfully
################################################################
I guess, its pretty much the same as giving 127.0.0.1 in the bhost-file.
lamboot tries to start lamd on a remote machine with the homenode set to
127.0.0.1, which is wrong on the remote machine.
Meanwhile I had look at the LAM source code. The problem seems to be that the
hostnames given in the bhost-file are converted to IP-numbers using
/etc/hosts (or something else). Then the IP-Number of the local machine is
determined by getifaddr(). The result of getifaddr() is compared to the
IP-numbers found before, and one address has to match. Now, on my machine
getifaddr() returns only one address, which is 127.0.0.1. This doesn't match
any of the other adresses. The question is: How can I make getifaddr() return
the IP-number of eth0, not lo?
andreas
--
________________________________________________
Andreas Wilde
Fraunhofer-Institut fuer Integrierte Schaltungen
Aussenstelle Entwurfsautomatisierung
Zeunerstr. 38
D-01069 Dresden
Tel.: 49 (0) 351 4640 852
Fax : 49 (0) 351 4640 703
E-Mail: Andreas.Wilde_at_[hidden]
________________________________________________
|