> Let me make sure I understand -- are you lambooting with the names that
> correspond to the 10.x addresses, or the names that correspond to the
> 172.x addresses?
I'm trying to boot using the 10.x addresses. But the lamboot process gives the nodes my IP adress but not the one that communicate on the 10.x network.
> LAM should be using on the names (and corresponding resolved addresses)
> that you gave in the boot schema file, and no others. So if you gave
> hostname_for_10 in the boot schema file and LAM ended up using the 172.x,
> then either something is wrong in your setup or we have a bug in LAM.
I think your are making a resolution on my hostname but my hostname is
assign to my 172.X network.
[..]
>Also FYI: in 7.1, there is a new SSI parameter called mpi_hostmap that
>allows you to boot one one set of IP addresses and then use an entirely
>different set of IP addresses for MPI communication (which is even
>better than what you're doing here, potentially -- you can lamboot on
>the admin network and automatically have all MPI traffic go on the high
>speed network).
That's exactly what I need.. I will wait for this feature.
> Can you send the entire output of "lamboot -d"?
Here it comes....
Thanks for your support,
[erwan_at_server erwan]$ lamboot -b -d
n0<22000> ssi:boot: Opening
n0<22000> ssi:boot: opening module globus
n0<22000> ssi:boot: initializing module globus
n0<22000> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n0<22000> ssi:boot: module not available: globus
n0<22000> ssi:boot: opening module rsh
n0<22000> ssi:boot: initializing module rsh
n0<22000> ssi:boot:rsh: module initializing
n0<22000> ssi:boot:rsh:agent: /usr/bin/rsh
n0<22000> ssi:boot:rsh:username: <same>
n0<22000> ssi:boot:rsh:verbose: 1000
n0<22000> ssi:boot:rsh:algorithm: linear
n0<22000> ssi:boot:rsh:priority: 10
n0<22000> ssi:boot: module available: rsh, priority: 10
n0<22000> ssi:boot: finalizing module globus
n0<22000> ssi:boot:globus: finalizing
n0<22000> ssi:boot: closing module globus
n0<22000> ssi:boot: Selected boot module rsh
LAM 7.0.2/MPI 2 C++/ROMIO - Indiana University
n0<22000> ssi:boot:base: looking for boot schema in following
directories:
n0<22000> ssi:boot:base: <current directory>
n0<22000> ssi:boot:base: $TROLLIUSHOME/etc
n0<22000> ssi:boot:base: $LAMHOME/etc
n0<22000> ssi:boot:base: /etc/lam
n0<22000> ssi:boot:base: looking for boot schema file:
n0<22000> ssi:boot:base: lam-bhost.def
n0<22000> ssi:boot:base: found boot schema: /etc/lam/lam-bhost.def
n0<22000> ssi:boot:rsh: found the following hosts:
n0<22000> ssi:boot:rsh: n0 compute1.domcomp.com (cpu=1)
n0<22000> ssi:boot:rsh: n1 compute2.domcomp.com (cpu=1)
n0<22000> ssi:boot:rsh: n2 compute3.domcomp.com (cpu=1)
n0<22000> ssi:boot:rsh: n3 compute4.domcomp.com (cpu=1)
n0<22000> ssi:boot:rsh: n4 compute5.domcomp.com (cpu=1)
n0<22000> ssi:boot:rsh: n5 compute7.domcomp.com (cpu=1)
n0<22000> ssi:boot:rsh: n6 compute8.domcomp.com (cpu=1)
n0<22000> ssi:boot:rsh: n7 compute9.domcomp.com (cpu=1)
n0<22000> ssi:boot:rsh: n8 compute6.domcomp.com (cpu=1)
n0<22000> ssi:boot:rsh: n9 server.clic2.mandrakesoft.com (cpu=1)
n0<22000> ssi:boot:rsh: resolved hosts:
n0<22000> ssi:boot:rsh: n0 compute1.domcomp.com --> 10.0.1.1
n0<22000> ssi:boot:rsh: n1 compute2.domcomp.com --> 10.0.1.2
n0<22000> ssi:boot:rsh: n2 compute3.domcomp.com --> 10.0.1.3
n0<22000> ssi:boot:rsh: n3 compute4.domcomp.com --> 10.0.1.4
n0<22000> ssi:boot:rsh: n4 compute5.domcomp.com --> 10.0.1.5
n0<22000> ssi:boot:rsh: n5 compute7.domcomp.com --> 10.0.1.7
n0<22000> ssi:boot:rsh: n6 compute8.domcomp.com --> 10.0.1.8
n0<22000> ssi:boot:rsh: n7 compute9.domcomp.com --> 10.0.1.9
n0<22000> ssi:boot:rsh: n8 compute6.domcomp.com --> 10.0.1.6
n0<22000> ssi:boot:rsh: n9 server.clic2.mandrakesoft.com -->
172.16.1.253 (origin)
n0<22000> ssi:boot:rsh: starting RTE procs
n0<22000> ssi:boot:base:linear: starting
n0<22000> ssi:boot:base:server: opening server TCP socket
n0<22000> ssi:boot:base:server: opened port 44039
n0<22000> ssi:boot:base:linear: booting n0 (compute1.domcomp.com)
n0<22000> ssi:boot:rsh: starting lamd on (compute1.domcomp.com)
n0<22000> ssi:boot:rsh: starting on n0 (compute1.domcomp.com): hboot -t
-c lam-conf.lamd -d -s -I "-H 172.16.1.253 -P 44039 -n 0 -o 9"
n0<22000> ssi:boot:rsh: launching remotely
n0<22000> ssi:boot:rsh: -b used, assuming same shell on remote nodes
n0<22000> ssi:boot:rsh: got local shell /bin/bash
n0<22000> ssi:boot:rsh: attempting to execute "/usr/bin/rsh
compute1.domcomp.com -n hboot -t -c lam-conf.lamd -d -s -I "-H
172.16.1.253 -P 44039 -n 0 -o 9""
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back:
/tmp/lam-erwan_at_[hidden]/lam-killfile
tkill: removing socket file ...
tkill: socket file:
/tmp/lam-erwan_at_[hidden]/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file:
/tmp/lam-erwan_at_[hidden]/lam-io-socket
tkill: f_kill =
"/tmp/lam-erwan_at_[hidden]/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 20238 ...
tkill: killed
tkill: all finished
hboot: performing tkill
hboot: tkill -d
hboot: booting...
hboot: fork /usr/bin/lamd
[1] 21051 lamd -H 172.16.1.253 -P 44039 -n 0 -o 9 -d
n0<22000> ssi:boot:rsh: successfully launched on n0
(compute1.domcomp.com)
n0<22000> ssi:boot:base:server: expecting connection from finite list
n0<22000> ssi:boot:base:server: got connection from 172.16.1.1
n0<22000> ssi:boot:base:server: unexpected connection; dropping
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).
Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
lamboot: wipe -- nothing to do
lamboot did NOT complete successfully
[erwan_at_server erwan]$
--
Erwan Velu
Linux Cluster Distribution Project Manager
MandrakeSoft
43 rue d'aboukir 75002 Paris
Phone Number : +33 (0) 1 40 41 17 94
Fax Number : +33 (0) 1 40 41 92 00
Web site : http://www.mandrakesoft.com
OpenPGP key : http://www.mandrakesecure.net/cks/
|