LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2003-10-15 05:18:48


On Wed, 15 Oct 2003, Erwan Velu wrote:

> > LAM should be using on the names (and corresponding resolved addresses)
> > that you gave in the boot schema file, and no others. So if you gave
> > hostname_for_10 in the boot schema file and LAM ended up using the 172.x,
> > then either something is wrong in your setup or we have a bug in LAM.
>
> I think your are making a resolution on my hostname but my hostname is
> assign to my 172.X network.

>From the lamboot -d output that you sent, I see:

-----
> n0<22000> ssi:boot:rsh: found the following hosts:
> n0<22000> ssi:boot:rsh: n0 compute1.domcomp.com (cpu=1)
> n0<22000> ssi:boot:rsh: n1 compute2.domcomp.com (cpu=1)
> n0<22000> ssi:boot:rsh: n2 compute3.domcomp.com (cpu=1)
> n0<22000> ssi:boot:rsh: n3 compute4.domcomp.com (cpu=1)
> n0<22000> ssi:boot:rsh: n4 compute5.domcomp.com (cpu=1)
> n0<22000> ssi:boot:rsh: n5 compute7.domcomp.com (cpu=1)
> n0<22000> ssi:boot:rsh: n6 compute8.domcomp.com (cpu=1)
> n0<22000> ssi:boot:rsh: n7 compute9.domcomp.com (cpu=1)
> n0<22000> ssi:boot:rsh: n8 compute6.domcomp.com (cpu=1)
> n0<22000> ssi:boot:rsh: n9 server.clic2.mandrakesoft.com (cpu=1)
> n0<22000> ssi:boot:rsh: resolved hosts:
> n0<22000> ssi:boot:rsh: n0 compute1.domcomp.com --> 10.0.1.1
> n0<22000> ssi:boot:rsh: n1 compute2.domcomp.com --> 10.0.1.2
> n0<22000> ssi:boot:rsh: n2 compute3.domcomp.com --> 10.0.1.3
> n0<22000> ssi:boot:rsh: n3 compute4.domcomp.com --> 10.0.1.4
> n0<22000> ssi:boot:rsh: n4 compute5.domcomp.com --> 10.0.1.5
> n0<22000> ssi:boot:rsh: n5 compute7.domcomp.com --> 10.0.1.7
> n0<22000> ssi:boot:rsh: n6 compute8.domcomp.com --> 10.0.1.8
> n0<22000> ssi:boot:rsh: n7 compute9.domcomp.com --> 10.0.1.9
> n0<22000> ssi:boot:rsh: n8 compute6.domcomp.com --> 10.0.1.6
> n0<22000> ssi:boot:rsh: n9 server.clic2.mandrakesoft.com -->
> 172.16.1.253 (origin)
-----

So server.clic2 is being resolved as 172.x. Is there a name that
corresponds to the NIC on the server that is on the 10.x network? What
happens if you use that in your boot schema file? Or is the server only
on the administrative network?

Also be aware of a new feature in 7.0.x, the "schedule=no" option in the
boot schema. If you use the following boot schema:

-----
compute1.domcomp.com
compute2.domcomp.com
server.clic2.mandrakesoft.com schedule=no
-----

Then "mpirun C my_mpi_application", LAM will [by default] skip scheduling
MPI jobs on "server.clic2" when using the "N" and "C" nomenclature to
mpirun. See section 7.1.2 in the 7.0.2 LAM User's Guide ("Avoiding
Running on Specific Nodes") for more details.

I mention this because I assume you are trying for a similar effect by
putting the server node last in the list, perhaps so that you can do
"mpirun -np 8 my_mpi_application" and the server node will not be used.

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/