LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Neil Storer (Neil.Storer_at_[hidden])
Date: 2004-02-16 06:22:22


Hi,

I tried running a LAM-MPI program on a couple of of our IBM Nighthawk2 nodes and
got the error message:

-----------------------------------------------------------------------------
The boot SSI rsh module found that your local host is not in the
hostfile "./hostlist".

The local host name *must* be in the list of hosts in the hostfile.
In other words, you must boot LAM from a node that will be part of the
universe.

         - If you simply forgot to put the local host in the boot
           schema file, add it and re-run The boot SSI rsh module
         - If you are trying to boot LAM from a node that will not be
           part of the universe, you must login to on of the nodes that
           will be part of the universe (i.e., one of the nodes in the
           hostfiles), and re-run The boot SSI rsh module

Although the local host name is usually the first in the list to avoid
I/O ambiguities, it can actually appear anywhere in the list.
-----------------------------------------------------------------------------

However, my local host (hpct0101) WAS in the hostfile.

I had run this test a few minutes earlier on a couple of our p690 nodes (with a
different hostfile) without a problem.

I looked at the LAM source code and realised fairly quickly that the error was
related to the difference in the number of network interfaces on the two types
of node.

A 'nestat -i' showed:

p690

----
Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll
en0   1500  link#2      0.2.55.6a.f8.bf    7446325     0  7583632     0     0
en0   1500  hcn0t       hpct0301-cntrl     7446325     0  7583632     0     0
css0  65504 link#3                        40717028     0 62811241     0     0
css0  65504 hsn0t-a     hpct0301-spsw-a   40717028     0 62811241     0     0
css1  65504 link#4                        40715865     0 62809165     0     0
css1  65504 hsn0t-b     hpct0301-spsw-b   40715865     0 62809165     0     0
ml0   65504 link#5                               0     0 115677967   156     0
ml0   65504 hsn0t       hpct0301                 0     0 115677967   156     0
lo0   16896 link#1                         6144799     0  6145794     0     0
lo0   16896 loopback    loopback           6144799     0  6145794     0     0
lo0   16896 ::1                            6144799     0  6145794     0     0
lo0   16896 136.156.213 hpct-batch         6144799     0  6145794     0     0
NH2
---
Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll
en0   1500  link#2      0.4.ac.ec.b.f      8304956     0  8639010     0     0
en0   1500  hcn0t       hpct0101-cntrl     8304956     0  8639010     0     0
en2   9000  link#3      0.2.55.9a.76.91   160532538     0 51261075     2     0
en2   9000  jumbo       hpct0101-gpn      160532538     0 51261075     2     0
en3   9000  link#4      0.6.29.6b.3f.5d    6473848     0   612278     2     0
en3   9000  hbn         hpct0101-hbn       6473848     0   612278     2     0
en6   9000  link#5      0.6.29.6b.3f.8a   10492358     0 94260540     3     0
en6   9000  hpn         hpct0101-hpn      10492358     0 94260540     3     0
css0  65504 link#6                        28821832     0 31468018     0     0
css0  65504 hsn0t-a     hpct0101-spsw-a   28821832     0 31468018     0     0
css1  65504 link#7                        28820537     0 31466905     0     0
css1  65504 hsn0t-b     hpct0101-spsw-b   28820537     0 31466905     0     0
ml0   65504 link#8                               0     0 55048486    55     0
ml0   65504 hsn0t       hpct0101                 0     0 55048486    55     0
lo0   16896 link#1                         6556592     0  6559174     0     0
lo0   16896 loopback    loopback           6556592     0  6559174     0     0
lo0   16896 ::1                            6556592     0  6559174     0     0
lo0   16896 136.156.229 hpct0101-virtual   6556592     0  6559174     0     0
I edited 'lam-7.0.3/share/boot/lamnet.c' increasing the size of "ifbuf":
from:
	static unsigned long    ifbuf[256];
to:
	static unsigned long    ifbuf[512];
I remade the libraries and it now works OK.
To save other sites from the same problem:
==================================================================
PERHAPS YOU SHOULD USE THIS VALUE (512) IN THE DEFAULT SOURCE CODE
==================================================================
Regards
	Neil
-- 
+-----------------+---------------------------------+------------------+
| Neil Storer     |    Head: Systems S/W Section    | Operations Dept. |
+-----------------+---------------------------------+------------------+
| ECMWF,          | email: neil.storer_at_[hidden]    |    //=\\  //=\\  |
| Shinfield Park, | Tel:   (+44 118) 9499353        |   //   \\//   \\ |
| Reading,        |        (+44 118) 9499000 x 2353 | ECMWF            |
| Berkshire,      | Fax:   (+44 118) 9869450        | ECMWF            |
| RG2 9AX,        |                                 |   \\   //\\   // |
| UK              | URL:   http://www.ecmwf.int/    |    \\=//  \\=//  |
+--+--------------+---------------------------------+----------------+-+
    | ECMWF is the European Centre for Medium-Range Weather Forecasts |
    +-----------------------------------------------------------------+