Hi,
I tried running a LAM-MPI program on a couple of of our IBM Nighthawk2 nodes and
got the error message:
-----------------------------------------------------------------------------
The boot SSI rsh module found that your local host is not in the
hostfile "./hostlist".
The local host name *must* be in the list of hosts in the hostfile.
In other words, you must boot LAM from a node that will be part of the
universe.
- If you simply forgot to put the local host in the boot
schema file, add it and re-run The boot SSI rsh module
- If you are trying to boot LAM from a node that will not be
part of the universe, you must login to on of the nodes that
will be part of the universe (i.e., one of the nodes in the
hostfiles), and re-run The boot SSI rsh module
Although the local host name is usually the first in the list to avoid
I/O ambiguities, it can actually appear anywhere in the list.
-----------------------------------------------------------------------------
However, my local host (hpct0101) WAS in the hostfile.
I had run this test a few minutes earlier on a couple of our p690 nodes (with a
different hostfile) without a problem.
I looked at the LAM source code and realised fairly quickly that the error was
related to the difference in the number of network interfaces on the two types
of node.
A 'nestat -i' showed:
p690
----
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en0 1500 link#2 0.2.55.6a.f8.bf 7446325 0 7583632 0 0
en0 1500 hcn0t hpct0301-cntrl 7446325 0 7583632 0 0
css0 65504 link#3 40717028 0 62811241 0 0
css0 65504 hsn0t-a hpct0301-spsw-a 40717028 0 62811241 0 0
css1 65504 link#4 40715865 0 62809165 0 0
css1 65504 hsn0t-b hpct0301-spsw-b 40715865 0 62809165 0 0
ml0 65504 link#5 0 0 115677967 156 0
ml0 65504 hsn0t hpct0301 0 0 115677967 156 0
lo0 16896 link#1 6144799 0 6145794 0 0
lo0 16896 loopback loopback 6144799 0 6145794 0 0
lo0 16896 ::1 6144799 0 6145794 0 0
lo0 16896 136.156.213 hpct-batch 6144799 0 6145794 0 0
NH2
---
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en0 1500 link#2 0.4.ac.ec.b.f 8304956 0 8639010 0 0
en0 1500 hcn0t hpct0101-cntrl 8304956 0 8639010 0 0
en2 9000 link#3 0.2.55.9a.76.91 160532538 0 51261075 2 0
en2 9000 jumbo hpct0101-gpn 160532538 0 51261075 2 0
en3 9000 link#4 0.6.29.6b.3f.5d 6473848 0 612278 2 0
en3 9000 hbn hpct0101-hbn 6473848 0 612278 2 0
en6 9000 link#5 0.6.29.6b.3f.8a 10492358 0 94260540 3 0
en6 9000 hpn hpct0101-hpn 10492358 0 94260540 3 0
css0 65504 link#6 28821832 0 31468018 0 0
css0 65504 hsn0t-a hpct0101-spsw-a 28821832 0 31468018 0 0
css1 65504 link#7 28820537 0 31466905 0 0
css1 65504 hsn0t-b hpct0101-spsw-b 28820537 0 31466905 0 0
ml0 65504 link#8 0 0 55048486 55 0
ml0 65504 hsn0t hpct0101 0 0 55048486 55 0
lo0 16896 link#1 6556592 0 6559174 0 0
lo0 16896 loopback loopback 6556592 0 6559174 0 0
lo0 16896 ::1 6556592 0 6559174 0 0
lo0 16896 136.156.229 hpct0101-virtual 6556592 0 6559174 0 0
I edited 'lam-7.0.3/share/boot/lamnet.c' increasing the size of "ifbuf":
from:
static unsigned long ifbuf[256];
to:
static unsigned long ifbuf[512];
I remade the libraries and it now works OK.
To save other sites from the same problem:
==================================================================
PERHAPS YOU SHOULD USE THIS VALUE (512) IN THE DEFAULT SOURCE CODE
==================================================================
Regards
Neil
--
+-----------------+---------------------------------+------------------+
| Neil Storer | Head: Systems S/W Section | Operations Dept. |
+-----------------+---------------------------------+------------------+
| ECMWF, | email: neil.storer_at_[hidden] | //=\\ //=\\ |
| Shinfield Park, | Tel: (+44 118) 9499353 | // \\// \\ |
| Reading, | (+44 118) 9499000 x 2353 | ECMWF |
| Berkshire, | Fax: (+44 118) 9869450 | ECMWF |
| RG2 9AX, | | \\ //\\ // |
| UK | URL: http://www.ecmwf.int/ | \\=// \\=// |
+--+--------------+---------------------------------+----------------+-+
| ECMWF is the European Centre for Medium-Range Weather Forecasts |
+-----------------------------------------------------------------+
|