LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Masakazu HIGAKI (higamasa_at_[hidden])
Date: 2004-05-30 06:21:02


Dear LAM/MPI developers,

Recently I tried to update LAM from 7.0.5 to 7.0.6 on my FreeBSD
machines and met a problem that "lamboot" failed with
the following messages:

$ lamboot -v -d
n-1<21056> ssi:boot: Opening
n-1<21056> ssi:boot: opening module globus
n-1<21056> ssi:boot: initializing module globus
n-1<21056> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<21056> ssi:boot: module not available: globus
n-1<21056> ssi:boot: opening module rsh
n-1<21056> ssi:boot: initializing module rsh
n-1<21056> ssi:boot:rsh: module initializing
n-1<21056> ssi:boot:rsh:agent: ssh
n-1<21056> ssi:boot:rsh:username: <same>
n-1<21056> ssi:boot:rsh:verbose: 1000
n-1<21056> ssi:boot:rsh:algorithm: linear
n-1<21056> ssi:boot:rsh:priority: 10
n-1<21056> ssi:boot: module available: rsh, priority: 10
n-1<21056> ssi:boot: finalizing module globus
n-1<21056> ssi:boot:globus: finalizing
n-1<21056> ssi:boot: closing module globus
n-1<21056> ssi:boot: Selected boot module rsh
n-1<21056> ssi:boot:base: looking for boot schema in following directories:
n-1<21056> ssi:boot:base: <current directory>
n-1<21056> ssi:boot:base: $TROLLIUSHOME/etc
n-1<21056> ssi:boot:base: $LAMHOME/etc
n-1<21056> ssi:boot:base: /xxx/lam/etc
n-1<21056> ssi:boot:base: looking for boot schema file:
n-1<21056> ssi:boot:base: lam-bhost.def
n-1<21056> ssi:boot:base: found boot schema: /xxx/lam/etc/lam-bhost.def
n-1<21056> ssi:boot:rsh: found the following hosts:
n-1<21056> ssi:boot:rsh: n0 localhost (cpu=1)
-----------------------------------------------------------------------------
The boot SSI rsh module found that your local host is not in the
hostfile "/xxx/lam/etc/lam-bhost.def".

The local host name *must* be in the list of hosts in the hostfile.
In other words, you must boot LAM from a node that will be part of the
universe.

        - If you simply forgot to put the local host in the boot
          schema file, add it and re-run The boot SSI rsh module
        - If you are trying to boot LAM from a node that will not be
          part of the universe, you must login to on of the nodes that
          will be part of the universe (i.e., one of the nodes in the
          hostfiles), and re-run The boot SSI rsh module

Although the local host name is usually the first in the list to avoid
I/O ambiguities, it can actually appear anywhere in the list.
-----------------------------------------------------------------------------

LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University

This problem does not happen with LAM 7.0.5 and below.
I examined the behavior of lamboot with a debugger
and found that the function getifaddr() in share/boot/lamnet.c
of version 7.0.6 does not detect network interfaces although they exists.
In the function, the return code of ioctl(SIOCGIFCONF) seems to
be used to check if the allocated memory for config.ifc_req is enough.
However, even if the allocated memory is smaller than required,
ioctl(SIOCGIFCONF) returns 0 as return code and just resets
config.ifc_len to 0.
Applying the provisional patch below makes lamboot work successfully.

--- share/boot/lamnet.c~ Sun May 2 22:11:03 2004
+++ share/boot/lamnet.c Sat May 29 14:22:18 2004
@@ -199,7 +199,8 @@
             close(sock);
             return LAMERROR;
           }
- if (ioctl(sock, SIOCGIFCONF, &config) < 0) {
+ if (ioctl(sock, SIOCGIFCONF, &config) < 0
+ || config.ifc_len == 0) {
             if (errno != EINVAL && lastlen != 0) {
               close(sock);
               return LAMERROR;

I would appreciate if you could fix it in the next release.
Thank you.

---
Masakazu Higaki