Dear LAM/MPI developers,
Recently I tried to update LAM from 7.0.5 to 7.0.6 on my FreeBSD
machines and met a problem that "lamboot" failed with
the following messages:
$ lamboot -v -d
n-1<21056> ssi:boot: Opening
n-1<21056> ssi:boot: opening module globus
n-1<21056> ssi:boot: initializing module globus
n-1<21056> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<21056> ssi:boot: module not available: globus
n-1<21056> ssi:boot: opening module rsh
n-1<21056> ssi:boot: initializing module rsh
n-1<21056> ssi:boot:rsh: module initializing
n-1<21056> ssi:boot:rsh:agent: ssh
n-1<21056> ssi:boot:rsh:username: <same>
n-1<21056> ssi:boot:rsh:verbose: 1000
n-1<21056> ssi:boot:rsh:algorithm: linear
n-1<21056> ssi:boot:rsh:priority: 10
n-1<21056> ssi:boot: module available: rsh, priority: 10
n-1<21056> ssi:boot: finalizing module globus
n-1<21056> ssi:boot:globus: finalizing
n-1<21056> ssi:boot: closing module globus
n-1<21056> ssi:boot: Selected boot module rsh
n-1<21056> ssi:boot:base: looking for boot schema in following directories:
n-1<21056> ssi:boot:base: <current directory>
n-1<21056> ssi:boot:base: $TROLLIUSHOME/etc
n-1<21056> ssi:boot:base: $LAMHOME/etc
n-1<21056> ssi:boot:base: /xxx/lam/etc
n-1<21056> ssi:boot:base: looking for boot schema file:
n-1<21056> ssi:boot:base: lam-bhost.def
n-1<21056> ssi:boot:base: found boot schema: /xxx/lam/etc/lam-bhost.def
n-1<21056> ssi:boot:rsh: found the following hosts:
n-1<21056> ssi:boot:rsh: n0 localhost (cpu=1)
-----------------------------------------------------------------------------
The boot SSI rsh module found that your local host is not in the
hostfile "/xxx/lam/etc/lam-bhost.def".
The local host name *must* be in the list of hosts in the hostfile.
In other words, you must boot LAM from a node that will be part of the
universe.
- If you simply forgot to put the local host in the boot
schema file, add it and re-run The boot SSI rsh module
- If you are trying to boot LAM from a node that will not be
part of the universe, you must login to on of the nodes that
will be part of the universe (i.e., one of the nodes in the
hostfiles), and re-run The boot SSI rsh module
Although the local host name is usually the first in the list to avoid
I/O ambiguities, it can actually appear anywhere in the list.
-----------------------------------------------------------------------------
LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
This problem does not happen with LAM 7.0.5 and below.
I examined the behavior of lamboot with a debugger
and found that the function getifaddr() in share/boot/lamnet.c
of version 7.0.6 does not detect network interfaces although they exists.
In the function, the return code of ioctl(SIOCGIFCONF) seems to
be used to check if the allocated memory for config.ifc_req is enough.
However, even if the allocated memory is smaller than required,
ioctl(SIOCGIFCONF) returns 0 as return code and just resets
config.ifc_len to 0.
Applying the provisional patch below makes lamboot work successfully.
--- share/boot/lamnet.c~ Sun May 2 22:11:03 2004
+++ share/boot/lamnet.c Sat May 29 14:22:18 2004
@@ -199,7 +199,8 @@
close(sock);
return LAMERROR;
}
- if (ioctl(sock, SIOCGIFCONF, &config) < 0) {
+ if (ioctl(sock, SIOCGIFCONF, &config) < 0
+ || config.ifc_len == 0) {
if (errno != EINVAL && lastlen != 0) {
close(sock);
return LAMERROR;
I would appreciate if you could fix it in the next release.
Thank you.
---
Masakazu Higaki
|