Hello Jeff Squyres,
Thank you for your quick response. That is really odd! I spend some
time to check about the trouble.
(1) You are right about the configuration without "memory-manager";
(2) There is no firewall software running;
(3) Instead of using the multiprocessor machine, I installed the
Lam-MPI on a single processor machine
with the same processor Itanium. Then I make the lamboot call with only
the Itanium station alone (with
two work stations, it results into the same error):
it results into the same error message as before, as you can see on
the following file:
================================================================ output
on the screen:
biscaye 173 : lamboot -v -d -ssi boot rsh hostfile
n-1<28363> ssi:boot:open: opening
n-1<28363> ssi:boot:open: looking for boot module named rsh
n-1<28363> ssi:boot:open: opening boot module rsh
n-1<28363> ssi:boot:open: opened boot module rsh
n-1<28363> ssi:boot:select: initializing boot module rsh
n-1<28363> ssi:boot:rsh: module initializing
n-1<28363> ssi:boot:rsh:agent: /usr/bin/remsh
n-1<28363> ssi:boot:rsh:username: <same>
n-1<28363> ssi:boot:rsh:verbose: 1000
n-1<28363> ssi:boot:rsh:algorithm: linear
n-1<28363> ssi:boot:rsh:no_n: 0
n-1<28363> ssi:boot:rsh:no_profile: 0
n-1<28363> ssi:boot:rsh:fast: 0
n-1<28363> ssi:boot:rsh:ignore_stderr: 0
n-1<28363> ssi:boot:rsh:priority: 10
n-1<28363> ssi:boot:select: boot module available: rsh, priority: 10
n-1<28363> ssi:boot:select: selected boot module rsh
LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University
n-1<28363> ssi:boot:base: looking for boot schema in following directories:
n-1<28363> ssi:boot:base: <current directory>
n-1<28363> ssi:boot:base: $TROLLIUSHOME/etc
n-1<28363> ssi:boot:base: $LAMHOME/etc
n-1<28363> ssi:boot:base: /homi/truong/Local/lam-mpi_itanium-inst/etc
n-1<28363> ssi:boot:base: looking for boot schema file:
n-1<28363> ssi:boot:base: hostfile
n-1<28363> ssi:boot:base: found boot schema: hostfile
n-1<28363> ssi:boot:rsh: found the following hosts:
n-1<28363> ssi:boot:rsh: n0 biscaye (cpu=1)
n-1<28363> ssi:boot:rsh: resolved hosts:
n-1<28363> ssi:boot:rsh: n0 biscaye --> 125.1.2.17 (origin)
n-1<28363> ssi:boot:rsh: starting RTE procs
n-1<28363> ssi:boot:base:linear: starting
n-1<28363> ssi:boot:base:server: opening server TCP socket
n-1<28363> ssi:boot:base:server: opened port 62915
n-1<28363> ssi:boot:base:linear: booting n0 (biscaye)
n-1<28363> ssi:boot:rsh: starting lamd on (biscaye)
n-1<28363> ssi:boot:rsh: starting on n0 (biscaye): hboot -t -c
lam-conf.lamd -d -v -I -H 125.1.2.17 -P 62915 -n 0 -o 0
n-1<28363> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-truong_at_biscaye/lam-killfile
tkill: f_kill = "/tmp/lam-truong_at_biscaye/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 28275 ...
tkill: already dead
tkill: removing socket file ...
tkill: socket file: /tmp/lam-truong_at_biscaye/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-truong_at_biscaye/lam-io-socket
tkill: all finished
hboot: booting...
hboot: fork /homi/truong/Local/lam-mpi_itanium-inst/bin/lamd
[1] 28366 lamd -H 125.1.2.17 -P 62915 -n 0 -o 0 -d
n-1<28363> ssi:boot:rsh: successfully launched on n0 (biscaye)
n-1<28363> ssi:boot:base:server: expecting connection from finite list
hboot: attempting to execute
-----------------------------------------------------------------------------
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.
*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.
As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:
- There are network filters between the lamboot agent host and
the remote host such that communication on random TCP ports
is blocked
- Network routing from the remote host to the local host isn't
properly configured (this is uncommon)
You can check these things by watching the output from "lamboot -d".
1. On the command line for hboot, there are two important parameters:
one is the IP address of where the lamboot agent was invoked, the
other is the port number that the lamboot agent is expecting the
newly-booted process to call back on (this will be a random
integer).
2. Manually login to the remote machine and try to telnet to the port
indicated on the hboot command line. For example,
telnet <ipnumber> <portnumber>
If all goes well, you should get a "Connection refused" error. If
you get any other kind of error, it could indicate either of the
two conditions above. Consult with your system/network
administrator.
-----------------------------------------------------------------------------
n-1<28363> ssi:boot:base:server: failed to connect to remote lamd!
n-1<28363> ssi:boot:base:server: closing server socket
n-1<28363> ssi:boot:base:linear: aborted!
lamboot did NOT complete successfully
==================================================================================
I use the compilation flags on the Itanium station equivalent to
those on the PA-Risc station. It
seems that the command hboot doesn't work. About the suggestion of
making telnet: if I open a socket on
another station, I can telnet from the Itanium station using this socket
port.
Would you have any suggestion for testing further?
Best regards,
V.Khiem Truong
Onera - France
>On Mar 29, 2007, at 4:01 AM, Van-Khiem Truong wrote:
>
>> I would like to get help for the "lamboot" procedure. I have
>> installed the code LAM-MPI on two machines HP-UX, the first one is a
>> PA-Risc 2.0, the second
>> one is a multiprocessor HP Itanium.
>>
>> The installation seems to be fine, except for the module ptmalloc2
>> (/share/memory/ptmalloc2) where I need to change the "Makefile " to
>> remove the file
>> malloc.c, otherwise the code tells me that variables are already
> declared.
>
>You should configure with --without-memory-manager and then you won't
>have this problem.
>
>> So on the machine HP PA-Risc , I can start the procedure
>> "lamboot"
>> and connect to another PA-Risc HP. However for the machine HP
>> multiprocessor, it tells me that it boots but the call back doesn't
>> work. I attach hereby the file containing the error message.
>See below.
>
>> [snip]
>> n-1<1698> ssi:boot:base: looking for boot schema file:
>> n-1<1698> ssi:boot:base: hostfile
>> n-1<1698> ssi:boot:base: found boot schema: hostfile
>> n-1<1698> ssi:boot:rsh: found the following hosts:
>> n-1<1698> ssi:boot:rsh: n0 nanopus (cpu=1) n-1<1698> ssi:boot:rsh:
>> n1 hudson (cpu=1) n-1<1698> ssi:boot:rsh: resolved hosts:
>> n-1<1698> ssi:boot:rsh: n0 nanopus --> 125.1.5.218 (origin)
>> n-1<1698> ssi:boot:rsh: n1 hudson --> 125.1.7.17
>> [snip]
>> n-1<1698> ssi:boot:rsh: starting on n0 (nanopus): hboot -t -c
>> lam-conf.lamd -d -v -I -H 125.1.5.218 -P 49939 -n 0 -o 0
>> n-1<1698> ssi:boot:rsh: launching locally
>> [snip]
>> hboot: attempting to execute [1] 1701 lamd -H 125.1.5.218 -P
>> 49939 -n
>> 0 -o 0 -d
>> n-1<1698> ssi:boot:rsh: successfully launched on n0 (nanopus)
>> n-1<1698> ssi:boot:base:server: expecting connection from finite list
>> ----------------------------------------------------------------------
>> -------
>> The lamboot agent timed out while waiting for the newly-booted process
>> to call back and indicated that it had successfully booted.
>> [snip]
>
>What is truly odd here is that the lamd that lamboot is waiting for
>is the *local* lamd.
>
>Did you check that you have no TCP filtering / firewall software
>running?
>
>--
>Jeff Squyres
>Cisco Systems
|