On Apr 2, 2007, at 9:35 AM, Van-Khiem Truong wrote:
> Hello Jeff Squyres,
>
> Thank you for your quick response. That is really odd! I spend
> some
> time to check about the trouble.
>
> (1) You are right about the configuration without "memory-manager";
>
> (2) There is no firewall software running;
>
> (3) Instead of using the multiprocessor machine, I installed the
> Lam-MPI on a single processor machine
> with the same processor Itanium. Then I make the lamboot call with
> only
> the Itanium station alone (with
> two work stations, it results into the same error):
> it results into the same error message as before, as you can see on
> the following file:
It's actually not hboot that is failing, but the lamd (hboot is
mainly a wrapper around fork/exec'ing the lamd). The lamd is trying
to open a socket back to 125.1.2.17 port 62915 (which *should* be the
same as the local host).
Do you, perchance, have multiple IP addresses on this machine? I'm
wondering if LAM is using the "wrong" IP address such that it can't
open a socket back to 125.1.2.17 properly.
>
> ================================================================
> output
> on the screen:
> biscaye 173 : lamboot -v -d -ssi boot rsh hostfile
> n-1<28363> ssi:boot:open: opening
> n-1<28363> ssi:boot:open: looking for boot module named rsh
> n-1<28363> ssi:boot:open: opening boot module rsh
> n-1<28363> ssi:boot:open: opened boot module rsh
> n-1<28363> ssi:boot:select: initializing boot module rsh
> n-1<28363> ssi:boot:rsh: module initializing
> n-1<28363> ssi:boot:rsh:agent: /usr/bin/remsh
> n-1<28363> ssi:boot:rsh:username: <same>
> n-1<28363> ssi:boot:rsh:verbose: 1000
> n-1<28363> ssi:boot:rsh:algorithm: linear
> n-1<28363> ssi:boot:rsh:no_n: 0
> n-1<28363> ssi:boot:rsh:no_profile: 0
> n-1<28363> ssi:boot:rsh:fast: 0
> n-1<28363> ssi:boot:rsh:ignore_stderr: 0
> n-1<28363> ssi:boot:rsh:priority: 10
> n-1<28363> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<28363> ssi:boot:select: selected boot module rsh
>
> LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University
>
> n-1<28363> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<28363> ssi:boot:base: <current directory>
> n-1<28363> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<28363> ssi:boot:base: $LAMHOME/etc
> n-1<28363> ssi:boot:base: /homi/truong/Local/lam-mpi_itanium-inst/
> etc
> n-1<28363> ssi:boot:base: looking for boot schema file:
> n-1<28363> ssi:boot:base: hostfile
> n-1<28363> ssi:boot:base: found boot schema: hostfile
> n-1<28363> ssi:boot:rsh: found the following hosts:
> n-1<28363> ssi:boot:rsh: n0 biscaye (cpu=1)
> n-1<28363> ssi:boot:rsh: resolved hosts:
> n-1<28363> ssi:boot:rsh: n0 biscaye --> 125.1.2.17 (origin)
> n-1<28363> ssi:boot:rsh: starting RTE procs
> n-1<28363> ssi:boot:base:linear: starting
> n-1<28363> ssi:boot:base:server: opening server TCP socket
> n-1<28363> ssi:boot:base:server: opened port 62915
> n-1<28363> ssi:boot:base:linear: booting n0 (biscaye)
> n-1<28363> ssi:boot:rsh: starting lamd on (biscaye)
> n-1<28363> ssi:boot:rsh: starting on n0 (biscaye): hboot -t -c
> lam-conf.lamd -d -v -I -H 125.1.2.17 -P 62915 -n 0 -o 0
> n-1<28363> ssi:boot:rsh: launching locally
> hboot: performing tkill
> hboot: tkill -d
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-truong_at_biscaye/lam-killfile
> tkill: f_kill = "/tmp/lam-truong_at_biscaye/lam-killfile"
> tkill: killing LAM...
> tkill: killing PID (SIGHUP) 28275 ...
> tkill: already dead
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-truong_at_biscaye/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-truong_at_biscaye/lam-io-socket
> tkill: all finished
> hboot: booting...
> hboot: fork /homi/truong/Local/lam-mpi_itanium-inst/bin/lamd
> [1] 28366 lamd -H 125.1.2.17 -P 62915 -n 0 -o 0 -d
> n-1<28363> ssi:boot:rsh: successfully launched on n0 (biscaye)
> n-1<28363> ssi:boot:base:server: expecting connection from finite list
> hboot: attempting to execute
> ----------------------------------------------------------------------
> -------
> The lamboot agent timed out while waiting for the newly-booted process
> to call back and indicated that it had successfully booted.
>
> *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
> *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
> *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
> *** MAILING LIST.
>
> As far as LAM could tell, the remote process started properly, but
> then never called back. Possible reasons that this may happen:
>
> - There are network filters between the lamboot agent host and
> the remote host such that communication on random TCP ports
> is blocked
> - Network routing from the remote host to the local host isn't
> properly configured (this is uncommon)
>
> You can check these things by watching the output from "lamboot -d".
>
> 1. On the command line for hboot, there are two important parameters:
> one is the IP address of where the lamboot agent was invoked, the
> other is the port number that the lamboot agent is expecting the
> newly-booted process to call back on (this will be a random
> integer).
>
> 2. Manually login to the remote machine and try to telnet to the port
> indicated on the hboot command line. For example,
> telnet <ipnumber> <portnumber>
> If all goes well, you should get a "Connection refused" error. If
> you get any other kind of error, it could indicate either of the
> two conditions above. Consult with your system/network
> administrator.
> ----------------------------------------------------------------------
> -------
> n-1<28363> ssi:boot:base:server: failed to connect to remote lamd!
> n-1<28363> ssi:boot:base:server: closing server socket
> n-1<28363> ssi:boot:base:linear: aborted!
> lamboot did NOT complete successfully
>
> ======================================================================
> ============
>
> I use the compilation flags on the Itanium station equivalent to
> those on the PA-Risc station. It
> seems that the command hboot doesn't work. About the suggestion of
> making telnet: if I open a socket on
> another station, I can telnet from the Itanium station using this
> socket
> port.
>
> Would you have any suggestion for testing further?
>
> Best regards,
>
> V.Khiem Truong
> Onera - France
>
>
>
>
>
>> On Mar 29, 2007, at 4:01 AM, Van-Khiem Truong wrote:
>>
>>> I would like to get help for the "lamboot" procedure. I have
>>> installed the code LAM-MPI on two machines HP-UX, the first one is a
>>> PA-Risc 2.0, the second
>>> one is a multiprocessor HP Itanium.
>>>
>>> The installation seems to be fine, except for the module ptmalloc2
>>> (/share/memory/ptmalloc2) where I need to change the "Makefile " to
>>> remove the file
>>> malloc.c, otherwise the code tells me that variables are already
>> declared.
>>
>> You should configure with --without-memory-manager and then you won't
>> have this problem.
>>
>>> So on the machine HP PA-Risc , I can start the procedure
>>> "lamboot"
>>> and connect to another PA-Risc HP. However for the machine HP
>>> multiprocessor, it tells me that it boots but the call back doesn't
>>> work. I attach hereby the file containing the error message.
>
>> See below.
>>
>>> [snip]
>>> n-1<1698> ssi:boot:base: looking for boot schema file:
>>> n-1<1698> ssi:boot:base: hostfile
>>> n-1<1698> ssi:boot:base: found boot schema: hostfile
>>> n-1<1698> ssi:boot:rsh: found the following hosts:
>>> n-1<1698> ssi:boot:rsh: n0 nanopus (cpu=1) n-1<1698> ssi:boot:rsh:
>>> n1 hudson (cpu=1) n-1<1698> ssi:boot:rsh: resolved hosts:
>>> n-1<1698> ssi:boot:rsh: n0 nanopus --> 125.1.5.218 (origin)
>>> n-1<1698> ssi:boot:rsh: n1 hudson --> 125.1.7.17
>>> [snip]
>>> n-1<1698> ssi:boot:rsh: starting on n0 (nanopus): hboot -t -c
>>> lam-conf.lamd -d -v -I -H 125.1.5.218 -P 49939 -n 0 -o 0
>>> n-1<1698> ssi:boot:rsh: launching locally
>>> [snip]
>>> hboot: attempting to execute [1] 1701 lamd -H 125.1.5.218 -P
>>> 49939 -n
>>> 0 -o 0 -d
>>> n-1<1698> ssi:boot:rsh: successfully launched on n0 (nanopus)
>>> n-1<1698> ssi:boot:base:server: expecting connection from finite
>>> list
>>> --------------------------------------------------------------------
>>> --
>>> -------
>>> The lamboot agent timed out while waiting for the newly-booted
>>> process
>>> to call back and indicated that it had successfully booted.
>>> [snip]
>>
>> What is truly odd here is that the lamd that lamboot is waiting for
>> is the *local* lamd.
>>
>> Did you check that you have no TCP filtering / firewall software
>> running?
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
Jeff Squyres
Cisco Systems
|