LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2006-05-17 23:10:22


On May 17, 2006, at 9:21 AM, Valter Dal Bo wrote:

> Hi all !
>
> Maybe someone can help me with this issue.
>
> I have a problem with lam-mpi using RHEL4 with lam v.7.0.6 that did
> not
> occour on Redhat9 with lam
> v.6.5.9.
> We use simulation packages (eg. ls-dyna) and work with biprocessor
> machines.
> In order to take full advantage of the 64bits architecture and the OS
> using the mentioned software, we need to run it in parallel mode;
> thing
> that would be done by using lam-mpi (the software has been compiled on
> the purpose of using the lam-mpi on 64bits EM64T architecture by the
> developers).
>
> To use the previous, I need to start the lam-mpi process by issuing
> the
> "lamboot" command which should start the mpi process enabling the 2
> cpus.
>
> Well, issuing the "lamboot" I get the following message:
>
> $ lamboot -v
>
> LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
>
> n-1<4268> ssi:boot:base:linear: booting n0 (localhost)
> n-1<4268> ssi:boot:base:linear: finished
>
> The above means that the process exited without enabling node1 and
> therefore it fails the initialization.
> At first I thought it was due to the fact that rsh'ing I was getting
> some messages in return:

Actually, since you didn't give any host file to lamboot, it did
exactly what it should. It defaulted to a hostfile of "localhost"
and started a universe there. So at this point, LAM/MPI looks ok.

> After fiddling enough ( ;-) ) and managing to get rid of the above,
> the
> problem still persists.
> Trying to run the job, the latter exits miserably...
> The lamboot -d command returns the following:
>
> $ lamboot $HOME/cluster -d
> n-1<8737> ssi:boot: Opening
> n-1<8737> ssi:boot: opening module globus
> n-1<8737> ssi:boot: initializing module globus
> n-1<8737> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n-1<8737> ssi:boot: module not available: globus
> n-1<8737> ssi:boot: opening module rsh
> n-1<8737> ssi:boot: initializing module rsh
> n-1<8737> ssi:boot:rsh: module initializing
> n-1<8737> ssi:boot:rsh:agent: /usr/bin/ssh -x -a
> n-1<8737> ssi:boot:rsh:username: <same>
> n-1<8737> ssi:boot:rsh:verbose: 1000
> n-1<8737> ssi:boot:rsh:algorithm: linear
> n-1<8737> ssi:boot:rsh:priority: 10
> n-1<8737> ssi:boot: module available: rsh, priority: 10
> n-1<8737> ssi:boot: finalizing module globus
> n-1<8737> ssi:boot:globus: finalizing
> n-1<8737> ssi:boot: closing module globus
> n-1<8737> ssi:boot: Selected boot module rsh
>
> LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
>
> n-1<8737> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<8737> ssi:boot:base: <current directory>
> n-1<8737> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<8737> ssi:boot:base: $LAMHOME/etc
> n-1<8737> ssi:boot:base: /etc/lam
> n-1<8737> ssi:boot:base: looking for boot schema file:
> n-1<8737> ssi:boot:base: /home/catusr/cluster
> n-1<8737> ssi:boot:base: found boot schema: /home/catusr/cluster
> n-1<8737> ssi:boot:rsh: found the following hosts:
> n-1<8737> ssi:boot:rsh: n0 redhat2 (cpu=2)
> n-1<8737> ssi:boot:rsh: resolved hosts:
> n-1<8737> ssi:boot:rsh: n0 redhat2 --> 192.168.1.11 (origin)
> n-1<8737> ssi:boot:rsh: starting RTE procs
> n-1<8737> ssi:boot:base:linear: starting
> n-1<8737> ssi:boot:base:server: opening server TCP socket
> n-1<8737> ssi:boot:base:server: opened port 33121
> n-1<8737> ssi:boot:base:linear: booting n0 (redhat2)
> n-1<8737> ssi:boot:rsh: starting lamd on (redhat2)
> n-1<8737> ssi:boot:rsh: starting on n0 (redhat2): hboot -t -c
> lam-conf.lamd -d -I -H 192.168.1.11 -P 33121 -n 0 -o 0
> n-1<8737> ssi:boot:rsh: launching locally
> hboot: performing tkill
> hboot: tkill -d
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-catusr_at_redhat2/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-catusr_at_redhat2/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-catusr_at_redhat2/lam-io-socket
> tkill: f_kill = "/tmp/lam-catusr_at_redhat2/lam-killfile"
> tkill: killing LAM...
> tkill: killing PID (SIGHUP) 8718 ...
> tkill: killed
> tkill: all finished
> hboot: booting...
> hboot: fork /usr/bin/lamd
> hboot: attempting to execute
> [1] 8740 lamd -H 192.168.1.11 -P 33121 -n 0 -o 0 -d
> n-1<8737> ssi:boot:rsh: successfully launched on n0 (redhat2)
> n-1<8737> ssi:boot:base:server: expecting connection from finite list
> n-1<8740> ssi:boot: Opening
> n-1<8740> ssi:boot: opening module globus
> n-1<8740> ssi:boot: initializing module globus
> n-1<8740> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n-1<8740> ssi:boot: module not available: globus
> n-1<8740> ssi:boot: opening module rsh
> n-1<8740> ssi:boot: initializing module rsh
> n-1<8740> ssi:boot:rsh: module initializing
> n-1<8740> ssi:boot:rsh:agent: /usr/bin/ssh -x -a
> n-1<8740> ssi:boot:rsh:username: <same>
> n-1<8740> ssi:boot:rsh:verbose: 1000
> n-1<8740> ssi:boot:rsh:algorithm: linear
> n-1<8740> ssi:boot:rsh:priority: 10
> n-1<8740> ssi:boot: module available: rsh, priority: 10
> n-1<8740> ssi:boot: finalizing module globus
> n-1<8740> ssi:boot:globus: finalizing
> n-1<8740> ssi:boot: closing module globus
> n-1<8740> ssi:boot: Selected boot module rsh
> n-1<8737> ssi:boot:base:server: got connection from 192.168.1.11
> n-1<8737> ssi:boot:base:server: this connection is expected (n0)
> n-1<8737> ssi:boot:base:server: remote lamd is at 192.168.1.11:32772
> n-1<8737> ssi:boot:base:server: closing server socket
> n-1<8737> ssi:boot:base:server: connecting to lamd at
> 192.168.1.11:33122
> n-1<8737> ssi:boot:base:server: connected
> n-1<8737> ssi:boot:base:server: sending number of links (1)
> n-1<8737> ssi:boot:base:server: sending info: n0 (redhat2)
> n-1<8737> ssi:boot:base:server: finished sending
> n-1<8737> ssi:boot:base:server: disconnected from 192.168.1.11:33122
> n-1<8737> ssi:boot:base:linear: finished
> n-1<8737> ssi:boot:rsh: all RTE procs started
> n-1<8737> ssi:boot:rsh: finalizing
> n-1<8737> ssi:boot: Closing
> n-1<8740> ssi:boot:rsh: finalizing
> n-1<8740> ssi:boot: Closing
>
> And it looks to me that the lam process dies without any evident
> reasons.
>
> I had a look at the /tmp/lam-debug-log.txt and I can see that the
> process exits but without letting me know what is wrong with it
> all............ :-( (The lam-debug-log.txt is inline at the bottom of
> the msg....)
>
> Does anybody have an idea on how to solve the problem ?
> Any help will be greatly appreciated !

I'm not sure what you mean - the above information looks perfect.
Lamboot should exit when it's done, and it looks like it finished as
expected. It started a univers on the node "redhat2" with a "cpu
count" of 2. Once lamboot is finished, you can run lamnodes to see
what nodes are in the newly booted environment, mpirun to run
processes, and lamhalt to take down the environment. Are one of
these commands not working properly?

Brian

-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/