Hi Brian
Thank you for your kind answer !
How do you just "reply" to the posts in the list ?????
Anyway.....let's use a workaround..... :-(
/>
> Well, issuing the "lamboot" I get the following message:
>
> $ lamboot -v
>
> LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
>
> n-1<4268> ssi:boot:base:linear: booting n0 (localhost)
> n-1<4268> ssi:boot:base:linear: finished
>
> The above means that the process exited without enabling node1 and
> therefore it fails the initialization.
> At first I thought it was due to the fact that rsh'ing I was getting
> some messages in return:
/
/Actually, since you didn't give any host file to lamboot, it did
exactly what it should. It defaulted to a hostfile of "localhost"
and started a universe there. So at this point, LAM/MPI looks ok.
/
Thing do not change even supplying a hostfile p.e.:
file: hostfile
redhat2 cpu=2
then running a lamboot hostfile -d I get:
n-1<32696> ssi:boot: Opening
n-1<32696> ssi:boot: opening module globus
n-1<32696> ssi:boot: initializing module globus
n-1<32696> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n-1<32696> ssi:boot: module not available: globus
n-1<32696> ssi:boot: opening module rsh
n-1<32696> ssi:boot: initializing module rsh
n-1<32696> ssi:boot:rsh: module initializing
n-1<32696> ssi:boot:rsh:agent: /usr/bin/ssh -x -a
n-1<32696> ssi:boot:rsh:username: <same>
n-1<32696> ssi:boot:rsh:verbose: 1000
n-1<32696> ssi:boot:rsh:algorithm: linear
n-1<32696> ssi:boot:rsh:priority: 10
n-1<32696> ssi:boot: module available: rsh, priority: 10
n-1<32696> ssi:boot: finalizing module globus
n-1<32696> ssi:boot:globus: finalizing
n-1<32696> ssi:boot: closing module globus
n-1<32696> ssi:boot: Selected boot module rsh
LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
n-1<32696> ssi:boot:base: looking for boot schema in following directories:
n-1<32696> ssi:boot:base: <current directory>
n-1<32696> ssi:boot:base: $TROLLIUSHOME/etc
n-1<32696> ssi:boot:base: $LAMHOME/etc
n-1<32696> ssi:boot:base: /etc/lam
n-1<32696> ssi:boot:base: looking for boot schema file:
n-1<32696> ssi:boot:base: cluster
n-1<32696> ssi:boot:base: found boot schema: cluster
n-1<32696> ssi:boot:rsh: found the following hosts:
n-1<32696> ssi:boot:rsh: n0 redhat2 (cpu=2)
n-1<32696> ssi:boot:rsh: resolved hosts:
n-1<32696> ssi:boot:rsh: n0 redhat2 --> 192.168.1.11 (origin)
n-1<32696> ssi:boot:rsh: starting RTE procs
n-1<32696> ssi:boot:base:linear: starting
n-1<32696> ssi:boot:base:server: opening server TCP socket
n-1<32696> ssi:boot:base:server: opened port 33816
n-1<32696> ssi:boot:base:linear: booting n0 (redhat2)
n-1<32696> ssi:boot:rsh: starting lamd on (redhat2)
n-1<32696> ssi:boot:rsh: starting on n0 (redhat2): hboot -t -c
lam-conf.lamd -d -I -H 192.168.1.11 -P 33816 -n 0 -o 0
n-1<32696> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-catusr_at_redhat2/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-catusr_at_redhat2/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-catusr_at_redhat2/lam-io-socket
tkill: f_kill = "/tmp/lam-catusr_at_redhat2/lam-killfile"
tkill: nothing to kill: "/tmp/lam-catusr_at_redhat2/lam-killfile"
hboot: booting...
hboot: fork /usr/bin/lamd
hboot: attempting to execute
[1] 32699 lamd -H 192.168.1.11 -P 33816 -n 0 -o 0 -d
n-1<32696> ssi:boot:rsh: successfully launched on n0 (redhat2)
n-1<32696> ssi:boot:base:server: expecting connection from finite list
n-1<32699> ssi:boot: Opening
n-1<32699> ssi:boot: opening module globus
n-1<32699> ssi:boot: initializing module globus
n-1<32699> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n-1<32699> ssi:boot: module not available: globus
n-1<32699> ssi:boot: opening module rsh
n-1<32699> ssi:boot: initializing module rsh
n-1<32699> ssi:boot:rsh: module initializing
n-1<32699> ssi:boot:rsh:agent: /usr/bin/ssh -x -a
n-1<32699> ssi:boot:rsh:username: <same>
n-1<32699> ssi:boot:rsh:verbose: 1000
n-1<32699> ssi:boot:rsh:algorithm: linear
n-1<32699> ssi:boot:rsh:priority: 10
n-1<32699> ssi:boot: module available: rsh, priority: 10
n-1<32699> ssi:boot: finalizing module globus
n-1<32699> ssi:boot:globus: finalizing
n-1<32699> ssi:boot: closing module globus
n-1<32699> ssi:boot: Selected boot module rsh
n-1<32696> ssi:boot:base:server: got connection from .11
n-1<32696> ssi:boot:base:server: this connection is expected (n0)
n-1<32696> ssi:boot:base:server: remote lamd is at 192.168.1.11:32775
n-1<32696> ssi:boot:base:server: closing server socket
n-1<32696> ssi:boot:base:server: connecting to lamd at 192.168.1.11:33817
n-1<32696> ssi:boot:base:server: connected
n-1<32696> ssi:boot:base:server: sending number of links (1)
n-1<32696> ssi:boot:base:server: sending info: n0 (redhat2)
n-1<32696> ssi:boot:base:server: finished sending
n-1<32696> ssi:boot:base:server: disconnected from 192.168.1.11:33817
n-1<32696> ssi:boot:base:linear: finished
n-1<32696> ssi:boot:rsh: all RTE procs started
n-1<32696> ssi:boot:rsh: finalizing
n-1<32699> ssi:boot:rsh: finalizing
n-1<32699> ssi:boot: Closing
n-1<32696> ssi:boot: Closing
Issuing the same command on a RedHat9+LAM6.5.9 I get what follows:
lamboot hostfile -d
LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
lamboot: boot schema file: cluster
lamboot: opening hostfile cluster
lamboot: found the following hosts:
lamboot: n0 redhat9
lamboot: resolved hosts:
lamboot: n0 redhat9 --> 192.168.1.18
lamboot: found 1 host node(s)
lamboot: origin node is 0 (redhat9)
lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -I " -H
192.168.1.18 -P 32770 -n 0 -o 0 ""
hboot: process schema = "/etc/lam/lam-conf.lam"
hboot: found /usr/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/bin/lamd
hboot: attempting to execute
[1] 4144 lamd -H 192.168.1.18 -P 32770 -n 0 -o 0 -d
lamboot completed successfully
Which looks quite a bit different, but I suppose the result would be the
same; booting the 2 cpus.
/I'm not sure what you mean - the above information looks perfect.
Lamboot should exit when it's done, and it looks like it finished as
expected. It started a univers on the node "redhat2" with a "cpu
count" of 2. Once lamboot is finished, you can run lamnodes to see
what nodes are in the newly booted environment, mpirun to run
processes, and lamhalt to take down the environment. Are one of
these commands not working properly?
/
Lamnodes returns:
n0 redhat2:2:origin,this_node
and it looks like it started correctly the 2 cpus environment.
The lamhalt also seem to be working fine, it is just when I try to run
LS-DYNA a job using the "mpirun -np 2 executable -i=inputfile" that I
get an error as reported:
-----------------------------------------------------------------------------
It seems that there is no lamd running on the host .
This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for MPI programs to run
(the MPI program tired to invoke the "MPI_Init" function).
Please run the "lamboot" command the start the LAM/MPI runtime
environment. See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
It seems that there is no lamd running on the host .
This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for MPI programs to run
(the MPI program tired to invoke the "MPI_Init" function).
Please run the "lamboot" command the start the LAM/MPI runtime
environment. See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started with
mpirun did not invoke MPI_INIT before quitting (it is possible that
more than one process did not invoke MPI_INIT -- mpirun was only
notified of the first one, which was on node n0).
mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.
-----------------------------------------------------------------------------
Again, the /tmp/lam-catusr_at_redhat2/lam-debug-log.txt file says what I
have already posted in my original message.
Any idea ?
I don't fancy compiling a 6.5.9 lam environment in order to run this
program as I have several different other application which may use
different versions of lam and I would prefer not to mess with the
original RHEL installation if possible.
Thank you again for the help.
Cheers
Valter
P.S.
I've got to find out how to reply properly.....using the "workaround" is
a bit messy.....
--
Brian Barrett
LAM/MPI developer and all around nice guy
;-))))))))))))))))
Have a LAM/MPI day: http://www.lam-mpi.org/
|