Hi,
I have a problem starting mpi jobs. When I run
mpirun -np 2 hello
I got the following error:
-----------------------------------------------------------------------------
It seems that there is no lamd running on this host, which indicates that the
LAM/MPI runtime environment is not operating. The LAM/MPI runtime
environment is necessary for MPI programs to run (the MPI program tired to
invoke the "MPI_Init" function).
Please run the "lamboot" command the start the LAM/MPI runtime
environment. See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started with mpirun
did not invoke MPI_INIT before quitting (it is possible that more than one
process did not invoke MPI_INIT -- mpirun was only notified of the first
one, which was on node n0).
mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.
-----------------------------------------------------------------------------
But lamboot using tm worked correct
lamboot -v -d
n0<24861> ssi:boot:open: opening
n0<24861> ssi:boot:open: opening boot module globus
n0<24861> ssi:boot:open: opened boot module globus
n0<24861> ssi:boot:open: opening boot module rsh
n0<24861> ssi:boot:open: opened boot module rsh
n0<24861> ssi:boot:open: opening boot module tm
n0<24861> ssi:boot:open: opened boot module tm
n0<24861> ssi:boot:select: initializing boot module tm
n0<24861> ssi:boot:tm: module initializing
n0<24861> ssi:boot:tm:verbose: 1000
n0<24861> ssi:boot:tm:priority: 50
n0<24861> ssi:boot:select: boot module available: tm, priority: 50
n0<24861> ssi:boot:select: initializing boot module rsh
n0<24861> ssi:boot:rsh: module initializing
n0<24861> ssi:boot:rsh:agent: ssh
n0<24861> ssi:boot:rsh:username: <same>
n0<24861> ssi:boot:rsh:verbose: 1000
n0<24861> ssi:boot:rsh:algorithm: linear
n0<24861> ssi:boot:rsh:priority: 10
n0<24861> ssi:boot:select: boot module available: rsh, priority: 10
n0<24861> ssi:boot:select: initializing boot module globus
n0<24861> ssi:boot:globus: globus-job-run not found, globus boot will not run
n0<24861> ssi:boot:select: boot module not available: globus
n0<24861> ssi:boot:select: finalizing boot module rsh
n0<24861> ssi:boot:rsh: finalizing
n0<24861> ssi:boot:select: closing boot module rsh
n0<24861> ssi:boot:select: finalizing boot module globus
n0<24861> ssi:boot:globus: finalizing
n0<24861> ssi:boot:select: closing boot module globus
n0<24861> ssi:boot:select: selected boot module tm
LAM 7.1a1cvs/MPI 2 C++/ROMIO - Indiana University
n0<24861> ssi:boot:tm: found the following 2 hosts:
n0<24861> ssi:boot:tm: n0 clic2a31.hrz.tu-chemnitz.de (cpu=1)
n0<24861> ssi:boot:tm: n1 clic2a23.hrz.tu-chemnitz.de (cpu=1)
n0<24861> ssi:boot:tm: starting RTE procs
n0<24861> ssi:boot:base:linear_windowed: starting
n0<24861> ssi:boot:base:linear_windowed: window size: 5
n0<24861> ssi:boot:base:server: opening server TCP socket
n0<24861> ssi:boot:base:server: opened port 42066
n0<24861> ssi:boot:base:linear_windowed: booting n0
(clic2a31.hrz.tu-chemnitz.de) n0<24861> ssi:boot:tm: starting wipe on
(clic2a31.hrz.tu-chemnitz.de) n0<24861> ssi:boot:tm: starting on n0
(clic2a31.hrz.tu-chemnitz.de): /afs/tucz/project/cluster/LAM/bin/tkill
-setsid -d -v tkill: setting prefix to (null) tkill: setting suffix to
(null)
tkill: got killname back:
/tmp/lam-wtob_at_[hidden]/lam-killfile
tkill: removing socket file ... tkill: socket file:
/tmp/lam-wtob_at_[hidden]/lam-kernel-socketd
tkill: removing IO daemon socket file ... tkill: IO daemon socket file:
/tmp/lam-wtob_at_[hidden]/lam-io-socket
tkill: f_kill =
"/tmp/lam-wtob_at_[hidden]/lam-killfile"
tkill: nothing to kill:
"/tmp/lam-wtob_at_[hidden]/lam-killfile"
n0<24861> ssi:boot:tm: successfully launched on n0
(clic2a31.hrz.tu-chemnitz.de) n0<24861> ssi:boot:tm: waiting for completion
on n0 (clic2a31.hrz.tu-chemnitz.de) n0<24861> ssi:boot:tm: finished on n0
(clic2a31.hrz.tu-chemnitz.de) n0<24861> ssi:boot:tm: starting lamd on
(clic2a31.hrz.tu-chemnitz.de) n0<24861> ssi:boot:tm: starting on n0
(clic2a31.hrz.tu-chemnitz.de): /afs/tucz/project/cluster/LAM/bin/lamd -H
134.109.249.7 -P 42066 -n 0 -o 0 -d n0<24861> ssi:boot:tm: successfully
launched on n0 (clic2a31.hrz.tu-chemnitz.de) n0<24861>
ssi:boot:base:linear_windowed: booting n1 (clic2a23.hrz.tu-chemnitz.de)
n0<24861> ssi:boot:tm: starting wipe on (clic2a23.hrz.tu-chemnitz.de)
n0<24861> ssi:boot:tm: starting on n1 (clic2a23.hrz.tu-chemnitz.de):
/afs/tucz/project/cluster/LAM/bin/tkill -setsid -d -v n0<24861> ssi:boot:tm:
successfully launched on n1 (clic2a23.hrz.tu-chemnitz.de) n0<24861>
ssi:boot:tm: waiting for completion on n1 (clic2a23.hrz.tu-chemnitz.de)
tkill: setting prefix to (null) tkill: setting suffix to (null) n-1<24863>
ssi:boot:open: opening n-1<24863> ssi:boot:open: opening boot module globus
n-1<24863> ssi:boot:open: opened boot module globus n-1<24863>
ssi:boot:open: opening boot module rsh n-1<24863> ssi:boot:open: opened boot
module rsh n-1<24863> ssi:boot:open: opening boot module tm n-1<24863>
ssi:boot:open: opened boot module tm n-1<24863> ssi:boot:select:
initializing boot module tm n-1<24863> ssi:boot:tm: module initializing
n-1<24863> ssi:boot:tm:verbose: 1000 n-1<24863> ssi:boot:tm:priority: 50
n-1<24863> ssi:boot:select: boot module available: tm, priority: 50
n-1<24863> ssi:boot:select: initializing boot module rsh n-1<24863>
ssi:boot:rsh: module initializing n-1<24863> ssi:boot:rsh:agent: ssh
n-1<24863> ssi:boot:rsh:username: <same>
n-1<24863> ssi:boot:rsh:verbose: 1000
n-1<24863> ssi:boot:rsh:algorithm: linear
n-1<24863> ssi:boot:rsh:priority: 10
n-1<24863> ssi:boot:select: boot module available: rsh, priority: 10
n-1<24863> ssi:boot:select: initializing boot module globus
n-1<24863> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<24863> ssi:boot:select: boot module not available: globus
n-1<24863> ssi:boot:select: finalizing boot module rsh
tkill: got killname back:
/tmp/lam-wtob_at_[hidden]/lam-killfile
n-1<24863> ssi:boot:rsh: finalizing tkill: removing socket file ... tkill:
socket file:
/tmp/lam-wtob_at_[hidden]/lam-kernel-socketd
tkill: removing IO daemon socket file ... tkill: IO daemon socket file:
/tmp/lam-wtob_at_[hidden]/lam-io-socket
tkill: f_kill =
"/tmp/lam-wtob_at_[hidden]/lam-killfile"
n-1<24863> ssi:boot:select: closing boot module rsh tkill: nothing to kill:
"/tmp/lam-wtob_at_[hidden]/lam-killfile"
n-1<24863> ssi:boot:select: finalizing boot module globus n-1<24863>
ssi:boot:globus: finalizing n-1<24863> ssi:boot:select: closing boot module
globus n-1<24863> ssi:boot:select: selected boot module tm n0<24861>
ssi:boot:tm: finished on n1 (clic2a23.hrz.tu-chemnitz.de) n0<24861>
ssi:boot:tm: starting lamd on (clic2a23.hrz.tu-chemnitz.de) n0<24861>
ssi:boot:tm: starting on n1 (clic2a23.hrz.tu-chemnitz.de):
/afs/tucz/project/cluster/LAM/bin/lamd -H 134.109.249.7 -P 42066 -n 1 -o 0 -d
n0<24861> ssi:boot:tm: successfully launched on n1
(clic2a23.hrz.tu-chemnitz.de) n0<24861> ssi:boot:base:linear_windowed:
finished launching n0<24861> ssi:boot:base:server: expecting connection from
finite list n0<24861> ssi:boot:base:server: got connection from
134.109.249.7 n0<24861> ssi:boot:base:server: this connection is expected
(n0) n0<24861> ssi:boot:base:server: remote lamd is at 134.109.249.7:32820
n0<24861> ssi:boot:base:server: expecting connection from finite list
n-1<4459> ssi:boot:open: opening n-1<4459> ssi:boot:open: opening boot
module globus n-1<4459> ssi:boot:open: opened boot module globus n-1<4459>
ssi:boot:open: opening boot module rsh n-1<4459> ssi:boot:open: opened boot
module rsh n-1<4459> ssi:boot:open: opening boot module tm n-1<4459>
ssi:boot:open: opened boot module tm n-1<4459> ssi:boot:select: initializing
boot module tm n-1<4459> ssi:boot:tm: module initializing n-1<4459>
ssi:boot:tm:verbose: 1000 n-1<4459> ssi:boot:tm:priority: 50
n-1<4459> ssi:boot:select: boot module available: tm, priority: 50
n-1<4459> ssi:boot:select: initializing boot module rsh
n-1<4459> ssi:boot:rsh: module initializing
n-1<4459> ssi:boot:rsh:agent: ssh
n-1<4459> ssi:boot:rsh:username: <same>
n-1<4459> ssi:boot:rsh:verbose: 1000
n-1<4459> ssi:boot:rsh:algorithm: linear
n-1<4459> ssi:boot:rsh:priority: 10
n-1<4459> ssi:boot:select: boot module available: rsh, priority: 10
n-1<4459> ssi:boot:select: initializing boot module globus
n-1<4459> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<4459> ssi:boot:select: boot module not available: globus
n-1<4459> ssi:boot:select: finalizing boot module rsh
n-1<4459> ssi:boot:rsh: finalizing
n-1<4459> ssi:boot:select: closing boot module rsh
n-1<4459> ssi:boot:select: finalizing boot module globus
n-1<4459> ssi:boot:globus: finalizing
n-1<4459> ssi:boot:select: closing boot module globus
n-1<4459> ssi:boot:select: selected boot module tm
n0<24861> ssi:boot:base:server: got connection from 134.109.249.6
n0<24861> ssi:boot:base:server: this connection is expected (n1)
n0<24861> ssi:boot:base:server: remote lamd is at 134.109.249.6:32826
n0<24861> ssi:boot:base:server: closing server socket
n0<24861> ssi:boot:base:server: connecting to lamd at 134.109.249.7:42076
n0<24861> ssi:boot:base:server: connected
n0<24861> ssi:boot:base:server: sending number of links (2)
n0<24861> ssi:boot:base:server: sending info: n0 (clic2a31.hrz.tu-chemnitz.de)
n0<24861> ssi:boot:base:server: sending info: n1
(clic2a23.hrz.tu-chemnitz.de) n0<24861> ssi:boot:base:server: finished
sending n0<24861> ssi:boot:base:server: disconnected from
134.109.249.7:42076
n0<24861> ssi:boot:base:server: connecting to lamd at 134.109.249.6:41719
n0<24861> ssi:boot:base:server: connected
n0<24861> ssi:boot:base:server: sending number of links (2)
n0<24861> ssi:boot:base:server: sending info: n0 (clic2a31.hrz.tu-chemnitz.de)
n0<24861> ssi:boot:base:server: sending info: n1
(clic2a23.hrz.tu-chemnitz.de) n0<24861> ssi:boot:base:server: finished
sending n0<24861> ssi:boot:base:server: disconnected from
134.109.249.6:41719
n0<24861> ssi:boot:base:linear_windowed: finished
n0<24861> ssi:boot:tm: all RTE procs started
n0<24861> ssi:boot:tm: finalizing
n0<24861> ssi:boot: Closing
wtob_at_clic2a31 mytests-mpi $ n-1<24863> ssi:boot:tm: finalizing
n-1<4459> ssi:boot:tm: finalizing
n-1<24863> ssi:boot: Closing
n-1<4459> ssi:boot: Closing
lamnodes says
n0 clic2a31.hrz.tu-chemnitz.de:1:origin,this_node
n1 clic2a23.hrz.tu-chemnitz.de:1:
The programm I tryed to run was a simple hello world.
#include <stdio.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Hello, world! I am %d of %d\n", rank, size);
MPI_Finalize();
return 0;
}
Non-MPI programs like
lamexec -np 2 ls
work but just starting the mpi programm localy
./hello
causes the same error
-----------------------------------------------------------------------------
It seems that there is no lamd running on this host, which indicates
that the LAM/MPI runtime environment is not operating. The LAM/MPI
runtime environment is necessary for MPI programs to run (the MPI
program tired to invoke the "MPI_Init" function).
Please run the "lamboot" command the start the LAM/MPI runtime
environment. See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
-----------------------------------------------------------------------------
Any idea what's wrong?
|