Hi,
The code snippet shows that the lamd is definitely running, but
then, mpirun / MPI application is not able to find it. This might be due
to :
* You are launching mpirun/MPI apps in a different shell than where
lamboot was launched, as a result the PBS environment variables are not
set (note that tm module requires some $PBS_ environment variables to be
set) and hence mpirun is not able to find the lamd.
* There might be a chance that the mpirun / MPI application you are
using may not be compiled against the same version of LAM from which you are
running lamboot.
Hope it helps..
Nihar
-
- Hi,
-
- I have a problem starting mpi jobs. When I run
-
- mpirun -np 2 hello
-
- I got the following error:
-
- -----------------------------------------------------------------------------
- It seems that there is no lamd running on this host, which indicates that the
- LAM/MPI runtime environment is not operating. The LAM/MPI runtime
- environment is necessary for MPI programs to run (the MPI program tired to
- invoke the "MPI_Init" function).
- Please run the "lamboot" command the start the LAM/MPI runtime
- environment. See the LAM/MPI documentation for how to invoke
- "lamboot" across multiple machines.
- -----------------------------------------------------------------------------
- -----------------------------------------------------------------------------
- It seems that [at least] one of the processes that was started with mpirun
- did not invoke MPI_INIT before quitting (it is possible that more than one
- process did not invoke MPI_INIT -- mpirun was only notified of the first
- one, which was on node n0).
- mpirun can *only* be used with MPI programs (i.e., programs that
- invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
- to run non-MPI programs over the lambooted nodes.
- -----------------------------------------------------------------------------
- But lamboot using tm worked correct
-
- lamboot -v -d
-
- n0<24861> ssi:boot:open: opening
- n0<24861> ssi:boot:open: opening boot module globus
- n0<24861> ssi:boot:open: opened boot module globus
- n0<24861> ssi:boot:open: opening boot module rsh
- n0<24861> ssi:boot:open: opened boot module rsh
- n0<24861> ssi:boot:open: opening boot module tm
- n0<24861> ssi:boot:open: opened boot module tm
- n0<24861> ssi:boot:select: initializing boot module tm
- n0<24861> ssi:boot:tm: module initializing
- n0<24861> ssi:boot:tm:verbose: 1000
- n0<24861> ssi:boot:tm:priority: 50
- n0<24861> ssi:boot:select: boot module available: tm, priority: 50
- n0<24861> ssi:boot:select: initializing boot module rsh
- n0<24861> ssi:boot:rsh: module initializing
- n0<24861> ssi:boot:rsh:agent: ssh
- n0<24861> ssi:boot:rsh:username: <same>
- n0<24861> ssi:boot:rsh:verbose: 1000
- n0<24861> ssi:boot:rsh:algorithm: linear
- n0<24861> ssi:boot:rsh:priority: 10
- n0<24861> ssi:boot:select: boot module available: rsh, priority: 10
- n0<24861> ssi:boot:select: initializing boot module globus
- n0<24861> ssi:boot:globus: globus-job-run not found, globus boot will not run
- n0<24861> ssi:boot:select: boot module not available: globus
- n0<24861> ssi:boot:select: finalizing boot module rsh
- n0<24861> ssi:boot:rsh: finalizing
- n0<24861> ssi:boot:select: closing boot module rsh
- n0<24861> ssi:boot:select: finalizing boot module globus
- n0<24861> ssi:boot:globus: finalizing
- n0<24861> ssi:boot:select: closing boot module globus
- n0<24861> ssi:boot:select: selected boot module tm
-
- LAM 7.1a1cvs/MPI 2 C++/ROMIO - Indiana University
-
- n0<24861> ssi:boot:tm: found the following 2 hosts:
- n0<24861> ssi:boot:tm: n0 clic2a31.hrz.tu-chemnitz.de (cpu=1)
- n0<24861> ssi:boot:tm: n1 clic2a23.hrz.tu-chemnitz.de (cpu=1)
- n0<24861> ssi:boot:tm: starting RTE procs
- n0<24861> ssi:boot:base:linear_windowed: starting
- n0<24861> ssi:boot:base:linear_windowed: window size: 5
- n0<24861> ssi:boot:base:server: opening server TCP socket
- n0<24861> ssi:boot:base:server: opened port 42066
- n0<24861> ssi:boot:base:linear_windowed: booting n0
- (clic2a31.hrz.tu-chemnitz.de) n0<24861> ssi:boot:tm: starting wipe on
- (clic2a31.hrz.tu-chemnitz.de) n0<24861> ssi:boot:tm: starting on n0
- (clic2a31.hrz.tu-chemnitz.de): /afs/tucz/project/cluster/LAM/bin/tkill
- -setsid -d -v tkill: setting prefix to (null) tkill: setting suffix to
- (null)
- tkill: got killname back:
- /tmp/lam-wtob_at_[hidden]/lam-killfile
- tkill: removing socket file ... tkill: socket file:
- /tmp/lam-wtob_at_[hidden]/lam-kernel-socketd
- tkill: removing IO daemon socket file ... tkill: IO daemon socket file:
- /tmp/lam-wtob_at_[hidden]/lam-io-socket
- tkill: f_kill =
- "/tmp/lam-wtob_at_[hidden]/lam-killfile"
- tkill: nothing to kill:
- "/tmp/lam-wtob_at_[hidden]/lam-killfile"
- n0<24861> ssi:boot:tm: successfully launched on n0
- (clic2a31.hrz.tu-chemnitz.de) n0<24861> ssi:boot:tm: waiting for completion
- on n0 (clic2a31.hrz.tu-chemnitz.de) n0<24861> ssi:boot:tm: finished on n0
- (clic2a31.hrz.tu-chemnitz.de) n0<24861> ssi:boot:tm: starting lamd on
- (clic2a31.hrz.tu-chemnitz.de) n0<24861> ssi:boot:tm: starting on n0
- (clic2a31.hrz.tu-chemnitz.de): /afs/tucz/project/cluster/LAM/bin/lamd -H
- 134.109.249.7 -P 42066 -n 0 -o 0 -d n0<24861> ssi:boot:tm: successfully
- launched on n0 (clic2a31.hrz.tu-chemnitz.de) n0<24861>
- ssi:boot:base:linear_windowed: booting n1 (clic2a23.hrz.tu-chemnitz.de)
- n0<24861> ssi:boot:tm: starting wipe on (clic2a23.hrz.tu-chemnitz.de)
- n0<24861> ssi:boot:tm: starting on n1 (clic2a23.hrz.tu-chemnitz.de):
- /afs/tucz/project/cluster/LAM/bin/tkill -setsid -d -v n0<24861> ssi:boot:tm:
- successfully launched on n1 (clic2a23.hrz.tu-chemnitz.de) n0<24861>
- ssi:boot:tm: waiting for completion on n1 (clic2a23.hrz.tu-chemnitz.de)
- tkill: setting prefix to (null) tkill: setting suffix to (null) n-1<24863>
- ssi:boot:open: opening n-1<24863> ssi:boot:open: opening boot module globus
- n-1<24863> ssi:boot:open: opened boot module globus n-1<24863>
- ssi:boot:open: opening boot module rsh n-1<24863> ssi:boot:open: opened boot
- module rsh n-1<24863> ssi:boot:open: opening boot module tm n-1<24863>
- ssi:boot:open: opened boot module tm n-1<24863> ssi:boot:select:
- initializing boot module tm n-1<24863> ssi:boot:tm: module initializing
- n-1<24863> ssi:boot:tm:verbose: 1000 n-1<24863> ssi:boot:tm:priority: 50
- n-1<24863> ssi:boot:select: boot module available: tm, priority: 50
- n-1<24863> ssi:boot:select: initializing boot module rsh n-1<24863>
- ssi:boot:rsh: module initializing n-1<24863> ssi:boot:rsh:agent: ssh
- n-1<24863> ssi:boot:rsh:username: <same>
- n-1<24863> ssi:boot:rsh:verbose: 1000
- n-1<24863> ssi:boot:rsh:algorithm: linear
- n-1<24863> ssi:boot:rsh:priority: 10
- n-1<24863> ssi:boot:select: boot module available: rsh, priority: 10
- n-1<24863> ssi:boot:select: initializing boot module globus
- n-1<24863> ssi:boot:globus: globus-job-run not found, globus boot will not run
- n-1<24863> ssi:boot:select: boot module not available: globus
- n-1<24863> ssi:boot:select: finalizing boot module rsh
- tkill: got killname back:
- /tmp/lam-wtob_at_[hidden]/lam-killfile
- n-1<24863> ssi:boot:rsh: finalizing tkill: removing socket file ... tkill:
- socket file:
- /tmp/lam-wtob_at_[hidden]/lam-kernel-socketd
- tkill: removing IO daemon socket file ... tkill: IO daemon socket file:
- /tmp/lam-wtob_at_[hidden]/lam-io-socket
- tkill: f_kill =
- "/tmp/lam-wtob_at_[hidden]/lam-killfile"
- n-1<24863> ssi:boot:select: closing boot module rsh tkill: nothing to kill:
- "/tmp/lam-wtob_at_[hidden]/lam-killfile"
- n-1<24863> ssi:boot:select: finalizing boot module globus n-1<24863>
- ssi:boot:globus: finalizing n-1<24863> ssi:boot:select: closing boot module
- globus n-1<24863> ssi:boot:select: selected boot module tm n0<24861>
- ssi:boot:tm: finished on n1 (clic2a23.hrz.tu-chemnitz.de) n0<24861>
- ssi:boot:tm: starting lamd on (clic2a23.hrz.tu-chemnitz.de) n0<24861>
- ssi:boot:tm: starting on n1 (clic2a23.hrz.tu-chemnitz.de):
- /afs/tucz/project/cluster/LAM/bin/lamd -H 134.109.249.7 -P 42066 -n 1 -o 0 -d
- n0<24861> ssi:boot:tm: successfully launched on n1
- (clic2a23.hrz.tu-chemnitz.de) n0<24861> ssi:boot:base:linear_windowed:
- finished launching n0<24861> ssi:boot:base:server: expecting connection from
- finite list n0<24861> ssi:boot:base:server: got connection from
- 134.109.249.7 n0<24861> ssi:boot:base:server: this connection is expected
- (n0) n0<24861> ssi:boot:base:server: remote lamd is at 134.109.249.7:32820
- n0<24861> ssi:boot:base:server: expecting connection from finite list
- n-1<4459> ssi:boot:open: opening n-1<4459> ssi:boot:open: opening boot
- module globus n-1<4459> ssi:boot:open: opened boot module globus n-1<4459>
- ssi:boot:open: opening boot module rsh n-1<4459> ssi:boot:open: opened boot
- module rsh n-1<4459> ssi:boot:open: opening boot module tm n-1<4459>
- ssi:boot:open: opened boot module tm n-1<4459> ssi:boot:select: initializing
- boot module tm n-1<4459> ssi:boot:tm: module initializing n-1<4459>
- ssi:boot:tm:verbose: 1000 n-1<4459> ssi:boot:tm:priority: 50
- n-1<4459> ssi:boot:select: boot module available: tm, priority: 50
- n-1<4459> ssi:boot:select: initializing boot module rsh
- n-1<4459> ssi:boot:rsh: module initializing
- n-1<4459> ssi:boot:rsh:agent: ssh
- n-1<4459> ssi:boot:rsh:username: <same>
- n-1<4459> ssi:boot:rsh:verbose: 1000
- n-1<4459> ssi:boot:rsh:algorithm: linear
- n-1<4459> ssi:boot:rsh:priority: 10
- n-1<4459> ssi:boot:select: boot module available: rsh, priority: 10
- n-1<4459> ssi:boot:select: initializing boot module globus
- n-1<4459> ssi:boot:globus: globus-job-run not found, globus boot will not run
- n-1<4459> ssi:boot:select: boot module not available: globus
- n-1<4459> ssi:boot:select: finalizing boot module rsh
- n-1<4459> ssi:boot:rsh: finalizing
- n-1<4459> ssi:boot:select: closing boot module rsh
- n-1<4459> ssi:boot:select: finalizing boot module globus
- n-1<4459> ssi:boot:globus: finalizing
- n-1<4459> ssi:boot:select: closing boot module globus
- n-1<4459> ssi:boot:select: selected boot module tm
- n0<24861> ssi:boot:base:server: got connection from 134.109.249.6
- n0<24861> ssi:boot:base:server: this connection is expected (n1)
- n0<24861> ssi:boot:base:server: remote lamd is at 134.109.249.6:32826
- n0<24861> ssi:boot:base:server: closing server socket
- n0<24861> ssi:boot:base:server: connecting to lamd at 134.109.249.7:42076
- n0<24861> ssi:boot:base:server: connected
- n0<24861> ssi:boot:base:server: sending number of links (2)
- n0<24861> ssi:boot:base:server: sending info: n0 (clic2a31.hrz.tu-chemnitz.de)
- n0<24861> ssi:boot:base:server: sending info: n1
- (clic2a23.hrz.tu-chemnitz.de) n0<24861> ssi:boot:base:server: finished
- sending n0<24861> ssi:boot:base:server: disconnected from
- 134.109.249.7:42076
- n0<24861> ssi:boot:base:server: connecting to lamd at 134.109.249.6:41719
- n0<24861> ssi:boot:base:server: connected
- n0<24861> ssi:boot:base:server: sending number of links (2)
- n0<24861> ssi:boot:base:server: sending info: n0 (clic2a31.hrz.tu-chemnitz.de)
- n0<24861> ssi:boot:base:server: sending info: n1
- (clic2a23.hrz.tu-chemnitz.de) n0<24861> ssi:boot:base:server: finished
- sending n0<24861> ssi:boot:base:server: disconnected from
- 134.109.249.6:41719
- n0<24861> ssi:boot:base:linear_windowed: finished
- n0<24861> ssi:boot:tm: all RTE procs started
- n0<24861> ssi:boot:tm: finalizing
- n0<24861> ssi:boot: Closing
- wtob_at_clic2a31 mytests-mpi $ n-1<24863> ssi:boot:tm: finalizing
- n-1<4459> ssi:boot:tm: finalizing
- n-1<24863> ssi:boot: Closing
- n-1<4459> ssi:boot: Closing
-
- lamnodes says
-
- n0 clic2a31.hrz.tu-chemnitz.de:1:origin,this_node
- n1 clic2a23.hrz.tu-chemnitz.de:1:
-
- The programm I tryed to run was a simple hello world.
-
- #include <stdio.h>
- #include <mpi.h>
-
- int main(int argc, char *argv[])
- {
- int rank, size;
-
- MPI_Init(&argc, &argv);
- MPI_Comm_rank(MPI_COMM_WORLD, &rank);
- MPI_Comm_size(MPI_COMM_WORLD, &size);
-
- printf("Hello, world! I am %d of %d\n", rank, size);
-
- MPI_Finalize();
-
- return 0;
- }
-
- Non-MPI programs like
- lamexec -np 2 ls
- work but just starting the mpi programm localy
- ./hello
- causes the same error
-
- -----------------------------------------------------------------------------
- It seems that there is no lamd running on this host, which indicates
- that the LAM/MPI runtime environment is not operating. The LAM/MPI
- runtime environment is necessary for MPI programs to run (the MPI
- program tired to invoke the "MPI_Init" function).
-
- Please run the "lamboot" command the start the LAM/MPI runtime
- environment. See the LAM/MPI documentation for how to invoke
- "lamboot" across multiple machines.
- -----------------------------------------------------------------------------
-
- Any idea what's wrong?
-
- _______________________________________________
- This list is archived at http://www.lam-mpi.org/MailArchives/lam/
-
Powered by LAM/MPI...
---------------------------------------
Nihar Sanghvi
LAM/MPI Team
Graduate Student (Indiana University)
http://www.lam-mpi.org
--------------------------------------
|