LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Tobias Wenzel (wtob_at_[hidden])
Date: 2003-11-12 11:36:04


Hi,

I have a problem starting mpi jobs. When I run

  mpirun -np 2 hello

I got the following error:

-----------------------------------------------------------------------------
It seems that there is no lamd running on this host, which indicates that the
LAM/MPI runtime environment is not operating. The LAM/MPI runtime
environment is necessary for MPI programs to run (the MPI program tired to
invoke the "MPI_Init" function).
Please run the "lamboot" command the start the LAM/MPI runtime
environment. See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started with mpirun
did not invoke MPI_INIT before quitting (it is possible that more than one
process did not invoke MPI_INIT -- mpirun was only notified of the first
one, which was on node n0).
mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.
-----------------------------------------------------------------------------
But lamboot using tm worked correct

lamboot -v -d

n0<24861> ssi:boot:open: opening
n0<24861> ssi:boot:open: opening boot module globus
n0<24861> ssi:boot:open: opened boot module globus
n0<24861> ssi:boot:open: opening boot module rsh
n0<24861> ssi:boot:open: opened boot module rsh
n0<24861> ssi:boot:open: opening boot module tm
n0<24861> ssi:boot:open: opened boot module tm
n0<24861> ssi:boot:select: initializing boot module tm
n0<24861> ssi:boot:tm: module initializing
n0<24861> ssi:boot:tm:verbose: 1000
n0<24861> ssi:boot:tm:priority: 50
n0<24861> ssi:boot:select: boot module available: tm, priority: 50
n0<24861> ssi:boot:select: initializing boot module rsh
n0<24861> ssi:boot:rsh: module initializing
n0<24861> ssi:boot:rsh:agent: ssh
n0<24861> ssi:boot:rsh:username: <same>
n0<24861> ssi:boot:rsh:verbose: 1000
n0<24861> ssi:boot:rsh:algorithm: linear
n0<24861> ssi:boot:rsh:priority: 10
n0<24861> ssi:boot:select: boot module available: rsh, priority: 10
n0<24861> ssi:boot:select: initializing boot module globus
n0<24861> ssi:boot:globus: globus-job-run not found, globus boot will not run
n0<24861> ssi:boot:select: boot module not available: globus
n0<24861> ssi:boot:select: finalizing boot module rsh
n0<24861> ssi:boot:rsh: finalizing
n0<24861> ssi:boot:select: closing boot module rsh
n0<24861> ssi:boot:select: finalizing boot module globus
n0<24861> ssi:boot:globus: finalizing
n0<24861> ssi:boot:select: closing boot module globus
n0<24861> ssi:boot:select: selected boot module tm

LAM 7.1a1cvs/MPI 2 C++/ROMIO - Indiana University

n0<24861> ssi:boot:tm: found the following 2 hosts:
n0<24861> ssi:boot:tm: n0 clic2a31.hrz.tu-chemnitz.de (cpu=1)
n0<24861> ssi:boot:tm: n1 clic2a23.hrz.tu-chemnitz.de (cpu=1)
n0<24861> ssi:boot:tm: starting RTE procs
n0<24861> ssi:boot:base:linear_windowed: starting
n0<24861> ssi:boot:base:linear_windowed: window size: 5
n0<24861> ssi:boot:base:server: opening server TCP socket
n0<24861> ssi:boot:base:server: opened port 42066
n0<24861> ssi:boot:base:linear_windowed: booting n0
(clic2a31.hrz.tu-chemnitz.de) n0<24861> ssi:boot:tm: starting wipe on
(clic2a31.hrz.tu-chemnitz.de) n0<24861> ssi:boot:tm: starting on n0
(clic2a31.hrz.tu-chemnitz.de): /afs/tucz/project/cluster/LAM/bin/tkill
-setsid -d -v tkill: setting prefix to (null) tkill: setting suffix to
(null)
tkill: got killname back:
/tmp/lam-wtob_at_[hidden]/lam-killfile
tkill: removing socket file ... tkill: socket file:
/tmp/lam-wtob_at_[hidden]/lam-kernel-socketd
tkill: removing IO daemon socket file ... tkill: IO daemon socket file:
/tmp/lam-wtob_at_[hidden]/lam-io-socket
tkill: f_kill =
"/tmp/lam-wtob_at_[hidden]/lam-killfile"
tkill: nothing to kill:
"/tmp/lam-wtob_at_[hidden]/lam-killfile"
n0<24861> ssi:boot:tm: successfully launched on n0
(clic2a31.hrz.tu-chemnitz.de) n0<24861> ssi:boot:tm: waiting for completion
on n0 (clic2a31.hrz.tu-chemnitz.de) n0<24861> ssi:boot:tm: finished on n0
(clic2a31.hrz.tu-chemnitz.de) n0<24861> ssi:boot:tm: starting lamd on
(clic2a31.hrz.tu-chemnitz.de) n0<24861> ssi:boot:tm: starting on n0
(clic2a31.hrz.tu-chemnitz.de): /afs/tucz/project/cluster/LAM/bin/lamd -H
134.109.249.7 -P 42066 -n 0 -o 0 -d n0<24861> ssi:boot:tm: successfully
launched on n0 (clic2a31.hrz.tu-chemnitz.de) n0<24861>
ssi:boot:base:linear_windowed: booting n1 (clic2a23.hrz.tu-chemnitz.de)
n0<24861> ssi:boot:tm: starting wipe on (clic2a23.hrz.tu-chemnitz.de)
n0<24861> ssi:boot:tm: starting on n1 (clic2a23.hrz.tu-chemnitz.de):
/afs/tucz/project/cluster/LAM/bin/tkill -setsid -d -v n0<24861> ssi:boot:tm:
successfully launched on n1 (clic2a23.hrz.tu-chemnitz.de) n0<24861>
ssi:boot:tm: waiting for completion on n1 (clic2a23.hrz.tu-chemnitz.de)
tkill: setting prefix to (null) tkill: setting suffix to (null) n-1<24863>
ssi:boot:open: opening n-1<24863> ssi:boot:open: opening boot module globus
n-1<24863> ssi:boot:open: opened boot module globus n-1<24863>
ssi:boot:open: opening boot module rsh n-1<24863> ssi:boot:open: opened boot
module rsh n-1<24863> ssi:boot:open: opening boot module tm n-1<24863>
ssi:boot:open: opened boot module tm n-1<24863> ssi:boot:select:
initializing boot module tm n-1<24863> ssi:boot:tm: module initializing
n-1<24863> ssi:boot:tm:verbose: 1000 n-1<24863> ssi:boot:tm:priority: 50
n-1<24863> ssi:boot:select: boot module available: tm, priority: 50
n-1<24863> ssi:boot:select: initializing boot module rsh n-1<24863>
ssi:boot:rsh: module initializing n-1<24863> ssi:boot:rsh:agent: ssh
n-1<24863> ssi:boot:rsh:username: <same>
n-1<24863> ssi:boot:rsh:verbose: 1000
n-1<24863> ssi:boot:rsh:algorithm: linear
n-1<24863> ssi:boot:rsh:priority: 10
n-1<24863> ssi:boot:select: boot module available: rsh, priority: 10
n-1<24863> ssi:boot:select: initializing boot module globus
n-1<24863> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<24863> ssi:boot:select: boot module not available: globus
n-1<24863> ssi:boot:select: finalizing boot module rsh
tkill: got killname back:
/tmp/lam-wtob_at_[hidden]/lam-killfile
n-1<24863> ssi:boot:rsh: finalizing tkill: removing socket file ... tkill:
socket file:
/tmp/lam-wtob_at_[hidden]/lam-kernel-socketd
tkill: removing IO daemon socket file ... tkill: IO daemon socket file:
/tmp/lam-wtob_at_[hidden]/lam-io-socket
tkill: f_kill =
"/tmp/lam-wtob_at_[hidden]/lam-killfile"
n-1<24863> ssi:boot:select: closing boot module rsh tkill: nothing to kill:
"/tmp/lam-wtob_at_[hidden]/lam-killfile"
n-1<24863> ssi:boot:select: finalizing boot module globus n-1<24863>
ssi:boot:globus: finalizing n-1<24863> ssi:boot:select: closing boot module
globus n-1<24863> ssi:boot:select: selected boot module tm n0<24861>
ssi:boot:tm: finished on n1 (clic2a23.hrz.tu-chemnitz.de) n0<24861>
ssi:boot:tm: starting lamd on (clic2a23.hrz.tu-chemnitz.de) n0<24861>
ssi:boot:tm: starting on n1 (clic2a23.hrz.tu-chemnitz.de):
/afs/tucz/project/cluster/LAM/bin/lamd -H 134.109.249.7 -P 42066 -n 1 -o 0 -d
n0<24861> ssi:boot:tm: successfully launched on n1
(clic2a23.hrz.tu-chemnitz.de) n0<24861> ssi:boot:base:linear_windowed:
finished launching n0<24861> ssi:boot:base:server: expecting connection from
finite list n0<24861> ssi:boot:base:server: got connection from
134.109.249.7 n0<24861> ssi:boot:base:server: this connection is expected
(n0) n0<24861> ssi:boot:base:server: remote lamd is at 134.109.249.7:32820
n0<24861> ssi:boot:base:server: expecting connection from finite list
n-1<4459> ssi:boot:open: opening n-1<4459> ssi:boot:open: opening boot
module globus n-1<4459> ssi:boot:open: opened boot module globus n-1<4459>
ssi:boot:open: opening boot module rsh n-1<4459> ssi:boot:open: opened boot
module rsh n-1<4459> ssi:boot:open: opening boot module tm n-1<4459>
ssi:boot:open: opened boot module tm n-1<4459> ssi:boot:select: initializing
boot module tm n-1<4459> ssi:boot:tm: module initializing n-1<4459>
ssi:boot:tm:verbose: 1000 n-1<4459> ssi:boot:tm:priority: 50
n-1<4459> ssi:boot:select: boot module available: tm, priority: 50
n-1<4459> ssi:boot:select: initializing boot module rsh
n-1<4459> ssi:boot:rsh: module initializing
n-1<4459> ssi:boot:rsh:agent: ssh
n-1<4459> ssi:boot:rsh:username: <same>
n-1<4459> ssi:boot:rsh:verbose: 1000
n-1<4459> ssi:boot:rsh:algorithm: linear
n-1<4459> ssi:boot:rsh:priority: 10
n-1<4459> ssi:boot:select: boot module available: rsh, priority: 10
n-1<4459> ssi:boot:select: initializing boot module globus
n-1<4459> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<4459> ssi:boot:select: boot module not available: globus
n-1<4459> ssi:boot:select: finalizing boot module rsh
n-1<4459> ssi:boot:rsh: finalizing
n-1<4459> ssi:boot:select: closing boot module rsh
n-1<4459> ssi:boot:select: finalizing boot module globus
n-1<4459> ssi:boot:globus: finalizing
n-1<4459> ssi:boot:select: closing boot module globus
n-1<4459> ssi:boot:select: selected boot module tm
n0<24861> ssi:boot:base:server: got connection from 134.109.249.6
n0<24861> ssi:boot:base:server: this connection is expected (n1)
n0<24861> ssi:boot:base:server: remote lamd is at 134.109.249.6:32826
n0<24861> ssi:boot:base:server: closing server socket
n0<24861> ssi:boot:base:server: connecting to lamd at 134.109.249.7:42076
n0<24861> ssi:boot:base:server: connected
n0<24861> ssi:boot:base:server: sending number of links (2)
n0<24861> ssi:boot:base:server: sending info: n0 (clic2a31.hrz.tu-chemnitz.de)
n0<24861> ssi:boot:base:server: sending info: n1
(clic2a23.hrz.tu-chemnitz.de) n0<24861> ssi:boot:base:server: finished
sending n0<24861> ssi:boot:base:server: disconnected from
134.109.249.7:42076
n0<24861> ssi:boot:base:server: connecting to lamd at 134.109.249.6:41719
n0<24861> ssi:boot:base:server: connected
n0<24861> ssi:boot:base:server: sending number of links (2)
n0<24861> ssi:boot:base:server: sending info: n0 (clic2a31.hrz.tu-chemnitz.de)
n0<24861> ssi:boot:base:server: sending info: n1
(clic2a23.hrz.tu-chemnitz.de) n0<24861> ssi:boot:base:server: finished
sending n0<24861> ssi:boot:base:server: disconnected from
134.109.249.6:41719
n0<24861> ssi:boot:base:linear_windowed: finished
n0<24861> ssi:boot:tm: all RTE procs started
n0<24861> ssi:boot:tm: finalizing
n0<24861> ssi:boot: Closing
wtob_at_clic2a31 mytests-mpi $ n-1<24863> ssi:boot:tm: finalizing
n-1<4459> ssi:boot:tm: finalizing
n-1<24863> ssi:boot: Closing
n-1<4459> ssi:boot: Closing

lamnodes says

  n0 clic2a31.hrz.tu-chemnitz.de:1:origin,this_node
  n1 clic2a23.hrz.tu-chemnitz.de:1:

The programm I tryed to run was a simple hello world.

  #include <stdio.h>
  #include <mpi.h>

  int main(int argc, char *argv[])
  {
      int rank, size;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    printf("Hello, world! I am %d of %d\n", rank, size);

    MPI_Finalize();

    return 0;
  }

Non-MPI programs like
  lamexec -np 2 ls
work but just starting the mpi programm localy
  ./hello
causes the same error

-----------------------------------------------------------------------------
It seems that there is no lamd running on this host, which indicates
that the LAM/MPI runtime environment is not operating. The LAM/MPI
runtime environment is necessary for MPI programs to run (the MPI
program tired to invoke the "MPI_Init" function).

Please run the "lamboot" command the start the LAM/MPI runtime
environment. See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
-----------------------------------------------------------------------------

Any idea what's wrong?