Dear LAM folks,
We are fighting with a strange problem on a 25 nodes and 100 procs
Solaris 10 cluster.
If we start an MPI application right after lamboot, it fails sometimes
with the following message:
valiron_at_n11 ~ > lamboot $OAR_NODEFILE ; mpirun C rotate
LAM 7.1.1/ROMIO - Indiana University
-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started with
mpirun did not invoke MPI_INIT before quitting (it is possible that
more than one process did not invoke MPI_INIT -- mpirun was only
notified of the first one, which was on node n0).
mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.
-----------------------------------------------------------------------------
If we add some sleep interval between the lamboot and the mpirun, it
works fine:
valiron_at_n11 ~ > lamboot $OAR_NODEFILE ; sleep 10; mpirun C rotate
LAM 7.1.1/ROMIO - Indiana University
NPROCS 100
buf_size, sent/node, iter_time (s), rate/node (MB/s)
8 10000 0.000027 0.563
16 10000 0.000028 1.107
32 10000 0.000029 2.094
64 10000 0.000027 4.445
128 10000 0.000028 8.691
256 10000 0.000030 16.494
512 10000 0.000040 24.327
1024 1000 0.000040 48.884
2048 1000 0.000044 88.127
4096 1000 0.000056 139.295
8192 1000 0.000076 205.768
valiron_at_n11 ~ >
This behaviour is very annoying for scripting batch jobs. We are very
proud of our fast OAR batch system, which starts a 100 proc job in a
second, and we don't want to introduce unneeded delays.
I had already run into trouble with lamhalt in a similar way if the LAM
universe was defined on a temporary filesystem:
mkdir -p $TMPDIR
lamboot bhost_file ; mpirun C application ; lamhalt
rm -rf $TMPDIR
In the latter case LAM daemons remained hanging around, and Jeff
explained lamhalt was asynchronous and returned *before* all daemons
were properly killed.
Is lamboot similarly asynchronous ?
Is there a *safe* way to be sure the MPI universe has properly started
before issuing the mpirun command ?
Pierre.
PS. We never ran into this lamboot problem on small clusters.
--
Soutenez le mouvement SAUVONS LA RECHERCHE :
http://recherche-en-danger.apinc.org/
_/_/_/_/ _/ _/ Dr. Pierre VALIRON
_/ _/ _/ _/ Laboratoire d'Astrophysique
_/ _/ _/ _/ Observatoire de Grenoble / UJF
_/_/_/_/ _/ _/ BP 53 F-38041 Grenoble Cedex 9 (France)
_/ _/ _/ http://www-laog.obs.ujf-grenoble.fr/~valiron/
_/ _/ _/ Mail: Pierre.Valiron_at_[hidden]
_/ _/ _/ Phone: +33 4 7651 4787 Fax: +33 4 7644 8821
_/ _/_/
|