It seems I have discovered some sort of a bug with LAM. Please note
when I say that it is not because I'm assuming since it's not working
it must be a problem with LAM, but I have attempted many different
scenarios, all with the same result. When I do lamboot, everything
loads properly. If I do lamexec N uname -s, I get the output I would
expect. However, if I try to run a simple hello world mpi program, I
get the following error:
$ mpirun C ./h
-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started with
mpirun did not invoke MPI_INIT before quitting (it is possible that
more than one process did not invoke MPI_INIT -- mpirun was only
notified of the first one, which was on node n-1077941600).
mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.
-----------------------------------------------------------------------------
and it creates a corefile, h.core. I get the following backtrace:
(gdb) bt
#0 0x280e5d5d in pthread_key_create () from /usr/lib/libpthread.so.1
#1 0x0805e7ef in ptmalloc_init ()
#2 0x080604ef in malloc_hook_ini ()
#3 0x080603f5 in malloc ()
#4 0x280eb21a in pthread_mutex_init () from /usr/lib/libpthread.so.1
#5 0x280f4cf0 in pthread_setconcurrency () from /usr/lib/libpthread.so.1
#6 0x280f4761 in pthread_setconcurrency () from /usr/lib/libpthread.so.1
#7 0x280f7e76 in pthread_testcancel () from /usr/lib/libpthread.so.1
#8 0x280f8fee in __error () from /usr/lib/libpthread.so.1
#9 0x280e0792 in ?? () from /usr/lib/libpthread.so.1
#10 0x280aa6c5 in find_symdef () from /libexec/ld-elf.so.1
#11 0x280a951b in _rtld () from /libexec/ld-elf.so.1
#12 0x280a8966 in .rtld_start () from /libexec/ld-elf.so.1
bill_at_c1:~
$ lamboot -V
LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
Arch: i386-unknown-freebsd5.3
Prefix: /usr/local
Configured by: root
Configured on: Tue Feb 22 16:42:25 HST 2005
Configure host: cluster.uhhcsdept.int
SSI rpi: crtcp lamd sysv tcp usysv
Here is my hello world code, taken from
http://www.eecis.udel.edu/~saunders/courses/372/01f/manual/manual.html:
#include <stdio.h>
#include <mpi.h>
/*NOTE: The MPI_Wtime calls can be placed anywhere between the MPI_Init
and MPI_Finalize calls.*/
main(int argc, char **argv)
{
int node;
double mytime; /*declare a variable to hold the time returned*/
MPI_Init(&argc,&argv);
mytime = MPI_Wtime(); /*get the time just before work to be timed*/
MPI_Comm_rank(MPI_COMM_WORLD, &node);
printf("Hello World from Node %d\n",node);
mytime = MPI_Wtime() - mytime; /*get the time just after work is done
and take the difference */
printf("Timing from node %d is %lf seconds.\n",node,mytime);
MPI_Finalize();
}
I have tried this on some old code I used to run on LAM 7.0.1, iirc,
which worked perfectly, and got the same result.
I am running FreeBSD 5.3-RELEASE and have 21 nodes on my cluster.
Any advice? I am more than willing to provide whatever additional
information may be required.
Thanks in advance!
Bill
|