Hello,
We have an MPI program that keeps crashing. At first it segfaulted
sometimes when it was run, so we compiled it with debugging symbols,
runtime error checking, and no optimization. We monitored it by
starting gdb for each rank. One of the ranks will always die, with this
message:
Program received signal SIGUSR2, User defined signal 2.
[Switching to Thread 16384 (LWP 4518)]
0x4011a698 in read () from /lib/i686/libpthread.so.0
(gdb) where
#0 0x4011a698 in read () from /lib/i686/libpthread.so.0
#1 0x400b744c in __dtors_list_end() from /usr/lib/libmpi.so.0
Our setup is Redhat 9 with lam 7.0 built with intel compilers on SMP
Xeon boxes.
Now, I have read that LAM uses SIGUSR2 internally, and that pthreads
does as well. Does anyone know what could be causing this problem, and
how we could fix it?
Thanks
-Luke
|