On Oct 2, 2003, at 9:20 AM, Luke Palmer wrote:
> We have an MPI program that keeps crashing. At first it segfaulted
> sometimes when it was run, so we compiled it with debugging symbols,
> runtime error checking, and no optimization. We monitored it by
> starting gdb for each rank. One of the ranks will always die, with
> this
> message:
>
> Program received signal SIGUSR2, User defined signal 2.
> [Switching to Thread 16384 (LWP 4518)]
> 0x4011a698 in read () from /lib/i686/libpthread.so.0
>
> (gdb) where
>
> #0 0x4011a698 in read () from /lib/i686/libpthread.so.0
> #1 0x400b744c in __dtors_list_end() from /usr/lib/libmpi.so.0
Note that gdb responds as above any time the process being debugged
receives a signal. LAM does use SIGUSR2 for synchronization, so it is
not unexpected that the process being debugged would receive such a
signal (and does not mean the process has died). However, you can tell
gdb to continue and all will be well. So the above output is most
likely a red herring - just LAM doing it's thing.
Your best bet is probably to unlimit your core size and let nature take
it's course, if you will. Analyze the core file with gdb post-death
and you should be able to see where the failures are occurring. And
this way, you don't have to deal with gdb prompting you at every
signal.
> Our setup is Redhat 9 with lam 7.0 built with intel compilers on SMP
> Xeon boxes.
I would be very careful with the combination of RH 9 and the Intel
compilers. I am not sure if it is still the case, but for a long time
Intel stated that the combination would not work. At the very least,
the combination is not supported. There are rumors that you can hack
the Intel compilers to work, but compilers are just one of those things
you really shouldn't hack. Just because the compiler produces an
object file doesn't mean it works.
We do not support running LAM in this configuration (RH 9 and Intel
compilers). You might try compiling LAM without thread support.
Perhaps the bad mojo between the new thread library and Intel's
compilers is causing the problems. Perhaps not.
Hope this helps,
Brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|