On Wed, 17 Dec 2003 USFResearch_at_[hidden] wrote:
> This program works fine on a Sunfire 880. It is crashes *sometimes*
> though on a cluster of P4s. I do not have root access (the admin
> installed MPI per my request), and I'm not privy to all the details of
> installation. I do know that it is running OpenMosix and Redhat 8 with
> kernel 2.4.20-openmosix2.
Some questions:
1. Are you using the migration features of Mosix? If processes migrate to
another node, unless Mosix handles the communication properly (last time I
checked, it didn't, but admittedly that was a looong time ago), a crash is
definitely possible.
2. Have you run your MPI program through a memory checking debugger such
as valgrind (on Linux/x86) or bcheck (on Solaris)? It might be worthwhile
to do so (see the FAQ for how); we have found such tools to be *immensely*
useful in finding bugs that you didn't even know that you had. Your
backtrace below shows a problem in delete[], so even though there is a
valid new int[] a few lines above it, there could be heap corruption from
previous memory Badness which can lead to problems later. I'd strongly
recommend checking it out with a good memory-checking debugger.
> Output from lamboot says "LAM 7.0/MPI 2 C++/ROMIO - Indiana University"
> so I assume that is the version being run.
Probably so!
Hope that helps.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|