LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2003-12-17 08:02:00


On Wed, 17 Dec 2003 USFResearch_at_[hidden] wrote:

> This program works fine on a Sunfire 880. It is crashes *sometimes*
> though on a cluster of P4s. I do not have root access (the admin
> installed MPI per my request), and I'm not privy to all the details of
> installation. I do know that it is running OpenMosix and Redhat 8 with
> kernel 2.4.20-openmosix2.

Some questions:

1. Are you using the migration features of Mosix? If processes migrate to
another node, unless Mosix handles the communication properly (last time I
checked, it didn't, but admittedly that was a looong time ago), a crash is
definitely possible.

2. Have you run your MPI program through a memory checking debugger such
as valgrind (on Linux/x86) or bcheck (on Solaris)? It might be worthwhile
to do so (see the FAQ for how); we have found such tools to be *immensely*
useful in finding bugs that you didn't even know that you had. Your
backtrace below shows a problem in delete[], so even though there is a
valid new int[] a few lines above it, there could be heap corruption from
previous memory Badness which can lead to problems later. I'd strongly
recommend checking it out with a good memory-checking debugger.

> Output from lamboot says "LAM 7.0/MPI 2 C++/ROMIO - Indiana University"
> so I assume that is the version being run.

Probably so!

Hope that helps.

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/