On Jun 12, 2005, at 2:37 PM, matsuit wrote:
> We are recently testing the large parallel software (hydrology
> model) with a
> combination of LAM-MPI, OSX, and Absoft Fortran compiler.
>
> During run, MPI (broadcast or scatter) works only perioridically. For
> example, it success at first time step, failed at second, failed at
> third,
> success at forth time step.....
>
> As a result, a portion of the output domain becomes strange values
> time to time.
>
> I checked the code that "MPI_SCATTERV" these data in the software.
> It actually
> scatter every time step. But, why it fails sometime?
>
> One thing is this software has been developed in MPICH, Linux, and
> Absoft
> compiler environment. But, basically, MPICH and LAM should be
> portable.
Since you are running with OS X, a quick note. LAM 7.1.1 (and
previous, actually) have a small bug in the usysv transport engine.
I'd recommend either upgrading to LAM 7.1.2beta or use the sysv/tcp
transport engines.
I'm a bit confused by what you mean by "failed". Are you changing
the error handler on the communicator to MPI_ERRORS_RETURN and seeing
failure return codes? Or is the output data just incorrect? If you
are seeing errors returned (in ierr, that is), I'd be interested to
know what the error code being returned is. If you are running into
incorrect data, there are a number of possibilities. First, make
sure you aren't using the usysv transport engine, since the bug is a
race condition when nodes are doing heavy communication. Second, the
problem could be "memory badness" within your application. You might
want to use Guard Malloc (man libgmalloc) for information on using a
memory checker on OS X.
Hope this helps,
Brian
|