Jeff
Thanks so much for your helpful response.
> Are you using the MPI-2 I/O stuff? That would be the only case where
> the ADIOI_Set_lock stuff would come into play.
This is good info that I missed. We are not using MPI-2 I/O.
> Although applications are generally source compatible between LAM/MPI
> and MPICH (they're both implementations of the same standard, after
> all), there are slight differences in the implementation. Most of the
> time, these things aren't noticable, but sometimes running an MPI
> application exclusively under one implementation and then bringing it
> over to another implementation can highlight application bugs.
My understanding is that MPI_ISEND in the MPICH implementation is
blocked and in the LAM implementation the sending is not blocked. Is
this true and if so, could that have exposed a flaw in our logic?
> LAM is pretty stable, and while I'm obviously not going to say that it
> is guaranteed to be 100% bug free, have you checked your app to ensure
> that it doesn't make some MPI assumptions that may be true in MPICH but
> aren't true in LAM?
> From the wording of your mail, I can't quite tell what the exact
> problem is -- are you just looking at the stdout from mpirun? Or are
> your numbers output into files? If you're just looking at stdout, if
> you have multiple MPI processes writing to stdout simultaneously, MPI
> makes no guarantee about the order in which it is displayed. Indeed,
> this is an inherent race condition -- you never know exactly which node
> is going to print when, etc. Is this what you're describing?
I am not looking at the stdout but at the results that are written to
a single file by a master node (node 0). Rows of an output array
are calculated on separate processors and then sent via MPI_ISEND
to node 0 who does the final assembly and writing of the output
array. Yes, this is the job of a parallel file system, but we had not
realized such calls were part of MPI-2. Here is a snippet of a 'diff'
between the single process results and the multiprocessor results.
diff singleProcessor.results LAM.results
993c993
< -3.47866893 -3.47866893 -3.47866893 -3.47866893 -3.47866893
-3.47866893
---
> -3.47866893 -3.47866893 5.52133131 -3.47866893 -3.47866893
-3.47866893
1030c1030
< 1.54010761 1.54010761 1.54010761 4.59215879 4.59215879
4.59215879
---
> 4.59215879 1.54010761 1.54010761 4.59215879 4.59215879
4.59215879
Particularly in the second difference on line 1030 it would appear that
the numerical value is not incorrect but possibly just its position
within the file.
Thanks again for your response,
Dave Kendig
NASA/GSFC
|