On Wed, 23 Jun 2004, Angel Tsankov wrote:
> I doubt the ordering of floating point computations might be causing the
> problem - I run the same image (executable) on all the CPUs. In fact, I
> start the program by issuing "mpirun c0-3 a.out 128". 128 is argument to
> the program and is interpreted as the number of blocks in a row/column
> of the matrix. Moreover, it is strange that the same image works fine
> with lower sizes (e.g. 16,32 or 64). Nevertheless, I will check my code
> for any sources of FP ordering problems - it really smells like that.
I assumed that your app was having each process compute a portion of the
calculation and then send the final results to the master for final
combination (e.g., a typical manager/worker kind of setup). Are you
saying that the exact same computation is performed in each process?
> Just to check this out - does MPI perform any conversions that might
> cause loss of precision in a HOMOGENEOUS cluster?
It *shouldn't* (but never say "never", right?).
Regardless, even if this were the case, if you run the program the same
way and have the same data distribution across your nodes (i.e., you
always send data D1 from node A to node B), you should get the same result
because LAM should translate it exactly the same way every time.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|