LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-03-30 08:38:45


On Mar 27, 2007, at 8:06 PM, Rich Naff wrote:

>> 1. Fill the send vectors with dummy data -- that being the
>> MPI_COMM_WORLD rank of the process that is sending the data. Then
>> see if the data that you receive corresponds to data from an
>> unexpected sender.
>
> RLN: I can do this, but I do not believe it necessary. The entries
> in the X vector (and, for that matter, all the matrix coefficents)
> were created using a uniform random number generator contained in the
> driver. Thus, all entries are unique.

Ok.

> I can grep on my saved output and
> quickly determine what is going on. For instance, the saved output
> for
> the example from yesterday's run is held in files process_sr4_713.1,
> process_sr4_713.2 process_sr4_713.3 and process_sr4_713.4; one file
> for each process. Greping on the first entry of the errant vector
> section, one obtains the following information:
>
> (bash) stoch.pts/8% grep 23.92983122355254 process_sr4_713.*
> process_sr4_713.1: Recv from sender: 3 p_adj(j)= 3 X=
> 23.92983122355254 24.90961832683145 11.47213807665227
> 34.14693740318928 18.33479365003256 3.294915109537070
> 26.07733221657863 29.58106493204133 1.731450359495497
> process_sr4_713.3: isend to process: 1 X= 23.92983122355254
> 24.90961832683145 11.47213807665227 34.14693740318928
> 18.33479365003256 3.294915109537070 26.07733221657863
> 29.58106493204133 1.731450359495497
> process_sr4_713.4: Recv from sender: 1 p_adj(j)= 1 X=
> 23.92983122355254 24.90961832683145 11.47213807665227
> 34.14693740318928 18.33479365003256 3.294915109537070
> 26.07733221657863 29.58106493204133 1.731450359495497
>
> The second entry indicates that the array section was originally
> sent by
> process 3 (process_sr4_713.3) to process 1. The first entry indicates
> that process 1 (process_sr4_713.1) did indeed receive the array
> section, as intended. However, the third entry indicates that process
> 4 (process_sr4_713.4) also received the section, even though it was
> expecting an array section sent from process 1.

Ahh... you are saying magic words here: array section. I didn't look
closely enough at your code before -- are you utilizing copy in/out
array functionality in Fortran? This may be problematic (I'm
referring to when Fortran will create temporary buffers behind the
scenes when you pass subsets of arrays into MPI functions). Check
out the F90 chapter (chapter 10) of MPI-2 where it talks about these
issues.

If you are using array sections and Fortran is creating temporary
buffers behind your back, this *could* be consistent with what you
are seeing (i.e., you're not actually passing the buffers to MPI that
you think you're passing -- Fortran is doing some frobbing behind
your back and passing buffers that are already being used in other
non-blocking communications, and therefore you get race conditions
and unexpected "data replication").

Can you confirm that "array section" means what I think it means
(that Fortran is using automatic temporary buffers to pass the data)?

>> 2. Run your application through a memory-checking debugger such as
>> valgrind and see if any other errors turn up.
>
> RLN: I have a dickens of a time interpreting these valgrind results,
> so I am attaching the output from a valgrind run to this message
> (valgrind_log.13641). You having more experience than I, hopefully
> can decide if there is anything of significance there. The reason I
> say this is that even when I run valgrind on a simple code as this

FWIW, you need to compile LAM with a special option to avoid a whole
bunch of false positives from within LAM itself. Check out these FAQ
entries:

http://www.lam-mpi.org/faq/category6.php3#question7
http://www.lam-mpi.org/faq/category6.php3#question8
http://www.lam-mpi.org/faq/category6.php3#question10

That should dramatically reduce the output from valgrind.

> I get a s**t pot of errors. For instance, here is the leak summary
> after having run valgrind, with LAM 7.1.3, on this smaller problem:
>
> ==14135== LEAK SUMMARY:
> ==14135== definitely lost: 4,159 bytes in 9 blocks.
> ==14135== indirectly lost: 56 bytes in 8 blocks.
> ==14135== possibly lost: 0 bytes in 0 blocks.
> ==14135== still reachable: 477 bytes in 20 blocks.
> ==14135== suppressed: 0 bytes in 0 blocks.
>
> Is this a problem?

Leaks are just leaks -- memory that was allocated and then
dereferenced (But never freed). So they're only waste; they don't
cause data corruption. So yes, they should be fixed, but they're
probably not critical (who cares about 4k? :-) ).

> The error summary for this smaller problem
> indicates 95 errors, mostly of the "uninitialised byte(s)" type.
> I will attach the valgrind results for this smaller problem as well,
> as it may give you a reference point (valgrind_log.14135). Both
> problems were run under LAM 7.1.3, but for reasons that aren't quite
> clear, I can only compile the small problem with gfortran, while I
> have been using the commercial compiler lf95 to comple my principal
> problem.

I think recompiling LAM with the --with-purify option should clear up
most/all of these and help determine if there are any real problems.

-- 
Jeff Squyres
Cisco Systems