On Dec 1, 2004, at 11:09 AM, Yaakoub Y El Khamra wrote:
> Does this application have a lot of pending non-blocking sends and
> receives ongoing when the process fails? LAM *should* try to release
> pinned memory in order to malloc/pin more, and therefore it should only
> fail in this situation if all the pinned GM memory is actively being
> used when you are trying to alloc more.
>
> Jeff,
>
> yes, exactly this is the case. The application tries to synchronise
> all inter-processor boundaries at once, which should result in a
> large
> burst of communication. The outstanding non-blocking sends and
> receives should all be "active" in the sense that they are expected to
> finish soon.
>
> If GM has a problem with that, then this is possibly not the most
> efficient way of handling this. Could the application be adapted so
> that it does not hit GM's internal limitations, so that the application
> will in the end run faster?
Before we conclude that this is actually the problem, let's double
check this and ensure that it's not LAM itself that is having the
problem.
Specifically, we should be able to account for all GM-addressable
memory. It should come from 2 places:
- internal memory allocated by LAM for envelopes and short messages
- the sum of all user buffers involved in pending communications over
ssi_rpi_gm_tinymsglen (defaults to 16384 bytes)
Hence, when this application runs out of memory, can you add up how
many bytes are in use by user buffers involved in pending
communications (i.e., both sends and receives) that are over 16k in
length?
Once you get this number, I'll send a short program that will try to
determine the max amount of GM memory that can be allocated (I don't
have access to Myrinet resources at the moment). The sum of LAM's
internal GM memory and the total amount of your memory should be in the
ballpark of the max amount of GM memory that can be allocated (they may
not be exactly the same due to differences in allocation patterns). If
they are, then what I described is likely happening to your
application. If they're not, then LAM may have a problem with not
releasing GM-addressable memory properly, and we should investigate
further.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|