On Nov 20, 2004, at 8:49 AM, Ahmad Faraj wrote:
> I have LAM 6.5.9 installed on a 32 node ethernet cluster. I am an MPI
Mandatory question here -- can you upgrade to the latest version of
LAM/MPI? 6.5.9 is several years old. Technically, we don't support it
anymore.
> code that is basically pure intensive communication. The sizes of the
> communications range from 1 B to 256BYTES per node. In the mpi
> application, I use isend and irecv, and i am sure i have enough
> buffer space. So, after running the application for many iterations,
> the
> program hangs on large sizes and sometimes even for meduim sizes. The
> feeling that i am getting is that somehow, after the network is
> saturated,
> lam does not deliver packets and hangs. Anyone has a clue? is there
> away
> around this? in the application, every X amounts of runs, i call
> lamclean
> to free some resources. That did not help!
When you say "large sizes" and "medium" sizes, are you considering
"large" to be 256 bytes? (and medium to therefore be something smaller
than that?)
Can you identify where your application is hanging? Is it stuck in an
MPI_WAIT, or some other blocking MPI call? Try attaching a debugger to
a "hung" process and see what it is doing.
If lamclean does not help, it could well be a problem with your
application. Here's what I would look for:
- You described that your application would hang after X runs, and I'm
assuming that X is variable. Some possible reasons for this include
race conditions and/or memory badness.
- Run your application through a memory-checking debugger such as
valgrind (see the LAM FAQ for more information) and see what it turns
up.
- Double check that you aren't accidentally mis-matching messages
(i.e., receiving the wrong message in the wrong buffer, and
inadvertantly causing your application to deadlock because of
now-failed assumptions).
Hope this helps.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|