LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-05-17 17:39:39


On Mon, 17 May 2004, Terry Frankcombe wrote:

> Yeah, the general situation is not a LAM problem as such. (But the
> freak out on packet drop is!) I don't know who is causing the NFS load

Indeed, it unfortunately is. The first revision of the gm module did not
take into account that gm may drop packets. So as you described in your
first message, if you don't periodically cause progress or match messages
in MPI, gm will drop the message, and your LAM process will likely hang.
:-(

> on the server. It may well be me, as the code I'm running (a DFT code)
> uses both local and global scratch files, but most of it is on local
> scratch. The admin hasn't been able to tell me (but I do suspect some
> of his jobs!)
>
> Certainly wasting cycles waiting for I/O is just that - a waste. But
> it's better than crashing calcs. The timeout referred to in the error
> message... is that a LAM timeout or a Myrinet timeout? The admin
> assures me that the Myrinet timeout is set to something spastic like 15
> minutes.

The default Myrinet timeout is 30 seconds; I couldn't find a way to change
it from within LAM (there may be a way to do it in the general gm/Myrinet
setup -- I'm not sure). help_at_[hidden] assures me that there is no way for
LAM to change it during the run of an MPI application.

> Anyway, the question remains: is there a Subversion revision that will
> probably work?

Yes and no. :-\

So the retransmit fixes went in a long time ago (a few months). The
ptmalloc changes went in about 1-2 weeks ago. They definitely broke gm
for various reasons. But we're unfortunately pretty sure that we also
broke something else in gm that we haven't nailed down yet. I had really
intended to work on this last week, but got sidetracked into other more
urgent issues. :-(

Let me see if I can get to it this week (ptmalloc + the other problem).
Either way, I should be able to get ptmalloc going relaitvely easily and
give you a subversion tarball that contains at least the retransmit fixes
-- we can see about the other problem as well.

FWIW, gm is the one "major" bug left before 7.1 is released (there's a
bunch of other "minor" bugs that testing that we have to do, too).

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/