Terry,
We experienced a number of problems with MPI communication hangs on a
Myrinet/PC cluster with NFS mounts timing out at Boeing, causing jobs to
hang with all processes listening for an MPI message that never came.
By updating the Myrinet drivers, the problem seems to have gone away or
reduced to an undetectable level. We didn't test extensively with LAM,
so it may not be the same problem you're seeing. But you might check
with Myricom if your Myrinet drivers are not the most recent (say as of
February or March of this year).
John Bussoletti
-----Original Message-----
From: Terry Frankcombe [mailto:T.Frankcombe_at_[hidden]]
Sent: Monday, May 17, 2004 8:56 AM
To: General LAM/MPI mailing list
Subject: Re: LAM: Fixed GM module
Yeah, the general situation is not a LAM problem as such. (But the
freak out
on packet drop is!) I don't know who is causing the NFS load on the
server.
It may well be me, as the code I'm running (a DFT code) uses both local
and
global scratch files, but most of it is on local scratch. The admin
hasn't
been able to tell me (but I do suspect some of his jobs!)
Certainly wasting cycles waiting for I/O is just that - a waste. But
it's
better than crashing calcs. The timeout referred to in the error
message...
is that a LAM timeout or a Myrinet timeout? The admin assures me that
the
Myrinet timeout is set to something spastic like 15 minutes.
Anyway, the question remains: is there a Subversion revision that will
probably work?
> On Mon, 17 May 2004, Terry Frankcombe wrote:
>
> > We think that it's because I'm accessing a heavily loaded NFS server
> > causing one or the other of my MPI processes to block and wait for
> > the IO to happen, which means that it doesn't participate in the
> > message passing like it should. Hence the timeout.)
>
> I have seen the same here some time ago. I can't really blame LAM-MPI,
> I see this mostly as a cluster setup problem - the I/O should not take
> that much time... But users specifying QM scratch files that reside on
> NFS mounted directories have no idea about the consequences (from your
> sig I see that you're doing Theoretical Chemistry, so probably using
> QM programs). At the first glance, being able to specify the GM
> timeout would help somehow, in the sense that jobs will be more likely
> to continue, but do you really want to let those CPUs and Myrinet
> cards do nothing while the whole job is waiting on I/O ?
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|