LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Terry Frankcombe (T.Frankcombe_at_[hidden])
Date: 2004-05-17 10:56:25


Yeah, the general situation is not a LAM problem as such. (But the freak out
on packet drop is!) I don't know who is causing the NFS load on the server.
It may well be me, as the code I'm running (a DFT code) uses both local and
global scratch files, but most of it is on local scratch. The admin hasn't
been able to tell me (but I do suspect some of his jobs!)

Certainly wasting cycles waiting for I/O is just that - a waste. But it's
better than crashing calcs. The timeout referred to in the error message...
is that a LAM timeout or a Myrinet timeout? The admin assures me that the
Myrinet timeout is set to something spastic like 15 minutes.

Anyway, the question remains: is there a Subversion revision that will
probably work?

> On Mon, 17 May 2004, Terry Frankcombe wrote:
>
> > We think that it's because I'm accessing a heavily loaded NFS server
> > causing one or the other of my MPI processes to block and wait for
> > the IO to happen, which means that it doesn't participate in the
> > message passing like it should. Hence the timeout.)
>
> I have seen the same here some time ago. I can't really blame LAM-MPI,
> I see this mostly as a cluster setup problem - the I/O should not take
> that much time... But users specifying QM scratch files that reside on
> NFS mounted directories have no idea about the consequences (from your
> sig I see that you're doing Theoretical Chemistry, so probably using
> QM programs). At the first glance, being able to specify the GM
> timeout would help somehow, in the sense that jobs will be more likely
> to continue, but do you really want to let those CPUs and Myrinet
> cards do nothing while the whole job is waiting on I/O ?