LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Yaron Minsky (yminsky_at_[hidden])
Date: 2004-06-17 15:25:46


I've hit some really weird behavior in my lam set-up, and I'm
wondering if anyone else has seen this. Here's the scenario. It
starts when I have some lam process that fails in an ugly way ---
generally some kind of run-time exception which prevents the system
from properly shutting down.

The end result, not rarely, is a single process out of the 121 that
were initiated, stranded on the final node of my cluster. The weird
thing that then happens often is that the dead node has some kind of
lock on the executable file, so that when I copy out a new build to
the bin directory, the file isn't updated on that single node. i.e.,
md5sum gives a different response at the final node in the cluster
than it does everywhere else.

What's particularly odd about this is that the filesystem is NFS, so
there really should be only one file. Has anyone seen this behavior
before? And is there a workaround?

Thanks,
Yaron Minsky