LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Bogdan Costescu (bogdan.costescu_at_[hidden])
Date: 2004-10-15 07:27:40


On Wed, 13 Oct 2004, William Bierman wrote:

> Howdy. I am an undergrad CS student at the University of Hawaii
> charged with creating a cluster that is 'resistant' to the effects of
> Murphy's law, or basically, a robust and versitile system.

In general high-speed and high-availability don't mix very well. If
you are willing to compromise (maybe on both sides), then you can
probably achieve what you want.

> Currently my setup is this: I have 23 machines (with 50 or so on the
> way) running FreeBSD and lam 6.5.9.

Could you update to a more recent LAM version like 7.0.x or even 7.1.x ?

> There is one master, which stores all user data, and NIS
> information. This master also acts as a gateway to the internet by
> means of a second NIC.

Already this setup raises some problems. Do the other nodes have a
second NIC ? If not, how are you going to setup the computational
network and the external access using only one NIC per node ?
Replication of data (user data and system data like NIS information)
might become difficuly if this data changes very often; ideally you'd
need to have a parallel file system that is setup with redundancy such
that taking down a node doesn't mean data (or part of it) is not
reachable anymore - let alone the possibility of having it lost
forever in case of a disk crash for example.

> The other 22 nodes are all NIS clients, which use NFS to mount home
> directories from the master.

Hmm, and when the master goes down what do the nodes with the NFS
mounts that they have ? What about the data that might be still in
OS file caches on the nodes ?

> Right now the master is static, since it is the only machine with two
> NICs and it is the only machine with NIS/user files information stored
> on it.

I think that it's much easier to have an (or more) extra nodes to act
as master and only master in hot-stanby.

> so it'd be a simple cron job required to re-assign the master in
> terms of ther internet.

I'd wish it would be that simple :-)

> Ultimately the problem is when the master goes down while there are
> MPI processes running.

Are the jobs started having a process on the master ? If not, there is
no real need for the master node to stay up as far as LAM is concerned
(of course, except file access to the NFS server is the master has
this function). This is basic functionality used for example by batch
systems.

> Does LAM provide some functionality to adapt to this case? Would it
> be possible to somehow ensure that the process does not die?

LAM has some features of notification when the MPI universe was
changed. Keeping a process alive is far beyond the scope of LAM - and
I would hazard to say that is impossible given that the node might
suddenly die due to some hardware failure or somebody tripping over
the power cable or switch (we are talking about Murphy, right :-))

-- 
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu_at_[hidden]