LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2003-08-13 11:16:27


On Wednesday, August 13, 2003, at 09:02 AM, Saberi Bin Mohamad wrote:

> before this, i was success setup parallel using lam mpi on 3 nodes.
> but when i add one more station in parallel environment, i get some
> problem.
> lamboot was success..but during i running mpi test program, the nodes
> can be very slow and take time to finish that.some error was display
> on all nodes screen. error such as below:
> nfs:server main_node(example name node) not responding, still trying
> nfs:server main_node(example name node) OK
>
> this error repeated until mpi program was finish.
>
> so how i solve this problem?>?

It sounds like either 1) your head node is having some server problems
or 2) there are some networking issues that are causing the clients to
be unable to contact the nfs server. It is really impossible for us to
know which one is causing the problems - you probably need to find a
local sysadmin for help there.

In general, LAM should not be able to cause these kinds of failures in
NFS. However, it may be possible if you are on a busy network with
high packet loss and you are using the lamd RPI. In these situations,
the lamd RPI can occasionally overwhelm the network while it tries to
recover from the various networking issues. If you are using the lamd
RPI, try using one of the other RPIs - in general, your performance
will be better and it may eliminate the NFS problem you are seeing.
More information on choosing an RPI can be found on our web site.

Hope this helps,

Brian

-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/