On Fri, 13 Jun 2003, Daniel Rohe wrote:
> I've just noticed that we have a seemingly identical problem on a
> 32-node Linux Cluster running SuSE-7.3 with lam version 6.5.4 ( at least
> that's what is sais in the man-page ).
>
> We've been using the cluster for quite a while, but the problems have
> arised only lately ( or maybe we hadn't realized!? ).
>
> Anyway, I've read the messages concerning this thread but I'm not sure
> what we shall do.
My guess would be that you have been having performance problems for some
time, but only recently noticed the problem. As far as I know, there have
not been any major performance bugs in the recent Linux kernel versions
(there were some in the early 2.2 series). You could always back out
previous patches to make sure something didn't change there.
In most cases, performance problems are actually due to the application
more than anything. TCP is pretty low bandwidth and high latency, so
applications very sensitive to either of these are going to have lower
than optimial CPU utilization. You might want to use a profiling tool
such as XMPI or MPE to see if there are obvious places in your application
where communication cost can be reduced.
Hope this helps,
Brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|