LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Daniel Rohe (d.rohe_at_[hidden])
Date: 2003-06-13 06:06:29


Hi,

I've just noticed that we have a seemingly identical problem on a 32-node Linux Cluster running SuSE-7.3 with lam version 6.5.4 ( at least that's what
is sais in the man-page ).

We've been using the cluster for quite a while, but the problems have arised only lately ( or maybe we hadn't realized!? ).

Anyway, I've read the messages concerning this thread but I'm not sure what we shall do.

If you have any news or suggestions we'd be grateful.

Cheers, Daniel

-- 
Daniel Rohe
Max Planck Institute for Solid State Research
Heisenbergstr. 1, D-70569 Stuttgart
Phone: +49 711/689-1516
Fax:   +49 711/689-1702
Peter McLachlan wrote:
> 
> I'm trying to run HPL on a large cluster and I'm hitting performance 
> problems I'm trying to track down.  I'm looking for suggestions on 
> probable cause for these problems.  
> 
> Our setup:
> I'm currently trying to run it across 196 dual processor x86 (P4) 
> homogeneous nodes using LAM 6.5.9 compiled with gcc 3.2.2.  HPL and 
> ATLAS  3.4.1 were also compiled with gcc 3.2.2.  
> 
> The LAM, HPL, gcc & ATLAS libraries and executables are shared to the 
> nodes over NFS mounts.  All nodes are connected and communicate over 
> gigabit ethernet links using a jumbo MTU of 9000.  
> 
> lamboot initializes the nodes just fine, tping and the pi sample seem to 
> run perfectly.  
> 
> The problem:
> When I run the xhpl binary it is using only a small amount of the CPU 
> power on the nodes.  If I ssh to a given node it will be between 30-80% 
> of the available CPU time.  Often at the lower end of that range.  In 
> fact my GFLOPs seems to suffer the more I increase the number of nodes. 
>  This to me suggests a communication problem of some sort.
> 
> When I run mpirun with the -lamd option performance improves somewhat 
> but still not to expected levels.  Interestingly the lamd daemon in this 
> mode uses 30-40% CPU and the xhpl binaries each use 30-40% cpu.  
> 
> The symptoms seem to indicate a network performance problem however when 
> I test point to point performance between nodes using netperf I am 
> seeing tcp/udp streams in the 990mbit/s range.  
> 
> Are there other possibilities I have yet to look into?  Does anyone have 
> any suggestions on a troubleshooting technique to pin this one down? 
>  Any help much appreciated.
> 
> Regards,
> 
> Peter McLachlan
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/