On Tue, 29 Jul 2003, Stephen Cary wrote:
> I've been working with a HPL run on a LARGE Linux cluster running RH7.3.
> (Kernel 2.4.19-27.7.xsmp, if I remember correctly..) I installed the
> stock 6.5.9 rpm from the ftp site.. It seems to run extremely well
> until I approach 1024 nodes.. We thought it might be the number of file
> descriptors on node0, but a recompiled kernel with 4096 descriptors
> didn't seem to help.. It's running very well with 1008 processors
> currently.. Any pointers or suggestions where to
> look/tinker/recompile/debug would be appreciated!!
File descriptors will definitely be a problem; the LAM tcp RPI module
opens a socket to each of its peers (we don't yet have a lazy-open
approach). So as you guessed, allowing more fd's per process is going to
be necessary.
What exactly is the error? i.e., what happens when you try to run on more
nodes than that?
If you're just starting with LAM, you might want to bump up to LAM/MPI
7.0, since it has all the latest features, performance enhancements, etc.
It's also much easier for us to debug/maintain -- the 6.5.x series is
officially retired. The RPM on our web site was compiled against RH 8.0,
so you'll probably need to compile from source, but that shouldn't be too
hard -- see the "For the Impatient" 1-page chapter in the Installation
Guide.
> I will, of course, provide a synopsis for the list after we fix this..
> Oh yes, the HPL application was compiled with the standard RH7.3
> compiler suite.. I do have the Intel 7.0 compiler suite available, if
> that's any help..
Either one should be fine -- this is likely not to be a compiler problem.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|