LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Peter Farrell (pfarrell_at_[hidden])
Date: 2004-04-24 05:17:43


I am using LAM to run hpl (the well known top500 program) on a cluster. I
have
successfully tested this (so far) with 482 CPUs. A larger test with 596
CPUs fails. It returns errno 111 from sfh_sock_open_clt_inet_stm in
connect_all(). I hacked some debugging into this function and found that
it connects using a "reasonable" range of port numbers for most of the
clients but, for some reason I haven't yet worked out, it suddenly decides
to use a port number of 1, i.e. inmsg.nh_data[0]=1.

Has anyone else seen this problem? Is there a solution?

Are there any built in limitations I might be hitting with a large number
of CPUs? Are there any flags I should be using to handle a large number of
CPUs?

Thanks,

Peter Farrell