I am using LAM to run hpl (the well known top500 program) on a cluster. I
have
successfully tested this (so far) with 482 CPUs. A larger test with 596
CPUs fails. It returns errno 111 from sfh_sock_open_clt_inet_stm in
connect_all(). I hacked some debugging into this function and found that
it connects using a "reasonable" range of port numbers for most of the
clients but, for some reason I haven't yet worked out, it suddenly decides
to use a port number of 1, i.e. inmsg.nh_data[0]=1.
Has anyone else seen this problem? Is there a solution?
Are there any built in limitations I might be hitting with a large number
of CPUs? Are there any flags I should be using to handle a large number of
CPUs?
Thanks,
Peter Farrell
|