Jeff --
This is the one you fixed in 7.0.5 regarding the race condition that
happened for more than 255 nodes, rite?. I dont remeber off the head what
exactly the problem and the fix was. Do I need to tell him about the
problem/fix or just tell him that there was a race condition that was
fixed and he can get it from 7.0.5/svn?
-Vishal
On Sat, 24 Apr 2004, Peter Farrell wrote:
#
#
#
#
#
# I am using LAM to run hpl (the well known top500 program) on a cluster. I
# have
# successfully tested this (so far) with 482 CPUs. A larger test with 596
# CPUs fails. It returns errno 111 from sfh_sock_open_clt_inet_stm in
# connect_all(). I hacked some debugging into this function and found that
# it connects using a "reasonable" range of port numbers for most of the
# clients but, for some reason I haven't yet worked out, it suddenly decides
# to use a port number of 1, i.e. inmsg.nh_data[0]=1.
#
# Has anyone else seen this problem? Is there a solution?
#
# Are there any built in limitations I might be hitting with a large number
# of CPUs? Are there any flags I should be using to handle a large number of
# CPUs?
#
# Thanks,
#
# Peter Farrell
#
# _______________________________________________
# This list is archived at http://www.lam-mpi.org/MailArchives/lam/
#
|