As Vishal pointed out, yes, there was recently a fix for what you might be
seeing -- there was an obscure race condition that *could* happen if you
were running on more than 255 nodes. You're certainly seeing a symptom of
it -- an inexplicable socket number of "1"; when the race condition
occurred, the endian swapping information accidentally got mixed in with
the socket startup information.
LAM 7.0.5 is due out Real Soon Now (we got delayed by 1.5 weeks because
all of us working on 7.0.5 went to a conference all last week) which
should have the fix.
Alternatively, if you're brave, you can get 7.0.5 from Subversion (see
http://www.lam-mpi.org/svn/). Get the branches/branch-7-0 version.
On Sat, 24 Apr 2004, Vishal Sahay wrote:
> This is the one you fixed in 7.0.5 regarding the race condition that
> happened for more than 255 nodes, rite?. I dont remeber off the head
> what exactly the problem and the fix was. Do I need to tell him about
> the problem/fix or just tell him that there was a race condition that
> was fixed and he can get it from 7.0.5/svn?
>
> # I am using LAM to run hpl (the well known top500 program) on a cluster. I
> # have
> # successfully tested this (so far) with 482 CPUs. A larger test with 596
> # CPUs fails. It returns errno 111 from sfh_sock_open_clt_inet_stm in
> # connect_all(). I hacked some debugging into this function and found that
> # it connects using a "reasonable" range of port numbers for most of the
> # clients but, for some reason I haven't yet worked out, it suddenly decides
> # to use a port number of 1, i.e. inmsg.nh_data[0]=1.
> #
> # Has anyone else seen this problem? Is there a solution?
> #
> # Are there any built in limitations I might be hitting with a large number
> # of CPUs? Are there any flags I should be using to handle a large number of
> # CPUs?
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|