LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Peter Kjellstroem (cap_at_[hidden])
Date: 2004-09-10 04:16:45


On Thu, 9 Sep 2004, Jeff Squyres wrote:

> Folks --
>
> It's been a long time coming, but I think we're getting near the end of
> the road. I uploaded 7.1b21 last night, and we're looking good so far.
> We've fixed a lot of bugs -- 7.1 appears more stable than ever.
>
> Given that we're so close, I'd like to ask people to help us have a
> bug-free release by downloading b21 and giving it a whirl in your
> environment. In particular, Myrinet and Infiniband testing would be
> most helpful.

done some IB testing with b21 and run into the following problems
(generally we like what we see though :-):

* high small packet latency (as expected)

* while PMB runs ok hirlam (http://hirlam.knmi.nl/) our operational
weather code hangs with rpi=ib but _works ok_ with rpi=tcp :-(
Initial observations reveal only the following:

All nodes use 100% cpu (top) but takes no interrupts (vmstat). ltrace on
the processes shows them calling:
 n0 VAPI_poll_cq(0, 0x0e8f1880, 0xbfb2c70c, 0, 4) = -213
    VAPI_poll_cq(0, 0x0ea04db8, 0xbfb2c70c, 0, 4) = -213
    VAPI_poll_cq(0, 0x0eb182f0, 0xbfb2c70c, 0, 4) = -213

 n1 VAPI_poll_cq(0, 0x0e8f1880, 0xbfb2c70c, 0, 0x0e8d9470) = -213
    VAPI_poll_cq(0, 0x0ea04db8, 0xbfb2c70c, 0, 0x0e8d9470) = -213
    VAPI_poll_cq(0, 0x0eb182f0, 0xbfb2c70c, 0, 0x0e8d9470) = -213

 n2 VAPI_poll_cq(0, 0x0e8f1880, 0xbfb2cd0c, 0, 4) = -213
    VAPI_poll_cq(0, 0x0ea04db8, 0xbfb2cd0c, 0, 4) = -213
    VAPI_poll_cq(0, 0x0eb182f0, 0xbfb2cd0c, 0, 4) = -213

 n3 VAPI_poll_cq(0, 0x0e8f1880, 0xbfb2b2cc, 0, 0xbfb2b3b8) = -213
    VAPI_poll_cq(0, 0x0ea04db8, 0xbfb2b2cc, 0, 0xbfb2b3b8) = -213
    VAPI_poll_cq(0, 0x0eb182f0, 0xbfb2b2cc, 0, 0xbfb2b3b8) = -213

observations:
the hexdigits are constant for all but the first column that cycles
through three different numbers: 0x0e8f1880, 0x0ea04db8, 0x0eb182f0

If we find the time today we will try to figure out in which MPI call it
hangs.

/Peter

kernel: 2.4.26 kernel.org
cpu: 3.2G prescott
HCA: mellanox 23108 PCI-X
chipset: E7210+6300ESB
IB software: mellanox IB_HPC 0.5.0 with matching firmware
lam: 7.1b21
compilers: intel 8.0 Build 20040716Z Package ID: l_cc_pc_8.0.066_pe070.1
configure line: ./configure --prefix=/usr/local/lam-7.1.b21-intel --with-rpi-ib=/usr/local/ib_hpc/ib/infinihost --with-rpi=ib

> We'd really like to release 7.1 in the Very Near Future
> -- a few positive reports "from the wild" would be extremely
> appreciated.
>
> http://www.lam-mpi.org/beta/
>
> Many thanks!
>
>

-- 
------------------------------------------------------------
  Peter Kjellstroem              | E-mail: cap_at_[hidden]
  National Supercomputer Centre  | Office: +46(0)13 281492
  Linkoeping University          | Fax   : +46(0)13 282535
  SE-581 83 Linkoeping           | 
  Sweden                         | http://www.nsc.liu.se