On Thu, 9 Sep 2004, Jeff Squyres wrote:
> Folks --
>
> It's been a long time coming, but I think we're getting near the end of
> the road. I uploaded 7.1b21 last night, and we're looking good so far.
> We've fixed a lot of bugs -- 7.1 appears more stable than ever.
>
> Given that we're so close, I'd like to ask people to help us have a
> bug-free release by downloading b21 and giving it a whirl in your
> environment. In particular, Myrinet and Infiniband testing would be
> most helpful.
done some IB testing with b21 and run into the following problems
(generally we like what we see though :-):
* high small packet latency (as expected)
* while PMB runs ok hirlam (http://hirlam.knmi.nl/) our operational
weather code hangs with rpi=ib but _works ok_ with rpi=tcp :-(
Initial observations reveal only the following:
All nodes use 100% cpu (top) but takes no interrupts (vmstat). ltrace on
the processes shows them calling:
n0 VAPI_poll_cq(0, 0x0e8f1880, 0xbfb2c70c, 0, 4) = -213
VAPI_poll_cq(0, 0x0ea04db8, 0xbfb2c70c, 0, 4) = -213
VAPI_poll_cq(0, 0x0eb182f0, 0xbfb2c70c, 0, 4) = -213
n1 VAPI_poll_cq(0, 0x0e8f1880, 0xbfb2c70c, 0, 0x0e8d9470) = -213
VAPI_poll_cq(0, 0x0ea04db8, 0xbfb2c70c, 0, 0x0e8d9470) = -213
VAPI_poll_cq(0, 0x0eb182f0, 0xbfb2c70c, 0, 0x0e8d9470) = -213
n2 VAPI_poll_cq(0, 0x0e8f1880, 0xbfb2cd0c, 0, 4) = -213
VAPI_poll_cq(0, 0x0ea04db8, 0xbfb2cd0c, 0, 4) = -213
VAPI_poll_cq(0, 0x0eb182f0, 0xbfb2cd0c, 0, 4) = -213
n3 VAPI_poll_cq(0, 0x0e8f1880, 0xbfb2b2cc, 0, 0xbfb2b3b8) = -213
VAPI_poll_cq(0, 0x0ea04db8, 0xbfb2b2cc, 0, 0xbfb2b3b8) = -213
VAPI_poll_cq(0, 0x0eb182f0, 0xbfb2b2cc, 0, 0xbfb2b3b8) = -213
observations:
the hexdigits are constant for all but the first column that cycles
through three different numbers: 0x0e8f1880, 0x0ea04db8, 0x0eb182f0
If we find the time today we will try to figure out in which MPI call it
hangs.
/Peter
kernel: 2.4.26 kernel.org
cpu: 3.2G prescott
HCA: mellanox 23108 PCI-X
chipset: E7210+6300ESB
IB software: mellanox IB_HPC 0.5.0 with matching firmware
lam: 7.1b21
compilers: intel 8.0 Build 20040716Z Package ID: l_cc_pc_8.0.066_pe070.1
configure line: ./configure --prefix=/usr/local/lam-7.1.b21-intel --with-rpi-ib=/usr/local/ib_hpc/ib/infinihost --with-rpi=ib
> We'd really like to release 7.1 in the Very Near Future
> -- a few positive reports "from the wild" would be extremely
> appreciated.
>
> http://www.lam-mpi.org/beta/
>
> Many thanks!
>
>
--
------------------------------------------------------------
Peter Kjellstroem | E-mail: cap_at_[hidden]
National Supercomputer Centre | Office: +46(0)13 281492
Linkoeping University | Fax : +46(0)13 282535
SE-581 83 Linkoeping |
Sweden | http://www.nsc.liu.se
|