If there's any way that you can give us some code that replicates this
problem (the smaller the better), that would be great.
Thanks!
On Sep 10, 2004, at 5:16 AM, Peter Kjellstroem wrote:
> done some IB testing with b21 and run into the following problems
> (generally we like what we see though :-):
>
> * high small packet latency (as expected)
>
> * while PMB runs ok hirlam (http://hirlam.knmi.nl/) our operational
> weather code hangs with rpi=ib but _works ok_ with rpi=tcp :-(
> Initial observations reveal only the following:
>
> All nodes use 100% cpu (top) but takes no interrupts (vmstat). ltrace
> on
> the processes shows them calling:
> n0 VAPI_poll_cq(0, 0x0e8f1880, 0xbfb2c70c, 0, 4) = -213
> VAPI_poll_cq(0, 0x0ea04db8, 0xbfb2c70c, 0, 4) = -213
> VAPI_poll_cq(0, 0x0eb182f0, 0xbfb2c70c, 0, 4) = -213
>
> n1 VAPI_poll_cq(0, 0x0e8f1880, 0xbfb2c70c, 0, 0x0e8d9470) = -213
> VAPI_poll_cq(0, 0x0ea04db8, 0xbfb2c70c, 0, 0x0e8d9470) = -213
> VAPI_poll_cq(0, 0x0eb182f0, 0xbfb2c70c, 0, 0x0e8d9470) = -213
>
> n2 VAPI_poll_cq(0, 0x0e8f1880, 0xbfb2cd0c, 0, 4) = -213
> VAPI_poll_cq(0, 0x0ea04db8, 0xbfb2cd0c, 0, 4) = -213
> VAPI_poll_cq(0, 0x0eb182f0, 0xbfb2cd0c, 0, 4) = -213
>
> n3 VAPI_poll_cq(0, 0x0e8f1880, 0xbfb2b2cc, 0, 0xbfb2b3b8) = -213
> VAPI_poll_cq(0, 0x0ea04db8, 0xbfb2b2cc, 0, 0xbfb2b3b8) = -213
> VAPI_poll_cq(0, 0x0eb182f0, 0xbfb2b2cc, 0, 0xbfb2b3b8) = -213
>
> observations:
> the hexdigits are constant for all but the first column that cycles
> through three different numbers: 0x0e8f1880, 0x0ea04db8, 0x0eb182f0
>
> If we find the time today we will try to figure out in which MPI call
> it
> hangs.
>
> /Peter
>
>
> kernel: 2.4.26 kernel.org
> cpu: 3.2G prescott
> HCA: mellanox 23108 PCI-X
> chipset: E7210+6300ESB
> IB software: mellanox IB_HPC 0.5.0 with matching firmware
> lam: 7.1b21
> compilers: intel 8.0 Build 20040716Z Package ID:
> l_cc_pc_8.0.066_pe070.1
> configure line: ./configure --prefix=/usr/local/lam-7.1.b21-intel
> --with-rpi-ib=/usr/local/ib_hpc/ib/infinihost --with-rpi=ib
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|