I noticed that N was different between the two runs - you might want to
make sure that there are not other values different between the two
runs or a configuration error that is causing your problems. If the
problem still persists, I would try running the test on a couple
different subsets of the failing cluster - perhaps there really is some
bad hardware causing problems for you. Generally, one would expect
other things to fail in addition to Linpack, but stranger things have
happened.
Just to verify that there is not an error in LAM (HPL doesn't really
stress the MPI implementation from a correctness standpoint - it only
uses a couple simple MPI functions), I would run the LAM test suite
(you can download it from our web page). I can't imagine that you are
running into MPI problems, but always better to error on the side of
caution.
Good luck,
Brian
On Dec 15, 2003, at 3:15 AM, J. Carlos Pérez Rodríguez wrote:
> I'm running the linpack benchmark over 16 machines. Test residual are
> big
> when I run linpack over other 16 machines it works. I don't know why
> it works
> over this group of machines and not work over the first group of
> machines
> The test output is:
> =======================================================================
> =====
> HPLinpack 1.0 -- High-Performance Linpack benchmark -- September
> 27, 2000
> Written by A. Petitet and R. Clint Whaley, Innovative Computing
> Labs., UTK
> =======================================================================
> =====
>
> An explanation of the input/output parameters follows:
> T/V : Wall time / encoded variant.
> N : The order of the coefficient matrix A.
> NB : The partitioning blocking factor.
> P : The number of process rows.
> Q : The number of process columns.
> Time : Time in seconds to solve the linear system.
> Gflops : Rate of execution for solving the linear system.
>
> The following parameter values will be used:
>
> N : 30000
> NB : 180
> P : 4
> Q : 4
> PFACT : Crout
> NBMIN : 4
> NDIV : 2
> RFACT : Crout
> BCAST : BlongM
> DEPTH : 0
> SWAP : Mix (threshold = 64)
> L1 : transposed form
> U : transposed form
> EQUIL : yes
> ALIGN : 8 double precision words
>
> -----------------------------------------------------------------------
> -----
>
> - The matrix A is randomly generated for each test.
> - The following scaled residual checks will be computed:
> 1) ||Ax-b||_oo / ( eps * ||A||_1 * N )
> 2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )
> 3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
> - The relative machine precision (eps) is taken to be
> 1.110223e-16
> - Computational tests pass if scaled residuals are less than
> 16.0
>
> =======================================================================
> =====
> T/V N NB P Q Time
> Gflops
> -----------------------------------------------------------------------
> -----
> W05C2C4 30000 180 4 4 926.10
> 1.944e+01
> -----------------------------------------------------------------------
> -----
> ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 60806833.2204566 ......
> FAILED
> ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 91855846.7188683 ......
> FAILED
> ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 18090934.5147445 ......
> FAILED
> ||Ax-b||_oo . . . . . . . . . . . . . . . . . = 1.539482
> ||A||_oo . . . . . . . . . . . . . . . . . . . = 7597.038667
> ||A||_1 . . . . . . . . . . . . . . . . . . . = 7601.349361
> ||x||_oo . . . . . . . . . . . . . . . . . . . = 3.363084
> ||x||_1 . . . . . . . . . . . . . . . . . . . = 19859.432598
> =======================================================================
> =====
>
>
>
> When the linpack works the output is:
>
> =======================================================================
> =====
> =======================================================================
> =====
> HPLinpack 1.0 -- High-Performance Linpack benchmark -- September
> 27, 2000
> Written by A. Petitet and R. Clint Whaley, Innovative Computing
> Labs., UTK
> =======================================================================
> =====
>
> An explanation of the input/output parameters follows:
> T/V : Wall time / encoded variant.
> N : The order of the coefficient matrix A.
> NB : The partitioning blocking factor.
> P : The number of process rows.
> Q : The number of process columns.
> Time : Time in seconds to solve the linear system.
> Gflops : Rate of execution for solving the linear system.
>
> The following parameter values will be used:
>
> N : 27500
> NB : 180
> P : 4
> Q : 4
> PFACT : Crout
> NBMIN : 4
> NDIV : 2
> RFACT : Crout
> BCAST : BlongM
> DEPTH : 1
> SWAP : Mix (threshold = 64)
> L1 : transposed form
> U : transposed form
> EQUIL : yes
> ALIGN : 8 double precision words
>
> -----------------------------------------------------------------------
> -----
>
> - The matrix A is randomly generated for each test.
> - The following scaled residual checks will be computed:
> 1) ||Ax-b||_oo / ( eps * ||A||_1 * N )
> 2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )
> 3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
> - The relative machine precision (eps) is taken to be
> 1.110223e-16
> - Computational tests pass if scaled residuals are less than
> 16.0
>
> =======================================================================
> =====
> T/V N NB P Q Time
> Gflops
> -----------------------------------------------------------------------
> -----
> W15C2C4 27500 180 4 4 764.35
> 1.814e+01
> -----------------------------------------------------------------------
> -----
> ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.1817086 ......
> PASSED
> ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0127607 ......
> PASSED
> ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0025336 ......
> PASSED
> =======================================================================
> =====
>
> Finished 1 tests with the following results:
> 1 tests completed and passed residual checks,
> 0 tests completed and failed residual checks,
> 0 tests skipped because of illegal input values.
> -----------------------------------------------------------------------
> -----
>
> End of Tests.
> =======================================================================
> =====
>
> I run tests with mpirun:
> mpirun n1-16 -np 16 xhpl
>
> I try look for errors in ram modules with memtest and errors in net
> whit
> netperf. I didn't see anything abnormal
>
> Please, if somebody knows why linpack don't work submit your emails to:
> jcarlos_at_[hidden]
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|