LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Davide Cittaro (davide.cittaro_at_[hidden])
Date: 2006-06-08 10:02:34


Hi and thanks...
On Jun 8, 2006, at 3:52 PM, esaifu wrote:

> Make sure that you have listed all your nodes including master
> along with its cpu count in the "lam-bhost.def " file

I have.

> (This file will be in <Lam instllation path>/etc/lam-bhost.def).You
> can also try the HPL.dat file which i am attaching along with this
> mail.

this is the output using your HPL.dat:

$ mpirun -np 20 /usr/bin/xhpl
========================================================================
====
HPLinpack 1.0a -- High-Performance Linpack benchmark -- January
20, 2004
Written by A. Petitet and R. Clint Whaley, Innovative Computing
Labs., UTK
========================================================================
====

An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 57965
NB : 200
PMAP : Row-major process mapping
P : 4
Q : 5
PFACT : Left Crout Right
NBMIN : 8
NDIV : 2
RFACT : Right
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 200)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words

------------------------------------------------------------------------

----
- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
    1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
    2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
    3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be           
1.110223e-16
- Computational tests pass if scaled residuals are less  
than           16.0
------------------------------------------------------------------------ 
-----
One of the processes started by mpirun has exited with a nonzero exit
code.  This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 27606 failed on node n2 (85.239.175.38) due to signal 9.
------------------------------------------------------------------------ 
-----
I tried xhpl on another little heterogeneous lam/mpi cluster (7.0.6)  
and it works. Is 7.1.1 that different? I've read elsewhere that one  
can have problems in allocating memory for MPI processes (well, it  
was with MPICH, P4_GLOBMEMSIZE variable), but is this configurable  
under LAM/MPI?
> Please let me know if it works.If swap is using while running the  
> xhpl just reduce the  matrix size from the HPL.dat file and do the  
> same.
> You can give the matrix size up to 57965,then only the system will  
> use the whole memory.
I would like to do this... but I have a 515 liimt :-(
d
> Hence you
> will get better performance.
> ----- Original Message ----- From: "Davide Cittaro"  
> <davide.cittaro_at_[hidden]>
> To: <lam_at_[hidden]>
> Sent: Thursday, June 08, 2006 5:10 PM
> Subject: LAM: xhpl crashes
>
>
>> Hi there, I'm pretty new to LAM/MPI, so please be patient with me ;-)
>> I've installed 10 dual opteron nodes cluster with gentoo linux and  
>> lam/mpi 7.1.1 connected with Gigabit Ethernet, it works fine (even  
>> coupled with SGE).
>> I would like, now, to test the cluster with linpack, so I've   
>> downloaded and installed xhpl. It happens that as I increase the  
>> N  value (the problem size value) it crashes. More in details:
>> 10 nodes, 2 CPU/node, 4Gb RAM/node, running
>>
>> $ mpirun -np 20 /usr/bin/xhpl
>> --------------------------------------------------------------------- 
>> --- -----
>> One of the processes started by mpirun has exited with a nonzero exit
>> code.  This typically indicates that the process finished in error.
>> If your process did not finish in error, be sure to include a "return
>> 0" or "exit(0)" in your C code before exiting the application.
>>
>> PID 12824 failed on node n0 (85.239.175.36) due to signal 9.
>> --------------------------------------------------------------------- 
>> --- -----
>>
>> looking at the HPL.out flie, it crashes as N=520... I'm confused,  
>> as  I read on their website, I should be able to use values up to  
>> 40000, according to my cluster configuration.
>>
>> $ head -n6 HPL.dat
>> HPLinpack benchmark input file
>> Innovative Computing Laboratory, University of Tennessee
>> HPL.out      output file name (if any)
>> 1            device out (6=stdout,7=stderr,file)
>> 4            # of problems sizes (N)
>> 511 515 520 525 Ns
>>
>> Does anybody here has same problems?
>>
>> Thanks
>>
>> d
>>
>> /*
>> Davide Cittaro
>> Bioinformatics Systems @ Informatics Core
>>
>> IFOM - Istituto FIRC di Oncologia Molecolare
>> via adamello, 16
>> 20139 Milano
>> Italy
>>
>> tel.: +39(02)574303355
>> e-mail: davide.cittaro_at_[hidden]
>> */
>>
>>
>>
>>
>> <HPL.dat>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
/*
Davide Cittaro
Bioinformatics Systems @ Informatics Core
IFOM - Istituto FIRC di Oncologia Molecolare
via adamello, 16
20139 Milano
Italy
tel.: +39(02)574303355
e-mail: davide.cittaro_at_[hidden]
*/