I've also tried with the option "-ssi rpi tcp", the same problem exists
under LAM 7.1.
Normally, we run mpirun like the following:
mpirun -ssi rpi gm -np 8 /crunch/qkli/bin/test
Since our cluster is of dual-cpu node, if you run the test with -np 2
(in a single node), then no memory leak occurs. Starting from -np 4
(using two nodes), memory leak occurs.
The following is a snapshot of laminfo:
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
qkli_at_apollo01:~> laminfo
LAM/MPI: 7.1
Prefix: /usr/rels/lam
Architecture: x86_64-unknown-linux-gnu
Configured by: root
Configured on: Sun Sep 19 19:49:57 EDT 2004
Configure host: apollo01
Memory manager: ptmalloc2
C bindings: yes
C++ bindings: yes
Fortran bindings: yes
C compiler: gcc
C++ compiler: g++
Fortran compiler: pgf90
Fortran symbols: underscore
C profiling: yes
C++ profiling: yes
Fortran profiling: yes
C++ exceptions: no
Thread support: yes
ROMIO support: yes
IMPI support: no
Debug support: no
Purify clean: no
SSI boot: globus (API v1.1, Module v0.6)
SSI boot: rsh (API v1.1, Module v1.1)
SSI boot: slurm (API v1.1, Module v1.0)
SSI boot: tm (API v1.1, Module v1.1)
SSI coll: lam_basic (API v1.1, Module v7.1)
SSI coll: shmem (API v1.1, Module v1.0)
SSI coll: smp (API v1.1, Module v1.2)
SSI rpi: crtcp (API v1.1, Module v1.1)
SSI rpi: gm (API v1.1, Module v1.2)
SSI rpi: lamd (API v1.0, Module v7.1)
SSI rpi: sysv (API v1.0, Module v7.1)
SSI rpi: tcp (API v1.0, Module v7.1)
SSI rpi: usysv (API v1.0, Module v7.1)
SSI cr: self (API v1.0, Module v1.0)
qkli_at_apollo01:~>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
I've also noticed a strange behavior under Suse Linux 9.1 (64-bit), for
example, if you use malloc and free in a subroutine, and call the
subroutine repeatedly, then the memory seems to be NOT freed properly.
This only happens under 64-bit situation.
Best regards,
Qikai
On Thu, 2004-10-14 at 14:20, Brian Barrett wrote:
> On Oct 14, 2004, at 9:37 AM, Qikai Li wrote:
>
> > The same code runs perfectly under LAM 7.0.6 with a stable memory
> > usage.
> >
> > Several guys in our group have experienced the same problem when I
> > switched the LAM from 7.0.6 to 7.1.
> >
> > Maybe this is related to a possible bug in gcc, i.e., the memory is NOT
> > properly freed under 64-bit environment even though you have used, for
> > example, pairs of malloc (or calloc) and free.
> >
> > Also, The problem seems to be only 64-bit specific.
> >
> > Or maybe it's the problem of LAM 7.1.
>
> Thanks for the bug report. We only have access to one Opteron machine
> and it doesn't have Myrinet, so I was wondering if you could run a
> couple tests for me to help localize the problem. First, could you
> send me the output from the "laminfo" command? There are a number of
> places that changed between 7.0 and 7.1, so I'm hoping we can localize
> it to a particular component. Does the memory leak happen regardless of
> number of processes running?
>
> Also, could you see if it happens with the following SSI options:
>
> -ssi rpi tcp (use tcp instead of gm)
> -ssi coll lam_basic (use the really simple collectives code)
>
> You specify the ssi params during mpirun, so something like: "mpirun
> -np 4 -ssi rpi tcp ./a.out"
>
> I'm looking at the problem as well, but having some starting points
> would really help.
>
> Thanks!
>
> Btian
--
Qikai Li
School of Materials Science and Engineering
Georgia Institute of Technology
771 Ferst Drive N.W.
Atlanta, Georgia 30332
Email: qikai.li_at_[hidden]
Phone: 404-385-2852
|