LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-01-12 17:01:06


Can you try upgrading to LAM/MPI 7.1.2? There was at least one fix
that may help the situation.

On Jan 12, 2007, at 9:48 AM, Jeho Park wrote:

> Hi
>
> http://lam-mpi.miscellaneousmirror.org/MailArchives/lam/
> 2003/03/5608.php
> Has the _known_ issue been solved yet?
>
> I am in the same situation with Karl in
> http://lam-mpi.miscellaneousmirror.org/MailArchives/lam/2003/03/
> 5606.php. The "Too many open files in system" error occurred when many
> ( > 20) mpiruns were interrupted and killed, _and_ when many mpirun
> processes were running at the same time especially.
>
> Once the error occurs, mpirun command no longer works on the node. It
> just complains the open file and exits. Sometimes the error causes the
> node hanged so that no ssh session can be established for lamclean
> command.
>
> My config is
> linux CentOS (kernel 2.4.21) and LAM/MPI 7.1.1
>
> my laminfo gives
> =============/ snip /====================
> $ laminfo
> LAM/MPI: 7.1.1
> Prefix: /usr
> Architecture: i686-redhat-linux-gnu
> Configured by: cmc
> Configured on: Thu Apr 14 12:56:01 PDT 2005
> Configure host: vosill.math.hmc.edu
> Memory manager: ptmalloc2
> C bindings: yes
> C++ bindings: yes
> Fortran bindings: yes
> C compiler: gcc
> C++ compiler: g++
> Fortran compiler: g77
> Fortran symbols: double_underscore
> C profiling: yes
> C++ profiling: yes
> Fortran profiling: yes
> C++ exceptions: no
> Thread support: yes
> ROMIO support: yes
> IMPI support: no
> Debug support: no
> Purify clean: no
> SSI boot: globus (API v1.1, Module v0.6)
> SSI boot: rsh (API v1.1, Module v1.1)
> SSI boot: slurm (API v1.1, Module v1.0)
> SSI coll: lam_basic (API v1.1, Module v7.1)
> SSI coll: shmem (API v1.1, Module v1.0)
> SSI coll: smp (API v1.1, Module v1.2)
> SSI rpi: crtcp (API v1.1, Module v1.1)
> SSI rpi: lamd (API v1.0, Module v7.1)
> SSI rpi: sysv (API v1.0, Module v7.1)
> SSI rpi: tcp (API v1.0, Module v7.1)
> SSI rpi: usysv (API v1.0, Module v7.1)
> SSI cr: self (API v1.0, Module v1.0)
> ================/ snip /====================
>
> Would there be a workaround other than re-lamboot or lamclean?
>
> Thanks in advance.
>
> Jeho
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems