Hi
http://lam-mpi.miscellaneousmirror.org/MailArchives/lam/2003/03/5608.php
Has the _known_ issue been solved yet?
I am in the same situation with Karl in
http://lam-mpi.miscellaneousmirror.org/MailArchives/lam/2003/03/
5606.php. The "Too many open files in system" error occurred when many
( > 20) mpiruns were interrupted and killed, _and_ when many mpirun
processes were running at the same time especially.
Once the error occurs, mpirun command no longer works on the node. It
just complains the open file and exits. Sometimes the error causes the
node hanged so that no ssh session can be established for lamclean
command.
My config is
linux CentOS (kernel 2.4.21) and LAM/MPI 7.1.1
my laminfo gives
=============/ snip /====================
$ laminfo
LAM/MPI: 7.1.1
Prefix: /usr
Architecture: i686-redhat-linux-gnu
Configured by: cmc
Configured on: Thu Apr 14 12:56:01 PDT 2005
Configure host: vosill.math.hmc.edu
Memory manager: ptmalloc2
C bindings: yes
C++ bindings: yes
Fortran bindings: yes
C compiler: gcc
C++ compiler: g++
Fortran compiler: g77
Fortran symbols: double_underscore
C profiling: yes
C++ profiling: yes
Fortran profiling: yes
C++ exceptions: no
Thread support: yes
ROMIO support: yes
IMPI support: no
Debug support: no
Purify clean: no
SSI boot: globus (API v1.1, Module v0.6)
SSI boot: rsh (API v1.1, Module v1.1)
SSI boot: slurm (API v1.1, Module v1.0)
SSI coll: lam_basic (API v1.1, Module v7.1)
SSI coll: shmem (API v1.1, Module v1.0)
SSI coll: smp (API v1.1, Module v1.2)
SSI rpi: crtcp (API v1.1, Module v1.1)
SSI rpi: lamd (API v1.0, Module v7.1)
SSI rpi: sysv (API v1.0, Module v7.1)
SSI rpi: tcp (API v1.0, Module v7.1)
SSI rpi: usysv (API v1.0, Module v7.1)
SSI cr: self (API v1.0, Module v1.0)
================/ snip /====================
Would there be a workaround other than re-lamboot or lamclean?
Thanks in advance.
Jeho
|