LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeho Park (tojeho_at_[hidden])
Date: 2007-01-12 09:48:57


Hi

http://lam-mpi.miscellaneousmirror.org/MailArchives/lam/2003/03/5608.php
Has the _known_ issue been solved yet?

I am in the same situation with Karl in
http://lam-mpi.miscellaneousmirror.org/MailArchives/lam/2003/03/
5606.php. The "Too many open files in system" error occurred when many
( > 20) mpiruns were interrupted and killed, _and_ when many mpirun
processes were running at the same time especially.

Once the error occurs, mpirun command no longer works on the node. It
just complains the open file and exits. Sometimes the error causes the
node hanged so that no ssh session can be established for lamclean
command.

My config is
linux CentOS (kernel 2.4.21) and LAM/MPI 7.1.1

my laminfo gives
=============/ snip /====================
$ laminfo
              LAM/MPI: 7.1.1
               Prefix: /usr
         Architecture: i686-redhat-linux-gnu
        Configured by: cmc
        Configured on: Thu Apr 14 12:56:01 PDT 2005
       Configure host: vosill.math.hmc.edu
       Memory manager: ptmalloc2
           C bindings: yes
         C++ bindings: yes
     Fortran bindings: yes
           C compiler: gcc
         C++ compiler: g++
     Fortran compiler: g77
      Fortran symbols: double_underscore
          C profiling: yes
        C++ profiling: yes
    Fortran profiling: yes
       C++ exceptions: no
       Thread support: yes
        ROMIO support: yes
         IMPI support: no
        Debug support: no
         Purify clean: no
             SSI boot: globus (API v1.1, Module v0.6)
             SSI boot: rsh (API v1.1, Module v1.1)
             SSI boot: slurm (API v1.1, Module v1.0)
             SSI coll: lam_basic (API v1.1, Module v7.1)
             SSI coll: shmem (API v1.1, Module v1.0)
             SSI coll: smp (API v1.1, Module v1.2)
              SSI rpi: crtcp (API v1.1, Module v1.1)
              SSI rpi: lamd (API v1.0, Module v7.1)
              SSI rpi: sysv (API v1.0, Module v7.1)
              SSI rpi: tcp (API v1.0, Module v7.1)
              SSI rpi: usysv (API v1.0, Module v7.1)
               SSI cr: self (API v1.0, Module v1.0)
================/ snip /====================

Would there be a workaround other than re-lamboot or lamclean?

Thanks in advance.

Jeho