LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Xiangyu Dong (xydong11_at_[hidden])
Date: 2009-07-07 14:54:02


Hi all,

I'm trying to use the checkpoint/restart mechanism for NPB benchmarks on a
quad-core workstation.

I've installed LAM 7.1.4 and BLCR 0.8.2 successfully (at least I think).

However, when I ran:

mpirun -np 4 bin/bt.C.4
and
lamcheckpoint -ssi cr blcr -pid <mpirun pid>

The checkpointing operation seems to finish instantly, and the resulting
"context" is only 400KB, while the context file for serial version of bt.C
is near 1.5GB. After I trying to restart, I got the following error message:

"mpirun (rpwait): Bad file descriptor"

So, I'm wondering whether the checkpoint is correctly taken, or it only
takes the checkpoint for mpirun alone. I also read some previous threads and
tried to run the mpirun with the options like:

mpirun -ssi rpi crtcp -ssi cr blcr -np 4 bin/bt.C.4

but got the error messages like this:

-----------------------------------------------------------------------------
   The "blcr" module requested in the CR kind was not found.

   This typically means that you misspelled the desired module name, or used
   the wrong name entirely.

-----------------------------------------------------------------------------

-----------------------------------------------------------------------------
   It seems that [at least] one of the processes that was started with
   mpirun did not invoke MPI_INIT before quitting (it is possible that
   more than one process did not invoke MPI_INIT -- mpirun was only
   notified of the first one, which was on node n-114652544).

   mpirun can *only* be used with MPI programs (i.e., programs that
   invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
   to run non-MPI programs over the lambooted nodes.

My BLCR is compiled with the option --enable-static, and LAM is compiled
with the options --with-rpi=crtcp and --with-cr-blcr

The BLCR library works fine for serial codes with cr_run. cr_checkpoint, and
cr_restart

The laminfo is listed below,

             LAM/MPI: 7.1.4
              Prefix: /usr
        Architecture: x86_64-unknown-linux-gnu
       Configured by: xiangyu
       Configured on: Tue Jul 7 04:19:53 PDT 2009
      Configure host: airsim-01
      Memory manager: ptmalloc2
          C bindings: yes
        C++ bindings: yes
    Fortran bindings: yes
          C compiler: gcc
        C++ compiler: g++-4.1
    Fortran compiler: gfortran
     Fortran symbols: underscore
         C profiling: yes
       C++ profiling: yes
   Fortran profiling: yes
      C++ exceptions: no
      Thread support: yes
       ROMIO support: yes
        IMPI support: no
       Debug support: no
        Purify clean: no
            SSI boot: globus (API v1.1, Module v0.6)
            SSI boot: rsh (API v1.1, Module v1.1)
            SSI boot: slurm (API v1.1, Module v1.0)
            SSI coll: lam_basic (API v1.1, Module v7.1)
            SSI coll: shmem (API v1.1, Module v1.0)
            SSI coll: smp (API v1.1, Module v1.2)
             SSI rpi: crtcp (API v1.1, Module v1.1)
             SSI rpi: lamd (API v1.0, Module v7.1)
             SSI rpi: sysv (API v1.0, Module v7.1)
             SSI rpi: tcp (API v1.0, Module v7.1)
             SSI rpi: usysv (API v1.0, Module v7.1)
              SSI cr: blcr (API v1.0, Module v1.1)
              SSI cr: self (API v1.0, Module v1.0)

Thanks,
-Rio