LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: ÕÅÙ© (zhangkan440_at_[hidden])
Date: 2008-11-24 15:04:42


Hi folks,

I currently using LAM+BLCR on my Fedora 8 linux cluster.
The problem is I can checkpoint the mpirun correctly, but cannot restart it.

First I run a mpi app like this (also tried --ssi rpi crtcp -ssi cr blcr
option):
---------------------------------------------
mpirun C hello
---------------------------------------------
The out put is like this:
---------------------------------------------
Hello, world! I am 0 of 2, iter 0
Hello, world! I am 1 of 2, iter 0
Hello, world! I am 0 of 2, iter 1
Hello, world! I am 1 of 2, iter 1
...
---------------------------------------------

Then I checkpoint this mpirun (assume the mpirun pid is 12345):
I tried the following commands:
---------------------------------------------
cr_checkpoint 12345
or
lamcheckpoint -ssi cr blcr -pid 12345
---------------------------------------------

After that, I found 3 files in my home dir (I only configured 2nodes: n0 and
n1, so it checkpoint correctly.):
---------------------------------------------
context.12345 context.12345-n0-12346 context.12345-n1-23455
---------------------------------------------

BUT when I restart it using the following command:
---------------------------------------------
cr_restart context.12345
or
lamrestart -ssi cr blcr -ssi cr_blcr_context_file context.12345
---------------------------------------------

THE RESTART PROCESS FREEZED. And if I check the process list, I can find the
the mpirun process but cannot find any hello process.

The problem seems like the restart process cannot notify all the nodes to
restart the job. it just restarted the mpirun process, but could not restart
the process in each node.

I also tried to restart the hello process using another terminal:
---------------------------------------------
cr_restart context.12345-n0-12346
---------------------------------------------
the output is like this:
---------------------------------------------
Hello, world! I am 0 of 2, iter 2
Hello, world! I am 0 of 2, iter 3
...
---------------------------------------------
but the previous mpirun still freeze and have no output.

Here is my installation records:
--------------------------BLCR Installation--------------------------
../configure
 -----------------------------------------------------------------------
and it all PASSED when I use make check.

-------------------------LAM Installation-----------------------------
./configure --with-threads=posix --with-rpi=crtcp --with-cr-blcr=/usr/local/
------------------------------------------------------------------------

Here is my laminfo:
-----------------------------------------------------------------------
             LAM/MPI: 7.1.4
              Prefix: /usr
        Architecture: i686-pc-linux-gnu
       Configured by: root
       Configured on: Mon Nov 24 03:30:44 CST 2008
      Configure host: cluster.node1
      Memory manager: ptmalloc2
          C bindings: yes
        C++ bindings: yes
    Fortran bindings: yes
          C compiler: gcc
        C++ compiler: g++
    Fortran compiler: g77
     Fortran symbols: double_underscore
         C profiling: yes
       C++ profiling: yes
   Fortran profiling: yes
      C++ exceptions: no
      Thread support: yes
       ROMIO support: yes
        IMPI support: no
       Debug support: no
        Purify clean: no
            SSI boot: globus (API v1.1, Module v0.6)
            SSI boot: rsh (API v1.1, Module v1.1)
            SSI boot: slurm (API v1.1, Module v1.0)
            SSI coll: lam_basic (API v1.1, Module v7.1)
            SSI coll: shmem (API v1.1, Module v1.0)
            SSI coll: smp (API v1.1, Module v1.2)
             SSI rpi: crtcp (API v1.1, Module v1.1)
             SSI rpi: lamd (API v1.0, Module v7.1)
             SSI rpi: sysv (API v1.0, Module v7.1)
             SSI rpi: tcp (API v1.0, Module v7.1)
             SSI rpi: usysv (API v1.0, Module v7.1)
              SSI cr: blcr (API v1.0, Module v1.1)
              SSI cr: self (API v1.0, Module v1.0)
--------------------------------------------------------------------------

I will appreciate for your help. Thanks.

Best regards.