Hi folks,
I currently using LAM+BLCR on my Fedora 8 linux cluster.
The problem is I can checkpoint the mpirun correctly, but cannot restart it.
First I run a mpi app like this (also tried --ssi rpi crtcp -ssi cr blcr
option):
---------------------------------------------
mpirun C hello
---------------------------------------------
The out put is like this:
---------------------------------------------
Hello, world! I am 0 of 2, iter 0
Hello, world! I am 1 of 2, iter 0
Hello, world! I am 0 of 2, iter 1
Hello, world! I am 1 of 2, iter 1
...
---------------------------------------------
Then I checkpoint this mpirun (assume the mpirun pid is 12345):
I tried the following commands:
---------------------------------------------
cr_checkpoint 12345
or
lamcheckpoint -ssi cr blcr -pid 12345
---------------------------------------------
After that, I found 3 files in my home dir (I only configured 2nodes: n0 and
n1, so it checkpoint correctly.):
---------------------------------------------
context.12345 context.12345-n0-12346 context.12345-n1-23455
---------------------------------------------
BUT when I restart it using the following command:
---------------------------------------------
cr_restart context.12345
or
lamrestart -ssi cr blcr -ssi cr_blcr_context_file context.12345
---------------------------------------------
THE RESTART PROCESS FREEZED. And if I check the process list, I can find the
the mpirun process but cannot find any hello process.
The problem seems like the restart process cannot notify all the nodes to
restart the job. it just restarted the mpirun process, but could not restart
the process in each node.
I also tried to restart the hello process using another terminal:
---------------------------------------------
cr_restart context.12345-n0-12346
---------------------------------------------
the output is like this:
---------------------------------------------
Hello, world! I am 0 of 2, iter 2
Hello, world! I am 0 of 2, iter 3
...
---------------------------------------------
but the previous mpirun still freeze and have no output.
Here is my installation records:
--------------------------BLCR Installation--------------------------
../configure
-----------------------------------------------------------------------
and it all PASSED when I use make check.
-------------------------LAM Installation-----------------------------
./configure --with-threads=posix --with-rpi=crtcp --with-cr-blcr=/usr/local/
------------------------------------------------------------------------
Here is my laminfo:
-----------------------------------------------------------------------
LAM/MPI: 7.1.4
Prefix: /usr
Architecture: i686-pc-linux-gnu
Configured by: root
Configured on: Mon Nov 24 03:30:44 CST 2008
Configure host: cluster.node1
Memory manager: ptmalloc2
C bindings: yes
C++ bindings: yes
Fortran bindings: yes
C compiler: gcc
C++ compiler: g++
Fortran compiler: g77
Fortran symbols: double_underscore
C profiling: yes
C++ profiling: yes
Fortran profiling: yes
C++ exceptions: no
Thread support: yes
ROMIO support: yes
IMPI support: no
Debug support: no
Purify clean: no
SSI boot: globus (API v1.1, Module v0.6)
SSI boot: rsh (API v1.1, Module v1.1)
SSI boot: slurm (API v1.1, Module v1.0)
SSI coll: lam_basic (API v1.1, Module v7.1)
SSI coll: shmem (API v1.1, Module v1.0)
SSI coll: smp (API v1.1, Module v1.2)
SSI rpi: crtcp (API v1.1, Module v1.1)
SSI rpi: lamd (API v1.0, Module v7.1)
SSI rpi: sysv (API v1.0, Module v7.1)
SSI rpi: tcp (API v1.0, Module v7.1)
SSI rpi: usysv (API v1.0, Module v7.1)
SSI cr: blcr (API v1.0, Module v1.1)
SSI cr: self (API v1.0, Module v1.0)
--------------------------------------------------------------------------
I will appreciate for your help. Thanks.
Best regards.
|