Hi all,
I'm trying to use the checkpoint/restart mechanism for NPB benchmarks on a
quad-core workstation.
I've installed LAM 7.1.4 and BLCR 0.8.2 successfully (at least I think).
However, when I ran:
mpirun -np 4 bin/bt.C.4
and
lamcheckpoint -ssi cr blcr -pid <mpirun pid>
The checkpointing operation seems to finish instantly, and the resulting
"context" is only 400KB, while the context file for serial version of bt.C
is near 1.5GB. After I trying to restart, I got the following error message:
"mpirun (rpwait): Bad file descriptor"
So, I'm wondering whether the checkpoint is correctly taken, or it only
takes the checkpoint for mpirun alone. I also read some previous threads and
tried to run the mpirun with the options like:
mpirun -ssi rpi crtcp -ssi cr blcr -np 4 bin/bt.C.4
but got the error messages like this:
-----------------------------------------------------------------------------
The "blcr" module requested in the CR kind was not found.
This typically means that you misspelled the desired module name, or used
the wrong name entirely.
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started with
mpirun did not invoke MPI_INIT before quitting (it is possible that
more than one process did not invoke MPI_INIT -- mpirun was only
notified of the first one, which was on node n-114652544).
mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.
My BLCR is compiled with the option --enable-static, and LAM is compiled
with the options --with-rpi=crtcp and --with-cr-blcr
The BLCR library works fine for serial codes with cr_run. cr_checkpoint, and
cr_restart
The laminfo is listed below,
LAM/MPI: 7.1.4
Prefix: /usr
Architecture: x86_64-unknown-linux-gnu
Configured by: xiangyu
Configured on: Tue Jul 7 04:19:53 PDT 2009
Configure host: airsim-01
Memory manager: ptmalloc2
C bindings: yes
C++ bindings: yes
Fortran bindings: yes
C compiler: gcc
C++ compiler: g++-4.1
Fortran compiler: gfortran
Fortran symbols: underscore
C profiling: yes
C++ profiling: yes
Fortran profiling: yes
C++ exceptions: no
Thread support: yes
ROMIO support: yes
IMPI support: no
Debug support: no
Purify clean: no
SSI boot: globus (API v1.1, Module v0.6)
SSI boot: rsh (API v1.1, Module v1.1)
SSI boot: slurm (API v1.1, Module v1.0)
SSI coll: lam_basic (API v1.1, Module v7.1)
SSI coll: shmem (API v1.1, Module v1.0)
SSI coll: smp (API v1.1, Module v1.2)
SSI rpi: crtcp (API v1.1, Module v1.1)
SSI rpi: lamd (API v1.0, Module v7.1)
SSI rpi: sysv (API v1.0, Module v7.1)
SSI rpi: tcp (API v1.0, Module v7.1)
SSI rpi: usysv (API v1.0, Module v7.1)
SSI cr: blcr (API v1.0, Module v1.1)
SSI cr: self (API v1.0, Module v1.0)
Thanks,
-Rio
|