When I run my MPI program on single CPU - everything is ok, but when
I'm trying to make it real parallel using
mpirun -ssi rpi crtcp -ssi cr blcr -np 2 ./hello
and checpointing with
lamcheckpoint -ssi cr blcr -pid mpirun_pid or
cr_checkpoint -p mpirun_pid --term
I get following error message:
-chkpt_watchdog: 'mpirun' (tgid/pid xxx/xxx) exited with signal 11
during checkpoint.
-----------------------------------------------------------------------
Encountered a failure in the SSI types while continuing from
checkpoint. Aborting in despair :-(
-----------------------------------------------------------------------
Segmentation Error
No checkpoint file is created, process is terminated.
I don't understand at the moment, is my problem in my programm,
LAM/MPI or BLCR settings, or in virtual machine platform.
LAM-MPI version is 7.1.4
LAM-MPI is configured with --with-rpi=crtcp --with-cr-blcr
BLCR version is 0.8.0
BLCR is configured with --enable-static
OS - CentiOS 5.
Platform - Sun xVM Virtual Box
Programm text:
#include <stdio.h>
#include <mpi.h>
#include <math.h>
int main (int argc, char *argv[]){
int rank, size, i;
long j;
double x;
x=5;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
for (i=0;i<100;i++){
printf("Hello, world! I am %d of %d, iteration %d\n",rank,size,i);
for(j=0;j<100000000;j++){
x=sin(x)
}
}
}
Printf("I am %d of %d. Modus - %f\n",rank,size,x);
MPI_Finalize();
return 0;
}
--
With best regards,
Gleb "Crazy Sage" Igumnov mailto:crazy.sage_at_[hidden]
|