LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Gleb \ (crazy.sage_at_[hidden])
Date: 2009-01-27 01:40:28


When I run my MPI program on single CPU - everything is ok, but when
I'm trying to make it real parallel using
mpirun -ssi rpi crtcp -ssi cr blcr -np 2 ./hello
and checpointing with
lamcheckpoint -ssi cr blcr -pid mpirun_pid or
cr_checkpoint -p mpirun_pid --term
I get following error message:
-chkpt_watchdog: 'mpirun' (tgid/pid xxx/xxx) exited with signal 11
during checkpoint.
-----------------------------------------------------------------------
Encountered a failure in the SSI types while continuing from
checkpoint. Aborting in despair :-(
-----------------------------------------------------------------------
Segmentation Error

No checkpoint file is created, process is terminated.
I don't understand at the moment, is my problem in my programm,
LAM/MPI or BLCR settings, or in virtual machine platform.

LAM-MPI version is 7.1.4
LAM-MPI is configured with --with-rpi=crtcp --with-cr-blcr
BLCR version is 0.8.0
BLCR is configured with --enable-static
OS - CentiOS 5.
Platform - Sun xVM Virtual Box

Programm text:

#include <stdio.h>
#include <mpi.h>
#include <math.h>

int main (int argc, char *argv[]){
int rank, size, i;
long j;
double x;
x=5;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
for (i=0;i<100;i++){
printf("Hello, world! I am %d of %d, iteration %d\n",rank,size,i);
for(j=0;j<100000000;j++){
x=sin(x)
}
}
}
Printf("I am %d of %d. Modus - %f\n",rank,size,x);
MPI_Finalize();
return 0;
}

-- 
With best regards,
 Gleb "Crazy Sage" Igumnov                          mailto:crazy.sage_at_[hidden]