LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-01-27 07:54:08


LAM is pretty much in support mode these days; there is no one
actively working on it. We've all moved on and are working on Open
MPI these days (www.open-mpi.org).

Open MPI v1.3 was just released with BLCR support -- would it be
possible for you to try that?

There are a few bug fixes coming in 1.3.1, but Josh (the main
checkpoint/restart author in Open MPI) tells me that it's still quite
usable in 1.3.0.

On Jan 27, 2009, at 1:40 AM, Gleb Crazy Sage Igumnov wrote:

> When I run my MPI program on single CPU - everything is ok, but when
> I'm trying to make it real parallel using
> mpirun -ssi rpi crtcp -ssi cr blcr -np 2 ./hello
> and checpointing with
> lamcheckpoint -ssi cr blcr -pid mpirun_pid or
> cr_checkpoint -p mpirun_pid --term
> I get following error message:
> -chkpt_watchdog: 'mpirun' (tgid/pid xxx/xxx) exited with signal 11
> during checkpoint.
> -----------------------------------------------------------------------
> Encountered a failure in the SSI types while continuing from
> checkpoint. Aborting in despair :-(
> -----------------------------------------------------------------------
> Segmentation Error
>
> No checkpoint file is created, process is terminated.
> I don't understand at the moment, is my problem in my programm,
> LAM/MPI or BLCR settings, or in virtual machine platform.
>
> LAM-MPI version is 7.1.4
> LAM-MPI is configured with --with-rpi=crtcp --with-cr-blcr
> BLCR version is 0.8.0
> BLCR is configured with --enable-static
> OS - CentiOS 5.
> Platform - Sun xVM Virtual Box
>
> Programm text:
>
> #include <stdio.h>
> #include <mpi.h>
> #include <math.h>
>
> int main (int argc, char *argv[]){
> int rank, size, i;
> long j;
> double x;
> x=5;
> MPI_Init(&argc,&argv);
> MPI_Comm_rank(MPI_COMM_WORLD,&rank);
> MPI_Comm_size(MPI_COMM_WORLD,&size);
> for (i=0;i<100;i++){
> printf("Hello, world! I am %d of %d, iteration %d\n",rank,size,i);
> for(j=0;j<100000000;j++){
> x=sin(x)
> }
> }
> }
> Printf("I am %d of %d. Modus - %f\n",rank,size,x);
> MPI_Finalize();
> return 0;
> }
>
> --
> With best regards,
> Gleb "Crazy Sage" Igumnov mailto:crazy.sage_at_[hidden]
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

-- 
Jeff Squyres
Cisco Systems