LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-11-14 19:38:08


LAM's checkpoint/restart support does not support MPI-2 spawn
functionality.

On Nov 14, 2004, at 7:19 PM, mailtome_at_[hidden] wrote:

> i m trying to run the following code to spawn the simple hello file,
> but
> there is a problem with the checkpoint modules.
>
> //hello.c
> #include<mpi.h>
> int main(int argc,char *argv[])
> {
> MPI_Init(&argc,&argv);
> printf("Hello World\n");
> MPI_Finalize();
> return 0;
> }
>
> //spawnex.c
> #include<stdio.h>
> #include<mpi.h>
>
> int main(int argc,char *argv[])
> {
> int myrank;
> MPI_Init(&argc,&argv);
> MPI_Comm_rank(MPI_COMM_WORLD,&myrank);
> printf("\nMyrank is %d\n",myrank);
> if(myrank == 0)
> {
> MPI_Comm childcommunicator;
> MPI_Info infoobject;
> int errcode;
> MPI_Info_create(&infoobject);
> printf("Trying to spawn",myrank);
> MPI_Comm_spawn("hello",MPI_ARGV_NULL,3,infoobject,0,MPI_COMM_SELF,&chil
> dcommunicator,&errcode);
> printf("\nSpawn successful\n");
> }
> MPI_Finalize();
> return 0;
> }
>
> The error when trying to run is
>
> [mpigroup_at_xxx xxxxx]$ mpirun N spawn
>
> Myrank is 0
>
> Myrank is 1
>
> Myrank is 2
> -----------------------------------------------------------------------
> ------
>
> It seems that [at least] one of the child processes that was started
> by MPI_Comm_spawn* chose a different CR module than the parent
> application. For example, one (of the) child process(es) that
> differed from the parent is shown below:
>
> Parent application: blcr (v1.1.0)
> Child MPI_COMM_WORLD rank 0: none (v-1.-1.-1)
>
> All MPI processes must choose the same CR module and version when
> they start. Check your SSI settings and/or the local environment
> variables on each node.
> -----------------------------------------------------------------------
> ------
> Trying to spawnMPI_Comm_spawn: unclassified (rank 0, MPI_COMM_SELF)
> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> Rank (0, MPI_COMM_WORLD): - MPI_Comm_spawn()
> Rank (0, MPI_COMM_WORLD): - main()
> MPI_Recv: process in local group is dead (rank 1, MPI_COMM_WORLD)
> Rank (1, MPI_COMM_WORLD): Call stack within LAM:
> MPI_Recv: process in local group is dead (rank 2, MPI_COMM_WORLD)
> Rank (2, MPI_COMM_WORLD): Call stack within LAM:
> Rank (1, MPI_COMM_WORLD): - MPI_Recv()
> Rank (1, MPI_COMM_WORLD): - MPI_Barrier()
> Rank (1, MPI_COMM_WORLD): - MPI_Finalize()
> Rank (1, MPI_COMM_WORLD): - main()
> Rank (2, MPI_COMM_WORLD): - MPI_Recv()
> Rank (2, MPI_COMM_WORLD): - MPI_Barrier()
> Rank (2, MPI_COMM_WORLD): - MPI_Finalize()
> Rank (2, MPI_COMM_WORLD): - main()
> -----------------------------------------------------------------------
> ------
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 29554 failed on node n2 (172.30.0.143) with exit status 1.
> -----------------------------------------------------------------------
> ------
>
> thanx for any help
>
>
>
>
> ______________________________
> http://www.omnilect.com
> Omnilect - 2,000 Megabytes Of Storage... Just For You.
> Email, Web Space, Photos, Whatever.
> Great Usernames Still Available!
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/