Hello,
LAM's checkpoint/restart support does not support MPI-2 spawn
functionality. >>>>>>> written by jeff squyres in one of the
archive mail.>>>>>>
We are trying to migrate a process using MPI_Comm_spawn and
need memory state to be checkpointed, so we thought of using
"blcr module" and we are getting the error as
It seems that [at least] one of the child processes that was started
by MPI_Comm_spawn* chose a different CR module than the parent
application. For example, one (of the) child process(es) that
differed from the parent is shown below:
Parent application: blcr (v1.1.0)
Child MPI_COMM_WORLD rank 0: none (v-1.-1.-1)
All MPI processes must choose the same CR module and version when
they start. Check your SSI settings and/or the local environment
variables on each node.
-----------------------------------------------------------------------
------
Trying to spawnMPI_Comm_spawn: unclassified (rank 0, MPI_COMM_SELF)
Rank (0, MPI_COMM_WORLD): Call stack within LAM:
Rank (0, MPI_COMM_WORLD): - MPI_Comm_spawn()
Rank (0, MPI_COMM_WORLD): - main()
MPI_Recv: process in local group is dead (rank 1, MPI_COMM_WORLD)
Rank (1, MPI_COMM_WORLD): Call stack within LAM:
MPI_Recv: process in local group is dead (rank 2, MPI_COMM_WORLD)
Rank (2, MPI_COMM_WORLD): Call stack within LAM:
Rank (1, MPI_COMM_WORLD): - MPI_Recv()
Rank (1, MPI_COMM_WORLD): - MPI_Barrier()
Rank (1, MPI_COMM_WORLD): - MPI_Finalize()
Rank (1, MPI_COMM_WORLD): - main()
Rank (2, MPI_COMM_WORLD): - MPI_Recv()
Rank (2, MPI_COMM_WORLD): - MPI_Barrier()
Rank (2, MPI_COMM_WORLD): - MPI_Finalize()
Rank (2, MPI_COMM_WORLD): - main()
-----------------------------------------------------------------------
------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 29554 failed on node n2 (172.30.0.143) with exit status 1.
----------------------------------------------------------------------- ---
we need to know that, is it possible that a process can be checkpointed
using blcr and use spawn at the same time. We would also like to know is
there any other way this can be done.
Thanks in advance.
Kumar.
______________________________
http://www.omnilect.com
Omnilect - 2,000 Megabytes Of Storage... Just For You.
Email, Storage Space, Blogs, & More.
Great Usernames Still Available!
|