I configured blcr with debugging, When I restart a mpi program, the
messages log shows:
----------------------------------------------------------------------------------------------------------------------------------------------------------
Nov 25 12:54:03 cluster kernel: cr_rstrt_request_restart
<cr_rstrt_req.c:826>, pid 2835: cr_magic = 67 82, cr_version = 7, scope = 3,
arch = 1.
Nov 25 12:54:03 cluster kernel: cr_reserve_ids <cr_rstrt_req.c:501>, pid
2838: Now reserving required ids...
Nov 25 12:54:03 cluster kernel: cr_reserve_ids <cr_rstrt_req.c:501>, pid
2838: Now reserving required ids...
Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2591>, pid
2838: 2838: Restoring credentials
Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2677>, pid
2837: Formerly mpirun PID 2778
Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2677>, pid
2838: Formerly mpirun PID 2780
Nov 25 12:54:03 cluster kernel: cr_restore_pids <cr_rstrt_req.c:1373>, pid
2838: Now restoring the pids...
Nov 25 12:54:03 cluster kernel: cr_restore_pids <cr_rstrt_req.c:1380>, pid
2780: Linkage restore finished...
Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2738>, pid
2780: Reading POSIX interval timers...
Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2749>, pid
2780: Reading mmap()ed pages (if any)...
Nov 25 12:54:03 cluster kernel: cr_restore_all_files <cr_rstrt_req.c:2027>,
pid 2780: close-on-exec of callers files
Nov 25 12:54:03 cluster kernel: cr_restore_all_files <cr_rstrt_req.c:2041>,
pid 2780: recovering fs_struct...
Nov 25 12:54:03 cluster kernel: cr_restore_parents <cr_rstrt_req.c:676>, pid
2835: Now restoring the parent linkage...
----------------------------------------------------------------------------------------------------------------------------------------------------------
ant it stucked here. so the problem is the blcr could restore the parent
linkage.
Any suggestion? Thanks for your help.
2008/11/25 ÕÅÙ© <zhangkan440_at_[hidden]>
> Hi,
>
> Thanks. I will try debug mode then.
>
> 2008/11/25 Jerry Mersel <jerry.mersel_at_[hidden]>
>
>
>> Hi:
>>
>>
>> I'm no expert in this by any means but I would try rebuilding blcr
>> with debugging enabled and then look at the logs.
>>
>> That's what I intend to do with my blcr problem.
>>
>> Regards,
>> Jerry
>>
>> > Hi folks,
>> >
>> > I currently using LAM+BLCR on my Fedora 8 linux cluster.
>> > The problem is I can checkpoint the mpirun correctly, but cannot restart
>> > it.
>> >
>> > First I run a mpi app like this (also tried --ssi rpi crtcp -ssi cr blcr
>> > option):
>> > ---------------------------------------------
>> > mpirun C hello
>> > ---------------------------------------------
>> > The out put is like this:
>> > ---------------------------------------------
>> > Hello, world! I am 0 of 2, iter 0
>> > Hello, world! I am 1 of 2, iter 0
>> > Hello, world! I am 0 of 2, iter 1
>> > Hello, world! I am 1 of 2, iter 1
>> > ...
>> > ---------------------------------------------
>> >
>> > Then I checkpoint this mpirun (assume the mpirun pid is 12345):
>> > I tried the following commands:
>> > ---------------------------------------------
>> > cr_checkpoint 12345
>> > or
>> > lamcheckpoint -ssi cr blcr -pid 12345
>> > ---------------------------------------------
>> >
>> > After that, I found 3 files in my home dir (I only configured 2nodes: n0
>> > and
>> > n1, so it checkpoint correctly.):
>> > ---------------------------------------------
>> > context.12345 context.12345-n0-12346 context.12345-n1-23455
>> > ---------------------------------------------
>> >
>> > BUT when I restart it using the following command:
>> > ---------------------------------------------
>> > cr_restart context.12345
>> > or
>> > lamrestart -ssi cr blcr -ssi cr_blcr_context_file context.12345
>> > ---------------------------------------------
>> >
>> > THE RESTART PROCESS FREEZED. And if I check the process list, I can find
>> > the
>> > the mpirun process but cannot find any hello process.
>> >
>> > The problem seems like the restart process cannot notify all the nodes
>> to
>> > restart the job. it just restarted the mpirun process, but could not
>> > restart
>> > the process in each node.
>> >
>> > I also tried to restart the hello process using another terminal:
>> > ---------------------------------------------
>> > cr_restart context.12345-n0-12346
>> > ---------------------------------------------
>> > the output is like this:
>> > ---------------------------------------------
>> > Hello, world! I am 0 of 2, iter 2
>> > Hello, world! I am 0 of 2, iter 3
>> > ...
>> > ---------------------------------------------
>> > but the previous mpirun still freeze and have no output.
>> >
>> > Here is my installation records:
>> > --------------------------BLCR Installation--------------------------
>> > ../configure
>> > -----------------------------------------------------------------------
>> > and it all PASSED when I use make check.
>> >
>> > -------------------------LAM Installation-----------------------------
>> > ./configure --with-threads=posix --with-rpi=crtcp
>> > --with-cr-blcr=/usr/local/
>> > ------------------------------------------------------------------------
>> >
>> > Here is my laminfo:
>> > -----------------------------------------------------------------------
>> > LAM/MPI: 7.1.4
>> > Prefix: /usr
>> > Architecture: i686-pc-linux-gnu
>> > Configured by: root
>> > Configured on: Mon Nov 24 03:30:44 CST 2008
>> > Configure host: cluster.node1
>> > Memory manager: ptmalloc2
>> > C bindings: yes
>> > C++ bindings: yes
>> > Fortran bindings: yes
>> > C compiler: gcc
>> > C++ compiler: g++
>> > Fortran compiler: g77
>> > Fortran symbols: double_underscore
>> > C profiling: yes
>> > C++ profiling: yes
>> > Fortran profiling: yes
>> > C++ exceptions: no
>> > Thread support: yes
>> > ROMIO support: yes
>> > IMPI support: no
>> > Debug support: no
>> > Purify clean: no
>> > SSI boot: globus (API v1.1, Module v0.6)
>> > SSI boot: rsh (API v1.1, Module v1.1)
>> > SSI boot: slurm (API v1.1, Module v1.0)
>> > SSI coll: lam_basic (API v1.1, Module v7.1)
>> > SSI coll: shmem (API v1.1, Module v1.0)
>> > SSI coll: smp (API v1.1, Module v1.2)
>> > SSI rpi: crtcp (API v1.1, Module v1.1)
>> > SSI rpi: lamd (API v1.0, Module v7.1)
>> > SSI rpi: sysv (API v1.0, Module v7.1)
>> > SSI rpi: tcp (API v1.0, Module v7.1)
>> > SSI rpi: usysv (API v1.0, Module v7.1)
>> > SSI cr: blcr (API v1.0, Module v1.1)
>> > SSI cr: self (API v1.0, Module v1.0)
>> >
>> --------------------------------------------------------------------------
>> >
>> > I will appreciate for your help. Thanks.
>> >
>> > Best regards.
>> >
>> > _______________________________________________
>> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>> >
>>
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>
>
|