Hi folks,
It works now. I did nothing but just added some of my own debug message in
cr_rstrt_req.c and reinstalled the blcr.
I dont know why, but any way it works now.
I think Jerry may try reinstall the blcr again, maybe the problem will
disappear. Good luck!
BTW my blcr version is 0.7.3 and the lam version is 7.1.4. the kernel
version is 2.6.26 without any patch.
Thanks so much for your help!
2008/11/25 ÕÅÙ© <zhangkan440_at_[hidden]>
> I configured blcr with debugging, When I restart a mpi program, the
> messages log shows:
> ------------------------------
>
> ----------------------------------------------------------------------------------------------------------------------------
> Nov 25 12:54:03 cluster kernel: cr_rstrt_request_restart
> <cr_rstrt_req.c:826>, pid 2835: cr_magic = 67 82, cr_version = 7, scope = 3,
> arch = 1.
> Nov 25 12:54:03 cluster kernel: cr_reserve_ids <cr_rstrt_req.c:501>, pid
> 2838: Now reserving required ids...
> Nov 25 12:54:03 cluster kernel: cr_reserve_ids <cr_rstrt_req.c:501>, pid
> 2838: Now reserving required ids...
> Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2591>, pid
> 2838: 2838: Restoring credentials
> Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2677>, pid
> 2837: Formerly mpirun PID 2778
> Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2677>, pid
> 2838: Formerly mpirun PID 2780
> Nov 25 12:54:03 cluster kernel: cr_restore_pids <cr_rstrt_req.c:1373>, pid
> 2838: Now restoring the pids...
> Nov 25 12:54:03 cluster kernel: cr_restore_pids <cr_rstrt_req.c:1380>, pid
> 2780: Linkage restore finished...
> Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2738>, pid
> 2780: Reading POSIX interval timers...
> Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2749>, pid
> 2780: Reading mmap()ed pages (if any)...
> Nov 25 12:54:03 cluster kernel: cr_restore_all_files <cr_rstrt_req.c:2027>,
> pid 2780: close-on-exec of callers files
> Nov 25 12:54:03 cluster kernel: cr_restore_all_files <cr_rstrt_req.c:2041>,
> pid 2780: recovering fs_struct...
> Nov 25 12:54:03 cluster kernel: cr_restore_parents <cr_rstrt_req.c:676>,
> pid 2835: Now restoring the parent linkage...
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
> and it stuck here. So the problem is the blcr could not restore the parent
> linkage.
> Any suggestion? Thanks for your help.
>
>
> 2008/11/25 ÕÅÙ© <zhangkan440_at_[hidden]>
>
>> Hi,
>>
>>
>> Thanks. I will try debug mode then.
>>
>> 2008/11/25 Jerry Mersel <jerry.mersel_at_[hidden]>
>>
>>
>>> Hi:
>>>
>>>
>>> I'm no expert in this by any means but I would try rebuilding blcr
>>> with debugging enabled and then look at the logs.
>>>
>>> That's what I intend to do with my blcr problem.
>>>
>>> Regards,
>>> Jerry
>>>
>>> > Hi folks,
>>> >
>>> > I currently using LAM+BLCR on my Fedora 8 linux cluster.
>>> > The problem is I can checkpoint the mpirun correctly, but cannot
>>> restart
>>> > it.
>>> >
>>> > First I run a mpi app like this (also tried --ssi rpi crtcp -ssi cr
>>> blcr
>>> > option):
>>> > ---------------------------------------------
>>> > mpirun C hello
>>> > ---------------------------------------------
>>> > The out put is like this:
>>> > ---------------------------------------------
>>> > Hello, world! I am 0 of 2, iter 0
>>> > Hello, world! I am 1 of 2, iter 0
>>> > Hello, world! I am 0 of 2, iter 1
>>> > Hello, world! I am 1 of 2, iter 1
>>> > ...
>>> > ---------------------------------------------
>>> >
>>> > Then I checkpoint this mpirun (assume the mpirun pid is 12345):
>>> > I tried the following commands:
>>> > ---------------------------------------------
>>> > cr_checkpoint 12345
>>> > or
>>> > lamcheckpoint -ssi cr blcr -pid 12345
>>> > ---------------------------------------------
>>> >
>>> > After that, I found 3 files in my home dir (I only configured 2nodes:
>>> n0
>>> > and
>>> > n1, so it checkpoint correctly.):
>>> > ---------------------------------------------
>>> > context.12345 context.12345-n0-12346 context.12345-n1-23455
>>> > ---------------------------------------------
>>> >
>>> > BUT when I restart it using the following command:
>>> > ---------------------------------------------
>>> > cr_restart context.12345
>>> > or
>>> > lamrestart -ssi cr blcr -ssi cr_blcr_context_file context.12345
>>> > ---------------------------------------------
>>> >
>>> > THE RESTART PROCESS FREEZED. And if I check the process list, I can
>>> find
>>> > the
>>> > the mpirun process but cannot find any hello process.
>>> >
>>> > The problem seems like the restart process cannot notify all the nodes
>>> to
>>> > restart the job. it just restarted the mpirun process, but could not
>>> > restart
>>> > the process in each node.
>>> >
>>> > I also tried to restart the hello process using another terminal:
>>> > ---------------------------------------------
>>> > cr_restart context.12345-n0-12346
>>> > ---------------------------------------------
>>> > the output is like this:
>>> > ---------------------------------------------
>>> > Hello, world! I am 0 of 2, iter 2
>>> > Hello, world! I am 0 of 2, iter 3
>>> > ...
>>> > ---------------------------------------------
>>> > but the previous mpirun still freeze and have no output.
>>> >
>>> > Here is my installation records:
>>> > --------------------------BLCR Installation--------------------------
>>> > ../configure
>>> >
>>> -----------------------------------------------------------------------
>>> > and it all PASSED when I use make check.
>>> >
>>> > -------------------------LAM Installation-----------------------------
>>> > ./configure --with-threads=posix --with-rpi=crtcp
>>> > --with-cr-blcr=/usr/local/
>>> >
>>> ------------------------------------------------------------------------
>>> >
>>> > Here is my laminfo:
>>> > -----------------------------------------------------------------------
>>> > LAM/MPI: 7.1.4
>>> > Prefix: /usr
>>> > Architecture: i686-pc-linux-gnu
>>> > Configured by: root
>>> > Configured on: Mon Nov 24 03:30:44 CST 2008
>>> > Configure host: cluster.node1
>>> > Memory manager: ptmalloc2
>>> > C bindings: yes
>>> > C++ bindings: yes
>>> > Fortran bindings: yes
>>> > C compiler: gcc
>>> > C++ compiler: g++
>>> > Fortran compiler: g77
>>> > Fortran symbols: double_underscore
>>> > C profiling: yes
>>> > C++ profiling: yes
>>> > Fortran profiling: yes
>>> > C++ exceptions: no
>>> > Thread support: yes
>>> > ROMIO support: yes
>>> > IMPI support: no
>>> > Debug support: no
>>> > Purify clean: no
>>> > SSI boot: globus (API v1.1, Module v0.6)
>>> > SSI boot: rsh (API v1.1, Module v1.1)
>>> > SSI boot: slurm (API v1.1, Module v1.0)
>>> > SSI coll: lam_basic (API v1.1, Module v7.1)
>>> > SSI coll: shmem (API v1.1, Module v1.0)
>>> > SSI coll: smp (API v1.1, Module v1.2)
>>> > SSI rpi: crtcp (API v1.1, Module v1.1)
>>> > SSI rpi: lamd (API v1.0, Module v7.1)
>>> > SSI rpi: sysv (API v1.0, Module v7.1)
>>> > SSI rpi: tcp (API v1.0, Module v7.1)
>>> > SSI rpi: usysv (API v1.0, Module v7.1)
>>> > SSI cr: blcr (API v1.0, Module v1.1)
>>> > SSI cr: self (API v1.0, Module v1.0)
>>> >
>>> --------------------------------------------------------------------------
>>> >
>>> > I will appreciate for your help. Thanks.
>>> >
>>> > Best regards.
>>> >
>>> > _______________________________________________
>>> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>> >
>>>
>>>
>>> _______________________________________________
>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>
>>
>>
>
|