Thanks for your help. I will try "$ make insmod cr_ktrace_mask=0xffffffff"
in my future work of modifying blcr.
2008/11/25 Paul H. Hargrove <PHHargrove_at_[hidden]>
> I believe that in a normal run "Now restoring the parent linkage" is the
> last step run. So, the kernel portion may have completed and the problem
> may be in the userspace code in LAM/MPI. To help determine where things
> are stuck, please load BLCR with
> $ make insmod cr_ktrace_mask=0xffffffff
> which will enable the maximum level of debugging output. I suspect there
> will be messages after the "Now restoring the parent linkage".
>
> Additionally, if there is more than one cluster node involved, it is
> possible that the MPI apps are restarting on other nodes and their
> debugging output might be on nodes other than the one running the
> mpirun. Please check that if you have not already done so.
>
> -Paul
>
> ÕÅÙ© wrote:
> > I configured blcr with debugging, When I restart a mpi program, the
> > messages log shows:
> > ------------------------------
> >
> ----------------------------------------------------------------------------------------------------------------------------
> > Nov 25 12:54:03 cluster kernel: cr_rstrt_request_restart
> > <cr_rstrt_req.c:826>, pid 2835: cr_magic = 67 82, cr_version = 7,
> > scope = 3, arch = 1.
> > Nov 25 12:54:03 cluster kernel: cr_reserve_ids <cr_rstrt_req.c:501>,
> > pid 2838: Now reserving required ids...
> > Nov 25 12:54:03 cluster kernel: cr_reserve_ids <cr_rstrt_req.c:501>,
> > pid 2838: Now reserving required ids...
> > Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2591>,
> > pid 2838: 2838: Restoring credentials
> > Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2677>,
> > pid 2837: Formerly mpirun PID 2778
> > Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2677>,
> > pid 2838: Formerly mpirun PID 2780
> > Nov 25 12:54:03 cluster kernel: cr_restore_pids <cr_rstrt_req.c:1373>,
> > pid 2838: Now restoring the pids...
> > Nov 25 12:54:03 cluster kernel: cr_restore_pids <cr_rstrt_req.c:1380>,
> > pid 2780: Linkage restore finished...
> > Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2738>,
> > pid 2780: Reading POSIX interval timers...
> > Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2749>,
> > pid 2780: Reading mmap()ed pages (if any)...
> > Nov 25 12:54:03 cluster kernel: cr_restore_all_files
> > <cr_rstrt_req.c:2027>, pid 2780: close-on-exec of callers files
> > Nov 25 12:54:03 cluster kernel: cr_restore_all_files
> > <cr_rstrt_req.c:2041>, pid 2780: recovering fs_struct...
> > Nov 25 12:54:03 cluster kernel: cr_restore_parents
> > <cr_rstrt_req.c:676>, pid 2835: Now restoring the parent linkage...
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
> > and it stuck here. So the problem is the blcr could not restore the
> > parent linkage.
> > Any suggestion? Thanks for your help.
> >
> >
> > 2008/11/25 ÕÅÙ© <zhangkan440_at_[hidden] <mailto:zhangkan440_at_[hidden]>>
> >
> > Hi,
> >
> > Thanks. I will try debug mode then.
> >
> > 2008/11/25 Jerry Mersel <jerry.mersel_at_[hidden]
> > <mailto:jerry.mersel_at_[hidden]>>
> >
> >
> > Hi:
> >
> >
> > I'm no expert in this by any means but I would try rebuilding
> blcr
> > with debugging enabled and then look at the logs.
> >
> > That's what I intend to do with my blcr problem.
> >
> > Regards,
> > Jerry
> >
> > > Hi folks,
> > >
> > > I currently using LAM+BLCR on my Fedora 8 linux cluster.
> > > The problem is I can checkpoint the mpirun correctly, but
> > cannot restart
> > > it.
> > >
> > > First I run a mpi app like this (also tried --ssi rpi crtcp
> > -ssi cr blcr
> > > option):
> > > ---------------------------------------------
> > > mpirun C hello
> > > ---------------------------------------------
> > > The out put is like this:
> > > ---------------------------------------------
> > > Hello, world! I am 0 of 2, iter 0
> > > Hello, world! I am 1 of 2, iter 0
> > > Hello, world! I am 0 of 2, iter 1
> > > Hello, world! I am 1 of 2, iter 1
> > > ...
> > > ---------------------------------------------
> > >
> > > Then I checkpoint this mpirun (assume the mpirun pid is 12345):
> > > I tried the following commands:
> > > ---------------------------------------------
> > > cr_checkpoint 12345
> > > or
> > > lamcheckpoint -ssi cr blcr -pid 12345
> > > ---------------------------------------------
> > >
> > > After that, I found 3 files in my home dir (I only
> > configured 2nodes: n0
> > > and
> > > n1, so it checkpoint correctly.):
> > > ---------------------------------------------
> > > context.12345 context.12345-n0-12346 context.12345-n1-23455
> > > ---------------------------------------------
> > >
> > > BUT when I restart it using the following command:
> > > ---------------------------------------------
> > > cr_restart context.12345
> > > or
> > > lamrestart -ssi cr blcr -ssi cr_blcr_context_file context.12345
> > > ---------------------------------------------
> > >
> > > THE RESTART PROCESS FREEZED. And if I check the process
> > list, I can find
> > > the
> > > the mpirun process but cannot find any hello process.
> > >
> > > The problem seems like the restart process cannot notify all
> > the nodes to
> > > restart the job. it just restarted the mpirun process, but
> > could not
> > > restart
> > > the process in each node.
> > >
> > > I also tried to restart the hello process using another
> > terminal:
> > > ---------------------------------------------
> > > cr_restart context.12345-n0-12346
> > > ---------------------------------------------
> > > the output is like this:
> > > ---------------------------------------------
> > > Hello, world! I am 0 of 2, iter 2
> > > Hello, world! I am 0 of 2, iter 3
> > > ...
> > > ---------------------------------------------
> > > but the previous mpirun still freeze and have no output.
> > >
> > > Here is my installation records:
> > > --------------------------BLCR
> > Installation--------------------------
> > > ../configure
> > >
> >
> -----------------------------------------------------------------------
> > > and it all PASSED when I use make check.
> > >
> > > -------------------------LAM
> > Installation-----------------------------
> > > ./configure --with-threads=posix --with-rpi=crtcp
> > > --with-cr-blcr=/usr/local/
> > >
> >
> ------------------------------------------------------------------------
> > >
> > > Here is my laminfo:
> > >
> >
> -----------------------------------------------------------------------
> > > LAM/MPI: 7.1.4
> > > Prefix: /usr
> > > Architecture: i686-pc-linux-gnu
> > > Configured by: root
> > > Configured on: Mon Nov 24 03:30:44 CST 2008
> > > Configure host: cluster.node1
> > > Memory manager: ptmalloc2
> > > C bindings: yes
> > > C++ bindings: yes
> > > Fortran bindings: yes
> > > C compiler: gcc
> > > C++ compiler: g++
> > > Fortran compiler: g77
> > > Fortran symbols: double_underscore
> > > C profiling: yes
> > > C++ profiling: yes
> > > Fortran profiling: yes
> > > C++ exceptions: no
> > > Thread support: yes
> > > ROMIO support: yes
> > > IMPI support: no
> > > Debug support: no
> > > Purify clean: no
> > > SSI boot: globus (API v1.1, Module v0.6)
> > > SSI boot: rsh (API v1.1, Module v1.1)
> > > SSI boot: slurm (API v1.1, Module v1.0)
> > > SSI coll: lam_basic (API v1.1, Module v7.1)
> > > SSI coll: shmem (API v1.1, Module v1.0)
> > > SSI coll: smp (API v1.1, Module v1.2)
> > > SSI rpi: crtcp (API v1.1, Module v1.1)
> > > SSI rpi: lamd (API v1.0, Module v7.1)
> > > SSI rpi: sysv (API v1.0, Module v7.1)
> > > SSI rpi: tcp (API v1.0, Module v7.1)
> > > SSI rpi: usysv (API v1.0, Module v7.1)
> > > SSI cr: blcr (API v1.0, Module v1.1)
> > > SSI cr: self (API v1.0, Module v1.0)
> > >
> >
> --------------------------------------------------------------------------
> > >
> > > I will appreciate for your help. Thanks.
> > >
> > > Best regards.
> > >
> > > _______________________________________________
> > > This list is archived at
> > http://www.lam-mpi.org/MailArchives/lam/
> > >
> >
> >
> > _______________________________________________
> > This list is archived at
> http://www.lam-mpi.org/MailArchives/lam/
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group Tel: +1-510-495-2352
> HPC Research Department Fax: +1-510-486-6900
> Lawrence Berkeley National Laboratory
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|