LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Paul H. Hargrove (PHHargrove_at_[hidden])
Date: 2008-11-25 15:45:43


I believe that in a normal run "Now restoring the parent linkage" is the
last step run. So, the kernel portion may have completed and the problem
may be in the userspace code in LAM/MPI. To help determine where things
are stuck, please load BLCR with
$ make insmod cr_ktrace_mask=0xffffffff
which will enable the maximum level of debugging output. I suspect there
will be messages after the "Now restoring the parent linkage".

Additionally, if there is more than one cluster node involved, it is
possible that the MPI apps are restarting on other nodes and their
debugging output might be on nodes other than the one running the
mpirun. Please check that if you have not already done so.

-Paul

ÕÅÙ© wrote:
> I configured blcr with debugging, When I restart a mpi program, the
> messages log shows:
> ------------------------------
> ----------------------------------------------------------------------------------------------------------------------------
> Nov 25 12:54:03 cluster kernel: cr_rstrt_request_restart
> <cr_rstrt_req.c:826>, pid 2835: cr_magic = 67 82, cr_version = 7,
> scope = 3, arch = 1.
> Nov 25 12:54:03 cluster kernel: cr_reserve_ids <cr_rstrt_req.c:501>,
> pid 2838: Now reserving required ids...
> Nov 25 12:54:03 cluster kernel: cr_reserve_ids <cr_rstrt_req.c:501>,
> pid 2838: Now reserving required ids...
> Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2591>,
> pid 2838: 2838: Restoring credentials
> Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2677>,
> pid 2837: Formerly mpirun PID 2778
> Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2677>,
> pid 2838: Formerly mpirun PID 2780
> Nov 25 12:54:03 cluster kernel: cr_restore_pids <cr_rstrt_req.c:1373>,
> pid 2838: Now restoring the pids...
> Nov 25 12:54:03 cluster kernel: cr_restore_pids <cr_rstrt_req.c:1380>,
> pid 2780: Linkage restore finished...
> Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2738>,
> pid 2780: Reading POSIX interval timers...
> Nov 25 12:54:03 cluster kernel: cr_rstrt_child <cr_rstrt_req.c:2749>,
> pid 2780: Reading mmap()ed pages (if any)...
> Nov 25 12:54:03 cluster kernel: cr_restore_all_files
> <cr_rstrt_req.c:2027>, pid 2780: close-on-exec of callers files
> Nov 25 12:54:03 cluster kernel: cr_restore_all_files
> <cr_rstrt_req.c:2041>, pid 2780: recovering fs_struct...
> Nov 25 12:54:03 cluster kernel: cr_restore_parents
> <cr_rstrt_req.c:676>, pid 2835: Now restoring the parent linkage...
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
> and it stuck here. So the problem is the blcr could not restore the
> parent linkage.
> Any suggestion? Thanks for your help.
>
>
> 2008/11/25 ÕÅÙ© <zhangkan440_at_[hidden] <mailto:zhangkan440_at_[hidden]>>
>
> Hi,
>
> Thanks. I will try debug mode then.
>
> 2008/11/25 Jerry Mersel <jerry.mersel_at_[hidden]
> <mailto:jerry.mersel_at_[hidden]>>
>
>
> Hi:
>
>
> I'm no expert in this by any means but I would try rebuilding blcr
> with debugging enabled and then look at the logs.
>
> That's what I intend to do with my blcr problem.
>
> Regards,
> Jerry
>
> > Hi folks,
> >
> > I currently using LAM+BLCR on my Fedora 8 linux cluster.
> > The problem is I can checkpoint the mpirun correctly, but
> cannot restart
> > it.
> >
> > First I run a mpi app like this (also tried --ssi rpi crtcp
> -ssi cr blcr
> > option):
> > ---------------------------------------------
> > mpirun C hello
> > ---------------------------------------------
> > The out put is like this:
> > ---------------------------------------------
> > Hello, world! I am 0 of 2, iter 0
> > Hello, world! I am 1 of 2, iter 0
> > Hello, world! I am 0 of 2, iter 1
> > Hello, world! I am 1 of 2, iter 1
> > ...
> > ---------------------------------------------
> >
> > Then I checkpoint this mpirun (assume the mpirun pid is 12345):
> > I tried the following commands:
> > ---------------------------------------------
> > cr_checkpoint 12345
> > or
> > lamcheckpoint -ssi cr blcr -pid 12345
> > ---------------------------------------------
> >
> > After that, I found 3 files in my home dir (I only
> configured 2nodes: n0
> > and
> > n1, so it checkpoint correctly.):
> > ---------------------------------------------
> > context.12345 context.12345-n0-12346 context.12345-n1-23455
> > ---------------------------------------------
> >
> > BUT when I restart it using the following command:
> > ---------------------------------------------
> > cr_restart context.12345
> > or
> > lamrestart -ssi cr blcr -ssi cr_blcr_context_file context.12345
> > ---------------------------------------------
> >
> > THE RESTART PROCESS FREEZED. And if I check the process
> list, I can find
> > the
> > the mpirun process but cannot find any hello process.
> >
> > The problem seems like the restart process cannot notify all
> the nodes to
> > restart the job. it just restarted the mpirun process, but
> could not
> > restart
> > the process in each node.
> >
> > I also tried to restart the hello process using another
> terminal:
> > ---------------------------------------------
> > cr_restart context.12345-n0-12346
> > ---------------------------------------------
> > the output is like this:
> > ---------------------------------------------
> > Hello, world! I am 0 of 2, iter 2
> > Hello, world! I am 0 of 2, iter 3
> > ...
> > ---------------------------------------------
> > but the previous mpirun still freeze and have no output.
> >
> > Here is my installation records:
> > --------------------------BLCR
> Installation--------------------------
> > ../configure
> >
> -----------------------------------------------------------------------
> > and it all PASSED when I use make check.
> >
> > -------------------------LAM
> Installation-----------------------------
> > ./configure --with-threads=posix --with-rpi=crtcp
> > --with-cr-blcr=/usr/local/
> >
> ------------------------------------------------------------------------
> >
> > Here is my laminfo:
> >
> -----------------------------------------------------------------------
> > LAM/MPI: 7.1.4
> > Prefix: /usr
> > Architecture: i686-pc-linux-gnu
> > Configured by: root
> > Configured on: Mon Nov 24 03:30:44 CST 2008
> > Configure host: cluster.node1
> > Memory manager: ptmalloc2
> > C bindings: yes
> > C++ bindings: yes
> > Fortran bindings: yes
> > C compiler: gcc
> > C++ compiler: g++
> > Fortran compiler: g77
> > Fortran symbols: double_underscore
> > C profiling: yes
> > C++ profiling: yes
> > Fortran profiling: yes
> > C++ exceptions: no
> > Thread support: yes
> > ROMIO support: yes
> > IMPI support: no
> > Debug support: no
> > Purify clean: no
> > SSI boot: globus (API v1.1, Module v0.6)
> > SSI boot: rsh (API v1.1, Module v1.1)
> > SSI boot: slurm (API v1.1, Module v1.0)
> > SSI coll: lam_basic (API v1.1, Module v7.1)
> > SSI coll: shmem (API v1.1, Module v1.0)
> > SSI coll: smp (API v1.1, Module v1.2)
> > SSI rpi: crtcp (API v1.1, Module v1.1)
> > SSI rpi: lamd (API v1.0, Module v7.1)
> > SSI rpi: sysv (API v1.0, Module v7.1)
> > SSI rpi: tcp (API v1.0, Module v7.1)
> > SSI rpi: usysv (API v1.0, Module v7.1)
> > SSI cr: blcr (API v1.0, Module v1.1)
> > SSI cr: self (API v1.0, Module v1.0)
> >
> --------------------------------------------------------------------------
> >
> > I will appreciate for your help. Thanks.
> >
> > Best regards.
> >
> > _______________________________________________
> > This list is archived at
> http://www.lam-mpi.org/MailArchives/lam/
> >
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

-- 
Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group                 Tel: +1-510-495-2352
HPC Research Department                   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory