å¼ ä¾ and Jerry have both indicated the failures are with Fedora 8. Could
one or both or you provide the output from the following commands:
$ uname -a
$ gcc --version
$ rpm -q glibc --qf '%{version}-%{release}\n'
-Paul
å¼ ä¾ wrote:
> Hi,
>
> Thanks. I will try debug mode then.
>
> 2008/11/25 Jerry Mersel <jerry.mersel_at_[hidden]
> <mailto:jerry.mersel_at_[hidden]>>
>
>
> Hi:
>
>
> I'm no expert in this by any means but I would try rebuilding blcr
> with debugging enabled and then look at the logs.
>
> That's what I intend to do with my blcr problem.
>
> Regards,
> Jerry
>
> > Hi folks,
> >
> > I currently using LAM+BLCR on my Fedora 8 linux cluster.
> > The problem is I can checkpoint the mpirun correctly, but cannot
> restart
> > it.
> >
> > First I run a mpi app like this (also tried --ssi rpi crtcp -ssi
> cr blcr
> > option):
> > ---------------------------------------------
> > mpirun C hello
> > ---------------------------------------------
> > The out put is like this:
> > ---------------------------------------------
> > Hello, world! I am 0 of 2, iter 0
> > Hello, world! I am 1 of 2, iter 0
> > Hello, world! I am 0 of 2, iter 1
> > Hello, world! I am 1 of 2, iter 1
> > ...
> > ---------------------------------------------
> >
> > Then I checkpoint this mpirun (assume the mpirun pid is 12345):
> > I tried the following commands:
> > ---------------------------------------------
> > cr_checkpoint 12345
> > or
> > lamcheckpoint -ssi cr blcr -pid 12345
> > ---------------------------------------------
> >
> > After that, I found 3 files in my home dir (I only configured
> 2nodes: n0
> > and
> > n1, so it checkpoint correctly.):
> > ---------------------------------------------
> > context.12345 context.12345-n0-12346 context.12345-n1-23455
> > ---------------------------------------------
> >
> > BUT when I restart it using the following command:
> > ---------------------------------------------
> > cr_restart context.12345
> > or
> > lamrestart -ssi cr blcr -ssi cr_blcr_context_file context.12345
> > ---------------------------------------------
> >
> > THE RESTART PROCESS FREEZED. And if I check the process list, I
> can find
> > the
> > the mpirun process but cannot find any hello process.
> >
> > The problem seems like the restart process cannot notify all the
> nodes to
> > restart the job. it just restarted the mpirun process, but could not
> > restart
> > the process in each node.
> >
> > I also tried to restart the hello process using another terminal:
> > ---------------------------------------------
> > cr_restart context.12345-n0-12346
> > ---------------------------------------------
> > the output is like this:
> > ---------------------------------------------
> > Hello, world! I am 0 of 2, iter 2
> > Hello, world! I am 0 of 2, iter 3
> > ...
> > ---------------------------------------------
> > but the previous mpirun still freeze and have no output.
> >
> > Here is my installation records:
> > --------------------------BLCR
> Installation--------------------------
> > ../configure
> >
> -----------------------------------------------------------------------
> > and it all PASSED when I use make check.
> >
> > -------------------------LAM
> Installation-----------------------------
> > ./configure --with-threads=posix --with-rpi=crtcp
> > --with-cr-blcr=/usr/local/
> >
> ------------------------------------------------------------------------
> >
> > Here is my laminfo:
> >
> -----------------------------------------------------------------------
> > LAM/MPI: 7.1.4
> > Prefix: /usr
> > Architecture: i686-pc-linux-gnu
> > Configured by: root
> > Configured on: Mon Nov 24 03:30:44 CST 2008
> > Configure host: cluster.node1
> > Memory manager: ptmalloc2
> > C bindings: yes
> > C++ bindings: yes
> > Fortran bindings: yes
> > C compiler: gcc
> > C++ compiler: g++
> > Fortran compiler: g77
> > Fortran symbols: double_underscore
> > C profiling: yes
> > C++ profiling: yes
> > Fortran profiling: yes
> > C++ exceptions: no
> > Thread support: yes
> > ROMIO support: yes
> > IMPI support: no
> > Debug support: no
> > Purify clean: no
> > SSI boot: globus (API v1.1, Module v0.6)
> > SSI boot: rsh (API v1.1, Module v1.1)
> > SSI boot: slurm (API v1.1, Module v1.0)
> > SSI coll: lam_basic (API v1.1, Module v7.1)
> > SSI coll: shmem (API v1.1, Module v1.0)
> > SSI coll: smp (API v1.1, Module v1.2)
> > SSI rpi: crtcp (API v1.1, Module v1.1)
> > SSI rpi: lamd (API v1.0, Module v7.1)
> > SSI rpi: sysv (API v1.0, Module v7.1)
> > SSI rpi: tcp (API v1.0, Module v7.1)
> > SSI rpi: usysv (API v1.0, Module v7.1)
> > SSI cr: blcr (API v1.0, Module v1.1)
> > SSI cr: self (API v1.0, Module v1.0)
> >
> --------------------------------------------------------------------------
> >
> > I will appreciate for your help. Thanks.
> >
> > Best regards.
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
Paul H. Hargrove PHHargrove_at_[hidden]
Future Technologies Group Tel: +1-510-495-2352
HPC Research Department Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory
|