Thanks. Here are the outputs:
------------------------------------------
uname -a
Linux cluster.node1 2.6.26LAM+BLCR #1 SMP Mon Nov 24 01:12:27 CST 2008 i686
i686 i386 GNU/Linux
gcc --version
gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
rpm -q glibc --qf '%{version}-%{release}\n'
2.7-2
-----------------------------------------
2008/11/25 Paul H. Hargrove <PHHargrove_at_[hidden]>
> ÕÅÙ© and Jerry have both indicated the failures are with Fedora 8. Could one
> or both or you provide the output from the following commands:
>
> $ uname -a
> $ gcc --version
> $ rpm -q glibc --qf '%{version}-%{release}\n'
>
> -Paul
>
> ÕÅÙ© wrote:
>
>> Hi,
>>
>> Thanks. I will try debug mode then.
>>
>> 2008/11/25 Jerry Mersel <jerry.mersel_at_[hidden] <mailto:
>> jerry.mersel_at_[hidden]>>
>>
>>
>>
>> Hi:
>>
>>
>> I'm no expert in this by any means but I would try rebuilding blcr
>> with debugging enabled and then look at the logs.
>>
>> That's what I intend to do with my blcr problem.
>>
>> Regards,
>> Jerry
>>
>> > Hi folks,
>> >
>> > I currently using LAM+BLCR on my Fedora 8 linux cluster.
>> > The problem is I can checkpoint the mpirun correctly, but cannot
>> restart
>> > it.
>> >
>> > First I run a mpi app like this (also tried --ssi rpi crtcp -ssi
>> cr blcr
>> > option):
>> > ---------------------------------------------
>> > mpirun C hello
>> > ---------------------------------------------
>> > The out put is like this:
>> > ---------------------------------------------
>> > Hello, world! I am 0 of 2, iter 0
>> > Hello, world! I am 1 of 2, iter 0
>> > Hello, world! I am 0 of 2, iter 1
>> > Hello, world! I am 1 of 2, iter 1
>> > ...
>> > ---------------------------------------------
>> >
>> > Then I checkpoint this mpirun (assume the mpirun pid is 12345):
>> > I tried the following commands:
>> > ---------------------------------------------
>> > cr_checkpoint 12345
>> > or
>> > lamcheckpoint -ssi cr blcr -pid 12345
>> > ---------------------------------------------
>> >
>> > After that, I found 3 files in my home dir (I only configured
>> 2nodes: n0
>> > and
>> > n1, so it checkpoint correctly.):
>> > ---------------------------------------------
>> > context.12345 context.12345-n0-12346 context.12345-n1-23455
>> > ---------------------------------------------
>> >
>> > BUT when I restart it using the following command:
>> > ---------------------------------------------
>> > cr_restart context.12345
>> > or
>> > lamrestart -ssi cr blcr -ssi cr_blcr_context_file context.12345
>> > ---------------------------------------------
>> >
>> > THE RESTART PROCESS FREEZED. And if I check the process list, I
>> can find
>> > the
>> > the mpirun process but cannot find any hello process.
>> >
>> > The problem seems like the restart process cannot notify all the
>> nodes to
>> > restart the job. it just restarted the mpirun process, but could not
>> > restart
>> > the process in each node.
>> >
>> > I also tried to restart the hello process using another terminal:
>> > ---------------------------------------------
>> > cr_restart context.12345-n0-12346
>> > ---------------------------------------------
>> > the output is like this:
>> > ---------------------------------------------
>> > Hello, world! I am 0 of 2, iter 2
>> > Hello, world! I am 0 of 2, iter 3
>> > ...
>> > ---------------------------------------------
>> > but the previous mpirun still freeze and have no output.
>> >
>> > Here is my installation records:
>> > --------------------------BLCR
>> Installation--------------------------
>> > ../configure
>> >
>> -----------------------------------------------------------------------
>> > and it all PASSED when I use make check.
>> >
>> > -------------------------LAM
>> Installation-----------------------------
>> > ./configure --with-threads=posix --with-rpi=crtcp
>> > --with-cr-blcr=/usr/local/
>> >
>>
>> ------------------------------------------------------------------------
>> >
>> > Here is my laminfo:
>> >
>> -----------------------------------------------------------------------
>> > LAM/MPI: 7.1.4
>> > Prefix: /usr
>> > Architecture: i686-pc-linux-gnu
>> > Configured by: root
>> > Configured on: Mon Nov 24 03:30:44 CST 2008
>> > Configure host: cluster.node1
>> > Memory manager: ptmalloc2
>> > C bindings: yes
>> > C++ bindings: yes
>> > Fortran bindings: yes
>> > C compiler: gcc
>> > C++ compiler: g++
>> > Fortran compiler: g77
>> > Fortran symbols: double_underscore
>> > C profiling: yes
>> > C++ profiling: yes
>> > Fortran profiling: yes
>> > C++ exceptions: no
>> > Thread support: yes
>> > ROMIO support: yes
>> > IMPI support: no
>> > Debug support: no
>> > Purify clean: no
>> > SSI boot: globus (API v1.1, Module v0.6)
>> > SSI boot: rsh (API v1.1, Module v1.1)
>> > SSI boot: slurm (API v1.1, Module v1.0)
>> > SSI coll: lam_basic (API v1.1, Module v7.1)
>> > SSI coll: shmem (API v1.1, Module v1.0)
>> > SSI coll: smp (API v1.1, Module v1.2)
>> > SSI rpi: crtcp (API v1.1, Module v1.1)
>> > SSI rpi: lamd (API v1.0, Module v7.1)
>> > SSI rpi: sysv (API v1.0, Module v7.1)
>> > SSI rpi: tcp (API v1.0, Module v7.1)
>> > SSI rpi: usysv (API v1.0, Module v7.1)
>> > SSI cr: blcr (API v1.0, Module v1.1)
>> > SSI cr: self (API v1.0, Module v1.0)
>> >
>>
>> --------------------------------------------------------------------------
>> >
>> > I will appreciate for your help. Thanks.
>> >
>> > Best regards.
>> >
>> > _______________________________________________
>> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>> >
>>
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group Tel: +1-510-495-2352
> HPC Research Department Fax: +1-510-486-6900
> Lawrence Berkeley National Laboratory
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|