Looks like we ahve similar problems.
The only differance is that when I run lamrestart it
runs mpirun but none of the child processes.
Regards,
Jerry
> Hi folks,
>
> I currently using LAM+BLCR on my Fedora 8 linux cluster.
> The problem is I can checkpoint the mpirun correctly, but cannot restart
> it.
>
> First I run a mpi app like this (also tried --ssi rpi crtcp -ssi cr blcr
> option):
> ---------------------------------------------
> mpirun C hello
> ---------------------------------------------
> The out put is like this:
> ---------------------------------------------
> Hello, world! I am 0 of 2, iter 0
> Hello, world! I am 1 of 2, iter 0
> Hello, world! I am 0 of 2, iter 1
> Hello, world! I am 1 of 2, iter 1
> ...
> ---------------------------------------------
>
> Then I checkpoint this mpirun (assume the mpirun pid is 12345):
> I tried the following commands:
> ---------------------------------------------
> cr_checkpoint 12345
> or
> lamcheckpoint -ssi cr blcr -pid 12345
> ---------------------------------------------
>
> After that, I found 3 files in my home dir (I only configured 2nodes: n0
> and
> n1, so it checkpoint correctly.):
> ---------------------------------------------
> context.12345 context.12345-n0-12346 context.12345-n1-23455
> ---------------------------------------------
>
> BUT when I restart it using the following command:
> ---------------------------------------------
> cr_restart context.12345
> or
> lamrestart -ssi cr blcr -ssi cr_blcr_context_file context.12345
> ---------------------------------------------
>
> THE RESTART PROCESS FREEZED. And if I check the process list, I can find
> the
> the mpirun process but cannot find any hello process.
>
> The problem seems like the restart process cannot notify all the nodes to
> restart the job. it just restarted the mpirun process, but could not
> restart
> the process in each node.
>
> I also tried to restart the hello process using another terminal:
> ---------------------------------------------
> cr_restart context.12345-n0-12346
> ---------------------------------------------
> the output is like this:
> ---------------------------------------------
> Hello, world! I am 0 of 2, iter 2
> Hello, world! I am 0 of 2, iter 3
> ...
> ---------------------------------------------
> but the previous mpirun still freeze and have no output.
>
> Here is my installation records:
> --------------------------BLCR Installation--------------------------
> ../configure
> -----------------------------------------------------------------------
> and it all PASSED when I use make check.
>
> -------------------------LAM Installation-----------------------------
> ./configure --with-threads=posix --with-rpi=crtcp
> --with-cr-blcr=/usr/local/
> ------------------------------------------------------------------------
>
> Here is my laminfo:
> -----------------------------------------------------------------------
> LAM/MPI: 7.1.4
> Prefix: /usr
> Architecture: i686-pc-linux-gnu
> Configured by: root
> Configured on: Mon Nov 24 03:30:44 CST 2008
> Configure host: cluster.node1
> Memory manager: ptmalloc2
> C bindings: yes
> C++ bindings: yes
> Fortran bindings: yes
> C compiler: gcc
> C++ compiler: g++
> Fortran compiler: g77
> Fortran symbols: double_underscore
> C profiling: yes
> C++ profiling: yes
> Fortran profiling: yes
> C++ exceptions: no
> Thread support: yes
> ROMIO support: yes
> IMPI support: no
> Debug support: no
> Purify clean: no
> SSI boot: globus (API v1.1, Module v0.6)
> SSI boot: rsh (API v1.1, Module v1.1)
> SSI boot: slurm (API v1.1, Module v1.0)
> SSI coll: lam_basic (API v1.1, Module v7.1)
> SSI coll: shmem (API v1.1, Module v1.0)
> SSI coll: smp (API v1.1, Module v1.2)
> SSI rpi: crtcp (API v1.1, Module v1.1)
> SSI rpi: lamd (API v1.0, Module v7.1)
> SSI rpi: sysv (API v1.0, Module v7.1)
> SSI rpi: tcp (API v1.0, Module v7.1)
> SSI rpi: usysv (API v1.0, Module v7.1)
> SSI cr: blcr (API v1.0, Module v1.1)
> SSI cr: self (API v1.0, Module v1.0)
> --------------------------------------------------------------------------
>
> I will appreciate for your help. Thanks.
>
> Best regards.
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|