I was not able to find any of the process files.
I saw that we can set up an environment variable called
LAM_MPI_SSI_cr_base_dir
I set it to a local directory in my .bashrc but still no files seem to
appear.
The thing is the blcr module is found and loaded correctly during the
run. Then, when I checkpoint the mpirun process, only one file is
created on my CWD,
Any help would be highly appreciated.
Thanks,
Hatem
Mars Lenjoy wrote:
>
>
> $ mpirun -np 9 -ssi cr blcr cr_verbose
>
> assume the pid = 1234
>
> $ lamcheckpoint -ssi blcr -pid 1234
>
> perhaps you don't configure your cr-base-file when in configure
> process, i don't know it's default direction, however, it's not the
> work direction.
>
> so, try to find the file begins with "context.mpirun." in $LAMHOME.
>
> then try to
>
> $ lamrestart -ssi cr blcr -ssi cr_blcr_context_file <your
> context.mpirun.1234's fullpath>
>
>
>
> i wish it help, if not, i'm afraid you need reinstall the LAMMPI with
> any other configure parameters.
>
>
>
>
> */Hatem Ltaief <ltaief_at_[hidden]>/* wrote:
>
> Thanks for your help.
>
> I guess here is my problem: I do not have any of these 9 context
> files on
> my CWD and home directory.
> Using lamcheckpoint and lamrestart commands give the same error
> and still
> no context files for each running processes are created.
>
> Any ideas?
> Thanks,
> Hatem
>
>
> On Thu, 20 Apr 2006, Mars Lenjoy wrote:
>
> > you'd better use lamcheckpoint and lamrestart to do that.
> > if you want to use cr_checkpoint and the mpirun's pid is 1234,
> > for example, try
> > $ cr_checkpoint -f context.mpirun.1234 --run 1234
> > make sure 9 contexts files whoes title contain "1234" and
> "context" are existed.
> > then
> > $ cr_restart context.mpirun.1234
> > it should work
> >
> > hope it helps
> >
> >
> > hatem ltaief wrote:
> > Hi,
> > I installed blcr and lam 7.1.2:
> > [ltaief_at_compute-0-16 lammpi-cg_3D]$ laminfo -all
> > LAM/MPI: 7.1.2
> > SSI boot: globus (SSI v1.0, API v1.1, Module v0.6)
> > SSI boot: rsh (SSI v1.0, API v1.1, Module v1.1)
> > SSI boot: slurm (SSI v1.0, API v1.1, Module v1.0)
> > SSI coll: lam_basic (SSI v1.0, API v1.1, Module v7.1)
> > SSI coll: shmem (SSI v1.0, API v1.1, Module v1.0)
> > SSI coll: smp (SSI v1.0, API v1.1, Module v1.2)
> > SSI rpi: crtcp (SSI v1.0, API v1.1, Module v1.1)
> > SSI rpi: lamd (SSI v1.0, API v1.0, Module v7.1)
> > SSI rpi: sysv (SSI v1.0, API v1.0, Module v7.1)
> > SSI rpi: tcp (SSI v1.0, API v1.0, Module v7.1)
> > SSI rpi: usysv (SSI v1.0, API v1.0, Module v7.1)
> > SSI cr: blcr (SSI v1.0, API v1.0, Module v1.1)
> > SSI cr: self (SSI v1.0, API v1.0, Module v1.0)
> > Prefix: /home/l/ltaief/source_lammpi/lam-src-f77/COMPILE_F77
> > Bindir:
> > /home/l/ltaief/source_lammpi/lam-src-f77/COMPILE_F77/bin
> > Libdir:
> > /home/l/ltaief/source_lammpi/lam-src-f77/COMPILE_F77/lib
> > Incdir:
> > /home/l/ltaief/source_lammpi/lam-src-f77/COMPILE_F77/include
> > Pkglibdir:
> > /home/l/ltaief/source_lammpi/lam-src-f77/COMPILE_F77/lib/lam
> > Sysconfdir:
> > /home/l/ltaief/source_lammpi/lam-src-f77/COMPILE_F77/etc
> > Architecture: x86_64-unknown-linux-gnu
> > Configured by: ltaief
> > Configured on: Thu Apr 20 19:37:33 CDT 2006
> > Configure host: medusa.tlc2.uh.edu
> > Memory manager: ptmalloc2
> > C bindings: yes
> > C++ bindings: yes
> > Fortran bindings: yes
> > C compiler: gcc
> > C char size: 1
> > C bool size: 1
> > C short size: 2
> > C int size: 4
> > C long size: 8
> > C float size: 4
> > C double size: 8
> > C pointer size: 8
> > C char align: 1
> > C bool align: 1
> > C int align: 4
> > C float align: 4
> > C double align: 8
> > C++ compiler: g++
> > Fortran compiler: g77
> > Fortran symbols: double_underscore
> > Fort integer size: 4
> > Fort real size: 4
> > Fort dbl prec size: 4
> > Fort cplx size: 4
> > Fort dbl cplx size: 4
> > Fort integer align: 4
> > Fort real align: 4
> > Fort dbl prec align: 4
> > Fort cplx align: 4
> > Fort dbl cplx align: 4
> > C profiling: yes
> > C++ profiling: yes
> > Fortran profiling: yes
> > C++ exceptions: no
> > Thread support: yes
> > ROMIO support: yes
> > IMPI support: no
> > Debug support: no
> > Purify clean: no
> > SSI base: parameter "verbose" (default value: )
> > SSI mpi: parameter "mpi_hostmap" (default value:
> >
> >
> "/home/l/ltaief/source_lammpi/lam-src-f77/COMPILE_F77/etc/lam-hostmap.txt")
> > SSI base: parameter "base_module_path" (default value:
> >
> > "/home/l/ltaief/source_lammpi/lam-src-f77/COMPILE_F77/lib/lam")
> > SSI boot: parameter "boot_verbose" (default value: )
> > SSI boot: parameter "boot" (default value: )
> > SSI boot: parameter "boot_base_promisc" (default value: "0")
> > SSI boot: parameter "boot_base_window_size" (default value: "5")
> > SSI boot: parameter "boot_globus_priority" (default value: "3")
> > SSI boot: parameter "boot_rsh_username" (default value: )
> > SSI boot: parameter "boot_rsh_agent" (default value:
> > "/bin/ssh ")
> > SSI boot: parameter "boot_rsh_no_n" (default value: "0")
> > SSI boot: parameter "boot_rsh_no_profile" (default value: "0")
> > SSI boot: parameter "boot_rsh_fast" (default value: "0")
> > SSI boot: parameter "boot_rsh_ignore_stderr" (default value:
> > "0")
> > SSI boot: parameter "boot_rsh_priority" (default value: "10")
> > SSI boot: parameter "boot_slurm_priority" (default value: "50")
> > SSI rpi: parameter "rpi_verbose" (default value: )
> > SSI rpi: parameter "rpi" (default value: )
> > SSI rpi: parameter "rpi_crtcp_priority" (default value: "25")
> > SSI rpi: parameter "rpi_crtcp_short" (default value: "65536")
> > SSI rpi: parameter "rpi_crtcp_sockbuf" (default value: "-1")
> > SSI rpi: parameter "rpi_lamd_priority" (default value: "20")
> > SSI rpi: parameter "rpi_sysv_pollyield" (default value: "1")
> > SSI rpi: parameter "rpi_sysv_poolsize" (default value:
> > "16777216")
> > SSI rpi: parameter "rpi_sysv_maxalloc" (default value:
> > "1048576")
> > SSI rpi: parameter "rpi_sysv_short" (default value: "8192")
> > SSI rpi: parameter "rpi_tcp_short" (default value: "65536")
> > SSI rpi: parameter "rpi_tcp_sockbuf" (default value: "-1")
> > SSI rpi: parameter "rpi_sysv_priority" (default value: "30")
> > SSI rpi: parameter "rpi_tcp_priority" (default value: "20")
> > SSI rpi: parameter "rpi_usysv_readlockpoll" (default value:
> > "10000")
> > SSI rpi: parameter "rpi_usysv_writelockpoll" (default value:
> > "10")
> > SSI rpi: parameter "rpi_usysv_pollyield" (default value: "1")
> > SSI rpi: parameter "rpi_usysv_poolsize" (default value:
> > "16777216")
> > SSI rpi: parameter "rpi_usysv_maxalloc" (default value:
> > "1048576")
> > SSI rpi: parameter "rpi_usysv_short" (default value: "8192")
> > SSI rpi: parameter "rpi_usysv_priority" (default value: "40")
> > SSI coll: parameter "coll_verbose" (default value: )
> > SSI coll: parameter "coll_shmem" (default value: "0")
> > SSI cr: parameter "cr_verbose" (default value: )
> > SSI cr: parameter "cr" (default value: )
> > SSI cr: parameter "cr_blcr_priority" (default value: "50")
> > SSI cr: parameter "cr_self_priority" (default value: "25")
> > SSI cr: parameter "cr_self_do_restart" (default value: "0")
> > SSI cr: parameter "cr_self_prefix" (default value:
> > "lam_cr_self")
> > SSI cr: parameter "cr_self_checkpoint" (default value: )
> > SSI cr: parameter "cr_self_continue" (default value: )
> > SSI cr: parameter "cr_self_restart" (default value: )
> >
> > Here is my output when running
> > [ltaief_at_compute-0-16 lammpi-cg_3D]$ mpirun -np 9 -ssi cr_verbose
> > level:1000,stderr -ssi rpi crtcp -ssi cr blcr -x LD_LIBRARY_PATH
> > ./main_heat &
> >
> > /home/l/ltaief/source_lammpi/lam-src-f77/COMPILE_F77/bin/mpif77
> -O3 -w
> > -I/home/l/ltaief/source_lammpi/lam-src-f77/COMPILE_F77/include -o
> > objet/main_heat.o -c ./source/main_heat.f
> > /home/l/ltaief/source_lammpi/lam-src-f77/COMPILE_F77/bin/mpif77 -o
> > main_heat ./objet/main_heat.o ./objet/ddot.o ./objet/gather.o
> > ./objet/UpdateBoundary.o ./objet/cut_Domain_proc.o ./objet/initF.o
> > ./objet/checkpoint.o ./objet/matvec.o ./objet/compute_norm.o
> > ./objet/Solve_CG.o ./objet/tempscom.o
> > -L/home/l/ltaief/source_lammpi/lam-src-f77/COMPILE_F77/lib
> >
> > Compilation Successfully Terminated!
> >
> > n0<6983> ssi:crmpi:open: opening
> > n0<6983> ssi:crmpi:open: looking for cr module named blcr
> > n0<6983> ssi:crmpi:open: opening cr module blcr
> > n0<6983> ssi:crmpi:open: opened cr module blcr
> > n0<6983> ssi:crmpi:query: querying cr module blcr
> > n0<6983> ssi:crmpi:blcr: module initializing
> > n0<6983> ssi:crmpi:blcr:verbose: 1000
> > n0<6983> ssi:crmpi:blcr:priority: 50
> > n0<6983> ssi:crmpi:query: cr module available: blcr, priority: 50
> > n0<6984> ssi:crmpi:open: opening
> > n0<6984> ssi:crmpi:open: looking for cr module named blcr
> > n0<6984> ssi:crmpi:open: opening cr module blcr
> > n0<6984> ssi:crmpi:open: opened cr module blcr
> > n0<6984> ssi:crmpi:query: querying cr module blcr
> > n0<6984> ssi:crmpi:blcr: module initializing
> > n0<6984> ssi:crmpi:blcr:verbose: 1000
> > n0<6984> ssi:crmpi:blcr:priority: 50
> > n0<6984> ssi:crmpi:query: cr module available: blcr, priority: 50
> > n4<6385> ssi:crmpi:open: opening
> > n4<6385> ssi:crmpi:open: looking for cr module named blcr
> > n4<6385> ssi:crmpi:open: opening cr module blcr
> > n4<6385> ssi:crmpi:open: opened cr module blcr
> > n4<6385> ssi:crmpi:query: querying cr module blcr
> > n4<6385> ssi:crmpi:blcr: module initializing
> > n4<6385> ssi:crmpi:blcr:verbose: 1000
> > n4<6385> ssi:crmpi:blcr:priority: 50
> > n4<6385> ssi:crmpi:query: cr module available: blcr, priority: 50
> > n1<15131> ssi:crmpi:open: opening
> > n2<3830> ssi:crmpi:open: opening
> > n4<6386> ssi:crmpi:open: opening
> > n3<25936> ssi:crmpi:open: opening
> > n1<15131> ssi:crmpi:open: looking for cr module named blcr
> > n1<15131> ssi:crmpi:open: opening cr module blcr
> > n1<15131> ssi:crmpi:open: opened cr module blcr
> > n1<15131> ssi:crmpi:query: querying cr module blcr
> > n1<15131> ssi:crmpi:blcr: module initializing
> > n1<15131> ssi:crmpi:blcr:verbose: 1000
> > n1<15131> ssi:crmpi:blcr:priority: 50
> > n1<15131> ssi:crmpi:query: cr module available: blcr, priority: 50
> > n1<15132> ssi:crmpi:open: opening
> > n2<3830> ssi:crmpi:open: looking for cr module named blcr
> > n2<3830> ssi:crmpi:open: opening cr module blcr
> > n2<3830> ssi:crmpi:open: opened cr module blcr
> > n2<3830> ssi:crmpi:query: querying cr module blcr
> > n2<3830> ssi:crmpi:blcr: module initializing
> > n2<3830> ssi:crmpi:blcr:verbose: 1000
> > n2<3830> ssi:crmpi:blcr:priority: 50
> > n2<3830> ssi:crmpi:query: cr module available: blcr, priority: 50
> > n4<6386> ssi:crmpi:open: looking for cr module named blcr
> > n4<6386> ssi:crmpi:open: opening cr module blcr
> > n4<6386> ssi:crmpi:open: opened cr module blcr
> > n4<6386> ssi:crmpi:query: querying cr module blcr
> > n4<6386> ssi:crmpi:blcr: module initializing
> > n4<6386> ssi:crmpi:blcr:verbose: 1000
> > n4<6386> ssi:crmpi:blcr:priority: 50
> > n4<6386> ssi:crmpi:query: cr module available: blcr, priority: 50
> > n3<25936> ssi:crmpi:open: looking for cr module named blcr
> > n3<25936> ssi:crmpi:open: opening cr module blcr
> > n3<25936> ssi:crmpi:open: opened cr module blcr
> > n1<15132> ssi:crmpi:open: looking for cr module named blcr
> > n1<15132> ssi:crmpi:open: opening cr module blcr
> > n1<15132> ssi:crmpi:open: opened cr module blcr
> > n1<15132> ssi:crmpi:query: querying cr module blcr
> > n1<15132> ssi:crmpi:blcr: module initializing
> > n1<15132> ssi:crmpi:blcr:verbose: 1000
> > n1<15132> ssi:crmpi:blcr:priority: 50
> > n1<15132> ssi:crmpi:query: cr module available: blcr, priority: 50
> > n3<25936> ssi:crmpi:query: querying cr module blcr
> > n3<25936> ssi:crmpi:blcr: module initializing
> > n3<25936> ssi:crmpi:blcr:verbose: 1000
> > n3<25936> ssi:crmpi:blcr:priority: 50
> > n3<25936> ssi:crmpi:query: cr module available: blcr, priority: 50
> > n3<25937> ssi:crmpi:open: opening
> > n3<25937> ssi:crmpi:open: looking for cr module named blcr
> > n3<25937> ssi:crmpi:open: opening cr module blcr
> > n3<25937> ssi:crmpi:open: opened cr module blcr
> > n3<25937> ssi:crmpi:query: querying cr module blcr
> > n3<25937> ssi:crmpi:blcr: module initializing
> > n3<25937> ssi:crmpi:blcr:verbose: 1000
> > n3<25937> ssi:crmpi:blcr:priority: 50
> > n3<25937> ssi:crmpi:query: cr module available: blcr, priority: 50
> > n0<6977> ssi:crlam: Opening
> > n0<6977> ssi:crlam: looking for module named blcr
> > n0<6977> ssi:crlam: opening module blcr
> > n0<6977> ssi:crlam: query module blcr
> > n0<6977> ssi:crlam:blcr: module initializing
> > n0<6977> ssi:crlam:blcr:verbose: 1000
> > n0<6977> ssi:crlam:blcr:priority: 50
> > n0<6977> ssi:crlam: Selected crlam module "blcr"
> > n0<6977> ssi:crlam:Registered C/R handlers
> > n0<6983> ssi:crmpi: initializing
> > n0<6984> ssi:crmpi: initializing
> > n0<6984> ssi:crmpi: CR support enabled (blcr)
> > n1<15131> ssi:crmpi: initializing
> > n0<6983> ssi:crmpi: CR support enabled (blcr)
> > n1<15132> ssi:crmpi: initializing
> > n2<3830> ssi:crmpi: initializing
> > n3<25936> ssi:crmpi: initializing
> > n2<3830> ssi:crmpi: CR support enabled (blcr)
> > n1<15131> ssi:crmpi: CR support enabled (blcr)
> > n3<25937> ssi:crmpi: initializing
> > n4<6386> ssi:crmpi: initializing
> > n3<25936> ssi:crmpi: CR support enabled (blcr)
> > n4<6385> ssi:crmpi: initializing
> > n4<6386> ssi:crmpi: CR support enabled (blcr)
> > n1<15132> ssi:crmpi: CR support enabled (blcr)
> > n3<25937> ssi:crmpi: CR support enabled (blcr)
> > n4<6385> ssi:crmpi: CR support enabled (blcr)
> > me= 0Nx= 26Ny= 26Nz= 50maxT= 1000dt= 0.00159999993
> > me= 1Nx= 26Ny= 26Nz= 50maxT= 1000dt= 0.00159999993
> > me= 7Nx= 26Ny= 26Nz= 50maxT= 1000dt= 0.00159999993
> > me= 4Nx= 26Ny= 26Nz= 50maxT= 1000dt= 0.00159999993
> > me= 8Nx= 26Ny= 26Nz= 50maxT= 1000dt= 0.00159999993
> > me= 5Nx= 26Ny= 26Nz= 50maxT= 1000dt= 0.00159999993
> > me= 6Nx= 26Ny= 26Nz= 50maxT= 1000dt= 0.00159999993
> > me= 2Nx= 26Ny= 26Nz= 50maxT= 1000dt= 0.00159999993
> > me= 3Nx= 26Ny= 26Nz= 50maxT= 1000dt= 0.00159999993
> > Execution Successfully Terminated!
> > End of solving
> > Time of Solving 3.88168907
> > n0<6983> ssi:crmpi: Closing
> > n0<6984> ssi:crmpi: Closing
> > n2<3830> ssi:crmpi: Closing
> > n4<6386> ssi:crmpi: Closing
> > n3<25937> ssi:crmpi: Closing
> > n1<15131> ssi:crmpi: Closing
> > n1<15132> ssi:crmpi: Closing
> > n4<6385> ssi:crmpi: Closing
> > n3<25936> ssi:crmpi: Closing
> >
> >
> > When I use cr_checkpoint command to checkpoint the mpirun
> process during
> > the execution, it creates a context.PID file in my CWD.
> > Then, after the program is finished I want to restart it by
> cr_restart
> > context.PID.
> > And I get this error:
> > [ltaief_at_compute-0-16 lammpi-cg_3D]$ cr_restart context.6977
> > mpirun: cannot start ./main_heat on n2: Bad file descriptor
> >
> > Any Ideas?
> >
> > Thanks and best regards,
> > Hatem
> >
> >
> >
> >
> >
> >
> >
> ___________________________________________________________________________
> > Faites de Yahoo! votre page d'accueil sur le web pour retrouver
> directement vos services préférés : vérifiez vos nouveaux mails,
> lancez vos recherches et suivez l'actualit?en temps réel.
> > Rendez-vous sur http://fr.yahoo.com/set
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
> >
> >
> > ---------------------------------
> > New Yahoo! Messenger with Voice. Call regular phones from your
> PC and save big.
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
> ------------------------------------------------------------------------
> Celebrate Earth Day everyday! Discover 10 things you can do to help
> slow climate change. Yahoo! Earth Day
> <http://us.rd.yahoo.com/mail_us/taglines/earthday/*http://earth.yahoo.com>
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
___________________________________________________________________________
Faites de Yahoo! votre page d'accueil sur le web pour retrouver directement vos services préférés : vérifiez vos nouveaux mails, lancez vos recherches et suivez l'actualité en temps réel.
Rendez-vous sur http://fr.yahoo.com/set
|