LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: FARKAS Zoltan (zfarkas_at_[hidden])
Date: 2004-11-17 13:21:11


   Hello!

1. I'm configuring lam with the following command:
$ configure --prefix=/home/zfarkas/lam-7.1.1 --with-rsh=/usr/bin/ssh
--with-rpi=crtcp --without-fc --with-cr=self
I compile a simple MPI application with mpicc, and run it with:
$ mpirun -v -ssi ssi_verbose 100 -ssi cr self -ssi cr_verbose 100 C
./hello 100
15856 ./hello running on n0 (o)
n0<15855> ssi:crlam:module "none" explicitly requested. Disabling cr support
n0<15856> ssi:crmpi: CR support enabled (self)
Hello, world! I am 0 of 1
0
1
2
3
...

The explicitly requested module "none" seems strange to me. The command
$ lamcheckpoint -ssi cr self -pid <pid_of_mpirun>
simply kills mpirun with SIGUSR1.
I've modified otb/mpirun/mpirun.c, the second parameter of
lam_ssi_crlam_base_open if now "self" instead of cr_module. After I
recompile LAM and try to run the application I get (during the execution
I issue a 'lamcheckpoint -ssi cr self -pid <pid_of_mpirun>'):
$ mpirun -v -ssi ssi_verbose 100 -ssi cr self -ssi cr_verbose 100 C
./hello 100
23984 ./hello running on n0 (o)
n0<23984> ssi:crmpi: CR support enabled (self)
Hello, world! I am 0 of 1
0
1
2
3
nside CRLAM's *signal* handler
Done making the appschema
# -v -ssi ssi_verbose 100 -ssi cr self -ssi cr_verbose 100 C ./hello 100
n0 cr_self_restart /home/zfarkas/context.0-n0-23984

4
5
6
...

So it seems to be ok. My question is the following: is it possible to
tell mpirun to use the "self" module? The code seems a bit confusing for me.

2. Next I've created a shared library, which has a 'do_checkpoint'
function. This function simply print out something to stdout. I compile
the application this way:
$ mpicc hello.c -L. -lckpt_lib -o hello
And run the application with the following command (during execution I
issue a 'lamcheckpoint -ssi cr self -pid <pid_of_mpirun>'):
$ mpirun -v -ssi ssi_verbose 100 -ssi cr self -ssi cr_verbose 100 -ssi
cr_self_prefix do -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH C ./hello 100
24029 ./hello running on n0 (o)
n0<24029> ssi:crmpi: CR support enabled (self)
Hello, world! I am 0 of 1
0
1
2
Inside CRLAM's *signal* handler
3
4
<output of do_checkpoint's printf>
Segmentation fault
5
...

The processes keep running, but mpirun is killed. This happens on Debian
Sarge (with a vanilla 2.6.9 Kernel, 2.3.2 Glibc, 3.3.4 GCC). But on a
Debian Woody system the system keeps on running (2.4.x Kernel, 2.2.5
Glibc, 2.95.4 GCC):
$ mpirun -v -ssi ssi_verbose 100 -ssi cr self -ssi cr_verbose 100 -ssi
cr_self_prefix do -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH C ./hello 100
24029 ./hello running on n0 (o)
n0<24029> ssi:crmpi: CR support enabled (self)
Hello, world! I am 0 of 1
0
1
2
Inside CRLAM's *signal* handler
3
4
Done making the appschema
# -v -ssi ssi_verbose 100 -ssi cr self -ssi cr_verbose 100 -ssi
cr_self_prefix do -x
LD_LIBRARY_PATH=/home/zfarkas/Ckpt_Proba:/home/zfarkas/Ckpt_Proba: C
./hello 100
n0 cr_self_restart /home/zfarkas/context.0-n0-27671

<output of do_checkpoint's printf>
5
...

My question is the following: does anybody have an idea, what could be
the problem in the first case?

Note. After I got this working on Debian Woody, I've tried to install a
signal handler for SIGTERM in the shared library's '_init' function. I'd
like to use the signal handler for the following: if somebody tries to
kill the MPI process, it will connect to some process, and that process
will notify mpirun to do a coordinated checkpoint. Right now the signal
handler only prints out something. Right now this seems to work under
Debian Woody.

3. My last question is: is it possible to use the "self" cr module, when
I add the '--with-threads=no' argument to configure? This is important
for me, because so far I couldn't find any userspace checkpointer, which
can handle threaded applications.

Thanks for any comments!
Regards,

        Zoltan