On Thu, 2004-02-19 at 02:48, Pirabhu Raman wrote:
> Hi,
Hi!
> I am trying to use lam with BLCR. I first installed BLCR with the following
> commands
> configure
Was the configure able to find the kernel headers and the system.map of
your running kernel?
> make
> make install
> Since I did not specify prefix option BLCR was installed in default
> /usr/local folder. Then I installed lam with commands
> configure --with-blcr=/usr/local --with-rpi-crtcp
Hmm.. I normally use --with-rpi=crtcp, but I guess it doesn't matter.
> Now when I do check point of ordinary processes using blcr it works fine. I
> started lamboot and then I invoked a parallel process with command
> mpirun -ssi rpi crtcp -ssi cr blcr -np 4 ./ring
> This produces error stating blcr module in CR kind was not found. This
> typically means you have misspelled the module name.
Weird.. I've never seen this.
> So I ran the program with command
> mpirun -ssi rpi crtcp -np 4 ./ring
That's the command line format I use to run my programs.
> and it works fine. Now I checkpoint with the command
> cr_checkpoint 23245 where 23245 is PID of mpirun. One file named
> context.23245 is created and no other files are created (Should other files
> be created). This file is created on node where I run command cr_checkpoint.
> (Note I don't have NFS on my test cluster)
Hmm.. as you don't have NFS on your cluster, I guess the other context
files were created in the user's home directory on each node. On my test
cluster the homedirs are shared among the nodes with NFS, and the
context files of the mpi processes are created in my $HOME. :-) At
checkpoint time is created an execution schema to be used at restart,
but I don't know if the CR framework can load context files saved across
the cluster. Anyone?
> When I try to restart the original program from context file with command
> cr_restart 23245 I get the error
> mpirun (rpwait) : bad file descriptor. (Note: The original process has
> already completed execution)
I guess I've seen this error message when my context file was empty due
to some problem with the BLCR checkpointer.
I hope it helps. Ah, and feel free (anyone) to point out any mistakes I
could have been made. :-)
-- Ulisses
|