On Feb 7, 2005, at 12:34 PM, <Tong_Liu_at_[hidden]> wrote:
> I did BLCR+LAM/MPI installation as followed:
> 1. BLCR-0.3.1 compiled on all cluster nodes.
> 2. Recompiled LAM/MPI with BLCR support on Master node only.
>
> Then laminfo shows that BLCR has been built to LAM. And I did insmod
> blcr.o, vmdump.o on all nodes too.
>
> I can successfully run mpi application with command:
> # Mpirun -np 4 -ssi rpi rctcp -ssi blcr application
> # cr_checkpoint PID of mpirun
> Context file was generated, but when I restarted that application
> with cr_restart command, it shows file description error.
Apologies for taking so long to answer. :-\
Can you be more specific about the command line that you ran and the
error that occurred? I note that the command line you typed above
isn't quite right (I'm assuming that it's just e-mail typos, but it's
good to be sure).
> Can any body give me some suggestion? Should I recompile LAM/MPI on
> all nodes?
If your nodes are homogeneous (check out the LAM FAQ for our specific
definition of homogeneous), then you can have one install of LAM (say,
on NFS) and you should be fine.
> Does command cr_checkpoint will generate context file on all nodes or
> only master node?
It generates a context file for each process. This effectively runs on
each node where processes are running.
> Should the directory which saves conext file be shared mounted
> directory?
It does not matter, but writing to a local disk will give you
noticeably better performance (as opposed to N nodes all writing to a
NFS server simultaneously).
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|