> If you're only seeing one context file, then lamrestart will definitely
> not work. Some followup questions:
>
> 1. When you run laminfo, do you see the blcr module listed? I.e., was
> the new beta built with BLCR support properly?
Yes
> 2. When you run MPI apps, do they have LD_LIBRARY_PATH set properly
> such that the BLCR library can be found? You might want to try
> something like:
>
> lamexec N env | grep LD_LIBRARY_PATH
>
> Remember that the MPI processes will inherit the environment of the
> lamd, which, in a PBS/Torque environment, *should* be the same as the
> environment of the process that did lamboot (which, therefore, should
> have the Right LD_LIBRARY_PATH -- but it's worth checking).
Yep, it's fine.
> 3. Is BLCR installed in the same directory on all machines in the
> cluster (such that PATH and LD_LIBRARY_PATH can be the same to find all
> the relevant BLCR parts on all nodes)?
Yes, also I uninstalled the previous lam before installing the new one.
> The real problem is that you're only getting on checkpoint file -- you
> should get N+1 (i.e., one for each MPI process and one for mpirun).
> Without that, you won't be able to restart properly.
You are right. I am looking into the source to understand why this is
happening. Are there any log files that are generated by lam?
--
Pradeep Padala
http://ppadala.blogspot.com
|