LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Pradeep Padala (ppadala_at_[hidden])
Date: 2005-07-23 19:47:59


> If you're only seeing one context file, then lamrestart will definitely
> not work. Some followup questions:
>
> 1. When you run laminfo, do you see the blcr module listed? I.e., was
> the new beta built with BLCR support properly?

Yes

> 2. When you run MPI apps, do they have LD_LIBRARY_PATH set properly
> such that the BLCR library can be found? You might want to try
> something like:
>
> lamexec N env | grep LD_LIBRARY_PATH
>
> Remember that the MPI processes will inherit the environment of the
> lamd, which, in a PBS/Torque environment, *should* be the same as the
> environment of the process that did lamboot (which, therefore, should
> have the Right LD_LIBRARY_PATH -- but it's worth checking).

Yep, it's fine.

> 3. Is BLCR installed in the same directory on all machines in the
> cluster (such that PATH and LD_LIBRARY_PATH can be the same to find all
> the relevant BLCR parts on all nodes)?

Yes, also I uninstalled the previous lam before installing the new one.

> The real problem is that you're only getting on checkpoint file -- you
> should get N+1 (i.e., one for each MPI process and one for mpirun).
> Without that, you won't be able to restart properly.

You are right. I am looking into the source to understand why this is
happening. Are there any log files that are generated by lam?

-- 
Pradeep Padala
http://ppadala.blogspot.com