Thanks for all the responses.
This is a Beowulf cluster with shared file system.
>From the help I got I suspect it was a cache problem
with the cluster.
Right now I cannot recreate the problem.
On Sun, 28 Sep 2003, Brian Barrett wrote:
> LAM just calls fork()/exec() out on the remote nodes. We used to have
> the problems you describe when all the LAM development workstations
> used AFS, which did heavy client-side caching. Of course, by the time
> you logged into the node to figure out what was going wrong, the cache
> was invalidated and everything worked as expected.
>
> If you are having repeated problems and are on a shared filesystem, you
> might want to talk to your systems administrator. It sounds like you
> may be having some problems on your machine. If you aren't using a
> common filesystem, you might want to try using the -s option to mpirun.
> Having mpirun push the binary out may be slightly less error-prone
> than doing it by hand.
>
> Either way, this isn't a LAM problem, but just some of the pain of
> working on clusters...
>
> Brian
>
Andras
|