On 8/2/06 3:10 PM, "Zubair Anwar" <zubair.anwar_at_[hidden]> wrote:
> I do not see a.out in the process table on node22, but I do see lamd on
> node22.
>
> I am also unable to lamexec a simple non-MPI hello world program across
> multiple nodes if the application schema contains node22.
These two facts are quite weird. If the lamd's are able to boot properly,
that means that they were using arbitrary TCP ports to talk to each other
(during the lamboot process). But if you can't even lamexec (which, in your
case is a better test because it doesn't involve any of the MPI startup
protocols - it's literally one lamd sending a "please fork/exec this
process" message to a remote lamd), then the lamds are apparently unable to
talk to each other.
Once difference between lamboot and the normal run-time operations of the
lamds are that lamboot uses TCP for communication, but the lamd's use UDP.
Do you have any firewalling / port blocking software running on this node?
You need to allow completely arbitrary TCP and UDP connectivity between the
nodes.
--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems
|