Hi,
I think I should have made it more clear.
Case 1: parent spawns (not spawn_multiple's) child1
child2. How do child1 talk to child2?
Case 2: parent spawns of child1 child2 child3 using
spawn_multiple. I create groups and related
communicators with the children using comm_world.
Later child3 dies. parent spawns/spawn_multiple's
child4 to compensate. How does child4 communicate with
peer's child1 & child2?
Jeff, if I understand you right, "the output of SPAWN
and SPAWN_MULTIPLE are the same: an intecommunicator
spanning *all* the children and parents. ".
That doesnt seem to be the case. Successive spawn()'s
seem to create different intercommunicators between
parent and child .ie. parent has a different
intercommunicator with each child.
{comm_get_parent(commParent) + comm_size(commParent) }
OR comm_size(comm_world) both return 1 with rank of 0
for each and every child, under any case using spawn
(with spawn_multiple, the same calls return a size
equal to number of children with appropriate ranks).
In the same context, my understanding is node-failure
can be detected by handling LAM_SIGSHRINK. How can we
detect process failure on a live node (say someone
kills off the process inadvertantly). In the process
failure scenario too Lam does not crash if we have
used only one mpirun ( all processes except one have
been spawned by the one started by mpirun). My
understanding is doing a test on a send/receive
request can be a deterministic way of process failure
detection. Is there any robust way provided by the Lam
system to detect process failure on a live node?
Thanks
Vinod
=====
Vinod Kannan
saranga2000_at_[hidden]
http://kannvin.freeshell.org
__________________________________
Do you Yahoo!?
Yahoo! Mail Address AutoComplete - You start. We finish.
http://promotions.yahoo.com/new_mail
|