Hi,
I am trying to utilize lam-mpi's ft capabilities,
with regards to node failure.
I was testing the receiving of SIGSHRINK at a
process. I set up 4 processes ( in one case doing
nothing but in a infinite while loop and in another
case sending and receiving messages using Isend &
Test), 2 on each node. All the processes register for
SIGSHRINK via lam_ksignal(). Lam universe consisted of
2 nodes. I went in and did a tkill on one node. I
noticed there was a substantial delay in receiving
SIGSHRINK. The delay varied from a few seconds to a
few minutes ( I crudely timed one to around 5
minutes).
I checked the node CPU load ( no other processes
other than mine is running), memory ( no shortage). I
am guessing network traffic should not be a factor
since the local lamd is signalling processes in the
same node ( since there is only 2 nodes). In any case
the nodes are within a LAN with little or no traffic.
I ran them at various times of the day (& night) with
simliar delay. I am flushing out any print statements.
Is this behaviour normal? What could be the reason?
Is there anyway I can speed up the receiving( &
capture) of the signal?
Thanks for any help and advise.
Newbie Lam-Mpi user
Vinod
__________________________________
Do you Yahoo!?
Take Yahoo! Mail with you! Get it on your mobile phone.
http://mobile.yahoo.com/maildemo
|