LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Vinod Kannan (saranga2000_at_[hidden])
Date: 2004-10-12 23:07:45


Hi,
 I am trying to utilize lam-mpi's ft capabilities,
with regards to node failure.
 I was testing the receiving of SIGSHRINK at a
process. I set up 4 processes ( in one case doing
nothing but in a infinite while loop and in another
case sending and receiving messages using Isend &
Test), 2 on each node. All the processes register for
SIGSHRINK via lam_ksignal(). Lam universe consisted of
2 nodes. I went in and did a tkill on one node. I
noticed there was a substantial delay in receiving
SIGSHRINK. The delay varied from a few seconds to a
few minutes ( I crudely timed one to around 5
minutes).
 I checked the node CPU load ( no other processes
other than mine is running), memory ( no shortage). I
am guessing network traffic should not be a factor
since the local lamd is signalling processes in the
same node ( since there is only 2 nodes). In any case
the nodes are within a LAN with little or no traffic.
I ran them at various times of the day (& night) with
simliar delay. I am flushing out any print statements.
  Is this behaviour normal? What could be the reason?
Is there anyway I can speed up the receiving( &
capture) of the signal?
 Thanks for any help and advise.
Newbie Lam-Mpi user
Vinod

                
__________________________________
Do you Yahoo!?
Take Yahoo! Mail with you! Get it on your mobile phone.
http://mobile.yahoo.com/maildemo