On Aug 11, 2007, at 10:37 AM, Greg Blair wrote:
> We have had a 4 machine MPI application running under LAM for about a
> year. Each machine acquires and processes real-time data.
> Information
> about the acquired data is exchanged with the other 3 machines. The
> system uses matched MPI Ssend/Recv calls over jumbo frame (MTU=9000)
> 1gigE Ethernet. Each machine is connected to a 1 gigE jumbo-frame
> configured switch.
>
> About 99.9999% of the time this works.
>
> However the TCP transfers, underneath the MPI software layer,
> sometimes
> time out. The kernel generates retries and eventually the TCP
> packet is
> transferred, the Ssend/Recv calls complete and we have our data. This
> creates an excessive delay for our application, the real-time
> acquisition falls apart and we have to restart the system.
Unfortunately, this is out of the scope of what LAM/MPI was designed
for (real time message delivery, that is), so I can't offer too much
advice. I don't have enough in-depth knowledge of the TCP stack to
have an opinion of how to keep its retransmission time down to the
point you need. Perhaps there's someone on this list more familiar
with TCP than I am...
Good luck,
Brian
--
Brian Barrett
LAM/MPI Developer
Make today a LAM/MPI day!
|