Greetings all.
For the last several months, we have had some problems with the gm RPI,
both in the 7.0.x releases and on the LAM development trunk (i.e., what
will [soon] become 7.1).
The problems in 7.0.x are that LAM does not correctly re-transmit gm
packets when gm informs LAM that a packet has been dropped. In this case,
LAM will print a warning message and then deadlock. This will generally
only happen if you send a message from one MPI process to another and the
receiver does not enter the gm RPI progression engine for a long time
(e.g., 30 seconds) to receive it. This is unlikely to be fixed in the
7.0.x series.
On the development trunk, we have fixed that problem, but had other
problems -- mysterious seg faults and sometimes deadlock. We have finally
tracked this down and are iterating over the problem with the Myricom
engineers. It seems to have been due to an optimization put on the LAM
trunk that is not in the 7.0.x series: long messages are sent with
gm_get() (a new function introduced in the GM 2.0.x series). Switching
LAM back over to use a long protocol based on gm_put() fixes the problem,
although the latency for long messages increases a little because it
requires one additional LAM-level gm message versus the gm_get()-based
protocol (but they're long MPI messages, so it really doesn't matter).
The SVN trunk now reflects this change, and should [finally] be stable.
The gm RPI configure script has grown a new option: --with-rpi-gm-get.
This option can be used to enable the gm_get()-based long protocol for
those willing to try it out.
Note that it *looks* like the gm_get() issues are a problem with GM
itself, but one can never tell with such deep, mystical, kernel-based
systems -- it may still be a problem with LAM's gm_get()-based protocol.
So we're hedging our bets and putting in the --with-rpi-gm-get switch in
case it isn't LAM's fault. :-)
We'll continue iterating with the Myricom engineers and keep you all
posted.
Thanks for your patience.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|