LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: jess michelsen (jam_at_[hidden])
Date: 2003-11-10 07:27:22


Hi Robert and everyone!

I replaced the Intel gigabit driver (e1000) version 4.3.2-k1 NAPI by the
latest version (5.2.20). Also, I replaced the OS kernel
(linux-2.4.18-14) by linux-2.4.20-20.8. Since then, the MPI
communication between two nodes seems to be much more stable, i.e. I
didn't see any 'hangs' until now. However, both latency and bandwidth
are deteriorated. Latency is now around 250 usec and bandwidth around
300 Mbit/sec. (they were 120 usec,600Mbit/sec).

1) Could the changed OS kernel be responsible (in any part) for this?

2) Should LAM-MPI be re-installed, i.e. is the intel driver linked into
the LAM or MPI programs?

3) Has anybody studied, how the parameters for the e1000 driver (which
are set when the ethernet devices are activated - the e1000 driver is a
module, not compiled into the kernel) affect performance. Is there an
optimal and safe setting? In our case, we will part of the time be
latency-bound, and the packet sizes are normally below 64Kbyte. So, both
latency and bandwidth need to be as optimal as possible w/o sacrificing
stability.

Best regards, Jess Michelsen

>We had a similar problem. Our MPI jobs would crash a node after 2-3
>hours Redhat 8.0 and an intel gigabit network card. The solution was
>to replace the gigabit driver with a newer version from intel.

>We are now using Intel(R) PRO/1000 Network Driver - version 5.2.16
>the systems all came with Intel(R) PRO/1000 Network Driver - version
>4.3.2-k1 NAPI (020618)

>>
>> Hi LAM-community!
>>
>> I've succesfully configured and installed LAM-MPI a couple of PC's.
>> Hardware and software are as follows:
>>
>> Dell PE 650, different makes of switches have been tested.
>> Intel 1000MT pro server NIC
>> Redhat Linux 8.0, glibc problems (_bswap32) rectified.
>> LAM-MPI 7.0.2
>> Intel 7.1 compilers
>>
>> LAM is booted as:
>> lamboot -v -ssi boot rsh -ssi rsh_agent "ssh" hostfile
>>
>> and the job submitted as:
>>
>> mpirun -np 2 -O MPItest
>>
>>
>> The Fortran code (MPItest1.f) below is intended to test that we have
>>the
>> right latencies and bandwidth, and that nothing 'funny' is going on
>> while we are running. Running this has already led to the removal of
>>a
>> couple of un-necessary cron-jobs which increased the mean-latency by
>> 2-3X. The amount of computation can be adjusted in order to study the
>> impact on latency (which tends to increase latency by 10-15% relative
>>to the values w/o computation).
>>
>> With the present settings, the code transfers 10000 consecutive
>>packets
>> each of 32 Kb, bi-directional. The round-trip time is about 500
>> microseconds, i.e. about 500 Mbit/sec which seems sensible.
>>
>> Once in a while, processor rank 1 fails to respond sometime during
>>the
>> MPI job. ssh to the host results in a 'no route to host' and even the
>> keyboard can't connect. The hardware (PC's, switches, cables) have
>>all
>> been switched to rule out that a specific piece of hardware was the
>> problem.
>
>> Admittingly, this intensity of communication is rather
>>extreme.Question
>> is, whether it is too extreme (ideally, a user should really not be
>>able
>> to bring a node down) or there is some bug or mis-configuration
>> involved.
>>
>> Best regards, Jess Michelsen
>>
>>

_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/