We had a similar problem. Our MPI jobs would crash a node after 2-3 hours.
Redhat 8.0 and an intel gigabit network card. The solution was to replace
the
gigabit driver with a newer version from intel.
We are now using Intel(R) PRO/1000 Network Driver - version 5.2.16
the systems all came with
Intel(R) PRO/1000 Network Driver - version 4.3.2-k1 NAPI (020618)
----- Original Message -----
From: "jess michelsen" <jam_at_[hidden]>
To: <lam_at_[hidden]>
Sent: Thursday, November 06, 2003 11:33 AM
Subject: LAM: Node fails during intensive bi-directional communication.
>
> Hi LAM-community!
>
> I've succesfully configured and installed LAM-MPI a couple of PC's.
> Hardware and software are as follows:
>
> Dell PE 650, different makes of switches have been tested.
> Intel 1000MT pro server NIC
> Redhat Linux 8.0, glibc problems (_bswap32) rectified.
> LAM-MPI 7.0.2
> Intel 7.1 compilers
>
> LAM is booted as:
> lamboot -v -ssi boot rsh -ssi rsh_agent "ssh" hostfile
>
> and the job submitted as:
>
> mpirun -np 2 -O MPItest
>
>
> The Fortran code (MPItest1.f) below is intended to test that we have the
> right latencies and bandwidth, and that nothing 'funny' is going on
> while we are running. Running this has already led to the removal of a
> couple of un-necessary cron-jobs which increased the mean-latency by
> 2-3X. The amount of computation can be adjusted in order to study the
> impact on latency (which tends to increase latency by 10-15% relative to
> the values w/o computation).
>
> With the present settings, the code transfers 10000 consecutive packets
> each of 32 Kb, bi-directional. The round-trip time is about 500
> microseconds, i.e. about 500 Mbit/sec which seems sensible.
>
> Once in a while, processor rank 1 fails to respond sometime during the
> MPI job. ssh to the host results in a 'no route to host' and even the
> keyboard can't connect. The hardware (PC's, switches, cables) have all
> been switched to rule out that a specific piece of hardware was the
> problem.
>
> Admittingly, this intensity of communication is rather extreme.Question
> is, whether it is too extreme (ideally, a user should really not be able
> to bring a node down) or there is some bug or mis-configuration
> involved.
>
> Best regards, Jess Michelsen
>
>
----------------------------------------------------------------------------
----
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|