On Wed, Nov 21, 2001 at 12:20:58AM -0500, Brian W. Barrett wrote:
> On Sat, 17 Nov 2001, Wan KUANG wrote:
>
> > I am using 10 Redhat Linux 7.2 machines over 100M
> > dedicated switch with LAM-6.5. The program is compiled
> > with KCC4.0e compiler. When I killed the job, the LAM
> > shows that some of the nodes is stalled in Waitall. I
> > could tell from the blinking light on the switch that
> > the program is not dead, just very slow, maybe one
> > flash every two seconds which gives terrible
> > performance.
> >
> > Does anyone have any idea what is going on?
>
> Unfortunately, I can't offer any solutions. I don't have any 7.2
> machines, but was unable to replicate your problems with KCC4.0e and RH
> 7.1 on our machines. Based on what you are describing, it would seem that
> there might be something wrong with the underlying network infrastructure.
> Are you running SMP machines? If so, you might want to try running the
> machines with only 1 CPU enabled (you will have to reboot to try this).
> There have been some issues with certain linux kernels, network drivers,
> and SMP machines in the past. Just some place to start looking.
The symptoms of this problem remind me of a problem that I described
a while ago and never found a solution for. Here is what I found (on
a dual Athlon MP cluster with 3 times 100baseT channel bonding):
- the problem appears randomly when using LAM, i.e., sometimes the
program (2d parallel FFTs, using the fftw library) executes normally,
sometimes the programs hangs.
- the program hangs when executing a r = read(sock, buf, nbytes) statement
over and over again. Typically: r=56 or r=696 and nbytes=116765796, i.e.,
if you decrease 116765796 in steps of 56 or 696, the program hangs for
practical purposes. This behaviour causes the "blinking light" on the
switch and the nics. In those cases when the program executes normally
I typically find r=8688 or larger. Also the frequency at which the read
statements are executed appears to be much, much higher.
- when using MPICH the program does not hang.
- when using the 2.2.19 smp kernel the program does not hang.
- using the append="noapic" setting in /etc/lilo.conf with a 2.4.x kernel
reduces the failure rate but still the program hangs with a probability
that is unacceptable for a production environment.
>From this I concluded that I cannot use a 2.4 kernel and LAM. I do not know
with certainty what is causing the failures:
- is it a LAM bug?
-> is there anything in LAM that could cause such a dramatic reduction in
the number of bytes that are read in a read statement?
- is it a 3c59x driver bug?
- is it a 2.4 kernel bug?
This problem does not seem to have anything to do with the version of
RedHat (it happens with 7.1 and 7.2).
For now the only solution (or workaround) for me is to use MPICH instead of
LAM. This may lead to a slight decrease in performance, although the
difference isn't very large anymore after mpich-1.2.2 was released.
I have already spent several weeks while debugging this problem and by now
have given up. If somebody else is interested (and has the time), I can send
you the code that causes the problem on my system (which doesn't mean that
it'll cause a problem on your system).
Cheers,
Martin
========================================================================
Martin Siegert
Academic Computing Services phone: (604) 291-4691
Simon Fraser University fax: (604) 291-4242
Burnaby, British Columbia email: siegert_at_[hidden]
Canada V5A 1S6
========================================================================
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|