LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2005-01-07 12:53:31


It has been a couple of days, so I wanted to quickly follow-up with a
status on this issue. We now know what the problem is, and can say
that it is only in the usysv RPI. The sysv RPI should not exhibit this
problem. We have a solution mostly ready, but still have some build
system issues to clean up. If you want the details, read on.

The usysv RPI makes two assumptions about the underlying memory
subsystem: 1) writes are always ordered and 2) cache coherency.
Unfortunately, this was lost from the documentation somewhere along the
line (the usysv RPI is many years old - longer than I've been working
on LAM). The PPC 970 (aka G5) does fairly aggressive instruction
reordering that can result in unordered writes. This is complicated by
the fact that the memory controller on a G5 machine can and does
reorder memory writes / reads to better use the memory bus. So one of
the two basic assumptions of the usysv rpi isn't met on an Apple G5
machine. I'm not sure about the IBM Blade servers based on the PPC
970, but would assume the same problem exists there as well.

The PowerPC architecture does give us a solution to the problem, as
there is an instruction to force the dispatch unit to dispatch all
pending instructions before the current instruction completes. A
well-behaving memory controller (which Apple's is) will also commit all
pending writes before the next write is started. In short, we force
the machine to temporarily appear to have ordered writes. While there
appear to be a number of aggressive ways of doing this, the easiest
(and what we are doing to fix the usysv code for 7.1.2) is to use the
"sync" instruction between writes that must be ordered.

Brian

On Jan 4, 2005, at 2:03 PM, Jacob Vos wrote:

> Jeff,
>
> Thank you for your confirmation on that bug and the temporary solution.
> I did not have any room for error, no matter how infrequent. All test
> runs have had no failures as of yet.
>
> -Jake
>
>
> -----Original Message-----
> From: Jeff Squyres [mailto:jsquyres_at_[hidden]]
> Sent: Monday, January 03, 2005 10:37 PM
> To: General LAM/MPI mailing list
> Cc: michael_at_[hidden]
> Subject: Re: LAM: data corruption
>
> I can unfortunately confirm that this is a bug -- I was able to
> replicate the problem with the usysv RPI on a dual processor G5 using
> the SVN trunk (which is effectively the same as the latest 7.1.2 beta).
> I'm testing the other RPIs right now, but I'm *guessing* that it's
> only going to be usysv that is affected.
>
> If this is the case, a temporary workaround is simply to use another
> RPI, such as sysv:
>
> mpirun -ssi rpi sysv ....
>
> To clarify: although I haven't done any analysis yet, this looks and
> feels like a race condition in the usysv RPI. usysv is one of LAM's 2
> shared memory transports; it uses spin locks for locking access to
> shared memory segments. sysv is the other shared memory transport; it
> uses SYSV semaphores for locking. So if I had to guess, I'd say that
> there's some kind of race condition in LAM's spin locks code that
> exhibits itself under high load and/or multi-CPU situations (again, I
> haven't done any analysis yet -- this is all gut feeling at this
> point).
>
>
>
> On Jan 3, 2005, at 6:57 PM, Jacob Vos wrote:
>
>> Here is an example of when there are a ton of errors. It is completely
>> random as to which "processor" fails. I've seen the same frequency for
>> all of them. Today, zero has definitely failed the most often. Also,
>> the
>> second receive fails, in general, much more often than the first.
>>
>> MachZero:~/Desktop/lamtest jake$ mpirun C a.out
>> rank 0 out of 8 instances.
>> rank 6 out of 8 instances.
>> rank 3 out of 8 instances.
>> rank 2 out of 8 instances.
>> rank 1 out of 8 instances.
>> rank 4 out of 8 instances.
>> rank 5 out of 8 instances.
>> rank 7 out of 8 instances.
>> CPU 0 recv2: 9078 != 9077
>> CPU 1 recv2: 12142 != 12141
>> CPU 1 recv2: 12533 != 12532
>> CPU 0 recv2: 17126 != 17125
>> CPU 0 recv2: 18869 != 18868
>> CPU 0 recv2: 34217 != 34216
>> CPU 0 recv2: 49035 != 49034
>> CPU 0 recv2: 60389 != 60388
>> CPU 0 recv2: 69068 != 69067
>> CPU 1 recv2: 71159 != 71158
>> CPU 0 recv2: 89829 != 89828
>> CPU 0 recv2: 91318 != 91317
>> CPU 0 recv2: 124826 != 124825
>> CPU 0 recv2: 130880 != 130879
>> CPU 0 recv2: 131121 != 131120
>> CPU 1 recv2: 137015 != 137014
>> CPU 0 recv2: 140855 != 140854
>> CPU 3 recv2: 147514 != 147513
>> CPU 0 recv2: 152036 != 152035
>> CPU 0 recv2: 187224 != 187223
>> CPU 0 recv2: 190873 != 190872
>> CPU 0 recv2: 191452 != 191451
>> CPU 3 recv2: 198360 != 198359
>> CPU 1 recv2: 200483 != 200482
>> CPU 1 recv2: 201806 != 201805
>> CPU 1 recv2: 224145 != 224144
>> CPU 1 recv2: 230078 != 230077
>> CPU 3 recv2: 232821 != 232820
>> CPU 1 recv2: 243280 != 243279
>> CPU 0 recv2: 246484 != 246483
>> CPU 0 recv2: 246749 != 246748
>> CPU 3 recv2: 248972 != 248971
>> CPU 1 recv2: 250754 != 250753
>> CPU 0 recv2: 268311 != 268310
>> CPU 0 recv2: 279668 != 279667
>> CPU 0 recv2: 285077 != 285076
>> CPU 3 recv2: 291615 != 291614
>> CPU 0 recv2: 292216 != 292215
>> CPU 0 recv2: 309108 != 309107
>> CPU 0 recv2: 311748 != 311747
>> CPU 0 recv2: 312668 != 312667
>> CPU 1 recv2: 312912 != 312911
>> CPU 0 recv2: 321491 != 321490
>> CPU 1 recv2: 328702 != 328701
>> CPU 1 recv2: 330495 != 330494
>> CPU 0 recv2: 332820 != 332819
>> CPU 0 recv2: 337565 != 337564
>> CPU 0 recv2: 342717 != 342716
>> CPU 0 recv2: 344977 != 344976
>> CPU 1 recv2: 352068 != 352067
>> CPU 0 recv2: 365009 != 365008
>> CPU 1 recv2: 376899 != 376898
>> CPU 1 recv2: 377322 != 377321
>> CPU 0 recv2: 380529 != 380528
>> CPU 0 recv2: 392425 != 392424
>> CPU 0 recv2: 418243 != 418242
>> CPU 0 recv2: 419348 != 419347
>> CPU 1 recv2: 421576 != 421575
>> CPU 3 recv2: 422983 != 422982
>> CPU 0 recv2: 431440 != 431439
>> CPU 0 recv2: 432557 != 432556
>> CPU 0 recv2: 448427 != 448426
>> CPU 0 recv2: 456880 != 456879
>> CPU 3 recv2: 466409 != 466408
>> CPU 1 recv2: 470845 != 470844
>> CPU 0 recv2: 485413 != 485412
>> CPU 1 recv2: 494849 != 494848
>>
>>
>> -----Original Message-----
>> From: Michael Gauckler [mailto:michael.gauckler_at_[hidden]]
>> Sent: Monday, January 03, 2005 5:48 PM
>> To: General LAM/MPI mailing list
>> Subject: Re: LAM: data corruption
>>
>> Hi Jake,
>>
>> I cannot confirm your findings: I tried your code with my setup and
> the
>> program terminated without any fault detected. Maybe you should supply
>> more details about your setup to locate the source of the problem.
>>
>> Below my configuration is given - I hope this helps.
>>
>> Chers,
>> Michael
>> ___
>>
>> Hardware: IBM X31 Laptop 1.5 MHz Intel Pentium M
>> OS: Ubuntulinux/Debian
>> LAM: Installed from Debian package, 7.0.6, i386-pc-linux-gnu
>>
>> I added this line to have at least some output:
>>
>> cout << "rank " << rank << " out of " << numtasks << " instances." <<
>> endl;
>>
>> mig_at_ubuntu:~/lamtest $ mpic++ test.cpp
>>
>> mig_at_ubuntu:~/lamtest $ cat hostfile
>> localhost cpu=8
>>
>> mig_at_ubuntu:~/lamtest $ mpirun -c 8 a.out
>> rank 0 out of 8 instances.
>> rank 2 out of 8 instances.
>> rank 1 out of 8 instances.
>> rank 4 out of 8 instances.
>> rank 3 out of 8 instances.
>> rank 7 out of 8 instances.
>> rank 5 out of 8 instances.
>> rank 6 out of 8 instances.
>> mig_at_ubuntu:~/lamtest $
>>
>> mig_at_ubuntu:~/lamtest $ uname -a
>> Linux ubuntu 2.6.8.1 #1 Mon Nov 29 16:56:41 CET 2004 i686 GNU/Linux
>>
>> Am Montag, den 03.01.2005, 12:06 -0500 schrieb Jacob Vos:
>>> Hi,
>>>
>>> Below is a small program demonstrating some spurious message data
>> being
>>> received. For convenience of development before I moved to the
>> cluster,
>>> I was using a Dual 2.5GHz G5 with a 'lamnode_file' set at 'cpu=8'. I
>>> noticed that randomly the second receive would contain the same data
>> as
>>> the first receive. I can't reproduce the anomaly with 'cpu=2'.
>>>
>>> I thought my logic may have been flawed. I however could not find the
>>> source of my error. So, I wrote this small test program and I was
> able
>>> to reproduce the anomaly.
>>>
>>> Basically, about .005% to .01% of the send/receives are corrupt. It's
>>> not clear in this test code, because the value of the last send
>> happens
>>> to be n-1. However, in my original code that I found the anomaly in,
>> the
>>> previous set of sends were not related. The second receive call,
> when
>>> it was corrupt, always had the same value as the first.
>>>
>>> Please confirm that this is indeed a bug, inform me of my poor use of
>>> asynchronous communication, or indicate that using 'cpu=8' is flawed.
>>>
>>> A typical output would be:
>>>
>>> CPU 4 recv2: 2984 != 2983
>>> CPU 5 recv2: 71459 != 71458
>>> CPU 7 recv2: 122923 != 122922
>>> CPU 6 recv2: 156124 != 156123
>>> CPU 3 recv2: 185705 != 185704
>>> CPU 0 recv2: 350950 != 350949
>>> CPU 0 recv2: 356951 != 356950
>>> CPU 4 recv2: 449649 != 449648
>>>
>>> Thanks all,
>>>
>>> Jake
>>>
>>> ---------------------------------------------------------
>>>
>>> #include "mpi.h"
>>> #include <iostream>
>>> using namespace std;
>>>
>>> int main(int argc, char *argv[])
>>> {
>>> int numtasks, rank;
>>>
>>> MPI_Init(&argc,&argv);
>>> MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>
>>> int prev = rank-1;
>>> int next = rank+1;
>>> if (rank == 0) prev = numtasks - 1;
>>> if (rank == (numtasks - 1)) next = 0;
>>>
>>> for(int n=0; n < 1000000; n++)
>>> {
>>> int send1 = n;
>>> int send2 = n+1;
>>>
>>> int recv1, recv2;
>>> MPI_Request reqs[4];
>>> MPI_Status stats[4];
>>> MPI_Irecv(&recv1, 1, MPI_INT, prev, 1, MPI_COMM_WORLD, &reqs[0]);
>>> MPI_Irecv(&recv2, 1, MPI_INT, prev, 2, MPI_COMM_WORLD, &reqs[1]);
>>> MPI_Isend(&send1, 1, MPI_INT, next, 1, MPI_COMM_WORLD, &reqs[2]);
>>> MPI_Isend(&send2, 1, MPI_INT, next, 2, MPI_COMM_WORLD, &reqs[3]);
>>> MPI_Waitall(4, reqs, stats);
>>>
>>> if(send1 != recv1) cout << "CPU " << rank << " recv1: " << send1
>> <<
>>> " != " << recv1 << endl;
>>> if(send2 != recv2) cout << "CPU " << rank << " recv2: " << send2
>> <<
>>> " != " << recv2 << endl;
>>> }
>>>
>>> MPI_Finalize();
>>> }

-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have an LAM/MPI day: http://www.lam-mpi.org/