LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-01-04 07:45:24


Another note -- after running test all night, the errors only showed up
in the usysv RPI. This is not 100% conclusive, of course -- the lack
of an error showing up doesn't mean that the error doesn't exist -- but
it does seem to give pretty good credence to my theory that this is a
problem with the usysv RPI somehow.

On Jan 3, 2005, at 10:36 PM, Jeff Squyres wrote:

> I can unfortunately confirm that this is a bug -- I was able to
> replicate the problem with the usysv RPI on a dual processor G5 using
> the SVN trunk (which is effectively the same as the latest 7.1.2
> beta). I'm testing the other RPIs right now, but I'm *guessing* that
> it's only going to be usysv that is affected.
>
> If this is the case, a temporary workaround is simply to use another
> RPI, such as sysv:
>
> mpirun -ssi rpi sysv ....
>
> To clarify: although I haven't done any analysis yet, this looks and
> feels like a race condition in the usysv RPI. usysv is one of LAM's 2
> shared memory transports; it uses spin locks for locking access to
> shared memory segments. sysv is the other shared memory transport; it
> uses SYSV semaphores for locking. So if I had to guess, I'd say that
> there's some kind of race condition in LAM's spin locks code that
> exhibits itself under high load and/or multi-CPU situations (again, I
> haven't done any analysis yet -- this is all gut feeling at this
> point).
>
>
>
> On Jan 3, 2005, at 6:57 PM, Jacob Vos wrote:
>
>> Here is an example of when there are a ton of errors. It is completely
>> random as to which "processor" fails. I've seen the same frequency for
>> all of them. Today, zero has definitely failed the most often. Also,
>> the
>> second receive fails, in general, much more often than the first.
>>
>> MachZero:~/Desktop/lamtest jake$ mpirun C a.out
>> rank 0 out of 8 instances.
>> rank 6 out of 8 instances.
>> rank 3 out of 8 instances.
>> rank 2 out of 8 instances.
>> rank 1 out of 8 instances.
>> rank 4 out of 8 instances.
>> rank 5 out of 8 instances.
>> rank 7 out of 8 instances.
>> CPU 0 recv2: 9078 != 9077
>> CPU 1 recv2: 12142 != 12141
>> CPU 1 recv2: 12533 != 12532
>> CPU 0 recv2: 17126 != 17125
>> CPU 0 recv2: 18869 != 18868
>> CPU 0 recv2: 34217 != 34216
>> CPU 0 recv2: 49035 != 49034
>> CPU 0 recv2: 60389 != 60388
>> CPU 0 recv2: 69068 != 69067
>> CPU 1 recv2: 71159 != 71158
>> CPU 0 recv2: 89829 != 89828
>> CPU 0 recv2: 91318 != 91317
>> CPU 0 recv2: 124826 != 124825
>> CPU 0 recv2: 130880 != 130879
>> CPU 0 recv2: 131121 != 131120
>> CPU 1 recv2: 137015 != 137014
>> CPU 0 recv2: 140855 != 140854
>> CPU 3 recv2: 147514 != 147513
>> CPU 0 recv2: 152036 != 152035
>> CPU 0 recv2: 187224 != 187223
>> CPU 0 recv2: 190873 != 190872
>> CPU 0 recv2: 191452 != 191451
>> CPU 3 recv2: 198360 != 198359
>> CPU 1 recv2: 200483 != 200482
>> CPU 1 recv2: 201806 != 201805
>> CPU 1 recv2: 224145 != 224144
>> CPU 1 recv2: 230078 != 230077
>> CPU 3 recv2: 232821 != 232820
>> CPU 1 recv2: 243280 != 243279
>> CPU 0 recv2: 246484 != 246483
>> CPU 0 recv2: 246749 != 246748
>> CPU 3 recv2: 248972 != 248971
>> CPU 1 recv2: 250754 != 250753
>> CPU 0 recv2: 268311 != 268310
>> CPU 0 recv2: 279668 != 279667
>> CPU 0 recv2: 285077 != 285076
>> CPU 3 recv2: 291615 != 291614
>> CPU 0 recv2: 292216 != 292215
>> CPU 0 recv2: 309108 != 309107
>> CPU 0 recv2: 311748 != 311747
>> CPU 0 recv2: 312668 != 312667
>> CPU 1 recv2: 312912 != 312911
>> CPU 0 recv2: 321491 != 321490
>> CPU 1 recv2: 328702 != 328701
>> CPU 1 recv2: 330495 != 330494
>> CPU 0 recv2: 332820 != 332819
>> CPU 0 recv2: 337565 != 337564
>> CPU 0 recv2: 342717 != 342716
>> CPU 0 recv2: 344977 != 344976
>> CPU 1 recv2: 352068 != 352067
>> CPU 0 recv2: 365009 != 365008
>> CPU 1 recv2: 376899 != 376898
>> CPU 1 recv2: 377322 != 377321
>> CPU 0 recv2: 380529 != 380528
>> CPU 0 recv2: 392425 != 392424
>> CPU 0 recv2: 418243 != 418242
>> CPU 0 recv2: 419348 != 419347
>> CPU 1 recv2: 421576 != 421575
>> CPU 3 recv2: 422983 != 422982
>> CPU 0 recv2: 431440 != 431439
>> CPU 0 recv2: 432557 != 432556
>> CPU 0 recv2: 448427 != 448426
>> CPU 0 recv2: 456880 != 456879
>> CPU 3 recv2: 466409 != 466408
>> CPU 1 recv2: 470845 != 470844
>> CPU 0 recv2: 485413 != 485412
>> CPU 1 recv2: 494849 != 494848
>>
>>
>> -----Original Message-----
>> From: Michael Gauckler [mailto:michael.gauckler_at_[hidden]]
>> Sent: Monday, January 03, 2005 5:48 PM
>> To: General LAM/MPI mailing list
>> Subject: Re: LAM: data corruption
>>
>> Hi Jake,
>>
>> I cannot confirm your findings: I tried your code with my setup and
>> the
>> program terminated without any fault detected. Maybe you should supply
>> more details about your setup to locate the source of the problem.
>>
>> Below my configuration is given - I hope this helps.
>>
>> Chers,
>> Michael
>> ___
>>
>> Hardware: IBM X31 Laptop 1.5 MHz Intel Pentium M
>> OS: Ubuntulinux/Debian
>> LAM: Installed from Debian package, 7.0.6, i386-pc-linux-gnu
>>
>> I added this line to have at least some output:
>>
>> cout << "rank " << rank << " out of " << numtasks << " instances." <<
>> endl;
>>
>> mig_at_ubuntu:~/lamtest $ mpic++ test.cpp
>>
>> mig_at_ubuntu:~/lamtest $ cat hostfile
>> localhost cpu=8
>>
>> mig_at_ubuntu:~/lamtest $ mpirun -c 8 a.out
>> rank 0 out of 8 instances.
>> rank 2 out of 8 instances.
>> rank 1 out of 8 instances.
>> rank 4 out of 8 instances.
>> rank 3 out of 8 instances.
>> rank 7 out of 8 instances.
>> rank 5 out of 8 instances.
>> rank 6 out of 8 instances.
>> mig_at_ubuntu:~/lamtest $
>>
>> mig_at_ubuntu:~/lamtest $ uname -a
>> Linux ubuntu 2.6.8.1 #1 Mon Nov 29 16:56:41 CET 2004 i686 GNU/Linux
>>
>> Am Montag, den 03.01.2005, 12:06 -0500 schrieb Jacob Vos:
>>> Hi,
>>>
>>> Below is a small program demonstrating some spurious message data
>> being
>>> received. For convenience of development before I moved to the
>> cluster,
>>> I was using a Dual 2.5GHz G5 with a 'lamnode_file' set at 'cpu=8'. I
>>> noticed that randomly the second receive would contain the same data
>> as
>>> the first receive. I can't reproduce the anomaly with 'cpu=2'.
>>>
>>> I thought my logic may have been flawed. I however could not find the
>>> source of my error. So, I wrote this small test program and I was
>>> able
>>> to reproduce the anomaly.
>>>
>>> Basically, about .005% to .01% of the send/receives are corrupt. It's
>>> not clear in this test code, because the value of the last send
>> happens
>>> to be n-1. However, in my original code that I found the anomaly in,
>> the
>>> previous set of sends were not related. The second receive call,
>>> when
>>> it was corrupt, always had the same value as the first.
>>>
>>> Please confirm that this is indeed a bug, inform me of my poor use of
>>> asynchronous communication, or indicate that using 'cpu=8' is flawed.
>>>
>>> A typical output would be:
>>>
>>> CPU 4 recv2: 2984 != 2983
>>> CPU 5 recv2: 71459 != 71458
>>> CPU 7 recv2: 122923 != 122922
>>> CPU 6 recv2: 156124 != 156123
>>> CPU 3 recv2: 185705 != 185704
>>> CPU 0 recv2: 350950 != 350949
>>> CPU 0 recv2: 356951 != 356950
>>> CPU 4 recv2: 449649 != 449648
>>>
>>> Thanks all,
>>>
>>> Jake
>>>
>>> ---------------------------------------------------------
>>>
>>> #include "mpi.h"
>>> #include <iostream>
>>> using namespace std;
>>>
>>> int main(int argc, char *argv[])
>>> {
>>> int numtasks, rank;
>>>
>>> MPI_Init(&argc,&argv);
>>> MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>
>>> int prev = rank-1;
>>> int next = rank+1;
>>> if (rank == 0) prev = numtasks - 1;
>>> if (rank == (numtasks - 1)) next = 0;
>>>
>>> for(int n=0; n < 1000000; n++)
>>> {
>>> int send1 = n;
>>> int send2 = n+1;
>>>
>>> int recv1, recv2;
>>> MPI_Request reqs[4];
>>> MPI_Status stats[4];
>>> MPI_Irecv(&recv1, 1, MPI_INT, prev, 1, MPI_COMM_WORLD, &reqs[0]);
>>> MPI_Irecv(&recv2, 1, MPI_INT, prev, 2, MPI_COMM_WORLD, &reqs[1]);
>>> MPI_Isend(&send1, 1, MPI_INT, next, 1, MPI_COMM_WORLD, &reqs[2]);
>>> MPI_Isend(&send2, 1, MPI_INT, next, 2, MPI_COMM_WORLD, &reqs[3]);
>>> MPI_Waitall(4, reqs, stats);
>>>
>>> if(send1 != recv1) cout << "CPU " << rank << " recv1: " << send1
>> <<
>>> " != " << recv1 << endl;
>>> if(send2 != recv2) cout << "CPU " << rank << " recv2: " << send2
>> <<
>>> " != " << recv2 << endl;
>>> }
>>>
>>> MPI_Finalize();
>>> }
>>>
>>>
>>>
>>> _______________________________________________
>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>>
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>
> --
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/