LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jacob Vos (jacob.vos_at_[hidden])
Date: 2005-01-04 14:03:53


Jeff,

Thank you for your confirmation on that bug and the temporary solution.
I did not have any room for error, no matter how infrequent. All test
runs have had no failures as of yet.

-Jake

-----Original Message-----
From: Jeff Squyres [mailto:jsquyres_at_[hidden]]
Sent: Monday, January 03, 2005 10:37 PM
To: General LAM/MPI mailing list
Cc: michael_at_[hidden]
Subject: Re: LAM: data corruption

I can unfortunately confirm that this is a bug -- I was able to
replicate the problem with the usysv RPI on a dual processor G5 using
the SVN trunk (which is effectively the same as the latest 7.1.2 beta).
  I'm testing the other RPIs right now, but I'm *guessing* that it's
only going to be usysv that is affected.

If this is the case, a temporary workaround is simply to use another
RPI, such as sysv:

        mpirun -ssi rpi sysv ....

To clarify: although I haven't done any analysis yet, this looks and
feels like a race condition in the usysv RPI. usysv is one of LAM's 2
shared memory transports; it uses spin locks for locking access to
shared memory segments. sysv is the other shared memory transport; it
uses SYSV semaphores for locking. So if I had to guess, I'd say that
there's some kind of race condition in LAM's spin locks code that
exhibits itself under high load and/or multi-CPU situations (again, I
haven't done any analysis yet -- this is all gut feeling at this
point).

On Jan 3, 2005, at 6:57 PM, Jacob Vos wrote:

> Here is an example of when there are a ton of errors. It is completely
> random as to which "processor" fails. I've seen the same frequency for
> all of them. Today, zero has definitely failed the most often. Also,
> the
> second receive fails, in general, much more often than the first.
>
> MachZero:~/Desktop/lamtest jake$ mpirun C a.out
> rank 0 out of 8 instances.
> rank 6 out of 8 instances.
> rank 3 out of 8 instances.
> rank 2 out of 8 instances.
> rank 1 out of 8 instances.
> rank 4 out of 8 instances.
> rank 5 out of 8 instances.
> rank 7 out of 8 instances.
> CPU 0 recv2: 9078 != 9077
> CPU 1 recv2: 12142 != 12141
> CPU 1 recv2: 12533 != 12532
> CPU 0 recv2: 17126 != 17125
> CPU 0 recv2: 18869 != 18868
> CPU 0 recv2: 34217 != 34216
> CPU 0 recv2: 49035 != 49034
> CPU 0 recv2: 60389 != 60388
> CPU 0 recv2: 69068 != 69067
> CPU 1 recv2: 71159 != 71158
> CPU 0 recv2: 89829 != 89828
> CPU 0 recv2: 91318 != 91317
> CPU 0 recv2: 124826 != 124825
> CPU 0 recv2: 130880 != 130879
> CPU 0 recv2: 131121 != 131120
> CPU 1 recv2: 137015 != 137014
> CPU 0 recv2: 140855 != 140854
> CPU 3 recv2: 147514 != 147513
> CPU 0 recv2: 152036 != 152035
> CPU 0 recv2: 187224 != 187223
> CPU 0 recv2: 190873 != 190872
> CPU 0 recv2: 191452 != 191451
> CPU 3 recv2: 198360 != 198359
> CPU 1 recv2: 200483 != 200482
> CPU 1 recv2: 201806 != 201805
> CPU 1 recv2: 224145 != 224144
> CPU 1 recv2: 230078 != 230077
> CPU 3 recv2: 232821 != 232820
> CPU 1 recv2: 243280 != 243279
> CPU 0 recv2: 246484 != 246483
> CPU 0 recv2: 246749 != 246748
> CPU 3 recv2: 248972 != 248971
> CPU 1 recv2: 250754 != 250753
> CPU 0 recv2: 268311 != 268310
> CPU 0 recv2: 279668 != 279667
> CPU 0 recv2: 285077 != 285076
> CPU 3 recv2: 291615 != 291614
> CPU 0 recv2: 292216 != 292215
> CPU 0 recv2: 309108 != 309107
> CPU 0 recv2: 311748 != 311747
> CPU 0 recv2: 312668 != 312667
> CPU 1 recv2: 312912 != 312911
> CPU 0 recv2: 321491 != 321490
> CPU 1 recv2: 328702 != 328701
> CPU 1 recv2: 330495 != 330494
> CPU 0 recv2: 332820 != 332819
> CPU 0 recv2: 337565 != 337564
> CPU 0 recv2: 342717 != 342716
> CPU 0 recv2: 344977 != 344976
> CPU 1 recv2: 352068 != 352067
> CPU 0 recv2: 365009 != 365008
> CPU 1 recv2: 376899 != 376898
> CPU 1 recv2: 377322 != 377321
> CPU 0 recv2: 380529 != 380528
> CPU 0 recv2: 392425 != 392424
> CPU 0 recv2: 418243 != 418242
> CPU 0 recv2: 419348 != 419347
> CPU 1 recv2: 421576 != 421575
> CPU 3 recv2: 422983 != 422982
> CPU 0 recv2: 431440 != 431439
> CPU 0 recv2: 432557 != 432556
> CPU 0 recv2: 448427 != 448426
> CPU 0 recv2: 456880 != 456879
> CPU 3 recv2: 466409 != 466408
> CPU 1 recv2: 470845 != 470844
> CPU 0 recv2: 485413 != 485412
> CPU 1 recv2: 494849 != 494848
>
>
> -----Original Message-----
> From: Michael Gauckler [mailto:michael.gauckler_at_[hidden]]
> Sent: Monday, January 03, 2005 5:48 PM
> To: General LAM/MPI mailing list
> Subject: Re: LAM: data corruption
>
> Hi Jake,
>
> I cannot confirm your findings: I tried your code with my setup and
the
> program terminated without any fault detected. Maybe you should supply
> more details about your setup to locate the source of the problem.
>
> Below my configuration is given - I hope this helps.
>
> Chers,
> Michael
> ___
>
> Hardware: IBM X31 Laptop 1.5 MHz Intel Pentium M
> OS: Ubuntulinux/Debian
> LAM: Installed from Debian package, 7.0.6, i386-pc-linux-gnu
>
> I added this line to have at least some output:
>
> cout << "rank " << rank << " out of " << numtasks << " instances." <<
> endl;
>
> mig_at_ubuntu:~/lamtest $ mpic++ test.cpp
>
> mig_at_ubuntu:~/lamtest $ cat hostfile
> localhost cpu=8
>
> mig_at_ubuntu:~/lamtest $ mpirun -c 8 a.out
> rank 0 out of 8 instances.
> rank 2 out of 8 instances.
> rank 1 out of 8 instances.
> rank 4 out of 8 instances.
> rank 3 out of 8 instances.
> rank 7 out of 8 instances.
> rank 5 out of 8 instances.
> rank 6 out of 8 instances.
> mig_at_ubuntu:~/lamtest $
>
> mig_at_ubuntu:~/lamtest $ uname -a
> Linux ubuntu 2.6.8.1 #1 Mon Nov 29 16:56:41 CET 2004 i686 GNU/Linux
>
> Am Montag, den 03.01.2005, 12:06 -0500 schrieb Jacob Vos:
>> Hi,
>>
>> Below is a small program demonstrating some spurious message data
> being
>> received. For convenience of development before I moved to the
> cluster,
>> I was using a Dual 2.5GHz G5 with a 'lamnode_file' set at 'cpu=8'. I
>> noticed that randomly the second receive would contain the same data
> as
>> the first receive. I can't reproduce the anomaly with 'cpu=2'.
>>
>> I thought my logic may have been flawed. I however could not find the
>> source of my error. So, I wrote this small test program and I was
able
>> to reproduce the anomaly.
>>
>> Basically, about .005% to .01% of the send/receives are corrupt. It's
>> not clear in this test code, because the value of the last send
> happens
>> to be n-1. However, in my original code that I found the anomaly in,
> the
>> previous set of sends were not related. The second receive call,
when
>> it was corrupt, always had the same value as the first.
>>
>> Please confirm that this is indeed a bug, inform me of my poor use of
>> asynchronous communication, or indicate that using 'cpu=8' is flawed.
>>
>> A typical output would be:
>>
>> CPU 4 recv2: 2984 != 2983
>> CPU 5 recv2: 71459 != 71458
>> CPU 7 recv2: 122923 != 122922
>> CPU 6 recv2: 156124 != 156123
>> CPU 3 recv2: 185705 != 185704
>> CPU 0 recv2: 350950 != 350949
>> CPU 0 recv2: 356951 != 356950
>> CPU 4 recv2: 449649 != 449648
>>
>> Thanks all,
>>
>> Jake
>>
>> ---------------------------------------------------------
>>
>> #include "mpi.h"
>> #include <iostream>
>> using namespace std;
>>
>> int main(int argc, char *argv[])
>> {
>> int numtasks, rank;
>>
>> MPI_Init(&argc,&argv);
>> MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>
>> int prev = rank-1;
>> int next = rank+1;
>> if (rank == 0) prev = numtasks - 1;
>> if (rank == (numtasks - 1)) next = 0;
>>
>> for(int n=0; n < 1000000; n++)
>> {
>> int send1 = n;
>> int send2 = n+1;
>>
>> int recv1, recv2;
>> MPI_Request reqs[4];
>> MPI_Status stats[4];
>> MPI_Irecv(&recv1, 1, MPI_INT, prev, 1, MPI_COMM_WORLD, &reqs[0]);
>> MPI_Irecv(&recv2, 1, MPI_INT, prev, 2, MPI_COMM_WORLD, &reqs[1]);
>> MPI_Isend(&send1, 1, MPI_INT, next, 1, MPI_COMM_WORLD, &reqs[2]);
>> MPI_Isend(&send2, 1, MPI_INT, next, 2, MPI_COMM_WORLD, &reqs[3]);
>> MPI_Waitall(4, reqs, stats);
>>
>> if(send1 != recv1) cout << "CPU " << rank << " recv1: " << send1
> <<
>> " != " << recv1 << endl;
>> if(send2 != recv2) cout << "CPU " << rank << " recv2: " << send2
> <<
>> " != " << recv2 << endl;
>> }
>>
>> MPI_Finalize();
>> }
>>
>>
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/