LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jacob Vos (jacob.vos_at_[hidden])
Date: 2005-01-03 18:57:02


Here is an example of when there are a ton of errors. It is completely
random as to which "processor" fails. I've seen the same frequency for
all of them. Today, zero has definitely failed the most often. Also, the
second receive fails, in general, much more often than the first.

MachZero:~/Desktop/lamtest jake$ mpirun C a.out
rank 0 out of 8 instances.
rank 6 out of 8 instances.
rank 3 out of 8 instances.
rank 2 out of 8 instances.
rank 1 out of 8 instances.
rank 4 out of 8 instances.
rank 5 out of 8 instances.
rank 7 out of 8 instances.
CPU 0 recv2: 9078 != 9077
CPU 1 recv2: 12142 != 12141
CPU 1 recv2: 12533 != 12532
CPU 0 recv2: 17126 != 17125
CPU 0 recv2: 18869 != 18868
CPU 0 recv2: 34217 != 34216
CPU 0 recv2: 49035 != 49034
CPU 0 recv2: 60389 != 60388
CPU 0 recv2: 69068 != 69067
CPU 1 recv2: 71159 != 71158
CPU 0 recv2: 89829 != 89828
CPU 0 recv2: 91318 != 91317
CPU 0 recv2: 124826 != 124825
CPU 0 recv2: 130880 != 130879
CPU 0 recv2: 131121 != 131120
CPU 1 recv2: 137015 != 137014
CPU 0 recv2: 140855 != 140854
CPU 3 recv2: 147514 != 147513
CPU 0 recv2: 152036 != 152035
CPU 0 recv2: 187224 != 187223
CPU 0 recv2: 190873 != 190872
CPU 0 recv2: 191452 != 191451
CPU 3 recv2: 198360 != 198359
CPU 1 recv2: 200483 != 200482
CPU 1 recv2: 201806 != 201805
CPU 1 recv2: 224145 != 224144
CPU 1 recv2: 230078 != 230077
CPU 3 recv2: 232821 != 232820
CPU 1 recv2: 243280 != 243279
CPU 0 recv2: 246484 != 246483
CPU 0 recv2: 246749 != 246748
CPU 3 recv2: 248972 != 248971
CPU 1 recv2: 250754 != 250753
CPU 0 recv2: 268311 != 268310
CPU 0 recv2: 279668 != 279667
CPU 0 recv2: 285077 != 285076
CPU 3 recv2: 291615 != 291614
CPU 0 recv2: 292216 != 292215
CPU 0 recv2: 309108 != 309107
CPU 0 recv2: 311748 != 311747
CPU 0 recv2: 312668 != 312667
CPU 1 recv2: 312912 != 312911
CPU 0 recv2: 321491 != 321490
CPU 1 recv2: 328702 != 328701
CPU 1 recv2: 330495 != 330494
CPU 0 recv2: 332820 != 332819
CPU 0 recv2: 337565 != 337564
CPU 0 recv2: 342717 != 342716
CPU 0 recv2: 344977 != 344976
CPU 1 recv2: 352068 != 352067
CPU 0 recv2: 365009 != 365008
CPU 1 recv2: 376899 != 376898
CPU 1 recv2: 377322 != 377321
CPU 0 recv2: 380529 != 380528
CPU 0 recv2: 392425 != 392424
CPU 0 recv2: 418243 != 418242
CPU 0 recv2: 419348 != 419347
CPU 1 recv2: 421576 != 421575
CPU 3 recv2: 422983 != 422982
CPU 0 recv2: 431440 != 431439
CPU 0 recv2: 432557 != 432556
CPU 0 recv2: 448427 != 448426
CPU 0 recv2: 456880 != 456879
CPU 3 recv2: 466409 != 466408
CPU 1 recv2: 470845 != 470844
CPU 0 recv2: 485413 != 485412
CPU 1 recv2: 494849 != 494848

-----Original Message-----
From: Michael Gauckler [mailto:michael.gauckler_at_[hidden]]
Sent: Monday, January 03, 2005 5:48 PM
To: General LAM/MPI mailing list
Subject: Re: LAM: data corruption

Hi Jake,

I cannot confirm your findings: I tried your code with my setup and the
program terminated without any fault detected. Maybe you should supply
more details about your setup to locate the source of the problem.

Below my configuration is given - I hope this helps.

Chers,
 Michael
___

Hardware: IBM X31 Laptop 1.5 MHz Intel Pentium M
OS: Ubuntulinux/Debian
LAM: Installed from Debian package, 7.0.6, i386-pc-linux-gnu

I added this line to have at least some output:

cout << "rank " << rank << " out of " << numtasks << " instances." <<
endl;

mig_at_ubuntu:~/lamtest $ mpic++ test.cpp

mig_at_ubuntu:~/lamtest $ cat hostfile
localhost cpu=8

mig_at_ubuntu:~/lamtest $ mpirun -c 8 a.out
rank 0 out of 8 instances.
rank 2 out of 8 instances.
rank 1 out of 8 instances.
rank 4 out of 8 instances.
rank 3 out of 8 instances.
rank 7 out of 8 instances.
rank 5 out of 8 instances.
rank 6 out of 8 instances.
mig_at_ubuntu:~/lamtest $

mig_at_ubuntu:~/lamtest $ uname -a
Linux ubuntu 2.6.8.1 #1 Mon Nov 29 16:56:41 CET 2004 i686 GNU/Linux

Am Montag, den 03.01.2005, 12:06 -0500 schrieb Jacob Vos:
> Hi,
>
> Below is a small program demonstrating some spurious message data
being
> received. For convenience of development before I moved to the
cluster,
> I was using a Dual 2.5GHz G5 with a 'lamnode_file' set at 'cpu=8'. I
> noticed that randomly the second receive would contain the same data
as
> the first receive. I can't reproduce the anomaly with 'cpu=2'.
>
> I thought my logic may have been flawed. I however could not find the
> source of my error. So, I wrote this small test program and I was able
> to reproduce the anomaly.
>
> Basically, about .005% to .01% of the send/receives are corrupt. It's
> not clear in this test code, because the value of the last send
happens
> to be n-1. However, in my original code that I found the anomaly in,
the
> previous set of sends were not related. The second receive call, when
> it was corrupt, always had the same value as the first.
>
> Please confirm that this is indeed a bug, inform me of my poor use of
> asynchronous communication, or indicate that using 'cpu=8' is flawed.
>
> A typical output would be:
>
> CPU 4 recv2: 2984 != 2983
> CPU 5 recv2: 71459 != 71458
> CPU 7 recv2: 122923 != 122922
> CPU 6 recv2: 156124 != 156123
> CPU 3 recv2: 185705 != 185704
> CPU 0 recv2: 350950 != 350949
> CPU 0 recv2: 356951 != 356950
> CPU 4 recv2: 449649 != 449648
>
> Thanks all,
>
> Jake
>
> ---------------------------------------------------------
>
> #include "mpi.h"
> #include <iostream>
> using namespace std;
>
> int main(int argc, char *argv[])
> {
> int numtasks, rank;
>
> MPI_Init(&argc,&argv);
> MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>
> int prev = rank-1;
> int next = rank+1;
> if (rank == 0) prev = numtasks - 1;
> if (rank == (numtasks - 1)) next = 0;
>
> for(int n=0; n < 1000000; n++)
> {
> int send1 = n;
> int send2 = n+1;
>
> int recv1, recv2;
> MPI_Request reqs[4];
> MPI_Status stats[4];
> MPI_Irecv(&recv1, 1, MPI_INT, prev, 1, MPI_COMM_WORLD, &reqs[0]);
> MPI_Irecv(&recv2, 1, MPI_INT, prev, 2, MPI_COMM_WORLD, &reqs[1]);
> MPI_Isend(&send1, 1, MPI_INT, next, 1, MPI_COMM_WORLD, &reqs[2]);
> MPI_Isend(&send2, 1, MPI_INT, next, 2, MPI_COMM_WORLD, &reqs[3]);
> MPI_Waitall(4, reqs, stats);
>
> if(send1 != recv1) cout << "CPU " << rank << " recv1: " << send1
<<
> " != " << recv1 << endl;
> if(send2 != recv2) cout << "CPU " << rank << " recv2: " << send2
<<
> " != " << recv2 << endl;
> }
>
> MPI_Finalize();
> }
>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/