Hi Michael,
Thanks for your help!
This is my current configuration:
Hardware: Dual 2.5 GHz PowerPC G5, 4 GB DDR SDRAM
OS: Mac OS X Version 10.3.6, using ssh from a Windows XP Box
LAM: 7.1.1
./configure --with-rsh=ssh --with-memory-manager=none
--prefix=/user/local/lam
and FC=xlf
I have IBM's XLF compiler installed, so I supposedly need to disable the
memory management until 7.1.2 comes out. The -u
_lam_darwin_malloc_linker_hack command is broken and is not supposedly
needed unless you have Myrinet or InfiniBand.
The failed "processors" have almost all been 0, 4, and 7 today and I
also added your line of output for clarity.
Below there is a run and some configuration info...
Thanks,
Jake
__
MachZero:~/Desktop/lamtest jake$ lamboot ../../lamnode_file
LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
MachZero:~/Desktop/lamtest jake$ more ../../lamnode_file
localhost cpu=8
MachZero:~/Desktop/lamtest jake$ mpic++ test.cpp
MachZero:~/Desktop/lamtest jake$ mpirun C a.out
rank 0 out of 8 instances.
rank 1 out of 8 instances.
rank 4 out of 8 instances.
rank 5 out of 8 instances.
rank 3 out of 8 instances.
rank 2 out of 8 instances.
rank 6 out of 8 instances.
rank 7 out of 8 instances.
CPU 0 recv2: 40232 != 40231
CPU 0 recv2: 42297 != 42296
CPU 0 recv2: 126062 != 126061
CPU 0 recv2: 140284 != 140283
CPU 0 recv2: 172515 != 172514
CPU 0 recv2: 180804 != 180803
CPU 0 recv2: 251432 != 251431
CPU 0 recv2: 257217 != 257216
CPU 0 recv2: 286518 != 286517
CPU 0 recv2: 325939 != 325938
CPU 0 recv2: 335786 != 335785
CPU 7 recv2: 336074 != 336075
CPU 7 recv2: 336075 != 336074
CPU 4 recv2: 449695 != 449694
MachZero:~/Desktop/lamtest jake$ uname -a
Darwin MachZero.local 7.6.0 Darwin Kernel Version 7.6.0: Sun Oct 10
12:05:27 PDT 2004; root:xnu/xnu-517.9.4.obj~1/RELEASE_PPC Power
Macintosh powerpc
MachZero:~/Desktop/lamtest jake$ g++ -v
Reading specs from /usr/libexec/gcc/darwin/ppc/3.3/specs
Thread model: posix
gcc version 3.3 20030304 (Apple Computer, Inc. build 1640)
-----Original Message-----
From: Michael Gauckler [mailto:michael.gauckler_at_[hidden]]
Sent: Monday, January 03, 2005 5:48 PM
To: General LAM/MPI mailing list
Subject: Re: LAM: data corruption
Hi Jake,
I cannot confirm your findings: I tried your code with my setup and the
program terminated without any fault detected. Maybe you should supply
more details about your setup to locate the source of the problem.
Below my configuration is given - I hope this helps.
Chers,
Michael
___
Hardware: IBM X31 Laptop 1.5 MHz Intel Pentium M
OS: Ubuntulinux/Debian
LAM: Installed from Debian package, 7.0.6, i386-pc-linux-gnu
I added this line to have at least some output:
cout << "rank " << rank << " out of " << numtasks << " instances." <<
endl;
mig_at_ubuntu:~/lamtest $ mpic++ test.cpp
mig_at_ubuntu:~/lamtest $ cat hostfile
localhost cpu=8
mig_at_ubuntu:~/lamtest $ mpirun -c 8 a.out
rank 0 out of 8 instances.
rank 2 out of 8 instances.
rank 1 out of 8 instances.
rank 4 out of 8 instances.
rank 3 out of 8 instances.
rank 7 out of 8 instances.
rank 5 out of 8 instances.
rank 6 out of 8 instances.
mig_at_ubuntu:~/lamtest $
mig_at_ubuntu:~/lamtest $ uname -a
Linux ubuntu 2.6.8.1 #1 Mon Nov 29 16:56:41 CET 2004 i686 GNU/Linux
Am Montag, den 03.01.2005, 12:06 -0500 schrieb Jacob Vos:
> Hi,
>
> Below is a small program demonstrating some spurious message data
being
> received. For convenience of development before I moved to the
cluster,
> I was using a Dual 2.5GHz G5 with a 'lamnode_file' set at 'cpu=8'. I
> noticed that randomly the second receive would contain the same data
as
> the first receive. I can't reproduce the anomaly with 'cpu=2'.
>
> I thought my logic may have been flawed. I however could not find the
> source of my error. So, I wrote this small test program and I was able
> to reproduce the anomaly.
>
> Basically, about .005% to .01% of the send/receives are corrupt. It's
> not clear in this test code, because the value of the last send
happens
> to be n-1. However, in my original code that I found the anomaly in,
the
> previous set of sends were not related. The second receive call, when
> it was corrupt, always had the same value as the first.
>
> Please confirm that this is indeed a bug, inform me of my poor use of
> asynchronous communication, or indicate that using 'cpu=8' is flawed.
>
> A typical output would be:
>
> CPU 4 recv2: 2984 != 2983
> CPU 5 recv2: 71459 != 71458
> CPU 7 recv2: 122923 != 122922
> CPU 6 recv2: 156124 != 156123
> CPU 3 recv2: 185705 != 185704
> CPU 0 recv2: 350950 != 350949
> CPU 0 recv2: 356951 != 356950
> CPU 4 recv2: 449649 != 449648
>
> Thanks all,
>
> Jake
>
> ---------------------------------------------------------
>
> #include "mpi.h"
> #include <iostream>
> using namespace std;
>
> int main(int argc, char *argv[])
> {
> int numtasks, rank;
>
> MPI_Init(&argc,&argv);
> MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>
> int prev = rank-1;
> int next = rank+1;
> if (rank == 0) prev = numtasks - 1;
> if (rank == (numtasks - 1)) next = 0;
>
> for(int n=0; n < 1000000; n++)
> {
> int send1 = n;
> int send2 = n+1;
>
> int recv1, recv2;
> MPI_Request reqs[4];
> MPI_Status stats[4];
> MPI_Irecv(&recv1, 1, MPI_INT, prev, 1, MPI_COMM_WORLD, &reqs[0]);
> MPI_Irecv(&recv2, 1, MPI_INT, prev, 2, MPI_COMM_WORLD, &reqs[1]);
> MPI_Isend(&send1, 1, MPI_INT, next, 1, MPI_COMM_WORLD, &reqs[2]);
> MPI_Isend(&send2, 1, MPI_INT, next, 2, MPI_COMM_WORLD, &reqs[3]);
> MPI_Waitall(4, reqs, stats);
>
> if(send1 != recv1) cout << "CPU " << rank << " recv1: " << send1
<<
> " != " << recv1 << endl;
> if(send2 != recv2) cout << "CPU " << rank << " recv2: " << send2
<<
> " != " << recv2 << endl;
> }
>
> MPI_Finalize();
> }
>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|