LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Camm Maguire (camm_at_[hidden])
Date: 2007-12-05 12:08:12


Greetings! We've used lam 6.x for years successfully, but now have
problems running the same application recompiled against lam 7.1.4.

1) When using the lamd rpi, certain nodes report a bad rank in
   MPI_Allgather:

MPI_Recv: internal MPI error: Bad address (rank 3, comm 3)
Rank (12, MPI_COMM_WORLD): Call stack within LAM:
Rank (12, MPI_COMM_WORLD): - MPI_Recv()
Rank (12, MPI_COMM_WORLD): - MPI_Allgather()
Rank (12, MPI_COMM_WORLD): - main()

Program received signal SIGPIPE, Broken pipe.
[Switching to Thread -1222748480 (LWP 23432)]
0xb72b9dee in write () from /lib/tls/libc.so.6

The call in question is:

  MPI_Allgather(w,mnr,MPI_DOUBLE,q1,mnr,MPI_DOUBLE,ccomm);

ccomm is setup thus: (np=16)

  comm=MPI_COMM_WORLD;

  MPI_Comm_rank(comm,&id);

  MPI_Comm_size(comm,&np);
  if (np!=npn)
    error("np!=npn\n");

  idr=id/ncb;
  idc=id%ncb;
  
  MPI_Comm_split(comm,idr,idc,&rcomm);
  MPI_Comm_split(comm,idc,idr,&ccomm);

I can confirm both sets of code are executed by node 15.

2) I had written by hand versions of allreduce and bcast which no
   longer work (random message corruption as yet not diagnosed
   further)

static __inline__ int
qdp_allreduce(void *a,int nn,MPI_Comm c,MPI_Datatype d,int size,
              void (*f)(void *,void *,int)) {

  int i,j,k,r,s;
  static MPI_Comm sc;
  static int si,sj,ss,sr;
  MPI_Request req;
  static void *b1,*b,*be;

  if (be-b1<size*nn)
    r_mem(b,size*nn);

  if (sc==c) {
    i=si;
    j=sj;
    r=sr;
    s=ss;
  } else {
    MPI_Comm_rank(c,&r);
    MPI_Comm_size(c,&s);
    for (i=0,j=1;j+j<=s;i++,j+=j);
    si=i;
    sj=j;
    sr=r;
    ss=s;
    sc=c;
  }

  if (r>=sj) {
    
    MPI_Isend(a,nn,d,r-(s-sj),s,c,&req);
    MPI_Wait(&req,MPI_STATUS_IGNORE);
    MPI_Recv(a,nn,d,r-(s-sj),s,c,MPI_STATUS_IGNORE);

  } else {

    if (r>=sj-(s-sj)) {
      MPI_Recv(b1,nn,d,r+(s-sj),s,c,MPI_STATUS_IGNORE);
      (*f)(a,b1,nn);
    }

    for (i--,j=j/2;i>=0;i--,j=j/2) {
      
      k=r/(j+j);
      k+=k;
      k=r/j-k;
      k=k ? r-j : r+j;
      
      MPI_Isend(a,nn,d,k,i,c,&req);
      MPI_Recv(b1,nn,d,k,i,c,MPI_STATUS_IGNORE);
      MPI_Wait(&req,MPI_STATUS_IGNORE);
      (*f)(a,b1,nn);
      
    }

    if (r>=sj-(s-sj)) {
      MPI_Isend(a,nn,d,r+(s-sj),s,c,&req);
      MPI_Wait(&req,MPI_STATUS_IGNORE);
    }

  }

  return 0;

}

static __inline__ int
qdp_bcast_lin(double *a,int nn,MPI_Comm c,int r,int s,int w) {

  int i;
  static MPI_Request *rq1,*rq,*rqe;

  if (s-1>rqe-rq1)
    r_mem(rq,s-1);

  if (r!=w)
    MPI_Recv(a,nn,MPI_DOUBLE,w,0,c,MPI_STATUS_IGNORE);
  else
    for (i=1,rq=rq1;i<s;i++)
      MPI_Send_init(a,nn,MPI_DOUBLE,(w+i)%s,0,c,rq++);

  if (rq>rq1) {
    MPI_Startall(rq-rq1,rq1);
    for (;--rq>=rq1;)
      MPI_Request_free(rq);
    rq++;
  }

  return 0;

}

Has anything changed regarding the blocking/non-blocking status of any
of these calls?

Finally, my code is in several libraries, two of which independently
setup static communicators for parallelization -- is there now some
internal interference for such a strategy within the lam library?

Please let me know if any further details are needed. lamtests
appears to run fine on this installation.

Thanks!

-- 
Camm Maguire			     			camm_at_[hidden]
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah