Greetings! We've used lam 6.x for years successfully, but now have
problems running the same application recompiled against lam 7.1.4.
1) When using the lamd rpi, certain nodes report a bad rank in
MPI_Allgather:
MPI_Recv: internal MPI error: Bad address (rank 3, comm 3)
Rank (12, MPI_COMM_WORLD): Call stack within LAM:
Rank (12, MPI_COMM_WORLD): - MPI_Recv()
Rank (12, MPI_COMM_WORLD): - MPI_Allgather()
Rank (12, MPI_COMM_WORLD): - main()
Program received signal SIGPIPE, Broken pipe.
[Switching to Thread -1222748480 (LWP 23432)]
0xb72b9dee in write () from /lib/tls/libc.so.6
The call in question is:
MPI_Allgather(w,mnr,MPI_DOUBLE,q1,mnr,MPI_DOUBLE,ccomm);
ccomm is setup thus: (np=16)
comm=MPI_COMM_WORLD;
MPI_Comm_rank(comm,&id);
MPI_Comm_size(comm,&np);
if (np!=npn)
error("np!=npn\n");
idr=id/ncb;
idc=id%ncb;
MPI_Comm_split(comm,idr,idc,&rcomm);
MPI_Comm_split(comm,idc,idr,&ccomm);
I can confirm both sets of code are executed by node 15.
2) I had written by hand versions of allreduce and bcast which no
longer work (random message corruption as yet not diagnosed
further)
static __inline__ int
qdp_allreduce(void *a,int nn,MPI_Comm c,MPI_Datatype d,int size,
void (*f)(void *,void *,int)) {
int i,j,k,r,s;
static MPI_Comm sc;
static int si,sj,ss,sr;
MPI_Request req;
static void *b1,*b,*be;
if (be-b1<size*nn)
r_mem(b,size*nn);
if (sc==c) {
i=si;
j=sj;
r=sr;
s=ss;
} else {
MPI_Comm_rank(c,&r);
MPI_Comm_size(c,&s);
for (i=0,j=1;j+j<=s;i++,j+=j);
si=i;
sj=j;
sr=r;
ss=s;
sc=c;
}
if (r>=sj) {
MPI_Isend(a,nn,d,r-(s-sj),s,c,&req);
MPI_Wait(&req,MPI_STATUS_IGNORE);
MPI_Recv(a,nn,d,r-(s-sj),s,c,MPI_STATUS_IGNORE);
} else {
if (r>=sj-(s-sj)) {
MPI_Recv(b1,nn,d,r+(s-sj),s,c,MPI_STATUS_IGNORE);
(*f)(a,b1,nn);
}
for (i--,j=j/2;i>=0;i--,j=j/2) {
k=r/(j+j);
k+=k;
k=r/j-k;
k=k ? r-j : r+j;
MPI_Isend(a,nn,d,k,i,c,&req);
MPI_Recv(b1,nn,d,k,i,c,MPI_STATUS_IGNORE);
MPI_Wait(&req,MPI_STATUS_IGNORE);
(*f)(a,b1,nn);
}
if (r>=sj-(s-sj)) {
MPI_Isend(a,nn,d,r+(s-sj),s,c,&req);
MPI_Wait(&req,MPI_STATUS_IGNORE);
}
}
return 0;
}
static __inline__ int
qdp_bcast_lin(double *a,int nn,MPI_Comm c,int r,int s,int w) {
int i;
static MPI_Request *rq1,*rq,*rqe;
if (s-1>rqe-rq1)
r_mem(rq,s-1);
if (r!=w)
MPI_Recv(a,nn,MPI_DOUBLE,w,0,c,MPI_STATUS_IGNORE);
else
for (i=1,rq=rq1;i<s;i++)
MPI_Send_init(a,nn,MPI_DOUBLE,(w+i)%s,0,c,rq++);
if (rq>rq1) {
MPI_Startall(rq-rq1,rq1);
for (;--rq>=rq1;)
MPI_Request_free(rq);
rq++;
}
return 0;
}
Has anything changed regarding the blocking/non-blocking status of any
of these calls?
Finally, my code is in several libraries, two of which independently
setup static communicators for parallelization -- is there now some
internal interference for such a strategy within the lam library?
Please let me know if any further details are needed. lamtests
appears to run fine on this installation.
Thanks!
--
Camm Maguire camm_at_[hidden]
==========================================================================
"The earth is but one country, and mankind its citizens." -- Baha'u'llah
|