On Dec 5, 2007, at 12:08 PM, Camm Maguire wrote:
> Greetings! We've used lam 6.x for years successfully, but now have
> problems running the same application recompiled against lam 7.1.4.
>
> 1) When using the lamd rpi, certain nodes report a bad rank in
> MPI_Allgather:
>
> MPI_Recv: internal MPI error: Bad address (rank 3, comm 3)
> Rank (12, MPI_COMM_WORLD): Call stack within LAM:
> Rank (12, MPI_COMM_WORLD): - MPI_Recv()
> Rank (12, MPI_COMM_WORLD): - MPI_Allgather()
> Rank (12, MPI_COMM_WORLD): - main()
Does it work with the other RPI's? (unlikely, but I thought I'd ask)
>
> 2) I had written by hand versions of allreduce and bcast which no
> longer work (random message corruption as yet not diagnosed
> further)
>
> static __inline__ int
> qdp_allreduce(void *a,int nn,MPI_Comm c,MPI_Datatype d,int size,
> void (*f)(void *,void *,int)) {
>
> int i,j,k,r,s;
Woof; that's a little too much for me to analyze without a Cisco
support contract, and Open MPI. ;-)
> Has anything changed regarding the blocking/non-blocking status of any
> of these calls?
Not really. I think the core algorithms for allgather have not
changed in LAM for a long, long time. But I'm afraid that I don't
remember the specifics...
There was a big change in the 7 series when we moved to the component
architecture stuff. So there was a bit of refactoring of the
collective algorithm code, but the core algorithms should still be the
same.
Have you tried configuring LAM for memory debugging and running your
code through a memory-checking debugger?
Have you tried Open MPI?
> Finally, my code is in several libraries, two of which independently
> setup static communicators for parallelization -- is there now some
> internal interference for such a strategy within the lam library?
I'm not quite sure what you're asking -- MPI gives you MPI_COMM_WORLD
by default. If you need to subset beyond that, then you can use calls
like MPI_COMM_SPLIT (like you showed, above) and friends. Are you
asking something beyond that?
--
Jeff Squyres
Cisco Systems
|