LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-12-06 09:40:09


On Dec 5, 2007, at 12:08 PM, Camm Maguire wrote:

> Greetings! We've used lam 6.x for years successfully, but now have
> problems running the same application recompiled against lam 7.1.4.
>
> 1) When using the lamd rpi, certain nodes report a bad rank in
> MPI_Allgather:
>
> MPI_Recv: internal MPI error: Bad address (rank 3, comm 3)
> Rank (12, MPI_COMM_WORLD): Call stack within LAM:
> Rank (12, MPI_COMM_WORLD): - MPI_Recv()
> Rank (12, MPI_COMM_WORLD): - MPI_Allgather()
> Rank (12, MPI_COMM_WORLD): - main()

Does it work with the other RPI's? (unlikely, but I thought I'd ask)

>
> 2) I had written by hand versions of allreduce and bcast which no
> longer work (random message corruption as yet not diagnosed
> further)
>
> static __inline__ int
> qdp_allreduce(void *a,int nn,MPI_Comm c,MPI_Datatype d,int size,
> void (*f)(void *,void *,int)) {
>
> int i,j,k,r,s;

Woof; that's a little too much for me to analyze without a Cisco
support contract, and Open MPI. ;-)

> Has anything changed regarding the blocking/non-blocking status of any
> of these calls?

Not really. I think the core algorithms for allgather have not
changed in LAM for a long, long time. But I'm afraid that I don't
remember the specifics...

There was a big change in the 7 series when we moved to the component
architecture stuff. So there was a bit of refactoring of the
collective algorithm code, but the core algorithms should still be the
same.

Have you tried configuring LAM for memory debugging and running your
code through a memory-checking debugger?

Have you tried Open MPI?

> Finally, my code is in several libraries, two of which independently
> setup static communicators for parallelization -- is there now some
> internal interference for such a strategy within the lam library?

I'm not quite sure what you're asking -- MPI gives you MPI_COMM_WORLD
by default. If you need to subset beyond that, then you can use calls
like MPI_COMM_SPLIT (like you showed, above) and friends. Are you
asking something beyond that?

-- 
Jeff Squyres
Cisco Systems