LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Camm Maguire (camm_at_[hidden])
Date: 2008-01-02 11:49:23


Greetings -- so sorry for the delay here.

Jeff Squyres <jsquyres_at_[hidden]> writes:

> On Dec 5, 2007, at 12:08 PM, Camm Maguire wrote:
>
> > Greetings! We've used lam 6.x for years successfully, but now have
> > problems running the same application recompiled against lam 7.1.4.
> >
> > 1) When using the lamd rpi, certain nodes report a bad rank in
> > MPI_Allgather:
> >
> > MPI_Recv: internal MPI error: Bad address (rank 3, comm 3)
> > Rank (12, MPI_COMM_WORLD): Call stack within LAM:
> > Rank (12, MPI_COMM_WORLD): - MPI_Recv()
> > Rank (12, MPI_COMM_WORLD): - MPI_Allgather()
> > Rank (12, MPI_COMM_WORLD): - main()
>
> Does it work with the other RPI's? (unlikely, but I thought I'd ask)
>

No, but the point of failure is often different.

> >
> > 2) I had written by hand versions of allreduce and bcast which no
> > longer work (random message corruption as yet not diagnosed
> > further)
> >
> > static __inline__ int
> > qdp_allreduce(void *a,int nn,MPI_Comm c,MPI_Datatype d,int size,
> > void (*f)(void *,void *,int)) {
> >
> > int i,j,k,r,s;
>
> Woof; that's a little too much for me to analyze without a Cisco
> support contract, and Open MPI. ;-)
>

OK, no need, as vanilla allreduce triggers the problem.

> > Has anything changed regarding the blocking/non-blocking status of any
> > of these calls?
>
> Not really. I think the core algorithms for allgather have not
> changed in LAM for a long, long time. But I'm afraid that I don't
> remember the specifics...
>
> There was a big change in the 7 series when we moved to the component
> architecture stuff. So there was a bit of refactoring of the
> collective algorithm code, but the core algorithms should still be the
> same.
>

OK, I confirm that the same code compiled against lam 6.5.9 runs
flawlessly on the same cluster. So either there is an error
introduced in subsequent lam, or 6.5.9 has a bug which masks a bug in
my code, which seems less likely. How can I chase this down?

> Have you tried configuring LAM for memory debugging and running your
> code through a memory-checking debugger?
>

Not yet, but this looks promising. I take it using mpirun to launch
an xterm on each node, running gdb on the code with LD_PRELOAD set to
libefence.so.0.0 is the method of choice? If not, any more details
here please?

> Have you tried Open MPI?
>

Alas, no. As you know, I maintain lam for Debian, and have not found
the time to package openmpi. Someone else now has, and I am unsure
whether the source compatibility design between the lam and mpich
packages has been maintained or not. Simply have not had time.

> > Finally, my code is in several libraries, two of which independently
> > setup static communicators for parallelization -- is there now some
> > internal interference for such a strategy within the lam library?
>
> I'm not quite sure what you're asking -- MPI gives you MPI_COMM_WORLD
> by default. If you need to subset beyond that, then you can use calls
> like MPI_COMM_SPLIT (like you showed, above) and friends. Are you
> asking something beyond that?
>

Not really, just confirming that one should be able to divide the
cluster in different ways in different subroutines. In any case, this
is not the issue, as removing same does not remove the error.

Suggestions most appreciated.

> --
> Jeff Squyres
> Cisco Systems
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
>

-- 
Camm Maguire			     			camm_at_[hidden]
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah