LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Martin Knoblauch (Martin.Knoblauch_at_[hidden])
Date: 2001-12-03 06:06:43


>
> On Thu, 29 Nov 2001, Martin Knoblauch wrote:
>
> > does LAM-MPI have the equivalent of the MPI_MSGS_PER_PROC environment
> > variable in IRIX or Unicos? I am trying to debug a problem where a
> > application fails async. send/rec operations. Increasing those variables
> > on IRIX did help the problem. My environment is
> > lam-6.5.4/usysv/linux-2.4.9-ac18.
>
> I'm afraid I really don't know what problem the Irix MPI_MSGS_PER_PROC is
> trying to solve, so let's try going at this from another angle. What
> exactly is happening to your program? Is LAM displaying any error
> messages? If, so what are they?
>

 The basic problem we (the software vendor and I) try to solve is that a
LAM/MPI application basically dies without uesful error messages. It
usually just tells us that some ranks are dead, but nothing more
specific.

  *** ERROR *** INTERRUPT RECEIVED FROM 3. Error No -89 TID
No 3
 ERRMSG:SNDC2V: error-message to node 5 Tag was 5000
[martink_at_sec001 ~/p_to_001]$ tail bla-6.log
 ERRMSG:SNDC2V: error-message to node 3 Tag was 5000
MPI_Recv: process in local group is dead (rank 1, MPI_COMM_WORLD)
Rank (1, MPI_COMM_WORLD): Call stack within LAM:
Rank (1, MPI_COMM_WORLD): - MPI_Recv()
Rank (1, MPI_COMM_WORLD): - main()
 ERRMSG:SNDC2V: error-message to node 5 Tag was 5000
MPI_Send: process in local group is dead (rank 3, MPI_COMM_WORLD)
Rank (3, MPI_COMM_WORLD): Call stack within LAM:
Rank (3, MPI_COMM_WORLD): - MPI_Send()
Rank (3, MPI_COMM_WORLD): - main()

 We know that the problem is related to asynch send/receive operations.
Interestingly, the problem only shows up when using 4 ore more
processes. 1, 2 and 3 is OK. This brought the guy from the vendor to the
MPI_MSGS_PER_PROC/MPI_MSGS_PER_HOST idea, because increasing those
helped in a simialr problem on an Origin (which of course is just one
big "host"). Our assumption is that the MPI layer runs out of resources
without telling us what is missing. BIG Q: is there a way to get better
diagnostics from the library?

> Also, some info on your application might not be a bad thing. Such as how
> many processes you are running, what the configuration of the machines are
> (uni-processor, dual, quad, etc). You mentioned that you have some
> asynchronous send/recv operations. How many would you say are outstanding
> at any given time?
>

 The applicatition isthe MPP version of PamCrash 2000 under Linux.
Unfortunatelly, the logistics in this problem is dificult. Those who
could solve the problem (ESI) do not have access to the triggering
dataset, which belongs to a customer of my employers customer. An I do
not have access to the source of PamCrash. So we are trying to do some
guesswork here ....

Thanks
Martin

-- 
------------------------------------------------------------------
Martin Knoblauch         |    email:  Martin.Knoblauch_at_[hidden]
TeraPort GmbH            |    Phone:  +49-89-510857-309
C+ITS                    |    Fax:    +49-89-510857-111
http://www.teraport.de   |    Mobile: +49-170-4904759
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/