On Mon, 3 Dec 2001, Martin Knoblauch wrote:
> The basic problem we (the software vendor and I) try to solve is that a
> LAM/MPI application basically dies without uesful error messages. It
> usually just tells us that some ranks are dead, but nothing more
> specific.
Strange. Without source code, this is going to be difficult to debug.
But you probably knew that already ;-).
> We know that the problem is related to asynch send/receive operations.
> Interestingly, the problem only shows up when using 4 ore more
> processes. 1, 2 and 3 is OK. This brought the guy from the vendor to the
> MPI_MSGS_PER_PROC/MPI_MSGS_PER_HOST idea, because increasing those
> helped in a simialr problem on an Origin (which of course is just one
> big "host"). Our assumption is that the MPI layer runs out of resources
> without telling us what is missing. BIG Q: is there a way to get better
> diagnostics from the library?
Well, the best way to debug something like this is with a debugger,
stepping through the code or attaching when something goes wrong. But if
you don't have access to the source code, that could be difficult. Just
out of curiousity, what version of LAM/MPI was the product you are trying
to use compiled with? Not which version are you running - if you don't
have access to the source code, then some of it had to have been compiled
before hand, right?
I had to talk to some of the other developers on the LAM project, and
there really isn't an equivalent to MPI_MSGS_PER_PROC in LAM. We will
continue to create space for as many evelopes as we can, stopping only
when malloc() starts failing. We do have a small data structure that has
to fit in shared memory, and once shared memory is full the application
will have to block/fail until some shared memory is freed up. Note that
this is ONLY applicable to communication over the shared memory bus. On a
dual-CPU machine, you would have to have a LOT of async sends / recvs
posted in order to trigger this problem. If you want to play, you can
increase the shared memory sizes by setting the LAM_MPI_SHMMAXALLOC
LAM_MPI_SHMPOOLSIZE environment variables.
Like I said, it would have to be a ton of communication between processes
on the same machine to run into this problem.
> > Also, some info on your application might not be a bad thing. Such as how
> > many processes you are running, what the configuration of the machines are
> > (uni-processor, dual, quad, etc). You mentioned that you have some
> > asynchronous send/recv operations. How many would you say are outstanding
> > at any given time?
>
> The applicatition isthe MPP version of PamCrash 2000 under Linux.
> Unfortunatelly, the logistics in this problem is dificult. Those who
> could solve the problem (ESI) do not have access to the triggering
> dataset, which belongs to a customer of my employers customer. An I do
> not have access to the source of PamCrash. So we are trying to do some
> guesswork here ....
Well, not having source does pose some problems in debugging ;-). At any
time, you can see the "state of the world" as seen by LAM/MPI by running
the 'mpitask' and 'mpimsg' programs that are part of the LAM/MPI
installation. But since you can't see the source, that information might
not be really useful to you.
Some more questions for you:
* How many machines are you running on?
* How many CPUs are in those machines?
* How many processes (of you application) are you running on each machine?
* How do you know (for certain) that the problem is related to
asynchronous sends and receives?
Brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|