LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-01-26 08:55:30


On Jan 26, 2006, at 7:08 AM, Homoceanu Silviu wrote:

> On a cluster on which run more than one job, i run a job that after
> it splits the data to compute by each slave, I collect the
> resulting data secventialy. The problem is that after all the
> computing done, while the master receives data from say slave 1,
> the rest of the slaves should do nothing, but with the top command
> i can see that my aplication uses 99% on all nodes this period of
> time also.

It's not clear from your text, but I'm assuming that once a slave
sends back data to the master, it spins at 99% of the CPU (i.e., that
this is true for all slaves -- not just slave 1).

If so, I'm guessing that this is likely because the MPI processes are
spinning waiting for more data (e.g., in an MPI_RECV call).

Are you running more than one process per node? If so, LAM man spin
in an MPI_RECV call when you use MPI_ANY_SOURCE, or if you have
pending receives from multiple sources because there is no way to
simultaneously block on a socket and check shared memory (hence, LAM
has to spin checking both).

Possible solutions include:

- not using MPI_ANY_SOURCE and not having any other pending receives
from a peer that is on the same node (i.e., not using both tcp and
shared memory)

- not using the sysv or usysv RPIs (usysv will be the default) so
that LAM will use TCP even for on-host communications, but then it
can block in MPI_RECV (even when using MPI_ANY_SOURCE)

Finally, is it really a problem that they're using 99% of the CPU?
LAM optimizes for message passing speed, so it aggressively spins in
tight loops to make this happen. If you have no other applications
running on your nodes, it might not matter that your "idling" slaves
are running at 99%.

> Another problem is that if I try to run an mpi application that
> uses also ncurses as interface... it doesnt work!!

This is not really enough information to try to help you. :-) Can
you describe what is going wrong? Does it fail to compile or link
or ...?

Here's a guess... :-)

If you use mpirun to launch your application, keep in mind that
mpirun funnels all stdout/stderr from the processes, so you get a
"relaying" effect, and ncurses may not do exactly what you think it
should (although you might be able to pick an appropriate serial
terminal interface such that it might work...? I have never tried
this, so I don't know). Alternatively, you can not use mpirun and
launch your ncurses program as a singleton (e.g., "./my_program") and
have that program MPI_COMM_SPAWN the rest of the processes (i.e., the
ones that won't use ncurses). Then you don't get the mpirun relaying
effect.

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/