LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-12-09 17:48:06


Sorry for the long delay -- I switched mail clients and just found a
bunch of mail that my new client hid from me. :-(

So when you run the master on a different node from any of the
slaves, you still have this problem?

I could understand if the master was on a node that was overloaded
(e.g., 1 master + 2 slaves on a dual processor machine), but if the
master is on a node by itself, this behavior would surprise me.

Are you using any kind of scheduler? Do you know that the master is
running on an otherwise-idle node?

On Dec 7, 2005, at 9:55 AM, Douglas Vechinski wrote:

>
> I'm having a problem with some apparent stalling on one platform
> and not
> another. First let me give some background on the code. I have a
> code
> that uses a master/slave design. Basically one master and several
> slaves. The slaves work independently of one another. There is no
> communication between the slaves. Slave just communicate with master,
> stating that they are ready for something to work on. They send a
> message to the master (who is waiting in a MPI_WAITANY). When the
> master receives a message, it determines what slave is requesting work
> and sends them the index of the next item to be worked on and goes
> back
> to the MPI_WAITANY statement.
>
> I developed and tested this application using several Linux
> machines on
> our local office LAN. Different flavors/versions of Linux were
> used but
> all had LAM 7.1.1 on them. The code seems to be working fine here.
>
> However, the system where this code will be exercised a lot is a
> Beowulf
> cluster with 8 nodes and 2 processors per node. I would say this
> cluster is not really a Beowulf cluster anymore since the bproc is not
> running or installed. It really represents 8 separate "PC's"
> mounted on
> a rack connected by ethernet with a single keyboard and monitor. LAM
> 7.0.4 is present on this machine. A common filesystem is mounted
> across
> the nodes. I now observe the following behavior for my parallel code.
>
> After the master and slave go the their initial setup, the slaves
> send
> their initial request to the master. The master handles one or two of
> the requests. These one or two slaves get their info and begin to
> work. All the other slaves are setting idle. Even though they have
> sent their request and the master is sitting at the MPI_WAITANY call.
> Not until the first one or two slaves finish their current task, which
> can be from a few minutes to tens of minutes, and request the next
> item
> to work on, do the other slaves receive their direction from the
> master. Then if the slaves that ran first, finish sooner and get the
> next item, the other slaves stall after finishing their current task.
> Again, even though they send a request to the master, the master
> doens't
> act on it until the first slaves finish their current task. So a lot
> of time is being wasted with no compution being performed. I've tried
> running the master and slaves on different cpu's and nodes and get
> similar results. (From one point of view, the code does run. Its
> just
> that a lot of slaves are sitting idle for long periods of time because
> the master is not receiving the requests promptly from the slaves even
> though they have been sent by the slaves.)
>
> However, as stated earlier, on the office LAN setup, it runs as
> expected. The master promptly acts upon a work request from a slave
> whenever a slave sends such a request.
>
> Any ideas or suggestions on the cause and or possible solution.
>
>
> Any
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/