LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Douglas Vechinski (douglas.vechinski_at_[hidden])
Date: 2005-12-12 09:28:37


Yes, the problem is present even when the master is on a node all to itself.

No sceduler is present or being used.

Yes, I can verify that the master is on a essentially idle node (i.e. no
active time-consuming processes are running).

> So when you run the master on a different node from any of the
> slaves, you still have this problem?
>
> I could understand if the master was on a node that was overloaded
> (e.g., 1 master + 2 slaves on a dual processor machine), but if the
> master is on a node by itself, this behavior would surprise me.
>
> Are you using any kind of scheduler? Do you know that the master is
> running on an otherwise-idle node?
>
> On Dec 7, 2005, at 9:55 AM, Douglas Vechinski wrote:
>
>>
>> I'm having a problem with some apparent stalling on one platform and
>> not
>> another. First let me give some background on the code. I have a code
>> that uses a master/slave design. Basically one master and several
>> slaves. The slaves work independently of one another. There is no
>> communication between the slaves. Slave just communicate with master,
>> stating that they are ready for something to work on. They send a
>> message to the master (who is waiting in a MPI_WAITANY). When the
>> master receives a message, it determines what slave is requesting work
>> and sends them the index of the next item to be worked on and goes back
>> to the MPI_WAITANY statement.
>>
>> I developed and tested this application using several Linux machines on
>> our local office LAN. Different flavors/versions of Linux were used
>> but
>> all had LAM 7.1.1 on them. The code seems to be working fine here.
>>
>> However, the system where this code will be exercised a lot is a
>> Beowulf
>> cluster with 8 nodes and 2 processors per node. I would say this
>> cluster is not really a Beowulf cluster anymore since the bproc is not
>> running or installed. It really represents 8 separate "PC's"
>> mounted on
>> a rack connected by ethernet with a single keyboard and monitor. LAM
>> 7.0.4 is present on this machine. A common filesystem is mounted
>> across
>> the nodes. I now observe the following behavior for my parallel code.
>>
>> After the master and slave go the their initial setup, the slaves send
>> their initial request to the master. The master handles one or two of
>> the requests. These one or two slaves get their info and begin to
>> work. All the other slaves are setting idle. Even though they have
>> sent their request and the master is sitting at the MPI_WAITANY call.
>> Not until the first one or two slaves finish their current task, which
>> can be from a few minutes to tens of minutes, and request the next item
>> to work on, do the other slaves receive their direction from the
>> master. Then if the slaves that ran first, finish sooner and get the
>> next item, the other slaves stall after finishing their current task.
>> Again, even though they send a request to the master, the master
>> doens't
>> act on it until the first slaves finish their current task. So a lot
>> of time is being wasted with no compution being performed. I've tried
>> running the master and slaves on different cpu's and nodes and get
>> similar results. (From one point of view, the code does run. Its just
>> that a lot of slaves are sitting idle for long periods of time because
>> the master is not receiving the requests promptly from the slaves even
>> though they have been sent by the slaves.)
>>
>> However, as stated earlier, on the office LAN setup, it runs as
>> expected. The master promptly acts upon a work request from a slave
>> whenever a slave sends such a request.
>>
>> Any ideas or suggestions on the cause and or possible solution.
>>
>>
>> Any
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
>
> --
> {+} Jeff Squyres
> {+} The Open MPI Project
> {+} http://www.open-mpi.org/
>
>