LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Douglas Vechinski (douglas.vechinski_at_[hidden])
Date: 2005-12-07 09:55:14


I'm having a problem with some apparent stalling on one platform and not
another. First let me give some background on the code. I have a code
that uses a master/slave design. Basically one master and several
slaves. The slaves work independently of one another. There is no
communication between the slaves. Slave just communicate with master,
stating that they are ready for something to work on. They send a
message to the master (who is waiting in a MPI_WAITANY). When the
master receives a message, it determines what slave is requesting work
and sends them the index of the next item to be worked on and goes back
to the MPI_WAITANY statement.

I developed and tested this application using several Linux machines on
our local office LAN. Different flavors/versions of Linux were used but
all had LAM 7.1.1 on them. The code seems to be working fine here.

However, the system where this code will be exercised a lot is a Beowulf
cluster with 8 nodes and 2 processors per node. I would say this
cluster is not really a Beowulf cluster anymore since the bproc is not
running or installed. It really represents 8 separate "PC's" mounted on
a rack connected by ethernet with a single keyboard and monitor. LAM
7.0.4 is present on this machine. A common filesystem is mounted across
the nodes. I now observe the following behavior for my parallel code.

After the master and slave go the their initial setup, the slaves send
their initial request to the master. The master handles one or two of
the requests. These one or two slaves get their info and begin to
work. All the other slaves are setting idle. Even though they have
sent their request and the master is sitting at the MPI_WAITANY call.
Not until the first one or two slaves finish their current task, which
can be from a few minutes to tens of minutes, and request the next item
to work on, do the other slaves receive their direction from the
master. Then if the slaves that ran first, finish sooner and get the
next item, the other slaves stall after finishing their current task.
Again, even though they send a request to the master, the master doens't
act on it until the first slaves finish their current task. So a lot
of time is being wasted with no compution being performed. I've tried
running the master and slaves on different cpu's and nodes and get
similar results. (From one point of view, the code does run. Its just
that a lot of slaves are sitting idle for long periods of time because
the master is not receiving the requests promptly from the slaves even
though they have been sent by the slaves.)

However, as stated earlier, on the office LAN setup, it runs as
expected. The master promptly acts upon a work request from a slave
whenever a slave sends such a request.

Any ideas or suggestions on the cause and or possible solution.

Any