LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Douglas Vechinski (douglas.vechinski_at_[hidden])
Date: 2005-09-22 16:56:04


I created and MPI application that uses two different executables, 1
master and N slaves. I start off the process by running "mpirun -np 1
master". The master reads in a config file which specifies the number
of slaves (N) and the name of the slave executable. The master does
some other initialization and then uses MPI_Comm_spawn to spawn off N
slaves using the name of the slave executable provided from the input
file. The slaves then prompt the master requesting work which the
master then hands out based upon the problem I am doing. Breaking it up
this way, the master is a really small code and doesn't have all the
processing code that the slaves do.

Now this all works fine and dandy when I run it in a parallel
environment that is basically one machine with many processors all the
same type and a common directory structure.

However, I am now presented with a different environment, several Linux
PC's on a common network, not necessarily with the same version/flavor
of Linux on each machine. Right now I am assuming that one machine will
have an exported filesystem that all other machines will be able to
mount and all input/output files will occur on this filesystem.

I am now trying to figure out it is possible to run in this app using
the LAM environment. The main problem I am having is in the name of the
slave that is provided to the MPI_Comm_spawn. The slaves may now be
different (same source just compiled separately on the different
machines). These executables may be in different locations on the
different machines.

I first tried settings an environment variable (SLAVE) on each machine
specifying the directory of the slave on each machine. Then when I
spawn I used "$SLAVE/slave" as the first argument thinking that $SLAVE
would maybe get expanded on the remote hosts. This didn't seem to
work.

Next I tried, modifying my path on each node so that the slave
executable is visible in my path and then just supplied the slave name
"slave" to the spawn function. This ran but all the slave processes
started and were running on the machine where I started the mpirun
command. None ran on any of the remote machines.

I'm looking for suggestions on how I might could get this setup to work
with LAM. Or do I need to consider combining the master and slaves into
a single code and let the rank 0 guy take a different branch and control
the slaves that way. This would mean that the master has all the
processing code that the slaves have but would go unused and request a
whole bunch of memory that would go unused.