As an aside comment about your desire to have nice'd LAM jobs...
Although LAM/MPI 7.0 has more facilities for running multiple distinct
mpiruns at the same time, it is not wise to have them overlap on the same
processors on your cluster. The wise choice would be to make sure second
and third mpirun's select a disjoint set of processors to run on, i.e. to
do space-sharing. Time-sharing a beowulf cluster is in most cases counter
productive unless there is a gang-scheduler configured to handle global
context switches between overlapping parallel jobs.
Even with a nice 19 process, the Linux scheduler guarantees it will
get some CPU time. This may seem like a good thing, and it is in
general, but for overlapping parallel programs this can have a dramatic
slowdown on the execution time of both parallel jobs. Let's call the
two concurrent and overlapping parallel jobs Alice and Bob (A and B).
At any point in time (without a gang-scheduler), the cluster will have
portions of Alice running as well as portions of Bob. More importantly,
there will be portions of Alice and portions Bob that are NOT running.
For example if Bob is communication intensive (even just moderatelly),
the portion of Bob that is not running is not going to be available to
send or receive messages with the portion that is currently running.
This forces the running portion of Bob to get stuck in MPI sends if there
isn't enough buffer space to hold the outgoing messages, and will
obviously get stuck if it has to wait to receive a message from the
not running portion. The slowdown from this effect can totally overwhelm
the speedup achieved from parallel processing.
The take home message is that unless you are using a gang-scheduler,
you do not want to try to time-share nodes of a beowulf cluster.
--
Tim Mattox - tmattox_at_[hidden] - http://home.earthlink.net/~timattox
http://aggregate.org/KAOS/ - http://advogato.org/person/tmattox/
On Tue, 20 May 2003, Jerome BENOIT wrote:
> Hello LAM,
>
> I noticed the fellowing behaviour when a LAM session
> is initialized with a given priority
> [ nice -n 15 lamboot <lamboot_args> ],
> the processes running on the original node
> herit the priority,
> but the processes running on the other nodes
> get the normal priority.
> This behaviour sounds odd:
> it would be nice to allow `lamboot' to propagate
> through the cluster its priority.
> In such a way the user can easily set a priority
> to each LAM_SESSION.
> This make sense, since now (LAM 7.0) several LAM sessions can be
> initialized (provided different LAM_SESSION_SUFFIX are set).
>
> I hope that helps,
> Jerome
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|