LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Bob Felderman (feldy_at_[hidden])
Date: 2005-03-10 05:12:48


The race problem I reported earlier this week (using usysv) appears to be
related to the implementation of

share/ssi/coll/smp/src/ssi_coll_smp_allreduce.c

Between tests, the Pallas benchmarks execute Barrier(), then
set up a new communicator. This leads to all processes calling
MPI_Comm_split which is implemented using MPI_Allreduce.
On a system using usysv that calls lam_ssi_coll_smp_allreduce.

C lam_ssi_coll_lam_basic_allreduce, FP=bffff7b8
C lam_ssi_coll_smp_allreduce, FP=bffff7e8
C MPI_Allreduce, FP=bffff828
C lam_coll_alloc_intra_cid, FP=bffff868
C MPI_Comm_split, FP=bffff8b8
C Set_Communicator, FP=bffff8d8
C Init_Communicator, FP=bffff928
C main, FP=bffffa18
     __libc_start_main, FP=bffffa38

lam_ssi_coll_smp_allreduce has a threshold test of 512bytes. If the
count*size is less than 512 then lam_ssi_coll_lam_basic_allreduce
is called otherwise you get nonassoc_allreduce which is defined in
ssi_coll_smp_allreduce.c and is meant for short messages. The size of
this communication is 65536 since that is the default size of the max_CID.

If I increase the threshold beyond 65536, I do not encounter the race.
I'm still trying to locate exactly what the problem is. I keep thinking
I can turn off the fl_block flag and avoid the problem, but that doesn't
seem to work.

I've attached the callstack tracebacks for the 6 processes in case someone
in-the-know can recognize what is happening. (Three dual-processor boxes
running cpu=2.) My gut feel is that the implementation of blocking is
somehow broken because I see that the processes are in _mpi_req_advance()
but they end up getting blocked down at the lowest layer of the different
communication devices (one is in tcp/read and the other is in the sysv
"spinlock"). I would think it is "dangerous" to block in an individual rpi
when things might be happening on the other rpi in a dual-rpi environment.
It is hard to make progress on the socket if you are stuck in the spinlock
and hard to make progress on the shmem if you are blocked in the kernel
waiting for a socket read to complete.

int
lam_ssi_coll_smp_allreduce(void *sbuf, void *rbuf, int count,
                           MPI_Datatype dtype, MPI_Op op, MPI_Comm comm)
{
  lam_ssi_coll_data_t *lcd = comm->c_ssi_coll_data;

  /* If this communicator was marked to be associative, use the wide
     area optimal associative algorithm. */

  if (lam_ssi_coll_base_get_param(comm, LAM_MPI_SSI_COLL_ASSOCIATIVE) == 1 &&
      op->op_commute == 1)
    return assoc_allreduce(sbuf, rbuf, count, dtype, op, comm);

  /* Otherwise, look at how many bytes will be transferred by each
     process to determine whether to use the lam_basic algorithm, or
     the short MagPIe algorithm */

  if ((count * dtype->dt_size) < lcd->lcd_reduce_crossover) <========= 512byte threshold
    return nonassoc_allreduce(sbuf, rbuf, count, dtype, op, comm);
  else
    return lb_functions.lsca_allreduce(sbuf, rbuf, count, dtype,
                                       op, comm);
}