LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Ramon Diaz-Uriarte (rdiaz02_at_[hidden])
Date: 2007-02-20 05:41:13


Dear All,

We are using LAM/MPI, and most of the time everything runs just fine.
However, when there are a bunch of slaves (say, 20 or more slaves per
computing node that result from many simulteneous mpi runs), I start
receiving LAM MPI errors; the log files show

Host xyz Rank (..)
Rank (1, MPI_COMM_WORLD): Call stack within LAM:
Rank (1, MPI_COMM_WORLD): - MPI_Recv()
Rank (1, MPI_COMM_WORLD): - MPI_Bcast()
Rank (1, MPI_COMM_WORLD): - MPI_Allgather()
Rank (1, MPI_COMM_WORLD): - main()

LAM/MPI is being used from R, and the R files show:

"MPI_Error_string: error spawning process"

I can easily reproduce this by launching a given job many times
simultaneously. If I launch only one or a few instances of the very
same job in this same LAM universe, things work OK. So it seems
related to having many slaves per node, or many slaves being handled
by a lamd daemon, or similar.

In the past, I used to have many LAM universes in parallel (using the
LAM_MPI_SESSION_SUFFIX env. variable). This way, each lam daemon would
only have to handle a few slaves. However, this did not avoid the
problems, and in fact created other problems (such as not being able
to boot a lam universe when many lam universes were attempted to be
lambooted almost at the same time).

Could this be a latency problem (this is a cluster running Linux with
a non-stellar gigabit ethernet)? Are there any ways to avoid the
problem?

Thanks,

R.

-- 
Ramon Diaz-Uriarte
Statistical Computing Team
Structural Biology and Biocomputing Programme
Spanish National Cancer Centre (CNIO)
http://ligarto.org/rdiaz