This is a problem with the LAM daemon and internal resource
limitations. The LAM daemons were not intended to be used to run
that many processes on one node. Unfortunately, this will not be
fixed in LAM as it would require a large number of changes and we are
currently focusing all our development work on Open MPI. The best I
can suggest is to not spawn as many processes per node.
Brian
On Feb 20, 2007, at 3:41 AM, Ramon Diaz-Uriarte wrote:
> Dear All,
>
> We are using LAM/MPI, and most of the time everything runs just fine.
> However, when there are a bunch of slaves (say, 20 or more slaves per
> computing node that result from many simulteneous mpi runs), I start
> receiving LAM MPI errors; the log files show
>
> Host xyz Rank (..)
> Rank (1, MPI_COMM_WORLD): Call stack within LAM:
> Rank (1, MPI_COMM_WORLD): - MPI_Recv()
> Rank (1, MPI_COMM_WORLD): - MPI_Bcast()
> Rank (1, MPI_COMM_WORLD): - MPI_Allgather()
> Rank (1, MPI_COMM_WORLD): - main()
>
>
> LAM/MPI is being used from R, and the R files show:
>
> "MPI_Error_string: error spawning process"
>
> I can easily reproduce this by launching a given job many times
> simultaneously. If I launch only one or a few instances of the very
> same job in this same LAM universe, things work OK. So it seems
> related to having many slaves per node, or many slaves being handled
> by a lamd daemon, or similar.
>
>
> In the past, I used to have many LAM universes in parallel (using the
> LAM_MPI_SESSION_SUFFIX env. variable). This way, each lam daemon would
> only have to handle a few slaves. However, this did not avoid the
> problems, and in fact created other problems (such as not being able
> to boot a lam universe when many lam universes were attempted to be
> lambooted almost at the same time).
>
>
> Could this be a latency problem (this is a cluster running Linux with
> a non-stellar gigabit ethernet)? Are there any ways to avoid the
> problem?
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|