On 2/20/07, Brian Barrett <brbarret_at_[hidden]> wrote:
> This is a problem with the LAM daemon and internal resource
> limitations. The LAM daemons were not intended to be used to run
> that many processes on one node. Unfortunately, this will not be
> fixed in LAM as it would require a large number of changes and we are
> currently focusing all our development work on Open MPI. The best I
> can suggest is to not spawn as many processes per node.
>
> Brian
Dear Brian,
Thanks for the reply. At least this will put a stop to my endless
search for "what did I screw up".
Two questions, though:
1. am I less likely to run into these problems if I switch to Open MPI?
2. Would having many lamd per node (thus, with a lot fewer slaves per
demon) help ?
Thanks,
R.
>
> On Feb 20, 2007, at 3:41 AM, Ramon Diaz-Uriarte wrote:
>
> > Dear All,
> >
> > We are using LAM/MPI, and most of the time everything runs just fine.
> > However, when there are a bunch of slaves (say, 20 or more slaves per
> > computing node that result from many simulteneous mpi runs), I start
> > receiving LAM MPI errors; the log files show
> >
> > Host xyz Rank (..)
> > Rank (1, MPI_COMM_WORLD): Call stack within LAM:
> > Rank (1, MPI_COMM_WORLD): - MPI_Recv()
> > Rank (1, MPI_COMM_WORLD): - MPI_Bcast()
> > Rank (1, MPI_COMM_WORLD): - MPI_Allgather()
> > Rank (1, MPI_COMM_WORLD): - main()
> >
> >
> > LAM/MPI is being used from R, and the R files show:
> >
> > "MPI_Error_string: error spawning process"
> >
> > I can easily reproduce this by launching a given job many times
> > simultaneously. If I launch only one or a few instances of the very
> > same job in this same LAM universe, things work OK. So it seems
> > related to having many slaves per node, or many slaves being handled
> > by a lamd daemon, or similar.
> >
> >
> > In the past, I used to have many LAM universes in parallel (using the
> > LAM_MPI_SESSION_SUFFIX env. variable). This way, each lam daemon would
> > only have to handle a few slaves. However, this did not avoid the
> > problems, and in fact created other problems (such as not being able
> > to boot a lam universe when many lam universes were attempted to be
> > lambooted almost at the same time).
> >
> >
> > Could this be a latency problem (this is a cluster running Linux with
> > a non-stellar gigabit ethernet)? Are there any ways to avoid the
> > problem?
>
> --
> Brian Barrett
> LAM/MPI developer and all around nice guy
> Have a LAM/MPI day: http://www.lam-mpi.org/
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
--
Ramon Diaz-Uriarte
Statistical Computing Team
Structural Biology and Biocomputing Programme
Spanish National Cancer Centre (CNIO)
http://ligarto.org/rdiaz
|