LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Andrew Friedley (afriedle_at_[hidden])
Date: 2006-04-11 15:54:14


Jeffrey B. Layton wrote:
> Good morning,
>
> I'm having a problem starting an MPI code that was
> built with PGI 6.1 and LAM-7.1.2. I get the following
> messages when I try to start the code:
>
>
> n-1<24201> ssi:boot:base:linear: booting n0 (n2004)
> n-1<24201> ssi:boot:base:linear: booting n1 (n2005)
> n-1<24201> ssi:boot:base:linear: booting n2 (n2006)
> n-1<24201> ssi:boot:base:linear: booting n3 (n2007)
> n-1<24201> ssi:boot:base:linear: booting n4 (n2008)
> n-1<24201> ssi:boot:base:linear: booting n5 (n2009)
> n-1<24201> ssi:boot:base:linear: booting n6 (n2010)
> n-1<24201> ssi:boot:base:linear: booting n7 (n2011)
> n-1<24201> ssi:boot:base:linear: finished
> -----------------------------------------------------------------------------
> It seems that [at least] one of the processes that was started with
> mpirun chose a different RPI than its peers. For example, at least
> the following two processes mismatched in their RPI selections:
>
> MPI_COMM_WORLD rank 0: tcp (v7.1.0)
> MPI_COMM_WORLD rank 3: usysv (v7.1.0)
>
> All MPI processes must choose the same RPI module and version when
> they start. Check your SSI settings and/or the local environment
> variables on each node.
>
>
> I'm using PBS to start the job and here are the relevant parts of
> the script:
>
> NET=tcp
>
> lamboot -b -v -ssi rpi $NET $PBS_NODEFILE
> mpirun -O -v C ./${EXE} >> ${OUTFILE}
> lamhalt -v
>
>
> where $EXE and $OUTFILE are defined in the script. Any ideas?

Not sure what is going on. One thing I can think of is that LAM
installations are getting mixed up on the nodes. Can you run 'laminfo'
on each node with the same environment settings (particularly PATH and
LD_LIBRARY_PATH) to make sure each node is using the same installation
of LAM? Also, make sure that each node has the same rpi components
available.

Andrew