Jeffrey B. Layton wrote:
> Good morning,
>
> I'm having a problem starting an MPI code that was
> built with PGI 6.1 and LAM-7.1.2. I get the following
> messages when I try to start the code:
>
>
> n-1<24201> ssi:boot:base:linear: booting n0 (n2004)
> n-1<24201> ssi:boot:base:linear: booting n1 (n2005)
> n-1<24201> ssi:boot:base:linear: booting n2 (n2006)
> n-1<24201> ssi:boot:base:linear: booting n3 (n2007)
> n-1<24201> ssi:boot:base:linear: booting n4 (n2008)
> n-1<24201> ssi:boot:base:linear: booting n5 (n2009)
> n-1<24201> ssi:boot:base:linear: booting n6 (n2010)
> n-1<24201> ssi:boot:base:linear: booting n7 (n2011)
> n-1<24201> ssi:boot:base:linear: finished
> -----------------------------------------------------------------------------
> It seems that [at least] one of the processes that was started with
> mpirun chose a different RPI than its peers. For example, at least
> the following two processes mismatched in their RPI selections:
>
> MPI_COMM_WORLD rank 0: tcp (v7.1.0)
> MPI_COMM_WORLD rank 3: usysv (v7.1.0)
>
> All MPI processes must choose the same RPI module and version when
> they start. Check your SSI settings and/or the local environment
> variables on each node.
>
>
> I'm using PBS to start the job and here are the relevant parts of
> the script:
>
> NET=tcp
>
> lamboot -b -v -ssi rpi $NET $PBS_NODEFILE
> mpirun -O -v C ./${EXE} >> ${OUTFILE}
> lamhalt -v
Have one more possible solution - '-ssi rpi $NET' should go on the
mpirun line, not the lamboot line. lamboot ignores these parameters,
and they aren't specified on the mpirun line. This implies that all RPI
components are up for selection at mpirun time, which isn't what you want.
But still, why are two different RPI components being selected? It's
possible that the machine rank 0 is running on does not have enough
shared memory available to run usysv. First try running 'lamclean' to
clean up any allocations LAM may have left around. If that doesn't cut
it, try a lamhalt/lamboot. Finally, you might use the 'ipcs' and
'ipcrm' commands, assuming they are available on your system.
Andrew
|