Sims, James S wrote:
> Thanks Mac. I think this helps. I am running the 64 bit version,
> but here is a detailed comparison of what works and what doesn't.
> If I do a qsub -I -l nodes=1:ppn=2
> lamboot
> mpirun -np 2 MPI_li_64
> in the torque/pbs environment, the code dies with
> PID 10261 failed on node n0 (10.2.1.54) due to signal 11.
>
> If on the other hand, I don't use torque but run the same
> example,
> mpirun -np 2 MPI_li_64, the job runs. So I think it is
> something about the PBS environment that is causing the
> problem.
To which Tim Prince replied:
You would normally set your PATH and LD_LIBRARY_PATH in your PBS script,
so that you get the one you need. Lately, I've got in the situation
where each phase of my PBS job requires a different MPI, so it seems
normal to wipe and set a new path for each mpirun.
This is not the problem. I have further insolated it to the following:
I start an interactive qsub environment with
qsub -I -l nodes=1:x4gb
and then on the node that I am given, I do a
lamboot $PBS_NODEFILE.
Now in the directory where I have my 64 bit code, I run
./MPI_li_64
and everything works fine.
But if instead I do
mpirun -np 1 ./MPI_li_64
the code eventually fails with a segmentation violation which
I can trace in the idb debugger, and it is a prefectly valid piece
of code. So what is running it under mpirun doing to mess this
up? Note that in this example, the environment is the same for
the example that works and the one that doesn't.
file mpirun gives
/usr/local/intel/lam/64/bin/mpirun: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.4.0, dynamically linked (uses shared libs), not stripped
and file MPI_li_64 is
/home/sims/hagstrom/MPI_li_forJim.DEVEL/MPI_li_64: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.4.0, dynamically linked (uses shared libs), not stripped
so what can mpirun be doing to cause this code to fail?
|