Hi,
You might try adding -v to the mpirun command in each case to
get more info about what mpirun is doing and see what the differences
are.
Mac
Houston
-----Original Message-----
From: lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] On Behalf
Of Sims, James S. Dr.
Sent: Tuesday, July 21, 2009 11:13 PM
To: lam_at_[hidden]
Subject: LAM: Problem with 64 bit lam and intel
Sims, James S wrote:
> Thanks Mac. I think this helps. I am running the 64 bit version, but
> here is a detailed comparison of what works and what doesn't.
> If I do a qsub -I -l nodes=1:ppn=2
> lamboot
> mpirun -np 2 MPI_li_64
> in the torque/pbs environment, the code dies with PID 10261 failed on
> node n0 (10.2.1.54) due to signal 11.
>
> If on the other hand, I don't use torque but run the same example,
> mpirun -np 2 MPI_li_64, the job runs. So I think it is something about
> the PBS environment that is causing the problem.
To which Tim Prince replied:
You would normally set your PATH and LD_LIBRARY_PATH in your PBS script,
so that you get the one you need. Lately, I've got in the situation
where each phase of my PBS job requires a different MPI, so it seems
normal to wipe and set a new path for each mpirun.
This is not the problem. I have further insolated it to the following:
I start an interactive qsub environment with qsub -I -l nodes=1:x4gb
and then on the node that I am given, I do a lamboot $PBS_NODEFILE.
Now in the directory where I have my 64 bit code, I run
./MPI_li_64
and everything works fine.
But if instead I do
mpirun -np 1 ./MPI_li_64
the code eventually fails with a segmentation violation which I can
trace in the idb debugger, and it is a prefectly valid piece of code. So
what is running it under mpirun doing to mess this up? Note that in this
example, the environment is the same for the example that works and the
one that doesn't.
file mpirun gives
/usr/local/intel/lam/64/bin/mpirun: ELF 64-bit LSB executable, AMD
x86-64, version 1 (SYSV), for GNU/Linux 2.4.0, dynamically linked (uses
shared libs), not stripped and file MPI_li_64 is
/home/sims/hagstrom/MPI_li_forJim.DEVEL/MPI_li_64: ELF 64-bit LSB
executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.4.0,
dynamically linked (uses shared libs), not stripped
so what can mpirun be doing to cause this code to fail?
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|