LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-07-22 13:11:05


Is there any chance you can upgrade to Open MPI?

On Jul 22, 2009, at 12:20 PM, Sims, James S. Dr. wrote:

> Thanks.
>
> Focusing on the 1 processor case below which fails, after starting
> an interactive torque session with
> qsub -I -l nodes=1 :x4gb,
> so presumably environment variables are the same in both cases, if I
> do
> ./MPI_li_64 the job runs to completion. If on the other hand I do
> mpirun -v -np 1 ./MPI_li_64
>
> the job eventually fails with a segV (traced in the debugger) in
> perfectly valid code. The difference between running this command
> with and without the -v option is one additional line
> 14201 ./MPI_li_64 running on n0 (o)
>
> Jim
> ________________________________________
> From: lam-bounces_at_[hidden] [lam-bounces_at_[hidden]] On Behalf Of
> McCalla, Mac [macmccalla_at_[hidden]]
> Sent: Wednesday, July 22, 2009 8:19 AM
> To: General LAM/MPI mailing list
> Subject: Re: LAM: Problem with 64 bit lam and intel
>
> Hi,
> You might try adding -v to the mpirun command in each case to
> get more info about what mpirun is doing and see what the differences
> are.
>
> Mac
> Houston
>
> -----Original Message-----
> From: lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] On
> Behalf
> Of Sims, James S. Dr.
> Sent: Tuesday, July 21, 2009 11:13 PM
> To: lam_at_[hidden]
> Subject: LAM: Problem with 64 bit lam and intel
>
> Sims, James S wrote:
>
> > Thanks Mac. I think this helps. I am running the 64 bit version, but
> > here is a detailed comparison of what works and what doesn't.
> > If I do a qsub -I -l nodes=1:ppn=2
> > lamboot
> > mpirun -np 2 MPI_li_64
> > in the torque/pbs environment, the code dies with PID 10261 failed
> on
> > node n0 (10.2.1.54) due to signal 11.
> >
> > If on the other hand, I don't use torque but run the same example,
> > mpirun -np 2 MPI_li_64, the job runs. So I think it is something
> about
>
> > the PBS environment that is causing the problem.
>
> To which Tim Prince replied:
> You would normally set your PATH and LD_LIBRARY_PATH in your PBS
> script,
> so that you get the one you need. Lately, I've got in the situation
> where each phase of my PBS job requires a different MPI, so it seems
> normal to wipe and set a new path for each mpirun.
>
> This is not the problem. I have further insolated it to the following:
>
> I start an interactive qsub environment with qsub -I -l nodes=1:x4gb
>
> and then on the node that I am given, I do a lamboot $PBS_NODEFILE.
>
> Now in the directory where I have my 64 bit code, I run
> ./MPI_li_64
> and everything works fine.
> But if instead I do
> mpirun -np 1 ./MPI_li_64
> the code eventually fails with a segmentation violation which I can
> trace in the idb debugger, and it is a prefectly valid piece of
> code. So
> what is running it under mpirun doing to mess this up? Note that in
> this
> example, the environment is the same for the example that works and
> the
> one that doesn't.
> file mpirun gives
> /usr/local/intel/lam/64/bin/mpirun: ELF 64-bit LSB executable, AMD
> x86-64, version 1 (SYSV), for GNU/Linux 2.4.0, dynamically linked
> (uses
> shared libs), not stripped and file MPI_li_64 is
> /home/sims/hagstrom/MPI_li_forJim.DEVEL/MPI_li_64: ELF 64-bit LSB
> executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.4.0,
> dynamically linked (uses shared libs), not stripped
>
> so what can mpirun be doing to cause this code to fail?
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
Jeff Squyres
jsquyres_at_[hidden]