Hi Ben,
As you have discovered, LAM/MPI is sensitive to build/run environments.
I have found it always easier to install/build LAM and compile my apps
in a consistent environment, that is, whatever compiler environment was
used to build LAM, is what i use to build the application. and whatever
LAM/MPI was used to compile/link the application is what i run the
application with. Any deviation from this policy invites strange and
inconsistent behavior at runtime (from either LAM or the application).
Having said that, if you possibly can, you should try downloading and
installing OpenMPI. It is the successor to LAM/MPI and all the LAM
developers have long since moved on. The user forum is quite active and
you will find lots of help there.
Cheers,
Mac
in Houston
-----Original Message-----
From: lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] On Behalf
Of Benjamin Kaduk
Sent: Tuesday, July 27, 2010 1:48 PM
To: lam_at_[hidden]
Subject: LAM: runtime error with LAM and intel compilers (and "unused"
ubuntu packages)
Hi all,
I have an application that I am trying to compile against LAM using the
intel compilers as a backend. This is actually coming as part of a
migration of our research group from an old login node for our cluster
to
a more powerful login node; we had a working setup on the old machine
("aslan", ubuntu gutsy with intel compilers 9.1.047) and things worked
acceptably. The new machine ("caspian") is running ubuntu lucid with
the
intel compilers 11.1/072. I wasn't around when aslan was set up, so I
don't know exactly what went into the custom fftw and
"lam-7.1.4.intel9.nomalloc" installations in /opt .
With the goal of figuring out what changes were needed (instead of just
blindly copying old binaries and hoping that they continue to work), I
started with the ubuntu packaged version of lam, which is based off of
7.1.2. I was able to modify the packaging to backend to the intel
compilers, but the resulting binaries proved unusable for me, as the
static libraries failed to link (the PMPI_Send family of symbols were
not
defined). This seems to have been fixed in the 7.1.4 release, so I
elected to compile that from source. My configure line was:
configure --prefix=/opt/lam-7.1.4 --disable-shared
--with-rsh=/usr/bin/rsh
--with-memory-manager=external CC=icc CXX=icpc F77=ifort
However, I still had my (modified) ubuntu package installed at the time,
so there were lam libraries in /usr/lib ....
Over the course of my testing, I got a few warnings about version skew,
so
I've tried to be careful to be consistent, but I'm not entirely sure
that
the default linker search path of /usr/lib has successfully been
overridden (the 7.1.2 libraries in /usr/lib are probably actively
harmful
to me).
The current state of affairs is that my hand-compiled version of lam
runs
correctly for some amount of time, and then dies with
bufferd (getroute): invalid node
I've copied over the "lam-7.1.4.intel9.nomalloc" from aslan to caspain
and
compiled using that with an otherwise unchanged configuration, which
runs
successfully.
What really confuses me is that when building with the old binaries from
aslan, my compilation fails at the linking stage, as it seems to be
checking that libraries exist in /opt/lam-7.1.4.intel9.nomalloc (as it
notices when I run that mpic++ from a different path), but actually
pulling in the libraries from /usr/lib (which fail to link as mentioned
above). I can then use 'mpic++ -showme' to get a link line, and put in
absolute paths to the five lam libraries we use instead of using -l
options. It is a binary so compiled that I claim works successfully
above. Compiling using my hand-built 7.1.4 does *not* see this build
failure, it has a successful link step. However, both that version and
a
version linked using a equivalent procedure to the one used for
"lam-7.1.4.intel9.nomalloc" experience the runtime error mentioned
above.
Would it be useful to attempt to reproduce the runtime error with a
smaller test program? (Where might I find such a program?)
Google doesn't know much about the particular error message, and the
line
in the source that prints it doesn't give much indication as to where
the
actual source of the error is. Is there something different in my
configure line that I should try?
Thanks,
Ben Kaduk
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
This e-mail and any attachments are for the sole use of the intended recipient(s) and may contain information that is confidential. If you are not the intended recipient(s) and have received this e-mail in error, please immediately notify the sender by return e-mail and delete this e-mail from your computer. Any distribution, disclosure or the taking of any other action by anyone other than the intended recipient(s) is strictly prohibited.
|