Hi all,
I have an application that I am trying to compile against LAM using the
intel compilers as a backend. This is actually coming as part of a
migration of our research group from an old login node for our cluster to
a more powerful login node; we had a working setup on the old machine
("aslan", ubuntu gutsy with intel compilers 9.1.047) and things worked
acceptably. The new machine ("caspian") is running ubuntu lucid with the
intel compilers 11.1/072. I wasn't around when aslan was set up, so I
don't know exactly what went into the custom fftw and
"lam-7.1.4.intel9.nomalloc" installations in /opt .
With the goal of figuring out what changes were needed (instead of just
blindly copying old binaries and hoping that they continue to work), I
started with the ubuntu packaged version of lam, which is based off of
7.1.2. I was able to modify the packaging to backend to the intel
compilers, but the resulting binaries proved unusable for me, as the
static libraries failed to link (the PMPI_Send family of symbols were not
defined). This seems to have been fixed in the 7.1.4 release, so I
elected to compile that from source. My configure line was:
configure --prefix=/opt/lam-7.1.4 --disable-shared --with-rsh=/usr/bin/rsh
--with-memory-manager=external CC=icc CXX=icpc F77=ifort
However, I still had my (modified) ubuntu package installed at the time,
so there were lam libraries in /usr/lib ....
Over the course of my testing, I got a few warnings about version skew, so
I've tried to be careful to be consistent, but I'm not entirely sure that
the default linker search path of /usr/lib has successfully been
overridden (the 7.1.2 libraries in /usr/lib are probably actively harmful
to me).
The current state of affairs is that my hand-compiled version of lam runs
correctly for some amount of time, and then dies with
bufferd (getroute): invalid node
I've copied over the "lam-7.1.4.intel9.nomalloc" from aslan to caspain and
compiled using that with an otherwise unchanged configuration, which runs
successfully.
What really confuses me is that when building with the old binaries from
aslan, my compilation fails at the linking stage, as it seems to be
checking that libraries exist in /opt/lam-7.1.4.intel9.nomalloc (as it
notices when I run that mpic++ from a different path), but actually
pulling in the libraries from /usr/lib (which fail to link as mentioned
above). I can then use 'mpic++ -showme' to get a link line, and put in
absolute paths to the five lam libraries we use instead of using -l
options. It is a binary so compiled that I claim works successfully
above. Compiling using my hand-built 7.1.4 does *not* see this build
failure, it has a successful link step. However, both that version and a
version linked using a equivalent procedure to the one used for
"lam-7.1.4.intel9.nomalloc" experience the runtime error mentioned above.
Would it be useful to attempt to reproduce the runtime error with a
smaller test program? (Where might I find such a program?)
Google doesn't know much about the particular error message, and the line
in the source that prints it doesn't give much indication as to where the
actual source of the error is. Is there something different in my
configure line that I should try?
Thanks,
Ben Kaduk
|