LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: John Robinson (jr_at_[hidden])
Date: 2005-08-08 12:34:02


Hi lam users,

Quick description:

Static linking fails to link, with multiply-defined symbols (with MPI
memory manager). Statically-linked test program segfaults in exit()
with memory-manager=none.

Long-winded tale of woe:

I have been working on an MPI infrastructure, and ran into the a
couple of problems. When trying to statically link (with mpiCC), I
get a error from ld about symbols in libc being redefined, and
libmpi.a is the culprit. So problem number 1 is I cannot statically
link mpi apps, in this environment:

   FC4 / i686 / g++ (GCC) 4.0.1 20050727 (Red Hat 4.0.1-5)

I figured that this must be due to the overloaded malloc package used
to protect users against hardware memory stomping when using
Infiniband or Myrinet, which I do not plan to use. So I took a deep
breath, uninstalled the redhat lam distribution, and proceeded to
download the sources and build lam/mpi myself with the following
config:

./configure --disable-tv-queue --with-memory-manager=none
--without-romio --with-trillium

[I don't need ROMIO and thought I might want to experiment with
building xmpi].

At any rate, I can now link my program okay, but when I execute it, I
get a SEGV out of exit:

  Program received signal SIGSEGV, Segmentation fault.
  0x00000000 in ?? ()
  (gdb) where
  #0 0x00000000 in ?? ()
  #1 0x080be6bd in __tcf_0 ()
  #2 0x0812eb02 in exit ()
  #3 0x080482a2 in main (argc=1, argv=0xbfe20064)

If I ask gdb to show me __tcf_0, however, it displays a different one.
So it looks like the exit_funcs are getting messed up. The
instruction that fails appears to be an incomplete link step [note the
"call 0x0"]:

0x080be6a0 <__tcf_0+0>: push %ebp
0x080be6a1 <__tcf_0+1>: mov %esp,%ebp
0x080be6a3 <__tcf_0+3>: sub $0x8,%esp
0x080be6a6 <__tcf_0+6>: mov 0x81d3784,%ecx
0x080be6ac <__tcf_0+12>: test %ecx,%ecx
0x080be6ae <__tcf_0+14>: je 0x80be6cb <__tcf_0+43>
0x080be6b0 <__tcf_0+16>: mov 0x81d378c,%eax
0x080be6b5 <__tcf_0+21>: mov %eax,(%esp)
0x080be6b8 <__tcf_0+24>: call 0x0
0x080be6bd <__tcf_0+29>: mov 0x81d3784,%eax
0x080be6c2 <__tcf_0+34>: mov %eax,0x8(%ebp)
0x080be6c5 <__tcf_0+37>: leave
0x080be6c6 <__tcf_0+38>: jmp 0x81141dc <_ZdlPv>
0x080be6cb <__tcf_0+43>: leave
0x080be6cc <__tcf_0+44>: ret
0x080be6cd <__tcf_0+45>: nop

All my test program does is try to instantiate a class that has some
Intracomm members. If I do not instantiate it, the problem stops (or
is masked). The same error happens whether I instantiate the class
with "new" or declare it in main().

I may be able to convince the rest of my project that dynamic linking
is okay, but maybe that is just deferring a problem that will still
crop up eventually. My test program did run its basic steps
successfully when linked dynamically, but maybe I was just lucky.

Has anyone got a fix for this? Or even seen it?

Many thanks,

John Robinson
Vertica Systems