LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: John Robinson (jr_at_[hidden])
Date: 2005-08-11 08:56:24


Hi Jeff,

Thanks for the reply. We are still struggling with the SEGV problem,
but as I said it is independent of LAM/MPI.

I have not pushed us to change to dynamic link but we may wind up there.

However, notice that even a trivial MPI program fails to link statically
on my setup with the redhat FC4 rpm for lam. So there is a problem to
be addressed, IMHO.

I am going away for a week+ but will report back to the list when I return.

Thanks again!
/jr

---
Jeff Squyres wrote:
> Sorry for the delay in replying.
> 
> This probably makes sense -- if you compile the rest of your code 
> statically (and against libc.a), then malloc and friends are included 
> in your executable.  For lack of a longer explanation, I think it's 
> easy to construct scenarios where the two memory managers run afoul of 
> each other (or simply create linker clashes).
> 
> Disabling the LAM memory manager is certainly an option here, 
> especially if you never plan to use IB or GM.
> 
> Is there a reason you need static linking?
> 
> 
> On Aug 10, 2005, at 4:18 PM, John Robinson wrote:
> 
> 
>>Dear lam users,
>>
>>My bad.  I misinterpreted a change in my project which introduced the
>>SEGV, since that change happened at roughly the same time that I
>>switched to statically-linked MPI.  So only the first problem remains -
>>the -static flag breaks unless you add -with-memory-manager=none to the
>>config (and give up on ib or md SSIs).
>>
>>/jr
>>---
>>John Robinson wrote:
>>
>>>Hi lam users,
>>>
>>>Quick description:
>>>
>>>Static linking fails to link, with multiply-defined symbols (with MPI
>>>memory manager).  Statically-linked test program segfaults in exit()
>>>with memory-manager=none.
>>>
>>>Long-winded tale of woe:
>>>
>>>I have been working on an MPI infrastructure, and ran into the a
>>>couple of problems.  When trying to statically link (with mpiCC), I
>>>get a error from ld about symbols in libc being redefined, and
>>>libmpi.a is the culprit.  So problem number 1 is I cannot statically
>>>link mpi apps, in this environment:
>>>
>>>   FC4 / i686 / g++ (GCC) 4.0.1 20050727 (Red Hat 4.0.1-5)
>>>
>>>I figured that this must be due to the overloaded malloc package used
>>>to protect users against hardware memory stomping when using
>>>Infiniband or Myrinet, which I do not plan to use.  So I took a deep
>>>breath, uninstalled the redhat lam distribution, and proceeded to
>>>download the sources and build lam/mpi myself with the following
>>>config:
>>>
>>>./configure  --disable-tv-queue --with-memory-manager=none
>>>--without-romio --with-trillium
>>>
>>>[I don't need ROMIO and thought I might want to experiment with
>>>building xmpi].
>>>
>>>At any rate, I can now link my program okay, but when I execute it, I
>>>get a SEGV out of exit:
>>>
>>>  Program received signal SIGSEGV, Segmentation fault.
>>>  0x00000000 in ?? ()
>>>  (gdb) where
>>>  #0  0x00000000 in ?? ()
>>>  #1  0x080be6bd in __tcf_0 ()
>>>  #2  0x0812eb02 in exit ()
>>>  #3  0x080482a2 in main (argc=1, argv=0xbfe20064)
>>>
>>>If I ask gdb to show me __tcf_0, however, it displays a different one.
>>>So it looks like the exit_funcs are getting messed up.  The
>>>instruction that fails appears to be an incomplete link step [note the
>>>"call 0x0"]:
>>>
>>>0x080be6a0 <__tcf_0+0>:	push   %ebp
>>>0x080be6a1 <__tcf_0+1>:	mov    %esp,%ebp
>>>0x080be6a3 <__tcf_0+3>:	sub    $0x8,%esp
>>>0x080be6a6 <__tcf_0+6>:	mov    0x81d3784,%ecx
>>>0x080be6ac <__tcf_0+12>:	test   %ecx,%ecx
>>>0x080be6ae <__tcf_0+14>:	je     0x80be6cb <__tcf_0+43>
>>>0x080be6b0 <__tcf_0+16>:	mov    0x81d378c,%eax
>>>0x080be6b5 <__tcf_0+21>:	mov    %eax,(%esp)
>>>0x080be6b8 <__tcf_0+24>:	call   0x0
>>>0x080be6bd <__tcf_0+29>:	mov    0x81d3784,%eax
>>>0x080be6c2 <__tcf_0+34>:	mov    %eax,0x8(%ebp)
>>>0x080be6c5 <__tcf_0+37>:	leave
>>>0x080be6c6 <__tcf_0+38>:	jmp    0x81141dc <_ZdlPv>
>>>0x080be6cb <__tcf_0+43>:	leave
>>>0x080be6cc <__tcf_0+44>:	ret
>>>0x080be6cd <__tcf_0+45>:	nop
>>>
>>>All my test program does is try to instantiate a class that has some
>>>Intracomm members.  If I do not instantiate it, the problem stops (or
>>>is masked).  The same error happens whether I instantiate the class
>>>with "new" or declare it in main().
>>>
>>>I may be able to convince the rest of my project that dynamic linking
>>>is okay, but maybe that is just deferring a problem that will still
>>>crop up eventually.  My test program did run its basic steps
>>>successfully when linked dynamically, but maybe I was just lucky.
>>>
>>>Has anyone got a fix for this?  Or even seen it?
>>>
>>>Many thanks,
>>>
>>>John Robinson
>>>Vertica Systems
>>>_______________________________________________
>>>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>
>>
>>_______________________________________________
>>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
> 
>