LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: John Robinson (jr_at_[hidden])
Date: 2005-08-22 16:56:10


Hi all,

I'm back. Our team has pretty much decided to move to dynamic link for
now, as it seems to work fine for us. The SEGV problem looks like
something introduced in gcc 4.0.0->4.0.1, and has something to do with
pthreads and/or atexit() calls. We are pursuing that on the GCC list
[gcc-help_at_[hidden]].

The static link problem with memory manager collisions remains for LAM;
disabling the private memory manager is a workaround if you are not a
myrinet or infiniband shop.

I expect that is all that need be said on this list. I will look into
moving to Open MPI soon; expect to try out the new downloads.
Definitely want MPI_THREAD_MULTIPLE long-term.

/jr

---
John Robinson wrote:
> Hi Jeff,
> 
> Thanks for the reply.  We are still struggling with the SEGV problem, 
> but as I said it is independent of LAM/MPI.
> 
> I have not pushed us to change to dynamic link but we may wind up there.
> 
> However, notice that even a trivial MPI program fails to link statically 
> on my setup with the redhat FC4   rpm for lam.  So there is a problem to 
> be addressed, IMHO.
> 
> I am going away for a week+ but will report back to the list when I return.
> 
> Thanks again!
> /jr
> ---
> Jeff Squyres wrote:
> 
>>Sorry for the delay in replying.
>>
>>This probably makes sense -- if you compile the rest of your code 
>>statically (and against libc.a), then malloc and friends are included 
>>in your executable.  For lack of a longer explanation, I think it's 
>>easy to construct scenarios where the two memory managers run afoul of 
>>each other (or simply create linker clashes).
>>
>>Disabling the LAM memory manager is certainly an option here, 
>>especially if you never plan to use IB or GM.
>>
>>Is there a reason you need static linking?
>>
>>
>>On Aug 10, 2005, at 4:18 PM, John Robinson wrote:
>>
>>
>>
>>>Dear lam users,
>>>
>>>My bad.  I misinterpreted a change in my project which introduced the
>>>SEGV, since that change happened at roughly the same time that I
>>>switched to statically-linked MPI.  So only the first problem remains -
>>>the -static flag breaks unless you add -with-memory-manager=none to the
>>>config (and give up on ib or md SSIs).
>>>
>>>/jr
>>>---
>>>John Robinson wrote:
>>>
>>>
>>>>Hi lam users,
>>>>
>>>>Quick description:
>>>>
>>>>Static linking fails to link, with multiply-defined symbols (with MPI
>>>>memory manager).  Statically-linked test program segfaults in exit()
>>>>with memory-manager=none.
>>>>
>>>>Long-winded tale of woe:
>>>>
>>>>I have been working on an MPI infrastructure, and ran into the a
>>>>couple of problems.  When trying to statically link (with mpiCC), I
>>>>get a error from ld about symbols in libc being redefined, and
>>>>libmpi.a is the culprit.  So problem number 1 is I cannot statically
>>>>link mpi apps, in this environment:
>>>>
>>>>  FC4 / i686 / g++ (GCC) 4.0.1 20050727 (Red Hat 4.0.1-5)
>>>>
>>>>I figured that this must be due to the overloaded malloc package used
>>>>to protect users against hardware memory stomping when using
>>>>Infiniband or Myrinet, which I do not plan to use.  So I took a deep
>>>>breath, uninstalled the redhat lam distribution, and proceeded to
>>>>download the sources and build lam/mpi myself with the following
>>>>config:
>>>>
>>>>./configure  --disable-tv-queue --with-memory-manager=none
>>>>--without-romio --with-trillium
>>>>
>>>>[I don't need ROMIO and thought I might want to experiment with
>>>>building xmpi].
>>>>
>>>>At any rate, I can now link my program okay, but when I execute it, I
>>>>get a SEGV out of exit:
>>>>
>>>> Program received signal SIGSEGV, Segmentation fault.
>>>> 0x00000000 in ?? ()
>>>> (gdb) where
>>>> #0  0x00000000 in ?? ()
>>>> #1  0x080be6bd in __tcf_0 ()
>>>> #2  0x0812eb02 in exit ()
>>>> #3  0x080482a2 in main (argc=1, argv=0xbfe20064)
>>>>
>>>>If I ask gdb to show me __tcf_0, however, it displays a different one.
>>>>So it looks like the exit_funcs are getting messed up.  The
>>>>instruction that fails appears to be an incomplete link step [note the
>>>>"call 0x0"]:
>>>>
>>>>0x080be6a0 <__tcf_0+0>:	push   %ebp
>>>>0x080be6a1 <__tcf_0+1>:	mov    %esp,%ebp
>>>>0x080be6a3 <__tcf_0+3>:	sub    $0x8,%esp
>>>>0x080be6a6 <__tcf_0+6>:	mov    0x81d3784,%ecx
>>>>0x080be6ac <__tcf_0+12>:	test   %ecx,%ecx
>>>>0x080be6ae <__tcf_0+14>:	je     0x80be6cb <__tcf_0+43>
>>>>0x080be6b0 <__tcf_0+16>:	mov    0x81d378c,%eax
>>>>0x080be6b5 <__tcf_0+21>:	mov    %eax,(%esp)
>>>>0x080be6b8 <__tcf_0+24>:	call   0x0
>>>>0x080be6bd <__tcf_0+29>:	mov    0x81d3784,%eax
>>>>0x080be6c2 <__tcf_0+34>:	mov    %eax,0x8(%ebp)
>>>>0x080be6c5 <__tcf_0+37>:	leave
>>>>0x080be6c6 <__tcf_0+38>:	jmp    0x81141dc <_ZdlPv>
>>>>0x080be6cb <__tcf_0+43>:	leave
>>>>0x080be6cc <__tcf_0+44>:	ret
>>>>0x080be6cd <__tcf_0+45>:	nop
>>>>
>>>>All my test program does is try to instantiate a class that has some
>>>>Intracomm members.  If I do not instantiate it, the problem stops (or
>>>>is masked).  The same error happens whether I instantiate the class
>>>>with "new" or declare it in main().
>>>>
>>>>I may be able to convince the rest of my project that dynamic linking
>>>>is okay, but maybe that is just deferring a problem that will still
>>>>crop up eventually.  My test program did run its basic steps
>>>>successfully when linked dynamically, but maybe I was just lucky.
>>>>
>>>>Has anyone got a fix for this?  Or even seen it?
>>>>
>>>>Many thanks,
>>>>
>>>>John Robinson
>>>>Vertica Systems
>>>>_______________________________________________
>>>>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>>
>>>
>>>_______________________________________________
>>>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>
>>
>>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/