Since this now has me intrigued :-), I downloaded glibc and had a look at
dlerror(). I didn't follow it all the way through, but it looks like it
*could* be erroneous to call dlerror() without first calling dlopen().
Here's some code right at the top of dlerror():
-----
/* Get error string. */
result = (struct dl_action_result *) __libc_getspecific (key);
if (result == NULL)
result = &last_result;
/* Test whether we already returned the string. */
if (result->returned != 0)
-----
I *think* the __libc_getspecific() thing will return 0. So result will ==
NULL, and it'll use &last_result. And last_result does not appear to be
statically initialized. So "result->returned" could well generate
Badness.
I'm not 100% sure that's happening, but I think it is...
On Fri, 11 Jun 2004, Richard Hadsell wrote:
> Jeff Squyres wrote:
>
>> I notice that the man page for dlerror(3) on Linux says:
>>
>> -----
>> If dlopen fails for any reason, it returns NULL. A human readable string
>> describing the most recent error that occurred from any of the dl routines
>> (dlopen, dlsym or dlclose) can be extracted with dlerror(). dlerror
>> returns NULL if no errors have occurred since initialization or since it
>> was last called.
>> -----
>>
>> This tends to imply that it may not be proper to call dlerror() before
>> dlopen() -- the "...if no errors have occurred since intialization..."
>> part is what I'm keying from. "Initialization" is not defined, so it
>> could mean either process initialization or the dl library initialzation
>> (effectively, dlopen).
>>
>> So if this is true, here's two guesses: you could be a) getting lucky with
>> prior versions of LAM, or b) there's some other part of the system that is
>> calling dlopen() before you call dlerror() (and therefore initializaing
>> the dl library for you).
>
> That's always possible, but I would expect the library to be initialized when
> it gets loaded itself.
>
> I suspect some form of interaction with LAM 7.0.6 to be the culprit because
> --
>
> - there has never been a problem with 6.6b2, even with the same dl library
> and the same application;
>
> - the problem is totally reproducible, even across machines randomly chosen
> from a farm of about 100, which would be in various states of the shared dl
> library and memory contents;
>
> - requiring a call to dlopen for initialization of the library seems unlikely
> -- there is no admonition in the man page, and I can't imagine intentionally
> coding it in a way that allows totally undefined behavior, including a seg
> fault, for calling dlerror before dlopen.
>
> I realize it may difficult for you to reproduce this problem, given the
> vagaries of Linux configuration, but I was hoping that one of the LAM
> developers might be familiar with LAM uses of dlerror. I can only imagine
> that something is using the pointer returned by dlerror and doing something
> terrible with it.
>
> I have a workaround for the problem (which would be the only correct coding,
> if your interpretation of the man page is correct), so I'm not pressing you
> to work on it. If I have time, I may try to get the source for the dl
> library, build a debug version, and try to pinpoint the bug. If I do that,
> is there anything I need to do to configure LAM to get a debuggable version
> of the binaries?
>
>
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|