LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Richard Hadsell (hadsell_at_[hidden])
Date: 2004-06-11 09:06:58


Jeff Squyres wrote:

> I notice that the man page for dlerror(3) on Linux says:
>
> -----
> If dlopen fails for any reason, it returns NULL. A human readable
> string describing the most recent error that occurred from any of the
> dl routines (dlopen, dlsym or dlclose) can be extracted with
> dlerror(). dlerror returns NULL if no errors have occurred since
> initialization or since it was last called.
> -----
>
> This tends to imply that it may not be proper to call dlerror() before
> dlopen() -- the "...if no errors have occurred since intialization..."
> part is what I'm keying from. "Initialization" is not defined, so it
> could mean either process initialization or the dl library
> initialzation (effectively, dlopen).
>
> So if this is true, here's two guesses: you could be a) getting lucky
> with prior versions of LAM, or b) there's some other part of the
> system that is calling dlopen() before you call dlerror() (and
> therefore initializaing the dl library for you).

That's always possible, but I would expect the library to be initialized
when it gets loaded itself.

I suspect some form of interaction with LAM 7.0.6 to be the culprit
because --

- there has never been a problem with 6.6b2, even with the same dl
library and the same application;

- the problem is totally reproducible, even across machines randomly
chosen from a farm of about 100, which would be in various states of the
shared dl library and memory contents;

- requiring a call to dlopen for initialization of the library seems
unlikely -- there is no admonition in the man page, and I can't imagine
intentionally coding it in a way that allows totally undefined behavior,
including a seg fault, for calling dlerror before dlopen.

I realize it may difficult for you to reproduce this problem, given the
vagaries of Linux configuration, but I was hoping that one of the LAM
developers might be familiar with LAM uses of dlerror. I can only
imagine that something is using the pointer returned by dlerror and
doing something terrible with it.

I have a workaround for the problem (which would be the only correct
coding, if your interpretation of the man page is correct), so I'm not
pressing you to work on it. If I have time, I may try to get the source
for the dl library, build a debug version, and try to pinpoint the bug.
If I do that, is there anything I need to do to configure LAM to get a
debuggable version of the binaries?

-- 
Dick Hadsell			914-259-6320  Fax: 914-259-6499
Reply-to:			hadsell_at_[hidden]
Blue Sky Studios                http://www.blueskystudios.com
44 South Broadway, White Plains, NY 10601