On Aug 16, 2005, at 5:39 PM, Lily Li wrote:
> Just a followup on this lamd case.
>
> We reconfigured our LAM 7.0 and install it to our production systems.
> It does help a lot. The hang rate dropped dramatically. Although, the
> lamd
> still lost/crashes sometime ( when my mpi tasks got signal and
> exit/killed,
> the lamd sometime crashes).
Erf. If you ever get more data on this, please let us know. It sounds
like something that should be fixed, but a difficult case to reproduce.
:-\
> We now have a new problem with LAM. Our production decides to
> use CentOS 4 with kernel 2.6.9 instead of RedHat 9. Can LAM 7.0
> compiled on RedHat 9 (kernel 2.4) be run on CentOS 4 with kernel
> 2.6.9 ?
I think you might run into a problem here with threading issues.
Although LAM is single threaded, it is probably linked against the
pthread library. And that changed dramatically between 2.4 and 2.6
(NPTL vs. old linux threads).
The other concern that I'd have would be about file descriptor passing
(which has several different flavors, and different distros have
exhibited both the different flavors and different bugs in each of the
flavors :-) ). However, I'd guess that it would either work fine or
not work at all (not a work-for-several-days-and-then-fail kind of
scenario).
The rest of the lamd is pretty standard POSIX/C stuff. Note that if
you use the "-d" switch to lamboot, the lamd will dump a bunch of
debugging information into the syslog. This might be a good place to
look for clues as to why lamds are dying after a few days...?
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|