LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-08-29 06:44:59


What version of LAM are you using?

IIRC, kqsync() is effectively the entry point for messages entering the
lamd and the exit point for messages leaving the lamd (i.e., it's
invoked when an incomming message has been received and when an
outgoing message needs to be sent). It shouldn't go bonkers like that
-- can you compile LAM with debugging and see *where* in kqsync() it's
operating? i.e., is it spinning somewhere in kqsync, or is it simply
repeatedly calling kqsync, etc.?

For example, during this time, can you run tping? (tping "pings" the
LAM daemons by sending them a message and waiting to receive a message
back)

You may also want to "lamboot -d" and watch your syslog. kqsync()
outputs a message to the syslog each time it's called (be sure you
don't fill up your syslog, particularly if kqsync is repeatedly being
called!). I *think* that this feature is only in LAM 7.0.x and above
-- I doubt it's in 6.5.9.

On Aug 26, 2004, at 3:59 PM, Lily Li wrote:

>
> Hi, LAM team,
>
> It has happened to us quite a few times when the lamd hanging in
> kqsync() and is taking > 99% of CPU time seen from "top" on linux.
>
> We don't know exactly how this situation started, but only notice that
> some jobs are running extremely slow, then we notice the lamd is
> taking more than 99% of the CPU time and is hanging in kqsync().
>
> Here is the stack trace for lamd from gdb.
>
> ----------------------------------------------------------
>
> 0x08052ce1 in kqsync ()
> (gdb) where
> #0  0x08052ce1 in kqsync ()
> #1  0x0805378c in main ()
> #2  0x40053657 in __libc_start_main (main=0x8053570 <main>, argc=10,
> ubp_av=0xbffff3c4, init=0x8049540 <_init>, fini=0x805c9d0 <_fini>,
> rtld_fini=0x4000dcd4 <_dl_fini>,
>     stack_end=0xbffff3bc) at ../sysdeps/generic/libc-start.c:129
> (gdb) quit
>
> -----------------------------------------------------------------
>
>
> Our system is using Linux Redhat 7.2 with gcc 2.93.5.
>
> Please help.
>
> Thank you.
>
> Lily
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/