I've been having lamd sporatically hang on a node during a
communication-intensive part of a job. The lamd is in such a
state that tping on a node list which includes that node never
returns.
I used gdb to examine the state of lamd on that node and found
that it was hung in the system routine readv(), called from mreadv().
The call list is:
#0 0x70126017 in __readv (fd=15, vector=0xbfffefec, count=1) at ../sysdeps/unix/sysv/linux/readv.c:51
#1 0x0805743d in mreadv (fd=15, iov=0xbfffefe4, iovcnt=2) at mrw.c:105
#2 0x08052031 in kio_recv (recvkmsg=0x8073fc0, minlen=8192, fd_client=15) at kernelio.c:505
#3 0x08053227 in transfer (pfrom=0x80745c8, pto=0x8073f7c) at kouter.c:524
#4 0x0805242e in kqsync (pclient=0x80745c8, pkq=0x8071140) at kinner.c:177
#5 0x08052d84 in main (argc=1, argv=0xbffff1e4) at kouter.c:251
#6 0x70062306 in __libc_start_main (main=0x8052b10 <main>, argc=11, ubp_av=0xbffff1e4, init=0x80494c0 <_init>, fini=0x805b34c <_fini>,
Can anyone give me a hint on what might be the problem or how to
track it down?
I'm running lam-6.5.1 on dual P4 using Red Hat Linux 7.1.
The configuration command was:
./configure --prefix=/apps/lam-6.5.1 --without-mpi2cpp --with-cc=gcc '--with-cflags=-O2 -Wall -g' --with-fc=pgf77 '--with-fflags=-fast -g' --with-rpi=sysv --with-purify
The command used to run the jobs is "mpirun -O -lamd -w -pty schema".
I thought that all the changes between 6.5.1 and 6.5.6 were of a
minor sort, so I hadn't upgraded. Should I?
Any suggestions would be appreciated.
Scott Morton
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|