Dear LAM developers,
We are having a problem with LAM 7.0 on a linux RedHat 9 cluster using
ethernet.
The cluster has 128 nodes. The lamboot was successful. But after running
about a day
or so. The lamnodes command starts to hang on the first node. All the
others seem working
just fine. But they report the first node as invalid.
The lamd on the first node is still running, just not to respond to any
lam commands,
such as lamnodes, mpitask, etc.
All nodes have 2 nics, but first one is not configured to have IPs.
After the LAM is booted for about a day or so, some time, we also see a
message like:
rcmd: socket: all port in use.
Does this problem sound like a system/firework configuration error or a
error in lamnodes/lamd.
Please help. I have included the backtrace from lamd of the first node
for your reference.
Best regards,
Lily Li
SPT
Petroleum Geo-Service.
--------------------------------------- output from gdb of lamd
--------------------------------------
GNU gdb Red Hat Linux (5.3post-0.20021129.18rh)
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for
details.
This GDB was configured as "i386-redhat-linux-gnu"...
Attaching to program:/ap/local/lam-7.0-pgs/LINUXM/bin/lamd, process 9186
Reading symbols from
/ap/local/lam-7.0-pgs/LINUXM/lib/liblam.so.0...done.
Loaded symbols for /ap/local/lam-7.0-pgs/LINUXM/lib/liblam.so.0
Reading symbols from /lib/libutil.so.1...done.
Loaded symbols for /lib/libutil.so.1
Reading symbols from /lib/libpthread.so.0...done.
[New Thread 16384 (LWP 9186)]
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
0x4019a122 in select () from /lib/libc.so.6
(gdb) where
#0 0x4019a122 in select () from /lib/libc.so.6
#1 0x0806e720 in exceptfds ()
#2 0x08055ba3 in run_kernel ()
#3 0x0804b875 in main ()
#4 0x400d3917 in __libc_start_main () from /lib/libc.so.6
(gdb) quit
|