LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: redirecting decoy (redirectingdecoy_at_[hidden])
Date: 2005-01-19 11:52:50


Can anyone tell my what would cause the "Lamnodes"
command to hang on one of the machines in my lam
universe ?

I have a total of 19 machines. Three of these machines
are servers with 2 nics, which I will call S1, S2 and
S3.
These servers have public addresses of 192.168.10.XXX
on nic 1. Then, the remaining 16 machines have 1 nic
with private addresses.
10.1.2.xxx for eight, and 10.1.3.xxx for the
remaining eight machines. The 3 Servers are the first
threee machines that
are booted into the lam universe using globus;
"lamboot -v -x -d -ssi boot globus machines.globus"

On nic 2 in S1,S2 and S3, the adresses are setup to be
10.1.(1,2,3).1 respectively. I am using S1 to
initially create the
lam universe, then I use lamgrow from S2 and S3 to add
eight machines each into the universe. This all seems
to work fine.
However, I am having a problem with S1. I can run the
lamnodes command from every machine in my universe,
except it hangs when
I try and run it on S1. I think whatever is causing
this, is causing my lam universe to not function
properly, as it seems that
lamnodes reports that S1 has become an invalid node on
some of the machines in the lam universe. The
programs I try and run
just hang there without doing anything for a long
time. Then I am forced to kill it after a while
because it just doesn't do anything.

Note: S1, S2 and S3 all have identical OS
configurations. Also, I am using Lam 7.0.6.

I know that what I am trying to do works, because It
has worked before. The only difference now is the
addition of S1 to the
lam universe.

Is it possible that my firewall on the Servers could
be the cause of the problem ? In order to get lam to
boot at all I needed
to add the following to my iptables configuration:

-A INPUT -m state --state NEW -p tcp -s 192.168.10.100
-j ACCEPT
-A INPUT -m state --state NEW -p udp -s 192.168.10.100
-j ACCEPT
-A INPUT -m state --state NEW -p tcp -s 192.168.10.101
-j ACCEPT
-A INPUT -m state --state NEW -p udp -s 192.168.10.101
-j ACCEPT
-A INPUT -m state --state NEW -p tcp -s 192.168.10.102
-j ACCEPT
-A INPUT -m state --state NEW -p udp -s 192.168.10.102
-j ACCEPT

Adding the above allows me to boot the universe with
or without globus.

When trying to do: strace lamnodes from S1, I get
some output, then it hangs while trying to read
something...

#################################################################################################################################
munmap(0x40016000, 73909) = 0
open("/etc/passwd", O_RDONLY) = 3
fcntl64(3, F_GETFD) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC) = 0
fstat64(3, {st_mode=S_IFREG|0664, st_size=1559, ...})
= 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40016000
read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096)
= 1559
close(3) = 0
munmap(0x40016000, 4096) = 0
uname({sys="Linux", node="host.net", ...}) = 0
stat64("/tmp/lam-bruno_at_[hidden]",
{st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0
getuid32() = 500
getcwd("/home/bruno", 2048) = 12
chdir("/tmp/lam-bruno_at_[hidden]") = 0
socket(PF_FILE, SOCK_STREAM, 0) = 3
connect(3, {sa_family=AF_FILE,
path="lam-kernel-socket"}, 19) = 0
chdir("/home/bruno") = 0
getsockopt(3, SOL_SOCKET, SO_SNDBUF, [107520], [4]) =
0
getsockopt(3, SOL_SOCKET, SO_RCVBUF, [107520], [4]) =
0
rt_sigaction(SIGUSR2, {0x40038e30, [],
SA_RESTORER|SA_RESTART, 0x400a6dc8}, {SIG_DFL}, 8) = 0
rt_sigprocmask(SIG_BLOCK, [USR2], [RTMIN], 8) = 0
write(3,
"\5\0\0\0\377\377\377\377\314_\0\0G\4\0\0\0\0\0\0\0\0\0"...,
96) = 96
read(3, "\0\0\0\0\0\0\0\0x\374\377\277\202\265\4\10\30
\10\10t$"..., 80) = 80
rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [USR2], NULL, 8) = 0
write(3, "\4\0\0\0\17\0\0\0\3\0\0\0
\364\377\277p\222\3@\0\0\0\0"..., 96) = 96
read(3,
"\0\0\0\0\0\0\0\0\0\0\0\0+\301\5\10\r\0\0\0\1\0\0\0`\0\0"...,
80) = 80
writev(3,
[{"\3\0\0@\0\0\0\0\376\377\377\377\3\0\0@\2\0\0\0\0\0\0\0"...,
64}, {NULL, 0}], 2) = 64
read(3,
"\0\0\0\0\0\0\0\0\4\0\0\0\2\0\0\0\0\0\0\0\24\0\0\0\200\254"...,
80) = 80
readv(3,
[{"4\240\377\377\0\0\0\0\376\377\377\3774\240\377\377\0\0"...,
64}, {"\0\0\0\0", 4}], 2) = 68
rt_sigprocmask(SIG_UNBLOCK, [USR2], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [USR2], NULL, 8) = 0
write(3,
"\0\0\0\0\17\0\0\0\0\0\0\0\0\0\0\0\3\0\0@\0\0\0\0\0\0\0"...,
96) = 96
read(3,
"\0\0\0\0\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\24\0\0\0`\0\0"...,
80) = 80
writev(3,
[{"\r\0\0@\377\377\377\377\0\0\0\0\30\0\0@\0\0\0\0\0\0\0\0"...,
64}, {NULL, 0}], 2) = 64
rt_sigprocmask(SIG_UNBLOCK, [USR2], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [USR2], NULL, 8) = 0
write(3,
"\4\0\0\0\17\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
96) = 96
read(3,
"\0\0\0\0\0\0\0\0\0\0\0\0\300\334\6\10P\374\377\277\320"...,
80) = 80
writev(3,
[{"\r\0\0@\0\0\0\0\376\377\377\377\r\0\0@\10\0\0\0\0\0\0\0"...,
64}, {NULL, 0}], 2) = 64
read(3,
"\0\0\0\0\0\0\0\0\6\0\0\0\30\255\4\10\0\0\0\0\320\373\377"...,
80) = 80
readv(3,
[{"4\240\377\377\377\377\377\377\0\0\0\0004\240\377\377\0"...,
64}, {"7.0.6\0", 6}], 2) = 70
rt_sigprocmask(SIG_UNBLOCK, [USR2], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [USR2], NULL, 8) = 0
write(3,
"\0\0\0\0\17\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
96) = 96
read(3, *<-------------------- HANGS HERE
-------------------------->*
#################################################################################################################################

Anyone have any ideas?

Thanks,

-R.D.

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com