That
is just weird -- I don't think I've seen a case where tping
worked
(implying that inter-lamd communication is working), but
running
applications did not.
Yes, it's kinda weird. I just noticed something, After running mpirun,
tping doesn't work anymore, See below.
[ter@uftoscar test]$ lamboot -v
host
LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University
n-1<12514>
ssi:boot:base:linear: booting n0 (uftoscar)
...
n-1<12514>
ssi:boot:base:linear: finished
[ter@uftoscar test]$ tping -c 3
n0-13
1 byte from 13 remote nodes and 1 local node: 0.007
secs
1 byte from 13 remote nodes and 1 local node: 0.005
secs
1 byte from 13 remote nodes and 1 local node: 0.006 secs
3
messages, 3 bytes (0.003K), 0.017 secs (0.340K/sec)
roundtrip min/avg/max:
0.005/0.006/0.007
[ter@uftoscar test]$ mpicc ring.c -o
ring.out
<---LAM's mpicc
[ter@uftoscar test]$ mpirun -np 13
ring.out
<freeze> (so I pressed Ctrl-C to
cancel)
********************* WARNING ***********************
This is
a vulnerable region. Exiting the application
now may lead to improper cleanup
of temporary objects
To exit the application, press Ctrl-C
again
********************* WARNING ************************
[ter@uftoscar
test]$ tping -c 3 n0-13
<freeze> :-(
The
only thing that I can think of is that there is some firewalling
in place
that only allows arbitrary UDP traffic through...? (inter-
lamd
traffic is UDP, not TCP) That doesn't seem to make
sense,
though, if MPICH works (cexec uses ssh, which is most
certainly
allowed). But can you triple check that there are no
firewalls tcp
rules in place that restrict UDP/TCP
traffic? (e.g., iptables)
I did. no firewall is running on any nodes.
[root@uftoscar ~]#
service iptables status
Firewall is stopped.
[root@uftoscar ~]# service
pfilter status
pfilter is stopped
[root@uftoscar ~]# cexec service
iptables status
************************* oscar_cluster
*************************
--------- oscarnode1---------
Firewall is
stopped.
.....
--------- oscarnode13---------
Firewall is
stopped.
[root@uftoscar ~]# cexec service pfilter status
<-- I already removed pfilter.
************************* oscar_cluster
*************************
--------- oscarnode1---------
pfilter:
unrecognized service
....
--------- oscarnode13---------
pfilter:
unrecognized service
Also
try running tping / mpirun / lamexec from a node other than the
origin
(i.e., the node you lambooted from).
I did. same problem.
On
May 23, 2007, at 11:32 PM, K. Charoenpornwattana Ter wrote:
> Try
some simple tests:
>
> - Does "tping -c 3" run successfully? (It
should ping all the lamd's)
>
> [ter@uftoscar test]$ tping -c 3
n0-13
> 1 byte from 13 remote nodes and 1 local node: 0.006
secs
> 1 byte from 13 remote nodes and 1 local node: 0.005
secs
> 1 byte from 13 remote nodes and 1 local node: 0.005
secs
>
> 3 messages, 3 bytes (0.003K), 0.016 secs
(0.368K/sec)
> roundtrip min/avg/max:
0.005/0.005/0.006
>
>
> - Does "lamexec N hostname" run
successfully? (It should run
> "hostname" on all the booted nodes)
>
> No, it doesn't work. It only show headnode's hostname. See
below:
>
> [ter@uftoscar ~]$ lamexec N hostname
>
uftoscar.latech
> <freeze>
>
> I, however, can execute
"cexec hostname" with no problem.
>
> - When you "mpirun -np 15
ring.out", do you see ring.out executing on
> all the nodes? (i.e., if
you ssh into each of the nodes and run ps,
> do you see it
running?
>
> I only see one ring.out run on headnode, no ring.out
running on
> other nodes.
>
>
> Thanks
>
Kulathep
> _______________________________________________
> This
list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
Jeff
Squyres
Cisco
Systems
_______________________________________________
This list is
archived at http://www.lam-mpi.org/MailArchives/lam/