That
is just weird -- I don't think I've seen a case where tping
worked
(implying that inter-lamd communication is working), but
running
applications did not.
Yes, it's kinda weird. I just noticed something, After running
mpirun, tping doesn't work anymore, See below.
[ter@uftoscar test]$
lamboot -v host
LAM 7.1.3/MPI 2 C++/ROMIO - Indiana
University
n-1<12514> ssi:boot:base:linear: booting n0
(uftoscar)
...
n-1<12514> ssi:boot:base:linear:
finished
[ter@uftoscar test]$ tping -c 3 n0-13
1 byte from 13
remote nodes and 1 local node: 0.007 secs
1 byte from 13 remote
nodes and 1 local node: 0.005 secs
1 byte from 13 remote nodes and
1 local node: 0.006 secs
3 messages, 3 bytes (0.003K), 0.017 secs
(0.340K/sec)
roundtrip min/avg/max: 0.005/0.006/0.007
[ter@uftoscar
test]$ mpicc ring.c -o
ring.out
<---LAM's mpicc
[ter@uftoscar test]$ mpirun -np 13
ring.out
<freeze> (so I pressed Ctrl-C to
cancel)
********************* WARNING ***********************
This
is a vulnerable region. Exiting the application
now may lead to improper
cleanup of temporary objects
To exit the application, press Ctrl-C
again
********************* WARNING
************************
[ter@uftoscar test]$ tping -c 3
n0-13
<freeze> :-(
The
only thing that I can think of is that there is some firewalling
in
place that only allows arbitrary UDP traffic
through...? (inter-
lamd traffic is UDP, not
TCP) That doesn't seem to make sense,
though, if MPICH works
(cexec uses ssh, which is most certainly
allowed). But can
you triple check that there are no firewalls tcp
rules in place that
restrict UDP/TCP traffic? (e.g., iptables)
I did. no firewall is running on any nodes.
[root@uftoscar
~]# service iptables status
Firewall is stopped.
[root@uftoscar ~]#
service pfilter status
pfilter is stopped
[root@uftoscar ~]# cexec
service iptables status
************************* oscar_cluster
*************************
--------- oscarnode1---------
Firewall is
stopped.
.....
--------- oscarnode13---------
Firewall is
stopped.
[root@uftoscar ~]# cexec service pfilter status
<-- I already removed pfilter.
************************* oscar_cluster
*************************
--------- oscarnode1---------
pfilter:
unrecognized service
....
--------- oscarnode13---------
pfilter:
unrecognized service
Also
try running tping / mpirun / lamexec from a node other than the
origin
(i.e., the node you lambooted from).
I did. same problem.
On
May 23, 2007, at 11:32 PM, K. Charoenpornwattana Ter wrote:
>
Try some simple tests:
>
> - Does "tping -c 3" run
successfully? (It should ping all the lamd's)
>
>
[ter@uftoscar test]$ tping -c 3 n0-13
> 1 byte from 13
remote nodes and 1 local node: 0.006 secs
> 1 byte from
13 remote nodes and 1 local node: 0.005 secs
> 1 byte
from 13 remote nodes and 1 local node: 0.005 secs
>
> 3
messages, 3 bytes (0.003K), 0.016 secs (0.368K/sec)
> roundtrip
min/avg/max: 0.005/0.005/0.006
>
>
> - Does "lamexec N
hostname" run successfully? (It should run
> "hostname" on all the
booted nodes)
>
> No, it doesn't work. It only show
headnode's hostname. See below:
>
> [ter@uftoscar ~]$ lamexec
N hostname
> uftoscar.latech
> <freeze>
>
>
I, however, can execute "cexec hostname" with no problem.
>
>
- When you "mpirun -np 15 ring.out", do you see ring.out executing
on
> all the nodes? (i.e., if you ssh into each of the nodes and run
ps,
> do you see it running?
>
> I only see one ring.out
run on headnode, no ring.out running on
> other
nodes.
>
>
> Thanks
> Kulathep
>
_______________________________________________
> This list is
archived at http://www.lam-mpi.org/MailArchives/lam/
--
Jeff
Squyres
Cisco
Systems
_______________________________________________
This
list is archived at http://www.lam-mpi.org/MailArchives/lam/