That
is just weird -- I don't think I've seen a case where tping
worked
(implying that inter-lamd communication is working), but
running
applications did not.
Yes, it's kinda weird. I just noticed something, After running
mpirun, tping doesn't work anymore, See below.
[ter@uftoscar test]$
lamboot -v host
LAM 7.1.3/MPI 2 C++/ROMIO - Indiana
University
n-1<12514> ssi:boot:base:linear: booting n0
(uftoscar)
...
n-1<12514> ssi:boot:base:linear:
finished
[ter@uftoscar test]$ tping -c 3 n0-13
1 byte from 13
remote nodes and 1 local node: 0.007 secs
1 byte from 13 remote
nodes and 1 local node: 0.005 secs
1 byte from 13 remote nodes and 1
local node: 0.006 secs
3 messages, 3 bytes (0.003K), 0.017 secs
(0.340K/sec)
roundtrip min/avg/max: 0.005/0.006/0.007
[ter@uftoscar
test]$ mpicc ring.c -o
ring.out
<---LAM's mpicc
[ter@uftoscar test]$ mpirun -np 13
ring.out
<freeze> (so I pressed Ctrl-C to
cancel)
********************* WARNING ***********************
This
is a vulnerable region. Exiting the application
now may lead to improper
cleanup of temporary objects
To exit the application, press Ctrl-C
again
********************* WARNING
************************
[ter@uftoscar test]$ tping -c 3
n0-13
<freeze> :-(
The
only thing that I can think of is that there is some firewalling
in
place that only allows arbitrary UDP traffic
through...? (inter-
lamd traffic is UDP, not
TCP) That doesn't seem to make sense,
though, if MPICH works
(cexec uses ssh, which is most certainly
allowed). But can you
triple check that there are no firewalls tcp
rules in place that
restrict UDP/TCP traffic? (e.g., iptables)
I did. no firewall is running on any nodes.
[root@uftoscar ~]#
service iptables status
Firewall is stopped.
[root@uftoscar ~]# service
pfilter status
pfilter is stopped
[root@uftoscar ~]# cexec service
iptables status
************************* oscar_cluster
*************************
--------- oscarnode1---------
Firewall is
stopped.
.....
--------- oscarnode13---------
Firewall is
stopped.
[root@uftoscar ~]# cexec service pfilter status
<-- I already removed pfilter.
************************* oscar_cluster
*************************
--------- oscarnode1---------
pfilter:
unrecognized service
....
--------- oscarnode13---------
pfilter:
unrecognized service
Also
try running tping / mpirun / lamexec from a node other than the
origin
(i.e., the node you lambooted from).
I did. same problem.
On
May 23, 2007, at 11:32 PM, K. Charoenpornwattana Ter wrote:
> Try
some simple tests:
>
> - Does "tping -c 3" run successfully? (It
should ping all the lamd's)
>
> [ter@uftoscar test]$ tping -c 3
n0-13
> 1 byte from 13 remote nodes and 1 local node:
0.006 secs
> 1 byte from 13 remote nodes and 1 local node:
0.005 secs
> 1 byte from 13 remote nodes and 1 local node:
0.005 secs
>
> 3 messages, 3 bytes (0.003K), 0.016 secs
(0.368K/sec)
> roundtrip min/avg/max:
0.005/0.005/0.006
>
>
> - Does "lamexec N hostname" run
successfully? (It should run
> "hostname" on all the booted nodes)
>
> No, it doesn't work. It only show headnode's hostname. See
below:
>
> [ter@uftoscar ~]$ lamexec N hostname
>
uftoscar.latech
> <freeze>
>
> I, however, can
execute "cexec hostname" with no problem.
>
> - When you
"mpirun -np 15 ring.out", do you see ring.out executing on
> all the
nodes? (i.e., if you ssh into each of the nodes and run ps,
> do you
see it running?
>
> I only see one ring.out run on headnode, no
ring.out running on
> other nodes.
>
>
>
Thanks
> Kulathep
>
_______________________________________________
> This list is
archived at http://www.lam-mpi.org/MailArchives/lam/
--
Jeff
Squyres
Cisco
Systems
_______________________________________________
This list
is archived at http://www.lam-mpi.org/MailArchives/lam/