On 5/24/07, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>
> That is just weird -- I don't think I've seen a case where tping
> worked (implying that inter-lamd communication is working), but
> running applications did not.
Yes, it's kinda weird. I just noticed something, After running mpirun, tping
doesn't work anymore, See below.
[ter_at_uftoscar test]$ lamboot -v host
LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University
n-1<12514> ssi:boot:base:linear: booting n0 (uftoscar)
...
n-1<12514> ssi:boot:base:linear: finished
[ter_at_uftoscar test]$ tping -c 3 n0-13
1 byte from 13 remote nodes and 1 local node: 0.007 secs
1 byte from 13 remote nodes and 1 local node: 0.005 secs
1 byte from 13 remote nodes and 1 local node: 0.006 secs
3 messages, 3 bytes (0.003K), 0.017 secs (0.340K/sec)
roundtrip min/avg/max: 0.005/0.006/0.007
[ter_at_uftoscar test]$ mpicc ring.c -o ring.out <---LAM's mpicc
[ter_at_uftoscar test]$ mpirun -np 13 ring.out
<freeze> (so I pressed Ctrl-C to cancel)
********************* WARNING ***********************
This is a vulnerable region. Exiting the application
now may lead to improper cleanup of temporary objects
To exit the application, press Ctrl-C again
********************* WARNING ************************
[ter_at_uftoscar test]$ tping -c 3 n0-13
<freeze> :-(
The only thing that I can think of is that there is some firewalling
> in place that only allows arbitrary UDP traffic through...? (inter-
> lamd traffic is UDP, not TCP) That doesn't seem to make sense,
> though, if MPICH works (cexec uses ssh, which is most certainly
> allowed). But can you triple check that there are no firewalls tcp
> rules in place that restrict UDP/TCP traffic? (e.g., iptables)
I did. no firewall is running on any nodes.
[root_at_uftoscar ~]# service iptables status
Firewall is stopped.
[root_at_uftoscar ~]# service pfilter status
pfilter is stopped
[root_at_uftoscar ~]# cexec service iptables status
************************* oscar_cluster *************************
--------- oscarnode1---------
Firewall is stopped.
.....
--------- oscarnode13---------
Firewall is stopped.
[root_at_uftoscar ~]# cexec service pfilter status <-- I already removed
pfilter.
************************* oscar_cluster *************************
--------- oscarnode1---------
pfilter: unrecognized service
....
--------- oscarnode13---------
pfilter: unrecognized service
> Also try running tping / mpirun / lamexec from a node other than the
> origin (i.e., the node you lambooted from).
I did. same problem.
On May 23, 2007, at 11:32 PM, K. Charoenpornwattana Ter wrote:
>
> > Try some simple tests:
> >
> > - Does "tping -c 3" run successfully? (It should ping all the lamd's)
> >
> > [ter_at_uftoscar test]$ tping -c 3 n0-13
> > 1 byte from 13 remote nodes and 1 local node: 0.006 secs
> > 1 byte from 13 remote nodes and 1 local node: 0.005 secs
> > 1 byte from 13 remote nodes and 1 local node: 0.005 secs
> >
> > 3 messages, 3 bytes (0.003K), 0.016 secs (0.368K/sec)
> > roundtrip min/avg/max: 0.005/0.005/0.006
> >
> >
> > - Does "lamexec N hostname" run successfully? (It should run
> > "hostname" on all the booted nodes)
> >
> > No, it doesn't work. It only show headnode's hostname. See below:
> >
> > [ter_at_uftoscar ~]$ lamexec N hostname
> > uftoscar.latech
> > <freeze>
> >
> > I, however, can execute "cexec hostname" with no problem.
> >
> > - When you "mpirun -np 15 ring.out", do you see ring.out executing on
> > all the nodes? (i.e., if you ssh into each of the nodes and run ps,
> > do you see it running?
> >
> > I only see one ring.out run on headnode, no ring.out running on
> > other nodes.
> >
> >
> > Thanks
> > Kulathep
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|