LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: K. Charoenpornwattana Ter (kcharoen_at_[hidden])
Date: 2007-05-24 17:32:46


Hi,

lamd on every nodes in the cluster are running (before and after running
mpirun on head node). see below:

[ter_at_uftoscar ~]$ cexec "ps -ef | grep lamd | grep -v grep"
************************* oscar_cluster *************************
--------- oscarnode1---------
ter 5292 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H
192.168.99.1 -P 57625 -n 1 -o 0
--------- oscarnode2---------
ter 5002 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H
192.168.99.1 -P 57625 -n 2 -o 0
--------- oscarnode3---------
ter 5002 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H
192.168.99.1 -P 57625 -n 3 -o 0
--------- oscarnode4---------
ter 5002 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H
192.168.99.1 -P 57625 -n 4 -o 0
--------- oscarnode5---------
ter 5002 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H
192.168.99.1 -P 57625 -n 5 -o 0
--------- oscarnode6---------
ter 5058 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H
192.168.99.1 -P 57625 -n 6 -o 0
--------- oscarnode7---------
ter 5016 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H
192.168.99.1 -P 57625 -n 7 -o 0
--------- oscarnode8---------
ter 4950 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H
192.168.99.1 -P 57625 -n 8 -o 0
--------- oscarnode9---------
ter 4950 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H
192.168.99.1 -P 57625 -n 9 -o 0
--------- oscarnode10---------
ter 4950 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H
192.168.99.1 -P 57625 -n 10 -o 0
--------- oscarnode11---------
ter 4950 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H
192.168.99.1 -P 57625 -n 11 -o 0
--------- oscarnode12---------
ter 4950 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H
192.168.99.1 -P 57625 -n 12 -o 0
--------- oscarnode13---------
ter 4950 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H
192.168.99.1 -P 57625 -n 13 -o 0

[ter_at_uftoscar ~]$ ps -ef | grep lamd | grep -v grep
ter 13808 1 0 16:23 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H
192.168.99.1 -P 57625 -n 0 -o 0

Thanks
Kulathep

On 5/24/07, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>
> Check to see if the lamd's are still running on all nodes when this
> problem occurs. If they are dying for some reason (or being killed),
> that could explain this behavior.
>
>
> On May 24, 2007, at 3:47 PM, K. Charoenpornwattana Ter wrote:
>
> > On 5/24/07, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> > That is just weird -- I don't think I've seen a case where tping
> > worked (implying that inter-lamd communication is working), but
> > running applications did not.
> >
> > Yes, it's kinda weird. I just noticed something, After running
> > mpirun, tping doesn't work anymore, See below.
> >
> > [ter_at_uftoscar test]$ lamboot -v host
> > LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University
> >
> > n-1<12514> ssi:boot:base:linear: booting n0 (uftoscar)
> > ...
> > n-1<12514> ssi:boot:base:linear: finished
> > [ter_at_uftoscar test]$ tping -c 3 n0-13
> > 1 byte from 13 remote nodes and 1 local node: 0.007 secs
> > 1 byte from 13 remote nodes and 1 local node: 0.005 secs
> > 1 byte from 13 remote nodes and 1 local node: 0.006 secs
> >
> > 3 messages, 3 bytes (0.003K), 0.017 secs (0.340K/sec)
> > roundtrip min/avg/max: 0.005/0.006/0.007
> > [ter_at_uftoscar test]$ mpicc ring.c -o ring.out <---
> > LAM's mpicc
> > [ter_at_uftoscar test]$ mpirun -np 13 ring.out
> > <freeze> (so I pressed Ctrl-C to cancel)
> >
> > ********************* WARNING ***********************
> > This is a vulnerable region. Exiting the application
> > now may lead to improper cleanup of temporary objects
> > To exit the application, press Ctrl-C again
> > ********************* WARNING ************************
> > [ter_at_uftoscar test]$ tping -c 3 n0-13
> > <freeze> :-(
> >
> > The only thing that I can think of is that there is some firewalling
> > in place that only allows arbitrary UDP traffic through...? (inter-
> > lamd traffic is UDP, not TCP) That doesn't seem to make sense,
> > though, if MPICH works (cexec uses ssh, which is most certainly
> > allowed). But can you triple check that there are no firewalls tcp
> > rules in place that restrict UDP/TCP traffic? (e.g., iptables)
> >
> > I did. no firewall is running on any nodes.
> >
> > [root_at_uftoscar ~]# service iptables status
> > Firewall is stopped.
> > [root_at_uftoscar ~]# service pfilter status
> > pfilter is stopped
> > [root_at_uftoscar ~]# cexec service iptables status
> > ************************* oscar_cluster *************************
> > --------- oscarnode1---------
> > Firewall is stopped.
> > .....
> > --------- oscarnode13---------
> > Firewall is stopped.
> >
> > [root_at_uftoscar ~]# cexec service pfilter status <-- I already
> > removed pfilter.
> > ************************* oscar_cluster *************************
> > --------- oscarnode1---------
> > pfilter: unrecognized service
> > ....
> > --------- oscarnode13---------
> > pfilter: unrecognized service
> >
> > Also try running tping / mpirun / lamexec from a node other than the
> > origin (i.e., the node you lambooted from).
> >
> > I did. same problem.
> >
> > On May 23, 2007, at 11:32 PM, K. Charoenpornwattana Ter wrote:
> >
> > > Try some simple tests:
> > >
> > > - Does "tping -c 3" run successfully? (It should ping all the
> > lamd's)
> > >
> > > [ter_at_uftoscar test]$ tping -c 3 n0-13
> > > 1 byte from 13 remote nodes and 1 local node: 0.006 secs
> > > 1 byte from 13 remote nodes and 1 local node: 0.005 secs
> > > 1 byte from 13 remote nodes and 1 local node: 0.005 secs
> > >
> > > 3 messages, 3 bytes (0.003K), 0.016 secs (0.368K/sec)
> > > roundtrip min/avg/max: 0.005/0.005/0.006
> > >
> > >
> > > - Does "lamexec N hostname" run successfully? (It should run
> > > "hostname" on all the booted nodes)
> > >
> > > No, it doesn't work. It only show headnode's hostname. See below:
> > >
> > > [ter_at_uftoscar ~]$ lamexec N hostname
> > > uftoscar.latech
> > > <freeze>
> > >
> > > I, however, can execute "cexec hostname" with no problem.
> > >
> > > - When you "mpirun -np 15 ring.out", do you see ring.out
> > executing on
> > > all the nodes? (i.e., if you ssh into each of the nodes and run ps,
> > > do you see it running?
> > >
> > > I only see one ring.out run on headnode, no ring.out running on
> > > other nodes.
> > >
> > >
> > > Thanks
> > > Kulathep
> > > _______________________________________________
> > > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
> >
> > --
> > Jeff Squyres
> > Cisco Systems
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>