LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: K. Charoenpornwattana Ter (kcharoen_at_[hidden])
Date: 2007-05-24 20:56:38


[ter_at_uftoscar ~]$ which mpirun
/opt/lam-7.1.3/bin/mpirun
[ter_at_uftoscar ~]$ cexec which mpirun
************************* oscar_cluster *************************
--------- oscarnode1---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode2---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode3---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode4---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode5---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode6---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode7---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode8---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode9---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode10---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode11---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode12---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode13---------
/opt/lam-7.1.3/bin/mpirun

Thanks

On 5/24/07, McCalla, Mac <macmccalla_at_[hidden]> wrote:
>
> Hi,
> just for grins, what does "which mpirun" show? ......
>
> mac mccalla
>
> ------------------------------
> *From:* lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] *On
> Behalf Of *K. Charoenpornwattana Ter
> *Sent:* 24 May 2007 14:47
> *To:* General LAM/MPI mailing list
> *Subject:* Re: LAM: lamboot is ok, mpirun is not
>
> On 5/24/07, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> >
> > That is just weird -- I don't think I've seen a case where tping
> > worked (implying that inter-lamd communication is working), but
> > running applications did not.
>
>
> Yes, it's kinda weird. I just noticed something, After running mpirun,
> tping doesn't work anymore, See below.
>
> [ter_at_uftoscar test]$ lamboot -v host
> LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University
>
> n-1<12514> ssi:boot:base:linear: booting n0 (uftoscar)
> ...
> n-1<12514> ssi:boot:base:linear: finished
> [ter_at_uftoscar test]$ tping -c 3 n0-13
> 1 byte from 13 remote nodes and 1 local node: 0.007 secs
> 1 byte from 13 remote nodes and 1 local node: 0.005 secs
> 1 byte from 13 remote nodes and 1 local node: 0.006 secs
>
> 3 messages, 3 bytes (0.003K), 0.017 secs (0.340K/sec)
> roundtrip min/avg/max: 0.005/0.006/0.007
> [ter_at_uftoscar test]$ mpicc ring.c -o ring.out <---LAM's mpicc
> [ter_at_uftoscar test]$ mpirun -np 13 ring.out
> <freeze> (so I pressed Ctrl-C to cancel)
>
> ********************* WARNING ***********************
> This is a vulnerable region. Exiting the application
> now may lead to improper cleanup of temporary objects
> To exit the application, press Ctrl-C again
> ********************* WARNING ************************
> [ter_at_uftoscar test]$ tping -c 3 n0-13
> <freeze> :-(
>
> The only thing that I can think of is that there is some firewalling
> > in place that only allows arbitrary UDP traffic through...? (inter-
> > lamd traffic is UDP, not TCP) That doesn't seem to make sense,
> > though, if MPICH works (cexec uses ssh, which is most certainly
> > allowed). But can you triple check that there are no firewalls tcp
> > rules in place that restrict UDP/TCP traffic? (e.g., iptables)
>
>
> I did. no firewall is running on any nodes.
>
> [root_at_uftoscar ~]# service iptables status
> Firewall is stopped.
> [root_at_uftoscar ~]# service pfilter status
> pfilter is stopped
> [root_at_uftoscar ~]# cexec service iptables status
> ************************* oscar_cluster *************************
> --------- oscarnode1---------
> Firewall is stopped.
> .....
> --------- oscarnode13---------
> Firewall is stopped.
>
> [root_at_uftoscar ~]# cexec service pfilter status <-- I already removed
> pfilter.
> ************************* oscar_cluster *************************
> --------- oscarnode1---------
> pfilter: unrecognized service
> ....
> --------- oscarnode13---------
> pfilter: unrecognized service
>
>
> > Also try running tping / mpirun / lamexec from a node other than the
> > origin (i.e., the node you lambooted from).
>
>
> I did. same problem.
>
> On May 23, 2007, at 11:32 PM, K. Charoenpornwattana Ter wrote:
> >
> > > Try some simple tests:
> > >
> > > - Does "tping -c 3" run successfully? (It should ping all the lamd's)
> > >
> > > [ter_at_uftoscar test]$ tping -c 3 n0-13
> > > 1 byte from 13 remote nodes and 1 local node: 0.006 secs
> > > 1 byte from 13 remote nodes and 1 local node: 0.005 secs
> > > 1 byte from 13 remote nodes and 1 local node: 0.005 secs
> > >
> > > 3 messages, 3 bytes (0.003K), 0.016 secs (0.368K/sec)
> > > roundtrip min/avg/max: 0.005/0.005/0.006
> > >
> > >
> > > - Does "lamexec N hostname" run successfully? (It should run
> > > "hostname" on all the booted nodes)
> > >
> > > No, it doesn't work. It only show headnode's hostname. See below:
> > >
> > > [ter_at_uftoscar ~]$ lamexec N hostname
> > > uftoscar.latech
> > > <freeze>
> > >
> > > I, however, can execute "cexec hostname" with no problem.
> > >
> > > - When you "mpirun -np 15 ring.out", do you see ring.out executing on
> > > all the nodes? (i.e., if you ssh into each of the nodes and run ps,
> > > do you see it running?
> > >
> > > I only see one ring.out run on headnode, no ring.out running on
> > > other nodes.
> > >
> > >
> > > Thanks
> > > Kulathep
> > > _______________________________________________
> > > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
> >
> > --
> > Jeff Squyres
> > Cisco Systems
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>