LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: K. Charoenpornwattana Ter (kcharoen_at_[hidden])
Date: 2007-05-24 21:43:55


Yes,

[ter_at_uftoscar test]$ mpirun -np 14 -v ring.out
17119 ring.out running on n0 (o)
<freeze>

Ummm, I guess, I will just remove everything and install it again.

Thanks anyway,
Kulathep

On 5/24/07, McCalla, Mac <macmccalla_at_[hidden]> wrote:
>
> Sorry, i see you did that earlier. have you tried the mpirun with -v
> parameter as well?
>
> ------------------------------
> *From:* lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] *On
> Behalf Of *K. Charoenpornwattana Ter
> *Sent:* 24 May 2007 19:57
> *To:* General LAM/MPI mailing list
> *Subject:* Re: LAM: lamboot is ok, mpirun is not
>
> [ter_at_uftoscar ~]$ which mpirun
> /opt/lam-7.1.3/bin/mpirun
> [ter_at_uftoscar ~]$ cexec which mpirun
> ************************* oscar_cluster *************************
> --------- oscarnode1---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode2---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode3---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode4---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode5---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode6---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode7---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode8---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode9---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode10---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode11---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode12---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode13---------
> /opt/lam-7.1.3/bin/mpirun
>
> Thanks
>
> On 5/24/07, McCalla, Mac <macmccalla_at_[hidden] > wrote:
> >
> > Hi,
> > just for grins, what does "which mpirun" show? ......
> >
> > mac mccalla
> >
> > ------------------------------
> > *From:* lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] *On
> > Behalf Of *K. Charoenpornwattana Ter
> > *Sent:* 24 May 2007 14:47
> > *To:* General LAM/MPI mailing list
> > *Subject:* Re: LAM: lamboot is ok, mpirun is not
> >
> > On 5/24/07, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> > >
> > > That is just weird -- I don't think I've seen a case where tping
> > > worked (implying that inter-lamd communication is working), but
> > > running applications did not.
> >
> >
> > Yes, it's kinda weird. I just noticed something, After running mpirun,
> > tping doesn't work anymore, See below.
> >
> > [ter_at_uftoscar test]$ lamboot -v host
> > LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University
> >
> > n-1<12514> ssi:boot:base:linear: booting n0 (uftoscar)
> > ...
> > n-1<12514> ssi:boot:base:linear: finished
> > [ter_at_uftoscar test]$ tping -c 3 n0-13
> > 1 byte from 13 remote nodes and 1 local node: 0.007 secs
> > 1 byte from 13 remote nodes and 1 local node: 0.005 secs
> > 1 byte from 13 remote nodes and 1 local node: 0.006 secs
> >
> > 3 messages, 3 bytes (0.003K), 0.017 secs (0.340K/sec)
> > roundtrip min/avg/max: 0.005/0.006/0.007
> > [ter_at_uftoscar test]$ mpicc ring.c -o ring.out <---LAM's
> > mpicc
> > [ter_at_uftoscar test]$ mpirun -np 13 ring.out
> > <freeze> (so I pressed Ctrl-C to cancel)
> >
> > ********************* WARNING ***********************
> > This is a vulnerable region. Exiting the application
> > now may lead to improper cleanup of temporary objects
> > To exit the application, press Ctrl-C again
> > ********************* WARNING ************************
> > [ter_at_uftoscar test]$ tping -c 3 n0-13
> > <freeze> :-(
> >
> > The only thing that I can think of is that there is some firewalling
> > > in place that only allows arbitrary UDP traffic through...? (inter-
> > > lamd traffic is UDP, not TCP) That doesn't seem to make sense,
> > > though, if MPICH works (cexec uses ssh, which is most certainly
> > > allowed). But can you triple check that there are no firewalls tcp
> > > rules in place that restrict UDP/TCP traffic? (e.g., iptables)
> >
> >
> > I did. no firewall is running on any nodes.
> >
> > [root_at_uftoscar ~]# service iptables status
> > Firewall is stopped.
> > [root_at_uftoscar ~]# service pfilter status
> > pfilter is stopped
> > [root_at_uftoscar ~]# cexec service iptables status
> > ************************* oscar_cluster *************************
> > --------- oscarnode1---------
> > Firewall is stopped.
> > .....
> > --------- oscarnode13---------
> > Firewall is stopped.
> >
> > [root_at_uftoscar ~]# cexec service pfilter status <-- I already removed
> > pfilter.
> > ************************* oscar_cluster *************************
> > --------- oscarnode1---------
> > pfilter: unrecognized service
> > ....
> > --------- oscarnode13---------
> > pfilter: unrecognized service
> >
> >
> > > Also try running tping / mpirun / lamexec from a node other than the
> > > origin (i.e., the node you lambooted from).
> >
> >
> > I did. same problem.
> >
> > On May 23, 2007, at 11:32 PM, K. Charoenpornwattana Ter wrote:
> > >
> > > > Try some simple tests:
> > > >
> > > > - Does "tping -c 3" run successfully? (It should ping all the
> > > lamd's)
> > > >
> > > > [ter_at_uftoscar test]$ tping -c 3 n0-13
> > > > 1 byte from 13 remote nodes and 1 local node: 0.006 secs
> > > > 1 byte from 13 remote nodes and 1 local node: 0.005 secs
> > > > 1 byte from 13 remote nodes and 1 local node: 0.005 secs
> > > >
> > > > 3 messages, 3 bytes (0.003K), 0.016 secs (0.368K/sec)
> > > > roundtrip min/avg/max: 0.005/0.005/0.006
> > > >
> > > >
> > > > - Does "lamexec N hostname" run successfully? (It should run
> > > > "hostname" on all the booted nodes)
> > > >
> > > > No, it doesn't work. It only show headnode's hostname. See below:
> > > >
> > > > [ter_at_uftoscar ~]$ lamexec N hostname
> > > > uftoscar.latech
> > > > <freeze>
> > > >
> > > > I, however, can execute "cexec hostname" with no problem.
> > > >
> > > > - When you "mpirun -np 15 ring.out", do you see ring.out executing
> > > on
> > > > all the nodes? (i.e., if you ssh into each of the nodes and run ps,
> > > > do you see it running?
> > > >
> > > > I only see one ring.out run on headnode, no ring.out running on
> > > > other nodes.
> > > >
> > > >
> > > > Thanks
> > > > Kulathep
> > > > _______________________________________________
> > > > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> > >
> > >
> > > --
> > > Jeff Squyres
> > > Cisco Systems
> > >
> > > _______________________________________________
> > > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> > >
> >
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>