LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-05-24 22:27:55


FWIW, I logged into this machine and gave it a whirl myself. I'm
frankly at a loss to explain why remote execution in LAM doesn't
work; it's almost like there is some kind of UDP failure between the
nodes. <shrug>

I tested OMPI and it seems to work fine except stdout doesn't appear
from remote nodes. Which is also really odd.

All this seems to point to the fact that the nodes may be built oddly.

FWIW, Josh Hursey added BLCR support to the OMPI trunk not long ago.
He still owes me FAQ text about it... :-)

On May 24, 2007, at 10:23 PM, K. Charoenpornwattana Ter wrote:

> Not now, but I will definitely try OpenMPI.
>
> I need lam/mpi to test our framework. We've developed a BLCR based
> fault-tolerance framework which (i think) only works with lam/mpi.
>
> Kulathep
>
> On 5/24/07, McCalla, Mac <macmccalla_at_[hidden]> wrote:
> Out of curiosity,since you're thinking about clearing the decks,
> have you considered trying OpenMPI instead of LAM?
>
> Regards,
>
> mac
>
> From: lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] On
> Behalf Of K. Charoenpornwattana Ter
> Sent: 24 May 2007 20:44
>
> To: General LAM/MPI mailing list
> Subject: Re: LAM: lamboot is ok, mpirun is not
>
> Yes,
>
> [ter_at_uftoscar test]$ mpirun -np 14 -v ring.out
> 17119 ring.out running on n0 (o)
> <freeze>
>
> Ummm, I guess, I will just remove everything and install it again.
>
> Thanks anyway,
> Kulathep
>
> On 5/24/07, McCalla, Mac <macmccalla_at_[hidden]> wrote:
> Sorry, i see you did that earlier. have you tried the mpirun with -
> v parameter as well?
>
> From: lam-bounces_at_[hidden] [mailto: lam-bounces_at_[hidden]] On
> Behalf Of K. Charoenpornwattana Ter
> Sent: 24 May 2007 19:57
>
> To: General LAM/MPI mailing list
> Subject: Re: LAM: lamboot is ok, mpirun is not
>
> [ter_at_uftoscar ~]$ which mpirun
> /opt/lam-7.1.3/bin/mpirun
> [ter_at_uftoscar ~]$ cexec which mpirun
> ************************* oscar_cluster *************************
> --------- oscarnode1---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode2---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode3---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode4---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode5---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode6---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode7---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode8---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode9---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode10---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode11---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode12---------
> /opt/lam-7.1.3/bin/mpirun
> --------- oscarnode13---------
> /opt/lam-7.1.3/bin/mpirun
>
> Thanks
>
> On 5/24/07, McCalla, Mac <macmccalla_at_[hidden] > wrote:
> Hi,
> just for grins, what does "which mpirun" show? ......
>
> mac mccalla
>
> From: lam-bounces_at_[hidden] [mailto: lam-bounces_at_[hidden]] On
> Behalf Of K. Charoenpornwattana Ter
> Sent: 24 May 2007 14:47
> To: General LAM/MPI mailing list
> Subject: Re: LAM: lamboot is ok, mpirun is not
>
> On 5/24/07, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> That is just weird -- I don't think I've seen a case where tping
> worked (implying that inter-lamd communication is working), but
> running applications did not.
>
> Yes, it's kinda weird. I just noticed something, After running
> mpirun, tping doesn't work anymore, See below.
>
> [ter_at_uftoscar test]$ lamboot -v host
> LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University
>
> n-1<12514> ssi:boot:base:linear: booting n0 (uftoscar)
> ...
> n-1<12514> ssi:boot:base:linear: finished
> [ter_at_uftoscar test]$ tping -c 3 n0-13
> 1 byte from 13 remote nodes and 1 local node: 0.007 secs
> 1 byte from 13 remote nodes and 1 local node: 0.005 secs
> 1 byte from 13 remote nodes and 1 local node: 0.006 secs
>
> 3 messages, 3 bytes (0.003K), 0.017 secs (0.340K/sec)
> roundtrip min/avg/max: 0.005/0.006/0.007
> [ter_at_uftoscar test]$ mpicc ring.c -o ring.out <---
> LAM's mpicc
> [ter_at_uftoscar test]$ mpirun -np 13 ring.out
> <freeze> (so I pressed Ctrl-C to cancel)
>
> ********************* WARNING ***********************
> This is a vulnerable region. Exiting the application
> now may lead to improper cleanup of temporary objects
> To exit the application, press Ctrl-C again
> ********************* WARNING ************************
> [ter_at_uftoscar test]$ tping -c 3 n0-13
> <freeze> :-(
>
> The only thing that I can think of is that there is some firewalling
> in place that only allows arbitrary UDP traffic through...? (inter-
> lamd traffic is UDP, not TCP) That doesn't seem to make sense,
> though, if MPICH works (cexec uses ssh, which is most certainly
> allowed). But can you triple check that there are no firewalls tcp
> rules in place that restrict UDP/TCP traffic? (e.g., iptables)
>
> I did. no firewall is running on any nodes.
>
> [root_at_uftoscar ~]# service iptables status
> Firewall is stopped.
> [root_at_uftoscar ~]# service pfilter status
> pfilter is stopped
> [root_at_uftoscar ~]# cexec service iptables status
> ************************* oscar_cluster *************************
> --------- oscarnode1---------
> Firewall is stopped.
> .....
> --------- oscarnode13---------
> Firewall is stopped.
>
> [root_at_uftoscar ~]# cexec service pfilter status <-- I already
> removed pfilter.
> ************************* oscar_cluster *************************
> --------- oscarnode1---------
> pfilter: unrecognized service
> ....
> --------- oscarnode13---------
> pfilter: unrecognized service
>
> Also try running tping / mpirun / lamexec from a node other than the
> origin (i.e., the node you lambooted from).
>
> I did. same problem.
>
> On May 23, 2007, at 11:32 PM, K. Charoenpornwattana Ter wrote:
>
> > Try some simple tests:
> >
> > - Does "tping -c 3" run successfully? (It should ping all the
> lamd's)
> >
> > [ter_at_uftoscar test]$ tping -c 3 n0-13
> > 1 byte from 13 remote nodes and 1 local node: 0.006 secs
> > 1 byte from 13 remote nodes and 1 local node: 0.005 secs
> > 1 byte from 13 remote nodes and 1 local node: 0.005 secs
> >
> > 3 messages, 3 bytes (0.003K), 0.016 secs (0.368K/sec)
> > roundtrip min/avg/max: 0.005/0.005/0.006
> >
> >
> > - Does "lamexec N hostname" run successfully? (It should run
> > "hostname" on all the booted nodes)
> >
> > No, it doesn't work. It only show headnode's hostname. See below:
> >
> > [ter_at_uftoscar ~]$ lamexec N hostname
> > uftoscar.latech
> > <freeze>
> >
> > I, however, can execute "cexec hostname" with no problem.
> >
> > - When you "mpirun -np 15 ring.out", do you see ring.out
> executing on
> > all the nodes? (i.e., if you ssh into each of the nodes and run ps,
> > do you see it running?
> >
> > I only see one ring.out run on headnode, no ring.out running on
> > other nodes.
> >
> >
> > Thanks
> > Kulathep
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

-- 
Jeff Squyres
Cisco Systems