Not now, but I will definitely try OpenMPI.
I need lam/mpi to test our framework. We've developed a BLCR based
fault-tolerance framework which (i think) only works with lam/mpi.
Kulathep
On 5/24/07, McCalla, Mac <macmccalla_at_[hidden]> wrote:
>
> Out of curiosity,since you're thinking about clearing the decks, have
> you considered trying OpenMPI instead of LAM?
>
> Regards,
>
> mac
>
> ------------------------------
> *From:* lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] *On
> Behalf Of *K. Charoenpornwattana Ter
> *Sent:* 24 May 2007 20:44
> *To:* General LAM/MPI mailing list
> *Subject:* Re: LAM: lamboot is ok, mpirun is not
>
> Yes,
>
> [ter_at_uftoscar test]$ mpirun -np 14 -v ring.out
> 17119 ring.out running on n0 (o)
> <freeze>
>
> Ummm, I guess, I will just remove everything and install it again.
>
> Thanks anyway,
> Kulathep
>
> On 5/24/07, McCalla, Mac <macmccalla_at_[hidden]> wrote:
> >
> > Sorry, i see you did that earlier. have you tried the mpirun with -v
> > parameter as well?
> >
> > ------------------------------
> > *From:* lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] *On
> > Behalf Of *K. Charoenpornwattana Ter
> > *Sent:* 24 May 2007 19:57
> > *To:* General LAM/MPI mailing list
> > *Subject:* Re: LAM: lamboot is ok, mpirun is not
> >
> > [ter_at_uftoscar ~]$ which mpirun
> > /opt/lam-7.1.3/bin/mpirun
> > [ter_at_uftoscar ~]$ cexec which mpirun
> > ************************* oscar_cluster *************************
> > --------- oscarnode1---------
> > /opt/lam-7.1.3/bin/mpirun
> > --------- oscarnode2---------
> > /opt/lam-7.1.3/bin/mpirun
> > --------- oscarnode3---------
> > /opt/lam-7.1.3/bin/mpirun
> > --------- oscarnode4---------
> > /opt/lam-7.1.3/bin/mpirun
> > --------- oscarnode5---------
> > /opt/lam-7.1.3/bin/mpirun
> > --------- oscarnode6---------
> > /opt/lam-7.1.3/bin/mpirun
> > --------- oscarnode7---------
> > /opt/lam-7.1.3/bin/mpirun
> > --------- oscarnode8---------
> > /opt/lam-7.1.3/bin/mpirun
> > --------- oscarnode9---------
> > /opt/lam-7.1.3/bin/mpirun
> > --------- oscarnode10---------
> > /opt/lam-7.1.3/bin/mpirun
> > --------- oscarnode11---------
> > /opt/lam-7.1.3/bin/mpirun
> > --------- oscarnode12---------
> > /opt/lam-7.1.3/bin/mpirun
> > --------- oscarnode13---------
> > /opt/lam-7.1.3/bin/mpirun
> >
> > Thanks
> >
> > On 5/24/07, McCalla, Mac <macmccalla_at_[hidden] > wrote:
> > >
> > > Hi,
> > > just for grins, what does "which mpirun" show? ......
> > >
> > > mac mccalla
> > >
> > > ------------------------------
> > > *From:* lam-bounces_at_[hidden] [mailto: lam-bounces_at_[hidden]] *On
> > > Behalf Of *K. Charoenpornwattana Ter
> > > *Sent:* 24 May 2007 14:47
> > > *To:* General LAM/MPI mailing list
> > > *Subject:* Re: LAM: lamboot is ok, mpirun is not
> > >
> > > On 5/24/07, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> > > >
> > > > That is just weird -- I don't think I've seen a case where tping
> > > > worked (implying that inter-lamd communication is working), but
> > > > running applications did not.
> > >
> > >
> > > Yes, it's kinda weird. I just noticed something, After running mpirun,
> > > tping doesn't work anymore, See below.
> > >
> > > [ter_at_uftoscar test]$ lamboot -v host
> > > LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University
> > >
> > > n-1<12514> ssi:boot:base:linear: booting n0 (uftoscar)
> > > ...
> > > n-1<12514> ssi:boot:base:linear: finished
> > > [ter_at_uftoscar test]$ tping -c 3 n0-13
> > > 1 byte from 13 remote nodes and 1 local node: 0.007 secs
> > > 1 byte from 13 remote nodes and 1 local node: 0.005 secs
> > > 1 byte from 13 remote nodes and 1 local node: 0.006 secs
> > >
> > > 3 messages, 3 bytes (0.003K), 0.017 secs (0.340K/sec)
> > > roundtrip min/avg/max: 0.005/0.006/0.007
> > > [ter_at_uftoscar test]$ mpicc ring.c -o ring.out <---LAM's
> > > mpicc
> > > [ter_at_uftoscar test]$ mpirun -np 13 ring.out
> > > <freeze> (so I pressed Ctrl-C to cancel)
> > >
> > > ********************* WARNING ***********************
> > > This is a vulnerable region. Exiting the application
> > > now may lead to improper cleanup of temporary objects
> > > To exit the application, press Ctrl-C again
> > > ********************* WARNING ************************
> > > [ter_at_uftoscar test]$ tping -c 3 n0-13
> > > <freeze> :-(
> > >
> > > The only thing that I can think of is that there is some firewalling
> > > > in place that only allows arbitrary UDP traffic through...? (inter-
> > > > lamd traffic is UDP, not TCP) That doesn't seem to make sense,
> > > > though, if MPICH works (cexec uses ssh, which is most certainly
> > > > allowed). But can you triple check that there are no firewalls tcp
> > > > rules in place that restrict UDP/TCP traffic? (e.g., iptables)
> > >
> > >
> > > I did. no firewall is running on any nodes.
> > >
> > > [root_at_uftoscar ~]# service iptables status
> > > Firewall is stopped.
> > > [root_at_uftoscar ~]# service pfilter status
> > > pfilter is stopped
> > > [root_at_uftoscar ~]# cexec service iptables status
> > > ************************* oscar_cluster *************************
> > > --------- oscarnode1---------
> > > Firewall is stopped.
> > > .....
> > > --------- oscarnode13---------
> > > Firewall is stopped.
> > >
> > > [root_at_uftoscar ~]# cexec service pfilter status <-- I already
> > > removed pfilter.
> > > ************************* oscar_cluster *************************
> > > --------- oscarnode1---------
> > > pfilter: unrecognized service
> > > ....
> > > --------- oscarnode13---------
> > > pfilter: unrecognized service
> > >
> > >
> > > > Also try running tping / mpirun / lamexec from a node other than the
> > > > origin (i.e., the node you lambooted from).
> > >
> > >
> > > I did. same problem.
> > >
> > > On May 23, 2007, at 11:32 PM, K. Charoenpornwattana Ter wrote:
> > > >
> > > > > Try some simple tests:
> > > > >
> > > > > - Does "tping -c 3" run successfully? (It should ping all the
> > > > lamd's)
> > > > >
> > > > > [ter_at_uftoscar test]$ tping -c 3 n0-13
> > > > > 1 byte from 13 remote nodes and 1 local node: 0.006 secs
> > > > > 1 byte from 13 remote nodes and 1 local node: 0.005 secs
> > > > > 1 byte from 13 remote nodes and 1 local node: 0.005 secs
> > > > >
> > > > > 3 messages, 3 bytes (0.003K), 0.016 secs (0.368K/sec)
> > > > > roundtrip min/avg/max: 0.005/0.005/0.006
> > > > >
> > > > >
> > > > > - Does "lamexec N hostname" run successfully? (It should run
> > > > > "hostname" on all the booted nodes)
> > > > >
> > > > > No, it doesn't work. It only show headnode's hostname. See below:
> > > > >
> > > > > [ter_at_uftoscar ~]$ lamexec N hostname
> > > > > uftoscar.latech
> > > > > <freeze>
> > > > >
> > > > > I, however, can execute "cexec hostname" with no problem.
> > > > >
> > > > > - When you "mpirun -np 15 ring.out", do you see ring.out executing
> > > > on
> > > > > all the nodes? (i.e., if you ssh into each of the nodes and run
> > > > ps,
> > > > > do you see it running?
> > > > >
> > > > > I only see one ring.out run on headnode, no ring.out running on
> > > > > other nodes.
> > > > >
> > > > >
> > > > > Thanks
> > > > > Kulathep
> > > > > _______________________________________________
> > > > > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> > > >
> > > >
> > > > --
> > > > Jeff Squyres
> > > > Cisco Systems
> > > >
> > > > _______________________________________________
> > > > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> > > >
> > >
> > >
> > > _______________________________________________
> > > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> > >
> >
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|