LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: McCalla, Mac (macmccalla_at_[hidden])
Date: 2007-05-24 22:09:15


Out of curiosity,since you're thinking about clearing the decks, have
you considered trying OpenMPI instead of LAM?
 
Regards,
 
mac

________________________________

From: lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] On Behalf
Of K. Charoenpornwattana Ter
Sent: 24 May 2007 20:44
To: General LAM/MPI mailing list
Subject: Re: LAM: lamboot is ok, mpirun is not

Yes,

[ter_at_uftoscar test]$ mpirun -np 14 -v ring.out
17119 ring.out running on n0 (o)
<freeze>

Ummm, I guess, I will just remove everything and install it again.

Thanks anyway,
Kulathep

On 5/24/07, McCalla, Mac <macmccalla_at_[hidden]> wrote:

        Sorry, i see you did that earlier. have you tried the mpirun
with -v parameter as well?

________________________________

        From: lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]]
On Behalf Of K. Charoenpornwattana Ter
        Sent: 24 May 2007 19:57
        
        To: General LAM/MPI mailing list
        Subject: Re: LAM: lamboot is ok, mpirun is not
        

        
        [ter_at_uftoscar ~]$ which mpirun
        /opt/lam-7.1.3/bin/mpirun
        [ter_at_uftoscar ~]$ cexec which mpirun
        ************************* oscar_cluster
*************************
        --------- oscarnode1---------
        /opt/lam-7.1.3/bin/mpirun
        --------- oscarnode2---------
        /opt/lam-7.1.3/bin/mpirun
        --------- oscarnode3---------
        /opt/lam-7.1.3/bin/mpirun
        --------- oscarnode4---------
        /opt/lam-7.1.3/bin/mpirun
        --------- oscarnode5---------
        /opt/lam-7.1.3/bin/mpirun
        --------- oscarnode6---------
        /opt/lam-7.1.3/bin/mpirun
        --------- oscarnode7---------
        /opt/lam-7.1.3/bin/mpirun
        --------- oscarnode8---------
        /opt/lam-7.1.3/bin/mpirun
        --------- oscarnode9---------
        /opt/lam-7.1.3/bin/mpirun
        --------- oscarnode10---------
        /opt/lam-7.1.3/bin/mpirun
        --------- oscarnode11---------
        /opt/lam-7.1.3/bin/mpirun
        --------- oscarnode12---------
        /opt/lam-7.1.3/bin/mpirun
        --------- oscarnode13---------
        /opt/lam-7.1.3/bin/mpirun
        
        Thanks
        
        
        On 5/24/07, McCalla, Mac <macmccalla_at_[hidden] > wrote:

                Hi,
                    just for grins, what does "which mpirun" show?
......
                 
                mac mccalla

________________________________

                From: lam-bounces_at_[hidden] [mailto:
lam-bounces_at_[hidden] <mailto:lam-bounces_at_[hidden]> ] On Behalf Of
K. Charoenpornwattana Ter
                Sent: 24 May 2007 14:47
                To: General LAM/MPI mailing list
                Subject: Re: LAM: lamboot is ok, mpirun is not
                
                
                
                On 5/24/07, Jeff Squyres <jsquyres_at_[hidden]> wrote:
                

                        That is just weird -- I don't think I've seen a
case where tping
                        worked (implying that inter-lamd communication
is working), but
                        running applications did not.

                Yes, it's kinda weird. I just noticed something, After
running mpirun, tping doesn't work anymore, See below.
                
                [ter_at_uftoscar test]$ lamboot -v host
                LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University
                
                n-1<12514> ssi:boot:base:linear: booting n0 (uftoscar)
                ...
                n-1<12514> ssi:boot:base:linear: finished
                [ter_at_uftoscar test]$ tping -c 3 n0-13
                  1 byte from 13 remote nodes and 1 local node: 0.007
secs
                  1 byte from 13 remote nodes and 1 local node: 0.005
secs
                  1 byte from 13 remote nodes and 1 local node: 0.006
secs
                
                3 messages, 3 bytes (0.003K), 0.017 secs (0.340K/sec)
                roundtrip min/avg/max: 0.005/0.006/0.007
                [ter_at_uftoscar test]$ mpicc ring.c -o ring.out
<---LAM's mpicc
                [ter_at_uftoscar test]$ mpirun -np 13 ring.out
                <freeze> (so I pressed Ctrl-C to cancel)
                
                ********************* WARNING ***********************
                This is a vulnerable region. Exiting the application
                now may lead to improper cleanup of temporary objects
                To exit the application, press Ctrl-C again
                ********************* WARNING ************************
                [ter_at_uftoscar test]$ tping -c 3 n0-13
                <freeze> :-(
                

                        The only thing that I can think of is that there
is some firewalling
                        in place that only allows arbitrary UDP traffic
through...? (inter-
                        lamd traffic is UDP, not TCP) That doesn't seem
to make sense,
                        though, if MPICH works (cexec uses ssh, which is
most certainly
                        allowed). But can you triple check that there
are no firewalls tcp
                        rules in place that restrict UDP/TCP traffic?
(e.g., iptables)

                I did. no firewall is running on any nodes.
                
                [root_at_uftoscar ~]# service iptables status
                Firewall is stopped.
                [root_at_uftoscar ~]# service pfilter status
                pfilter is stopped
                [root_at_uftoscar ~]# cexec service iptables status
                ************************* oscar_cluster
*************************
                --------- oscarnode1---------
                Firewall is stopped.
                .....
                --------- oscarnode13---------
                Firewall is stopped.
                
                [root_at_uftoscar ~]# cexec service pfilter status <-- I
already removed pfilter.
                ************************* oscar_cluster
*************************
                --------- oscarnode1---------
                pfilter: unrecognized service
                ....
                --------- oscarnode13---------
                pfilter: unrecognized service
                 
                

                        Also try running tping / mpirun / lamexec from a
node other than the
                        origin (i.e., the node you lambooted from).

                I did. same problem.
                

                        On May 23, 2007, at 11:32 PM, K.
Charoenpornwattana Ter wrote:
                        
> Try some simple tests:
>
> - Does "tping -c 3" run successfully? (It
should ping all the lamd's)
>
> [ter_at_uftoscar test]$ tping -c 3 n0-13
> 1 byte from 13 remote nodes and 1 local
node: 0.006 secs
> 1 byte from 13 remote nodes and 1 local
node: 0.005 secs
> 1 byte from 13 remote nodes and 1 local
node: 0.005 secs
>
> 3 messages, 3 bytes (0.003K), 0.016 secs
(0.368K/sec)
> roundtrip min/avg/max: 0.005/0.005/0.006
>
>
> - Does "lamexec N hostname" run successfully?
(It should run
> "hostname" on all the booted nodes)
>
> No, it doesn't work. It only show headnode's
hostname. See below:
>
> [ter_at_uftoscar ~]$ lamexec N hostname
> uftoscar.latech
> <freeze>
>
> I, however, can execute "cexec hostname" with
no problem.
>
> - When you "mpirun -np 15 ring.out", do you
see ring.out executing on
> all the nodes? (i.e., if you ssh into each of
the nodes and run ps,
> do you see it running?
>
> I only see one ring.out run on headnode, no
ring.out running on
> other nodes.
>
>
> Thanks
> Kulathep
>
_______________________________________________
> This list is archived at
http://www.lam-mpi.org/MailArchives/lam/
                        
                        
                        --
                        Jeff Squyres
                        Cisco Systems
                        
                        _______________________________________________
                        This list is archived at
http://www.lam-mpi.org/MailArchives/lam/
                        

                _______________________________________________
                This list is archived at
http://www.lam-mpi.org/MailArchives/lam/
                

        _______________________________________________
        This list is archived at
http://www.lam-mpi.org/MailArchives/lam/