LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: McCalla, Mac (macmccalla_at_[hidden])
Date: 2007-05-24 20:54:03


Hi,
    just for grins, what does "which mpirun" show? ......
 
mac mccalla

________________________________

From: lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] On Behalf
Of K. Charoenpornwattana Ter
Sent: 24 May 2007 14:47
To: General LAM/MPI mailing list
Subject: Re: LAM: lamboot is ok, mpirun is not

On 5/24/07, Jeff Squyres <jsquyres_at_[hidden]> wrote:

        That is just weird -- I don't think I've seen a case where tping
        worked (implying that inter-lamd communication is working), but
        running applications did not.

Yes, it's kinda weird. I just noticed something, After running mpirun,
tping doesn't work anymore, See below.

[ter_at_uftoscar test]$ lamboot -v host
LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University

n-1<12514> ssi:boot:base:linear: booting n0 (uftoscar)
...
n-1<12514> ssi:boot:base:linear: finished
[ter_at_uftoscar test]$ tping -c 3 n0-13
  1 byte from 13 remote nodes and 1 local node: 0.007 secs
  1 byte from 13 remote nodes and 1 local node: 0.005 secs
  1 byte from 13 remote nodes and 1 local node: 0.006 secs

3 messages, 3 bytes (0.003K), 0.017 secs (0.340K/sec)
roundtrip min/avg/max: 0.005/0.006/0.007
[ter_at_uftoscar test]$ mpicc ring.c -o ring.out <---LAM's
mpicc
[ter_at_uftoscar test]$ mpirun -np 13 ring.out
<freeze> (so I pressed Ctrl-C to cancel)

********************* WARNING ***********************
This is a vulnerable region. Exiting the application
now may lead to improper cleanup of temporary objects
To exit the application, press Ctrl-C again
********************* WARNING ************************
[ter_at_uftoscar test]$ tping -c 3 n0-13
<freeze> :-(

        The only thing that I can think of is that there is some
firewalling
        in place that only allows arbitrary UDP traffic through...?
(inter-
        lamd traffic is UDP, not TCP) That doesn't seem to make sense,
        though, if MPICH works (cexec uses ssh, which is most certainly
        allowed). But can you triple check that there are no firewalls
tcp
        rules in place that restrict UDP/TCP traffic? (e.g., iptables)

I did. no firewall is running on any nodes.

[root_at_uftoscar ~]# service iptables status
Firewall is stopped.
[root_at_uftoscar ~]# service pfilter status
pfilter is stopped
[root_at_uftoscar ~]# cexec service iptables status
************************* oscar_cluster *************************
--------- oscarnode1---------
Firewall is stopped.
.....
--------- oscarnode13---------
Firewall is stopped.

[root_at_uftoscar ~]# cexec service pfilter status <-- I already removed
pfilter.
************************* oscar_cluster *************************
--------- oscarnode1---------
pfilter: unrecognized service
....
--------- oscarnode13---------
pfilter: unrecognized service
 

        Also try running tping / mpirun / lamexec from a node other than
the
        origin (i.e., the node you lambooted from).

I did. same problem.

        On May 23, 2007, at 11:32 PM, K. Charoenpornwattana Ter wrote:
        
> Try some simple tests:
>
> - Does "tping -c 3" run successfully? (It should ping all the
lamd's)
>
> [ter_at_uftoscar test]$ tping -c 3 n0-13
> 1 byte from 13 remote nodes and 1 local node: 0.006 secs
> 1 byte from 13 remote nodes and 1 local node: 0.005 secs
> 1 byte from 13 remote nodes and 1 local node: 0.005 secs
>
> 3 messages, 3 bytes (0.003K), 0.016 secs (0.368K/sec)
> roundtrip min/avg/max: 0.005/0.005/0.006
>
>
> - Does "lamexec N hostname" run successfully? (It should run
> "hostname" on all the booted nodes)
>
> No, it doesn't work. It only show headnode's hostname. See
below:
>
> [ter_at_uftoscar ~]$ lamexec N hostname
> uftoscar.latech
> <freeze>
>
> I, however, can execute "cexec hostname" with no problem.
>
> - When you "mpirun -np 15 ring.out", do you see ring.out
executing on
> all the nodes? (i.e., if you ssh into each of the nodes and
run ps,
> do you see it running?
>
> I only see one ring.out run on headnode, no ring.out running
on
> other nodes.
>
>
> Thanks
> Kulathep
> _______________________________________________
> This list is archived at
http://www.lam-mpi.org/MailArchives/lam/
        
        
        --
        Jeff Squyres
        Cisco Systems
        
        _______________________________________________
        This list is archived at
http://www.lam-mpi.org/MailArchives/lam/