LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Ethan Deneault (edeneault_at_[hidden])
Date: 2009-09-30 16:28:47


Hello,

I'm somewhat new here; and a cursory search of this mailing list's
archives has netted me little to no enlightenment. Apologies if this is
an obvious FAQ.

I'm running tests on a cluster of 10 machines, and I am greatly
concerned that the nodes are not communicating correctly. Right now, I
am testing two nodes which I will refer to as "master" and "slave". Both
are running Debian lenny, and LAM 7.1.2

I have set up NTP on both machines such that master gets its time from
the debian time server, and slave gets its time from master.

Master's home directory is exported, and is mounted on slave:
master:/home /home rw 0 0

I have my rsh hostkeys set up so that I don't have to login with a
password to any of the cluster machines.

Now, I'd like to run a test program that I downloaded to test the
cluster. This program is designed to have the following output:

$ mpirun N pi.o
Process 0 of 2 on master
pi is approximately 3.1415926535899814, Error is 0.0000000000001883
wall clock time = 1.634312
Process 1 of 2 on slave

So, I connect the servers using recon, and lamboot:
$ recon -v twonodes
n-1<2779> ssi:boot:base:linear: booting n0 (master)
n-1<2779> ssi:boot:base:linear: booting n1 (slave)
n-1<2779> ssi:boot:base:linear: finished

$ lamboot -v twonodes

LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University

n-1<2783> ssi:boot:base:linear: booting n0 (master)
n-1<2783> ssi:boot:base:linear: booting n1 (slave)
n-1<2783> ssi:boot:base:linear: finished

And then, run the program:
$ mpirun N pi.o
Process 0 of 2 on master
pi is approximately 3.1415926535899814, Error is 0.0000000000001883
wall clock time = 0.391949

---
Slave does not check in. So I try to run the test program on master alone:
$ mpirun n0 pi.o
Process 0 of 1 on master
pi is approximately 3.1415926037850412, Error is 0.0000000498047519
wall clock time = 0.782638
And then try to run it on slave alone:
$ mpirun n1 pi.o
$
I have a term open to slave, watching top. When I run the process, I see 
that the program -is- running on slave; but there is no output to stdout 
on master coming from slave.
Unfortunately, I'm at a loss to say much more. Every test program I have 
try has very much the same problem; output from the slave does not 
appear correctly. I don't have cause to doubt the code - I'm downloading 
sample MPI programs from beowulf websites so I am assured of what the 
output should look like. I don't know enough about clustering (yet!) to 
know any better question to ask, so I hope that there is someone out 
there that can see what the (probably simple) solution is.
Cheers,
Ethan
-- 
Dr. Ethan Deneault
Assistant Professor of Physics
SC-234
University of Tampa
Tampa, FL 33615
Office: (813) 257-3555