LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Craig Lam (craig.mpi_at_[hidden])
Date: 2005-06-25 21:08:01


Jeff,

I just realized that I forgot to answer all of your questions.
fflushing(stdout); does not alleviate the problems.

Thanks,
Craig

On 6/25/05, Craig Lam <craig.mpi_at_[hidden]> wrote:
> Jeff,
>
> Thank you again for your response. I'm a bit baffled by this myself.
> I've been poaring through the source code for lam in an attempt to
> understand the stdout redirection, but I'm afraid this will probably
> take quiet some time. I've included the output of configure (both the
> log and the output to stdout/err piped together to
> configure.std_output.log.gz) as attachments to this email. Your most
> likely looking for the line "checking fd passing using RFC2292 API...
> passed" in the configure output to stdout, which I saw just fine. Is
> there any resource that describes how the standard out redirection
> occurs in natural language so that I could understand this quickly?
>
> Thanks again,
> Craig Casey
>
>
> On 6/25/05, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> > This is, indeed, quite odd.
> >
> > LAM's stdout forwarding is a function of the LAM daemons; if the LAM
> > daemons are not working, then your parallel job should not start up at
> > all (I assume you verified that even though you're not getting stdout,
> > that your application is actually running on the remote nodes, such as
> > by touching a file in /tmp, or somesuch?).
> >
> > I missed in your first mail, but I can confirm that having a /tmp on
> > all nodes to launch your process from is fine -- it should have nothing
> > to do with this problem. Regardless of whether your app is on local
> > disk or a networked filesystem, it should start fine (see the LAM FAQ
> > for more detail on this issue). Your hello world app also looks fine.
> > Did you trying putting in an explicit flush() statement in there,
> > and/or a barrier before Finalize?
> >
> > There is one fairly non-portable aspect of LAM that provides the final
> > mile of the stdout forwarding -- Unix file descriptor passing. We
> > actually have several implementations of this in LAM, and the configure
> > script is supposed to figure out which one to use. Various OS's
> > (including specific Linux distros) have had bugs with respect to this
> > in the past -- can you send the output of configure and the resulting
> > config.log (please compress) so that we can see which was chosen for
> > your system?
> >
> >
> > On Jun 25, 2005, at 9:15 AM, Craig Lam wrote:
> >
> > > Jeff,
> > >
> > > Thanks for the reply. It seems to happen for every case. I've got a
> > > simulator that prints out a bunch of stuff as an extreme case, and
> > > here is another example of a 'hello world' type application. Source
> > > code and output shown below. (Summary, every node should print "Comm
> > > rank %d reporting", mpi_comm_rank, but only a single one does (unless
> > > I run more mpi processes than nodes, when just the local nodes run).
> > >
> > > __________________________
> > > Source code:
> > > #include <mpi.h>
> > >
> > > int main(int argc, char* argv[])
> > > {
> > > int mpi_comm_rank;
> > > int mpi_comm_size;
> > >
> > > MPI_Init(&argc, &argv);
> > >
> > > MPI_Comm_size(MPI_COMM_WORLD, &mpi_comm_size);
> > > MPI_Comm_rank(MPI_COMM_WORLD, &mpi_comm_rank);
> > >
> > > printf("Comm rank %d reporting.\n", mpi_comm_rank);
> > >
> > > MPI_Finalize();
> > > }
> > >
> > > _________________
> > > OUTPUT
> > > ---------------------
> > > [craig_at_c1 mpi_test]$ mpirun -np 6 mpi_test
> > > Comm rank 0 reporting.
> > > [craig_at_c1 mpi_test]$
> > >
> > >
> > > Any ideas at all are greatly appreciated.
> > >
> > > Thanks,
> > > Craig Casey,
> > > craig.mpi_at_[hidden]
> > >
> > > On 6/25/05, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> > >> Can you give a concrete example of this?
> > >>
> > >> Do you have a lot of stdout from the processes running on the nodes,
> > >> or
> > >> just a little output (and then program termination)?
> > >>
> > >> If it's just a little output, you might want to put explicit fflush()
> > >> statements in your application (I'm assuming that this is a C
> > >> application?).
> > >>
> > >> On Jun 23, 2005, at 11:17 PM, Craig Lam wrote:
> > >>
> > >>> Hello,
> > >>>
> > >>> I've set up a diskless cluster running Fedora Core 3 (modified to
> > >>> allow the diskless cluster nodes to start up). When I run an MPI
> > >>> job,
> > >>> it seems that stdout does not get directed from remote nodes
> > >>> correctly
> > >>> although all local processes' output shows up correctly. Does anyone
> > >>> know why this might be?
> > >>>
> > >>> My system set up is an 8 node dual opteron cluster running in 32-bit
> > >>> mode on Linux. Each node has dual infiniband over PCI express
> > >>> (although I am only using one interface currently). My configuration
> > >>> of MPI is done with "./configure --with-debug --prefix=/opt/lam-7.0.6
> > >>> --exec-prefix=/opt/lam-7.0.6 --with-rsh=ssh". The problem exhibits
> > >>> itself on both Lam-7.0.6 and Lam-7.1.1 (I have not tried other
> > >>> version). My diskless clusters run NFS version 4, and each cluster
> > >>> node binds /var/${HOSTNAME}/ to /var and /tmp/${HOSTNAME} to /tmp to
> > >>> give each node an individual copy of these directories (would this
> > >>> contribute to these problems?)
> > >>>
> > >>> I must admit that I am a bit stumped.
> > >>>
> > >>> Thanks for all your thoughts,
> > >>> Craig Casey
> > >>> craig.mpi_at_[hidden]
> > >>>
> > >>> _______________________________________________
> > >>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> > >>>
> > >>
> > >> --
> > >> {+} Jeff Squyres
> > >> {+} jsquyres_at_[hidden]
> > >> {+} http://www.lam-mpi.org/
> > >>
> > >> _______________________________________________
> > >> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> > >>
> > >
> > > _______________________________________________
> > > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> > >
> >
> > --
> > {+} Jeff Squyres
> > {+} jsquyres_at_[hidden]
> > {+} http://www.lam-mpi.org/
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
>
>
>