LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Craig Lam (craig.mpi_at_[hidden])
Date: 2005-06-25 11:05:56


Jeff,

Thank you again for your response. I'm a bit baffled by this myself.
I've been poaring through the source code for lam in an attempt to
understand the stdout redirection, but I'm afraid this will probably
take quiet some time. I've included the output of configure (both the
log and the output to stdout/err piped together to
configure.std_output.log.gz) as attachments to this email. Your most
likely looking for the line "checking fd passing using RFC2292 API...
passed" in the configure output to stdout, which I saw just fine. Is
there any resource that describes how the standard out redirection
occurs in natural language so that I could understand this quickly?

Thanks again,
Craig Casey

On 6/25/05, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> This is, indeed, quite odd.
>
> LAM's stdout forwarding is a function of the LAM daemons; if the LAM
> daemons are not working, then your parallel job should not start up at
> all (I assume you verified that even though you're not getting stdout,
> that your application is actually running on the remote nodes, such as
> by touching a file in /tmp, or somesuch?).
>
> I missed in your first mail, but I can confirm that having a /tmp on
> all nodes to launch your process from is fine -- it should have nothing
> to do with this problem. Regardless of whether your app is on local
> disk or a networked filesystem, it should start fine (see the LAM FAQ
> for more detail on this issue). Your hello world app also looks fine.
> Did you trying putting in an explicit flush() statement in there,
> and/or a barrier before Finalize?
>
> There is one fairly non-portable aspect of LAM that provides the final
> mile of the stdout forwarding -- Unix file descriptor passing. We
> actually have several implementations of this in LAM, and the configure
> script is supposed to figure out which one to use. Various OS's
> (including specific Linux distros) have had bugs with respect to this
> in the past -- can you send the output of configure and the resulting
> config.log (please compress) so that we can see which was chosen for
> your system?
>
>
> On Jun 25, 2005, at 9:15 AM, Craig Lam wrote:
>
> > Jeff,
> >
> > Thanks for the reply. It seems to happen for every case. I've got a
> > simulator that prints out a bunch of stuff as an extreme case, and
> > here is another example of a 'hello world' type application. Source
> > code and output shown below. (Summary, every node should print "Comm
> > rank %d reporting", mpi_comm_rank, but only a single one does (unless
> > I run more mpi processes than nodes, when just the local nodes run).
> >
> > __________________________
> > Source code:
> > #include <mpi.h>
> >
> > int main(int argc, char* argv[])
> > {
> > int mpi_comm_rank;
> > int mpi_comm_size;
> >
> > MPI_Init(&argc, &argv);
> >
> > MPI_Comm_size(MPI_COMM_WORLD, &mpi_comm_size);
> > MPI_Comm_rank(MPI_COMM_WORLD, &mpi_comm_rank);
> >
> > printf("Comm rank %d reporting.\n", mpi_comm_rank);
> >
> > MPI_Finalize();
> > }
> >
> > _________________
> > OUTPUT
> > ---------------------
> > [craig_at_c1 mpi_test]$ mpirun -np 6 mpi_test
> > Comm rank 0 reporting.
> > [craig_at_c1 mpi_test]$
> >
> >
> > Any ideas at all are greatly appreciated.
> >
> > Thanks,
> > Craig Casey,
> > craig.mpi_at_[hidden]
> >
> > On 6/25/05, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> >> Can you give a concrete example of this?
> >>
> >> Do you have a lot of stdout from the processes running on the nodes,
> >> or
> >> just a little output (and then program termination)?
> >>
> >> If it's just a little output, you might want to put explicit fflush()
> >> statements in your application (I'm assuming that this is a C
> >> application?).
> >>
> >> On Jun 23, 2005, at 11:17 PM, Craig Lam wrote:
> >>
> >>> Hello,
> >>>
> >>> I've set up a diskless cluster running Fedora Core 3 (modified to
> >>> allow the diskless cluster nodes to start up). When I run an MPI
> >>> job,
> >>> it seems that stdout does not get directed from remote nodes
> >>> correctly
> >>> although all local processes' output shows up correctly. Does anyone
> >>> know why this might be?
> >>>
> >>> My system set up is an 8 node dual opteron cluster running in 32-bit
> >>> mode on Linux. Each node has dual infiniband over PCI express
> >>> (although I am only using one interface currently). My configuration
> >>> of MPI is done with "./configure --with-debug --prefix=/opt/lam-7.0.6
> >>> --exec-prefix=/opt/lam-7.0.6 --with-rsh=ssh". The problem exhibits
> >>> itself on both Lam-7.0.6 and Lam-7.1.1 (I have not tried other
> >>> version). My diskless clusters run NFS version 4, and each cluster
> >>> node binds /var/${HOSTNAME}/ to /var and /tmp/${HOSTNAME} to /tmp to
> >>> give each node an individual copy of these directories (would this
> >>> contribute to these problems?)
> >>>
> >>> I must admit that I am a bit stumped.
> >>>
> >>> Thanks for all your thoughts,
> >>> Craig Casey
> >>> craig.mpi_at_[hidden]
> >>>
> >>> _______________________________________________
> >>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >>>
> >>
> >> --
> >> {+} Jeff Squyres
> >> {+} jsquyres_at_[hidden]
> >> {+} http://www.lam-mpi.org/
> >>
> >> _______________________________________________
> >> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >>
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
>
> --
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>