LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Craig Lam (craig.mpi_at_[hidden])
Date: 2005-06-28 12:22:05


Hey Jeff,

Good thinking with your philosiphy, but this is not the case. I've
tried this with some more complicated programs (that are known to work
elsewhere) and it still misbehaved.

Craig

On 6/28/05, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> Here's a crazy idea -- throw in a "sleep(10);" before you
> MPI_Finalize(). This may help on the off chance that the output *is*
> being sent properly, but is simply not being displayed because it
> arrives *after* all the processes terminate and mpirun terminates.
>
> This is quite unlikely ("impossible" is a Very Big Word for software
> developers), but certainly, in a Murphy's Law kind of way, possible.
>
>
> On Jun 28, 2005, at 11:14 AM, Craig Lam wrote:
>
> > On 6/28/05, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> >> On Jun 25, 2005, at 12:05 PM, Craig Lam wrote:
> >>
> >>> Thank you again for your response. I'm a bit baffled by this myself.
> >>> I've been poaring through the source code for lam in an attempt to
> >>> understand the stdout redirection, but I'm afraid this will probably
> >>> take quiet some time.
> >>
> >> Yes, unfortunately it's quite twisted and tangled code.
> >>
> >
> > Yeah, I discovered this too. :)
> >
> >
> >>> I've included the output of configure (both the
> >>> log and the output to stdout/err piped together to
> >>> configure.std_output.log.gz) as attachments to this email. Your most
> >>> likely looking for the line "checking fd passing using RFC2292 API...
> >>> passed" in the configure output to stdout, which I saw just fine.
> >>
> >> Ok. There's actually several relevant lines -- we test for all
> >> possible fd-passing systems:
> >>
> >> checking BSD 4.3 for msg_accrights in struct msghdr... no
> >> checking for BSD 4.3 fd passing support... no
> >> checking for POSIX.1g struct msghdr... yes
> >> checking fd passing using RFC2292 API... passed
> >> checking for BSD 4.4 fd passing support... yes (RFC2292 API)
> >> checking for System V Release 4 for struct strrecvfd... yes
> >> checking System V Release 4 fd passing example... failed
> >> checking for System V Release 4 fd passing support... no
> >>
> >> But the end result is the same -- it looks like you have BSD 4.4
> >> support (RFC2292). The configure test actually compiles and runs a
> >> short test that performs fd passing; if the test passes, your BSD 4.4
> >> fd passing *should* be working properly on your machine.
> >>
> >> Are you running the same version of the OS over your entire cluster?
> >>
> >
> > Yes, same statically compiled Linux 2.6.11 kernel image on each node.
> >
> >>> Is there any resource that describes how the standard out redirection
> >>> occurs in natural language so that I could understand this quickly?
> >>
> >> Unfortunately, no. But here's a quick breakdown (this is from memory;
> >> it's been quite a long time since I've looked at this code, so this
> >> may
> >> not be 100% accurate, but it's close enough to give you the spirit of
> >> what is happening):
> >>
> >> - lamboot is run and you get a set of LAM daemons (lamd's)
> >> - mpirun contacts the local lamd and passes its stdin/out/err file
> >> descriptors
> >> - mpirun contacts each relevant lamd and tells it to launch your
> >> process
> >> - for all nodes where mpirun is not run:
> >> - before launching, the lamd chains the stdin/out/err to pipes that
> >> go into the lamd (i.e., after the fork but before the exec)
> >> - each lamd then exec's your process(es)
> >> - when information is received on the stdout/err pipes, the lamd
> >> forwards the data to the lamd where mpirun is running
> >> - for the node where mpirun is running:
> >> - before launching, the lamd passes the file descriptors that it
> >> received from mpirun to the newly-forked process and dup2's them into
> >> stdin/out/err (hence, they write directly to mpirun's stdout/err
> >> through normal unix mechanisms)
> >> - when the lamd receives remote stdout/err data, it writes it to
> >> the
> >> file descriptors that it received from mpirun
> >>
> >> It's quite complicated, actually. :-\
> >>
> >
> > That is invaluble information for anyone trying to debug this type of
> > problem, or understand the LAM archietecture in general! Thank you so
> > much!!
> >
> >
> > Strangely.. and (seemingly) randomly, stdout seems to be working now.
> > As far as I can tell, I didn't do anything different. I'm even more
> > perplexed now than I was before. I was running lam jobs with no
> > forwarded input from remote nodes all morning, and then I ran lamexec,
> > and it seems to be working correctly now. I'd really like to discover
> > what the problem was, but, frankly, I'm a bit confused.
> >
> >
> >> So, a few followup questions:
> >>
> >> - What happens if you mpirun only on the local node? E.g., mpirun -np
> >> 1 foo
> >> - Does the same behavior happen if you lamexec? E.g., lamexec -np 1
> >> uptime (local node only), or lamexec -np 4 uptime (spanning multiple
> >> nodes)
> >> - Did you confirm that your processes are, indeed, running on your
> >> remote nodes? Can you put a "system("date > /tmp/foo");", for
> >> example,
> >> in your code to ensure that they are actually launched properly on all
> >> nodes?
> >
> > When the strange behavior was exhibiting itself, output always showed
> > up from a single node job as it would be run on the node that I
> > executed mpirun on. All output from the local node showed up, though
> > no output on other nodes showed up. The actual program was running, I
> > verified this several different ways (running top, passing messages,
> > checking that messages would be passed properly.)
> >
> > As I said, I'd really like to track this problem down even though it's
> > no longer occuring. If anyone has any ideas, please let me know.
> >
> > Craig
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
>
> --
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>