This is, indeed, quite odd.
LAM's stdout forwarding is a function of the LAM daemons; if the LAM
daemons are not working, then your parallel job should not start up at
all (I assume you verified that even though you're not getting stdout,
that your application is actually running on the remote nodes, such as
by touching a file in /tmp, or somesuch?).
I missed in your first mail, but I can confirm that having a /tmp on
all nodes to launch your process from is fine -- it should have nothing
to do with this problem. Regardless of whether your app is on local
disk or a networked filesystem, it should start fine (see the LAM FAQ
for more detail on this issue). Your hello world app also looks fine.
Did you trying putting in an explicit flush() statement in there,
and/or a barrier before Finalize?
There is one fairly non-portable aspect of LAM that provides the final
mile of the stdout forwarding -- Unix file descriptor passing. We
actually have several implementations of this in LAM, and the configure
script is supposed to figure out which one to use. Various OS's
(including specific Linux distros) have had bugs with respect to this
in the past -- can you send the output of configure and the resulting
config.log (please compress) so that we can see which was chosen for
your system?
On Jun 25, 2005, at 9:15 AM, Craig Lam wrote:
> Jeff,
>
> Thanks for the reply. It seems to happen for every case. I've got a
> simulator that prints out a bunch of stuff as an extreme case, and
> here is another example of a 'hello world' type application. Source
> code and output shown below. (Summary, every node should print "Comm
> rank %d reporting", mpi_comm_rank, but only a single one does (unless
> I run more mpi processes than nodes, when just the local nodes run).
>
> __________________________
> Source code:
> #include <mpi.h>
>
> int main(int argc, char* argv[])
> {
> int mpi_comm_rank;
> int mpi_comm_size;
>
> MPI_Init(&argc, &argv);
>
> MPI_Comm_size(MPI_COMM_WORLD, &mpi_comm_size);
> MPI_Comm_rank(MPI_COMM_WORLD, &mpi_comm_rank);
>
> printf("Comm rank %d reporting.\n", mpi_comm_rank);
>
> MPI_Finalize();
> }
>
> _________________
> OUTPUT
> ---------------------
> [craig_at_c1 mpi_test]$ mpirun -np 6 mpi_test
> Comm rank 0 reporting.
> [craig_at_c1 mpi_test]$
>
>
> Any ideas at all are greatly appreciated.
>
> Thanks,
> Craig Casey,
> craig.mpi_at_[hidden]
>
> On 6/25/05, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>> Can you give a concrete example of this?
>>
>> Do you have a lot of stdout from the processes running on the nodes,
>> or
>> just a little output (and then program termination)?
>>
>> If it's just a little output, you might want to put explicit fflush()
>> statements in your application (I'm assuming that this is a C
>> application?).
>>
>> On Jun 23, 2005, at 11:17 PM, Craig Lam wrote:
>>
>>> Hello,
>>>
>>> I've set up a diskless cluster running Fedora Core 3 (modified to
>>> allow the diskless cluster nodes to start up). When I run an MPI
>>> job,
>>> it seems that stdout does not get directed from remote nodes
>>> correctly
>>> although all local processes' output shows up correctly. Does anyone
>>> know why this might be?
>>>
>>> My system set up is an 8 node dual opteron cluster running in 32-bit
>>> mode on Linux. Each node has dual infiniband over PCI express
>>> (although I am only using one interface currently). My configuration
>>> of MPI is done with "./configure --with-debug --prefix=/opt/lam-7.0.6
>>> --exec-prefix=/opt/lam-7.0.6 --with-rsh=ssh". The problem exhibits
>>> itself on both Lam-7.0.6 and Lam-7.1.1 (I have not tried other
>>> version). My diskless clusters run NFS version 4, and each cluster
>>> node binds /var/${HOSTNAME}/ to /var and /tmp/${HOSTNAME} to /tmp to
>>> give each node an individual copy of these directories (would this
>>> contribute to these problems?)
>>>
>>> I must admit that I am a bit stumped.
>>>
>>> Thanks for all your thoughts,
>>> Craig Casey
>>> craig.mpi_at_[hidden]
>>>
>>> _______________________________________________
>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>
>>
>> --
>> {+} Jeff Squyres
>> {+} jsquyres_at_[hidden]
>> {+} http://www.lam-mpi.org/
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|