It's "standard procedure" for us to reply off-list for long-standing
problems so that we don't continue to fill up people's INBOXes with
"try this" / "didn't work", "ok now try this" / "nope, still didn't
work" mails. When the problem is finally solved, we post the final
resolution back to the mailing list so that everyone can see what the
problem was.
I think Josh replied to you a few days ago (off list) asking a few
more clarification questions, etc. Can you continue to follow up
with him there? Josh or you can post the final resolution here on
the list once you guys figure it out.
Many thanks.
On Mar 6, 2006, at 2:03 AM, Soumya,Teena, Ranjana,Liss,Navya wrote:
>
>
>
> we did
> mpicc hello.c -lmpi
> lamshrink n2
> ssh node-1
> lamshrink n0
>
>
> lamnodes gave
> no invalid node
> n1 node-1.mec.ac.in:1:this_node
> n2 invalid node
> n3 node-3.mec.ac.in:1:
>
> we did mpirun -np 2 a.out
> gave no output .No error displayed either.
>
> we did ./a.out
> gave output
> Hello World 0 of 1 on debian-204
> After Final...0 of 1 running on debian-204
>
>
> It seems mpirun is not giving any output on node-1 but .a/.out is
> giving output.
>
>
>
> On 2/27/06, Josh Hursey < jjhursey_at_[hidden] > wrote: I'm
> wondering if there is a problem with the LAM/MPI installation or
> environment setup on one or more of the compute nodes.
>
> What version of LAM/MPI are you using on all of these machines? Is it
> the same across all of the nodes?
>
> To help pinpoint the machines that are being troublesome try the
> enclosed extension to your program:
>
> /* ========== */
> #include <stdio.h>
> #include <mpi.h>
>
> int main(int argc, char *argv[]){
> int rank, size, len;
> char hostname[256] = "";
>
> MPI_Init(&argc, &argv);
>
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> MPI_Comm_size(MPI_COMM_WORLD, &size);
> MPI_Get_processor_name(hostname, &len);
>
> printf("Hello World! I am %d of %d running on %s\n",
> rank, size, hostname);
>
> MPI_Finalize();
> printf("After final... %d of %d running on %s\n",
> rank, size, hostname);
>
> return 0;
> }
> /* ========== */
>
> This will print out the rank, and returned hostname identifier
> (should be the UNIX hostname) for each node running the application.
> This should tell you which nodes are working properly. To find the
> nodes that are not working properly use the difference between:
> $ lamnodes
> and
> $ mpirun N a.out
>
> The 'N' will tell LAM/MPI to run on all available nodes, so this will
> cover all of the nodes from lamnodes (except those marked
> 'no_schedule').
>
> Once you know the subset of machines that are not returning stdout,
> try ssh'ing to those mahcines and running the MPI program as a
> 'singleton':
> $ ./a.out
> This will run the MPI program as if you had a lamnodes of just the
> localhost with a '-np 1' specified on the command line to mpirun.
>
> If this doesn't give you anything on stdout, then I would try a non-
> MPI program to see if behaves differently. If it does then there may
> be other problems.
>
> Let me know if this helps pinpoint your problem.
>
> Josh
>
> On Feb 27, 2006, at 5:30 AM, Soumya,Teena, Ranjana,Liss,Navya wrote:
>
> > Tried MPI_Barrier(MPI_COMM_WORLD) before MPI_Finalize(). It
> > didn't help. Same output as before.
> >
> > On 2/27/06, Elie Choueiri < elie.choueiri_at_[hidden]> wrote:
> > Would an MPI_Barrier before the MPI_Finalize make a difference?
> >
> >
> > On 2/27/06, Soumya,Teena, Ranjana,Liss,Navya <
> > bytecode.compression_at_[hidden] > wrote: Thank you for the prompt
> > suggestions. But it did not help. The outputs are just the same
> > as before except that the message comes on separate lines.
> >
> > On 2/26/06, Jeff Squyres < jsquyres_at_[hidden] > wrote: I *think*
> > that you are seeing the canonical parallel output issue --
> > that output from different nodes shows up at seemingly random times
> > because, at least in part, all nodes are not exactly synchronized.
> >
> > However, you should probably still see the "Hello world" messages
> > from the MPI processes on node-3 (note that you killed node-1, so
> you
> > won't see anything from node-1). These are silly suggestions, but
> > try them anyway:
> >
> > - Output is not guaranteed to be flushed unless you trail the
> message
> > with a \n -- put a \n in your "After final" message.
> > - Put a "sleep(1);" before MPI_Finalize to give the output time to
> > flush over to mpirun before closing everything down.
> >
> > I'm sure your application is running properly and that for some
> > reason the stdout is not getting flushed over to mpirun.
> >
> >
> > On Feb 24, 2006, at 4:26 AM, Soumya,Teena, Ranjana,Liss,Navya wrote:
> >
> > > hi,
> > >
> > >
> > > hello.c program is as follows:
> > > #include<stdio.h>
> > > #include<mpi.h>
> > > int main(int argc,char* argv[]){
> > > int rank,size;
> > > MPI_Init(&argc,&argv);
> > > MPI_Comm_rank(MPI_COMM_WORLD,&rank);
> > > MPI_Comm_size(MPI_COMM_WORLD,&size);
> > >
> > > printf("Hello World %d of %d \n",rank,size);
> > > MPI_Finalize();
> > > printf("After final...");
> > > return 0;
> > > }
> > >
> > >
> > > Lam-mpi is configured on 4 systems.
> > > The command lamnodes gave:
> > > n0 master.mec.ac.in:1 :origin,this_node
> > > n1 node-1.mec.ac.in:1:
> > > n2 node-2.mec.ac.in:1:
> > > n3 node-3.mec.ac.in:1:
> > >
> > > When we do mpirun -np 4 a.out after mpicc test.c -lmpi having
> > > compiled successfully we are getting the following output:
> > > Hello World 0 of 4
> > > Hello World 2 of 4
> > > After final...Afterfinal...
> > >
> > >
> > > Then we did mpirun -np 8 a.out and got output as follows
> > > Hello World 0 of 8
> > > Hello World 4 of 8
> > > After final...Hello World 2 of 8
> > > Hello World 6 of 8
> > > After final...After final...After final...
> > >
> > > Then we did lamshrink n1 .
> > > Then mpirun -np 8 a.out
> > > Hello World 3 of 8
> > > Hello World 1 of 8
> > > Hello World 4 of 8
> > > Hello World 7 of 8
> > > After final...After final...After final...After final...Hello
> World
> > > 0 of 8
> > > After final...Hello World 6 of 8
> > >
> > > It seems that the nodes node-1 and node-3 is not printing any
> > > message although communication between the nodes by means of
> > > message parsing is taking place and lam is properly installed in
> > > all 4 nodes.
> > >
> > > What could be the reason? Please suggest a solution.
> > >
> > > Thank u.
> > > --
> > > CE 2002-06
> > > _______________________________________________
> > > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
> >
> > --
> > {+} Jeff Squyres
> > {+} The Open MPI Project
> > {+} http://www.open-mpi.org/
> >
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
> >
> >
> > --
> > CE 2002-06
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
> >
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
> >
> >
> >
> > --
> > CE 2002-06
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
> ----
> Josh Hursey
> jjhursey_at_[hidden]
> http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
>
> --
> CE 2002-06
>
> --
> CE 2002-06
>
>
> ----------------------------------------------------------------------
> --
> Fire is the test of gold; adversity, of strong men.
> ----------------------------------------------------------------------
> --
>
> Jyothish Varma
> 1619 Crest Road
> Apartment# 1
> Raleigh -27606
> NC
> Ph:1-919-271-1419
> ----------------------------------------------------------------------
> --
>
> Yahoo! Mail
> Use Photomail to share photos without annoying attachments.
>
>
>
>
> --
> CE 2002-06
>
>
> --
> CE 2002-06
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
|