On Feb 10, 2007, at 9:18 AM, Faisal Iqbal wrote:
> The output for
>
> mpiexec n0 /common/hello
> ----------------------------------------------------------------------
> -------
> It seems that [at least] one of the processes that was started with
> mpirun did not invoke MPI_INIT before quitting (it is possible that
> more than one process did not invoke MPI_INIT -- mpirun was only
> notified of the first one, which was on node n0).
>
> mpirun can *only* be used with MPI programs (i.e., programs that
> invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
> to run non-MPI programs over the lambooted nodes.
> ----------------------------------------------------------------------
> -------
> mpirun failed with exit status 252
Wait, I'm now confused. You said that n0 was your head node, but now
you're saying that it failed. Also, you previously said lamexec, not
mpiexec (they both do different things). Which is it?
Here's what I would suggest...
Remove your LAM installation. Remove your LAM source tree. Remove
the hello program. Basically: remove everything that you did
before. Let's start with a clean slate and run an extremely
systematic approach:
1. Ensure that your NFS client nodes are strictly time synchronized
with the NFS server (via NTP or whatever other time-synchronizing
protocol you prefer -- setting them manually to the "same" time is
*NOT* sufficient).
2. Download the latest LAM.
3. Configure and build LAM into an NFS-shared directory.
- for simplicity, ensure that the mount point is the same on all
nodes
- see http://www.lam-mpi.org/faq/category3.php3#question9 and
the LAM/MPI installation guide
4. Confirm that the basic installation works properly with the
following:
- lamboot across both nodes
- run: lamexec N hostname
- you should see each hostname listed exactly once
5. Once that is working, compile the hello world program from the LAM
examples/hello directory (i.e., hello.c). Compile it with: "mpicc
hello.c -o hello"
6. Put the executable in an NFS-shared directory
7. Verify that it is correct with:
- lamboot across both nodes (or use the previously-run lamboot)
- cd to the NFS-shared directory where the "hello" program exists
- run "lamexec N ls -l hello"
- you should see it listed exactly once for each node, with the
same file size, dates, etc.
- run "lamexec N md5sum hello"
- you should see exactly the same md5 value, once for each node
8. once that is all correct, try to run it individually on each node
- lamboot across both nodes (or use the previously-run lamboot)
- login to the head node
- cd to the NFS-shared directory where the "hello" program exists
- run "./hello"
- see the expected output
- repeat the procedure on the second node
9. once that is all correct, try to mpirun across each node individually
- lamboot across both nodes (or use the previously-run lamboot)
- login to the head node
- cd to the NFS-shared directory where the "hello" program exists
- run "mpirun n0 hello"
- see the expected output
- run "mpirun n1 hello"
- see the expected output
10. once that is all correct, try to mpirun across both nodes
- lamboot across both nodes (or use the previously-run lamboot)
- login to the head node
- cd to the NFS-shared directory where the "hello" program exists
- run "mpirun N hello"
- see the expected output, exactly one line of output for each node
> The output is correct only for head node, for all other nodes we
> get the aforementioned error.
>
> Faisal
>
> Jeff Squyres <jsquyres_at_[hidden]> wrote: On Feb 7, 2007, at 2:47
> PM, Faisal Iqbal wrote:
>
> > > Can you verify that /common/hello is exactly the same
> executable on
> > > both nodes?
> > [snipped]
>
> All sounds good.
>
> > > Can you run the /common/hello application just on n1? For example,
> > > do the following on each of your two nodes:
> > > - login
> > > - lamboot
> > > - /common/hello (i.e., run it without mpirun)
> > > - lamhalt
> > > I'm assuming it will work fine on n0 -- the question is whether it
> > > will for n1.
> > I tried "lamexec C /common/hello" and it worked so this shows that
> > it is working on both the PCs.
>
> Well that's just very peculiar. :-\
>
> If you can run them manually, the only reason I can think that LAM's
> mpirun would think that they failed is because they were compiled
> with some other MPI (e.g., MPICH or some prior version of LAM). But
> that's not consistent with what you said earlier -- that you can
> mpirun it properly on just one node.
>
> When the error occurs, do you get core dumps? If so, can you get a
> stack trace from them to see where exactly it is failing?
>
> What is the exact output of "lamexec C /common/hello"?
>
> --
> Jeff Squyres
> Server Virtualization Business Unit
> Cisco Systems
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
> Expecting? Get great news right away with email Auto-Check.
> Try the Yahoo! Mail Beta.
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems
|