From this error message, it looks like LAM was not able to find the
"hello" executable on n1. What happens is that LAM tries to change to
the cwd that you launched mpirun in on all involved nodes. If it can't
(e.g., if it doesn't exist), LAM runs from your $HOME. If you don't
specify an absolute filename to your executable, LAM uses the $PATH to
find it (e.g., "hello"). If it can't find the executable in your
$PATH, that's when you get that error message.
This is very much geared towards optimizing the common case -- where
the executable is available in the same directory on every node (e.g.,
if NFS exported to all nodes). For cases where this is not true, you
still have several options:
- Use an appschema to specify the location on each node (see
appschema(5))
- Add the directory where the executable lives on each node to the
local $PATH
- Create a directory of the same name on all nodes and ensure that the
executable can be found there
Hope that helps.
On Nov 9, 2004, at 1:12 PM, Johannes Theron wrote:
> Neil
>
> The files exist and are readable. I stepped back a bit and tried
> running a very simple hello.f file as written in the LAM manual. Even
> that program has problems connecting to by cluster node.
>
> Here is the text of another message I recently posted based on the
> outcome of this simple test:
>
>
> ******************
> I wrote the little test program hello.f and ran it on my dual-G5
> (rotorx), dual Xserve (xbot0) mini-cluster.
>
> Lamboot works:
>
> rotorx:/Volumes/bigscratch/runs/test jnt7$ lamboot -v lamhosts
>
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
> n-1<9804> ssi:boot:base:linear: booting n0 (rotorx)
> n-1<9804> ssi:boot:base:linear: booting n1 (xbot0)
> n-1<9804> ssi:boot:base:linear: finished
>
> But when I try to launch a job I get the output:
>
>
> mpirun: cannot start hello on n1: No such file or directory
>
> Running 4 instances on my head node only gives:
>
>
> Hello, world! I am 0 of 4
> Hello, world! I am 3 of 4
> Hello, world! I am 1 of 4
> Hello, world! I am 2 of 4
>
> Passwordless rsh works both directions.
>
> ***************
>
> I looked at my /private/tmp and on both the head node and the cluster
> node, a direcxtory is created according to the usual convention, i.e.
> lam-jnt7_at_xbot0 and lam-jnt7_at_rotorx.
>
> One issue might be that my Xserve is running OSX Server while my head
> node is running regular OSX The user on the Xserve also has the same
> UID and GID as the head node.
>
> Johannes
>
>
>
>
>
> Johannes,
>
> I realise this may be stating the obvious, but it looks like there is
> a
> problem with the 2 "grid" files in "od_scratch". Can you first check
> that
> they exist and are readable by you, both on your MAC and on your
> XSERVE system?
>
> Johannes Theron wrote:
>
> >
> > When the computational job actually starts, (the Xserve (xbot0)
> needs to
> > read these files from the head node (rotorx)), I get the following
> error:
> >
> > ***************
> > ** ERROR ** UNABLE TO OPEN GRID FILE od_scratch/grid.14
> >
> > STOP_ALL called from routine GRID_READ, group 3
> >
> >
> > ** ERROR ** UNABLE TO OPEN GRID FILE od_scratch/grid.15
> >
> > STOP_ALL called from routine GRID_READ, group 4
> > ****************
> >
>
> Regards
> Neil
> --
>
> +-----------------+---------------------------------
> +------------------+
> | Neil Storer | Head: Systems S/W Section | Operations
> Dept. |
>
> +-----------------+---------------------------------
> +------------------+
> | ECMWF, | email: neil.storer_at_[hidden] | //=\\ //=\\
> |
> | Shinfield Park, | Tel: (+44 118) 9499353 | // \\//
> \\ |
> | Reading, | (+44 118) 9499000 x 2353 | ECMWF
> |
> | Berkshire,
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|