Hi all,
I try to install a LAM/MPI cluster with 3 Nodes inside AWS EC2.
Three instances with Ubuntu 10.10 are runnung.
The /etc/hosts at the master-node includes these lines:
10.86.209.175 ip-10-86-209-175.ec2.internal master
10.122.171.209 ip-10-122-171-209.ec2.internal node1
10.252.86.100 domU-12-31-38-00-51-96.compute-1.internal node2
I also created a file /home/ubuntu/hosts.mpi with these lines:
master
node1
node2
This part looks good:
$ lamboot -v hosts.mpi
LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University
n-1<6731> ssi:boot:base:linear: booting n0 (master)
n-1<6731> ssi:boot:base:linear: booting n1 (node1)
n-1<6731> ssi:boot:base:linear: booting n2 (node2)
n-1<6731> ssi:boot:base:linear: finished
$ lamnodes
n0 ip-10-86-209-175.ec2.internal:1:origin,this_node
n1 ip-10-122-171-209.ec2.internal:1:
n2 domU-12-31-38-00-51-96.compute-1.internal:1:
$ lamnodes -ic
n0 10.86.209.175
n1 10.122.171.209
n2 10.252.86.100
At node1: $ ps -x | grep lamd
5870 ? Ss 0:00 /usr/bin/lamd -H 10.86.209.175 -P 53768 -n 1 -o 0
At node2: $ ps -x | grep lamd
5287 ? Ss 0:00 /usr/bin/lamd -H 10.86.209.175 -P 53768 -n 2 -o 0
But when I try to run the popular "Hello world from process..." example like this one:
http://www.dartmouth.edu/~rc/classes/intro_mpi/hello_world_ex.html
$ mpicc -g -o hello hello.c
$ mpirun -np 3 hello
Hello world from process 0 of 1
Hello world from process 0 of 1
Hello world from process 0 of 1
When I try to force the distribution, I get an error message
$ mpirun -np 3 -nolocal hello
--------------------------------------------------------------------------
There are no available nodes allocated to this job. This could be because
no nodes were found or all the available nodes were already used.
Note that since the -nolocal option was given no processes can be
launched on the local node.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished
What did I wrong?
Thenks for any help.
Best Regards
Christian
|