I have a 4 node xserve cluster running 7.1.2b21. Each node has dual
processors, and they are connected via GigE. The output from laminfo is
pasted below. My machine file is as follows:
192.168.1.1 cpu=2 schedule=no
192.168.1.2 cpu=2
192.168.1.3 cpu=2
192.168.1.4 cpu=2
192.168.1.5 cpu=2
.1.1 is the headnode which is the file server for the cluster. The
nodes are connected via AFS (apple file system), so for obvious reasons,
I do not want to schedule the headnode for work. I am new to MPI, and
this is my first cluster to work with, so my knowledge is pretty limited
in this area. That being said, I am using the following command to run
my application.
mpirun C -s n0 ../../myprogram < myinfile (the executable is not in the
same dir as the infiles)
LAM is booting on the nodes successfully, so that isnt a problem.
However, I can only get the command above to work if I remove
'schedule=no' from the machine file. The -s option is the only way I
have been able to get a job to run, and Im relatively sure that if -s is
used, then the node MUST be schedulable. My guess is that their is a
path problem? I have read through the FAQs and haven't been able to
figure out the problem. The directory from which I execute mpirun is
present on all nodes, and the permissions of and inside of said
directory are rwxrwxrwx.
With the headnode set to schedule=no,
When I do: "mpirun C ../../myprogram < myinfile"
I get: mpirun: cannot start ../../myprogram on n1: No such file or
directory
(Remember the working directory is present on all nodes in the same
location from which mpirun was executed from)
When I do: "mpirun -wd path/to/dir C ../../myprogram < myinfile"
The application starts and ends without actually doing anything. (Im not
sure how to describe this .. it prints the program header, then gives
the total time for the run, and exits).
However, when I use the -s n0 option with the schedule=no removed, from
the same directory, the application runs seemingly OK.. just with the
CPU usage not distributed across the cluster as I would like.
I can give more information if needed..
Thanks,
Austin Leach
$laminfo
LAM/MPI: 7.1.2b21
Prefix: /Applications/Absoft_hpcsdk/lam-absoft-32bit-7.1.2
Architecture: powerpc-apple-darwin8.0.0
Configured by: root
Configured on: Wed Jun 8 11:00:51 EDT 2005
Configure host: tabby.absoft.com
Memory manager: darwin7malloc
C bindings: yes
C++ bindings: yes
Fortran bindings: yes
C compiler: gcc -m32
C++ compiler: g++ -m32
Fortran compiler: /Applications/Absoft/bin/f90 -m32 -YEXT_SFX=_
-YEXT_NAMES=LCS -lU77
Fortran symbols: underscore
C profiling: yes
C++ profiling: yes
Fortran profiling: yes
C++ exceptions: no
Thread support: yes
ROMIO support: yes
IMPI support: no
Debug support: no
Purify clean: no
SSI boot: globus (API v1.1, Module v0.6)
SSI boot: rsh (API v1.1, Module v1.1)
SSI boot: slurm (API v1.1, Module v1.0)
SSI coll: lam_basic (API v1.1, Module v7.1)
SSI coll: shmem (API v1.1, Module v1.0)
SSI coll: smp (API v1.1, Module v1.2)
SSI rpi: crtcp (API v1.1, Module v1.1)
SSI rpi: lamd (API v1.0, Module v7.1)
SSI rpi: sysv (API v1.0, Module v7.1)
SSI rpi: tcp (API v1.0, Module v7.1)
SSI rpi: usysv (API v1.0, Module v7.1)
SSI cr: self (API v1.0, Module v1.0)
--
Austin Leach
Graduate Research Assistant
Materials Science and Engineering
Georgia Institute of Technology
austin dot leach at gmail dot com
|