Machine:
2x Athlon
MP2400 machine running red hat 9.0 connected in a cluster with 4 similar machines.
Problem:
Starting up lamboot on a single machine and running mpirun
is ok on one processor, but stalls on 2.
mpirun –np
1 <program> running
fine
mpirun –np
2 <program> stalls
at first or second MPI_Send entry
The strange thing is that booting
two machines with a hostfile like:
aqnode03
aqnode04
Now running on 2 cpu’s is going fine (one on
each machine). Running on 4 or 1 cpu’s
is also ok, but now the program if I try to run it on 3 cpu’s.
The hostfile
should normally be specified as:
aqnode03 cpu=2
aqnode04 cpu=2
Since each node has two cpu’s. Booting lam with this
option results in a lot of stalls. Only way one can run the program is on 1 cpu. The hostfile
without cpu specification
works well, running mpirun -np
4 will run the program efficiently on all 4 cpu’s.
The problem is hardly program
specific, since we are running the same program on two other machines (Opteron running Fedora Core 2). On this
machines also the cpu options in the hostfile is working well.
Hopefully there is someone
out there to answer my most confusing questions.
regards
Atle Svandal
Institutt
for Fysikk og Teknologi
Universitetet i
Allegaten 55 - 5007 Bergen
tlf: 55 58 32 58