Machine:

 

2x Athlon MP2400 machine running red hat 9.0 connected in a cluster with 4 similar machines.

 

Problem:

 

Starting up lamboot on a single machine and running mpirun is ok on one processor, but stalls on 2.

 

            mpirunnp 1 <program>            running fine

 

            mpirunnp 2 <program>            stalls at first or second MPI_Send entry

 

The strange thing is that booting two machines with a hostfile like:

 

aqnode03

aqnode04

 

Now running on 2 cpu’s is going fine (one on each machine). Running on 4 or 1 cpu’s is also ok, but now the program if I try to run it on 3 cpu’s.

 

The hostfile should normally be specified as:

 

aqnode03 cpu=2

aqnode04 cpu=2

 

Since each node has two cpu’s. Booting lam with this option results in a lot of stalls. Only way one can run the program is on 1 cpu. The hostfile without cpu specification works well, running mpirun -np 4 will run the program efficiently on all 4 cpu’s.

 

The problem is hardly program specific, since we are running the same program on two other machines (Opteron running Fedora Core 2). On this machines also the cpu options in the hostfile is working well.

 

Hopefully there is someone out there to answer my most confusing questions.

 

regards

 

Atle Svandal

 

Institutt for Fysikk og Teknologi

Universitetet i Bergen

Allegaten 55 - 5007 Bergen

tlf: 55 58 32 58