Hi,
I've got a little bproc4.0.0pre8 cluster and I'm trying to get LAM going on
it (over gigE). I get blowups when I try to use the second CPU on my 2-way
nodes. No frills Debian gcc/g77 3.3.6. Here's lam-bhost.def:
---lam-bhost.def---
strongbad.strongbadia
0.strongbadia cpu=2
1.strongbadia cpu=2
---end lam-bhost.def---
$ lamboot
$ lamnodes
n0 strongbad.strongbadia:1:no_schedule,origin,this_node
n1 0.strongbadia:2:
n2 1.strongbadia:2:
And when I run one thread per node it works fine:
$ mpirun N hello
Hello, world! I am 0 of 2
Hello, world! I am 1 of 2
But when I try to run on all the CPUs (or, in fact, when I try _any_ mpirun
syntax that would start more than one thread on any one or more physical
nodes), things go awry. I get the same behavior in LAM 7.1.1 and today's
(13 June) snapshot. This is such a fundamental problem (and no useful hits
in Google) that I must just be missing a something important that everybody
else in the world thinks is obvious. Anybody care to clue me in on what I'm
doing wrong? Only using half my processors makes me a sad panda.
Here's the blowup:
$ mpirun C hello
-----------------------------------------------------------------------------
The selected RPI failed to initialize during MPI_INIT. This is a
fatal error; I must abort.
This occurred on host 1 (n2).
The PID of failed process was 31617 (MPI_COMM_WORLD rank: 2)
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 31615 failed on node n1 (192.168.1.100) with exit status 1.
-----------------------------------------------------------------------------
Does this ring any bells for anybody? Does this "just work fine" for anybody?
Thanks!
-mcq
|