LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: John McCorquodale (mcq_at_[hidden])
Date: 2005-06-13 23:45:24


Hi,

I've got a little bproc4.0.0pre8 cluster and I'm trying to get LAM going on
it (over gigE). I get blowups when I try to use the second CPU on my 2-way
nodes. No frills Debian gcc/g77 3.3.6. Here's lam-bhost.def:

---lam-bhost.def---
strongbad.strongbadia
0.strongbadia cpu=2
1.strongbadia cpu=2
---end lam-bhost.def---

$ lamboot
$ lamnodes
n0 strongbad.strongbadia:1:no_schedule,origin,this_node
n1 0.strongbadia:2:
n2 1.strongbadia:2:

And when I run one thread per node it works fine:

$ mpirun N hello
Hello, world! I am 0 of 2
Hello, world! I am 1 of 2
 
But when I try to run on all the CPUs (or, in fact, when I try _any_ mpirun
syntax that would start more than one thread on any one or more physical
nodes), things go awry. I get the same behavior in LAM 7.1.1 and today's
(13 June) snapshot. This is such a fundamental problem (and no useful hits
in Google) that I must just be missing a something important that everybody
else in the world thinks is obvious. Anybody care to clue me in on what I'm
doing wrong? Only using half my processors makes me a sad panda.

Here's the blowup:

$ mpirun C hello

-----------------------------------------------------------------------------
The selected RPI failed to initialize during MPI_INIT. This is a
fatal error; I must abort.

This occurred on host 1 (n2).
The PID of failed process was 31617 (MPI_COMM_WORLD rank: 2)
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 31615 failed on node n1 (192.168.1.100) with exit status 1.
-----------------------------------------------------------------------------

Does this ring any bells for anybody? Does this "just work fine" for anybody?

Thanks!

-mcq