LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Arvind Gopu (agopu_at_[hidden])
Date: 2005-06-14 12:53:12


John-

I don't know if your problem is build/platform specific, but we run LAM
jobs on multiprocessor (two to be precise) nodes all the time without any
problems. Intel Prestonia, RHEL 3, LAM 7.1.1 and 7.1.2beta.

It's another issue, though, that the performance takes a big hit when I
use both processors on the same node. That behavior is common to LAM and
MPICH (on our system) and we've thought about a whole bunch of possible
reasons.

cheers, Arvind

On 2005-06-13 21:45 (-0700), John McCorquodale had pondered:

> Date: Mon, 13 Jun 2005 21:45:24 -0700
> From: John McCorquodale <mcq_at_[hidden]>
> Reply-To: General LAM/MPI mailing list <lam_at_[hidden]>
> To: lam_at_[hidden]
> Subject: LAM: LAM + Bproc4 + SMP = Boom?
>
> Hi,
>
> I've got a little bproc4.0.0pre8 cluster and I'm trying to get LAM going on
> it (over gigE). I get blowups when I try to use the second CPU on my 2-way
> nodes. No frills Debian gcc/g77 3.3.6. Here's lam-bhost.def:
>
> ---lam-bhost.def---
> strongbad.strongbadia
> 0.strongbadia cpu=2
> 1.strongbadia cpu=2
> ---end lam-bhost.def---
>
> $ lamboot
> $ lamnodes
> n0 strongbad.strongbadia:1:no_schedule,origin,this_node
> n1 0.strongbadia:2:
> n2 1.strongbadia:2:
>
> And when I run one thread per node it works fine:
>
> $ mpirun N hello
> Hello, world! I am 0 of 2
> Hello, world! I am 1 of 2
>
> But when I try to run on all the CPUs (or, in fact, when I try _any_ mpirun
> syntax that would start more than one thread on any one or more physical
> nodes), things go awry. I get the same behavior in LAM 7.1.1 and today's
> (13 June) snapshot. This is such a fundamental problem (and no useful hits
> in Google) that I must just be missing a something important that everybody
> else in the world thinks is obvious. Anybody care to clue me in on what I'm
> doing wrong? Only using half my processors makes me a sad panda.
>
> Here's the blowup:
>
> $ mpirun C hello
>
> -----------------------------------------------------------------------------
> The selected RPI failed to initialize during MPI_INIT. This is a
> fatal error; I must abort.
>
> This occurred on host 1 (n2).
> The PID of failed process was 31617 (MPI_COMM_WORLD rank: 2)
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 31615 failed on node n1 (192.168.1.100) with exit status 1.
> -----------------------------------------------------------------------------
>
> Does this ring any bells for anybody? Does this "just work fine" for anybody?
>
> Thanks!
>
> -mcq
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

_____________________________________________________________________
 Arvind Gopu | High Performance Computing Group| (UITS-RAC-HPC) @ IU
 HPC website: http://www.indiana.edu/~rac/hpc | Work: (812) 856-0187
 My website: http://cs.indiana.edu/~agopu | Cell: (812) 361-4054