LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Howard Butler (hobu_at_[hidden])
Date: 2005-10-03 21:57:31


Dear List,

Please excuse my newbiness. I am attempting to use Rmpi
<http://cran.r-project.org/src/contrib/Descriptions/Rmpi.html>, which
is library for the statistical processing software R
<http://www.r-project.org/>. In the past I had been using PVM with
much (but slow and flaky) success. I am having trouble spawning
processes to the cluster using the prescribed methods, and I think it
has something to do with how I boot the cluster with lamboot.

I am working on a dual 2.5 ghz Apple G5 and a dual 2.3 ghz Xserve,
with the Xserve acting as the master node (both are running Tiger
10.4.2). I am using the binaries provided on the site, and ssh'ing
works without passwords. When I issue a lamboot, all appears well
and recon gives me the w00t.

When I attempt to invoke things from within R, this error message is returned:

>It seems that [at least] one of the child processes that was started
>by MPI_Comm_spawn* chose a different RPI than the parent MPI
>application. For example, one (of the) child process(es) that
>differed from the parent is shown below:
>
> Parent application: MPI_Comm_spawn
> Child MPI_COMM_WORLD rank crtcp (v1.1.0): 0

Taking the extra computer out of the cluster and only running on the
master allows the spawning to complete successfully. I have been
struggling to research this error (it appears that google hasn't
caught up with a recent maillist archive move -- links to google's
results like <http://www.lam-mpi.org//MailArchives/lam/mail20.php>
are 404).

It is clear to me that either I am not configuring the slave node
properly when I issue the lamboot, or the rmpi.c code is spawning
things improperly.

Looking at the rmpi.c code, it appears that it is invoking the spawn
command with MPI_Comm_spawn:
> mpi_errhandler(MPI_Comm_spawn (CHAR (STRING_ELT
>(sexp_slave, 0)), argv, nslave,
> info[infon], root,
>MPI_COMM_SELF, &comm[intercommn],
> slaverrcode));

I also compiled my own LAM/MPI with gcc4 and gfortran and had the
same results. The lamboot FAQ
<http://www.lam-mpi.org/faq/category4.php3> doesn't appear to have
any questions related to my problem. If it is just a case of my bad
google foo, please point me to any information that I should look at.
Any other ideas would be greatly welcome.

Thanks

Howard