LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: dick_at_[hidden]
Date: 2005-09-06 12:31:36


greetz,

i've been playing around with the LAM-MPI 7.1.1 source and
have tried to get it to run on openbsd 3.5 without success. by
"get it to run" i mean that basic tests to check that it works
correctly fail (i'll expand on this below). i find this odd
since it is claimed that openbsd 3.5 is a tested platform for
LAM-MPI 7.1.1 (see
http://lam-mpi.lzu.edu.cn/about/overview/support.php ).

since there is a port for LAM-MPI 6.5.9 (the old unsupported
version) on openbsd 3.6 and later, i tested it on a 3.6
install. i made sure things were working by doing a "$ recon
-v bhost.def" and a "$ lamboot -v bhost.def" without getting
any errors, where bhost.def contains the two node hostnames in
question. i also successfully compiled and ran most of the
example programs in the examples directory of the 6.5.9 source
tree on the two test nodes.

i did get 7.1.1 to compile and install correctly, but basic
commands don't work and i get errors when i try to do anything
remotely. i did change the RSH agent to be ssh, so the only
thing non-default i did was to '$ ./configure --with-rsh="ssh
-x"'. here are the problems:

1) recon and lamboot give me grief about remote computer
(NOTE: i have ssh working with public key authentication just
fine)

$ recon -v bhost.def
                                                             
                             
n-1<6817> ssi:boot:base:linear: booting n0 (craptiva.plf)
ERROR: LAM/MPI unexpectedly received the following on stderr:
ksh: [: missing ]
-----------------------------------------------------------------------------
LAM failed to execute a LAM binary on the remote node
"craptiva.plf".
Since LAM was already able to determine your remote shell as
"tkill",
it is probable that this is not an authentication problem.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE
LAM/MPI USER'S
*** MAILING LIST.

LAM tried to use the remote agent command "ssh"
to invoke the following command:

        ssh -x craptiva.plf -n '( ! [ -e ./.profile] || .
./.profile;' tkill -N -v )

This can indicate several things. You should check the following:

        - The LAM binaries are in your $PATH
        - You can run the LAM binaries
        - The $PATH variable is set properly before your
          .cshrc/.profile exits

Try to invoke the command listed above manually at a Unix prompt.

You will need to configure your local setup such that you will
*not*
be prompted for a password to invoke this command on the
remote node.
No output should be printed from the remote node before the
output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<6817> ssi:boot:base:linear: Failed to boot n0 (craptiva.plf)
n-1<6817> ssi:boot:base:linear: aborted!

i suspect that this has to do with some mucked shell syntax,
but i'm not sure

2) laminfo just hangs, irrespective of the arguments i pass it

3) mpirun hangs when i try to test the examples on a single node

$ lamboot -v

LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University

n-1<10867> ssi:boot:base:linear: booting n0 (localhost)
n-1<10867> ssi:boot:base:linear: finished
$ mpirun C ring
^C-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started
with
mpirun did not invoke MPI_INIT before quitting (it is possible
that
more than one process did not invoke MPI_INIT -- mpirun was only
notified of the first one, which was on node n-809558100).

mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec"
program
to run non-MPI programs over the lambooted nodes.
-----------------------------------------------------------------------------

so, given these issues with 7.1.1, i wonder if i should try to
work through the errors, provided a developer/more educated
user is willing to help, or whether i should just work with
the functioning 6.5.9 port. i would rather go forward than use
a dated version of LAM-MPI which i would likely have to
upgrade later.

any suggestions welcome (aside from "use another OS"), thx for
reading.

jake