LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-09-15 17:26:30


On Sep 15, 2005, at 10:19 AM, Ole Holm Nielsen wrote:

> Well, it seems to be OK. The /usr/local/lam-7.1.2-pgi directory
> tree is rsync'ed from the central server, and the files therein
> have identical timestamps and sizes on all nodes.

Ok.

> My $PATH appears to be OK. Also, when recon executes it picks up
> /usr/local/lam-7.1.2-pgi/bin/tkill (the path is correct) on the
> master node (see my previous mail). I don't know if the correct
> $PATH is set on the slave nodes when LAM boots with the "tm" schema
> - is there a way to check that ? In our setup the user's .cshrc
> file is responsible for setting LAMHOME and PATH to point to
> /usr/local/lam-7.1.2-pgi.

No, other than LAM being installed in the same place on all nodes,
little else matters on the client nodes -- LAM runs executables with an
absolute path name (see the -d output), so if that is correct on all
nodes, that should be good enough.

> LAM-MPI works with the "rsh" boot schema on the same test cluster,
> so the problem seems to be specific to the "tm" boot schema.
> The funny thing is that the problem cropped up after I updated
> the Torque version. With torque-1.2.0p4 things were just fine
> (but maybe the default schema was "rsh" back then, we don't know...).

Unfortunately we don't give a good enough error message to show exactly
what went wrong here -- it's either failing in the tm_obit() or
tm_poll() library calls (i.e., one of those two is returning something
other than TM_SUCCESS). The problem is occurring in
share/ssi/boot/tm/src/ssi_boot_tm.c; you can ee the "waiting for
completion" message on line 449, and it calls tm_obit() and tm_poll().
If those two succeed, we would have seen a "finished" message.

Can you put in some additional printf's in there to see which of those
two it's dying on, and what value it's returning?

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/