On Sep 15, 2005, at 10:19 AM, Ole Holm Nielsen wrote:
> Well, it seems to be OK. The /usr/local/lam-7.1.2-pgi directory
> tree is rsync'ed from the central server, and the files therein
> have identical timestamps and sizes on all nodes.
Ok.
> My $PATH appears to be OK. Also, when recon executes it picks up
> /usr/local/lam-7.1.2-pgi/bin/tkill (the path is correct) on the
> master node (see my previous mail). I don't know if the correct
> $PATH is set on the slave nodes when LAM boots with the "tm" schema
> - is there a way to check that ? In our setup the user's .cshrc
> file is responsible for setting LAMHOME and PATH to point to
> /usr/local/lam-7.1.2-pgi.
No, other than LAM being installed in the same place on all nodes,
little else matters on the client nodes -- LAM runs executables with an
absolute path name (see the -d output), so if that is correct on all
nodes, that should be good enough.
> LAM-MPI works with the "rsh" boot schema on the same test cluster,
> so the problem seems to be specific to the "tm" boot schema.
> The funny thing is that the problem cropped up after I updated
> the Torque version. With torque-1.2.0p4 things were just fine
> (but maybe the default schema was "rsh" back then, we don't know...).
Unfortunately we don't give a good enough error message to show exactly
what went wrong here -- it's either failing in the tm_obit() or
tm_poll() library calls (i.e., one of those two is returning something
other than TM_SUCCESS). The problem is occurring in
share/ssi/boot/tm/src/ssi_boot_tm.c; you can ee the "waiting for
completion" message on line 449, and it calls tm_obit() and tm_poll().
If those two succeed, we would have seen a "finished" message.
Can you put in some additional printf's in there to see which of those
two it's dying on, and what value it's returning?
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
|