LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-07-14 13:46:02


On Jul 14, 2005, at 10:53 AM, Jejo Koola wrote:

> 1. Yes, I captured all of the output generated by the command: lamboot
> -d. I assume that it will output all stderr messages on the host and
> remote nodes by default.

Yes, it will.

Hmm. If you're sure you've captured any stderr, then somehow it's
failing without printing anything to stderr. Odd, but I guess that can
happen.

What you might want to try next is to wrap you call to ssh in a shell
script -- perhaps something like:

-----
:
echo about to invoke underlying ssh
shift
ssh $*
foo=$?
echo got return status from ssh: $foo
exit $foo
-----

Save that in a shell script somewhere and set the environment variable
LAMRSH to point to it. Then run lamboot -d again -- you should see the
output from this script, and see the return value from ssh.

Is it nonzero?

> 2. In looking through the output, I do not find any error messages.
>
> 3. I did find it odd that it tried to execute ssh 'echo $SHELL' twice.
> You will see it if you look at the output of lamboot. And it also
> printed out lamboot's verbose error/help message twice. Is it
> supposed to try that twice?, or is that indicative or something wrong?

This is normal. When lamboot detects a failed boot, it goes and tries
to kill everything. What's happening is that it's first trying to
launch on a node, it fails, and then it tries to kill any LAM processes
on that node. But it fails to launch the killer process (tkill) on
that node, so you get the same error messages twice. This is somewhat
non-intuitive, but normal.

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/