Hi Jeff,
When I follow what you suggest, it does not print any messages out at
all. Infact, if I point LAMRSH at just any random file, the output is
no different than when LAMRSH=ssh. So it seems that it is never
actually running the program at all. lamboot -d does print out that
it is "attempting to execute" ssh, so is there something that can go
wrong between printing "attempting to execute" and actually executing
the program?
Thanks,
Jejo
On 7/14/05, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> On Jul 14, 2005, at 10:53 AM, Jejo Koola wrote:
>
> > 1. Yes, I captured all of the output generated by the command: lamboot
> > -d. I assume that it will output all stderr messages on the host and
> > remote nodes by default.
>
> Yes, it will.
>
> Hmm. If you're sure you've captured any stderr, then somehow it's
> failing without printing anything to stderr. Odd, but I guess that can
> happen.
>
> What you might want to try next is to wrap you call to ssh in a shell
> script -- perhaps something like:
>
> -----
> :
> echo about to invoke underlying ssh
> shift
> ssh $*
> foo=$?
> echo got return status from ssh: $foo
> exit $foo
> -----
>
> Save that in a shell script somewhere and set the environment variable
> LAMRSH to point to it. Then run lamboot -d again -- you should see the
> output from this script, and see the return value from ssh.
>
> Is it nonzero?
>
> > 2. In looking through the output, I do not find any error messages.
> >
> > 3. I did find it odd that it tried to execute ssh 'echo $SHELL' twice.
> > You will see it if you look at the output of lamboot. And it also
> > printed out lamboot's verbose error/help message twice. Is it
> > supposed to try that twice?, or is that indicative or something wrong?
>
> This is normal. When lamboot detects a failed boot, it goes and tries
> to kill everything. What's happening is that it's first trying to
> launch on a node, it fails, and then it tries to kill any LAM processes
> on that node. But it fails to launch the killer process (tkill) on
> that node, so you get the same error messages twice. This is somewhat
> non-intuitive, but normal.
>
> --
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|