On Sun, 3 Aug 2003, Axel Scheepers, Operations Via NET.Works NL wrote:
> I am able to run recon without any problems;
> [snipped]
> -----------------------------------------------------------------------------
> LAM failed to execute a LAM binary on the remote node "zeus".
> Since LAM was already able to determine your remote shell as "hboot",
> it is probable that this is not an authentication problem.
This is the first odd thing -- the remote shell should not be "hboot".
Looks like a bug in our error message. :-(
> LAM tried to use the remote agent command "ssh"
> to invoke the following command:
>
> ssh -x zeus -n hboot -t -c lam-conf.lamd -s -I "-H 192.168.0.10 -P
> 41831 -n 2 -o 0"
> ....
>
> So, as mentioned, I tried running that by hand:
> pvm_at_darkstar:~/lam/etc$ ssh -x zeus -n hboot -t -c lam-conf.lamd -s -I "-H
> 192.168.0.10 -P 41831 -n 2 -o 0"
> pvm_at_darkstar:~/lam/etc$
This is odd -- I would not expect hboot to finish properly here. The -P
argument specifies a TCP port number that lamboot is listening on, waiting
for the lamd to call back on. Hence, when lamboot dies, that port closes,
and if you try to run it again, hboot/lamd should fail because it can't
connect to that port.
But you should also get an error message about this. Ugh! Looks like
another bug in our help message output! :-(
> Hm? That seemed to be going ok doesn't it? Then I wouldn't be having a
> problem.. so, let's print the exit code of hboot:
> pvm_at_darkstar:~/lam/etc$ ssh -x zeus -n hboot -t -c lam-conf.lamd -s -I "-H
> 192.168.0.10 -P 41831 -n 2 -o 0"; echo $?
> 226
Yes, I think that's the first honest response you've gotten. :-)
> lamboot -d doesn't give me much help here either; I just prints out the
> above line.
It just shows that line and then the help message that it didn't work?
Anything else?
> Anyone any ideas about what's going wrong here?
> It shouldn't be a problem mixing different machines, should it?
No, as long as you have executables setup properly in the $PATH for each
machine/architecture/OS/whatever, you should be ok. If you have 7.0 on
all your machines (regardless of arch/OS/etc.), they should interoperate
properly.
Common problems here include the following:
- firewalling/port blocking between the machines
- not able to find hboot in your path on the remote machine (I didn't see
an explicit path entry for LAM in your .ssh/environment file; but I
don't know if you installed it in one of the "common" directories...?)
- incorrect IP resolution (is 192.168.0.10 the right IP address for the
host that you're lambooting from?)
Can you send the full lamboot -d output?
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|