On Feb 8, 2006, at 2:46 PM, ew fgff wrote:
> Actually, the problem is on "node2.xxx.xxx" because it
> can run lamboot on "node3.xxx.xxx" if I do not put
> node2.xxx.xxx.
>
> It is not the prblem with LAM in "node2.xxx.xxx"
> because lamboot can be run on "node2.xxx.xxx" from
> "node2.xxx.xxx".
>
> Also, it should not be the problem with my account
> because I can connect "node2.xxx.xxx" from
> "node1.xxx.xxx".
>
> Could you please help me to solve this problem.
> Thanks.
You've actually covered most of the bases with your initial writeup.
Usually, this shows some type of authentication setup error on your
cluster. If you run recon -v -d hostfile (or lamboot -v -d
hostfile), you can see what command LAM is trying to invoke on the
remote node. I would then try to run that command by hand and see if
you can figure out what is going on. If you are using rsh, the best
place to look is the system logs on the remote side. If you are
using ssh, a good place to look is ssh itself, adding a -v to the ssh
command line to get more debugging information.
Hope this helps,
Brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|