uname: Linux walrus.crl.dec.com 2.4.21-0.13mdksmp #1 SMP Fri Mar 14 13:41:18
EST 2003 i686 unknown unknown GNU/Linux
My lam-mpi version is 6.5.9
I'm having problems with lamgrow.
I setup an ssh-agent to avoid having ssh ask for my password.
I know that ssh-agent sets environment variables that are used by ssh.
I set the LAMRSH environment to use "ssh -x".
============================================================================
I can successfully boot my localhost (which is walrus) and sybil from bash:
$ cat ~/hostfiles/ws
walrus.crl.dec.com
sybil.crl.dec.com
$ lamboot ~/hostfiles/ws
LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
$ lamnodes
n0 walrus.crl.dec.com:1
n1 sybil.crl.dec.com:1
$ lamhalt
============================================================================
So far so good. Now try booting walrus and use lamgrow to add sybil:
LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
$ lamboot ~/hostfiles/walrus
LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
$ lamnodes
n0 walrus.crl.dec.com:1
$ lamgrow sybil
$ lamnodes
n0 walrus.crl.dec.com:1
n1 sybil.crl.dec.com:1
$ lamhalt
I will point out that though I was able to add sybil successfully in
this set of commands, I do occasionally get timeouts adding sybil this way.
============================================================================
The problem comes in when I try to put those commands in a shell-script
$ cat grower.x
#!/bin/bash
lamboot ~/hostfiles/localhost
ssh -x sybil echo $SHELL
lamgrow -v sybil
LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
$ grower.x
LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
The ssh shell on sybil
/bin/csh
Executing hboot on n1 (sybil - 0 CPU)...
lamgrow (lambootagent): Connection timed out
tkill ...
I *always* get timeouts when the lamgrow command is executed from a
shell or a C program.
============================================================================
Does anybody have any suggestions on how to get lamgrow to work from a
program?
Thanks.
|