Hello all,
I've set up a test environment for LAM-MPI containing 5 machines. Between
compiling I tried to run lamboot with a couple of configured machines when I
got stuck with one of my nodes running solaris:
I am able to run recon without any problems;
pvm_at_darkstar:~/lam/etc$ cat lam-bhost.def
darkstar cpu=2
sauron
zeus
pvm_at_darkstar:~/lam/etc$ recon
-----------------------------------------------------------------------------
Woo hoo!
recon has completed successfully.
....
But when I try lamboot it fails:
pvm_at_darkstar:~/lam/etc$ lamboot
LAM 7.0/MPI 2 C++/ROMIO - Indiana University
-----------------------------------------------------------------------------
LAM failed to execute a LAM binary on the remote node "zeus".
Since LAM was already able to determine your remote shell as "hboot",
it is probable that this is not an authentication problem.
LAM tried to use the remote agent command "ssh"
to invoke the following command:
ssh -x zeus -n hboot -t -c lam-conf.lamd -s -I "-H 192.168.0.10 -P
41831 -n 2 -o 0"
....
So, as mentioned, I tried running that by hand:
pvm_at_darkstar:~/lam/etc$ ssh -x zeus -n hboot -t -c lam-conf.lamd -s -I "-H
192.168.0.10 -P 41831 -n 2 -o 0"
pvm_at_darkstar:~/lam/etc$
Hm? That seemed to be going ok doesn't it? Then I wouldn't be having a
problem.. so, let's print the exit code of hboot:
pvm_at_darkstar:~/lam/etc$ ssh -x zeus -n hboot -t -c lam-conf.lamd -s -I "-H
192.168.0.10 -P 41831 -n 2 -o 0"; echo $?
226
pvm_at_darkstar:~/lam/etc$
hmm ok, that seems to be the problem I think; it's the return code when you
specify an unknow option afaik.
lamboot -d doesn't give me much help here either; I just prints out the
above line.
zeus runs solaris 9 and has a bash2.05 shell. All needed variables are
being set in .ssh/environment, and are loaded correctly.
Anyone any ideas about what's going wrong here?
It shouldn't be a problem mixing different machines, should it?
(I know it would be better to buy some new machines and set it up properly,
but I don't have the money for that :( )
When I run it directly on zeus everything seems to be ok:
pvm_at_darkstar:~/lam/etc$ ssh zeus
Last login: Sun Aug 3 15:26:34 2003 from darkstar.thuis
Sun Microsystems Inc. SunOS 5.9 Generic_112234-03 November
2002
bash-2.05$ hboot -t -c lam-conf.lamd -s -I "-H 192.168.0.10 -P 41831 -n 2 -o
O"
bash-2.05$ echo $?
0
?!
Note that my .bash_profile is empty, and all variables are in
.ssh/environment:
bash-2.05$ cat .ssh/environment
PVM_ROOT=$HOME/pvm3
PVM_RSH=/usr/bin/ssh
PATH=$HOME/bin:$PATH:/usr/local/bin:/usr/ccs/bin:$HOME/pvm3/bin/X86SOL2:$HOME/pvm3/lib:/home/pvm/lam/bin
LD_LIBRARY_PATH=/usr/sfw/lib:/usr/local/lib
Now I am confused..
Could you please Cc: me, as I am not on the list.
Thanx,
gr,
--
VIA NET.WORKS Nederland
Axel Scheepers
System Administrator UNIX
phone +31 40 239 33 93
fax +31 40 239 33 11
e-mail ascheepers_at_[hidden]
pgp id 21A33FE0
http://www.vianetworks.nl/
|