LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Axel Scheepers, Operations Via NET.Works NL (ascheepers_at_[hidden])
Date: 2003-08-03 09:00:47


Hello all,

I've set up a test environment for LAM-MPI containing 5 machines. Between
compiling I tried to run lamboot with a couple of configured machines when I
got stuck with one of my nodes running solaris:
 
I am able to run recon without any problems;
pvm_at_darkstar:~/lam/etc$ cat lam-bhost.def
darkstar cpu=2
sauron
zeus
pvm_at_darkstar:~/lam/etc$ recon
-----------------------------------------------------------------------------
Woo hoo!

recon has completed successfully.
....

But when I try lamboot it fails:
pvm_at_darkstar:~/lam/etc$ lamboot

LAM 7.0/MPI 2 C++/ROMIO - Indiana University

-----------------------------------------------------------------------------
LAM failed to execute a LAM binary on the remote node "zeus".
Since LAM was already able to determine your remote shell as "hboot",
it is probable that this is not an authentication problem.

LAM tried to use the remote agent command "ssh"
to invoke the following command:

        ssh -x zeus -n hboot -t -c lam-conf.lamd -s -I "-H 192.168.0.10 -P
41831 -n 2 -o 0"
....

So, as mentioned, I tried running that by hand:
pvm_at_darkstar:~/lam/etc$ ssh -x zeus -n hboot -t -c lam-conf.lamd -s -I "-H
192.168.0.10 -P 41831 -n 2 -o 0"
pvm_at_darkstar:~/lam/etc$

Hm? That seemed to be going ok doesn't it? Then I wouldn't be having a
problem.. so, let's print the exit code of hboot:
pvm_at_darkstar:~/lam/etc$ ssh -x zeus -n hboot -t -c lam-conf.lamd -s -I "-H
192.168.0.10 -P 41831 -n 2 -o 0"; echo $?
226
pvm_at_darkstar:~/lam/etc$

hmm ok, that seems to be the problem I think; it's the return code when you
specify an unknow option afaik.

lamboot -d doesn't give me much help here either; I just prints out the
above line.
zeus runs solaris 9 and has a bash2.05 shell. All needed variables are
being set in .ssh/environment, and are loaded correctly.

Anyone any ideas about what's going wrong here?
It shouldn't be a problem mixing different machines, should it?
(I know it would be better to buy some new machines and set it up properly,
but I don't have the money for that :( )

When I run it directly on zeus everything seems to be ok:
pvm_at_darkstar:~/lam/etc$ ssh zeus
Last login: Sun Aug 3 15:26:34 2003 from darkstar.thuis
Sun Microsystems Inc. SunOS 5.9 Generic_112234-03 November
2002
bash-2.05$ hboot -t -c lam-conf.lamd -s -I "-H 192.168.0.10 -P 41831 -n 2 -o
O"
bash-2.05$ echo $?
0

?!
Note that my .bash_profile is empty, and all variables are in
.ssh/environment:
bash-2.05$ cat .ssh/environment
PVM_ROOT=$HOME/pvm3
PVM_RSH=/usr/bin/ssh
PATH=$HOME/bin:$PATH:/usr/local/bin:/usr/ccs/bin:$HOME/pvm3/bin/X86SOL2:$HOME/pvm3/lib:/home/pvm/lam/bin
LD_LIBRARY_PATH=/usr/sfw/lib:/usr/local/lib

Now I am confused..

Could you please Cc: me, as I am not on the list.

Thanx,
gr,

-- 
VIA NET.WORKS Nederland
Axel Scheepers
System Administrator UNIX
phone 	+31 40 239 33 93
fax 	+31 40 239 33 11
e-mail 	ascheepers_at_[hidden]
pgp id  21A33FE0
http://www.vianetworks.nl/