Dear LAM users,
I have a problem booting a certain machine in a LAM cluster, and would greatly
appreciate some help, since I ran out of ideas what to do.
I already successfully compiled an ran MPI applications distributed over
several Linux machines. Then I tried to add some real processing power by
adding a 70 processor machine called "paris" running SunOS 5.9.
The following things already work fine:
1. Running "lamboot lamhosts " directly on paris with lamhosts containing only
the single line "paris cpu=70". Test applications compile and run fine.
2. Running "lamboot lamhosts " directly on paris with lamhosts also containing
my Linux machine:
"paris cpu=70
mpiat5100 cpu=2"
3. ssh'ing from my Linux machine (mpiat5100) to paris and back. No password is
required, no messages printed to stderr.
4. running lam programs from mpiat5100 remotely on paris, e.g. via
ssh paris -n laminfo
BUT: I really have to invoke "lamboot lamhosts" from my Linux machine, so my
"lamhosts" file looks exactly the other way round:
mpiat5100 cpu=2
...
mpiat5304
paris cpu=70
I have added a few other Linux nodes, just to see if they work. They do. When
I try to lamboot or recon, I get an error for paris:
...
tkill: got killname back: /tmp/lam-bg_at_mpiat5304/lam-killfile
tkill: removing socket file ...
tkill: removing IO daemon socket file ...
tkill: f_kill = "/tmp/lam-bg_at_mpiat5304/lam-killfile"
tkill: nothing to kill: "/tmp/lam-bg_at_mpiat5304/lam-killfile"
n0<32138> ssi:boot:rsh: successfully launched on n5 (mpiat5304)
n0<32138> ssi:boot:base:linear: booting n6 (paris)
n0<32138> ssi:boot:rsh: starting recon on (paris)
n0<32138> ssi:boot:rsh: starting on n6 (paris): tkill -N -d -v
n0<32138> ssi:boot:rsh: launching remotely
n0<32138> ssi:boot:rsh: attempting to execute "ssh paris -n echo $SHELL"
The last "ssh" command failed. However, when I enter
ssh paris -n 'echo $SHELL'
manually from a shell, it works fine and echoes only the single line "/usr/
local/bin/bash", no error messages.
Does anybody know an alternative reason why the booting process could have
failed?
[By the way, the command
ssh paris -n echo $SHELL
suggested by lamboot is incorrect, since it echoes the current shell used on
the local machine, and *NOT* the shell used on the remote computer paris,
which it is supposed to do. I hope it is not really used to query the remote
shell?]
Thanks,
Bastian Goldluecke.
MPI Informatik, Saarbruecken
bg_at_[hidden]
|