LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Amey Dharurkar (adharurk_at_[hidden])
Date: 2003-10-20 16:54:58


Hi,
Can you provide some more details (specifically the error which you
get when the recon fails at 'ssh paris -n echo $SHELL')?

Amey S. Dharurkar
----------------------------------------------------------
LAM/MPI Developer
Graduate Student, Indiana University
Ph. O:(812)855-3609, H:(812)331-8203

On Mon, 20 Oct 2003, Bastian Goldluecke wrote:

> Dear LAM users,
>
> I have a problem booting a certain machine in a LAM cluster, and would greatly
> appreciate some help, since I ran out of ideas what to do.
>
> I already successfully compiled an ran MPI applications distributed over
> several Linux machines. Then I tried to add some real processing power by
> adding a 70 processor machine called "paris" running SunOS 5.9.
>
> The following things already work fine:
> 1. Running "lamboot lamhosts " directly on paris with lamhosts containing only
> the single line "paris cpu=70". Test applications compile and run fine.
> 2. Running "lamboot lamhosts " directly on paris with lamhosts also containing
> my Linux machine:
> "paris cpu=70
> mpiat5100 cpu=2"
> 3. ssh'ing from my Linux machine (mpiat5100) to paris and back. No password is
> required, no messages printed to stderr.
> 4. running lam programs from mpiat5100 remotely on paris, e.g. via
> ssh paris -n laminfo
>
> BUT: I really have to invoke "lamboot lamhosts" from my Linux machine, so my
> "lamhosts" file looks exactly the other way round:
> mpiat5100 cpu=2
> ...
> mpiat5304
> paris cpu=70
> I have added a few other Linux nodes, just to see if they work. They do. When
> I try to lamboot or recon, I get an error for paris:
>
> ...
> tkill: got killname back: /tmp/lam-bg_at_mpiat5304/lam-killfile
> tkill: removing socket file ...
> tkill: removing IO daemon socket file ...
> tkill: f_kill = "/tmp/lam-bg_at_mpiat5304/lam-killfile"
> tkill: nothing to kill: "/tmp/lam-bg_at_mpiat5304/lam-killfile"
> n0<32138> ssi:boot:rsh: successfully launched on n5 (mpiat5304)
> n0<32138> ssi:boot:base:linear: booting n6 (paris)
> n0<32138> ssi:boot:rsh: starting recon on (paris)
> n0<32138> ssi:boot:rsh: starting on n6 (paris): tkill -N -d -v
> n0<32138> ssi:boot:rsh: launching remotely
> n0<32138> ssi:boot:rsh: attempting to execute "ssh paris -n echo $SHELL"
>
> The last "ssh" command failed. However, when I enter
> ssh paris -n 'echo $SHELL'
> manually from a shell, it works fine and echoes only the single line "/usr/
> local/bin/bash", no error messages.
>
> Does anybody know an alternative reason why the booting process could have
> failed?
>
> [By the way, the command
> ssh paris -n echo $SHELL
> suggested by lamboot is incorrect, since it echoes the current shell used on
> the local machine, and *NOT* the shell used on the remote computer paris,
> which it is supposed to do. I hope it is not really used to query the remote
> shell?]
>
> Thanks,
> Bastian Goldluecke.
>
> MPI Informatik, Saarbruecken
> bg_at_[hidden]
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>