Thanks for your help.
I configured LAM-MPI version 7.0 with "--with-rpi=ssh".
Here are excerpts from 'lamboot' output when executed with the '-d' option.
I give only output related to one node (n1) inside my host file. It is
identical for all others.
1) % lamboot -d hostfile
...
n-1<13309> ssi:boot: Opening
n-1<13309> ssi:boot: opening module globus
n-1<13309> ssi:boot: initializing module globus
n-1<13309> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<13309> ssi:boot: module not available: globus
n-1<13309> ssi:boot: opening module rsh
n-1<13309> ssi:boot: initializing module rsh
n-1<13309> ssi:boot:rsh: module initializing
n-1<13309> ssi:boot:rsh:agent: ssh
n-1<13309> ssi:boot:rsh:username: <same>
n-1<13309> ssi:boot:rsh:verbose: 1000
n-1<13309> ssi:boot:rsh:algorithm: linear
n-1<13309> ssi:boot:rsh:priority: 10
n-1<13309> ssi:boot: module available: rsh, priority: 10
n-1<13309> ssi:boot: finalizing module globus
n-1<13309> ssi:boot:globus: finalizing
n-1<13309> ssi:boot: closing module globus
n-1<13309> ssi:boot: Selected boot module rsh
n0<13306> ssi:boot:base:server: got connection from 142.130.48.15
n0<13306> ssi:boot:base:server: this connection is expected (n0)
n0<13306> ssi:boot:base:server: remote lamd is at 142.130.48.15:6273
n0<13306> ssi:boot:base:linear: booting n1 (nordet.qc.dfo.ca)
n0<13306> ssi:boot:rsh: starting lamd on (nordet.qc.dfo.ca)
n0<13306> ssi:boot:rsh: starting on n1 (nordet.qc.dfo.ca): hboot -t -c
lam-conf.lamd -d -s -I "-H 142.130.48.15 -P 43343 -n 1 -o 0"
n0<13306> ssi:boot:rsh: launching remotely
n0<13306> ssi:boot:rsh: -b used, assuming same shell on remote nodes
n0<13306> ssi:boot:rsh: got local shell /bin/bash
n0<13306> ssi:boot:rsh: attempting to execute "ssh nordet.qc.dfo.ca -n hboot
-t -c lam-conf.l amd -d -s -I "-H 142.130.48.15 -P 43343 -n 1 -o 0""
Enter passphrase for key '/home/gosselin_a/.ssh/id_dsa':
We have "rsh:agent: ssh" as expected, and an ssh conection is attempted on
node 1.
2) Now I set the rsh agent through "LAMRSH" and run lamboot again (following
a 'lamhalt' of course).
% export LAMRSH=rsh
% lamboot -d hostfile
...
n-1<13317> ssi:boot: Opening
n-1<13317> ssi:boot: opening module globus
n-1<13317> ssi:boot: initializing module globus
n-1<13317> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<13317> ssi:boot: module not available: globus
n-1<13317> ssi:boot: opening module rsh
n-1<13317> ssi:boot: initializing module rsh
n-1<13317> ssi:boot:rsh: module initializing
n-1<13317> ssi:boot:rsh:agent: rsh
n-1<13317> ssi:boot:rsh:username: <same>
n-1<13317> ssi:boot:rsh:verbose: 1000
n-1<13317> ssi:boot:rsh:algorithm: linear
n-1<13317> ssi:boot:rsh:priority: 10
n-1<13317> ssi:boot: module available: rsh, priority: 10
n-1<13317> ssi:boot: finalizing module globus
n-1<13317> ssi:boot:globus: finalizing
n-1<13317> ssi:boot: closing module globus
n-1<13317> ssi:boot: Selected boot module rsh
n0<13314> ssi:boot:base:server: got connection from 142.130.48.15
n0<13314> ssi:boot:base:server: this connection is expected (n0)
n0<13314> ssi:boot:base:server: remote lamd is at 142.130.48.15:6785
n0<13314> ssi:boot:base:linear: booting n1 (nordet.qc.dfo.ca)
n0<13314> ssi:boot:rsh: starting lamd on (nordet.qc.dfo.ca)
n0<13314> ssi:boot:rsh: starting on n1 (nordet.qc.dfo.ca): hboot -t -c
lam-conf.lamd -d -s -I "-H 142.130.48.15 -P 43348 -n 1 -o 0"
n0<13314> ssi:boot:rsh: launching remotely
n0<13314> ssi:boot:rsh: -b used, assuming same shell on remote nodes
n0<13314> ssi:boot:rsh: got local shell /bin/bash
n0<13314> ssi:boot:rsh: attempting to execute "rsh nordet.qc.dfo.ca -n hboot
-t -c lam-conf.lamd -d -s -I "-H 142.130.48.15 -P 43348 -n 1 -o 0""
...
We get "rsh:agent: rsh" , rsh is the agent as expected, and an rsh
connection is attempted.
3) Now I try the (supposedly) equivalent "-ssi boot_rsh_agent rsh" command
line option, after unsetting the LAMRSH env var and a 'lamhalt'.
% lamboot -d -ssi boot_rsh_agent rsh hostfile
n-1<13340> ssi:boot: Opening
n-1<13340> ssi:boot: opening module globus
n-1<13340> ssi:boot: initializing module globus
n-1<13340> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<13340> ssi:boot: module not available: globus
n-1<13340> ssi:boot: opening module rsh
n-1<13340> ssi:boot: initializing module rsh
n-1<13340> ssi:boot:rsh: module initializing
n-1<13340> ssi:boot:rsh:agent: rsh
n-1<13340> ssi:boot:rsh:username: <same>
n-1<13340> ssi:boot:rsh:verbose: 1000
n-1<13340> ssi:boot:rsh:algorithm: linear
n-1<13340> ssi:boot:rsh:priority: 10
n-1<13340> ssi:boot: module available: rsh, priority: 10
n-1<13340> ssi:boot: finalizing module globus
n-1<13340> ssi:boot:globus: finalizing
n-1<13340> ssi:boot: closing module globus
n-1<13340> ssi:boot: Selected boot module rsh
n0<13337> ssi:boot:base:server: got connection from 142.130.48.15
n0<13337> ssi:boot:base:server: this connection is expected (n0)
n0<13337> ssi:boot:base:server: remote lamd is at 142.130.48.15:8321
n0<13337> ssi:boot:base:linear: booting n1 (nordet.qc.dfo.ca)
n0<13337> ssi:boot:rsh: starting lamd on (nordet.qc.dfo.ca)
n0<13337> ssi:boot:rsh: starting on n1 (nordet.qc.dfo.ca): hboot -t -c
lam-conf.lamd -d -s -I "-H 142.130.48.15 -P 43364 -n 1 -o 0"
n0<13337> ssi:boot:rsh: launching remotely
n0<13337> ssi:boot:rsh: -b used, assuming same shell on remote nodes
n0<13337> ssi:boot:rsh: got local shell /bin/bash
n0<13337> ssi:boot:rsh: attempting to execute "ssh nordet.qc.dfo.ca -n hboot
-t -c lam-conf.lamd -d -s -I "-H 142.130.48.15 -P 43364 -n 1 -o 0""
As expected, we get "rsh:agent: rsh", so rsh should be the agent. But
surprisingly, an ssh connection is attempted afterwards.
It seems that the command-line setting is not honoured at all. I spent a few
moments looking at the code. The output 'attempting to execute "ssh
nordet.qc.dfo.ca ...' comes from a function inside file
'ssi_boot_rsh_initexec.c', and the text of the command is initialized by a
call to function 'add_rsh()'. This function looks only for LAMRSH env var,
or the compiled default. Unless I am wrong, this seems to explain the
present behavior.
But you say you do not experience this problem. So what version are you
running exactly?
Again, here what 'laminfo' gives for me:
LAM/MPI: 7.0
Prefix: /usr
Architecture: i586-mandrake-linux-gnu
Configured by: root
Configured on: Wed Sep 10 20:56:04 EDT 2003
Configure host: euclide.qc.dfo.ca
C bindings: yes
C++ bindings: yes
Fortran bindings: yes
C profiling: yes
C++ profiling: yes
Fortran profiling: yes
ROMIO support: yes
IMPI support: no
Debug support: no
Purify clean: no
SSI boot: globus (Module v0.5)
SSI boot: rsh (Module v1.0)
SSI coll: lam_basic (Module v7.0)
SSI coll: smp (Module v1.0)
SSI rpi: crtcp (Module v1.0)
SSI rpi: lamd (Module v7.0)
SSI rpi: sysv (Module v7.0)
SSI rpi: tcp (Module v7.0)
SSI rpi: usysv (Module v7.0)
Regards
-----Message d'origine-----
De : Jeff Squyres [mailto:jsquyres_at_[hidden]]
Envoyé : Thursday, September 11, 2003 18:17
À : General LAM/MPI mailing list
Objet : Re: LAM: LAMRSH vs boot_rsh_agent : boot_rsh_agent seems not
functionnal
On Thu, 11 Sep 2003 GosselinA_at_[hidden] wrote:
> I compiler lam-mpi 7.0 with "--with-rsh=ssh" to use ssh as the
> default. But sometimes I would like to revert to "rsh" to boot.
>
> If I do "export LAMRSH=rsh" , lamboot connects using rsh as expected,
> using ".rhosts" files on my nodes.
Good.
> But if I try this:
>
> % lamboot -v -ssi boot_rsh_agent rsh hostfile
>
> lamboot proceeds as if ssh was still my agent (asking for pass phrases
> for ex. if no ssh-agent has been loaded with my private key).
Can you look at the output of "lamboot -d -ssi boot_rsh_agent rsh hostfile"?
That should tell you what LAM is running as its underlying remote agent
command. I just tried "-ssi boot_rsh_agent rsh" myself and it overrode the
underlying command as it should.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/ _______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
- application/ms-tnef attachment: stored
|