LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: damien_at_[hidden]
Date: 2005-04-06 09:45:30


This looks like ye olde faithful password in ssh problems. Did you try
the command recommended?

/bin/ssh -x localhost -n echo $SHELL

If you were asked for a password, it's straightforward to fix. It's
written up here:

http://www.lam-mpi.org/faq/category4.php3#question15

Damien

> hello...i have problem with my LAM configuration in my cluster. the master
> node has 2 nic card and all the slaves have one nic card. i have
> configure the connection to use ssh instead of rsh. my lam_bhost.def is
> look like this
> master
> node01
> node02
> node03
> i have a problem with lamboot and when i run the recon -d
>
> [aizura_at_master lam-7.0.6]$ recon -d
> n-1<26486> ssi:boot: Opening
> n-1<26486> ssi:boot: opening module globus
> n-1<26486> ssi:boot: initializing module globus
> n-1<26486> ssi:boot:globus: globus-job-run not found, globus boot will not
> run
> n-1<26486> ssi:boot: module not available: globus
> n-1<26486> ssi:boot: opening module rsh
> n-1<26486> ssi:boot: initializing module rsh
> n-1<26486> ssi:boot:rsh: module initializing
> n-1<26486> ssi:boot:rsh:agent: /bin/ssh -x
> n-1<26486> ssi:boot:rsh:username: <same>
> n-1<26486> ssi:boot:rsh:verbose: 1000
> n-1<26486> ssi:boot:rsh:algorithm: linear
> n-1<26486> ssi:boot:rsh:priority: 10
> n-1<26486> ssi:boot: module available: rsh, priority: 10
> n-1<26486> ssi:boot: finalizing module globus
> n-1<26486> ssi:boot:globus: finalizing
> n-1<26486> ssi:boot: closing module globus
> n-1<26486> ssi:boot: Selected boot module rsh
> n-1<26486> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<26486> ssi:boot:base: <current directory>
> n-1<26486> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<26486> ssi:boot:base: $LAMHOME/etc
> n-1<26486> ssi:boot:base: /usr/etc
> n-1<26486> ssi:boot:base: looking for boot schema file:
> n-1<26486> ssi:boot:base: lam-bhost.def
> n-1<26486> ssi:boot:base: found boot schema: /usr/etc/lam-bhost.def
> n-1<26486> ssi:boot:rsh: found the following hosts:
> n-1<26486> ssi:boot:rsh: n0 localhost (cpu=1)
> n-1<26486> ssi:boot:rsh: n1 master (cpu=1)
> n-1<26486> ssi:boot:rsh: n2 node01 (cpu=1)
> n-1<26486> ssi:boot:rsh: n3 node02 (cpu=1)
> n-1<26486> ssi:boot:rsh: n4 node03 (cpu=1)
> n-1<26486> ssi:boot:rsh: resolved hosts:
> n-1<26486> ssi:boot:rsh: n0 localhost --> 127.0.0.0
> n-1<26486> ssi:boot:rsh: n1 master --> 10.0.0.1 (origin)
> n-1<26486> ssi:boot:rsh: n2 node01 --> 10.0.0.2
> n-1<26486> ssi:boot:rsh: n3 node02 --> 10.0.0.3
> n-1<26486> ssi:boot:rsh: n4 node03 --> 10.0.0.4
> n-1<26486> ssi:boot:rsh: starting RTE procs
> n-1<26486> ssi:boot:base:linear: starting
> n-1<26486> ssi:boot:base:linear: booting n0 (localhost)
> n-1<26486> ssi:boot:rsh: starting recon on (localhost)
> n-1<26486> ssi:boot:rsh: starting on n0 (localhost): tkill -N -d
> n-1<26486> ssi:boot:rsh: launching remotely
> n-1<26486> ssi:boot:rsh: attempting to execute "/bin/ssh -x localhost -n
> echo $SHELL"
> -----------------------------------------------------------------------------
> LAM failed to execute a process on the remote node "localhost".
> LAM was not trying to invoke any LAM-specific commands yet -- we were
> simply trying to determine what shell was being used on the remote
> host.
>
> LAM tried to use the remote agent command "/bin/ssh"
> to invoke "echo $SHELL" on the remote node.
>
> This usually indicates an authentication problem with the remote
> agent, or some other configuration type of error in your .cshrc or
> .profile file. The following is a list of items that you may wish to
> check on the remote node:
>
> - You have an account and can login to the remote machine
> - Incorrect permissions on your home directory (should
> probably be 0755)
> - Incorrect permissions on your $HOME/.rhosts file (if you are
> using rsh -- they should probably be 0644)
> - You have an entry in the remote $HOME/.rhosts file (if you
> are using rsh) for the machine and username that you are
> running from
> - Your .cshrc/.profile must not print anything out to the
> standard error
> - Your .cshrc/.profile should set a correct TERM type
> - Your .cshrc/.profile should set the SHELL environment
> variable to your default shell
>
> Try invoking the following command at the unix command line:
>
> /bin/ssh -x localhost -n echo $SHELL
>
> You will need to configure your local setup such that you will *not*
> be prompted for a password to invoke this command on the remote node.
> No output should be printed from the remote node before the output of
> the command is displayed.
>
> When you can get this command to execute successfully by hand, LAM
> will probably be able to function properly.
> -----------------------------------------------------------------------------
> n-1<26486> ssi:boot:base:linear: Failed to boot n0 (localhost)
> n-1<26486> ssi:boot:base:linear: aborted!
> -----------------------------------------------------------------------------
> recon was not able to complete successfully. There can be any number
> of problems that did not allow recon to work properly. You should use
> the "-d" option to recon to get more information about each step that
> recon attempts.
>
> Any error message above may present a more detailed description of the
> actual problem.
>
> Here is general a list of prerequisites that *must* be fulfilled
> before recon can work:
>
> - Each machine in the hostfile must be reachable and operational.
> - You must have an account on each machine.
> - You must be able to rsh(1) to the machine (permissions
> are typically set in the user's $HOME/.rhosts file).
>
> *** Sidenote: If you compiled LAM to use a remote shell program
> other than rsh (with the --with-rsh option to ./configure;
> e.g., ssh), or if you set the LAMRSH environment variable
> to an alternate remote shell program, you need to ensure
> that you can execute programs on remote nodes with no
> password. For example:
>
> unix% ssh -x pinky uptime
> 3:09am up 211 day(s), 23:49, 2 users, load average: 0.01, 0.08,
> 0.10
>
> - The LAM executables must be locatable on each machine, using
> the shell's search path and possibly the LAMHOME environment
> variable.
> - The shell's start-up script must not print anything on standard
> error. You can take advantage of the fact that rsh(1) will
> start the shell non-interactively. The start-up script (such
> as .profile or .cshrc) can exit early in this case, before
> executing many commands relevant only to interactive sessions
> and likely to generate output.
> -----------------------------------------------------------------------------
> n-1<26486> ssi:boot:rsh: finalizing
> n-1<26486> ssi:boot: Closing
>
>
>
> ---------------------------------
> Yahoo! Messenger
> Show us what our next emoticon should look like. Join the
> fun._______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/