LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Aamir Shafi (aamir.shafi_at_[hidden])
Date: 2004-08-04 15:08:29


Hi,

The following output looks like saying, 'tkill' cant be found. So
basically when lam tries to ssh into the compute nodes, it cant find
'tkill'. My question is, how to make it find it ? if its about adding
$LAM_HOME/bin to the PATH, its already there. What am i missing ?

Thanks for any help
--Aamir
shafia_at_holly:~/install/lam-7.0.6/examples/ring$ recon -v lamhosts
n-1<6434> ssi:boot:base:linear: booting n0 (holly.starbug.dsg.port.ac.uk)
n-1<6434> ssi:boot:base:linear: booting n1 (comp00.starbug.dsg.port.ac.uk)
ERROR: LAM/MPI unexpectedly received the following on stderr:
bash: line 1: tkill: command not found
-----------------------------------------------------------------------------
LAM failed to execute a LAM binary on the remote node
"comp00.starbug.dsg.port.ac.uk".
Since LAM was already able to determine your remote shell as "tkill",
it is probable that this is not an authentication problem.

LAM tried to use the remote agent command "ssh"
to invoke the following command:

        ssh comp00.starbug.dsg.port.ac.uk -n tkill -N -v

This can indicate several things. You should check the following:

        - The LAM binaries are in your $PATH
        - You can run the LAM binaries
        - The $PATH variable is set properly before your
          .cshrc/.profile exits

Try to invoke the command listed above manually at a Unix prompt.

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<6434> ssi:boot:base:linear: Failed to boot n1
(comp00.starbug.dsg.port.ac.uk)
n-1<6434> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
recon was not able to complete successfully. There can be any number
of problems that did not allow recon to work properly. You should use
the "-d" option to recon to get more information about each step that
recon attempts.

Any error message above may present a more detailed description of the
actual problem.

Here is general a list of prerequisites that *must* be fulfilled
before recon can work:

        - Each machine in the hostfile must be reachable and operational.
        - You must have an account on each machine.
        - You must be able to rsh(1) to the machine (permissions
          are typically set in the user's $HOME/.rhosts file).

        *** Sidenote: If you compiled LAM to use a remote shell program
            other than rsh (with the --with-rsh option to ./configure;
            e.g., ssh), or if you set the LAMRSH environment variable
            to an alternate remote shell program, you need to ensure
            that you can execute programs on remote nodes with no
            password. For example:

        unix% ssh -x pinky uptime
        3:09am up 211 day(s), 23:49, 2 users, load average: 0.01, 0.08, 0.10

        - The LAM executables must be locatable on each machine, using
          the shell's search path and possibly the LAMHOME environment
          variable.
        - The shell's start-up script must not print anything on standard
          error. You can take advantage of the fact that rsh(1) will
          start the shell non-interactively. The start-up script (such
          as .profile or .cshrc) can exit early in this case, before
          executing many commands relevant only to interactive sessions
          and likely to generate output.
-----------------------------------------------------------------------------