LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Mahmoud Payami (mpayami_at_[hidden])
Date: 2006-04-12 12:17:58


Dear LAM users and developers,

I am a novice in LAM/MPI and trying to install but still failed. I have traced the LAM UG and FAQ but all points mentioned in them are satisfied pointwise.
The steps in configuring and making are as follows:

1- FC=ifort F77=ifort
2- export FC F77
3- ./configure --with-rsh="/usr/bin/ssh -x"
4- make
5- make install (with root account).
6- I have made a file named "hostfile" containing the two lines:
   condmat1.ctpm.aeoi.org cpu=2
   condmat10.ctpm.aeoi.org cpu=2
7- The bin directory (/usr/local/bin) has been added in the environmental setting in .bashrc
8- I can ssh to the remote node without password

Now as I try to boot lam, I receive the following messages. I would appreciate any comment.

Best regards,
                   Mahmoud Payami
------------------------------------------------------------------------------------------------------------------
[mahmoud_at_condmat1 ~]$ lamboot -v -ssi boot rsh hostfile

LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University

n-1<27159> ssi:boot:base:linear: booting n0 (condmat1.ctpm.aeoi.org)
n-1<27159> ssi:boot:base:linear: booting n1 (condmat10.ctpm.aeoi.org)
-----------------------------------------------------------------------------
LAM failed to execute a process on the remote node "condmat10.ctpm.aeoi.org".
LAM was not trying to invoke any LAM-specific commands yet -- we were
simply trying to determine what shell was being used on the remote
host.

LAM tried to use the remote agent command "/home/mahmoud/lam-7.0.6/share/ssi/boot/rsh/ssh"
to invoke "echo $SHELL" on the remote node.

This usually indicates an authentication problem with the remote
agent, or some other configuration type of error in your .cshrc or
.profile file. The following is a list of items that you may wish to
check on the remote node:

        - You have an account and can login to the remote machine
        - Incorrect permissions on your home directory (should
          probably be 0755)
        - Incorrect permissions on your $HOME/.rhosts file (if you are
          using rsh -- they should probably be 0644)
        - You have an entry in the remote $HOME/.rhosts file (if you
          are using rsh) for the machine and username that you are
          running from
        - Your .cshrc/.profile must not print anything out to the
          standard error
        - Your .cshrc/.profile should set a correct TERM type
        - Your .cshrc/.profile should set the SHELL environment
          variable to your default shell

Try invoking the following command at the unix command line:

        /home/mahmoud/lam-7.0.6/share/ssi/boot/rsh/ssh -x condmat10.ctpm.aeoi.org -n echo $SHELL

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<27159> ssi:boot:base:linear: Failed to boot n1 (condmat10.ctpm.aeoi.org)
n-1<27159> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
n-1<27164> ssi:boot:base:linear: booting n0 (condmat1.ctpm.aeoi.org)
n-1<27164> ssi:boot:base:linear: booting n1 (condmat10.ctpm.aeoi.org)
-----------------------------------------------------------------------------
LAM failed to execute a process on the remote node "condmat10.ctpm.aeoi.org".
LAM was not trying to invoke any LAM-specific commands yet -- we were
simply trying to determine what shell was being used on the remote
host.

LAM tried to use the remote agent command "/home/mahmoud/lam-7.0.6/share/ssi/boot/rsh/ssh"
to invoke "echo $SHELL" on the remote node.

This usually indicates an authentication problem with the remote
agent, or some other configuration type of error in your .cshrc or
.profile file. The following is a list of items that you may wish to
check on the remote node:

        - You have an account and can login to the remote machine
        - Incorrect permissions on your home directory (should
          probably be 0755)
        - Incorrect permissions on your $HOME/.rhosts file (if you are
          using rsh -- they should probably be 0644)
        - You have an entry in the remote $HOME/.rhosts file (if you
          are using rsh) for the machine and username that you are
          running from
        - Your .cshrc/.profile must not print anything out to the
          standard error
        - Your .cshrc/.profile should set a correct TERM type
        - Your .cshrc/.profile should set the SHELL environment
          variable to your default shell

Try invoking the following command at the unix command line:

        /home/mahmoud/lam-7.0.6/share/ssi/boot/rsh/ssh -x condmat10.ctpm.aeoi.org -n echo $SHELL

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<27164> ssi:boot:base:linear: Failed to boot n1 (condmat10.ctpm.aeoi.org)
n-1<27164> ssi:boot:base:linear: aborted!
lamboot did NOT complete successfully
[mahmoud_at_condmat1 ~]$

=++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

[mahmoud_at_condmat1 ~]$ recon
-----------------------------------------------------------------------------
Woo hoo!

recon has completed successfully. This means that you will most likely
be able to boot LAM successfully with the "lamboot" command (but this
is not a guarantee). See the lamboot(1) manual page for more
information on the lamboot command.

If you have problems booting LAM (with lamboot) even though recon
worked successfully, enable the "-d" option to lamboot to examine each
step of lamboot and see what fails. Most situations where recon
succeeds and lamboot fails have to do with the hboot(1) command (that
lamboot invokes on each host in the hostfile).
-----------------------------------------------------------------------------
[mahmoud_at_condmat1 ~]$
******************************************************************************
[mahmoud_at_condmat1 ~]$ lamboot -v hostfile

LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University

n-1<28223> ssi:boot:base:linear: booting n0 (condmat1.ctpm.aeoi.org)
n-1<28223> ssi:boot:base:linear: booting n1 (condmat10.ctpm.aeoi.org)
-----------------------------------------------------------------------------
LAM failed to execute a process on the remote node "condmat10.ctpm.aeoi.org".
LAM was not trying to invoke any LAM-specific commands yet -- we were
simply trying to determine what shell was being used on the remote
host.

LAM tried to use the remote agent command "/home/mahmoud/lam-7.0.6/share/ssi/boo t/rsh/ssh"
to invoke "echo $SHELL" on the remote node.

This usually indicates an authentication problem with the remote
agent, or some other configuration type of error in your .cshrc or
.profile file. The following is a list of items that you may wish to
check on the remote node:

        - You have an account and can login to the remote machine
        - Incorrect permissions on your home directory (should
          probably be 0755)
        - Incorrect permissions on your $HOME/.rhosts file (if you are
          using rsh -- they should probably be 0644)
        - You have an entry in the remote $HOME/.rhosts file (if you
          are using rsh) for the machine and username that you are
          running from
        - Your .cshrc/.profile must not print anything out to the
          standard error
        - Your .cshrc/.profile should set a correct TERM type
        - Your .cshrc/.profile should set the SHELL environment
          variable to your default shell

Try invoking the following command at the unix command line:

        /home/mahmoud/lam-7.0.6/share/ssi/boot/rsh/ssh -x condmat10.ctpm.aeoi.or g -n echo $SHELL

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<28223> ssi:boot:base:linear: Failed to boot n1 (condmat10.ctpm.aeoi.org)
n-1<28223> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
n-1<28228> ssi:boot:base:linear: booting n0 (condmat1.ctpm.aeoi.org)
n-1<28228> ssi:boot:base:linear: booting n1 (condmat10.ctpm.aeoi.org)
-----------------------------------------------------------------------------
LAM failed to execute a process on the remote node "condmat10.ctpm.aeoi.org".
LAM was not trying to invoke any LAM-specific commands yet -- we were
simply trying to determine what shell was being used on the remote
host.

LAM tried to use the remote agent command "/home/mahmoud/lam-7.0.6/share/ssi/boo t/rsh/ssh"
to invoke "echo $SHELL" on the remote node.

This usually indicates an authentication problem with the remote
agent, or some other configuration type of error in your .cshrc or
.profile file. The following is a list of items that you may wish to
check on the remote node:

        - You have an account and can login to the remote machine
        - Incorrect permissions on your home directory (should
          probably be 0755)
        - Incorrect permissions on your $HOME/.rhosts file (if you are
          using rsh -- they should probably be 0644)
        - You have an entry in the remote $HOME/.rhosts file (if you
          are using rsh) for the machine and username that you are
          running from
        - Your .cshrc/.profile must not print anything out to the
          standard error
        - Your .cshrc/.profile should set a correct TERM type
        - Your .cshrc/.profile should set the SHELL environment
          variable to your default shell

Try invoking the following command at the unix command line:

        /home/mahmoud/lam-7.0.6/share/ssi/boot/rsh/ssh -x condmat10.ctpm.aeoi.or g -n echo $SHELL

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<28228> ssi:boot:base:linear: Failed to boot n1 (condmat10.ctpm.aeoi.org)
n-1<28228> ssi:boot:base:linear: aborted!
lamboot did NOT complete successfully
[mahmoud_at_condmat1 ~]$ recon -d hostfile
n-1<28263> ssi:boot: Opening
n-1<28263> ssi:boot: opening module globus
n-1<28263> ssi:boot: initializing module globus
n-1<28263> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<28263> ssi:boot: module not available: globus
n-1<28263> ssi:boot: opening module rsh
n-1<28263> ssi:boot: initializing module rsh
n-1<28263> ssi:boot:rsh: module initializing
n-1<28263> ssi:boot:rsh:agent: /home/mahmoud/lam-7.0.6/share/ssi/boot/rsh/ssh -x
n-1<28263> ssi:boot:rsh:username: <same>
n-1<28263> ssi:boot:rsh:verbose: 1000
n-1<28263> ssi:boot:rsh:algorithm: linear
n-1<28263> ssi:boot:rsh:priority: 10
n-1<28263> ssi:boot: module available: rsh, priority: 10
n-1<28263> ssi:boot: finalizing module globus
n-1<28263> ssi:boot:globus: finalizing
n-1<28263> ssi:boot: closing module globus
n-1<28263> ssi:boot: Selected boot module rsh
n-1<28263> ssi:boot:base: looking for boot schema in following directories:
n-1<28263> ssi:boot:base: <current directory>
n-1<28263> ssi:boot:base: $TROLLIUSHOME/etc
n-1<28263> ssi:boot:base: $LAMHOME/etc
n-1<28263> ssi:boot:base: /usr/local/etc
n-1<28263> ssi:boot:base: looking for boot schema file:
n-1<28263> ssi:boot:base: hostfile
n-1<28263> ssi:boot:base: found boot schema: hostfile
n-1<28263> ssi:boot:rsh: found the following hosts:
n-1<28263> ssi:boot:rsh: n0 condmat1.ctpm.aeoi.org (cpu=2)
n-1<28263> ssi:boot:rsh: n1 condmat10.ctpm.aeoi.org (cpu=2)
n-1<28263> ssi:boot:rsh: resolved hosts:
n-1<28263> ssi:boot:rsh: n0 condmat1.ctpm.aeoi.org --> 192.168.10.1 (origin)
n-1<28263> ssi:boot:rsh: n1 condmat10.ctpm.aeoi.org --> 192.168.10.10
n-1<28263> ssi:boot:rsh: starting RTE procs
n-1<28263> ssi:boot:base:linear: starting
n-1<28263> ssi:boot:base:linear: booting n0 (condmat1.ctpm.aeoi.org)
n-1<28263> ssi:boot:rsh: starting recon on (condmat1.ctpm.aeoi.org)
n-1<28263> ssi:boot:rsh: starting on n0 (condmat1.ctpm.aeoi.org): tkill -N -d
n-1<28263> ssi:boot:rsh: launching locally
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-mahmoud_at_[hidden]/lam-killfile
tkill: removing socket file ...
tkill: removing IO daemon socket file ...
tkill: f_kill = "/tmp/lam-mahmoud_at_[hidden]/lam-killfile"
tkill: nothing to kill: "/tmp/lam-mahmoud_at_[hidden]/lam-killfile"
n-1<28263> ssi:boot:rsh: successfully launched on n0 (condmat1.ctpm.aeoi.org)
n-1<28263> ssi:boot:base:linear: booting n1 (condmat10.ctpm.aeoi.org)
n-1<28263> ssi:boot:rsh: starting recon on (condmat10.ctpm.aeoi.org)
n-1<28263> ssi:boot:rsh: starting on n1 (condmat10.ctpm.aeoi.org): tkill -N -d
n-1<28263> ssi:boot:rsh: launching remotely
n-1<28263> ssi:boot:rsh: attempting to execute "/home/mahmoud/lam-7.0.6/share/ss i/boot/rsh/ssh -x condmat10.ctpm.aeoi.org -n echo $SHELL"
-----------------------------------------------------------------------------
LAM failed to execute a process on the remote node "condmat10.ctpm.aeoi.org".
LAM was not trying to invoke any LAM-specific commands yet -- we were
simply trying to determine what shell was being used on the remote
host.

LAM tried to use the remote agent command "/home/mahmoud/lam-7.0.6/share/ssi/boo t/rsh/ssh"
to invoke "echo $SHELL" on the remote node.

This usually indicates an authentication problem with the remote
agent, or some other configuration type of error in your .cshrc or
.profile file. The following is a list of items that you may wish to
check on the remote node:

        - You have an account and can login to the remote machine
        - Incorrect permissions on your home directory (should
          probably be 0755)
        - Incorrect permissions on your $HOME/.rhosts file (if you are
          using rsh -- they should probably be 0644)
        - You have an entry in the remote $HOME/.rhosts file (if you
          are using rsh) for the machine and username that you are
          running from
        - Your .cshrc/.profile must not print anything out to the
          standard error
        - Your .cshrc/.profile should set a correct TERM type
        - Your .cshrc/.profile should set the SHELL environment
          variable to your default shell

Try invoking the following command at the unix command line:

        /home/mahmoud/lam-7.0.6/share/ssi/boot/rsh/ssh -x condmat10.ctpm.aeoi.or g -n echo $SHELL

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<28263> ssi:boot:base:linear: Failed to boot n1 (condmat10.ctpm.aeoi.org)
n-1<28263> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
recon was not able to complete successfully. There can be any number
of problems that did not allow recon to work properly. You should use
the "-d" option to recon to get more information about each step that
recon attempts.

Any error message above may present a more detailed description of the
actual problem.

Here is general a list of prerequisites that *must* be fulfilled
before recon can work:

        - Each machine in the hostfile must be reachable and operational.
        - You must have an account on each machine.
        - You must be able to rsh(1) to the machine (permissions
          are typically set in the user's $HOME/.rhosts file).

        *** Sidenote: If you compiled LAM to use a remote shell program
            other than rsh (with the --with-rsh option to ./configure;
            e.g., ssh), or if you set the LAMRSH environment variable
            to an alternate remote shell program, you need to ensure
            that you can execute programs on remote nodes with no
            password. For example:

        unix% ssh -x pinky uptime
        3:09am up 211 day(s), 23:49, 2 users, load average: 0.01, 0.08, 0.10

        - The LAM executables must be locatable on each machine, using
          the shell's search path and possibly the LAMHOME environment
          variable.
        - The shell's start-up script must not print anything on standard
          error. You can take advantage of the fact that rsh(1) will
          start the shell non-interactively. The start-up script (such
          as .profile or .cshrc) can exit early in this case, before
          executing many commands relevant only to interactive sessions
          and likely to generate output.
-----------------------------------------------------------------------------
n-1<28263> ssi:boot:rsh: finalizing
n-1<28263> ssi:boot: Closing
[mahmoud_at_condmat1 ~]$