LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Keith Stevens (kstevens_at_[hidden])
Date: 2005-09-01 14:48:08


Hello,

 

I am running LAM/MPI precompiled with the Intel C Compiler. This is running
on a Rocks 3.3 Linux Cluster. Below are outputs from the lam commands:

 

Output from recon:
-------------------------

/opt/lam/intel/bin/recon -d
n-1<11238> ssi:boot:open: opening
n-1<11238> ssi:boot:open: opening boot module globus
n-1<11238> ssi:boot:open: opened boot module globus
n-1<11238> ssi:boot:open: opening boot module rsh
n-1<11238> ssi:boot:open: opened boot module rsh
n-1<11238> ssi:boot:open: opening boot module slurm
n-1<11238> ssi:boot:open: opened boot module slurm
n-1<11238> ssi:boot:select: initializing boot module globus
n-1<11238> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<11238> ssi:boot:select: boot module not available: globus
n-1<11238> ssi:boot:select: initializing boot module rsh
n-1<11238> ssi:boot:rsh: module initializing
n-1<11238> ssi:boot:rsh:agent: ssh -x
n-1<11238> ssi:boot:rsh:username: <same>
n-1<11238> ssi:boot:rsh:verbose: 1000
n-1<11238> ssi:boot:rsh:algorithm: linear
n-1<11238> ssi:boot:rsh:no_n: 0
n-1<11238> ssi:boot:rsh:no_profile: 0
n-1<11238> ssi:boot:rsh:fast: 0
n-1<11238> ssi:boot:rsh:ignore_stderr: 0
n-1<11238> ssi:boot:rsh:priority: 75
n-1<11238> ssi:boot:select: boot module available: rsh, priority: 75
n-1<11238> ssi:boot:select: initializing boot module slurm
n-1<11238> ssi:boot:slurm: not running under SLURM
n-1<11238> ssi:boot:select: boot module not available: slurm
n-1<11238> ssi:boot:select: finalizing boot module globus
n-1<11238> ssi:boot:globus: finalizing
n-1<11238> ssi:boot:select: closing boot module globus
n-1<11238> ssi:boot:select: finalizing boot module slurm
n-1<11238> ssi:boot:slurm: finalizing
n-1<11238> ssi:boot:select: closing boot module slurm
n-1<11238> ssi:boot:select: selected boot module rsh
n-1<11238> ssi:boot:base: looking for boot schema in following directories:
n-1<11238> ssi:boot:base: <current directory>
n-1<11238> ssi:boot:base: $TROLLIUSHOME/etc
n-1<11238> ssi:boot:base: $LAMHOME/etc
n-1<11238> ssi:boot:base: /opt/lam/intel/etc
n-1<11238> ssi:boot:base: looking for boot schema file:
n-1<11238> ssi:boot:base: lam-bhost.def
n-1<11238> ssi:boot:base: found boot schema:
/opt/lam/intel/etc/lam-bhost.def
n-1<11238> ssi:boot:rsh: found the following hosts:
n-1<11238> ssi:boot:rsh: n0 localhost (cpu=1)
n-1<11238> ssi:boot:rsh: resolved hosts:
n-1<11238> ssi:boot:rsh: n0 localhost --> 127.0.0.1 (origin)
n-1<11238> ssi:boot:rsh: starting RTE procs
n-1<11238> ssi:boot:base:linear: starting
n-1<11238> ssi:boot:base:linear: booting n0 (localhost)
n-1<11238> ssi:boot:rsh: starting recon on (localhost)
n-1<11238> ssi:boot:rsh: starting on n0 (localhost): tkill -N -d
n-1<11238> ssi:boot:rsh: launching locally
n-1<11238> ssi:boot:base:linear: Failed to boot n0 (localhost)
n-1<11238> ssi:boot:base:linear: aborted!
----------------------------------------------------------------------------
-
recon was not able to complete successfully. There can be any number
of problems that did not allow recon to work properly. You should use
the "-d" option to recon to get more information about each step that
recon attempts.

Any error message above may present a more detailed description of the
actual problem.

Here is general a list of prerequisites that *must* be fulfilled
before recon can work:

        - Each machine in the hostfile must be reachable and operational.
        - You must have an account on each machine.
        - You must be able to rsh(1) to the machine (permissions
          are typically set in the user's $HOME/.rhosts file).

        *** Sidenote: If you compiled LAM to use a remote shell program
            other than rsh (with the --with-rsh option to ./configure;
            e.g., ssh), or if you set the LAMRSH environment variable
            to an alternate remote shell program, you need to ensure
            that you can execute programs on remote nodes with no
            password. For example:

        unix% ssh -x pinky uptime
        3:09am up 211 day(s), 23:49, 2 users, load average: 0.01, 0.08, 0.10

        - The LAM executables must be locatable on each machine, using
          the shell's search path and possibly the LAMHOME environment
          variable.
        - The shell's start-up script must not print anything on standard
          error. You can take advantage of the fact that rsh(1) will
          start the shell non-interactively. The start-up script (such
          as .profile or .cshrc) can exit early in this case, before
          executing many commands relevant only to interactive sessions
          and likely to generate output.
----------------------------------------------------------------------------
-
n-1<11238> ssi:boot:rsh: finalizing
n-1<11238> ssi:boot: Closing

 

============================================================================
==============

laminfo output:
--------------------
             LAM/MPI: 7.1.1
              Prefix: /opt/lam/intel
        Architecture: i686-pc-linux-gnu
       Configured by: root
       Configured on: Mon Feb 21 17:48:42 SGT 2005
      Configure host: compute-0-1.local
      Memory manager: ptmalloc2
          C bindings: yes
        C++ bindings: yes
    Fortran bindings: yes
          C compiler: icc
        C++ compiler: icpc
    Fortran compiler: ifort
     Fortran symbols: underscore
         C profiling: yes
       C++ profiling: yes
   Fortran profiling: yes
      C++ exceptions: no
      Thread support: yes
       ROMIO support: yes
        IMPI support: no
       Debug support: no
        Purify clean: no
            SSI boot: globus (API v1.1, Module v0.6)
            SSI boot: rsh (API v1.1, Module v1.1)
            SSI boot: slurm (API v1.1, Module v1.0)
            SSI coll: lam_basic (API v1.1, Module v7.1)
            SSI coll: shmem (API v1.1, Module v1.0)
            SSI coll: smp (API v1.1, Module v1.2)
             SSI rpi: crtcp (API v1.1, Module v1.1)
             SSI rpi: gm (API v1.1, Module v1.2)
             SSI rpi: lamd (API v1.0, Module v7.1)
             SSI rpi: tcp (API v1.0, Module v7.1)
             SSI rpi: sysv (API v1.0, Module v7.1)
             SSI rpi: usysv (API v1.0, Module v7.1)
              SSI cr: self (API v1.0, Module v1.0)

 

============================================================================
====================

 

lamboot output:
----------------------

LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University

n-1<11251> ssi:boot:base:linear: booting n0 (compute-0-1)
n-1<11251> ssi:boot:base:linear: Failed to boot n0 (compute-0-1)
n-1<11251> ssi:boot:base:linear: aborted!
lamboot did NOT complete successfully

 

additional lamboot recon errors: (I think that this one has helpful info)

 

/opt/lam/intel/bin/recon -d /opt/lam/intel/bin/lamboot
n-1<12909> ssi:boot:open: opening
n-1<12909> ssi:boot:open: opening boot module globus
n-1<12909> ssi:boot:open: opened boot module globus
n-1<12909> ssi:boot:open: opening boot module rsh
n-1<12909> ssi:boot:open: opened boot module rsh
n-1<12909> ssi:boot:open: opening boot module slurm
n-1<12909> ssi:boot:open: opened boot module slurm
n-1<12909> ssi:boot:select: initializing boot module slurm
n-1<12909> ssi:boot:slurm: not running under SLURM
n-1<12909> ssi:boot:select: boot module not available: slurm
n-1<12909> ssi:boot:select: initializing boot module rsh
n-1<12909> ssi:boot:rsh: module initializing
n-1<12909> ssi:boot:rsh:agent: ssh -x
n-1<12909> ssi:boot:rsh:username: <same>
n-1<12909> ssi:boot:rsh:verbose: 1000
n-1<12909> ssi:boot:rsh:algorithm: linear
n-1<12909> ssi:boot:rsh:no_n: 0
n-1<12909> ssi:boot:rsh:no_profile: 0
n-1<12909> ssi:boot:rsh:fast: 0
n-1<12909> ssi:boot:rsh:ignore_stderr: 0
n-1<12909> ssi:boot:rsh:priority: 75
n-1<12909> ssi:boot:select: boot module available: rsh, priority: 75
n-1<12909> ssi:boot:select: initializing boot module globus
n-1<12909> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<12909> ssi:boot:select: boot module not available: globus
n-1<12909> ssi:boot:select: finalizing boot module slurm
n-1<12909> ssi:boot:slurm: finalizing
n-1<12909> ssi:boot:select: closing boot module slurm
n-1<12909> ssi:boot:select: finalizing boot module globus
n-1<12909> ssi:boot:globus: finalizing
n-1<12909> ssi:boot:select: closing boot module globus
n-1<12909> ssi:boot:select: selected boot module rsh
n-1<12909> ssi:boot:base: looking for boot schema in following directories:
n-1<12909> ssi:boot:base: <current directory>
n-1<12909> ssi:boot:base: $TROLLIUSHOME/etc
n-1<12909> ssi:boot:base: $LAMHOME/etc
n-1<12909> ssi:boot:base: /opt/lam/intel/etc
n-1<12909> ssi:boot:base: looking for boot schema file:
n-1<12909> ssi:boot:base: /opt/lam/intel/bin/lamboot
n-1<12909> ssi:boot:base: found boot schema: /opt/lam/intel/bin/lamboot
----------------------------------------------------------------------------
-
A key-value pair was found where a hostname was expected on line 25 of
the hostfile.

The first word on each hostfile line must be a hostname. The line
will be ignored.

The offending line was:

        ]üÉÃUå=0¹
----------------------------------------------------------------------------
-
n-1<12909> ssi:boot:rsh: found the following hosts:
n-1<12909> ssi:boot:rsh: n0 ELF (cpu=1)
n-1<12909> ssi:boot:rsh: n1 (cpu=4)
n-1<12909> ssi:boot:rsh: n2 nal (cpu=1)
n-1<12909> ssi:boot:rsh: n3 .L (cpu=1)
n-1<12909> ssi:boot:rsh: n4 ; (cpu=1)
n-1<12909> ssi:boot:rsh: n5h` (cpu=1)
n-1<12909> ssi:boot:rsh: n6 hà (cpu=1)
WPÿ5<è>üÿÿÄP1<12909> ssi:boot:rsh: n7
Phÿ5<ènüÿÿ3ÀPjPhÿ5<èXüÿÿÄl3ÀPjPhÿ5<è?üÿÿ3ÀPjPhÿ5<è)üÿÿ3ÀPPPhÿ5<èüÿÿÿ5<èÉüÿÿÿ
5<èÎýÿÿ

Àt$j (cpu=1)
n-1<12909> ssi:boot:rsh: n8 èÜûÿÿ3ÀÄ,[^_å]Ãhÿ5<èpüÿÿÄ
Àtjh´èýúÿÿhÿ5<èJüÿÿÄ
ÀtjhÄè×úÿÿhÿ5<è$üÿÿÄ
Àu3öë3ÀPPhÿ5<èÅùÿÿðÄhÿ5<èðûÿÿÄ
Àu3Àë3ÒRRhÿ5<èùÿÿÄj (cpu=1)
n-1<12909> ssi:boot:rsh: n9 è.úÿÿÄ¡8¹
ÀtPè\øÿÿYèùÿÿj (cpu=1)
n-1<12909> ssi:boot:rsh: n10 jè÷ÿÿøÿ
                                       ½ (cpu=1)
n-1<12909> ssi:boot:rsh: n11 Æ0Â;Ñ|ê;Ñ|j (cpu=1)
n-1<12909> ssi:boot:rsh: n12 èÿÿSèZ÷ÿÿáõÿÿè@ (cpu=1)
n-1<12909> ssi:boot:rsh: n13 * (cpu=1)
n-1<12909> ssi:boot:rsh: n14 IUüëDúu&UÜ÷ (cpu=1)
n-1<12909> ssi:boot:rsh: n15 ZÂp (cpu=1)
n-1<12909> ssi:boot:rsh: n16 ®\$D$È@D$®T$ (cpu=1)
n-1<12909> ssi: (cpu=1) n1. þ
n-1<12909> ssi:boot:rsh: n18 ym (cpu=1)
n-1<12909> ssi:boot:rsh: n19( (cpu=1)
n-1<12909> ssi:boot:rsh: n20 § (cpu=1)
n-1<12909> ssi:boot:rsh: n21 (cpu=1)
n-1<12909> ssi:boot:rsh: n22 _dtors_aux (cpu=1)
n-1<12909> ssi:boot:rsh: n23 _args (cpu=1)
n-1<12909> ssi:boot:rsh: n24 __intel_proc_init_B.A (cpu=1)
n-1<12909> ssi:boot:rsh: n25 ram_set_int (cpu=1)
n-1<12909> ssi:boot:rsh: n26 rt (cpu=1)
n-1<12909> ssi:boot:rsh: n27 _select (cpu=1)
n-1<12909> ssi:boot:rsh: n28 gv_dup (cpu=1)

 

Thanks,

Keith