LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Vishal Sahay (vsahay_at_[hidden])
Date: 2004-04-06 13:23:16


A few things:

- I don't see /home/ik20/lam/bin (I guess this is your LAM installation
  dir) in your path

- if you *append* the path with the your local lam on the current shell,
  and do not export it (for sh/bash), then the sub shell (after a fork
  for lamd) would grab the old path (which would not contain your new lam)

Make sure your path is getting propagated properly.

-Vishal

On Tue, 6 Apr 2004, I Kozin wrote:

# The first thing i tried was
#
# $ lamboot -v hostfile
# where hostfile is
# computer_name cpu=4
#
# it seems like omitting hostfile implies localhost which is fine
#
# $echo $PATH
# .:~/bin:~/lam/bin:/opt/intel_cc_80/bin:/opt/intel_fc_80/bin:
# /usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin
#
# $ lamboot -d hostfile
# n0<25967> ssi:boot: Opening
# n0<25967> ssi:boot: opening module globus
# n0<25967> ssi:boot: initializing module globus
# n0<25967> ssi:boot:globus: globus-job-run not found, globus boot will not run
# n0<25967> ssi:boot: module not available: globus
# n0<25967> ssi:boot: opening module rsh
# n0<25967> ssi:boot: initializing module rsh
# n0<25967> ssi:boot:rsh: module initializing
# n0<25967> ssi:boot:rsh:agent: rsh
# n0<25967> ssi:boot:rsh:username: <same>
# n0<25967> ssi:boot:rsh:verbose: 1000
# n0<25967> ssi:boot:rsh:algorithm: linear
# n0<25967> ssi:boot:rsh:priority: 10
# n0<25967> ssi:boot: module available: rsh, priority: 10
# n0<25967> ssi:boot: finalizing module globus
# n0<25967> ssi:boot:globus: finalizing
# n0<25967> ssi:boot: closing module globus
# n0<25967> ssi:boot: Selected boot module rsh
#
# LAM 7.0.4/MPI 2 C++/ROMIO - Indiana University
#
# n0<25967> ssi:boot:base: looking for boot schema in following directories:
# n0<25967> ssi:boot:base: <current directory>
# n0<25967> ssi:boot:base: $TROLLIUSHOME/etc
# n0<25967> ssi:boot:base: $LAMHOME/etc
# n0<25967> ssi:boot:base: /home/ik20/lam/etc
# n0<25967> ssi:boot:base: looking for boot schema file:
# n0<25967> ssi:boot:base: hostfile
# n0<25967> ssi:boot:base: found boot schema: hostfile
# n0<25967> ssi:boot:rsh: found the following hosts:
# n0<25967> ssi:boot:rsh: n0 tca1 (cpu=4)
# n0<25967> ssi:boot:rsh: resolved hosts:
# n0<25967> ssi:boot:rsh: n0 tca1 --> 193.62.112.34 (origin)
# n0<25967> ssi:boot:rsh: starting RTE procs
# n0<25967> ssi:boot:base:linear: starting
# n0<25967> ssi:boot:base:server: opening server TCP socket
# n0<25967> ssi:boot:base:server: opened port 1330
# n0<25967> ssi:boot:base:linear: booting n0 (tca1)
# n0<25967> ssi:boot:rsh: starting lamd on (tca1)
# n0<25967> ssi:boot:rsh: starting on n0 (tca1): hboot -t -c lam-conf.lamd -d -I -H 193.62.112.34 -P 1330 -n 0 -o 0
# n0<25967> ssi:boot:rsh: launching locally
# hboot: process schema = "lam-conf.lamd"
# hboot: found /usr/bin/lamd
# hboot: performing tkill
# hboot: tkill
# hboot: booting...
# hboot: fork /usr/bin/lamd
# [1] 25970 lamd -H 193.62.112.34 -P 1330 -n 0 -o 0 -d
# hboot: attempting to execute
# n0<25967> ssi:boot:rsh: successfully launched on n0 (tca1)
# n0<25967> ssi:boot:base:server: expecting connection from finite list
# n0<25967> ssi:boot:base:server: got connection from 193.62.112.34
# n0<25967> ssi:boot:base:server: this connection is expected (n0)
# -----------------------------------------------------------------------------
# The lamboot agent failed to read a message over a socket from the
# newly-booted process. This should not happen (especially since TCP is
# a guaranteed protocol).
#
# Please check your network connectivity and ensure that messages can be
# passed reliably over TCP. Additionally, ensure that the host where
# the newly-booted process was launched is healthy and still available
# on the network.
# -----------------------------------------------------------------------------
# n0<25967> ssi:boot:base:server: failed to connect to remote lamd!
# n0<25967> ssi:boot:base:server: closing server socket
# n0<25967> ssi:boot:base:linear: aborted!
# -----------------------------------------------------------------------------
# lamboot encountered some error (see above) during the boot process,
# and will now attempt to kill all nodes that it was previously able to
# boot (if any).
#
# Please wait for LAM to finish; if you interrupt this process, you may
# have LAM daemons still running on remote nodes.
# -----------------------------------------------------------------------------
# lamboot: wipe -- nothing to do
# lamboot did NOT complete successfully
#
#
# >
# > Can you send across the following:
# >
# > - The command you invoke for lamboot - how many nodes you are booting on?
# > It seems you are just booting on the current node with "lamboot -d" w/o
# > any hostfile. Just wanted to confirm this.
# >
# > - The complete output of "lamboot -d"
# >
# > - The value of your path environment variable
# >
# > -Vishal
# >
# > On Tue, 6 Apr 2004, I Kozin wrote:
# >
# > #
# > # Hello,
# > #
# > # here is the problem:
# > # we've got a 4 processor Intel Itanium2 box and want to
# > # use LAM (shared memory environment only).
# > #
# > # There is already LAM 6.5 installed but it has been created
# > # using gcc (v2.95) and I can not link a code compiled using
# > # Intel Fortran 8.0 with the existing LAM (MPI function
# > # names are not resolved).
# > #
# > # This is a known problem according to LAM FAQ
# > # and the solutions is to rebuild LAM. OK, I downloaded
# > # LAM 7.04 and compiled it. Now, I don't want to remove
# > # the old LAM because it might be useful if someone wants
# > # to use gcc. Instead I decided to install LAM locally
# > # in my home directory. I appended the PATH variable
# > # so that the new path to LAM overrides the old one.
# > # I also pointed LAMHOME to the local dir (just in case).
# > #
# > # While I could not see any problems during make and
# > # install when I run lamboot it returns an error.
# > # Although laminfo points to the local dir
# > #
# > # "lamboot -d" shows
# > # ...
# > # hboot: found /usr/bin/lamd
# > #
# > # which it should not. ["which lamd" points to my local dir as well]
# > #
# > # and after that
# > #
# > # hboot: performing tkill
# > # hboot: tkill
# > # hboot: booting...
# > # hboot: fork /usr/bin/lamd
# > # [1] 25211 lamd -H 127.0.0.1 -P 1324 -n 0 -o 0 -d
# > # hboot: attempting to execute
# > # n0<25208> ssi:boot:rsh: successfully launched on n0 (localhost)
# > # n0<25208> ssi:boot:base:server: expecting connection from finite list
# > # n0<25208> ssi:boot:base:server: got connection from 127.0.0.1
# > # n0<25208> ssi:boot:base:server: this connection is expected (n0)
# > # ----------------------------------------------------------------------------
# > # -
# > # The lamboot agent failed to read a message over a socket from the
# > # newly-booted process. This should not happen (especially since TCP is
# > # a guaranteed protocol).
# > #
# > # Please check your network connectivity and ensure that messages can be
# > # passed reliably over TCP. Additionally, ensure that the host where
# > # the newly-booted process was launched is healthy and still available
# > # on the network.
# > # ----------------------------------------------------------------------------
# > # -
# > # n0<25208> ssi:boot:base:server: failed to connect to remote lamd!
# > # n0<25208> ssi:boot:base:server: closing server socket
# > # n0<25208> ssi:boot:base:linear: aborted!
# > #
# > # what is going on?
# > # Your help is greatly appreciated!
# > #
# > # Igor
# > #
# > # config.log, make.log and make-install.log can be sent on request.
# > # _______________________________________________
# > # This list is archived at http://www.lam-mpi.org/MailArchives/lam/
# > #
# > _______________________________________________
# > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
# _______________________________________________
# This list is archived at http://www.lam-mpi.org/MailArchives/lam/
#