LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: redirecting decoy (redirectingdecoy_at_[hidden])
Date: 2005-01-19 10:02:23


Hi,

Here is the output of the following command:

"lamboot -v -d -x -ssi boot globus machines.globus"

#######################################################
n-1<21208> ssi:boot:open: opening
n-1<21208> ssi:boot:open: looking for boot module
named globus
n-1<21208> ssi:boot:open: opening boot module globus
n-1<21208> ssi:boot:open: opened boot module globus
n-1<21208> ssi:boot:select: initializing boot module
globus
n-1<21208> ssi:boot:globus: module initializing
n-1<21208> ssi:boot:globus:verbose: 1000
n-1<21208> ssi:boot:globus:priority: 3
n-1<21208> ssi:boot:globus:GLOBUS_LOCATION:
/usr/local/globus/bin/globus-job-run
n-1<21208> ssi:boot:select: boot module available:
globus, priority: 3
n-1<21208> ssi:boot:select: selected boot module
globus
 
LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
 
n-1<21208> ssi:boot:base: looking for boot schema in
following directories:
n-1<21208> ssi:boot:base: <current directory>
n-1<21208> ssi:boot:base: $TROLLIUSHOME/etc
n-1<21208> ssi:boot:base: $LAMHOME/etc
n-1<21208> ssi:boot:base: /usr/local/lam/etc
n-1<21208> ssi:boot:base: looking for boot schema
file:
n-1<21208> ssi:boot:base: machines.globus
n-1<21208> ssi:boot:base: found boot schema:
machines.globus
n-1<21208> ssi:boot:globus: found the following hosts:
n-1<21208> ssi:boot:globus: n0 nest.public (cpu=4)
(prefix=/usr/local/lam)
n-1<21208> ssi:boot:globus: n1 pa-wb-001.public
(cpu=2) (prefix=/usr/local/lam)
n-1<21208> ssi:boot:globus: n2 pa-wb-002.public
(cpu=2) (prefix=/usr/local/lam)
n-1<21208> ssi:boot:globus: resolved hosts:
n-1<21208> ssi:boot:globus: n0 nest.public -->
192.168.10.100 (origin)
n-1<21208> ssi:boot:globus: n1 pa-wb-001.public -->
192.168.10.101
n-1<21208> ssi:boot:globus: n2 pa-wb-002.public -->
192.168.10.102
n-1<21208> ssi:boot:globus: starting RTE procs
n-1<21208> ssi:boot:base:linear: starting
n-1<21208> ssi:boot:base:server: opening server TCP
socket
n-1<21208> ssi:boot:base:server: opened port 33816
n-1<21208> ssi:boot:base:linear: booting n0
(nest.public)
n-1<21208> ssi:boot:globus: starting lamd on
(nest.public)
n-1<21208> ssi:boot:globus: starting on n0
(nest.public): /usr/local/globus/bin/globus-job-run
/usr/local/lam/bin/hboot -t -c
/usr/local/lam/etc/lam-conf.lamd -d -v -I "-x -H
192.168.10.100 -P 33816 -n 0 -o 0" -prefix
/usr/local/lam
n-1<21208> ssi:boot:globus: launching on n0
(nest.public)
n-1<21208> ssi:boot:globus: attempting to execute
"/usr/local/globus/bin/globus-job-run nest.public
/usr/local/lam/bin/hboot -t -c
/usr/local/lam/etc/lam-conf.lamd -d -v -I "-x -H
192.168.10.100 -P 33816 -n 0 -o 0" -prefix
/usr/local/lam"
ERROR: LAM/MPI unexpectedly received the following on
stderr:
-----------------------------------------------------------------------------
LAM encountered an error when invoking the library
call "getenv".
 
This is an unexpected error; we don't have much
additional information
here. Perhaps this Unix error message will help:
 
        Unix errno: 1268
        Unknown error 1268
 
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
LAM failed to execute a LAM binary on the remote node
"nest.public".
LAM attempted to execute a process on the remote node
"nest.public",
but received some output on the standard error.
 
LAM tried to use the command
"/usr/local/globus/bin/globus-job-run" to invoke the
following command:
 
        /usr/local/globus/bin/globus-job-run
nest.public /usr/local/lam/bin/hboot -t -c
/usr/local/lam/etc/lam-conf.lamd -d -v -I "-x -H
192.168.10.100 -P 33816 -n 0 -o 0" -prefix
/usr/local/lam
 
The problem may be because:
 
    - The Globus GRAM client returned some output on
the stderr
 
    - You have not done 'grid-proxy-init'. You need
to do that before
       LAM can boot as it uses globus-job-run to start
the LAM daemons.
 
    - LAM is not able to find binaries in the
'prefix' path you
       specified in the boot hostfile. Check the path,
it should point to
       the directory where LAM/MPI is installed on
this host.
 
Try to invoke the command listed above manually at a
Unix prompt.
When you can get this command to execute successfully
by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<21208> ssi:boot:base:linear: Failed to boot n0
(nest.public)
n-1<21208> ssi:boot:base:server: closing server socket
n-1<21208> ssi:boot:base:linear: aborted!
lamboot did NOT complete successfully
#######################################################

I compiled the version of lam using the following
config options:

"./configure --prefix=/usr/local/lam --with-rsh=ssh -x
--with-mpi-stubs --with-threads=posix
--with-lamb-hb=60 --enable-shared --with-modules"

Now, as far as lamgrow is concerned, I can boot lam
using the following command:
"lamboot -v -d -x machines.globus"

This boot's fine. However when I try using lamgrow,
it does on of the following:
1) Hangs for a really long time (I'm always forced to
kill it)
2) Works
3) Seg Faults

So first I boot up lam (without globus), then if
successful, I try to add machines using lamgrow. That
is as far as I've gone with lam 7.1.1

Hope this helps...

-R.D.

--- Jeff Squyres <jsquyres_at_[hidden]> wrote:

> Apologies for taking so long to answer this. :-(
>
> Arf -- if you're right, we've got two problems in
> 7.1.1:
>
> - globus boot SSI not working. Can you post the
> full output of
> "lamboot -d" with the 7.1.1 installation?
>
> - lamgrow failing. Can you post a repeatable
> sequence of events that
> causes lamgrow to seg fault? That would be most
> helpful in helping us
> identify the exact bug.
>
> Thanks!
>
>
> On Jan 18, 2005, at 4:30 PM, redirecting decoy
> wrote:
>
> > Well, for the testing purposes I've uninstalled
> lam
> > 7.1.1 and install Lam 7.0.6. Now everything works
> > fine.
> > So logic tells me that my globus problem must be a
> bug
> > with Lam 7.1.1. It may be something with my
> system
> > though. I don't know. Has anyone else made lam
> 7.1.1
> > work with Globus ? Also, I with 7.1.1 I am
> getting
> > segmentation faults when attempting to use
> lamgrow.
> > It works the first few times, then it starts to
> seg
> > fault.
> >
> > Anyone else have similar experiences ?
> >
> > -R.D.
> >
> > --- redirecting decoy <redirectingdecoy_at_[hidden]>
> > wrote:
> >
> >> still having problems with this. The problem
> does
> >> not
> >> appear to be caused by globus I think. I have
> >> checked
> >> my globus install to correctness and did not see
> any
> >> problems. Anyone have any ideas on this one ?
> >>
> >> -R.D.
> >>
> >>
> >> --- redirecting decoy
> <redirectingdecoy_at_[hidden]>
> >> wrote:
> >>
> >>> Hello all,
> >>>
> >>> I'm am having trouble booting lam using globus.
> I
> >>> am
> >>> using lam 7.1.1 and latest globus installed on 3
> >>> different machines. I try and boot lam using
> the
> >>> following command:
> >>>
> >>> lamboot -v -d -x -ssi boot globus
> machines.globus
> >>>
> >>> machines.globus looks like this:
> >>>
> >>
> ###################################################
> >>> nest.public prefix=/usr/local/lam cpu=4
> >> schedule=yes
> >>> pa-wb-001.public prefix=/usr/local/lam cpu=2
> >>> schedule=yes
> >>> pa-wb-002.public prefix=/usr/local/lam cpu=2
> >>> schedule=yes
> >>>
> >>> ################################################
> >>>
> >>> When I do that I get a wierd error message.
> >> Lamboot
> >>> tells me it was trying to run the command:
> >>>
> >>> /usr/local/globus/bin/globus-job-run nest.public
> >>> /usr/local/lam/bin/hboot -t -c
> >>> /usr/local/lam/etc/lam-conf.lamd -d -v -I "-x -H
> >>> 192.168.10.100 -P 44621 -n 0 -o 0" -prefix
> >>> /usr/local/lam
> >>>
> >>> which, when I try gives me the following error
> >>>
> >>>
> >>
> >
>
-----------------------------------------------------------------------
>
> > ------
> >>> LAM encountered an error when invoking the
> library
> >>> call "getenv".
> >>>
> >>> This is an unexpected error; we don't have much
> >>> additional information
> >>> here. Perhaps this Unix error message will
> help:
> >>>
> >>> Unix errno: 1268
> >>> Unknown error 1268
> >>>
> >>>
> >>
> >
>
-----------------------------------------------------------------------
>
> > ------
> >>>
> >>> Does anyone know what is causing this? And how
> to
> >>> make
> >>> the error go away ?
> >>>
> >>> Any help would be much appreciated.
> >>> Thanks much,
> >>>
> >>> -RD
> >>>
> >>>
> >>>
> __________________________________________________
> >>> Do You Yahoo!?
> >>> Tired of spam? Yahoo! Mail has the best spam
> >>> protection around
> >>> http://mail.yahoo.com
> >>> _______________________________________________
> >>> This list is archived at
> >>> http://www.lam-mpi.org/MailArchives/lam/
> >>>
> >>
> >>
> >>
> >>
> >> __________________________________
> >> Do you Yahoo!?
> >> Yahoo! Mail - 250MB free storage. Do more. Manage
> >> less.
> >> http://info.mail.yahoo.com/mail_250
> >> _______________________________________________
> >> This list is archived at
> >> http://www.lam-mpi.org/MailArchives/lam/
> >>
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam? Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> > _______________________________________________
> > This list is archived at
> http://www.lam-mpi.org/MailArchives/lam/
> >
>
> --
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at
> http://www.lam-mpi.org/MailArchives/lam/
>

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com