LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-01-19 13:00:53


I have a suspicion as to what's going on -- I *think* that hboot is
unable to find your PATH on the remote node and is therefore throwing
that error (not very descriptive, is it? :-( ).

Can you verify this?

Edit tools/hboot/hboot.c -- there's a section that looks like this
(just search for "getenv"):

-----
           if ((path_env = getenv("PATH")) == NULL) {
               show_help(NULL, "lib-call-fail", "getenv", NULL);
               exit(errno);
           }
-----

Put an extra "fprintf(stderr, "yes, this is the one\n");
fflush(stderr);" before the show_help() line. Then see if you see
"yes, this is the one" in the error message. If so, then this confirms
what is going on, and we can figure out where to go from there.

On Jan 19, 2005, at 10:02 AM, redirecting decoy wrote:

> Hi,
>
> Here is the output of the following command:
>
> "lamboot -v -d -x -ssi boot globus machines.globus"
>
> #######################################################
> n-1<21208> ssi:boot:open: opening
> n-1<21208> ssi:boot:open: looking for boot module
> named globus
> n-1<21208> ssi:boot:open: opening boot module globus
> n-1<21208> ssi:boot:open: opened boot module globus
> n-1<21208> ssi:boot:select: initializing boot module
> globus
> n-1<21208> ssi:boot:globus: module initializing
> n-1<21208> ssi:boot:globus:verbose: 1000
> n-1<21208> ssi:boot:globus:priority: 3
> n-1<21208> ssi:boot:globus:GLOBUS_LOCATION:
> /usr/local/globus/bin/globus-job-run
> n-1<21208> ssi:boot:select: boot module available:
> globus, priority: 3
> n-1<21208> ssi:boot:select: selected boot module
> globus
>
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
> n-1<21208> ssi:boot:base: looking for boot schema in
> following directories:
> n-1<21208> ssi:boot:base: <current directory>
> n-1<21208> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<21208> ssi:boot:base: $LAMHOME/etc
> n-1<21208> ssi:boot:base: /usr/local/lam/etc
> n-1<21208> ssi:boot:base: looking for boot schema
> file:
> n-1<21208> ssi:boot:base: machines.globus
> n-1<21208> ssi:boot:base: found boot schema:
> machines.globus
> n-1<21208> ssi:boot:globus: found the following hosts:
> n-1<21208> ssi:boot:globus: n0 nest.public (cpu=4)
> (prefix=/usr/local/lam)
> n-1<21208> ssi:boot:globus: n1 pa-wb-001.public
> (cpu=2) (prefix=/usr/local/lam)
> n-1<21208> ssi:boot:globus: n2 pa-wb-002.public
> (cpu=2) (prefix=/usr/local/lam)
> n-1<21208> ssi:boot:globus: resolved hosts:
> n-1<21208> ssi:boot:globus: n0 nest.public -->
> 192.168.10.100 (origin)
> n-1<21208> ssi:boot:globus: n1 pa-wb-001.public -->
> 192.168.10.101
> n-1<21208> ssi:boot:globus: n2 pa-wb-002.public -->
> 192.168.10.102
> n-1<21208> ssi:boot:globus: starting RTE procs
> n-1<21208> ssi:boot:base:linear: starting
> n-1<21208> ssi:boot:base:server: opening server TCP
> socket
> n-1<21208> ssi:boot:base:server: opened port 33816
> n-1<21208> ssi:boot:base:linear: booting n0
> (nest.public)
> n-1<21208> ssi:boot:globus: starting lamd on
> (nest.public)
> n-1<21208> ssi:boot:globus: starting on n0
> (nest.public): /usr/local/globus/bin/globus-job-run
> /usr/local/lam/bin/hboot -t -c
> /usr/local/lam/etc/lam-conf.lamd -d -v -I "-x -H
> 192.168.10.100 -P 33816 -n 0 -o 0" -prefix
> /usr/local/lam
> n-1<21208> ssi:boot:globus: launching on n0
> (nest.public)
> n-1<21208> ssi:boot:globus: attempting to execute
> "/usr/local/globus/bin/globus-job-run nest.public
> /usr/local/lam/bin/hboot -t -c
> /usr/local/lam/etc/lam-conf.lamd -d -v -I "-x -H
> 192.168.10.100 -P 33816 -n 0 -o 0" -prefix
> /usr/local/lam"
> ERROR: LAM/MPI unexpectedly received the following on
> stderr:
> -----------------------------------------------------------------------
> ------
> LAM encountered an error when invoking the library
> call "getenv".
>
> This is an unexpected error; we don't have much
> additional information
> here. Perhaps this Unix error message will help:
>
> Unix errno: 1268
> Unknown error 1268
>
> -----------------------------------------------------------------------
> ------
> -----------------------------------------------------------------------
> ------
> LAM failed to execute a LAM binary on the remote node
> "nest.public".
> LAM attempted to execute a process on the remote node
> "nest.public",
> but received some output on the standard error.
>
> LAM tried to use the command
> "/usr/local/globus/bin/globus-job-run" to invoke the
> following command:
>
> /usr/local/globus/bin/globus-job-run
> nest.public /usr/local/lam/bin/hboot -t -c
> /usr/local/lam/etc/lam-conf.lamd -d -v -I "-x -H
> 192.168.10.100 -P 33816 -n 0 -o 0" -prefix
> /usr/local/lam
>
> The problem may be because:
>
> - The Globus GRAM client returned some output on
> the stderr
>
> - You have not done 'grid-proxy-init'. You need
> to do that before
> LAM can boot as it uses globus-job-run to start
> the LAM daemons.
>
> - LAM is not able to find binaries in the
> 'prefix' path you
> specified in the boot hostfile. Check the path,
> it should point to
> the directory where LAM/MPI is installed on
> this host.
>
> Try to invoke the command listed above manually at a
> Unix prompt.
> When you can get this command to execute successfully
> by hand, LAM
> will probably be able to function properly.
> -----------------------------------------------------------------------
> ------
> n-1<21208> ssi:boot:base:linear: Failed to boot n0
> (nest.public)
> n-1<21208> ssi:boot:base:server: closing server socket
> n-1<21208> ssi:boot:base:linear: aborted!
> lamboot did NOT complete successfully
> #######################################################
>
> I compiled the version of lam using the following
> config options:
>
> "./configure --prefix=/usr/local/lam --with-rsh=ssh -x
> --with-mpi-stubs --with-threads=posix
> --with-lamb-hb=60 --enable-shared --with-modules"
>
>
> Now, as far as lamgrow is concerned, I can boot lam
> using the following command:
> "lamboot -v -d -x machines.globus"
>
> This boot's fine. However when I try using lamgrow,
> it does on of the following:
> 1) Hangs for a really long time (I'm always forced to
> kill it)
> 2) Works
> 3) Seg Faults
>
>
> So first I boot up lam (without globus), then if
> successful, I try to add machines using lamgrow. That
> is as far as I've gone with lam 7.1.1
>
> Hope this helps...
>
> -R.D.
>
>
>
> --- Jeff Squyres <jsquyres_at_[hidden]> wrote:
>
>> Apologies for taking so long to answer this. :-(
>>
>> Arf -- if you're right, we've got two problems in
>> 7.1.1:
>>
>> - globus boot SSI not working. Can you post the
>> full output of
>> "lamboot -d" with the 7.1.1 installation?
>>
>> - lamgrow failing. Can you post a repeatable
>> sequence of events that
>> causes lamgrow to seg fault? That would be most
>> helpful in helping us
>> identify the exact bug.
>>
>> Thanks!
>>
>>
>> On Jan 18, 2005, at 4:30 PM, redirecting decoy
>> wrote:
>>
>>> Well, for the testing purposes I've uninstalled
>> lam
>>> 7.1.1 and install Lam 7.0.6. Now everything works
>>> fine.
>>> So logic tells me that my globus problem must be a
>> bug
>>> with Lam 7.1.1. It may be something with my
>> system
>>> though. I don't know. Has anyone else made lam
>> 7.1.1
>>> work with Globus ? Also, I with 7.1.1 I am
>> getting
>>> segmentation faults when attempting to use
>> lamgrow.
>>> It works the first few times, then it starts to
>> seg
>>> fault.
>>>
>>> Anyone else have similar experiences ?
>>>
>>> -R.D.
>>>
>>> --- redirecting decoy <redirectingdecoy_at_[hidden]>
>>> wrote:
>>>
>>>> still having problems with this. The problem
>> does
>>>> not
>>>> appear to be caused by globus I think. I have
>>>> checked
>>>> my globus install to correctness and did not see
>> any
>>>> problems. Anyone have any ideas on this one ?
>>>>
>>>> -R.D.
>>>>
>>>>
>>>> --- redirecting decoy
>> <redirectingdecoy_at_[hidden]>
>>>> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I'm am having trouble booting lam using globus.
>> I
>>>>> am
>>>>> using lam 7.1.1 and latest globus installed on 3
>>>>> different machines. I try and boot lam using
>> the
>>>>> following command:
>>>>>
>>>>> lamboot -v -d -x -ssi boot globus
>> machines.globus
>>>>>
>>>>> machines.globus looks like this:
>>>>>
>>>>
>> ###################################################
>>>>> nest.public prefix=/usr/local/lam cpu=4
>>>> schedule=yes
>>>>> pa-wb-001.public prefix=/usr/local/lam cpu=2
>>>>> schedule=yes
>>>>> pa-wb-002.public prefix=/usr/local/lam cpu=2
>>>>> schedule=yes
>>>>>
>>>>> ################################################
>>>>>
>>>>> When I do that I get a wierd error message.
>>>> Lamboot
>>>>> tells me it was trying to run the command:
>>>>>
>>>>> /usr/local/globus/bin/globus-job-run nest.public
>>>>> /usr/local/lam/bin/hboot -t -c
>>>>> /usr/local/lam/etc/lam-conf.lamd -d -v -I "-x -H
>>>>> 192.168.10.100 -P 44621 -n 0 -o 0" -prefix
>>>>> /usr/local/lam
>>>>>
>>>>> which, when I try gives me the following error
>>>>>
>>>>>
>>>>
>>>
>>
> -----------------------------------------------------------------------
>>
>>> ------
>>>>> LAM encountered an error when invoking the
>> library
>>>>> call "getenv".
>>>>>
>>>>> This is an unexpected error; we don't have much
>>>>> additional information
>>>>> here. Perhaps this Unix error message will
>> help:
>>>>>
>>>>> Unix errno: 1268
>>>>> Unknown error 1268
>>>>>
>>>>>
>>>>
>>>
>>
> -----------------------------------------------------------------------
>>
>>> ------
>>>>>
>>>>> Does anyone know what is causing this? And how
>> to
>>>>> make
>>>>> the error go away ?
>>>>>
>>>>> Any help would be much appreciated.
>>>>> Thanks much,
>>>>>
>>>>> -RD
>>>>>
>>>>>
>>>>>
>> __________________________________________________
>>>>> Do You Yahoo!?
>>>>> Tired of spam? Yahoo! Mail has the best spam
>>>>> protection around
>>>>> http://mail.yahoo.com
>>>>> _______________________________________________
>>>>> This list is archived at
>>>>> http://www.lam-mpi.org/MailArchives/lam/
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> __________________________________
>>>> Do you Yahoo!?
>>>> Yahoo! Mail - 250MB free storage. Do more. Manage
>>>> less.
>>>> http://info.mail.yahoo.com/mail_250
>>>> _______________________________________________
>>>> This list is archived at
>>>> http://www.lam-mpi.org/MailArchives/lam/
>>>>
>>>
>>>
>>> __________________________________________________
>>> Do You Yahoo!?
>>> Tired of spam? Yahoo! Mail has the best spam
>> protection around
>>> http://mail.yahoo.com
>>> _______________________________________________
>>> This list is archived at
>> http://www.lam-mpi.org/MailArchives/lam/
>>>
>>
>> --
>> {+} Jeff Squyres
>> {+} jsquyres_at_[hidden]
>> {+} http://www.lam-mpi.org/
>>
>> _______________________________________________
>> This list is archived at
>> http://www.lam-mpi.org/MailArchives/lam/
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/