LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Swan (swan2925_at_[hidden])
Date: 2005-06-11 12:51:40


Dear Jeff/All,

I didn't wait for your modified copy to fix the env path problem, and I
directly modified the source and add the -env option when running
globus-job-run. I believe the env path problem previous mentioned has been
fixed. However, another problem did arise. The follow debug message should
tell my situation.

[vasptest_at_orlon31 test2]$ cat hosts
orlon31 prefix=/usr/local/lam-7.1.1-org
orlon28 prefix=/usr/local/lam-7.1.1
[vasptest_at_orlon31 test2]$ /usr/local/lam-7.1.1-fai/bin/lamboot -v -d -ssi
boot globus hosts
n-1<30205> ssi:boot:open: opening
n-1<30205> ssi:boot:open: looking for boot module named globus
n-1<30205> ssi:boot:open: opening boot module globus
n-1<30205> ssi:boot:open: opened boot module globus
n-1<30205> ssi:boot:select: initializing boot module globus
n-1<30205> ssi:boot:globus: module initializing
n-1<30205> ssi:boot:globus:verbose: 1000
n-1<30205> ssi:boot:globus:priority: 75
n-1<30205> ssi:boot:globus:GLOBUS_LOCATION:
/usr/local/gt321/bin/globus-job-run
n-1<30205> ssi:boot:select: boot module available: globus, priority: 75
n-1<30205> ssi:boot:select: selected boot module globus

LAM 7.1.1/MPI 2 C++ - Indiana University

n-1<30205> ssi:boot:base: looking for boot schema in following directories:
n-1<30205> ssi:boot:base: <current directory>
n-1<30205> ssi:boot:base: $TROLLIUSHOME/etc
n-1<30205> ssi:boot:base: $LAMHOME/etc
n-1<30205> ssi:boot:base: /usr/local/lam-7.1.1-fai/etc
n-1<30205> ssi:boot:base: looking for boot schema file:
n-1<30205> ssi:boot:base: hosts
n-1<30205> ssi:boot:base: found boot schema: hosts
n-1<30205> ssi:boot:globus: found the following hosts:
n-1<30205> ssi:boot:globus: n0 orlon31 (cpu=1)
(prefix=/usr/local/lam-7.1.1-org)
n-1<30205> ssi:boot:globus: n1 orlon28 (cpu=1)
(prefix=/usr/local/lam-7.1.1)
n-1<30205> ssi:boot:globus: resolved hosts:
n-1<30205> ssi:boot:globus: n0 orlon31 --> 137.189.27.88 (origin)
n-1<30205> ssi:boot:globus: n1 orlon28 --> 137.189.27.81
n-1<30205> ssi:boot:globus: starting RTE procs
n-1<30205> ssi:boot:base:linear: starting
n-1<30205> ssi:boot:base:server: opening server TCP socket
n-1<30205> ssi:boot:base:server: opened port 47576
n-1<30205> ssi:boot:base:linear: booting n0 (orlon31)
n-1<30205> ssi:boot:globus: starting lamd on (orlon31)
n-1<30205> ssi:boot:globus: starting on n0 (orlon31):
/usr/local/gt321/bin/globus-job-run -env PATH=`/bin/echo $PATH`
/usr/local/lam-7.1.1-org/bin/hboot -t -c
/usr/local/lam-7.1.1-org/etc/lam-conf.lamd -s -d -v -I "-H 137.189.27.88 -P
47576 -n 0 -o 0" -prefix /usr/local/lam-7.1.1-org
n-1<30205> ssi:boot:globus: launching on n0 (orlon31)
************ argv[0]: n-1<30205> ssi:boot:globus: attempting to execute
"/usr/local/gt321/bin/globus-job-run orlon31 -env PATH=`/bin/echo $PATH`
/usr/local/lam-7.1.1-org/bin/hboot -t -c
/usr/local/lam-7.1.1-org/etc/lam-conf.lamd -s -d -v -I "-H 137.189.27.88 -P
47576 -n 0 -o 0" -prefix /usr/local/lam-7.1.1-org"
n-1<30205> ssi:boot:globus: successfully launched on n0 (orlon31)
n-1<30205> ssi:boot:base:server: expecting connection from finite list
-----------------------------------------------------------------------------
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/>) BEFORE
POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:

        - There are network filters between the lamboot agent host and
          the remote host such that communication on random TCP ports
          is blocked
        - Network routing from the remote host to the local host isn't
          properly configured (this is uncommon)

You can check these things by watching the output from "lamboot -d".

1. On the command line for hboot, there are two important parameters:
   one is the IP address of where the lamboot agent was invoked, the
   other is the port number that the lamboot agent is expecting the
   newly-booted process to call back on (this will be a random
   integer).

2. Manually login to the remote machine and try to telnet to the port
   indicated on the hboot command line. For example,
       telnet <ipnumber> <portnumber>
   If all goes well, you should get a "Connection refused" error. If
   you get any other kind of error, it could indicate either of the
   two conditions above. Consult with your system/network
   administrator.
-----------------------------------------------------------------------------
n-1<30205> ssi:boot:base:server: failed to connect to remote lamd!
n-1<30205> ssi:boot:base:server: closing server socket
n-1<30205> ssi:boot:base:linear: aborted!
lamboot did NOT complete successfully

Does anyone has ideas for what is going wrong?
I am looking forward to your favorable reply!!

Regards,
Swan
  ----- Original Message -----
  From: Jeff Squyres<mailto:jsquyres_at_[hidden]>
  To: General LAM/MPI mailing list<mailto:lam_at_[hidden]>
  Sent: 2005$BG/(B6$B7n(B8$BF|(B $B2<8a(B 11:04
  Subject: Re: LAM: lamboot on globus

  It seems that you have no path whatsoever. Right now, hboot will
  complain about this (i.e., exactly the error that you are seeing).
  I'll update hboot to not make this an error, but rather handle this
  situation properly. This will be available in tomorrow's nightly
  tarball (I'll put it both on the trunk and the upcoming 7.1.2 release,
  but won't be cutting a new 7.1.2 beta tarball).

  As an alternate workaround, you might want to see how to setup
  globus-job-run so that it sets a PATH for the launched job.

  On Jun 7, 2005, at 6:29 PM, Swan wrote:

> Hi Jeff,
>
> I had executed /bin/env using globus-job-run as you suggested,
> it doesn't had any PATH environment variable.
>
> [vasptest_at_orlon31 testing]$ globus-job-run 127.0.0.1 env
> GRAM Job failed because the executable does not exist (error code 5)
> [vasptest_at_orlon31 testing]$ globus-job-run 127.0.0.1 /bin/env
> HOME=/home/vasptest
> LOGNAME=vasptest
> GLOBUS_GRAM_JOB_CONTACT=
https://orlon31.itsc.cuhk.edu.hk:34241/6379/
> 1118193672/
> GLOBUS_LOCATION=/usr/local/gt321
> X509_USER_PROXY=/home/vasptest/.globus/job/orlon31.itsc.cuhk.edu.hk/
> 6379.1118193672/x509_up
> GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://orlon31.itsc.cuhk.edu.hk:34242/
> What should I do in order to make it works properly?
> I am glad that I could hear your reply and looking for your future
> replies.
>
> Regards,
> Swan, HPC team, Chinese University of Hong Kong
>> ----- Original Message -----
>> From: Jeff Squyres
>> To: General LAM/MPI mailing list
>> Sent: 2005$BG/(B6$B7n(B8$BF|(B $B>e8a(B 04:55
>> Subject: Re: LAM: lamboot on globus
>>
>> What it looks like is happening is that hboot (an internal LAM
>> command)
>> is failing to find the $PATH environment variable -- which seems
>> pretty
>> odd. When you globus-job-run a command, do you get no PATH at all?
>> E.g., what happens if you "globus-job-run 127.0.0.1 env"?
>>
>>
>> On Jun 6, 2005, at 10:46 PM, Lai Swan wrote:
>>
>> > Dear All,
>> >
>> > I am trying to run lamboot and occurred the following error,
>> >
>> > [vasptest_at_orlon31 testing]$ lamboot -v -ssi boot globus hosts
>> > LAM 7.1.1/MPI 2 C++ - Indiana University
>> > n-1<23931> ssi:boot:base:linear: booting n0 (127.0.0.1)
>> > ERROR: LAM/MPI unexpectedly received the following on stderr:
>> >
>> ----------------------------------------------------------------------
>> -
>> > ------
>> >
>> > LAM encountered an error when invoking the library call "getenv".
>> > This is an unexpected error; we don't have much additional
>> information
>> > here. Perhaps this Unix error message will help:
>> > Unix errno: 1268
>> > Unknown error 1268
>> >
>> ----------------------------------------------------------------------
>> -
>> > ------
>> >
>> >
>> ----------------------------------------------------------------------
>> -
>> > ------
>> >
>> > LAM failed to execute a LAM binary on the remote node "127.0.0.1".
>> > LAM attempted to execute a process on the remote node "127.0.0.1",
>> > but received some output on the standard error.
>> > LAM tried to use the command "/usr/local/gt321/bin/globus-job-run"
>> to
>> > invoke the following command:
>> > /usr/local/gt321/bin/globus-job-run 127.0.0.1
>> > /usr/local/lam-7.1.1/bin/hboot -t -c
>> > /usr/local/lam-7.1.1/etc/lam-conf.lamd -v -I "-H 127.0.0.1 -P 45587
>> -n
>> > 0 -o 0" -prefix /usr/local/lam-7.1.1
>> > The problem may be because:
>> > - The Globus GRAM client returned some output on the stderr
>> > - You have not done 'grid-proxy-init'. You need to do that
>> before
>> > LAM can boot as it uses globus-job-run to start the LAM
>> daemons.
>> > - LAM is not able to find binaries in the 'prefix' path you
>> > specified in the boot hostfile. Check the path, it should
>> point
>> > to
>> > the directory where LAM/MPI is installed on this host.
>> > Try to invoke the command listed above manually at a Unix prompt.
>> > When you can get this command to execute successfully by hand, LAM
>> > will probably be able to function properly.
>> >
>> ----------------------------------------------------------------------
>> -
>> > ------
>> >
>> > n-1<23931> ssi:boot:base:linear: Failed to boot n0 (127.0.0.1)
>> > n-1<23931> ssi:boot:base:linear: aborted!
>> > lamboot did NOT complete successfully
>> >
>> > What should I do to solve it?
>> > I would be very grateful if I could hear your reply!!
>> >
>> > Regards,
>> > Swan, HPC Team, Chinese University of Hong Kong
>> >
>> >
>> > _______________________________________________
>> > This list is archived at
http://www.lam-mpi.org/MailArchives/lam/>
>> >
>>
>> --
>> {+} Jeff Squyres
>> {+} jsquyres_at_[hidden]<mailto:jsquyres_at_[hidden]>
>> {+}
http://www.lam-mpi.org/>
>>
>> _______________________________________________
>> This list is archived at
http://www.lam-mpi.org/MailArchives/lam/>
> _______________________________________________
> This list is archived at
http://www.lam-mpi.org/MailArchives/lam/>

  --
  {+} Jeff Squyres
  {+} jsquyres_at_[hidden]<mailto:jsquyres_at_[hidden]>
  {+}
http://www.lam-mpi.org/>

  _______________________________________________
  This list is archived at
http://www.lam-mpi.org/MailArchives/lam/>