>) BEFORE
POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.
As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:
- There are network filters between the lamboot agent host and
the remote host such that communication on random TCP ports
is blocked
- Network routing from the remote host to the local host isn't
properly configured (this is uncommon)
You can check these things by watching the output from "lamboot -d".
1. On the command line for hboot, there are two important parameters:
one is the IP address of where the lamboot agent was invoked, the
other is the port number that the lamboot agent is expecting the
newly-booted process to call back on (this will be a random
integer).
2. Manually login to the remote machine and try to telnet to the port
indicated on the hboot command line. For example,
telnet <ipnumber> <portnumber>
If all goes well, you should get a "Connection refused" error. If
you get any other kind of error, it could indicate either of the
two conditions above. Consult with your system/network
administrator.
-----------------------------------------------------------------------------
n-1<30205> ssi:boot:base:server: failed to connect to remote lamd!
n-1<30205> ssi:boot:base:server: closing server socket
n-1<30205> ssi:boot:base:linear: aborted!
lamboot did NOT complete successfully
Does anyone has ideas for what is going wrong?
I am looking forward to your favorable reply!!
Regards,
Swan
----- Original Message -----
From: Jeff Squyres<mailto:jsquyres_at_[hidden]>
To: General LAM/MPI mailing list<mailto:lam_at_[hidden]>
Sent: 2005$BG/(B6$B7n(B8$BF|(B $B2<8a(B 11:04
Subject: Re: LAM: lamboot on globus
It seems that you have no path whatsoever. Right now, hboot will
complain about this (i.e., exactly the error that you are seeing).
I'll update hboot to not make this an error, but rather handle this
situation properly. This will be available in tomorrow's nightly
tarball (I'll put it both on the trunk and the upcoming 7.1.2 release,
but won't be cutting a new 7.1.2 beta tarball).
As an alternate workaround, you might want to see how to setup
globus-job-run so that it sets a PATH for the launched job.
On Jun 7, 2005, at 6:29 PM, Swan wrote:
> Hi Jeff,
>
> I had executed /bin/env using globus-job-run as you suggested,
> it doesn't had any PATH environment variable.
>
> [vasptest_at_orlon31 testing]$ globus-job-run 127.0.0.1 env
> GRAM Job failed because the executable does not exist (error code 5)
> [vasptest_at_orlon31 testing]$ globus-job-run 127.0.0.1 /bin/env
> HOME=/home/vasptest
> LOGNAME=vasptest
> GLOBUS_GRAM_JOB_CONTACT=https://orlon31.itsc.cuhk.edu.hk:34241/6379/
> 1118193672/
> GLOBUS_LOCATION=/usr/local/gt321
> X509_USER_PROXY=/home/vasptest/.globus/job/orlon31.itsc.cuhk.edu.hk/
> 6379.1118193672/x509_up
> GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://orlon31.itsc.cuhk.edu.hk:34242/
> What should I do in order to make it works properly?
> I am glad that I could hear your reply and looking for your future
> replies.
>
> Regards,
> Swan, HPC team, Chinese University of Hong Kong
>> ----- Original Message -----
>> From: Jeff Squyres
>> To: General LAM/MPI mailing list
>> Sent: 2005$BG/(B6$B7n(B8$BF|(B $B>e8a(B 04:55
>> Subject: Re: LAM: lamboot on globus
>>
>> What it looks like is happening is that hboot (an internal LAM
>> command)
>> is failing to find the $PATH environment variable -- which seems
>> pretty
>> odd. When you globus-job-run a command, do you get no PATH at all?
>> E.g., what happens if you "globus-job-run 127.0.0.1 env"?
>>
>>
>> On Jun 6, 2005, at 10:46 PM, Lai Swan wrote:
>>
>> > Dear All,
>> >
>> > I am trying to run lamboot and occurred the following error,
>> >
>> > [vasptest_at_orlon31 testing]$ lamboot -v -ssi boot globus hosts
>> > LAM 7.1.1/MPI 2 C++ - Indiana University
>> > n-1<23931> ssi:boot:base:linear: booting n0 (127.0.0.1)
>> > ERROR: LAM/MPI unexpectedly received the following on stderr:
>> >
>> ----------------------------------------------------------------------
>> -
>> > ------
>> >
>> > LAM encountered an error when invoking the library call "getenv".
>> > This is an unexpected error; we don't have much additional
>> information
>> > here. Perhaps this Unix error message will help:
>> > Unix errno: 1268
>> > Unknown error 1268
>> >
>> ----------------------------------------------------------------------
>> -
>> > ------
>> >
>> >
>> ----------------------------------------------------------------------
>> -
>> > ------
>> >
>> > LAM failed to execute a LAM binary on the remote node "127.0.0.1".
>> > LAM attempted to execute a process on the remote node "127.0.0.1",
>> > but received some output on the standard error.
>> > LAM tried to use the command "/usr/local/gt321/bin/globus-job-run"
>> to
>> > invoke the following command:
>> > /usr/local/gt321/bin/globus-job-run 127.0.0.1
>> > /usr/local/lam-7.1.1/bin/hboot -t -c
>> > /usr/local/lam-7.1.1/etc/lam-conf.lamd -v -I "-H 127.0.0.1 -P 45587
>> -n
>> > 0 -o 0" -prefix /usr/local/lam-7.1.1
>> > The problem may be because:
>> > - The Globus GRAM client returned some output on the stderr
>> > - You have not done 'grid-proxy-init'. You need to do that
>> before
>> > LAM can boot as it uses globus-job-run to start the LAM
>> daemons.
>> > - LAM is not able to find binaries in the 'prefix' path you
>> > specified in the boot hostfile. Check the path, it should
>> point
>> > to
>> > the directory where LAM/MPI is installed on this host.
>> > Try to invoke the command listed above manually at a Unix prompt.
>> > When you can get this command to execute successfully by hand, LAM
>> > will probably be able to function properly.
>> >
>> ----------------------------------------------------------------------
>> -
>> > ------
>> >
>> > n-1<23931> ssi:boot:base:linear: Failed to boot n0 (127.0.0.1)
>> > n-1<23931> ssi:boot:base:linear: aborted!
>> > lamboot did NOT complete successfully
>> >
>> > What should I do to solve it?
>> > I would be very grateful if I could hear your reply!!
>> >
>> > Regards,
>> > Swan, HPC Team, Chinese University of Hong Kong
>> >
>> >
>> > _______________________________________________
>> > This list is archived at
http://www.lam-mpi.org/MailArchives/lam/>
>> >
>>
>> --
>> {+} Jeff Squyres
>> {+} jsquyres_at_[hidden]<mailto:jsquyres_at_[hidden]>
>> {+} http://www.lam-mpi.org/>
>>
>> _______________________________________________
>> This list is archived at
http://www.lam-mpi.org/MailArchives/lam/>
> _______________________________________________
> This list is archived at
http://www.lam-mpi.org/MailArchives/lam/>
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]<mailto:jsquyres_at_[hidden]>
{+} http://www.lam-mpi.org/>
_______________________________________________
This list is archived at
http://www.lam-mpi.org/MailArchives/lam/>