On Jun 11, 2005, at 1:51 PM, Swan wrote:
> I didn't wait for your modified copy to fix the env path problem, and
> I directly modified the source and add the -env option when running
> globus-job-run. I believe the env path problem previous mentioned has
> been fixed.
Ok. That's a good workaround for you. Unfortunately, it's not good
for the general case because you can't assume that the path is the same
on the remote node as it is on the same node.
> However, another problem did arise. The follow debug message should
> tell my situation.
>
> [vasptest_at_orlon31 test2]$ cat hosts
> orlon31 prefix=/usr/local/lam-7.1.1-org
> orlon28 prefix=/usr/local/lam-7.1.1
> [vasptest_at_orlon31 test2]$ /usr/local/lam-7.1.1-fai/bin/lamboot -v -d
> -ssi boot globus hosts
> n-1<30205> ssi:boot:open: opening
> [snipped]
> n-1<30205> ssi:boot:globus: starting on n0 (orlon31):
> /usr/local/gt321/bin/globus-job-run -env PATH=`/bin/echo $PATH`
> /usr/local/lam-7.1.1-org/bin/hboot -t -c
> /usr/local/lam-7.1.1-org/etc/lam-conf.lamd -s -d -v -I "-H
> 137.189.27.88 -P 47576 -n 0 -o 0" -prefix /usr/local/lam-7.1.1-org
> n-1<30205> ssi:boot:globus: launching on n0 (orlon31)
> ************ argv[0]: n-1<30205> ssi:boot:globus: attempting to
> execute "/usr/local/gt321/bin/globus-job-run orlon31 -env
> PATH=`/bin/echo $PATH` /usr/local/lam-7.1.1-org/bin/hboot -t -c
> /usr/local/lam-7.1.1-org/etc/lam-conf.lamd -s -d -v -I "-H
> 137.189.27.88 -P 47576 -n 0 -o 0" -prefix /usr/local/lam-7.1.1-org"
> n-1<30205> ssi:boot:globus: successfully launched on n0 (orlon31)
> n-1<30205> ssi:boot:base:server: expecting connection from finite list
> -----------------------------------------------------------------------
> ------
> The lamboot agent timed out while waiting for the newly-booted process
> to call back and indicated that it had successfully booted.
So what is happening here is exactly what is described -- lamboot
successfully launched its agent on the remote node (i.e., the hboot
command was launched via globus-job-run on orlon31). hboot is supposed
to open a socket back to lamboot -- but that never happened -- lamboot
gave up after a timeout expired and it had not yet received a socket
connection from hboot.
Lamboot was waiting on the IP address/socket port listed on the hboot
command line: 137.189.27.88 port 47576. If hboot was unable to open a
connection to that port, this could be a cause of failure. Do you have
firewalls between these machines?
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|