LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Swan (swan2925_at_[hidden])
Date: 2005-06-16 08:04:25


We don't have firewall between these machines, I believe it maybe due to lamboot components.

I have executed the following command which the lamboot executed too, and here is its output:

[vasptest_at_orlon31 vasptest]$ /usr/local/gt321/bin/globus-job-run orlon31 -env PATH=`/bin/echo $PATH` /usr/local/lam-7.1.1-org/bin/hboot -t -c /usr/local/lam-7.1.1-org/etc/lam-conf.lamd -d -v -I "-H 137.189.27.88 -P 47576 -n 0 -o 0" -prefix /usr/local/lam-7.1.1-org
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-vasptest_at_[hidden]/lam-killfile<>
tkill: removing socket file ...
tkill: socket file: /tmp/lam-vasptest_at_[hidden]/lam-kernel-socketd<>
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-vasptest_at_[hidden]/lam-io-socket<>
tkill: f_kill = "/tmp/lam-vasptest_at_[hidden]/lam-killfile<>"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 23484 ...
tkill: already dead
tkill: all finished
hboot: performing tkill
hboot: /usr/local/lam-7.1.1-org/bin/tkill -d
hboot: booting...
hboot: fork /usr/local/lam-7.1.1-org/bin/lamd
[1] 23702 lamd -H 137.189.27.88 -P 47576 -n 0 -o 0 -d
ssi_boot_send_lamd_info: sfh_sock_open_clt_inet_stm failed: Connection refused

Do you have any suggestions?

Thanks!!

Regards,
Swan
  ----- Original Message -----
  From: Jeff Squyres<mailto:jsquyres_at_[hidden]>
  To: General LAM/MPI mailing list<mailto:lam_at_[hidden]>
  Sent: 2005å¹´6月16æ—¥ 下午 08:35
  Subject: Re: LAM: lamboot on globus

  On Jun 11, 2005, at 1:51 PM, Swan wrote:

> I didn't wait for your modified copy to fix the env path problem, and
> I directly modified the source and add the -env option when running
> globus-job-run. I believe the env path problem previous mentioned has
> been fixed.

  Ok. That's a good workaround for you. Unfortunately, it's not good
  for the general case because you can't assume that the path is the same
  on the remote node as it is on the same node.

> However, another problem did arise. The follow debug message should
> tell my situation.
>
> [vasptest_at_orlon31 test2]$ cat hosts
> orlon31 prefix=/usr/local/lam-7.1.1-org
> orlon28 prefix=/usr/local/lam-7.1.1
> [vasptest_at_orlon31 test2]$ /usr/local/lam-7.1.1-fai/bin/lamboot -v -d
> -ssi boot globus hosts
> n-1<30205> ssi:boot:open: opening
> [snipped]
> n-1<30205> ssi:boot:globus: starting on n0 (orlon31):
> /usr/local/gt321/bin/globus-job-run -env PATH=`/bin/echo $PATH`
> /usr/local/lam-7.1.1-org/bin/hboot -t -c
> /usr/local/lam-7.1.1-org/etc/lam-conf.lamd -s -d -v -I "-H
> 137.189.27.88 -P 47576 -n 0 -o 0" -prefix /usr/local/lam-7.1.1-org
> n-1<30205> ssi:boot:globus: launching on n0 (orlon31)
> ************ argv[0]: n-1<30205> ssi:boot:globus: attempting to
> execute "/usr/local/gt321/bin/globus-job-run orlon31 -env
> PATH=`/bin/echo $PATH` /usr/local/lam-7.1.1-org/bin/hboot -t -c
> /usr/local/lam-7.1.1-org/etc/lam-conf.lamd -s -d -v -I "-H
> 137.189.27.88 -P 47576 -n 0 -o 0" -prefix /usr/local/lam-7.1.1-org"
> n-1<30205> ssi:boot:globus: successfully launched on n0 (orlon31)
> n-1<30205> ssi:boot:base:server: expecting connection from finite list
> -----------------------------------------------------------------------
> ------
> The lamboot agent timed out while waiting for the newly-booted process
> to call back and indicated that it had successfully booted.

  So what is happening here is exactly what is described -- lamboot
  successfully launched its agent on the remote node (i.e., the hboot
  command was launched via globus-job-run on orlon31). hboot is supposed
  to open a socket back to lamboot -- but that never happened -- lamboot
  gave up after a timeout expired and it had not yet received a socket
  connection from hboot.

  Lamboot was waiting on the IP address/socket port listed on the hboot
  command line: 137.189.27.88 port 47576. If hboot was unable to open a
  connection to that port, this could be a cause of failure. Do you have
  firewalls between these machines?

  --
  {+} Jeff Squyres
  {+} jsquyres_at_[hidden]<mailto:jsquyres_at_[hidden]>
  {+} http://www.lam-mpi.org/>

  _______________________________________________
  This list is archived at
http://www.lam-mpi.org/MailArchives/lam/>