Hi,
Sorry for the delay in replying to this mail.
I suspect one of the following things might be causing the problem
- lamd unable to get the hostname
- lamd unable to resolve the hostname into an IP address
- lamd unable to open a TCP listening socket
I would put my money on the second one. Makesure that whatever is
returned by gethostname() (i.e., the hostname(1) command) must be
resolvable by getinetaddr().
hope this helps,
Manish Chablani
------------------------------------------------------
Graduate Student, CS Department, Indiana University.
http://www.cs.indiana.edu/~mchablan
LAM/MPI Developer
Make today a LAM/MPI day !!!
http://www.lam-mpi.org
------------------------------------------------------
On Tue, 6 May 2003, ragnar sjoberg wrote:
> Hi!
> Any help on this would be very appriciated,
> I'm really stuck :-)
>
> Before starting, here is a good primer I found, close to my problem:
> http://www.lam-mpi.org/MailArchives/lam/msg01399.php
>
> I cannot lamboot the diskless slaves, the master can boot OK, both uses a host file
> containing itself only.
> I have checked the loopback interface is
> proper, nsswitch.conf/protocol/services is
> at /etc.
>
> The point of failure when running
> lamboot -d -v ./local at the slave:
> The call to readcltcoord() returns failure
> when checking the status variable, that is
> mread()'ed by readsockint4().
> The "status" variable contained the number
> 1238, not zero as expected by the function readcltcoord().
>
> doing:
> find . | grep '^.*\.h$' | xargs cat | grep 1238
>
> when standing at the lam-source root doesn't tell
> anything.
>
>
> I still think the error is something very basic, but the network seems ok. I can
> netstat/ping and run ssh in any direction slave/master.
>
> Here is my output from running lamboot:
>
> $ lamboot -d -v ./local2
>
> LAM 6.5.9 - Indiana University
>
> lamboot: boot schema file: ./local2
> lamboot: opening hostfile ./local2
> lamboot: found the following hosts:
> lamboot: n0 10.0.0.17
> lamboot: resolved hosts:
> lamboot: n0 10.0.0.17 --> 10.0.0.17
> lamboot: found 1 host node(s)
> lamboot: origin node is 0 (10.0.0.17)
> CALLING lambootagent()
> Executing hboot on n0 (10.0.0.17 - 1 CPU)...
> lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H 10.0.0.17 -P 1140 -n 0 -o 0 ""
> hboot: process schema = "/usr/local/mpi/etc/lam-conf.lam"
> hboot: found /usr/local/mpi/bin/lamd
> hboot: performing tkill
> hboot: tkill
> hboot: booting...
> hboot: fork /usr/local/mpi/bin/lamd
> [1] 1137 lamd -H 10.0.0.17 -P 1140 -n 0 -o 0 -d
> mark-a
> hboot: attempting to execute
>
> DOING CONNECT () at: 10 0 0 17
> mark-b
> RSa
> CONNECT_IS_OK
> RSb: 1238
> CALLING lambootagent() FAILED
>
> DOING CONNECT () at: 10 0 0 17
> CONNECT_IS_OK
> -----------------------------------------------------------------------------
> lamboot encountered some error (see above) during the boot process,
> and will now attempt to kill all nodes that it was previously able to
> boot (if any).
>
> Please wait for LAM to finish; if you interrupt this process, you may
> have LAM daemons still running on remote nodes.
> -----------------------------------------------------------------------------
> wipe ...
>
>
>
>
>
> Here is the same output with strace prepended:
> ...
> hboot: tkill
> hboot: booting...
> hboot: fork /usr/local/mpi/bin/lamd
> [1] 1144 lamd -H 10.0.0.17 -P 1144 -n 0 -o 0 -d
> [WIFEXITED(s) && WEXITSTATUS(s) == 0], 0, NULL) = 1142
> --- SIGCHLD (Child exited) ---
> write(2, "mark-a\n", 7mark-a
> ) = 7
> select(4, [3], NULL, NULL, {60, 0}hboot: attempting to execute
>
> DOING CONNECT () at: 10 0 0 17
> ) = 1 (in [3], left {59, 970000})
> accept(3, 0, NULL) = 4
> write(2, "mark-b\n", 7mark-b
> ) = 7
> write(2, "RSa\n", 4RSa
> ) = 4
> read(4, CONNECT_IS_OK
> "\0\0\4\326", 4) = 4
> write(2, "RSb: 1238\n", 10RSb: 1238
> ) = 10
> write(2, "CALLING lambootagent() FAILED\n", 30CALLING lambootagent() FAILED
> ) = 30
> open("/usr/home/shrek/lam-helpfile", O_RDONLY
> DOING CONNECT () at: 10 0 0 17
> CONNECT_IS_OK
> ) = -1 ENOENT (No such file or directory)
>
>
>
>
> thanks!
> Ragnar
>
>
>
>
> ___________________________________________________
> Which British golfer became the youngest ever to play in the Ryder Cup when he did so in 1977?
> Find out at postmaster.co.uk
>
> http://www.postmaster.co.uk/cgi-bin/meme/quiz.pl?id=225
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|