LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: ragnar sjoberg (ragnarsjoberg_at_[hidden])
Date: 2003-05-06 09:34:24


('binary' encoding is not supported, stored as-is) Hi!
Any help on this would be very appriciated,
I'm really stuck :-)

Before starting, here is a good primer I found, close to my problem:
http://www.lam-mpi.org/MailArchives/lam/msg01399.php

I cannot lamboot the diskless slaves, the master can boot OK, both uses a host file
containing itself only.
I have checked the loopback interface is
proper, nsswitch.conf/protocol/services is
at /etc.

The point of failure when running
lamboot -d -v ./local at the slave:
The call to readcltcoord() returns failure
when checking the status variable, that is
mread()'ed by readsockint4().
The "status" variable contained the number
1238, not zero as expected by the function readcltcoord().

doing:
find . | grep '^.*\.h$' | xargs cat | grep 1238

when standing at the lam-source root doesn't tell
anything.


I still think the error is something very basic, but the network seems ok. I can
netstat/ping and run ssh in any direction slave/master.

Here is my output from running lamboot:

$ lamboot -d -v ./local2

LAM 6.5.9 - Indiana University

lamboot: boot schema file: ./local2
lamboot: opening hostfile ./local2
lamboot: found the following hosts:
lamboot: n0 10.0.0.17
lamboot: resolved hosts:
lamboot: n0 10.0.0.17 --> 10.0.0.17
lamboot: found 1 host node(s)
lamboot: origin node is 0 (10.0.0.17)
CALLING lambootagent()
Executing hboot on n0 (10.0.0.17 - 1 CPU)...
lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H 10.0.0.17 -P 1140 -n 0 -o 0 ""
hboot: process schema = "/usr/local/mpi/etc/lam-conf.lam"
hboot: found /usr/local/mpi/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/local/mpi/bin/lamd
[1] 1137 lamd -H 10.0.0.17 -P 1140 -n 0 -o 0 -d
mark-a
hboot: attempting to execute

DOING CONNECT () at: 10 0 0 17
mark-b
RSa
CONNECT_IS_OK
RSb: 1238
CALLING lambootagent() FAILED

DOING CONNECT () at: 10 0 0 17
CONNECT_IS_OK
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
wipe ...





Here is the same output with strace prepended:
...
hboot: tkill
hboot: booting...
hboot: fork /usr/local/mpi/bin/lamd
[1] 1144 lamd -H 10.0.0.17 -P 1144 -n 0 -o 0 -d
[WIFEXITED(s) && WEXITSTATUS(s) == 0], 0, NULL) = 1142
--- SIGCHLD (Child exited) ---
write(2, "mark-a\n", 7mark-a
) = 7
select(4, [3], NULL, NULL, {60, 0}hboot: attempting to execute

DOING CONNECT () at: 10 0 0 17
) = 1 (in [3], left {59, 970000})
accept(3, 0, NULL) = 4
write(2, "mark-b\n", 7mark-b
) = 7
write(2, "RSa\n", 4RSa
) = 4
read(4, CONNECT_IS_OK
"\0\0\4\326", 4) = 4
write(2, "RSb: 1238\n", 10RSb: 1238
) = 10
write(2, "CALLING lambootagent() FAILED\n", 30CALLING lambootagent() FAILED
) = 30
open("/usr/home/shrek/lam-helpfile", O_RDONLY
DOING CONNECT () at: 10 0 0 17
CONNECT_IS_OK
) = -1 ENOENT (No such file or directory)




thanks!
Ragnar

___________________________________________________
Which British golfer became the youngest ever to play in the Ryder Cup when he did so in 1977?
Find out at postmaster.co.uk

http://www.postmaster.co.uk/cgi-bin/meme/quiz.pl?id=225