Hello all:
We are in the process of setting up a small 5-node cluster with
Clustermatic, and hence bproc. We just upgraded from LAM 7.0 to 7.1 but
lamboot is not working the way it should. This is probably because the
clients are set up as diskless. We get the output below when running
lamboot on just 2 children nodes (for test purposes).
As you can see, there is a problem with tkill in the boot sequence.
I think this problem has been reported before, but I have been unable to
find a concrete fix or workaround. Would installing the V9FS filesystem
be of any help? Is there an easier way to solve this issue?
Thanks,
Reza
> shahidi_at_controller:~> lamboot -v -d ~/bhost.def
> n-1<18699> ssi:boot:open: opening
> n-1<18699> ssi:boot:open: opening boot module bproc
> n-1<18699> ssi:boot:open: opened boot module bproc
> n-1<18699> ssi:boot:open: opening boot module globus
> n-1<18699> ssi:boot:open: opened boot module globus
> n-1<18699> ssi:boot:open: opening boot module rsh
> n-1<18699> ssi:boot:open: opened boot module rsh
> n-1<18699> ssi:boot:open: opening boot module slurm
> n-1<18699> ssi:boot:open: opened boot module slurm
> n-1<18699> ssi:boot:select: initializing boot module slurm
> n-1<18699> ssi:boot:slurm: not running under SLURM
> n-1<18699> ssi:boot:select: boot module not available: slurm
> n-1<18699> ssi:boot:select: initializing boot module rsh
> n-1<18699> ssi:boot:rsh: module initializing
> n-1<18699> ssi:boot:rsh:agent: rsh
> n-1<18699> ssi:boot:rsh:username: <same>
> n-1<18699> ssi:boot:rsh:verbose: 1000
> n-1<18699> ssi:boot:rsh:algorithm: linear
> n-1<18699> ssi:boot:rsh:no_n: 0
> n-1<18699> ssi:boot:rsh:no_profile: 0
> n-1<18699> ssi:boot:rsh:fast: 0
> n-1<18699> ssi:boot:rsh:ignore_stderr: 0
> n-1<18699> ssi:boot:rsh:priority: 10
> n-1<18699> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<18699> ssi:boot:select: initializing boot module bproc
> n-1<18699> ssi:boot:bproc: module initializing
> n-1<18699> ssi:boot:bproc:verbose: 1000
> n-1<18699> ssi:boot:bproc:priority: 50
> n-1<18699> ssi:boot:select: boot module available: bproc, priority: 50
> n-1<18699> ssi:boot:select: initializing boot module globus
> n-1<18699> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n-1<18699> ssi:boot:select: boot module not available: globus
> n-1<18699> ssi:boot:select: finalizing boot module slurm
> n-1<18699> ssi:boot:slurm: finalizing
> n-1<18699> ssi:boot:select: closing boot module slurm
> n-1<18699> ssi:boot:select: finalizing boot module rsh
> n-1<18699> ssi:boot:rsh: finalizing
> n-1<18699> ssi:boot:select: closing boot module rsh
> n-1<18699> ssi:boot:select: finalizing boot module globus
> n-1<18699> ssi:boot:globus: finalizing
> n-1<18699> ssi:boot:select: closing boot module globus
> n-1<18699> ssi:boot:select: selected boot module bproc
>
> LAM 7.1.1/MPI 2 C++/ROMIO/bproc - Indiana University
>
> n-1<18699> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<18699> ssi:boot:base: <current directory>
> n-1<18699> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<18699> ssi:boot:base: $LAMHOME/etc
> n-1<18699> ssi:boot:base: /usr/local/lam/etc
> n-1<18699> ssi:boot:base: looking for boot schema file:
> n-1<18699> ssi:boot:base: /home/shahidi/bhost.def
> n-1<18699> ssi:boot:base: found boot schema: /home/shahidi/bhost.def
> n-1<18699> ssi:boot:bproc: found the following hosts:
> n-1<18699> ssi:boot:bproc: n0 cluster002 (cpu=1)
> n-1<18699> ssi:boot:bproc: n1 cluster003 (cpu=1)
> n-1<18699> ssi:boot:bproc: n2 controller (cpu=1)
> n-1<18699> ssi:boot:bproc: resolved hosts:
> n-1<18699> ssi:boot:bproc: n0 cluster002 --> 192.168.0.12
> n-1<18699> ssi:boot:bproc: n1 cluster003 --> 192.168.0.13
> n-1<18699> ssi:boot:bproc: n2 controller --> 192.168.0.1 (origin)
> n-1<18699> ssi:boot:bproc: n2 node status: up
> n-1<18699> ssi:boot:bproc: n2 access rights not checked.
> n-1<18699> ssi:boot:bproc: n3 node status: up
> n-1<18699> ssi:boot:bproc: n3 access rights not checked.
> n-1<18699> ssi:boot:bproc: found master node (controller). Skipping
> checks.
> n-1<18699> ssi:boot:bproc: starting RTE procs
> n-1<18699> ssi:boot:bproc:vector: starting
> n-1<18699> ssi:boot:bproc:vector: launching on nodes 2,3,-1
> n-1<18699> ssi:boot:bproc:vector: starting wipe on 2,3,-1
> n-1<18699> ssi:boot:bproc: execmoving tkill -d -v to 2,3,-1
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-shahidi_at_controller/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-shahidi_at_controller/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-shahidi_at_controller/lam-io-socket
> tkill: f_kill = "/tmp/lam-shahidi_at_controller/lam-killfile"
> tkill: nothing to kill: "/tmp/lam-shahidi_at_controller/lam-killfile"
|