Actually, your output shows absolutely nothing wrong with the part of
the boot sequence output you included. Perhaps there is something you
forgot to include?
Brian
On Dec 16, 2004, at 8:30 PM, Reza Shahidi wrote:
> Hello all:
>
> We are in the process of setting up a small 5-node cluster with
> Clustermatic, and hence bproc. We just upgraded from LAM 7.0 to 7.1
> but lamboot is not working the way it should. This is probably
> because the clients are set up as diskless. We get the output below
> when running lamboot on just 2 children nodes (for test purposes).
>
> As you can see, there is a problem with tkill in the boot sequence.
> I think this problem has been reported before, but I have been unable
> to find a concrete fix or workaround. Would installing the V9FS
> filesystem be of any help? Is there an easier way to solve this
> issue?
>
> Thanks,
>
> Reza
>
>> shahidi_at_controller:~> lamboot -v -d ~/bhost.def
>> n-1<18699> ssi:boot:open: opening
>> n-1<18699> ssi:boot:open: opening boot module bproc
>> n-1<18699> ssi:boot:open: opened boot module bproc
>> n-1<18699> ssi:boot:open: opening boot module globus
>> n-1<18699> ssi:boot:open: opened boot module globus
>> n-1<18699> ssi:boot:open: opening boot module rsh
>> n-1<18699> ssi:boot:open: opened boot module rsh
>> n-1<18699> ssi:boot:open: opening boot module slurm
>> n-1<18699> ssi:boot:open: opened boot module slurm
>> n-1<18699> ssi:boot:select: initializing boot module slurm
>> n-1<18699> ssi:boot:slurm: not running under SLURM
>> n-1<18699> ssi:boot:select: boot module not available: slurm
>> n-1<18699> ssi:boot:select: initializing boot module rsh
>> n-1<18699> ssi:boot:rsh: module initializing
>> n-1<18699> ssi:boot:rsh:agent: rsh
>> n-1<18699> ssi:boot:rsh:username: <same>
>> n-1<18699> ssi:boot:rsh:verbose: 1000
>> n-1<18699> ssi:boot:rsh:algorithm: linear
>> n-1<18699> ssi:boot:rsh:no_n: 0
>> n-1<18699> ssi:boot:rsh:no_profile: 0
>> n-1<18699> ssi:boot:rsh:fast: 0
>> n-1<18699> ssi:boot:rsh:ignore_stderr: 0
>> n-1<18699> ssi:boot:rsh:priority: 10
>> n-1<18699> ssi:boot:select: boot module available: rsh, priority: 10
>> n-1<18699> ssi:boot:select: initializing boot module bproc
>> n-1<18699> ssi:boot:bproc: module initializing
>> n-1<18699> ssi:boot:bproc:verbose: 1000
>> n-1<18699> ssi:boot:bproc:priority: 50
>> n-1<18699> ssi:boot:select: boot module available: bproc, priority: 50
>> n-1<18699> ssi:boot:select: initializing boot module globus
>> n-1<18699> ssi:boot:globus: globus-job-run not found, globus boot
>> will not run
>> n-1<18699> ssi:boot:select: boot module not available: globus
>> n-1<18699> ssi:boot:select: finalizing boot module slurm
>> n-1<18699> ssi:boot:slurm: finalizing
>> n-1<18699> ssi:boot:select: closing boot module slurm
>> n-1<18699> ssi:boot:select: finalizing boot module rsh
>> n-1<18699> ssi:boot:rsh: finalizing
>> n-1<18699> ssi:boot:select: closing boot module rsh
>> n-1<18699> ssi:boot:select: finalizing boot module globus
>> n-1<18699> ssi:boot:globus: finalizing
>> n-1<18699> ssi:boot:select: closing boot module globus
>> n-1<18699> ssi:boot:select: selected boot module bproc
>>
>> LAM 7.1.1/MPI 2 C++/ROMIO/bproc - Indiana University
>>
>> n-1<18699> ssi:boot:base: looking for boot schema in following
>> directories:
>> n-1<18699> ssi:boot:base: <current directory>
>> n-1<18699> ssi:boot:base: $TROLLIUSHOME/etc
>> n-1<18699> ssi:boot:base: $LAMHOME/etc
>> n-1<18699> ssi:boot:base: /usr/local/lam/etc
>> n-1<18699> ssi:boot:base: looking for boot schema file:
>> n-1<18699> ssi:boot:base: /home/shahidi/bhost.def
>> n-1<18699> ssi:boot:base: found boot schema: /home/shahidi/bhost.def
>> n-1<18699> ssi:boot:bproc: found the following hosts:
>> n-1<18699> ssi:boot:bproc: n0 cluster002 (cpu=1)
>> n-1<18699> ssi:boot:bproc: n1 cluster003 (cpu=1)
>> n-1<18699> ssi:boot:bproc: n2 controller (cpu=1)
>> n-1<18699> ssi:boot:bproc: resolved hosts:
>> n-1<18699> ssi:boot:bproc: n0 cluster002 --> 192.168.0.12
>> n-1<18699> ssi:boot:bproc: n1 cluster003 --> 192.168.0.13
>> n-1<18699> ssi:boot:bproc: n2 controller --> 192.168.0.1 (origin)
>> n-1<18699> ssi:boot:bproc: n2 node status: up
>> n-1<18699> ssi:boot:bproc: n2 access rights not checked.
>> n-1<18699> ssi:boot:bproc: n3 node status: up
>> n-1<18699> ssi:boot:bproc: n3 access rights not checked.
>> n-1<18699> ssi:boot:bproc: found master node (controller). Skipping
>> checks.
>> n-1<18699> ssi:boot:bproc: starting RTE procs
>> n-1<18699> ssi:boot:bproc:vector: starting
>> n-1<18699> ssi:boot:bproc:vector: launching on nodes 2,3,-1
>> n-1<18699> ssi:boot:bproc:vector: starting wipe on 2,3,-1
>> n-1<18699> ssi:boot:bproc: execmoving tkill -d -v to 2,3,-1
>> tkill: setting prefix to (null)
>> tkill: setting suffix to (null)
>> tkill: got killname back: /tmp/lam-shahidi_at_controller/lam-killfile
>> tkill: removing socket file ...
>> tkill: socket file: /tmp/lam-shahidi_at_controller/lam-kernel-socketd
>> tkill: removing IO daemon socket file ...
>> tkill: IO daemon socket file:
>> /tmp/lam-shahidi_at_controller/lam-io-socket
>> tkill: f_kill = "/tmp/lam-shahidi_at_controller/lam-killfile"
>> tkill: nothing to kill: "/tmp/lam-shahidi_at_controller/lam-killfile"
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|