LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Reza Shahidi (shahidi_at_[hidden])
Date: 2004-12-17 11:13:53


Hello,

    Well, it seems to hang indefinitely after this output. If I Ctrl-C
after waiting for a few minutes, it says that lamboot did not complete
successfully. It seems rather odd...

- Reza

Brian Barrett wrote:

> Actually, your output shows absolutely nothing wrong with the part of
> the boot sequence output you included. Perhaps there is something you
> forgot to include?
>
> Brian
>
> On Dec 16, 2004, at 8:30 PM, Reza Shahidi wrote:
>
>> Hello all:
>>
>> We are in the process of setting up a small 5-node cluster with
>> Clustermatic, and hence bproc. We just upgraded from LAM 7.0 to 7.1
>> but lamboot is not working the way it should. This is probably
>> because the clients are set up as diskless. We get the output below
>> when running lamboot on just 2 children nodes (for test purposes).
>>
>> As you can see, there is a problem with tkill in the boot
>> sequence. I think this problem has been reported before, but I have
>> been unable to find a concrete fix or workaround. Would installing
>> the V9FS filesystem be of any help? Is there an easier way to solve
>> this issue?
>>
>> Thanks,
>>
>> Reza
>>
>>> shahidi_at_controller:~> lamboot -v -d ~/bhost.def
>>> n-1<18699> ssi:boot:open: opening
>>> n-1<18699> ssi:boot:open: opening boot module bproc
>>> n-1<18699> ssi:boot:open: opened boot module bproc
>>> n-1<18699> ssi:boot:open: opening boot module globus
>>> n-1<18699> ssi:boot:open: opened boot module globus
>>> n-1<18699> ssi:boot:open: opening boot module rsh
>>> n-1<18699> ssi:boot:open: opened boot module rsh
>>> n-1<18699> ssi:boot:open: opening boot module slurm
>>> n-1<18699> ssi:boot:open: opened boot module slurm
>>> n-1<18699> ssi:boot:select: initializing boot module slurm
>>> n-1<18699> ssi:boot:slurm: not running under SLURM
>>> n-1<18699> ssi:boot:select: boot module not available: slurm
>>> n-1<18699> ssi:boot:select: initializing boot module rsh
>>> n-1<18699> ssi:boot:rsh: module initializing
>>> n-1<18699> ssi:boot:rsh:agent: rsh
>>> n-1<18699> ssi:boot:rsh:username: <same>
>>> n-1<18699> ssi:boot:rsh:verbose: 1000
>>> n-1<18699> ssi:boot:rsh:algorithm: linear
>>> n-1<18699> ssi:boot:rsh:no_n: 0
>>> n-1<18699> ssi:boot:rsh:no_profile: 0
>>> n-1<18699> ssi:boot:rsh:fast: 0
>>> n-1<18699> ssi:boot:rsh:ignore_stderr: 0
>>> n-1<18699> ssi:boot:rsh:priority: 10
>>> n-1<18699> ssi:boot:select: boot module available: rsh, priority: 10
>>> n-1<18699> ssi:boot:select: initializing boot module bproc
>>> n-1<18699> ssi:boot:bproc: module initializing
>>> n-1<18699> ssi:boot:bproc:verbose: 1000
>>> n-1<18699> ssi:boot:bproc:priority: 50
>>> n-1<18699> ssi:boot:select: boot module available: bproc, priority: 50
>>> n-1<18699> ssi:boot:select: initializing boot module globus
>>> n-1<18699> ssi:boot:globus: globus-job-run not found, globus boot
>>> will not run
>>> n-1<18699> ssi:boot:select: boot module not available: globus
>>> n-1<18699> ssi:boot:select: finalizing boot module slurm
>>> n-1<18699> ssi:boot:slurm: finalizing
>>> n-1<18699> ssi:boot:select: closing boot module slurm
>>> n-1<18699> ssi:boot:select: finalizing boot module rsh
>>> n-1<18699> ssi:boot:rsh: finalizing
>>> n-1<18699> ssi:boot:select: closing boot module rsh
>>> n-1<18699> ssi:boot:select: finalizing boot module globus
>>> n-1<18699> ssi:boot:globus: finalizing
>>> n-1<18699> ssi:boot:select: closing boot module globus
>>> n-1<18699> ssi:boot:select: selected boot module bproc
>>>
>>> LAM 7.1.1/MPI 2 C++/ROMIO/bproc - Indiana University
>>>
>>> n-1<18699> ssi:boot:base: looking for boot schema in following
>>> directories:
>>> n-1<18699> ssi:boot:base: <current directory>
>>> n-1<18699> ssi:boot:base: $TROLLIUSHOME/etc
>>> n-1<18699> ssi:boot:base: $LAMHOME/etc
>>> n-1<18699> ssi:boot:base: /usr/local/lam/etc
>>> n-1<18699> ssi:boot:base: looking for boot schema file:
>>> n-1<18699> ssi:boot:base: /home/shahidi/bhost.def
>>> n-1<18699> ssi:boot:base: found boot schema: /home/shahidi/bhost.def
>>> n-1<18699> ssi:boot:bproc: found the following hosts:
>>> n-1<18699> ssi:boot:bproc: n0 cluster002 (cpu=1)
>>> n-1<18699> ssi:boot:bproc: n1 cluster003 (cpu=1)
>>> n-1<18699> ssi:boot:bproc: n2 controller (cpu=1)
>>> n-1<18699> ssi:boot:bproc: resolved hosts:
>>> n-1<18699> ssi:boot:bproc: n0 cluster002 --> 192.168.0.12
>>> n-1<18699> ssi:boot:bproc: n1 cluster003 --> 192.168.0.13
>>> n-1<18699> ssi:boot:bproc: n2 controller --> 192.168.0.1 (origin)
>>> n-1<18699> ssi:boot:bproc: n2 node status: up
>>> n-1<18699> ssi:boot:bproc: n2 access rights not checked.
>>> n-1<18699> ssi:boot:bproc: n3 node status: up
>>> n-1<18699> ssi:boot:bproc: n3 access rights not checked.
>>> n-1<18699> ssi:boot:bproc: found master node (controller). Skipping
>>> checks.
>>> n-1<18699> ssi:boot:bproc: starting RTE procs
>>> n-1<18699> ssi:boot:bproc:vector: starting
>>> n-1<18699> ssi:boot:bproc:vector: launching on nodes 2,3,-1
>>> n-1<18699> ssi:boot:bproc:vector: starting wipe on 2,3,-1
>>> n-1<18699> ssi:boot:bproc: execmoving tkill -d -v to 2,3,-1
>>> tkill: setting prefix to (null)
>>> tkill: setting suffix to (null)
>>> tkill: got killname back: /tmp/lam-shahidi_at_controller/lam-killfile
>>> tkill: removing socket file ...
>>> tkill: socket file: /tmp/lam-shahidi_at_controller/lam-kernel-socketd
>>> tkill: removing IO daemon socket file ...
>>> tkill: IO daemon socket file: /tmp/lam-shahidi_at_controller/lam-io-socket
>>> tkill: f_kill = "/tmp/lam-shahidi_at_controller/lam-killfile"
>>> tkill: nothing to kill: "/tmp/lam-shahidi_at_controller/lam-killfile"
>>
>>
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>