LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: LUK ShunTim (shuntim.luk_at_[hidden])
Date: 2007-02-11 06:37:35


Hello,

I have lam 7.1.2 installed via rpms in a caos2 cum warewulf setup and things
work well. A user requested lam 6.5 and I installed 6.5.9 from source into
/home/lamo, with /home is mounted via nfs. PATH is set so that /home/lamo/bin is
first picked up, in both the master and the nodes. Recon returned successfully
but lamboot fails. Here's the error message of "lamboot -d".

<error>
hboot: process schema = "/home/lamo/etc/lam-conf.lam"
hboot: found /home/lamo/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /home/lamo/bin/lamd
[1] 6820 lamd -H 10.0.4.1 -P 49838 -n 0 -o 0 -d
lamboot: attempting to execute "/usr/bin/rsh node0000 -n echo $SHELL"
lamboot: got remote shell /bin/bash
lamboot: attempting to execute "/usr/bin/rsh node0000 -n hboot -t -c
lam-conf.lam -d -s -I "-H 10.0.4.1 -P 49838 -n 1 -o 0 ""
base: cannot find process schema (null):
-----------------------------------------------------------------------------
LAM failed to execute a LAM binary on the remote node "node0000".
Since LAM was already able to determine your remote shell as "hboot",
it is probable that this is not an authentication problem.

LAM tried to use the remote agent command "/usr/bin/rsh"
to invoke the following command:

        /usr/bin/rsh node0000 -n hboot -t -c lam-conf.lam -d -s -I "-H 10.0.4.1 -P
49838 -n 1 -o 0 "

This can indicate several things. You should check the following:

        - The LAM binaries are in your $PATH
        - You can run the LAM binaries
        - The $PATH variable is set properly before your
          .cshrc/.profile exits

Try to invoke the command listed above manually at a Unix prompt.

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------

LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University

Executing tkill on n0 (demo)...

LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University

lamboot: boot schema file: /home/stluk/lamboot-all-nodes.def
lamboot: opening hostfile /home/stluk/lamboot-all-nodes.def
lamboot: found the following hosts:
lamboot: n0 demo
lamboot: n1 node0000
lamboot: n2 node0001
lamboot: n3 node0002
lamboot: n4 node0003
lamboot: n5 node0004
lamboot: n6 node0005
lamboot: n7 node0006
lamboot: n8 node0007
lamboot: n9 node0008
lamboot: resolved hosts:
lamboot: n0 demo --> 10.0.4.1
lamboot: n1 node0000 --> 10.0.5.0
lamboot: n2 node0001 --> 10.0.5.1
lamboot: n3 node0002 --> 10.0.5.2
lamboot: n4 node0003 --> 10.0.5.3
lamboot: n5 node0004 --> 10.0.5.4
lamboot: n6 node0005 --> 10.0.5.5
lamboot: n7 node0006 --> 10.0.5.6
lamboot: n8 node0007 --> 10.0.5.7
lamboot: n9 node0008 --> 10.0.5.8
lamboot: found 10 host node(s)
lamboot: origin node is 0 (demo)
lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -I " -H 10.0.4.1 -P
49838 -n 0 -o 0 ""
wipe ...
lamboot did NOT complete successfully
</error>

This is the offending line:

<quote>
lamboot: attempting to execute "/usr/bin/rsh node0000 -n hboot -t -c
lam-conf.lam -d -s -I "-H 10.0.4.1 -P 49838 -n 1 -o 0 ""
base: cannot find process schema (null):
</quote>

What puzzles me is, when as a comparison test, installing lam 7.1.2 in exactly
the same way into /home/lam and lamboot *worked*.

Thanks in advance for your help.
Regards,
ST

--