LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: K. Choi (kchoi_at_[hidden])
Date: 2004-05-01 23:22:18


Some days later, I had upgraded to lam 7.0.4-2 and there is a problem.

I failed when I execute lamboot. The fail process is below,

[xenus_at_van:~]$ cat lamhosts
van
node2
[xenus_at_van:~]$ recon -v lamhosts
n0<5357> ssi:boot:base:linear: booting n0 (van)
n0<5357> ssi:boot:base:linear: booting n1 (node2)
n0<5357> ssi:boot:base:linear: finished
-----------------------------------------------------------------------------
Woo hoo!

recon has completed successfully. This means that you will most likely
be able to boot LAM successfully with the "lamboot" command (but this
is not a guarantee). See the lamboot(1) manual page for more
information on the lamboot command.

If you have problems booting LAM (with lamboot) even though recon
worked successfully, enable the "-d" option to lamboot to examine each
step of lamboot and see what fails. Most situations where recon
succeeds and lamboot fails have to do with the hboot(1) command (that
lamboot invokes on each host in the hostfile).
-----------------------------------------------------------------------------
[xenus_at_van:~]$ lamboot -v lamhosts

LAM 7.0.4/MPI 2 C++/ROMIO - Indiana University

n0<5361> ssi:boot:base:linear: booting n0 (van)
n0<5361> ssi:boot:base:linear: booting n1 (node2)
ERROR: LAM/MPI unexpectedly received the following on stderr:
base: cannot find process schema (null): No such file or directory
-----------------------------------------------------------------------------
*** Oops -- cannot find the help that you're supposed to get.
*** Using the following help file:
***
*** /usr/lib/lam/etc/lam-helpfile
***
*** You were supposed to get help on the program "hboot"
*** about the topic "cant-parse-config"
*** But it doesn't seem to be in that file.
***
*** Sorry!
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
LAM failed to execute a LAM binary on the remote node "node2".
Since LAM was already able to determine your remote shell as "hboot",
it is probable that this is not an authentication problem.

LAM tried to use the remote agent command "/usr/bin/rsh"
to invoke the following command:

        /usr/bin/rsh node2 -n hboot -t -c lam-conf.lamd -v -s -I "-H
192.168.42.250 -P 33150 -n 1 -o 0"

This can indicate several things. You should check the following:

        - The LAM binaries are in your $PATH
        - You can run the LAM binaries
        - The $PATH variable is set properly before your
          .cshrc/.profile exits

Try to invoke the command listed above manually at a Unix prompt.

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n0<5361> ssi:boot:base:linear: Failed to boot n1 (node2)
n0<5361> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
n0<5367> ssi:boot:base:linear: booting n0 (van)
n0<5367> ssi:boot:base:linear: booting n1 (node2)
n0<5367> ssi:boot:base:linear: finished
lamboot did NOT complete successfully

What is the problem? I don't know why failed. When I used lam 6.x, there are no
problems. :-(

-- 
Sincerely, Kiyoung