LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Kevin Kuo (kkuo_at_[hidden])
Date: 2004-09-14 15:55:18


Hi All,

I'm having some problems booting LAM on an older cluster. Here is what
I did:

$ lamboot -v ~/machines

LAM 6.3.1/MPI 2 C++/ROMIO - University of Notre Dame

Executing hboot on n0 (emag6)...
-----------------------------------------------------------------------------
LAM was trying to determine the your shell on the "emag6".
However, LAM did not receive any valid output.

LAM tried to use the remote agent command "/usr/bin/rsh"
to invoke "echo $SHELL" on the remote node.

This is an unusaual error -- it does not typically indicate a
permissions problem. But it can sometimes indicate latent (or
"silent") errors in your $HOME/.cshrc, $HOME/.login, or $HOME/.profile
file.

Try invoking the following command at the unix command line:

        /usr/bin/rsh emag6 -n echo $SHELL

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------

so I tried what the error message is saying:

$ /usr/bin/rsh emag6 -n echo $SHELL
/bin/bash
$

Then in the machines file is says:

The hostname should
# be the same as the result from the command "hostname"
emag6
emag5
emag7

so when I do `hostname` on emag6, I get:
emag6.cluster.xxx.xxx.com

emag6$ ping emag5.cluster.xxx.xxx.com
times out....

So I look at my /etc/hosts:
127.0.0.1 emag5 localhost.localdomain localhost
192.168.2.254 emag7.cluster.xxx.xxx.com emag7
192.168.2.12 emag5.cluster.xxx.xxx.com emag5

So it looks like emag5 and emag5.cluster.xxx.xxx.com should be flipped,
yes? But it's somewhat strange that whatever Unix system call lamboot
makes couldn't resolve network aliases?

So then I switched the order of the node listings by putting emag7 at
the top:

emag7
emag6
emag5

and now this is what lamboot gives me:

$ lamboot -d ~/machines

LAM 6.3.1/MPI 2 C++/ROMIO - University of Notre Dame

lamboot: boot schema file: /export/homes/donghoon/machines
lamboot: opening hostfile /export/homes/donghoon/machines
lamboot: found the following hosts:
lamboot: n0 emag7
lamboot: n1 emag6
lamboot: n2 emag5
lamboot: found 3 host node(s)
lamboot: origin node is 0 (emag7)
lamboot: attempting to execute "hboot -t -c conf.lam -d -I " -H
127.0.0.1 -P 1066 -n 0 -o 0 ""
hboot: process schema = "conf.lam"

and hangs...

conf.lam was copied from /usr/boot/bhost.def and contains just
"localhost"

Sorry for the very long post, and thanks for your patience. I would
appreciate any help you can provide.

Truly,

Kevin Kuo