I'm unable to boot my nodes with lam 7.0.5 (recently installed). This is my
boot command in the PBS script:
lamboot -v -ssi boot rsh -ssi rsh_agent "rsh" $PBS_NODEFILE
And below are the errors in the error file, the output file doesn't contain
the message "topology done" which I guess is printed if it's successful.
n-1<1718> ssi:boot:base:linear: booting n0 (Empire-09-14)
n-1<1718> ssi:boot:base:linear: booting n1 (Empire-09-02)
ERROR: LAM/MPI unexpectedly received the following on stderr:
hboot: error while loading shared libraries: liblam.so.0: cannot open shared
object file: No suc
h file or directory
----------------------------------------------------------------------------
-
LAM attempted to execute a process on the remote node "Empire-09-02",
but received some output on the standard error.
LAM tried to use the remote agent command "rsh"
to invoke "hboot" on the remote node.
This can indicate an authentication error with the remote agent, or
can indicate an error in your $HOME/.cshrc, $HOME/.login, or
$HOME/.profile files. The following is a list of items that you may
wish to check on the remote node:
.......
.......
I tried pasting the rsh command and this is the result:
Redstone[1153] pushkar$ rsh Empire-09-02 -n hboot -t -c
lam-conf.lamd -v -sessionsuffix pbs-59687.Empire -s -I "-H 172.16.9.14 -P
32837 -n 1 -o 0"
poll: protocol failure in circuit setup
I made sure all the libs and binaries are in my path.
Can anyone tell what's wrong? Thanks,
Pushkar
|