Mars Lenjoy wrote:
> $ recon -v lamhosts
> n-1<21389> ssi:boot:base:linear: booting n0 (11.11.11.1)
> n-1<21389> ssi:boot:base:linear: booting n1 (11.11.11.2)
> ERROR: LAM/MPI unexpectedly received the following on stderr:
> bash: line 1: tkill: command not found
Have you set up your PATH and LD_LIBRARY_PATH correctly on all nodes?
These must be set via some sort of login script, and must be done for
non-interactive logins.
> Try invoking the following command at the unix command line:
>
> rsh 11.11.11.2 -n tkill -N -v
Have you done this? What happens?
> n-1<21389> ssi:boot:base:linear: Failed to boot n1 (11.11.11.2)
> n-1<21389> ssi:boot:base:linear: aborted!
>
> =============================== end
> ==============================================
>
> why are only n-0 and n-1 tested, but not n2-5?
Testing stops when any sort of error occurs, like what happened for n1.
When n1 succeeds recon will continue to the other nodes.
> if I use lamboot without any parameter
>
> ===================== begin =============
>
> *$ lamboot*
>
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
> *$ ps aux | grep lam
> *test 21695 0.0 0.0 2260 1048 ? S 23:16 0:00
> /u2/test/Lenjoy/LAMHOME/bin/lamd -H 127.0.0.1 -P 36823 -n 0 -o 0
> test 21713 0.0 0.0 3580 656 pts/4 S 23:16 0:00 grep lam
>
> ======================== end ======================
This is because you are only lamboot'ing the local node, not any remote
nodes.
Andrew
|