LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: ew fgff (sah_8_at_[hidden])
Date: 2005-04-30 09:57:50


Hi Jeff,

Thank you very much for your responce.

1) The output from "recon -v lam-bhost.def" is:

======================================================
recon: -- testing n0 (wolf10.my.edu)
recon: -- testing n1 (wolf.my.edu)
recon: -- testing n2 (wolf4.my.edu)
recon: -- testing n3 (wolf9.my.edu)
----------------------------------------------
Woo hoo!
                                                      
                                                      
               
recon has completed successfully. This means that you
will most likely
be able to boot LAM successfully with the "lamboot"
command (but this
is not a guarantee). See the lamboot(1) manual page
for more
information on the lamboot command.
                                                      
                                                      
               
If you have problems booting LAM (with lamboot) even
though recon
worked successfully, enable the "-d" option to lamboot
to examine each
step of lamboot and see what fails. Most situations
where recon
succeeds and lamboot fails have to do with the
hboot(1) command (that
lamboot invokes on each host in the hostfile).
======================================================

2) The output from "lamboot -d" is:

======================================================
LAM 6.5.9/MPI 2 C++ - Indiana University
                                                      
                                                      
               
lamboot: boot schema file: /etc/lam/lam-bhost.def
lamboot: opening hostfile /etc/lam/lam-bhost.def
lamboot: found the following hosts:
lamboot: n0 wolf10.my.edu
lamboot: n1 wolf.my.edu
lamboot: n2 wolf4.my.edu
lamboot: n3 wolf9.my.edu
lamboot: resolved hosts:
lamboot: n0 wolf10.my.edu --> 312.226.653.323
lamboot: n1 wolf.my.edu --> 312.226.653.48
lamboot: n2 wolf4.my.edu --> 312.226.653.98
lamboot: n3 wolf9.my.edu --> 312.226.653.202
lamboot: found 4 host node(s)
lamboot: origin node is 0 (wolf10.my.edu)
lamboot: attempting to execute "hboot -t -c
lam-conf.lam -d -I " -H 312.226.653.323 -P 40065 -n 0
-o 0 ""
hboot: process schema = "/etc/lam/lam-conf.lam"
hboot: found /usr/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/bin/lamd
hboot: attempting to execute
[1] 10440 lamd -H 312.226.653.323 -P 40065 -n 0 -o 0
-d
lamboot: attempting to execute "ssh -x wolf.my.edu -n
echo $SHELL"
lamboot: got remote shell /bin/bash2
lamboot: attempting to execute "ssh -x wolf.my.edu -n
hboot -t -c lam-conf.lam -d -s -I "-H 312.226.653.323
-P 40065 -n 1 -o 0 ""
hboot: process schema = "/etc/lam/lam-conf.lam"
hboot: found /usr/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/bin/lamd
[1] 26205 lamd -H 312.226.653.323 -P 40065 -n 1 -o 0
-d
lamboot: attempting to execute "ssh -x wolf4.my.edu -n
echo $SHELL"
lamboot: got remote shell /bin/bash2
lamboot: attempting to execute "ssh -x wolf4.my.edu -n
hboot -t -c lam-conf.lam -d -s -I "-H 312.226.653.323
-P 40065 -n 2 -o 0 ""
hboot: process schema = "/etc/lam/lam-conf.lam"
hboot: found /usr/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/bin/lamd
[1] 6506 lamd -H 312.226.653.323 -P 40065 -n 2 -o 0
-d
lamboot: attempting to execute "ssh -x wolf9.my.edu -n
echo $SHELL"
lamboot: got remote shell /bin/bash2
lamboot: attempting to execute "ssh -x wolf9.my.edu -n
hboot -t -c lam-conf.lam -d -s -I "-H 312.226.653.323
-P 40065 -n 3 -o 0 ""
hboot: process schema = "/etc/lam/lam-conf.lam"
hboot: found /usr/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/bin/lamd
[1] 24813 lamd -H 312.226.653.323 -P 40065 -n 3 -o 0
-d
------------------------------------------------------
lamboot encountered some error (see above) during the
boot process,
and will now attempt to kill all nodes that it was
previously able to
boot (if any).
 
Please wait for LAM to finish; if you interrupt this
process, you may
have LAM daemons still running on remote nodes.
------------------------------------------------
wipe ...
 
LAM 6.5.9/MPI 2 C++ - Indiana University
 
Executing tkill on n0 (wolf10.my.edu)...
Executing tkill on n1 (wolf.my.edu)...
Executing tkill on n2 (wolf4.my.edu)...
Executing tkill on n3 (wolf9.my.edu)...
lamboot did NOT complete successfully

======================================================
3) The the lamboot failed on wolf9.my.edu machine.
When I run lamboot only in wolf9.my.edu machine then
there was no problem. It runs on only wolf9.my.edu.

Thanks again
Manoj

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com