Actually -- I take that back. I initially thought we added all that
debugging output in the 7.x series, but thinking about it more, I'm
pretty sure we had some level of that back in the 6.5 series.
But either way, I would *strongly* recommend upgrading if possible. We
don't really support the 6.5 series anymore (it's only available for
download for some ISV's who have QA checked their apps with 6.5.9).
The 7.x debugging output is much more verbose, and will give a better
indication of what is going wrong.
On Apr 30, 2005, at 12:12 PM, Jeff Squyres wrote:
> It looks like you are mixing versions of LAM/MPI. The debug output
> you are showing is from the 7.x series, but your "lamboot -d" output
> definitely shows version 6.5.9.
>
> You cannot mix versions of LAM -- MPI applications are source
> compatible with different versions of LAM, but not binary compatible.
> Different versions of LAM are pretty much guaranteed not to be
> compatible with each other (i.e., we make no effort for binary
> compatibility between versions).
>
> You might want to check and see if your Linux distro installed an
> older version of LAM (e.g., 6.5.9) on your machines automatically, and
> that is causing your problems with your downloaded/compiled/installed
> version of LAM.
>
>
> On Apr 30, 2005, at 10:57 AM, ew fgff wrote:
>
>> Hi Jeff,
>>
>> Thank you very much for your responce.
>>
>> 1) The output from "recon -v lam-bhost.def" is:
>>
>> ======================================================
>> recon: -- testing n0 (wolf10.my.edu)
>> recon: -- testing n1 (wolf.my.edu)
>> recon: -- testing n2 (wolf4.my.edu)
>> recon: -- testing n3 (wolf9.my.edu)
>> ----------------------------------------------
>> Woo hoo!
>>
>>
>>
>> recon has completed successfully. This means that you
>> will most likely
>> be able to boot LAM successfully with the "lamboot"
>> command (but this
>> is not a guarantee). See the lamboot(1) manual page
>> for more
>> information on the lamboot command.
>>
>>
>>
>> If you have problems booting LAM (with lamboot) even
>> though recon
>> worked successfully, enable the "-d" option to lamboot
>> to examine each
>> step of lamboot and see what fails. Most situations
>> where recon
>> succeeds and lamboot fails have to do with the
>> hboot(1) command (that
>> lamboot invokes on each host in the hostfile).
>> ======================================================
>>
>> 2) The output from "lamboot -d" is:
>>
>> ======================================================
>> LAM 6.5.9/MPI 2 C++ - Indiana University
>>
>>
>>
>> lamboot: boot schema file: /etc/lam/lam-bhost.def
>> lamboot: opening hostfile /etc/lam/lam-bhost.def
>> lamboot: found the following hosts:
>> lamboot: n0 wolf10.my.edu
>> lamboot: n1 wolf.my.edu
>> lamboot: n2 wolf4.my.edu
>> lamboot: n3 wolf9.my.edu
>> lamboot: resolved hosts:
>> lamboot: n0 wolf10.my.edu --> 312.226.653.323
>> lamboot: n1 wolf.my.edu --> 312.226.653.48
>> lamboot: n2 wolf4.my.edu --> 312.226.653.98
>> lamboot: n3 wolf9.my.edu --> 312.226.653.202
>> lamboot: found 4 host node(s)
>> lamboot: origin node is 0 (wolf10.my.edu)
>> lamboot: attempting to execute "hboot -t -c
>> lam-conf.lam -d -I " -H 312.226.653.323 -P 40065 -n 0
>> -o 0 ""
>> hboot: process schema = "/etc/lam/lam-conf.lam"
>> hboot: found /usr/bin/lamd
>> hboot: performing tkill
>> hboot: tkill
>> hboot: booting...
>> hboot: fork /usr/bin/lamd
>> hboot: attempting to execute
>> [1] 10440 lamd -H 312.226.653.323 -P 40065 -n 0 -o 0
>> -d
>> lamboot: attempting to execute "ssh -x wolf.my.edu -n
>> echo $SHELL"
>> lamboot: got remote shell /bin/bash2
>> lamboot: attempting to execute "ssh -x wolf.my.edu -n
>> hboot -t -c lam-conf.lam -d -s -I "-H 312.226.653.323
>> -P 40065 -n 1 -o 0 ""
>> hboot: process schema = "/etc/lam/lam-conf.lam"
>> hboot: found /usr/bin/lamd
>> hboot: performing tkill
>> hboot: tkill
>> hboot: booting...
>> hboot: fork /usr/bin/lamd
>> [1] 26205 lamd -H 312.226.653.323 -P 40065 -n 1 -o 0
>> -d
>> lamboot: attempting to execute "ssh -x wolf4.my.edu -n
>> echo $SHELL"
>> lamboot: got remote shell /bin/bash2
>> lamboot: attempting to execute "ssh -x wolf4.my.edu -n
>> hboot -t -c lam-conf.lam -d -s -I "-H 312.226.653.323
>> -P 40065 -n 2 -o 0 ""
>> hboot: process schema = "/etc/lam/lam-conf.lam"
>> hboot: found /usr/bin/lamd
>> hboot: performing tkill
>> hboot: tkill
>> hboot: booting...
>> hboot: fork /usr/bin/lamd
>> [1] 6506 lamd -H 312.226.653.323 -P 40065 -n 2 -o 0
>> -d
>> lamboot: attempting to execute "ssh -x wolf9.my.edu -n
>> echo $SHELL"
>> lamboot: got remote shell /bin/bash2
>> lamboot: attempting to execute "ssh -x wolf9.my.edu -n
>> hboot -t -c lam-conf.lam -d -s -I "-H 312.226.653.323
>> -P 40065 -n 3 -o 0 ""
>> hboot: process schema = "/etc/lam/lam-conf.lam"
>> hboot: found /usr/bin/lamd
>> hboot: performing tkill
>> hboot: tkill
>> hboot: booting...
>> hboot: fork /usr/bin/lamd
>> [1] 24813 lamd -H 312.226.653.323 -P 40065 -n 3 -o 0
>> -d
>> ------------------------------------------------------
>> lamboot encountered some error (see above) during the
>> boot process,
>> and will now attempt to kill all nodes that it was
>> previously able to
>> boot (if any).
>>
>> Please wait for LAM to finish; if you interrupt this
>> process, you may
>> have LAM daemons still running on remote nodes.
>> ------------------------------------------------
>> wipe ...
>>
>> LAM 6.5.9/MPI 2 C++ - Indiana University
>>
>> Executing tkill on n0 (wolf10.my.edu)...
>> Executing tkill on n1 (wolf.my.edu)...
>> Executing tkill on n2 (wolf4.my.edu)...
>> Executing tkill on n3 (wolf9.my.edu)...
>> lamboot did NOT complete successfully
>>
>> ======================================================
>> 3) The the lamboot failed on wolf9.my.edu machine.
>> When I run lamboot only in wolf9.my.edu machine then
>> there was no problem. It runs on only wolf9.my.edu.
>>
>> Thanks again
>> Manoj
>>
>>
>>
>>
>> __________________________________________________
>> Do You Yahoo!?
>> Tired of spam? Yahoo! Mail has the best spam protection around
>> http://mail.yahoo.com
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>
> --
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|