Excellent! Sorry I didn't reply earlier, but I'm not sure I would have
guessed this anyway.
This probably explains the 56.* IP addresses -- we don't make any
promises about binary compatibility between different versions of
LAM/MPI, so the IP address information that was being sent across the
wire probably didn't exactly match the format that was expected by the
other version of LAM, and things went downhill from there.
On Nov 16, 2005, at 12:28 AM, Guangyu Wu wrote:
> Hi,Jeff:
> Now, the problem is solved!
> I could not explain the strange behavior but I believe it was caused
> by the
> co-existence of two lam version.
> I forget to remove the lam6.5.9 package on 2 of the nodes before I
> installed
> lam7.0.3. Although I removed it afterward it did not help.
> So, the proper action is removing lam6.5.9 clearly before installing
> new
> version.
> Now, PBS is able to get the proper cpu time information of the MPP
> dyna job!
> Thanks, Jeff thanks for your help all the time!
> Guangyu Wu.
>
> -----ÓʼþÔ¼þ-----
> ·¢¼þÈË: lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] ´ú±í
> Guangyu Wu
> ·¢ËÍʱ¼ä: 2005Äê11ÔÂ15ÈÕ 18:13
> ÊÕ¼þÈË: 'General LAM/MPI mailing list'
> Ö÷Ìâ: ´ð¸´: LAM: Uable to boot lam within PBS job
>
> Hi, Jeff:
> Actually I did not disable tm within the job, instead I set the
> following
> variable:
> export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE INPUTRC
> export CFLAGS=-I/usr/pbs/include
> export CPPFLAGS=-I/usr/pbs/include
> export LDFLAGS=-L/usr/pbs/lib
> export LAM_MPI_SSI_BOOT=tm
> As you said it is pretty strange that "ssi:boot:base:server: got
> connection
> from 56.145.206.0", the strange IP is changed each time I run a PBS
> job but
> all are 56.*.*.0!
> I could not find any information about lamd on the 2th and 3th nodes.
> The nodes are linux1~3 and the IPs are 192.168.40.81~83 with a gateway
> and
> and ISP provide DNS server set.
> Does it matter with DNS service?
> I notice that I would get an error information when issuing: host
> localhost
> or host linux1 as following:
> [wgy_at_linux1 wgy]$ host linux2
> Host linux2 not found: 3(NXDOMAIN)
> I am going to unset the variables to see what happens since last time I
> could boot lam without the variable set.( the previous OS of these
> nodes are
> RH9, and now is RHEL3.0. but at that time I could not get any CPU time
> information of the MPP job, that is why I change the OS and rebuild the
> environment).
> Please suggest.
> Thanks and best regards.
> If you meet any guys from Altair In the SC05 event please say hello to
> them.
> Are you going to the event, right?
>
> -----ÓʼþÔ¼þ-----
> ·¢¼þÈË: lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] ´ú±í Jeff
> Squyres
> ·¢ËÍʱ¼ä: 2005Äê11ÔÂ11ÈÕ 19:36
> ÊÕ¼þÈË: General LAM/MPI mailing list
> Ö÷Ìâ: Re: LAM: Uable to boot lam within PBS job
>
> On Nov 12, 2005, at 12:55 AM, Guangyu Wu wrote:
>
>> I could boot lam universe using rsh by ¡°lamboot ¨Cv nodes¡±, but got the
>> same error while booting within a PBS job.
>
> If you're in a PBS job and you lamboot with a hostfile, LAM is still
> going
> to use tm and ignore the hostfile unless you specifically disable the
> TM
> boot module. Did you do that?
>
> But other than that, rsh and tm use the same mechanisms to launch
> (i.e.,
> communication-wise), so something is odd with your setup if one works
> and
> the other does not, but both are able to actually launch the lamd's on
> remote nodes.
>
>> Thanks for your reply! Now it seems I have compiled lam with TM
>> enabled.
>> But I got an "The lamboot agent timed out while waiting for the
>> newly-booted process "error while booting lam within a PBS job.
>> The followingmessage in the .e36 file indicates that lam was trying to
>> boot via tm.
>> n0<16809> ssi:boot:tm: successfully launched on n2 (linux3) Attached
>> please find the job script and error output file.
>> I didn¡¯t configure any rsh or ssh between the 3 nodes.
>> Please could you have a look inside the file and give me some
>> suggestions?
>
> I see from your output:
>
> n0<16809> ssi:boot:base:linear_windowed: finished launching n0<16809>
> ssi:boot:base:server: expecting connection from finite list n0<16809>
> ssi:boot:base:server: got connection from 192.168.40.81 n0<16809>
> ssi:boot:base:server: this connection is expected (n0) n0<16809>
> ssi:boot:base:server: remote lamd is at 192.168.40.81:32782 n0<16809>
> ssi:boot:base:server: expecting connection from finite list n0<16809>
> ssi:boot:base:server: got connection from 56.145.206.0
> 56.97.21.0
>
> So lamboot thinks it launched everything and then it got a callback
> from the
> local lamd and that went fine. But then it got a callback from
> 56.145.206.0
> -- that seems like a pretty strange IP address.
> Since you're using 192.168 kinds of addresses, I'm surprised that a
> non-private address is calling back, and I'm also surprised that it's
> a .0
> address. Are you sure that your network setup is correct?
>
> After all this, LAM decides that it hasn't heard from all the other
> lamd's
> in a timely fashion and gives up.
>
> You might want to look in the syslog on the nodes that failed to boot
> and
> see if there are any lamd messages in there (lamboot -d causes the
> lamd's to
> dump messages to the syslog).
>
> --
> {+} Jeff Squyres
> {+} The Open MPI Project
> {+} http://www.open-mpi.org/
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
>
>
>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
|