LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Guangyu Wu (wgy_at_[hidden])
Date: 2005-11-16 00:28:00


Hi,Jeff:
Now, the problem is solved!
I could not explain the strange behavior but I believe it was caused by the
co-existence of two lam version.
I forget to remove the lam6.5.9 package on 2 of the nodes before I installed
lam7.0.3. Although I removed it afterward it did not help.
So, the proper action is removing lam6.5.9 clearly before installing new
version.
Now, PBS is able to get the proper cpu time information of the MPP dyna job!
Thanks, Jeff thanks for your help all the time!
Guangyu Wu.

-----ÓʼþÔ­¼þ-----
·¢¼þÈË: lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] ´ú±í
Guangyu Wu
·¢ËÍʱ¼ä: 2005Äê11ÔÂ15ÈÕ 18:13
ÊÕ¼þÈË: 'General LAM/MPI mailing list'
Ö÷Ìâ: ´ð¸´: LAM: Uable to boot lam within PBS job

Hi, Jeff:
Actually I did not disable tm within the job, instead I set the following
variable:
export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE INPUTRC
export CFLAGS=-I/usr/pbs/include
export CPPFLAGS=-I/usr/pbs/include
export LDFLAGS=-L/usr/pbs/lib
export LAM_MPI_SSI_BOOT=tm
As you said it is pretty strange that "ssi:boot:base:server: got connection
from 56.145.206.0", the strange IP is changed each time I run a PBS job but
all are 56.*.*.0!
I could not find any information about lamd on the 2th and 3th nodes.
The nodes are linux1~3 and the IPs are 192.168.40.81~83 with a gateway and
and ISP provide DNS server set.
Does it matter with DNS service?
I notice that I would get an error information when issuing: host localhost
or host linux1 as following:
[wgy_at_linux1 wgy]$ host linux2
Host linux2 not found: 3(NXDOMAIN)
I am going to unset the variables to see what happens since last time I
could boot lam without the variable set.( the previous OS of these nodes are
RH9, and now is RHEL3.0. but at that time I could not get any CPU time
information of the MPP job, that is why I change the OS and rebuild the
environment).
Please suggest.
Thanks and best regards.
If you meet any guys from Altair In the SC05 event please say hello to them.
Are you going to the event, right?

-----ÓʼþÔ­¼þ-----
·¢¼þÈË: lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] ´ú±í Jeff
Squyres
·¢ËÍʱ¼ä: 2005Äê11ÔÂ11ÈÕ 19:36
ÊÕ¼þÈË: General LAM/MPI mailing list
Ö÷Ìâ: Re: LAM: Uable to boot lam within PBS job

On Nov 12, 2005, at 12:55 AM, Guangyu Wu wrote:

> I could boot lam universe using rsh by ¡°lamboot ¨Cv nodes¡±, but got the
> same error while booting within a PBS job.

If you're in a PBS job and you lamboot with a hostfile, LAM is still going
to use tm and ignore the hostfile unless you specifically disable the TM
boot module. Did you do that?

But other than that, rsh and tm use the same mechanisms to launch (i.e.,
communication-wise), so something is odd with your setup if one works and
the other does not, but both are able to actually launch the lamd's on
remote nodes.

> Thanks for your reply! Now it seems I have compiled lam with TM
> enabled.
> But I got an "The lamboot agent timed out while waiting for the
> newly-booted process "error while booting lam within a PBS job.
> The followingmessage in the .e36 file indicates that lam was trying to
> boot via tm.
> n0<16809> ssi:boot:tm: successfully launched on n2 (linux3) Attached
> please find the job script and error output file.
> I didn¡¯t configure any rsh or ssh between the 3 nodes.
> Please could you have a look inside the file and give me some
> suggestions?

I see from your output:

n0<16809> ssi:boot:base:linear_windowed: finished launching n0<16809>
ssi:boot:base:server: expecting connection from finite list n0<16809>
ssi:boot:base:server: got connection from 192.168.40.81 n0<16809>
ssi:boot:base:server: this connection is expected (n0) n0<16809>
ssi:boot:base:server: remote lamd is at 192.168.40.81:32782 n0<16809>
ssi:boot:base:server: expecting connection from finite list n0<16809>
ssi:boot:base:server: got connection from 56.145.206.0
 56.97.21.0

So lamboot thinks it launched everything and then it got a callback from the
local lamd and that went fine. But then it got a callback from 56.145.206.0
-- that seems like a pretty strange IP address.
Since you're using 192.168 kinds of addresses, I'm surprised that a
non-private address is calling back, and I'm also surprised that it's a .0
address. Are you sure that your network setup is correct?

After all this, LAM decides that it hasn't heard from all the other lamd's
in a timely fashion and gives up.

You might want to look in the syslog on the nodes that failed to boot and
see if there are any lamd messages in there (lamboot -d causes the lamd's to
dump messages to the syslog).

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/