LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Lily Li (lily.li_at_[hidden])
Date: 2005-05-11 11:51:27


Hi, Jeff, thanks for the reply.

We do plan to use LAM 7.1.1 for our next release. But the current
production are using
LAM 7.0, and it is not easy for us to have a quick switch for many
processing centers.

For the last week or so, we've been watching the behavior of lamd. It
seems the fault tolerance
mode of lamboot make things worse. I'd really appreciate if you could
give me more information
regarding lamd to find the root of the problem. We need some get-around
solutions badly for now.

( our lamboot command is: lamboot -d -x host_list_file, and using TCP
ssi rpi)

1. For every lamnodes command, does lamd do a up-to-date synch/check on
all nodes in LAM,
   or just report the current routing table in memory ?
   So if a lamd died after the last heart beat checking from the head
node, will the lamnodes on the head node
   show this node as invalid ?

2. When lamd does the heart beat checking on other nodes, will it still
be able to respond to client requests
   such as lamnodes ? I mean: will the lamd is in blocking mode ?

3. When a lamd does find a invalid node, will it share the information
with other live nodes ? so that the result from
   lamnodes on all live nodes are the same ?

4. We do find that when mpirun starts a job, and the tasks in this job
crash due to I/O problem before it calls MPI_Init(), the lamd
   on the mpirun node will mark lots of other nodes as invalid. Does
this have anything to do with the fault tolerance mode.
   Our mpirun uses options like "-f -w -ssi rpi tcp", sometime, we need
to export $LD_PRELOAD.

5. From the documentation, the default mpirun uses "-c2c" option. So
after the job starts, is there still connections
   between the task and the lamd on the same node ? If for some reason
lamd crashes on some nodes in a job,
   will the job continue to run to finish ?

6. The last thing we can do is to use the LAM_MPI_SESSION_SUFFIX env to
have each job boot its own LAM.
   Is there any drawbacks with this option regarding performance,
stability ?

Hopefully, these are not hard questions for you, and I'd like to say
that you guys are great in supporting LAM.
Will the new Open MPI have the similar support level as LAM ?

Thanks a lot.

Lily Li

On Tue, 2005-05-10 at 20:00, Jeff Squyres wrote:

> On May 5, 2005, at 12:59 PM, Lily Li wrote:
>
> > We are having a problem with LAM 7.0 on a linux RedHat 9 cluster
> > using ethernet.
> >
> > The cluster has 128 nodes. The lamboot was successful. But after
> > running about a day or so. The lamnodes command starts to hang on the
> > first node. All the others seem working just fine. But they report the
> > first node as invalid.
> >
> > The lamd on the first node is still running, just not to respond to
> > any lam commands, such as lamnodes, mpitask, etc.
>
> Yikes. That clearly shouldn't happen. :-(
>
> > All nodes have 2 nics, but first one is not configured to have IPs.
>
> This shouldn't be an issue.
>
> > After the LAM is booted for about a day or so, some time, we also see
> > a message like:
> >
> > rcmd: socket: all port in use.
>
> Hum. That's an odd message. I'm not sure that it's from us -- rcmd is
> a system-level service, if I recall correctly, and not one that LAM
> uses.
>
> > Does this problem sound like a system/firework configuration error or
> > a error in lamnodes/lamd.
>
> It *sounds* like a lamd error, but not behavior that we have seen
> before. It could also be an OS issue, that somehow inbound network
> connections are not actually getting to the lamd.
>
> Unfortunately (or fortunately?), the backtrace simply shows that the
> lamd is in its main processing loop -- nothing too strange showing up
> there.
>
> The one obvious question I have to ask -- is there any way that you can
> upgrade to LAM 7.1.1 and see if you see the same behavior? There have
> been a small number of lamd fixes since 7.0.