LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: W.PAKDEE (Watit.Pakdee_at_[hidden])
Date: 2003-06-10 14:02:34


Jeff,

Thanks for you messages.
In fact I was not sure if nodes really crashed. But what happened was that
some times a job just hung. Then either lamclean or lamhalt would work. So
I had to execute wipe command and simply reboot it.

Please let me if you have any additional suggestions.
Thank you,

Watit Pakdee
Center of Combustion and Environmental Research
University of Colorado at Boulder

On Tue, 10 Jun 2003, Jeff Squyres wrote:

> Can you be specific about what you mean by "nodes crash"? Does the MPI
> job just fail (e.g., seg fault), or does the entire node reboot?

> This *should* not be a LAM/MPI problem. As long as you're using TCP/IP
> as the underlying transport, LAM doesn't really care what the actual
> device being used it -- it will use it in exactly the same way. Hence,
> if device), LAM/MPI shouldn't care.

> You may be having network problems -- you might want to run some
> diagnostics to ensure that your TCP stack is functioning properly.

> On Mon, 9 Jun 2003, W.PAKDEE wrote:

> > I am using LAM/MPI parallel computing. LAM 6.5.6 was installed. Recently
> > I upgraded the network from a Megabit to Gigabit Ethernet. I changed the
> > switch and network card. So now I use a Gigabit switch and the Intel
> > PRO/1000 MT Destop Adapter.
> >
> > As a result, it is processing a lot faster, but my jobs are no longer
> > stable. With the exact same job submitted at different times, different
> > results were obtained. Many times, process finished with error.
> > Sometimes nodes crash. (I always execute lamclean before each mpirun)
> >
> > What could cause the problem? Hardware? Do I have to re-install LAM? Any
> > suggestions are appreciated. Thank you, -Watit