On Mon, 9 Jun 2003, W.PAKDEE wrote:
> I am using LAM/MPI parallel computing. LAM 6.5.6 was installed. Recently
> I upgraded the network from a Megabit to Gigabit Ethernet. I changed the
> switch and network card. So now I use a Gigabit switch and the Intel
> PRO/1000 MT Destop Adapter.
>
> As a result, it is processing a lot faster, but my jobs are no longer
> stable. With the exact same job submitted at different times, different
> results were obtained. Many times, process finished with error.
> Sometimes nodes crash. (I always execute lamclean before each mpirun)
Can you be specific about what you mean by "nodes crash"? Does the MPI
job just fail (e.g., seg fault), or does the entire node reboot?
> What could cause the problem? Hardware? Do I have to re-install LAM? Any
> suggestions are appreciated. Thank you, -Watit
This *should* not be a LAM/MPI problem. As long as you're using TCP/IP
as the underlying transport, LAM doesn't really care what the actual
device being used it -- it will use it in exactly the same way. Hence, if
you simply provide a different hostname (which maps to a different
device), LAM/MPI shouldn't care.
You may be having network problems -- you might want to run some
diagnostics to ensure that your TCP stack is functioning properly.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|