On Thu, 20 Dec 2001, Scott Morton wrote:
> >Can you provide some information about your setup? What operating system
> >are you using? How many nodes is your setup?
>
> These dual P4 nodes are running Red Hat 7.1 with 2.4.* kernels. The
> exact version varied from 2.4.9 to 2.4.13. I say varied because our
> sysadmin suspects that my problem could be due to a memory problem
> with the older (pre 2.4.10) kernels; so he upgraded all of the nodes
> to 2.4.13 and my code ran to completion. This is hardly conclusive,
> since my failure rate was only about 30%. But time will tell.
>
> If the trace back from my hung lamd doesn't give you any ideas, I
> suspect that the best way to proceed is for me to make a few more runs
> and see if the problem still occurs. If not, we can blame it on the
> kernel. If so, then we revisit it. Does that make sense to you?
It is possible that a kernel bug caused the problems you were seeing.
Running in lamd mode can be, well, abusive to the OS and linux isn't know
for its super-stable networking implementations.
Let us know if you are still having problems and I can investigate
further.
Thanks,
Brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|