LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2004-05-18 11:57:49


On May 18, 2004, at 9:50 AM, Josh Stewart wrote:

> I am running a very large MPI computation program that runs for
> approximately an hour and then is killed via SIGKILL.  The program
> uses a large amount of memory and is creating several large files.
>
> Does anyone know under what circumstances this would be happening?

I would guess that you have run sufficiently distressed the virtual
memory system of the kernel, so it decided to take the easy way out and
kill you. Unfortunately, there usually isn't a good log of these
things on Linux, so it can be hard to track down. You might want to
check for increasing memory consumption in your application - a program
like memprof can show memory usage patterns, or valgrind is fairly good
at finding real leaks. Of course, the cheap and dirty way to check if
the SIGKILL is coming from the kernel because of a lack of memory is to
run a program like "free" every couple of minutes and watch the numbers
- if available memory is shrinking, it's a good bet that your app is
getting killed because of allocation issues.

Hope this helps,

Brian

-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/