LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2005-04-11 09:02:58


On Apr 11, 2005, at 4:45 AM, Bartlomiej Balcerek wrote:

> bartol_at_n13:~/cpmd/r3$ /usr/local/lam-7.1.1-intel81/bin/mpirun -v -np 4
> /usr/local/CPMD/bin/cpmd.x-lammpi ./cls_R2F.inp
> 15420 /usr/local/CPMD/bin/cpmd.x-lammpi running on n0 (o)
> 15070 /usr/local/CPMD/bin/cpmd.x-lammpi running on n1
> 15430 /usr/local/CPMD/bin/cpmd.x-lammpi running on n2
> 15421 /usr/local/CPMD/bin/cpmd.x-lammpi running on n0 (o)
> -----------------------------------------------------------------------
> ------
> It seems that some error has occurred during MPI_INIT. This will
> cause your process to abort. These kinds of errors are usually
> system-related, such as running out of disk space, running out of
> memory, or something more serious such as data not being passed
> between processes properly. That is, you should not be seeing this
> error message; if you are, something is likely Very Wrong with your
> system. :-(
>
> Perhaps this Unix error message will help:
>
> Unix errno: 14
> Bad address
>
> -----------------------------------------------------------------------
> ------

Wow - that's an unusual error. As the message says, this error is only
given when something really extraordinary happened (like malloc or free
failing). The only thing that I can think of is to make sure you don't
have any weird interactions with threading (NPTL vs LinuxThreads or
something like that). I might try rebuilding LAM on the machine with
the new kernel and see if that helps.

If you are willing to do some code diving (you'll have to rebuild LAM
with CFLAGS='-g' so that the library contains debugging symbols),
figuring out exactly where the problem is occurring shouldn't be too
hard. The error is printed from lam_mpi_init() in lammpiinit.c, but is
caused by something in lam_linit() in laminit.c going wrong. If you
set a breakpoint for lam_linit and step through that function,
something is returning LAMERROR - there's the start of the problem.

As for how to debug the MPI library, aside from needing to compile LAM
with -g, the hints in the LAM/MPI FAQ for debugging MPI applications
also apply to the MPI library itself:

   http://www.lam-mpi.org/faq/category6.php3

If you would be willing to help track this issue down, please let me
know and feel free to ask any questions you might have. I'm guessing
the issue has something to do with the new kernel causing a subtle
system misconfiguration, but without recompiling LAM with debugging
symbols and a bit of debugging, I can't really even make an educated
guess.

Hope this helps,

Brian

-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/