What interface are you using to invoke MPI functions from MEX?
This sounds somewhat similar to the Matlab/MEX issues that have come up
on this list before. I have not personally worked with Matlab/MEX and
MPI, so I can't say for sure, but I think there are a few things you
need to be careful when invoking MPI from Matlab.
For example, see:
http://www.lam-mpi.org/MailArchives/lam/2001/08/3070.php
http://www.lam-mpi.org/MailArchives/lam/2001/06/2737.php
http://www.lam-mpi.org/MailArchives/lam/2003/05/5931.php
Also be sure to see the following section of the LAM/MPI User Guide for
7.1.1:
-----
3.4.3 Dynamic/Embedded Environments
In LAM/MPI version 7.1.1, some RPI modules may utilize an additional
memory manager mechanism (see Section 3.3.1, page 15 for more details).
This can cause problems when running MPI processes as dynamically
loaded modules. For example, when running a LAM/MPI program as a MEX
function in a Matlab environment, normal Unix linker semantics create
situations where both the default Unix and the memory management
systems are used. This typically results in process failure.
Note that this only occurs when LAM/MPI processes are used in a dynamic
environment and an additional memory manager is included in LAM/MPI.
This appears to occur because of normal Unix semantics; the only way to
avoid it is to use the --with-memory-manager parameter to LAMs
configure script, specifying either none or external as its value.
See the LAM/MPI Installation Guide for more details.
-----
On Jun 1, 2005, at 8:32 PM, Paul Haney wrote:
> Hi,
> I have Matlab code that uses MEX files to call MPI routines. I can
> compile this code to a standalone executable and run parallel jobs,
> however I get system crashes maybe 30-40% of the time if I use 1
> processor (2 jobs per processor), and crashed 80-90% of the time if I
> use > 1 processor. Some details:
> I'm using LAM 7.0.4, gcc 3.2.3, Matlab 7.
> It seems as though the crash occurs right away in MPI_Init. I can see
> which process executes the first command by calling clock(), and it
> can run successfully if either node0 or node1 starts first. Anyone
> have any advice on how to proceed??
> Here's the LAM error message upon crash:
>
> -----------------------------------------------------------------------
> ------
> It seems that [at least] one of the processes that was started with
> mpirun did not invoke MPI_INIT before quitting (it is possible that
> more than one process did not invoke MPI_INIT -- mpirun was only
> notified of the first one, which was on node n0).
>
> mpirun can *only* be used with MPI programs (i.e., programs that
> invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
> to run non-MPI programs over the lambooted nodes.
> -----------------------------------------------------------------------
> ------
>
> Here's some info the Matlab gives me:
>
> -----------------------------------------------------------------------
> -
> Segmentation violation detected at Wed Jun 1 18:45:50 2005
> -----------------------------------------------------------------------
> -
>
> Configuration:
> MATLAB Version: 7.0.4.352 (R14) Service Pack 2
> MATLAB License: unknown
> Operating System: Linux 2.4.20-30.9.papismp #1 SMP Mon May 3 13:57:07
> CDT 2004 i686
> Window System: No active display
> Current Visual: None
> Processor ID: x86 Family 15 Model 2 Stepping 9, GenuineIntel
> Virtual Machine: Java 1.5.0 with Sun Microsystems Inc. Java
> HotSpot(TM) Client VM
> (mixed mode)
> Default Charset: ibm-923
>
> Register State:
> eax = 08200000 ebx = 40138b18
> ecx = 00000000 edx = 08218468
> esi = 084e7d48 edi = 00040b74
> ebp = bfff89f4 esp = bfff89dc
> eip = 4012c087 flg = 00010216
>
> Stack Trace:
> [0] libpthread.so.0:__pthread_mutex_lock~(265076, 0x085f7050
> "lsf-543646-0", 0x08218468 "/tmp", 0x085f7068 "LAM_MPI_SESSION_SUFFI\
> X=lsf-54364..") + 23 bytes
>
> Error in ==> MPI_Init at 3
>
> Error in ==> HelloWorld at 4
>
> ---------------------------------------------
>
> Even if the code doesn't crash and successfully says 'Hello world'
> from all of the nodes, I get the following error:
> --------------------------------------------------------
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 8374 failed on node n0 (129.114.62.145) with exit status 22.
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|