Hello,
I attach very simple MPI program, which posts a number of MPI_Isends(), and
then creates a new thread to receive all posted messages (consider 2
processor case). Note, that the thread is created _after_ all MPI_Isend()s,
and only single thread is within MPI at a time. Program completes ok when I
just call thread_func(), but when I create a thread to execute the very
same function, MPI_Get_count() returns bogus tag and source values:
1: MPI_Isend OK
1: MPI_Isend OK
1: Receiving 2048 bytes, tag = 48, src = 0
1: MPI_Recv OK
1: Receiving 2048 bytes, tag = 4294934530, src = 4294934530
MPI_Recv: invalid tag argument (rank 1, MPI_COMM_WORLD)
Rank (1, MPI_COMM_WORLD): Call stack within LAM:
Rank (1, MPI_COMM_WORLD): - MPI_Recv()
Rank (1, MPI_COMM_WORLD): - main()
I checked LAM MPI documentation, which says:
"If user programs utilized multiple threads, they must ensure that only one
thread uses LAM at a time. Unpredictable results (read: crash and burn)
will occur if multiple threads access LAM simultaneously."
I believe that's what I have! Only one thread is in MPI!
I compiled my program with -D_REENTRANT and -lpthread options. Some
messages from the archive suggested, that MPI must be compiled with
-D_REENTRANT. I recompiled LAM 7.0 with this flag -- didn't help...
I also read about MPI_Init_thread(), but comment in the 7.0 source says,
"Using 'MPI_THREAD_SERIALIZED' will cause LAM to place locks around all
MPI calls such that only one thread will be able to enter the MPI
library at a time; beware of this fact for portability with other MPI
implementations. Even with multiple threads, deadlock is still
possible when using 'MPI_THREAD_SERIALIZED' -- applications still need
to be aware of this and code appropriately."
Well... I don't need locks in this program! There's nothing to
synchronize, it just should work! What is wrong -- LAM manual, which I
cited here and followed, or this program?
Please, anybody, help me out here! I am totally puzzled... If MPI just
shouldn't be used from multiple threads, can you explain me WHY? (consider
everything is perfectly synchronized) Thank you very much in advance!
ps One more detail. I have it crashing on Solaris. Runs fine on RH9
2.4.20-19.9...
--
Andriy Fedorov
Department of Computer Science,
College of William & Mary
P.O. Box 8795
Williamsburg, VA 23185-8795, USA
---
http://www.cs.wm.edu/~fedorov
|