LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-07-19 08:34:48


It looks like one of your processes is dying due to some other reason,
possibly even before the barrier (it's hard to tell with your cout
statements putting \n at the beginning instead of the end, or using
endl).

You should probably run your application through a memory-checking
debugger such as valgrind to see if there are hidden errors that are
causing seg faults or other memory badness. See the LAM FAQ in the
debugging section for details on how to do this.

If that doesn't help, post back here again with what you found.

Good luck.

On Jul 17, 2005, at 8:24 AM, qcqc_at_[hidden] wrote:

> Greetings all.
> A program i am creating is the first ever use of MPI. It is supposed
> to be a small optimization program. A genetic algorithm running
> modyfing data that is input to Network Simulator 2 (ns2). Total
> population is always the same but each process gets a part of it
> depending on the number of machines (its ran on a cluster with debian
> - nfs loading of system). Later bash scripts modify the data from ns
> accordingly and put it back into the program.
> Anyway i have questions concerning data sending and receiving.
>
> This code runs in all instances:
>
> //<C++ CODE>
>
> cout<<"\nGATHER "<<mynum;
> MPI_Barrier(MPI_COMM_WORLD);
>
> MPI_Gather( punkty , pop, MPI_FLOAT, pktall, pop, MPI_FLOAT, 0,
> MPI_COMM_WORLD);
>
> MPI_Gather( delay , pop, tablicafloat, delayall, pop,
> tablicafloat, 0, MPI_COMM_WORLD);
>
> MPI_Gather( size , pop, tablicalong, sizeall, pop, tablicalong, 0,
> MPI_COMM_WORLD);
>
> cout<<"\n END OF GATHER";
>
> //... a part of code done by only processor with mynum=0
>
>
> cout<<"\n I am processor nr : "<<mynum;
> cout<<"\nSCATTER";
> MPI_Barrier(MPI_COMM_WORLD);
> MPI_Scatter( delayall , pop, tablicafloat, delay, pop,
> tablicafloat, 0, MPI_COMM_WORLD);
> MPI_Barrier(MPI_COMM_WORLD);
> MPI_Scatter( sizeall , pop, tablicalong, size, pop, tablicalong,
> 0, MPI_COMM_WORLD);
>
> //</C++ CODE>
>
> output generated when 4 processors run it:
>
> //NORMAL OUTPUT
>
>
> GATHER : 1
> GATHER : 2
> END OF GATHER : 1
> I am processor nr : 1
> END OF GATHER : 2
> I am procesorem nr : 2
> GATHER : 3
> END OF GATHER : 3
> I am procesorem nr : 3
> SCATTERMPI_Recv: process in local group is dead (rank 1,
> MPI_COMM_WORLD)
> SCATTERMPI_Recv: process in local group is dead (rank 2,
> MPI_COMM_WORLD)
> Rank (1, MPI_COMM_WORLD): Call stack within LAM:
> Rank (1, MPI_COMM_WORLD): - MPI_Recv()
> Rank (1, MPI_COMM_WORLD): - MPI_Barrier()
> Rank (1, MPI_COMM_WORLD): - main()
> Rank (2, MPI_COMM_WORLD): Call stack within LAM:
> Rank (2, MPI_COMM_WORLD): - MPI_Recv()
> Rank (2, MPI_COMM_WORLD): - MPI_Barrier()
> Rank (2, MPI_COMM_WORLD): - main()
> SCATTERMPI_Recv: process in local group is dead (rank 3,
> MPI_COMM_WORLD)
> Rank (3, MPI_COMM_WORLD): Call stack within LAM:
> Rank (3, MPI_COMM_WORLD): - MPI_Recv()
> Rank (3, MPI_COMM_WORLD): - MPI_Barrier()
> Rank (3, MPI_COMM_WORLD): - main()
> -----------------------------------------------------------------------
> ------
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
> PID 29261 failed on node n1 (192.168.0.3) due to signal 11.
> -----------------------------------------------------------------------
> ------
> GATHER : 0
> END OF GATHER : 0
>
> //NORMAL OUTPUT
>
> Been playing around with this for 2 days and i have no idea why this
> is happening.
>
> The program works on a single processor (including the gathering and
> scattering.. but that is obvious since the process sends and recieves
> to/from itself...). What i must add is that ive got rid of
> MPI_Finalize.. it causes the program to crash on the single
> processor. Was trying to find the cause but cannot. Made sure i clean
> up all the allocated memory. And i am. Had no other ideas for a cause
> of the crash. So maybe the lack of MPI_Finalize is the cause of the
> MPI_Barrier not working.. not sure if you use the system semafors or
> whatever.. MPI_Barrier returns MPI_Success upon use.
>
> Please help. Even a hint where to look would be gold worth.
>
> Regards Krzysztof Korzunowicz
>
> PS. The output and code can be slightly different because i translated
> both to english for the typical reader's sake.
>
> PS2 definitions of tablicafloat and tablicalong:
>
> MPI_Datatype tablicafloat, tablicalong;
> MPI_Type_contiguous(2, MPI_FLOAT, &tablicafloat);
> MPI_Type_contiguous(2, MPI_LONG, &tablicalong);
>
> MPI_Type_commit(&tablicafloat);
> MPI_Type_commit(&tablicalong);
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/