LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Amey Dharurkar (adharurk_at_[hidden])
Date: 2003-10-09 13:25:31


Hello,

> Hi,
>
> under which circumstances does LAM throw an error message like this:
> ---snip---
> Frequency Step Number 22
> Frequency Step Number 23
> Frequency Step Number 24
> MPI_Recv: unclassified: Cannot allocate memory (rank 3, comm 4)
> Rank (7, MPI_COMM_WORLD): Call stack within LAM:
> Rank (7, MPI_COMM_WORLD): - MPI_Recv()
> Rank (7, MPI_COMM_WORLD): - main()
> MPI_Recv: unclassified: Cannot allocate memory (rank 3, comm 4)
> Rank (11, MPI_COMM_WORLD): Call stack within LAM:
> Rank (11, MPI_COMM_WORLD): - MPI_Recv()
> Rank (11, MPI_COMM_WORLD): - main()
> MPI_Recv: unclassified: Cannot allocate memory (rank 2, comm 4)
> Rank (14, MPI_COMM_WORLD): Call stack within LAM:
> Rank (14, MPI_COMM_WORLD): - MPI_Recv()
> Rank (14, MPI_COMM_WORLD): - main()
> MPI_Recv: unclassified: Cannot allocate memory (rank 1, comm 4)
> Rank (13, MPI_COMM_WORLD): Call stack within LAM:
> Rank (13, MPI_COMM_WORLD): - MPI_Recv()
> Rank (13, MPI_COMM_WORLD): - main()
> -----------------------------------------------------------------------------
>
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 5096 failed on node n4 with exit status 1.
> -----------------------------------------------------------------------------
> ---snip---
>
> "Cannot allocate memory"is obvious. But on a node with 2GB RAM, no other
> procs, and a local matrix size of 142MB? LAM version is 6.5.9, taken from
> SuSE Linux 8.2. The failing program is used for parallel matrix setup and
> decomposition using BLACS and ScaLAPACK on a 16-node P4 cluster system.
>
> Any comments appreciated.

The error in being thrown because LAM is unable to complete a malloc()
inside. Since you are dealing with very large messages and data
structures, memory is being used rapidly and that is why you are getting
the error.

I suggest that you should try to check the program using memory-checking
debuggers (valgrind/bcheck) for obvious problems. Also note that you need
to recompile LAM --with-purify.

Hope this helps.

---------------
Amey Dharurkar
Graduate Student, Indiana Univeristy.

>
> Regards,
> Michael
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>