LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-08-03 06:48:15


Signal 13 is a SIGPIPE, which usually means that you're writing to a
pipe with no readers. I can't say for sure, but this doesn't look like
a LAM error. Are you sure that your program is dying within MPI_SEND
(or some other MPI function)? And even if so, are you sure that your
program is correct?

For example, I notice that you appear to be using EOF as the "I'm
finished sending" marker. EOF is typically -1, but since you're
assigning it to an unsigned char, it's probably being sent as 255.
Hence, I don't know what your "server side" code is doing, but it could
well be misinterpreting valid characters in the file as EOF, and
therefore the server may be "hanging up" early.

I would suggest sending a message on a different tag to indicate that
you have completed sending (note that because of MPI ordering
guarantees, simply sending an "I'm done" message is probably not enough
-- you'll probably want to send a cheap checksum, or total number of
bytes sent, or something that the server can check in its receive loop
to know when it has finished).

Also, I assume you're only fread()'ing and sending 1 character at a
time simply as a test, right? Operating one character at a time is
going to be quite inefficient.

On Aug 2, 2004, at 5:01 PM, bcruchet_at_[hidden] wrote:

> Hi ...
>
> i have a problems with a program made with lam. the program copi a file
> from nod0 to others 3 nods.
>
> i dont have a problem when i run the program thus:
>
> mpirun -np 4 mpi_copy /etc/hosts /home/mpi/hosts
>
> this copy the local file /etc/hosts to /home/mpi/hosts in the nodes. i
> try
> with large files and i not have a problem, but when i run the program
> thus:
>
>
> mpirun -np 4 mpi_copy /bin/ls /home/mpi/ls
>
> the program crach and the error is this:
>
> MPI_Recv: process in local group is dead (rank 2, MPI_COMM_WORLD)
> Rank (2, MPI_COMM_WORLD): Call stack within LAM:
> Rank (2, MPI_COMM_WORLD): - MPI_Recv()
> MPI_Recv: process in local group is dead (rank 3, MPI_COMM_WORLD)
> Rank (3, MPI_COMM_WORLD): Call stack within LAM:
> Rank (3, MPI_COMM_WORLD): - MPI_Recv()
> Rank (2, MPI_COMM_WORLD): - main()
> -----------------------------------------------------------------------
> ------
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 28682 failed on node n0 (192.168.1.200) due to signal 13.
> -----------------------------------------------------------------------
> ------
> Rank (3, MPI_COMM_WORLD): - main()
>
>
> and ... the read file fail at randon
>
>
> the read function in master node is this:
>
> -----------------------------------------------------------------------
> ----
> void read_file(FILE *src_file, int cluster_size, int block_size)
> {
> unsigned int current_nodo, i, temp;
> unsigned char c;
>
> current_nodo=1;
> i=0;
> temp=0;
>
> while (fread(&c, 1, 1, src_file)>0)
> {
> // send the data
> MPI_Send(&c, 1, MPI_CHAR, current_nodo, 1, MPI_COMM_WORLD);
>
> if (i>=block_size )
> {
> current_nodo++;
> i=0;
> }
> else
> {
> i++;
> }
> }
> current_nodo=1;
> while (current_nodo<cluster_size)
> {
> printf("Sending EOF to node: %i\n",current_nodo);
> c=EOF;
> MPI_Send(&c, 1, MPI_CHAR, current_nodo, 1, MPI_COMM_WORLD);
> current_nodo++;
> }
> }
>
>
>
> Saludos From CHILE
>
> Boris
> -------
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/