LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Alastuey, Lucas (Lucas.Alastuey_at_[hidden])
Date: 2006-11-22 12:12:06


How we can debug this kind of error??

This message is not very descriptive
> MPI_Recv: process in local group is dead (rank 1,
> MPI_COMM_WORLD)
> Rank (1, MPI_COMM_WORLD): Call stack within LAM:
> Rank (1, MPI_COMM_WORLD): - MPI_Recv()
> Rank (1, MPI_COMM_WORLD): - main()

Gdb, Valgrind ??

-----Original Message-----
From: lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] On Behalf Of Nam Hoang
Sent: Martes, 21 de Noviembre de 2006 11:39 p.m.
To: lam_at_[hidden]
Subject: Re: LAM: lam Digest, Vol 813, Issue 1

Hi Hector,
I think your program is not written correctly.
First, you should allocate a memory to message (for
example : char message[12]..., or using malloc).
Second, initializing value to message should be placed
in private code of node 0 (sending node) :
if (rank == 0)
    {
      strcpy(message, "Hello world !");
      for (i = 1; i < size; i++)
        {
          MPI_Send (message, 12, MPI_CHAR, i, tag,
MPI_COMM_WORLD);
        }
    }

Hope this helps | :)
--- lam-request_at_[hidden] wrote:

> Send lam mailing list submissions to
> lam_at_[hidden]
>
> To subscribe or unsubscribe via the World Wide Web,
> visit
> http://www.lam-mpi.org/mailman/listinfo.cgi/lam
> or, via email, send a message with subject or body
> 'help' to
> lam-request_at_[hidden]
>
> You can reach the person managing the list at
> lam-owner_at_[hidden]
>
> When replying, please edit your Subject line so it
> is more specific
> than "Re: Contents of lam digest..."
> > Today's Topics:
>
> 1. Re: Unable to boot Lam in a remote machine
> (460853_at_[hidden])
> > From: 460853_at_[hidden]
> To: lam_at_[hidden]
> Date: Mon, 20 Nov 2006 18:24:24 +0100
> Subject: Re: LAM: Unable to boot Lam in a remote
> machine
>
> Hello everyone
>
> Well, at first, thank you for answering. I'd also
> like to apologize for not
> having been able to write earlier, but some family
> dutys kept me out of all
> this for a while.
>
> Next, I'd like to say that the trouble I asked about
> in my previous mail has
> been solved by disabling the Firewall so, certainly,
> that was the problem. The
> thing is that now, I'm having another trouble.
>
> After disabling the firewall, and managing to set
> the environemnt up, I looked
> in the Internet for a very simple program (actually,
> a "Hello World")
> done with
> MPI:
>
>
> ---------------------prueba.c ------------------
> /* C Example */
> #include <stdio.h>
> #include <mpi.h>
> #include <math.h>
>
>
> void
> main (argc, argv)
> int argc;
> char *argv[];
> {
> char *message = "Hello world";
> int rank, size, i, tag, node;
> MPI_Status status;
>
> MPI_Init (&argc, &argv); /* starts MPI */
> MPI_Comm_rank (MPI_COMM_WORLD, &rank); /*
> get current process id */
> MPI_Comm_size (MPI_COMM_WORLD, &size); /*
> get number of processes */
> tag = 100;
>
> if (rank == 0)
> {
> for (i = 1; i < size; i++)
> {
> MPI_Send (message, 12, MPI_CHAR, i, tag,
> MPI_COMM_WORLD);
> }
> }
> else
> {
> MPI_Recv (message, 12, MPI_CHAR, 0, tag,
> MPI_COMM_WORLD, &status);
> }
>
> printf ("node:%d %s\n", rank, message);
> MPI_Finalize ();
> }
> --------------------------------------------
>
> I compile it with: mpicc -o prueba.exe prueba.c
> (It's a Linux system, so I know that this of the
> .exe is unnecessary, but
> anyway... I did it this way in order to know which
> the executable file is).
> Then I place a copy of that executable in a folder
> which is in the Path
> in both
> computers (preciseness in $HOME/bin/)
>
> Next, I start the environment properly (ehm...
> properly "I guess")
> ---------------------------------------------
> hector_at_rdp13:~/Pa aprendé/Pruebas MPI> lamboot -v
> lamhosts
>
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
> n-1<26498> ssi:boot:base:linear: booting n0
> (155.210.155.67)
> n-1<26498> ssi:boot:base:linear: booting n1
> (155.210.155.70)
> n-1<26498> ssi:boot:base:linear: finished
> ----------------------------------------------
>
> But when I try to execute with mpirun, I get the
> following output:
> ---------------------------------------------
> hector_at_rdp13:~/bin> mpirun -v -np 2 prueba.exe
> 26535 prueba.exe running on n0 (o)
> 4861 prueba.exe running on n1
> node:0 Hello world
> MPI_Recv: process in local group is dead (rank 1,
> MPI_COMM_WORLD)
> Rank (1, MPI_COMM_WORLD): Call stack within LAM:
> Rank (1, MPI_COMM_WORLD): - MPI_Recv()
> Rank (1, MPI_COMM_WORLD): - main()
> ---------------------------------------------
>
> It seems that node 1 (the remote node) is not
> working. It says it's "dead". I
> looked for this error message in Google, and I
> understood that what is
> happenning is that the process is not running in the
> remote machine. It was
> also said that this can happen because the
> MPI_Finalize (); instruction was
> executed too soon. I think in this case, that can't
> be it, because is an
> absolutely simple program that has been downloaded
> from an example web
> page, so
> I guess it should work.
>
> I would also like to say that in the remote machine,
> after setting up the
> enviroment with the lamboot command, a "ps aux"
> shows (among many other
> things)
> a lamd daemon running
>
> -----------------------------------
> hector_at_venus2:~/bin> ps aux
> USER PID %CPU %MEM VSZ RSS TTY STAT
> START TIME COMMAND
> root 1 0.0 0.0 776 304 ? S
> 17:24 0:00 init [5]
> root 2 0.0 0.0 0 0 ? SN
> 17:24 0:00 [ksoftirqd/0]
> [. . .]
> hector 3743 0.0 0.0 6484 1148 ? S
> 17:26 0:00
> /usr/bin/lamd -
> -----------------------------------
>
> So the environement seems to be raised properly...
> The thing is that it
> doesn't
> execute the program properly.
>
> I imagine that the solution will be quite simple,
> but I can't see it :(
>
> Thank you very much in advance!!
> //Hector
>
> >> 460853_at_[hidden] wrote:
> >>> I know there's a firewall in each machine that
> only opens the SSH
> >>> (22) port, so
> >>> I guess the problem comes from that. So, what
> ports do I have to
> >>> open in order
> >>> to boot LAM?.
> >>>
> >>> Executing the lamboot with the -d option, I've
> read (among many
> >>> other things)
> >>> this:
> >>>
> >>> lamd -H 155.210.155.67 -P 6459 -n 1 -o 0 -d
> >>>
> >>> So, I guess that this means that the .155.70
> machine should be able
> >>> to reach the
> >>> port 6459 in the .155.67 machine. Am I right? So
> the solution comes
> >>> by opening
> >>> the 6459 port in the .155.67 machine? Should I
> open this port also in the
> >>> .155.70 machine? Otherwise, which ports should I
> open? Because I
> >>> don't know if
> >>> it will be enough with opening only these ports.
> >>
> >> All non-system (> 1024) TCP ports are needed to
> boot and run LAM. In
> >> more detail - LAM does not use any specific port
> numbers, but instead
> >> requests any random open port from the OS. Check
> out FAQs 17 and 18
> >> here for some more info:
> >>
> >> http://www.lam-mpi.org/faq/category4.php3
> >>
> >> Hope this helps!
> >>
> >> Andrew
>
>
>
>
>
>

 
____________________________________________________________________________________
Sponsored Link

$200,000 mortgage for $660/ mo
30/15 yr fixed, reduce debt
http://yahoo.ratemarketplace.com
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/

Sonda S.A.
La información contenida en este correo electrónico, así como en cualquiera de sus archivos adjuntos, es confidencial y está dirigida exclusivamente a él o los destinatarios indicados. Cualquier uso, reproducción, divulgación o distribución por otras personas distintas de él o los destinatarios está estrictamente prohibida. Si ha recibido este correo por error, por favor notifíquelo inmediatamente al remitente y bórrelo de su sistema sin dejar copia del mismo. SONDA no acepta responsabilidad alguna por cualquier pérdida o daño como consecuencia, directa o indirecta, del uso indebido de este e-mail o de los archivos adjuntos al mismo.