Indeed it does not solve the problem ... I had changed something in your
code which would make it work regardless and had forgotten too change it
back :-( ..... I am looking at the code and will let you know as soon as I
find out more ....
Sorry about that
Anju
On Thu, 7 Apr 2005, wrote:
> I should also note that making that programname change didn't "fix" the
> situation for me.
>
>
> It is a strange situation in that it depends on which nodes I choose to run it on.
>
> For example, if I run:
> ./mymanager 0 11
> specifying the master to spawn the slaves on nodes 0 and 11, it works fine.
>
> But if I run:
> ./mymanager 0 3
> specifying the master to spawn the slaves on nodes 0 and 3, it hangs at the barrier.
>
>
>
> Not sure if this helps clarify the problem...
> --dp
>
>
>
>
> Quoting Prabhanjan Kambadur <pkambadu_at_[hidden]>:
>
> >
> > This is a snippet from your copy_myworker program.
> >
> > ===============================================================
> > /* command to spawn & merge */
> > if(msg==1){
> > char programname[100];
> > MPI_Info info;
> > int root=0,maxprocs=1;
> >
> > MPI_Comm_spawn(programname,MPI_ARGV_NULL,maxprocs,info,
> > root,intracomm,&intercomm,MPI_ERRCODES_IGNORE );
> > fprintf(fp,"spawned...\n");fflush(fp);
> > MPI_Intercomm_merge(intercomm,1,&intracomm);
> > fprintf(fp,"merged...\n");fflush(fp);
> >
> > MPI_Comm_rank(intracomm,&mynewrank);
> > MPI_Comm_size(intracomm,&mynewsize);
> > fprintf(fp,"now im %d of %d\n",mynewrank,mynewsize);fflush(fp);
> > }
> > =================================================================
> >
> > Notice that the variable "programname" is never initialize and therefore
> > MPI_Comm_spawn throws an exception causing all processes to abort. It
> > worked for me without any problems once this was corrected.
> >
> > Hope this helps,
> > Anju
> >
> >
> > On Tue, 5 Apr 2005, wrote:
> >
> > > Hello.
> > >
> > > Ive attached the code for my manager and slave processes. I've also
> > included
> > > the logging output from a run that should illustrate the problem.
> > >
> > >
> > > QUICK SUMMARY
> > > There is the master that spawns a child and merges for the intracomm.
> > > >From that intracomm, there are two process: the master (rank0 of 2) and
> > the
> > > slave (rank1 of 2).
> > > Then the master signals the slave (sends a msg with an integer 1) to
> > participate
> > > in a collective spawn/merge. So a second slave comes up. From the
> > intracomm
> > > returned from the merge, there are now three processes: master (rank0 of 3)
> > and
> > > slave1 (rank2 of 3) and slave2 (rank2 of 3)!! Both slaves are saying that
> > they
> > > are 2 of 3.
> > >
> > > The problem is that the program then hangs at a barrier. I'm guessing it
> > is
> > > because both slaves are calling themselves the same thing.
> > >
> > >
> > > I can't seem to understand what is wrong here.
> > > Thanks again for any help.
> > >
> > > --dp
> > >
> > >
> > >
> > >
> > > Quoting Jeff Squyres <jsquyres_at_[hidden]>:
> > >
> > > > Can you send a small code example that shows this problem? That would
> > > > be most helpful.
> > > >
> > > > Thanks!
> > > >
> > > > On Apr 1, 2005, at 9:21 AM, "" <petrovic_at_[hidden]> wrote:
> > > >
> > > > > Hello all.
> > > > >
> > > > > I'm struggling with something that seems to be a familiar topic on
> > > > > this mailing
> > > > > list. Any help would be appreciated.
> > > > >
> > > > > I'm trying to have a 'master' program start up a number of 'slave'
> > > > > programs by a
> > > > > series of spawn calls. (I know I can spawn multiple programs with one
> > > > > call to
> > > > > spawn or spawn_multiple, but for other reasons, i must do it this
> > > > > way...).
> > > > >
> > > > > The general problem is trying to get an intracommunicator that
> > > > > includes the
> > > > > whole bunch. I understand that I can use spawn and intercomm_merge,
> > > > > and that
> > > > > these calls are collective. This seems to work fine except when I run
> > > > > on certain
> > > > > nodes on the cluster I am working on; from some logging, it seems that
> > > > > two
> > > > > processes end up thinking that they are the same rank from a given
> > > > > intracomm.
> > > > >
> > > > >
> > > > > here are the steps:
> > > > >
> > > > > **master**
> > > > > use MPI_COMM_SELF as starting intracomm
> > > > > loop begin
> > > > > (notify existing processes to collectively spawn/merge)
> > > > > spawns a process using intracomm
> > > > > merges the returned intercomm (from the spawn) into intracomm
> > > > > loop end
> > > > >
> > > > >
> > > > > **slave**
> > > > > merges parent intercomm into intracomm.
> > > > > loop begin
> > > > > if notified, spawn (using intracomm)
> > > > > merge (using intercomm returned from spawn) into intracomm
> > > > > loop end
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > also, the master is changing the "lam_spawn_sched_round_robin" key
> > > > > before each
> > > > > spawn, if that might be an issue...
> > > > >
> > > > > Any ideas?
> > > > > Thanks in advance!
> > > > > --dp
> > > > >
> > > > > -----------------------------------------------------------------
> > > > > This mail was sent through IMP Webmail at http://www.imp3.tut.fi/
> > > > > -----------------------------------------------------------------
> > > > > _______________________________________________
> > > > > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> > > > >
> > > >
> > > > --
> > > > {+} Jeff Squyres
> > > > {+} jsquyres_at_[hidden]
> > > > {+} http://www.lam-mpi.org/
> > > >
> > > > _______________________________________________
> > > > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> > > >
> > >
> > >
> > >
> > >
> > > -----------------------------------------------------------------
> > > This mail was sent through IMP Webmail at http://www.imp3.tut.fi/
> > > -----------------------------------------------------------------
> > >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
>
>
>
>
> -----------------------------------------------------------------
> This mail was sent through IMP Webmail at http://www.imp3.tut.fi/
> -----------------------------------------------------------------
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|