LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: John Robinson (jr_at_[hidden])
Date: 2005-08-28 12:51:44


Hi Lei,

I am working on a client-server application, and using MPI_Publish_Name,
although I am writing in C++.

To avoid the crashes on name loookup or publish, I wrap the MPI calls in
try/catch blocks.

The lingering published name I do not yet need (or have) a solution for,
but I can imagine a background process/cron job that wakes up from time
to time to check for the name and whether the server is accepting
connections, and takes appropriate action. Again, I expect you would
need try/catch, and probably timeouts, to handle crashed servers or a
hung lam environment (i.e. in need of lamhalt/lamwipe).

HTH,
/jr

---
Lei_at_ICS wrote:
> To understand the problem more, I did the following:
> I ran View3d_server as usual, i.e.,
>  > mpirun -np 1 View3d_server
> I then ran the client not from mpirun,
>  > client
> Both my client and server crashed with the same error message.
> 
> So using Matlab is not the reason, rather starting the client
> without using mpirun is.
> 
> The following is from Jeff some time ago:
> 
>  > Hence, you must lamboot before you run any MPI application under LAM.
>  > You can do this before you run matlab,
> 
> Yes, I did lamboot before starting the server and before starting the C
> client or mex client.
> 
>  > So, once you have a LAM universe, you can launch MPI jobs in one of
>  > three ways:
> 
>  > 1. "Singleton", where you just "./a.out" (where a.out invokes
>  > MPI_Init). This will make a.out be an MPI process, and it will have an
>  > MPI_COMM_WORLD size of 1.
> 
> I ran my C client today without using mpirun.  I did check
> the size of MPI_COMM_WORLD and found it to be 1.
> This client process is of rank 0.
> 
>  > 2. With mpirun.
> 
>  > 3. With MPI_Comm_spawn[_multiple].
> 
>  > So what I was suggesting with that design is that you will probably
>  > lamboot before you run matlab (or you can make your mex script smart
>  > enough to run lamboot itself), and then have a mex interface that calls
>  > MPI_Init. This will give you a singleton MPI process, where you can
>  > look for published names, etc. Then you can spawn a master or connect
>  > to the existing master... etc.
> 
> My experiment today takes the Matlab out of the equation.
> Now the question is why I am unable to connect to the server
> from an MPI singleton that is run without using mpirun.
> 
> Two related questions:
> (1) When my client is run from mpirun, but the server is not
> started, if my client MPI_Lookup_name, it will crash with
> the error:
> MPI_Lookup_name: publishing service: name is not published (rank 0, 
> MPI_COMM_WORLD)
> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> Rank (0, MPI_COMM_WORLD):  - MPI_Lookup_name()
> Rank (0, MPI_COMM_WORLD):  - main()
> 
> (2) If I control-C the server, obviously the server will not
> have a chance to MPI_Unpublish_name.  The next time
> I start the server, it will crash with the error:
> MPI_Publish_name: publishing service: name is published (rank 0, 
> MPI_COMM_WORLD)
> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> Rank (0, MPI_COMM_WORLD):  - MPI_Publish_name()
> Rank (0, MPI_COMM_WORLD):  - main()
> 
> I can lamboot again to solve this problem.  But is there a way to remove
> the left-over published name from with my server code?
> 
> Thanks,
> 
> -Lei
> 
> 
> 
> Lei_at_ICS wrote:
> 
> 
>>I tried something crazy -- I ran matlab from mpirun:
>>mpirun -np 1 matlab -nodesktop
>>Now my mex_client can connect to my MPI server
>>without crashing.  Wow! :)
>>
>>This isn't the way I wanted to run matlab; it should not be!
>>Indeed, other weird thing happened when I ran matlab from mpirun.
>>
>>But why did mpirun help in this case?  What is the right way
>>to start an MPI singleton from matlab via mex?
>>
>>-------------------- mex_client.c --------------------
>>   MPI_Init( NULL, NULL );
>>   strcpy(port_name, "n0:i11:323" );
>>
>>   MPI_Comm_size(MPI_COMM_WORLD, &size);
>>   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>
>>   /*
>>   //status = MPI_Lookup_name("MPI_SERVER1", MPI_INFO_NULL, port_name);
>>   //if (status != MPI_SUCCESS) {
>>     //printf("****** MPI Server not up yet. \n");
>>   //}
>>   */
>>
>>   MPI_Comm_connect( port_name, MPI_INFO_NULL, 0, MPI_COMM_SELF,
>>                     &server );
>>
>>
>>------------------- View3d_server.c -------------------------
>>   MPI_Init(&argc, &argv);
>>   MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
>>   MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
>>
>>   MPI_Open_port(MPI_INFO_NULL, sport_name);
>>
>>   while (1) {
>>         MPI_Comm_accept(sport_name, MPI_INFO_NULL, 0, MPI_COMM_SELF,
>>                           &comm_client );
>>         ... ... ...
>>  }
>>
>>Lei_at_ICS wrote:
>>
>> 
>>
>>
>>>Hi,
>>>
>>>I have an MPI server which opens a port, prints out the port name,
>>>and waits for connection.  My client, in C, will use the printed
>>>port name as its argument to connect to the server and send
>>>a message to the server.  The reason I do not use published
>>>names is that my call to MPI_Lookup_name() would crash
>>>(see my earlier emails; that's a different problem). 
>>>
>>>Now my above server and client work fine until I made
>>>the client a MEX function called from Matlab.  Now
>>>the sever will crash with the following error:
>>>losangeles[48]% mpirun -np 1 View3d_server
>>>MPI_SERVER available at n0:i11:323
>>>**** before MPI_Comm_accept ...
>>>MPI_Comm_accept: mismatched run-time flags: Bad address (rank 0, 
>>>MPI_COMM_WORLD)
>>>Rank (0, MPI_COMM_WORLD): Call stack within LAM:
>>>Rank (0, MPI_COMM_WORLD):  - MPI_Comm_accept()
>>>Rank (0, MPI_COMM_WORLD):  - main()
>>>-----------------------------------------------------------------------------
>>>One of the processes started by mpirun has exited with a nonzero exit
>>>
>>>And the MEX client will crash with the following error:
>>>   
>>>
>>>
>>>>>mex_client()
>>>>>       
>>>>>
>>>
>>>*** port_name: n0:i11:323
>>>Rank (0, MPI_COMM_WORLD): Call stack within LAM:
>>>Rank (0, MPI_COMM_WORLD):  - MPI_Comm_connect()
>>>Rank (0, MPI_COMM_WORLD):  - main()
>>>MPI_Comm_connect: unclassified: Too many open files (rank 0, MPI_COMM_WORLD)
>>>
>>>My client (C or MEX) is very simple and it does not open any files.
>>>My LAM (7.1.1) was built with the options:
>>>
>>>--without-threads --with-memory-manager=none
>>>
>>>Any suggestions on how to solve this problem?  Has anybody
>>>actually done this before?
>>>
>>>Thanks a lot for your help!
>>>
>>>-Lei
>>>
>>>
>>>
>>>
>>>_______________________________________________
>>>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>
>>>
>>>   
>>>
>>
>>_______________________________________________
>>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>> 
>>
> 
> 
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/