Hi Lei,
I am working on a client-server application, and using MPI_Publish_Name,
although I am writing in C++.
To avoid the crashes on name loookup or publish, I wrap the MPI calls in
try/catch blocks.
The lingering published name I do not yet need (or have) a solution for,
but I can imagine a background process/cron job that wakes up from time
to time to check for the name and whether the server is accepting
connections, and takes appropriate action. Again, I expect you would
need try/catch, and probably timeouts, to handle crashed servers or a
hung lam environment (i.e. in need of lamhalt/lamwipe).
HTH,
/jr
---
Lei_at_ICS wrote:
> To understand the problem more, I did the following:
> I ran View3d_server as usual, i.e.,
> > mpirun -np 1 View3d_server
> I then ran the client not from mpirun,
> > client
> Both my client and server crashed with the same error message.
>
> So using Matlab is not the reason, rather starting the client
> without using mpirun is.
>
> The following is from Jeff some time ago:
>
> > Hence, you must lamboot before you run any MPI application under LAM.
> > You can do this before you run matlab,
>
> Yes, I did lamboot before starting the server and before starting the C
> client or mex client.
>
> > So, once you have a LAM universe, you can launch MPI jobs in one of
> > three ways:
>
> > 1. "Singleton", where you just "./a.out" (where a.out invokes
> > MPI_Init). This will make a.out be an MPI process, and it will have an
> > MPI_COMM_WORLD size of 1.
>
> I ran my C client today without using mpirun. I did check
> the size of MPI_COMM_WORLD and found it to be 1.
> This client process is of rank 0.
>
> > 2. With mpirun.
>
> > 3. With MPI_Comm_spawn[_multiple].
>
> > So what I was suggesting with that design is that you will probably
> > lamboot before you run matlab (or you can make your mex script smart
> > enough to run lamboot itself), and then have a mex interface that calls
> > MPI_Init. This will give you a singleton MPI process, where you can
> > look for published names, etc. Then you can spawn a master or connect
> > to the existing master... etc.
>
> My experiment today takes the Matlab out of the equation.
> Now the question is why I am unable to connect to the server
> from an MPI singleton that is run without using mpirun.
>
> Two related questions:
> (1) When my client is run from mpirun, but the server is not
> started, if my client MPI_Lookup_name, it will crash with
> the error:
> MPI_Lookup_name: publishing service: name is not published (rank 0,
> MPI_COMM_WORLD)
> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> Rank (0, MPI_COMM_WORLD): - MPI_Lookup_name()
> Rank (0, MPI_COMM_WORLD): - main()
>
> (2) If I control-C the server, obviously the server will not
> have a chance to MPI_Unpublish_name. The next time
> I start the server, it will crash with the error:
> MPI_Publish_name: publishing service: name is published (rank 0,
> MPI_COMM_WORLD)
> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> Rank (0, MPI_COMM_WORLD): - MPI_Publish_name()
> Rank (0, MPI_COMM_WORLD): - main()
>
> I can lamboot again to solve this problem. But is there a way to remove
> the left-over published name from with my server code?
>
> Thanks,
>
> -Lei
>
>
>
> Lei_at_ICS wrote:
>
>
>>I tried something crazy -- I ran matlab from mpirun:
>>mpirun -np 1 matlab -nodesktop
>>Now my mex_client can connect to my MPI server
>>without crashing. Wow! :)
>>
>>This isn't the way I wanted to run matlab; it should not be!
>>Indeed, other weird thing happened when I ran matlab from mpirun.
>>
>>But why did mpirun help in this case? What is the right way
>>to start an MPI singleton from matlab via mex?
>>
>>-------------------- mex_client.c --------------------
>> MPI_Init( NULL, NULL );
>> strcpy(port_name, "n0:i11:323" );
>>
>> MPI_Comm_size(MPI_COMM_WORLD, &size);
>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>
>> /*
>> //status = MPI_Lookup_name("MPI_SERVER1", MPI_INFO_NULL, port_name);
>> //if (status != MPI_SUCCESS) {
>> //printf("****** MPI Server not up yet. \n");
>> //}
>> */
>>
>> MPI_Comm_connect( port_name, MPI_INFO_NULL, 0, MPI_COMM_SELF,
>> &server );
>>
>>
>>------------------- View3d_server.c -------------------------
>> MPI_Init(&argc, &argv);
>> MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
>> MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
>>
>> MPI_Open_port(MPI_INFO_NULL, sport_name);
>>
>> while (1) {
>> MPI_Comm_accept(sport_name, MPI_INFO_NULL, 0, MPI_COMM_SELF,
>> &comm_client );
>> ... ... ...
>> }
>>
>>Lei_at_ICS wrote:
>>
>>
>>
>>
>>>Hi,
>>>
>>>I have an MPI server which opens a port, prints out the port name,
>>>and waits for connection. My client, in C, will use the printed
>>>port name as its argument to connect to the server and send
>>>a message to the server. The reason I do not use published
>>>names is that my call to MPI_Lookup_name() would crash
>>>(see my earlier emails; that's a different problem).
>>>
>>>Now my above server and client work fine until I made
>>>the client a MEX function called from Matlab. Now
>>>the sever will crash with the following error:
>>>losangeles[48]% mpirun -np 1 View3d_server
>>>MPI_SERVER available at n0:i11:323
>>>**** before MPI_Comm_accept ...
>>>MPI_Comm_accept: mismatched run-time flags: Bad address (rank 0,
>>>MPI_COMM_WORLD)
>>>Rank (0, MPI_COMM_WORLD): Call stack within LAM:
>>>Rank (0, MPI_COMM_WORLD): - MPI_Comm_accept()
>>>Rank (0, MPI_COMM_WORLD): - main()
>>>-----------------------------------------------------------------------------
>>>One of the processes started by mpirun has exited with a nonzero exit
>>>
>>>And the MEX client will crash with the following error:
>>>
>>>
>>>
>>>>>mex_client()
>>>>>
>>>>>
>>>
>>>*** port_name: n0:i11:323
>>>Rank (0, MPI_COMM_WORLD): Call stack within LAM:
>>>Rank (0, MPI_COMM_WORLD): - MPI_Comm_connect()
>>>Rank (0, MPI_COMM_WORLD): - main()
>>>MPI_Comm_connect: unclassified: Too many open files (rank 0, MPI_COMM_WORLD)
>>>
>>>My client (C or MEX) is very simple and it does not open any files.
>>>My LAM (7.1.1) was built with the options:
>>>
>>>--without-threads --with-memory-manager=none
>>>
>>>Any suggestions on how to solve this problem? Has anybody
>>>actually done this before?
>>>
>>>Thanks a lot for your help!
>>>
>>>-Lei
>>>
>>>
>>>
>>>
>>>_______________________________________________
>>>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>
>>>
>>>
>>>
>>
>>_______________________________________________
>>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|