Sorry about the long delay in replying -- I've been out of town the
last couple of days.
It looks like you have the right idea, and I don't see anything
obviously wrong from your code snippet. You took the approach that I
would have taken in having your new "daemon" be a separate process.
When looking at all the "daemons" in otb/sys/, keep in mind that the
code is designed to be used in either one big process or a bunch of
little processes. The "one big process" model was added after the
project was well underway, and plays some games to make it work. One
of those games is that nsend / nrecv don't actually block. Instead,
they schedule activity to be done when control is returned to the
kernel. There are some very tight restrictions placed on the use of
nsend / nrecv when running this way. In particular, there can only
be one nsend and one nrecv call scheduled during a single function
call, then control must return to the kernel before the next send /
recv can be posted. Also, packet length is strictly limited to being
less than MAXNMSGSIZE bytes long. Since you are running in your own
process, none of these apply to you (thankfully - makes life much
easier that way).
By the way, there's no reason you can't modify the lamd-conf.lamd
file (the one for the one big-lamd model) to start your new_feature
daemon next to the lamd. Your file would then look like:
lamd $inet_topo $debug $session_prefix $session_suffix
lamd_newfeature $debug $session_prefix $session_suffix
I don't know if that's useful or not to you, but thought I would
point out that it should work.
Like I said, I don't see anything obvious in your code, so I'll point
out a couple of things. If these don't help, I might to look at some
more of your code to be of any real help. First, your process should
kinit() with a priority of PRDAEMON. There's really no benefit to
having your own priority. There are a couple of services that
require it, but I don't think that what you are trying to do would
fit that category. One very, very important thing is that your EVNEW
constant be defined to something that isn't used elsewhere. Look at
the long comments in share/include/event.h to see how event numbers
are chosen in LAM. This is a frequent issue when using the LAM
communication layer.
If none of that helps, the next thing is to try to use some of the
LAM tools to look at where your messages are going. You can use
state and bfstate commands (LAM must be configured with the --with-
trillium option for these to be built and installed) to see what is
running on a particular node and what is up with the communication
buffers on a particular node.
Hope this helps a bit. If not, let me know, including as much
information about your daemon as possible. It's a bit hard to debug
code at that level and even harder when I don't know what the code is
doing ;).
Brian
On Jan 26, 2006, at 2:13 PM, wchao_at_[hidden] wrote:
>> Yes. I add the new feature as a pseudo-daemon of lamd,
>> just as echod, dli_inet, dlo_inet, etc.
>
> I should re-clarify here :)
> I didn't use lamd as a big process to boot lam,
> instead, I used the following config-file to boot lam as
> a cluster of processes, including the new process I added:
>
> lamd_kernel $debug $session_prefix $session_suffix
>
> lamd_router $debug $session_prefix $session_suffix
> lamd_kenyad $debug $session_prefix $session_suffix
> lamd_dli_inet $inet_topo $debug $session_prefix $session_suffix
> lamd_dlo_inet $debug $session_prefix $session_suffix
> lamd_bufferd $debug $session_prefix $session_suffix
> lamd_bforward $debug $session_prefix $session_suffix
> lamd_loadd $debug $session_prefix $session_suffix
> lamd_echod $debug $session_prefix $session_suffix
> lamd_flatd $debug $session_prefix $session_suffix
> lamd_filed $debug $session_prefix $session_suffix
> lamd_traced $debug $session_prefix $session_suffix
> lamd_iod $debug $session_prefix $session_suffix
> lamd_haltd $debug $session_prefix $session_suffix
> lamd_versiond $debug $session_prefix $session_suffix
> lamd_newfeature $debug $session_prefix $session_suffix
>
> Thanks!
>
>>
>> Also, I use nsend()/nrecv() the way as its using in echod, filed,
>> but I met the issue as I mentioned.
>>
>> So, then what's the difference using nsend/nrecv,
>> how should I use them? It's really confused for me.
>> Seems the message is received at the receiver node,
>> but it's lost among the daemon processes.
>>
>> Thank you very much!
>>
>> Chao
>>
>>> To clarify -- are you adding another pseudo-daemon inside the lamd
>>> itself? If so, the communication model is a little different using
>>> nsend/nrecv (vs. processes outside of the lamd).
>>>
>>> My answer to your question depends on the answer to the above
>>> question. :-)
>>>
>>>
>>> On Jan 25, 2006, at 11:34 PM, wchao_at_[hidden] wrote:
>>>
>>>> In addition to the things I mentioned in the previous mail,
>>>> I also found:
>>>> For some message, node 1 sends to node 0 by nsend(), and node 0
>>>> waits it with nrecv(). node 1 does send it out.
>>>> And, from the printf statement I added in dsend(), the message does
>>>> appear on node 0, but it appeared in dsend(), which is strange,
>>>> but not reach nrecv() on node 0. So, it means the message is lost
>>>> for the nrecv() on node 0.
>>>>
>>>> Any idea on such issue? Thanks a lot!
>>>>
>>>> ---------------------------- Original Message
>>>> ---------------------------
>>>> I am adding a feature to lam, and the new feature is running as a
>>>> single
>>>> process of lamd. So, I defined a priority for it:
>>>> #define PRNEW PRDAEMON
>>>> and call kinit(PRNEW) in the new process.
>>>>
>>>> In the new code, I used nsend()/ntry_recv() to communicate among
>>>> them:
>>>>
>>>> LAM_ZERO_ME(outgoing);
>>>> outgoing.nh_node = destination; //0, 1, 2, or 3
>>>> // on 4 nodes test environment
>>>> outgoing.nh_event = EVNEW;
>>>> outgoing.nh_type = 0;
>>>> outgoing.nh_flags = 0;
>>>> outgoing.nh_length = strlen(msg) + 1;
>>>> outgoing.nh_msg = msg;
>>>>
>>>> nsend(&outgoing);
>>>>
>>>> ...
>>>>
>>>> LAM_ZERO_ME(incoming);
>>>> memset((void*) msg, 0, 256);
>>>> incoming.nh_event = EVNEW;
>>>> incoming.nh_flags = 0;
>>>> incoming.nh_msg = msg;
>>>> incoming.nh_length = 256;
>>>> incoming.nh_type = 0;
>>>>
>>>> while(ntry_recv(&incoming) == 0){
>>>>
>>>> Then, sometimes nsend()/ntry_recv() works, and all messages between
>>>> the 4
>>>> nodes are sent and received.
>>>>
>>>> But most of the time, during the messages communication, some
>>>> message would
>>>> be sent and the receiver didn't receive it, or some message was
>>>> suspending
>>>> on nsend() but the receiver is reachable with tping.
>>>>
>>>> I tried to adjust the priority of the new process, to update the
>>>> nh_type
>>>> and nh_event, and to use nrecv() instead of ntry_recv(), but it
>>>> didn't fix
>>>> the issue. Seems something is wrong with the event queue, the
>>>> message
>>>> sent from the new process is got by other process of the lamd,
>>>> but the
>>>> nh_event should have avoid such case. I'm really confused here.
>>>>
>>>> So, what's wrong with it? Is my using of nsend()/nrecv() right? or
>>>> anything is missed?
>>>>
>>>> Any comments and suggests are welcome! Thanks!
>>>>
>>>> Chao
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> lam-devel mailing list
>>>> lam-devel_at_[hidden]
>>>> http://www.lam-mpi.org/mailman/listinfo.cgi/lam-devel
>>>
>>>
>>> --
>>> {+} Jeff Squyres
>>> {+} The Open MPI Project
>>> {+} http://www.open-mpi.org/
>>>
>>>
>>> _______________________________________________
>>> lam-devel mailing list
>>> lam-devel_at_[hidden]
>>> http://www.lam-mpi.org/mailman/listinfo.cgi/lam-devel
>>>
>>
>>
>
>
> _______________________________________________
> lam-devel mailing list
> lam-devel_at_[hidden]
> http://www.lam-mpi.org/mailman/listinfo.cgi/lam-devel
|