Message Passing Design Issue

Tim · April 25, 2006, 7:18pm

Hi,

I’m in the middle of my design stage and I’ve reached a point with my message passing that I can’t seem to solve so I thought I’d ask the question here to see how others would handle something similar.

I have a communication process that has to send messages to a remote hardware (non-QNX) (sometimes using serial, TCP/IP, USB etc). To handle all the hardware types we designed our own own protocol mechanism. The basic premise is that messages flow across the hardware and once per second a special heartbeat message is sent to ensure the remote node is alive and receiving data and no data has been lost.

So in my comms process I have a timer set to expire once per second. This works fine and when the timer fires I receive the pulse on a channel and send the heartbeat message.

Now I want to add the code to recieve messages from other local processes (there will be more than 1) to be sent to the remote hardware. This is where I am getting stuck.

I originally wanted to use mqueue for my comms process and then use mq_notify() to send a pulse to my channel and do the MsgReceive() for either a timer event or a message in the mqueue queue.

However, the doc’s say that mq_notify only works when the queue transitions from empty to non-empty so you must drain the queue before the mq_notify works again. The problem is my comms task may not empty it in time before a timer event fires and so the heartbeat message may get delayed (likely not a long time but still, maybe a few seconds if there is a burst of messages). So this seems unacceptable if I want to get the timer to be reasonably accurate at sending the heartbeat message in a busy system (or maybe not as I have not done any real testing of this since the system isn’t complete yet).

I then looked at doing away with mqueue and just going with MsgReceive. This however means that senders to the MsgRecieve channel have to block until their message is handled (ie I have to do send-receive-reply even if my reply is empty). I really don’t want to have my local process’s block on on the comms process if I can at all avoid it (even if it’s only a very short time) as I’d prefer they just use a send-and-forget model.

The only thing I can see right now is to use mqueue and then have the timer send a pulse to a dummy thread which generates an mqueue message at higher priority so that the comms process sends the heartbeat. Kind of kludgy.

Is there another way I am missing here that would accomplish what I want, which is to wait for either a message or timer event and treat the timer event with priority so it gets handled right away if there are lots of messages queued/waiting. At the same time message senders are not blocked waiting for their message to be handled.

TIA,

Tim

rgallen · April 26, 2006, 1:56am

Standard QNX messaging is “queued” in priority order. Thus if you have a thread blocked in MsgReceive and a high priority pulse is sent, then it will be placed at the front of the “queue”.

Of course, I quote queue, since there is no queue. That is the point of synchronous messaging. Each client stores their own message in their own address space until the thread in MsgSend is the highest priority ready thread in the system.

If a low priority client thread has sent (i.e. become Send Blocked) on a server, but the server thread has not reached MsgReceive; and then a high priority pulse is enqueued on the server, then the server will receive the pulse before the message, even though the client thread “sent” the message first. If the server is processing a message from a low priority client, when a high priority pulse is enqueued, then the server thread will rise to the priority of the pulse, finish processing the client message (without pre-emption) return to the MsgReceive, and receive the pulse.

cburgess · April 26, 2006, 2:30pm

A possible alternative would be to have two threads, a message listener thread, and a worker thread, with a condvar.

If the listener receives a pulse, it sets a variable to say whether there it was a timer pulse or a mq pulse.
It then signals the condvar, and blocks again in MsgReceive (or dispatch_block() if it’s a resmgr).

The worker wakes up, and starts a processing loop.

At each iteration it first checks for the timer flag. If found it will send the heartbeat. Otherwise, if the mq flag is set, then it
will suck a message out of mq. If that fails with EAGAIN then you clear the mq flag and wait for the condvar.

Cheers,

Colin

Tim · April 26, 2006, 6:03pm

rgallen,

Yes, I understand how QNX native message passing works and that’s why I want to avoid it. Mostly because it forces the sending process to block waiting for the message to be processed because the data remains in the senders address space.

One the the benefits of mqueue is you instantly avoid deadlock issues which I would then have to deal with if I went with native QNX messaging. Also I want my other processes to not have to block when sending a message to the remote hardware because they may have other work to do.

Admitedly, there shouldn’t be a lot of message traffic (I estimate maybe 20-50 internally a second on average) in our system (a few hardware interrupts or user requests) so blocking will be on the order of microseconds and the queue draining shouldn’t be an issue. But I am just as sure that the second the code gets into the field that sometime/someplace when I least expect it, a system will get uber busy and miss some heartbeats that will cause a problem. Hence the reason for this post.

Colin,

An excellent suggestion! Instead of the kludgely thread on the timer expiry I just use one to set a condvar for the worker thread. That should work perfectly.

Thanks,

Tim

rgallen · April 26, 2006, 8:01pm

I don’t consider non-blocking to be a benefit. By not blocking, you bypass priority inheritance, which practically insures that your system will experience priority inversion (at some completely unpredictable point).
The work necessary to design your system to avoid deadlocks if far (and I cannot emphasize this enough) far far easier than the work necessary to insure that an asynchronous system will not experience priority inversion.

If you design the system as a real time system (which is absolutely impossible with asynchronous messaging, since there is no priority inheritance), that is your best insulation against unexpected failures in the field.

Synchronous real-time systems are inherently predictable whilst asynchronous systems are inherently un-predicable.

Colin just told you to make your system synchronous (by using a pthreads synchronization primitive).

Tim · April 26, 2006, 10:20pm

Wait, one of us is confused here about my original question of what I want to accomplish. Either I didn’t explain very well or you misunderstood.

Priority inversion won’t solve deadlock as you well know.

In fact my system won’t suffer from priority issues because it’s likely all my processes will run at the same priority (at least for now they will until during testing I see otherwise).

Let me explain again my system or what I want to accomplish.

Lets call my communication process with remote hardware process A
Lets call another process in my system that does some work process B

Process A has 2 threads, one to receive messages from the hardware (serial/ethernet/USB etc) and one to send messages. There is a mutex to protect access to some internal data structures/hardware.

Process B has only 1 thread.

Lets assume I use QNX messaging as you described.

A message arrives at process A and it sends it to process B to be worked on. Process B finishes and wants to send data back to the hardware so it sends a message back to process A. If in the mean time another message arrives for process B from the hardware the receive thread in process A will try to send to B who is blocked on the send thread of A because the send thread can’t complete sending until the receive thread releases the mutex. Deadlock. No amount of priority inversion will fix this as you well know (tho adding more threads can mitigate the problem at the cost of making the system more complex).

I’d have to design a system where messages only flow in 1 direction to prevent deadlock.

If instead I use mqueue/mq (or write my own message queue equivalent) then I can’t get deadlock because no one blocks when they send a message because mqueue buffers everything.

Further, mqueue allows message priority so that higher priority messages will jump in front of lower ones in the queue so they can be de-queued in a priority driven manner.

I won’t suffer from priority inversion issues because 2 process’s won’t be ever be blocked on each other.

Now, back to the timer part of the equation. In process A’s send thread I want to either block waiting for a message or the timer to expire. I want the timer pulse to have a higher priority so that when it fires I send the heartbeat right then as opposed to de-queueing more messages.

The problem that I found when reading mq_notify is that mq_notify only sends notification if the queue is emptied first and then gets a message. What I wish it did was send the pulse if the queue was non-empty (so if messages were waiting, it would fire right away). That way my code would be trivial as in:

while (1)
{
MsgReceive_r()

if (timerPulse)
//send heartbeat
else
//process message from queue

}

and that would be it since priority of the pulses would ensure the timer got in immediately when it fired.

Sadly, mq_notify does not work like that so you if you get a message you have to loop until the mqueue is drained (if it ever is) and so you have indeterminate time until you get back to the MsgReceive() call which isn’t what I want.

He did? If you say so. I read it as follows:

Thread A:

while (1)
{
MsgReceive_r()

if (timerpulse)
globalFlag = timer;
else
globalFlag = message;

//set convar

}

Thread B:

while (1)
{
//wait on condvar

if (globalFlag == timer)
//send timer message
else
{
while (mq_receive() != EAGAIN)
{
// process 1 message

if (globalFlag == timer)
// Send timer message
globalFlag=message
}

}

Basically the 1st thread’s job is just to set a global flag for the 2nd thread to notice in the de-queueing-of-message-loop to tell it that it’s time to send a heartbeat message. That and set the condvar to wake up the 2nd thread when the queue is empty and there is no timer event waiting. I mentioned using a 2nd thread in my initial post but I wanted to use it in a much kludgier way that Colin suggested (since his can handle multiple timers expiring and I’d need multiple extra threads using my initial kludgely method).

Of course there needs to be a mutex to protect the globalFlag but that’s fairly obvious.

Tim

rgallen · April 27, 2006, 6:19am

I think you mean that priority inheritance won’t solve deadlock; and while that is certainly true, I’m not sure I get your drift.

Well, if you use queues, your system darn well better always run at one priority, otherwise it is guaranteed to fail (most likely in the field).

OK.

Stop right there! This violates rule #1 of designing deadlock-free systems (there are a grand total of 2 rules you need to follow to design deadlock-free systems).

Rule #1: never have two threads send to each other.

Of course. You violated rule #1.

No. Nothing will prevent deadlock in a design that breaks either of the rules for designing deadlock-free systems (no amount of threads, queues or other forms of obsfucation will solve the problem)

You are correct about how you must design a deadlock-free system. So if I understand you correctly, you’re basically saying:

“Since I don’t want to shoot myself in the head (by ignoring the rule that says you don’t point a loaded gun at yourself) I have chosen to stab myself in the foot, since it seems less painful.”

There is a third option:

Option 3: don’t shoot yourself in the head or stab yourself in the foot.

Actually, queues do nothing to prevent deadlock. They only delay it (by the size of the queue). What happens when the queue is full?

You need to prevent deadlock in either case (MsgSend/mq_send) but if you solve the problem (easily solved) for MsgSend, then you also have solved the problem of priority inversion. If, OTOH, you somehow solve the problem for mq_send (which in all likelyhood would look a lot like the solution to deadlock prevention for MsgSend) you still haven’t solved the problem of priority inversion (which, admittedly isn’t a problem if everything runs at the same priority).

Synchronous messaging does this also, and also conveys priority (thus preventing priority inversions).

Priority inversion has nothing to do with whether 2 threads are blocked on each other; that is, in fact, deadlock.

I agree that you won’t suffer from priority inversion issues since you say that everything runs at the same priority (this is a bit like saying “I’ll never have my car stolen 'cuz it’s a Yugo” - true, but then again, that does mean that you have to drive a Yugo).

I fail to see how this rules out synchronous messaging.

But if this worked the way you would like, you would still be subject to priority inversion and deadlock (when the queues filled up).

Maybe, you should look at this as an opportunity to consider a different design that would be both deadlock-free and priority inversion free?

Tim:

He did? If you say so. I read it as follows:

Thread A:

while (1)
{
MsgReceive_r()

if (timerpulse)
globalFlag = timer;
else
globalFlag = message;

//set convar

}

Thread B:

while (1)
{
//wait on condvar

if (globalFlag == timer)
//send timer message
else
{
while (mq_receive() != EAGAIN)
{
// process 1 message

if (globalFlag == timer)
// Send timer message
globalFlag=message
}

}

}

Basically the 1st thread’s job is just to set a global flag for the 2nd thread to notice in the de-queueing-of-message-loop to tell it that it’s time to send a heartbeat message. That and set the condvar to wake up the 2nd thread when the queue is empty and there is no timer event waiting. I mentioned using a 2nd thread in my initial post but I wanted to use it in a much kludgier way that Colin suggested (since his can handle multiple timers expiring and I’d need multiple extra threads using my initial kludgely method).

Of course there needs to be a mutex to protect the globalFlag but that’s fairly obvious.

Do you see the “//” commented lines in the above code fragment?

See the part where (in one thread) it says “// wait on condvar”, and the part (in the other thread) where it says “// set condvar”. That is a synchronization point. Thread ‘B’ waits until Thread ‘A’ signals the condvar. This is semantically identical to the following code:

Thread A:

while (1)
{
  MsgReceive_r()

  if (timerpulse)
    globalFlag = timer;
  else
    globalFlag = message;

   MsgReply() // to thread 'B'
}

Thread B:

while (1)
{
   MsgSend() // to thread 'A'

   if (globalFlag == timer)
     //send timer message
   else {
     while (mq_receive() != EAGAIN) {
       // process 1 message

       if (globalFlag == timer)
         // Send timer message
         globalFlag=message
       }

    }

}

Now, I think both solutions are horrible kludges, but they are 100% semantically equivalent.

mario · April 27, 2006, 1:07pm

Have read this thread throughly but have you consider async messages part of 6.3.2

cburgess · April 27, 2006, 1:51pm

Tim I wasn’t aware that process A would be sending to process B too.

Have you considered implementing process A as a resource manager, and implementing the ionotify mechanism as a means to avoid bidirection messaging? Realtime systems are much easier to get right if adopt a strict client-server approach rather than a peer-peer approach.

That way process B could do blocking writes process A, because process A would only ever deliver a non-blocking sigevent to process B.
You would get deadlock avoidance AND priority inheritance.

Regards,

Colin

Tim · April 27, 2006, 2:48pm

Mario,

I looked at the async message calls (MsgDeliverEvent family) and I’m not sure what this gives me that mqueue doesn’t. So maybe I don’t understand it fully.

I understand that async messages prevent deadlock and priority inversion problems. That part is clear. Here’s what I don’t get with that method (again using my process A and B example)

Process A will deliver incoming data using async calls to B. B will using normal sync calls to send data back to A.

Now when A gets data it notifies B with the pulse. Then it has to buffer the data until B calls back for it. In the mean time more data may arrive from the hardware so A may have to buffer more incoming data while waiting for B to call for the incoming data. So essentially A must buffer some unknown number of messages internally until B calls for them.

How is this different from just using mqueue to buffer those same messages? That’s what I don’t get about using async message passing. It seems to me the benefit I get is priority inversion on messages at the cost of writing code to buffer messages (reasonably trivial since I am using C++ and can just use a List container) and manage sending pulses as long as data is still in the buffer.

Tim

Tim · April 27, 2006, 3:43pm

rallen,

I won’t deny it’s a kludgely design. That was the reason I posted on this topic in the first place because I didn’t really like how it was going to work.

You also mentioned option #3, a redesign or rework of my idea. I’m more than open to suggestions on this. Here are the constraints on my system which I can’t avoid.

Process A has to talk to remote hardware. This means sending and receiving messages from it. So all other process’s in the system have to send data thru process A.

Process B-E (probably going to be 3-5 more process’s) process the data and send replies back to the hardware thru process A (or to a GUI). Note: B-E will never sent among themselves, only to A or F.

Process F is a GUI where the user can make changes that will send messages to process’s B-E to that will eventually cause messages to be sent to the remote hardware.

So looks more or less like this:

hardware <–> A <–> (B-E) <–> F

I wanted to avoid making B-E multi-threaded if I could (for simplicity sake). I also obviously don’t want the GUI to be stuck if a user makes a request until it gets handled by B-E.

Now maybe I am overcomplicating things since in reality I will only have 30-50 messages a second and the process time is going to be on the order of microseconds so there may be no slowdown noticible.

Anyway I chose mqueue/mq because it nicely buffers messages and I don’t expect to fall behind and fill the queue with so few messages per second (using the car analogy you mentioned, I don’t need Space Shuttle features (backup systems, air tight etc) on my Yugo if I am only driving it 2 blocks a day to the store. In other words, I don’t want to overcomplicate the design) with everything running at the same priority.

Maybe the way to go is to write my own message buffering system using async messages as Mario suggested. Or maybe make A a resource manager as Colin suggested (I am looking at Robert Krten’s books since I have never written one of those before and it looks easy but I am not sure just how easy).

Tim

rgallen · April 27, 2006, 4:18pm

Queues DO NOT prevent deadlock. They only delay the problem until the queue fills up. I have seen more deadlocked queue based designs, than synchronous designs.

Your perception that queues prevent deadlock is simply incorrect. There is only one prevention strategy for deadlock that works; and that is a proper understanding of the requirements, and a well executed design. Throwing in some queues and hoping they don’t fill up, does not qualify as either an understanding of the requirements or a well executed design.

My point exactly. It is no different at all. I think it illustrates clearly that a queue (buffer) does not solve the problem for a “unknown number of messages”. The “unknown number of messages” is the problem. It should be obvious, that no system can be designed to function reliably, if the design constraints are unknown. Once the design constraints are known, then unbounded queues are unnecessary (simply allocate a buffer of the correct size).

Here’s the questions you should be asking:

Where is ‘A’ getting it’s data from?
What is the maximum sustained rate that ‘A’s’ data source can deliver?
What is the peak burst rate that ‘A’s’ data source can deliver?
Given my systems capacities, what is the maximum rate that can be sustained? (if the answer to this question is not greater than the answer from 2, then you pack up and go home, since your system will never work).
Given my systems capacities, what is the peak burst rate that can be sustained by buffering? (if the answer to this question is not greater than the answer from 3, then you pack up and go home, since your system will never work).

If you can’t answer these questions, then you should also pack-up and go home, since (as an engineer) you cannot state authoritatively that your system will work (either you don’t know what it is supposed to do, or you are unable to determine whether it capable of “doing it”).

If you can answer these questions, then you don’t need queues.

The point is not to buffer (queue) non-deterministically (deterministic buffering is fine). If the data source for ‘A’ overruns 'A’s buffers, then you essentially have to shut the system down (to fail-safe), since it has failed (as a real-time system), as it cannot meet the external deadlines for processing the incoming data (i.e. one of the 5 questions above was not answered correctly).

The best thread to “know” how much buffering is required for the external data source is ‘A’ (since it is the interface process), not some generic queue downstream from ‘A’.

Using queues as a substitute to understanding your systems design requirements and its capacity and behavior, is a recipe for disaster (that, unfortunately, has played out many, many times).

I have no idea what your system is, but you did state that you didn’t want it to fail in the field. The only way to insure that it will not fail in the field is to understand what it is supposed to do, and to verify that it can do it. Queues will not aid you in this endeavor, in fact, since they preclude you from using scheduling priorities they actually impede your ability to insure that your system can meet its requirements.