Message queue problem

Hi,

I have come across a problem in using message Queues.

Requirement:
I need to reclaim a message queue descriptor which is become invalid while the program is being run [for e.g. server
fails or restart].

Problem:
As per my design, server creates Message Queue, a server thread remain in receive blocked state which waits for any
queue notification pulse and finally queue is unlinked during server termination. Client opens the Queue in read/write
mode and save the descriptor as part of initialization step. Client uses this descriptor throughout its life to send any
data to queue and the descriptor will be closed only during client Termination. So I want to recover the message queue
descriptor when it becomes stale (i.e. server gets restarted). I used ha client recovery APIs for this purpose. But as
per mq_unlink () documentation,
‘If some process has the queue open when the call to mq_unlink () is made, then the actual deletion of the queue is
postponed until it has been closed.’ So found that mq_send () API returns success even if the server has unlinked the
queue, until client closes the already opened Message Queue descriptor. Recovery function is called only when mq_send ()
returns bad descriptor error. But this error codes are generated only when client explicitly close all opened message
Queue descriptor. But as per my design message queue descriptors will be closed only at client termination to avoid the message queue open/close calls overhead each time.

My ultimate question:
How come I manage such situations? How can I recover message queue descriptor when the server terminates using HA Client Recovery Library or any other method? Can you please provide a general template followed in using message queues for IPC and its recovery mechanism? Please help.

:cry: I am stuck up in this so pls help me :frowning:

I think you will get an error return from your mq_send(), probably as return code and errno. You have to identify the problem, re-connect (open) the message queue and re-send. This should be pretty simple to code once you know what the rc and errno are.

No error code is obtained, since mq_send() is not at all returning Failure. U can c it very clearly by creating a msg q and then use the cat command in the terminal prompt and then unlink the msg q and rite to c whether the msg q is deleted or not. But just after this when u do a mq_send() and then cat the q u can clearly c the data is being read. Inshort mq_send() is not returning any failure even after unlinking the msq. Since it doesnot return failure there won’t be any error codes available. I have already checked that

The problem with using mqueue is that you aren’t going to receive any notification about failure unless the mqueue server task fails.

In other words your message path is:

yourClient<->mqueue server<->yourServer

Failure to send will only happen if the mqueue server disappears.

To handle the kind of failure you want, you are going to have to register to track the death of yourServer PLUS the failure to send. Then when either happens yourClient will have to close the message queue. Note: If you get the failure on the send() you’ll have to restart mqueue itself.

This is why using mqueue instead of native QNX message passing isn’t the greatest of designs in a HAM system.

Tim

Thanks Tim, I couldn’t figure out what he’s talking about. A modified mqueue might be able to handle this condition as follows. Have the queue manager return an error if the queue is not open both for reading and writing. When sent a message it could verify that the reader is still alive and return an error if not. Of course, this would not help if the reader went crazy and wasn’t reading anymore.

On the other hand, just using QNX messages would probably be a better idea. I’ve found that a lot of programmers think “queue” because they have a misunderstanding of what the cpu is typically doing (nothing).

Hi all,

Is following work around acceptable for message queue descriptor recovery?

Solution 1:idea:

Kernel deliver a pulse _PULSE_CODE_COIDDEATH to the clients channel when the channel that the connection is attached to is destroyed [i.e. server dies]. So client can explicitly close all opened message queue descriptor and reopen the message queue to get new descriptor.

Dependency 1.a):

All clients need a channel to be in receiving blocked state to receive kernel pulse for server death for the message queue descriptor recovery:!:

Solution 2: :bulb:

Let a third process say central controller creates the Queue . Client and server open the queue in read/write mode. And let server dont unlink the queue when it terminates. So that descriptor never becomes invalid. Queue may be unlinked only from the central controller as part of clean up which in turn results in all system shutdown.

Issue 2.a):

Server never gets queue pulse notification, as there may be unprocessed data in message queue.

Solution 2.a.1):If server has a dedicated thread for queue processing, let it remain in mq_receive blocked state, so that it doesn’t have to wait for any notification pulse.

Solution 2.a.2):Otherwise, let server explicitly check the number of message in the queue during initialization. If count > 0, let it do dummy read, just to clear those unprocessed messages in the queue. So that they gets further notification when a data is added to queue.

Please let me know your comments. Please reply

Thank You,
Princi

Princi,

I would say that your biggest problem is the one you didn’t identify:

Issue 3)

The Server dies/exits. Clients have no idea the Server has died/exited.

So Clients will send messages that will never get processed and the Client will have no way to know that (assuming that’s important in your system).

This is the problem with using mqueue. You have no guarantee that messages sent by a Client are received by a Server or vice versa in cases where either the Client or Server exits because messages can be left in the queue.

This is why using native QNX message passing is preferable.

Tim

Thank you for the reply, :slight_smile:
so u mean, there is no way to recover the invalid message descriptor using client recovery libraries(HA) :question:

I have designed client in such a way that all resources needed to communicate with a server will be opened as part of initialization and they will be closed only during client termination just to avoid performance overhead in open/close of message queue. Do continous open/close really cause any overhead :question:
For gracefull exit of server , it will do all clean up work ie it will unlink the queue. But at the client side,mq_send() return sucess until it closes the opened message queue descriptor. This queue is no more valid, as server will create new queue on restart. So the only way to get the valid descriptor is to reopen the message queue. So I need to know when should Client do such recovery. Or shall I modify my design in such a way to have open/close for each time it communicate with the server :question:
or shall I go ahead with solution 2, ie to have a central controller which create/unlink queue, such that any crash to central controller means all system shutdown :question: Please help.

Thank You,
Princi

Princi,

No, I didn’t say that. I merely said it was more difficult.

To recover the descriptors you need to have HA on two things:

  1. The file descriptors (which you have now) in case mqueue dies (that’s all that a bad file descriptor will tell you, that mqueue died)
  2. The server task so that when the server exits your client can close and re-open the queue.

They cause some overheard for sure. That’s because the kernel will have to tell the mqueue task that your client is connecting to the queue every time you open/close. So that means mqueue must get some CPU time. If there are a lot of messages being sent (ie hundreds a second) this could get expensive overhead wise. If there are only 5-10 messages a second it probably isn’t an issue.

So how do you handle the case that the server exits when there are still messages from the client in the queue? What happens to those unhandled messages? Does it matter in your system because the Client has no idea there are messages that never got processed.

I don’t know what your system does, but if you were say doing database transactions then this entire design is flawed. Given that you are implementing HA, I assume that your system does care about unhandled messages.

This is probably the easiest solution to implement. Especially if you set it up so that if the central controller crashes the entire system shuts down.

Of course you still have to solve the problem of a Server restart. Your idea of throwing away all messages at start time means once again you have Client unhandled messages.

Tim

Thank you Tim for all ur valuable suggestions :slight_smile:

Of course you still have to solve the problem of a Server restart. Your idea of throwing away all messages at start time means once again you have Client unhandled messages.

To handle those unhandled messages, server can check the message count in the queue during initialization and if count > 0 ,server can explicitly notify its queue processing thread by sending appropriate pulse.
Is this solution okay :question:
or shall I go ahead with Solution 2.a.1) :question:
ie:If server has a dedicated thread for queue processing, let it remain in mq_receive blocked state, so that it doesn’t have to wait for any notification pulse. So that all unprocessed data in the queue can be handled and queue notification overhead can be eliminated.

Thank You,
Princi

Hi,

I used the HA framework to monitor the mqueue process crash to restart mqueue process and to restablish the message queue descriptor.

I tried this scenario with ham client recovery libraries. Mq_send() returned bad descriptor error on mqueue server crash. Having detected the bad file descriptor the recovery function attached with the message queue is invoked and a new descriptor is obtained as expected. But the value of the newly obtained descriptor is found to be not the same value of the old descriptor which should be needed for a successful recovery. How could I achieve this? Why am I not getting the old descriptor value :question: :frowning: Please help.

Thank You,
Princi

Princi,

The reason you are not getting the same file descriptor back is because mqueue crashed.

mqueue creates and maintains those file descriptors internally in memory. So when it crashes they all get reset. The only way you could get the same descriptor back would be if all processes created their message queues in the EXACT same order after the crash.

Tim

Hi,

Using ha_reopen() in the recovery function successfully returned new message queue descriptor with value same as old descriptor. But found that the recovery function is invoked only by adding following code snippets at the end of the program.
……………
void xx()
{
MsgSend(0,0,0,0,0);
}
………….
On debugging found that after reaching control out of recovery function, “Single stepping until exit from function MsgSend,â€