Doubt in mq_send()

lullaby · April 20, 2013, 9:25am

Hi all,

We are trying to analyse a system hang issue on a multicore machine.

We have a QNX application which is using Message passing, pulse, semaphore, mutex, message queues, signaling etc. Application is having two processes and one of the process is having almost
76 threads based on a worst case user input.The other process is single threaded and is running as a resource manager. All these threads in the application form a complex communication path. But we only ensure that message passing will not occur in two directions (upward-downward) to avoid a deadlock condition. We have used pulses in both upward-downward direction (within the first process). We have Photon based GUI update also something like a logging mechanism. Also message queue is also used between threads. Scenario is like 33 threads will be sending messages to a single message queue and one thread will be receiving messages from this and is doing msg_sendv() to a resource manager.

Now the application hangs on a quad-core machine on continuous run.

As per the following QNX post,
openqnx.com/phpbbforum/viewtopic.php?t=7713
they say QNX message queues also can produce a deadlock scenario.

On analysing the application, we see that message queue used is getting full manytimes. We haven’t used NONBLOCK flag for the threads which send messages to message queue. So if the queue gets full, these threads will be blocked. So my doubt is:- is there any possibility that this queue getting full and the receiver of the queue is blocked on the resource manager and is not able to receive any message from the queue. This causes the 33 threads to block as they are waiting to send message to the queue and causes some other related threads also to block.
Is this queue full situation also cause a QNX system hang?

How can we avoid a QNX full condition? In our application, even 1000 messages can be generated in 1ms. We have put Queue size as 1500. So we experience a high rate of queue full condition. Can we set an “unlimited” value for maximum number of messages in Queue?

Thanks,
Lullaby

maschoen · April 20, 2013, 6:16pm

If your processes are blocking and not running in a hard loop, you should be able to run pidin and see what everything is waiting on.
It does sound like you have a deadly embrace somewhere. The way such things are removed is by analysis of the data flow.
With a quad-core it is also possible you have a race condition. This can be checked by temporarily forcing all threads to run on one processor and seeing if the problem goes away.
[quote
As per the following QNX post,
[openqnx.com/phpbbforum/viewtopic.php?t=7713]
(http://www.openqnx.com/phpbbforum/viewtopic.php?t=7713)
they say QNX message queues also can produce a deadlock scenario.
[/quote]
I didn’t see that in the discussion. I saw a statement that queues do not by themselves remove a deadlock, just delay it. That is true.
But queues themselves do not cause a deadlock. Incorrect design does.

It’s not a QNX hang. It’s your designed in hang. A queue uses the model of a consumer-producer. If the producer produces quicker than the consumer consumes, the producer will get temporarily hung waiting for consumer to clear room in the queue. If temporarily hanging the producer this way is unacceptable, then there was no point in using a queue (instead of straight message passing). In such a situation, a queue only makes sense if you have these conditions.

Production is bursty
the queue can be made big enough consumption can always catch up before the queue is filled.

If you can’t achieve this, then either there is something wrong with your design, or your system doesn’t have enough resources.

Again, this has little to do with QNX. You mention being able to generate 1000 messages in 1ms. Is this constant or bursty, and if the latter, what is the profile. Getting 1000 messages in 1ms followed by 10 seconds of quiet is very different from 1000 messages in 1ms followed by just 1ms of quiet.

How long does it take to consume these messages? Is there any way to flow control the production?

mario · April 20, 2013, 10:42pm

I have said it a few times, here we go again. Learn to use the System Profiler. There is nothing this tools can’t answer, aside the meaning of life which by the way is NOT 42.

You keep mixing the expression “system hang” and “application hang”, these are VERY different thinks, and are addressed with distinct approaches, please clarify.

You say: there are 33 threads sending to a single message queue, I hope you realize, in a multi-core scenario, there is no way a reader can keep up, if these 33 threads send to much stuff too fast. Would there be no limit, mq/mqueue would run out of memory eventually. And that’s not even talking about the resource manager the data is relayed to. Even if you put quotes around the word “unlimited” such a concept still doesn’t make any sense. If the architecture generates too much data, then the architecture is broken and need to be fixed.

I do not have any experience with mqueue/mq because when there is a need to queue stuff I let a thread in the receiving app handle that. Each apps gets its own thread queue , far less context switches, cache trashing, etc. It also easier to implement data flushing in case of data overload.

You might want to try mq instead of mqueue which should improve performance quite a bit but not sure if it will be included in the next release of QNX as the async stuff which mq is based, is suppose to be deprecated.

lullaby · April 22, 2013, 6:04am

Sorry, it was a typo. Actually the issue is QNX system hang.
I have already tried with locking all the threads to a single core. Then also the QNX system hung.
I have tried running all the threads with same priority too. But still QNX system hang.
Now the code has been modified such that only MsgSendv(), MsgReply() exists only
in the resource manager used. Is there any possibility that a combination of queue full condition and a resource manager can’t
reply to its client will lead to a deadlock and QNX system hang? My observation is even the system time is not getting changed
during QNX system hang.

Thanks,
Lullaby.

maschoen · April 22, 2013, 8:04am

I think it’s very unlikely.

mario · April 22, 2013, 3:16pm

How do you come to that conclusion?

Tim · April 22, 2013, 4:55pm

This is just a guess based on the other question about sending 1000 log messages to the GUI console in 1 ms. But I’m guessing he means the Date/Time shown in Photon in the lower right corner. If that’s the case my next guess is that there are 1000 pending screen updates sent to Photon in 1 ms and that these are queued up taking a VERY long time to redraw if there isn’t a native driver for the video card. This can make it seem like QNX is hung when it’s just the Photon driver that is very busy trying to catch up to all those screen redraws.

KGB

maschoen · April 22, 2013, 6:38pm

A possibility. Turning off the spigot and waiting a minute should indicate one way or another.

lullaby · April 23, 2013, 4:20am

Yes. What Tim interprets is correct.
The system time at the Photon desktop at the lower right corner
was not getting changed. There were no mouse movements.
No keyboard inputs. No network connectivity… Nothing. Just system
restart works. When I run the application for continuous testing last Saturday
and checked it Monday morning, system time hasn’t changed since Saturday evening.
So if system gets into that state, it can’t recover. So I guess it is a QNX system hang.

Thanks,
Lullaby

maschoen · April 23, 2013, 4:48am

I can see we might have some terminology issues so I’ll be precise.

An application hang is caused by an application. Stopping the application if possible will leave a functioning system.

A system hang is caused by a problem with the OS, meaning the kernel, a system process or a driver.

The fact that the system is frozen doesn’t mean you have a system hang. You can freeze everything up with your application. What you describe sounds like you are over-taking some Photon component, which makes it difficult or impossible to do anything. There are ways to check this. For example, if you hook up a serial terminal and raise the priority on the shell you are running, you should be able to do a pidin even though the system seems frozen. The same might be true for a telnet session.

mario · April 23, 2013, 2:17pm

As Tim mentionned terminology is important, the system time is the time as maintained by the kernel, if your Photon application stop showing/updating time it doesn’t mean that the system time is not maintained by the kernel.

If you press num-lock on the keyboard does the status led, assuming there is one, changes state?

lullaby · April 24, 2013, 11:10am

Hi all,

No Status Led reaction if I press NumLock/Capslock during the hang state.
I have rechecked the scenario.
Also by system time, I mean, the time which is getting updated at the lower right corner of the Photon desktop.
But I have a trivial query. My application threads are running at priorities 10, 11 and 12. My Photon thread is running at 10. By checking the io-graphics and io-display driver threads with System information perspective, i see that they are also running at 10, 11 and 12. So is there any chance that my photon application running at the same priority causes some photon crash/hang due to the heavy updates in one of the views and due to the priority settings?
Please share your thoughts…

Thanks,
Lullaby

mario · April 24, 2013, 3:28pm

That bad ;-)

Thanks for the clarification

Aside a kernel crash or code stuck in a interrupt the machine should never hang what ever you do. Of course there is a matter of process priority but in these cases the machine isn’t hanged it is just unresponsive. Also Photon is a server hence it will process request at the priority of the process that made the request, its priority will float.

Does the machine has some sort of networking?

Tim · April 24, 2013, 4:30pm

Or do you have some custom hardware plugged into this machine (PCI card, USB device etc). You mentioned getting 1000 log messages in 1 ms so I’m curious to know where they are coming from as I assume it’s some external source.

Tim

lullaby · April 25, 2013, 7:03am

Hi,
LAN networking is enabled. But no external devices connected.
But at system hang time, network is also gone.
Logs is of type harddrive sectors health-checking only.

Thanks,
Lullaby.

mario · April 26, 2013, 3:53pm

Then I would use a process of elimination to figure out what operation is causing grief.

There might be a kernel dump but because photon is running you can see the error messages on console 1. I have never use this myself but you might want to use a serial port to let the kernel output error message there.

Thunderblade · May 7, 2013, 8:20am

What video driver are you using? I heard that vga/vesabios can cause problems on some systems where the GPU is using system RAM. So there might indeed be a system hang.