Troubleshoot QNX system hang

lullaby · February 26, 2013, 5:42am

Hi all,

I have tested my QNX application which continuously writes some raw data to disk from start sector to end sector overnight. The application GUI is photon based. I have enabled sloginfo to read my error prints. But when I check it in the morning, the application hangs and also the system hangs. Even the system time hasn’t changed after the application hang. I restarted my system and analyse the log file. But I couldn’t see any error prints. The test PC is quad-core. There are multiple threads running on my application and the application is also trying to calculate the speed of each write call and prints it a log file. Functions like ThreadCtl, SYSPAGE_ENTRY, Clock_cycles() are used in every loop. I have tested the same application in a single core PC and I couldn’t reproduce this hang issue. Could someone give some hints to troubleshoot this issue? I am stuck in this issue as I have no clue how to proceed.

Thanks,
Lullaby

maschoen · February 26, 2013, 8:17pm

Well the first thing to think about if an application works on a single cpu but not a multi-cpu is whether there is a race condition somewhere in the code.

It is possible in QNX for a hung application to hang the GUI, but for that to happen, you have to have pushed the application priorities up high.

Lastly, by sheer coincidence I ran into what might be a related problem yesterday. A dual core system was loaded with QNX 6.5 using Photon, an archive was loaded and make was run. And it was hanging randomly. The problem went away when the video driver was changed from VGA to VESA or the native hardware driver. You might want to check which video driver you are using.

lullaby · February 28, 2013, 2:48am

Hi maschoen,

Thank you for your reply. But is there any way to identify the race condition scenario (other than source code walk-through and enable debug prints)? I find it difficult as the QNX PC itself gets hang and I couldn’t even analyse the state of the threads in my application.
Even the system time hangs. This system hang issue is not visible if I test the same photon based application overnight on a single core or a dual core PC.

You have mentioned something like this, right?

So do you mean that if the application works on a single cpu and system hangs, it can be due to a race condition? is my understanding correct? So what happens when the application runs in a quad-core (multi-core) machine? Are you trying to tell that race condition may not be the reason in a multi-core PC? Could you please share your thoughts on my queries?

Thanks,
Lullaby

maschoen · February 28, 2013, 6:13am

There’s no magic.

Here’s something extreme you can do.
In any permanent loops, put a mutex lock at the beginning and an unlock at the end.
Then for anything not in a loop, like a callback, put in a mutex lock when the routine is entered and unlock when it leaves.
This will force your code to be single threaded.
If the problem goes away, it was a race condition. If not, well something else.

You might want to look into the partition schedule. It would allow you to reserve some time for a shell that should always work, unless the OS is hosed.

You can have a race condition whether single or multi-core. It becomes easier to trigger a race condition on a multi-core because you can two threads in your process running at the same time.

This reminds me of one more thing you can try. If you ThreadCtl() your process onto one cpu and use the FIFO scheduling, you should eliminate any race conditions. I suggest this not as a good way to run, but rather as a way to find out if the problem is a race condition.

lullaby · February 28, 2013, 6:26am

Thank you a lot for your suggestions, Maschoen.
I will try it out and let you know the results.

Thanks,
Lullaby

Tim · February 28, 2013, 5:40pm

There is something else you can do.

Before running your app you can spawn one terminal (since you are running in Photon) at a REALLY high priority (30+). That way if your app is merely consuming 100% of the CPU as opposed to hanging the O/S you’ll still be able to run commands like hogs/sin/pidin in this high priority terminal to see what’s going on.

The ‘renice’ command is handy for increasing the priority of an existing terminal spawned at normal priority.

Tim

mario · February 28, 2013, 8:41pm

Can you give more details about “even system time hangs?”

mario · February 28, 2013, 11:01pm

Your usage of ThreadCtl/ClockCycles is creating some level of “chaos” in the system. Your interpretation of the documentation isn’t bad but not quite right.

When you intent to use ClockCycles() in a thread you should “lock” that thread to a core ahead of time, when the thread is started and leave it there for the whole life of the thread. Doing it just before ClockCycles() creates all sort of disrupting behavior. First if the thread is running on core 3 but you are assigned it to core 0, you are creating an thread migration, that isn’t good, it will affect performance. What ever number you will get will not give you a “real case scenario”. If another thread is also doing the same thing and is begin also assigned to thread 0, then you have contention. Only one of thread can run as before each of them were happily running on own core.

Check the model of CPU you are using, most modern x86 have their RTDSC synchronised and unaffected by thinks like SpeedStep. They even have a mode then ensure RTDSC can be used to measure time and not clock cycles ( which is that there were designed for in the beginning ). We have been using Xeon familly for quite a while and never had to worry about that.

Another solution if you don’t do lots of reading per second is to write a “ClockCycles Server”. You get that server’s affinity set to one core and have any process wanting to get the time to send a message to it.

Playing with thread affinity is in general a very bad idea unless one possesses great knowledge of CPU and OS architectures.

mario · February 28, 2013, 11:01pm

Your usage of ThreadCtl/ClockCycles is creating some level of “chaos” in the system. Your interpretation of the documentation isn’t bad but not quite right.

When you intent to use ClockCycles() in a thread you should “lock” that thread to a core ahead of time, when the thread is started and leave it there for the whole life of the thread. Doing it just before ClockCycles() creates all sort of disrupting behavior. First if the thread is running on core 3 but you are assigned it to core 0, you are creating an thread migration, that isn’t good, it will affect performance. What ever number you will get will not give you a “real case scenario”. If another thread is also doing the same thing and is begin also assigned to thread 0, then you have contention. Only one of thread can run as before each of them were happily running on own core.

Check the model of CPU you are using, most modern x86 have their RTDSC synchronised and unaffected by thinks like SpeedStep. They even have a mode then ensure RTDSC can be used to measure time and not clock cycles ( which is that there were designed for in the beginning ). We have been using Xeon familly for quite a while and never had to worry about that.

Another solution if you don’t do lots of reading per second is to write a “ClockCycles Server”. You get that server’s affinity set to one core and have any process wanting to get the time to send a message to it.

Playing with thread affinity is in general a very bad idea unless one possesses great knowledge of CPU and OS architectures.

lullaby · March 2, 2013, 5:42am

Hi all,

A little update. I think I forgot to mention one thing. My multi-threaded application continuously writes to disk sectors; calculates speed; writes the log to a file; also updates the log to a PtMultiText continuously. One non-photon thread is doing log file write and multitext update. I am using PtMultiTextModifyText() function to add/delete entries from multi-text. That is, when the log entries reach some limit in the PtMultiText widget, I need to delete some old entries from the PtMultiText widget and this works periodically. I have done updates to PtMultiText widget within a PtEnter()…PtLeave() block.
This is my pseudo code:-
PtEnter(0)
if deletion condition is satisfied
{
for loop to get the length of 100 lines
PtMultiTextInfo()

PtMultiTextModifyText - to delete those 100 lines
}
PtMultiTextModifyText - to add the new line
PtSetResource(vertical scroll pos)
PtLeave()

Even now, if my application is put for a continuous run, it hangs after some time in an Intel Xeon E5540 machine. No issues when testing is done with single core/dual core machine. I couldn’t find any possibility of race condition in my application on code analysis. Now I doubt if there is some issue in continuous update in PtMultiText. On reading through QNX help pages, I saw some functions like PtHold(), PtRelease(), PtContainerHold(), PtContainerRelease() etc. The help says these are used to force/delay display update. I was not aware of any of these functions and I haven’t used any in my application.

My doubt is that:- Is there any issue like when my non-photon thread is constantly updating the PtMultiText and suddenly the display is updated and this causes the system hang? What is the purpose of using PtHold(), PtRelease(), PtContainerHold(), PtContainerRelease() etc? I haven’t understood it clearly. Is the hang issue in my application caused due to the absence of these functions?

could you please share your thoughts on this issue?

Thanks,
Lullaby

maschoen · March 2, 2013, 6:59am

Just something to try.

I never understood exactly why, but I think you should code the PtEnter(), PtLeave() this way.

int rc;

rc = PtEnter(0);

…

PtLeave(rc);

It usually works without this, but try it and see if your problem changes any.

lullaby · March 4, 2013, 6:36am

Hi Maschoen,

This is an excellent clue and thank you a lot for pointing out my mistake. I have updated the source code and put it for continuous run in the quad-core machine. I will get back to you with its results tomorrow.

Thanks,
Lullaby

lullaby · March 5, 2013, 6:27am

Hi all,

Our QNX system hang issue is finally solved. The problem was with the argument used in PtLeave() call. I have used zero instead of the return value from PtEnter(0) call. So modified PtLeave() as Maschoen told in the earlier post. Now the application run continuously in the Quadcore machine for 25 hours.
Thank you a lot !!!

Thanks,
Lullaby