Has Anyone Has QNX Freeze?

Has anyone experienced, seen, or heard of a freeze of the QNX scheduler,
such that all application programs freeze?

We are using QNX 6.1 in an embedded application with a GUI using Photon. The
board we are using is a PC-on-a-board with a hardware watchdog to reset the
board and restart all apps if certain critical applications don’t run for a
period of time. We are taxing the realtime capabilities of the board we are
using, but the devices are just running steady-state (not at their peak
loading) when the reset condition occurs. It happens to small number of our
installed base, perhaps 1%, and I’ve not been able to induce it to happen in
my lab.

With some diagnostic features we built-in, we are seeing all of our apps
freeze for an unknown reason. From the tracks left, they all seem to stop
updating their logs at the same time, and the hardware watchdog then resets
the board a couple of minutes later (as it is supposed to do).

Is there anything that can cause the QNX scheduler to freeze or die? Any
known failure mode where QNX just stops running? What happens to the QNX
kernal when the processor is using all it’s realtime - can it start eating
up resources in saved contexts or backlogged functions?

As you can imagine, this is a very frustrating problem!

Hi Jeff:

This very much sounds like a high priority process running ready and
preventing your other application(s) from running. What are the
priorities of the processes on the system? If, for example, you
increase the priority of the watchdog kicker process, does the watchdog
still reset the system? What you typically have to do in these cases is
run a shell at a very high priority to see who’s consuming CPU and
starving the other processes.

Hope that helps,
Robert.


Jeff Maass wrote:

Has anyone experienced, seen, or heard of a freeze of the QNX scheduler,
such that all application programs freeze?

We are using QNX 6.1 in an embedded application with a GUI using Photon. The
board we are using is a PC-on-a-board with a hardware watchdog to reset the
board and restart all apps if certain critical applications don’t run for a
period of time. We are taxing the realtime capabilities of the board we are
using, but the devices are just running steady-state (not at their peak
loading) when the reset condition occurs. It happens to small number of our
installed base, perhaps 1%, and I’ve not been able to induce it to happen in
my lab.

With some diagnostic features we built-in, we are seeing all of our apps
freeze for an unknown reason. From the tracks left, they all seem to stop
updating their logs at the same time, and the hardware watchdog then resets
the board a couple of minutes later (as it is supposed to do).

Is there anything that can cause the QNX scheduler to freeze or die? Any
known failure mode where QNX just stops running? What happens to the QNX
kernal when the processor is using all it’s realtime - can it start eating
up resources in saved contexts or backlogged functions?

As you can imagine, this is a very frustrating problem!

Jeff Maass <jmaass@layerzero.com> wrote:

The QNX scheduler doesn’t so much “run periodically” or in a way that makes
sense to talk about it “getting to run”, it is more fundamental, in that
everytime the kernel is going to return to an application (after a kernel
call or interrupt has happened), the kernel does a scheduling evaluation to
determine which thread it is returning to.

The symptoms you describe here are most often caused by one of two problems:

  1. a high priority thread runs unbounded, so that everytime the scheduler
    goes to choose, this particular thread is chosen.

  2. a hardware interrupt is firing constantly, and 100% of the CPU time is
    spent in servicing this interrupt.

  3. is generally caused by a system design or coding error.

  4. is sometimes caused by a software design error, and sometimes by broken
    hardware – either in an I/O card, or more rarely in a PIC.

If the first is the problem, having a maximum priority command line shell
to examine the system, and/or boosting the hardware watchdog to the maximum
priority so that it can watch/log/detect this could be helpful. Though, the
maximum priority shell is less useful if you can’t reproduce the problem in
your own testing, since you may not be able to get to the site of the failure
before the reboot situation to try and debug anything.

-David

David Gibbs
QNX Training Services
dagibbs@qnx.com

David Gibbs wrote:

Jeff Maass <> jmaass@layerzero.com> > wrote:

The QNX scheduler doesn’t so much “run periodically” or in a way that makes
sense to talk about it “getting to run”, it is more fundamental, in that
everytime the kernel is going to return to an application (after a kernel
call or interrupt has happened), the kernel does a scheduling evaluation to
determine which thread it is returning to.

The symptoms you describe here are most often caused by one of two problems:

  1. a high priority thread runs unbounded, so that everytime the scheduler
    goes to choose, this particular thread is chosen.

  2. a hardware interrupt is firing constantly, and 100% of the CPU time is
    spent in servicing this interrupt.

  3. is generally caused by a system design or coding error.

  4. is sometimes caused by a software design error, and sometimes by broken
    hardware – either in an I/O card, or more rarely in a PIC.

It’s quite possible to get into situation 2) via interrupt configuration
problems. If the interrupt from device A is processed by a handler for B,
the handler for B won’t properly service A, and A will re-interrupt
immediately.

John Nagle