when does the QNX6 microkernel freezes?

rahil · January 22, 2011, 8:29am

Greetings,

I am running some 50+ custom applications, including drivers on the QNX6 box. Now, this box sometimes hangs and sometimes it restarts automatically.

We had suspected PCI drivers to be the culprit for the restart. Probably for the sake that the PCI interrupts might be on the lose.

Also, my suspect is that the CPU can’t quite hold onto 100% of its capacity. If theres 100% CPU usage for quite a time, then everything freezes/hangs.

Can anybody throw some light on why the QNX micro-kernel offering full memory protection and being one of the most reliable OS, could freeze and restart?

mario · January 24, 2011, 1:49pm

Mostly a hardware issue of some soft. Are you using a watchdog? Is one of your PCI driver doing DMA, it could overwrite critical memory ( for example where the kernel lives…).

If one of your ISR goes into an infinite loop there is nothing the OS can do about that.

maschoen · January 24, 2011, 2:26pm

You didn’t say which version of QNX. It is unlikely that the microkernel is freezing.
If you don’t have a watchdog and the system is rebooting, you probably have a hardware problem.
If you do have a watchdog, you need to think this through a bit. How is that watchdog tickled?
There are three ways I can think of.

In an interrupt handler
This is a terrible place to tickle a watch dog as it can continue to function even when the
entire system is hosed. If this is the case and you are firing, it means you probably
have a kernel fault.
In the highest priority process
Not the best place to tickle a watch dog as it won’t detect a high priority (but lower than this) process gone into a hard loop.
In the lowest priority (other than IDLE) process
The problem here is that it will fire if the system can use 100% of the cpu for a period greater than the watchdog period.

So what can you do if you have the almost impossible situation in which you need to detect a high priority process going into a hard loop in a system that uses 100% of the cpu?

Tickle the watch dog in an interrupt handler, but also have the lowest priority process in the same data space updating a counter. The interrupt handler then checks this counter, and implements a slower period watch dog that can differentiate between low oxygen and no oxygen.

rahil · January 25, 2011, 7:19am

Yes. I am using a watchdog.

I am not sure whether the PCI driver doing DMA. But here is what I know.

It uses pci_attach_device with PCI_SHARE, PCI_SEARCH_VENDEV and PCI_INIT_ALL flags.
It calls mmap_device_io() to map the memory.
It uses out8/in8 and other variants for 16 bits & 32 bits.

The PCI driver is using InterruptAttachEvent() and hence I suppose cannot go into an infinite loop. There are some loops in the Interrupt thread, after the InterruptWait(), but the Interrupt is masked at that time and is unmasked only after the loop ends.

rahil · January 25, 2011, 7:28am

maschoen:

You didn’t say which version of QNX. It is unlikely that the microkernel is freezing.
If you don’t have a watchdog and the system is rebooting, you probably have a hardware problem.
If you do have a watchdog, you need to think this through a bit. How is that watchdog tickled?
There are three ways I can think of.

In an interrupt handler
This is a terrible place to tickle a watch dog as it can continue to function even when the
entire system is hosed. If this is the case and you are firing, it means you probably
have a kernel fault.

In the highest priority process
Not the best place to tickle a watch dog as it won’t detect a high priority (but lower than this) process gone into a hard loop.

In the lowest priority (other than IDLE) process
The problem here is that it will fire if the system can use 100% of the cpu for a period greater than the watchdog period.

So what can you do if you have the almost impossible situation in which you need to detect a high priority process going into a hard loop in a system that uses 100% of the cpu?

Tickle the watch dog in an interrupt handler, but also have the lowest priority process in the same data space updating a counter. The interrupt handler then checks this counter, and implements a slower period watch dog that can differentiate between low oxygen and no oxygen.

I am using QNX 640.

I do have a watchdog. The process that triggers the watchdog is a higher priority one. I will have to take care of this. But, to test whether the watchog could be the reason for the restart, I disabled the watchdog.
But, still the system restarted.

I believe that the reason for this restart is the PCI driver. Albeit, the instances of system freezing is more than the system restart.

maschoen · January 25, 2011, 3:28pm

If you disable the watchdog and your system restarts by itself, you either have a serious problem with your hardware, or as you suggest, there is a driver problem. I would create a test where you exercise that driver as much as possible and see if you can cause the failure.

mario · January 25, 2011, 3:49pm

If the interrupt thread runs at very high priority, then it’s almost the same as getting stuck in an interrupt loop.

rahil · January 27, 2011, 5:04am

Thanks Maschoen. am waiting for the exercise.

rahil · January 27, 2011, 5:13am

Thanks Mario. The priority of the Interrupt Thread is 30r. I don’t think this priority could be the issue. But let me elaborate more on interrupts.
There are actually 4 Interrupt threads in the same driver, for the sake of 4 ports on which it offers its services. The frequency of the interrupts is very high. They get called almost every time as soon as they are unmasked.

maschoen · January 27, 2011, 8:32am

In order to have 4 interrupt threads, you would need 4 ports connected to 4 different interrupts. Seems a little unusual, but could be true.

rahil · January 27, 2011, 9:55am

I believe there are two interrupt lines one each for 2 ports. Anyway, whenver any of the interrupt line is triggerred, both of the Interrupt threads gets awaken.

maschoen · January 27, 2011, 1:32pm

How is that possible?

mario · January 27, 2011, 5:33pm

priority 30 is enough to freeze a machine if it used 100% of the CPU, mainly if it’s the highest one in the system.

For maximum efficiency you should have one interrupt thread per hardware interrupt line/number.