Hardware watchdogs

Our new product failed ESD testing when it locked up after application of ESD to a metal connector housing on the rear of the unit. As a short term fix I’ve been asked to implement a watchdog. The pxa270 we’re using has a built-in watchdog timer which asserts RESET. This is known to work since we already use it to implement a programmed warm-boot after a software upgrade.

We don’t know exactly what causes the product to freeze up when ESD strikes, or if the failure mode is the same every time.

Watchdogs always seem to be a pain, but the customer wants one so a discussion of the best way to implement them on a QNX system might be useful.

My thinking is that the lowest priority user process in the system could loop and continuously reload the watchdog countdown timer. If a few seconds were to go by when it didn’t get scheduled the reboot would occur. It seems like this would work in most cases as long as long as the system has a few idle cycles to spare.

There is probably no foolproof way to implement a watchdog but I’d be interested in hearing of successful approaches others might have used to do this sort of thing.

I like to do it a little differently. It runs at the highest priority, petting the watchdog, but also monitoring the idle status. If the idle thread does’t run for x number of seconds then we know something has gone wrong. You can also add extra monitoring, like low memory. I prefer doing it this way because if the watchdog program decides not to pet the watchdog ( say low memory ) then it log that information somewhere. If it’s very low priority and a reboot occurs because the program didn’t get CPU time, then it becomes hard to find why the machine rebooted.

Or you could use APS ( if you can afford it ) and run the watchdog program a low priority but ensure that it gets saw a minimum of 1% of CPU time, then you can add has much monitoring tools you want ;-)

Mario, that certainly seems like an improvement to my original idea. It wouldn’t be able to log the event where an interrupt goes awry but it would cover other cases. During development a few times we’ve seen the processor get into a wonky state where execution of any instruction causes a data abort interrupt, followed by a return, followed by another data abort, etc, etc, etc. As far as I know this has never happened in the field but one never can tell…

A good question might be, what are you watching for? A double fault is a very different animal from a process in a loop hogging resources. Both stop your system, but one can be dealt with a software watchdog making the system much more robust. Not as easy to implement as to implement as a hardware watchdog of courses. So my preference would be to update the watchdog in the timer interrupt, and to watch for out of control processes by another method. Here is how you can do this.

Each potentially dangerous process is started by a supervisor that runs at the highest application priority. Once started, each process has some period during which it must check in. This is like tickling the watchdog. If it doesn’t check it, it is killed and restarted.

We don’t know what we’re watching for, that’s the trouble.

When ESD hits the power button won’t work so the meter can’t be powered down. The recovery method is to unplug the meter and remove the battery. If the power button worked I think ESD testing would pass.

Since the power button runs at the highest priority (to allow power down in case of other software errors), and power down code is invoked by interrupt, it could be that the keypad controller is broken, or that interrupts aren’t working (the dreaded stream of Data Abort interrupts I described would explain this) or that memory addressing is hosed (causing data aborts) or some other hardware fault. Since our goal is to simply pass ESD testing a hardware watchdog would be adequate.

This is a lab instrument and not a mission critical medical device.

This is a perfect application for Adaptive Partitioning. Create a watchdog partition, give it 1% CPU, put your watchdog stroker in the partition (doesn’t matter the priority, as long a 1% CPU is enough to allow it to stroke at a sufficient rate).

I guess I don’t know what ESD is. It sounds like some kind of external mechanical or electrical stress on the system. If that is so, you are probably toast when it hits, so a hardware watchdog is the way to go. I wouldn’t be afraid that after this failure an interrupt handler is still working well enough to keep tickling the watchdog. The only other way I can see this playing out is if the ESD causes a driver to go into a loop. That’s pretty unlikely, and suggests an immediate fix. Protect again any potentially infinite loops.

ESD == Electro-Static Discharge

I don’t think Ken is worried about continued tickling of the watchdog when an ESD occurs, but that he is concerned with being able to stroke the watchdog deterministically during normal operation (no false resets). This is where AP provides the solution, as it allows you to create a “logical” separate CPU (still hard real-time) with enough power to insure that the watchdog will be stroked deterministically. Then when ESD does strike, the processor stops and the system resets. Presumably, it is desirable to detect and reset as quickly as possible. Using AP a 100ms stroke period with no chance of false resets is easily doable…

Ok, well I second my comments then. If an ESD is stopping things, then the cpu is no doubt hosed. I’d put the tickler in an interrupt handler. The logical CPU will work fine too, but unless you have other needs for this feature, it is a bit of over kill.

Thanks for the useful comments.

As I understand it the ESD test applies a static discharge to the unit and the test only passes if the unit is not rendered inoperable. After ESD the unit does not have to continue to function normally but it must be recoverable without extreme measures - a watchdog reset or a user power down/up cycle is acceptable. Our test failed because the power button would no longer function and the battery had to be removed and reinserted before it would work again.

AP sounds like a good solution but it does seem like quite an extra expense just for this purpose. The product is also already shipping so AP would be a little late to the party. This particular variation of the product is not yet shipping but it’s the same unit with the addition of a GPIB interface, and ESD to the GPIB connector is what causes the trouble.

Simply running a watchdog process at the highest application priority and reloading the watchdog countdown register 2-3 times/second with a one second timeout value should work. The speed of reset is not critical. It seems if the memory or interrupt subsystem is broken the watchdog should fire, but if the cpu is hosed who knows? The pxa270 watchdog is integrated on chip.

One thing that really bugs me about this chip - any reset causes the RTC to lose the time and date. For warm boots we have a kludge of writing the time+date to flash, rebooting, then resetting the RTC from the flash data but for a real watchdog reset you can’t plan on flash working. Since the pxa was targeted at the cell phone market Intel probably figured to get the time info off the air.

We’ll have to ask our sales rep about what AP might cost for the next version of the product. We are facing some real performance challenges and AP could help us with those (at least with tuning and diagnostics of the product). And it would be fun to play with :slight_smile:

It’s free to play with (source is on Foundry27).

There is a royalty impact only. One thing is to make sure that your PXA variant has a cycle counter (AP is much more efficient with a cycle counter).

I don’t think it’s worth the money for this. Running the watchdog at the highest priority is much safer and simpler.

Which brings up a question about AP. Over what period of time is any particular percentage guaranteed? Clearly if it was over an hour, AP would be worthless, and over a nanosecond, it would be impossible.

I’ve read the pxa docs for months and never seen a reference to a cycle count register. What’s the overhead if that register isn’t available? Do you know if any ARM processor has a cycle counter? We’re looking at the Freescale i.MX31 for the next generation product but it seems the ARM architecture has design issues that hurt performance.

community.qnx.com/sf/wiki/do/vie … _Works_FAQ

In a nut shell it’s a sliding windows and you have control over the width of the windows.

Mario, I agree with you completely for this particular purpose and we will probably do just as you suggest.

Given some of the performance issues we’re having AP might prove useful on our next generation project but it’s too late for this one.

Talking of Hardware watchdogs. Has anyone had experience using the WTD501P PCI-WDG-CSM watchdog cards? We have a QNX4.25 app that uses one of these cards. Rev D of the card works fine, Rev F cards appear not to repsond to the inp or outp instructions. The manufacturer claims no difference except the Rev F version is ROHS compliant. The new card is also universal PCI, the old one wasn’t. Any experience or ideas guys??

Not knowing anything about your driver, could it be that it doesn’t respond to the inp/outp instructions because they’ve moved? If you read inp and get 0xff, this is a good possibility.

I disagree with mario. Running a stroker at the highest priority, doesn’t tell you that your applications threads are actually running. OTOH, if you have an HA manager that registers to receive a pulse from the kernel when the budget is exceeded, plus a watchdog at priority 1 in a 1% partition, then your HA manager knows whether your applications threads are getting CPU. If the watchdog thread itself has a bug and goes into a loop, the HA manager will also be informed, and can re-start the watchdog without falsely resetting the system (i.e. the software is actually OK - with the exception of the watchdog).

Watchdogs that actually are useful, are (as Ken initially pointed out) traditionally very tricky to do, but with AP they are nearly trivial…

I know that there are PXA variants with a cycle counter (it is one of the performance counter registers), but I don’t know enough to know exactly which ones…