ECC memory controller on AMD SC520

Tenzing · March 8, 2006, 7:19pm

Dear all,

I’m working with QNX 6.2.1 on a PC104/SC520 board. The board has an ECC memory controller (that’s the reason why we chose this one). Single event upsets are automatically detected and corrected. Multiple event upsets are detected but not corrected. In this case a NMI is generated.

I believe it’s the responsibility of the kernel to handle this interrupt and to do the appropriate action, like killing the process. Please correct me if i’m wrong. Could anybody tell me if QNX is really handling this interrupt or if i have to do it by hand?

Cheers,

mario · March 8, 2006, 9:58pm

QNX does not handle NMI.

I don’t think killing the process in such a scenario is the best course of action. I’m not familiar with ECC memory but I don’t beleive it would be possible to get the actual address at which the error occured in any standard way.

Tenzing · March 9, 2006, 10:30am

In my case, killing the process whose memory got smashed makes quite some sense, as i’m using HAM to restart it on death. The current status is reloaded from flash when the process restarts so there’s virtually no interruption of service. What other action do you have in mind?

If the kernel itself (or just the memory/process manager) would handle this interrupt it would be possible i guess to generate a SIGBUS or SIGSEGV. Doing so from another process would require to know which process is mapped at which physical memory address, and i doubt that this is possible.

Is there any way to modify the process manager to handle this interrupt?

mario · March 9, 2006, 3:00pm

Killing the processes is only part of the solution, you would need to somehow identify the bad ram, then reserve it so that when the process gets restarted by HAM you are not using the same memory area, otherwise you’d be most likely be stuck in a loop ;-)

It would be rather difficult to pin point the exact memory location and instruction that caused the NMI because of caching issue and read ahead, delayed write and so forth.

At that point anyway if you have an NMI your machine is gone bad if you ask me you need much more drastic measure then restarting the program to go around this, I beleive hardware redundancy is the only real solution.

It’s like trying to restart a car when the engine is on fire ;-)

Tenzing · March 10, 2006, 10:02pm

Actually it still makes some sense

The board will be used in space (on a satellite). Due to radiations effects, some bits can flip in memory. This flip is only temporary, the memory itself is still ok, it just needs to be written again. If i restart the process it will be working fine, until the same thing happens again.

So I really need to have this NMI caught to kill the dirty process. Maybe it’s possible to know, when the NMI handler is called, which was current running context. Caching is disabled as much as possible, because of those radiation problems, but i agree that read-aheads and delayed writes makes it more complex.

Is there any way to make some patches to handle this? I wonder if it’s possible to get the sources of the QNX kernel.

mario · March 11, 2006, 4:38pm

If you have billions of dollars it should be possible.

As for NMI, it was possible with QNX4 maybe it’s possible with QNX6. I wouldn’t know how though.

What if the memory is the one where kernel data is stored. How would you deal with that?

What if the memory is thae one where the NMI handler lives?

rgallen · March 13, 2006, 2:42pm

I agree with Mario.

I would suggest that a (hardware) redundant system might be appropriate. When the NMI hits the first controller, it halts; and the second controller takes over. A third device (a very simple watchdog) could monitor a heart beat from both devices and power cycle whichever one fails to meet a handshake deadline (handles the case where there is a burst of radiation and both controllers are hit simultaneously).

Thunderblade · March 14, 2006, 8:45am

As said previously: What do you do if a bit flips inside the kernel? You can’t put it under HAM control! And what if your NMI handling code has some flipped bits?

AFAIK if an NMI occurs under QNX, the system reboots. I leave it up to someone else to confirm. I think no software on the world can help you, if your hardware fails.

Tenzing · March 14, 2006, 9:51am

After a quick talk with our sponsors, i had the feeling that this solution will not be possible

Ok, thanks for the info. I’ll look in this direction when i have more time.

There’s an external hardware watchdog on this board, so we hope that a bit flip in kernel memory would somehow crash the machine so that it can get restarted. For many reasons (out of my control) the board cannot be made redundant. I know i can’t achieve 100% reliability with the current design, but i joined too late to change it.

Tenzing · March 14, 2006, 10:01am

Ok. I thought that QNX was more flexible than that. At least it’s better than doing nothing, after a short interruption of service everything should be back to normal. Thanks for the info.

Well, more or less. I’m working on it

Tenzing · March 14, 2006, 10:06am

Those bit flips are temporary, so there’s really no need for a second board (except for latch-ups, but that’s another story). Anyway we’re still back to the initial question, how to attach something to this NMI under QNX?