I’m having kernel crash or cpu halt every few days on QNX 6.5, and I’m not able to diagnose it because I don’t have any console working when the crash happens ( tty on RS232, or rlogin / ssh … ) even if their priority has been set to the highest. Everything is stuck. Photon isn’t activated on my machine since there’s no need.
I’m using momentics with the instrumented kernel to check wether there is an increase in cpu times, memory usage or not untill a crash occur. The system seems to be pretty stable ( mostly idling, less than 25% memory usage ).
System profiling shows a normal behaviour, correctly handling interrupts, and multi processing.
I thought that the best way to diagnose it was to capture a profiling .kev file untill the very last instruction, and find the culprit. Unfortunately, even if filtered, the file is about 3Mb/s. Host only has a SD card so streaming is the only option. There’s no rolling memory option so this means a 200GB file /day, and i’m not sure if Momentics will be able to digest it, or even record it. I’ll try this this weekend.
Do you have any advice on how to diagnose these kind of crashes ?
Partitioning might work only if the OS is still “safe”. For example, in case an interrupt triggers forever, you loose control of the machine but everything runs normally (except the interrupt consumes all the CPU).
With an instrumented kernel, you can capture everything that happened in the processor for some time.
If there is a “real” crash, the true problem is to find what triggers it.
Bad software like DMA bad behaviour
Bad power supply (old ones can behave badly because of capacitors wearing)
Bad connectors (oxidized contacts)
If the problem is new with an old software, last points should be analyzed first.
Are you using interrupts that could cause this behavior to happen (infinitely triggering).
Are you sure you have your tty/rlogin at the highest priority? In other words have you verified via Pidin that it’s running at the priority you think it is and what number did you use. It may be that you aren’t running as high as you think.
Many years ago in 6.3 we had a similar problem. It turned out that the C++ compiler when changing from 2.95 to 3.34 implemented spinlocks as a way to ‘wait’ for resources to becoming available. On a high priority thread this spinlock was consuming 100% of the CPU and spinlock seemed immune to the QNX scheduling that’s supposed to raise up lower priority threads to release resources. when it happened it looked like the whole O/S had hung (like what you see) and only when we got a REALLY high priority console actually working were we able to figure out what was going on. Hence why I asked how you setup yours.
P.S. When your ‘hang’ happens, do the keyboard lights (scroll lock, caps lock) still work?
Thanks for your answers. The SBC doesn’t have any usb or ps2 port. Only ethernet / rs232.
I’ll try to set the shell priority to 255 ( or less if you think it would cause troubles ). it was at 63 previously.
All my processes are running at 10.
A spinlock and 100% CPU should be seen with momentics isn’t it ? I still haven’t got a good capture window to get a crash. Hopefully tonight ?
About interrupts, these are the main culprit. I’m currently reviewing the code to see if there’s anything with semaphores, even though all calls errors are handled.
I’ll also try to use dumper, but I think that my SD card won’t like it so much.
@nico04 There’s no ESD , bad power supply nor bad connectors. I’ve also used an analyser + checked my SBC reboot causes and there was no power supply error detected. It was a watcdog cause ( my process didn’t rearm it in time since it crashed / hang).
I’ll also check all my shmem code.
Dumper just catches application crashes, it won’t capture anything if you get into an infinite loop (interrupt or even just a While(1) code loop.
If you are using Momentics you won’t see 100% CPU usage if the thread is running at a higher priority than Momentics (it will lock Momentics out). If you are running Momentics from Linux/Windows and connecting to your QNX machine, there is a task that runs on the QNX side that Momentics connects to called qconn. Make sure this runs at a high priority (higher than any process/thread your application has, 63 should be plenty if you have nothing running higher than 10) or else you can be locked out on 100% CPU usage.
You just mentioned a Watchdog. Is this a software watchdog running in QNX or a hardware one running on the board itself. If it’s software in QNX that means the QNX Kernel is OK.
If you can catch the Watchdog event, then you can trigger a Momentics capture of all events in the (instrumented) kernel.
I used kernel tracing once to find a IRQ related bug. After analysis I found this was a chip bug. I applied the errata workaround given by the manufacturer and all is well since the fix.
More info about this here.