QNX 6.5 cpu halt ? Kernel freeze ? No terminal available

pierro · April 14, 2023, 2:36pm

Hello,

I’m having kernel crash or cpu halt every few days on QNX 6.5, and I’m not able to diagnose it because I don’t have any console working when the crash happens ( tty on RS232, or rlogin / ssh … ) even if their priority has been set to the highest. Everything is stuck. Photon isn’t activated on my machine since there’s no need.

I’m using momentics with the instrumented kernel to check wether there is an increase in cpu times, memory usage or not untill a crash occur. The system seems to be pretty stable ( mostly idling, less than 25% memory usage ).

System profiling shows a normal behaviour, correctly handling interrupts, and multi processing.

I thought that the best way to diagnose it was to capture a profiling .kev file untill the very last instruction, and find the culprit. Unfortunately, even if filtered, the file is about 3Mb/s. Host only has a SD card so streaming is the only option. There’s no rolling memory option so this means a 200GB file /day, and i’m not sure if Momentics will be able to digest it, or even record it. I’ll try this this weekend.

Do you have any advice on how to diagnose these kind of crashes ?

Sincerely, Pierro

nico04 · April 17, 2023, 8:24am

Can you give more details about the hardware ?

Have you tried partitioning as suggested here by @maschoen ?

pierro · April 17, 2023, 9:59am

Hi,

It’s an old reliable SBC on PowrPC architecture. I have multiple absolutely identical SBCs and they all behave the same way. It looks like it’s SW/OS related.

I’m going to try partitionning. Did it work in your case ?

Should I been able to see anything with momentics ? Or is this a waste of time ?

Is there any way to get a rolling memory capture ?

nico04 · April 17, 2023, 11:44am

Partitioning might work only if the OS is still “safe”. For example, in case an interrupt triggers forever, you loose control of the machine but everything runs normally (except the interrupt consumes all the CPU).

With an instrumented kernel, you can capture everything that happened in the processor for some time.

If there is a “real” crash, the true problem is to find what triggers it.

Bad software like DMA bad behaviour
Electrostatic discharges
Bad power supply (old ones can behave badly because of capacitors wearing)
Bad connectors (oxidized contacts)

If the problem is new with an old software, last points should be analyzed first.

Tim · April 17, 2023, 2:40pm

Are you using interrupts that could cause this behavior to happen (infinitely triggering).

Are you sure you have your tty/rlogin at the highest priority? In other words have you verified via Pidin that it’s running at the priority you think it is and what number did you use. It may be that you aren’t running as high as you think.

Many years ago in 6.3 we had a similar problem. It turned out that the C++ compiler when changing from 2.95 to 3.34 implemented spinlocks as a way to ‘wait’ for resources to becoming available. On a high priority thread this spinlock was consuming 100% of the CPU and spinlock seemed immune to the QNX scheduling that’s supposed to raise up lower priority threads to release resources. when it happened it looked like the whole O/S had hung (like what you see) and only when we got a REALLY high priority console actually working were we able to figure out what was going on. Hence why I asked how you setup yours.

Tim

P.S. When your ‘hang’ happens, do the keyboard lights (scroll lock, caps lock) still work?

pierro · April 18, 2023, 5:01pm

Hi,

Thanks for your answers. The SBC doesn’t have any usb or ps2 port. Only ethernet / rs232.

I’ll try to set the shell priority to 255 ( or less if you think it would cause troubles ). it was at 63 previously.
All my processes are running at 10.

A spinlock and 100% CPU should be seen with momentics isn’t it ? I still haven’t got a good capture window to get a crash. Hopefully tonight ?

About interrupts, these are the main culprit. I’m currently reviewing the code to see if there’s anything with semaphores, even though all calls errors are handled.

I’ll also try to use dumper, but I think that my SD card won’t like it so much.

@nico04 There’s no ESD , bad power supply nor bad connectors. I’ve also used an analyser + checked my SBC reboot causes and there was no power supply error detected. It was a watcdog cause ( my process didn’t rearm it in time since it crashed / hang).
I’ll also check all my shmem code.

Thanks a lot,

Pierro

Tim · April 18, 2023, 5:09pm

Pierro,

Dumper just catches application crashes, it won’t capture anything if you get into an infinite loop (interrupt or even just a While(1) code loop.

If you are using Momentics you won’t see 100% CPU usage if the thread is running at a higher priority than Momentics (it will lock Momentics out). If you are running Momentics from Linux/Windows and connecting to your QNX machine, there is a task that runs on the QNX side that Momentics connects to called qconn. Make sure this runs at a high priority (higher than any process/thread your application has, 63 should be plenty if you have nothing running higher than 10) or else you can be locked out on 100% CPU usage.

You just mentioned a Watchdog. Is this a software watchdog running in QNX or a hardware one running on the board itself. If it’s software in QNX that means the QNX Kernel is OK.

Tim

nico04 · April 19, 2023, 7:03am

If you can catch the Watchdog event, then you can trigger a Momentics capture of all events in the (instrumented) kernel.
I used kernel tracing once to find a IRQ related bug. After analysis I found this was a chip bug. I applied the errata workaround given by the manufacturer and all is well since the fix.
More info about this here.

pierro · April 19, 2023, 3:59pm

Tim, Nico,

Momentics is running on another computer, qcon was already set to 63. But you’re right about dumper + momentics relevance in this context.

It’s a HW watchdog, so I can’t use it for triggering anything.

Now, the only processes going above 63 are:
254: pipe, mmcsd, procnto (instr)
255: pocnto (instr) , pipe, serial driver (required for ksh), mmcsd, and ksh.
My own processes are started at 10.

I’ll run this test tonight since it takes about 2 or 3 days between each crash.

I may try to reduce mmcsd priority after startup for my next test.

BTW, sloginfo doesn’t show anything too suspect.

nico04 · April 20, 2023, 6:38am

No way to convert it to an IRQ by modifying the board (or anything else) ?