Kernel Dump msg S/C/F 11/2/11 - Page fault

sheran.vaz · April 23, 2016, 7:29pm

Hi,

How to debug page fault reported by qnx kernel dump message?

Regards
Sheran

maschoen · April 24, 2016, 6:03pm

I’m not sure how to read a kernel dump, the problem doesn’t come up very often. Here are some things to consider the possibility of, from most to least likely.

Bug in a user written driver.
Failure of custom hardware.
Failure of standard hardware.
Bug in the QNX kernel.

sheran.vaz · April 25, 2016, 1:53pm

How to map .sym files to the kernel dump. Will it help debug the issue?

Regards
Sheran

nico04 · April 25, 2016, 2:19pm

I’ll add QNX drivers to the list. Especially the ones manipulating memory like graphics drivers.

sheran.vaz · April 25, 2016, 6:24pm

Thnx for your inputs.

How to map .sym files to the kernel dump. Will it help debug the issue?

sheran.vaz · April 25, 2016, 6:30pm

Everytime i get the same dump,

instruction[f007f0fe]
c3 e9 3c fc ff ff cd 28 c3 e9 34 fc ff ff b8 5c 00 00 00 f7 05 88 a7 09 f0 00

Looking for how to map the last executed instructions to the processes.

sheran.vaz · April 25, 2016, 9:21pm

The registers edi, edx, eax, eip, efl, cs and ss are all same for every crash.

edi = 00000010 (SAME)
esi = efead7ac
ebp = effcef2c
exx = efef1988
ebx = efead708
edx = f007f0fe (SAME)
ecx = effcee80
eax = 00000000 (SAME)
eip = f007f0fe (SAME)
cs = 0000001d (SAME)
efl = 00001246 (SAME)
esp = effcee80
ss = 00000099 (SAME)

maschoen · April 25, 2016, 9:44pm

Based on this, there is a little information I can provide. Your eip is f007f0fe. The high bits on this address indicate you are running at the highest protection level which means either procnto is running, or you are in the kernel. The code segment 0000001d is the code segment of proctnto.

Given that it repeats, it seems mostly likely caused by an interrupt handler bug. Have you written any drivers that use interrupts?

sheran.vaz · April 25, 2016, 10:05pm

Just using the bins and drivers of the QNX 6.5.0 SP1. Do not have any custom drivers using interrupts.

Regards
Sheran

sheran.vaz · April 25, 2016, 10:25pm

when i do a “pidin backtrace” i can see :

   1-01 f007f0f9                                                    
   1-02 f007f0f9                                                    
   1-03 f007f0f9                                                    
   1-04 f007f0fc                                                    
   1-05 f007eeb4                                                    
   1-06 f007eeb4                                                    
   1-07 f007f0fc                                                    
   1-08 f007eeb4                                                    
   1-09 f007eeb4                                                    
   1-11 f007eeb4                                                    
   1-13 f007eeb4                                                    
   1-14 f007eeb4                                                    
   1-15 f007eeb4                                                    
   1-17 f007eeb4                                                    
   1-18 f007f0fc

does it mean, it belong to thread 4 of procnto-smp. The target is dual core with hyperthreading enabled, therefore 4 CPUs.
The thread 4 is a special idle thread with priority 0 for the 4th CPU.

sheran.vaz · April 25, 2016, 10:33pm

sorry f007f0fc should be the address of the some routine and i can see it being called by multiple threads in different instances. It is not linked to the the thread.

sheran.vaz · April 26, 2016, 1:52am

virtual Address of procnto-smp Code Segment ==> f0018000
Routine pointed by Instruction Pointer ==> f007f0fe
Offset of the Routine pointed by Instruction Pointer ==> 670fe

If i take a objdump of procnto-smp then

[code]
…
…
000670e4 <__Ring0>:
670e4: b8 02 00 00 00 mov $0x2,%eax
670e9: f7 05 00 00 00 00 00 testl $0x400,0x0
670f0: 04 00 00
670f3: ba fe 70 06 00 mov $0x670fe,%edx
670f8: 74 0a je 67104 <__Ring0+0x20>
670fa: 89 e1 mov %esp,%ecx
670fc: 0f 34 sysenter
670fe: c3 ret
670ff: e9 fc ff ff ff jmp 67100 <__Ring0+0x1c>
67104: cd 28 int $0x28
67106: c3 ret
67107: e9 fc ff ff ff jmp 67108 <__Ring0+0x24>

0006710c :
6710c: b8 5c 00 00 00 mov $0x5c,%eax
67111: f7 05 00 00 00 00 00 testl $0x400,0x0
67118: 04 00 00
6711b: ba 26 71 06 00 mov $0x67126,%edx
67120: 74 0a je 6712c <SchedCtl+0x20>
67122: 89 e1 mov %esp,%ecx
67124: 0f 34 sysenter
67126: c3 ret
67127: e9 fc ff ff ff jmp 67128 <SchedCtl+0x1c>
6712c: cd 28 int $0x28
6712e: c3 ret
6712f: e9 fc ff ff ff jmp 67130 <SchedCtl+0x24>
…
…[/code]

The highlighted code, seems similar to the instruction in the kernel dump, But not exactly same.

0: c3 ret 1: e9 3c fc ff ff jmp 0xfffffc42 6: cd 28 int 0x28 8: c3 ret 9: e9 34 fc ff ff jmp 0xfffffc42 e: b8 5c 00 00 00 mov eax,0x5c 13: f7 05 88 a7 09 f0

Does this mean the kernel was executing ring0() and schedctl() routines?

sheran.vaz · May 10, 2016, 3:14am

Have a qt GUI application, which has a bunch of numbers getting updated every 200ms. Hogs shows utilisation of 70% for this process. Not sure if it is really required for the CPU utilisation to go that much. And can it be this GUI process causing the page fault as it should be accessing the video buffers 0xa0000.

maschoen · May 10, 2016, 5:16am

I don’t know if you are using qt under screen (usually 6.6 or later) or under gf (6.4-6.5). Either way, qt is not writing directly to 0xa0000, but rather is using a screen or qf call.

The way qt works is that it renders graphics to a ram image and then blits it using the screen or gf interface. Depending on what you are updating and how smart qt is, that data could be a small or large amount of data.

One thing you might consider, is why you need to update a number 5 times a second. True the human eye can detect changes this rapidly, but it’s a number, not a bad guy you need to shoot. One easy enhancement would be to check to see if the number has changed before updating.

sheran.vaz · May 10, 2016, 6:15am

I’m not sure if my GUI is actually contributing to the page fault, but is it possible to reduce the GUI’s CPU utilisation by keeping my application as is? Also I don’t see any utilisation in io-display driver.

maschoen · May 10, 2016, 6:30am

I don’t think I understand your question. You seem to be asking: by changing nothing, can you (magically?) reduce the GUI’s cpu utilization. How on earth would expect that to happen?

Tim · May 10, 2016, 2:58pm

Are you running in release mode or debug mode? Compiling in release mode will increase your speed.

Why does it matter if your utilizing 70% of your CPU? Do you still have a lot of other code / processes that are still to be developed / added and your worried about needing more than 100% CPU capacity?

Tim

sheran.vaz · May 10, 2016, 5:01pm

Yes, i have other processes running, The Page Fault seem to happen when there is more CPU utilisation by different processes. Therefore was wondering if there are better ways to improve the GUI performance.

sheran.vaz · May 10, 2016, 5:32pm

I do agree the human eye can not grasp what was displayed, if it keeps changing every 200ms(only the numbers after the decimal keep changing frequently).
There are other places the issue is observed, like i have a busy wheel which keeps ticking very 100ms. After i changed the timing to 500ms, i dont see it much often now. There are other visualization that the GUI does at different places. Therefore was wondering would changing something in the underlying layer provide better performance.

maschoen · May 10, 2016, 5:50pm

There is only one thing I can think of in this category. If you are running with an non-native video driver, for example the VESA driver, then switching to a native driver if one exists might speed things up. Video cards can have acceleration, specifically for biting data to the memory map. This has to be implemented in the driver.