Ready thread sometimes takes ~100ms before running

Hi,

I have a problem with QNX 6.5 Service Pack 1.

I have a thread running at high priority (15) used to talk to a custom PCI device. This thread attach an interrupt to the PCI device via the InterruptAttach() function and then go into an infinite loop that will block on this interrupt via the InterruptWait() function. By profiling the application with the QNX system profiler, I found out that shortly after the ISR is executed, the thread transition from the “INTERRUPT” state to the “READY” state. It then takes approximately 25us before the thread becomes “RUNNING”.

However, on some very rare occasion, the thread will stay in the “READY” state for a very long period of time (50 to 100ms) even though the CPU2 (the thread is forced on CPU2) is running the idle task the whole time. The ISR still processes new incoming interrupts, but the working thread is stuck on the “READY” state. This kind of jitter is not acceptable for my application.

I have found a known issue in the QNX 6.5 release notes that correspond to my problem:

            We've observed an issue in the x86 SMP kernel where a ready thread sometimes won't run on an idling CPU:
                           On interrupt entry, if the CPU is halted (because the idle thread executed a halt instruction), then the interrupted context could be mistakenly identified as “kernel” instead of “user.” This isn't a result of the halt itself, but rather that idle thread is making a kernel call to call halt.
                           On interrupt exit, the behavior depends on the interrupted context, whether another CPU holds a kernel lock, etc. Under certain conditions, this issue could cause either the rescheduling not to be done and the idle thread to run until the next interrupt, or the idle thread to attempt to reacquire the kernel (with priority 0) instead of “force kernel” (with the highest priority).
            (Ref# 156062, J383827)
            
            Workaround: Specify the -h option for procnto-smp, to disable CPU halting in the idle thread.

I have tried the specified workaround without success.

My processor is an i7-2710QE with QM77 chipset.

I would like to know if anyone had similar issues with QNX and if so, what was your solution.

As a side note, I have also tried the following things without success:
-Deactivate Hyperthreading by BIOS configuration
-Disable power management features (p-states and c-states) by BIOS config

I’ve never had your issue but the doc’s say:

qnx.com/developers/docs/6.5. … ocnto.html

which doesn’t quite seem to be your problem. So I am not sure about that note you found (I couldn’t find it on a google search) and it’s workaround suggestion and what that might help with your problem.

Disabling hyperthreading is normally a good idea on QNX when performance matters because QNX can’t tell that a hyperthread isn’t a real processor.

If you still have technical support (or even if you don’t) you may have better luck registering and posting on Foundry27 in the Kernel section where the QNX developers live and might be to provide more insight.

As a minimum here we’d need to know about how (commands your using) your binding processes/threads to CPU’s, how many of those processes threads you have. Are you going any adaptive scheduling etc? Anything else running etc?

Tim

Hi Tim,

I found the release note on the QNX website: http://www.qnx.com/developers/articles/rel_5189_48.html#Issues_kernel in the procnto-smp section.

I might give a try on Foundry27, thank you for this suggestion.

Here is more information about my system:

I am using the qnxbasesmp.build script from the QNX Momentics SDK without modifications to generate the startup code.

My application is a single process that will launch ~20 threads which are all bound to CPU1. From the profiling traces I generated, I see that the CPU activity on CPU1 is between 10 and 60% with some peak to 100%. This application is also starting the thread to talk to the PCI device, which is bound to CPU2. CPU2 activity is about 10% with peaks to 30%. I am assigning CPU run mask with the ThreadCtl(_NTO_TCTL_RUNMASK, mask) function from <sys/neutrino.h>. I am compiling with qcc to get an ELF which is then started from a qnx shell within a GNU screen (terminal multiplexer).

I am not using adaptive scheduling.

Do you have a chance to try 6.6 or 7.0?

Also, are you using the latest 6.5 SP1 kernel + libc?
qnx.com/download/feature.htm … amid=35228

This does seem peculiar. I assume you have not set the cpu mask. If you don’t have a support agreement with QNX, here’s what I would do. Rewrite your code so that the interrupt causes a pulse event. Instead of InterruptWait you can do a MsgReceive. See if you get the same results.