On QNX is it possible (or advisable) to use WBINVD, even from an interrupt. I have seen the post on the boards openqnx.com/PNphpBB2-viewtopic-t9557-.html “How to invalidate CPU cache”, however in my application the data isn’t a single block of memory. If it is possible, I would really appreciate some direction on the code to get this working. If it isn’t possible, what alternatives do I have for invalidating the cache on x86?
Background
I’m looking to do some evaluation of worst-case execution time for the component parts of an algorithm, based on research papers from York university on timing schemas. The aim of this work is to get a pessimistic estimate of the time that a piece of code takes to execute. The target architecture for my application is x86.
Currently I have a framework in place for measuring the execution time of a block of code. This code is executing at maximum priority with the FIFO scheduler to minimize the variance introduced by context switches, if required, for the period of testing I can also potentially disable interrupts. All non-essential processes on the machine have been killed.
for(int i = 0; i < numCalls; i++) {
SERIALIZE_CPU(); // CPUID with EAX = 0
t1 = ClockCycles(); // RDTSC
MyFunction(); // Function to time
SERIALIZE_CPU();
times[i] = ClockCycles() - t1;
}
In performing multiple tests of the same algorithm, the performance from the first iteration is significantly worse than subsequent iterations. I assume that the these results indicate that the data enters the cache in the first iteration, and that subsequent memory lookups are significantly faster. To get an accurate measurement of the worst-case execution time, I need to be able to ensure that the cache is in the worst possible state before commencing any measurements. This occurs when none of the data related to the function for test is in the cache. The approach recommended in the literature [1][2] involves using the WBINVD instruction to flush the cache. I know that WBINVD is a ring-0 instruction, and can’t be executed in an inline assembler block.
As an interim measure I’ve written code that reads and writes to a block of memory sufficiently large to fill the cache (I think), and not related to that used in the function under test. This appears to produce more consistent results, with significantly worse performance (in this case a good thing). However the time taken to do this is significant when compared to the time taken by some of my operations, and I’m not certain that everything in the cache has been spilled.
[1] Making Worst Case Execution Time Analysis for Hard Real-Time Tasks on State of the Art Processors Feasible (Petters and Farber)
[2] Estimation of Worst-Case Execution Time Using Statistical Analysis (Edgar)