Multicore Thread Performance Measurement with Timestamps

strasserfj · August 11, 2014, 1:46pm

Hello guys,

I want to benchmark the performance of two threads communicating with each other, each pinned to its own core. Specifically I want to benchmark the work as well as interaction/communication delays of these threads and have decided to use the PMU (performance monitoring unit) values like the cycle counter register and some other event counters.

My problem is now that the two PMUs aren’t in sync and I can’t compare the recorded values for each thread after the benchmark is done.

Is there a way to synchronize the PMUs?

What i found is the ClockCycles command. In the QNX documentation it is stated that these command might use a free running timer located directly on a core. In my case I think this command uses the global timer of the system, but I am not exactly sure.

How can i find out if the source of ClockCycles is the global timer? (Where to search in the BSP?)

Or can anyone recommend a better approach?

My environment:
Freescale Sabre Lite Board boundarydevices.com/products/sab … -imx6-sbc/
QNX 6.6.0
QNX 6.6.0 BSP for Sabre Lite community.qnx.com/sf/sfmain/do/v … ojects.bsp

Thank in advance!

maschoen · August 11, 2014, 8:01pm

ClockCycles() returns a 64 bit processor cycle timer. There is a global system variable that can be used to convert intervals measured with this routine into seconds. I’ve only used it on x86 so I don’t know if this is implemented properly on your processor, but a simple experiment should make this clear.

denkelly · August 18, 2014, 11:54am

ClockCycles … use(s) a free running timer located directly on a core
This is correct - you can’t use the ClockCycles() return value on different cores. Typically, to do these measurements you set “affinity” to a single core for all threads so they are using the same timer.

Another option would be to attempt to determine the approximate “delta” between timers on various cores. The delta would be constant until reboot. (I don’t have a scheme for doing that…)

mario · August 19, 2014, 6:29pm

Depends on the model of CPU, check the doc, on the latest x86 all of the core RDTSC are synchronised and will even take into account things such as power down of cores, or slowing down to conserver power.

The RDTSC has two mode of operations, one is REAL core clock count, and the other is time keeping irrelavent or core behavior or configuration. I beleive QNX uses the lastest.

strasserfj · August 20, 2014, 12:42pm

At first, thank you for your replies and suggestions.

According to maschoens recommendation, I did some experiments to see if QNX uses the global timer of my CPU. Therefore I memory mapped the global timer and tried to read the low value some times. As expected by denkelly, the result was, that this value stays zero all the time which leads me to the conclusion that QNX does not uses the global timer of my CPU at all.

After doing some more research i configured the global timer to be free running and wrote some code to read the lower 32 bit of the timer. This is enough for me to do my work. So I can say my question is answered.

An additional benefit of using this timer is, that it runs at much higher frequency than CLockCycles which results in more precise measurements. The global timer runs at CLK/2, ClockCycles only at CLK/12.

Maybe it helps someone, here is my code:

#define PERIPH_BASE_ADDR 0x00A00000
#define GLOBAL_TIMER_BASE_ADDR PERIPH_BASE_ADDR +0x0200
#define GLOBAL_TIMER_SIZE 0x02FF

volatile unsigned int addr;

inline uint32_t getGlobalTime32(){
	uint32_t value;
		  __asm__ __volatile__("ldr %0, [%1]\n\t"\
		        : "=&r"(value)
		            : "r" (addr)
		        : "memory");
	return value;
}

inline void resetGlobalTimer(){
	 out32(addr,0);
	 out32(addr+4,0);
}

void configureGlobalTimer(){

	ThreadCtl(_NTO_TCTL_IO, 0);
	void *gtimer_addr = NULL;
	gtimer_addr = mmap_device_memory(0, GLOBAL_TIMER_SIZE, PROT_READ | PROT_WRITE | PROT_NOCACHE, 0, GLOBAL_TIMER_BASE_ADDR);
	if (gtimer_addr == MAP_FAILED) {
		perror("mmap_device_memory for physical address failed");
		exit(EXIT_FAILURE);
	}
	addr = (uint32_t)gtimer_addr;
// set timer value to 0
	   out32(addr,0);
	   out32(addr+4,0);

	   // enable global timer to be free running, no IRQ no compare value
	   out32(addr +8, 1);

denkelly · August 21, 2014, 12:13am

This post is for i.mx6 so x86 is irrelevant