Determining Idle cycles

kwschumm · May 27, 2008, 4:22pm

Is there a way to programatically determine how much cpu time is being spent in procnto’s IDLE thread? We have a need to dynamically adjust DSP parameters based on system load and monitoring cpu consumption of the IDLE thread using an averaging algorithm seems like it would do the trick.

The hogs code seems to works only at the process, not thread, level.

The alternative is to write our own Idle process to run at priority 1 but we’d rather not clutter up the system if we don’t have to.

rgallen · May 27, 2008, 6:35pm

Ken, try AP; it’s pure magic… really… Get everything working with AP (shouldn’t take more than a day). Price it out with your sales team; and then decide if it is worth it to do the work yourself.

It’s no skin off my nose if you don’t use it; just trying to make your life easier (I am at the point where I wonder how I ever lived without it

kwschumm · May 27, 2008, 6:41pm

Rennie, I believe you. Really

But even with AP we would probably have to do something like this. If the user attempts to drive our meter at a frequency higher than it can handle it needs to be throttled back by tweaking the DSP parameters (in this case the DSP would begin to skip samples and flash an icon to indicate the condition). Even with AP it could saturate the partition (and any unused cycles left over from other partitions) so we need to manage that problem.

rgallen · May 27, 2008, 6:58pm

Ken, this is what AP is perfect for. You’ll know when the system is overloaded (the kernel will trigger a pulse notifying you of this, see: SCHED_APS_ATTACH_EVENTS). You then throttle back the DSP parms.

Of course, AP doesn’t tweak the DSP parms for you, but it does everything else (e.g. throttles the offender back - so that your management code can run - then notifies you who it throttled). Your code (that receives the notification of the overload condition) then tweaks the DSP parms, to correct the misconfiguration “permanently” (or at least until the next time the user and/or software/hardware misbehaves). You can also collect .kev files, take core dumps of processes that are misbehaving, etc., etc…

mario · May 27, 2008, 7:57pm

Ken, check sales first, don’t tempt yourself too much Unless price has changed we could get an extra programmer for a year for the price of AP

rgallen · May 27, 2008, 8:53pm

mario:

rgallen:

kwschumm:

Is there a way to programatically determine how much cpu time is being spent in procnto’s IDLE thread? We have a need to dynamically adjust DSP parameters based on system load and monitoring cpu consumption of the IDLE thread using an averaging algorithm seems like it would do the trick.

The hogs code seems to works only at the process, not thread, level.

The alternative is to write our own Idle process to run at priority 1 but we’d rather not clutter up the system if we don’t have to.

Ken, try AP; it’s pure magic… really… Get everything working with AP (shouldn’t take more than a day). Price it out with your sales team; and then decide if it is worth it to do the work yourself.

Ken, check sales first, don’t tempt yourself too much Unless price has changed we could get an extra programmer for a year for the price of AP

It only takes a day to try AP on a one of your problems; if you can’t overcome the temptation to pay the extra runtime cost after trying it out (there is no upfront cost - the source is on F27) then that means it’s worth the price

ps: If you are referring to the old price for the TDK, that has now been changed to zero.

mario · May 27, 2008, 9:58pm

Yes I was referring to the TDK. This is good news, thanks.

kwschumm · May 27, 2008, 10:10pm

Mario, we’ll go into this with eyes open, thanks. I’ll email our sales rep to get a firm quote.

I’m still not convinced we need it, but my mind is open to the possibility. If nothing else it would be fun to play with since the cost of entry is cheap.

The 4.0.1 Momentics docs say this under SchedCtl() in the SCHED_APS_ATTACH_EVENTS section: “Overload notification isn’t implemented in this release”.

That might be an oversight, and assuming it is, this would give us about half of what we’re after. We want to slide the DSP parameters around so that maximum speed is always available given current loading. This means throttling down when the cpu is overloaded and also throttling up when it’s not. There are several situations when we would throttle up, some are user-initiated events and others are automatic.

AP might send a pulse when the system is overloaded, but doesn’t look like it can be configured to send a pulse when it transitions to a not-overloaded state. That would leave us with the need to poll for current state, which is sort of like I was proposing in the first place (by tracking the number of cycles used by the Idle thread).

Plus, we may like to set the specific thresholds where those events occur (not sure if that would be necessary or not).

rgallen · May 28, 2008, 12:22am

kwschumm:

rgallen:

Ken, this is what AP is perfect for. You’ll know when the system is overloaded (the kernel will trigger a pulse notifying you of this, see: SCHED_APS_ATTACH_EVENTS). You then throttle back the DSP parms.

Mario, we’ll go into this with eyes open, thanks. I’ll email our sales rep to get a firm quote.

I’m still not convinced we need it, but my mind is open to the possibility. If nothing else it would be fun to play with since the cost of entry is cheap.

The 4.0.1 Momentics docs say this under SchedCtl() in the SCHED_APS_ATTACH_EVENTS section: “Overload notification isn’t implemented in this release”.

That might be an oversight, and assuming it is, this would give us about half of what we’re after. We want to slide the DSP parameters around so that maximum speed is always available given current loading. This means throttling down when the cpu is overloaded and also throttling up when it’s not. There are several situations when we would throttle up, some are user-initiated events and others are automatic.

AP might send a pulse when the system is overloaded, but doesn’t look like it can be configured to send a pulse when it transitions to a not-overloaded state. That would leave us with the need to poll for current state, which is sort of like I was proposing in the first place (by tracking the number of cycles used by the Idle thread).

Plus, we may like to set the specific thresholds where those events occur (not sure if that would be necessary or not).

Ken, yes; I think for your app, settable thresholds would be nice to have. They are on the roadmap…

Now that you have provided this information, I agree that the bankruptcy event alone is not sufficient for a fully event driven approach, and you will need to sample using the SCHED_APS_PARTITION_STATS function. I do think though that you may be underestimating the amount of work necessary to get an idle thread CPU load implementation that will give you good behavior (i.e. merely recognizing how much idle time is available may not be a good indication of how much CPU your DSP thread should be able to use).

If you place your DSP thread into a partition with (say) 99% budget with 400ms of critical; then have your manager sample the partition usage to perform the DSP “throttling” you will have a very controlled and predictable result (and don’t forget, that you can also put your watchdog thread in a 1% partition, and be assured that when your DSP partition goes into bankruptcy, you will still be able to service the watchdog and not have a false reset, yet if your software really does go down the toilet, you will get a reset That last point is one of the most difficult behaviors to actually achieve…

Tim · May 28, 2008, 1:45pm

Mario,

We were surprised to learn about this price change last month too. Our local QNX rep stopped in to pay us a courtesy visit and while he was here were chatted about AP and he wondered why we didn’t at least try it. I joked that the 50K cost was WAY outside our price range so there was no reason to even experiment.

He then replied that the cost had been changed to $0. Apparently the original price was set by marketing people (with no clue?) and at that price it sold about as well as you expect it would

Bottom line is the new price is a 30% increase on the run time license. So if you are buying in volume and get a reasonable price (say $100 a licence) then AP would cost you $130. Far more palatable tho once you pass about 1600 run time licences you are actually worse off under the new pricing model.

Tim

maschoen · May 28, 2008, 3:11pm

Ken,

I understand your comment about cluttering up system, but I think you could implement a priority 1 process that would replace IDLE's function fairly easily.  Here is how I would do it.   Have two threads, one runs at priority 1, and spins in a loop, incrementing a 64 bit variable.   The other thread runs at a very very high priority, higher than any application process.   It can have either a message passing or I/O interface and it can answer questions via a message like "how loaded am I", and if need be, issue pulses when thresholds are passed, up or down.

kwschumm · May 28, 2008, 4:13pm

maschoen:

Ken,

I understand your comment about cluttering up system, but I think you could implement a priority 1 process that would replace IDLE's function fairly easily.  Here is how I would do it.   Have two threads, one runs at priority 1, and spins in a loop, incrementing a 64 bit variable.   The other thread runs at a very very high priority, higher than any application process.   It can have either a message passing or I/O interface and it can answer questions via a message like "how loaded am I", and if need be, issue pulses when thresholds are passed, up or down.

Mitchell, that technique was my original thought but when the source to hogs was published it seemed to be a natural way to derive the needed information. The procfs_info structure has user and system time accumulators in nanoseconds so it would provide a direct measurement of time spent in Idle (except that Idle is a procnto thread and the hogs technique only provides process level info, which prompted this original discussion).

So a priority 1 process A that does nothing but loop could be written. Then, a very high priority process B could simply open A’s proc/pid entry and periodically query its procfs_info struct to directly read cpu time consumed by process A (and handle wraps as necessary). A moving average algorithm could smooth things out, then it would issue pulses and make the loading info available as needed by the DSP thread.

maschoen · May 28, 2008, 5:04pm

Sounds like a plan to me. ;-)

mario · May 28, 2008, 5:47pm

The low priority process shouldn’t do a busy loop. The idle thread will put the process to “sleep” with the halt instruction. If the idle thread(s) never get a change to run because your custom “idle” process uses 100% of the CPU, the processor(s) will consume extra power and generate extra head for no good reason.

rgallen · May 28, 2008, 5:55pm

Uh-huh… micro-billing in the clock interrupt is the way to go

rgallen · May 28, 2008, 6:41pm

Because I don’t like claiming that something is “easy”, without some proof, I have attached a working DSPtuner program, as well as a DSPSim (simulator), that the DSP tuner “tunes”.

To use this, extract the two zip files into a Momentics workspace; then “Import” existing project into workspace (from inside the IDE).

On a VMware target with aps module installed:

aps create -b 90 DSP

on -X aps=DSP DSPSim &

DSPtuner

Of course, the DSP tuner could use a PID algorithm to tune DSPSim optimally, but that was left out for simplicity. The DSPSim, is just that; a simulator of your DSP thread/process, on which DSPtuner can operate.

kwschumm · May 29, 2008, 1:16am

Mario, D’oh! thanks for the reminder, I forgot that Idle does a halt. We’re so far ahead of required battery life it may not matter from a product standpoint but there’s no point in wasting power (they’ve already discontinued the large battery option because nobody needed it).

Rennie, thanks for the code, it’s really appreciated and will help a lot. It does look pretty straightforward.

There is a concern about how much overhead the AP will add, especially when it has to emulate a clock cycle counter. We are fighting for every cycle on this thing and just by removing ClockTime() calls (also emulated on the pxa270) we gained a couple of percent in performance. We’re about 25% from our ultimate performance goals and we’re running out of things to optimize. In retrospect, some boneheaded hardware design decisions were made that really hurt. (Sure, blame it on the hardware )

rgallen · May 29, 2008, 1:42am

If you bump up the tick size to 10ms, it shouldn’t be too bad. For reference here is a comparison of:

ClockCycles on x86:

    106 #define ClockCycles()	({ register _Uint64t __cycles; __asm__ __volatile__( \
    107 						"rdtsc" \
    108 						: "=A" (__cycles)); __cycles; })

with ClockCycles on ARM:

#
# $QNXLicenseA:
# Copyright 2007, QNX Software Systems. All Rights Reserved.
# 
# You must obtain a written license from and pay applicable license fees to QNX 
# Software Systems before you may reproduce, modify or distribute this software, 
# or any work that includes all or part of this software.   Free development 
# licenses are available for evaluation and non-commercial purposes.  For more 
# information visit http://licensing.qnx.com or email licensing@qnx.com.
#  
# This file may contain contributions from others.  Please review this entire 
# file for other proprietary rights or license notices, as well as the QNX 
# Development Suite License Guide at http://licensing.qnx.com/license-guide/ 
# for other information.
# $
#

#include <asmoff.def>

	.globl	ClockCycles

	.text

ClockCycles:
	stmdb	sp!, {r4,lr}

	ldr		r0, =_syspage_ptr
	ldr		r1, =qtimeptr
	ldr		ip, =callout_timer_value
	ldr		r0, [r0]
	ldr		r1, [r1]

	/*
	 * Disable interrupts
	 */
	mrs		r4, cpsr
	orr		r2, r4, #ARM_CPSR_I | ARM_CPSR_F
	msr		cpsr, r2

	mov		lr, pc
	ldr		pc, [ip]

	ldr		r2, =cycles
	ldr		lr, =last_cycles
	ldmia	r2, {r2,r3}

.ifdef VARIANT_le
	adds	r0, r0, r2
	adc		r1, r3, #0
.else
	adds	r1, r0, r3
	adc		r0, r2, #0
.endif

	/*
	 * Adjust by timer_load if timestamp < last_cycles
	 */
	ldmia	lr, {r2,r3}
.ifdef	VARIANT_le
	cmp		r3, r1
	bhi		0f
	bne		1f
	cmp		r2, r0
	bls		1f
0:	ldr		ip, =qtimeptr
	ldr		ip, [ip]
	ldr		ip, [ip, #TIMER_LOAD]
	adds	r0, r0, ip
	adc		r1, r1, #0
.else
	cmp		r2, r0
	bhi		0f
	bne		1f
	cmp		r3, r1
	bls		1f
0:	ldr		ip, =qtimeptr
	ldr		ip, [ip]
	ldr		ip, [ip, #TIMER_LOAD]
	adds	r1, r1, ip
	adc		r0, r0, #0
.endif

	/*
	 * Update last_cycles
	 */
1:	stmia	lr, {r0,r1}

	/*
	 * Restore interrupts and return
	 */
	msr		cpsr, r4
	ldmia	sp!, {r4,pc}

Yeah, that is a lot bigger, but if you only call it 100 times/sec, it isn’t outrageous.

btw: the DSPtuner code has a few bugs, but you get the idea. Of course, you’d be tuning DSP parameters not adjusting how much spinning is done, since AP automatically controls the execution of the QNX code. You’re app would essentially be tapping into AP data stream in order to load adjust an external processor to match the available capacity of the QNX system (which is actually a pretty interesting use case, since AP would be throttling a device external to QNX).

rgallen · May 29, 2008, 3:32am

Here is fixed code, and a utility to “bump” the system.

kwschumm · May 31, 2008, 10:10pm

I’m really bummed. Due to hardware issues this feature can’t be implemented on the current version of our product. Due to the simplistic way the DSP was interfaced to the pxa270 the only way the DSP can adjust it’s threshold dynamically is by downloading new code to it. There isn’t a single bit of i/o or shared resource anywhere that can be used to indicate when it should start pulse sampling. It takes about 200ms to reconfigure the DSP with new firmware, and we can’t afford that sort of interruption, so without another board spin it’s a no go.

However, we are reworking the product to increase performance. Part of that change is to dump the DSP in favor of an FPGA. The new design will provide for the feedback loop we want to implement. It’s probably six months away but the info in this thread will help a great deal.