Performance Hz (ouch!)

Bill_Caroselli1 · March 18, 2004, 2:59pm

I was very surprised lately. I had a performance problem. So I
naturally assumed that I was doing something inappropriate. So I coded
a test program. Here’s what I found.

Essentially, the main thread of a process reads it’s command line and
spawns off child threads. The argument on the command line is the
number of milliseconds between repetitive firings on a timer. When the
timer fires, a log record is generated and put into a buffer. So there are
two phases to the program, initialization and data collection. Here is
the program flow:

Initialization:

p1. parent parses the command line
p2. parent pthread_create()s a child thread and passes that arguemnt
p3. parent does a pthread_sleepon_lock()/sleepon_wait()/sleepon_unlock()
c1. child prepares a signal
c2. child prepares a timer event using that signal
c3. child initializes the timer with a repeat interval
c4. child does a pthread_sleepon_lock()/sleepon_signal()/sleepon_unlock()
c5. child does a barrier_wait()
p4. parent loop back and iterates through the next child thread
p5. parent does a parrier_wait()

So far, no problem.

Data Collection: (after all have reached the barrier_wait() )

p10. parent does a pthread_sleepon_lock()/sleepon_wait()
c10. child does a timer_settime()
c11. child does a sigwait()
c12. ( child timer signal fires )
c13. child generates a log record including a clock_gettime() of right now
c14. child does a pthread_mutex_lock()
c15. child does a pthread_sleepon_lock()/sleepon_signal()/sleepon_unlock()
c16. child does a pthread_cond_wait()
p11. parent copies this record into a buffer
p12. parent check if done, if so, writes out buffer and exits
p13. parent does a ptherad_cond_signal()
p14. parent does a pthread_sleepon_unlock()
p15. parent loops back to p10
c17. child does pthread_mutex_unlock()
c18. child loops back to c11

I tried to run this with two child thread with timers going off at
5 & 7 ms. The program was not able to keep up. And, Oh, by the way,
I’m running on a 2.4 GHz processor !

I wanted to see where the bottleneck was so I simplified the data
collection even more. Instead of letting the parent thread put the data
in the table the child thread puts the data in the table. I.E. the
parent never wakes up in phase 2. Now I was able to start threads
with timers set to 3, 5, 7, 11 & 13 ms without loosing a hit or even
being delayed. So, what were the steps that were eliminated?

c15. child does a pthread_sleepon_lock()/sleepon_signal()/sleepon_unlock()
c16. child does a pthread_cond_wait()
p13. parent does a ptherad_cond_signal()
p14. parent does a pthread_sleepon_unlock()
p15. parent loops back to p10

Why are these steps SO INEFFICIENT ? ! ? !

Note: ( p11 and p12 were moved to the child thread.)

BTW, even when I was missing samples the CPU usage indicator seemed to
show almost no CPU usage. I.E. lots of blocking

Chris_McKillop1 · March 18, 2004, 8:38pm

Bill Caroselli <qtps@earthlink.net> wrote:

I was very surprised lately. I had a performance problem. So I
naturally assumed that I was doing something inappropriate. So I coded
a test program. Here’s what I found.

You might want to try running your program with the instrumented kernel
to see where the contention is taking place using the System Profiler
part of the IDE.

chris

–
Chris McKillop <cdm@qnx.com> “The faster I go, the behinder I get.”
Software Engineer, QSSL – Lewis Carroll –
http://qnx.wox.org/

Colin_Burgess1 · March 18, 2004, 9:18pm

note that it would be work putting in a

TraceEvent( _NTO_TRACE_INSERTUSRSTREVENT, 1, “Performing Step One” );

through your code, so that you can easily match up with what the trace
is telling you…

Chris McKillop wrote:

Bill Caroselli <> qtps@earthlink.net> > wrote:

I was very surprised lately. I had a performance problem. So I
naturally assumed that I was doing something inappropriate. So I coded
a test program. Here’s what I found.

You might want to try running your program with the instrumented kernel
to see where the contention is taking place using the System Profiler
part of the IDE.

chris

–
cburgess@qnx.com

Bill_Caroselli1 · March 19, 2004, 4:59pm

Chris McKillop <cdm@qnx.com> wrote:
CM > Bill Caroselli <qtps@earthlink.net> wrote:

I was very surprised lately. I had a performance problem. So I
naturally assumed that I was doing something inappropriate. So I coded
a test program. Here’s what I found.

CM > You might want to try running your program with the instrumented kernel
CM > to see where the contention is taking place using the System Profiler
CM > part of the IDE.

CM > chris

My understaning is that the instrumente kernel was a pay extra option.
We do own 6.2.1B PE. Does that include the instrumented kernel?

If it does, how do I enable it?

Regardless of the instrumented kernel, I realize that by letting the
parent thread do the work I’m causing 5 extra context switches, but
I still consider this a HUGE performance hit for a 2.4 GHz system.

And, as I said, the little CPU meter is showing mostly idle.

Bill_Caroselli1 · March 19, 2004, 5:24pm

Colin Burgess <cburgess@qnx.com> wrote:
CB > If you have 621B PE, then you have it. Rebuild your image with
CB > procnto-instr, and check the docs for tracelogger and traceprinter.
CB > Then check the docs for the System Profiler perspective in the IDE,
CB > and import the logfile that tracelogger generated. Or alternatively
CB > connect via qconn and use that method to generate a logfile.

And so I do. OK, I rebuilt the OS with procnto_instr and rebooted.

I logged in a root and ran ‘tracelogger’. It printed something about 32
interrupts ??? and then ends. The whole thing ends in about 3 seconds.

Am I using this wrong?

TRACELOGGER version 1.2
interrupt#: 1, iteration#: 1, event#:716
interrupt#: 2, iteration#: 2, event#:715
. . .
interrupt#: 32, iteration#: 32, event#:715

Colin_Burgess1 · March 19, 2004, 5:28pm

If you have 621B PE, then you have it. Rebuild your image with
procnto-instr, and check the docs for tracelogger and traceprinter.
Then check the docs for the System Profiler perspective in the IDE,
and import the logfile that tracelogger generated. Or alternatively
connect via qconn and use that method to generate a logfile.

Bill Caroselli wrote:

Chris McKillop <> cdm@qnx.com> > wrote:
CM > Bill Caroselli <> qtps@earthlink.net> > wrote:

I was very surprised lately. I had a performance problem. So I
naturally assumed that I was doing something inappropriate. So I coded
a test program. Here’s what I found.

CM > You might want to try running your program with the instrumented kernel
CM > to see where the contention is taking place using the System Profiler
CM > part of the IDE.

CM > chris

My understaning is that the instrumente kernel was a pay extra option.
We do own 6.2.1B. Does that include the instrumented kernel?

If it does, how do I enable it?

Regardless of the instrumented kernel, I realize that by letting the
parent thread do the work I’m causing 5 extra context switches, but
I still consider this a HUGE performance hit for a 2.4 GHz system.

And, as I said, the little CPU meter is showing mostly idle.

–
cburgess@qnx.com

Colin_Burgess1 · March 19, 2004, 5:59pm

You should now have a /dev/shmem/tracebuffer file which is the logfile.
If you run traceprinter then it will print out that log.

For a graphical view, Import that file into your IDE (make sure it is
named something.kev) and then double click - voila!

Bill Caroselli wrote:

Colin Burgess <> cburgess@qnx.com> > wrote:
CB > If you have 621B PE, then you have it. Rebuild your image with
CB > procnto-instr, and check the docs for tracelogger and traceprinter.
CB > Then check the docs for the System Profiler perspective in the IDE,
CB > and import the logfile that tracelogger generated. Or alternatively
CB > connect via qconn and use that method to generate a logfile.

And so I do. OK, I rebuilt the OS with procnto_instr and rebooted.

I logged in a root and ran ‘tracelogger’. It printed something about 32
interrupts ??? and then ends. The whole thing ends in about 3 seconds.

Am I using this wrong?

TRACELOGGER version 1.2
interrupt#: 1, iteration#: 1, event#:716
interrupt#: 2, iteration#: 2, event#:715
. . .
interrupt#: 32, iteration#: 32, event#:715

–
cburgess@qnx.com

Bill_Caroselli1 · March 19, 2004, 7:01pm

Colin Burgess <cburgess@qnx.com> wrote:
CB > You should now have a /dev/shmem/tracebuffer file which is the logfile.
CB > If you run traceprinter then it will print out that log.

CB > For a graphical view, Import that file into your IDE (make sure it is
CB > named something.kev) and then double click - voila!

My point was that tracelogger finished so quickly I didn’thave a chance
to load my program. And the system was essentiallyidle at the time.

I tried to un tracelogger again with 256 buffers but it still ended
after 32 lines. Is there a quiet mode for tracelogger? -q

Colin_Burgess1 · March 19, 2004, 7:36pm

I don’t think there is. The 630 version of tracelogger has been
rewritten and is quiet.

You can start tracelogger in daemon mode, and then add a

TraceEvent( _NTO_TRACE_START );
…
TraceEvent( _NTO_TRACE_STOP );
TraceEvent( _NTO_TRACE_FLUSHBUFFERS );

to your application.

Or you just do it with a shellscript

#!/bin/sh

n0 for unlimited

tracelogger -n0 &
myapp
slay tracelogger

Bill Caroselli wrote:

Colin Burgess <> cburgess@qnx.com> > wrote:
CB > You should now have a /dev/shmem/tracebuffer file which is the logfile.
CB > If you run traceprinter then it will print out that log.

CB > For a graphical view, Import that file into your IDE (make sure it is
CB > named something.kev) and then double click - voila!

My point was that tracelogger finished so quickly I didn’thave a chance
to load my program. And the system was essentiallyidle at the time.

I tried to un tracelogger again with 256 buffers but it still ended
after 32 lines. Is there a quiet mode for tracelogger? -q

–
cburgess@qnx.com

Bill_Caroselli1 · March 19, 2004, 7:43pm

Colin Burgess <cburgess@qnx.com> wrote:

CB > Or you just do it with a shellscript

CB > #!/bin/sh

CB > # n0 for unlimited
CB > tracelogger -n0 &
CB > myapp
CB > slay tracelogger

I tried this. Tracelogger seems to consume 100% of the CPU.
My app doesn’t start until I kill tracelogger.

Colin_Burgess1 · March 19, 2004, 8:18pm

tracelogger doesn’t background itself, so make sure you do that.

It shouldn’t take much more than a few percent of cpu.

Bill Caroselli wrote:

Colin Burgess <> cburgess@qnx.com> > wrote:

CB > Or you just do it with a shellscript

CB > #!/bin/sh

CB > # n0 for unlimited
CB > tracelogger -n0 &
CB > myapp
CB > slay tracelogger

I tried this. Tracelogger seems to consume 100% of the CPU.
My app doesn’t start until I kill tracelogger.

–
cburgess@qnx.com

David_Gibbs1 · March 22, 2004, 3:09pm

Bill Caroselli <qtps@earthlink.net> wrote:

First thing I wondered, when looking at this, is why are you
mixing sleepon and condvars? The pthread_sleepon_xxx routines
are, basically, cover functions for mutexes (sleepon_lock) and
condvars (sleepon_wait, sleepon_signal), but unlike using your
own explicit mutexes & condvars, all sleepon_lock() calls share
one global mutex. (And, they allocate & destroy condvars as
needed.)

And, just at a guess – without looking over the logic – and the
fact that you aren’t burning much CPU, I’d bet you have a contention
or serialization problem that is slowing you down.

with timers set to 3, 5, 7, 11 & 13 ms without loosing a hit or even
being delayed. So, what were the steps that were eliminated?

c15. child does a pthread_sleepon_lock()/sleepon_signal()/sleepon_unlock()
c16. child does a pthread_cond_wait()
p13. parent does a ptherad_cond_signal()
p14. parent does a pthread_sleepon_unlock()
p15. parent loops back to p10

Why are these steps SO INEFFICIENT ? ! ? !

I would bet that you are creating a timing dependency, or serialization,
between threads that shouldn’t be serialized. For instance, the
pthread_sleepon_lock() will try to lock a single, global, mutex – if
it is already locked (e.g. by the parent that will unlock it in p14),
then that lock will block… probably greatly increasing the delay.

-David

David Gibbs
dagibbs@qnx.com

Bill_Caroselli1 · March 22, 2004, 3:35pm

David Gibbs <dagibbs@qnx.com> wrote:
DG > Bill Caroselli <qtps@earthlink.net> wrote:

DG > First thing I wondered, when looking at this, is why are you
DG > mixing sleepon and condvars? The pthread_sleepon_xxx routines
DG > are, basically, cover functions for mutexes (sleepon_lock) and
DG > condvars (sleepon_wait, sleepon_signal), but unlike using your
DG > own explicit mutexes & condvars, all sleepon_lock() calls share
DG > one global mutex. (And, they allocate & destroy condvars as
DG > needed.)

I needed two different sleepons. Since sleepon is global, I kept the first
one and added the condvar as the second one.

DG > And, just at a guess – without looking over the logic – and the
DG > fact that you aren’t burning much CPU, I’d bet you have a contention
DG > or serialization problem that is slowing you down.

with timers set to 3, 5, 7, 11 & 13 ms without loosing a hit or even
being delayed. So, what were the steps that were eliminated?

c15. child does a pthread_sleepon_lock()/sleepon_signal()/sleepon_unlock()
c16. child does a pthread_cond_wait()
p13. parent does a ptherad_cond_signal()
p14. parent does a pthread_sleepon_unlock()
p15. parent loops back to p10

Why are these steps SO INEFFICIENT ? ! ? !

DG > I would bet that you are creating a timing dependency, or serialization,
DG > between threads that shouldn’t be serialized. For instance, the
DG > pthread_sleepon_lock() will try to lock a single, global, mutex – if
DG > it is already locked (e.g. by the parent that will unlock it in p14),
DG > then that lock will block… probably greatly increasing the delay.

I am definitely creating a serialization bottleneck. I know that. I need
a sequence like:
child does a little
parent does a little
child does a little
The parent/child pair should not block (except waiting for the other)
once this process has begun.

The other children are expected to block while a parent and one child
are doing their thing.

I will recode this using only condvars to see if the efficiency improves.

What is the expected thread-to-thread context switch latency of a
2.4 GHz Pentium III processor?

Colin_Burgess1 · March 25, 2004, 5:01pm

Bill - did you get anywhere with this?

Colin Burgess wrote:

tracelogger doesn’t background itself, so make sure you do that.

It shouldn’t take much more than a few percent of cpu.

Bill Caroselli wrote:

Colin Burgess <> cburgess@qnx.com> > wrote:

CB > Or you just do it with a shellscript

CB > #!/bin/sh

CB > # n0 for unlimited
CB > tracelogger -n0 &
CB > myapp
CB > slay tracelogger

I tried this. Tracelogger seems to consume 100% of the CPU.
My app doesn’t start until I kill tracelogger.

–
cburgess@qnx.com

Bill_Caroselli1 · March 26, 2004, 6:33pm

Colin Burgess <cburgess@qnx.com> wrote:
CB > Bill - did you get anywhere with this?

No. There is temporarilly something more important.

I should get back to it next week.

Bill_Caroselli1 · April 28, 2004, 9:29pm

Colin Burgess <cburgess@qnx.com> wrote:
CB > Bill - did you get anywhere with this?

CB > Colin Burgess wrote:

tracelogger doesn’t background itself, so make sure you do that.

It shouldn’t take much more than a few percent of cpu.

Bill Caroselli wrote:

Colin Burgess <> cburgess@qnx.com> > wrote:

CB > Or you just do it with a shellscript

CB > #!/bin/sh

CB > # n0 for unlimited
CB > tracelogger -n0 &
CB > myapp
CB > slay tracelogger

I tried this. Tracelogger seems to consume 100% of the CPU.
My app doesn’t start until I kill tracelogger.

It’s been a while since I worked on this function. It was actually a test
function for a more important function that I have been working on.

It turns out that the bad boy isn’t the synchronization mechisms.
Instead it appears to be a call to clock_gettime().

Is this know to be a not very efficient function?
If so, is there a better replacement for it?

Colin_Burgess1 · April 29, 2004, 12:15pm

If you are just trying to get the difference between two times, then use
the SYSPAGE_ENTRY(qtime)->nsec field - this obviates the need for a
kernel call.

Bill Caroselli wrote:

Colin Burgess <> cburgess@qnx.com> > wrote:
CB > Bill - did you get anywhere with this?

CB > Colin Burgess wrote:

tracelogger doesn’t background itself, so make sure you do that.

It shouldn’t take much more than a few percent of cpu.

Bill Caroselli wrote:

Colin Burgess <> cburgess@qnx.com> > wrote:

CB > Or you just do it with a shellscript

CB > #!/bin/sh

CB > # n0 for unlimited
CB > tracelogger -n0 &
CB > myapp
CB > slay tracelogger

I tried this. Tracelogger seems to consume 100% of the CPU.
My app doesn’t start until I kill tracelogger.

It’s been a while since I worked on this function. It was actually a test
function for a more important function that I have been working on.

It turns out that the bad boy isn’t the synchronization mechisms.
Instead it appears to be a call to clock_gettime().

Is this know to be a not very efficient function?
If so, is there a better replacement for it?

–
cburgess@qnx.com

Bill_Caroselli1 · April 30, 2004, 12:57pm

Colin Burgess <cburgess@qnx.com> wrote:
CB > If you are just trying to get the difference between two times, then use
CB > the SYSPAGE_ENTRY(qtime)->nsec field - this obviates the need for a
CB > kernel call.

I’m not looking for the difference in two times.

I want to record a time stamp.

Is there a convienence function to convert from SYSPAGE_ENTRY(qtime) to
a struct timespec?

If not, can I assume that the conversion is simply
boot_time * 1000000000000 + nsec ?

Bill_Caroselli1 · April 30, 2004, 1:05pm

Bill Caroselli <qtps@earthlink.net> wrote:
BC > Colin Burgess <cburgess@qnx.com> wrote:
BC > CB > If you are just trying to get the difference between two times, then use
BC > CB > the SYSPAGE_ENTRY(qtime)->nsec field - this obviates the need for a
BC > CB > kernel call.

BC > I’m not looking for the difference in two times.

BC > I want to record a time stamp.

BC > Is there a convienence function to convert from SYSPAGE_ENTRY(qtime) to
BC > a struct timespec?

BC > If not, can I assume that the conversion is simply
BC > boot_time * 1000000000000 + nsec ?

OK, I see that it is NOT that simple.

Is there a function that will do this?

Sunil_Kittur1 · April 30, 2004, 2:19pm

Bill Caroselli wrote:

Bill Caroselli <> qtps@earthlink.net> > wrote:
BC > Colin Burgess <> cburgess@qnx.com> > wrote:
BC > CB > If you are just trying to get the difference between two times, then use
BC > CB > the SYSPAGE_ENTRY(qtime)->nsec field - this obviates the need for a
BC > CB > kernel call.

BC > I’m not looking for the difference in two times.

BC > I want to record a time stamp.

BC > Is there a convienence function to convert from SYSPAGE_ENTRY(qtime) to
BC > a struct timespec?

BC > If not, can I assume that the conversion is simply
BC > boot_time * 1000000000000 + nsec ?

OK, I see that it is NOT that simple.

Is there a function that will do this?

nsec2timespec() converts a 64-bit nsec value to a timespec.

Do you need the timespec to be time-of-day, or just a
monotonically increasing stamp since system boot?

Sunil.