I was very surprised lately. I had a performance problem. So I
naturally assumed that I was doing something inappropriate. So I coded
a test program. Here’s what I found.
Essentially, the main thread of a process reads it’s command line and
spawns off child threads. The argument on the command line is the
number of milliseconds between repetitive firings on a timer. When the
timer fires, a log record is generated and put into a buffer. So there are
two phases to the program, initialization and data collection. Here is
the program flow:
Initialization:
p1. parent parses the command line
p2. parent pthread_create()s a child thread and passes that arguemnt
p3. parent does a pthread_sleepon_lock()/sleepon_wait()/sleepon_unlock()
c1. child prepares a signal
c2. child prepares a timer event using that signal
c3. child initializes the timer with a repeat interval
c4. child does a pthread_sleepon_lock()/sleepon_signal()/sleepon_unlock()
c5. child does a barrier_wait()
p4. parent loop back and iterates through the next child thread
p5. parent does a parrier_wait()
So far, no problem.
Data Collection: (after all have reached the barrier_wait() )
p10. parent does a pthread_sleepon_lock()/sleepon_wait()
c10. child does a timer_settime()
c11. child does a sigwait()
c12. ( child timer signal fires )
c13. child generates a log record including a clock_gettime() of right now
c14. child does a pthread_mutex_lock()
c15. child does a pthread_sleepon_lock()/sleepon_signal()/sleepon_unlock()
c16. child does a pthread_cond_wait()
p11. parent copies this record into a buffer
p12. parent check if done, if so, writes out buffer and exits
p13. parent does a ptherad_cond_signal()
p14. parent does a pthread_sleepon_unlock()
p15. parent loops back to p10
c17. child does pthread_mutex_unlock()
c18. child loops back to c11
I tried to run this with two child thread with timers going off at
5 & 7 ms. The program was not able to keep up. And, Oh, by the way,
I’m running on a 2.4 GHz processor !
I wanted to see where the bottleneck was so I simplified the data
collection even more. Instead of letting the parent thread put the data
in the table the child thread puts the data in the table. I.E. the
parent never wakes up in phase 2. Now I was able to start threads
with timers set to 3, 5, 7, 11 & 13 ms without loosing a hit or even
being delayed. So, what were the steps that were eliminated?
c15. child does a pthread_sleepon_lock()/sleepon_signal()/sleepon_unlock()
c16. child does a pthread_cond_wait()
p13. parent does a ptherad_cond_signal()
p14. parent does a pthread_sleepon_unlock()
p15. parent loops back to p10
Why are these steps SO INEFFICIENT ? ! ? !
Note: ( p11 and p12 were moved to the child thread.)
BTW, even when I was missing samples the CPU usage indicator seemed to
show almost no CPU usage. I.E. lots of blocking