QNX response times redux

Sorry to keep bothering you guys regarding this project, but my boss is
adamant that we get consistent packet round-trip times of under 100
microseconds, and we seem to be inching closer…

For those coming it late, here’s what we’ve got:

We have two programs: a background program called DQE, and a simple
text-mode user interactive program called UA. The user can use the UA to
send a message to the DQE (using MsgSend()) to tell the DQE to send a
packet over the network. A DQE on the other machine gets the packet and
sends it back. When the first machine’s DQE sends the packet, it stores
a time value in the packet (from ClockCycles()), so that when it gets
the packet back it can calculate the round-trip time.

The network consists of just these 2 computers with Netgear 83815
ethernet cards connected directly by a crossover cable. Thus there is no
other network traffic to interfere. The networking code just sends
straight ethernet packets containing our desired data (e.g. clock time)
in the payload; no TCP/IP or UDP or anything.

As I said, the UA can send a “send a packet” command message to the
DQE. But the user can also tell the UA to send 10,000 “send a packet”
messages in a tight loop (with a nap(1) after each one). The UA can also
send a “print timing stats” message to the DQE, which will then print
out min, max, and average packet trip times, and a simple histogram.

Thanks to the wonderful people on this newsgroup, we have learned that
running the DQE at a higher priority than the UA (and pretty much
everything else) gives us good packet times. When not running the Photon
environment on either machine we consistently get packet times of less
than 100 microseconds. However, when sending only one packet, the packet
time often measures longer than the average time when sending 10,000
packets (~84 usec. vs. ~62 usec. The difference was much worse before we
raised the priority of the DQE; sending a single packet was often ~118
usec. or more). Remember, the loop is in the UA, not the DQE; the UA
sends 10,000 “send a packet” requests.

Now, my boss insists that since QNX supports “hard realtime” which is
our goal, and since QNX strictly honors priority (no thread will ever be
pre-empted by a lower priority thread), our results shouldn’t get worse
if we run these programs in a terminal window in the Photon environment.
But they do. So that’s what I’m banging my head on now; how can we get a
background task that can send and receive packets under 100 usec. even
when running Photon and working on other things? If the DQE is the
highest priority task, nothing else should be preempting it, right?
Could it just be a matter of context switch time? Is my boss’
understanding of “realtime operating systems” correct in this regard?

Perhaps some results would better illustrate:

Typical results for 10,000 packets when running with Photon are:
Min: 62 usec. Max: 189 usec. Avg: 62 usec.
Histogram:
40-79 usec: 9456 packets
80-99 usec: 469 packets
100-199 usec: 76 packets

Typical results when running from a text only environment on both
machines are:
Min: 62 usec. Max: 88 usec. Avg: 62 usec.
Histogram:
40-79 usec: 9999 packets
80-99 usec: 1 packets
100-199 usec: 0 packets

Taking a step back, you need to be careful how you architect your
system. If you have a hard real-time system (like the DQE) running at
high-priority, you don’t want a “user” app like the UA to interfere with
normal (hard real-time) operations when it issues commands and queries
information to/from the DQE.

For example, when the UA issues a command to the DQE, the DQE thread is
doing work “on behalf” of the UA, so it inherits the priority of the UA.
In the grand scheme of things, this is the right way to go, because you
are ready to sacrifice user response speed to let the DQE do its hard
real-time thing.

What you need is a second dedicated worker thread in the DQE that is
locked to a “high” priority (ie. over just about everything, including
Photon!), and some UA communications threads that loop on a MsgReceive()
with floating priorities. The trick now is how to let the “low priority”
UA communications threads talk to the “high priority” DQE worker
thread(s) without dragging them down.

A common approach is via some sort of mutex protected communications
area(s). The UA comm thread basically does:

while (1) {
MsgReceive();

// Do some prep work at inherited priority

// Begin critical section
pthread_mutex_lock();

// QUICKLY perform UA <-> DQE communications
// (ie. read a value, enqueue a command, etc…)

pthread_mutex_unlock();
// Finished critical section

// Do some cleanup

MsgReply();
}

Everything outside of the critical section runs whenever the DQE is
idle, but can be preempted by the DQE at will. Once the mutex is locked,
the UA thread is in a grey area, where if it is quick, it will get in
and out before the DQE needs to do anything. But, if the DQE suddenly
wakes up, and attempts to lock that mutex, it will boost the UA comm
thread to the same super-high priority. At this point, the UA comm
thread is holding up the DQE thread and better get out fast (hence the
emphasis on quickly).

At this point it is critical that you realize that message passing and
mutexes allow for priority inheritance, while condvars don’t because
there is no concept of a condvar “owner”, so we don’t know who to boost.

Now, the point of this long winded explanation is that it was suggested
that you simply turn off priority inheritance by using the
_NTO_CHF_FIXED_PRIORITY and lock your DQE process at a “high” priority.
What this does is broaden the critical section so that your loop looks
more like:

while (1) {
// Begin critical section
MsgReceive();

// Do some prep work

// Do some DQE processing
// In your case, for (i = 0; i < 10000; i++)…

// Do some cleanup

MsgReply();
// Finished critical section
}

This will work, especially for a simple test in that it makes the the
same thread do UA comm work and hard real-time DQE work. But, it is not
normally recommended because there is always some “prep work” and
“cleanup” that should be done at the UA’s priority, and hence should be
preemptable by hard real-time activities. This is especially true if the
prep and cleanup involve any kind of loop.

Hope this helps.

Daryl Low

Strictly speaking, ethernet is not realtime. You use crossover cable of
course, but keep in mind that ethernet chips were never designed for
determinisctic transmissions either. You may get different results by using
different chipset - try Intel 82559 for comparision.

As for Photon making things worse, that’s expected. You don’t get graphics
for free - doing those GUI operations steals both CPU and bus cycles, so
graphics may compete with ethernet transmissions, even on bus level. While
PCI bandwidth may be ‘big enough’, there’s also the latency since the common
bus must be shared. They also may be sharing the interrupt, to add more
confusion. On the software level, depending on how the drivers are written
the scheduling latency may depend on activity of the other hardware too.

Bottom line is, when you want hard realtime on PC hardware you have to
eliminate all the non-essential hardware components…

– igor

“Karl von Laudermann” <karl@nospam.ueidaq.com> wrote in message
news:atssvs$mg6$1@inn.qnx.com

Sorry to keep bothering you guys regarding this project, but my boss is
adamant that we get consistent packet round-trip times of under 100
microseconds, and we seem to be inching closer…

For those coming it late, here’s what we’ve got:

We have two programs: a background program called DQE, and a simple
text-mode user interactive program called UA. The user can use the UA to
send a message to the DQE (using MsgSend()) to tell the DQE to send a
packet over the network. A DQE on the other machine gets the packet and
sends it back. When the first machine’s DQE sends the packet, it stores
a time value in the packet (from ClockCycles()), so that when it gets
the packet back it can calculate the round-trip time.

The network consists of just these 2 computers with Netgear 83815
ethernet cards connected directly by a crossover cable. Thus there is no
other network traffic to interfere. The networking code just sends
straight ethernet packets containing our desired data (e.g. clock time)
in the payload; no TCP/IP or UDP or anything.

As I said, the UA can send a “send a packet” command message to the
DQE. But the user can also tell the UA to send 10,000 “send a packet”
messages in a tight loop (with a nap(1) after each one). The UA can also
send a “print timing stats” message to the DQE, which will then print
out min, max, and average packet trip times, and a simple histogram.

Thanks to the wonderful people on this newsgroup, we have learned that
running the DQE at a higher priority than the UA (and pretty much
everything else) gives us good packet times. When not running the Photon
environment on either machine we consistently get packet times of less
than 100 microseconds. However, when sending only one packet, the packet
time often measures longer than the average time when sending 10,000
packets (~84 usec. vs. ~62 usec. The difference was much worse before we
raised the priority of the DQE; sending a single packet was often ~118
usec. or more). Remember, the loop is in the UA, not the DQE; the UA
sends 10,000 “send a packet” requests.

Now, my boss insists that since QNX supports “hard realtime” which is
our goal, and since QNX strictly honors priority (no thread will ever be
pre-empted by a lower priority thread), our results shouldn’t get worse
if we run these programs in a terminal window in the Photon environment.
But they do. So that’s what I’m banging my head on now; how can we get a
background task that can send and receive packets under 100 usec. even
when running Photon and working on other things? If the DQE is the
highest priority task, nothing else should be preempting it, right?
Could it just be a matter of context switch time? Is my boss’
understanding of “realtime operating systems” correct in this regard?

Perhaps some results would better illustrate:

Typical results for 10,000 packets when running with Photon are:
Min: 62 usec. Max: 189 usec. Avg: 62 usec.
Histogram:
40-79 usec: 9456 packets
80-99 usec: 469 packets
100-199 usec: 76 packets

Typical results when running from a text only environment on both
machines are:
Min: 62 usec. Max: 88 usec. Avg: 62 usec.
Histogram:
40-79 usec: 9999 packets
80-99 usec: 1 packets
100-199 usec: 0 packets

The highest priority process in the GUI is the graphics driver which runs at
priority 12.
If your application runs at a priority higher than 12 you should not see any
appreciable impact due to running Photon. If your apps run at a priority
less than 12, you will be preempted by the graphics driver… which will
definately affect your timings.

“Igor Kovalenko” <kovalenko@attbi.com> wrote in message
news:atsv1a$p35$1@inn.qnx.com

Strictly speaking, ethernet is not realtime. You use crossover cable of
course, but keep in mind that ethernet chips were never designed for
determinisctic transmissions either. You may get different results by
using
different chipset - try Intel 82559 for comparision.

As for Photon making things worse, that’s expected. You don’t get graphics
for free - doing those GUI operations steals both CPU and bus cycles, so
graphics may compete with ethernet transmissions, even on bus level. While
PCI bandwidth may be ‘big enough’, there’s also the latency since the
common
bus must be shared. They also may be sharing the interrupt, to add more
confusion. On the software level, depending on how the drivers are written
the scheduling latency may depend on activity of the other hardware too.

Bottom line is, when you want hard realtime on PC hardware you have to
eliminate all the non-essential hardware components…

– igor

“Karl von Laudermann” <> karl@nospam.ueidaq.com> > wrote in message
news:atssvs$mg6$> 1@inn.qnx.com> …
Sorry to keep bothering you guys regarding this project, but my boss is
adamant that we get consistent packet round-trip times of under 100
microseconds, and we seem to be inching closer…

For those coming it late, here’s what we’ve got:

We have two programs: a background program called DQE, and a simple
text-mode user interactive program called UA. The user can use the UA to
send a message to the DQE (using MsgSend()) to tell the DQE to send a
packet over the network. A DQE on the other machine gets the packet and
sends it back. When the first machine’s DQE sends the packet, it stores
a time value in the packet (from ClockCycles()), so that when it gets
the packet back it can calculate the round-trip time.

The network consists of just these 2 computers with Netgear 83815
ethernet cards connected directly by a crossover cable. Thus there is no
other network traffic to interfere. The networking code just sends
straight ethernet packets containing our desired data (e.g. clock time)
in the payload; no TCP/IP or UDP or anything.

As I said, the UA can send a “send a packet” command message to the
DQE. But the user can also tell the UA to send 10,000 “send a packet”
messages in a tight loop (with a nap(1) after each one). The UA can also
send a “print timing stats” message to the DQE, which will then print
out min, max, and average packet trip times, and a simple histogram.

Thanks to the wonderful people on this newsgroup, we have learned that
running the DQE at a higher priority than the UA (and pretty much
everything else) gives us good packet times. When not running the Photon
environment on either machine we consistently get packet times of less
than 100 microseconds. However, when sending only one packet, the packet
time often measures longer than the average time when sending 10,000
packets (~84 usec. vs. ~62 usec. The difference was much worse before we
raised the priority of the DQE; sending a single packet was often ~118
usec. or more). Remember, the loop is in the UA, not the DQE; the UA
sends 10,000 “send a packet” requests.

Now, my boss insists that since QNX supports “hard realtime” which is
our goal, and since QNX strictly honors priority (no thread will ever be
pre-empted by a lower priority thread), our results shouldn’t get worse
if we run these programs in a terminal window in the Photon environment.
But they do. So that’s what I’m banging my head on now; how can we get a
background task that can send and receive packets under 100 usec. even
when running Photon and working on other things? If the DQE is the
highest priority task, nothing else should be preempting it, right?
Could it just be a matter of context switch time? Is my boss’
understanding of “realtime operating systems” correct in this regard?

Perhaps some results would better illustrate:

Typical results for 10,000 packets when running with Photon are:
Min: 62 usec. Max: 189 usec. Avg: 62 usec.
Histogram:
40-79 usec: 9456 packets
80-99 usec: 469 packets
100-199 usec: 76 packets

Typical results when running from a text only environment on both
machines are:
Min: 62 usec. Max: 88 usec. Avg: 62 usec.
Histogram:
40-79 usec: 9999 packets
80-99 usec: 1 packets
100-199 usec: 0 packets

Darrin Fry <darrin@qnx.com> wrote in message
news:atsvhk$p74$1@inn.qnx.com

The highest priority process in the GUI is the graphics driver which runs
at
priority 12.
If your application runs at a priority higher than 12 you should not see
any
appreciable impact due to running Photon. If your apps run at a priority
less than 12, you will be preempted by the graphics driver… which will
definately affect your timings.

Not necessarily - any PCI device (video etc) can cause undesirable effects
on the overall system outside the scope of the OS (we’ve seen it with video
cards which constantly grab the bus). While the PCI bus will eventually
force a device to relinquish control, constantly consuming/holding maximum
time on the bus will affect the throughput of other devices. So while
strictly speaking, a lower priority process will not preempt a higher
priority one, there are side effects of interacting with hardware
(video/ethernet) which must be taken into account.

-Adam

Igor Kovalenko wrote:

Strictly speaking, ethernet is not realtime. You use crossover cable of
course, but keep in mind that ethernet chips were never designed for
determinisctic transmissions either. You may get different results by using
different chipset - try Intel 82559 for comparision.

As for Photon making things worse, that’s expected. You don’t get graphics
for free - doing those GUI operations steals both CPU and bus cycles, so
graphics may compete with ethernet transmissions, even on bus level. While
PCI bandwidth may be ‘big enough’, there’s also the latency since the common
bus must be shared. They also may be sharing the interrupt, to add more
confusion. On the software level, depending on how the drivers are written
the scheduling latency may depend on activity of the other hardware too.

Bottom line is, when you want hard realtime on PC hardware you have to
eliminate all the non-essential hardware components…

That makes sense. Since the real product will be running on a
microcontroller card rather than a desktop PC, this isn’t a big concern.
Thanks for the info!

Karl von Laudermann wrote:
[ clip … ]

Typical results when running from a text only environment on both
machines are:
Min: 62 usec. Max: 88 usec. Avg: 62 usec.
Histogram:
40-79 usec: 9999 packets
80-99 usec: 1 packets
100-199 usec: 0 packets

When we talk about time frames in the range above … then you have
also to disable the power management of your PC in order to minimize
the impact of the SMI handler (SMI-> System Management Interrupt)

Armin