io-net timings

Shaun_Jackman · August 16, 2002, 11:17pm

I’ve measured some timings within io-net and found two particularly
troublesome choke-points in my code.

The time to tx_down() a packet from an en_en filter is 27.96 us.
The time from calling tx_down() in the down producer, to the time of
rx_down() being called in the converter is 22.80 us.

I’d appreciate any insight into why these timings are much higher than I’d
expect. I’d also be interested in hearing from anyone else who would test
the same timings on a comparable system and report their experience.

These timings were taken by recording ClockCycles() before and after the
event, and dividing the difference by cycles_per_sec. The statistic I
reported is the mean of 100,000 samples, so should be relatively resistant
to interrupts that happen to land in the middle of the call.

CPU: PMMX-266
NIC: Intel 82559ER (100Mb/s)
QNX: 6.1.0a

For reference, these times on a PIII-750 are 4.97 us and 1.69 us
respectively.

Thanks,
Shaun

Gogi · September 11, 2002, 8:04pm

Since our guru coop left (Hi Shaun) I have to finish up the work on our
io-net module. Because of the CPU intensive io-net performance I had to
change the concept quite a bit. I squeezed our three layer/three io-net
module stack to single io-net module which does all:
filter-converter-protocol and acts to the io-net as filter above the NIC.
Next thing I did is I preallocated a packets for our real-time trafic to
“steal” some time there. Now CPU usage went from almost 100% to about 60%
when running our “high speed” real time loop (our PI@266 receives two and
sends one min. sized ethernet (100 baseT) packet per milisecond).

Reason I am doing this post is related to the QNX Neutrino RTOS evaluation
document (done by “Dedicated Systems”). In the section testing the network
performance “Network stack - TCP/IP”, it is very indicative that in tests
they did (4.9.1 Full TCP/IP Stack - Receive Capacity, 4.9.2 Tiny TCP/IP
Stack - Receive Capacity, 4.9.3 Full TCP/IP Stack - Send Capacity, 4.9.4
Tiny TCP/IP Stack - Send Capacity) CPU usage goes to 100% even for the
smallest packet sizes and low throughput. My guess is that this is caused
more by io-net that TCP/IP stack.

My understanding of io-net is that all what it does when tx down/up is
called it passes the packet to the next layer. It seems that I am wrong. Is
there anybody who can explain what is realy going on and if there is a
way/workarround which will improove the performance of communication stack.

Thanks Gogi.

“Shaun Jackman” <sjackman@nospam.vortek.com> wrote in message
news:ajk0of$3od$1@inn.qnx.com…

I’ve measured some timings within io-net and found two particularly
troublesome choke-points in my code.

The time to tx_down() a packet from an en_en filter is 27.96 us.
The time from calling tx_down() in the down producer, to the time of
rx_down() being called in the converter is 22.80 us.

I’d appreciate any insight into why these timings are much higher than I’d
expect. I’d also be interested in hearing from anyone else who would test
the same timings on a comparable system and report their experience.

These timings were taken by recording ClockCycles() before and after the
event, and dividing the difference by cycles_per_sec. The statistic I
reported is the mean of 100,000 samples, so should be relatively resistant
to interrupts that happen to land in the middle of the call.

CPU: PMMX-266
NIC: Intel 82559ER (100Mb/s)
QNX: 6.1.0a

For reference, these times on a PIII-750 are 4.97 us and 1.69 us
respectively.

Thanks,
Shaun
\

Sean_Boudreau1 · September 11, 2002, 8:22pm

Gogi <gogi@vortek.com> wrote:

Since our guru coop left (Hi Shaun) I have to finish up the work on our
io-net module. Because of the CPU intensive io-net performance I had to
change the concept quite a bit. I squeezed our three layer/three io-net
module stack to single io-net module which does all:
filter-converter-protocol and acts to the io-net as filter above the NIC.
Next thing I did is I preallocated a packets for our real-time trafic to
“steal” some time there. Now CPU usage went from almost 100% to about 60%
when running our “high speed” real time loop (our PI@266 receives two and
sends one min. sized ethernet (100 baseT) packet per milisecond).

Reason I am doing this post is related to the QNX Neutrino RTOS evaluation
document (done by “Dedicated Systems”). In the section testing the network
performance “Network stack - TCP/IP”, it is very indicative that in tests
they did (4.9.1 Full TCP/IP Stack - Receive Capacity, 4.9.2 Tiny TCP/IP
Stack - Receive Capacity, 4.9.3 Full TCP/IP Stack - Send Capacity, 4.9.4
Tiny TCP/IP Stack - Send Capacity) CPU usage goes to 100% even for the
smallest packet sizes and low throughput. My guess is that this is caused
more by io-net that TCP/IP stack.

My understanding of io-net is that all what it does when tx down/up is
called it passes the packet to the next layer. It seems that I am wrong. Is
there anybody who can explain what is realy going on and if there is a
way/workarround which will improove the performance of communication stack.

Thanks Gogi.

ion->tx_up() / ion->tx_down() are pretty much function calls to redirect
you to the next layer’s rx_up() / rx_down() funcs. See the thread above this
one where Shaun was consistently getting around 5us per operation. It’s
once the next module gets the packet that the majority of the work is done.

-seanb

“Shaun Jackman” <> sjackman@nospam.vortek.com> > wrote in message
news:ajk0of$3od$> 1@inn.qnx.com> …
I’ve measured some timings within io-net and found two particularly
troublesome choke-points in my code.

The time to tx_down() a packet from an en_en filter is 27.96 us.
The time from calling tx_down() in the down producer, to the time of
rx_down() being called in the converter is 22.80 us.

I’d appreciate any insight into why these timings are much higher than I’d
expect. I’d also be interested in hearing from anyone else who would test
the same timings on a comparable system and report their experience.

These timings were taken by recording ClockCycles() before and after the
event, and dividing the difference by cycles_per_sec. The statistic I
reported is the mean of 100,000 samples, so should be relatively resistant
to interrupts that happen to land in the middle of the call.

CPU: PMMX-266
NIC: Intel 82559ER (100Mb/s)
QNX: 6.1.0a

For reference, these times on a PIII-750 are 4.97 us and 1.69 us
respectively.

Thanks,
Shaun
\

Gogi · September 12, 2002, 7:35pm

ion->tx_up() / ion->tx_down() are pretty much function calls to redirect
you to the next layer’s rx_up() / rx_down() funcs. See the thread above
this
one where Shaun was consistently getting around 5us per operation. It’s
once the next module gets the packet that the majority of the work is
done.

-seanb

Yes, 5us on PIII@750, but 22.8us on PI@266 (embedded PC we use as
temperature measurement device). In both cases it is several thousands
cycles. It seems that we will have to look for some good PIII or faster
104(+) motherboard…

Gogi

Sean_Boudreau1 · September 12, 2002, 7:49pm

Gogi <gogi@vortek.com> wrote:

ion->tx_up() / ion->tx_down() are pretty much function calls to redirect
you to the next layer’s rx_up() / rx_down() funcs. See the thread above
this
one where Shaun was consistently getting around 5us per operation. It’s
once the next module gets the packet that the majority of the work is
done.

-seanb

Yes, 5us on PIII@750, but 22.8us on PI@266 (embedded PC we use as
temperature measurement device). In both cases it is several thousands
cycles. It seems that we will have to look for some good PIII or faster
104(+) motherboard…

That’s not my reading of the thread. In his last entry he said
he made a mistake in his timings and was including more than
just the tx_down() to rx_down(). His bottleneck was the time for the
tx_down() to return from the driver which, depending on what’s happening,
is variable.

I still maintain what I said.

-seanb

Gogi · September 13, 2002, 12:09am

Sorry, internal lack of communication. Hopefully these are the latest
numbers he came up with:

alloc_down_pkt: 24.00 us
reg_tx_done: 10.01 us
tx_down to rx_down (down-producer to converter): 5.05 us
tx_down to rx_down (converter to filter): 5.09 us
tx_down (filter to ethernet): 31.55 us

It seems that I saved much more by preallocating/reusing the packets than by
“merging” the io-net modules.
I got rid of: alloc_down_pkt(24us) + reg_tx_down(10us) + free(?us) per
packet. It seems that merging the io-net modules was pretty much waste of
time. Now we need Shaun back to fix the damage:).

Thanks, Gogi.

“Sean Boudreau” <seanb@node25.ott.qnx.com> wrote in message
news:alqr4u$onl$1@nntp.qnx.com…

Gogi <> gogi@vortek.com> > wrote:
ion->tx_up() / ion->tx_down() are pretty much function calls to
redirect
you to the next layer’s rx_up() / rx_down() funcs. See the thread
above
this
one where Shaun was consistently getting around 5us per operation.
It’s
once the next module gets the packet that the majority of the work is
done.

-seanb

Yes, 5us on PIII@750, but 22.8us on PI@266 (embedded PC we use as
temperature measurement device). In both cases it is several thousands
cycles. It seems that we will have to look for some good PIII or faster
104(+) motherboard…

That’s not my reading of the thread. In his last entry he said
he made a mistake in his timings and was including more than
just the tx_down() to rx_down(). His bottleneck was the time for the
tx_down() to return from the driver which, depending on what’s happening,
is variable.

I still maintain what I said.

-seanb