TCPIP Performance

Folks!

Can anyone shed some light on QNX’s TCP/IP performance? I have an
ongoing issue (since 6.0) where the performance of the ipstack is
not that good on low power processor platforms.

I’ve noticed this on various processor architectures and with different
versions of the ipstack.

An example would be the chargen service on the stack. I have a test
application that requests data from this port and then calculates the
throughput.


The results are (all from the same requesting machine, all targets have
the same network card and identical drivers, when using qnx)

Processor/Speed OS Max Data Rate

AMD 1500MHz W2000 > 70Mbit/sec
AMD 1500MHz QNX6.1 > 70Mbit/sec
AMD 1500MHz QNX6.2.1 > 70Mbit/sec
AMD 1500MHz Linux > 80Mbit/sec

NEC 300MHz QNX6.1 ~ 3Mbit/sec
NEC 300MHz QNX6.2 ~ 3Mbit/sec
NEC 300MHz Linux > 50Mbit/sec

PPC 750 200MHz QNX 6.1 ~ 3Mbit/sec
PPC 750 200MHz QNX 6.2 ~ 3Mbit/sec

I can appreciate that I’m using much slower processors for the embedded
systems and as such the final throughput should be lower, however when
running the tests the results don’t appear to scale as I would expect.

One issue off the top is that chargen does 74 byte writes.
With our message passing model, you’ll get more bang for
the buck with a larger value. The sweet spot can vary
per machine but I’d say something at least 4K.

-seanb


Dave Edwards <Dave.edwards@abicom-international.com> wrote:

Folks!

Can anyone shed some light on QNX’s TCP/IP performance? I have an
ongoing issue (since 6.0) where the performance of the ipstack is
not that good on low power processor platforms.

I’ve noticed this on various processor architectures and with different
versions of the ipstack.

An example would be the chargen service on the stack. I have a test
application that requests data from this port and then calculates the
throughput.



The results are (all from the same requesting machine, all targets have
the same network card and identical drivers, when using qnx)

Processor/Speed OS Max Data Rate

AMD 1500MHz W2000 > 70Mbit/sec
AMD 1500MHz QNX6.1 > 70Mbit/sec
AMD 1500MHz QNX6.2.1 > 70Mbit/sec
AMD 1500MHz Linux > 80Mbit/sec

NEC 300MHz QNX6.1 ~ 3Mbit/sec
NEC 300MHz QNX6.2 ~ 3Mbit/sec
NEC 300MHz Linux > 50Mbit/sec

PPC 750 200MHz QNX 6.1 ~ 3Mbit/sec
PPC 750 200MHz QNX 6.2 ~ 3Mbit/sec

I can appreciate that I’m using much slower processors for the embedded
systems and as such the final throughput should be lower, however when
running the tests the results don’t appear to scale as I would expect.


Just for comparasion, I’ve been running some very similar tests, and I seem
to get better results than yours from slow machines, and not quite as good
results from fast machines:

Processor speed OS Data Rate from chargen
200Mhz QNX 6.2.0 ~8Mbps
500Mhz QNX 6.2.0 ~20Mbps
1.8Ghz QNX 6.2.0 ~50Mbps
2.4Ghs QNX 6.2.1 ~65Mbps

I’m just making a TCP connection to the chargen port and hanging in a loop
doing recv() into a large buffer until I’ve seen a total of 10MB.

Murf

Dave Edwards wrote:

Folks!

Can anyone shed some light on QNX’s TCP/IP performance? I have an
ongoing issue (since 6.0) where the performance of the ipstack is
not that good on low power processor platforms.

I’ve noticed this on various processor architectures and with different
versions of the ipstack.

An example would be the chargen service on the stack. I have a test
application that requests data from this port and then calculates the
throughput.

The results are (all from the same requesting machine, all targets have
the same network card and identical drivers, when using qnx)

Processor/Speed OS Max Data Rate

AMD 1500MHz W2000 > 70Mbit/sec
AMD 1500MHz QNX6.1 > 70Mbit/sec
AMD 1500MHz QNX6.2.1 > 70Mbit/sec
AMD 1500MHz Linux > 80Mbit/sec

NEC 300MHz QNX6.1 ~ 3Mbit/sec
NEC 300MHz QNX6.2 ~ 3Mbit/sec
NEC 300MHz Linux > 50Mbit/sec

PPC 750 200MHz QNX 6.1 ~ 3Mbit/sec
PPC 750 200MHz QNX 6.2 ~ 3Mbit/sec

I can appreciate that I’m using much slower processors for the embedded
systems and as such the final throughput should be lower, however when
running the tests the results don’t appear to scale as I would expect.
\

Murf,

Your results are closer to what I expected. I’ve never been able to
achieve these rates on the slower machines.

Have you also noticed that the IP stack can sometimes fragment
unnecessarily?

I’m confident that My Hardware is OK, as I’ve written a bridge module
that sits underneath the IP stack and forwards the network traffic to
another NIC. In this mode I can talk to one of the faster machines, via
the slow NEC processor, and still achieve good throughput.


[PC]—>[NEC, EN0]–>[bridge]—>[NEC, EN1]---->[AMD 1500]
^----------300MHz NEC-----------^

In this configuration I can saturate the link between EN1 and the AMD
1500 and get nearly full bandwidth (it’s not an Ethernet port but has a
capacity of around 18Mbit/sec)

If I do the test to the 300MHz Nec machine, running the same firmware I
get a data rate of ~3Mbit/sec.

One other thing to mention is that this appears to be independent of
link setting on the NEC machine (10/100Mbit Full/Half Duplex), and It’s
independent of my code!

One other question,

Has anyone done any optimisation on the realtek 8139 code? I have a copy
of the source code from the UK office and notice that it uses a pool
of RX npkts for receive, the version I received uses the pool in a
strange way. Basically it deletes any receive_complete npkts and then
does an ion-alloc to add a packet to the RX pool.

This happens for every frame.

Q1, Isn’t this inefficient?
Q2, Why does the driver not recycle the npkts once that have been released?


Dave



John A. Murphy wrote:

Just for comparasion, I’ve been running some very similar tests, and I seem
to get better results than yours from slow machines, and not quite as good
results from fast machines:

Processor speed OS Data Rate from chargen
200Mhz QNX 6.2.0 ~8Mbps
500Mhz QNX 6.2.0 ~20Mbps
1.8Ghz QNX 6.2.0 ~50Mbps
2.4Ghs QNX 6.2.1 ~65Mbps

I’m just making a TCP connection to the chargen port and hanging in a loop
doing recv() into a large buffer until I’ve seen a total of 10MB.

Murf

Dave Edwards wrote:


Folks!

Can anyone shed some light on QNX’s TCP/IP performance? I have an
ongoing issue (since 6.0) where the performance of the ipstack is
not that good on low power processor platforms.

I’ve noticed this on various processor architectures and with different
versions of the ipstack.

An example would be the chargen service on the stack. I have a test
application that requests data from this port and then calculates the
throughput.

The results are (all from the same requesting machine, all targets have
the same network card and identical drivers, when using qnx)

Processor/Speed OS Max Data Rate

AMD 1500MHz W2000 > 70Mbit/sec
AMD 1500MHz QNX6.1 > 70Mbit/sec
AMD 1500MHz QNX6.2.1 > 70Mbit/sec
AMD 1500MHz Linux > 80Mbit/sec

NEC 300MHz QNX6.1 ~ 3Mbit/sec
NEC 300MHz QNX6.2 ~ 3Mbit/sec
NEC 300MHz Linux > 50Mbit/sec

PPC 750 200MHz QNX 6.1 ~ 3Mbit/sec
PPC 750 200MHz QNX 6.2 ~ 3Mbit/sec

I can appreciate that I’m using much slower processors for the embedded
systems and as such the final throughput should be lower, however when
running the tests the results don’t appear to scale as I would expect.


\

Dave Edwards wrote:

Has anyone done any optimisation on the realtek 8139 code?

As an aside, we did an evaluation of a dozen or so MACs about two years
ago, and the 8139 was architectually flawed such that it could never
achieve the performance you get from some “better” designs. The Realtek
chip is more or less an NE-2000-like design except that the circular
buffer is in main memory. But the trouble is that it requires a copy
from the memory ring into a packet buffer and thus won’t match the speed
of cards that can DMA directly into usable packet memory.

The best chips we found, at that time, were the AMD PC-Net and the
National 83815 (which is used on the Netgear FA-311). The National chip
was the best performing of the group tested on our platform, but only
slightly more than AMD, but that’s not the end of the story, because the
driver also makes a big difference. There is a huge variance across
OSes and OS versions, even with a known-good MAC.

So your question about “isn’t this inefficient?” is rather astute, in
that the driver may also be contributing to the performance. But my
point is that if you really want the fastest performance, the Realtek
8139 is probably not the best way to achieve it. Now if cost is the
goal, over performance, than that chip may be a good choice.

lew

As far as the 8139, Lew summed it up very nicely: the 8319 follows in the grand
tradition of the NE2000, which served its purpose for many years - but they’re
both primitive antiques by today’s standards, especially if you have any interest
in performance.

As for reuse of buffers, I agree that it would be much more efficient to reuse
the buffers; the nic_allocator functions provide a handy (although undocumented)
way to do this. Before we discovered nic_allocator we implemented the same thing
on our own and witnessed a substantial performance improvement.

As for Sean’s comments about the use of small buffers, I threw together a chargen
that does big sends, and saw the 8Mbps output of a 200MHz machine jump to over
75Mbps. Another example of one of the basic truths of computers: if you move
data around in small chunks you’ll pay a high price in performance.

Murf

Dave Edwards wrote:

Murf,

Your results are closer to what I expected. I’ve never been able to
achieve these rates on the slower machines.

Have you also noticed that the IP stack can sometimes fragment
unnecessarily?

I’m confident that My Hardware is OK, as I’ve written a bridge module
that sits underneath the IP stack and forwards the network traffic to
another NIC. In this mode I can talk to one of the faster machines, via
the slow NEC processor, and still achieve good throughput.

[PC]—>[NEC, EN0]–>[bridge]—>[NEC, EN1]---->[AMD 1500]
^----------300MHz NEC-----------^

In this configuration I can saturate the link between EN1 and the AMD
1500 and get nearly full bandwidth (it’s not an Ethernet port but has a
capacity of around 18Mbit/sec)

If I do the test to the 300MHz Nec machine, running the same firmware I
get a data rate of ~3Mbit/sec.

One other thing to mention is that this appears to be independent of
link setting on the NEC machine (10/100Mbit Full/Half Duplex), and It’s
independent of my code!

One other question,

Has anyone done any optimisation on the realtek 8139 code? I have a copy
of the source code from the UK office and notice that it uses a pool
of RX npkts for receive, the version I received uses the pool in a
strange way. Basically it deletes any receive_complete npkts and then
does an ion-alloc to add a packet to the RX pool.

This happens for every frame.

Q1, Isn’t this inefficient?
Q2, Why does the driver not recycle the npkts once that have been released?

Dave

John A. Murphy wrote:
Just for comparasion, I’ve been running some very similar tests, and I seem
to get better results than yours from slow machines, and not quite as good
results from fast machines:

Processor speed OS Data Rate from chargen
200Mhz QNX 6.2.0 ~8Mbps
500Mhz QNX 6.2.0 ~20Mbps
1.8Ghz QNX 6.2.0 ~50Mbps
2.4Ghs QNX 6.2.1 ~65Mbps

I’m just making a TCP connection to the chargen port and hanging in a loop
doing recv() into a large buffer until I’ve seen a total of 10MB.

Murf

Dave Edwards wrote:


Folks!

Can anyone shed some light on QNX’s TCP/IP performance? I have an
ongoing issue (since 6.0) where the performance of the ipstack is
not that good on low power processor platforms.

I’ve noticed this on various processor architectures and with different
versions of the ipstack.

An example would be the chargen service on the stack. I have a test
application that requests data from this port and then calculates the
throughput.

The results are (all from the same requesting machine, all targets have
the same network card and identical drivers, when using qnx)

Processor/Speed OS Max Data Rate

AMD 1500MHz W2000 > 70Mbit/sec
AMD 1500MHz QNX6.1 > 70Mbit/sec
AMD 1500MHz QNX6.2.1 > 70Mbit/sec
AMD 1500MHz Linux > 80Mbit/sec

NEC 300MHz QNX6.1 ~ 3Mbit/sec
NEC 300MHz QNX6.2 ~ 3Mbit/sec
NEC 300MHz Linux > 50Mbit/sec

PPC 750 200MHz QNX 6.1 ~ 3Mbit/sec
PPC 750 200MHz QNX 6.2 ~ 3Mbit/sec

I can appreciate that I’m using much slower processors for the embedded
systems and as such the final throughput should be lower, however when
running the tests the results don’t appear to scale as I would expect.


\

Gent’s

I must have missed out on the NIC allocator routines. Is there any
documentation? Are they mentioned in the 6.2PE docs?


With respect to the realtek chip, I’m currently stuck with this as it’s
fitted to the COTS board that we are using. I would rather use an AMD
based device (for the DMA reasons specified) but I can’t.


As to packet sizing, I’ll rework my Chargen Test so see if that changes
things

Dave


John A. Murphy wrote:

As far as the 8139, Lew summed it up very nicely: the 8319 follows in the grand
tradition of the NE2000, which served its purpose for many years - but they’re
both primitive antiques by today’s standards, especially if you have any interest
in performance.

As for reuse of buffers, I agree that it would be much more efficient to reuse
the buffers; the nic_allocator functions provide a handy (although undocumented)
way to do this. Before we discovered nic_allocator we implemented the same thing
on our own and witnessed a substantial performance improvement.

As for Sean’s comments about the use of small buffers, I threw together a chargen
that does big sends, and saw the 8Mbps output of a 200MHz machine jump to over
75Mbps. Another example of one of the basic truths of computers: if you move
data around in small chunks you’ll pay a high price in performance.

Murf

Dave Edwards wrote:


Murf,

Your results are closer to what I expected. I’ve never been able to
achieve these rates on the slower machines.

Have you also noticed that the IP stack can sometimes fragment
unnecessarily?

I’m confident that My Hardware is OK, as I’ve written a bridge module
that sits underneath the IP stack and forwards the network traffic to
another NIC. In this mode I can talk to one of the faster machines, via
the slow NEC processor, and still achieve good throughput.

[PC]—>[NEC, EN0]–>[bridge]—>[NEC, EN1]---->[AMD 1500]
^----------300MHz NEC-----------^

In this configuration I can saturate the link between EN1 and the AMD
1500 and get nearly full bandwidth (it’s not an Ethernet port but has a
capacity of around 18Mbit/sec)

If I do the test to the 300MHz Nec machine, running the same firmware I
get a data rate of ~3Mbit/sec.

One other thing to mention is that this appears to be independent of
link setting on the NEC machine (10/100Mbit Full/Half Duplex), and It’s
independent of my code!

One other question,

Has anyone done any optimisation on the realtek 8139 code? I have a copy
of the source code from the UK office and notice that it uses a pool
of RX npkts for receive, the version I received uses the pool in a
strange way. Basically it deletes any receive_complete npkts and then
does an ion-alloc to add a packet to the RX pool.

This happens for every frame.

Q1, Isn’t this inefficient?
Q2, Why does the driver not recycle the npkts once that have been released?

Dave

John A. Murphy wrote:

Just for comparasion, I’ve been running some very similar tests, and I seem
to get better results than yours from slow machines, and not quite as good
results from fast machines:

Processor speed OS Data Rate from chargen
200Mhz QNX 6.2.0 ~8Mbps
500Mhz QNX 6.2.0 ~20Mbps
1.8Ghz QNX 6.2.0 ~50Mbps
2.4Ghs QNX 6.2.1 ~65Mbps

I’m just making a TCP connection to the chargen port and hanging in a loop
doing recv() into a large buffer until I’ve seen a total of 10MB.

Murf

Dave Edwards wrote:



Folks!

Can anyone shed some light on QNX’s TCP/IP performance? I have an
ongoing issue (since 6.0) where the performance of the ipstack is
not that good on low power processor platforms.

I’ve noticed this on various processor architectures and with different
versions of the ipstack.

An example would be the chargen service on the stack. I have a test
application that requests data from this port and then calculates the
throughput.

The results are (all from the same requesting machine, all targets have
the same network card and identical drivers, when using qnx)

Processor/Speed OS Max Data Rate

AMD 1500MHz W2000 > 70Mbit/sec
AMD 1500MHz QNX6.1 > 70Mbit/sec
AMD 1500MHz QNX6.2.1 > 70Mbit/sec
AMD 1500MHz Linux > 80Mbit/sec

NEC 300MHz QNX6.1 ~ 3Mbit/sec
NEC 300MHz QNX6.2 ~ 3Mbit/sec
NEC 300MHz Linux > 50Mbit/sec

PPC 750 200MHz QNX 6.1 ~ 3Mbit/sec
PPC 750 200MHz QNX 6.2 ~ 3Mbit/sec

I can appreciate that I’m using much slower processors for the embedded
systems and as such the final throughput should be lower, however when
running the tests the results don’t appear to scale as I would expect.



\

Assuming you have the Network DDK, take a look in usr/include/drvr/support.h for the
nic_allocator stuff; that’s the only documentation I’m aware of short of contacting
your sales rep.

On the packet sizing, note that it was the chargen daemon I changed (replaced,
actually), not the client. The standard QNX version, according to Sean, does 74 byte
writes; mine does 7030 byte writes.

Murf

Dave Edwards wrote:

Gent’s

I must have missed out on the NIC allocator routines. Is there any
documentation? Are they mentioned in the 6.2PE docs?

With respect to the realtek chip, I’m currently stuck with this as it’s
fitted to the COTS board that we are using. I would rather use an AMD
based device (for the DMA reasons specified) but I can’t.

As to packet sizing, I’ll rework my Chargen Test so see if that changes
things

Dave

John A. Murphy wrote:
As far as the 8139, Lew summed it up very nicely: the 8319 follows in the grand
tradition of the NE2000, which served its purpose for many years - but they’re
both primitive antiques by today’s standards, especially if you have any interest
in performance.

As for reuse of buffers, I agree that it would be much more efficient to reuse
the buffers; the nic_allocator functions provide a handy (although undocumented)
way to do this. Before we discovered nic_allocator we implemented the same thing
on our own and witnessed a substantial performance improvement.

As for Sean’s comments about the use of small buffers, I threw together a chargen
that does big sends, and saw the 8Mbps output of a 200MHz machine jump to over
75Mbps. Another example of one of the basic truths of computers: if you move
data around in small chunks you’ll pay a high price in performance.

Murf

Dave Edwards wrote:


Murf,

Your results are closer to what I expected. I’ve never been able to
achieve these rates on the slower machines.

Have you also noticed that the IP stack can sometimes fragment
unnecessarily?

I’m confident that My Hardware is OK, as I’ve written a bridge module
that sits underneath the IP stack and forwards the network traffic to
another NIC. In this mode I can talk to one of the faster machines, via
the slow NEC processor, and still achieve good throughput.

[PC]—>[NEC, EN0]–>[bridge]—>[NEC, EN1]---->[AMD 1500]
^----------300MHz NEC-----------^

In this configuration I can saturate the link between EN1 and the AMD
1500 and get nearly full bandwidth (it’s not an Ethernet port but has a
capacity of around 18Mbit/sec)

If I do the test to the 300MHz Nec machine, running the same firmware I
get a data rate of ~3Mbit/sec.

One other thing to mention is that this appears to be independent of
link setting on the NEC machine (10/100Mbit Full/Half Duplex), and It’s
independent of my code!

One other question,

Has anyone done any optimisation on the realtek 8139 code? I have a copy
of the source code from the UK office and notice that it uses a pool
of RX npkts for receive, the version I received uses the pool in a
strange way. Basically it deletes any receive_complete npkts and then
does an ion-alloc to add a packet to the RX pool.

This happens for every frame.

Q1, Isn’t this inefficient?
Q2, Why does the driver not recycle the npkts once that have been released?

Dave

John A. Murphy wrote:

Just for comparasion, I’ve been running some very similar tests, and I seem
to get better results than yours from slow machines, and not quite as good
results from fast machines:

Processor speed OS Data Rate from chargen
200Mhz QNX 6.2.0 ~8Mbps
500Mhz QNX 6.2.0 ~20Mbps
1.8Ghz QNX 6.2.0 ~50Mbps
2.4Ghs QNX 6.2.1 ~65Mbps

I’m just making a TCP connection to the chargen port and hanging in a loop
doing recv() into a large buffer until I’ve seen a total of 10MB.

Murf

Dave Edwards wrote:



Folks!

Can anyone shed some light on QNX’s TCP/IP performance? I have an
ongoing issue (since 6.0) where the performance of the ipstack is
not that good on low power processor platforms.

I’ve noticed this on various processor architectures and with different
versions of the ipstack.

An example would be the chargen service on the stack. I have a test
application that requests data from this port and then calculates the
throughput.

The results are (all from the same requesting machine, all targets have
the same network card and identical drivers, when using qnx)

Processor/Speed OS Max Data Rate

AMD 1500MHz W2000 > 70Mbit/sec
AMD 1500MHz QNX6.1 > 70Mbit/sec
AMD 1500MHz QNX6.2.1 > 70Mbit/sec
AMD 1500MHz Linux > 80Mbit/sec

NEC 300MHz QNX6.1 ~ 3Mbit/sec
NEC 300MHz QNX6.2 ~ 3Mbit/sec
NEC 300MHz Linux > 50Mbit/sec

PPC 750 200MHz QNX 6.1 ~ 3Mbit/sec
PPC 750 200MHz QNX 6.2 ~ 3Mbit/sec

I can appreciate that I’m using much slower processors for the embedded
systems and as such the final throughput should be lower, however when
running the tests the results don’t appear to scale as I would expect.



\

Murf,

To save me time, would you be willing to let me have a copy?

Dave


John A. Murphy wrote:

Assuming you have the Network DDK, take a look in usr/include/drvr/support.h for the
nic_allocator stuff; that’s the only documentation I’m aware of short of contacting
your sales rep.

On the packet sizing, note that it was the chargen daemon I changed (replaced,
actually), not the client. The standard QNX version, according to Sean, does 74 byte
writes; mine does 7030 byte writes.

Murf

Dave Edwards wrote:


Gent’s

I must have missed out on the NIC allocator routines. Is there any
documentation? Are they mentioned in the 6.2PE docs?

With respect to the realtek chip, I’m currently stuck with this as it’s
fitted to the COTS board that we are using. I would rather use an AMD
based device (for the DMA reasons specified) but I can’t.

As to packet sizing, I’ll rework my Chargen Test so see if that changes
things

Dave

John A. Murphy wrote:

As far as the 8139, Lew summed it up very nicely: the 8319 follows in the grand
tradition of the NE2000, which served its purpose for many years - but they’re
both primitive antiques by today’s standards, especially if you have any interest
in performance.

As for reuse of buffers, I agree that it would be much more efficient to reuse
the buffers; the nic_allocator functions provide a handy (although undocumented)
way to do this. Before we discovered nic_allocator we implemented the same thing
on our own and witnessed a substantial performance improvement.

As for Sean’s comments about the use of small buffers, I threw together a chargen
that does big sends, and saw the 8Mbps output of a 200MHz machine jump to over
75Mbps. Another example of one of the basic truths of computers: if you move
data around in small chunks you’ll pay a high price in performance.

Murf

Dave Edwards wrote:



Murf,

Your results are closer to what I expected. I’ve never been able to
achieve these rates on the slower machines.

Have you also noticed that the IP stack can sometimes fragment
unnecessarily?

I’m confident that My Hardware is OK, as I’ve written a bridge module
that sits underneath the IP stack and forwards the network traffic to
another NIC. In this mode I can talk to one of the faster machines, via
the slow NEC processor, and still achieve good throughput.

[PC]—>[NEC, EN0]–>[bridge]—>[NEC, EN1]---->[AMD 1500]
^----------300MHz NEC-----------^

In this configuration I can saturate the link between EN1 and the AMD
1500 and get nearly full bandwidth (it’s not an Ethernet port but has a
capacity of around 18Mbit/sec)

If I do the test to the 300MHz Nec machine, running the same firmware I
get a data rate of ~3Mbit/sec.

One other thing to mention is that this appears to be independent of
link setting on the NEC machine (10/100Mbit Full/Half Duplex), and It’s
independent of my code!

One other question,

Has anyone done any optimisation on the realtek 8139 code? I have a copy
of the source code from the UK office and notice that it uses a pool
of RX npkts for receive, the version I received uses the pool in a
strange way. Basically it deletes any receive_complete npkts and then
does an ion-alloc to add a packet to the RX pool.

This happens for every frame.

Q1, Isn’t this inefficient?
Q2, Why does the driver not recycle the npkts once that have been released?

Dave

John A. Murphy wrote:


Just for comparasion, I’ve been running some very similar tests, and I seem
to get better results than yours from slow machines, and not quite as good
results from fast machines:

Processor speed OS Data Rate from chargen
200Mhz QNX 6.2.0 ~8Mbps
500Mhz QNX 6.2.0 ~20Mbps
1.8Ghz QNX 6.2.0 ~50Mbps
2.4Ghs QNX 6.2.1 ~65Mbps

I’m just making a TCP connection to the chargen port and hanging in a loop
doing recv() into a large buffer until I’ve seen a total of 10MB.

Murf

Dave Edwards wrote:




Folks!

Can anyone shed some light on QNX’s TCP/IP performance? I have an
ongoing issue (since 6.0) where the performance of the ipstack is
not that good on low power processor platforms.

I’ve noticed this on various processor architectures and with different
versions of the ipstack.

An example would be the chargen service on the stack. I have a test
application that requests data from this port and then calculates the
throughput.

The results are (all from the same requesting machine, all targets have
the same network card and identical drivers, when using qnx)

Processor/Speed OS Max Data Rate

AMD 1500MHz W2000 > 70Mbit/sec
AMD 1500MHz QNX6.1 > 70Mbit/sec
AMD 1500MHz QNX6.2.1 > 70Mbit/sec
AMD 1500MHz Linux > 80Mbit/sec

NEC 300MHz QNX6.1 ~ 3Mbit/sec
NEC 300MHz QNX6.2 ~ 3Mbit/sec
NEC 300MHz Linux > 50Mbit/sec

PPC 750 200MHz QNX 6.1 ~ 3Mbit/sec
PPC 750 200MHz QNX 6.2 ~ 3Mbit/sec

I can appreciate that I’m using much slower processors for the embedded
systems and as such the final throughput should be lower, however when
running the tests the results don’t appear to scale as I would expect.



\

Ok,

I’ve just checked the Support.h file.

There’s not a lot in it.

Can Sean or Chris please explain the functions of:

nic_allocator_create
nic_allocator_alloc
nic_allocator_free
nic_allocator_destroy

My guess that this would create a pool of npkts for use by a driver, if
that’s so, what are the args to be passed to allocator_create?


Cheers

Dave


John A. Murphy wrote:

Assuming you have the Network DDK, take a look in usr/include/drvr/support.h for the
nic_allocator stuff; that’s the only documentation I’m aware of short of contacting
your sales rep.

On the packet sizing, note that it was the chargen daemon I changed (replaced,
actually), not the client. The standard QNX version, according to Sean, does 74 byte
writes; mine does 7030 byte writes.

Murf

Dave Edwards wrote:


Gent’s

I must have missed out on the NIC allocator routines. Is there any
documentation? Are they mentioned in the 6.2PE docs?

With respect to the realtek chip, I’m currently stuck with this as it’s
fitted to the COTS board that we are using. I would rather use an AMD
based device (for the DMA reasons specified) but I can’t.

As to packet sizing, I’ll rework my Chargen Test so see if that changes
things

Dave

John A. Murphy wrote:

As far as the 8139, Lew summed it up very nicely: the 8319 follows in the grand
tradition of the NE2000, which served its purpose for many years - but they’re
both primitive antiques by today’s standards, especially if you have any interest
in performance.

As for reuse of buffers, I agree that it would be much more efficient to reuse
the buffers; the nic_allocator functions provide a handy (although undocumented)
way to do this. Before we discovered nic_allocator we implemented the same thing
on our own and witnessed a substantial performance improvement.

As for Sean’s comments about the use of small buffers, I threw together a chargen
that does big sends, and saw the 8Mbps output of a 200MHz machine jump to over
75Mbps. Another example of one of the basic truths of computers: if you move
data around in small chunks you’ll pay a high price in performance.

Murf

Dave Edwards wrote:



Murf,

Your results are closer to what I expected. I’ve never been able to
achieve these rates on the slower machines.

Have you also noticed that the IP stack can sometimes fragment
unnecessarily?

I’m confident that My Hardware is OK, as I’ve written a bridge module
that sits underneath the IP stack and forwards the network traffic to
another NIC. In this mode I can talk to one of the faster machines, via
the slow NEC processor, and still achieve good throughput.

[PC]—>[NEC, EN0]–>[bridge]—>[NEC, EN1]---->[AMD 1500]
^----------300MHz NEC-----------^

In this configuration I can saturate the link between EN1 and the AMD
1500 and get nearly full bandwidth (it’s not an Ethernet port but has a
capacity of around 18Mbit/sec)

If I do the test to the 300MHz Nec machine, running the same firmware I
get a data rate of ~3Mbit/sec.

One other thing to mention is that this appears to be independent of
link setting on the NEC machine (10/100Mbit Full/Half Duplex), and It’s
independent of my code!

One other question,

Has anyone done any optimisation on the realtek 8139 code? I have a copy
of the source code from the UK office and notice that it uses a pool
of RX npkts for receive, the version I received uses the pool in a
strange way. Basically it deletes any receive_complete npkts and then
does an ion-alloc to add a packet to the RX pool.

This happens for every frame.

Q1, Isn’t this inefficient?
Q2, Why does the driver not recycle the npkts once that have been released?

Dave

John A. Murphy wrote:


Just for comparasion, I’ve been running some very similar tests, and I seem
to get better results than yours from slow machines, and not quite as good
results from fast machines:

Processor speed OS Data Rate from chargen
200Mhz QNX 6.2.0 ~8Mbps
500Mhz QNX 6.2.0 ~20Mbps
1.8Ghz QNX 6.2.0 ~50Mbps
2.4Ghs QNX 6.2.1 ~65Mbps

I’m just making a TCP connection to the chargen port and hanging in a loop
doing recv() into a large buffer until I’ve seen a total of 10MB.

Murf

Dave Edwards wrote:




Folks!

Can anyone shed some light on QNX’s TCP/IP performance? I have an
ongoing issue (since 6.0) where the performance of the ipstack is
not that good on low power processor platforms.

I’ve noticed this on various processor architectures and with different
versions of the ipstack.

An example would be the chargen service on the stack. I have a test
application that requests data from this port and then calculates the
throughput.

The results are (all from the same requesting machine, all targets have
the same network card and identical drivers, when using qnx)

Processor/Speed OS Max Data Rate

AMD 1500MHz W2000 > 70Mbit/sec
AMD 1500MHz QNX6.1 > 70Mbit/sec
AMD 1500MHz QNX6.2.1 > 70Mbit/sec
AMD 1500MHz Linux > 80Mbit/sec

NEC 300MHz QNX6.1 ~ 3Mbit/sec
NEC 300MHz QNX6.2 ~ 3Mbit/sec
NEC 300MHz Linux > 50Mbit/sec

PPC 750 200MHz QNX 6.1 ~ 3Mbit/sec
PPC 750 200MHz QNX 6.2 ~ 3Mbit/sec

I can appreciate that I’m using much slower processors for the embedded
systems and as such the final throughput should be lower, however when
running the tests the results don’t appear to scale as I would expect.



\

I am hesitent to document it here in the newsgroup since I am not
really sure if it will continue to be used in the future. Right now
only the pcnet driver is taking advantage of it. So you can certainly
look in that location for an example.

chris


Dave Edwards <Dave.edwards@abicom-international.com> wrote:

Ok,

I’ve just checked the Support.h file.

There’s not a lot in it.

Can Sean or Chris please explain the functions of:

nic_allocator_create
nic_allocator_alloc
nic_allocator_free
nic_allocator_destroy

My guess that this would create a pool of npkts for use by a driver, if
that’s so, what are the args to be passed to allocator_create?


Cheers

Dave


John A. Murphy wrote:
Assuming you have the Network DDK, take a look in usr/include/drvr/support.h for the
nic_allocator stuff; that’s the only documentation I’m aware of short of contacting
your sales rep.

On the packet sizing, note that it was the chargen daemon I changed (replaced,
actually), not the client. The standard QNX version, according to Sean, does 74 byte
writes; mine does 7030 byte writes.

Murf

Dave Edwards wrote:


Gent’s

I must have missed out on the NIC allocator routines. Is there any
documentation? Are they mentioned in the 6.2PE docs?

With respect to the realtek chip, I’m currently stuck with this as it’s
fitted to the COTS board that we are using. I would rather use an AMD
based device (for the DMA reasons specified) but I can’t.

As to packet sizing, I’ll rework my Chargen Test so see if that changes
things

Dave

John A. Murphy wrote:

As far as the 8139, Lew summed it up very nicely: the 8319 follows in the grand
tradition of the NE2000, which served its purpose for many years - but they’re
both primitive antiques by today’s standards, especially if you have any interest
in performance.

As for reuse of buffers, I agree that it would be much more efficient to reuse
the buffers; the nic_allocator functions provide a handy (although undocumented)
way to do this. Before we discovered nic_allocator we implemented the same thing
on our own and witnessed a substantial performance improvement.

As for Sean’s comments about the use of small buffers, I threw together a chargen
that does big sends, and saw the 8Mbps output of a 200MHz machine jump to over
75Mbps. Another example of one of the basic truths of computers: if you move
data around in small chunks you’ll pay a high price in performance.

Murf

Dave Edwards wrote:



Murf,

Your results are closer to what I expected. I’ve never been able to
achieve these rates on the slower machines.

Have you also noticed that the IP stack can sometimes fragment
unnecessarily?

I’m confident that My Hardware is OK, as I’ve written a bridge module
that sits underneath the IP stack and forwards the network traffic to
another NIC. In this mode I can talk to one of the faster machines, via
the slow NEC processor, and still achieve good throughput.

[PC]—>[NEC, EN0]–>[bridge]—>[NEC, EN1]---->[AMD 1500]
^----------300MHz NEC-----------^

In this configuration I can saturate the link between EN1 and the AMD
1500 and get nearly full bandwidth (it’s not an Ethernet port but has a
capacity of around 18Mbit/sec)

If I do the test to the 300MHz Nec machine, running the same firmware I
get a data rate of ~3Mbit/sec.

One other thing to mention is that this appears to be independent of
link setting on the NEC machine (10/100Mbit Full/Half Duplex), and It’s
independent of my code!

One other question,

Has anyone done any optimisation on the realtek 8139 code? I have a copy
of the source code from the UK office and notice that it uses a pool
of RX npkts for receive, the version I received uses the pool in a
strange way. Basically it deletes any receive_complete npkts and then
does an ion-alloc to add a packet to the RX pool.

This happens for every frame.

Q1, Isn’t this inefficient?
Q2, Why does the driver not recycle the npkts once that have been released?

Dave

John A. Murphy wrote:


Just for comparasion, I’ve been running some very similar tests, and I seem
to get better results than yours from slow machines, and not quite as good
results from fast machines:

Processor speed OS Data Rate from chargen
200Mhz QNX 6.2.0 ~8Mbps
500Mhz QNX 6.2.0 ~20Mbps
1.8Ghz QNX 6.2.0 ~50Mbps
2.4Ghs QNX 6.2.1 ~65Mbps

I’m just making a TCP connection to the chargen port and hanging in a loop
doing recv() into a large buffer until I’ve seen a total of 10MB.

Murf

Dave Edwards wrote:




Folks!

Can anyone shed some light on QNX’s TCP/IP performance? I have an
ongoing issue (since 6.0) where the performance of the ipstack is
not that good on low power processor platforms.

I’ve noticed this on various processor architectures and with different
versions of the ipstack.

An example would be the chargen service on the stack. I have a test
application that requests data from this port and then calculates the
throughput.

The results are (all from the same requesting machine, all targets have
the same network card and identical drivers, when using qnx)

Processor/Speed OS Max Data Rate

AMD 1500MHz W2000 > 70Mbit/sec
AMD 1500MHz QNX6.1 > 70Mbit/sec
AMD 1500MHz QNX6.2.1 > 70Mbit/sec
AMD 1500MHz Linux > 80Mbit/sec

NEC 300MHz QNX6.1 ~ 3Mbit/sec
NEC 300MHz QNX6.2 ~ 3Mbit/sec
NEC 300MHz Linux > 50Mbit/sec

PPC 750 200MHz QNX 6.1 ~ 3Mbit/sec
PPC 750 200MHz QNX 6.2 ~ 3Mbit/sec

I can appreciate that I’m using much slower processors for the embedded
systems and as such the final throughput should be lower, however when
running the tests the results don’t appear to scale as I would expect.





\


Chris McKillop <cdm@qnx.com> “The faster I go, the behinder I get.”
Software Engineer, QSSL – Lewis Carroll –
http://qnx.wox.org/

Thanks for the pointer I’ll go have a look at the PCnet docs/code

Dave


Chris McKillop wrote:

I am hesitent to document it here in the newsgroup since I am not
really sure if it will continue to be used in the future. Right now
only the pcnet driver is taking advantage of it. So you can certainly
look in that location for an example.

chris


Dave Edwards <> Dave.edwards@abicom-international.com> > wrote:

Ok,

I’ve just checked the Support.h file.

There’s not a lot in it.

Can Sean or Chris please explain the functions of:

nic_allocator_create
nic_allocator_alloc
nic_allocator_free
nic_allocator_destroy

My guess that this would create a pool of npkts for use by a driver, if
that’s so, what are the args to be passed to allocator_create?


Cheers

Dave


John A. Murphy wrote:

Assuming you have the Network DDK, take a look in usr/include/drvr/support.h for the
nic_allocator stuff; that’s the only documentation I’m aware of short of contacting
your sales rep.

On the packet sizing, note that it was the chargen daemon I changed (replaced,
actually), not the client. The standard QNX version, according to Sean, does 74 byte
writes; mine does 7030 byte writes.

Murf

Dave Edwards wrote:



Gent’s

I must have missed out on the NIC allocator routines. Is there any
documentation? Are they mentioned in the 6.2PE docs?

With respect to the realtek chip, I’m currently stuck with this as it’s
fitted to the COTS board that we are using. I would rather use an AMD
based device (for the DMA reasons specified) but I can’t.

As to packet sizing, I’ll rework my Chargen Test so see if that changes
things

Dave

John A. Murphy wrote:


As far as the 8139, Lew summed it up very nicely: the 8319 follows in the grand
tradition of the NE2000, which served its purpose for many years - but they’re
both primitive antiques by today’s standards, especially if you have any interest
in performance.

As for reuse of buffers, I agree that it would be much more efficient to reuse
the buffers; the nic_allocator functions provide a handy (although undocumented)
way to do this. Before we discovered nic_allocator we implemented the same thing
on our own and witnessed a substantial performance improvement.

As for Sean’s comments about the use of small buffers, I threw together a chargen
that does big sends, and saw the 8Mbps output of a 200MHz machine jump to over
75Mbps. Another example of one of the basic truths of computers: if you move
data around in small chunks you’ll pay a high price in performance.

Murf

Dave Edwards wrote:




Murf,

Your results are closer to what I expected. I’ve never been able to
achieve these rates on the slower machines.

Have you also noticed that the IP stack can sometimes fragment
unnecessarily?

I’m confident that My Hardware is OK, as I’ve written a bridge module
that sits underneath the IP stack and forwards the network traffic to
another NIC. In this mode I can talk to one of the faster machines, via
the slow NEC processor, and still achieve good throughput.

[PC]—>[NEC, EN0]–>[bridge]—>[NEC, EN1]---->[AMD 1500]
^----------300MHz NEC-----------^

In this configuration I can saturate the link between EN1 and the AMD
1500 and get nearly full bandwidth (it’s not an Ethernet port but has a
capacity of around 18Mbit/sec)

If I do the test to the 300MHz Nec machine, running the same firmware I
get a data rate of ~3Mbit/sec.

One other thing to mention is that this appears to be independent of
link setting on the NEC machine (10/100Mbit Full/Half Duplex), and It’s
independent of my code!

One other question,

Has anyone done any optimisation on the realtek 8139 code? I have a copy
of the source code from the UK office and notice that it uses a pool
of RX npkts for receive, the version I received uses the pool in a
strange way. Basically it deletes any receive_complete npkts and then
does an ion-alloc to add a packet to the RX pool.

This happens for every frame.

Q1, Isn’t this inefficient?
Q2, Why does the driver not recycle the npkts once that have been released?

Dave

John A. Murphy wrote:



Just for comparasion, I’ve been running some very similar tests, and I seem
to get better results than yours from slow machines, and not quite as good
results from fast machines:

Processor speed OS Data Rate from chargen
200Mhz QNX 6.2.0 ~8Mbps
500Mhz QNX 6.2.0 ~20Mbps
1.8Ghz QNX 6.2.0 ~50Mbps
2.4Ghs QNX 6.2.1 ~65Mbps

I’m just making a TCP connection to the chargen port and hanging in a loop
doing recv() into a large buffer until I’ve seen a total of 10MB.

Murf

Dave Edwards wrote:





Folks!

Can anyone shed some light on QNX’s TCP/IP performance? I have an
ongoing issue (since 6.0) where the performance of the ipstack is
not that good on low power processor platforms.

I’ve noticed this on various processor architectures and with different
versions of the ipstack.

An example would be the chargen service on the stack. I have a test
application that requests data from this port and then calculates the
throughput.

The results are (all from the same requesting machine, all targets have
the same network card and identical drivers, when using qnx)

Processor/Speed OS Max Data Rate

AMD 1500MHz W2000 > 70Mbit/sec
AMD 1500MHz QNX6.1 > 70Mbit/sec
AMD 1500MHz QNX6.2.1 > 70Mbit/sec
AMD 1500MHz Linux > 80Mbit/sec

NEC 300MHz QNX6.1 ~ 3Mbit/sec
NEC 300MHz QNX6.2 ~ 3Mbit/sec
NEC 300MHz Linux > 50Mbit/sec

PPC 750 200MHz QNX 6.1 ~ 3Mbit/sec
PPC 750 200MHz QNX 6.2 ~ 3Mbit/sec

I can appreciate that I’m using much slower processors for the embedded
systems and as such the final throughput should be lower, however when
running the tests the results don’t appear to scale as I would expect.



\

Dave Edwards <Dave.edwards@abicom-international.com> wrote:

Murf,

To save me time, would you be willing to let me have a copy?

Dave

You might want to look at the ‘netperf’ package. It lets you
vary this and a lot more via command line args. It also uses
the MSG_WAITALL flag on recv() which can also cut down on
message passes.

-seanb

Sound interesting! Where would we find that? The package I just
downloaded from the netperf homepage doesn’t use MSG_WAITALL, and I
can’t find any reference to netperf on the QNX site.

Murf

Sean Boudreau wrote:

Dave Edwards <> Dave.edwards@abicom-international.com> > wrote:
Murf,

To save me time, would you be willing to let me have a copy?

Dave

You might want to look at the ‘netperf’ package. It lets you
vary this and a lot more via command line args. It also uses
the MSG_WAITALL flag on recv() which can also cut down on
message passes.

-seanb

netperf homepage should be fine. And you’re right, it looks like
it doesn’t use MSG_WAITALL. Must have dreamt it. But its use can
have the described effect.

-seanb

John A. Murphy <murf@perftech.com> wrote:

Sound interesting! Where would we find that? The package I just
downloaded from the netperf homepage doesn’t use MSG_WAITALL, and I
can’t find any reference to netperf on the QNX site.

Murf

Sean Boudreau wrote:

Dave Edwards <> Dave.edwards@abicom-international.com> > wrote:
Murf,

To save me time, would you be willing to let me have a copy?

Dave

You might want to look at the ‘netperf’ package. It lets you
vary this and a lot more via command line args. It also uses
the MSG_WAITALL flag on recv() which can also cut down on
message passes.

-seanb

John A. Murphy wrote:

As far as the 8139, Lew summed it up very nicely: the 8319 follows in the grand
tradition of the NE2000, which served its purpose for many years - but they’re
both primitive antiques by today’s standards, especially if you have any interest
in performance.

As for reuse of buffers, I agree that it would be much more efficient to reuse
the buffers; the nic_allocator functions provide a handy (although undocumented)
way to do this. Before we discovered nic_allocator we implemented the same thing
on our own and witnessed a substantial performance improvement.

As for Sean’s comments about the use of small buffers, I threw together a chargen
that does big sends, and saw the 8Mbps output of a 200MHz machine jump to over
75Mbps. Another example of one of the basic truths of computers: if you move
data around in small chunks you’ll pay a high price in performance.

That’s only true in general for moving data through a network. Moving
them in local memory, while still not as efficient should not be THAT
horrible.

So the high price is not really for moving data - it is for using too
many context switches when moving data locally.

– igor

[PRE]

That’s only true in general for moving data through a network. Moving
them in local memory, while still not as efficient should not be THAT
horrible.

So the high price is not really for moving data - it is for using too
many context switches when moving data locally.

– igor

I have to agree. If you combine this with how QNX handles interrupts,
then it quickly becomes ovbious why it suffers from poor performance on
lower end processors.

Software Interrupts, which are used by the Realtek driver (an mine) have
a default resolution of around 1ms. Bt this, I mean that it can take
upto 1ms for the OS to start executution of the soft-interrupt routine.
This timing is derived from the main system scheduling clock, and
although it can be adjusted setting it to lower values does have its own
issues.

After doing the sums, it’s apparent that any kind of synchronous
networking will suffer from poor performance.

The question I now find myself asking is “why do I continue to use QNX
for high speed networking, when there is a faster, ‘free’ alternative in
LINUX?”

Dave

Dave Edwards <Dave.edwards@abicom-international.com> wrote:

[PRE]


That’s only true in general for moving data through a network. Moving
them in local memory, while still not as efficient should not be THAT
horrible.

So the high price is not really for moving data - it is for using too
many context switches when moving data locally.

– igor


I have to agree. If you combine this with how QNX handles interrupts,
then it quickly becomes ovbious why it suffers from poor performance on
lower end processors.

Software Interrupts, which are used by the Realtek driver (an mine) have
a default resolution of around 1ms. Bt this, I mean that it can take
upto 1ms for the OS to start executution of the soft-interrupt routine.
This timing is derived from the main system scheduling clock, and
although it can be adjusted setting it to lower values does have its own
issues.

This isn’t true. The timer interrupt has a default period of
1ms. Scheduling latency of interrupt handlers (whether attached to
the timer interrupt or otherwise) has nothing to do with this value.

-seanb

After doing the sums, it’s apparent that any kind of synchronous
networking will suffer from poor performance.

The question I now find myself asking is “why do I continue to use QNX
for high speed networking, when there is a faster, ‘free’ alternative in
LINUX?”

Dave

Igor Kovalenko wrote:

John A. Murphy wrote:
As far as the 8139, Lew summed it up very nicely: the 8319 follows in the grand
tradition of the NE2000, which served its purpose for many years - but they’re
both primitive antiques by today’s standards, especially if you have any interest
in performance.

As for reuse of buffers, I agree that it would be much more efficient to reuse
the buffers; the nic_allocator functions provide a handy (although undocumented)
way to do this. Before we discovered nic_allocator we implemented the same thing
on our own and witnessed a substantial performance improvement.

As for Sean’s comments about the use of small buffers, I threw together a chargen
that does big sends, and saw the 8Mbps output of a 200MHz machine jump to over
75Mbps. Another example of one of the basic truths of computers: if you move
data around in small chunks you’ll pay a high price in performance.


That’s only true in general for moving data through a network. Moving
them in local memory, while still not as efficient should not be THAT
horrible.

So the high price is not really for moving data - it is for using too
many context switches when moving data locally.

– igor

Over the last 25 years or so we’ve found it to be true in general. It may be worse
when there’s a network and/or context switches involved, but EVERY transfer consists of
a setup time, or time_per_transaction, and a transfer time, or time_per_byte. The more
of those time_per_byte chunks of time that you can stuff under one time_per_transaction
chunk, the faster you go. When the time_per_transaction is made of of function calls
and register loads it may not make as big a difference as when it’s made up of context
switches or packet transmissions, but the principle still holds. The question is “How
high is high?”, or “How horrible is THAT horrible?”

Murf

[PRE]

Sean,

I’m currently seeing my ISR routines executing with around 1ms
resolution, these are triggered by an isr deliver event signal with a
high priority.

After a long period of experimentation I found that this delay was
reduced by using the ClockPeriod function.

I don’t claim to know or understand what is going on here, but it
appears to me that the servicing of the interrupt via the event
mechanism occurs at the rate of the microkernel tick period.

Since this discovery, I’ve tried operating within the HW ISR itself (By
writing my own function), this works better but it can still take 1 tick
time to pass control over to the software isr.

Dave





This isn’t true. The timer interrupt has a default period of
1ms. Scheduling latency of interrupt handlers (whether attached to
the timer interrupt or otherwise) has nothing to do with this value.

-seanb