Mysterious UDP multicast slowdowns.

Hello. My company has been struggling with a problem for months now,
and we have gotten nowhere. We are seeing data transfer rates and
latencies much greater than we expected. Although the problem is fairly
simple in concept, it’s hard to describe. I’m including an extensive
description below of exactly what’s happening and what we’ve already
tried.

First, some background. We are developing a driving simulator
application. An STS (student training station) consists of three QNX
nodes and associated support computers (which vary in number, but
perform such operations as controlling the steering feedback motor and
generating the graphics). There is also an IOS (instructor/operator
station), which is a Windows NT (not my fault) machine. A group of 4
STS’s and one IOS are together known as a pod. A 100 mbit ethernet
network connects each of the top-level STS nodes and the IOS machine
with a hub; that is, there is one wire running to each STS and to the
IOS from a hub.

STS’s are capable of running network scenarios, in which the involved
STS’s can fully interact with one another. The STS’s share data using
UDP multicast.

Each STS generates about 3.5 kbytes of data per 30 Hz frame; this is
roughly 100 kbytes per second per STS. Even with four STS’s involved,
this is still under half a megabyte per second; this is relatively
insignificant compared to the theoretical throughput of a 100 mbit
ethernet. There is a certain amount of other traffic: some data is
going to the IOS via UDP unicast to drive gauge displays (speedometer,
etc), and control data also uses this network. In total, however, this
additional data does not approach the UDP multicast in volume. Assuming
it is an additional quarter megabyte, this amounts to about 6% of the
available bandwidth.

This data is bursty; it tends to all be sent at once, in 30 Hz bursts.
This will lead to collisions, but we’re of the understanding that
collisions alone could not explain the slowdowns.

The hardware we’re using, as well as what we’ve already tried and what
is readily available for additional tests, is listed at the end of this
message.

The actual time needed for transfers varies hugely from one run to
another. For a four-player network game, the network data transfer
takes anywhere from 6 milliseconds to 25 milliseconds; the average seems
to be about 8-12 ms. Our back-of-the-envelope calculations led us to
believe that it should only take roughly 2-4 ms. The time varies
immensely; it is huge after rebooting the computers, then improves after
each run until stabilizing at some value, which varies but is generally
in the 8-12 ms range mentioned above.

This is where it gets weird. Despite the fact that everything is purely
100 mbit, the hub shows very high (~40%) 10 mbit usage. This correlates
to using multicast; if I disable the multicast, the 10 mbit activity
goes away. We have tried replacing the 10/100 hub with a purely 100
mbit hub, and the app continues to work as it does with the 10/100 hub;
the 10 mbit activity is presumably a firmware bug in the hub. 3Com
insists that it couldn’t be. Despite this evidence, it still seems
suspicious to me that the 10 mbit activity (~40%) corresponds so closely
to the predicted 100 mbit activity (~4%).

We have also tried replacing the Corman cards with 3Com cards, which use
a different chip and a different driver. The 3Com cards perform worse
than the Cormans (not surprising, based on previous experience with 3Com
products, but we thought it was worth testing).

Hopefully at least a few people understood that description. :slight_smile: In
short, my questions can be summed up below:

  1. Is anyone else using UDP multicast extensively, and if so, have you
    noticed any slowdowns?
  2. What else could we look at? We’ve tried everything we could think
    of.

Thanks. Any and all suggestions are appreciated; we’re stumped.

Josh Hamacher and Dean Douthat
FAAC Incorporated



Current Configuration:
Software:
QNX OS 4.25D
Tcpip 5.00X
Net.ct100tx 4.25C

Hardware
Corman FE-120 Network Cards
All involved computers are roughly 800 MHz Pentium III’s with plenty
of RAM.

Also Tested:
Software:
Net.ct100tx 4.25E

Hardware:
3Com 3C905B-TX-NM

Available for testing:
Software:
QNX OS 4.25E

Hardware:
Corman FE-122 Network Cards

Josh,

A series of questions/checks:

  1. On the Corman cards
  • have you tried Net.tulip? Different revs between Corman/QNX have had
    better/worse
    results - especially at 100MB
  • are you locking the card too 100MB/FD or letting it negotiate? Locking
    is much better.
  • anything in traceinfo?
  • anything in netinfo?
  • what IRQ did they get assigned? High interrupt # or shared?
  • is anything running with long PCI latency timeouts?
  1. 3Com Hub (Switch???)
  • with references to 10 and 100 it leads me to believe it is a switch and
    that is the
    basis for the rest of these comments… ???
  • a pure hub should result in worse performance (collisions and random
    backoff)
  • assuming switch - is it set for “fast forward”, “fragment free”, or
    “store and forward” ?
    this could be where most of your latency is. Which is best will
    involve testing
    but will be either “fragment free” or “store and forward” if the
    systems are indeed
    all bursting at the same time.
  • does it show any statistics for errors on any ports?
  • is “spanning tree” or “broadcast control” turned on? Should be no on
    both.
  1. Assorted
  • do you trust all your cables are rated 100MB - no home
    built/trampled/etc - they can
    generate errors you may/may no see.
  • have you tried 2 QNX boxes back-to-back (crossover cable) to see what
    you can
    sustain? then try it with the hub/switch?
  • priority issue on QNX?

Maybe more thoughts later,
Jay

Josh Hamacher <hamacher@faac.com> wrote in article
<3B70534A.E93D1DB7@faac.com>…

Hello. My company has been struggling with a problem for months now,
and we have gotten nowhere. We are seeing data transfer rates and
latencies much greater than we expected. Although the problem is fairly
simple in concept, it’s hard to describe. I’m including an extensive
description below of exactly what’s happening and what we’ve already
tried.

First, some background. We are developing a driving simulator
application. An STS (student training station) consists of three QNX
nodes and associated support computers (which vary in number, but
perform such operations as controlling the steering feedback motor and
generating the graphics). There is also an IOS (instructor/operator
station), which is a Windows NT (not my fault) machine. A group of 4
STS’s and one IOS are together known as a pod. A 100 mbit ethernet
network connects each of the top-level STS nodes and the IOS machine
with a hub; that is, there is one wire running to each STS and to the
IOS from a hub.

STS’s are capable of running network scenarios, in which the involved
STS’s can fully interact with one another. The STS’s share data using
UDP multicast.

Each STS generates about 3.5 kbytes of data per 30 Hz frame; this is
roughly 100 kbytes per second per STS. Even with four STS’s involved,
this is still under half a megabyte per second; this is relatively
insignificant compared to the theoretical throughput of a 100 mbit
ethernet. There is a certain amount of other traffic: some data is
going to the IOS via UDP unicast to drive gauge displays (speedometer,
etc), and control data also uses this network. In total, however, this
additional data does not approach the UDP multicast in volume. Assuming
it is an additional quarter megabyte, this amounts to about 6% of the
available bandwidth.

This data is bursty; it tends to all be sent at once, in 30 Hz bursts.
This will lead to collisions, but we’re of the understanding that
collisions alone could not explain the slowdowns.

The hardware we’re using, as well as what we’ve already tried and what
is readily available for additional tests, is listed at the end of this
message.

The actual time needed for transfers varies hugely from one run to
another. For a four-player network game, the network data transfer
takes anywhere from 6 milliseconds to 25 milliseconds; the average seems
to be about 8-12 ms. Our back-of-the-envelope calculations led us to
believe that it should only take roughly 2-4 ms. The time varies
immensely; it is huge after rebooting the computers, then improves after
each run until stabilizing at some value, which varies but is generally
in the 8-12 ms range mentioned above.

This is where it gets weird. Despite the fact that everything is purely
100 mbit, the hub shows very high (~40%) 10 mbit usage. This correlates
to using multicast; if I disable the multicast, the 10 mbit activity
goes away. We have tried replacing the 10/100 hub with a purely 100
mbit hub, and the app continues to work as it does with the 10/100 hub;
the 10 mbit activity is presumably a firmware bug in the hub. 3Com
insists that it couldn’t be. Despite this evidence, it still seems
suspicious to me that the 10 mbit activity (~40%) corresponds so closely
to the predicted 100 mbit activity (~4%).

We have also tried replacing the Corman cards with 3Com cards, which use
a different chip and a different driver. The 3Com cards perform worse
than the Cormans (not surprising, based on previous experience with 3Com
products, but we thought it was worth testing).

Hopefully at least a few people understood that description. > :slight_smile: > In
short, my questions can be summed up below:

  1. Is anyone else using UDP multicast extensively, and if so, have you
    noticed any slowdowns?
  2. What else could we look at? We’ve tried everything we could think
    of.

Thanks. Any and all suggestions are appreciated; we’re stumped.

Josh Hamacher and Dean Douthat
FAAC Incorporated



Current Configuration:
Software:
QNX OS 4.25D
Tcpip 5.00X
Net.ct100tx 4.25C

Hardware
Corman FE-120 Network Cards
All involved computers are roughly 800 MHz Pentium III’s with plenty
of RAM.

Also Tested:
Software:
Net.ct100tx 4.25E

Hardware:
3Com 3C905B-TX-NM

Available for testing:
Software:
QNX OS 4.25E

Hardware:
Corman FE-122 Network Cards

Darn it’s nice to see you back Jay :wink:

“Jay Hogg” <nobody@nowhere.com> wrote in message
news:01c11f90$8f5468f0$c80b11ac@j_hogg1…

Josh,

A series of questions/checks:

  1. On the Corman cards
  • have you tried Net.tulip? Different revs between Corman/QNX have had
    better/worse
    results - especially at 100MB
  • are you locking the card too 100MB/FD or letting it negotiate? Locking
    is much better.
  • anything in traceinfo?
  • anything in netinfo?
  • what IRQ did they get assigned? High interrupt # or shared?
  • is anything running with long PCI latency timeouts?
  1. 3Com Hub (Switch???)
  • with references to 10 and 100 it leads me to believe it is a switch
    and
    that is the
    basis for the rest of these comments… ???
  • a pure hub should result in worse performance (collisions and random
    backoff)
  • assuming switch - is it set for “fast forward”, “fragment free”, or
    “store and forward” ?
    this could be where most of your latency is. Which is best will
    involve testing
    but will be either “fragment free” or “store and forward” if the
    systems are indeed
    all bursting at the same time.
  • does it show any statistics for errors on any ports?
  • is “spanning tree” or “broadcast control” turned on? Should be no on
    both.
  1. Assorted
  • do you trust all your cables are rated 100MB - no home
    built/trampled/etc - they can
    generate errors you may/may no see.
  • have you tried 2 QNX boxes back-to-back (crossover cable) to see what
    you can
    sustain? then try it with the hub/switch?
  • priority issue on QNX?

Maybe more thoughts later,
Jay

Josh Hamacher <> hamacher@faac.com> > wrote in article
3B70534A.E93D1DB7@faac.com> >…
Hello. My company has been struggling with a problem for months now,
and we have gotten nowhere. We are seeing data transfer rates and
latencies much greater than we expected. Although the problem is fairly
simple in concept, it’s hard to describe. I’m including an extensive
description below of exactly what’s happening and what we’ve already
tried.

First, some background. We are developing a driving simulator
application. An STS (student training station) consists of three QNX
nodes and associated support computers (which vary in number, but
perform such operations as controlling the steering feedback motor and
generating the graphics). There is also an IOS (instructor/operator
station), which is a Windows NT (not my fault) machine. A group of 4
STS’s and one IOS are together known as a pod. A 100 mbit ethernet
network connects each of the top-level STS nodes and the IOS machine
with a hub; that is, there is one wire running to each STS and to the
IOS from a hub.

STS’s are capable of running network scenarios, in which the involved
STS’s can fully interact with one another. The STS’s share data using
UDP multicast.

Each STS generates about 3.5 kbytes of data per 30 Hz frame; this is
roughly 100 kbytes per second per STS. Even with four STS’s involved,
this is still under half a megabyte per second; this is relatively
insignificant compared to the theoretical throughput of a 100 mbit
ethernet. There is a certain amount of other traffic: some data is
going to the IOS via UDP unicast to drive gauge displays (speedometer,
etc), and control data also uses this network. In total, however, this
additional data does not approach the UDP multicast in volume. Assuming
it is an additional quarter megabyte, this amounts to about 6% of the
available bandwidth.

This data is bursty; it tends to all be sent at once, in 30 Hz bursts.
This will lead to collisions, but we’re of the understanding that
collisions alone could not explain the slowdowns.

The hardware we’re using, as well as what we’ve already tried and what
is readily available for additional tests, is listed at the end of this
message.

The actual time needed for transfers varies hugely from one run to
another. For a four-player network game, the network data transfer
takes anywhere from 6 milliseconds to 25 milliseconds; the average seems
to be about 8-12 ms. Our back-of-the-envelope calculations led us to
believe that it should only take roughly 2-4 ms. The time varies
immensely; it is huge after rebooting the computers, then improves after
each run until stabilizing at some value, which varies but is generally
in the 8-12 ms range mentioned above.

This is where it gets weird. Despite the fact that everything is purely
100 mbit, the hub shows very high (~40%) 10 mbit usage. This correlates
to using multicast; if I disable the multicast, the 10 mbit activity
goes away. We have tried replacing the 10/100 hub with a purely 100
mbit hub, and the app continues to work as it does with the 10/100 hub;
the 10 mbit activity is presumably a firmware bug in the hub. 3Com
insists that it couldn’t be. Despite this evidence, it still seems
suspicious to me that the 10 mbit activity (~40%) corresponds so closely
to the predicted 100 mbit activity (~4%).

We have also tried replacing the Corman cards with 3Com cards, which use
a different chip and a different driver. The 3Com cards perform worse
than the Cormans (not surprising, based on previous experience with 3Com
products, but we thought it was worth testing).

Hopefully at least a few people understood that description. > :slight_smile: > In
short, my questions can be summed up below:

  1. Is anyone else using UDP multicast extensively, and if so, have you
    noticed any slowdowns?
  2. What else could we look at? We’ve tried everything we could think
    of.

Thanks. Any and all suggestions are appreciated; we’re stumped.

Josh Hamacher and Dean Douthat
FAAC Incorporated



Current Configuration:
Software:
QNX OS 4.25D
Tcpip 5.00X
Net.ct100tx 4.25C

Hardware
Corman FE-120 Network Cards
All involved computers are roughly 800 MHz Pentium III’s with plenty
of RAM.

Also Tested:
Software:
Net.ct100tx 4.25E

Hardware:
3Com 3C905B-TX-NM

Available for testing:
Software:
QNX OS 4.25E

Hardware:
Corman FE-122 Network Cards

Previously, Josh Hamacher wrote in qdn.public.qnx4:

This data is bursty; it tends to all be sent at once, in 30 Hz bursts.
This will lead to collisions, but we’re of the understanding that
collisions alone could not explain the slowdowns.

Why not?

A few random thoughts. How big are your packets. The smaller
the packets, the worse the overhead and the more the collisions.

You mention bursty behavior. You also indicate about 10% of the
bandwidth being used. I recall that plain vanilla 10BaseT would
max out at about 40% capacity and then decrease as collision
rates become exponential. So if you are getting a 2Megbyte
period for say, 1/10 the transmission period, you could see
wild amounts of collisions.

In a mixed environment, that is one where the hub thinks
that you might have some 10baseT’s on the line, some of the
overhead might need to run at 10baseT. This might be the
hub or a NIC’s fault.

An interesting test would be to hook up a 100Mbit switch and
see what happens to performance.

Mitchell Schoenbrun --------- maschoen@pobox.com

No one here is an expert on ethernet, but we were of the impression that
the backoff time for collisions was very short. Perhaps we’re mistaken,
but that’s why we’re assuming the collisions aren’t the problem.

Our packets are right at the UDP limit; I forget the exact number, but I
think it’s a little under 1500 bytes. I have a special function that
takes the roughly 3.5 kbyte buffer, breaks it down into proper-sized UDP
packets, and transmits them.

We’ve been developing this product for nearly two years. At one point,
we were worried about the performance of the hub (for different reasons,
I don’t remember the details any more) and had purchased a switch. I’ll
poke around and see if we still have it anywhere.

Thanks for the suggestions.

Josh


Mitchell Schoenbrun wrote:

Previously, Josh Hamacher wrote in qdn.public.qnx4:

This data is bursty; it tends to all be sent at once, in 30 Hz bursts.
This will lead to collisions, but we’re of the understanding that
collisions alone could not explain the slowdowns.

Why not?

A few random thoughts. How big are your packets. The smaller
the packets, the worse the overhead and the more the collisions.

You mention bursty behavior. You also indicate about 10% of the
bandwidth being used. I recall that plain vanilla 10BaseT would
max out at about 40% capacity and then decrease as collision
rates become exponential. So if you are getting a 2Megbyte
period for say, 1/10 the transmission period, you could see
wild amounts of collisions.

In a mixed environment, that is one where the hub thinks
that you might have some 10baseT’s on the line, some of the
overhead might need to run at 10baseT. This might be the
hub or a NIC’s fault.

An interesting test would be to hook up a 100Mbit switch and
see what happens to performance.

Mitchell Schoenbrun --------- > maschoen@pobox.com

Hi. Thanks for the checklist; we’re still going over it here, but we’re
also quite busy and so it might take a few days. I’ve filled in any
info I have below.

Josh


Jay Hogg wrote:

Josh,

A series of questions/checks:

  1. On the Corman cards
  • have you tried Net.tulip? Different revs between Corman/QNX have had
    better/worse
    results - especially at 100MB

I thought I did try this, but I don’t remember for sure. We’ll try it
again.

  • are you locking the card too 100MB/FD or letting it negotiate? Locking
    is much better.

The cards are being locked to 100 MB. I’m not sure about full duplex.

  • anything in traceinfo?

Nothing except for the tons of garbage SMBfsys dumps there. As a side
note, is there any way of getting SMBfsys to shut up?

  • anything in netinfo?

I don’t think so. The actual production hardware is in Missouri, which
makes it a little difficult to access from the home office in Michigan.
We do have a test setup here in Michigan. I’m pretty certain netinfo
was clean in Missouri. Here we have some really nasty-looking stuff
(“bad CRC”, “alignment error”, “first descr bit not set”) but that’s
probably a different problem.

  • what IRQ did they get assigned? High interrupt # or shared?
  • is anything running with long PCI latency timeouts?

I don’t know about either of those. I’ll ask our sysadmin, who did the
installs.

  1. 3Com Hub (Switch???)
  • with references to 10 and 100 it leads me to believe it is a switch and
    that is the
    basis for the rest of these comments… ???

It’s a 3Com 8-port 10/100 autosensing hub. We have a 3Com 10/100
autosensing switch laying around, but we haven’t tested it. We will,
however.

  • a pure hub should result in worse performance (collisions and random
    backoff)

Doh!

  • assuming switch - is it set for “fast forward”, “fragment free”, or
    “store and forward” ?
    this could be where most of your latency is. Which is best will
    involve testing
    but will be either “fragment free” or “store and forward” if the
    systems are indeed
    all bursting at the same time.
  • does it show any statistics for errors on any ports?
  • is “spanning tree” or “broadcast control” turned on? Should be no on
    both.
  1. Assorted
  • do you trust all your cables are rated 100MB - no home
    built/trampled/etc - they can
    generate errors you may/may no see.

No home-made cables - we’ve had too many bad experiences with them. As
for the general quality, I don’t know for sure but it seems relatively
high. Some of them may be trampled (due to construction that is still
going on around the installation), but the results we’ve seen are common
to 8 pods of 4 simulators each, so it’s unlikely to be a cable problem.

  • have you tried 2 QNX boxes back-to-back (crossover cable) to see what
    you can
    sustain? then try it with the hub/switch?

No, another good idea.

  • priority issue on QNX?

We’ve done a lot of priority tuning. Our technique is more of the “make
guess, try, see if it improves things” genre, however, so it might not
be optimal. In a nutshell, here’s what we have:

27 - The “executive” component of our system (handles timing,
post-office style interprocess communication, etc).
26 - Proc32
26 - The network manager component of our system (handles all network
traffic).
23 - Net
20 - Net.ct100tx
14 - All other components of our system (vehicle model, sound subsystem,
traffic, etc).
10 - Tcpip

Maybe boosting Tcpip would help? You know, this might be the first time
in a year that we’ve taken a hard look at the priorities. The ones
listed above seemed optimal for running a single system, but were never
really tested in a network scenario.

Maybe more thoughts later,
Jay

Previously, Josh Hamacher wrote in qdn.public.qnx4:

No one here is an expert on ethernet, but we were of the impression that
the backoff time for collisions was very short. Perhaps we’re mistaken,
but that’s why we’re assuming the collisions aren’t the problem.

They are short, but with very high traffic the probability of a 2nd
and 3rd collision escalate. After a while you are spending more time
backing off than transmitting data.

Our packets are right at the UDP limit; I forget the exact number, but I
think it’s a little under 1500 bytes. I have a special function that
takes the roughly 3.5 kbyte buffer, breaks it down into proper-sized UDP
packets, and transmits them.

Well that is at least optimal.

Mitchell Schoenbrun --------- maschoen@pobox.com