Qnet errors

Hi everybody.
I’m having some trouble with Qnet.

My system consist of to x86 CPUs (x86, qnx 6.3 SP3). Some process of CPU#1 periodically access the resource managers running on CPU#2 over Qnet. By means of “nicinfo” and “sloginfo” I observe in CPU#2 ,that a few errors appears continuously and growing, but in CPU#1 all seems to be OK.

nicinfo on CPU#1:

RealTek 8139 Ethernet Controller

Physical Node ID … 00304F 51A19F
Current Physical Node ID … 00304F 51A19F
Current Operation Rate … 100.00 Mb/s full-duplex
Active Interface Type … MII
Active PHY address … 0
Maximum Transmittable data Unit … 1514
Maximum Receivable data Unit … 1514
Hardware Interrupt … 0xb
I/O Aperture … 0xec00 - 0xecff
Memory Aperture … 0xdfffff00 - 0xdfffffff
Promiscuous Mode … Off
Multicast Support … Enabled

Packets Transmitted OK … 343422
Bytes Transmitted OK … 39190221
Memory Allocation Failures on Transmit … 0

Packets Received OK … 321107
Bytes Received OK … 72174514
Memory Allocation Failures on Receive … 0

Single Collisions on Transmit … 0
Transmits aborted (excessive collisions) … 0
Transmit Underruns … 0
No Carrier on Transmit … 0
Receive Alignment errors … 0
Received packets with CRC errors … 0
Packets Dropped on receive … 0

nicinfo on CPU#2:

ns83815 : DP83815 Ethernet Controller

Physical Node ID … 0006D5 1098F3
Current Physical Node ID … 0006D5 1098F3
Current Operation Rate … 100.00 Mb/s full-duplex
Active Interface Type … MII
Active PHY address … 0
Maximum Transmittable data Unit … 1514
Maximum Receivable data Unit … 1514
Hardware Interrupt … 0xb
I/O Aperture … 0x1000 - 0x10ff
Promiscuous Mode … Off
Multicast Support … Enabled

Packets Transmitted OK … 3017967
Bytes Transmitted OK … 727686890

Packets Received OK … 3035888
Bytes Received OK … 289970721

Single Collisions on Transmit … 99
Multiple Collisions on Transmit … 141
Deferred Transmits … 0
Late Collision on Transmit errors … 0
Transmits aborted (excessive collisions) … 2
Transmits aborted (excessive deferrals) … 0
Transmit Underruns … 2
No Carrier on Transmit … 0
Receive Alignment errors … 0
Received packets with CRC errors … 34
Packets Dropped on receive … 0
Ethernet Headers out of range … 0
Oversized Packets received … 0
Short packets … 0
Total Frames experiencing Collison(s) … 240

sloginfo on CPU#1 gives this type of frequent errors (I think these are errors), as follows:


Feb 28 13:29:56 7 15 0 npm-qnet(L4): l4_rx_first_checks(): bad rxd pkt - hdr len 524 vs tot len 50

Feb 28 13:30:07 7 15 0 npm-qnet(L4): l4_rx_first_checks(): bad rxd pkt - hdr len 524 vs tot len 50

Feb 28 13:30:19 7 15 0 npm-qnet(L4): l4_rx_first_checks(): bad rxd pkt - hdr len 524 vs tot len 50

Feb 28 13:30:23 7 15 0 npm-qnet(L4): l4_rx_first_checks(): bad rxd pkt - hdr len 524 vs tot len 50

Feb 28 13:30:30 7 15 0 npm-qnet(L4): l4_rx_first_checks(): bad rxd pkt - hdr len 524 vs tot len 50

Feb 28 13:30:37 7 15 0 npm-qnet(L4): l4_rx_first_checks(): bad rxd pkt - hdr len 524 vs tot len 50

slofinfo on CPU#2 gives frequent timeouts with nd 12 (nd 12 is the CPU#1), as follows:


Feb 28 13:19:40 7 15 0 npm-qnet(L4): l4_tx_timeout(): timeout: nd 12 sc 8 dc 1 ss 1457224 tk 2666529 ct 2666531

Feb 28 13:19:52 7 15 0 npm-qnet(L4): l4_tx_timeout(): timeout: nd 12 sc 8 dc 1 ss 1457269 tk 2666590 ct 2666592

Feb 28 13:19:55 7 15 0 npm-qnet(L4): l4_tx_timeout(): timeout: nd 12 sc 8 dc 1 ss 1457282 tk 2666608 ct 2666610

Feb 28 13:20:03 7 15 0 npm-qnet(L4): l4_tx_timeout(): timeout: nd 12 sc 8 dc 1 ss 1457308 tk 2666644 ct 2666646

Feb 28 13:20:09 7 15 0 npm-qnet(L4): l4_tx_timeout(): timeout: nd 12 sc 8 dc 1 ss 1457333 tk 2666678 ct 2666680

Feb 28 13:20:39 7 15 0 npm-qnet(L4): l4_tx_timeout(): timeout: nd 12 sc 8 dc 1 ss 1457446 tk 2666827 ct 2666829

Feb 28 13:21:00 7 15 0 npm-qnet(L4): l4_tx_timeout(): timeout: nd 12 sc 8 dc 1 ss 1457524 tk 2666930 ct 2666932

Sometimes the communication between both CPUs becomes heavy and even the CPU#1 is not able to update the information collected from CPU#2, perhaps because of these errrors.

What is the meaning of the system log messages?
I would like to know what’s happening with the Qnet?
Any advice on new inquiries to find out the problem?

Thanks a lot.
Regards

ogr

What kind of network hardware is that ( switch, cross cable). From the look of things you have bad hardware somewhere.

I’m using a crossover cable but using a LAN Switch the problem still remains.

Have you tried changing network card. Maybe computer #2 is bad.

I have not still made test with another card, but with this card and TCP/IP the communication works correctly. No errors arise.

And, whats the meaning of the “sloginfo” messages?

Thanks Mario for your replies.

TCP/IP is very very resilient and will not log problems. QNet is less resilent, being designed ( I think) to run on reliable network.

The sloginfo seems to indicate a problem with the data. That is why so far everything points to a hardware problem.

That being said 6.3 is VERY old and there could be bugs in there that were fixed since then. Can you try 6.5 on the same hardware just to check.

I can’t test with a full 6.5 installation beacuse this hardware is currently working on plant.
What if I only update the qnet libraries from 6.5 to 6.3?
Will it run?
Any other library to update in order to make this test run?

Thanks

ogr

Not 6.5. isn’t compatible, but I think 6.3.2 is.

Hi ogr,

We’ve got the same problem with qnet and 6.4.1 on PPC.
Despite the log message, this is probably not related to a
hardware issue. There are several points to know :

  • You’d better use a good ethernet controller (ie : intel). It can
    really make a difference

  • even with DMA, with an heavily loaded system with high
    scheduling latency, the ethernet driver may lack of buffers
    descriptors. Add the option “receive=2048,transmit=2048” to
    your driver.

  • you can also change some qnet parameters (especially
    the number of qnet_ticks/second and qnet priority)

  • remember that high priority process may hog the CPU for
    much too long (QNX is realtime, not “fair multitask” like linux).
    use “tracelogger” to see what’s going on in your system.

Don’t expect the qnx’s support desk to help you on this issue.
They are really clueless.

Regards
Emmanuel