Sporadic lost of Network Communication with QNX 4.25

mhborovay · October 25, 2018, 10:25pm

We are using QNX 4.25 with USB support.

The PC we are using is the Single Board Computer CPU, Advantech PCA-6012 Atom

We are using LAN 2 which uses the Intel 82583V controller.

The Ethernet driver is Net.e1000.

We have found when communicating with our instrument, we would sporadically lose communication.

Have other seen issues with the 825832 controller with Net.e1000 driver.

Thanks for any help.

Tim · October 26, 2018, 4:36pm

How long do you lose communication for (less than a second, seconds, minutes)? During that time can you successfully ping the machine?

What’s the output of ‘nicinfo’ say. You want to check that to see if there are any hardware errors.

Next thing you’ll want to confirm is that you have enough spare CPU to communicate. Are you certain that you aren’t using 100% of the CPU? Are you also certain your application doesn’t have an error in responding (presumably your app responding is what’s not happening as opposed to telnet/ping not responding)?

Tim

maschoen · October 26, 2018, 9:14pm

Everything Tim mentioned.

Also, take a close look at any cables, hubs, routers etc. Is there any other traffic on the network that could interfere with your communications?

Now just for good measure, a story about ARP. Hopefully this has nothing to do with your problem, but just in case.

I do work for the US Post office. We have a QNX 6 system that has two ethernet ports, one for a local network (192.168..) and one for an external intranet using real IP addresses.
My contact who works in Virginia was unexpectedly in town, San Francisco. After a couple of days he asked me to come down to the main package sorting facility to help out. Getting through security wasn’t too hard, but was done quite carefully. I’m not an employee and have no clearance of any kind.

The place was massive. It was probably a 10 minute walk through a huge building to where the problem was. It sounded a lot like yours. Occasionally the network connection between a scanner and our system on the local network would fail. After either a reboot or a few minute it would come back. My contact had spent 2 days scratching his head on this. He had found out that when the network failed, there was some strange activity in the ARP table.

I digress. If you have two nodes on a TCP/IP ethernet network, they don’t address each other by passing packets with the IP as the destination. They use the mac addresses. In the normal (not promiscuous) mode the hardware ignores packets with another mac address. So how do the nodes know what mac address to send to. That’s where the ARP protocol comes in. A new node on the system will broadcast itself and its mac address using an ARP packet. If it wants to send a packet to a local IP it broadcasts an ARP request to find out that mac address. That’s about the extent of what I know.

So back to the post office. I observed the system working properly and then all of a sudden stopping. I spent about an hour understanding what the problem looked like and what happened to the ARP table. Then someone mentioned to me that the other system had the same problem. Other system? Yes there were two of these systems. I immediately figured out what was going on.

I had them lead me to the other system. I told them that there was going to be a cable connecting the internal local area network hub and the external intranet hub. And that’s what we found. After disconnecting the cable, the problem went away.

So here’s what was happening. Both system’s local 192.168.. networks used the same IP’s. A machine on the 2nd system did an ARP broadcast that went across this cable over the intranet to the first QNX machine, screwing up its ARP table. After that attempts to communicate to the scanner went to the wrong mac address.

I hope this is useful.