Problem with onboard Ethernet NIC, QNX 6.2.1

kbeaucha · June 6, 2006, 6:44pm

Hi all:

I’m evaluating some new hardware platforms to use for our QNX application (we’re an end user) and have an AMD Geode NX-based iBase MB740 motherboard that I’m looking at. The application we’re running is currently based on QNX v6.2.1, so that’s what I’ve loaded on the motherboard.

The system comes up and I have limited network connectivity, but simple things like actually transfering a file with ftp or opening a telnet session seem to take the system offline: I seem to loose network response.

The chipset on the MV740 is SiS’ 741CX + 964. The docs say that the Ethernet is SiS 900 compatible, so I’m assuming that its devn-sis9.so that is giving me what network access I have.

So far I haven’t been able to get any additional information from our developer, and we don’t have a developers seat of our own, so I haven’t contacted QNX.

Does anyone know if there is a copy of devn-sis9.so that has a file creation date newer than may 02/2002? I’m not sure if that is the actual date, but its what on my system. Is there a better way to find out what version of this device driver is running?

Thanks in advance.

kmb

Tim · June 7, 2006, 2:31pm

KMB,

On my QNX 6.30 system I get the following:

devn-sis9.so 41512 April 2004

So there is definitely a much newer version out there than the one you have.

I’m not sure if this version only shipped with QNX 6.30 or if it also shipped with 6.2.1 and if you could even use the 6.30 version (assuming it is) under 6.2.1.

When you say you lose network response do you mean it goes away forever and doesn’t come back without a reboot/restart of networking or does everything just slow down and then return to normal in a few seconds? When this problem occurs (losing network response) can you ping a node on the network from a different QNX console?

Tim

kbeaucha · June 7, 2006, 8:09pm

Are there any problems running drivers released with 6.3.0 under 6.2.1?

I can always get it running again with a restart.

If I go to another machine on the same network and start a ping against the new MB740 in one window, the MB740 answers the pings fine. If I then open another window to start an ftp session, I can connect, login, change directory, etc., etc…

It is only when I try to actually transfer a file that the ftp session locks up: no errors, no response.

At the same time, the MB740 stops responding to the pings in the other window.

Eventually the session times out and I get:

“421 Service not available, remote server timed out. Connection closed.”

The system does not resume answering the pings, and won’t allow me to connect with ftp or telnet. I have to reboot to get it to come back (I assume that slaying and restarting io-net would do the same, but I’m more familiar with QNX 4 than 6, and so I don’t know how to restart io-net manually and have it enumerate the devices as it does at boot).

kmb

Tim · June 8, 2006, 2:37pm

KMB,

The real answer is “it depends”. Meaning it depends on which 6.3.0 drivers you want to run under 6.2.1.

Some 100% won’t work. Others work just fine. In this case I have no idea if the one I have will work under 6.2.1. Your more than welcome to try it as I am attaching it to this post. But unless someone from QNX officially says it works with your version of io-net your completely on your own.

Interesting. It seems to fail only under heavy load (file transfer). I wonder if you edited a file in a telnet session and saved it (forcing it to re-send) if it would fail then. My guess is yes.

I wonder if there are some shared resources (interrupts kind of thing) on the MB with the built in ethernet card that is causing problems. You might take a look at the MB Bios or manual and see if it mentions this.

You can get help on io-net by typing ‘use io-net’ at the command line. There is also much more documentation in the helpviewer (just enter io-net in the search area) including options for devn-sis9.so in addition to io-net.

You can see what options io-net started with on your system by typing in ‘pidin A’ from the command line and piping it to less or grepping the output for io-net. On my system for example it’s a very simple start:

io-net -ptcpip

which was the default that QNX installed. This starts a full TCP/IP stack. Then io-net figures out which type of card you have and starts the appropriate driver (in your case the devn-sis9.so). You can change all this an add options to io-net and devn-sis9.so. Normally you shouldn’t have to tho unless there is something unusal in the hardware.

My first guess would be you need a new driver (the one I am attaching) and the second guess would be something unusual on the MB (check manual/Bios)

One more thing. Can you run ‘sloginfo’ from the command line BEFORE you lose connection and then again after wards and see if there are any new entries added.

Tim

kbeaucha · June 13, 2006, 6:35pm

Hi Tim:

Thanks for all your help so far.

The output I got from “pidin A | grep io-net” was the same as yours.

I also tried the new driver. I dumped it into /x86/lib/dll (whereupon it magically appeared in /lib/dll. What’s up with that? ls -l doesn’t indicate these are links.)

After rebooting with the new driver I noticed that /dev/en0 did not get initialized, while it did with the old driver.

I manually slay’ed io-net and restarted it with:

io-net -dsis9 -ptcpip

which gave me an en0 with a MAC address. I used ifconfig to
“up” the interface and try to give it an IP address and netmask, but I did not manage to get on the net.

Interestingly, the post-netchoke sloginfo included several messages of the format:

Date/timestamp 5 10 0 Devn-sis9: MORE bit not clear.

Does that ring a bell for anyone?

kmb

mario · June 13, 2006, 8:49pm

It’s /x86 that is a link to /

Tim · June 14, 2006, 5:40pm

KMB,

Unfortunately it looks like the devn-sis9.so driver that comes with QNX 6.3 is not compatible with 6.2.1.

Here your talking about seeing these messages with the devn-sis9.so driver that shipped with 6.2.1 correct? This is not the one I uploaded. I just want to make sure.

From reading the message it certainly looks like there is some bit that’s not being cleared correctly. Probably a bit that tells the driver there is more information waiting to be grabbed from the hardware. So the driver is probably requesting info that’s not there then logging the error and never going back to the hardware to get new incoming packets.

Can you start the driver again in more verbose mode (4 is the highest):

io-net -dsis9 -verbose=4 -ptcpip

This should put more messages in the sloginfo log. Maybe one of those will be more helpful. There may be some other option that needs to get passed to the driver to make it work.

You might also want to specify the duplex/speed manually to the driver as well. I’ve found that many switches/routers don’t play very nice in terms of identifying themselves to QNX and so QNX defaults to 10 Mbit connections instead of using 100 Mbit (esp when switches/routers report they are 1 Gig and QNX is not expecting that so it defaults to 10). I can’t see this being the issue other than causing very slow network access.

One other thing which I am sure you have done but I want to be sure. You did disable ‘plug-n-play’ O/S in the BIOS on this MB right?

Tim

P.S. Any chance of you downloading and testing QNX 6.3 on this MB. I know your using 6.2.1 on your application. But if you can prove 6.3 works (or doesn’t) it will go a long way toward helping to know why 6.2.1 fails and whether your wasting your time with ever trying to get 6.2.1 working.

kbeaucha · June 15, 2006, 1:33pm

I downloaded the 6.3.0 evaluation image and the SP2 patch and installed them on the MB740.

No problems: not with the installation (on a 1GHz machine), not with the Ethernet connectivity or stability, not (so far) with the main application we need to run.

Now I just have to make the case to management to spend the money having the application certified for 6.3.0 SP2 (or 6.3.2?) and upgrading our half-dozen 6.2.1 licenses.

I also have a second test PC: a rackmount Kontron chassis with an Intel MB that I couldn’t even begin to get on the network under 6.2.1: no driver for the integrated Pro/1000 chipset existed until 6.3.0, so loading 6.3.0 on this unit and running the same tests is next.

Thanks for all your help.

kmb

Tim · June 15, 2006, 3:20pm

KMB,

I’m glad it worked with 6.3 so that you at least know you can get a working system.

Your other option besides re-certification (depending on that cost) is to contact QNX (especially if you have tech support) and ask what it would take to get just the driver you need under 6.3 to run under 6.2.1. Of course that change alone may force a re-cert since you’ve changed the default 6.2.1 install.

Tim

kbeaucha · June 26, 2006, 2:47pm

Another followup.

We put the test chassis into a live site, where it ran for two days. Then on Saturday I got a call from our operations centre saying the node had gone offline. When I got to the site, the PC was still talking with its I/O, but Ethernet communication was not working.

sloginfo was spitting out triples that looked something like this (if I’m reading my scratchy notes correctly):

date/timestamp 7 15 0 npm-qnet(L4): en_ionet_tx_pkt(): lon->tx->tx_down() returned -1 errno 255
date/timestamp 7 15 0 l4_tx_pkt(): dp->tx_pkt() failed for L4 0
date/timestamp 7 15 0 l4_tx_timeout(): l4_tx_pkt() raw failed

When I tried to ping this node from another PC, it did not respond. When I tried to ping from the failed node, I got:

ping - “no buffer space available”

I’m not sure what to make of this. In an earlier post disabling PnP OS support was mentioned. I didn’t do it then, but went into the BIOS to check it today. The BIOS is a Phoenix/Award, and while on other motherboards I have seen one setting for “PnP Aware Operating System” that lets you enable or disable this feature, this BIOS does not.

There is a PnP/PCI menu though, and it has a few options:

Reset configuration data (disabled)
Configuration controlled by (Auto(ESCD))
PCI/VGA Pallette snoop (diabled)

What happens if the PnP OS support is NOT diabled?

thanks
kmb

Tim · June 26, 2006, 7:13pm

KMB

Are you running Qnet? These sloginfo’s would seem to indicate you are (npm-qnet). Unless you need Qnet you probably shouldn’t be running it.

You normally have to explicitly start Qnet for it to be running which is why I ask.

When I tried to ping this node from another PC, it did not respond. When I tried to ping from the failed node, I got:

Interesting. I have never seen this. But my educated guess is that the internal TCP/IP stack (32K) is full and can’t seem to empty itself to either send or receive or both.

My next question (after the Qnet one) is are you running any apps that talk across the network to other machines (I assume this is a YES otherwise you would not have noted the machine dropped off the network). In that case I’d ask if those apps do any error checking on sending/receiving of data and if they reported anything (I assume your using sockets of some kind).

Again, my guess is one of your apps is either sending or receiving (or both) data incorrectly from the TCP/IP layer. Maybe something as simple as a high priority thread/task consuming all the CPU so the IP layer isn’t running or not closing some socket connection properly etc. But it sure seems to me that your gradually using up the TCP/IP stack buffer until there’s none left.

I’d suggest your going to have to find it by sending and receiving a large amount of test data to/from your app unless you post the code and it’s an obvious problem.

If PnP O/S is not disabled QNX can have trouble detecting hardware correctly (ethernet, harddrives etc). From memory (so someone may want to correct me) QNX relies on what the BIOS tells it in terms harddrives and hardware (Interrupts used, I/O address’s cards reside at etc). If PnP O/S is enabled then sometimes the BIOS isn’t reporting all this correctly and so QNX won’t auto detect hardware (hence manually starting of drivers where you specify known locations for hardware). The fact you were able to find your ethernet card and start it suggests this isn’t anything you need to worry about.

Tim

kbeaucha · June 26, 2006, 10:23pm

[color=blue]

I have enabled Qnet. We used FLEET for support and troubleshooting running the earlier version of this application on QNX 4, so I wanted to have that same ability on 6, even though the application is all IP and doesn’t use it directly.
[color=green]

[color=blue]

Tim:

Interesting. I have never seen this. But my educated guess is that the internal TCP/IP stack (32K) is full and can’t seem to empty itself to either send or receive or both.

My next question (after the Qnet one) is are you running any apps that talk across the network to other machines (I assume this is a YES otherwise you would not have noted the machine dropped off the network). In that case I’d ask if those apps do any error checking on sending/receiving of data and if they reported anything (I assume you’re using sockets of some kind).

Again, my guess is one of your apps is either sending or receiving (or both) data incorrectly from the TCP/IP layer. Maybe something as simple as a high priority thread/task consuming all the CPU so the IP layer isn’t running or not closing some socket connection properly etc. But it sure seems to me that you’re gradually using up the TCP/IP stack buffer until there’s none left.

I’d suggest you’re going to have to find it by sending and receiving a large amount of test data to/from your app unless you post the code and it’s an obvious problem.

The PCs running QNX talk to a front-end database server running Unix all over a private Ethernet.

I checked the app’s log files and there is an unusual log there repeated over and over, but it started a full day before the PC went offline. Could this have slowly strangled the system? I’ll run it past the developer and see what they think.
[color=green]

[color=blue]

Probably true. I’ve sent a note to the supplier we bought the motherboard from to see what it takes to disable this “just in case”. I’ve also set “Configuration controled by:” to “manual”.

Thanks Tim.

kmb