QNET/Driver Problems

Hi again,

There are special situations where something happen between 2 nodes working in QNET. When this thing happen (I dont know yet what actually is) one node let say node A, cant talk no longer with node B by qnet (eg: pidin -n B, ls /net/B/bin).

From other nodes theres no problem with B, and no problems from node B to node A or any other node in the net.

What I saw is when I do for example a pidin -nB, a thread of procnto is created and it blocks vs io-net.

Even more, if I restart node A the problem persist (this could be natural), the only thing that solve the situation is doing an umount on node B of the net driver working over qnet (speedo.so) and remount it.

If on node B I run a pidin there is no problem, buy I do pidin of files descriptors, it stop on io-net.

Its seems like io-net lose some fd. But I dont know why and who to solve it.

It could be the version of the driver I supouse, but it happend with speedo and rtl too as I remember.

Any sugestion?

Thanks

Juan

Are you using the latest 6.3?

Hi… I’m here again… I’m working hard this days…

Oh, yes, I think is the latest version.

qconfig

QNX Installations

Installation Name: QNX Momentics 6.3.0 Service Pack 3
Version: 6.3.0 SP3
Base Directory: /usr/qnx630/
QNX_HOST: /usr/qnx630/host/qnx6/x86/
QNX_TARGET: /usr/qnx630/target/qnx6/

Additional Packages

  Package Name: QNX Neutrino Core OS
       Version: 6.3.2
          Base: QNX Momentics 6.3.0 SP1, SP2 or SP3
  Install Path: /usr/qnx630

Any idea? Thanks, for your answer… And sorry but the idle time (blocked on WORK actually :slight_smile:)

Juan

Two suggestions:

  1. take a look at /proc/qnetstats (cat it), and the slog (sloginfo) to see if there are any errors reported.
  2. turn on crc checking (do_crc=1) option (see docs for npm-qnet-l4_lite.so)

When qnet behaves like this it is usually bad networking hardware. If everything works OK with do_crc=1, then throw out your network cards and get good ones…

rgallen: Thank’s for your answer!

Look, I’ve already tried suggestion number 1. Here are the attachs of that reports when it fails. Node B in my previouse description is node ‘ppc2’ (The one which has the problem maybe). ‘dev1’ is other node in the network which blocks when I run ‘pidin -n ppc2’ for example.

There are lot of errors logged but I can’t still get a conclusion.

In addition here is the pidin of target node (B/ppc2)

   1   4 procnto             10r REPLY       77837           
   1   5 procnto             10r REPLY       77837           
   1   6 procnto             10r REPLY       77837           
   1   7 procnto             10r REPLY       77837           
   1   8 procnto             10r REPLY       77837           
   1   9 procnto             10r REPLY       77837           
   1  10 procnto             10r REPLY       77837           
   1  11 procnto             10r REPLY       77837           
   1  12 procnto             10r RECEIVE     1               
   1  13 procnto             10r RECEIVE     1               
   1  14 procnto             10r RECEIVE     1               
   1  15 procnto             10r RECEIVE     1               
   1  16 procnto             10r RECEIVE     1               
   1  17 procnto             10r REPLY       77837           
   1  18 procnto             10r REPLY       77837           
   1  19 procnto             10r RUNNING                     
   1  39 procnto             10r REPLY       77837           
   2   1 sbin/tinit          10o REPLY       1               
4099   1 proc/boot/pci-bios  10o RECEIVE     1               
4100   1 proc/boot/slogger   21o RECEIVE     1               

20491 1 sbin/mqueue 10o RECEIVE 1
77837 1 sbin/io-net 10o SIGWAITINFO
77837 2 sbin/io-net 13o RECEIVE 5
77837 4 sbin/io-net 10o RECEIVE 1
77837 5 sbin/io-net 10o RECEIVE 20
77837 6 sbin/io-net 20o RECEIVE 23
77837 7 sbin/io-net 10o RECEIVE 1
77837 8 sbin/io-net 21o RECEIVE 27
77837 9 sbin/io-net 21o RECEIVE 28
77837 10 sbin/io-net 10o RECEIVE 1
77837 11 sbin/io-net 10o RECEIVE 1

As I mentioned, there is a procnto thread blocked on an io-net one for each app. in other node (dev1) trying to talk to node ppc2 over qnet.

With respect to the crc-checking option I think it’s very interesting. I’ll test it. The question here is about performance because there is a realtime broadcasted data over the network going to the clients.

Other thing that put this harder is that this is a very unfrecuent event. It happend, let’s say, once a month on a 12 nodes qnx living system.

Thanks in advance!

Regards,
Juan

Juan, I looked at your qnet stats. It definitely looks like data corruption (you are getting invalid scoids).

It appears that occasionally the NIC reports a frame as good (there are no reported bad frames) yet the frame is actually either corrupt when it is received, or it is corrupted as it is transfered to memory.

Turning “do_crc=1” on should fix this, but this indicates that you have defective hardware (the NIC is supposed to CRC the frame itself, and (with correctly functioning hardware) there is no possibility of the frame being corrupted as it is transfered to memory.

That said, the previous is based on the driver being error free. There could be a bug in the driver that corrupts the packet after it is received. What driver are you using?

Hi rgallen!

Look, I’m using ‘speedo’ driver mounted on a traditional (at least for me) 0x8086-0x1229 Intel NIC (onboard card). We’ve been using this card for years (even in QNX4.25) with no problems at all. This card appears as supported hardware in qnx 6 database.

I don’t know how can I see the specific version of speedo driver. The only thing I can tell you is that is the one which is installed when QNX 6.3.0 SP3 + Core patch 6.3.2 CD is installed (you can see my qconfig output).

There is a newer version of speedo (nov 2007) that I don’t test yet. Do you think this new driver could be a solution for this problem?

Thanks in advance

Regards,

Juan

So does “do_crc=1” solve the problem then?

So does “do_crc=1” solve the problem then?
[/quote]

Well, it’s no so easy. As I told you the problem is very very sporadic (actually it didn’t happen again since a couple of weeks ago…) and I have to wait for a propitious moment to stop the whole system. But it’s still in mind!

The other posibility is the newer ‘speedo’… Maybe in the next weeks I can test all this things… and I’ll tell you.

Thank you again!

Regards,
Juan