There are special situations where something happen between 2 nodes working in QNET. When this thing happen (I dont know yet what actually is) one node let say node A, cant talk no longer with node B by qnet (eg: pidin -n B, ls /net/B/bin).
From other nodes theres no problem with B, and no problems from node B to node A or any other node in the net.
What I saw is when I do for example a pidin -nB, a thread of procnto is created and it blocks vs io-net.
Even more, if I restart node A the problem persist (this could be natural), the only thing that solve the situation is doing an umount on node B of the net driver working over qnet (speedo.so) and remount it.
If on node B I run a pidin there is no problem, buy I do pidin of files descriptors, it stop on io-net.
Its seems like io-net lose some fd. But I dont know why and who to solve it.
It could be the version of the driver I supouse, but it happend with speedo and rtl too as I remember.
take a look at /proc/qnetstats (cat it), and the slog (sloginfo) to see if there are any errors reported.
turn on crc checking (do_crc=1) option (see docs for npm-qnet-l4_lite.so)
When qnet behaves like this it is usually bad networking hardware. If everything works OK with do_crc=1, then throw out your network cards and get good ones…
Look, I’ve already tried suggestion number 1. Here are the attachs of that reports when it fails. Node B in my previouse description is node ‘ppc2’ (The one which has the problem maybe). ‘dev1’ is other node in the network which blocks when I run ‘pidin -n ppc2’ for example.
There are lot of errors logged but I can’t still get a conclusion.
In addition here is the pidin of target node (B/ppc2)
As I mentioned, there is a procnto thread blocked on an io-net one for each app. in other node (dev1) trying to talk to node ppc2 over qnet.
With respect to the crc-checking option I think it’s very interesting. I’ll test it. The question here is about performance because there is a realtime broadcasted data over the network going to the clients.
Other thing that put this harder is that this is a very unfrecuent event. It happend, let’s say, once a month on a 12 nodes qnx living system.
Juan, I looked at your qnet stats. It definitely looks like data corruption (you are getting invalid scoids).
It appears that occasionally the NIC reports a frame as good (there are no reported bad frames) yet the frame is actually either corrupt when it is received, or it is corrupted as it is transfered to memory.
Turning “do_crc=1” on should fix this, but this indicates that you have defective hardware (the NIC is supposed to CRC the frame itself, and (with correctly functioning hardware) there is no possibility of the frame being corrupted as it is transfered to memory.
That said, the previous is based on the driver being error free. There could be a bug in the driver that corrupts the packet after it is received. What driver are you using?
Look, I’m using ‘speedo’ driver mounted on a traditional (at least for me) 0x8086-0x1229 Intel NIC (onboard card). We’ve been using this card for years (even in QNX4.25) with no problems at all. This card appears as supported hardware in qnx 6 database.
I don’t know how can I see the specific version of speedo driver. The only thing I can tell you is that is the one which is installed when QNX 6.3.0 SP3 + Core patch 6.3.2 CD is installed (you can see my qconfig output).
There is a newer version of speedo (nov 2007) that I don’t test yet. Do you think this new driver could be a solution for this problem?
So does “do_crc=1” solve the problem then?
[/quote]
Well, it’s no so easy. As I told you the problem is very very sporadic (actually it didn’t happen again since a couple of weeks ago…) and I have to wait for a propitious moment to stop the whole system. But it’s still in mind!
The other posibility is the newer ‘speedo’… Maybe in the next weeks I can test all this things… and I’ll tell you.