How to make qnet more deterministic

ncostes · August 8, 2006, 4:48pm

I’m trying to explore some of the distributed capabilities of QNX, particularly in terms of having IO processes on one node and control processes on another node.

My test rig is as follows:

1 P3 1Ghz (node name P3)
1 P4 2.4GHz (node name P4)
1 scope, dual channel, channel 1 connected to a data pin on P4’s parallel port, channel 2 connected ti a data pin on P3’s parallel port.

I wrote a simple server that waits for a msg and either turns the parallel port data lines high or low depending on the msg contents.

Ran this program on both nodes at the same time.

Wrote a client program that receives timer pulses at say 128Hz. This program sends a msg to the parallel port servers on nodes p3 and p4 (so one is local, the other one is over the fast ethernet via qnet).

All programs run at priority 30.

The local parallel port produces a nice steady fixed frequency pulse train.

The remote parallel port has a lot of non-deterministic behavior (i.e. pulse lenghts vary quite a bit). I am guessing this is due to qnet. The fast ethernet connection is full duplex, and it is an isolated segment - point to point between the 2 pcs.

Is there any way to reduce this non-deterministic behavior of qnet?

Note that first I tried a resource manager and just opening the file locally and remotely (same deal 2 resource managers) - found I had to fflush to push the msgs out at the rate I was trying to run at (128Hz). When using a res mgr seemed like both parallel ports exhibited the non-deterministic behavior (varying length pulses). I figured it was due to qnet / res mgr library.

Then I traid the simple 3 way handshake using MsgSend()/Receive/Reply and acheived the results I mentioned first (deterministic local respons, non-deterministic remote response).

The motivation for all this was to allow implementation of an intelligent IO controller as a QNX pc with a simple native QNX messaging interface (or preferably a posix file based interface using a resource manager).

Thanks!

mario · August 8, 2006, 5:36pm

First 128Hz (7.8ms) is not a multiple of 1ms which is the default timer resolution so you will get some discrepencies.

Make sure you use 6.3SP2 and npm-qnet-14_lite.so (npm_qnet.so so be a link to that)

I`m not sure why you are getting jitter when you implement the program as a resource manager, would need to see the code.

By how much does the pulse length vary.

How do you get the program to run at priority 30. Try to lower it to 18 (below that of event thread in network driver)

Post the output of nicinfo on each machine. Post the output of pidin

maschoen · August 8, 2006, 6:32pm

Aside from that, some information on your network might be useful. Presumably you are using Ethernet? 100TX? Do you have a hub, switch, or inverted cable? And probably most important, is there other traffic on the network, from other nodes? these nodes?

mario · August 8, 2006, 7:19pm

Mitch “the fast ethernet connection is full duplex, and it is an isolated segment - point to point between the 2 pcs.” ;-)

ncostes · August 8, 2006, 8:15pm

I am using sp2 with 6.3.2 core applied.

I did find npm-qnet-14_lite.so in the documentation and am using it.

Ok i lied its not exactly 128Hz its 8,000,000 ns delay timer - each time the timer goes off i send a msg to toggle the state of the dio bit. So 8ms on, 8ms off - 50% duty cycle, 16ms period - works fine no jitter when talking to the parallel port server on the same machine. Jitter only occurs when talking to the other node with MsgSend() or on either node (local or remote) if I use the resource manager and open /net//dev/sample.

I use seched_setscheduler to set the FIFO scheduling and prio of 30, and then verify it afterwards with a sched_getparm (also you can see in sin that the prios are 30).

The network topology is 2 nodes, both connected to the same switch. Nothing else connected to the switch. The switch is Full Duplex Fast Ethernet (100M).

I am not sure how much the pulse varies by, I’ll implement some instrumentation - its hard to tell on the scope its an old analog scope w/no capture - but looks like a couple periods (so 16-48 ms).

I’ve attached the projects.

parportserver is the msg passing server
parporttimerclient is the msg passing client
run the server on a local and remote node, the client will send msgs to both servers based on a timer it has. The servers toggle the parallel port data bits.

parportres is the resource manager implementation
timerclient is the timer client that talks to it. run the res mgr on both nodes as well.
You can test the res mgr by hadn by doing
echo “on” > /dev/sample
and
echo “off” > /dev/sample

Just connect any of the data pins of your parallel port to a scope.

Sorry they are sloppy, it’s just copy/paste from the docs and add some scrubby code of my own just to learn how these things work.

mario · August 8, 2006, 10:04pm

I tried the non resource manager version and get behavior similar to your. I tried changing a few thing here and there but couldn’t get any significant improvement. I must say I’m a bit surprised.

I beleive qnet has a bunch of undocumented option. I’ll see if I can dig it out (it was a post in a newsgroup a while ago) maybe there is an option in there that could help.

I did not try the resmgr flavor but it’s normal that you have to use fflush because FILE * consists of buffered operation. Either set it to unbuffer via setvbuf or use open() instead of fopen()

Mario

maschoen · August 8, 2006, 11:00pm

I agree Mario, 16 to 48 ms is troubling. I still didn’t hear whether the connector box was a hub or switch. I’m just trying to remove the possibility that the switch buffering could be at fault. This would of course be just as distressing. It would take me a few hours, but if there is no resolution, I’d be glad to try to reproduce this here, just ask.

mario · August 8, 2006, 11:27pm

Sorry can’t find that document that I was talking about.

ncostes · August 10, 2006, 10:54am

Same behavior with a switch or hub (i tried both). Only thing I haven’t tried is a crossover cable, I can try that if you like.

To be specific - one 100 meg fast ethernet hub, 2 nodes connected to it, nothing else.

Also one 100 meg fast ethernet switch, 2 nodes connected to it, nothing else.

So is the problem QNET or the ethernet chipset driver or the medium or something else?

Tim · August 10, 2006, 2:27pm

Ncostes,

You could try eliminating QNET from the equation and see if that improves the non-deterministic behavior.

Just write a couple of VERY simple processes that connect across ethernet using sockets. On one machine you send to the other and on the remote you receive the packet and send a pulse. Then check to see how regularly that happens.

There are many articles on the internet that will speak about how non-deterministic TCP/IP is. So depending on what QNET uses, it could well be nothing more than a problem with the TCP/IP ethernet medium (which the test above should show if it also exhibits the same behaviour).

Tim

mario · August 10, 2006, 2:55pm

16 ms (which is the delay) I’m seeing is a lot. And I know in my case the problem is not caused by networking hardware. Hence this has to be an issue with QNET. I would contact QSS about this.

ncostes · August 10, 2006, 6:28pm

I could try that tim but QNET doesn’t use TCP/IP as far as I know, it is the QNX native networking in QNX6 (like FLEET was in QNX4).

It would be interesting to see if datagrams or a connection based socket connection have the same issues, I guess I’ll try it out. But I can’t imagine that it would be better than QNET for the reasons you mentioned (there is no expectation of deterministic behavior from the TCP/IP stack).

I really was expecting this to work out on QNX, and was one of the strong points in the case I am trying to make for adoption of QNX for use in our trainers. We currently use a mix of distributed IO (e.g. Phoenix IO which is modbus over TCP), IIOCs, and local IO boards with the host being Linux or VxWorks. Would be quite convincing if drivers = servers that could be manually migrated to other PCs should the CPU load on the main host PC become too high when software is updated etc.

Thanks for the help guys. Not sure what I’ll do next.

maschoen · August 10, 2006, 10:08pm

Let me reiterate what Mario said. You should check with QSSL about this. QNET most definitely should be more deterministic than what you are seeing. Short of something we haven’t discovered (other than QNET) that is causing the problem, there is something wrong. Whether QSSL will find the problem and fix it, or report that it is a “feature” is not known, but they put a lot of effort into revamping QNET for QNX 6, so if it is performing worse, there is something amiss.

mario · August 10, 2006, 11:00pm

I think we are misusing the work deterministic. To be derterministic you have to compare it against some specification ;-) But I would definitely say it’s underperforming.

ncostes · August 14, 2006, 11:32am

non-stochastic?

ncostes · August 14, 2006, 11:39am

Well the thing is, I’m using an eval version of QNX to try to make a case for us switching from VxWorks and Linux to QNX for our trainers (flight sims) for the host (computer that runs the sim and IO).

Since I am not a current customer, I am not sure how I’d contact QSSL about this, or if it’s even appropriate.

Thanks for your help guys.

maschoen · August 14, 2006, 6:46pm

Well, contact your rep and explain the situation. That should get their attention even more. You are saying that you are seeing behavior that they hopefully would agree should not be there, and that future sales are dependent on resolving it.

xtang · August 15, 2006, 4:41pm

First thing come to mind, does lower the priority from 30 to say 15, make things better?

Also, check “nicinfo”, and “cat /proc/qnetstats” output, does QNET report any retransmit?
What network driver is this? npm-qnet.so does not happy which a bad network driver.
Sometime, runing network driver with large ring size (io-net -d receive=1024,transmit=1024)
could help.

mario · August 15, 2006, 7:56pm

xtang.

I tried priority 10 and 15 and it didn’t change anything. I was using speedo driver with no errors reported by nicinfo. I didn`t check for /proc/qnetstats though

Thunderblade · August 16, 2006, 12:45pm

The evaluation version is for you to check if you like the system. For questions, of course contact QSSL - if you are a potential customer, they should be interested.

Where are you located? QNX has offices in several countries and distributors cover whole world.