Problem getting networking working reliably

Tim · August 6, 2004, 6:57pm

Sigh, I’m back here again with issues moving to 6.3. It seems far more has changed than I thought and obviously not always for the better.

This time it has to do with getting TCPIP networking working reliably on our machine.

Here’s what I have hardware wise. A card cage board with a San Disk on the EIDE controller and 2 Intel ethernet cards, a USB port and a monitor port. All in all a very simple setup that works well under 6.1.

I’m to the stage in my porting from 6.1 that I have move the card cage board from my boot rig (where I boot off a HD and mount the San disk) to putting the board in the instrument card cage.

Now the problem I am seeing is that TCPIP networking isn’t always coming up correctly. Sometimes it comes up after 3-5 minutes and other times no matter what I do it just never comes up properly.

Here are the relevant lines from my sysinit file to start networking.

Start networking

Remove from here to…–>>

-------------------------------------------------

This is the setting for the ethernet chip on the MIC 3000 boards

Changed to Large tcpip stack since OS had networking problems

io-net -ptcpip -ppppmgr -dspeedo pci=0 -dspeedo pci=1 &

wait until io-net manifests itself and then start networkng manager

which uses /etc/net.cfg by default

waitfor /dev/io-net 60
netmanager
inetd &

Qnet

Enable qnet - can only be used on target systems

slay inetd
slay io-net
echo “Waiting for io-net to release sockets…”
sleep 5
io-net -ptcpip -ppppmgr -pqnet -dspeedo pci=0 -dspeedo pci=1 &
echo “Waiting for io-net to re-establish sockets…”
sleep 10
netmanager
inetd &

Here’s the net.cfg that I have on my machine:

nto network config file v1.2

version v1.2

[global]
hostname X-ALPHA10

domain melb.vsl

domain cdg.us.bio-rad.com
nameserver 10.42.104.64
nameserver 10.42.64.64
nameserver 198.211.153.60
nameserver 198.211.153.60
route 10.42.104.1 0.0.0.0 0.0.0.0
lookup file bind

[en0]
type ethernet
mode dhcp

[en1]
type ethernet
mode manual
manual_ip 192.168.0.2
manual_netmask 255.255.255.0

Now here’s what happens.

Booting takes an VERY long time to get past the EIDE driver even though I turned off DMA access. This is the line to diskboot in my build file

[pri=10o] PATH=/proc/boot diskboot -s -vvvv -b1 -D0 -odevc-con,-n1

This is strange because the only device on the EIDE is the Sandisk. Looking in sloginfo I see the San disk recognized then I see it followed by a bunch of errors reported even though there are no other devices. Not sure if this has any bearing but it does make booting take longer which is unacceptable and needs to be fixed at some point.

My sysinit executes and I see the info printed to the screen about starting io-net, netmanger etc. Even though further above I have ‘random’ started in my sysinit file it always complains about having to use the pseudo random number generator. Strange. Anyone know why?
When I get to a long shell, I login in and type ifconfig -a. At that point I see the 192.168.0.2 address assigned to en1 (2nd card) and no IP address for en0 (1st card). Both cards are plugged in cable wise and card 1 is plugged into the corp net and the routes are set right to get a DHCP address since this works perfectly in my boot rig, just on in the real machine.
I try to ping out to the 192.168.0.2 address (my own card). Ping just sits there hung and I CTRL-C out after 5 seconds. I try to ping 192.168.0.1 which is the GUI Windows box and again it’s hung. If I look at the lights on the card while attempting the ping it certainly looks like packets are transferred because the lights blink. If I actually just sit there and wait ping will eventually respond after about 2 minutes with ping information. What’s strange is that it will suddenly start showing all the packets and mention that there has been 0% packet loss for the 2 minutes I’ve been waiting so I don’t know if this means the responses have been in a queue someplace for the 2 minutes. Of course the very next ping command then takes another 2 minutes to respond.
From the Windoze box I can ping 192.168.0.2 which means the address of the card got set by QNX correctly.
I have tried slaying io-net, inetd and re-typing the entire lines of code from the sysinit. Everying starts without complaint but once up I have no DHCP address to card 1 nor can I ping out card 2.

What’s really strange is that this does on occasion work. Once in a while (say every 10-15 boots) after about 3-4 minutes QNX will find itself and I’ll get a DHCP address on card 1 and be able to use both card 1 and 2 just fine and normally until the next re-boot.

Any idea’s what could be going on here? I’m totally stumped but am sure something is causing a driver someplace to be stuck not passing packets back to the O/S but transmitting outward just fine.

Tim

xtang · August 8, 2004, 1:31am

For the warning about “random”. The warning message actually ask you “see ‘random’ option”, did you try a “use /lib/dll/npm-tcpip.so” and see what is option “random” about?

As for you rest problem, looks to me you have a problem that your en0 can’t get an address.

“ping” by default, will try to resolve hostname, and I guess with en0 no address, there is no way to get to your DNS server.

Try “ping -n 192.168.0.2”, you will see the difference.

You really want to figure out why your en0 can’t be configured by dhcp.

Tim · August 9, 2004, 9:33pm

xtang,

Your right, ping -n works right away so the problem is definitely with resolving host names.

Fooling around a bit more with things I see that the real problem is:

io-net -ptcpip -ppppmgr -dspeedo pci=0 -dspeedo pci=1 &

I pulled the card from the card cage and placed it back into my boot rig and powered it up. In the bootrig I boot from a connected HD that has a 6.3 stock installation from the CD. I checked and I have an identical net.cfg on the harddrive for the boot rig. So clearly it’s not a net.cfg problem.

But I have a completely different sysinit, rc.sysint, rc.local on the stock 6.3 install. The biggest thing is that I can’t see how io-net is started (it’s not launched from any of these 3 files so it must be in the boot image someplace?)

Doing a sin args shows:

io-net -ptcpip.

But when I slay io-net on my boot rig and launch it from the command line with ‘io-net -ptcpip’ it does not find anything in /dev/io-net other than the loopback address (ie no en0, en1). So those aren’t the only args to io-net despite what sin args reports. Plus if I launch it by hand using the:

io-net -ptcpip -ppppmgr -dspeedo pci=0 -dspeedo pci=1 &

command then netmanager fails to obtain a DHCP address or set my routes correctly.

So my question is, how can I determine what args are passed into io-net so I can copy that on my flash disk sysinit file? I saw there is something in /etc/system/enum/devices/net file but I don’t know what to make of the info there or if it even has any relevance.

Tim

Tim · August 10, 2004, 12:30am

Some more info:

I spent time time commenting out things in rc.local, rc.sysinit and sysinit in an attempt to figure out exactly how QNX auto-starts my network from a stock install.

Eventually I traced everything to sysinit where it makes a call execute /etc/rc.d/rc.devices

Inside this file there is some processing that leads to a call to /etc/system/enum and so on and so forth. Bottom line is that QNX starts io-net and then does a mount of each ethernet card.

I traced this out to be the following calls:

io-net -ptcpip (the sin args I found earlier)
mount -T io-net -opci=0,did=0x1229(guess on this number) /lib/dll/devn-speedo.so -o/dev/io-net/en0
mount -T io-net -opci=1,did=0x1229(guess on this number) /lib/dll/devn-speedo.so -o/dev/io-net/en1

So, of course I slayed io-net and inetd and attempted to run the above 3 lines by hand to mount my connection.

After the ‘io-net -ptcpip’ I see /dev/io-net and /dev/socket created. Looking inside /dev/io-net I see the loopback connection and one other I’m not sure what it is.

Then I run the 2 mount commands. After each one I looked in /dev/io-net and sure enuff first en0 and then en1 was added.

Then I ran netmanager and boom, I’m back to where I was in the beginning. Netmanager doesn’t complain about anything but after it runs I have the 192.168.0.2 address hard coded to en1 and nothing on en0 (no DHCP address assigned). So I don’t get it. Clearly this is the same thing the O/S is doing at boot time (I checked how it starts netmanager and it just waits for /dev/socket and /dev/io-net before running netmanager with no parameters) yet it doesn’t work. So something else is completely hidden/going on behind the scenes that is not well documented.

What’s really strange is that after I finish the mount commands if I do a ‘netstat -i’ I see that packets are arriving on en0 (connected to the corp network) and none on en1 (the hardcoded network that is not connected). So far that seems normal. But when I run netmanager and I do another ‘netstat -i’ command and see that 5 packets are sent out on en0 and some more packets are received (no idea if they are in response or just packets arriving at the card). However I don’t get a DHCP address assigned to en0 and for all intents and purposes the interface is dead/useless.

Is it possible the incomingpackets are not getting delivered to the TCPIP stack? Is there a way to check where the incoming packets go?

This is unbelievably frustrating not to be able to get such a simple thing working after spending 2 days on it.

Tim

cdm · August 14, 2004, 11:38pm

First off, you really don’t need to use diskboot so don’t. Unless you are trying to make a custom image that has to run on a variety of hardware you will be better off adding the binaries you need to boot to the image and starting things yourself.

starless · August 16, 2004, 7:27pm

pidin mem gives you good clues about this…

xtang · August 17, 2004, 1:15am

Do it by hand, manually, and then try to figure out.

start, slay io-net, so no network at all
io-net -p tcpip -p pppmgr
pci -v to find out the “device id” for your 2 speedo card.
mount -Tio-net -opci=0,did=<did_num> /lib/dll/devn-speedo.so
mount -Tio-net -opci=1,did=<did_num> /lib/dll/devn-speedo.so
nicinfo -r en0, is this really speedo connect to the corpration network?
(is media rate, mac address sounds right?)
touch /var/log/syslog, and start syslogd

start dhcp.client as "dhcp.client -i en0 -m -dddd

Do you get an address? Does /var/log/syslog said anyting?

Tim · August 17, 2004, 4:14pm

Xtang, Starless,

I followed steps 1-7 in your list over and over. I know I had the right driver already because that same driver worked just fine under 6.1.

Eventually through sheer luck I stumbled across dhcp.client late last week. The doc’s barely mention it and it was certainly never in any sysinit file I’ve ever seen (under 6.1 or 6.3). I truly have NO idea how it gets started under 6.1 (our current system) or by a stock installation of 6.3.

I basically found it by doing a pidin command on the stock 6.3 machine, writing down every thread name and comparing to the ones running on my board. As soon as I saw the missing dhcp.client it was clear what that was doing so I started it and immediately obtained a DHCP address. So I went into my sysinit file and added a line to start dhcp.client so that my sysinit looks like this now:

io-net -ptcpip -ppppmgr -dspeedo pci=0 -dspeedo pci=1 &

wait until io-net manifests itself and then start networkng manager

which uses /etc/net.cfg by default

waitfor /dev/io-net 60
netmanager
dhcp.client
inetd &

But my obvious question is, how is dhcp.client started by a standard installation of 6.3 if it’s not in any of the sysinit files? My guess is that netmanager is supposed to start it if it sees the ‘mode dhcp’ set for an ethernet card if net.cfg but it’s apparently not doing that for me on my board.

cdm,

How can I get away from having a diskboot image in my build file? All the examples of build files I’ve seen have had one there. I’d love to just add the minimum binaries to my image to start and then start everything else in the sysinit file because it looks like the DMA mode off is not working because it takes SO long (at least 20-30 seconds just spinning printing errors to the screen) to boot the board I get at least a screen and a half of errors before the boot actually occurs.

Tim

xtang · August 17, 2004, 5:18pm

So does this works? At which step you faild?

Tim:

Eventually through sheer luck I stumbled across dhcp.client late last week. The doc’s barely mention it and it was certainly never in any sysinit file I’ve ever seen (under 6.1 or 6.3). I truly have NO idea how it gets started under 6.1 (our current system) or by a stock installation of 6.3.

I basically found it by doing a pidin command on the stock 6.3 machine, writing down every thread name and comparing to the ones running on my board. As soon as I saw the missing dhcp.client it was clear what that was doing so I started it and immediately obtained a DHCP address. So I went into my sysinit file and added a line to start dhcp.client so that my sysinit looks like this now:

io-net -ptcpip -ppppmgr -dspeedo pci=0 -dspeedo pci=1 &

wait until io-net manifests itself and then start networkng manager

which uses /etc/net.cfg by default

waitfor /dev/io-net 60
netmanager
dhcp.client
inetd &

But my obvious question is, how is dhcp.client started by a standard installation of 6.3 if it’s not in any of the sysinit files? My guess is that netmanager is supposed to start it if it sees the ‘mode dhcp’ set for an ethernet card if net.cfg but it’s apparently not doing that for me on my board.

Look at your sysinit above, I doubt when netmanager is running,
your drivers haven’t started yet. (is sloginfo says anything?).

Try it like this:

io-net .... -d speedo pci=1 
waitfor /dev/io-net/en0 60
waitfor /dev/socket/2 60
netmanager
 inetd

The key is 1) no “&” in io-net line, and 2) wait for “/dev/socket/2” for tcpip stack is ready.

See if this works for you.

-xtang

Tim · August 19, 2004, 8:14pm

Xtang,

Yup removing the & made it work perfectly. Thanks for that tip!

I swear when I was manually starting it that I was delaying between the io-net call and the netmanager call but maybe the only delay was between netmanager and inetd.

Now if I can eventually figure out why the the dev.eide driver is still trying to start in DMA mode despite that fact I told it not to in the bootimage I’ll be back to where I was in 6.1 in terms of booting (SO much work to still do to get the S/W to port because of tons of core dumps).

Tim

xtang · August 20, 2004, 12:19am

what’s the command line you used for devb-eide ?

Tim · August 20, 2004, 5:22pm

Xtang,

It’s actually part of by .bld file. I use the line of:

[pri=10o] PATH=/proc/boot diskboot -s -vvvv -b1 -D0 -odevc-con,-n1

to launch start the eide driver. The -D0 option is supposed to turn DMA access off.

Would I need to add in a -o “devb-eide ???” on the diskboot line to disable DMA directly? Doesn’t seem like I should need that.

Tim