Weird TCP/IP 5.0 ifconfig behavior...

Greatings,

Before I get into the nitty gritty of this, I’ve got a question about
ifconfig…

The documentation for the “interface” parameter (enx, pppx, etc.) states:

“where x is the number of logical QNX LANs running the selected hardware
protocol.”

That’s a bit ambiguous, but I’ve always assumed that what that really is
suppose to mean is:

“where x is the number of the logical QNX LAN to run the selected
hardware protocol.”

In other words, to run TCP/IP on logical LAN 3 (ethernet) you’d use en3.

Is that correct?


OK, now for the weird behavior…

We’ve got four Intel ethernet NICs running on a node. All are
controlled by separate invocations of the Net.ether82557 driver (4.25G
Feb 22 2001). The 2 built-in NICs (on the motherboard) are designated
as logical LANs 1 & 2, to be used as dual QNX LANs only (no TCP/IP).
The other 2 NICs on a dual ethernet PCI card are for logical LANs 3 & 4,
to be used on redundant TCP/IP networks. After fuddling around with the
BIOS, we’ve managed to get all 4 drives to use unique IRQs and when the
machine is cold booted, everything works as expected i.e. LAN1 → en1,
LAN2 → en2, etc.

The weird behavior happens when the machine is warm booted. Everything
shows up bass-ackwards i.e. LAN1 → en4, LAN2 → en3, LAN3 → en2, LAN4
→ en1. And, although TCP/IP runs and ifconfig gives no errors, we can
not communicate over TCP/IP, even when using en2 or en1. QNX LANs 1 & 2
still work just fine though.

Here’s the pertinent info when things are weirded out:

ps …
56 7 0 20r RECV 0 364K Net.ether82557 -I0 -l3 -s100 -F
57 7 0 20r RECV 0 364K Net.ether82557 -I1 -l4 -s100 -F
58 7 0 20r RECV 0 364K Net.ether82557 -I2 -l1 -s100 -F
59 7 0 20r RECV 0 364K Net.ether82557 -I3 -l2 -s100 -F

ifconfig -a …
en4: flags=8822<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST> mtu 1500
address: 00:30:48:28:76:f7
en3: flags=8822<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST> mtu 1500
address: 00:30:48:28:76:f6
en1: flags=8822<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST> mtu 1500
address: 00:02:b3:ec:3b:1c
en2: flags=8863<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST> mtu 1500
address: 00:02:b3:ec:3b:1d
inet 192.168.1.1 netmask 0xffffff00 broadcast 192.168.1.255
lo0: flags=8009<UP,LOOPBACK,MULTICAST> mtu 32976
inet 127.0.0.1 netmask 0xff000000

netinfo -l …
Driver Slot 0: Driver Pid 57 Logical Net 4 Network Card: Ethernet/
Speedo : 82557/8/9 Ethernet Controller
Vendor ID … 0x8086
Device ID … 0x1229
Subsystem ID … 0x1015
Subsystem Vendor ID … 0x8086
Revision … 0xd
Physical Node ID … 0002B3 EC3B1C
Media Rate … 100Mb/s FDX
Mtu … 1514
Hardware Interrupt … 11

Driver Slot 1: Driver Pid 56 Logical Net 3 Network Card: Ethernet/
Speedo : 82557/8/9 Ethernet Controller
Vendor ID … 0x8086
Device ID … 0x1229
Subsystem ID … 0x1015
Subsystem Vendor ID … 0x8086
Revision … 0xd
Physical Node ID … 0002B3 EC3B1D
Media Rate … 100Mb/s FDX
Mtu … 1514
Hardware Interrupt … 7

Driver Slot 2: Driver Pid 58 Logical Net 1 Network Card: Ethernet/
Speedo : 82559 A/B-Step Ethernet Controller
Vendor ID … 0x8086
Device ID … 0x1229
Subsystem ID … 0x100c
Subsystem Vendor ID … 0x8086
Revision … 0x8
Physical Node ID … 003048 2876F7
Media Rate … 100Mb/s FDX
Mtu … 1514
Hardware Interrupt … 10

Driver Slot 3: Driver Pid 59 Logical Net 2 Network Card: Ethernet/
Speedo : 82559 A/B-Step Ethernet Controller
Vendor ID … 0x8086
Device ID … 0x1229
Subsystem ID … 0x100c
Subsystem Vendor ID … 0x8086
Revision … 0x8
Physical Node ID … 003048 2876F6
Media Rate … 100Mb/s FDX
Mtu … 1514
Hardware Interrupt … 5


Anybody else seen this behavior? Is this a bug, or what?

BTW, the above info is from one of our production systems.
Unfortunately, we’ll need to wait for the wee hours of the morning to do
a cold reboot. In the meantime, we’re setting up a test system to see
if we can verify and/or make any sense of this.

TIA,

-Rob

On Mon, 01 Aug 2005 21:05:49 +0400, Rob Hem <rob@spamyourself.com> wrote:

controlled by separate invocations of the Net.ether82557 driver (4.25G
Feb 22 2001).
There are the newer version: Net.ether82557 v4.25H Apr 08 2005.

(I’m not shure if the changelog covers any multi-NICs problems, though.)

Tony.

This has gotten even weirder. We’ve setup a test machine that is for
all practical purposes identical to the production machine. Both
machines have the same hardware, the same BIOS, start the same drivers,
on the same PCI slots/indexes, and all drivers are assigned the same IRQs.

… Except, the test machine boots flawlessly for both cold and warm
(re)boots with no reordering of the enX assignments!

The only differences we’ve noticed (apart from the bass-ackwards config
-a output on the production machine) are:

  1. The output of netinfo -l. The production machine lists driver slots
    0-3 as LAN 4, 3, 1, 2 and the test machine as 1, 3, 4, 2. That should
    just be a function of who registers when with Net, correct? It
    shouldn’t have any effect on or be affected by logical LANs or PCI
    slots, etc.
  2. show_pci has the base addresses of the LAN 3 & 4 NICs flipped
    around. Big deal, eh?

Assuming that QNX LANX == enX is suppose to be how things are logically
configured, how can this be coming out backwards on the production
machine? Everything that is hardware dependent is configured correctly??

I’m no longer sure cold (re)booting the production node is going to fix
anything!

HELP!

-Rob

Rob Hem wrote:

Greatings,

Before I get into the nitty gritty of this, I’ve got a question about
ifconfig…

The documentation for the “interface” parameter (enx, pppx, etc.) states:

“where x is the number of logical QNX LANs running the selected hardware
protocol.”

That’s a bit ambiguous, but I’ve always assumed that what that really is
suppose to mean is:

“where x is the number of the logical QNX LAN to run the selected
hardware protocol.”

In other words, to run TCP/IP on logical LAN 3 (ethernet) you’d use en3.

Is that correct?


OK, now for the weird behavior…

We’ve got four Intel ethernet NICs running on a node. All are
controlled by separate invocations of the Net.ether82557 driver (4.25G
Feb 22 2001). The 2 built-in NICs (on the motherboard) are designated
as logical LANs 1 & 2, to be used as dual QNX LANs only (no TCP/IP). The
other 2 NICs on a dual ethernet PCI card are for logical LANs 3 & 4, to
be used on redundant TCP/IP networks. After fuddling around with the
BIOS, we’ve managed to get all 4 drives to use unique IRQs and when the
machine is cold booted, everything works as expected i.e. LAN1 → en1,
LAN2 → en2, etc.

The weird behavior happens when the machine is warm booted. Everything
shows up bass-ackwards i.e. LAN1 → en4, LAN2 → en3, LAN3 → en2, LAN4
→ en1. And, although TCP/IP runs and ifconfig gives no errors, we can
not communicate over TCP/IP, even when using en2 or en1. QNX LANs 1 & 2
still work just fine though.

Here’s the pertinent info when things are weirded out:

ps …
56 7 0 20r RECV 0 364K Net.ether82557 -I0 -l3 -s100 -F
57 7 0 20r RECV 0 364K Net.ether82557 -I1 -l4 -s100 -F
58 7 0 20r RECV 0 364K Net.ether82557 -I2 -l1 -s100 -F
59 7 0 20r RECV 0 364K Net.ether82557 -I3 -l2 -s100 -F

ifconfig -a …
en4: flags=8822<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST> mtu 1500
address: 00:30:48:28:76:f7
en3: flags=8822<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST> mtu 1500
address: 00:30:48:28:76:f6
en1: flags=8822<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST> mtu 1500
address: 00:02:b3:ec:3b:1c
en2: flags=8863<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST> mtu 1500
address: 00:02:b3:ec:3b:1d
inet 192.168.1.1 netmask 0xffffff00 broadcast 192.168.1.255
lo0: flags=8009<UP,LOOPBACK,MULTICAST> mtu 32976
inet 127.0.0.1 netmask 0xff000000

netinfo -l …
Driver Slot 0: Driver Pid 57 Logical Net 4 Network Card: Ethernet/
Speedo : 82557/8/9 Ethernet Controller
Vendor ID … 0x8086
Device ID … 0x1229
Subsystem ID … 0x1015
Subsystem Vendor ID … 0x8086
Revision … 0xd
Physical Node ID … 0002B3 EC3B1C
Media Rate … 100Mb/s FDX
Mtu … 1514
Hardware Interrupt … 11

Driver Slot 1: Driver Pid 56 Logical Net 3 Network Card: Ethernet/
Speedo : 82557/8/9 Ethernet Controller
Vendor ID … 0x8086
Device ID … 0x1229
Subsystem ID … 0x1015
Subsystem Vendor ID … 0x8086
Revision … 0xd
Physical Node ID … 0002B3 EC3B1D
Media Rate … 100Mb/s FDX
Mtu … 1514
Hardware Interrupt … 7

Driver Slot 2: Driver Pid 58 Logical Net 1 Network Card: Ethernet/
Speedo : 82559 A/B-Step Ethernet Controller
Vendor ID … 0x8086
Device ID … 0x1229
Subsystem ID … 0x100c
Subsystem Vendor ID … 0x8086
Revision … 0x8
Physical Node ID … 003048 2876F7
Media Rate … 100Mb/s FDX
Mtu … 1514
Hardware Interrupt … 10

Driver Slot 3: Driver Pid 59 Logical Net 2 Network Card: Ethernet/
Speedo : 82559 A/B-Step Ethernet Controller
Vendor ID … 0x8086
Device ID … 0x1229
Subsystem ID … 0x100c
Subsystem Vendor ID … 0x8086
Revision … 0x8
Physical Node ID … 003048 2876F6
Media Rate … 100Mb/s FDX
Mtu … 1514
Hardware Interrupt … 5


Anybody else seen this behavior? Is this a bug, or what?

BTW, the above info is from one of our production systems.
Unfortunately, we’ll need to wait for the wee hours of the morning to do
a cold reboot. In the meantime, we’re setting up a test system to see
if we can verify and/or make any sense of this.

TIA,

-Rob

Double check those IRQs on reboot. And don’t try to reassign them or you’ll prolly be changing more than you or the PCI API is realising.


Evan

err … Don’t try to reassign the IRQs from within QNX …


Evan

I’m not manually reassigning any IRQs… or attempting to do so from
within QNX. They’re all assigned by the BIOS/PCI config. FYI… By
disabling the parallel port and the video IRQ in the BIOS we were able
to get the BIOS to assign separate IRQs to all four NICs (5, 7, 10 & 11
if you must know).

What I’ve been trying to point out is, all the hardware configurations
seem to be correct (on both the production and test machines), as far as
I can tell. It’s the logical assignment/pairing of LAN# to en# that is
somehow fubar on the production machine. And, I can’t figure out how or
why that can be. It’s got to be a software problem, doesn’t it?
Everything except ifconfig has the correct LAN# assigned to the correct
NIC/MAC address (netinfo, netmap, etc. all agree). And, as I said in
the original post, the QNX Fleet only LANs 1 & 2 are working fine, even
though ifconfig has assigned them en4 & en3 respectively.

Could someone from QSSL please confirm whether en# == LAN# is a correct
assumption… before we go spinning further off into space :wink:

-Rob

Evan Hillas wrote:

err … Don’t try to reassign the IRQs from within QNX …


Evan

Rob Hem <rob@spamyourself.com> wrote:

Could someone from QSSL please confirm whether en# == LAN# is a correct
assumption… before we go spinning further off into space > :wink:

That has always been my understanding – the “logical network ID” as
assigned by the -l (el) option to the Net.driver is the same number
assigned as the TCP/IP interface for ifconfig (e.g. -l3 is en3.)

-David

David Gibbs
QNX Training Services
dagibbs@qnx.com

Rob Hem <rob@spamyourself.com> wrote:

Greatings,



“where x is the number of logical QNX LANs running the selected hardware
protocol.”

That’s a bit ambiguous, but I’ve always assumed that what that really is
suppose to mean is:

“where x is the number of the logical QNX LAN to run the selected
hardware protocol.”

I’ve always assumed the same, and think that is the correct wording.

In theory, I should probably issue a documentation PR for this.

-David

David Gibbs
QNX Training Services
dagibbs@qnx.com

Have you tried putting sleeps in between the loading of your Net.drivers?


“Rob Hem” <rob@spamyourself.com> wrote in message
news:42EE7C99.30405@spamyourself.com

This has gotten even weirder. We’ve setup a test machine that is for
all practical purposes identical to the production machine. Both
machines have the same hardware, the same BIOS, start the same drivers,
on the same PCI slots/indexes, and all drivers are assigned the same IRQs.

… Except, the test machine boots flawlessly for both cold and warm
(re)boots with no reordering of the enX assignments!

The only differences we’ve noticed (apart from the bass-ackwards config
-a output on the production machine) are:

  1. The output of netinfo -l. The production machine lists driver slots
    0-3 as LAN 4, 3, 1, 2 and the test machine as 1, 3, 4, 2. That should
    just be a function of who registers when with Net, correct? It
    shouldn’t have any effect on or be affected by logical LANs or PCI
    slots, etc.
  2. show_pci has the base addresses of the LAN 3 & 4 NICs flipped
    around. Big deal, eh?

Assuming that QNX LANX == enX is suppose to be how things are logically
configured, how can this be coming out backwards on the production
machine? Everything that is hardware dependent is configured correctly??

I’m no longer sure cold (re)booting the production node is going to fix
anything!

HELP!

-Rob

Rob Hem wrote:

Greatings,

Before I get into the nitty gritty of this, I’ve got a question about
ifconfig…

The documentation for the “interface” parameter (enx, pppx, etc.)
states:

“where x is the number of logical QNX LANs running the selected hardware
protocol.”

That’s a bit ambiguous, but I’ve always assumed that what that really is
suppose to mean is:

“where x is the number of the logical QNX LAN to run the selected
hardware protocol.”

In other words, to run TCP/IP on logical LAN 3 (ethernet) you’d use en3.

Is that correct?


OK, now for the weird behavior…

We’ve got four Intel ethernet NICs running on a node. All are
controlled by separate invocations of the Net.ether82557 driver (4.25G
Feb 22 2001). The 2 built-in NICs (on the motherboard) are designated
as logical LANs 1 & 2, to be used as dual QNX LANs only (no TCP/IP). The
other 2 NICs on a dual ethernet PCI card are for logical LANs 3 & 4, to
be used on redundant TCP/IP networks. After fuddling around with the
BIOS, we’ve managed to get all 4 drives to use unique IRQs and when the
machine is cold booted, everything works as expected i.e. LAN1 → en1,
LAN2 → en2, etc.

The weird behavior happens when the machine is warm booted. Everything
shows up bass-ackwards i.e. LAN1 → en4, LAN2 → en3, LAN3 → en2, LAN4
→ en1. And, although TCP/IP runs and ifconfig gives no errors, we can
not communicate over TCP/IP, even when using en2 or en1. QNX LANs 1 & 2
still work just fine though.

Here’s the pertinent info when things are weirded out:

ps …
56 7 0 20r RECV 0 364K Net.ether82557 -I0 -l3 -s100 -F
57 7 0 20r RECV 0 364K Net.ether82557 -I1 -l4 -s100 -F
58 7 0 20r RECV 0 364K Net.ether82557 -I2 -l1 -s100 -F
59 7 0 20r RECV 0 364K Net.ether82557 -I3 -l2 -s100 -F

ifconfig -a …
en4: flags=8822<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST> mtu 1500
address: 00:30:48:28:76:f7
en3: flags=8822<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST> mtu 1500
address: 00:30:48:28:76:f6
en1: flags=8822<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST> mtu 1500
address: 00:02:b3:ec:3b:1c
en2: flags=8863<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST> mtu
1500
address: 00:02:b3:ec:3b:1d
inet 192.168.1.1 netmask 0xffffff00 broadcast 192.168.1.255
lo0: flags=8009<UP,LOOPBACK,MULTICAST> mtu 32976
inet 127.0.0.1 netmask 0xff000000

netinfo -l …
Driver Slot 0: Driver Pid 57 Logical Net 4 Network Card: Ethernet/
Speedo : 82557/8/9 Ethernet Controller
Vendor ID … 0x8086
Device ID … 0x1229
Subsystem ID … 0x1015
Subsystem Vendor ID … 0x8086
Revision … 0xd
Physical Node ID … 0002B3 EC3B1C
Media Rate … 100Mb/s FDX
Mtu … 1514
Hardware Interrupt … 11

Driver Slot 1: Driver Pid 56 Logical Net 3 Network Card: Ethernet/
Speedo : 82557/8/9 Ethernet Controller
Vendor ID … 0x8086
Device ID … 0x1229
Subsystem ID … 0x1015
Subsystem Vendor ID … 0x8086
Revision … 0xd
Physical Node ID … 0002B3 EC3B1D
Media Rate … 100Mb/s FDX
Mtu … 1514
Hardware Interrupt … 7

Driver Slot 2: Driver Pid 58 Logical Net 1 Network Card: Ethernet/
Speedo : 82559 A/B-Step Ethernet Controller
Vendor ID … 0x8086
Device ID … 0x1229
Subsystem ID … 0x100c
Subsystem Vendor ID … 0x8086
Revision … 0x8
Physical Node ID … 003048 2876F7
Media Rate … 100Mb/s FDX
Mtu … 1514
Hardware Interrupt … 10

Driver Slot 3: Driver Pid 59 Logical Net 2 Network Card: Ethernet/
Speedo : 82559 A/B-Step Ethernet Controller
Vendor ID … 0x8086
Device ID … 0x1229
Subsystem ID … 0x100c
Subsystem Vendor ID … 0x8086
Revision … 0x8
Physical Node ID … 003048 2876F6
Media Rate … 100Mb/s FDX
Mtu … 1514
Hardware Interrupt … 5


Anybody else seen this behavior? Is this a bug, or what?

BTW, the above info is from one of our production systems.
Unfortunately, we’ll need to wait for the wee hours of the morning to do
a cold reboot. In the meantime, we’re setting up a test system to see
if we can verify and/or make any sense of this.

TIA,

-Rob

David Gibbs wrote:

Rob Hem <> rob@spamyourself.com> > wrote:


Could someone from QSSL please confirm whether en# == LAN# is a correct
assumption… before we go spinning further off into space > :wink:


That has always been my understanding – the “logical network ID” as
assigned by the -l (el) option to the Net.driver is the same number
assigned as the TCP/IP interface for ifconfig (e.g. -l3 is en3.)

-David

It is. The binding for enX is based on network ID (as far as I can see
from the code).


Cheers,
Adam

QNX Software Systems
[ amallory@qnx.com ]

With a PC, I always felt limited by the software available.
On Unix, I am limited only by my knowledge.
–Peter J. Schoenster <pschon@baste.magibox.net>

Thanks David. I’m feeling saner now :wink:

Any ideas on how this strange reordering of the enX interfaces could be
happening? This should all be handled by the Tcpip resource manager
during its initialization, correct? Is it doing something other than
querying Net for the logical LANs & corresponding MAC addresses? How
could that be out of order, when other utilities like netmap report the
same data correctly? Are there any config files or environment vars I
should be looking at/for?

We’re going to schedule a maintenance window for the production system
to try to fix this. But, if a cold reboot doesn’t work out, I want to
have a plan B, C & maybe even D. And, I need as much info as possble to
formulate those alternatives.

BTW, this was working correctly on the production system until last
weekend, when I installed a new boot image with the new Proc32 and did a
warm reboot. Unfortunately, I didn’t notice TCP/IP wasn’t working
correctly until after our systems maintenance window was closed. And
yes, I went through the same update procedure on the test system we’ve
setup, just to make sure it has nothing to do with the new Proc32. The
test system is still working fine.

What’s got me boggled & worried is I don’t know how this could be a
hardware problem, that would manifest itself in this way with no other
apparent side effects. Apart from TCP/IP not working, everything else
seems to be just fine. The production machine is one of our primary
nodes. There’s nearly 600 processes running on it and none of them are
exhibiting any strange or flaky behavior. And, as I mentioned
previously, the QNX Fleet LANs are working fine also.

Any help or suggestions would be appreciated.

TIA

-Rob

David Gibbs wrote:

Rob Hem <> rob@spamyourself.com> > wrote:


Could someone from QSSL please confirm whether en# == LAN# is a correct
assumption… before we go spinning further off into space > :wink:


That has always been my understanding – the “logical network ID” as
assigned by the -l (el) option to the Net.driver is the same number
assigned as the TCP/IP interface for ifconfig (e.g. -l3 is en3.)

-David

Rob Hem <rob@spamyourself.com> wrote:

Thanks David. I’m feeling saner now > :wink:

Any ideas on how this strange reordering of the enX interfaces could be
happening?

No ideas.

This should all be handled by the Tcpip resource manager
during its initialization, correct? Is it doing something other than
querying Net for the logical LANs & corresponding MAC addresses? How
could that be out of order, when other utilities like netmap report the
same data correctly? Are there any config files or environment vars I
should be looking at/for?

None that I can think of. This looks very very strange to me.

-David


David Gibbs
QNX Training Services
dagibbs@qnx.com

Thanks for the suggestion Bill, I’ll certainly keep it as an option when
we get around to doing production maintenance… BUT, I don’t think
getting the drivers started is the problem. And, in theory, the order
in which they register with Net shouldn’t matter… although that is the
order in which netinfo, ifconfig, netstat, etc. report the interfaces.

All 4 of the Net.drivers are running. I’ve also repeatedly stopped and
started Tcpip to try to get this working, but with the same bogus
results. And, as I’ve mentioned before, the QNX Fleet LANs (1 & 2) are
working fine. The production node that’s having the tcp/ip problems is
actively talking to 20 other QNX nodes. If the MAC addresses for the
NICs on that node were wrong, none of fleet networking would work. And,
TCP/IP was working at one time on this node.

I guess what I’d really like to know is … How does Tcpip query
interface info from the rest of the system resources? Somewhere in the
how of that has got to be a reason or at least a significant clue as to
why all this is coming out backwards in this particular case.

-Rob

Bill Caroselli wrote:

Have you tried putting sleeps in between the loading of your Net.drivers?


[SNIP]

I believe the TCP/IP stack also looks at the netmap file to get some
info about the interface. Are you other two cards “registered” in the
netmap file?

Regards,

Joe

Rob Hem wrote:

Thanks for the suggestion Bill, I’ll certainly keep it as an option when
we get around to doing production maintenance… BUT, I don’t think
getting the drivers started is the problem. And, in theory, the order
in which they register with Net shouldn’t matter… although that is the
order in which netinfo, ifconfig, netstat, etc. report the interfaces.

All 4 of the Net.drivers are running. I’ve also repeatedly stopped and
started Tcpip to try to get this working, but with the same bogus
results. And, as I’ve mentioned before, the QNX Fleet LANs (1 & 2) are
working fine. The production node that’s having the tcp/ip problems is
actively talking to 20 other QNX nodes. If the MAC addresses for the
NICs on that node were wrong, none of fleet networking would work. And,
TCP/IP was working at one time on this node.

I guess what I’d really like to know is … How does Tcpip query
interface info from the rest of the system resources? Somewhere in the
how of that has got to be a reason or at least a significant clue as to
why all this is coming out backwards in this particular case.

-Rob

Bill Caroselli wrote:

Have you tried putting sleeps in between the loading of your Net.drivers?


[SNIP]

I believe the TCP/IP stack also looks at the netmap file to get some
info about the interface. Are your other two cards “registered” in the
netmap file?

Regards,

Joe

Rob Hem wrote:

Thanks for the suggestion Bill, I’ll certainly keep it as an option when
we get around to doing production maintenance… BUT, I don’t think
getting the drivers started is the problem. And, in theory, the order
in which they register with Net shouldn’t matter… although that is the
order in which netinfo, ifconfig, netstat, etc. report the interfaces.

All 4 of the Net.drivers are running. I’ve also repeatedly stopped and
started Tcpip to try to get this working, but with the same bogus
results. And, as I’ve mentioned before, the QNX Fleet LANs (1 & 2) are
working fine. The production node that’s having the tcp/ip problems is
actively talking to 20 other QNX nodes. If the MAC addresses for the
NICs on that node were wrong, none of fleet networking would work. And,
TCP/IP was working at one time on this node.

I guess what I’d really like to know is … How does Tcpip query
interface info from the rest of the system resources? Somewhere in the
how of that has got to be a reason or at least a significant clue as to
why all this is coming out backwards in this particular case.

-Rob

Bill Caroselli wrote:

Have you tried putting sleeps in between the loading of your Net.drivers?


[SNIP]

Indeed they are, and correctly! That’s what is so perplexing.
Everything except Tcpip has got it right.

Here’s the the netmap listing (for just that node)…

Logical Lan Physical TX Count Last TX Fail Time

1 1 003048 2876F7 ; 0
1 2 003048 2876F6 ; 0
1 4 0002B3 EC3B1C ; 0
1 3 0002B3 EC3B1D ; 0

It matches the netinfo -l output from one of my previous posts. And,
the MAC addresses for LANs 3 & 4 match those written on the duel port
PCI card.

And, here’s what Tcpip thinks they are (via ifconfig):

en4: flags=8822<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST> mtu 1500
address: 00:30:48:28:76:f7
en3: flags=8822<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST> mtu 1500
address: 00:30:48:28:76:f6
en1: flags=8822<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST> mtu 1500
address: 00:02:b3:ec:3b:1c
en2: flags=8863<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST> mtu 1500
address: 00:02:b3:ec:3b:1d
inet 192.168.1.1 netmask 0xffffff00 broadcast 192.168.1.255
lo0: flags=8009<UP,LOOPBACK,MULTICAST> mtu 32976
inet 127.0.0.1 netmask 0xff000000

Completely backwards! FYI… en2 has the TCP/IP interface designated as
UP because I tried to get that NIC up (it should be en3, if everything
was correct). It didn’t work though. I’m going to try hooking up the
test system to the production node via LANs 3 & 4 with crossover cables,
and then try to talk Fleet, just to see if the NICs & Net.drivers are
actually working.

This is a bit of a stretch, but could this possibly be some sort of a
fence post thingie because this is on node 1? Node 1 is “special” in
the netboot world??

Stay tuned.

-Rob

Joe Mammone wrote
:

I believe the TCP/IP stack also looks at the netmap file to get some
info about the interface. Are your other two cards “registered” in the
netmap file?

Regards,

Joe

Rob Hem wrote:

Thanks for the suggestion Bill, I’ll certainly keep it as an option
when we get around to doing production maintenance… BUT, I don’t
think getting the drivers started is the problem. And, in theory, the
order in which they register with Net shouldn’t matter… although
that is the order in which netinfo, ifconfig, netstat, etc. report the
interfaces.

All 4 of the Net.drivers are running. I’ve also repeatedly stopped
and started Tcpip to try to get this working, but with the same bogus
results. And, as I’ve mentioned before, the QNX Fleet LANs (1 & 2)
are working fine. The production node that’s having the tcp/ip
problems is actively talking to 20 other QNX nodes. If the MAC
addresses for the NICs on that node were wrong, none of fleet
networking would work. And, TCP/IP was working at one time on this node.

I guess what I’d really like to know is … How does Tcpip query
interface info from the rest of the system resources? Somewhere in
the how of that has got to be a reason or at least a significant clue
as to why all this is coming out backwards in this particular case.

-Rob

Bill Caroselli wrote:

Have you tried putting sleeps in between the loading of your
Net.drivers?


[SNIP]

The saga continues to become stranger…

I just finished testing LANs 3 & 4 running Fleet on crossover cables to
the test node. Everything works fine.

What could possibly be causing Tcpip to get this all backwards and just
plain not work!

BTW, I’ve pieced together that a cold reboot is definitely what we had
to do to get this working the last time. It was on Apr 8th. I didn’t
do much investigation back then, but knowing what I do now, I have no
idea why that worked. Back then, everyone involved just assumed it
might be a hardware problem and we were all set to swap out any suspect
parts. But, now, I don’t know how it could be a hardware problem.

We’ve scheduled a systems maintenance window for this Sunday morning,
and we will be trying a cold reboot (and anything else we can think of)
to see if we can get Tcpip working on production node 1 again. I’ll let
you all know how it turns out this time.

-Rob

Rob Hem wrote:

Indeed they are, and correctly! That’s what is so perplexing.
Everything except Tcpip has got it right.

Here’s the the netmap listing (for just that node)…

Logical Lan Physical TX Count Last TX Fail Time

1 1 003048 2876F7 ; 0
1 2 003048 2876F6 ; 0
1 4 0002B3 EC3B1C ; 0
1 3 0002B3 EC3B1D ; 0

It matches the netinfo -l output from one of my previous posts. And,
the MAC addresses for LANs 3 & 4 match those written on the duel port
PCI card.

And, here’s what Tcpip thinks they are (via ifconfig):

en4: flags=8822<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST> mtu 1500
address: 00:30:48:28:76:f7
en3: flags=8822<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST> mtu 1500
address: 00:30:48:28:76:f6
en1: flags=8822<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST> mtu 1500
address: 00:02:b3:ec:3b:1c
en2: flags=8863<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST> mtu 1500
address: 00:02:b3:ec:3b:1d
inet 192.168.1.1 netmask 0xffffff00 broadcast 192.168.1.255
lo0: flags=8009<UP,LOOPBACK,MULTICAST> mtu 32976
inet 127.0.0.1 netmask 0xff000000

Completely backwards! FYI… en2 has the TCP/IP interface designated as
UP because I tried to get that NIC up (it should be en3, if everything
was correct). It didn’t work though. I’m going to try hooking up the
test system to the production node via LANs 3 & 4 with crossover cables,
and then try to talk Fleet, just to see if the NICs & Net.drivers are
actually working.

This is a bit of a stretch, but could this possibly be some sort of a
fence post thingie because this is on node 1? Node 1 is “special” in
the netboot world??

Stay tuned.

-Rob

Joe Mammone wrote
:

I believe the TCP/IP stack also looks at the netmap file to get some
info about the interface. Are your other two cards “registered” in the
netmap file?

Regards,

Joe

Rob Hem wrote:

Thanks for the suggestion Bill, I’ll certainly keep it as an option
when we get around to doing production maintenance… BUT, I don’t
think getting the drivers started is the problem. And, in theory,
the order in which they register with Net shouldn’t matter…
although that is the order in which netinfo, ifconfig, netstat, etc.
report the interfaces.

All 4 of the Net.drivers are running. I’ve also repeatedly stopped
and started Tcpip to try to get this working, but with the same bogus
results. And, as I’ve mentioned before, the QNX Fleet LANs (1 & 2)
are working fine. The production node that’s having the tcp/ip
problems is actively talking to 20 other QNX nodes. If the MAC
addresses for the NICs on that node were wrong, none of fleet
networking would work. And, TCP/IP was working at one time on this
node.

I guess what I’d really like to know is … How does Tcpip query
interface info from the rest of the system resources? Somewhere in
the how of that has got to be a reason or at least a significant clue
as to why all this is coming out backwards in this particular case.

-Rob

Bill Caroselli wrote:

Have you tried putting sleeps in between the loading of your
Net.drivers?


[SNIP]

Well, we did our production maintenance thing, and all I can say is
Tcpip’s initialization behavior is bazaar!

We first tried a cold reboot. Tcpip remapped everything in yet a
different (incorrect) order… LANs 1 & 2 were mapped to en3 & en4, LANs
3 & 4 to en1 & en2.

It then did the same incorrect (but different) remapping after a warm
reboot.

So, we punted and replaced the dual Intel NICs card with an SMC single
NIC card (Net.epic driver). And, everything came back up correctly.

One curiosity, Tcpip apparently doesn’t even see the Intel
Net.ether82557 LANs (1 & 2) now. I suspect it’s a timing thing.
Net.epic initialized faster than either of the Net.ether82557 drivers
and was the only thing registered with Net when Tcpip did its
initialization. In other words, if I were to restart Tcpip now, it
would probably find the other two LANs. But, I’m not about to go there now.

We are in fact going to be changing our (near) future systems
architecture so that our primary nodes are no longer dependent on
actually running Tcpip. We’ll off load that task to support nodes who’s
soul purpose will be to run Tcpip I.E. be Fleet to TCP/IP gateways.

One final note… In researching this problem, I noticed that ifconfig
for other OSs has the ability to update/correct MAC/ARP entries for
local drivers/interfaces via the lladdr option. That would have
certainly been helpful here. Or, alternatively, some way to manually
configure/initialize Tcpip correctly, since the “automated”
initialization certainly seems, uhm, flawed.

But, that’s water over the damn, I’m moving on…

-Rob


Rob Hem wrote:

The saga continues to become stranger…

I just finished testing LANs 3 & 4 running Fleet on crossover cables to
the test node. Everything works fine.

What could possibly be causing Tcpip to get this all backwards and just
plain not work!

BTW, I’ve pieced together that a cold reboot is definitely what we had
to do to get this working the last time. It was on Apr 8th. I didn’t
do much investigation back then, but knowing what I do now, I have no
idea why that worked. Back then, everyone involved just assumed it
might be a hardware problem and we were all set to swap out any suspect
parts. But, now, I don’t know how it could be a hardware problem.

We’ve scheduled a systems maintenance window for this Sunday morning,
and we will be trying a cold reboot (and anything else we can think of)
to see if we can get Tcpip working on production node 1 again. I’ll let
you all know how it turns out this time.

-Rob

Rob Hem wrote:

Indeed they are, and correctly! That’s what is so perplexing.
Everything except Tcpip has got it right.

Here’s the the netmap listing (for just that node)…

Logical Lan Physical TX Count Last TX Fail Time

1 1 003048 2876F7 ; 0
1 2 003048 2876F6 ; 0
1 4 0002B3 EC3B1C ; 0
1 3 0002B3 EC3B1D ; 0

It matches the netinfo -l output from one of my previous posts. And,
the MAC addresses for LANs 3 & 4 match those written on the duel port
PCI card.

And, here’s what Tcpip thinks they are (via ifconfig):

en4: flags=8822<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST> mtu 1500
address: 00:30:48:28:76:f7
en3: flags=8822<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST> mtu 1500
address: 00:30:48:28:76:f6
en1: flags=8822<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST> mtu 1500
address: 00:02:b3:ec:3b:1c
en2: flags=8863<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST> mtu
1500
address: 00:02:b3:ec:3b:1d
inet 192.168.1.1 netmask 0xffffff00 broadcast 192.168.1.255
lo0: flags=8009<UP,LOOPBACK,MULTICAST> mtu 32976
inet 127.0.0.1 netmask 0xff000000

Completely backwards! FYI… en2 has the TCP/IP interface designated
as UP because I tried to get that NIC up (it should be en3, if
everything was correct). It didn’t work though. I’m going to try
hooking up the test system to the production node via LANs 3 & 4 with
crossover cables, and then try to talk Fleet, just to see if the NICs
& Net.drivers are actually working.

This is a bit of a stretch, but could this possibly be some sort of a
fence post thingie because this is on node 1? Node 1 is “special” in
the netboot world??

Stay tuned.

-Rob

Joe Mammone wrote
:

I believe the TCP/IP stack also looks at the netmap file to get some
info about the interface. Are your other two cards “registered” in
the netmap file?

Regards,

Joe

Rob Hem wrote:

Thanks for the suggestion Bill, I’ll certainly keep it as an option
when we get around to doing production maintenance… BUT, I don’t
think getting the drivers started is the problem. And, in theory,
the order in which they register with Net shouldn’t matter…
although that is the order in which netinfo, ifconfig, netstat, etc.
report the interfaces.

All 4 of the Net.drivers are running. I’ve also repeatedly stopped
and started Tcpip to try to get this working, but with the same
bogus results. And, as I’ve mentioned before, the QNX Fleet LANs (1
& 2) are working fine. The production node that’s having the tcp/ip
problems is actively talking to 20 other QNX nodes. If the MAC
addresses for the NICs on that node were wrong, none of fleet
networking would work. And, TCP/IP was working at one time on this
node.

I guess what I’d really like to know is … How does Tcpip query
interface info from the rest of the system resources? Somewhere in
the how of that has got to be a reason or at least a significant
clue as to why all this is coming out backwards in this particular
case.

-Rob

Bill Caroselli wrote:

Have you tried putting sleeps in between the loading of your
Net.drivers?


[SNIP]