Occasional missed deadlines over qnet

Hello:

In our vehicle training simulators, we have a three node QNX qnet
network for each driving station. From time-to-time, the processes on
node 3 miss their 30 Hz. deadline by a lot – about 500 msec. total.
Here is a snippet from node 3 netinfo of a typical incident:

10:14:06 2 (2767) tulip (irq) Tx Underflow, adapting…
10:14:06 2 (2768) tulip (irq) Fatal Bus Error,
requesting RESET
10:14:06 2 (2778)
10:14:06 2 00A0CC 592B51 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 12396 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 7343 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 26547 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 4 (2707) tulip ( tx) timeout (no nack)

This appears to be associated with messages being sent to our executive
program which is on node 1. Here is a typical corresponding snippet
from node 1:

10:56:05 2 00A0CC 58FFA2 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 0 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 2795 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 32768 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 1 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 00A0CC 58FFA2 (2707) tulip ( tx) timeout (no
nack)PROGRAM NAME VERSION DATE
10:56:05 2 Status 0 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 2795 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 32768 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 1 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 00A0CC 58FFA2 (2700) tulip ( tx) retry/ack timeout
10:56:05 0 Status 2 ( 33) NET ( tx) no more alternate drivers
10:56:05 0 Status 2 ( 7) NET ( tx) failed (vc_attach ctrl pkt)

At first we thought this indicated a hardware problem but then noticed
the same sort of thing happening on driving station 1 (the snippets
above are from driving station 2). What are the possible
interpretations/causes of this behavior. BTW, what is net traceinfo
2778?

Here is sinve for node 1:

sys/Proc32 Proc 4.24K Mar 26 1998
sys/Proc32 Slib16 4.23G Oct 04 1996
sys/Slib32 Slib32 4.24B Aug 12 1997
/bin/Fsys Fsys32 4.24T Feb 26 1999
/bin/Fsys Floppy 4.24B Aug 19 1997
/bin/Fsys.eide eide 4.24N Nov 18 1998
//1/bin/Dev32 Dev32 4.23G Oct 04 1996
//1/bin/Dev32.ansi Dev32.ansi 4.23H Nov 21 1996
//1/bin/Dev32.ser Dev32.ser 4.23I Jun 27 1997
//1/bin/Dev32.par Dev32.par 4.23G Oct 04 1996
//1/bin/Dev32.pty Dev32.pty 4.23G Oct 04 1996
//1/bin/Dev32.pty Dev32.pty 4.23G Oct 04 1996
//1/bin/Dev32.pty Dev32.pty 4.23G Oct 04 1996
//1/bin/Dev32.pty Dev32.pty 4.23G Oct 04 1996
//1/bin/Mouse Mouse 4.24A Aug 22 1997
//1/bin/Iso9660fsys Iso9660fsys 4.23B Jun 10 1998
//1/bin/Net Net 4.25B Jul 27 1998
//1/bin/Net.tulip Net.tulip 4.25M Jan 25 1999
//1/bin/Net.tulip Net.tulip 4.25M Jan 25 1999
//1/*/usr/ucb/Tcpip Tcpip 5.00X Jul 30 1999
//1/bin/SMBfsys SMBfsys 1.30H Feb 09 1999
//1/bin/Pipe Pipe 4.23A Feb 26 1996
//1/bin/Audio Audio 4.23A Apr 17 1997
//1/bin/Audio Audio 4.23A Apr 17 1997

and for node 3

PROGRAM NAME VERSION DATE
sys/Proc32 Proc 4.25J Sep 09 1999
sys/Proc32 Slib16 4.23G Oct 04 1996
sys/Slib32 Slib32 4.24B Aug 12 1997
/bin/Fsys Fsys32 4.24T Feb 26 1999
/bin/Fsys Floppy 4.24B Aug 19 1997
/bin/Fsys.eide eide 4.24Q Jun 28 1999
//3/bin/Dev32 Dev32 4.23G Oct 04 1996
//3/bin/Dev32.ansi Dev32.ansi 4.23H Nov 21 1996
//3/bin/Dev32.ser Dev32.ser 4.23I Jun 27 1997
//3/bin/Dev32.par Dev32.par 4.23G Oct 04 1996
//3/bin/Dev32.pty Dev32.pty 4.23G Oct 04 1996
//3/bin/Mouse Mouse 4.24A Aug 22 1997
//3/bin/Iso9660fsys Iso9660fsys 4.23B Aug 30 1999
//3/bin/Pipe Pipe 4.23A Feb 26 1996
//3/bin/Net Net 4.25C Aug 30 1999
//3/bin/Net.tulip Net.tulip 4.25Q Aug 30 1999
//3/*/usr/ucb/Tcpip Tcpip 5.00X Jul 30 1999

ping

Dean Douthat wrote:

Hello:

In our vehicle training simulators, we have a three node QNX qnet
network for each driving station. From time-to-time, the processes on
node 3 miss their 30 Hz. deadline by a lot – about 500 msec. total.
Here is a snippet from node 3 netinfo of a typical incident:

10:14:06 2 (2767) tulip (irq) Tx Underflow, adapting…
10:14:06 2 (2768) tulip (irq) Fatal Bus Error,
requesting RESET
10:14:06 2 (2778)
10:14:06 2 00A0CC 592B51 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 12396 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 7343 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 26547 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 4 (2707) tulip ( tx) timeout (no nack)

This appears to be associated with messages being sent to our executive
program which is on node 1. Here is a typical corresponding snippet
from node 1:

10:56:05 2 00A0CC 58FFA2 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 0 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 2795 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 32768 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 1 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 00A0CC 58FFA2 (2707) tulip ( tx) timeout (no
nack)PROGRAM NAME VERSION DATE
10:56:05 2 Status 0 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 2795 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 32768 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 1 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 00A0CC 58FFA2 (2700) tulip ( tx) retry/ack timeout
10:56:05 0 Status 2 ( 33) NET ( tx) no more alternate drivers
10:56:05 0 Status 2 ( 7) NET ( tx) failed (vc_attach ctrl pkt)

At first we thought this indicated a hardware problem but then noticed
the same sort of thing happening on driving station 1 (the snippets
above are from driving station 2). What are the possible
interpretations/causes of this behavior. BTW, what is net traceinfo
2778?

Here is sinve for node 1:

sys/Proc32 Proc 4.24K Mar 26 1998
sys/Proc32 Slib16 4.23G Oct 04 1996
sys/Slib32 Slib32 4.24B Aug 12 1997
/bin/Fsys Fsys32 4.24T Feb 26 1999
/bin/Fsys Floppy 4.24B Aug 19 1997
/bin/Fsys.eide eide 4.24N Nov 18 1998
//1/bin/Dev32 Dev32 4.23G Oct 04 1996
//1/bin/Dev32.ansi Dev32.ansi 4.23H Nov 21 1996
//1/bin/Dev32.ser Dev32.ser 4.23I Jun 27 1997
//1/bin/Dev32.par Dev32.par 4.23G Oct 04 1996
//1/bin/Dev32.pty Dev32.pty 4.23G Oct 04 1996
//1/bin/Dev32.pty Dev32.pty 4.23G Oct 04 1996
//1/bin/Dev32.pty Dev32.pty 4.23G Oct 04 1996
//1/bin/Dev32.pty Dev32.pty 4.23G Oct 04 1996
//1/bin/Mouse Mouse 4.24A Aug 22 1997
//1/bin/Iso9660fsys Iso9660fsys 4.23B Jun 10 1998
//1/bin/Net Net 4.25B Jul 27 1998
//1/bin/Net.tulip Net.tulip 4.25M Jan 25 1999
//1/bin/Net.tulip Net.tulip 4.25M Jan 25 1999
//1/*/usr/ucb/Tcpip Tcpip 5.00X Jul 30 1999
//1/bin/SMBfsys SMBfsys 1.30H Feb 09 1999
//1/bin/Pipe Pipe 4.23A Feb 26 1996
//1/bin/Audio Audio 4.23A Apr 17 1997
//1/bin/Audio Audio 4.23A Apr 17 1997

and for node 3

PROGRAM NAME VERSION DATE
sys/Proc32 Proc 4.25J Sep 09 1999
sys/Proc32 Slib16 4.23G Oct 04 1996
sys/Slib32 Slib32 4.24B Aug 12 1997
/bin/Fsys Fsys32 4.24T Feb 26 1999
/bin/Fsys Floppy 4.24B Aug 19 1997
/bin/Fsys.eide eide 4.24Q Jun 28 1999
//3/bin/Dev32 Dev32 4.23G Oct 04 1996
//3/bin/Dev32.ansi Dev32.ansi 4.23H Nov 21 1996
//3/bin/Dev32.ser Dev32.ser 4.23I Jun 27 1997
//3/bin/Dev32.par Dev32.par 4.23G Oct 04 1996
//3/bin/Dev32.pty Dev32.pty 4.23G Oct 04 1996
//3/bin/Mouse Mouse 4.24A Aug 22 1997
//3/bin/Iso9660fsys Iso9660fsys 4.23B Aug 30 1999
//3/bin/Pipe Pipe 4.23A Feb 26 1996
//3/bin/Net Net 4.25C Aug 30 1999
//3/bin/Net.tulip Net.tulip 4.25Q Aug 30 1999
//3/*/usr/ucb/Tcpip Tcpip 5.00X Jul 30 1999

Previously, Dean Douthat wrote in qdn.public.qnx4:

ping

Dean Douthat wrote:

Hello:

In our vehicle training simulators, we have a three node QNX qnet
network for each driving station. From time-to-time, the processes on
node 3 miss their 30 Hz. deadline by a lot – about 500 msec. total.
Here is a snippet from node 3 netinfo of a typical incident:

10:14:06 2 (2767) tulip (irq) Tx Underflow, adapting…

The fact that you got a Tx Underflow says that there is something wrong
on the PCI bus. Do you have other devices running on the PCI bus? The
driver couldn’t recover, so it reset the adapter.

10:14:06 2 (2768) tulip (irq) Fatal Bus Error,
requesting RESET
10:14:06 2 (2778)
10:14:06 2 00A0CC 592B51 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 12396 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 7343 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 26547 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 4 (2707) tulip ( tx) timeout (no nack)

This appears to be associated with messages being sent to our executive
program which is on node 1. Here is a typical corresponding snippet
from node 1:

10:56:05 2 00A0CC 58FFA2 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 0 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 2795 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 32768 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 1 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 00A0CC 58FFA2 (2707) tulip ( tx) timeout (no
nack)PROGRAM NAME VERSION DATE
10:56:05 2 Status 0 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 2795 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 32768 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 Status 1 (2707) tulip ( tx) timeout (no nack)
10:56:05 2 00A0CC 58FFA2 (2700) tulip ( tx) retry/ack timeout
10:56:05 0 Status 2 ( 33) NET ( tx) no more alternate drivers
10:56:05 0 Status 2 ( 7) NET ( tx) failed (vc_attach ctrl pkt)

At first we thought this indicated a hardware problem but then noticed
the same sort of thing happening on driving station 1 (the snippets
above are from driving station 2). What are the possible
interpretations/causes of this behavior. BTW, what is net traceinfo
2778?

Here is sinve for node 1:

sys/Proc32 Proc 4.24K Mar 26 1998
sys/Proc32 Slib16 4.23G Oct 04 1996
sys/Slib32 Slib32 4.24B Aug 12 1997
/bin/Fsys Fsys32 4.24T Feb 26 1999
/bin/Fsys Floppy 4.24B Aug 19 1997
/bin/Fsys.eide eide 4.24N Nov 18 1998
//1/bin/Dev32 Dev32 4.23G Oct 04 1996
//1/bin/Dev32.ansi Dev32.ansi 4.23H Nov 21 1996
//1/bin/Dev32.ser Dev32.ser 4.23I Jun 27 1997
//1/bin/Dev32.par Dev32.par 4.23G Oct 04 1996
//1/bin/Dev32.pty Dev32.pty 4.23G Oct 04 1996
//1/bin/Dev32.pty Dev32.pty 4.23G Oct 04 1996
//1/bin/Dev32.pty Dev32.pty 4.23G Oct 04 1996
//1/bin/Dev32.pty Dev32.pty 4.23G Oct 04 1996
//1/bin/Mouse Mouse 4.24A Aug 22 1997
//1/bin/Iso9660fsys Iso9660fsys 4.23B Jun 10 1998
//1/bin/Net Net 4.25B Jul 27 1998
//1/bin/Net.tulip Net.tulip 4.25M Jan 25 1999
//1/bin/Net.tulip Net.tulip 4.25M Jan 25 1999
//1/*/usr/ucb/Tcpip Tcpip 5.00X Jul 30 1999
//1/bin/SMBfsys SMBfsys 1.30H Feb 09 1999
//1/bin/Pipe Pipe 4.23A Feb 26 1996
//1/bin/Audio Audio 4.23A Apr 17 1997
//1/bin/Audio Audio 4.23A Apr 17 1997

and for node 3

PROGRAM NAME VERSION DATE
sys/Proc32 Proc 4.25J Sep 09 1999
sys/Proc32 Slib16 4.23G Oct 04 1996
sys/Slib32 Slib32 4.24B Aug 12 1997
/bin/Fsys Fsys32 4.24T Feb 26 1999
/bin/Fsys Floppy 4.24B Aug 19 1997
/bin/Fsys.eide eide 4.24Q Jun 28 1999
//3/bin/Dev32 Dev32 4.23G Oct 04 1996
//3/bin/Dev32.ansi Dev32.ansi 4.23H Nov 21 1996
//3/bin/Dev32.ser Dev32.ser 4.23I Jun 27 1997
//3/bin/Dev32.par Dev32.par 4.23G Oct 04 1996
//3/bin/Dev32.pty Dev32.pty 4.23G Oct 04 1996
//3/bin/Mouse Mouse 4.24A Aug 22 1997
//3/bin/Iso9660fsys Iso9660fsys 4.23B Aug 30 1999
//3/bin/Pipe Pipe 4.23A Feb 26 1996
//3/bin/Net Net 4.25C Aug 30 1999
//3/bin/Net.tulip Net.tulip 4.25Q Aug 30 1999
//3/*/usr/ucb/Tcpip Tcpip 5.00X Jul 30 1999

Hugh Brown (613) 591-0931 ext. 209 (voice)
QNX Software Systems Ltd. (613) 591-3579 (fax)
175 Terence Matthews Cres. email: hsbrown@qnx.com
Kanata, Ontario, Canada.
K2M 1W8

Hugh Brown wrote:

Previously, Dean Douthat wrote in qdn.public.qnx4:
ping

Dean Douthat wrote:

Hello:

In our vehicle training simulators, we have a three node QNX qnet
network for each driving station. From time-to-time, the processes on
node 3 miss their 30 Hz. deadline by a lot – about 500 msec. total.
Here is a snippet from node 3 netinfo of a typical incident:

10:14:06 2 (2767) tulip (irq) Tx Underflow, adapting…

The fact that you got a Tx Underflow says that there is something wrong
on the PCI bus. Do you have other devices running on the PCI bus? The
driver couldn’t recover, so it reset the adapter.

This is very strange because there is really nothing else on the PCI
bus. This is motherboard with built-in video but there is no screen
attached and shouldn’t be anything being written to the video. Other
than the video, there is nothing else on PCI. This node is pretty
strictly a number cruncher and has pretty much nothing in the way of
peripherals. It communicates by 100baseT ethernet using FLEET. The only
thing unusual is that node three is a fast machine (733Mhz). It does
have an EIDE disk but this disk is only used for booting and program
load; it is idle during realtime operation.

10:14:06 2 (2768) tulip (irq) Fatal Bus Error,
requesting RESET
10:14:06 2 (2778)
10:14:06 2 00A0CC 592B51 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 12396 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 7343 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 26547 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 4 (2707) tulip ( tx) timeout (no nack)

Previously, Dean Douthat wrote in qdn.public.qnx4:

Hugh Brown wrote:

Previously, Dean Douthat wrote in qdn.public.qnx4:
ping

Dean Douthat wrote:

Hello:

In our vehicle training simulators, we have a three node QNX qnet
network for each driving station. From time-to-time, the processes on
node 3 miss their 30 Hz. deadline by a lot – about 500 msec. total.
Here is a snippet from node 3 netinfo of a typical incident:

10:14:06 2 (2767) tulip (irq) Tx Underflow, adapting…

The fact that you got a Tx Underflow says that there is something wrong
on the PCI bus. Do you have other devices running on the PCI bus? The
driver couldn’t recover, so it reset the adapter.

This is very strange because there is really nothing else on the PCI
bus. This is motherboard with built-in video but there is no screen
attached and shouldn’t be anything being written to the video. Other
than the video, there is nothing else on PCI. This node is pretty
strictly a number cruncher and has pretty much nothing in the way of
peripherals. It communicates by 100baseT ethernet using FLEET. The only
thing unusual is that node three is a fast machine (733Mhz). It does
have an EIDE disk but this disk is only used for booting and program
load; it is idle during realtime operation.

I see from your sin output that you have different versions of Net.tulip
on your machines. Node 3 has a later version than node 1. Have you tried
the later version on node 1 (4.25Q)?

10:14:06 2 (2768) tulip (irq) Fatal Bus Error,
requesting RESET
10:14:06 2 (2778)
10:14:06 2 00A0CC 592B51 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 12396 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 7343 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 26547 (2707) tulip ( tx) timeout (no nack)
10:14:06 2 Status 4 (2707) tulip ( tx) timeout (no nack)

snip

Hugh Brown (613) 591-0931 ext. 209 (voice)
QNX Software Systems Ltd. (613) 591-3579 (fax)
175 Terence Matthews Cres. email: hsbrown@qnx.com
Kanata, Ontario, Canada.
K2M 1W8

Hugh Brown wrote:

I see from your sin output that you have different versions of Net.tulip
on your machines. Node 3 has a later version than node 1. Have you tried
the later version on node 1 (4.25Q)?

Hmm. OK, we’ll try changing the tulip drivers. More to come …

Dean Douthat wrote:

Hugh Brown wrote:

I see from your sin output that you have different versions of Net.tulip
on your machines. Node 3 has a later version than node 1. Have you tried
the later version on node 1 (4.25Q)?

Hmm. OK, we’ll try changing the tulip drivers. More to come …

We moved all nodes to current versions and have the same results. We
also reverted to an older version of our application and see the problem
go away. So, as suspected, the problem lies not with hardware nor with
OS software but rather with our software. What advice can anybody give
as to where to look. In particular, where I’m stumped is: what
application software error could possibly cause a bus error?

Hi Dean,

The version of Net.tulip on my system has a “-s” option to force
“Store/Forward Mode”. Have you tried running Net.tulip with this option?

From your netinfo log, the trouble seems to start with a “Tx Underflow”,
which suggests that something else was hogging the PCI bus while the poor
100TX card was trying to read data from memory for a transmit packet. The
“Store/Forward Mode” should force the card to load the entire packet into
its local buffer memory before starting to transmit, thereby avoiding the
possibility of a “Tx Underflow”.

Maybe that’s not it, but it would be worth a try.

Regards,

Bert Menkveld
Engineer
Corman Technologies Inc.

Dean Douthat <ddouthat@faac.com> wrote in message
news:39C61731.96978D93@faac.com

Dean Douthat wrote:

Hugh Brown wrote:

I see from your sin output that you have different versions of
Net.tulip
on your machines. Node 3 has a later version than node 1. Have you
tried
the later version on node 1 (4.25Q)?

Hmm. OK, we’ll try changing the tulip drivers. More to come …

We moved all nodes to current versions and have the same results. We
also reverted to an older version of our application and see the problem
go away. So, as suspected, the problem lies not with hardware nor with
OS software but rather with our software. What advice can anybody give
as to where to look. In particular, where I’m stumped is: what
application software error could possibly cause a bus error?

Hi. I’ve been working with Dean on this one. Adding the “-s” option
did eliminate the hitches we had been seeing - thanks a lot. Of course,
we’re still no closer to knowing the underlying cause, but for now this
is a great work-around.

There really isn’t a whole lot on the PCI bus - just a video card that,
in theory, has nothing to display, and this one ethernet card (which
does see some pretty heavy activity). It is an EIDE system, but there
shouldn’t be much (if any) disk activity while we’re running. Any clues
on what to look for next?

Again, thanks for the tip.

Josh Hamacher
FAAC Incorporated



Bert Menkveld wrote:

Hi Dean,

The version of Net.tulip on my system has a “-s” option to force
“Store/Forward Mode”. Have you tried running Net.tulip with this option?

From your netinfo log, the trouble seems to start with a “Tx Underflow”,
which suggests that something else was hogging the PCI bus while the poor
100TX card was trying to read data from memory for a transmit packet. The
“Store/Forward Mode” should force the card to load the entire packet into
its local buffer memory before starting to transmit, thereby avoiding the
possibility of a “Tx Underflow”.

Maybe that’s not it, but it would be worth a try.

Regards,

Bert Menkveld
Engineer
Corman Technologies Inc.

Dean Douthat <> ddouthat@faac.com> > wrote in message
news:> 39C61731.96978D93@faac.com> …
Dean Douthat wrote:

Hugh Brown wrote:

I see from your sin output that you have different versions of
Net.tulip
on your machines. Node 3 has a later version than node 1. Have you
tried
the later version on node 1 (4.25Q)?

Hmm. OK, we’ll try changing the tulip drivers. More to come …

We moved all nodes to current versions and have the same results. We
also reverted to an older version of our application and see the problem
go away. So, as suspected, the problem lies not with hardware nor with
OS software but rather with our software. What advice can anybody give
as to where to look. In particular, where I’m stumped is: what
application software error could possibly cause a bus error?

Hmmm, do you have any ISA bus cards? I don’t really know, but I suppose
it’s possible that activity on the ISA bus would also tie up the PCI bus.

What kind of motherboard is this? If it’s an older one, it’s just possible
that it has a poor PCI bus implementation (we’ve got an excellent example of
an early Pentium motherboard that demonstrates just about anything you’d
like to go wrong on a PCI bus… :slight_smile:.

Oh, and have you checked the BIOS settings for the PCI bus clock?

If you can’t track down the problem, just running with the “-s” option may
not be such a bad solution. It theoretically degrades network performance
somewhat, but I don’t know if you would ever notice the difference.

Regards,

Bert Menkveld
Engineer
Corman Technologies Inc.

Josh Hamacher <hamacher@faac.com> wrote in message
news:39C8D488.F1888A03@faac.com

Hi. I’ve been working with Dean on this one. Adding the “-s” option
did eliminate the hitches we had been seeing - thanks a lot. Of course,
we’re still no closer to knowing the underlying cause, but for now this
is a great work-around.

There really isn’t a whole lot on the PCI bus - just a video card that,
in theory, has nothing to display, and this one ethernet card (which
does see some pretty heavy activity). It is an EIDE system, but there
shouldn’t be much (if any) disk activity while we’re running. Any clues
on what to look for next?

Again, thanks for the tip.

Josh Hamacher
FAAC Incorporated



Bert Menkveld wrote:

Hi Dean,

The version of Net.tulip on my system has a “-s” option to force
“Store/Forward Mode”. Have you tried running Net.tulip with this
option?

From your netinfo log, the trouble seems to start with a “Tx Underflow”,
which suggests that something else was hogging the PCI bus while the
poor
100TX card was trying to read data from memory for a transmit packet.
The
“Store/Forward Mode” should force the card to load the entire packet
into
its local buffer memory before starting to transmit, thereby avoiding
the
possibility of a “Tx Underflow”.

Maybe that’s not it, but it would be worth a try.

Regards,

Bert Menkveld
Engineer
Corman Technologies Inc.

Dean Douthat <> ddouthat@faac.com> > wrote in message
news:> 39C61731.96978D93@faac.com> …
Dean Douthat wrote:

Hugh Brown wrote:

I see from your sin output that you have different versions of
Net.tulip
on your machines. Node 3 has a later version than node 1. Have you
tried
the later version on node 1 (4.25Q)?

Hmm. OK, we’ll try changing the tulip drivers. More to come …

We moved all nodes to current versions and have the same results. We
also reverted to an older version of our application and see the
problem
go away. So, as suspected, the problem lies not with hardware nor
with
OS software but rather with our software. What advice can anybody
give
as to where to look. In particular, where I’m stumped is: what
application software error could possibly cause a bus error?

Hello. In order to provide some closure on this, I’ll bring everyone up
to date.

The hitches are gone (we think). It turns out that some file
descriptors were being inherited that I wasn’t aware of. While I’m
still not entirely sure why this caused the problems, once I cleaned all
of this up the hitches seemed to go away (although the system hasn’t
been strenuously exercised yet). A file descriptor was staying open
from the node 3 vehicle model to node 1’s Dev32, and this seemed to be
the root of the problem (the fd seemed to be stderr). The vehicle model
was also trying to dump out error messages at about 5000 Hz - a pretty
punishing task in and of itself. So we also fixed that.

Thanks to everyone that responded.

Josh Hamacher
FAAC Incorporated


Bert Menkveld wrote:

Hmmm, do you have any ISA bus cards? I don’t really know, but I suppose
it’s possible that activity on the ISA bus would also tie up the PCI bus.

What kind of motherboard is this? If it’s an older one, it’s just possible
that it has a poor PCI bus implementation (we’ve got an excellent example of
an early Pentium motherboard that demonstrates just about anything you’d
like to go wrong on a PCI bus… > :slight_smile:> .

Oh, and have you checked the BIOS settings for the PCI bus clock?

If you can’t track down the problem, just running with the “-s” option may
not be such a bad solution. It theoretically degrades network performance
somewhat, but I don’t know if you would ever notice the difference.

Regards,

Bert Menkveld
Engineer
Corman Technologies Inc.

Josh Hamacher <> hamacher@faac.com> > wrote in message
news:> 39C8D488.F1888A03@faac.com> …
Hi. I’ve been working with Dean on this one. Adding the “-s” option
did eliminate the hitches we had been seeing - thanks a lot. Of course,
we’re still no closer to knowing the underlying cause, but for now this
is a great work-around.

There really isn’t a whole lot on the PCI bus - just a video card that,
in theory, has nothing to display, and this one ethernet card (which
does see some pretty heavy activity). It is an EIDE system, but there
shouldn’t be much (if any) disk activity while we’re running. Any clues
on what to look for next?

Again, thanks for the tip.

Josh Hamacher
FAAC Incorporated



Bert Menkveld wrote:

Hi Dean,

The version of Net.tulip on my system has a “-s” option to force
“Store/Forward Mode”. Have you tried running Net.tulip with this
option?

From your netinfo log, the trouble seems to start with a “Tx Underflow”,
which suggests that something else was hogging the PCI bus while the
poor
100TX card was trying to read data from memory for a transmit packet.
The
“Store/Forward Mode” should force the card to load the entire packet
into
its local buffer memory before starting to transmit, thereby avoiding
the
possibility of a “Tx Underflow”.

Maybe that’s not it, but it would be worth a try.

Regards,

Bert Menkveld
Engineer
Corman Technologies Inc.

Dean Douthat <> ddouthat@faac.com> > wrote in message
news:> 39C61731.96978D93@faac.com> …
Dean Douthat wrote:

Hugh Brown wrote:

I see from your sin output that you have different versions of
Net.tulip
on your machines. Node 3 has a later version than node 1. Have you
tried
the later version on node 1 (4.25Q)?

Hmm. OK, we’ll try changing the tulip drivers. More to come …

We moved all nodes to current versions and have the same results. We
also reverted to an older version of our application and see the
problem
go away. So, as suspected, the problem lies not with hardware nor
with
OS software but rather with our software. What advice can anybody
give
as to where to look. In particular, where I’m stumped is: what
application software error could possibly cause a bus error?