QNX4 lockup with PLX based card under network load

[re-posting this qdn — on c.o.q i got only one reply …]

Hi QNXers,

we have a problem with a strange lockup of the following configura-
tion: it’s a Compact-PCI with AMD K6 CPU and Tulip Ethernet chip, the
OS is QNX 4.25 and TCP runtime with the newest patches (installed from
quics /updates last week). It has also a custom-made output board
(designed specially for this application by another third party) with
an PLX 9080 as its PCI interface. This board feeds the machine with
small chunks (256kB) of data which it reads from a large file (up to
several GB). This is the absolutely time-critical task. Our software
is designed as follows:

(1) ISR (IRQ is generated by PLX when it’s output buffer is less than
half-full):
clear IRQ
return irqproxy

(2) user mode driver:
Receive(irqproxy)
transfer 256kB of Data from memory buffer to PLX via DMA
if memory buffer less than half-full
Trigger(bufferproxy)

(3) another task, made to de-couple (2) from harddisk delays:
setup shared memory with (2)
Receive(bufferproxy)
read next large (5MB) chunk from HD into shm

This works fine, until there is high network load onto this machine.
Currently we can use a flood-ping (synthetic test) or ftp’ing a large
file onto that machine (a task which must be handled by the machine
in reality: while one large file is sent out via PLX, the user loads
the next file down to the harddisk!). Sooner or later (between 1s and
2minutes, unfortunately its only reproducable by trying long enough…)
the machine completely locks up (which means, i can still enter e.g.
'sin ’ at the console prompt, but it never returns). So i can’t
analyse the machine after the lockup, dumper doesn’t write any file
and i suppose even a dejaview via network would hang.

We have tried nearly everything (well, at least what i can think of):
setting priority of (2) and (3) high (even above Proc32), or low
(below 10, hoping to get a shell back), decoupling (2) completely from
the harddisk (by cyclic usage of data from memory), putting small
loops into (1) (from several tests i had the impression, that the OS
jumps faster into the ISR than the PLX can clear the IRQ on it’s local
side), clearing PLX’s complete IRQ Control & Status register (INTSCR),
ignoring IRQ’s which come ‘too soon’ etc.etc…

So finally my question is, whether anybody had the same problems with
K6 based board / Tulip Ethernet / QNX 4 / TCP/RT / PLX9080 or any idea
what i can test to tackle down the problem?

Any hints welcome & TIA,
Edelhard

s o f t w a r e m a n u f a k t u r — Software, that fits!
OO-Realtime Automation from Embedded-PCs up to distributed SMP Systems
info@software-manufaktur.de URL: http://www.software-manufaktur.de/
Fon: ++49+7073/50061-6, Fax: -5, Gaertnerstrasse 6, D-72119 Entringen

“Edelhard Becker” <ebecker@software-manufaktur.de> wrote in message
news:slrn9ltfd9.3nm.ebecker@fram.software-manufaktur.de

[re-posting this qdn — on c.o.q i got only one reply …]

Hi QNXers,

we have a problem with a strange lockup of the following configura-
tion: it’s a Compact-PCI with AMD K6 CPU and Tulip Ethernet chip, the
OS is QNX 4.25 and TCP runtime with the newest patches (installed from
quics /updates last week). It has also a custom-made output board
(designed specially for this application by another third party) with
an PLX 9080 as its PCI interface. This board feeds the machine with
small chunks (256kB) of data which it reads from a large file (up to
several GB). This is the absolutely time-critical task. Our software
is designed as follows:

(1) ISR (IRQ is generated by PLX when it’s output buffer is less than
half-full):
clear IRQ
return irqproxy

(2) user mode driver:
Receive(irqproxy)
transfer 256kB of Data from memory buffer to PLX via DMA
if memory buffer less than half-full
Trigger(bufferproxy)

(3) another task, made to de-couple (2) from harddisk delays:
setup shared memory with (2)
Receive(bufferproxy)
read next large (5MB) chunk from HD into shm

This works fine, until there is high network load onto this machine.
Currently we can use a flood-ping (synthetic test) or ftp’ing a large
file onto that machine (a task which must be handled by the machine
in reality: while one large file is sent out via PLX, the user loads
the next file down to the harddisk!). Sooner or later (between 1s and
2minutes, unfortunately its only reproducable by trying long enough…)
the machine completely locks up (which means, i can still enter e.g.
'sin ’ at the console prompt, but it never returns).

If you can type (and see the echo) of sin that means the machine is
not stuck in an interrupt and the CPU is still running. It’s possible
the filesystem has somehow crash and you can access any file,
that could explain why sin doesn’t return. Can you try doing
a sin over the network?

I have seen lots of nasty thing with custom made board using DMA
(DMA isn’t very forgiving), once I’ve seen network packets end up
in the block of a SCSI harddisk ;-(

So i can’t
analyse the machine after the lockup, dumper doesn’t write any file
and i suppose even a dejaview via network would hang.

We have tried nearly everything (well, at least what i can think of):
setting priority of (2) and (3) high (even above Proc32), or low
(below 10, hoping to get a shell back), decoupling (2) completely from
the harddisk (by cyclic usage of data from memory), putting small
loops into (1) (from several tests i had the impression, that the OS
jumps faster into the ISR than the PLX can clear the IRQ on it’s local
side), clearing PLX’s complete IRQ Control & Status register (INTSCR),
ignoring IRQ’s which come ‘too soon’ etc.etc…

So finally my question is, whether anybody had the same problems with
K6 based board / Tulip Ethernet / QNX 4 / TCP/RT / PLX9080 or any idea
what i can test to tackle down the problem?

Any hints welcome & TIA,
Edelhard

s o f t w a r e m a n u f a k t u r — Software, that fits!
OO-Realtime Automation from Embedded-PCs up to distributed SMP Systems
info@software-manufaktur.de > URL: > http://www.software-manufaktur.de/
Fon: ++49+7073/50061-6, Fax: -5, Gaertnerstrasse 6, D-72119 Entringen

Hi Mario,

On Wed, 25 Jul 2001 10:00:29 -0400, Mario Charest
<mcharest@zinformatic.com> wrote:

“Edelhard Becker” <> ebecker@software-manufaktur.de> > wrote in message
news:> slrn9ltfd9.3nm.ebecker@fram.software-manufaktur.de> …
[…] the machine completely locks up (which means, i can still
enter e.g. 'sin ’ at the console prompt, but it never
returns).

If you can type (and see the echo) of sin that means the machine is
not stuck in an interrupt and the CPU is still running. It’s
possible the filesystem has somehow crash and you can access any
file, that could explain why sin doesn’t return.

I also can Shift-CTRL-ALT-DEL to shutdown the machine. Doing a chkfsys
after reboot, usually the only error reported is that my “large file”
is still marked open. Never had any loss or damage of data.

Can you try doing a sin over the network?

Maybe that’s one of the next tests i do at our customer’s site. They
unfortunately have only one QNX system, so i have to transport our
system. The lockup also occured when running all our processes below
priority of 10, so i don’t think i’ll get a reply via network.

I have seen lots of nasty thing with custom made board using DMA
(DMA isn’t very forgiving), once I’ve seen network packets end up in
the block of a SCSI harddisk ;-(

Yes, the DMA stuff was somewhat tricky. But as long as there’s little
or no (TCP/IP) network traffic everything runs fine for hours, and the
data of the large file is spooled correctly via the board out of the
machine.
It also can’t be a resource problem (DMA, IRQ, …) between Net(.tulip)
and the custom board, or the lockup should arise immediately?! But
sometimes our processes + ftp download ran fine together for half a
minute before locking…

[…]
Thanks & greetings,

Edelhard

s o f t w a r e m a n u f a k t u r — Software, that fits!
OO-Realtime Automation from Embedded-PCs up to distributed SMP Systems
info@software-manufaktur.de URL: http://www.software-manufaktur.de/
Fon: ++49+7073/50061-6, Fax: -5, Gaertnerstrasse 6, D-72119 Entringen

“Edelhard Becker” <ebecker@software-manufaktur.de> wrote in message
news:slrn9lu3sp.5k9.ebecker@fram.software-manufaktur.de

Hi Mario,

On Wed, 25 Jul 2001 10:00:29 -0400, Mario Charest
mcharest@zinformatic.com> > wrote:
“Edelhard Becker” <> ebecker@software-manufaktur.de> > wrote in message
news:> slrn9ltfd9.3nm.ebecker@fram.software-manufaktur.de> …
[…] the machine completely locks up (which means, i can still
enter e.g. 'sin ’ at the console prompt, but it never
returns).

If you can type (and see the echo) of sin that means the machine is
not stuck in an interrupt and the CPU is still running. It’s
possible the filesystem has somehow crash and you can access any
file, that could explain why sin doesn’t return.

I also can Shift-CTRL-ALT-DEL to shutdown the machine. Doing a chkfsys
after reboot, usually the only error reported is that my “large file”
is still marked open. Never had any loss or damage of data.

Can you try doing a sin over the network?

Maybe that’s one of the next tests i do at our customer’s site. They
unfortunately have only one QNX system, so i have to transport our
system. The lockup also occured when running all our processes below
priority of 10, so i don’t think i’ll get a reply via network.

If the file system software is dead I suspect you will get a reply. I don’t
beleive this has to do with priority but rather then the filesystem isn’t
responding.

I have seen lots of nasty thing with custom made board using DMA
(DMA isn’t very forgiving), once I’ve seen network packets end up in
the block of a SCSI harddisk ;-(

Yes, the DMA stuff was somewhat tricky. But as long as there’s little
or no (TCP/IP) network traffic everything runs fine for hours, and the
data of the large file is spooled correctly via the board out of the
machine.

It also can’t be a resource problem (DMA, IRQ, …) between Net(.tulip)
and the custom board, or the lockup should arise immediately?!

I’m not thing about conflict but rather illegal signal timing on the bus
that
could corrupt disk info as it’s being moved from the disk to memory,
confusing the heck out of Fsys. Just a wild guess.

Could even be a week powersupply.

sometimes our processes + ftp download ran fine together for half a
minute before locking…

[…]
Thanks & greetings,
Edelhard

s o f t w a r e m a n u f a k t u r — Software, that fits!
OO-Realtime Automation from Embedded-PCs up to distributed SMP Systems
info@software-manufaktur.de > URL: > http://www.software-manufaktur.de/
Fon: ++49+7073/50061-6, Fax: -5, Gaertnerstrasse 6, D-72119 Entringen