System freeze

Our system is experiencing occasional complete failures.
I have a simple hardware test program which I left running
overnight. The program loops continuously writing 64k blocks
to PCI boards, reading back and checking for mismatches.
Any mismatch causes a printf to occur. The program is
single-threaded.

No mismatches are found, but after many hours (38 and 16
for the two tries so far) the system freezes. I’ve seen
this happen both when connected via a serial terminal to
/dev/ser1 and when using telnet (PuTTY). In the former
case (ser1), login to the other serial port (ser2) is not
possible and the host cannot be pinged. In the latter,
the PuTTY terminal does not loose its connection but again
the host cannot be pinged and I cannot login to the serial
port.

The setup is QNX 6.2.0 x86 installed from the distribution CD.
We start inetd, fs-nfs2 and syslogd from /etc/rc.d/rc.local,
plus a few NFS mounts (on which the executable is located).
There is a local hard disk. After it boots, I just run my
test program.

The hardware is a stack of PC104 boards.

Can anyone suggest something to help find this? Could a
disk or network access by some system task failing cause
the system to freeze? Are there PCI conditions which can
cause a freeze?

Thanks for any replies
William Morris

“William Morris” <william@bangel.demon.co.uk> wrote in message
news:3E3FE6B1.8287B80A@bangel.demon.co.uk

Our system is experiencing occasional complete failures.
I have a simple hardware test program which I left running
overnight. The program loops continuously writing 64k blocks
to PCI boards, reading back and checking for mismatches.
Any mismatch causes a printf to occur. The program is
single-threaded.

No mismatches are found, but after many hours (38 and 16
for the two tries so far) the system freezes. I’ve seen
this happen both when connected via a serial terminal to
/dev/ser1 and when using telnet (PuTTY). In the former
case (ser1), login to the other serial port (ser2) is not
possible and the host cannot be pinged. In the latter,
the PuTTY terminal does not loose its connection but again
the host cannot be pinged and I cannot login to the serial
port.

The setup is QNX 6.2.0 x86 installed from the distribution CD.
We start inetd, fs-nfs2 and syslogd from /etc/rc.d/rc.local,
plus a few NFS mounts (on which the executable is located).
There is a local hard disk. After it boots, I just run my
test program.

The hardware is a stack of PC104 boards.

Can anyone suggest something to help find this? Could a
disk or network access by some system task failing cause
the system to freeze? Are there PCI conditions which can
cause a freeze?

So it’s a PC104+, some PC104 setup won’t work reliable without a termination
on the bus

Thanks for any replies
William Morris

Mario Charest wrote:

The hardware is a stack of PC104 boards.

So it’s a PC104+,

You are right, the CPU is a PC104+ board. The stack

contains 3 PCI devices which I am testing, built around
the PCI2040 chip. It also contains a number of
PC104 (not +) boards.

… some PC104 setup won’t work reliable without a
termination on the bus

Could a lack of correct PCI termination cause the
system to freeze or just corrupt data transfers?

Thanks
William

William Morris <william@bangel.demon.co.uk> wrote:

Mario Charest wrote:

The hardware is a stack of PC104 boards.

So it’s a PC104+,

You are right, the CPU is a PC104+ board. The stack
contains 3 PCI devices which I am testing, built around
the PCI2040 chip. It also contains a number of
PC104 (not +) boards.

… some PC104 setup won’t work reliable without a
termination on the bus

Could a lack of correct PCI termination cause the
system to freeze or just corrupt data transfers?

If you get corrupted data accross your bus… well, just about
anything could happen…

Maybe an address line got corrupted instead of a data line…

-David

QNX Training Services
http://www.qnx.com/support/training/
Please followup in this newsgroup if you have further questions.

David Gibbs wrote:

William Morris <> william@bangel.demon.co.uk> > wrote:
Could a lack of correct PCI termination cause the
system to freeze or just corrupt data transfers?

If you get corrupted data accross your bus… well, just about
anything could happen…

Maybe an address line got corrupted instead of a data line…

I understand your point, I think; as the system
devices are on the same bus as the external PCI
bus connector with no bridge chip separating them,
they are exposed to external PCI bus faults. An
address fault could cause a write to any device
on the PCI bus, of which there are: PIIX4 (ISA/IDE/
USB/Power), Display and Ethernet. Such a write
fault could cause such a device to do something
that upsets QNX and locks the system (such as
interrupting continuously?).

It does seem strange that, if such faults are
occuring, they go straight for the most inconvenient
spots that hang the system, instead of showing up
as the read/write faults the test program is looking
for. Or am I missing something? Are system freezes
often attributable to faults of this kind?

It also seems a poor design to have all system
components exposed in this way. I dislike PC104
for physical and practical reasons, but, assuming
the design is common to such boards, it seems like
a good reason to avoid PC104+.

Is there a way to find such problems? Are there
PC104+ PCI bus analysers that will detect bus
problems of this sort?

Could someone tell me what the normal method of
terminating a PC104+ PCI bus is? I cannot find
any references specifically to PCI termination on
Google. Do termination modules exist, and should
we have one at the end of the PCI part of the
stack? Or at both ands?

Thanks for the help
William

Hi,

I have similar experiences. When I run a test procedure in a pterm
window which sends out every millisecond a PROFIBUS-FDL packet (CPU load
100%) the system locks up if I create an additional CPU load by a telnet
connection from a remote node. All other resources are not used 100%.

Armin


William Morris wrote:

Our system is experiencing occasional complete failures.
I have a simple hardware test program which I left running
overnight. The program loops continuously writing 64k blocks
to PCI boards, reading back and checking for mismatches.
Any mismatch causes a printf to occur. The program is
single-threaded.

No mismatches are found, but after many hours (38 and 16
for the two tries so far) the system freezes. I’ve seen
this happen both when connected via a serial terminal to
/dev/ser1 and when using telnet (PuTTY). In the former
case (ser1), login to the other serial port (ser2) is not
possible and the host cannot be pinged. In the latter,
the PuTTY terminal does not loose its connection but again
the host cannot be pinged and I cannot login to the serial
port.

The setup is QNX 6.2.0 x86 installed from the distribution CD.
We start inetd, fs-nfs2 and syslogd from /etc/rc.d/rc.local,
plus a few NFS mounts (on which the executable is located).
There is a local hard disk. After it boots, I just run my
test program.

The hardware is a stack of PC104 boards.

Can anyone suggest something to help find this? Could a
disk or network access by some system task failing cause
the system to freeze? Are there PCI conditions which can
cause a freeze?

Thanks for any replies
William Morris