Kernel fault (fault in transfer ?)

Hi all

My QNX4 system is based on a standard desktop PC (node 1) and one or
more cPCI PC boards (nodes 1…5). The cPCI nodes have no disk, and
boot from node 1 via bootp; after kernel startup only Qnet is used to
reach them to launch processes and peform IPC with servers running on
node 1.

In a particular test scenario, the cPCI node hang up after a few
hours, showing the following kernel fault message:

[bootp load message]
1741MHZ 686/687 PCI bus boot modules:
sys/Proc32
sys/Slib32
sys/Slib16
/bin/Net
/bin/Net.i82540
/bin/sinit
starting QNX…
fault in transfer

Version 425.O Uag 19 2002 Technnical support…
kernel fault ldt 0 fault 6+0
cs:eip=f0:1a47 ss:esp=f8:1438 efl=12246 ds=0 es=0 fs=0 gs=0
eax/ef2 ebx/0 ecx/0 edx/1a470 esi/1a470 edi/1 ebp/1438
Stack (f8:13fc)
00000000 00000000 00000000 00000000 00000001 0001a470 00001438 000142c
00000000 0001a470 00000000 00000ef2 0001a470 000000f0 00012246 0000000
ffffffff 00000166 00000000 00000118 00000000 00000000 0000000d 000001d
00000001 00000001 000014bc 0000147c 00004c30 00000180 00000001 000e000
00003dba 00000014 00002202 0001001d 00000000 0002a560 000031e8 000001d
00000118 00000ea8 000000f0 00000000 0000000a 0000010a 0000000a 0000130
00000000 00000000 000000f8 000000f8 00000000 0002a560 00000000 00014ec
00000000 00000002 00000000 0002a618 0000337d 000000f0 00002246 0000000

The test executed is a simple loop starting a process on node 2 (the
one that hangs up) doing some ipc with a custom driver for an HDLC
controller (some traffic is received/transmitted on channels); every 2
seconds the process gets slayed and restarted again. The processor
load is not visible using ‘sac’ (the micro is a P4).

The cPCI board is an Advantech mic3369.

The same test (and other tests, a lot more CPU-expensive) run with no
problems on another cPCI PC board, VMIC 7753 (P3). The only difference
between software running on the two boards is the network card driver
(Net.i82540 on Advantech, Net.ether82557 on VMIC).

Some idea?

Thanks in advance.
Davide

Davide Ancri wrote:

Hi all

My QNX4 system is based on a standard desktop PC (node 1) and one or
more cPCI PC boards (nodes 1…5). The cPCI nodes have no disk, and
boot from node 1 via bootp; after kernel startup only Qnet is used to
reach them to launch processes and peform IPC with servers running on
node 1.

In a particular test scenario, the cPCI node hang up after a few
hours, showing the following kernel fault message:

[bootp load message]
1741MHZ 686/687 PCI bus boot modules:
sys/Proc32
sys/Slib32
sys/Slib16
/bin/Net
/bin/Net.i82540
/bin/sinit
starting QNX…
fault in transfer

Version 425.O Uag 19 2002 Technnical support…
kernel fault ldt 0 fault 6+0
cs:eip=f0:1a47 ss:esp=f8:1438 efl=12246 ds=0 es=0 fs=0 gs=0
eax/ef2 ebx/0 ecx/0 edx/1a470 esi/1a470 edi/1 ebp/1438
Stack (f8:13fc)
00000000 00000000 00000000 00000000 00000001 0001a470 00001438 000142c
00000000 0001a470 00000000 00000ef2 0001a470 000000f0 00012246 0000000
ffffffff 00000166 00000000 00000118 00000000 00000000 0000000d 000001d
00000001 00000001 000014bc 0000147c 00004c30 00000180 00000001 000e000
00003dba 00000014 00002202 0001001d 00000000 0002a560 000031e8 000001d
00000118 00000ea8 000000f0 00000000 0000000a 0000010a 0000000a 0000130
00000000 00000000 000000f8 000000f8 00000000 0002a560 00000000 00014ec
00000000 00000002 00000000 0002a618 0000337d 000000f0 00002246 0000000

Something strange happened (to state the obvious). The OS took an
invalid opcode fault on what should have been an SBB instruction. Even
stranger, is that opcode is actually the second time the CPU executed it
(ie. the CPU successfully executed a previously existing one 2 opcodes
away). From that I’d have to conclude that the opcode was no longer a
SBB, and was garbage.

The fault occurred while the kernel was trying to log a transient event
into the tracelog. That transient event was a fault in a message pass
or due to Net faulting (due to driver perhaps?).

The test executed is a simple loop starting a process on node 2 (the
one that hangs up) doing some ipc with a custom driver for an HDLC
controller (some traffic is received/transmitted on channels); every 2
seconds the process gets slayed and restarted again. The processor
load is not visible using ‘sac’ (the micro is a P4).

The cPCI board is an Advantech mic3369.

The same test (and other tests, a lot more CPU-expensive) run with no
problems on another cPCI PC board, VMIC 7753 (P3). The only difference
between software running on the two boards is the network card driver
(Net.i82540 on Advantech, Net.ether82557 on VMIC).

Some idea?

Seems highly possible the network driver or Net have something to do
with it as the invalid opcode fault suggests something is corrupting
memory somewhere.


Cheers,
Adam

QNX Software Systems Ltd.
[ amallory@qnx.com ]

With a PC, I always felt limited by the software available.
On Unix, I am limited only by my knowledge.
–Peter J. Schoenster <pschon@baste.magibox.net>