looking for field input

Randy_Martin · September 4, 2001, 3:27pm

hello qnx4 users.

there will be a more formal discussion coming out… but i am interested in
early feedback as to the scope of this problem:

we had a report of a kernel crash in the timer_table code of Proc.
problem manifests with a kernel dump, with cs:ip pointing into the
code that walks the timer_table (for all posix timers). on version
425K that is somewhere past 0x8996 of cs.

has anyone else seen this?

another manifestation of this error is the potential to loop… so the system
can appear frozen.

the conditions to cause the error are extremely rare:

extensive setting/destroying/firing of timers of very low resolution
(some of sub tick size)
at same time, cause a heavy interrupt load from external source (net card,
serial etc.)

we have a test case here where we can see the failure after 3-4 hours of
intensive test.

and, we have a fix being tested now called Proc32 version M

if anyone has experienced similar patterns and wants to try an early version
of this before it goes to formal beta, please email me.

note: although the fault is definitely a critical error, the chances of
hitting this window is extremely small, but is possible. it is still
being determined if we want all users to migrate to version M once
the fix is proven.

\

Randy Martin randy@qnx.com
Manager of FAE Group, North America
QNX Software Systems www.qnx.com
175 Terence Matthews Crescent, Kanata, Ontario, Canada K2M 1W8
Tel: 613-591-0931 Fax: 613-591-3579

Jeffrey_O_L_Jordan · September 10, 2001, 1:52pm

Randy et al:

I tried the new version you sent, but the symptoms remained.

I have an Industrial Computer Source (ICS) Omnix processor with an
aha-2940, two LAN cards and three digital I/O cards, all interrupt driven.
There are also 2 Opto-2 PAMUX cards.

Same code is running on another computer with the same equipment
installed, but does not freeze. Both were running Proc 4.25 “J” until I got
the new version “M”

Hope this helps,

~ Jeffrey Jordan

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
/ \ | __ ) | ~Jeffrey Jordan jordanj@abc-naco.com
/ | \ | _ ( Phone: (610)630-2330x216 jljordan@wans.net
/ ++ \ | ) ( Fax: (610)630-2323
|| |||/_| 2550 Blvd. o/t Generals, Norristown PA 19403
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Randy Martin wrote:

hello qnx4 users.

there will be a more formal discussion coming out.. but i am interested in
early feedback as to the scope of this problem:

we had a report of a kernel crash in the timer_table code of Proc.
problem manifests with a kernel dump, with cs:ip pointing into the
code that walks the timer_table (for all posix timers). on version
425K that is somewhere past 0x8996 of cs.

has anyone else seen this?

another manifestation of this error is the potential to loop.. so the system
can appear frozen.

the conditions to cause the error are extremely rare:

extensive setting/destroying/firing of timers of very low resolution
(some of sub tick size)

at same time, cause a heavy interrupt load from external source (net card,
serial etc.)

we have a test case here where we can see the failure after 3-4 hours of
intensive test.

and, we have a fix being tested now called Proc32 version M

if anyone has experienced similar patterns and wants to try an early version
of this before it goes to formal beta, please email me.

note: although the fault is definitely a critical error, the chances of
hitting this window is extremely small, but is possible. it is still
being determined if we want all users to migrate to version M once
the fix is proven.

–
Randy Martin > randy@qnx.com
Manager of FAE Group, North America
QNX Software Systems > www.qnx.com
175 Terence Matthews Crescent, Kanata, Ontario, Canada K2M 1W8
Tel: 613-591-0931 Fax: 613-591-3579

Randy_Martin · September 17, 2001, 8:30pm

in order to know what is happening with that hardware we will have to know
what you mean by ‘freeze’.

i have seen older hardware with L2 cache enabled have problems with interrupts
and cache integrity… but this is from 6 or 7 years ago. the chips in place
now ( PII’s and later) really shouldn’t have any problems like that.

if possible you will need to enable the serial port output in Proc so that
if it crashes or ‘freezes’ then you will see why.

in your /boot/build file…

sys/Proc32
$ Proc32 -o 3f8,57600

(include other options that youi already have, like -l or -P)

then connect a null serial cable to ser1. make sure that Dev.ser does NOT
take over ser1 in your sysinit file.
e.g. start it like: Dev.ser 2f8,3 &

start a terminal session on another machine at 57600.

then if you get a crash you should see a debug screen on your serial terminal.
from that we can determine why you died.

when i was debugging cache problems way back when, the instruction that it
died on was different than what was in the binary itself, so we knew that
the cpu+cache was modifying memory.. we turned off cache and the problem
went away.

Jeffrey O L Jordan <jljordan@wans.net> wrote:

Randy et al:

I tried the new version you sent, but the symptoms remained.

I have an Industrial Computer Source (ICS) Omnix processor with an
aha-2940, two LAN cards and three digital I/O cards, all interrupt driven.
There are also 2 Opto-2 PAMUX cards.

Same code is running on another computer with the same equipment
installed, but does not freeze. Both were running Proc 4.25 “J” until I got
the new version “M”

Hope this helps,

~ Jeffrey Jordan

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
/ \ | __ ) | ~Jeffrey Jordan > jordanj@abc-naco.com
/ | \ | _ ( Phone: (610)630-2330x216 > jljordan@wans.net
/ ++ \ | ) ( Fax: (610)630-2323
|| |||/_| 2550 Blvd. o/t Generals, Norristown PA 19403
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Randy Martin wrote:

hello qnx4 users.

there will be a more formal discussion coming out.. but i am interested in
early feedback as to the scope of this problem:

we had a report of a kernel crash in the timer_table code of Proc.
problem manifests with a kernel dump, with cs:ip pointing into the
code that walks the timer_table (for all posix timers). on version
425K that is somewhere past 0x8996 of cs.

has anyone else seen this?

another manifestation of this error is the potential to loop.. so the system
can appear frozen.

the conditions to cause the error are extremely rare:

extensive setting/destroying/firing of timers of very low resolution
(some of sub tick size)

at same time, cause a heavy interrupt load from external source (net card,
serial etc.)

we have a test case here where we can see the failure after 3-4 hours of
intensive test.

and, we have a fix being tested now called Proc32 version M

if anyone has experienced similar patterns and wants to try an early version
of this before it goes to formal beta, please email me.

note: although the fault is definitely a critical error, the chances of
hitting this window is extremely small, but is possible. it is still
being determined if we want all users to migrate to version M once
the fix is proven.

–
Randy Martin > randy@qnx.com
Manager of FAE Group, North America
QNX Software Systems > www.qnx.com
175 Terence Matthews Crescent, Kanata, Ontario, Canada K2M 1W8
Tel: 613-591-0931 Fax: 613-591-3579

–
Randy Martin randy@qnx.com
Manager of FAE Group, North America
QNX Software Systems www.qnx.com
175 Terence Matthews Crescent, Kanata, Ontario, Canada K2M 1W8
Tel: 613-591-0931 Fax: 613-591-3579

Jeffrey_O_L_Jordan · September 18, 2001, 7:07pm

Randy:

By “freeze” I mean the following:

No change of the NumLocks light when NumLocks button is pressed.
No response from kbd, even + or
+++.
Down to alive command from another computer.
netmap shows up ‘down’.
Cannot ‘ping’ this node.
Appears no disk action is occuring, though none expected.

I followed the inctructions you sent earlier, but without result. When
the
computer boots, the serial computer displays the output, but when the
computer
locks up, nothing new is displayed.

Please send help!

Thanks in advance,

~ Jeff

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
/ \ | __ ) | ~Jeffrey Jordan jordanj@abc-naco.com
/ | \ | _ ( Phone: (610)630-2330x216 jljordan@wans.net
/ ++ \ | ) ( Fax: (610)630-2323
|| |||/_| 2550 Blvd. o/t Generals, Norristown PA 19403
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Randy Martin wrote:

in order to know what is happening with that hardware we will have to know
what you mean by ‘freeze’.

i have seen older hardware with L2 cache enabled have problems with interrupts
and cache integrity… but this is from 6 or 7 years ago. the chips in place
now ( PII’s and later) really shouldn’t have any problems like that.

if possible you will need to enable the serial port output in Proc so that
if it crashes or ‘freezes’ then you will see why.

in your /boot/build file…

sys/Proc32
$ Proc32 -o 3f8,57600

(include other options that youi already have, like -l or -P)

then connect a null serial cable to ser1. make sure that Dev.ser does NOT
take over ser1 in your sysinit file.
e.g. start it like: Dev.ser 2f8,3 &

start a terminal session on another machine at 57600.

then if you get a crash you should see a debug screen on your serial terminal.
from that we can determine why you died.

when i was debugging cache problems way back when, the instruction that it
died on was different than what was in the binary itself, so we knew that
the cpu+cache was modifying memory.. we turned off cache and the problem
went away.

Jeffrey O L Jordan <> jljordan@wans.net> > wrote:
Randy et al:

I tried the new version you sent, but the symptoms remained.

I have an Industrial Computer Source (ICS) Omnix processor with an
aha-2940, two LAN cards and three digital I/O cards, all interrupt driven.
There are also 2 Opto-2 PAMUX cards.

Same code is running on another computer with the same equipment
installed, but does not freeze. Both were running Proc 4.25 “J” until I got
the new version “M”

Hope this helps,

~ Jeffrey Jordan

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
/ \ | __ ) | ~Jeffrey Jordan > jordanj@abc-naco.com
/ | \ | _ ( Phone: (610)630-2330x216 > jljordan@wans.net
/ ++ \ | ) ( Fax: (610)630-2323
|| |||/_| 2550 Blvd. o/t Generals, Norristown PA 19403
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Randy Martin wrote:

hello qnx4 users.

there will be a more formal discussion coming out.. but i am interested in
early feedback as to the scope of this problem:

we had a report of a kernel crash in the timer_table code of Proc.
problem manifests with a kernel dump, with cs:ip pointing into the
code that walks the timer_table (for all posix timers). on version
425K that is somewhere past 0x8996 of cs.

has anyone else seen this?

another manifestation of this error is the potential to loop.. so the system
can appear frozen.

the conditions to cause the error are extremely rare:

extensive setting/destroying/firing of timers of very low resolution
(some of sub tick size)

at same time, cause a heavy interrupt load from external source (net card,
serial etc.)

we have a test case here where we can see the failure after 3-4 hours of
intensive test.

and, we have a fix being tested now called Proc32 version M

if anyone has experienced similar patterns and wants to try an early version
of this before it goes to formal beta, please email me.

note: although the fault is definitely a critical error, the chances of
hitting this window is extremely small, but is possible. it is still
being determined if we want all users to migrate to version M once
the fix is proven.

–
Randy Martin > randy@qnx.com
Manager of FAE Group, North America
QNX Software Systems > www.qnx.com
175 Terence Matthews Crescent, Kanata, Ontario, Canada K2M 1W8
Tel: 613-591-0931 Fax: 613-591-3579

–
Randy Martin > randy@qnx.com
Manager of FAE Group, North America
QNX Software Systems > www.qnx.com
175 Terence Matthews Crescent, Kanata, Ontario, Canada K2M 1W8
Tel: 613-591-0931 Fax: 613-591-3579

Jeffrey_O_L_Jordan · September 19, 2001, 12:42pm

Sorry - NumLocks does seem to toggle the NumLock Light.

~ Jeff

Jeffrey O L Jordan wrote:

Randy:

By “freeze” I mean the following:

No change of the NumLocks light when NumLocks button is pressed.

No response from kbd, even + or
ctrl>+++.

Down to alive command from another computer.

netmap shows up ‘down’.

Cannot ‘ping’ this node.

Appears no disk action is occuring, though none expected.

I followed the inctructions you sent earlier, but without result. When
the
computer boots, the serial computer displays the output, but when the
computer
locks up, nothing new is displayed.

Please send help!

Thanks in advance,

~ Jeff

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
/ \ | __ ) | ~Jeffrey Jordan > jordanj@abc-naco.com
/ | \ | _ ( Phone: (610)630-2330x216 > jljordan@wans.net
/ ++ \ | ) ( Fax: (610)630-2323
|| |||/_| 2550 Blvd. o/t Generals, Norristown PA 19403
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Randy Martin wrote:

in order to know what is happening with that hardware we will have to know
what you mean by ‘freeze’.

i have seen older hardware with L2 cache enabled have problems with interrupts
and cache integrity… but this is from 6 or 7 years ago. the chips in place
now ( PII’s and later) really shouldn’t have any problems like that.

if possible you will need to enable the serial port output in Proc so that
if it crashes or ‘freezes’ then you will see why.

in your /boot/build file…

sys/Proc32
$ Proc32 -o 3f8,57600

(include other options that youi already have, like -l or -P)

then connect a null serial cable to ser1. make sure that Dev.ser does NOT
take over ser1 in your sysinit file.
e.g. start it like: Dev.ser 2f8,3 &

start a terminal session on another machine at 57600.

then if you get a crash you should see a debug screen on your serial terminal.
from that we can determine why you died.

when i was debugging cache problems way back when, the instruction that it
died on was different than what was in the binary itself, so we knew that
the cpu+cache was modifying memory.. we turned off cache and the problem
went away.

Jeffrey O L Jordan <> jljordan@wans.net> > wrote:
Randy et al:

I tried the new version you sent, but the symptoms remained.

I have an Industrial Computer Source (ICS) Omnix processor with an
aha-2940, two LAN cards and three digital I/O cards, all interrupt driven.
There are also 2 Opto-2 PAMUX cards.

Same code is running on another computer with the same equipment
installed, but does not freeze. Both were running Proc 4.25 “J” until I got
the new version “M”

Hope this helps,

~ Jeffrey Jordan

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
/ \ | __ ) | ~Jeffrey Jordan > jordanj@abc-naco.com
/ | \ | _ ( Phone: (610)630-2330x216 > jljordan@wans.net
/ ++ \ | ) ( Fax: (610)630-2323
|| |||/_| 2550 Blvd. o/t Generals, Norristown PA 19403
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Randy Martin wrote:

hello qnx4 users.

there will be a more formal discussion coming out.. but i am interested in
early feedback as to the scope of this problem:

we had a report of a kernel crash in the timer_table code of Proc.
problem manifests with a kernel dump, with cs:ip pointing into the
code that walks the timer_table (for all posix timers). on version
425K that is somewhere past 0x8996 of cs.

has anyone else seen this?

another manifestation of this error is the potential to loop.. so the system
can appear frozen.

the conditions to cause the error are extremely rare:

extensive setting/destroying/firing of timers of very low resolution
(some of sub tick size)

at same time, cause a heavy interrupt load from external source (net card,
serial etc.)

we have a test case here where we can see the failure after 3-4 hours of
intensive test.

and, we have a fix being tested now called Proc32 version M

if anyone has experienced similar patterns and wants to try an early version
of this before it goes to formal beta, please email me.

note: although the fault is definitely a critical error, the chances of
hitting this window is extremely small, but is possible. it is still
being determined if we want all users to migrate to version M once
the fix is proven.

–
Randy Martin > randy@qnx.com
Manager of FAE Group, North America
QNX Software Systems > www.qnx.com
175 Terence Matthews Crescent, Kanata, Ontario, Canada K2M 1W8
Tel: 613-591-0931 Fax: 613-591-3579

–
Randy Martin > randy@qnx.com
Manager of FAE Group, North America
QNX Software Systems > www.qnx.com
175 Terence Matthews Crescent, Kanata, Ontario, Canada K2M 1W8
Tel: 613-591-0931 Fax: 613-591-3579

looking for field input

note: although the fault is definitely a critical error, the chances of hitting this window is extremely small, but is possible. it is still being determined if we want all users to migrate to version M once the fix is proven. \

note: although the fault is definitely a critical error, the chances of
hitting this window is extremely small, but is possible. it is still
being determined if we want all users to migrate to version M once
the fix is proven.

\