O/S hang?

JS12 · August 3, 2005, 2:11pm

Our application is running 24-hour non-stop everyday on QNX 6.21 using HP
Proliant ML370 G3 Xeon single processor. We have a problem where the system
hang at least once a month. At first we suspected that one of the process
might have run away and consumed all the CPU time, so we set one of our
console to the highest priority (63) so that we can use that console to
analyze the system. However, when it happened again today, our console also
got frozen, and the keyboard also became hang because the Num Lock key is
not operational. So we guessed the O/S had somehow frozen.

This problem is not reproducable and intermitent. It may take a day or a
week to have the same problem occur again. There is no core dump at all. We
really appreciate any suggestion on what we should do to identify the cause
of the problem. Is there any tools that we can use to log the system
condition? Many many thanks to your recommendation.

Mario_Charest1 · August 3, 2005, 3:27pm

“JS” <jsukamtoh@infolink.co.id> wrote in message
news:dcqiev$mse$1@inn.qnx.com…

Our application is running 24-hour non-stop everyday on QNX 6.21 using HP
Proliant ML370 G3 Xeon single processor. We have a problem where the
system hang at least once a month. At first we suspected that one of the
process might have run away and consumed all the CPU time, so we set one
of our console to the highest priority (63) so that we can use that
console to analyze the system. However, when it happened again today, our
console also got frozen, and the keyboard also became hang because the Num
Lock key is not operational. So we guessed the O/S had somehow frozen.

This problem is not reproducable and intermitent. It may take a day or a
week to have the same problem occur again. There is no core dump at all.
We really appreciate any suggestion on what we should do to identify the
cause of the problem. Is there any tools that we can use to log the system
condition? Many many thanks to your recommendation.

When Num Lock isn’t responding that usually mean the CPU is stuck in an ISR
or there is an interrupt line stuck active.

If you have your own software running in the machine, I would suggest you
look at the code of the ISR (if there is any) for possible endless loop and
such. You might use the parallel port as a trace to follow what ISRs are
doing running. For example bit 0 of parallle port is timer interrupt, bit 1
is interrupt for IO card etc.

Other causes are of course bad hardware or bug in QNX drivers. If you are
good with hardware you might probe the PCI bus to figure out the IRQ line
that is stuck high.

\

David_Gibbs1 · August 3, 2005, 3:54pm

JS <jsukamtoh@infolink.co.id> wrote:

Our application is running 24-hour non-stop everyday on QNX 6.21 using HP
Proliant ML370 G3 Xeon single processor. We have a problem where the system
hang at least once a month. At first we suspected that one of the process
might have run away and consumed all the CPU time, so we set one of our
console to the highest priority (63) so that we can use that console to
analyze the system. However, when it happened again today, our console also
got frozen, and the keyboard also became hang because the Num Lock key is
not operational. So we guessed the O/S had somehow frozen.

While the shell on the console might be highest priority (63), if the
keyboard driver that handles your keyboard input is not at that priority,
it may still be pre-empted and your typing won’t get anywhere.

And, if by console you mean a Photon terminal (pterm) window, then there’s
even more layers that are pre-emptable.

Assuming you’re in text-mode, not graphics mode, the fact that there is
no kernel dump displayed does suggest the O.S. hasn’t crashed.

The most likely causes:
– high priority run away thread (yes, you are looking for this, but as I
noted, I’m not sure you’ve got a valid “look-see” for it)
– run away ISR or IRQ (as someone else noted)
– run-away pulse queue, where you have a pulse receiver that is never
dequeueing pulses generated by either a timer or ISR

This problem is not reproducable and intermitent. It may take a day or a
week to have the same problem occur again. There is no core dump at all. We
really appreciate any suggestion on what we should do to identify the cause
of the problem. Is there any tools that we can use to log the system
condition? Many many thanks to your recommendation.

Unfortunately, there aren’t a lot of useful tools for logging system
condition that don’t require a moderately sane system. They need CPU
time and the ability to be scheduled to do their work. Depending on
how nasty it gets, using the instrumented kernel, hooking into the
tracing mechanism pseudo-ISR, and dumping stuff to a serial port that
goes elsewhere (and not using the serial port driver, but hitting the
hardware directly, probably through startup’s debug callouts) or directly
to VGA memory (if text mode) are possibilities. They are, though, pretty
nasty.

-David

David Gibbs
QNX Training Services
dagibbs@qnx.com

JS12 · August 4, 2005, 12:58am

Thanks for the reply.

Our software doesn’t have any ISR and doesn’t use any interrupt at all,
unless the QNX library uses the ISR that we are not aware of, or like you
say there may be bug in QNX drivers. We have 200 remote devices connected to
our application via TCP/IP and we have approximately 3 processes to serve
each device, which means we are using a lot of socket, a lot of TCP/IP
function call, a lot of timers, and a lot of disk I/O because we have the
processes log the data into harddisk. At first we thought it could be
hardware problem, so we replaced the machine with the latest model, and the
same problem still occurs.

If, let’s say an IRQ line is stuck, could it really freeze the O/S?

We can’t do anything with the machine because it is in the production line.
But we do have the older machine which we tried to reproduce a similar
environment but the problem has not arised in our testing environment. Like
I said, it is intermitent and may take days or weeks to have it occur once.

“Mario Charest” postmaster@127.0.0.1 wrote in message
news:dcqmt6$q7a$1@inn.qnx.com…

“JS” <> jsukamtoh@infolink.co.id> > wrote in message
news:dcqiev$mse$> 1@inn.qnx.com> …
Our application is running 24-hour non-stop everyday on QNX 6.21 using HP
Proliant ML370 G3 Xeon single processor. We have a problem where the
system hang at least once a month. At first we suspected that one of the
process might have run away and consumed all the CPU time, so we set one
of our console to the highest priority (63) so that we can use that
console to analyze the system. However, when it happened again today, our
console also got frozen, and the keyboard also became hang because the
Num Lock key is not operational. So we guessed the O/S had somehow
frozen.

This problem is not reproducable and intermitent. It may take a day or a
week to have the same problem occur again. There is no core dump at all.
We really appreciate any suggestion on what we should do to identify the
cause of the problem. Is there any tools that we can use to log the
system condition? Many many thanks to your recommendation.

When Num Lock isn’t responding that usually mean the CPU is stuck in an
ISR or there is an interrupt line stuck active.

If you have your own software running in the machine, I would suggest you
look at the code of the ISR (if there is any) for possible endless loop
and such. You might use the parallel port as a trace to follow what ISRs
are doing running. For example bit 0 of parallle port is timer interrupt,
bit 1 is interrupt for IO card etc.

Other causes are of course bad hardware or bug in QNX drivers. If you are
good with hardware you might probe the PCI bus to figure out the IRQ line
that is stuck high.

\

JS12 · August 4, 2005, 1:24am

Thanks for the reply.

We are using text mode. So which processes’ priority should we set to the
highest so that our keyboard and monitor will get pre-empted? Our processes’
priority is set between 10 and 15.
Our application doesn’t have ISR nor interrupt call, and we don’t use pulse
or QNX message-passing. But we do use a lot of IPC via TCP/IP.

When we do pidin, we noted that many of the O/S threads are running on very
high priority. We are not sure if this is normal or not. Could it be due to
the shell we are running pidin has the highest priority? Below is the list
processes with high priority.
81933 5 sbin/io-net 20o RECEIVE 6
6 2 roc/boot/devb-eide 21r RECEIVE 1
7 2 roc/boot/devb-aha8 21r READY
81933 7 sbin/io-net 21r RECEIVE 22
81933 8 sbin/io-net 21r RECEIVE 26
126996 2 sbin/devb-fdc 21r RECEIVE 1
81933 3 sbin/io-net 62o RECEIVE 1
7 3 roc/boot/devb-aha8 63o RECEIVE 7
7 4 roc/boot/devb-aha8 63o RECEIVE 4
7 7 roc/boot/devb-aha8 63o CONDVAR b822c018
9 3 4A/x86/sbin/fs-pkg 63o RECEIVE 1
9 5 4A/x86/sbin/fs-pkg 63o RECEIVE 1
4106 1 sbin/pipe 63o RECEIVE 1
4106 2 sbin/pipe 63o RECEIVE 1
53260 1 sbin/devc-pty 63o RECEIVE 1
184344 1 bin/sh 63o SIGSUSPEND
184345 1 bin/sh 63o REPLY 8
421899 1 bin/sh 63o SIGSUSPEND
1344020 1 bin/sh 63o SIGSUSPEND
1585756 1 bin/pidin 63o REPLY 1
1585757 1 usr/bin/sort 63o REPLY 4106
1 2 6/boot/sys/procnto 63r RECEIVE 1
1 4 6/boot/sys/procnto 63r RUNNING
1 5 6/boot/sys/procnto 63r RECEIVE 1
1 8 6/boot/sys/procnto 63r RECEIVE 1
1 9 6/boot/sys/procnto 63r RECEIVE 1
1 17 6/boot/sys/procnto 63r REPLY 7

“David Gibbs” <dagibbs@qnx.com> wrote in message
news:dcqpb1$r98$2@inn.qnx.com…

JS <> jsukamtoh@infolink.co.id> > wrote:
Our application is running 24-hour non-stop everyday on QNX 6.21 using HP
Proliant ML370 G3 Xeon single processor. We have a problem where the
system
hang at least once a month. At first we suspected that one of the process
might have run away and consumed all the CPU time, so we set one of our
console to the highest priority (63) so that we can use that console to
analyze the system. However, when it happened again today, our console
also
got frozen, and the keyboard also became hang because the Num Lock key is
not operational. So we guessed the O/S had somehow frozen.

While the shell on the console might be highest priority (63), if the
keyboard driver that handles your keyboard input is not at that priority,
it may still be pre-empted and your typing won’t get anywhere.

And, if by console you mean a Photon terminal (pterm) window, then there’s
even more layers that are pre-emptable.

Assuming you’re in text-mode, not graphics mode, the fact that there is
no kernel dump displayed does suggest the O.S. hasn’t crashed.

The most likely causes:
– high priority run away thread (yes, you are looking for this, but as I
noted, I’m not sure you’ve got a valid “look-see” for it)
– run away ISR or IRQ (as someone else noted)
– run-away pulse queue, where you have a pulse receiver that is never
dequeueing pulses generated by either a timer or ISR

This problem is not reproducable and intermitent. It may take a day or a
week to have the same problem occur again. There is no core dump at all.
We
really appreciate any suggestion on what we should do to identify the
cause
of the problem. Is there any tools that we can use to log the system
condition? Many many thanks to your recommendation.

Unfortunately, there aren’t a lot of useful tools for logging system
condition that don’t require a moderately sane system. They need CPU
time and the ability to be scheduled to do their work. Depending on
how nasty it gets, using the instrumented kernel, hooking into the
tracing mechanism pseudo-ISR, and dumping stuff to a serial port that
goes elsewhere (and not using the serial port driver, but hitting the
hardware directly, probably through startup’s debug callouts) or directly
to VGA memory (if text mode) are possibilities. They are, though, pretty
nasty.

-David

David Gibbs
QNX Training Services
dagibbs@qnx.com

Igor_Kovalenko2 · August 4, 2005, 3:04am

A lot of timers will get you in trouble, especially if they are being
dynamically created/destroyed or modified. My friends from where I worked
before are debugging something that looks similar right now. I’ve heard they
came up with a test case that brings system to its knees by just messing
with timers.

– igor

“JS” <jsukamtoh@infolink.co.id> wrote in message
news:dcrocs$jap$1@inn.qnx.com…

Thanks for the reply.

Our software doesn’t have any ISR and doesn’t use any interrupt at all,
unless the QNX library uses the ISR that we are not aware of, or like you
say there may be bug in QNX drivers. We have 200 remote devices connected
to our application via TCP/IP and we have approximately 3 processes to
serve each device, which means we are using a lot of socket, a lot of
TCP/IP function call, a lot of timers, and a lot of disk I/O because we
have the processes log the data into harddisk. At first we thought it
could be hardware problem, so we replaced the machine with the latest
model, and the same problem still occurs.

If, let’s say an IRQ line is stuck, could it really freeze the O/S?

We can’t do anything with the machine because it is in the production
line. But we do have the older machine which we tried to reproduce a
similar environment but the problem has not arised in our testing
environment. Like I said, it is intermitent and may take days or weeks to
have it occur once.

“Mario Charest” postmaster@127.0.0.1 wrote in message
news:dcqmt6$q7a$> 1@inn.qnx.com> …

“JS” <> jsukamtoh@infolink.co.id> > wrote in message
news:dcqiev$mse$> 1@inn.qnx.com> …
Our application is running 24-hour non-stop everyday on QNX 6.21 using
HP Proliant ML370 G3 Xeon single processor. We have a problem where the
system hang at least once a month. At first we suspected that one of the
process might have run away and consumed all the CPU time, so we set one
of our console to the highest priority (63) so that we can use that
console to analyze the system. However, when it happened again today,
our console also got frozen, and the keyboard also became hang because
the Num Lock key is not operational. So we guessed the O/S had somehow
frozen.

This problem is not reproducable and intermitent. It may take a day or a
week to have the same problem occur again. There is no core dump at all.
We really appreciate any suggestion on what we should do to identify the
cause of the problem. Is there any tools that we can use to log the
system condition? Many many thanks to your recommendation.

When Num Lock isn’t responding that usually mean the CPU is stuck in an
ISR or there is an interrupt line stuck active.

If you have your own software running in the machine, I would suggest you
look at the code of the ISR (if there is any) for possible endless loop
and such. You might use the parallel port as a trace to follow what ISRs
are doing running. For example bit 0 of parallle port is timer interrupt,
bit 1 is interrupt for IO card etc.

Other causes are of course bad hardware or bug in QNX drivers. If you
are good with hardware you might probe the PCI bus to figure out the IRQ
line that is stuck high.

\

Evan_Hillas1 · August 3, 2005, 7:43pm

JS wrote:

But we do have the older machine which we tried to reproduce a similar
environment but the problem has not arised in our testing environment.

Not even once? You may have a good indicator right there. Put some effort into reproducing the crash in the older unit. Your problem could be environmental, ie: outside interference on site.

Evan

JS12 · August 4, 2005, 9:05am

No kidding! An RTOS with timer problem? If this is true, then QSSL must take
this case seriously.

Are they using 6.21? We do use a lot of timers. One timer for each process
that handles the connectivity with the remote device. The process can be up
and down, so is the timer, but it is not very frequent.

“Igor Kovalenko” <kovalenko@comcast.net> wrote in message
news:dcrvoj$ntg$1@inn.qnx.com…

A lot of timers will get you in trouble, especially if they are being
dynamically created/destroyed or modified. My friends from where I worked
before are debugging something that looks similar right now. I’ve heard
they came up with a test case that brings system to its knees by just
messing with timers.

– igor

“JS” <> jsukamtoh@infolink.co.id> > wrote in message
news:dcrocs$jap$> 1@inn.qnx.com> …
Thanks for the reply.

Our software doesn’t have any ISR and doesn’t use any interrupt at all,
unless the QNX library uses the ISR that we are not aware of, or like you
say there may be bug in QNX drivers. We have 200 remote devices connected
to our application via TCP/IP and we have approximately 3 processes to
serve each device, which means we are using a lot of socket, a lot of
TCP/IP function call, a lot of timers, and a lot of disk I/O because we
have the processes log the data into harddisk. At first we thought it
could be hardware problem, so we replaced the machine with the latest
model, and the same problem still occurs.

If, let’s say an IRQ line is stuck, could it really freeze the O/S?

We can’t do anything with the machine because it is in the production
line. But we do have the older machine which we tried to reproduce a
similar environment but the problem has not arised in our testing
environment. Like I said, it is intermitent and may take days or weeks to
have it occur once.

“Mario Charest” postmaster@127.0.0.1 wrote in message
news:dcqmt6$q7a$> 1@inn.qnx.com> …

“JS” <> jsukamtoh@infolink.co.id> > wrote in message
news:dcqiev$mse$> 1@inn.qnx.com> …
Our application is running 24-hour non-stop everyday on QNX 6.21 using
HP Proliant ML370 G3 Xeon single processor. We have a problem where the
system hang at least once a month. At first we suspected that one of
the process might have run away and consumed all the CPU time, so we
set one of our console to the highest priority (63) so that we can use
that console to analyze the system. However, when it happened again
today, our console also got frozen, and the keyboard also became hang
because the Num Lock key is not operational. So we guessed the O/S had
somehow frozen.

This problem is not reproducable and intermitent. It may take a day or
a week to have the same problem occur again. There is no core dump at
all. We really appreciate any suggestion on what we should do to
identify the cause of the problem. Is there any tools that we can use
to log the system condition? Many many thanks to your recommendation.

When Num Lock isn’t responding that usually mean the CPU is stuck in an
ISR or there is an interrupt line stuck active.

If you have your own software running in the machine, I would suggest
you look at the code of the ISR (if there is any) for possible endless
loop and such. You might use the parallel port as a trace to follow
what ISRs are doing running. For example bit 0 of parallle port is timer
interrupt, bit 1 is interrupt for IO card etc.

Other causes are of course bad hardware or bug in QNX drivers. If you
are good with hardware you might probe the PCI bus to figure out the IRQ
line that is stuck high.

\

JS12 · August 4, 2005, 9:35am

I agree, that’s why we replace the old unit with a brand new one, replace
the power cord and move it to another outlet, but we can’t replace the hub,
router, ups, etc.

We are trying our best right now to simulate the same condition as what we
have in the production environment, but so far we can only get as close as,
maybe 80% of the similar condition. We are still improving our simulation
softwares to achieve the same condition as what we have in our production
environment.

Hopefully anyone from this group who has a similar problem would share with
us.

“Evan Hillas” <evanh@clear.net.nz> wrote in message
news:dcsg31$6ib$1@inn.qnx.com…

JS wrote:
But we do have the older machine which we tried to reproduce a similar
environment but the problem has not arised in our testing environment.

Not even once? You may have a good indicator right there. Put some
effort into reproducing the crash in the older unit. Your problem could
be environmental, ie: outside interference on site.

Evan

Bill_Caroselli1 · August 4, 2005, 9:02pm

“David Gibbs” <dagibbs@qnx.com> wrote in message
news:dcqpb1$r98$2@inn.qnx.com…

The most likely causes:
– high priority run away thread (yes, you are looking for this, but as I
noted, I’m not sure you’ve got a valid “look-see” for it)
– run away ISR or IRQ (as someone else noted)
– run-away pulse queue, where you have a pulse receiver that is never
dequeueing pulses generated by either a timer or ISR

I would also check if you are running out of system resources like memory,

stack space, etc.

ed1k2 · August 14, 2005, 11:05pm

In article <dcrpt9$kgk$1@inn.qnx.com>, jsukamtoh@infolink.co.id says…

Thanks for the reply.

We are using text mode. So which processes’ priority should we set to the
highest so that our keyboard and monitor will get pre-empted? Our processes’
priority is set between 10 and 15.
Our application doesn’t have ISR nor interrupt call, and we don’t use pulse
or QNX message-passing. But we do use a lot of IPC via TCP/IP.

When we do pidin, we noted that many of the O/S threads are running on very
high priority. We are not sure if this is normal or not. Could it be due to
the shell we are running pidin has the highest priority? Below is the list
processes with high priority.

The list below doesn’t look normal to me. I don’t know what you did to
get “high-priority” shell but you seems to get everything running at
highest priority (except threads that set their priority theirself). As
well I don’t think the package manager (fs-pkg) is a good thing for a
machine in production line. I think this isn’t ever going to work
stable.

Eduard.

81933 5 sbin/io-net 20o RECEIVE 6
6 2 roc/boot/devb-eide 21r RECEIVE 1
7 2 roc/boot/devb-aha8 21r READY
81933 7 sbin/io-net 21r RECEIVE 22
81933 8 sbin/io-net 21r RECEIVE 26
126996 2 sbin/devb-fdc 21r RECEIVE 1
81933 3 sbin/io-net 62o RECEIVE 1
7 3 roc/boot/devb-aha8 63o RECEIVE 7
7 4 roc/boot/devb-aha8 63o RECEIVE 4
7 7 roc/boot/devb-aha8 63o CONDVAR b822c018
9 3 4A/x86/sbin/fs-pkg 63o RECEIVE 1
9 5 4A/x86/sbin/fs-pkg 63o RECEIVE 1
4106 1 sbin/pipe 63o RECEIVE 1
4106 2 sbin/pipe 63o RECEIVE 1
53260 1 sbin/devc-pty 63o RECEIVE 1
184344 1 bin/sh 63o SIGSUSPEND
184345 1 bin/sh 63o REPLY 8
421899 1 bin/sh 63o SIGSUSPEND
1344020 1 bin/sh 63o SIGSUSPEND
1585756 1 bin/pidin 63o REPLY 1
1585757 1 usr/bin/sort 63o REPLY 4106
1 2 6/boot/sys/procnto 63r RECEIVE 1
1 4 6/boot/sys/procnto 63r RUNNING
1 5 6/boot/sys/procnto 63r RECEIVE 1
1 8 6/boot/sys/procnto 63r RECEIVE 1
1 9 6/boot/sys/procnto 63r RECEIVE 1
1 17 6/boot/sys/procnto 63r REPLY 7

“David Gibbs” <> dagibbs@qnx.com> > wrote in message
news:dcqpb1$r98$> 2@inn.qnx.com> …
JS <> jsukamtoh@infolink.co.id> > wrote:
Our application is running 24-hour non-stop everyday on QNX 6.21 using HP
Proliant ML370 G3 Xeon single processor. We have a problem where the
system
hang at least once a month. At first we suspected that one of the process
might have run away and consumed all the CPU time, so we set one of our
console to the highest priority (63) so that we can use that console to
analyze the system. However, when it happened again today, our console
also
got frozen, and the keyboard also became hang because the Num Lock key is
not operational. So we guessed the O/S had somehow frozen.

JS12 · August 16, 2005, 4:12am

Can you please tell us how to disable fs-pkg? After we slay the fs-pkg
process, we can’t call all the programs in /bin /usr/bin. PATH is not
changed. Thanks.

“ed1k” <ed1k@fake.address> wrote in message
news:MPG.1d69984721d484a99896dd@inn.qnx.com…

In article <dcrpt9$kgk$> 1@inn.qnx.com> >, > jsukamtoh@infolink.co.id > says…
Thanks for the reply.

We are using text mode. So which processes’ priority should we set to the
highest so that our keyboard and monitor will get pre-empted? Our
processes’
priority is set between 10 and 15.
Our application doesn’t have ISR nor interrupt call, and we don’t use
pulse
or QNX message-passing. But we do use a lot of IPC via TCP/IP.

When we do pidin, we noted that many of the O/S threads are running on
very
high priority. We are not sure if this is normal or not. Could it be due
to
the shell we are running pidin has the highest priority? Below is the
list
processes with high priority.

The list below doesn’t look normal to me. I don’t know what you did to
get “high-priority” shell but you seems to get everything running at
highest priority (except threads that set their priority theirself). As
well I don’t think the package manager (fs-pkg) is a good thing for a
machine in production line. I think this isn’t ever going to work
stable.

Eduard.

81933 5 sbin/io-net 20o RECEIVE 6
6 2 roc/boot/devb-eide 21r RECEIVE 1
7 2 roc/boot/devb-aha8 21r READY
81933 7 sbin/io-net 21r RECEIVE 22
81933 8 sbin/io-net 21r RECEIVE 26
126996 2 sbin/devb-fdc 21r RECEIVE 1
81933 3 sbin/io-net 62o RECEIVE 1
7 3 roc/boot/devb-aha8 63o RECEIVE 7
7 4 roc/boot/devb-aha8 63o RECEIVE 4
7 7 roc/boot/devb-aha8 63o CONDVAR b822c018
9 3 4A/x86/sbin/fs-pkg 63o RECEIVE 1
9 5 4A/x86/sbin/fs-pkg 63o RECEIVE 1
4106 1 sbin/pipe 63o RECEIVE 1
4106 2 sbin/pipe 63o RECEIVE 1
53260 1 sbin/devc-pty 63o RECEIVE 1
184344 1 bin/sh 63o SIGSUSPEND
184345 1 bin/sh 63o REPLY 8
421899 1 bin/sh 63o SIGSUSPEND
1344020 1 bin/sh 63o SIGSUSPEND
1585756 1 bin/pidin 63o REPLY 1
1585757 1 usr/bin/sort 63o REPLY 4106
1 2 6/boot/sys/procnto 63r RECEIVE 1
1 4 6/boot/sys/procnto 63r RUNNING
1 5 6/boot/sys/procnto 63r RECEIVE 1
1 8 6/boot/sys/procnto 63r RECEIVE 1
1 9 6/boot/sys/procnto 63r RECEIVE 1
1 17 6/boot/sys/procnto 63r REPLY 7

“David Gibbs” <> dagibbs@qnx.com> > wrote in message
news:dcqpb1$r98$> 2@inn.qnx.com> …
JS <> jsukamtoh@infolink.co.id> > wrote:
Our application is running 24-hour non-stop everyday on QNX 6.21 using
HP
Proliant ML370 G3 Xeon single processor. We have a problem where the
system
hang at least once a month. At first we suspected that one of the
process
might have run away and consumed all the CPU time, so we set one of
our
console to the highest priority (63) so that we can use that console
to
analyze the system. However, when it happened again today, our console
also
got frozen, and the keyboard also became hang because the Num Lock key
is
not operational. So we guessed the O/S had somehow frozen.

Armin_Steinhoff1 · August 16, 2005, 9:17am

JS wrote:

Can you please tell us how to disable fs-pkg? After we slay the fs-pkg
process, we can’t call all the programs in /bin /usr/bin. PATH is not
changed. Thanks.

The easiest way is to install QNX6.3

–Armin

“ed1k” <> ed1k@fake.address> > wrote in message
news:> MPG.1d69984721d484a99896dd@inn.qnx.com> …

In article <dcrpt9$kgk$> 1@inn.qnx.com> >, > jsukamtoh@infolink.co.id > says…

Thanks for the reply.

We are using text mode. So which processes’ priority should we set to the
highest so that our keyboard and monitor will get pre-empted? Our
processes’
priority is set between 10 and 15.
Our application doesn’t have ISR nor interrupt call, and we don’t use
pulse
or QNX message-passing. But we do use a lot of IPC via TCP/IP.

When we do pidin, we noted that many of the O/S threads are running on
very
high priority. We are not sure if this is normal or not. Could it be due
to
the shell we are running pidin has the highest priority? Below is the
list
processes with high priority.

The list below doesn’t look normal to me. I don’t know what you did to
get “high-priority” shell but you seems to get everything running at
highest priority (except threads that set their priority theirself). As
well I don’t think the package manager (fs-pkg) is a good thing for a
machine in production line. I think this isn’t ever going to work
stable.

Eduard.

81933 5 sbin/io-net 20o RECEIVE 6
6 2 roc/boot/devb-eide 21r RECEIVE 1
7 2 roc/boot/devb-aha8 21r READY
81933 7 sbin/io-net 21r RECEIVE 22
81933 8 sbin/io-net 21r RECEIVE 26
126996 2 sbin/devb-fdc 21r RECEIVE 1
81933 3 sbin/io-net 62o RECEIVE 1
7 3 roc/boot/devb-aha8 63o RECEIVE 7
7 4 roc/boot/devb-aha8 63o RECEIVE 4
7 7 roc/boot/devb-aha8 63o CONDVAR b822c018
9 3 4A/x86/sbin/fs-pkg 63o RECEIVE 1
9 5 4A/x86/sbin/fs-pkg 63o RECEIVE 1
4106 1 sbin/pipe 63o RECEIVE 1
4106 2 sbin/pipe 63o RECEIVE 1
53260 1 sbin/devc-pty 63o RECEIVE 1
184344 1 bin/sh 63o SIGSUSPEND
184345 1 bin/sh 63o REPLY 8
421899 1 bin/sh 63o SIGSUSPEND
1344020 1 bin/sh 63o SIGSUSPEND
1585756 1 bin/pidin 63o REPLY 1
1585757 1 usr/bin/sort 63o REPLY 4106
1 2 6/boot/sys/procnto 63r RECEIVE 1
1 4 6/boot/sys/procnto 63r RUNNING
1 5 6/boot/sys/procnto 63r RECEIVE 1
1 8 6/boot/sys/procnto 63r RECEIVE 1
1 9 6/boot/sys/procnto 63r RECEIVE 1
1 17 6/boot/sys/procnto 63r REPLY 7

“David Gibbs” <> dagibbs@qnx.com> > wrote in message
news:dcqpb1$r98$> 2@inn.qnx.com> …

JS <> jsukamtoh@infolink.co.id> > wrote:

Our application is running 24-hour non-stop everyday on QNX 6.21 using
HP
Proliant ML370 G3 Xeon single processor. We have a problem where the
system
hang at least once a month. At first we suspected that one of the
process
might have run away and consumed all the CPU time, so we set one of
our
console to the highest priority (63) so that we can use that console
to
analyze the system. However, when it happened again today, our console
also
got frozen, and the keyboard also became hang because the Num Lock key
is
not operational. So we guessed the O/S had somehow frozen.

ed1k2 · August 17, 2005, 2:43am

In article <dds9vb$5iv$1@inn.qnx.com>, a-steinhoff@web.de says…

JS wrote:
Can you please tell us how to disable fs-pkg? After we slay the fs-pkg
process, we can’t call all the programs in /bin /usr/bin. PATH is not
changed. Thanks.

The easiest way is to install QNX6.3 >

More difficult way is to read a bunch of docs: "Building Embedded

Systems" book, manual for mkifs in Utilities reference at
http://www.qnx.com/developers/docs/momentics62_docs/momentics/index.html
and “Making Buildfiles for the QNX® Neutrino® RTOS” article at
http://www.qnx.com/developers/articles/index.html

You’re not limited by this list, feel free to read more.

Eduard.

ed1k2 · August 17, 2005, 3:14am

In article <MPG.1d6c6e5c5b456f349896de@inn.qnx.com>, ed1k@fake.address
says…

In article <dds9vb$5iv$> 1@inn.qnx.com> >, > a-steinhoff@web.de > says…
JS wrote:
Can you please tell us how to disable fs-pkg? After we slay the fs-pkg
process, we can’t call all the programs in /bin /usr/bin. PATH is not
changed. Thanks.

BTW, I don’t think you have a problem because of fs-pkg. All the
performance hit caused by fs-pkg you easily compensated by G3 Xeon. The
wrong was a priority of system processes you listed before. That list
suggests you have set maximum priority not for one console but for all
consoles and many other system processes (probably launched later off
the some high priority console). The disk driver assumes it runs at 10o
and thread with 21r is a high-priority thread. You got exactly opposite
situation and a lot of interfering threads running at maximum priority.
I would not be surprised to know you have system crash and disk driver
(s) malfunction. The same applicable to io-net manager.

Regards,
Eduard.

JS12 · August 18, 2005, 4:14pm

Do all the programs upward compatible with 6.3?

“Armin Steinhoff” <a-steinhoff@web.de> wrote in message
news:dds9vb$5iv$1@inn.qnx.com…

JS wrote:
Can you please tell us how to disable fs-pkg? After we slay the fs-pkg
process, we can’t call all the programs in /bin /usr/bin. PATH is not
changed. Thanks.

The easiest way is to install QNX6.3 >

–Armin

“ed1k” <> ed1k@fake.address> > wrote in message
news:> MPG.1d69984721d484a99896dd@inn.qnx.com> …

In article <dcrpt9$kgk$> 1@inn.qnx.com> >, > jsukamtoh@infolink.co.id > says…

Thanks for the reply.

We are using text mode. So which processes’ priority should we set to
the
highest so that our keyboard and monitor will get pre-empted? Our
processes’
priority is set between 10 and 15.
Our application doesn’t have ISR nor interrupt call, and we don’t use
pulse
or QNX message-passing. But we do use a lot of IPC via TCP/IP.

When we do pidin, we noted that many of the O/S threads are running on
very
high priority. We are not sure if this is normal or not. Could it be due
to
the shell we are running pidin has the highest priority? Below is the
list
processes with high priority.

The list below doesn’t look normal to me. I don’t know what you did to
get “high-priority” shell but you seems to get everything running at
highest priority (except threads that set their priority theirself). As
well I don’t think the package manager (fs-pkg) is a good thing for a
machine in production line. I think this isn’t ever going to work
stable.

Eduard.

81933 5 sbin/io-net 20o RECEIVE 6
6 2 roc/boot/devb-eide 21r RECEIVE 1
7 2 roc/boot/devb-aha8 21r READY
81933 7 sbin/io-net 21r RECEIVE 22
81933 8 sbin/io-net 21r RECEIVE 26
126996 2 sbin/devb-fdc 21r RECEIVE 1
81933 3 sbin/io-net 62o RECEIVE 1
7 3 roc/boot/devb-aha8 63o RECEIVE 7
7 4 roc/boot/devb-aha8 63o RECEIVE 4
7 7 roc/boot/devb-aha8 63o CONDVAR b822c018
9 3 4A/x86/sbin/fs-pkg 63o RECEIVE 1
9 5 4A/x86/sbin/fs-pkg 63o RECEIVE 1
4106 1 sbin/pipe 63o RECEIVE 1
4106 2 sbin/pipe 63o RECEIVE 1
53260 1 sbin/devc-pty 63o RECEIVE 1
184344 1 bin/sh 63o SIGSUSPEND
184345 1 bin/sh 63o REPLY 8
421899 1 bin/sh 63o SIGSUSPEND
1344020 1 bin/sh 63o SIGSUSPEND
1585756 1 bin/pidin 63o REPLY 1
1585757 1 usr/bin/sort 63o REPLY 4106
1 2 6/boot/sys/procnto 63r RECEIVE 1
1 4 6/boot/sys/procnto 63r RUNNING
1 5 6/boot/sys/procnto 63r RECEIVE 1
1 8 6/boot/sys/procnto 63r RECEIVE 1
1 9 6/boot/sys/procnto 63r RECEIVE 1
1 17 6/boot/sys/procnto 63r REPLY 7

“David Gibbs” <> dagibbs@qnx.com> > wrote in message
news:dcqpb1$r98$> 2@inn.qnx.com> …

JS <> jsukamtoh@infolink.co.id> > wrote:

Our application is running 24-hour non-stop everyday on QNX 6.21 using
HP
Proliant ML370 G3 Xeon single processor. We have a problem where the
system
hang at least once a month. At first we suspected that one of the
process
might have run away and consumed all the CPU time, so we set one of
our
console to the highest priority (63) so that we can use that console
to
analyze the system. However, when it happened again today, our console
also
got frozen, and the keyboard also became hang because the Num Lock key
is
not operational. So we guessed the O/S had somehow frozen.

Armin_Steinhoff1 · August 19, 2005, 9:01am

JS wrote:

Do all the programs upward compatible with 6.3?

Yes … in 99.9% of all cases. Just recompile …

–Armin

“Armin Steinhoff” <> a-steinhoff@web.de> > wrote in message
news:dds9vb$5iv$> 1@inn.qnx.com> …

JS wrote:

Can you please tell us how to disable fs-pkg? After we slay the fs-pkg
process, we can’t call all the programs in /bin /usr/bin. PATH is not
changed. Thanks.

The easiest way is to install QNX6.3 >

–Armin

“ed1k” <> ed1k@fake.address> > wrote in message
news:> MPG.1d69984721d484a99896dd@inn.qnx.com> …

In article <dcrpt9$kgk$> 1@inn.qnx.com> >, > jsukamtoh@infolink.co.id > says…

Thanks for the reply.

We are using text mode. So which processes’ priority should we set to
the
highest so that our keyboard and monitor will get pre-empted? Our
processes’
priority is set between 10 and 15.
Our application doesn’t have ISR nor interrupt call, and we don’t use
pulse
or QNX message-passing. But we do use a lot of IPC via TCP/IP.

When we do pidin, we noted that many of the O/S threads are running on
very
high priority. We are not sure if this is normal or not. Could it be due
to
the shell we are running pidin has the highest priority? Below is the
list
processes with high priority.

The list below doesn’t look normal to me. I don’t know what you did to
get “high-priority” shell but you seems to get everything running at
highest priority (except threads that set their priority theirself). As
well I don’t think the package manager (fs-pkg) is a good thing for a
machine in production line. I think this isn’t ever going to work
stable.

Eduard.

81933 5 sbin/io-net 20o RECEIVE 6
6 2 roc/boot/devb-eide 21r RECEIVE 1
7 2 roc/boot/devb-aha8 21r READY
81933 7 sbin/io-net 21r RECEIVE 22
81933 8 sbin/io-net 21r RECEIVE 26
126996 2 sbin/devb-fdc 21r RECEIVE 1
81933 3 sbin/io-net 62o RECEIVE 1
7 3 roc/boot/devb-aha8 63o RECEIVE 7
7 4 roc/boot/devb-aha8 63o RECEIVE 4
7 7 roc/boot/devb-aha8 63o CONDVAR b822c018
9 3 4A/x86/sbin/fs-pkg 63o RECEIVE 1
9 5 4A/x86/sbin/fs-pkg 63o RECEIVE 1
4106 1 sbin/pipe 63o RECEIVE 1
4106 2 sbin/pipe 63o RECEIVE 1
53260 1 sbin/devc-pty 63o RECEIVE 1
184344 1 bin/sh 63o SIGSUSPEND
184345 1 bin/sh 63o REPLY 8
421899 1 bin/sh 63o SIGSUSPEND
1344020 1 bin/sh 63o SIGSUSPEND
1585756 1 bin/pidin 63o REPLY 1
1585757 1 usr/bin/sort 63o REPLY 4106
1 2 6/boot/sys/procnto 63r RECEIVE 1
1 4 6/boot/sys/procnto 63r RUNNING
1 5 6/boot/sys/procnto 63r RECEIVE 1
1 8 6/boot/sys/procnto 63r RECEIVE 1
1 9 6/boot/sys/procnto 63r RECEIVE 1
1 17 6/boot/sys/procnto 63r REPLY 7

“David Gibbs” <> dagibbs@qnx.com> > wrote in message
news:dcqpb1$r98$> 2@inn.qnx.com> …

JS <> jsukamtoh@infolink.co.id> > wrote:

Our application is running 24-hour non-stop everyday on QNX 6.21 using
HP
Proliant ML370 G3 Xeon single processor. We have a problem where the
system
hang at least once a month. At first we suspected that one of the
process
might have run away and consumed all the CPU time, so we set one of
our
console to the highest priority (63) so that we can use that console
to
analyze the system. However, when it happened again today, our console
also
got frozen, and the keyboard also became hang because the Num Lock key
is
not operational. So we guessed the O/S had somehow frozen.

Marty_Doane1 · August 19, 2005, 4:35pm

I happen to be in the 0.1% camp.

The behavior of condition variables at thread cancelation was changed
between 6.2.1B and 6.3.0. Previously, the mutex guarding the condition
variable was left unlocked if a thread waiting on the condition variable was
canceled. Now this mutex is locked. My 6.2.1 application runs OK in 6.3.0
but deadlocks on shutdown. The solution is a rigorous use of cleanup
handlers to unlock the appropriate mutex.

Marty Doane
Siemens Logistics and Assembly Systems

“Armin Steinhoff” <a-steinhoff@web.de> wrote in message
news:de4648$1cb$1@inn.qnx.com…

JS wrote:
Do all the programs upward compatible with 6.3?

Yes … in 99.9% of all cases. Just recompile …

–Armin

“Armin Steinhoff” <> a-steinhoff@web.de> > wrote in message
news:dds9vb$5iv$> 1@inn.qnx.com> …

JS wrote:

Can you please tell us how to disable fs-pkg? After we slay the fs-pkg
process, we can’t call all the programs in /bin /usr/bin. PATH is not
changed. Thanks.

The easiest way is to install QNX6.3 >

–Armin

“ed1k” <> ed1k@fake.address> > wrote in message
news:> MPG.1d69984721d484a99896dd@inn.qnx.com> …

In article <dcrpt9$kgk$> 1@inn.qnx.com> >, > jsukamtoh@infolink.co.id > says…

Thanks for the reply.

We are using text mode. So which processes’ priority should we set to
the
highest so that our keyboard and monitor will get pre-empted? Our
processes’
priority is set between 10 and 15.
Our application doesn’t have ISR nor interrupt call, and we don’t use
pulse
or QNX message-passing. But we do use a lot of IPC via TCP/IP.

When we do pidin, we noted that many of the O/S threads are running on
very
high priority. We are not sure if this is normal or not. Could it be
due to
the shell we are running pidin has the highest priority? Below is the
list
processes with high priority.

The list below doesn’t look normal to me. I don’t know what you did to
get “high-priority” shell but you seems to get everything running at
highest priority (except threads that set their priority theirself). As
well I don’t think the package manager (fs-pkg) is a good thing for a
machine in production line. I think this isn’t ever going to work
stable.

Eduard.

81933 5 sbin/io-net 20o RECEIVE 6
6 2 roc/boot/devb-eide 21r RECEIVE 1
7 2 roc/boot/devb-aha8 21r READY
81933 7 sbin/io-net 21r RECEIVE 22
81933 8 sbin/io-net 21r RECEIVE 26
126996 2 sbin/devb-fdc 21r RECEIVE 1
81933 3 sbin/io-net 62o RECEIVE 1
7 3 roc/boot/devb-aha8 63o RECEIVE 7
7 4 roc/boot/devb-aha8 63o RECEIVE 4
7 7 roc/boot/devb-aha8 63o CONDVAR b822c018
9 3 4A/x86/sbin/fs-pkg 63o RECEIVE 1
9 5 4A/x86/sbin/fs-pkg 63o RECEIVE 1
4106 1 sbin/pipe 63o RECEIVE 1
4106 2 sbin/pipe 63o RECEIVE 1
53260 1 sbin/devc-pty 63o RECEIVE 1
184344 1 bin/sh 63o SIGSUSPEND
184345 1 bin/sh 63o REPLY 8
421899 1 bin/sh 63o SIGSUSPEND
1344020 1 bin/sh 63o SIGSUSPEND
1585756 1 bin/pidin 63o REPLY 1
1585757 1 usr/bin/sort 63o REPLY 4106
1 2 6/boot/sys/procnto 63r RECEIVE 1
1 4 6/boot/sys/procnto 63r RUNNING
1 5 6/boot/sys/procnto 63r RECEIVE 1
1 8 6/boot/sys/procnto 63r RECEIVE 1
1 9 6/boot/sys/procnto 63r RECEIVE 1
1 17 6/boot/sys/procnto 63r REPLY 7

“David Gibbs” <> dagibbs@qnx.com> > wrote in message
news:dcqpb1$r98$> 2@inn.qnx.com> …

JS <> jsukamtoh@infolink.co.id> > wrote:

Our application is running 24-hour non-stop everyday on QNX 6.21
using HP
Proliant ML370 G3 Xeon single processor. We have a problem where the
system
hang at least once a month. At first we suspected that one of the
process
might have run away and consumed all the CPU time, so we set one of
our
console to the highest priority (63) so that we can use that console
to
analyze the system. However, when it happened again today, our
console
also
got frozen, and the keyboard also became hang because the Num Lock
key is
not operational. So we guessed the O/S had somehow frozen.

Adam_Mallory1 · August 19, 2005, 5:18pm

Marty Doane wrote:

I happen to be in the 0.1% camp.

The behavior of condition variables at thread cancelation was changed
between 6.2.1B and 6.3.0. Previously, the mutex guarding the condition
variable was left unlocked if a thread waiting on the condition variable was
canceled. Now this mutex is locked. My 6.2.1 application runs OK in 6.3.0
but deadlocks on shutdown. The solution is a rigorous use of cleanup
handlers to unlock the appropriate mutex.

This is POSIX defined behaviour - it explicitly says that upon
cancellation, the mutex must be acquired before running the cleanup
handlers and canceling (the behaviour change is unfortunately not in the
migration notes - I’ll see if we can change that).

Additionally, the previous behaviour was to leave the mutex alone, not
ensure it was unlocked, so it’s state at the time of cancellation is
purely a function of the asynchronous nature of your application.

IMHO, the use of cleanup handlers to ensure shared resource sanity is
good programming practice regardless of the rules - especially when
doing thread cancellation.

\

Cheers,
Adam

QNX Software Systems
[ amallory@qnx.com ]

With a PC, I always felt limited by the software available.
On Unix, I am limited only by my knowledge.
–Peter J. Schoenster

Armin_Steinhoff1 · August 19, 2005, 5:23pm

Thanks! Are there still additional cases known ?

–Armin

Marty Doane wrote:

I happen to be in the 0.1% camp.

The behavior of condition variables at thread cancelation was changed
between 6.2.1B and 6.3.0. Previously, the mutex guarding the condition
variable was left unlocked if a thread waiting on the condition variable was
canceled. Now this mutex is locked. My 6.2.1 application runs OK in 6.3.0
but deadlocks on shutdown. The solution is a rigorous use of cleanup
handlers to unlock the appropriate mutex.

Marty Doane
Siemens Logistics and Assembly Systems

“Armin Steinhoff” <> a-steinhoff@web.de> > wrote in message
news:de4648$1cb$> 1@inn.qnx.com> …

JS wrote:

Do all the programs upward compatible with 6.3?

Yes … in 99.9% of all cases. Just recompile …

–Armin

“Armin Steinhoff” <> a-steinhoff@web.de> > wrote in message
news:dds9vb$5iv$> 1@inn.qnx.com> …

JS wrote:

Can you please tell us how to disable fs-pkg? After we slay the fs-pkg
process, we can’t call all the programs in /bin /usr/bin. PATH is not
changed. Thanks.

The easiest way is to install QNX6.3 >

–Armin

“ed1k” <> ed1k@fake.address> > wrote in message
news:> MPG.1d69984721d484a99896dd@inn.qnx.com> …

In article <dcrpt9$kgk$> 1@inn.qnx.com> >, > jsukamtoh@infolink.co.id > says…

Thanks for the reply.

We are using text mode. So which processes’ priority should we set to
the
highest so that our keyboard and monitor will get pre-empted? Our
processes’
priority is set between 10 and 15.
Our application doesn’t have ISR nor interrupt call, and we don’t use
pulse
or QNX message-passing. But we do use a lot of IPC via TCP/IP.

When we do pidin, we noted that many of the O/S threads are running on
very
high priority. We are not sure if this is normal or not. Could it be
due to
the shell we are running pidin has the highest priority? Below is the
list
processes with high priority.

The list below doesn’t look normal to me. I don’t know what you did to
get “high-priority” shell but you seems to get everything running at
highest priority (except threads that set their priority theirself). As
well I don’t think the package manager (fs-pkg) is a good thing for a
machine in production line. I think this isn’t ever going to work
stable.

Eduard.

81933 5 sbin/io-net 20o RECEIVE 6
6 2 roc/boot/devb-eide 21r RECEIVE 1
7 2 roc/boot/devb-aha8 21r READY
81933 7 sbin/io-net 21r RECEIVE 22
81933 8 sbin/io-net 21r RECEIVE 26
126996 2 sbin/devb-fdc 21r RECEIVE 1
81933 3 sbin/io-net 62o RECEIVE 1
7 3 roc/boot/devb-aha8 63o RECEIVE 7
7 4 roc/boot/devb-aha8 63o RECEIVE 4
7 7 roc/boot/devb-aha8 63o CONDVAR b822c018
9 3 4A/x86/sbin/fs-pkg 63o RECEIVE 1
9 5 4A/x86/sbin/fs-pkg 63o RECEIVE 1
4106 1 sbin/pipe 63o RECEIVE 1
4106 2 sbin/pipe 63o RECEIVE 1
53260 1 sbin/devc-pty 63o RECEIVE 1
184344 1 bin/sh 63o SIGSUSPEND
184345 1 bin/sh 63o REPLY 8
421899 1 bin/sh 63o SIGSUSPEND
1344020 1 bin/sh 63o SIGSUSPEND
1585756 1 bin/pidin 63o REPLY 1
1585757 1 usr/bin/sort 63o REPLY 4106
1 2 6/boot/sys/procnto 63r RECEIVE 1
1 4 6/boot/sys/procnto 63r RUNNING
1 5 6/boot/sys/procnto 63r RECEIVE 1
1 8 6/boot/sys/procnto 63r RECEIVE 1
1 9 6/boot/sys/procnto 63r RECEIVE 1
1 17 6/boot/sys/procnto 63r REPLY 7

“David Gibbs” <> dagibbs@qnx.com> > wrote in message
news:dcqpb1$r98$> 2@inn.qnx.com> …

JS <> jsukamtoh@infolink.co.id> > wrote:

Our application is running 24-hour non-stop everyday on QNX 6.21
using HP
Proliant ML370 G3 Xeon single processor. We have a problem where the
system
hang at least once a month. At first we suspected that one of the
process
might have run away and consumed all the CPU time, so we set one of
our
console to the highest priority (63) so that we can use that console
to
analyze the system. However, when it happened again today, our
console
also
got frozen, and the keyboard also became hang because the Num Lock
key is
not operational. So we guessed the O/S had somehow frozen.

O/S hang?

-David

-David

IMHO, the use of cleanup handlers to ensure shared resource sanity is good programming practice regardless of the rules - especially when doing thread cancellation. \

QNX Software Systems [ amallory@qnx.com ]

IMHO, the use of cleanup handlers to ensure shared resource sanity is
good programming practice regardless of the rules - especially when
doing thread cancellation.

\

QNX Software Systems
[ amallory@qnx.com ]