Fsys Cash

Hi All,

I kind of stumbled into discovering some rather strange behavior
regarding Fsys and its cash size. It seems there is an upper limit to
how big a cash Fsys can handle effectively. Once that threshold has
been crossed, Fsys through-put will drop drastically, apparently by a
factor of 20 or more.

Here’s the background…
I’ve been doing some development and testing to try to alleviate peak
load levels on our production systems when large database updates and/or
refreshes are being run. My cohorts and I have suspected for quite a
while that we were becoming disk bound and from watching sac output when
DB updates were in progress, it appeared that we were also becoming CPU
bound. So the plan was to figure out a way to selectively off load
some of our larger database intensive processing to an auxiliary node.
To that end, we built a test node and thinking that it will be needing
all the resources it could possibly use, we installed 2G of RAM in it.

While testing database updates on this new test node, I observed that a
batched update would start running at a rate of ~2000 trans/sec, and
then at some point would slow to ~100 trans/sec with the CPU pegged at
the priority level of the DB engine process. Once that “slow” state
was reached, it seemed there was no predicting when “normal” through-put
would be restored. For smaller batches of updates (10,000 records or
so), an entire batch might run at normal levels, or there might be
several slow downs lasting anywhere from 10 seconds to several minutes.
For large batch updates (500,000 records or more), at some point the
through-put slowdown would just lock in for hours at a time. Then after
just sitting there idle (like over night) through-put would return to
normal.

Suspecting that there might possibly be something wrong with the
hardware on the new test node, I tried the same set of tests on several
of our other test nodes. Although I could get the slowed through-put to
happen, it would never last for more than 20 seconds. How strange,
thought I… The only real difference between the new test node and the
older ones was that the older nodes had only 1G of RAM… This got me
thinking about Fsys & its cash allocation, which was by default 8% of
system memory i.e. 160M for the new test node and 80M for the older ones.

Also, studying sac output, I noticed that when things were running
normally, the CPU was only occasionally pegged, and there would be a
regular blip at priority 22 (Fsys doing its private business) and also
regular hardware interrupt activity. When things slowed down, the CPU
would be pegged at the priority of the DB engine (priority 9 in this
case), with nearly no activity at any other priority level.

So… to make a long story shorter, I started playing around with Fsys
cash allocation, and 64M seems to be the magic number. Allocating any
more than that will cause the slow down symptoms to start appearing.
The more you allocate, the sooner, more frequent and longer the
slowdowns will be. Allocating less than 64M seems to work just fine,
BUT the lower the cash size the more frequently Fsys will be running at
priority 22 and agregate through-put will drop slightly (a few seconds
per 100,000 records).

I think I may have found a significant and rather easy (for a change :wink:
solution for helping to alleviate our production system woes… They
also have 2G of RAM in them, and therefore the default 160M Fsys cash.

The same tests on the new test node w/ the Fsys cash set at 64M, now
very consistently run at a rate of over 2000 trans/sec, the CPU is never
pegged and I have not observed any slowdown symptoms.

==============

Obviously, somebody at QNX should look into this further, but for now,
my only remaining concern is…

Is there any foreseeable down side to setting the Fsys cash to 64M?

For example, that shouldn’t affect heap allocation (the -H diskN option)
for large file systems, should it?

TIA

-Rob

What type of HD are you using and what driver are you using?


“Rob Hem” <rob@spamyourself.com> wrote in message
news:edptsd$ov7$1@inn.qnx.com

Hi All,

I kind of stumbled into discovering some rather strange behavior regarding
Fsys and its cash size. It seems there is an upper limit to how big a
cash Fsys can handle effectively. Once that threshold has been crossed,
Fsys through-put will drop drastically, apparently by a factor of 20 or
more.

Here’s the background…
I’ve been doing some development and testing to try to alleviate peak load
levels on our production systems when large database updates and/or
refreshes are being run. My cohorts and I have suspected for quite a
while that we were becoming disk bound and from watching sac output when
DB updates were in progress, it appeared that we were also becoming CPU
bound. So the plan was to figure out a way to selectively off load some
of our larger database intensive processing to an auxiliary node. To that
end, we built a test node and thinking that it will be needing all the
resources it could possibly use, we installed 2G of RAM in it.

While testing database updates on this new test node, I observed that a
batched update would start running at a rate of ~2000 trans/sec, and then
at some point would slow to ~100 trans/sec with the CPU pegged at the
priority level of the DB engine process. Once that “slow” state was
reached, it seemed there was no predicting when “normal” through-put would
be restored. For smaller batches of updates (10,000 records or so), an
entire batch might run at normal levels, or there might be several slow
downs lasting anywhere from 10 seconds to several minutes. For large batch
updates (500,000 records or more), at some point the through-put slowdown
would just lock in for hours at a time. Then after just sitting there
idle (like over night) through-put would return to normal.

Suspecting that there might possibly be something wrong with the hardware
on the new test node, I tried the same set of tests on several of our
other test nodes. Although I could get the slowed through-put to happen,
it would never last for more than 20 seconds. How strange, thought I…
The only real difference between the new test node and the older ones was
that the older nodes had only 1G of RAM… This got me thinking about
Fsys & its cash allocation, which was by default 8% of system memory i.e.
160M for the new test node and 80M for the older ones.

Also, studying sac output, I noticed that when things were running
normally, the CPU was only occasionally pegged, and there would be a
regular blip at priority 22 (Fsys doing its private business) and also
regular hardware interrupt activity. When things slowed down, the CPU
would be pegged at the priority of the DB engine (priority 9 in this
case), with nearly no activity at any other priority level.

So… to make a long story shorter, I started playing around with Fsys cash
allocation, and 64M seems to be the magic number. Allocating any more
than that will cause the slow down symptoms to start appearing. The more
you allocate, the sooner, more frequent and longer the slowdowns will be.
Allocating less than 64M seems to work just fine, BUT the lower the cash
size the more frequently Fsys will be running at priority 22 and agregate
through-put will drop slightly (a few seconds per 100,000 records).

I think I may have found a significant and rather easy (for a change > :wink:
solution for helping to alleviate our production system woes… They also
have 2G of RAM in them, and therefore the default 160M Fsys cash.

The same tests on the new test node w/ the Fsys cash set at 64M, now very
consistently run at a rate of over 2000 trans/sec, the CPU is never pegged
and I have not observed any slowdown symptoms.

==============

Obviously, somebody at QNX should look into this further, but for now, my
only remaining concern is…

Is there any foreseeable down side to setting the Fsys cash to 64M?

For example, that shouldn’t affect heap allocation (the -H diskN option)
for large file systems, should it?

TIA

-Rob

Two of the test systems (including the new one) use external RAID with
36G mirrored arrays. The other test system has a single scsi 36G drive
(an IBM model, I think). All exhibit the same symptoms when slowness is
forced upon them … so I don’t think it has anything to do with RAID
verses non-RAID.

All use driver: Fsys.aha8scsi

All production systems use the Fsys.aha8scsi driver, all use external
RAID, but some have up to 400G in RAID 5 arrays.

Mario Charest wrote:

What type of HD are you using and what driver are you using?


“Rob Hem” <> rob@spamyourself.com> > wrote in message
news:edptsd$ov7$> 1@inn.qnx.com> …

Hi All,

I kind of stumbled into discovering some rather strange behavior regarding
Fsys and its cash size. It seems there is an upper limit to how big a
cash Fsys can handle effectively. Once that threshold has been crossed,
Fsys through-put will drop drastically, apparently by a factor of 20 or
more.

Here’s the background…
I’ve been doing some development and testing to try to alleviate peak load
levels on our production systems when large database updates and/or
refreshes are being run. My cohorts and I have suspected for quite a
while that we were becoming disk bound and from watching sac output when
DB updates were in progress, it appeared that we were also becoming CPU
bound. So the plan was to figure out a way to selectively off load some
of our larger database intensive processing to an auxiliary node. To that
end, we built a test node and thinking that it will be needing all the
resources it could possibly use, we installed 2G of RAM in it.

While testing database updates on this new test node, I observed that a
batched update would start running at a rate of ~2000 trans/sec, and then
at some point would slow to ~100 trans/sec with the CPU pegged at the
priority level of the DB engine process. Once that “slow” state was
reached, it seemed there was no predicting when “normal” through-put would
be restored. For smaller batches of updates (10,000 records or so), an
entire batch might run at normal levels, or there might be several slow
downs lasting anywhere from 10 seconds to several minutes. For large batch
updates (500,000 records or more), at some point the through-put slowdown
would just lock in for hours at a time. Then after just sitting there
idle (like over night) through-put would return to normal.

Suspecting that there might possibly be something wrong with the hardware
on the new test node, I tried the same set of tests on several of our
other test nodes. Although I could get the slowed through-put to happen,
it would never last for more than 20 seconds. How strange, thought I…
The only real difference between the new test node and the older ones was
that the older nodes had only 1G of RAM… This got me thinking about
Fsys & its cash allocation, which was by default 8% of system memory i.e.
160M for the new test node and 80M for the older ones.

Also, studying sac output, I noticed that when things were running
normally, the CPU was only occasionally pegged, and there would be a
regular blip at priority 22 (Fsys doing its private business) and also
regular hardware interrupt activity. When things slowed down, the CPU
would be pegged at the priority of the DB engine (priority 9 in this
case), with nearly no activity at any other priority level.

So… to make a long story shorter, I started playing around with Fsys cash
allocation, and 64M seems to be the magic number. Allocating any more
than that will cause the slow down symptoms to start appearing. The more
you allocate, the sooner, more frequent and longer the slowdowns will be.
Allocating less than 64M seems to work just fine, BUT the lower the cash
size the more frequently Fsys will be running at priority 22 and agregate
through-put will drop slightly (a few seconds per 100,000 records).

I think I may have found a significant and rather easy (for a change > :wink:
solution for helping to alleviate our production system woes… They also
have 2G of RAM in them, and therefore the default 160M Fsys cash.

The same tests on the new test node w/ the Fsys cash set at 64M, now very
consistently run at a rate of over 2000 trans/sec, the CPU is never pegged
and I have not observed any slowdown symptoms.

==============

Obviously, somebody at QNX should look into this further, but for now, my
only remaining concern is…

Is there any foreseeable down side to setting the Fsys cash to 64M?

For example, that shouldn’t affect heap allocation (the -H diskN option)
for large file systems, should it?

TIA

-Rob

OK, I’ll just go ahead and answer the question I really wanted answered,
myself:

Yes, there are indeed nasty consequences and repercussions when
explicitly setting Fsys cash size.

Could someone from QSSL please shed some light on this. I’m pretty much
staring at black box taking WAGs at what might be rattling around
inside. All I can really conclude so far, is that something is fubar.


Here’s what I’ve observed so far…

  1. The system that I originally did my testing on, apparently didn’t
    have enough files and/or disk space allocated on it to show any ill
    effects. When I tried doing explicit Fsys cash allocation (at 64M) on
    one of our “regular”, far more active test systems, all sorts of file
    I/O errors started happen. It seemed to only be happen on files of a
    relatively large size, say > 10M. Nothing was able to access data in
    these sorts of files beyond a certain offset into the file. Doing a
    ‘tail logfile’ on a text based log file of ~15M, just resulted in ‘read:
    I/O error’.

I subsequently reset everything to the default Fsys cash allocation, and
everything returned to normal. In fact, the files that were having i/o
errors were also fine. They had even grown while the 64M cash was in
play, even though they couldn’t be read back at the time i.e. write
through must have still been working even though read back wasn’t.

BTW, I also ran ‘chkfsys -u’ on the file systems before and after
resetting to the default cash allocation. No errors were returned
either time.

  1. When doing an explicit Fsys cash allocation, the resulting memory
    allocation by Fsys doesn’t seem to be what one would expect. On a 1G of
    memory system, ‘Fsys -Hdisk36’ results in Fsys allocating ~580M; The
    same thing with ‘-c 64M’ added resulted in ~285M (as I recall). If I’m
    not mistaken, there should only have been a ~16M difference (80M verses
    64M).

It’s pretty hard to tell what’s going on here, as I don’t think Fsys
actually has 590M allocated. A ‘sin info’ on that system shows 804M free.

  1. I lowered the system memory on the original (Auxiliary) test system
    to 1G (from 2G), invoked Fsys with default cash allocation, and
    everything seems to be working fine. No prolonged ‘stall outs’.

TIA

-Rob

Rob Hem wrote:

Hi All,

I kind of stumbled into discovering some rather strange behavior
regarding Fsys and its cash size. It seems there is an upper limit to
how big a cash Fsys can handle effectively. Once that threshold has
been crossed, Fsys through-put will drop drastically, apparently by a
factor of 20 or more.

Here’s the background…
I’ve been doing some development and testing to try to alleviate peak
load levels on our production systems when large database updates and/or
refreshes are being run. My cohorts and I have suspected for quite a
while that we were becoming disk bound and from watching sac output when
DB updates were in progress, it appeared that we were also becoming CPU
bound. So the plan was to figure out a way to selectively off load
some of our larger database intensive processing to an auxiliary node.
To that end, we built a test node and thinking that it will be needing
all the resources it could possibly use, we installed 2G of RAM in it.

While testing database updates on this new test node, I observed that a
batched update would start running at a rate of ~2000 trans/sec, and
then at some point would slow to ~100 trans/sec with the CPU pegged at
the priority level of the DB engine process. Once that “slow” state
was reached, it seemed there was no predicting when “normal” through-put
would be restored. For smaller batches of updates (10,000 records or
so), an entire batch might run at normal levels, or there might be
several slow downs lasting anywhere from 10 seconds to several minutes.
For large batch updates (500,000 records or more), at some point the
through-put slowdown would just lock in for hours at a time. Then after
just sitting there idle (like over night) through-put would return to
normal.

Suspecting that there might possibly be something wrong with the
hardware on the new test node, I tried the same set of tests on several
of our other test nodes. Although I could get the slowed through-put to
happen, it would never last for more than 20 seconds. How strange,
thought I… The only real difference between the new test node and the
older ones was that the older nodes had only 1G of RAM… This got me
thinking about Fsys & its cash allocation, which was by default 8% of
system memory i.e. 160M for the new test node and 80M for the older ones.

Also, studying sac output, I noticed that when things were running
normally, the CPU was only occasionally pegged, and there would be a
regular blip at priority 22 (Fsys doing its private business) and also
regular hardware interrupt activity. When things slowed down, the CPU
would be pegged at the priority of the DB engine (priority 9 in this
case), with nearly no activity at any other priority level.

So… to make a long story shorter, I started playing around with Fsys
cash allocation, and 64M seems to be the magic number. Allocating any
more than that will cause the slow down symptoms to start appearing. The
more you allocate, the sooner, more frequent and longer the slowdowns
will be. Allocating less than 64M seems to work just fine, BUT the
lower the cash size the more frequently Fsys will be running at priority
22 and agregate through-put will drop slightly (a few seconds per
100,000 records).

I think I may have found a significant and rather easy (for a change > :wink:
solution for helping to alleviate our production system woes… They
also have 2G of RAM in them, and therefore the default 160M Fsys cash.

The same tests on the new test node w/ the Fsys cash set at 64M, now
very consistently run at a rate of over 2000 trans/sec, the CPU is never
pegged and I have not observed any slowdown symptoms.

==============

Obviously, somebody at QNX should look into this further, but for now,
my only remaining concern is…

Is there any foreseeable down side to setting the Fsys cash to 64M?

For example, that shouldn’t affect heap allocation (the -H diskN option)
for large file systems, should it?

TIA

-Rob

Rob Hem wrote:

  1. The system that I originally did my testing on, apparently didn’t
    have enough files and/or disk space allocated on it to show any ill
    effects. When I tried doing explicit Fsys cash allocation (at 64M) on
    one of our “regular”, far more active test systems, all sorts of file
    I/O errors started happen. It seemed to only be happen on files of a
    relatively large size, say > 10M. Nothing was able to access data in
    these sorts of files beyond a certain offset into the file. Doing a
    ‘tail logfile’ on a text based log file of ~15M, just resulted in ‘read:
    I/O error’.

Was an IO error logged in the traceinfo? What about a dcheck on the
disk? In QNX4 (unlike QNX6) IO errors are ‘sticky’, and so if you
have a block that is going bad and it reported an error, that error
is stored in the cache, and no retry made. Rebooting would let it try
again (and this time happen to work).

BTW, I also ran ‘chkfsys -u’ on the file systems before and after
resetting to the default cash allocation. No errors were returned
either time.

chkfsys does not examine data content, only the filesystem structure
(which is a subset of blocks). Try a dcheck; or look in traceinfo for
error messages from the driver.

  1. When doing an explicit Fsys cash allocation, the resulting memory
    allocation by Fsys doesn’t seem to be what one would expect. On a 1G of
    memory system, ‘Fsys -Hdisk36’ results in Fsys allocating ~580M; The
    same thing with ‘-c 64M’ added resulted in ~285M (as I recall). If I’m
    not mistaken, there should only have been a ~16M difference (80M verses
    64M).

-c and -H are not mutually exclusive. -H sets up the heap size for a
variety of internal data structures excluding any -c cache size. Each
disk block in cache also requires some maintenance overhead (hash lists,
free/LRU lists, state and flags, etc), so increasing -c will also add
additional “paperwork” memory (for each 512-byte unit of that -c).

  1. I lowered the system memory on the original (Auxiliary) test system
    to 1G (from 2G), invoked Fsys with default cash allocation, and
    everything seems to be working fine. No prolonged ‘stall outs’.

Cache performance is difficult to analyse in a general case. Increasing
cache does not automatically makes things faster; for example you now
get more hash collisions as multiple blocks hash to same header (I
benchmarked this on QNX6 has introducing a 2% overhead in some scenarios
which is why in io-blk I scale a variety of other internal data
structures ilke hash/buckets based on cache size); buffering more dirty
writes may result in situations where large runs are written at driver
priority to the exclusion of handling interactive requests; etc. If you
have found a configuration which works well, stick with it.

On Mon, 25 Sep 2006 23:39:28 +0400, John Garvey <jgarvey@qnx.com> wrote:

If you have found a configuration which works well, stick with it.
Does this mean "we will not release an updated|fixed Fsys (in near

future)"?

Tony.

John,

Thanks for the response …
See below.

John Garvey wrote:

Rob Hem wrote:

  1. The system that I originally did my testing on, apparently didn’t
    have enough files and/or disk space allocated on it to show any ill
    effects. When I tried doing explicit Fsys cash allocation (at 64M) on
    one of our “regular”, far more active test systems, all sorts of file
    I/O errors started happen. It seemed to only be happen on files of a
    relatively large size, say > 10M. Nothing was able to access data in
    these sorts of files beyond a certain offset into the file. Doing a
    ‘tail logfile’ on a text based log file of ~15M, just resulted in
    ‘read: I/O error’.


    Was an IO error logged in the traceinfo? What about a dcheck on the
    disk? In QNX4 (unlike QNX6) IO errors are ‘sticky’, and so if you
    have a block that is going bad and it reported an error, that error
    is stored in the cache, and no retry made. Rebooting would let it try
    again (and this time happen to work).

I did not check the traceinfo. Like I said, this was one of our
“regular” test systems. My initial tests, having been seemingly
successful, I was fairly confident that the cash tweaks should improve
performance on this system also.

Unfortunately, I had to get things put back into place ASAP, as there
was a hoard of agitated programmers coming for me with pitch forks and
rakes.

Why are you equating “bad block” problems with I/O errors? Why would
“bad block” problems suddenly surface when I changed the cash size?
And, then just disappear when I reset the cash to its default?

Since I reset things, we’ve had no more I/O errors. That was about a
week ago BTW.

I’m running a dcheck right now, and it did find a few bad blocks, but we
still haven’t had any I/O errors since the cash reset.

BTW, I also ran ‘chkfsys -u’ on the file systems before and after
resetting to the default cash allocation. No errors were returned
either time.


chkfsys does not examine data content, only the filesystem structure
(which is a subset of blocks). Try a dcheck; or look in traceinfo for
error messages from the driver.

  1. When doing an explicit Fsys cash allocation, the resulting memory
    allocation by Fsys doesn’t seem to be what one would expect. On a 1G
    of memory system, ‘Fsys -Hdisk36’ results in Fsys allocating ~580M;
    The same thing with ‘-c 64M’ added resulted in ~285M (as I recall).
    If I’m not mistaken, there should only have been a ~16M difference
    (80M verses 64M).


    -c and -H are not mutually exclusive. -H sets up the heap size for a
    variety of internal data structures excluding any -c cache size. Each
    disk block in cache also requires some maintenance overhead (hash lists,
    free/LRU lists, state and flags, etc), so increasing -c will also add
    additional “paperwork” memory (for each 512-byte unit of that -c).

Just to be clear, I was not increasing cash size, I was decreasing it.

  1. I lowered the system memory on the original (Auxiliary) test system
    to 1G (from 2G), invoked Fsys with default cash allocation, and
    everything seems to be working fine. No prolonged ‘stall outs’.


    Cache performance is difficult to analyse in a general case. Increasing
    cache does not automatically makes things faster; for example you now
    get more hash collisions as multiple blocks hash to same header (I
    benchmarked this on QNX6 has introducing a 2% overhead in some scenarios
    which is why in io-blk I scale a variety of other internal data
    structures ilke hash/buckets based on cache size); buffering more dirty
    writes may result in situations where large runs are written at driver
    priority to the exclusion of handling interactive requests; etc. If you
    have found a configuration which works well, stick with it.

Did you read my original post? That’s what I was trying to communicate.
Only it certain seemed that it was far more than a 2% overhead… try
2000% (20x slower). It looked as though I’d managed to run into some
sort of threshold where Fsys would just stall out for up to hours at a
time. And, it wasn’t pegging the CPU at Fsys priority (22), but at user
process priority (in my case 9).


John, we’ve got critical production systems with 2G of RAM in them, that
are (empirical) exhibiting the type of behavior I described in my
original post. I say empirical because I really don’t know any good way
to monitor Fsys performance on a live system. Do you?

Therefore, I’m trying to get some definitive answers, so we (myself and
my cohort) can make as informed a decision as possible as to what “a
configuration which works well” is. Is it limiting all our systems to
no more than 1G of RAM, or can Fsys performance be reliably improved by
limiting the cash size to 64M?

At this point, it looks like we’re going to be stuck with a 1G RAM limit.

TIA

-Rob

Rob Hem wrote:

Was an IO error logged in the traceinfo? What about a dcheck on the
disk? In QNX4 (unlike QNX6) IO errors are ‘sticky’, and so if you
have a block that is going bad and it reported an error, that error
is stored in the cache, and no retry made. Rebooting would let it try
again (and this time happen to work).
Why are you equating “bad block” problems with I/O errors? Why would
“bad block” problems suddenly surface when I changed the cash size? And,
then just disappear when I reset the cash to its default?

A bad block is a physical IO error (EIO). Because the bad block in
QNX4 is “sticky”, a bigger cache would allow a bad block to remain
marked as such for a longer period, whereas a smaller cache would
have to go back to disk again and might perhaps work this time. Or
vice-versa (you manage to get a good read, which persists longer
in the cache, whereas with a smaller -c it would have been LRU
replaced, and the read fails this time). It totally depends on the
application access patterns. BTW, there is a special traceinfo
message for this sticky indication which you could look for, namely
“Bad block %04.4x%04.4x on /dev/%@3,s flagged in cache” (major 3,
minor 1-5 codes are all various bad blocks logs, based on what was
being attempted when the IO error was reported up by the driver).

I’m running a dcheck right now, and it did find a few bad blocks, but we

That is an item of much concern, you should either replace the disk
or mark the blocks as bad (dcheck -m) and chkfsys to see what files
were affected. It does make it hard to draw any firm conclusions about
any causal relationship with cache size in the presence of physical
media errors.

John, we’ve got critical production systems with 2G of RAM in them, that
are (empirical) exhibiting the type of behavior I described in my

At the time QNX4 Fsys was being developed (1990s), 1GB of RAM was an
unheard-of amount. I think it would be a fair statement to say that no
performance analysis of Fsys behaviour in such an environment has been
conducted, hence my comment that if you have found a particular
configuration which works better then use it. You should not have to
restrict your entire system to 1GB, as you can set an explicit cache
size with -c rather than let it pick a percentage of RAM. If you
have found from trial-and-error that -c64M is a good number, then that
sounds like a good number to me too. Perhaps that extra RAM might be
best-used as a ramdisk (Fsys -r), into which you could copy certain
common read-only files from disk at startup (which would also serve to
reduce throughput of the buffer cache, leaving it to cache other/writing
files)? Try manipulating the Fsys -d in conjunction with -c (I don’t
know what mix of read/write you have in your system, but thrashing
cuold be caused by interaction between sync and async writes).

original post. I say empirical because I really don’t know any good way
to monitor Fsys performance on a live system. Do you?

Running “fsysinfo -bd1” on a console might let you visualise activity?
Running a disk benchmark, like ‘iozone’, might let you test certain
cache configurations, but it really sounds like you need the IO mix of
your production system to show the problem?

On Tue, 26 Sep 2006 03:47:14 +0400, John Garvey <jgarvey@qnx.com> wrote:

If you have found from trial-and-error that -c64M is a good number, then
that sounds like a good number to me too.
Hmmm…

The main idea of the original post was that limiting the cache with
“-c64M” has introduced new problems - I/O errors. Not limiting the cache -
letting it be the default - introduces huge performance losses.
Where is the optimum?

Tony.

Thanks John, for the robust explanation :wink:

I’m planning on getting the disk cleaned up and trying the tests again.
I’ll try to get a traceinfo dump also.

I’ve got about 6 to 8 different balls in the air right now, so I’m not
sure when I’ll be getting around to it though. I’ll let you know how it
turns out.

Thanks again//

-Rob

John Garvey wrote:

Rob Hem wrote:

Was an IO error logged in the traceinfo? What about a dcheck on the
disk? In QNX4 (unlike QNX6) IO errors are ‘sticky’, and so if you
have a block that is going bad and it reported an error, that error
is stored in the cache, and no retry made. Rebooting would let it try
again (and this time happen to work).

Why are you equating “bad block” problems with I/O errors? Why would
“bad block” problems suddenly surface when I changed the cash size?
And, then just disappear when I reset the cash to its default?


A bad block is a physical IO error (EIO). Because the bad block in
QNX4 is “sticky”, a bigger cache would allow a bad block to remain
marked as such for a longer period, whereas a smaller cache would
have to go back to disk again and might perhaps work this time. Or
vice-versa (you manage to get a good read, which persists longer
in the cache, whereas with a smaller -c it would have been LRU
replaced, and the read fails this time). It totally depends on the
application access patterns. BTW, there is a special traceinfo
message for this sticky indication which you could look for, namely
“Bad block %04.4x%04.4x on /dev/%@3,s flagged in cache” (major 3,
minor 1-5 codes are all various bad blocks logs, based on what was
being attempted when the IO error was reported up by the driver).

I’m running a dcheck right now, and it did find a few bad blocks, but we


That is an item of much concern, you should either replace the disk
or mark the blocks as bad (dcheck -m) and chkfsys to see what files
were affected. It does make it hard to draw any firm conclusions about
any causal relationship with cache size in the presence of physical
media errors.

John, we’ve got critical production systems with 2G of RAM in them,
that are (empirical) exhibiting the type of behavior I described in my


At the time QNX4 Fsys was being developed (1990s), 1GB of RAM was an
unheard-of amount. I think it would be a fair statement to say that no
performance analysis of Fsys behaviour in such an environment has been
conducted, hence my comment that if you have found a particular
configuration which works better then use it. You should not have to
restrict your entire system to 1GB, as you can set an explicit cache
size with -c rather than let it pick a percentage of RAM. If you
have found from trial-and-error that -c64M is a good number, then that
sounds like a good number to me too. Perhaps that extra RAM might be
best-used as a ramdisk (Fsys -r), into which you could copy certain
common read-only files from disk at startup (which would also serve to
reduce throughput of the buffer cache, leaving it to cache other/writing
files)? Try manipulating the Fsys -d in conjunction with -c (I don’t
know what mix of read/write you have in your system, but thrashing
cuold be caused by interaction between sync and async writes).

original post. I say empirical because I really don’t know any good
way to monitor Fsys performance on a live system. Do you?


Running “fsysinfo -bd1” on a console might let you visualise activity?
Running a disk benchmark, like ‘iozone’, might let you test certain
cache configurations, but it really sounds like you need the IO mix of
your production system to show the problem?

It was very interesting reading this thread in light of two issues,
user expectations, and end of life of a product. While QSSL still
purports to support QNX 4, there this is clearly limited to quick and
relatively easy fixes. I was told some time ago by Gorden Bell that
some issues, eg. new video drivers, are relatively easy to provide
support for. What seems at issue here is a design issue related to
the era that QNX 4 Fsys was built in. At that time, disks were less
than 1Gig and memory was generally less than 64meg. It is not all
that surprising that Fsys has run into a cache choke point. Had
QSSL’s development direction been different, there might have been a
rewrite of Fsys to take this into account, but clearly this is not
now in the cards. Similar, and more severe issues occured with QNX
2, although there are surely more to come for QNX 4. All this
suggests some prudence and pragmatic long term planning on the part
of QNX 4 users.

John (or someone),

Could you please help me make sense of this…

The disk on the “regular” test system is just a SCSII disk, no RAID.

For the record, here’s the Fsys configuration via ‘ps’ & ‘df’:
4 4 0 10r RECV 0 580792K Fsys -Hdisk36
5 5 0 22r RECV 0 291188K Fsys.aha8scsi -L aha8scsi -DCF
-V9005 -p1

File system (kb) Total User Used Free Used
Mounted
//17/dev/hd0t77 200781 200722 19818 180904 9% /
//17/dev/hd0t78 35640202 35630829 9280032 26350797 26% /u

Yesterday I ran six iterations of ‘dcheck’ on the ‘/u’ partition on that
disk (/dev/hd0t78). Every other time with the ‘-m’ option, starting w/o
-m. Every iteration showed 3 NEW bad blocks. After each ‘-m’
iteration, I ran ‘chkfsys -u /u’. Although there were repairs to the
file system made each time, there was only one trashed file that I had
to zap. None of the file repairs involved any of the files that were
having I/O errors in my previous tests.

So, this morning I ran another ‘dcheck’ (w/o the -m option), and this is
what I got:

root[17]/home/rob>dcheck /dev/hd0t78
Read check of /dev/hd0t78 without bad block marking,
starting at block 1 for the remainder of the disk.
Checking 71280405 blocks (35640202K)
NEW BAD: 1392843 (01540CB hex) - 1
NEW BAD: 18267271 (116BC87 hex) - 2
NEW BAD: 18267274 (116BC8A hex) - 3
DCHECK: 3 bad block(s) on disk. 3 new bad block(s) were found.

What! Yet more bad blocks?

I did some more investigation into what was suppose to be going on here
and managed to just confuse myself…

root[17]/home/rob>la /u
total 19010

-rw-rw---- 1 root root 23 Sep 26 16:25 .bad_blks
-r–r–r-- 1 root root 8910051 Sep 18 2003 .bitmap

root[17]/home/rob>hd /u/.bad_blks
0000000 31 35 34 30 43 42 0a 31 31 36 42 43 38 37 0a 31 1540CB.116BC87.1
0000010 31 36 42 43 38 41 0a 16BC8A.
0000016

Those values in the hex dump match exactly what dcheck was reporting as
“NEW” bad blocks in my run from this morning… only the bad block list
was last updated, yesterday??
Further more, there are only 3 bad blocks total. Yesterday’s runs
should have put 9 of them in there, shouldn’t they?

And, to further confuse things, here’s the traceinfo dump from this
morning’s run:

Sep 27 09:18:40 5 00005109 Scsi sense (unit=0 scsi=2 err=F0h sense=3h
asc=11h ascq=0h)
Sep 27 09:18:40 3 00003005 Bad block 001B61A4 on /dev/hd0 during preread
Sep 27 09:18:40 2 00003001 Bad block 001B61A4 on /dev/hd0 flagged in cache
Sep 27 09:18:40 2 00003001 Bad block 001B61A4 on /dev/hd0 flagged in cache

Sep 27 09:30:10 5 00005109 Scsi sense (unit=0 scsi=2 err=F0h sense=3h
asc=11h ascq=0h)
Sep 27 09:30:10 3 00003005 Bad block 011CDD60 on /dev/hd0 during preread
Sep 27 09:30:10 2 00003001 Bad block 011CDD60 on /dev/hd0 flagged in cache
Sep 27 09:30:10 2 00003001 Bad block 011CDD60 on /dev/hd0 flagged in cache
Sep 27 09:30:10 5 00005109 Scsi sense (unit=0 scsi=2 err=F0h sense=3h
asc=11h ascq=0h)
Sep 27 09:30:10 2 00003004 Bad block 011CDD63 on /dev/hd0 during read
Sep 27 09:30:10 2 00003001 Bad block 011CDD63 on /dev/hd0 flagged in cache
Sep 27 09:30:10 2 00003001 Bad block 011CDD63 on /dev/hd0 flagged in cache

Those hex values don’t match any of the above. What’s going on here?
Ah, I think I see, it’s the block number relative to the entire hard
dive, not just the partition.

Fortunately, I’ve got scroll back buffers on the terminal windows I use
to log into our QNX4 systems and after hunting around a bit, I found the
traceinfo dumps I did yesterday. It seems those same 3 bad blocks are
being reported every time.

Does that mean I have managed to repair the disk? And, I’ve just been
interpreting dchecks output incorrectly? Why then does dcheck keep
calling these NEW bad blocks … and why does chkfsys always need to do
repairs after each ‘dcheck -m’ pass?

-Rob

Rob Hem wrote:

Thanks John, for the robust explanation > :wink:

I’m planning on getting the disk cleaned up and trying the tests again.
I’ll try to get a traceinfo dump also.

I’ve got about 6 to 8 different balls in the air right now, so I’m not
sure when I’ll be getting around to it though. I’ll let you know how it
turns out.

Thanks again//

-Rob

John Garvey wrote:

Rob Hem wrote:

Rob Hem <rob@spamyourself.com> wrote:

John (or someone),

Could you please help me make sense of this…

The disk on the “regular” test system is just a SCSII disk, no RAID.

For the record, here’s the Fsys configuration via ‘ps’ & ‘df’:
4 4 0 10r RECV 0 580792K Fsys -Hdisk36
5 5 0 22r RECV 0 291188K Fsys.aha8scsi -L aha8scsi -DCF
-V9005 -p1

File system (kb) Total User Used Free Used
Mounted
//17/dev/hd0t77 200781 200722 19818 180904 9% /
//17/dev/hd0t78 35640202 35630829 9280032 26350797 26% /u

Yesterday I ran six iterations of ‘dcheck’ on the ‘/u’ partition on that
disk (/dev/hd0t78). Every other time with the ‘-m’ option, starting w/o
-m. Every iteration showed 3 NEW bad blocks. After each ‘-m’
iteration, I ran ‘chkfsys -u /u’. Although there were repairs to the
file system made each time, there was only one trashed file that I had
to zap. None of the file repairs involved any of the files that were
having I/O errors in my previous tests.

So, this morning I ran another ‘dcheck’ (w/o the -m option), and this is
what I got:

root[17]/home/rob>dcheck /dev/hd0t78
Read check of /dev/hd0t78 without bad block marking,
starting at block 1 for the remainder of the disk.
Checking 71280405 blocks (35640202K)
NEW BAD: 1392843 (01540CB hex) - 1
NEW BAD: 18267271 (116BC87 hex) - 2
NEW BAD: 18267274 (116BC8A hex) - 3
DCHECK: 3 bad block(s) on disk. 3 new bad block(s) were found.

What! Yet more bad blocks?

I did some more investigation into what was suppose to be going on here
and managed to just confuse myself…

root[17]/home/rob>la /u
total 19010

-rw-rw---- 1 root root 23 Sep 26 16:25 .bad_blks
-r–r–r-- 1 root root 8910051 Sep 18 2003 .bitmap

root[17]/home/rob>hd /u/.bad_blks
0000000 31 35 34 30 43 42 0a 31 31 36 42 43 38 37 0a 31 1540CB.116BC87.1
0000010 31 36 42 43 38 41 0a 16BC8A.
0000016

Those values in the hex dump match exactly what dcheck was reporting as
“NEW” bad blocks in my run from this morning… only the bad block list
was last updated, yesterday??
Further more, there are only 3 bad blocks total. Yesterday’s runs
should have put 9 of them in there, shouldn’t they?

And, to further confuse things, here’s the traceinfo dump from this
morning’s run:

Sep 27 09:18:40 5 00005109 Scsi sense (unit=0 scsi=2 err=F0h sense=3h
asc=11h ascq=0h)
Sep 27 09:18:40 3 00003005 Bad block 001B61A4 on /dev/hd0 during preread
Sep 27 09:18:40 2 00003001 Bad block 001B61A4 on /dev/hd0 flagged in cache
Sep 27 09:18:40 2 00003001 Bad block 001B61A4 on /dev/hd0 flagged in cache

Sep 27 09:30:10 5 00005109 Scsi sense (unit=0 scsi=2 err=F0h sense=3h
asc=11h ascq=0h)
Sep 27 09:30:10 3 00003005 Bad block 011CDD60 on /dev/hd0 during preread
Sep 27 09:30:10 2 00003001 Bad block 011CDD60 on /dev/hd0 flagged in cache
Sep 27 09:30:10 2 00003001 Bad block 011CDD60 on /dev/hd0 flagged in cache
Sep 27 09:30:10 5 00005109 Scsi sense (unit=0 scsi=2 err=F0h sense=3h
asc=11h ascq=0h)
Sep 27 09:30:10 2 00003004 Bad block 011CDD63 on /dev/hd0 during read
Sep 27 09:30:10 2 00003001 Bad block 011CDD63 on /dev/hd0 flagged in cache
Sep 27 09:30:10 2 00003001 Bad block 011CDD63 on /dev/hd0 flagged in cache

Those hex values don’t match any of the above. What’s going on here?
Ah, I think I see, it’s the block number relative to the entire hard
dive, not just the partition.

Fortunately, I’ve got scroll back buffers on the terminal windows I use
to log into our QNX4 systems and after hunting around a bit, I found the
traceinfo dumps I did yesterday. It seems those same 3 bad blocks are
being reported every time.

Does that mean I have managed to repair the disk? And, I’ve just been
interpreting dchecks output incorrectly? Why then does dcheck keep
calling these NEW bad blocks … and why does chkfsys always need to do
repairs after each ‘dcheck -m’ pass?

I would look at getting a new drive. The drive is reporting medium erorrs
as indicated by the

Sep 27 09:30:10 5 00005109 Scsi sense (unit=0 scsi=2 err=F0h sense=3h
asc=11h ascq=0h)

trace entries.

The entry translates to:

SENSE KEY: 3
MEDIUM ERROR. Indicates that the command terminated with a
non-recovered error condition that was probably caused by a
flaw in the medium or an error in the recorded data. This
sense key may also be returned if the target is unable to
distinguish between a flaw in the medium and a specific hardware
failure (sense key 4h).

ASC: 11, ASCQ: 0
UNRECOVERED READ ERROR

Thought I’d report back to you-all, since it’s been well over a year…

My cohorts and I figured out soon after the last post that the QNX4 scsi
driver just plane doesn’t like talking directly to the disk drives we
were/are using. It had/has no problem talking to the hardware RAID
controllers we use. So we have since eliminated all direct scsi drive
access on all our QNX4 systems.

As for the -c 64M thingie…

As things were left, we pretty much knew empirically that the -c 64M
cash thing changed Fsys performance (apparently for the better), but
lacking any hard evidence we didn’t know if it could really safely help
our production systems. So we configured a couple of auxiliary
production machines with the 64M cash to see how that would go and
proceeded to start digging through our systems for ways to optimize
processing from the code level up. Since then we have made some very
significant gains to that end. Unfortunately in the course of a year or
so we kind of forgot about (re)evaluating the cash thing and expanding
its implementation.

Then a few weeks ago we had a rather large new client come on board who
has, to put it as nicely as I can, POS equipment that exhibits way less
than optimal performance characteristics. That POS equipment has
needlessly tied up a lot of our core systems resources waiting on dead
air. Which has then subsequently lead to our systems getting pushed to
the wall during peak processing periods. And I do mean pushed, as in
face pressed to the glass wishing you were somewhere else. We have
experienced 1-2 hour periods with nearly no idle time and the system
throughput slowing to a crawl.

We had to do something and soon (or start loosing all our customers), so
we decided to pull the -c 64M rabbit from our collective hat and install
it on our primary processing machines. I still can’t hardly believe the
impact this has had… it has stood the previously pathetic systems
performance profile (as viewed via sac) on its head! We’re now running
80-100% idle where previously idle time was usually bouncing violently
between 10-60% and often hovering at 10% or less. Our main “host”
processes that run at priority 13 were running at 20-70% CPU usage.
They now bounce between 0-15%. Fsys at priority 22 now only shows up as
an occasional little blip. Previously, it would periodically go into
sustained 70% CPU usage and was nearly alway there bouncing at 20% usage.

So, I can now definitively say that there are extremely positive
benefits to limiting the Fsys cash to 64M. Nearly miraculous, in fact.
The processing profile we are seeing now via sac looks like what I would
expect a 3.1 GHz machine to be.