file write anomaly

Hi

I have a 4.25/1.14 based system that controls a chocolate bar wrapping
machine. It comprises a number of processes running at a range of
priorities. The operator interface is a phab generated graphics application
(process). Two of the processes (the graphics one and a simple c++ program)
store (write to and read from) configuration information in windows style
ini files using routines I have written. These routines are linked into the
processes as object files. The files are opened and written to using the c++
out.open, out << and out.close. The ASCII file for the graphics process is
393 bytes long, the file for the other process about 1500 bytes long. The
graphics application is lower priority (8) than the other process (12).

The system recently crashed (sigseg) when the graphics process read its file
because the file had been corrupted. On examination the two ini files were
of the correct length and had identical time stamps. The 393 byte file
actually contained a section from the larger file instead of its own data.
It is almost as if the lower priority graphics application had been
creating/writing its file when it was pre-empted by the higher priority
process which wrote its file, and in the process overwrote the data/buffer
for the graphics file which then wrote the correct length file with the
wrong data.

The data found in the graphics file could not have come from the graphics
program, it was a true and accurate copy of a section from the larger file,
apparently starting at a random point in the larger file. I assume that the
two processes are truly independent even though they both use the same
object file for the file accessing routines so there should not be a problem
with common buffers. Could the use of the CO++ out mechanism be to blame?
once again surely the independent processes should have their own I/O
buffers, file descriptors so even if there are hidden static members in the
C++ code they cannot affect another process.

It is almost too horrible to think that this could be a file system problem
or something happening to the disk drives own buffer so I would welcome any
suggestions as to what the cause may have been. It is unlikely that the
system was powered down during the file write process.

Hi Stephen

In order to properly diagnose your problem you may need to send us all a box
of chocolate bars. That way we can examine them thoroughly and give you
more detailed information.

OK. I couldn’t resist saying that. But seriously, I have a large and
complex system that uses C++ iostreams for many things and we have never see
this kind of behavior. I Have seen this kind of behavior when a program had
a pointer to data that got screwed up. It thought it was writing one thing
but was actually writing something else. You said that the files sizes were
what they were supposed to be.

Do you have an older version of this program that works? If so, start
diffing.

If you correct the config files does the new version screw up consistently?
If so, maybe you need to play with a debugger.

If the problem turns out to be very intermittent then you have an uphill
battle.

Let me know if you need my address to send that chocolate.


“Stephen F Terrell” <stephen@trsystem.demon.co.uk> wrote in message
news:ac17ob$prr$1@inn.qnx.com

Hi

I have a 4.25/1.14 based system that controls a chocolate bar wrapping
machine. It comprises a number of processes running at a range of
priorities. The operator interface is a phab generated graphics
application
(process). Two of the processes (the graphics one and a simple c++
program)
store (write to and read from) configuration information in windows style
ini files using routines I have written. These routines are linked into
the
processes as object files. The files are opened and written to using the
c++
out.open, out << and out.close. The ASCII file for the graphics process is
393 bytes long, the file for the other process about 1500 bytes long. The
graphics application is lower priority (8) than the other process (12).

The system recently crashed (sigseg) when the graphics process read its
file
because the file had been corrupted. On examination the two ini files were
of the correct length and had identical time stamps. The 393 byte file
actually contained a section from the larger file instead of its own data.
It is almost as if the lower priority graphics application had been
creating/writing its file when it was pre-empted by the higher priority
process which wrote its file, and in the process overwrote the data/buffer
for the graphics file which then wrote the correct length file with the
wrong data.

The data found in the graphics file could not have come from the graphics
program, it was a true and accurate copy of a section from the larger
file,
apparently starting at a random point in the larger file. I assume that
the
two processes are truly independent even though they both use the same
object file for the file accessing routines so there should not be a
problem
with common buffers. Could the use of the CO++ out mechanism be to blame?
once again surely the independent processes should have their own I/O
buffers, file descriptors so even if there are hidden static members in
the
C++ code they cannot affect another process.

It is almost too horrible to think that this could be a file system
problem
or something happening to the disk drives own buffer so I would welcome
any
suggestions as to what the cause may have been. It is unlikely that the
system was powered down during the file write process.

Thanks for the offer to help with the chocolate bars, I will see what I can
do, the wrappers run at about 500 bars/min so there is a lot of evidence to
hide each time they stop.

I certainly wish it had not happened, I may be offered the chance to live on
site until it is fixed. I did see something similar several years ago on a
different system but that never happened again.

I agree that the most common reason for writing the wrong data is a pointer
type thing. Unfortunately the section name strings i.e. [wrapping speed] do
not exist in the graphics process so it is difficult to see how ever a
pointer error could create such a sensible but incorrect output. The shorter
incorrect file also started part way through a section and finished
similarly out of phase.

It certainly looked just as it would if the file buffer pointer was
corrupted, but into the data area of another process, not in QNX!

I am sure the graphics process, writing the short file that was corrupted,
was pre-empted by the process writing the longer file. Is there any
potential for fsys to create this type of effect? it seems to be the only
common point between the processes (technically known as grasping at
straws).

It is of course not repeatable! and asking the operators to repeat what they
did is a bit similar to talking to Martians, they were all on a different
planet at the time!

Oh well, better start climbing the hill. Thanks for your comments and
suggestions. If i ever find out what it was i will share it, even if it was
my fault (egoless programming is so easy, not).

Steve

Here’s one more thought. Do you have any processes that make assumptions
about file descriptors? I.E. A program know that fds 0, 1 and 2 are being
used so the next one to be allocated must be fd3 and therefore does IO to
fd3. Only this time, some routine had an extra file open somewhere. As you
said, grasping at straws.

Do let us know if you find out what it was.

“Stephen F Terrell” <stephen@trsystem.demon.co.uk> wrote in message
news:ac3v4o$r2h$1@inn.qnx.com

Oh well, better start climbing the hill. Thanks for your comments and
suggestions. If i ever find out what it was i will share it, even if it
was
my fault (egoless programming is so easy, not).

Hi Bill & Stephen,

This reminds me of a nasty bug I had to track down once… sort of the flip
side of what Bill’s suggesting. If your server process has stdin/out/err
closed, make sure there are no stray printf’s or the like lying around. In
the case I experienced, I had written a database server that had only two
open fds, one for each database file/table it managed (e.g. 0 & 1). A year
or two after the server was originally written, another programmer added an
archiving facility. The server would spawn an archiving process, and every
time it did so, the database would get trashed. However, that wasn’t
readily apparent at first. It took quite a bit of time and effort to
correlate the cause and effect. But, as it turned out the archiver process
was doing status/debug messages using printf and had inherited the fd’s from
it’s parent. The printf’s were overwriting data in the (fd=1) database
file. As I recall, the light didn’t come on until I happened to do a
hexdump of the db data file and noticed the status messages embedded in
it… and then did a grep of the source code to find out what had generated
them.

An fcntl( fd, F_SETFD, FD_CLOEXEC ) in the database open routine fixed the
problem.

Another straw grasping suggestion… but worth at least $.02 US, I think :wink:

Rob

“Bill Caroselli (Q-TPS)” <QTPS@EarthLink.net> wrote in message
news:acb924$37u$1@inn.qnx.com

Here’s one more thought. Do you have any processes that make assumptions
about file descriptors? I.E. A program know that fds 0, 1 and 2 are
being
used so the next one to be allocated must be fd3 and therefore does IO
to
fd3. Only this time, some routine had an extra file open somewhere. As
you
said, grasping at straws.

Do let us know if you find out what it was.

“Stephen F Terrell” <> stephen@trsystem.demon.co.uk> > wrote in message
news:ac3v4o$r2h$> 1@inn.qnx.com> …

Oh well, better start climbing the hill. Thanks for your comments and
suggestions. If i ever find out what it was i will share it, even if it
was
my fault (egoless programming is so easy, not).
\

Stephen F Terrell <stephen@trsystem.demon.co.uk> wrote:

It is almost too horrible to think that this could be a file system problem
or something happening to the disk drives own buffer so I would welcome any
suggestions as to what the cause may have been. It is unlikely that the
system was powered down during the file write process.

But you did run chkfsys to make sure that the filesystem wasn’t
corrupted, right?


Wojtek Lerch QNX Software Systems Ltd.

I haven’t yet (because it never occurred to me), the factory is in Scotland
and is a couple of hundred miles away, but will do on the next visit.
Thanks for the reminder
Steve


“Wojtek Lerch” <wojtek_l@yahoo.ca> wrote in message
news:acdofg$fk1$1@nntp.qnx.com

Stephen F Terrell <> stephen@trsystem.demon.co.uk> > wrote:

It is almost too horrible to think that this could be a file system
problem
or something happening to the disk drives own buffer so I would welcome
any
suggestions as to what the cause may have been. It is unlikely that the
system was powered down during the file write process.

But you did run chkfsys to make sure that the filesystem wasn’t
corrupted, right?


Wojtek Lerch QNX Software Systems Ltd.

Ouch!

“Rob” <rob@spamyouself.com> wrote in message
news:acdkh2$q4a$1@inn.qnx.com

Hi Bill & Stephen,

This reminds me of a nasty bug I had to track down once… sort of the
flip
side of what Bill’s suggesting. If your server process has stdin/out/err
closed, make sure there are no stray printf’s or the like lying around.
In
the case I experienced, I had written a database server that had only two
open fds, one for each database file/table it managed (e.g. 0 & 1). A
year
or two after the server was originally written, another programmer added
an
archiving facility. The server would spawn an archiving process, and
every
time it did so, the database would get trashed. However, that wasn’t
readily apparent at first. It took quite a bit of time and effort to
correlate the cause and effect. But, as it turned out the archiver
process
was doing status/debug messages using printf and had inherited the fd’s
from
it’s parent. The printf’s were overwriting data in the (fd=1) database
file. As I recall, the light didn’t come on until I happened to do a
hexdump of the db data file and noticed the status messages embedded in
it… and then did a grep of the source code to find out what had
generated
them.

An fcntl( fd, F_SETFD, FD_CLOEXEC ) in the database open routine fixed the
problem.

Another straw grasping suggestion… but worth at least $.02 US, I think
:wink:

Rob

“Bill Caroselli (Q-TPS)” <> QTPS@EarthLink.net> > wrote in message
news:acb924$37u$> 1@inn.qnx.com> …
Here’s one more thought. Do you have any processes that make
assumptions
about file descriptors? I.E. A program know that fds 0, 1 and 2 are
being
used so the next one to be allocated must be fd3 and therefore does IO
to
fd3. Only this time, some routine had an extra file open somewhere. As
you
said, grasping at straws.

Do let us know if you find out what it was.

“Stephen F Terrell” <> stephen@trsystem.demon.co.uk> > wrote in message
news:ac3v4o$r2h$> 1@inn.qnx.com> …

Oh well, better start climbing the hill. Thanks for your comments and
suggestions. If i ever find out what it was i will share it, even if
it
was
my fault (egoless programming is so easy, not).


\

Hi Steve

It never hurts to run chkfsys. In this case I doubt that it will turn up
anything that tells you what went wrong. It is more likely that it will
turn up problems that may have occurred as a result of this other problem.

I guess if you find any “cross-linked files” that could be your original
problem. See what the files are that are cross linked.

“Stephen F Terrell” <stephen@trsystem.demon.co.uk> wrote in message
news:acduuv$49f$1@inn.qnx.com

I haven’t yet (because it never occurred to me), the factory is in
Scotland
and is a couple of hundred miles away, but will do on the next visit.
Thanks for the reminder
Steve


“Wojtek Lerch” <> wojtek_l@yahoo.ca> > wrote in message
news:acdofg$fk1$> 1@nntp.qnx.com> …
Stephen F Terrell <> stephen@trsystem.demon.co.uk> > wrote:

It is almost too horrible to think that this could be a file system
problem
or something happening to the disk drives own buffer so I would
welcome
any
suggestions as to what the cause may have been. It is unlikely that
the
system was powered down during the file write process.

But you did run chkfsys to make sure that the filesystem wasn’t
corrupted, right?


Wojtek Lerch QNX Software Systems Ltd.

“Bill Caroselli (Q-TPS)” <QTPS@EarthLink.net> wrote in message
news:acb924$37u$1@inn.qnx.com

Here’s one more thought. Do you have any processes that make assumptions
about file descriptors? I.E. A program know that fds 0, 1 and 2 are
being
used so the next one to be allocated must be fd3 and therefore does IO
to
fd3. Only this time, some routine had an extra file open somewhere. As
you
said, grasping at straws.

Do let us know if you find out what it was.

I don’t knowingly do that, the actual code is
//-----------------------------------------------
out.open( (const char *)tempFileName);
if( !out )
{
Fault(“Could not open file for output”);
exit(1);
}

out << "// written on " << sTime << “\n\n”;
if( pSections )
{
for(int i=0;i< pSections->entries() ; i++ )
{
ppSection = pSections->find(i);
pSection = *ppSection;
out << “[” << pSection->pSectionName << “]\n”;
for(int j=0;jEntries();j++ )
{
pSection->GetKeyAndValue(j,pK,pV);
out << pK << “=” << pV << “\n”;
}
out << “\n”;
}
}
else
{
out << “// empty\n” ;
}

out.close();
//---------------------------------------------

where pSections = new WCPtrSList<Section *>;

I suppose the other option is that something has happened to the WCPtrSList
but in such a way that the list integrity is ok. The corrupted file started
with a line containing =20 which is part of a line generated by the out <<
pK << “=” << pV << “\n”; code.

I really need to run chkfsys!

I think the clue is that both files have identical time stamps.
I can’t offer any definitive solutions but here are some things to
consider:

  1. Fsys (or Fsys.eide) has a ‘high water’ level whereby hard disk writes
    are queued up for a while, to minimise hard disk operations. Once the
    level has been reached the writes are done. Perhaps there was a problem
    with the buffering? The file contents could have become screwed in such
    a way as you describe if they were in the buffer at the same time, hence
    identical timestamps. The buffer threshold could be set to zero, i.e. no
    buffering to prevent this happening again.
  2. Anything in the tracelog around/before the time of the problem?
  3. Have you run chkfsys to verify the filesystem?
  4. May have been a bizarre one-off hardware glitch in say disk
    controller or RAM. Unlikely though, and very difficult to
    establish/prove.
  5. Are there any other machines running the same software?
  6. Is it an officially released o/s version? May be a flaw in a beta
    module, i.e. the release version was fixed?

Any gurus out there care to comment on 1) ???


Stephen F Terrell wrote:

Hi

I have a 4.25/1.14 based system that controls a chocolate bar wrapping
machine. It comprises a number of processes running at a range of
priorities. The operator interface is a phab generated graphics application
(process). Two of the processes (the graphics one and a simple c++ program)
store (write to and read from) configuration information in windows style
ini files using routines I have written. These routines are linked into the
processes as object files. The files are opened and written to using the c++
out.open, out << and out.close. The ASCII file for the graphics process is
393 bytes long, the file for the other process about 1500 bytes long. The
graphics application is lower priority (8) than the other process (12).

The system recently crashed (sigseg) when the graphics process read its file
because the file had been corrupted. On examination the two ini files were
of the correct length and had identical time stamps. The 393 byte file
actually contained a section from the larger file instead of its own data.
It is almost as if the lower priority graphics application had been
creating/writing its file when it was pre-empted by the higher priority
process which wrote its file, and in the process overwrote the data/buffer
for the graphics file which then wrote the correct length file with the
wrong data.

The data found in the graphics file could not have come from the graphics
program, it was a true and accurate copy of a section from the larger file,
apparently starting at a random point in the larger file. I assume that the
two processes are truly independent even though they both use the same
object file for the file accessing routines so there should not be a problem
with common buffers. Could the use of the CO++ out mechanism be to blame?
once again surely the independent processes should have their own I/O
buffers, file descriptors so even if there are hidden static members in the
C++ code they cannot affect another process.

It is almost too horrible to think that this could be a file system problem
or something happening to the disk drives own buffer so I would welcome any
suggestions as to what the cause may have been. It is unlikely that the
system was powered down during the file write process.