CompactFlash File/Disk Corruption

ablinka · May 16, 2023, 1:55am

Hello all,

I have some systems running on QNX 4.25 on a single board computer using Compact Flash as their disk. We are encountering a fairly difficult issue with corrupted files and drives on these systems.

Symptoms we’re seeing:

Corrupted files: we have seen some files get corrupted such as /bin/netinfo and /bin/awk (they give either an Input/output error or a exec format error when you try to run them by hand, and the checksum of the file is wrong as well). I give these two specific files as examples because we use them in our applications via popen() so it’s more obvious when they’re not working (more on that later). We have also seen /bin/sh be corrupted, which causes the node to hang during the boot sequence when the node tries to exec the sysinit file. Sometimes these files get corrupted and stay corrupted, other times they’re fine one minute and corrupt the next (and they go in-and-out like that).
Corrupted drives: sometimes when the node is rebooted it is completely unbootable and gives a message of “Reboot and select proper boot device”.
Application instability: some of our applications stop responding correctly; our users have described it as “going to sleep” (the nodes and applications do not sleep by design or configuration, so this isn’t expected).

System configuration:

We have various configurations in the field, and the symptoms that they see are all similar but varied (as you can see above we have a variety of symptoms, but I think they all tie back to a common issue). Here are some of the configurations we have:

Some nodes run QNX 4.25 2009 CD, some run the QNX 4.25 2011 CD.
Some nodes run on a single-core processor, some run on an Intel Core 2 Duo (dual-core)
Some of the nodes use different parameters for Fsys. Some just run /bin/Fsys -Hdisk## -A -r8000, others do /bin/Fsys -Hdisk## -A -r8000 -c0k -d0
The size of the Compact Flash cards varies from as low as 256MB to up to 8GB. I believe that all are either 256MB, 512MB, 2GB, or 8GB.
Some of the boards use the Fsys.atapi driver, some use Fsys.eide. In general the single-core boards are using Fsys.eide and the dual-core are using Fsys.atapi (not sure why, to be honest), but I don’t know that this is 100% consistent.

Things we’ve considered:

Hardware issues. The obvious culprit, but it doesn’t seem to be the case here (or at least not always). Some of these drives are fairly old so it could be wear-and-tear, but some of them are new (not new enough to be infant mortality, but not old enough for it to be wearout), and verified good before being put into service. It is possible that some of these failures could be attributed to wearout/old cards, but not all of them (in fact I would say most of them are not due to wearout/old cards).
Hard reboots/power failures. We always use soft shutdown and aren’t having power failures, but still have these issues.
Dual core processor. I’m fairly sure that QNX 4.25 wasn’t designed with dual core processors in mind (not 100% sure if it straight-up doesn’t support them though), but we have issues on single-core processors also. The failure modes aren’t exactly the same (for example we’ve only seen drive corruption on the dual-core boards so far), but we’ve had problems with both configurations so we think that the dual-core processor isn’t the only factor.

Some observations:

These issues started to be reported (or at least have been happening with greater frequency) since the introduction of a software update. Reviewing the update the main thing that sticks out to me is that we started using popen(), which we do to open a read pipe to run the netinfo command to get some network statistics, and then pipe that output to grep and then awk to get specific statistics that we care about. I especially mention that because sometimes we’ll see errors in the terminal of “popen: Input/output error”, which also seems to point to popen(). We’ve checked the code and we did code the use of popen() correctly, so we don’t think it’s a bug in how we’re using that function call (i.e. we’re ensuring that we close the pipe correctly with pclose(), etc).
I would like to note that the issues may not be tied to the software update and may also not be tied to popen(), these are just possibilities that we’ve considered.
We’ve seen these issues happen while the nodes are running without any reboot, shutdown, or power loss.
/bin/Pipe is not running on these nodes. We’ve tried running it in our lab but it doesn’t seem to solve the problem though.
The systems having issues do write to the drive (fopen() calls mostly, though I suspect that popen() also either writes to the drive or interacts with it somehow), but we have some systems that do not do any writes to the drive (no fopen(), no popen()), and these systems do not have these same issues, which makes me think that writing to the drive is tied to this.
Doing some research I discovered that popen() (in QNX 4.25 at least) is not thread-safe. Our applications are single-threaded only, but I also discovered that Fsys by default runs with 4 threads, and I’ve started to wonder if maybe somehow this could be related. It doesn’t seem possible, but at this point I’m expanding my ideas of what is possible.

Questions/help needed:

These issues are fairly intermittent, but happening enough for it to be a major issue. At this point we’re doing a root cause analysis and a major investigation. We’re doing a lot of testing in our laboratory, and have seen these failures a couple of times now, but we haven’t been able to determine a root cause and the reproductions have been infrequent enough that we haven’t been able to determine a pattern to get ourselves to a root cause. I suspect that this is tied to some kind of issue with Fsys, either parameters that we could tweak, or a bug in it that we could perhaps work around/design around, and if I was able to reproduce the issue more frequently/repeatably I think I could figure out how to do that, but I haven’t had success with that so far.

What I could use help with is this:

Do you have any ideas about what I should look at regarding contributing factors beyond what I’ve listed here, something that I’m not considering, maybe?
I would greatly appreciate any ideas for tests that I could run that could help me reproduce this issue repeatably. I’m open to writing any kind of shell script or C program (or anything else that would help), or if there’s some existing utility/tool that could help I’m open to that as well.

Any ideas, information, pointing me in the right direction, etc. would be greatly appreciated. Thank you!

Tim · May 16, 2023, 2:30pm

Thanks for the very detailed synopsis of your issue.

I did a quick google search on QNX, popen and bugs and it appears there are people reporting issues

github.com/ros/rospack

popen problem on qnx

opened 08:26PM - 06 Jun 17 UTC

closed 02:02AM - 13 Dec 17 UTC

jilinzhou

more-information-needed

1. In librospack (rospack.cpp), the cache file name is created to be ".rospack_c…ache.XXXXXX". While in function _read_rospack_cache() of roslib/packages.py, it expects the file to be "rospack_cache". 313 try: 314 with open(os.path.join(rospkg.get_ros_home(), 'rospack_cache')) as f: 2. In function get_pkg_dir() of roslib/packages.py, when the cache is checked (line 161), the file checked is manifest.xml. In fact, many packages now have package.xml only. Therefore, the condition to check is the existence of either manifest.xml or package.xml. 158 if package in _pkg_dir_cache: 159 dir_, rr, rpp = _pkg_dir_cache[package] 160 if rr == ros_root and rpp == ros_package_path: 161 if os.path.isfile(os.path.join(dir_, MANIFEST_FILE)): 162 return dir_ 163 else: 164 # invalidate cache 165 _invalidate_cache(_pkg_dir_cache)

If I were you, I’d look at finding another way to accomplish what you need without using popen() and then test that in your lab and see if that fixes the issue. Especially since you’ve said the problem happened after a S/W update that started using popen()

I don’t know much about your application and how much it needs to write to the CF drive (you mentioned some systems don’t write at all) but another thing you could try is to create a RAM drive (assuming you have enough RAM to create one) and then do all your read/writes to the RAM drive (you can even copy up all the QNX executables to the RAM drive and run commands directly from the RAM drive if you put the RAM drive first in your path. I’ve done this many times before when looking for issues). It’s not a long term solution but might help narrow down what’s going wrong with the bonus that you aren’t corrupting anything important.

Tim

ablinka · May 16, 2023, 4:41pm

Thanks for your response, that’s very helpful. I have a few follow-ups if I may:

Designing popen() out of the system is certainly one possibility, but I need to be able to reproduce the issue more repeatably to be able to confirm that this actually solves the problem. Do you have any ideas about tests I could try to run to reproduce my issue more repeatably?

Using a ramdisk is a good idea, but I’ve got some questions about that too. I assume files in ramdisk like /bin/awk and /bin/netinfo could still be corrupted in RAM (assuming they’re loaded into RAM), but at least they would be restored on a node reboot, which would be an improvement. But is there any reason to think that using a ramdisk would lower the chances of corrupting the drive itself? I’m not entirely clear what is causing the drive itself to be corrupted, which makes me unsure of if a ramdisk would alleviate that issue or not.

Tim · May 16, 2023, 5:12pm

The most obvious would be to run the code that calls popen() a lot more frequently. For example if your collecting network stats once an hour, change the code to call it say once a second (or every few seconds). That should make the problem occur faster if it’s popen related. If your application does other disk stuff (esp writes) I’d increase the frequency of those as well.

The other alternative if modifying your code isn’t easily feasible is writing a small test program that uses the same popen calls as your program does and run it very fast (every second or so) along with doing some other disk related stuff (copy a 1 meg file, delete it and do this over and over as fast as you can so your are continually writing to the disk). This may spur the problem to occur faster.

Reading/writing to a RAM disk is orders of magnitude faster than writing even to CF (or an SSD for that matter). So that may lower the window of the corruption to occur. The other thing about using the RAM disk is that you aren’t wearing out your hardware, nor are you using the disk controller driver (just in case there is a bug in there someplace).

How do you get your systems back in order once this problem happens? Does it fix itself on a reboot or do you have to run chkfsys or do you have to do a full install including QNX? If you aren’t running chkfsys (with forced fixing of errors) as part of your boot process, you should consider doing it.

The other thing to look at with Fsys is cache size. Maybe some of your issues are to do with caching (prior to doing writes, it likely never mattered) and how soon Fsys writes the cache (write delay). It may be the corruption is cache related with popen. Just grasping for ideas for you to look at. I think the test program that writes constantly along with doing popen might be your best bet to replicate the issue.

Tim

ablinka · May 16, 2023, 6:38pm

Thanks for the suggestions, I will definitely try that. I have been using a test program that runs our popen() call repeatedly/forever as fast as possible (no delay in the while loop). It did reproduce the drive corruption once, but it took over a month to do it and only happened once, so I’ve been trying to find something that can be a little more effective. I can try combining that with a test program that also writes to the disk a lot. I’ve been running a separate test program (by itself) that constantly writes a big file and then truncates it with ltrunc() and then extends it again over and over again (I used ltrunc() because it’s also not thread-safe, going back to my thread safety theory), but that hasn’t been very effective. I’ve been thinking of just using a shell script that uses cp and rm to copy and remove a big file, but haven’t tried that yet. I’ll give that a try and combine it with the popen() test program to see if that can be more effective.

How do you get your systems back in order once this problem happens? Does it fix itself on a reboot or do you have to run chkfsys or do you have to do a full install including QNX? If you aren’t running chkfsys (with forced fixing of errors) as part of your boot process, you should consider doing it.

For cases where there’s a corrupt file, copying a non-corrupt version of the file over the corrupt file will correct the issue. Rebooting (we do run chkfsys -u in our sysinits) doesn’t fix it.

For corrupt drives we have to replace the CompactFlash card to get back online. In the field they clone the cards when they’re in a good state prior to placing them in service using a machine that does sector-by-sector cloning and replace the failed cards with a cloned one. As far as we can tell the clones are good, and most of the failures that have been reported are on original/uncloned drives, so we don’t think that’s a contributing factor.

The other thing about using the RAM disk is that you aren’t wearing out your hardware, nor are you using the disk controller driver (just in case there is a bug in there someplace).

When you say disk controller driver I assume that’s some kind of firmware/controller in the CompactFlash card itself and not something that’s part of QNX like Fsys, right?

The other thing to look at with Fsys is cache size. Maybe some of your issues are to do with caching (prior to doing writes, it likely never mattered) and how soon Fsys writes the cache (write delay). It may be the corruption is cache related with popen.

Would you expect that having a smaller cache or a bigger cache would cause a problem? i.e. if I set the cache to a small value or even 0 would you expect that to make it better, or would a big cache be better? Can you think of any tests I might be able to run that would specifically target/test for issues with caching?

Tim · May 16, 2023, 7:37pm

A month to only get a single failure is a VERY long time Can I presume the popen call in your test program does the exact same thing as the one in your application (get network stats)? If it doesn’t I’d change it so it makes the exact same calls.

How often is this happening in the field (ie how many systems do you have and how many have reported the problem and have any reported it more than once after fixing the issue)?

No, I meant that the RAM drive doesn’t use fsys.eide or fsys.atapi since those are what talk to the hardware controller. So if the problem is some interaction between Fsys and those drivers, writing to the RAM disk will bypass it. It’s a long shot that’s where the problem is of course but if you can eventually reliably reproduce the problem the RAM disk would let you test if that’s where it is.

Here’s another question for you. Does your application run as root? If it does, can you run as non-root and see if that helps (there are usually ways to elevate things that need root permission on an ‘as needed’ basis). Running as non-root should prevent over writing actual commands like awk/netinfo since your app would lack permission to write to those directories/files.

A smaller cache means more disk reads (or writes) obviously. A large cache may allow multiple commands to be cached between process/threads which could all for some kind of in memory corruption in Fsys. A smaller cache means less commands cached and less likely there can be commands in memory from multiple threads/processes at a time. Whether smaller or larger cache would make a difference is impossible to say until we know the problem - LOL. I’d suggest trying both a smaller and larger cache and also experimenting with the write back delay (smaller and larger) in conjunction with your test program in hopes of getting more failures.

Since you are writing a ‘big’ file as part of your test program I’d make sure the amount you write exceeds the cache size (both large and small cache size). If that doesn’t cause a problem the next case is write smaller than the cache size (so the write back delay comes into play).

Tim

ablinka · May 16, 2023, 9:23pm

Can I presume the popen call in your test program does the exact same thing as the one in your application (get network stats)? If it doesn’t I’d change it so it makes the exact same calls.

How often is this happening in the field (ie how many systems do you have and how many have reported the problem and have any reported it more than once after fixing the issue)?

It does use the exact same call as what’s in my production application, yeah.

Without getting into too much detail, in the field we’ve had about 10ish failures across 4 systems in the course of about 12 months (that includes failures of both corrupt files and corrupt drives), so it’s fairly intermittent there as well. There have been repeated failures at all of the sites.

Here’s another question for you. Does your application run as root?

My application does run as root. I think this application could run as non-root though, but I’d have to confirm that nothing it does needs elevated privileges. It’s a good idea for me to pursue though, thanks for bringing that up.

Since you are writing a ‘big’ file as part of your test program I’d make sure the amount you write exceeds the cache size (both large and small cache size).

If I don’t explicitly set a cache size using the -c parameter, from the help pages for Fsys it says the cache size is 1/8 of total memory. So if my computer has 1GB memory then my cache is ~134MB? Is that right? That value sounds high, that’s why I’m asking. Do I just need to make sure that the file I’m copying is bigger than that value, or do I need to ensure that I’m able to copy more than that amount in less time than the write-behind delay specified with the -d parameter to Fsys? I’m not sure if that question makes sense, to be honest caching is not my strongest suit so I apologize if this is a dumb question.

Tim · May 16, 2023, 10:14pm

Multiple repeated failures at multiple sites within a year is often enough to say it’s not random (like a single CF going bad) and that something is going on.

QNX 4 is very old and 1/8 of memory was only a few kilobytes in the early 90s. I imagine the 1/8 is still true and that it’s using 134 Megs cache. Doing a ‘sin’ command and looking at memory for Fsys should confirm the cache size (I can’t recall now which option you might need with sin to see memory usage, might be ‘sin format m’).

Both, if you want to exceed the cache.

If your Fsys really is using 134 megs of cache setting it to a smaller value (even something like 50 kilobytes) may well make the problem appear more/less often. At 134 megs in size it would be unlikely to ever exceed cache under normal operations. Its worth exploring changing the size to see what happens.

I always forget to ask when talking about QNX4, but I am assuming you are compiling 32 bit mode and running against 32 bit executables etc.

Tim

ablinka · May 17, 2023, 7:16pm

I always forget to ask when talking about QNX4, but I am assuming you are compiling 32 bit mode and running against 32 bit executables etc.

I apologize for my ignorance but I’m not entirely sure how to tell whether they are 16 or 32-bit. Is there an easy way to tell from looking at makefiles or something else?

Tim · May 17, 2023, 7:31pm

The default is to build 32 bit executables. If you aren’t sure, then it’s 99.9% likely you are building 32 bit. The only realistic way you’d be building 16 bit is if the code base is very old (like 20+ years).

You can check what you are linking against (slib16 vs slib32) or check compile options but I wouldn’t bother if your code base is newer than 20+ years old.

Tim

sieudot · May 18, 2023, 3:03pm

ablinka:

These issues started to be reported (or at least have been happening with greater frequency) since the introduction of a software update. Reviewing the update the main thing that sticks out to me is that we started using popen(), which we do to open a read pipe to run the netinfo command to get some network statistics, and then pipe that output to grep and then awk to get specific statistics that we care about. I especially mention that because sometimes we’ll see errors in the terminal of “popen: Input/output error”, which also seems to point to popen(). We’ve checked the code and we did code the use of popen() correctly, so we don’t think it’s a bug in how we’re using that function call (i.e. we’re ensuring that we close the pipe correctly with pclose(), etc).

I would like to note that the issues may not be tied to the software update and may also not be tied to popen(), these are just possibilities that we’ve considered.

It’s too long to read the whole story. But I read that your problem has occured when you update the software.
Consider the popen() function with “netinfo”, your application must run after the “Net” driver, I think “netinfo” is related to this guy.
This function of mine works normally, stable at QNX 4.25, just build with “cc”

ablinka · May 22, 2023, 3:25pm

Thanks for your response! You’re correct that it does need Net to run, and Net is indeed running in our system. The program you put together would also work in my system.

ablinka · May 22, 2023, 3:36pm

You’re correct, I think we are using 32-bit.

I had another thought: in one of my prior lab failures I got an error message from popen: “popen: No space left on device”, but running df my disk still has plenty of space (only 6% used on an 8GB drive). Any idea why that might be?

maschoen · May 23, 2023, 3:56am

I know I’m late to the game here but I’ll add what I can.

First, QNX 4 only supports 1 core. If there is another core in the machine, it will be ignored.
QNX 4 did near the end support a type of threading however it was rarely used and it was kind of funky. Instead of a process having multiple threads, you really had more than one process sharing code and data. You mentioned Fsys having multiple threads. Nope. The file system runs in multiple processes in that drivers are separate from Fsys.

I recall issues with QNX 4 if you shut things down shortly after a write. The write might not get to the media. There was an Fsys parameter to mitigate this at the expense of performance.

Maybe this has been suggested already, but you could rewrite your open version of popen that would work as follow:

FP *New-popen(program)
{
system(program >/tmp/some-file);
return(fopen(/tmp/some-file));
}

Make /tmp a ramdisk.
This only has a problem if the output is large

ablinka · May 30, 2023, 5:24pm

Hey, thanks for the feedback I appreciate it! A few follow-ups if I may:

QNX 4 did near the end support a type of threading however it was rarely used and it was kind of funky.

When you say “near the end” - would that be the 2011 CD? That’s what we’re running in many of our fielded systems.

Make /tmp a ramdisk.
This only has a problem if the output is large

How large is large?

maschoen · May 30, 2023, 5:39pm

“When you say “near the end” - would that be the 2011 CD? That’s what we’re running in many of our fielded systems.”

I honestly don’t know. Look for a routine called tfork().

“How large is large”

The question is how much data does

“system(program >/tmp/some-file);”

produce? The ram disk has to be at least large enough to contain it or the program will stop prematurely when the ram disk runs out of space.

ablinka · May 31, 2023, 5:25pm

So I’ve been able to generate another failure in my lab based on the test suggested by @Tim. The symptoms are very similar to before, but not quite the same: this time my popen() calls generate the message: “popen: Read-only file system”. The read-only file system bit is different than what I’ve seen before, but the overall result seems to be similar; for example pipes aren’t working from the command line, so if I run sin | my_program it will tell me that it can’t create a pipe.

I’ve found some interesting messages in the tracelog (via tracelogger), but I could use some help interpreting them if anyone happens to have some ideas. This is what I found there. After these messages, I just constantly get the message of “stat failure 75 on /”.

Tracelog snippet:

May 28 11:36:04 5 00005109 Scsi sense (unit=0 scsi=2 err=70h sense=5h asc=24h ascq=0h)
May 28 11:36:06 5 0000510e 00000000 00000000 0000000B
May 28 11:36:09 5 00005109 Scsi sense (unit=0 scsi=2 err=70h sense=5h asc=24h ascq=0h)
May 28 11:36:09 5 00005109 Scsi sense (unit=0 scsi=2 err=70h sense=5h asc=24h ascq=0h)
May 28 11:36:12 5 0000510e 00000000 00000000 0000000B
May 28 11:36:15 5 0000510e 00000000 00000000 0000000B
May 28 11:36:17 5 0000510e 00000000 00000000 0000000B
May 28 11:36:17 2 00003003 Bad block 00037440 on /dev/hd0 during asynchronous write
May 28 11:36:17 5 00005109 Scsi sense (unit=0 scsi=2 err=70h sense=7h asc=0h ascq=0h)
May 28 11:36:17 2 00003003 Bad block 000060D1 on /dev/hd0 during asynchronous write
May 28 11:36:17 2 00003003 Bad block 0000DCAE on /dev/hd0 during asynchronous write
May 28 11:36:19 0 0000301c stat failure 75 on /

I’m especially confused by the Scsi sense message since we’re not using SCSI, we have IDE drives using the Fsys.atapi driver. I also could use some help understing what stat failure 75 means.

I ran dcheck -e /dev/hd0 and chkfsys -uf / on the node. dcheck returned no errors and all, and chkfsys returned only 1 error on 1 file: “file busy (erroneously): (passed)”.

Any ideas you have would be greatly appreciated.

Tim · May 31, 2023, 6:07pm

I’m going to guess that ‘stat failure 75 on /’ is telling you that the ‘stat’ command returned an error number of 75. You can see what the text equivalent of that isby looking in errno.h or by doing ‘errno 75’ from the command line

https://www.qnx.com/developers/docs/qnx_4.25_docs/qnx4/utils/e/errno.html

It appears something flipped your filesystem to ‘read only’. That should only happen if there is a consistency issue (ie filesystem wrote something and then tried to read it back and found it wasn’t correct. I suspect the write delay/caching is the issue but I’ll wait to see what that fstat error means)

After running the chkfsys command are you able to open files and continue on or did you reboot first before running that command?

Tim

ablinka · May 31, 2023, 8:44pm

75 isn’t in errno.h and the errno command says it’s an unknown error.

After running chkfsys I still can’t write/modify files, still have the same error (but I also ran with the -f flag that prevents fixing errors. I can try it without that flag and see what happens). I haven’t rebooted yet, I wanted to keep it in the failed state until I was sure I had gathered all the information I might possibly need.

Tim · May 31, 2023, 9:03pm

Can you try the last suggestion (sin fd) and see if you ran out of file descriptors (fd’s)? This seems to be a likely culprit.

The other thing is to try a plain ‘mount’ command and see if somehow your filesystem is now mounted as read only (I think a plain mount will show whether it’s mounted as RW or R).

It’s good you haven’t rebooted yet so we can run through a lot of things.

Tim