CompactFlash File/Disk Corruption

ablinka · May 31, 2023, 9:33pm

How do I know if it’s out of file descriptors? I ran sin fd and its output is very long (I piped it to wc -l from another node and it had ~180 lines), but I’m not sure if that means it’s out of fds or not.

I tried mount by itself but it just printed the use message, it doesn’t seem like mount by itself has an option that does what you’re looking for (that I can tell).

I wonder if there’s a way to write a test program using some of the built-in functions like fsys_stat() or fsys_fstat that would tell us if it’s read-only?

Tim · May 31, 2023, 9:50pm

180+ lines means there are a lot of files open. Is that expected?

The output from sin fd should tell you who has what open.
https://www.qnx.com/developers/docs/qnx_4.25_docs/qnx4/utils/s/sin.html

You’ll need to capture that on another node or somehow interpret what gets printed to the screen. If one process or command is leaking descriptors it should show up a lot in that output.

Lets not worry about the mount command at the moment because it may be that you are out of file descriptors.

Tim

ablinka · June 1, 2023, 4:50pm

180+ lines means there are a lot of files open. Is that expected?

I think so. Other nodes that are also running the tests but not having issues (yet) have a similar number, and a baseline from a node not under test is similar (but slightly lower because the test programs aren’t running).

Going through the output of sin fd, the number of fds overall I think is driven by the number of processes that are running in general. I don’t see any process with more than 10 fds, and I’ve been capturing the output of all the sin commands (twice per day, triggered by cron), and the number fluctuates over time but is roughly the same.

Would it be helpful if I posted the full output of sin fd? I would need to change some program names to protect our IP, but I don’t think there will be an issue sharing the rest.

Tim · June 1, 2023, 5:39pm

If other systems that are correctly running have the same number (or very close to the same number) of fd’s open then it’s not an issue with fd’s being leaked. So in my mind there is no reason for you to go through the effort to post your fd list.

I didn’t ask, but I’m going to assume that you aren’t out of anything obvious like memory or that pipe isn’t using a ton of RAM etc.

Lets get back to the popen command failed with read-only filesystem.

Have you tried to verify if the file system is indeed in read only mode right now. For example can you create a new file (copy an existing file to a brand new name)?
If doing a ‘sin | myprogram’ fails because it can’t create a pipe, have you verified that pipe is still running in your system?

If #1 works and your file system is R/W, then it would appear to be a pipe problem (assuming it’s still running). You could try slaying pipe and restarting it and then seeing if you can do the ‘sin | myprogram’ command. That would tell you if it’s a pipe issue.

Tim

ablinka · June 1, 2023, 6:01pm

I have verified it’s read-only, I can’t touch or create a file on that node. /bin/Pipe actually isn’t running on these nodes, which matches our field configuraiton. Why it’s not configured to run, I don’t know. It might be intentional by the designers or it might just be an oversight. I’ve tried running it in the past (in the lab) as a test, but as far as I can tell it hasn’t made a difference (I’ve seen failures with it running and without).

I also forgot to mention that the node isn’t out of memory. From sin in I can see I’ve got about 97% free memory on the node. Doing sin mem I don’t see any specific process that’s using up a crazy amount of memory compared to the others.

Tim · June 5, 2023, 1:15pm

Now it definitely sounds like you are experiencing file system corruption that is forcing the filesystem to go into read only mode. The most likely culprit is that your CF ran into a dead sector (one that wore out)

This article talks about things you can do (note there is no power safe option in QNX4)

I think your best option is the one that I and Maschoen suggested. Create RAM drive (say 50 megs in size) and do all your run time writing to that RAM drive since none of what you write needs to be permanently stored.

Tim

ablinka · June 6, 2023, 2:18pm

The most likely culprit is that your CF ran into a dead sector (one that wore out)

I ran a dcheck that didn’t come back with any errors, is it still possible that there’s a dead sector dcheck didn’t catch? And if not, are there any other possible causes you can think of? I realize we’ve brainstormed pretty much everything, I figured it couldn’t hurt to ask

On the article you linked to, it just took me to the QNX front page, could you re-post the link?

Tim · June 6, 2023, 2:37pm

Weird that is copied a totally different link than where I was.

I think what happens with CF drives (all drives really) is that the filesystem does a write and then a read back to confirm it wrote properly. When a block goes bad, the read back fails. I believe the drive internally fixes this by marking the block bad and writing to a different block so your file really gets written. But from the OS point of view at the time of write, it was a failure. That’s why chkfsys later thinks everything is OK. You can probably google around about what happens on a bad block on a CF disk to know for sure.

Going to a RAM disk will prevent you from wearing out your CF disks. This is what we have done for almost 2 decades now. We have a 50 Meg RAM disk for runtime data we don’t need to survive a reboot.

Tim