CompactFlash File/Disk Corruption

ablinka · May 17, 2023, 7:16pm

I always forget to ask when talking about QNX4, but I am assuming you are compiling 32 bit mode and running against 32 bit executables etc.

I apologize for my ignorance but I’m not entirely sure how to tell whether they are 16 or 32-bit. Is there an easy way to tell from looking at makefiles or something else?

Tim · May 17, 2023, 7:31pm

The default is to build 32 bit executables. If you aren’t sure, then it’s 99.9% likely you are building 32 bit. The only realistic way you’d be building 16 bit is if the code base is very old (like 20+ years).

You can check what you are linking against (slib16 vs slib32) or check compile options but I wouldn’t bother if your code base is newer than 20+ years old.

Tim

sieudot · May 18, 2023, 3:03pm

ablinka:

These issues started to be reported (or at least have been happening with greater frequency) since the introduction of a software update. Reviewing the update the main thing that sticks out to me is that we started using popen(), which we do to open a read pipe to run the netinfo command to get some network statistics, and then pipe that output to grep and then awk to get specific statistics that we care about. I especially mention that because sometimes we’ll see errors in the terminal of “popen: Input/output error”, which also seems to point to popen(). We’ve checked the code and we did code the use of popen() correctly, so we don’t think it’s a bug in how we’re using that function call (i.e. we’re ensuring that we close the pipe correctly with pclose(), etc).

I would like to note that the issues may not be tied to the software update and may also not be tied to popen(), these are just possibilities that we’ve considered.

It’s too long to read the whole story. But I read that your problem has occured when you update the software.
Consider the popen() function with “netinfo”, your application must run after the “Net” driver, I think “netinfo” is related to this guy.
This function of mine works normally, stable at QNX 4.25, just build with “cc”

ablinka · May 22, 2023, 3:25pm

Thanks for your response! You’re correct that it does need Net to run, and Net is indeed running in our system. The program you put together would also work in my system.

ablinka · May 22, 2023, 3:36pm

You’re correct, I think we are using 32-bit.

I had another thought: in one of my prior lab failures I got an error message from popen: “popen: No space left on device”, but running df my disk still has plenty of space (only 6% used on an 8GB drive). Any idea why that might be?

maschoen · May 23, 2023, 3:56am

I know I’m late to the game here but I’ll add what I can.

First, QNX 4 only supports 1 core. If there is another core in the machine, it will be ignored.
QNX 4 did near the end support a type of threading however it was rarely used and it was kind of funky. Instead of a process having multiple threads, you really had more than one process sharing code and data. You mentioned Fsys having multiple threads. Nope. The file system runs in multiple processes in that drivers are separate from Fsys.

I recall issues with QNX 4 if you shut things down shortly after a write. The write might not get to the media. There was an Fsys parameter to mitigate this at the expense of performance.

Maybe this has been suggested already, but you could rewrite your open version of popen that would work as follow:

FP *New-popen(program)
{
system(program >/tmp/some-file);
return(fopen(/tmp/some-file));
}

Make /tmp a ramdisk.
This only has a problem if the output is large

ablinka · May 30, 2023, 5:24pm

Hey, thanks for the feedback I appreciate it! A few follow-ups if I may:

QNX 4 did near the end support a type of threading however it was rarely used and it was kind of funky.

When you say “near the end” - would that be the 2011 CD? That’s what we’re running in many of our fielded systems.

Make /tmp a ramdisk.
This only has a problem if the output is large

How large is large?

maschoen · May 30, 2023, 5:39pm

“When you say “near the end” - would that be the 2011 CD? That’s what we’re running in many of our fielded systems.”

I honestly don’t know. Look for a routine called tfork().

“How large is large”

The question is how much data does

“system(program >/tmp/some-file);”

produce? The ram disk has to be at least large enough to contain it or the program will stop prematurely when the ram disk runs out of space.

ablinka · May 31, 2023, 5:25pm

So I’ve been able to generate another failure in my lab based on the test suggested by @Tim. The symptoms are very similar to before, but not quite the same: this time my popen() calls generate the message: “popen: Read-only file system”. The read-only file system bit is different than what I’ve seen before, but the overall result seems to be similar; for example pipes aren’t working from the command line, so if I run sin | my_program it will tell me that it can’t create a pipe.

I’ve found some interesting messages in the tracelog (via tracelogger), but I could use some help interpreting them if anyone happens to have some ideas. This is what I found there. After these messages, I just constantly get the message of “stat failure 75 on /”.

Tracelog snippet:

May 28 11:36:04 5 00005109 Scsi sense (unit=0 scsi=2 err=70h sense=5h asc=24h ascq=0h)
May 28 11:36:06 5 0000510e 00000000 00000000 0000000B
May 28 11:36:09 5 00005109 Scsi sense (unit=0 scsi=2 err=70h sense=5h asc=24h ascq=0h)
May 28 11:36:09 5 00005109 Scsi sense (unit=0 scsi=2 err=70h sense=5h asc=24h ascq=0h)
May 28 11:36:12 5 0000510e 00000000 00000000 0000000B
May 28 11:36:15 5 0000510e 00000000 00000000 0000000B
May 28 11:36:17 5 0000510e 00000000 00000000 0000000B
May 28 11:36:17 2 00003003 Bad block 00037440 on /dev/hd0 during asynchronous write
May 28 11:36:17 5 00005109 Scsi sense (unit=0 scsi=2 err=70h sense=7h asc=0h ascq=0h)
May 28 11:36:17 2 00003003 Bad block 000060D1 on /dev/hd0 during asynchronous write
May 28 11:36:17 2 00003003 Bad block 0000DCAE on /dev/hd0 during asynchronous write
May 28 11:36:19 0 0000301c stat failure 75 on /

I’m especially confused by the Scsi sense message since we’re not using SCSI, we have IDE drives using the Fsys.atapi driver. I also could use some help understing what stat failure 75 means.

I ran dcheck -e /dev/hd0 and chkfsys -uf / on the node. dcheck returned no errors and all, and chkfsys returned only 1 error on 1 file: “file busy (erroneously): (passed)”.

Any ideas you have would be greatly appreciated.

Tim · May 31, 2023, 6:07pm

I’m going to guess that ‘stat failure 75 on /’ is telling you that the ‘stat’ command returned an error number of 75. You can see what the text equivalent of that isby looking in errno.h or by doing ‘errno 75’ from the command line

https://www.qnx.com/developers/docs/qnx_4.25_docs/qnx4/utils/e/errno.html

It appears something flipped your filesystem to ‘read only’. That should only happen if there is a consistency issue (ie filesystem wrote something and then tried to read it back and found it wasn’t correct. I suspect the write delay/caching is the issue but I’ll wait to see what that fstat error means)

After running the chkfsys command are you able to open files and continue on or did you reboot first before running that command?

Tim

ablinka · May 31, 2023, 8:44pm

75 isn’t in errno.h and the errno command says it’s an unknown error.

After running chkfsys I still can’t write/modify files, still have the same error (but I also ran with the -f flag that prevents fixing errors. I can try it without that flag and see what happens). I haven’t rebooted yet, I wanted to keep it in the failed state until I was sure I had gathered all the information I might possibly need.

Tim · May 31, 2023, 9:03pm

Can you try the last suggestion (sin fd) and see if you ran out of file descriptors (fd’s)? This seems to be a likely culprit.

The other thing is to try a plain ‘mount’ command and see if somehow your filesystem is now mounted as read only (I think a plain mount will show whether it’s mounted as RW or R).

It’s good you haven’t rebooted yet so we can run through a lot of things.

Tim

ablinka · May 31, 2023, 9:33pm

How do I know if it’s out of file descriptors? I ran sin fd and its output is very long (I piped it to wc -l from another node and it had ~180 lines), but I’m not sure if that means it’s out of fds or not.

I tried mount by itself but it just printed the use message, it doesn’t seem like mount by itself has an option that does what you’re looking for (that I can tell).

I wonder if there’s a way to write a test program using some of the built-in functions like fsys_stat() or fsys_fstat that would tell us if it’s read-only?

Tim · May 31, 2023, 9:50pm

180+ lines means there are a lot of files open. Is that expected?

The output from sin fd should tell you who has what open.
https://www.qnx.com/developers/docs/qnx_4.25_docs/qnx4/utils/s/sin.html

You’ll need to capture that on another node or somehow interpret what gets printed to the screen. If one process or command is leaking descriptors it should show up a lot in that output.

Lets not worry about the mount command at the moment because it may be that you are out of file descriptors.

Tim

ablinka · June 1, 2023, 4:50pm

180+ lines means there are a lot of files open. Is that expected?

I think so. Other nodes that are also running the tests but not having issues (yet) have a similar number, and a baseline from a node not under test is similar (but slightly lower because the test programs aren’t running).

Going through the output of sin fd, the number of fds overall I think is driven by the number of processes that are running in general. I don’t see any process with more than 10 fds, and I’ve been capturing the output of all the sin commands (twice per day, triggered by cron), and the number fluctuates over time but is roughly the same.

Would it be helpful if I posted the full output of sin fd? I would need to change some program names to protect our IP, but I don’t think there will be an issue sharing the rest.

Tim · June 1, 2023, 5:39pm

If other systems that are correctly running have the same number (or very close to the same number) of fd’s open then it’s not an issue with fd’s being leaked. So in my mind there is no reason for you to go through the effort to post your fd list.

I didn’t ask, but I’m going to assume that you aren’t out of anything obvious like memory or that pipe isn’t using a ton of RAM etc.

Lets get back to the popen command failed with read-only filesystem.

Have you tried to verify if the file system is indeed in read only mode right now. For example can you create a new file (copy an existing file to a brand new name)?
If doing a ‘sin | myprogram’ fails because it can’t create a pipe, have you verified that pipe is still running in your system?

If #1 works and your file system is R/W, then it would appear to be a pipe problem (assuming it’s still running). You could try slaying pipe and restarting it and then seeing if you can do the ‘sin | myprogram’ command. That would tell you if it’s a pipe issue.

Tim

ablinka · June 1, 2023, 6:01pm

I have verified it’s read-only, I can’t touch or create a file on that node. /bin/Pipe actually isn’t running on these nodes, which matches our field configuraiton. Why it’s not configured to run, I don’t know. It might be intentional by the designers or it might just be an oversight. I’ve tried running it in the past (in the lab) as a test, but as far as I can tell it hasn’t made a difference (I’ve seen failures with it running and without).

I also forgot to mention that the node isn’t out of memory. From sin in I can see I’ve got about 97% free memory on the node. Doing sin mem I don’t see any specific process that’s using up a crazy amount of memory compared to the others.

Tim · June 5, 2023, 1:15pm

Now it definitely sounds like you are experiencing file system corruption that is forcing the filesystem to go into read only mode. The most likely culprit is that your CF ran into a dead sector (one that wore out)

This article talks about things you can do (note there is no power safe option in QNX4)

I think your best option is the one that I and Maschoen suggested. Create RAM drive (say 50 megs in size) and do all your run time writing to that RAM drive since none of what you write needs to be permanently stored.

Tim

ablinka · June 6, 2023, 2:18pm

The most likely culprit is that your CF ran into a dead sector (one that wore out)

I ran a dcheck that didn’t come back with any errors, is it still possible that there’s a dead sector dcheck didn’t catch? And if not, are there any other possible causes you can think of? I realize we’ve brainstormed pretty much everything, I figured it couldn’t hurt to ask

On the article you linked to, it just took me to the QNX front page, could you re-post the link?

Tim · June 6, 2023, 2:37pm

Weird that is copied a totally different link than where I was.

I think what happens with CF drives (all drives really) is that the filesystem does a write and then a read back to confirm it wrote properly. When a block goes bad, the read back fails. I believe the drive internally fixes this by marking the block bad and writing to a different block so your file really gets written. But from the OS point of view at the time of write, it was a failure. That’s why chkfsys later thinks everything is OK. You can probably google around about what happens on a bad block on a CF disk to know for sure.

Going to a RAM disk will prevent you from wearing out your CF disks. This is what we have done for almost 2 decades now. We have a 50 Meg RAM disk for runtime data we don’t need to survive a reboot.

Tim