Logging to USB Flash Sticks mysteriously fails

We have a hard-to-reproduce problem and I’m wondering if anyone here has seen this behavior and can offer some insight.

Our pxa270 system logs data to a FAT formatted USB flash stick.

Sometimes in the middle of logging the fprintf() call that writes to the file fails with errno EIO. When this happens the flash stick is no longer visible in the namespace and syslog has errors like this that repeat every 4-5 seconds:

umass_bulk_reset: path 0, devno 1
CLASS_ResetDevicePort: dno 1, vid 3538
USB_SelectConfiguration: Set config devno 1, cfg 1

When the stick is removed and reinserted it mounts correctly and the log file that was being written to is there, but obviously it has been truncated and data lost.

With only QNX components running we wrote a small test that looped, continuously writing data records to a file on the stick, pausing 1ms between writes to simulate the real system, and the same problem happened after about 11M records (stick was only 45% full).

It may well be a bad stick, but we’ve seen it several times on different brands and sizes of sticks so are suspicious of the software.

Any suggestions? We are running QNX 6.3.2, XP hosted Momentics 4.0.1 and using the 4.2.1 compiler.

TIA

Ken,

Now I know why we use ethernet to remotely log data :slight_smile:

My first suggestion would be to insert a USB drive that started at 45-50% full when you do your test.

I’m wondering if it’s something to do with the USB drive being filled up and making progressively slower responses to the QNX driver causing the driver to timeout thinking the device is gone. Many USB devices do all kinds of little tricks in firmware to evenly distribute the writes across the entire flash memory to prevent parts wearing out too early. So I wonder if a half full drive (ie can you start with a log file of 10 million records and append to it) causes slowdowns due to those firmware tricks esp at a 1 ms write speed which is pretty quick.

The other thing to do would be to slow down your write speed from 1 ms to 10 ms (keep 10 ms worth of data and write once every 10 ms instead of 10 times in 10 ms) and occasionally issue a ‘sync’ command to ensure the USB is properly writing the data from the cache.

One more thing. You say you have 11 million records. In one file or several files? I can’t remember if there is a max filesize (2 gig?) in FAT format that you are possibly exceeding since 11 million records is quite a lot (at 100 bytes per record that’s 1 gig).

Tim

Thanks for the ideas. I wish we didn’t have to use flash sticks but it’s a marketing requirement for customer data acqusition. Our product also has some internal flash for data storage that never seems to have a problem, but it’s much smaller (only 8MB) and it bypasses the USB and uses the QNX4 filesystem.

To clarify, the way the code works is that laser pulses arrive at a 1ms rate. fprintf() is used to format and write each one, but setvbuf is used when the file is first opened to establish a 16k buffer that should only hit the flash when buffer fills. Individual records are 27 bytes so, if my math is right, one 16k block is written every ~600ms.

The records are all in one file but according to wiki FAT16 and FAT32 can hold files of up to 4GB-one byte (cluster size varies).

Try these parameters:
devb-umass cam timeout=0:0:0:4,retries=3 …

Thanks.

Here devb-umass is started by umass-enum so I hacked those options into the umass-enum source and a test is now running (takes about five hours).

But I’ve got to ask, where are those parameters documented?

An overnight test showed that the ‘devb-umass cam timeout=0:0:0:4,retries=3’ change didn’t help.
Same symptoms, same results.

Ken,

At 27 bytes * 11 million records that’s a file about 250 megs in size.

I’d still be tempted to write a test program that copies a 250 meg file on the usb flash disk, then opens it in append mode and starting writing 27 byte records at your usual logging speed.

At this point it seems either it’s the sheer number of writes (11 million) that causes the problem or the file size (250 megs). At least by starting with a 250 meg file you can eliminate that part of the equation and then focus on the 11 million writes part.

Tim

Hi Tim,

That would be an interesting test, thanks for the idea. It would save time during testing and help isolate if it’s the file size or the general level of activity. Our code never appends to log files but test code could. The failure is not always at 11 million writes, it’s anywhere from 5M up.

I’m tempted to blame the sticks because half of our sticks don’t exhibit the problem. The problem follows the stick - when a stick has problems it always has problems and the el-cheapo promo sticks are the worst. So far all of our Sony and Lexar sticks have worked flawlessly.

Of course they all work fine on Windows. Very frustrating.

Ken,

Something tells me that you are going to find that the name brand Sony and Lexar sticks are going to have the fastest seek/write times and that the el-cheapo stuff has the worst.

Here at my company we bought a bunch of 1 gig cheapo sticks to brand with our company logo to give out at trade shows. I’ve used them to transfer between windows and QNX and complained to the person who bought them that they shouldn’t have gotten USB 1.0 stuff. I was then shocked to be told (and verify) they were in fact USB 2.0. So while they may support USB 2.0 standards they most definitely do not read/write at 2.0 speeds.

Tim

Tim,

We found the same thing here. The marketing guys bought these sleek looking stainless steel 128MB sticks with the company logo screened on them. They enumerate with the “CMB” brand as 2.0 devices but they are a lot slower than the name brand sticks. In testing only about half of them work properly with our system. I wish there was a good way to qualify these things (specs don’t tell the story if specs can even be found). The promo sticks probably contain 256MB or larger flash chips that failed testing due to excessive numbers of bad bits.

Ken

I just want to throw something out for thought, since it was mentioned that there is never a problem with Windows. It has happened more than once that a piece of hardware that worked fine with DOS or Windows broke down under QNX, not because of a driver problem, but because QNX was able to hit the hardware so much faster than Windows ever could.

Do you know what happens if you slow the whole process down?

I’ve fiddled with that a little by changing the delay between individual writes from .5 to 1 to 2 ms. It didn’t make any difference.

On windows it takes minutes to copy a 300MB files to the questionable sticks and they always copy fine.

Our test takes five hours or so to run, the resulting files are smaller and the files are corrupted on the same sticks.

That’s not to say it couldn’t be a speed related issue. USB would force some level of pacing on stick bound packets but the interval between subsequent commands or the ordering of metadata updates could be different.

All of the sticks used are USB 2.0 but there seems to be a wide variation in sustained write throughput between them. I think all the sticks that fail are on the slower side performance wise.

Ken,

Unfortunately you are probably going to have to tell your customers that they can only buy certain brands that are known to work. If they question why, you can always tell them that the no-name cheap stuff is too slow to keep up with an RTOS like QNX :slight_smile:

The only other alternative is going to be to completely change how your app does logging. You’ll probably have to log to a RAM drive and then periodically copy the log file(s) to the flash disk at a speed slow enough to guarantee they don’t crap out.

Actually, another option though probably not one you can use is that you might try going into your BIOS and forcing the USB bus to operate in 1.0 mode (most BIOS allow this). Then at least io-usb will know the connected flash drives are going to write VERY slowly. That may work if the drivers have a big enough buffer (not sure if you can specify like you can for devb-eide) to accomodate the amount of data arriving from your app vs the speed of 1.0 USB.

Tim

Thanks for the suggestions Tim.

Changing the way logging is done wouldn’t really be possible because requirements are such that we be able to log files up to the maximum size of the flash stick and the board only has 32MB of SDRAM :slight_smile:

As you suspected, there is no BIOS on this board and the pxa270 only uses USB 1.1 so that’s what the sticks enumerate as.

QNX support sent me a new version of devb-umass and usbd lib. With this update the meter logged 18M error-free events over the weekend on a stick that has never worked before.

Hopefully that will fix the problem for good but we’ll have to test for a few days to be sure.

Perhaps you are not interested in experimenting with undocumented options.
If you are, try
devb-umass cam bounce …

ysinitsky,

Thank you for your mentioning the undocumented cam options. Can you tell us what they do?

QNX support is unaware of those options.

Specifically, the “cam timeout 0:0:0:4,retries=3” and the “cam bounce” option.

Thanks!