Problem with devb-eide detecting devices

Tim · July 13, 2004, 6:32pm

A couple of days ago under the Help forum I posted about having a problem installing 6.3 on my hard disk on my Flash Boot rig

openqnx.com/PNphpBB2-viewtopic-t2569-.html

After taking the Harddisk out and putting it in another machine I was finally able to install the OS and so I put it back in my flash rig. At that point it booted up just fine and I thought everything was well. I should have known better…

Yesterday I tried to use the flash rig for the first time to install a 6.3 image onto a San disk 256MB flash disk. First I removed the Flash disk that was currently in there (one that already had 6.1 installed on it) and replaced it with a brand new flash disk. Then I booted up and attempted to go through the installation procedure to install QNX to the flash disk.

The first thing I noticed is that as 6.3 boots up from the hard drive devb-eide doesn’t seem to find any devices at all. It just spins there for about 20 seconds and finally goes on to finish the boot process and give me a log in prompt. At that point I noticed that none of my devices were found!!! Doing a ‘df’ revealed the harddrive as the only device found (the CD-Rom and compact flash drive are attached to the slave ide controller and there is a floppy drive). Repeated ‘df’ commands over the next 4-5 minutes revealed that gradually devb-eide was able to find the devices (starting with the floppy and graduating to the CD-rom and compact flash drive).

Since this was so strange I decided to go to the effort of pulling out the harddrive with 6.3 on it and replacing it with another harddrive with 6.1 installed on it. Once I did that the devb-eide under 6.1 found all devices at boot time and I had complete access to everything at the time the login in prompt was displayed.

So my first question is, why is devb-eide under 6.3 taking so long to find my devices and is there anything I can do to speed up this process. Certainly the HD boot images aren’t any different (minus the fact one is 6.3 and the other is 6.1) because when I look at the args given to devb-eide they are identical.

Then I figured that once the flash disk was found after 4-5 minutes that I would be able to go ahead and run my scripts to initialize the flash disk, copy on the QNX boot image and finally our S/W. So I began running the scripts and of course they don’t work now either

I was able to run our script that deletes the default DOS partition and replaces with with a QNX partition. I re-booted, waiting 5 minutes and confirmed via fdisk that there was indeed a QNX partition.

The following is the command used to create the QNX partition in our scripts.

fdisk -p /dev/hd1 add -f 1 t77 all boot t77 loader

Then I attempted to run our script that copies over the boot image and the key lines are:

dinit -h -R -f flashBoot.ifs /dev/hd1t77

which works and:

cp flashNocacheBoot.ifs /fs/hd1-qnx4/.altboot

which fails telling me there is no /fs/hd1-qnx4 directory.

At this point I did a ‘df’ and noticed that /fs/hd1-qnx4/ had not been mounted at all. In fact no /fs/ had been mounted for either the flash drive OR the CD-Rom drive (in essense I could not access either one).

I tried to mount the flash disk by using.

mount -t qnx4 /fs/hd1-qnx4 /dev/hd1

and while the command completes without complaint it doesn’t actually mount correctly (note I may have the last 2 arguments backwards because I am doing the mount from memory but it does show up in ‘df’ after I do the mount) because when I re-run the script it tells me that /fs/hd1-qnx4 is a corrupted file system and so the cp fails.

Once again I replaced the 6.3 HD with the 6.1 HD and booted up. I examined the flash disk and noted that indeed the dinit command worked because there was a .boot file on the flash disk (along with a couple of other QNX files) but no .altboot. The /fs directory contained entries for /cd0 and for /hd1-qnx4 as I would expect.

So I’m at a complete loss. I can’t figure out what’s going wrong in devb-eide that’s not letting it properly find and mount my compact flash (and I assume my CD-Rom and floppy since I never tested either other than the CD-Rom failure when trying to install 6.3 on the HD as I noted in my previous post).

Anyone have any clues on what I can do? I’d love to be able to just use the devb-eide from 6.1 but I’m sure it won’t work under 6.3 since it’s a completely different filesize.

Doing a ps -A -o pid,pgid,sid,args on devb-eide reveals:

devb-eide blk auto=partition dos exe=all cam quiet on my 6.3 machine.
devb-eide blk auto=partition dos exe=all on my 6.1 machine.

which is essentially the same minus the quiet option. However I am confused because none of the blk,auto,dos or exe options are even mentioned in the docs for devb-eide so I have no idea what they do.

I was not expecting this much trouble with a fresh 6.3 install on the flash rig (is it possible I have not configured some file someplace because I haven’t done anything other than install 6.3 and create a couple of user accounts) and get network access up and running.

Tim

rick · July 14, 2004, 1:24am

Have you tried doing any of this with DMA disabled? One of the things that changed with 6.3 was support for higher UDMA modes - unfortunately that uncovers problems with some chipsets. If you can get it installed without dma (which is real slow), you can create your own image with whatever dma mode works best for you.

Also did you see any errors in sloginfo when you were having these problems?

Rick…

Tim · July 14, 2004, 6:20pm

Rick,

I didn’t know about disabling DMA and that it might help. I went back and re-booted and looked in my BIOS. All my UDMA access there is set to Disabled (is that causing issues and should I set it to Auto?). Then as QNX booted I selected the ‘d’ option to disable DMA.

Success! Everything was recognized right away and the QNX boot loader let me choose to boot from the HD or Flashdisk (as it does in 6.1). Once I logged in I had complete access to my flash drive and CD-Rom.

Out of curiosity I then went back and re-booted without selecting the ‘d’ option and checked sloginfo. Sure enuff there were several eide errors reported in there that I assume (I don’t understand them complete) are related to the DMA access.

So now my question is, how do I change the DMA parameters in my boot image and is it just a trial and error process to find the right values?

Thanks,

Tim

rick · July 14, 2004, 6:49pm

Somewhat - the sloginfo should show what it detects. You can then create a new boot image with the correct parameters (which I don’t recall off the top of my head). Look at the helpviewer for help on ‘diskboot’ and ‘devb-eide’ to find the right options.

In the end, it is a trial and error process.

Rick…

evanh · July 14, 2004, 9:23pm

I first recommend enabling udma in the bios. If that works, then you’ll be much better off all round.

The older PIO modes simply suck.

Tim · July 14, 2004, 10:04pm

Rick, Evanh,

Thanks. I’ll try enabling UDMA in the Bios and if that doesn’t help then I’ll go with the trial/error method. Since this is a card sliding into a card cage I need to make sure it works because there won’t be any monitor/keyboard for a user to make any kind of selection.

On another note, I’m now having another problem, this time with the sysinit on the flash disk. In 6.1 we were creating login consoles in our flashdisk bootbuild file and not from using tinit in the sysinit file (as a normal QNX install does). I found that (A) that doesn’t work in 6.3 and (B) I don’t really like that method even if it’s just for occasional debug methods because when you exit a shell it disappears forever.

So I removed the creation of the consoles in the flashdisk bootbuild file and added the following lines to the end of my sysinit (basically just copied from the default sysinit 6.3 install)

PATH=/bin:/usr/bin:/sbin exec which tinit

exec sh
exec fesh

However, now when I boot the flashdisk I error out on the line with the exec which tinit with the message:

/etc/system/sysinit[81]: can’t create pipe - try again

And it hangs there forcing me to re-boot using the HD to access the flash filesystem to try and figure out what went wrong.

From the thread:

openqnx.com/PNphpBB2-viewtopic-t2401-.html

here it appears I don’t have a path to pipe. But I did specify specifically /sbin in the PATH and I do have pipe located in that directory. So I’m completely stumped as to why I can’t get the consoles created and the sysinit to continue on. Any ideas?

Tim

Tim · July 15, 2004, 12:08am

Rick,

Here I noticed you talked about flash disk corruption

qnxzone.com/node/view/140

I came across this while doing general searches for help on my flash disk boot problem (before you wrote your solution above).

Anyway we have experienced what has been reported as flash disk corruption at a few of our beta sites (final product is in beta testing right now) and obviously Management is hot to track down any possible problem before it gets to the field.

Initially we thought that since some of these machines had been in the field for a year and had undergone several S/W upgrades that perhaps we had worn out the life of the disk. But I did some white paper reading on the San Disk site and realized it was very very unlikely that was the case.

It’s far more likely we experienced something that you described. So my question is, how can we determine if this is indeed our problem. In theory we never do any actual writes to the flash disk beyond S/W installation and calibration of hardware. At all other times the board simply boots up, runs and talks to a Windows box with a GUI on it. There aren’t even log files (those are also stored on the Windows side).

Of course I have observed the maintenance and service guys powering down the board when they weren’t supposed to. Since they could have been calibrating hardware or upgrading S/W right before that they might have caused the problem. The thing is as I am sure you know, virtually impossible to re-produce when you want to make it happen.

So my questions are:

If there are no writes (only reads) going on, can corruption occur via the controller (ie, would the controller do any writes of any kind that might cause a problem if power was lost).

By battery solution I assume you mean battery power to the board with flash disk that runs for a couple of seconds in order to allow writes to the flash disk to complete or are you just talking about battery power to the flash disk itself?

When corruption occurs is it the whole flash disk that gets corrupted or just the file that was being written to. My worry is that the corruption occurs in the middle of updating the .bitmap file which means the whole disk is then shot and needs a full S/W install.

Lastly I assume the corruption is only S/W related and that a re-format of the flash disk solves the problem (ie, no physical damage to the flash).

Thanks,

Tim

rick · July 15, 2004, 3:02am

I don’t think so. However having said that, the corruption is actually caused (so I believe) when the sandisk decides to move things around, behind the scenes, and you happen to power off when this happens. I always assumed the rearranging was triggered by writing data, but I suppose it could also be triggered by other events.

I believe all that is required is power to the flash disk itself - however in the cases I have worked with, it was easier to power the whole board, so I can’t say I have tried it.

The corruption can occur anywhere and is not directly related to where you are writing the data. My best understanding is at some point, the controller decides to shuffle blocks around. If the blocks it is in the process of moving happen to be important, the you see it as corruption. Perhaps worse, it could appear to silently switch data around and you may not discover it right away.

In a case I worked with, we had ~100,000 devices in use. About 1000 (1%) were experiencing corruption problems (which we attributed to the kind of problems we are discussing). We managed to reformat and recover most of those. We did have some which we were never able to “fix”.

So unfortunately the answer is yes, this damage appears to be able to permenently damage the flash disk. Personally at this point, I could not recommend a customer use compact flash for anything which wasn’t running from a battery.

Rick…

evanh · July 15, 2004, 6:36am

There’s nothing quite like a defrag in the middle of a power-cut!

Actually, I suspect the real destruction is occuring because the memory cells being written to at power-down are getting spiked. Flash memory is easily destroyed by over voltage. Which is a bit of a downer considering that Solid State Discs are commonly employed in, and also touted as superior to Hard Discs in, unfriendly enviroments.

Fingers crossed that MRAM won’t have similar destructive behaviour.

Tim · July 16, 2004, 12:53am

Rick,

Thanks for the info. Obviously your talking about the legenday iOpener

That was 1999 technology on a 10 MB flash disk. Of course I doubt it’s advanced much but one can always hope

Evanh,

Your probably right. But I wonder why more people don’t encounter issues with all these USB drives being plugged in and pulled out of computers all the time.

Tim

P.S. I’m still trying to get past the annoying sysinit not starting any consoles. I got sidetracked with other issues today (had a field problem where an executable file that rc.sysinit is looking for was not there so of course the machine will not boot. How wonderfully annoying that the sysinit doesn’t just pass by that line and keep going with the boot process. I ended taking out the flash disk and using a hex editor from a Windows laptop in raw mode to comment out the line to get it to boot up so I could put back the missing file) but will return to it tomorrow. No idea why I can’t get consoles created at the end of running my sysinit (I’m past the pipe problem but now it just sits there frozen not creating a login prompt).

peterbarrie · July 16, 2004, 1:51pm

Would anybody like to say which devices they have been using (the corrupted ones) and what vintage they are? I’m considering using (manufacturer undecided) industrial grade CF in some new designs and am considering whether some form of battery backup is required. The products will not be powered-down in normal usage but may obviously loose power at some stage, including power cycling on watchdog trip.

Tim · July 16, 2004, 6:14pm

peterbarrie,

Sure, the 100,000 devices Rick was mentioning was done using a SanDisk 10 MB compact flash that came out circa 1999-2001.

We are currently using 256 MB Sandisks and are moving to the 256 MB ultra Sandisk for production units that ship to customers. These are all disks made in the past year or year and a half.

Sandisk uses wear levelling (the moving around of blocks Rick was talking about) to prolong the life of the flash disk but something he believes also increases the likelihood of corruption. Other companies may do their own wear levelling or they may not (you’ll have to check on a case by case basis).

What seems funny to me is that all these new USB flash disks use essentially the same technology as flash disks and yet are much more wide spead in the consumer market yet I never hear anyone talk about file corruption on them and lord knows consumers will be pulling those things out and shoving them in at the wrong times.

Tim

rick · July 16, 2004, 8:27pm

Yeah, I guess the real question is whether the ultra devices are really any better - faster perhaps, but more resilient? I don’t know.

And I think the Iopener used a 16 meg part, not 10. I am not sure anyone made a 10 meg device - they seem to stick with base 2 sizes.

We certainly experienced similiar problems on a 256 meg sandisk device also (on a different project).

Rick…

evanh · July 16, 2004, 11:15pm

If you don’t need easily removable media then DiscOnChip has never given me any hassles. There is also DiscOnModule if a DOC socket is not an option for you. But looking through M-Systems web site, DOM seems to be non-existant so must be discontinuing them in favour of HDD form factors.

m-sys.com/

mritun · July 23, 2004, 9:49am

Tim,

Did you start pipe server in script before starting tinit ?

Akhilesh

Tim · July 26, 2004, 6:10pm

Akhilesh,

Yes I did.

I finally figured it out.

We were using scripts to create our flash disk from the QNX 6.1 days and we had never used tinit before to create our consoles. So /etc/config was never created as a directory and so ttys was being placed in /etc/system (the last directory created by the script) instead of /etc/config. Once I finally noticed that it was a trivial matter to get it working. So many wasted hours until then that moment of clarity

Tim