We’ve had a strange problem plaguing three or four systems (out of about a dozen and a half in regular operation, the rest of which have no apparent issues).
I’ve posted about it in the past but now I need to make an active effort to try and figure out how to fix/avoid it.
When the problem occurs, the hard drive becomes inaccessible… the drive access light comes on effectively solid, and you aren’t able to access anything on the drive; cannot browse directories/files, cannot execute anything on the disk, etc. The resultant fallout from this often damages or corrupts the database and files that contain our different test configurations.
General context:
- I don’t believe it to be hardware related
- the software we run is a piece of data acquisition software, which can have dozens of test configurations (read: “sets”) that outline instrumentation, DAS hardware, programming, etc.
- I previously thought it might be related to the database that runs behind the software we use
- WHEN it occurs, the user is never doing anything beyond the scope of the software’s capabilities
- IF it has occurred but the system and config are salvageable, it is almost guaranteed to happen again within the particular set that was being used at the time
- Note: also recently, I also noticed that a folder I created on July 25th had a creation date of July 21 according to the OS despite the system clock being correct (and a y2k compliance update having been installed on all of these systems)… Not sure what this may suggest but it was certainly odd.
After much other trial and error, I decided to do a completely fresh install and rebuild the test configuration by hand, rather than import any potentially corrupted data from the current problematic system. This is time-consuming but was the only way I could think of avoiding any persistent issue.
Not far into the process of rebuilding the configuration, after having already saved my progress a few times, I saved again and then the system failed in the same way as the other problematic systems. This was on a fresh installation on vetted, known-good hardware.
In the past, @Maschoen had suggested mounting a ramdisk with some cmds to see whether the file system might have crashed.
I have done this on a couple occasions and can confirm that the system can still navigate directory structure in both the ram disk as well as a floppy. I feel like this would suggest that something happen[s/ed] to the hard disk driver.
Maschoen also said that the .ata driver that these systems use for IDE hard disks may actually be a driver he had written back in the day and made available for general use.
Whatever the case, I’ve now begun to wonder if there is some issue with how I am partitioning/formatting/mounting the hard drives before establishing/restoring a system.
- The drives are new-old-stock 80gb Seagate Barracuda ide/ata drives.
- i mount with
mount disk 3 1:/drivers/disk.ata
- I format them on an Adek-sourced Pentium 4 motherboard, the BIOS is able to see the full size of the disk, and the
fdisk
command reports the same values that are visible in the bios; H=255 T=4865 N=63 -
- (note: this allows me to create multiple partitions, as the older Dell Optiplex GXa/GX1 machines we use can only see a total of 1024 tracks. If I install a drive that was formatted on the Pentium 4 machine into the older Dells, the Dell still only sees the 1024 tracks, but partitions beyond that point are accessible.)
- in the
fdisk
utility, I specify the main partition as type 7, start cylinder 0, end 999, and mark it as the boot partition. This results in a partition that is just shy of 8gb in size, and a bitmap file of around 2mb - i will remount with
mount disk 3 d=3 pa=7
then executedinit 3 +h
I have another drive that houses full system backups (created via backup x:/ y:/location123/ +a s=c
); I will swap the freshly formatted drive to the second position on the IDE cable and then boot from the other archive drive in the first IDE position, mount the fresh drive, then use the backup command again to copy either a previous system backup or a fresh installation to the new drive.
From there, I put the fresh drive in the first position again, startup from a floppy, then run the boot command to set up the fresh drive for booting;
boot 3:/netboot/os.2.21atpb c=3:/netboot/config_file d=3:/drivers/disk.ata +H +P +q
I have run dcheck
on the drives that we’re using and found no bad blocks.
The sys.init mounts the drive with 32kb cache, 16k xtent cache, and it mounts a bitmap cache.
Is there anything that I am doing that could conceivably introduce some kind of problem?
- Do I not need to add the
+q
option when executing theboot
command? - Does the bitmap file get properly wiped when reformatting a drive/partition?
- Do file/directory pointers get properly wiped when formatting?
I’ve wondered if the ~8gb partition size might be a problem but other systems have been rolling with the same size boot partition for a while without issue.
I’m at a loss by now, I can’t conceive of any other possible catalysts for this weird behavior. I know there aren’t really any other people here who know QNX2 besides Tim and Maschoen, and even then you guys are software devs, not tech support for the OS itself.
Would be divine if we could migrate away from these systems but we’re entrenched, and the data software is actually really good. These issues are not widespread but I cannot seem to iron out what is causing it in the machines where it occurs. Thanks in advance if anyone has some input or sees something I’m missing.