[QNX2] System crash saga persists, looking for insight

We’ve had a strange problem plaguing three or four systems (out of about a dozen and a half in regular operation, the rest of which have no apparent issues).

I’ve posted about it in the past but now I need to make an active effort to try and figure out how to fix/avoid it.

When the problem occurs, the hard drive becomes inaccessible… the drive access light comes on effectively solid, and you aren’t able to access anything on the drive; cannot browse directories/files, cannot execute anything on the disk, etc. The resultant fallout from this often damages or corrupts the database and files that contain our different test configurations.

General context:

  • I don’t believe it to be hardware related
  • the software we run is a piece of data acquisition software, which can have dozens of test configurations (read: “sets”) that outline instrumentation, DAS hardware, programming, etc.
  • I previously thought it might be related to the database that runs behind the software we use
  • WHEN it occurs, the user is never doing anything beyond the scope of the software’s capabilities
  • IF it has occurred but the system and config are salvageable, it is almost guaranteed to happen again within the particular set that was being used at the time
  • Note: also recently, I also noticed that a folder I created on July 25th had a creation date of July 21 according to the OS despite the system clock being correct (and a y2k compliance update having been installed on all of these systems)… Not sure what this may suggest but it was certainly odd.

After much other trial and error, I decided to do a completely fresh install and rebuild the test configuration by hand, rather than import any potentially corrupted data from the current problematic system. This is time-consuming but was the only way I could think of avoiding any persistent issue.

Not far into the process of rebuilding the configuration, after having already saved my progress a few times, I saved again and then the system failed in the same way as the other problematic systems. This was on a fresh installation on vetted, known-good hardware.


In the past, @Maschoen had suggested mounting a ramdisk with some cmds to see whether the file system might have crashed.

I have done this on a couple occasions and can confirm that the system can still navigate directory structure in both the ram disk as well as a floppy. I feel like this would suggest that something happen[s/ed] to the hard disk driver.

Maschoen also said that the .ata driver that these systems use for IDE hard disks may actually be a driver he had written back in the day and made available for general use.


Whatever the case, I’ve now begun to wonder if there is some issue with how I am partitioning/formatting/mounting the hard drives before establishing/restoring a system.

  • The drives are new-old-stock 80gb Seagate Barracuda ide/ata drives.
  • i mount with mount disk 3 1:/drivers/disk.ata
  • I format them on an Adek-sourced Pentium 4 motherboard, the BIOS is able to see the full size of the disk, and the fdisk command reports the same values that are visible in the bios; H=255 T=4865 N=63
    • (note: this allows me to create multiple partitions, as the older Dell Optiplex GXa/GX1 machines we use can only see a total of 1024 tracks. If I install a drive that was formatted on the Pentium 4 machine into the older Dells, the Dell still only sees the 1024 tracks, but partitions beyond that point are accessible.)
  • in the fdisk utility, I specify the main partition as type 7, start cylinder 0, end 999, and mark it as the boot partition. This results in a partition that is just shy of 8gb in size, and a bitmap file of around 2mb
  • i will remount with mount disk 3 d=3 pa=7 then execute dinit 3 +h

I have another drive that houses full system backups (created via backup x:/ y:/location123/ +a s=c); I will swap the freshly formatted drive to the second position on the IDE cable and then boot from the other archive drive in the first IDE position, mount the fresh drive, then use the backup command again to copy either a previous system backup or a fresh installation to the new drive.

From there, I put the fresh drive in the first position again, startup from a floppy, then run the boot command to set up the fresh drive for booting;
boot 3:/netboot/os.2.21atpb c=3:/netboot/config_file d=3:/drivers/disk.ata +H +P +q

I have run dcheck on the drives that we’re using and found no bad blocks.

The sys.init mounts the drive with 32kb cache, 16k xtent cache, and it mounts a bitmap cache.


Is there anything that I am doing that could conceivably introduce some kind of problem?

  • Do I not need to add the +q option when executing the boot command?
  • Does the bitmap file get properly wiped when reformatting a drive/partition?
  • Do file/directory pointers get properly wiped when formatting?

I’ve wondered if the ~8gb partition size might be a problem but other systems have been rolling with the same size boot partition for a while without issue.

I’m at a loss by now, I can’t conceive of any other possible catalysts for this weird behavior. I know there aren’t really any other people here who know QNX2 besides Tim and Maschoen, and even then you guys are software devs, not tech support for the OS itself.

Would be divine if we could migrate away from these systems but we’re entrenched, and the data software is actually really good. These issues are not widespread but I cannot seem to iron out what is causing it in the machines where it occurs. Thanks in advance if anyone has some input or sees something I’m missing.

So when you say 3-4 systems out of a total of 18 have this issue do you mean it’s the same 3-4 that have it over and over again (with some indeterminate amount of time passing between the problem occurring) and the others never exhibit this issue?

If that’s the case it would seem FAR more likely it’s a hardware issue than a software one. Especially if the same software AND same test configurations (sets) are run on all 18 machines.

What can you tell us about the 18 PCs in use on the 18 machines? Are they all the same PC type (ie same mother board, hard drive, CPU etc)? If some are different, is there anything that would distinguish the systems you are having trouble on vs ones you don’t have trouble on?

One obvious thing I would do is if I had a system that NEVER exhibits the problem, I would swap the HD from a system that does have the problem into the one that never has the problem and see what happens (ie does it now exhibit the problem) and at the same time take a HD from a system with the problem and put it in one that never has a problem and see if it now goes away.

Tim

Yes.

But then why would I suddenly be getting the problem on my benchtop testbed machine which is a known-good system?

I will say that I am currently running my TS approach by copying a full fresh system & software installation from a complete backup (the software came from the manufacturer on apx 20 floppies, I do not currently have the time to do repeated trials while waiting for two dozen disks to install and decompress). There is certainly a non-zero chance that MY backup/restore process may be introducing whatever this weird issue is, hence the problem surfacing on my machine.

This is another variable I will have to test and rule out.

I don’t want to get hung up on the hardware too much; like I said, I really don’t think it’s hardware related since the problem and its presence spans any combination of the hardware in use.


On the particular system that is currently Hot (read: testing has been delayed for almost a month), the problem existed on both the older machine configuration and the new machine configuration.

The only thing that was migrated from the old system to the new one was literally just some data; a previous system backup was imported to a fresh system install on a new hard drive, and the problem reappeared…

This is why I thought it was perhaps some corruption in some of the older files.

But literally this morning I was testing another fresh install at my desktop on one of the older style machines (I am changing one variable of my process at a time; this time I omitted the +q option when running the boot command), and the system locked up when I saved just the initial setname and DAS input card list.

I feel like it potentially has something to do with how QNX manages file tracking or partition info, and something just gets confused when you ask it to write or access one of these files that the software manages.


As far as the hardware, I believe the facility originally had 486 systems (possibly Gateway machines) back in the day for this software. At some point they transitioned to Dell Optiplex GXa/GX1 models… these dells are all

  • pentium II 250-400mhz
  • 64-128mb ram, though obviously QNX2 cannot see that much
  • all or mostly all the same motherboards, whatever Dell included.
  • hp-82335a GPIB ISA card
  • linksys ether16 lan card 16big ISA card
  • HDD are primarily the seagate barracudas I mentioned before, but a few systems still exist with smaller/older drives

.

  • The Adek machines that I recently cobbled together use that mb-800v motherboard
  • have socket 478 2.8ghz pentium 4
  • 1gb ram
  • same GPIB card
  • same ethernet card

On any given day, I am able to take a drive from one machine and stick it in another, and it loads and functions just fine; I would have to adjust the IRQ and memory values for the two ISA cards, but there’s no inherent incompatibility.

Unfortunately that’s tough for a few reasons; the problem doesn’t always occur in the same part of the software, it’s not always 100% consistently repeatable the same way… The other issue is the pressing nature of the customers who are trying to run their tests. Obviously at this point I need to do anything to attempt forward progress, so I’ll certainly SEE if I can do this and instigate the problem (I have a good candidate right now with this fresh install that apparently developed the problem immediately).


but again, I’d like to also consider things that are not hardware related, or not in the sense of some kind of hardware failure. I feel like it has to be something with the options when I am using the backup command for copying full systems, or the options/method of formatting and partitioning the target drive, or maybe just some other esoteric thing that I might have overlooked…

If I am able to consistently trigger the problem, are there any kind of built-in utilities I can use to see or interpret the current state of the system? Like, if I run the tsk command, I don’t know what “ready/recv/reply/wait” means in the “state” column, I don’t know what all the different flags mean (the QNX 2.21 manual outlines most of them in a couple places but not all), I don’t know what “Dad/Bro/Son” means other than maybe how one service or task relates to another? I understand the sac command shows current activity based on task priority levels, so I can see what priority has a consistent bar and then search tsk for whatever might be at that level… b

“ready/recv/reply/wait”
Ready - The task is ready to run but some other task currently has the CPU (typically higher priority task but also possibly a same priority one).
Recv - The task is blocked waiting for a message from another task (think of this as a server waiting for a client to initiate an action).
Reply - The task is blocked waiting for a reply message (think of this as a client who sent a message to a server and is now waiting for the reply). QNX 2 message passing was send-receive-reply based so the receiver always has to send a reply to acknowledge the message even if it’s an ‘empty’ reply just to say it got the message.
Wait - The task is blocked waiting for something (a timer, a mutex etc) before it can continue.

I still find it hard to imagine the file system becoming corrupt enough to hang the driver. It would not be related to data inside a file itself. It would have to somehow corrupt the bitmap and/or the drive geometry itself. The idea that user created software would do this seems really unlikely, especially software that’s been in use for decades without displaying this issue.

On the other hand, you are forced to use hardware that even though ancient by modern standards was still FAR beyond what QNX 2 was originally written for. It just seems highly likely to me that a CPU or a hard drive or a mother board or a BIOS or some combo of those is now ‘too fast’ for QNX 2 and causes the problem of the hang up. I’d be especially suspicious of hard drive cache (something that did not exist when QNX 2 was written) and BIOS options. I know you say these 2 extra cards occupy ISA slots in your machines but are you sure they don’t share any IRQ with the hard drive controller or anything else sensitive (old machine are notorious for IRQ issues with shared hardware).

Tim

I’m going to read this more in depth when I have time, but something stood out immediately. Pentium 4. Wow!

A little history on QNX 2 running on newer processors.

Their were two issues that started to crop up with late version (fast) 486 cpus and Pentiums.

  1. The floppy drive driver. It used a number calculated at startup to create a timing loop. This started failing with the 16 bit value rolled around. A fix was published that set the value to 0xffff which worked for a while, but not forever. One way to deal with this was to use the BIOS to cripple the speed of the cpu by disabling the cache. I was able to run QNX 2 on a Pentium 2 laptop this way for a while.
  2. The QNET network. There was a race condition that cause virtual circuits to not be recovered. Eventually you could run out and need to reboot. There was a publish clean up fix, which In ever got to work.

The point of me describing these is that I suspect you may be dealing with an unknown race condition in the disk driver that is caused by using such a fast processor. If it only happens on 3-4 of your machines, I would scrutinize their processor speeds. You could also see if there is some way to slow them down in the BIOS.

There is one other direction you could go in. I’ll be happy to send you the disk.ata driver source if you want to try your hand at seeing what is going wrong. The only way to get feedback from the driver is to put text directly to the video screen memory. This ability is provided in the driver for diagnostic purposes. This might let you track down where the driver freezes. This works best if you have a way to cause the problem within a short period of time

LOL, you could try to hire me to work on this, but I don’t think that is in your best interest financially.

Here’s something I could do for your at a relatively low cost. I could scan the driver for any naked loops. That is loops waiting for some hardware event that might never quit. I could then put in some kind of limit. The problem with this is it requires getting a very old development system up and running. I’ve done this a couple time in the last 10 years so I know it is possible.

This is of course a crap shoot. If you are not investigating updating to a new system, you should have a mighty good reason. I once did a job for a chip manufactorer that was using QNX 2 to run their fab. It was not silicon. The cost to build a new system was in the 10M+ range so at the time it made sense to have some work done. I don’t know what your situation is.

I’m re-reading and have a few comments.

After this happens, have you tried re-mounting the hard drive using the ramdisk files?

Mounting a bitmap cache for a drive that has a 2Meg bitmap on a machine with 16 of memory seems scary to me. That’s not to say there’s anything wrong with this.

I’m pretty sure the +q option to boot just means run quietly, no messages, so it doesn’t matter.

If by formatting you mean dinit, yes the bitmap gets wiped.

Nothing from the past matters when you run dinit. Old data may still be there and recoverable, but it should not influence going forward.

I see that Tim and I agree it might be the speed of the processor causing a problem.

I’ll circle back to you guys’ other comments as soon as i can, things have been super busy on my end and I’ve been pulled away from this again.

I just wanted to hit on this and mention that I got a fresh install to a place where this crash happens consistently.

With a ramdisk mounted, cmds directory copied, search command pointing to the ramdisk first, and logged into a second console before triggering the crash…

Attempting to cd literally anywhere on the hard drive returns
No Current Directory$
I can cd to the ramdisk and navigate just fine.

if I try to re-mount the hard drive with mount disk 3 d=3 pa=7 it returns
MOUNT: unable to read master partition record

I also tried putting the disk.ata driver on the ramdisk and pointing the mount command there for a driver, but received the same failure message.

note this “crash” is occurring when requesting the software to save its configuration to the c-tree database which runs in the background… navigating and doing anything else at the moment causes no issue whatsoever. I can apparently do anything else just fine but something about saving/modifying entries in the database causes it to present this behavior.


(I had typed the above last week and never got a chance to post, I have had another subsequent odd behavior that might offer some insight)

I ran chkfsys just to scan the file system and it returned

ERROR
Physical read error attempting to read block.
FILE: [0]6:/das/device/NEFF.1.1320   XTNT:3   RBLK:4831 (12DFh)   ABLK:538976288
(20202020h)

This is NOT a fixable error. Type <CR> to continue

BUT, when running dcheck to check for bad blocks, it returns no issues.

Is there anything else that could cause a “physical read error” but not present itself as a bad block? That particular file was the set that I was currently working on to trigger the problem, but I have triggered it in other sets as well, so other files would of course be located in a different position on the disk.

The doc’s for dcheck says it may miss intermittent bad blocks so that could be why. I think dcheck also just checked for physically bad blocks while chkfsys checks for other things (bad data pointers) which could also be the reason.

From what you describe here it seems like the MBR + bitmap has been corrupted. That’s why the RAM drive is fine but nothing works on the hard drive (including rebooting).

In the Utils doc for QNX 2.2 under chkfsys there is this paragraph

I don’t see to have that particular guide. But an online search found it here

Section 4.3.3 is ‘The root of the file system was corrupted’

You can review the steps there and in the section after it. Since you haven’t backed up those blocks (it tells you how for future) you might not be able to recover. But on your next test system you might want to backup those blocks and after a failure, see if that’s exactly what happened (by comparing the current state of the blocks vs what you backed up).

All that said, if indeed this is the problem, what’s here doesn’t tell you how to stop it in the future. It would just provide a speedier way to recover without full re-install. The source of the problem in my opinon remains likely hardware (something too fast for QNX2 that corrupts the file when doing file updates).

Tim

A few random thoughts. The read error could be spurious, happening sometimes and not others. Otherwise, there is no obvious reason. I would rerun chkfsys a few times and see if it repeats. The same with dcheck. I don’t know why it would be related, but like disk.atc, disk.ata did multisector reads and writes. The original disk.at only read and wrote one sector at a time. This of course made a significant performance difference. One thing you could do with the source is decrease the size of the buffer and see if it made any difference.

Your experiment trying to re-mount the disk suggests that the hardware is hosed in some way, a very unusual situation. I have one more thought on the matter. There has always been a debugging version of the driver that writes things directly to the monitor memory map. If you don’t have that version, I can find a way to get it to you. Let me know if you would like it.

Ummmmmm. C-Tree. It’s hard to remember back that far. In the early days of QNX I wrote a B-Tree package. For a while QNX sold it. Later for reasons, they dumped it in favor of C-Tree. C-Tree came with source code so it was possible to port it. I once had to help someone diagnose a problem with it. It had a feature that boggles my mind to this day. The function calls were not black boxes. There were circumstances where if you incorrectly modified a data buffer when you weren’t supposed to, you could corrupt the index. I don’t think this has anything to do with your problem as it was not a hardware issue. Just saying, hearing about C-tree still leaves a bad taste in my mouth.