QNX2 is there expected behavior in a system crash?

Maybe this is kind of an esoteric question…
Is there any kind of predictable or default behavior from a QNX 2.21atpb system in the event of a system crash or if a program has some kind of unrecoverable issue?

This explanation got longer than I intended, but I’m trying to offer as much context as possible.
I guess I’m not really expecting any kind of concrete answer, but thanks for anyone who takes the time to read.

I ask because in multiple, seemingly DIFFERENT, scenarios with the software we run, the system itself always seems to crash/fail in the same way; exhibiting the same pattern of behavior nearly every time. (also, when I refer to “set” I typically mean it as a noun, a set or sets, not verb as in ‘set something up’)

The data acquisition software we use has four layers of functions that are typically loaded during normal usage; “administrator,” “communications,” “scanning,” and “logging.” They are also tiered in that order, ie you cannot start scanning if communications haven’t been started, comms can’t be started if administrator isn’t running, etc.

basic program config and file navigation do not need administrator to be running; but if you want to modify set/device configurations or do anything else, administrator must be loaded… a help prompt states:
“Selecting this item will activate the AutoNet system administrator (the set record DBASE server), as well as any other ‘subtasks’ that need to be running in order for you to access items under the ‘Sets’ menu.”

Communications is just opening up comms with configured devices, scanning is telling the software to actually sieve/pull values at X rate from the live database as it receives data, and logging is simply capturing lines of data from selected channels.

The primary behavior that happens when a problem occurs is the hard drive appears completely full, 0.0mb free, and then the system just gives endless error beeps because its automated error-logging fails to write to disk, since the system thinks the disk is full. The drive access light will be constantly lit.
I wondered if the behavior was maybe a sign of any specific problem(s) that can occur, as it might point me in a better direction in the event that I have to troubleshoot or repair a system.

  • I have had this happen when attempting to use the program’s set-configuration import ability (literally just an automated process of copying a few files off of a floppy and assimilating them into its database).
  • it has occurred when trying to save code/programming for a set; the software uses the built-in QNX editor (ed or bed, not sure which)
  • it has occurred when trying to use its logged data conversion/export utility

Each time it’s the same resulting behavior; system perceives the disk as full, and you can’t select any menu option or enter any commands in another console because it’s not able to access the disk. Only recourse is a hard reset.

There’s something strange about a couple of those examples, though…

When it occurred while attempting to save the set program: I restarted the machine, then ran ed directly from the shell, loaded the file there and made a superfluous change, then saved and exited the editor… upon loading our software and trying to edit the file again like normal through the software, it worked fine, and I was also able to save/edit without issue, and also load the full set.
note: the “administrator” wasn’t loaded/running when I had gone into the shell to load the editor, but this configuration within the software requires that administrator be loaded to even load anything from the menu.

When it occurred while trying to export data: all I did was change a random setting and save that setting before running the export. It worked just fine, then I adjusted the setting back to what it was at the time of the failure, and it also worked fine.
note: administrator does not have to be running to push a data export.

There didn’t seem to be any rhyme or reason for why it would trigger the crash while doing a set import, it literally just seemed to pick and choose whether it wanted to cooperate in that moment. In some instances when the crash has happened during an import, it has rendered the entire database corrupt and I am forced to do a full system reinstall.

My next goal is to see if I can track down any of the original developers but I am not holding my breath about that. The problem is simply that there is no error/troubleshooting documentation available anywhere. I think I mentioned this in another thread somewhat recently, I think they just expected you to contact them directly and they either walked you through something or sent a tech out.

Any thoughts are appreciated… is this particular behavior familiar or known in any part of the QNX2 world, or is it maybe just a result of this individual software?

It’s late so I can’'t follow exactly what yo are saying, but for now this is what I can figure out.

Let’s start with the words “system crash”. It sounds like you are confused about what this means. When the QNX 2 system crashes, that meas that there is a serious problem that has occurred in proc or the kernel. The behabior iis a dump of information including registers on the text screen. If you are in a graphics mode, you can’t see it, you just see the system stop.

Any task in QNX 2 can crash. You seem to be having a problem with the file manager and file system. This can be caused by a hardware problem or a bug in a driver… The latter is most unlikely. If fsys crashes, that doesn’t automatically klil the system, but it does make it hard to do anything. Seeing the drive light stay on seems like a hardware problem. It’s hard to believe that fsys thinks that the hard drive is full when it is not.

After one of your crashes, have you tried chkfsys to fix up the file system?

There are other types of failure mode besides a kernel crash or a task crash. A task can be in a loop that prevents anything else from happening. This could be an application that runs at a high priority. The other mode would be if some system task just stops.

Does it ever happen after just booting up the system or does it only happen after you’ve been running for a period of time?

I know you are working with very large files and disk sizes compared to what was the norm when QNX 2 was released in the early 90s. I wonder if you are instead running out of some resource in the file manager and it’s reporting an error as what appears to be ‘disk full’ to your software when in fact maybe it’s just out of some internal cache (or maybe you’ve got too many open files etc). That would explain why after a reboot everything is fine.

Tim

yes, I am using “crash” in a more general sense. I HAVE actually seen a couple instances where the screen dumps a bunch of text and is fully in panic mode (I think you can trigger it intentionally if you try to load a disk driver a second time) but this is not that.

like I said, it’s impossible to execute any command to even see the state of the system; whatever is occurring makes the system think the disk is full and any command returns “command not found” or something like that because it cannot even access the drive.

There’s not really anything to suggest a hardware issue… when the behavior surfaces, it’s always under a particular circumstance in the software, even if that circumstance isn’t always the same…
(what I mean is when the problem presents itself, whatever happened to trigger it will again trigger it after a restart until you’ve changed some variable… I haven’t yet found a rhyme or reason to why it happens, because multiple different scenarios have triggered it)

Also, I have run chkfsys myself on occasion and it has not shown any errors or difference in bitmap size.
The software has a script it loads at startup to check for any busy/questionable files on the off-chance the system wasn’t shut down properly… here’s the body of that small script (it is called from the larger sys.init during boot)

"Remove the /tmp/busy_files file if it exists 
if +f /tmp/busy_files then
	rm /tmp/busy_files >$null >*$null
endif 

"Zap it if it still exists 
if +f /tmp/busy_files then
	zap /tmp/busy_files >$null >*$null
endif 

"Remove the /tmp/busy_file_size file if it exists 
if +f /tmp/busy_file_size then
	rm /tmp/busy_file_size >$null >*$null
endif 

"Zap it if it still exists 
if +f /tmp/busy_file_size then
	zap /tmp/busy_file_size >$null >*$null
endif 

type "Mounting file system. . ."
files 3:/ +b -v > /tmp/busy_files

"Strip the /tmp/busy_files from the list
led /tmp/busy_files /busy_files/ d w q >$null >*$null

" If there were any busy files, fix the file system
if +f /tmp/busy_files then
	type "Busy files were found."
	/das/sys/build_stats x=/tmp/busy_files > /tmp/busy_file_size

	type "Recovering file system.  Please wait. . ."
	chkfsys 3 +r -p -v -s >*$null >$null
	eo /tmp/busy_file_size "/das/sys/trunc @ >*$null" >$null >*$null
	type "Recovery complete."

	" Remove the temporary files 
	rm /tmp/busy_files >$null >*$null
	rm /tmp/busy_file_size >$null >*$null
endif

" Look for and remove any LDF file with zero records.
if +f /das/sys/ldf_check then
	/das/sys/ldf_check
endif

" Sleep 2 seconds to let the user read the messages
sleep 2

Doesn’t occur after just booting up, but the “period of time” can simply be the fifteen or twenty seconds it takes to get into the program and get to whatever Thing may be triggering the behavior, depending on what exactly may be causing it in a given instance.

In some instances, there are no large files on the drive; a fresh install of the software on a formatted disk may collectively take up apx 25 - 30 MB total, and I’ve had the problem occur simply from trying to import a set from a floppy (ie less than a megabyte) into that fresh install.

If it were running out of any resource or cache, I would expect it to happen all the time and we would never even be able to use the machines :o

The large files are really just the logged data files which are typically 5-25mb depending on the complexity and duration of a test, some systems end up making logs in the order of nearly 50mb in size. These logs are not specifically loaded/viewed/manipulated during regular usage, and haven’t seemed to cause any issue.


I was talking with a coworker who is far removed from all of this, so they only have a very surface level understanding of it, but they suggested something that I had also wondered; if this behavior is possibly an intentional thing on the part of the software… as if it was by design in the event that there’s some kind of internal conflict or uncertainty.

It doesn’t seem likely, though, considering the industry and application that this software was designed for… it is not only data acquisition software but was also used for automation in production.

yes, I am using “crash” in a more general sense. I HAVE actually seen a couple instances where the screen dumps a bunch of text and is fully in panic mode (I think you can trigger it intentionally if you try to load a disk driver a second time) but this is not that.

Reloading a driver shouldn’t cause this to happen. What driver(s) are you mounting? If you incorrectly try to remount a driver, you could lose access to the disk.

like I said, it’s impossible to execute any command to even see the state of the system; whatever is occurring makes the system think the disk is full and any command returns “command not found” or something like that because it cannot even access the drive.

I still don’t know why you think this means that the disk is full. This behavior is common and well understood. The system cannot find any command, most likely because the driver is not working. This can also be made to occur by mis-using the ‘search’ command. You still might be able to execute commands if you put in a floppy diskette with a /cmds directory, and run a command, for example:

$ 1:/cmds/tsk

Another thing that could be done is to mount a RAM disk, put a /cmds directory on it along with some commands and use the search command to point at the RAM disk first. If you need to know the sequence of commands to do this, let me know

There’s not really anything to suggest a hardware issue…

What you are seeing seems exactly like a hardware issue to me. If the shell is working, which you would know because you are getting any response, such as ‘command not found’, the kernel and sh are working.

when the behavior surfaces, it’s always under a particular circumstance in the software, even if that circumstance isn’t always the same…

If software is triggering the problem, it is probably not broken hardware, but it still can be a hardware issue.

(what I mean is when the problem presents itself, whatever happened to trigger it will again trigger it after a restart until you’ve changed some variable… I haven’t yet found a rhyme or reason to why it happens, because multiple different scenarios have triggered it)

Whatever it is, I think you are hosing the disk driver.

Also, I have run chkfsys myself on occasion and it has not shown any errors or difference in bitmap size.

The startup script has already done this on startup, so it is too late for you to see anything from ‘chkfsys’.

The software has a script it loads at startup to check for any busy/questionable files on the off-chance the system wasn’t shut down properly… here’s the body of that small script (it is called from the larger sys.init during boot)

The script seems pretty thorough

"Remove the /tmp/busy_files file if it exists
if +f /tmp/busy_files then
rm /tmp/busy_files >$null >*$null
endif

This looks for a file named /tmp/busy_files, and if it exists try to delete it.

"Zap it if it still exists
if +f /tmp/busy_files then
zap /tmp/busy_files >$null >*$null
endif

If the file is still there, possibly because the busy bit was on, remove it from the directory. This can lose sectors on the disk which will be fixed when running chkfsys later.

"Remove the /tmp/busy_file_size file if it exists
if +f /tmp/busy_file_size then
rm /tmp/busy_file_size >$null >*$null

"Zap it if it still exists
if +f /tmp/busy_file_size then
zap /tmp/busy_file_size >$null >*$null
endif

Same thing only for a file named /tmp/busy_file_size.

type “Mounting file system. . .”
files 3:/ +b -v > /tmp/busy_files

This does not mount the file system. It is already mounted during boot. I don’t see a ‘mount’ command below.
This is where the file /tmp/busy_files is created by searching for any busy files in the system.

"Strip the /tmp/busy_files from the list
led /tmp/busy_files /busy_files/ d w q >$null >*$null

Since /tmp/busy_files is an intentionally and properly busy file, remove it from the list using the line editor. Note that the assumption here is that it is listed first. If this is the only line in the file, when it is deleted and the file is saved, the file is deleted. This is a particularly QNX 2 behavior. Most OS’s allow zero length files, but not QNX 2.

" If there were any busy files, fix the file system
if +f /tmp/busy_files then
type “Busy files were found.”
/das/sys/build_stats x=/tmp/busy_files > /tmp/busy_file_size

If there are any busy files, a non-qnx program called ‘build_stats’ is run.

type “Recovering file system. Please wait. . .”
chkfsys 3 +r -p -v -s >$null >$null
eo /tmp/busy_file_size "/das/sys/trunc @ >
$null" >$null >*$null
type “Recovery complete.”

" Remove the temporary files
rm /tmp/busy_files >$null >$null
rm /tmp/busy_file_size >$null >
$null
endif

Run ‘chkfsys’. This is a good place to note that all these commands are run directing their output to $null which means you will see nothing, even if there is something to see.
The ‘eo’ command is Execute On, which runs the application ‘trunc’ program using a line from /tmp/busy_file_size. A guess is that it is trying to restore the proper file size to a file that was open for write when the ‘crash’ occurred.

Also, note that the two /tmp/busy_file* files are deleted here. They are deleted at the beginning in case the crash occurs at startup when they exist.

" Look for and remove any LDF file with zero records.
if +f /das/sys/ldf_check then
/das/sys/ldf_check
endif

Obviously an application program that cleans up application files.

" Sleep 2 seconds to let the user read the messages
sleep 2

The only thing you get to read is the output of any programs that don’t have the >*

Just my opinion here. It seems unlikely that the system would be designed to do this.

My money is on this especially given all the file I/O that happens to floppy and other disks. I bet the search path is getting mangled.

Section ‘8.4 Pathnames’ in the Operating System guide talks about this. 8.4.1 talks about over riding the order.

Might want to try fully quantifying the path on the harddrive next time it happens to see if it’s just the search path that got mangled.
Tim

If you are suspicious as Tim suggests that the search path has been mangled, there is a simple fix:

$ 3:/cmds/search 3 1

This will look for the search command on disk 3, run it and reset the search to look at disk 3, the hard drive first and 1, the floppy next

If the problem is the disk driver, this may hang or fail. In this case, if you have a floppy with a /cmds directory on it that has the search command, you can run this:

$ 1:/cmds/search 1

And the system will only look on the floppy. If the floppy has the mount command and the disk driver, you could try remounting it.

I really wish I could take pics/video within our facility so I could actually document and show this software, as well as try to capture this behavior. I’m going to reach out and see what I need to do to get a camera pass or approval to make any visual recording.

I was just going off of something you had said in a previous thread (specifying a driver location vs pointing the mount command to a driver that is already resident, like d=3 in the string), It had caused a screen of red text in the past like this:

01A0 0EA5
X {"X" just endlessly cycles through every character}
 SS   DS   ES   DI   SI   DX   CX   BX   BP   AX   IP   CS   FL
0098 01A8 0098 857B 879F 0000 0000 820B 8BBC 0400 0EA5 01A0 0282
Stack
{eight lines of a bunch of hex values here}

It was only something I noticed coincidentally when I was figuring out mounting/formatting multiple partitions a couple months ago, but it’s ultimately not relevant for the issue at hand, so disregard. :+1:

I do not personally think the disk is actually full, but that is what the background disk-monitoring program reports whenever the problem occurs, it says “0.0M Free” in its corner of the screen. This can be a matter of seconds or a minute or two after the mystery problem occurs, prior to which there may have been gigabytes free.

The search command is never called by the user or the software at any point during usage of the system; I do understand its purpose and utilization though.

If I encounter the issue again on a machine, I absolutely want to have a floppy or mount a ram disk and stuff it with at least a few commands and try that while triggering the condition again though. That doesn’t exactly solve the issue of what may be causing it, but it would be neat to see how the system reacts.

This same problem has occurred across at least 4 different hardware configurations, and all explicitly from some software related trigger, albeit whilst doing different tasks.

Could be, but I can’t understand why opening a file or importing a configuration to the database or simply trying to save a file from a particular instance of the text editor would cause it to freak out. In all occasions, the ‘solutions’ have been different as well (uninstalling and reinstalling a software package in one case, opening a file in the editor elsewhere, literally changing one insignificant thing and re-saving the options, etc)

I appreciate hearing those commands in more layman’s terms, thanks. It mostly confirms what I expected it may be doing… it’s not the primary startup script, though, it’s called from the sys.init script to check for anything left over in the event of an unexpected restart or crash (again used in the general sense).


I’m familiar with the relative and specific pathname behaviors in the shell, but in every instance of these problems, the user would be within the program…

the program is entirely menu and window/panel prompt operated (character mode, not graphics), there is never any interaction with the shell in 99.8% of use-cases aside from the normal login prompt, and normal users have a script associated in the password file which starts loading the program automatically.

To be honest, I don’t know if any of the test operators for the last 30 years even knew there was an operating system behind the program itself, and the only three or four current engineers who would are aware wouldn’t have been poking around in it anyway.


I’m not suspicious that it may have gotten messed up but like I said in response to your other comment, I’m definitely curious to try.


I’m a little more prepared to investigate things the next time I encounter or can cause this behavior… Unfortunately the two systems that recently suffered issues were mission-critical, so it was a bigger priority to just get them back up and running rather than trying to find some indication of why anything happened -_-;

it’s frustrating because I’m not the kind of person who likes to “fix” something without actually knowing what I did or without any confidence that I was actively working toward a solution… starting at zero and throwing stuff at the wall to see what sticks isn’t exactly deliberate troubleshooting lol

How can you know this when you don’t have the source code? From within a C program you can easily run O/S commands like ‘search’ and change the search path.

My guess continues to be that within the software the search path was modified. Likely on purpose to simply make doing some file copying operations (to floppy or other backup medium like a 2nd partition on the HD?) easier so they didn’t have to fully quantify paths. When running correctly it should put it back the way it was after finishing the operation so everything would continue on just fine. But sometimes for reasons unknown, something fails somewhere and the code doesn’t put it back properly.

I can’t remember if you’ve mentioned this before, but are there any other active consoles? If you aren’t aware, QNX allows virtual consoles. You can cycle between them with CTRL-ALT-+ (plus key or minus key or even enter a number like 2 to go to the 2nd virtual console). Hopefully there are virtual consoles available to you so that you can log in on a different console when the problem happens. Because if you can’t, I don’t see how you are going to be able to run commands from a floppy drive. If you do have virtual consoles then hopefully logging in on one of them will let you check the search path and see if you can fix it on a running system.

You also won’t be able to mount a RAM disk after the problem occurs because all the commands to create/mount the RAM disk and fill it with commands won’t be possible. You need to have the RAM disk already created and ready to go with all the commands you think you might need.

Tim

fair. I am assuming as such because I didn’t see it as plain-text when skimming through a hex dump of any of the relevant program files. Other command strings show up occasionally when some programs are designed to Do A Thing in the shell in the background…

Coincidentally, when you or maschoen mentioned the spatch utility recently, I was able to go in to a program that gets loaded from a menu in the main application, find and delete two plain-text strings I saw that were executing a command that I didn’t want running for what I was testing, and it successfully stopped the program from executing that particular command.

The software we use does not know to look for other QNX partitions, does not attempt to look for them, and does not attempt copying to a floppy/tape/partition in any of the occurrences where this strange hanging behavior has occurred.

By default, the company that made this software package has configured the QNX sys.init file to load three other consoles. The second isn’t interactive, it is simply a status display and rolling log of system messages. The third and fourth are there for instances when a user may need multiple screens for configuring a set or comparing/referencing the code for a program, etc.

When this behavior occurs, you cannot do anything in the other two consoles… cannot log into them because the system doesn’t see the password file, cannot execute any command if you’re already logged in because it gives “command not found” like I said earlier…

Yes this I am aware, I meant that I have to find a situation where this behavior occurs so that I can restart the machine and configure a RAM disk and floppy with commands and configure search to be aware of them, etc, and then trigger the behavior and see if the ram disk or floppy are still accessible and able to execute whatever commands I have put on them.


Between Friday and yesterday, we again had this strange behavior appear, but unfortunately it was for a test that needed to be run this week before a High Profile customer came to witness progress starting next week. I was not afforded the chance to test things, the biggest priority was just Making Things Work…

the short version is I had to rebuild a device configuration from scratch (as opposed to the engineer having simply imported the configuration from another pre-existing set which may have had some corrupt data or a conflicting tag in the database). I want to go back and compare some things between the two set configurations that may highlight a prospective conflict, but they’re going to be running consistently for a few weeks or months now.

I do not personally think the disk is actually full, but that is what the background disk-monitoring program reports whenever the problem occurs, it says “0.0M Free” in its corner of the screen. This can be a matter of seconds or a minute or two after the mystery problem occurs, prior to which there may have been gigabytes free.

This sounds to me like the same thing. The program wants to figure out how much space is left on disk, but it can’t access the disk so it reports zero.

This same problem has occurred across at least 4 different hardware configurations, and all explicitly from some software related trigger, albeit whilst doing different tasks.

This is clearly good evidence that it is not hardware.

Could be, but I can’t understand why opening a file or importing a configuration to the database or simply trying to save a file from a particular instance of the text editor would cause it to freak out.

Those operations shouldn’t. There is a simple test if you have a shell prompt. Enter the command:

$ 3:/cmds/task

If you see anything besides a list of running tasks, I think your driver is hosed.