QNX2 is there expected behavior in a system crash?

bananaman · November 18, 2024, 8:54pm

Maybe this is kind of an esoteric question…
Is there any kind of predictable or default behavior from a QNX 2.21atpb system in the event of a system crash or if a program has some kind of unrecoverable issue?

This explanation got longer than I intended, but I’m trying to offer as much context as possible.
I guess I’m not really expecting any kind of concrete answer, but thanks for anyone who takes the time to read.

I ask because in multiple, seemingly DIFFERENT, scenarios with the software we run, the system itself always seems to crash/fail in the same way; exhibiting the same pattern of behavior nearly every time. (also, when I refer to “set” I typically mean it as a noun, a set or sets, not verb as in ‘set something up’)

The data acquisition software we use has four layers of functions that are typically loaded during normal usage; “administrator,” “communications,” “scanning,” and “logging.” They are also tiered in that order, ie you cannot start scanning if communications haven’t been started, comms can’t be started if administrator isn’t running, etc.

basic program config and file navigation do not need administrator to be running; but if you want to modify set/device configurations or do anything else, administrator must be loaded… a help prompt states:
“Selecting this item will activate the AutoNet system administrator (the set record DBASE server), as well as any other ‘subtasks’ that need to be running in order for you to access items under the ‘Sets’ menu.”

Communications is just opening up comms with configured devices, scanning is telling the software to actually sieve/pull values at X rate from the live database as it receives data, and logging is simply capturing lines of data from selected channels.

The primary behavior that happens when a problem occurs is the hard drive appears completely full, 0.0mb free, and then the system just gives endless error beeps because its automated error-logging fails to write to disk, since the system thinks the disk is full. The drive access light will be constantly lit.
I wondered if the behavior was maybe a sign of any specific problem(s) that can occur, as it might point me in a better direction in the event that I have to troubleshoot or repair a system.

I have had this happen when attempting to use the program’s set-configuration import ability (literally just an automated process of copying a few files off of a floppy and assimilating them into its database).
it has occurred when trying to save code/programming for a set; the software uses the built-in QNX editor (ed or bed, not sure which)
it has occurred when trying to use its logged data conversion/export utility

Each time it’s the same resulting behavior; system perceives the disk as full, and you can’t select any menu option or enter any commands in another console because it’s not able to access the disk. Only recourse is a hard reset.

There’s something strange about a couple of those examples, though…

When it occurred while attempting to save the set program: I restarted the machine, then ran ed directly from the shell, loaded the file there and made a superfluous change, then saved and exited the editor… upon loading our software and trying to edit the file again like normal through the software, it worked fine, and I was also able to save/edit without issue, and also load the full set.
note: the “administrator” wasn’t loaded/running when I had gone into the shell to load the editor, but this configuration within the software requires that administrator be loaded to even load anything from the menu.

When it occurred while trying to export data: all I did was change a random setting and save that setting before running the export. It worked just fine, then I adjusted the setting back to what it was at the time of the failure, and it also worked fine.
note: administrator does not have to be running to push a data export.

There didn’t seem to be any rhyme or reason for why it would trigger the crash while doing a set import, it literally just seemed to pick and choose whether it wanted to cooperate in that moment. In some instances when the crash has happened during an import, it has rendered the entire database corrupt and I am forced to do a full system reinstall.

My next goal is to see if I can track down any of the original developers but I am not holding my breath about that. The problem is simply that there is no error/troubleshooting documentation available anywhere. I think I mentioned this in another thread somewhat recently, I think they just expected you to contact them directly and they either walked you through something or sent a tech out.

Any thoughts are appreciated… is this particular behavior familiar or known in any part of the QNX2 world, or is it maybe just a result of this individual software?

maschoen · November 19, 2024, 8:26am

It’s late so I can’'t follow exactly what yo are saying, but for now this is what I can figure out.

Let’s start with the words “system crash”. It sounds like you are confused about what this means. When the QNX 2 system crashes, that meas that there is a serious problem that has occurred in proc or the kernel. The behabior iis a dump of information including registers on the text screen. If you are in a graphics mode, you can’t see it, you just see the system stop.

Any task in QNX 2 can crash. You seem to be having a problem with the file manager and file system. This can be caused by a hardware problem or a bug in a driver… The latter is most unlikely. If fsys crashes, that doesn’t automatically klil the system, but it does make it hard to do anything. Seeing the drive light stay on seems like a hardware problem. It’s hard to believe that fsys thinks that the hard drive is full when it is not.

After one of your crashes, have you tried chkfsys to fix up the file system?

There are other types of failure mode besides a kernel crash or a task crash. A task can be in a loop that prevents anything else from happening. This could be an application that runs at a high priority. The other mode would be if some system task just stops.

Tim · November 19, 2024, 10:20am

Does it ever happen after just booting up the system or does it only happen after you’ve been running for a period of time?

I know you are working with very large files and disk sizes compared to what was the norm when QNX 2 was released in the early 90s. I wonder if you are instead running out of some resource in the file manager and it’s reporting an error as what appears to be ‘disk full’ to your software when in fact maybe it’s just out of some internal cache (or maybe you’ve got too many open files etc). That would explain why after a reboot everything is fine.

Tim

bananaman · November 20, 2024, 2:28pm

maschoen:

Let’s start with the words “system crash”. It sounds like you are confused about what this means. When the QNX 2 system crashes, that meas that there is a serious problem that has occurred in proc or the kernel. The behabior iis a dump of information including registers on the text screen. If you are in a graphics mode, you can’t see it, you just see the system stop.

Any task in QNX 2 can crash. You seem to be having a problem with the file manager and file system. This can be caused by a hardware problem or a bug in a driver… The latter is most unlikely. If fsys crashes, that doesn’t automatically klil the system, but it does make it hard to do anything. Seeing the drive light stay on seems like a hardware problem. It’s hard to believe that fsys thinks that the hard drive is full when it is not.

After one of your crashes, have you tried chkfsys to fix up the file system?

There are other types of failure mode besides a kernel crash or a task crash. A task can be in a loop that prevents anything else from happening. This could be an application that runs at a high priority. The other mode would be if some system task just stops.

yes, I am using “crash” in a more general sense. I HAVE actually seen a couple instances where the screen dumps a bunch of text and is fully in panic mode (I think you can trigger it intentionally if you try to load a disk driver a second time) but this is not that.

like I said, it’s impossible to execute any command to even see the state of the system; whatever is occurring makes the system think the disk is full and any command returns “command not found” or something like that because it cannot even access the drive.

There’s not really anything to suggest a hardware issue… when the behavior surfaces, it’s always under a particular circumstance in the software, even if that circumstance isn’t always the same…
(what I mean is when the problem presents itself, whatever happened to trigger it will again trigger it after a restart until you’ve changed some variable… I haven’t yet found a rhyme or reason to why it happens, because multiple different scenarios have triggered it)

Also, I have run chkfsys myself on occasion and it has not shown any errors or difference in bitmap size.
The software has a script it loads at startup to check for any busy/questionable files on the off-chance the system wasn’t shut down properly… here’s the body of that small script (it is called from the larger sys.init during boot)

"Remove the /tmp/busy_files file if it exists 
if +f /tmp/busy_files then
	rm /tmp/busy_files >$null >*$null
endif 

"Zap it if it still exists 
if +f /tmp/busy_files then
	zap /tmp/busy_files >$null >*$null
endif 

"Remove the /tmp/busy_file_size file if it exists 
if +f /tmp/busy_file_size then
	rm /tmp/busy_file_size >$null >*$null
endif 

"Zap it if it still exists 
if +f /tmp/busy_file_size then
	zap /tmp/busy_file_size >$null >*$null
endif 

type "Mounting file system. . ."
files 3:/ +b -v > /tmp/busy_files

"Strip the /tmp/busy_files from the list
led /tmp/busy_files /busy_files/ d w q >$null >*$null

" If there were any busy files, fix the file system
if +f /tmp/busy_files then
	type "Busy files were found."
	/das/sys/build_stats x=/tmp/busy_files > /tmp/busy_file_size

	type "Recovering file system.  Please wait. . ."
	chkfsys 3 +r -p -v -s >*$null >$null
	eo /tmp/busy_file_size "/das/sys/trunc @ >*$null" >$null >*$null
	type "Recovery complete."

	" Remove the temporary files 
	rm /tmp/busy_files >$null >*$null
	rm /tmp/busy_file_size >$null >*$null
endif

" Look for and remove any LDF file with zero records.
if +f /das/sys/ldf_check then
	/das/sys/ldf_check
endif

" Sleep 2 seconds to let the user read the messages
sleep 2

Doesn’t occur after just booting up, but the “period of time” can simply be the fifteen or twenty seconds it takes to get into the program and get to whatever Thing may be triggering the behavior, depending on what exactly may be causing it in a given instance.

In some instances, there are no large files on the drive; a fresh install of the software on a formatted disk may collectively take up apx 25 - 30 MB total, and I’ve had the problem occur simply from trying to import a set from a floppy (ie less than a megabyte) into that fresh install.

If it were running out of any resource or cache, I would expect it to happen all the time and we would never even be able to use the machines :o

The large files are really just the logged data files which are typically 5-25mb depending on the complexity and duration of a test, some systems end up making logs in the order of nearly 50mb in size. These logs are not specifically loaded/viewed/manipulated during regular usage, and haven’t seemed to cause any issue.

I was talking with a coworker who is far removed from all of this, so they only have a very surface level understanding of it, but they suggested something that I had also wondered; if this behavior is possibly an intentional thing on the part of the software… as if it was by design in the event that there’s some kind of internal conflict or uncertainty.

It doesn’t seem likely, though, considering the industry and application that this software was designed for… it is not only data acquisition software but was also used for automation in production.

maschoen · November 20, 2024, 5:05pm

yes, I am using “crash” in a more general sense. I HAVE actually seen a couple instances where the screen dumps a bunch of text and is fully in panic mode (I think you can trigger it intentionally if you try to load a disk driver a second time) but this is not that.

Reloading a driver shouldn’t cause this to happen. What driver(s) are you mounting? If you incorrectly try to remount a driver, you could lose access to the disk.

like I said, it’s impossible to execute any command to even see the state of the system; whatever is occurring makes the system think the disk is full and any command returns “command not found” or something like that because it cannot even access the drive.

I still don’t know why you think this means that the disk is full. This behavior is common and well understood. The system cannot find any command, most likely because the driver is not working. This can also be made to occur by mis-using the ‘search’ command. You still might be able to execute commands if you put in a floppy diskette with a /cmds directory, and run a command, for example:

$ 1:/cmds/tsk

Another thing that could be done is to mount a RAM disk, put a /cmds directory on it along with some commands and use the search command to point at the RAM disk first. If you need to know the sequence of commands to do this, let me know

There’s not really anything to suggest a hardware issue…

What you are seeing seems exactly like a hardware issue to me. If the shell is working, which you would know because you are getting any response, such as ‘command not found’, the kernel and sh are working.

when the behavior surfaces, it’s always under a particular circumstance in the software, even if that circumstance isn’t always the same…

If software is triggering the problem, it is probably not broken hardware, but it still can be a hardware issue.

(what I mean is when the problem presents itself, whatever happened to trigger it will again trigger it after a restart until you’ve changed some variable… I haven’t yet found a rhyme or reason to why it happens, because multiple different scenarios have triggered it)

Whatever it is, I think you are hosing the disk driver.

Also, I have run chkfsys myself on occasion and it has not shown any errors or difference in bitmap size.

The startup script has already done this on startup, so it is too late for you to see anything from ‘chkfsys’.

The software has a script it loads at startup to check for any busy/questionable files on the off-chance the system wasn’t shut down properly… here’s the body of that small script (it is called from the larger sys.init during boot)

The script seems pretty thorough

"Remove the /tmp/busy_files file if it exists
if +f /tmp/busy_files then
rm /tmp/busy_files >$null >*$null
endif

This looks for a file named /tmp/busy_files, and if it exists try to delete it.

"Zap it if it still exists
if +f /tmp/busy_files then
zap /tmp/busy_files >$null >*$null
endif

If the file is still there, possibly because the busy bit was on, remove it from the directory. This can lose sectors on the disk which will be fixed when running chkfsys later.

"Remove the /tmp/busy_file_size file if it exists
if +f /tmp/busy_file_size then
rm /tmp/busy_file_size >$null >*$null

"Zap it if it still exists
if +f /tmp/busy_file_size then
zap /tmp/busy_file_size >$null >*$null
endif

Same thing only for a file named /tmp/busy_file_size.

type “Mounting file system. . .”
files 3:/ +b -v > /tmp/busy_files

This does not mount the file system. It is already mounted during boot. I don’t see a ‘mount’ command below.
This is where the file /tmp/busy_files is created by searching for any busy files in the system.

"Strip the /tmp/busy_files from the list
led /tmp/busy_files /busy_files/ d w q >$null >*$null

Since /tmp/busy_files is an intentionally and properly busy file, remove it from the list using the line editor. Note that the assumption here is that it is listed first. If this is the only line in the file, when it is deleted and the file is saved, the file is deleted. This is a particularly QNX 2 behavior. Most OS’s allow zero length files, but not QNX 2.

" If there were any busy files, fix the file system
if +f /tmp/busy_files then
type “Busy files were found.”
/das/sys/build_stats x=/tmp/busy_files > /tmp/busy_file_size

If there are any busy files, a non-qnx program called ‘build_stats’ is run.

type “Recovering file system. Please wait. . .”
chkfsys 3 +r -p -v -s >$null >$null
eo /tmp/busy_file_size "/das/sys/trunc @ >$null" >$null >*$null
type “Recovery complete.”

" Remove the temporary files
rm /tmp/busy_files >$null >$null
rm /tmp/busy_file_size >$null >$null
endif

Run ‘chkfsys’. This is a good place to note that all these commands are run directing their output to $null which means you will see nothing, even if there is something to see.
The ‘eo’ command is Execute On, which runs the application ‘trunc’ program using a line from /tmp/busy_file_size. A guess is that it is trying to restore the proper file size to a file that was open for write when the ‘crash’ occurred.

Also, note that the two /tmp/busy_file* files are deleted here. They are deleted at the beginning in case the crash occurs at startup when they exist.

" Look for and remove any LDF file with zero records.
if +f /das/sys/ldf_check then
/das/sys/ldf_check
endif

Obviously an application program that cleans up application files.

" Sleep 2 seconds to let the user read the messages
sleep 2

The only thing you get to read is the output of any programs that don’t have the >*

maschoen · November 20, 2024, 5:21pm

Just my opinion here. It seems unlikely that the system would be designed to do this.

Tim · November 20, 2024, 5:54pm

My money is on this especially given all the file I/O that happens to floppy and other disks. I bet the search path is getting mangled.

Section ‘8.4 Pathnames’ in the Operating System guide talks about this. 8.4.1 talks about over riding the order.

You can override QNX ’ s sequential search of your disk drives and lock onto a
particular drive by prefixing your filename with a drive number followed by a
colon (:). As an example
 backup 2:/user 1 :/user
would invoke the backup command with the directory “user” on drive 2 as the
source and the directory “user” on drive 1 as the destination. The command
backup 1 :/ 2:/ +all
would back up all structure on the disk in drive 1 onto the disk in drive 2. Finally,
the command
dump 2:/bitmap
would only look on drive 2 for the file “bitmap” under the ROOT.

Might want to try fully quantifying the path on the harddrive next time it happens to see if it’s just the search path that got mangled.
Tim

maschoen · November 21, 2024, 8:49am

If you are suspicious as Tim suggests that the search path has been mangled, there is a simple fix:

$ 3:/cmds/search 3 1

This will look for the search command on disk 3, run it and reset the search to look at disk 3, the hard drive first and 1, the floppy next

If the problem is the disk driver, this may hang or fail. In this case, if you have a floppy with a /cmds directory on it that has the search command, you can run this:

$ 1:/cmds/search 1

And the system will only look on the floppy. If the floppy has the mount command and the disk driver, you could try remounting it.

bananaman · November 21, 2024, 8:20pm

I really wish I could take pics/video within our facility so I could actually document and show this software, as well as try to capture this behavior. I’m going to reach out and see what I need to do to get a camera pass or approval to make any visual recording.

I was just going off of something you had said in a previous thread (specifying a driver location vs pointing the mount command to a driver that is already resident, like d=3 in the string), It had caused a screen of red text in the past like this:

01A0 0EA5
X {"X" just endlessly cycles through every character}
 SS   DS   ES   DI   SI   DX   CX   BX   BP   AX   IP   CS   FL
0098 01A8 0098 857B 879F 0000 0000 820B 8BBC 0400 0EA5 01A0 0282
Stack
{eight lines of a bunch of hex values here}

It was only something I noticed coincidentally when I was figuring out mounting/formatting multiple partitions a couple months ago, but it’s ultimately not relevant for the issue at hand, so disregard.

I do not personally think the disk is actually full, but that is what the background disk-monitoring program reports whenever the problem occurs, it says “0.0M Free” in its corner of the screen. This can be a matter of seconds or a minute or two after the mystery problem occurs, prior to which there may have been gigabytes free.

The search command is never called by the user or the software at any point during usage of the system; I do understand its purpose and utilization though.

If I encounter the issue again on a machine, I absolutely want to have a floppy or mount a ram disk and stuff it with at least a few commands and try that while triggering the condition again though. That doesn’t exactly solve the issue of what may be causing it, but it would be neat to see how the system reacts.

This same problem has occurred across at least 4 different hardware configurations, and all explicitly from some software related trigger, albeit whilst doing different tasks.

Could be, but I can’t understand why opening a file or importing a configuration to the database or simply trying to save a file from a particular instance of the text editor would cause it to freak out. In all occasions, the ‘solutions’ have been different as well (uninstalling and reinstalling a software package in one case, opening a file in the editor elsewhere, literally changing one insignificant thing and re-saving the options, etc)

I appreciate hearing those commands in more layman’s terms, thanks. It mostly confirms what I expected it may be doing… it’s not the primary startup script, though, it’s called from the sys.init script to check for anything left over in the event of an unexpected restart or crash (again used in the general sense).

I’m familiar with the relative and specific pathname behaviors in the shell, but in every instance of these problems, the user would be within the program…

the program is entirely menu and window/panel prompt operated (character mode, not graphics), there is never any interaction with the shell in 99.8% of use-cases aside from the normal login prompt, and normal users have a script associated in the password file which starts loading the program automatically.

To be honest, I don’t know if any of the test operators for the last 30 years even knew there was an operating system behind the program itself, and the only three or four current engineers who would are aware wouldn’t have been poking around in it anyway.

I’m not suspicious that it may have gotten messed up but like I said in response to your other comment, I’m definitely curious to try.

I’m a little more prepared to investigate things the next time I encounter or can cause this behavior… Unfortunately the two systems that recently suffered issues were mission-critical, so it was a bigger priority to just get them back up and running rather than trying to find some indication of why anything happened -_-;

it’s frustrating because I’m not the kind of person who likes to “fix” something without actually knowing what I did or without any confidence that I was actively working toward a solution… starting at zero and throwing stuff at the wall to see what sticks isn’t exactly deliberate troubleshooting lol

Tim · November 25, 2024, 4:22pm

How can you know this when you don’t have the source code? From within a C program you can easily run O/S commands like ‘search’ and change the search path.

My guess continues to be that within the software the search path was modified. Likely on purpose to simply make doing some file copying operations (to floppy or other backup medium like a 2nd partition on the HD?) easier so they didn’t have to fully quantify paths. When running correctly it should put it back the way it was after finishing the operation so everything would continue on just fine. But sometimes for reasons unknown, something fails somewhere and the code doesn’t put it back properly.

I can’t remember if you’ve mentioned this before, but are there any other active consoles? If you aren’t aware, QNX allows virtual consoles. You can cycle between them with CTRL-ALT-+ (plus key or minus key or even enter a number like 2 to go to the 2nd virtual console). Hopefully there are virtual consoles available to you so that you can log in on a different console when the problem happens. Because if you can’t, I don’t see how you are going to be able to run commands from a floppy drive. If you do have virtual consoles then hopefully logging in on one of them will let you check the search path and see if you can fix it on a running system.

You also won’t be able to mount a RAM disk after the problem occurs because all the commands to create/mount the RAM disk and fill it with commands won’t be possible. You need to have the RAM disk already created and ready to go with all the commands you think you might need.

Tim

bananaman · November 27, 2024, 3:14pm

fair. I am assuming as such because I didn’t see it as plain-text when skimming through a hex dump of any of the relevant program files. Other command strings show up occasionally when some programs are designed to Do A Thing in the shell in the background…

Coincidentally, when you or maschoen mentioned the spatch utility recently, I was able to go in to a program that gets loaded from a menu in the main application, find and delete two plain-text strings I saw that were executing a command that I didn’t want running for what I was testing, and it successfully stopped the program from executing that particular command.

The software we use does not know to look for other QNX partitions, does not attempt to look for them, and does not attempt copying to a floppy/tape/partition in any of the occurrences where this strange hanging behavior has occurred.

By default, the company that made this software package has configured the QNX sys.init file to load three other consoles. The second isn’t interactive, it is simply a status display and rolling log of system messages. The third and fourth are there for instances when a user may need multiple screens for configuring a set or comparing/referencing the code for a program, etc.

When this behavior occurs, you cannot do anything in the other two consoles… cannot log into them because the system doesn’t see the password file, cannot execute any command if you’re already logged in because it gives “command not found” like I said earlier…

Yes this I am aware, I meant that I have to find a situation where this behavior occurs so that I can restart the machine and configure a RAM disk and floppy with commands and configure search to be aware of them, etc, and then trigger the behavior and see if the ram disk or floppy are still accessible and able to execute whatever commands I have put on them.

Between Friday and yesterday, we again had this strange behavior appear, but unfortunately it was for a test that needed to be run this week before a High Profile customer came to witness progress starting next week. I was not afforded the chance to test things, the biggest priority was just Making Things Work…

the short version is I had to rebuild a device configuration from scratch (as opposed to the engineer having simply imported the configuration from another pre-existing set which may have had some corrupt data or a conflicting tag in the database). I want to go back and compare some things between the two set configurations that may highlight a prospective conflict, but they’re going to be running consistently for a few weeks or months now.

maschoen · November 28, 2024, 3:21am

I do not personally think the disk is actually full, but that is what the background disk-monitoring program reports whenever the problem occurs, it says “0.0M Free” in its corner of the screen. This can be a matter of seconds or a minute or two after the mystery problem occurs, prior to which there may have been gigabytes free.

This sounds to me like the same thing. The program wants to figure out how much space is left on disk, but it can’t access the disk so it reports zero.

This same problem has occurred across at least 4 different hardware configurations, and all explicitly from some software related trigger, albeit whilst doing different tasks.

This is clearly good evidence that it is not hardware.

Could be, but I can’t understand why opening a file or importing a configuration to the database or simply trying to save a file from a particular instance of the text editor would cause it to freak out.

Those operations shouldn’t. There is a simple test if you have a shell prompt. Enter the command:

$ 3:/cmds/task

If you see anything besides a list of running tasks, I think your driver is hosed.

bananaman · December 6, 2024, 12:41pm

So there are a regular amount of running tasks at any given moment, especially if they’re in the process of scanning data from any hardware…
if I run tsk, this is the default batch of processes right after logging in directly to the shell after startup

task
fsys
dev
idle
/cmds/sh
/cmds/timer
*s/sys/an_timer
*/sys/pdb_admin
/cmds/queue
/cmds/clock
3:/cmds/hpgpib
/cmds/dyna
*cmds/emul.2000
/etc/cmds/netd
/etc/cmds/inetd
*c/cmds/telnetd
*ys/tcp_monitor
*sys/c86c_admin
*s/sys/autonetd
*sys/patch_func
*/sys/serno_mon
*ys/an_auto_del
/cmds/spooler
/cmds/spooldev
*as/sys/err_adm
/cmds/tsk
*s/sys/tcl_beep
*s/sys/test_adm
/cmds/login
/cmds/login

Hmm, what exactly do you mean by “anything besides a list of running tasks?” As in, try to see if there’s an oddball task, or would some kind of garbage show up on a line?

Coincidentally, I just noticed the “serno_mon” task, which I’ve just realized is “Serial Number Monitor…” Is that a normal QNX thing? Otherwise, that may be the thing that handles license validity for the software we use, and I’m wondering if restoring a functional system from a full backup trips some file/setting/attribute that the serial number task doesn’t jive with…

maschoen · December 6, 2024, 2:18pm

It’s been so long that I forgot that they changed the program task to tsk. For fun run $ tsk tsk
What I meant was, you should see a list of tasks, but if nothing shows up, the driver is hosed. I would not expect either an odd task or garbage. I suggest this because that would indicate that the OS can’t read the program 3:/cmes/tsk to run it.

serno_mon is not a QNX 2 distributed program. I can’t say anything about its purpose with your application.

I can tell you a few things that come to mind as I look at your task list.

/cmds/dyna - I think this was for running programs with the C86 compiler. I think it had to do with loading share libraries. C86 was an early ANSI compuer that Computer Innovations ported to QNX.

You are running one of the TCP/IP variants that ran on QNX 2. None of them had a particularly good reputation. One required an expensive ethernet card that had an on board processor. Another required a very specific commercial card. I don’t know which variant you are running.

Aside from the OS tasks, timer, queue, clock, spooler, spooldev and login are the only QNX supplied programs that I see running.

I’m not sure but I think that c86c_admin is a program provided by Computer Innovations that let you run their compiler. It would be very strange, and probably a license violation to provide this with a distributed system. That is, unless your system was a one-off that was devleoped in house.

I think you could try running $ c86 ? to see if the compiler is on your system.

bananaman · December 6, 2024, 5:41pm

ha :P

I’m definitely gonna try to tug on the string for whatever serno_mon is, I have a sneaking suspicion that it may be the root of these random issues…

as far as any c86 related stuff, I’m not surprised to see it since FairCom’s c-tree database runs in the background, so I think it must need some assorted C resources… if I skim through a hex dump of c86c_admin, I see strings that explicitly call out components and functions of the database. I have to imagine it handles some amount of i/o with c-tree.
(a files search of *c86* returns both “.dcfg” and “.dlib” files named c86base, c86cvt, c86fprnt, and c86fscan)

this software has a package that allows advanced users to compile their own programs in four different languages; it’s own control language, QNX C, C 86, and Fortran 77+… it even calls out Computer Innovations on C86 (and Southdale next to Fortran 77+).

The actual compiler and libraries would likely have been included with that application-builder package (they refer to it as “Integrated Software Development Environment”). None of our systems have that package, so all the meat and potatoes of the compiler stuff aren’t actually included.

There’s an EchoLink package for this software that includes the TCP/IP components and has drivers for three ethernet cards; NE2000 compatible, WD 8003EB, and 3Com 3c507. I forget the exact card we use but it’s an ISA card and takes the NE2000 driver.

Honestly we haven’t had any specific issues with the network capabilities, we routinely send data out to a server we have in our lab via the software’s built-in export functions, there’s also a rudimentary ftp program in qnx that lets us use basic ftp commands… I just wish there were some kind of background ftp HOST function so I could get into these machines remotely from my main desk, but maybe that’s asking too much lol

bananaman · December 10, 2024, 2:47pm

The problem occurred again yesterday and I had an opportunity to mount a ramdisk w/commands to try and poke around a bit…

when running tsk after drive 3 becomes inaccessible, these are the tasks that were running at the time;

task
fsys
dev
3:/cmds/hpgpib
/cmds/dyna
*cmds/emul.2000
/etc/cmds/netd
/etc/cmds/inetd
*c/cmds/telnetd
*ys/tcp_monitor
*sys/c86c_admin
/cmds/spooler
*sys/patch_func
*/sys/serno_mon
*ys/an_auto_del
/cmds/spooldev
*s/disk_monitor
*as/sys/err_adm
/cmds/run_menu
/cmds/sh
*sys/status_bar
*s/sys/test_adm
*s/sys/tcl_beep
/cmds/sh
/cmds/afm
/cmds/sh
/cmds/tsk

(note: i find it interesting that the clock and the two export program components don’t show up in the list at the time of the failure… this was the third time I triggered the failure and tested things with the ramdisk; they did appear in my first attempt. I’m not sure whether that’s really a sign of anything of consequence)

when trying to run a simple files command on disk 3, it just says “Unable to open.” I think it’s been established that anything relating to the boot drive is out the window.

running the sac command shows moderate activity on priority 3 (with 15 taking up the rest), and the only task that has a priority of 3 is fsys.

in the moment, I neglected to see if a second mounted partition of the same drive might have been readable, I figure that might have given more weight to the possibility of the disk driver having gotten messed-up since the ramdisk doesn’t use the .ata driver.

I’m beginning the hunt for anyone who might have been involved with developing the software. Unfortunately it was a dead-end via Comark, the company that acquired Nematron (which itself had acquired Imagination Systems), so my best two leads at the moment are someone Tim found on Linkedin from another thread, and another gentleman I found on linkedin.

I’m really surprised there’s almost no record of this software or these people anywhere online, it really must have been a niche industry program.

Tim · December 10, 2024, 3:09pm

This sounds like there are too many open files in the filesystem rather than fsys crashed especially since fsys is still in the task list and consuming CPU.

Tim

bananaman · December 10, 2024, 6:31pm

so in this particular instance, this failure has occurred during a logged-data export…

During testing, the software logs data from however many points of information a test plan calls for. It does this at a specified rate, so the log files are just thousands of line-entries of the points and their status.

The files can be of varying size; you are able to specify the number of lines per log, we typically limit them to 8000 lines, but there could be a couple dozen per line or a couple hundred.

The export function converts the raw log file to a format we can import into Excel later.
It converts one file at a time, building it in the 3:/tmp/ directory along with temporary files that give the ftp command its instruction (login/pw/ip/file/etc).
Once the file is transferred, the export function moves onto the next log file and the process repeats.

It only processes one log file at a time, and this is normal operation that we have done thousands of times on multiple different systems. As I said before, this failure behavior has happened in a handful of different scenarios, and never under circumstances that are out of the norm for the decades that this software has been in use.

coincidentally, this is why I had made a different thread asking about the osconfig command; by default from the developer of the software, it has been configured to allow a maximum of 128 open files…

I can’t imagine it would even have that many open files, but I also don’t necessarily know what constitutes an “open file” beyond a running program/command/driver/library and files that are currently being read or written.
Would the fopen command be a good indicator of open files in a given moment? If I run the command while a system is just sitting idle after startup, it only returns one open file…

Tim · December 10, 2024, 10:25pm

1 file being opened in the entire system seems suspiciously low to me. You’d think other programs (including QNX O/S programs) would have open files. To get a better idea of whether this number makes sense or not, you should try running the fopen command during normal program execution and see what number it returns.

I forgot to ask in my prior post but I am assuming that the ‘files’ command you tried to run is in your Ram disk. Because if it isn’t it would make sense why you’d get that error since the files executable isn’t available (it seems you at least have the fopen command on your Ram disk).

Some other things you can experiment with is slaying the fsys driver and restarting it (presumably you know the options from the sysinit) when this happens (obviously fsys needs to be in your Ram disk too) and see if it allows normal operations to then continue.

Tim

bananaman · December 11, 2024, 2:36pm

certainly but again, I don’t know exactly what fopen expects to see of how it deems a file to be “open.” Personally, I would figure absolutely every little granular component of even an OS itself would be considered an “open file,” unless it’s all resident in memory, or only accessed on the drive when called for.

I just typed the command now after a restart and it returns zero open files (if I’m in the shell, wouldn’t “sh” at least be open?). If I open the file manager in another console, fopen in the first console still returns zero files open (not even the file manager shows), but if I open a file through the file manager for ascii viewing, fopen finally lists that file.

if I load the software and start device scanning, fopen returns the program itself and the two database files but nothing else. If I also start logging, it adds the newly created log file…

Yes.

ooh that’s a good idea. What options do you mean from the sys.init? because the default has like almost 300 lines; I know a lot of those are comments and routines and scripts and other little background things the developers of the software configured for the purposes of the program, but I don’t know what would be just the most integral stuff besides speed_correct.
back, dots on, base 10, slice 2 5 >$null >*$null… I assume I’d have to mount a console again, wouldn’t I? Would killing fsys cause the current console I’d be using to lock-up or die?