Open files means files that are actually open in the file system at that moment for read/write. Your experiments with running fopen under normal conditions looks right. The shell (sh) for example is not an open file because it’s a program that’s been loaded into memory. It might be open for a very brief time while it’s loading from disk into memory but that would be very hard to catch via the fopen command. Editing a file in the file manager should definitely show as an open file and indeed it does.
So now the question is what does fopen show when the problem occurs.
As far as restarting fsys. Look in the sysinit and search for fsys (hopefully it’s in there and not in a boot image). Usually fsys is started with some arguments. Something like fsys that tell the file system some things it needs to know (drivers to launch, other configuration information). You’ll need to match that if you want the restart of fsys to work.
Killing fsys should not cause the console to lock up or die (doesn’t in QNX 4, 6, 7, 8). It’s been a long time since I ran QNX 2 but I’d be surprised if it did there. What it will do is of course is mean you can’t access the hard drive. But you should be able to still access your RAM drive so as long as a copy of fsys is there you should be good to restart it. Obviously I’d practice killing it once on a running system to make sure you can start it again successfully otherwise it’s a waste of time to wait until the problem occurs and hope you can do it.
the string does not exist elsewhere in the dump, though I know that doesn’t necessarily rule out anything.
Also, if I do a files p=*fsys* search, all it returns are chkfsys and dosfsys, I don’t see an actual command/program/file/whatever simply named fsys…
should I assume if I didn’t find a standalone “fsys” file, chances are it’s not easily dropped into the ramdisk?
EDIT: fwiw, here’s a default sys.init from one of these systems
" ********************************************************************
" * QNX2 uses a 16-bit uint to store the CPU speed index. On faster *
" * machines, the value can wrap around because it will exceed 66535 *
" * (max uint). So, force the CPU speed index to be 65535. *
" ********************************************************************
/cmds/speed_correct
" /**********************************************************************/
" /* this is the sys.init.anet file for at boot diskettes, to be copied */
" /* to sys.init on the target AutoNet machine. */
" /**********************************************************************/
back
dots on
base 10
stty inton=3
stty inton=4
timer &
" *****************************************************
" * set the tick size and the task switching interval *
" *****************************************************
slice 2 5 >$null >*$null
if +f /das/sys/an_timer then
/das/sys/an_timer &
endif
if eq "#p" "00044" then
clock_update ps2/80 &
else
clock_update at &
endif
if eq "#p" "00044" then
rtc ps2/80
else
rtc at
endif
/das/sys/set_an_time
mount cache d=3 s=16k
mount xcache s=10k
mount bmcache d=3
" ***************************************
" * Check for and recover corrupt files *
" ***************************************
/das/script/clean_up
" ******************************
" * Check for any update files *
" ******************************
if +d /updt then
backup /updt / +a -p -v +w > $null >* $null
ws /updt "chattr @ a=+w > $null >* $null" -v l=99
rm /updt +r > $null >* $null
if +d /updt then
files /updt -v > /tmp/zap.files
eo /tmp/zap.files "zap @" > $null >* $null
endif
endif
" ******************
" * Mount consoles *
" ******************
mount console 7 $con2
stty >$con2 -edit -echo +noboot up=0 down=0
mount console 8 $con3
stty >$con3 +echo +edit up=a1 down=a9
ontty $con3 login
mount console 9 $con4
stty >$con4 +echo +edit up=a1 down=a9
ontty $con4 login
" ******************************************************
" * Mount the shared library used by the 'sac' command *
" ******************************************************
mount lib /config/sac.slib
" *****************************
" * Show the ISI company logo *
" *****************************
if +f /expl/nemasoft.logo then
expl nemasoft.logo
endif
type
passon
stty +noboot up=0 down=0
if +f /drivers/glib.isi mount lib /drivers/glib.isi
mount float
" mount lib /config/pdb.slib
" ****************************
" * Start the system options *
" ****************************
ws /config "sh @" -verbose p=sys.opt.*
" *****************************************************
" * Set the CURRENT_NETWORK_NODE in the security file *
" *****************************************************
if +f /cmds/set_security then
zeros off
set_security 66 #n
" Determine if this installation of AutoNet will have access to the
" NETWORK menu option: access allowed if this node is not node 0
" (node 0 is a standalone node without archnet) or if the TCP/IP
" option has been purchased and installed.
if eq #n 0 then
get_security 273 -p
setvar = i0 #?
if eq #i0 0 then
set_security 67 0
else
set_security 67 1
endif
else
set_security 67 1
endif
zeros on
endif
" *************************************************
" * Bring up pdb and later load pdb security file *
" *************************************************
/das/sys/pdb_admin &
sleep 2
" *************************
" * Start queued messages *
" *************************
queue p=27 &
/das/sys/pdb_security
clock a=f104 &
if +f /das/data/tape_name then
cron -l &+
endif
if +f /das/sys/an_auto_del then
/das/sys/an_auto_del & >$null >*$null
endif
stty -lock >$lpt >*$null
" ******************************
" * Create isam parameter file *
" ******************************
chattr /das/data/AN.DVPRG.P -b
echo "3 1 8 1" >/das/data/AN.DVPRG.P
echo "0 [#n]3:/das/data/AN.DVPRG.DAT 30 16384 7 1" >>/das/data/AN.DVPRG.P
echo " 1 [#n]3:/das/data/AN.DVPRG.IDX 30 0 0 0 16384 3 0 0 1" >>/das/data/AN.DVPRG.P
echo " 0 30 0" >>/das/data/AN.DVPRG.P
base 10
if eq #n 00001 then
poll -v &
netboot &
endif
if ge #n 00001 then
poll_check &
def_server #n 00001
nacc 3 +read +write
alive +n n=1
nacc CPU 1 +w +r
endif
if ge #n 00001 then
stty n=$prt >$lpt >*$null
if eq #? 00000 then
zeros off
spooler d=$prt t=[#n]3:/spool/ >$null >*$null &
spooldev f=3:/config/spool.init &
zeros on
endif
else
stty n=$prt >$lpt >*$null
if eq #? 00000 then
spooler d=$prt t=3:/spool/ >$null >*$null &
spooldev f=3:/config/spool.init &
endif
endif
if +f /das/sys/err_log.0 then
chattr /das/sys/err_log.0 s=-b
endif
if +f /das/sys/err_log.1 then
chattr /das/sys/err_log.1 s=-b
endif
if +f /das/sys/err_log.2 then
chattr /das/sys/err_log.2 s=-b
endif
if +f /das/sys/disk_monitor then
/das/sys/disk_monitor &
endif
" *********************************
" * Start the error administrator *
" *********************************
icheck_reg err_adm
if ne #? 00000 then
/das/sys/err_adm 2 7 & >$con2 >*$con2
endif
" *****************************************
" * Start automatic system message delete *
" *****************************************
if +f /das/sys/del_msgs then
/das/sys/del_msgs &
endif
" *************************
" * Clear any spool files *
" *************************
if +d /spool then
ws /spool "rm @" -v
endif
" *************************************************
" * Clear any files files from the /tmp directory *
" *************************************************
if +d /tmp then
ws /tmp "rm @" -v
endif
if +f /das/sys/FLAG.PRINT then
sleep 1
/das/sys/an_err_out "$lpt"
endif
" *****************************************************
" * Call netx if node id is different since last boot *
" *****************************************************
if +f /cmds/convert_node then
convert_node
endif
if +f /das/sys/setup_rcomm then
/das/sys/setup_rcomm +sleep &
endif
" /*********************************************************************/
" /* if the system flag file is in /das/sys, the computer crashed with */
" /* AutoNet up (one time or another). We then will automatically */
" /* spawn the necessary tasks. */
" /*********************************************************************/
if +f /das/sys/FLAG.ADM then
kill_restart
if eq #? 00001 then
/das/sys/admin_up
if +f /das/sys/FLAG.COM then
/das/sys/comm_up
/das/sys/scan_manager
endif
if +f /das/sys/FLAG.LOG then
/das/sys/log restart -p
endif
/das/sys/send_err 0 "Automatic Restart"
else
/das/sys/send_err 0 "Automatic Restart Disabled"
if +f /das/sys/FLAG.LOG then
ws /log "rm @" p=LOG.* -v
endif
ws /das/sys "rm @" p=FLAG* -v
endif
endif
" ************************************************
" * Clear any active remote communication setups *
" ************************************************
if +f /das/data/active_setup.cfg then
rm /das/data/active_setup.cfg
endif
" **********************************************************
" * Start any existing default remote communication setups *
" **********************************************************
if +f /das/sys/setup_rcomm then
/das/sys/setup_rcomm
endif
iposcur 20 0
type "eJ"
ws /config "sh @ &" -verbose p=bsys.opt.*
Its definitely inside the boot image. QNX2 is so old that I guess they didn’t let you modify anything to do with the filesystem. Of course since that same fsys must be working because you can use the RAM disk so it can’t be totally hosed (though the disk driver itself could be).
A few posts above, you showed the results of running ‘tsk’ when the system got into that state. Are there any differences between that tsk and running tsk when the system first starts up and is normal. Looking for something that’s disappeared (crashed).
I’d be more curious to see what was running during and prior to the crash on the same machine, I kinda wish I had thought to check that in the moment.
in this particular instance, it seems like the commands timer, queue, and clock are missing, along with other non-qnx system components an_timer, pdb_admin, and autonetd…
an_timer is probably part of whatever routine runs in the background for the software to handle time-based processes, especially when it’s running code for a given test program.
pdb_admin is point database administrator, probably just a primary handler for reads and writes
and I’m not sure what autonetd is but a hex dump reveals lots of plaintext and assorted shell commands and file pointers… It may be like a primary background service for a lot of the different components of the main application.
It’s certainly possible that those tasks not running could be A problem, but there’s no reason they should have quit. I’d have to do some random testing to see if using slay to stop them from another console does anything when a system that is functioning just fine…
Tim, I’m pretty sure that unlike later versions of QNX, fsys and dev cannot die. I’m not sure about the having too many files open, but it is a possiblity. The way to check would be to kill some tasks and see what happens.
bananaman, there is an obvious thing to try to see if the driver is hosed. Re-mount it. To do this you would need the mount command and the driver on your ram disk. If that brings access to disk 3 back, it confirms that this is the problem. If it doesn’t the result is not conclusive as Tim might be right about the open file problem.
I’ve noticed this behavior has resurfaced with a degree of consistency in a couple of the systems that I’ve restored, about every couple weeks or so, so I’ll cobble a little utility disk together with all the little things I’d need to test whenever it occurs again.
I still don’t know how much closer it gets me to an actual understanding or explanation as to why it might be happening, but anything that unshrouds the mystery just a little bit more each time is always good.
Hmm I just thought of something as I was typing this reply…
After getting the zoo command to work and trying to figure out some preliminary options for making a list of empty directories, I remember a comment in the mkdir description;
Directories can contain any number of files, but may become fragmented if the initial size is not made large enough which often implies a speed penalty when opening files.
mkdir has an option to specify a [size] for a created directory…
When I have restored these systems, I just use backup sourcedir targetdir +a s=c
to copy an entire batch of files and directories back to the boot drive of a system…
is it possible that the backup command generates directories with the default size limit (10 files) before copying files, and then any directory with >10 files immediately gets somewhat fragmented in the process? And then, perhaps, whatever utility in the software we’re running gets tripped up because maybe it takes too long to skip across the fragmented data?
I know that’s kind of a stretch because the file system should be able to navigate fragments by design, but in a couple of the instances where this weird behavior has occurred, it has been when the software is accessing files in a directory that often has dozens and occasionally hundreds of files in them.
Is there a way to see folder attributes that might call out what the current file limit may be? files +d +v at least lists blocks and xtents, so if I see like 20 blocks but only two or three xtents, I assume, that directory has at least a couple hundred files and is only split into a couple fragments… but if I see 44 blocks and 23 xtents, that means it likely has at least 400 files but over twenty fragments…
Is there a way to update the size or “file limit” of an established directory?
Lots of questions here, but first a quick question. Are you running QNET? QNET was the native networking for QNX2 that required custom (Corman) Arcnet cards. Is so, are the processor 486? There was a bug that surfaced late in the life of QNX2 that caused a problem. If you need to know what that was, let me know.
So, in many ways, a directory is just a special file. The structure of that file is the same as regular files. My recollection is that there was an algorithm that increased the number of sectors allocated after each allocation, so if your directory grew, the file system would try to use larger and larger segments of continuous sectors. Now, fragmentation on a QNX2 system could become an issue. One mitigation was to mount as much sector cache as possible, which was around 64K. There are two other types of cache that can be useful when mounted. By the end, QNX2 systems had plenty of extra memory for those caches. I do remember there being a serious performance problem with directories with a very large number of files. I remember that when Bill Flowers rewrote the file system for QNX4 he was very proud of how he had optimized things so very large directorys would work reasonably well. If you want to investigate the fragmentation of a directory, spatch is your friend. Each extent, an allocated block of contiguous sectors, has a 16 byte header. If you can look at the include file fsys.h, the format is documented. With spatch and some paper you can scan through the extents, seeing where they are.
The command “$ tsk info” might have some limits information. The file system limits are likely documented somewhere. I don’t know what you mean by "update the size or ‘file limit’ of an established directory. You can probably defrag a directory as follows. Find the number of files. Use the size parameter of mkdir to make a large directory. Backup from the fragmented directory to the new one. Delete the old directory and rename the new one.
no QNET, these are Linksys Ether16 10baseT ISA cards… I couldn’t necessarily tell you who did the networking drivers/software though; the driver is considered “NE2000 compatible” by the installer package. And these are mostly Pentium II Dell Optiplex GX1/GXa machines, except for two that are Pentium 4s I’ve built from an industrial board
if I look at the file “snmpconf” in 3:/etc, it reads SpiderTCP TCP/IP (release 5.1) for QNX version 2.15
and a string of some other misc numbers… I also see a text string of SpiderTCP in a hexdump of the driver (dated Dec-15-95), so maybe it’s a good chance that’s who wrote it?
directories grow as more files are added, but a few key directories can potentially contain hundreds of configuration files, log files, etc…
I’ve noticed some of those key directories on these systems will take up many blocks, but only have 1 extent, which leads me to think that the developers of the software intentionally specified those directories to be readily capable of Many Files without having to make the filesystem skip around to find other pieces.
During a fresh system install, all of the magic happens without really reporting anything that it’s doing on-screen other than asking for subsequent disks and prompting the user for a few small settings… so you don’t see any of the QNX install process, any of the directory creation, etc etc.
unfortunately there’s no “fsys.h” file but I think digging that deep into things is beyond the scope of my needs (at least at this current point)
the command offers info about the system but that’s about it…
And yeah, I figured there wasn’t any way to really change the file limit attributes of a directory without doing basically what you said. At least it’s a relatively simple process to defragment a folder or an entire drive just by moving data elsewhere and copying it back.
And I presume that the system is not using QNET, just TCP/IP. There was talk of some large customer that paid QNX a lot of money to make QNET work over ethernet cables. I doubt you are using such machines, but if you were the same bug probablyl applies. It was a race condition that first appeared in 486s due to the speed. It would occur a lot in a Pentium.
which leads me to think that the developers of the software intentionally specified those
directories to be readily capable of Many Files
That sounds right. You could check by doing a files +v +d and looking at how many sectors a directory is using.
I will find and post fsys.h if you ask.
Since at times I saw painful delays due to directories being very large, a thought occurred to me.
The speed at which the CPU can read through sectors is no doubt not a problem with the cpu’s you described. If a directory isn’t that large, the sector cache should take care of things. But what if a directory was so large that it doesn’t fit in the cache? That probably is a issue with a directory of more than 1000 files. I imagine any file opening has to traverse the directory from the beginning until it finds the file in question. If you had directory with 2000 files and you open a file at the end, fsys might need to read every sector of the directory to find the file entry. If you are opening and closing files a lot, that would create a lot of unexpected disk access.
You might think that there would be a better algorithm, but remember when QNX 2 was written. The first versions only had to deal with 360K floppy disks. The first hard drives supported had 5Meg of disk space. There were enhancements added such as the cache, but I don’t think much was changed over the years.
Another minor related point. The IDE driver that QNX first supplied, disk.at only read one sector at a time. The disk.atc version was able to do a multi-sector read. The driver disk.ata that I wrote (LBA) was based on the atc driver. I think most later hard drives layed the sectors down sequentially, 1, 2, 3… not staggered around a track, so if your driver is reading 1 sector at a time, that’s 1 rotation of the disk for each read. That only 60 sectors/second.
Interesting, do you have any other info about it? It nigh impossible to really find any information about legacy QNX, let alone commercial and corporate dealings from 30 years ago. Not that it would necessarily help me, but I’m honestly curious.
Though, it would be neat if there were a way to actively trigger the condition to see if it might actually affect these systems.
Yeah, I think my approach is going to be to make a record of the directory structure and sizes (I made a small table to reference a directory [size] limit based on how many blocks it takes up; 10b = 100 files, 19b = 200 files, 94b = 1000 files, etc etc approximately), and then manually create a few of the directories that are routinely handling many files. With the proper sequence of backup and zoo and shuffling files around, I can relatively easily restore a system without worrying (hopefully) about any prospective issues related to directory sizes.
I still don’t know if this is even the lingering cause of this weird failure behavior, but I’m starting to lean into it as I think about it more… It has occurred most frequently in situations where files are being written to or accessed from directories that would typically have housed many files, on systems that I have recovered and would likely have not made those directories with a file limit large enough to avoid fragmentation.
again I know the system is able to navigate fragmented files and folders, but maybe there’s just some stupid little bug or quirk that was never fully discovered in development.
Yes please, if nothing else it might be good reference in the future.
I made a directory with a 1000 file limit, and it occupied 94 blocks; if each block is 512bytes, that’s just a hair over 48kb.
as it stands, these systems have always been configured for a 16kb cache, 10k xcache, and bitmap cache enabled; that is just how the developer of the software set them up at the time.
This hasn’t been an assumed issue in the past, but any directories that might have a high number of files would have been configured for such in advance, rather than grown as necessary.
What is considered the sector cache, is that “cache” or “xcache” from the mount command?
All I know in this regard is that we use the disk.ata driver and it seems fine? These late-era IDE drives may access sectors sequentially but they’re 7200rpm and likely have enough of their own controller cache to handle stuff as necessary (I hope)
If you are not using QNET it is not an issue. Sending messages to a task on a remote node requires the creation of a ‘virtual circuit’. They are a limited number of these available. A race condition can cause the removal of one of these to fail. Eventually all virtual circuits are used up and no connections to remove nodes are possible. There was a utility that was supposed to clean up these dead circuits, but I could never get it to work. There is a tsk option that will show virtual circuits, maybe ‘tsk vc’
Getting fsys.h to you is on my todo list.
The sector cache is the cache. The xcache caches segment headers, the 16 bytes that allow you to jump around a file with seek without. doing a read. If your system has excess memory I would bump the cache to 64k. I think there is a tsk option to show how much available memory there is, maybe ‘tsk freemmem’. If you run this after your system is up you can find out.