Help with debugging a system lockup

cscott · May 8, 2006, 12:10pm

I’m in need of some ideas on how to debug a QNX 6.3.0 system. The system will occasionally (intermittent, I can’t duplicate it on demand) appear to lockup; what I mean is it will not accept network connections - none of the normal services like ftp respond and it will no longer answer to a ping. The photon session doesn’t respond to any actions - the mouse no longer moves and all keyboard actions apparently have no effect. Does anyone have a idea on how I can tell what processes are running and possibly taking up all the processors time, during the apparent lockup? I’m assuming there is a process that is a fairly high priority and it is consuming all available resources and it not releasing them.

The system that we are running on is a fairly typical PC, nothing too special hardware wise, SCSI controller, Intel P3…

Thanks for any help or ideas,
Scott

mario · May 8, 2006, 12:31pm

Does the num lock led on the keyboard still toggle. If it doesn’t then it’s probably stuck in an ISR. As far as using all the resource (memory) that should not prevent the mouse from responding.

You could try running qconn or a shell at the higher possible priority (where is QNX4 ditto when you need it)

cscott · May 8, 2006, 1:37pm

Cool, thanks for the info. I started qconn up an priority 34 and child processes will be started at priority 33. Do you know if I need to launch io-net at a higher priority or will qconn still work even though most of the other network services don’t respond?

mario · May 8, 2006, 3:08pm

I’m not sure how it will behave.

Note that 6.3 support priority up to 255.

rgallen · May 8, 2006, 8:29pm

That depends on what priority the “offender” is spinning at. If the priority is below that of the default nic interrupt driver (and stack timer processing) priorities then you’ll be fine (see man pages for devn-* and npm-*).

Alternatively, if you have access to Adaptive Partitioning (you probably don’t, but for completeness of answer), just simply put qconn in a partition of its own and give it a guaranteed 1% of the CPU (no need to twiddle with priorities).

mario · May 9, 2006, 4:10pm

Another way around this, which works most of the time, is to have a dual cpu or dual core (not the Hyperthreading stuff). If one thread goes nuts the machine will still be responsive.

Another plus with dual core/cpu, is make -j2 will, depending on the machine, almost half the compile time!

rgallen · May 10, 2006, 2:43am

Yeah, trying to demo how Adaptive Partitioning prevents live-lock on multi-core is very interesting (interesting, since creating the livelock condition is difficult). Instead of creating a single while(1) at high priority, you have to query the syspage for num cpus and spawn that number of hog threads. Where as it appears simple to create livelock on a uni-core, it looks quite difficult on a multi-core.

maschoen · May 21, 2006, 1:00am

Except for the problem of running Photon, there is the possibility of using the kernel debugger. You could add a 2nd video card, and run Photon on the alternate display. Then you should be able to break into the debugger unless the hardware is hosed. Looking at where you are when you enter the debugger could help you deduce what the problem is.