System hangs and I have to restart

Hi folks,

I have 4 QNX6 nodes in a QNet, nodes 1 & 2 with MySQL in master-slave replication setup. Apache web server running in node 2 too. Some other apps running on the other nodes, nothing special. All hardware is presumably supported and works correctly, initially. All nodes are regular PCs with 2 to 5 Ethernet and Serial PCI cards.

Suddenly, node 1 starts to be unresponsive. System monitor shows intensive CPU usage. Mouse moves very slowly and system is very laggy. After a while, it doesn’t respond anymore.

If I connect to node 1 from the other nodes, they start to be unresponsive too! Just listing /net/node1 makes the system laggy.

Output of ‘hogs’: Usually the system doesn’t have much workload, so procnto uses around 90%. Before node 1 hangs, hogs shows that SYSTEM time used by the kernel procnto is over 200%.

Node 1 is where the MySQL master server is. We are about to make a checklist to test everything we can, so any ideas will be helpful.

Thank you all

if possible use the system profiler when this is happening. It should help you figure out exactly what is going on,

Use Adaptive partitioning. Segregate suspicious processes into separate partitions. The time in procnto could be either a lot of message passing, or if you have a hardware problem where an interrupt never resets. The former is more in line with the system corrupting the other nodes, although I can imagine a scenario where you are spewing out ethernet packets which slows down everyone.