System hangs and I have to restart

slipbull · April 23, 2010, 10:25am

Hi folks,

I have 4 QNX6 nodes in a QNet, nodes 1 & 2 with MySQL in master-slave replication setup. Apache web server running in node 2 too. Some other apps running on the other nodes, nothing special. All hardware is presumably supported and works correctly, initially. All nodes are regular PCs with 2 to 5 Ethernet and Serial PCI cards.

Suddenly, node 1 starts to be unresponsive. System monitor shows intensive CPU usage. Mouse moves very slowly and system is very laggy. After a while, it doesn’t respond anymore.

If I connect to node 1 from the other nodes, they start to be unresponsive too! Just listing /net/node1 makes the system laggy.

Output of ‘hogs’: Usually the system doesn’t have much workload, so procnto uses around 90%. Before node 1 hangs, hogs shows that SYSTEM time used by the kernel procnto is over 200%.

Node 1 is where the MySQL master server is. We are about to make a checklist to test everything we can, so any ideas will be helpful.

Thank you all

mario · April 23, 2010, 11:38am

if possible use the system profiler when this is happening. It should help you figure out exactly what is going on,

maschoen · April 25, 2010, 5:24pm

Use Adaptive partitioning. Segregate suspicious processes into separate partitions. The time in procnto could be either a lot of message passing, or if you have a hardware problem where an interrupt never resets. The former is more in line with the system corrupting the other nodes, although I can imagine a scenario where you are spewing out ethernet packets which slows down everyone.