Crashing QNX 6.3, need help diagnosing the problem

OK, I’ve posted before about what generally I’m doing for my company. The goal was to port our code from QNX 6.1 to QNX 6.3 and at the same time move from GCC 2.95.3 to GCC 3.3.1.

So far I’ve ported the code from QNX 6.1 to QNX 6.3 using GCC 2.95.3 and everything is up and running smoothly.

Recently I finished the port to the GCC 3.3.1 compiler (more involved than expected).

However when I now run the software I experience a complete lockup of the O/S. When I say lockup, I mean totally unresponsible system including Caps lock. In addition I spawned a high priority console running at Prio 60 and it doesn’t get an CPU to be responsive. The only thing I can do is to reboot the QNX board.

What we have hardware wise is a card cage. Inside the cardcage is a board with QNX running on it. Also inside the card cage is a Comms board to talk to a Windows GUI host and 4 stepper cards that talk to the hardware on our instrument.

When I launch the software it comes up and starts spawning off the threads (there are 40 in all about 10 of which talk to hardware). Right around the time I hear the beep telling me we established communication with the hardware the lockup occurs.

In the code we are using InterruptAttachEvent to get interrupt events from the cards. The code uses shared memory to map the I/O space on the cards in order to read/write to the various cards registers etc. Nothing unusual here.

So the big question is what could cause a total system lockup so bad that after rebooting the board I have to run chkfsys in order to get the filesystem back in a usuable state because all our log files are corrupted (because they were likely mid write when death occurred). In addition there are no core files generated by the O/S or at least no time to write any before death occurs.

I think this kind of death can only be caused by 3 things:

  1. Total corruption of the microkernal (we are running as root to access hardware)
  2. Some interrupt on a card becoming so busy that the system is “hung” trying to service all the interrupts.
  3. Some interrupt not getting cleared properly so the O/S is “hung” waiting for the interrupt to be cleared.

Since the EXACT same code compiles and runs under 2.95.3 (minus some minor compiler differences) I assume the problem must be #1.

My question is, how can I track this down and figure out where I’m dying because there is a LOT of code running in a lot of threads at time of death. Are there any tools available to help with this. I’m trying things like writing to stdout and of course the crash occurs after a varaiable amount of time (10 seconds to 1-2 minutes) so the stdout isn’t always the same each time and because of the untimely death not everything gets written to a file before death occurs.

Thoughts appreciated,

Tim

I really doublt it’s #1. In order to corrupt microkernel code you would need to mmap that space and write to it, or the hardware could DMA into the kernel memory.

More likey is #2. Should be #3 because if you using InterruptAttachEvent interrupt are masked until you umasked them.

I would process by not doing InterruptAttachEvent and then add them one at a time, and see where it crashed.

If you have a parallel port write a value to it at different location in your program. When the machine freeze read the bits on the paralle port (with a meter) that will thell you where it crashed. I prefer to use a diagostic card (Port 80) instead of using parallel port.

Mario,

I’m in the process of pulling out the InterruptAttachEvents.

In the meantime I tried one more thing that I found interesting.

I was running ‘pidin’ at priority 60 and monitoring the output as I mentioned in my 1st post and when the lockup occurred I stopped getting any output from pidin.

Today as part of the debugging process I started ‘hogs’ at priority 60 on a 3rd console.

This time when the lockup occurred I was stunned to see Hogs continued to run. Going back to the pidin console I found it hung (no new output) and all other consoles and everything else on the system was dead. But Hogs continued to run every second and update me with output.

What I saw was our app using 97% of the CPU at this point and everything else running at 0%.

So I have some questions:

  1. Clearly the O/S is alive and well underneath everything. But it appears even though our program runs at no higher than priority 40 there is no CPU cycles available for anything else to run. This clearly points to an unserviced interrupt or an interrupt running wild (even tho this does not happen with 2.95.3 code). This seem reasonable.

  2. Hogs keeps running. I can only assume this program does so because it’s using DMA access to the video card instead of interrupts to the vid card? Anyone know how Hogs really works?

  3. Our app is threaded, not process driven. Hogs only reports process CPU. Is there any way to make it report thread CPU?

Thanks,

Tim

AUGGHGHGHGHGHGHGGG!!! This stupid forum logged me out in the middle of typing a LONG LONG reponse on a crappy laptop from on location in Australia late at night. Now I have to type it all again. :frowning: :frowning: :frowning:

I forget most of what I typed so this is going to be a MUCH abbreviated version.

After 2 days of banging my head against the wall on this with help from some Engineers here, this is what I learned.

A) The crash occurs somewhere between 10 seconds and several minutes into starting up. The 10 second cases are important because at that time only a few threads have started and most of the hardware ones have not.

B) After finding QNX was alive thanks to Hogs (and no thanks to pidin which is not responsive) I wrote a 10 line program who’s only job was to sleep for 10 seconds at prio 60, wake up and drop a SIGKILL on my app. I was surprised to find this actually worked (since pidin was unresponsive at prio 60) and Hogs showed my app gone. Then I found io-net was running ready and consuming all the CPU. One of the hardware engineers confirmed that the Comms card my app talks to is sharing an interrupt with the ethernet card. So I modified my 10 line program to SIGKILL io-net too and lo and behold the system returns to normal and is responsive.

C) I pored over the source code to find 2 calls to InterruptAttachEvent and 1 call to InterruptAttach. The InterruptAttachEvent calls are spun off into 8 threads (4 for each call) and the InterruptAttach is spun off into 2 threads.

D) While adding debugging code to help find the “hung” state I ended up discovering that the likely problem appears to be heap corruption (not stack). This was because a few times I got genuine SIGSEGV deaths with core files that revealed corrupted heap pointers in gdb and by looking thru our debug log files that were legitimate values only a few milliseconds and subroutines earlier. So it appears something someplace is trashing memory and this sometimes leads to a “hung” state if it happens to do so in a thread that monitors hardware (or if a hardware thread is the culprit which I think is the case)

E) Most importantly, I found that my 10 line program can ONLY kill my app if it drops a SIGKILL on it. I assume that’s because the kernal knows that the app can’t catch SIGKILL so it cleans it up right away and doesn’t even actually deliver the signal. Because if I try SIGABRT my app keeps on running. I suspect the SIGABRT gets delivered to the app in the “hung” state but because some interrupt is going crazy it can’t even clean itself up or be properly torn down. It’s entirely possible to me that many of the “hung” deaths could be occurring or generating SIGSEGV’s that can’t even generate core files or let the app terminate due to the interrupts.

So this leads to the following questions:

  1. Is there ANY way to tell QNX to clean up the interrupts for my task in cases of death (SIGSEGV and the like). I know there are flags to set in InterruptAttach and InterruptAttachEvent but which ones do I need to set and do they work so that QNX will clean up the interrupts letting a core file be generated and the app die and control returned to me. I really need this in order to make progress on the memory corruption.

  2. With InterruptAttachEvent the docs say the app must do an Interrupt_Unmask call. What happens if the app fails to make this call? Does it mean the kernel will never again deliver that interrupt to my app (I think this is what happens) since it’s masked off or does it mean everything is left in a bad state (unserviced interrupt that keeps firing). I ask because maybe corruption prevents an unmask from happening in my app. What happens if this is a shared interrupt (with say io-net or another thread in my app) and the unmask is failed to be done?

  3. With InterruptAttach we provide our own ISR that simply reads bytes from a Comms card into a circular buffer. It looks OK in a code inspection but it’s 3 subroutines long and not trivial to follow. I can’t debug it since it’s an ISR. I have 2 questions on this. First, there was a coding error I corrected that was passing in the wrong size for the buffer area. Not really a problem since we also wrote the ISR and we didn’t exceed the real size allocated (or I think we didn’t) but would this affect the kernal in any way? My other question has to do with the ISR space. My understanding is it executes in the kernal, not the task space. So what happens to pointers and address’s. Under QNX 4 this had to be compiled with stack checking off. Does this still need to be done (3 subroutines long) or not. Would a memory write off the buffer cause a SIGSEGV or just happily write to memory. I suspect this is the problem area since it starts early is the list of threads but I can’t easily prove it.

  4. Are there any memory fence tools available to help with heap corruption by checking heap intregrity either in the kernal or by a user call. Esp for writes off end of allocated space etc. I’ve used vxWorks before and they have some nice tools for this kind of stuff. I’m happy to download and compile something in or turn something on in the O/S to get this if possible.

TIA,

Tim

Probably a good thing this is the abbreviated version :wink:

About pidin versus hog. pidin sends messages to process to gather information. If a process becomes blocked so will pidin since it will wait forever for the reply.
[/quote]
As for io-net using all the CPU, it sounds like your hardware is leaving its interrupt active.

What do you mean by spun off?

Yes that is the case.

That’s possible since for the signal to be processed (aside SIGKILL) the process must have CPU time.

You could setup a sigsegv handler to clean up your interrupt.

There will never be another interrupt from that level (so io-net never gets
interrupt as well)

With the help of parallel port or a diagnostic card it’s possible. ALso I think with the help of instrumented kernek function its possible to log data in an ISR (not sure)

A sigsegv in an ISR will result in a kernel crash.

For InterruptAttach yes, but only the stack is in the kernal space.

To my knowledge gcc doesn’t do stack checking (not sure)

It will SIGSEGV if you are writting ouside the memory of your process

QNX6 comes with memory allocation tool malloc_g, it’s all documented. Note that it can’t prevent overrun or underrun it just tries to detect them after the fact.

You should be able to turn your ISR/InterruptAttach() into a Thread/InterruptAttachEvent(), so you can
debug it. After you sure everything is fine, you can then switch back.

A few more days on the problem and I’m somewhat closer to a solution but I haven’t fully solved it yet.

What I have done is to comment out all the hardware attach stuff so that our program doesn’t talk to any hardware at all now. I also lowered the priorities of all the threads in the system to < 10 in an attempt to regain control of QNX once we spin out of control.

So far I have taken 38 of the 40 threads to less that priority 10 and I am still able to get the lockup problem. So that indicates that one of the remaining 2 threads is the problem.

The real question is, when I look at the code, I can’t see anything obvious that should be causing the thread to spin and use CPU as opposed to simply blocking and never returning (aa different problem but at least then I’d have control). Here’s the basic code running in the 2 threads (most of the original work is commented out to reduce the amount of code each thread runs).

This is the original hardware interrupt thread with the calls to the interrupts commented out. It now simply loops in a delay of 100 ms forever.

void XCommsUsbPort::RunInterrupt(void)
{
pthread_cleanup_push(CleanupInterruptThreadFunction, this);

struct sigevent event;

cout << "TDS: XCommsUsbPort::RunInterrupt Attaching to interrupt " << m_InterruptNumber << endl;
    SIGEV_INTR_INIT(&event);

/* m_InterruptID = InterruptAttachEvent( m_InterruptNumber, &event, _NTO_INTR_FLAGS_TRK_MSK);
*/

m_InterruptThreadRunning = true;

sleep(1);

m_Mutex.lock();

// circle in the loop until told not to
while (m_ThreadsOkToRun)
{

/* TDS int maskCount = InterruptUnmask(m_InterruptNumber, m_InterruptID);
*/
m_Mutex.unlock();

    // now wait for the interrupt
    int retVal;
    do
    {
        if (TimerTimeout(   CLOCK_REALTIME,
                            _NTO_TIMEOUT_INTR,
                            &m_wdogEvent,
                            Watchdog::registry->wdogTimeout(),
                            NULL) == -1)
        {
            DEBUG_TEXT("TimerTimeout failed");
        }
	    // now wait for the interrupt

// TDS retVal = InterruptWait_r(0, 0);
delay(100);
retVal = ETIMEDOUT;

       pthread_testcancel();

        // pat the watchdog
        Watchdog::registry->threadAlive(m_pPatThisUsb);
    } while (retVal == ETIMEDOUT);

    SNIP REST OF CODE NO LONGER REACHABLE

m_Mutex.unlock();

   pthread_cleanup_pop(1);

}

No much going on in the above code other than the loop and delay 100. The external calls do the following:

inline void threadAlive(volatile unsigned *patThis) { atomic_set(patThis, 1); }
inline uint64_t *wdogTimeout(void) { return &m_patTime; }

The watchdog thread meanwhile does the following:

It waits in a receiveMessage call for a timer that it sets to go off ever 1.5 seconds. This code is normally working. Here is where the receive code lies. It’s long and complex because it’s used in general for all the threads to receive messages of various kinds. Eventually it calls the handleTimerExpiry code which then calls directly into the watchdog code which I pasted below.

// TT6916
#if GCC_VERSION >= 30301
void
Thread::
receiveMessage(
Msg::Id& id,
void** msgPointer,
Uint32& byteSize,
bool& shutdown
)
#else
void
Thread::
receiveMessage(
Msg::Id& id,
void*& msgPointer,
Uint32& byteSize,
bool& shutdown
)
#endif
{
int rcvid;
struct _pulse msg;
bool msgReceived = false;

while (!msgReceived)
{
	if (m_timeOfLastMessageReceipt != 0)
	{
		uint64_t executionTime
			= g_HardwareTime.getRelTimeMilliseconds()
				- m_timeOfLastMessageReceipt;
		if
			(executionTime > 100)
		{
			DebugReporter::Instance()->Comment("SLOW_THREAD " + m_name
											   + " MsgId: "
											   + image(m_lastReceivedMsgId)
                                               + " "
											   + image(executionTime));
		}
	}
	
	m_lastReceivedMsgId = 0;

    // pat the watchdog
    Watchdog::registry->threadAlive(m_pPatThis);
    // Don't break out of this loop if we have forced a timeout of the blocking call.
    do
    {
        if (TimerTimeout(   CLOCK_REALTIME,
                            _NTO_TIMEOUT_RECEIVE,
                            &m_wdogEvent,
                            Watchdog::registry->wdogTimeout(),
                            NULL) == -1)
        {
            Error::report(Fatal, THREAD_ERROR_TIMER_TIMEOUT_FAILED, "Errno = " + Error::errnoImage());
        }
        rcvid = MsgReceive_r(m_channelId, &msg, sizeof(msg), NULL);

        // pat the watchdog
        Watchdog::registry->threadAlive(m_pPatThis);
    }while (rcvid == -ETIMEDOUT);

	m_timeOfLastMessageReceipt = g_HardwareTime.getRelTimeMilliseconds();

    // Check if we got a pulse
    if (rcvid == 0)
	{
		if (msg.code == PULSE_CODE_SHUTDOWN)
		{
			// Set the flag that says we got a shutdown pulse.
			//
			shutdown = true;
			msgReceived = true;
		}
		else if (msg.code == PULSE_CODE_MESSAGE)
		{
			// Pull the message out of the pulse
			Message* message = (Message* ) msg.value.sival_ptr;

			uint64_t deliveryTime = g_HardwareTime.getRelTimeMilliseconds() -
				message->m_timeOfSend;

			if (deliveryTime > 100)
			{
				DebugReporter::Instance()->Comment("SLOW_MESSAGE " + m_name + ": DeliveryTime: "
												   + image(deliveryTime) +
												   " MsgId: " + image(message->m_id));
			}

			id = message->m_id;

        // TT6916
        #if GCC_VERSION >= 30301
			*msgPointer = message->m_msgData;
        #else
			msgPointer = message->m_msgData;
        #endif
			byteSize = message->m_msgSize;
			shutdown = false;
			msgReceived = true;

			m_lastReceivedMsgId = id;
			
			// Delete the Message
			delete message;
		}
		else if (msg.code == PULSE_CODE_TIMER)
		{

			// Get the sequence number out of the pulse.
			//
			Uint32 sequenceNumber = (Uint32)msg.value.sival_ptr;

			// Lock the mutex that protects the timerQueue and the
			// expectedSequenceNumber.
			//
			m_timerMutex.lock();

			// Ensure the timerQueue is not empty, and that the sequence number
			// in the pulse matches that we are expecting.
			//
			if (!m_timerQueue.empty() &&
				(sequenceNumber == m_expectedSequenceNumber))
			{
				// Remove the timer from the timerQueue.
				//
				TimerHandler* handler
					= m_timerQueue.begin()->second;
				m_timerQueue.erase(m_timerQueue.begin());

				// This timer is no-longer pending.
				//
				handler->m_isPending = false;

				// Reset the single timer used by this thread to expire
				// when the next timer in the queue should expire.
				//
				if (!m_timerQueue.empty())
				{
					resetTimer();
				}

				timespec requestedExpiryTime = handler->m_absoluteExpiryTime;

				// Unlock the timerMutex before calling the expiry handler.
				//
				m_timerMutex.unlock();

				// Workout how late the timer has expired.
				//
				timespec deliveryTime;
				clock_gettime(CLOCK_REALTIME, &deliveryTime);
				deliveryTime = deliveryTime - requestedExpiryTime;

				if ((deliveryTime.tv_sec >= 1) ||
					((deliveryTime.tv_sec == 0) &&
					 (deliveryTime.tv_nsec > 100000000)))
				{
					DebugReporter::Instance()->Comment("SLOW_TIMER " + m_name + " " +
													   typeid(*handler).name() +
													   ": DeliveryTime: " +
													   image(deliveryTime));
				}

				// Then call the handler
				//
				handler->handleTimerExpiry();
			}
			else
			{
				// This pulse is out-of-date so just unlock the timerMutex.
				//
				m_timerMutex.unlock();
			}
		}
		else
		{
			Error::report(Fatal, THREAD_ERROR_RECEIVED_INVALID_PULSE,
						  "Pulse code = " + image(msg.code) +
						  "Errno = " + Error::errnoImage());
		}
	}
    else
	{
		Error::report(Fatal, THREAD_ERROR_RECEIVED_INVALID_MESSAGE,
					  "Recieve ID = " + image(rcvid) +
					  "Errno = " + Error::errnoImage());
	}
}

}

void WatchdogTimer::handleTimerExpiry(void)
{
    static int count = m_frequency;

    // Set the timer again to make it periodic.
    thread->setTimer(this, m_period); 

	std::cout << "TDS: WatchdogTimer::handleTimerExpiry() timer expired on count " << count << " reset to time " << m_period.tv_sec << " " << m_period.tv_nsec << std::endl;

    if(--count == 0)
    {
        count = m_frequency

// TDS Note this call is a simple loop with more printouts that the code is not hung up inside.

        if(!registry->checkAllRegisteredThreadsOk())
        {
            // A registered thread has failed - we want to reset!
            
			std::cout << "TDS: Watchdog timeout. We need to do a reset" << std::endl;
            if (m_callbackHardware)
            {
                // TT6916
				std::string message = "Watchdog detected inactive thread.";
                DebugReporter::Instance()->Comment(message);
                //copy to stderr in case it was the debug reporter thread
				// TT6916
				std::cerr << message << std::endl;
				std::cout << "TDS: Watchdog timeout. Going to shutdown spm" << std::endl;
				sleep(1);
                raise(SIGABRT);  // Raise a signal and let the handler flush the debug reporter
            }
        }
    }
	else
		std::cout << "TDS: Watchdog timeout. Not time to check threads yet" << std::endl;

}   // WatchdogTimer::handleTimerExpiry

In normal running mode the TDS lines print out regularly every 1.5 seconds.

When we lockup, I stop getting any printouts to the screen. This despite this thread running at priority 60. At the same time CPU used goes to 97% as shown by hogs. Clearly the OS isn’t dead, just endlessly busy.

So something in the above bit of code is spinning endlessly using 97% of the CPU according to HOGS.

None of the above code looks like it could possibly spin like that. I assume waiting for mutexs or other OS calls would never spin using 100% of the CPU but would instead cause a thread to become blocked waiting for the resource. Can any of the OS calls made above actually cause a spin of the CPU resources?

So I’m very very stumped as to what happens to cause 100% of the CPU to be used.

Tim

Some late news on this front.

I dropped the last 2 thread to run under priority 10 (shell) and managed to get a “hung” event.

At this point I was able to do a successful pidin command to see which thread was using the CPU. What it revealed was that the watchdog thread (2nd coding section above) was the one running ready. It was the one getting 100% of the CPU. No other thread was raised up to it’s priority so it clearly wasn’t waiting for a mutex of some kind.

I ended up dropping a SIGABRT on the entire process to have it shutdown with a core file for debugging. What was more amazing was that at that point the hard looping transferred to the initial thread that starts everything up. It began to run ready at priority 10 (shell?) but never finished actually creating a dump file or tearing down any of the threads. We have a simple signal handler installed to catch SIGABRT and I have cout code there to display when I reach that handler (which works to catch CTRL-C from the shell for example in normal running mode) yet it never ever prints that code out. So some how even the signal handler can’t run properly at this point. All I can do is drop SIGKILL on the task to get rid of it.

Anyone know why this might occur? Could the atomic operations be failing or locking up at some point. I have not seen anything in the docs that mentions if the atomic operations can handle priority inversion issues like normal mutex’s so maybe at some point the watchdog ends up running ready waiting for the atomic value that another thread is modifying that it can never finish doing because it’s no longer getting any CPU because it’s running at a lower priority?

Tim

OK. I’ve isolated it down to 1 line:

rcvid = MsgReceive_r(m_channelId, &msg, sizeof(msg), NULL);

This call NEVER returns when I reach the “hung” state (I added cout calls before and after this line and I only get the before printout when it hangs). It spins the CPU to 100% and can’t recover.

I took a look at the doc’s for this call and it says if the sending thread/process has a problem with it’s sending buffers that the kernal only detects this at wakeup time and instead of waking up and unblocking it sends an error message to the sending task/thread.

I’m pretty sure this is what happens. Except in this case the thread that’s spinning is sending nothing more than an empty message pulse to itself (so not sure how the sending buffers can be corrupt) when a timer expires (the watchdog sends itself a pulse timer message ever 1.5 seconds). I’ve looked at the code over and over and can’t see the problem. So if the kernal is trying to send an error to the sending task/thread it’s screwed since it’s the same task that receives the pulse! Even dropping a signal on it won’t tear down the thread.

One thing that does bother me tho. The doc’s don’t mention if the error case (the call not unblocking) is on a threaded model or a per process model. What I mean is I have 30 threads running in 1 process. So lets say thread 2 sends to thread 3 and the corruption occurs in thread 2’s sending buffer. Meanwhile thread 1 does a msgreceive_r. Does the kernal notice at this point there is a problem and refuse to unblock for thread 1 (all part of same process) or would it only be for thread 3. I ask because when I reach the hung state with the lowered thread priorities and do a pidin I see MANY of the threads in the system running “READY”. This obviously isn’t right but would make sense if and only if every msgreceive_r was trying to unblock but was being prevented by the kernal because the thread that originally sent the bad message has not handled the error sent to it. Then the watchdog would eventually reach it’s own msgreceive_r and run ready forever because it’s the highest thread in our app. Does this make sense esp since we do lots of real message passes between threads where maybe there is a problem with sending buffers?

I’m thinking a call to QNX support for help is going to be on the horizon for this if no one knows the answer.

Tim

Gut feeling tells me you are on the wrong track. If the call doesn’t unblock then it should not use CPU time.

Now that the machine isn’t hanging you could use a debugger.

I would probably best to post your code in an attachment. I find it too messy to read.

When debugging threads, cout ( or printf) can yield strange result because they may not print in sequence. On a side note, in the real-time world I rarely see usage of cout. If you look at the assembly code of calls to cout you’ll understand why. It generates way more code then a printf call. Last time I checked, the single line cout << "abc " << 1 << "def " << variable << endl was generating 5 calls to subroutine compare to just 1 for a printf.

I though I read somewhere printf is thread safe but usage of file handler (FILE*) is not. Thus because printf uses stdout there could be issue with that. I could be confusion this issue with QNX4 or older version of QNX6.

Mario,

Your gut feeling is a good one :slight_smile:

I ran and got “hung” events. One on my development server (no hardware) and
one on the actual instrument. Posted here below in the debugging information
I got on the development server case after attaching gdb to the task when it
reached the “hung” state.

Here’s what I see using pidin at the moment I attach gdb. Thread 9 (our
watchdog and highest priority thread) has been consuming all the CPU
(3m55s worth) from the time it hung till I attached the debugger.

$ pidin -pspm -l tt
pid tid name STATE start time thread start sutime
376356923 1 ./spm CONDVAR Mar 31 01:37 Mar 31 01:37 0.180
376356923 2 ./spm STOPPED Mar 31 01:37 Mar 31 01:37 0.000
376356923 3 ./spm STOPPED Mar 31 01:37 Mar 31 01:37 0.000
376356923 4 ./spm STOPPED Mar 31 01:37 Mar 31 01:37 0.000
376356923 5 ./spm STOPPED Mar 31 01:37 Mar 31 01:37 0.000
376356923 6 ./spm STOPPED Mar 31 01:37 Mar 31 01:37 0.000
376356923 7 ./spm SEM Mar 31 01:37 Mar 31 01:37 0.000
376356923 8 ./spm STOPPED Mar 31 01:37 Mar 31 01:37 0.001
376356923 9 ./spm STOPPED Mar 31 01:37 Mar 31 01:37 3m55s
376356923 10 ./spm STOPPED Mar 31 01:37 Mar 31 01:37 1.403
376356923 11 ./spm STOPPED Mar 31 01:37 Mar 31 01:37 0.000
376356923 12 ./spm CONDVAR Mar 31 01:37 Mar 31 01:37 0.000
376356923 13 ./spm STOPPED Mar 31 01:37 Mar 31 01:37 0.000
376356923 14 ./spm STOPPED Mar 31 01:37 Mar 31 01:37 0.000
376356923 15 ./spm CONDVAR Mar 31 01:37 Mar 31 01:37 0.000
376356923 16 ./spm CONDVAR Mar 31 01:37 Mar 31 01:37 0.000
376356923 17 ./spm CONDVAR Mar 31 01:37 Mar 31 01:37 0.000
376356923 18 ./spm STOPPED Mar 31 01:37 Mar 31 01:37 0.000
376356923 19 ./spm STOPPED Mar 31 01:37 Mar 31 01:37 0.000
376356923 20 ./spm CONDVAR Mar 31 01:37 Mar 31 01:37 0.000
376356923 21 ./spm CONDVAR Mar 31 01:37 Mar 31 01:37 0.000

Here’s what I see in terms of linked libraries when I start gdb along with
my arguments to gdb.

$ gdb --pid=376356923 --se=/home/txs/projects/QNX-6.3-Dev0001/SPM/spm
GNU gdb 5.2.1qnx-nto
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type “show copying” to see the conditions.
There is absolutely no warranty for GDB. Type “show warranty” for details.
This GDB was configured as “ntox86”…
Attaching to program /home/txs/projects/QNX-6.3-Dev0001/SPM/spm’, process
376356923
Reading symbols from
/usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5…done.
Loaded symbols for /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
Reading symbols from /usr/qnx630/target/qnx6/x86/lib/libm.so.2…done.
Loaded symbols for /usr/qnx630/target/qnx6/x86/lib/libm.so.2
Reading symbols from /usr/qnx630/target/qnx6/x86/lib/libc.so.2…done.
Loaded symbols for /usr/qnx630/target/qnx6/x86/lib/libc.so.2
[Switching to process 376356923]
0xb032d7e9 in SyncCondvarWait_r () from
/usr/qnx630/target/qnx6/x86/lib/libc.so.2

Here’s what I see when I ask gdb about the state of the threads in the
system. Basically everything looks normal for every thread except 8, 9 and
possibly 10 (which may have legitimately been interrupted)…

(gdb) info threads
21 process 376356923 0xb032d7e9 in SyncCondvarWait_r ()
from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
20 process 376356923 0xb032d7e9 in SyncCondvarWait_r ()
from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
19 process 376356923 0xb032c986 in MsgReceive_r ()
from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
18 process 376356923 0xb032c986 in MsgReceive_r ()
from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
17 process 376356923 0xb032d7e9 in SyncCondvarWait_r ()
from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
16 process 376356923 0xb032d7e9 inSyncCondvarWait_r ()
from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
15 process 376356923 0xb032d7e9 in SyncCondvarWait_r ()
from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
14 process 376356923 0xb032c986 in MsgReceive_r ()
from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
13 process 376356923 0xb032c986 in MsgReceive_r ()
from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
12 process 376356923 0xb032d7e9 in SyncCondvarWait_r ()
from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
11 process 376356923 0xb032c986 in MsgReceive_r ()
from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
10 process 376356923 0xb8249aea in std::string::_Rep::_M_refcopy() ()
from /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5

  • 9 process 376356923 0xb823150e in std::locale::locale(std::locale const&) ()
    from /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
    8 process 376356923 0xb823150e in std::locale::locale(std::locale const&) ()
    from /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
    7 process 376356923 0xb032db0d in SyncSemWait ()
    from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
    6 process 376356923 0xb032c986 in MsgReceive_r ()
    from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
    5 process 376356923 0xb032c986 in MsgReceive_r ()
    from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
    4 process 376356923 0xb032c986 in MsgReceive_r ()
    from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
    3 process 376356923 0xb032c986 in MsgReceive_r ()
    from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
    2 process 376356923 0xb032c986 in MsgReceive_r ()
    from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
    1 process 376356923 0xb032d7e9 in SyncCondvarWait_r ()
    from /usr/qnx630/target/qnx6/x86/lib/libc.so.2

So stepping to thread 9 and dumping a stack trace gives the following
output. What looks suspicious is the call stack into overflow(),
_M_really_overflow(), _M_convert_to_external etc. It appears something has
overflowed internally someplace at this point. Where/how I have no idea.

(gdb) thread 9
(gdb) where
#0 0xb823150e in std::locale::locale(std::locale const&) ()
from /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
#1 0xb822a920 in std::basic_filebuf<char, std::char_traits
>::_M_convert_to_external(char*, int, int&, int&) ()
from /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
#2 0xb822a871 in std::basic_filebuf<char, std::char_traits
>::_M_really_overflow(int) ()
from /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
#3 0xb822a7ac in std::basic_filebuf<char, std::char_traits
>::overflow(int) ()
from /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
#4 0xb8249456 in std::basic_streambuf<char, std::char_traits
>::xsputn(char const*, int) ()
from /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
#5 0xb822af58 in std::basic_filebuf<char, std::char_traits
>::xsputn(char const*, int) ()
from /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
#6 0xb824450b in std::ostream::write(char const*, int) ()
from /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
#7 0xb8244bac in std::basic_ostream<char, std::char_traits >&
std::operator<< <std::char_traits >(std::basic_ostream<char,
std::char_traits >&, char const*) ()
from /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
#8 0x084fa09a in Watchdog::WatchdogTimer::handleTimerExpiry()
(this=0x86b80c0) at WatchdogTimer.cpp:92
#9 0x084f1ba1 in Thread::receiveMessage(unsigned&, void**, unsigned&,
bool&) (this=0x889c1d8, id=@0x7e06f30, msgPointer=0x7e06f2c,
byteSize=@0x7e06f28, shutdown=@0x7e06f27)
at Thread.cpp:795
#10 0x084f0249 in Thread::defaultMainLoop(void*) (theThread=0x889c1d8)
at Thread.cpp:273
#11 0x084f0008 in Thread::threadMain(void*) (theThread=0x889c1d8) at
Thread.cpp:231

The actual last line of code in our application is at step 8. That line is:

std::cout << "TDS: WatchdogTimer::handleTimerExpiry() timer expired
on count " << count << " reset to time " << m_period.tv_sec << " " <<
m_period.tv_nsec << std::endl;

where count is defined to be: static int count = m_frequency;
where m_period is defined in the class definition to be: timespec m_period;

I tried to dump the values of m_period out but gdb complains about not be
able to retrieve the variable name. Not sure why this is. But suffice a
common timespec variable can’t be that crazy that cout can’t handle it.(gdb)

(gdb) print count
$1 = 1

So moving on to thread 8 (which is stopped at the exact same point) and
dumping a stack trace gives the following output.

(gdb) thread 8
(gdb) where
#0 0xb823150e in std::locale::locale(std::locale const&) ()
from /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
#1 0xb822a920 in std::basic_filebuf<char, std::char_traits
>::_M_convert_to_external(char*, int, int&, int&) ()
from /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
#2 0xb822a8e3 in std::basic_filebuf<char, std::char_traits
>::_M_really_overflow(int) ()
from /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
#3 0xb822ae20 in std::basic_filebuf<char, std::char_traits
>::sync() ()
from /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
#4 0xb824459f in std::ostream::flush() () from
/usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
#5 0xb8244933 in std::basic_ostream<char, std::char_traits >&
std::flush<char, std::char_traits >
(std::basic_ostream<char, std::char_traits >&) ()
from /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
#6 0xb82448ec in std::basic_ostream<char, std::char_traits >&
std::endl<char, std::char_traits >
(std::basic_ostream<char, std::char_traits >&) ()
from /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
#7 0xb8243453 in std::ostream::operator<<(std::ostream&
()(std::ostream&)) ()
from /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
#8 0x084ac35e in
DebugReporter::DebugOutputStream::outWithEndl(std::string const&) (
this=0x86b4f20, s=@0x8a38ce8) at DebugReporter.cpp:287
#9 0x084aebd1 in DebugReporter::Loop() (this=0x86b4f20) at
DebugReporter.cpp:886
#10 0x084aec88 in DebugReporter::LoopStarter(void
) (pObject=0x86b4f20) at
DebugReporter.cpp:911

The actual last line of code in our application is at step 8. That line is:

m_outputFileStream << s << endl;

where m_outputFileStream is defined to be: std::ofstream m_outputFileStream;
and s is passed into the function as (const std::string& s)

(gdb) print s
$8 = (const string &) @0x8a38ce8: {static npos = 0,
_M_dataplus = {<allocator> = {},
_M_p = 0x8a34cb4 ““9748”,“Info”,“HW_Config:: Short or no apparent
comment for ID 1074254011 in Device187”,”},
static _S_empty_rep_storage = {0, 0, 0, 0}}

This value for ‘s’ looks legitimate as we have a statement that builds this
text string elsewhere in the code.

Lastly we come to thread 10 which likely was just interrupted in the middle
of it’s normal operation as there is nothing suspicious about this stack
trace.

(gdb) thread
(gdb) where
#0 0xb8249aea in std::string::_Rep::_M_refcopy() ()
from /usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
#1 0xb8249b7b in std::string::string() () from
/usr/qnx630/target/qnx6/x86/lib/libstdc++.so.5.0.5
#2 0x08291aa9 in Hardware::DeviceBase::RestoreHwConfig() (this=0x89cff70)
at DeviceBase.cpp:1536
#3 0x0828cde0 in
Hardware::DeviceBase::InitialiseDeviceData(Hardware::SEARCH_METHOD) (
this=0x89cff70, method=SEARCH_RECURSIVE) at DeviceBase.cpp:119
#4 0x0829ee19 in
Hardware::DigitalInput::InitialiseDeviceData(Hardware::SEARCH_METHOD) (
this=0x89cff70, method=SEARCH_RECURSIVE) at DigitalInput.cpp:47
#5 0x083769a9 in
Hardware::StepperCardSPI_DigitalInput::InitialiseDeviceData(
Hardware::SEARCH_METHOD) (this=0x89cff70, method=SEARCH_RECURSIVE) at
StepperCardSPI_DigitalInput.cpp:55
#6 0x0828cdbb in
Hardware::DeviceBase::InitialiseDeviceData(Hardware::SEARCH_METHOD) (
this=0x89cfbf0, method=SEARCH_RECURSIVE) at DeviceBase.cpp:114
#7 0x0828cdbb in
Hardware::DeviceBase::InitialiseDeviceData(Hardware::SEARCH_METHOD) (
this=0x89cf880, method=SEARCH_RECURSIVE) at DeviceBase.cpp:114
#8 0x082abae9 in
Hardware::FluidSupplyBase::InitialiseDeviceData(Hardware::SEARCH_METHOD) (
this=0x89cf880, method=SEARCH_RECURSIVE) at FluidSupplyBase.cpp:113
#9 0x082b0db8 in
Hardware::FluidSupplySheath::InitialiseDeviceData(Hardware::SEARCH_METHOD) (
this=0x89cf880, method=SEARCH_RECURSIVE) at FluidSupplySheath.cpp:77
#10 0x0828cdbb in
Hardware::DeviceBase::InitialiseDeviceData(Hardware::SEARCH_METHOD) (
this=0x89ca110, method=SEARCH_RECURSIVE) at DeviceBase.cpp:114
#11 0x0828cdbb in
Hardware::DeviceBase::InitialiseDeviceData(Hardware::SEARCH_METHOD) (
this=0x89b1018, method=SEARCH_RECURSIVE) at DeviceBase.cpp:114
#12 0x08161513 in
Hardware::SPM_Hardware::initialiseSPM_Data(CallBackBase<HW_Message>) (
this=0x89b1018) at SPM_Hardware.cpp:1905
#13 0x08174006 in Hardware::ActionP0Hardware::SPM_Hardware::Run()
(this=0x89e1fbc) at Action.h:181
#14 0x082a1273 in Hardware::DispatchQueue::Loop() (this=0x87da7b0) at
DispatchQueue.cpp:332
#15 0x082a1362 in Hardware::DispatchQueue::LoopStarter(void
)
(pObject=0x87da7b0) at DispatchQueue.cpp:390

The actual last line of code in our application is at step 8. That line is:

string comment;

(gdb) print comment
$9 = {static npos = 0, _M_dataplus = {<allocator> = {},
_M_p = 0x874076c “/home/txs/projects/QNX-6.3-Dev0001//SPM/Hardware/Config/Dev_0223.cfg”},
static _S_empty_rep_storage = {0, 0, 0, 0}}

This appears to have some left over garbage left on the stack since this
line is really just creating storage for a string that’s about to be used.
Not sure why gdb reports values here…

So that’s where I stand. Definitely not in MsgReceive_r(). But obviously
hung up on some kind of spin lock issue in the libraries someplace. Esp
given that 2 threads are in the EXACT same place at the same time looks
really suspicious. Almost like the lower priority thread got interrupted and
can’t finish to give the higher priority thread the spin lock and so we sit
here for ever never making any progress…

So looking at our compilation/link flags I have the following (same ones
used for 2.95.3 compiler that we are moving from). Given that everything works
in 2.95.3 I am posting these incase I am missing something to handle thread
safety in the std libs (_PTHREADS=1 worked in 2.95.3).

CC = qcc -V3.3.1,gcc_ntox86_gpp
CXX = qcc -V3.3.1,gcc_ntox86_gpp
LD = qcc -V3.3.1,gcc_ntox86_gpp

CFLAGS = -lang-c++ -D_PTHREADS=1 -g -Werror -Wno-deprecated
LDFLAGS = -lang-c++ -g -Wl,-Map spm.map -D_PTHREADS=1

So now for my questions - you knew there had to be some ;-)

  1. Am I using the wrong flags/libraries to get threaded support for standard
    library calls in 3.3.1?

  2. If I am using the correct flags/libraries then I have a huge question on
    why 2 threads are spin locking in what looks to be a major error path in the
    handling of simple valid strings. I am wondering how I can get there or figure
    out what’s wrong once I leave my code because I can’t get ANY values in the
    debugger for anything in the libraries (no info locals, no info args etc).

TIA,

Tim

The overflow stuff looks very strange I do not know what this is, but I would google it up. Maybe it’s a stack overflow, might want to increaste stack space.

As for your question I think you are compiling all right. I didn’t know it was necessary to use _PTHREADS=1, must be C++ specify stuff and I don’t do much C++ ;)

Maybe you could try a different library, just to see if you are getting a different behavior.

Mario,

I tried playing with the stack space and increased it to 128K per thread, 4 Megs for the total application and 4K guard space using the malloc_g libraries (which I am not 100% sure work with C++ ‘new’ command). Same problem happens.

I took a look online at the libstdc source code at the gcc website. Sure enuff inside locale::locale() the first line of code attempts to grab a lock. The comments in the code say it will be a global mutex in single threaded mode and a local mutex in multi-threaded.

That’s clearly where we are spinning. Why we spin instead of block I don’t know. But I think clearly we are not using a thread safe version of this library. I was even able to attach gdb to our task, go to the lower priority thread (thread 8 in the example I posted) and step thru the code (even tho I lack source). Once I stepped out of the routine in the debugger everything started funtioning again proving that we were hung waiting for a lock to be free’d by a lower priority thread that wasn’t ever getting CPU time again.

So I took your other suggestion. I bit the bullet and finished converting the code to compile with the Dinkum lbraries (a pile of warnings that needed to be taken care of since Dinkum appears to be more stringently Ansi C++). After hunting around to get the proper libraries linked in (the math library isn’t automatically linked in for some reason) I got everything to link.

I’ve been runnning now for 4 days. Not a single crash or problem. So it appears there is definitely a thread safety issue with the 3.3.1 GDB libraries. Just a word of warning to anyone else planning to make a similar conversion of code from 2.95.3.

Tim

Actually that problem was also with 2.95.3 complier under 6.2.1.
I have no 6.3 yet so I was unable to test my app there.

I’m actually writing in plain C.
The problem was as following:

I used thread_pool_* stuff to let the system create and manage the pool of threads for me. All the threads were blocking on pthread_cond_wait(…) and one thread signalled them when to start. The program works fine for some days (!!) but then I can see something consumes all the CPU time. I’m able to kill my app and when it is killed - the cpu is released.
When I use pidin - it shows my app has several threads READY and the rest are in CONDVAR state.

My app handles communication on the network.
There is only one thread that accepts connections, logs info about client and then adds new connection to queue - only after that it signals threads ( with pthread_cond_signal() ) that there is job to do. At the moment some threads are active - there is no connections handled and even if there is no activity on the network (no new connections arrive and old ones are closed) for a long enough period of time - threads still stay in READY state.

I’m trying to find what’s the problem but have no clue - My problem seems to be of the same kind described here above…
Any suggestions or ideas or if somebody could face me the right way to solving the problem would be of great help.

Mike,

The first thing I would do is determine which thread is the one consuming 100% of the CPU. You can do that by running pidin with the arguments I listed

pidin -p(your app name) -l tt

This will update over and over every second. Check to see which thread gets the CPU time when they all run ready (if they are all at the same priority there may be several that are endlessly running ready).

After that you can attach gdb as I did and follow the same steps to try and determine where in the code it’s spinning and then maybe determine if it’s a programming error vs a problem in the libraries.

Note that at one time I ported to 6.21 using 2.95.3. I was able to run successfully there using the make options I listed above. The key one for me was the -D_PTHREADS=1 on the compile and link lines to tell the libraries to use the thread safe versions. Make sure you are using that option.

Tim

Thanks Tim!
I would be very happy if you can help me once more. Looking into pidin output I can see thread 5 is consuming the CPU. So I attach GDB to the running process:

I have no debug version handy at the moment - i’ll send more detailed trace when come home…
But at a first glance: the thing is that the thread is in recv() call - waiting data from client. But I can not figure out why it is not blocked, but steals all the CPU? there is no data from client!

Mike,

Sorry for the late reply. I was out of the office last week for a conference.

OK, a quick first question after looking at your output. You are using blocking sockets right? I took a quick look at recvfrom and recv and they both mention in the doc’s that they only block waiting for a message if the sockets are set up as blocking otherwise those calls will fall through essentially looping using CPU.

Assuming your using blocking sockets I’d try doing the following once the problem occurs:

  1. Start gdb and get to the point you reached here.
  2. Then step thru the functions in the debugger. You need not have source code, just enter ‘s’ and the code will step along nicely from one function call to the next.

See where this takes you in terms of functions and the call tree. In my case when I hit the ‘s’ in the spinning thread the ‘s’ call never returned because I was stuck waiting in one function. See if that’s happening to you or if you make other function calls or unwind the stack and then re-call all the same functions again.

What you are trying to discover here is how big the loop is and if it’s in one subroutine or a loop across several subroutines.

Post back what you find assuming you don’t solve it on your own based on stepping through the debugger.

Tim

Tim,
thanks for the answer - but the thing is that I can do nothing to slove the problem. Here is how it happenes:

  1. Ofcourse I use blocking stream (TCP) sockets in my program.
    To not let clients hang around too long - i use setsockopt(… SO_RCVTIMEO …) - that is the core of the problem as shown later.
  2. Everything works fine till some extent. When there are enough threads and communication is rather extensive (new sockets are created and destroyed in a short time) happens the following thing: if there appears to be one SLOW client - something breaks inside libsocket (i suppose).
    I also have to notice: socket timeout is set to 5 seconds. SLOW connection is the one that has time intervals between sending data about timeout interval.
    Then if client has connected and timeouted without sending any data then recv() returns -1 and errno set to EWOULDBLOCK. Everything is fine.
    Supposed client has sent something and then failed to send next portion of data. In this case recv() waits blocked until timeout expires and then it hangs in MsgSendv() and never returns anything.

Interesting thing is that it can happen only to one socket within one process. Others seem to work as expected. But nevertheless I do not like one thread hanging consuming all the CPU.

I’ve downloaded 6.3 - I’ll give it a try. Hope they have corrected that issue.

I submitted a PR in our database for this, since obviously the 3.3.1 lib should be threadsafe.

If you ever need to contact QNX techsupport about it, it’s PR 24694

Cheers,

Colin