Chasing a program crash - Stack Corrupted or exhausted?

Hello,

I’m finishing up a port of our application software from QNX 6.1 to QNX 6.3. I’m in the process of running the application and am encountering crashes that I can’t seem to find the source of.

The application itself consists of 40 threads, most of which are hardware and idle 99% of the time. The other threads handle user requests, schedule hardware movements, hardware updates from sensors etc.

What I’ve found is that the application can run anywhere from 2 minutes to 30 minutes before a crash occurs. The crashes happen in different threads but never in hardware threads. 99% of the crashes appear to be related to freeing or allocating RAM whether from the heap or internally in the stdlib calls.

There is a LOT of code to visually inspect and since everything is happening in real time with that many threads it’s impossible to attach a debugger and get meanining results. On top of it all, everything runs fine in 6.1 with no crashes. I’ve obtained oodles of core files and our own log files in an attempt to figure out what’s wrong but I have had no luck so I am going to ask for some help here to see if anyone has any idea’s for me to try:

Here’s a GDB dump of a core file crash I just got. The crash occurred where the code was trying to send a simple debug text string to a log file. At step 9 when the crash is detected we call a signal handler to clean up other threads, esp hardware ones. So the trace from #9 to #1 is all related to code in the signal handler flushing output logs etc.

(gdb) where
#0 0x0804e8de in __default_alloc_template<false, 0>::allocate (__n=80)
at /usr/qnx630/target/qnx6/usr/include/g+±3/stl_alloc.h:422
#1 0x0804db12 in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::Rep::operator new (s=16,
extra=64)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.cc:33
#2 0x0804db4c in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::Rep::create (extra=34)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.cc:60
#3 0x0804de5f in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::replace (this=0x7d23a98, pos=0,
n1=0,
s=0x84aaf40 “DebugReporter flush successful!!!”, n2=34)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.cc:164
#4 0x0804e645 in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::assign (this=0x7d23a98,
s=0x84aaf40 “DebugReporter flush successful!!!”, n=34)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.h:218
#5 0x0804ea3f in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::assign (this=0x7d23a98,
s=0x84aaf40 “DebugReporter flush successful!!!”)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.h:220
#6 0x0804e682 in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::basic_string (this=0x7d23a98,
s=0x84aaf40 “DebugReporter flush successful!!!”)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.h:178
#7 0x083cd2b5 in DebugReporter::ForceFlush (this=0x87a0480)
at DebugReporter.cpp:422
#8 0x08140de7 in Hardware::handler (signo=11) at SignalHandler.cpp:52
#9 0xb031764e in __signalstub ()
from/usr/qnx630/target/qnx6/x86/lib/libc.so.2
#10 0x0804db12 in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::Rep::operator new (s=16,
extra=64)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.cc:33
#11 0x0804db4c in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::Rep::create (extra=38)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.cc:60
#12 0x0804de5f in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::replace (this=0x7d23f38,
pos=0, n1=0,
s=0x84625c0 “DelayQueue::OnTimer() - executing call”, n2=38)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.cc:164
#13 0x0804e645 in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::assign (this=0x7d23f38,
s=0x84625c0 “DelayQueue::OnTimer() - executing call”, n=38)
—Type to continue, or q to quit—
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.h:218
#14 0x0804ea3f in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::assign (this=0x7d23f38,
s=0x84625c0 “DelayQueue::OnTimer() - executing call”)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.h:220
#15 0x0804e682 in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::basic_string (this=0x7d23f38,
s=0x84625c0 “DelayQueue::OnTimer() - executing call”)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.h:178
#16 0x081e7099 in Hardware::DelayQueue::OnTimer (this=0x8adad50)
at DelayQueue.cpp:281
#17 0x081e7a25 in Hardware::DelayQueue::Loop (this=0x8adad50)
at DelayQueue.cpp:436
#18 0x081e7abc in Hardware::DelayQueue::LoopStarter (pObject=0x8adad50)
at DelayQueue.cpp:461

The actual crash occurs at step 10 in Rep::operator new in bastring.cc
line 33: Going to that file I see the following code:

template <class charT, class traits, class Allocator>
inline void * basic_string <charT, traits, Allocator>::Rep:
operator new (size_t s, size_t extra)
{
return Allocator::allocate(s + extra * sizeof (charT)); // This is
line 33
}

Doing an info locals to check the variable values I find:

(gdb) info locals
s = 80
extra = 131218164

So ‘extra=131218164’ looks VERY out of whack and explains the SIGSEGV
due to lack of Ram to supply that kind of allocate. But looking at the
stack trace of the gdb dump you see that:

__default_alloc_template<false, 0> >::Rep::operator new (s=16, extra=64)

which seems to indicate that extra should be 64 anot 131218164. So I
don’t get how it changed from 64 to 131218164???

Looking at the calling routine step 11 we see it died on line 60. The
code is:

template <class charT, class traits, class Allocator>
inline basic_string <charT, traits, Allocator>::Rep *
basic_string <charT, traits, Allocator>::Rep::
create (size_t extra)
{
extra = frob_size (extra + 1);
Rep *p = new (extra) Rep; // This is line 60
p->res = extra;
p->ref = 1;
p->selfish = false;
return p;
}

Doing an info locals and info args and print *p gives the following
values:

(gdb) info args
extra = 64
(gdb) info locals
extra = 64
p = (Rep *) 0x1
(gdb) print *p
$2 = {len = 0, res = 0, ref = 0, selfish = false}

Note that extra appears to be 64 here which matches what the gdb stack
trace seems to think the value of extra is for the Rep::operator new
routine.

So it appears somehow that the normal value for extra is being changed
into a hugely large value for some reason. I checked my Stroustrup book
on pg 255-256 where it talks about how code such as:

Rep *p = new (extra) Rep;

works and there doesn’t seem to be anything strangely amiss here other
than not allocating Ram from the free pool. I guess the Std lib must
create it’s own memory pool and allocates from there instead for internal temporary storage so as to guarantee not to use more Ram in a memory constrained system?

But overall I am completely mystified as to why extra gets this crazy
value shoved in it.

Doing a ‘pidin -f m -p spm’ when the system is running reveals each thread has a stack size of 256K. This seems to be more than enuff for each thread and the system has 189 megs free out of 256 Ram wise.

So I am trying to figure out if I am getting stack corruption from other threads in some strange manner or if the stack memory is exhausted in some manner (esp the Std lib free pool) that I don’t realize. Esp since GDB post mortem debugging has no method to allow you to see how much stack space was in use by each thread.

Anyone have any idea/suggestions?

Tim

check the instruction where it’s crashing.

(gdb)x/i $pc

If it’s pushing onto the stack, then you are most likely running out of stack room.

In the Rep::operator new routine I did:

(gdb) x
0x804db15
<__nw__Q2t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0_3RepUiUi+29>: mov %eax,%edx

(gdb) x/i $pc
0x804db12
<__nw__Q2t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0_3RepUiUi+26>: add $0x10,%esp

Neither of these look like a push onto the stack to me but maybe I am misinterpreting the assember code.

Also I have a question. If the value of ‘extra’ was corrupted from 64 to 131218164 then isn’t any attempt to push that size of memory onto the stack is going to cause a SIGSEGV because there isn’t that much RAM on my whole system?

Tim

From the backtrace in your original posting the crash occurred in

#0 0x0804e8de in __default_alloc_template<false, 0>::allocate (__n=80)
at /usr/qnx630/target/qnx6/usr/include/g+±3/stl_alloc.h:422

what is the instruction there?

I don’t see anyway that add $0x10, %esp could cause a SIGSEGV

OK, here is the info at:

#0 0x0804e8de in __default_alloc_template<false, 0>::allocate (__n=80)
at /usr/qnx630/target/qnx6/usr/include/g+±3/stl_alloc.h:422

(gdb) info args
__n = 80
(gdb) info local
__n = 80
__my_free_list = (_Obj **) 0x84b4324
__result = (_Obj *) 0x3d

(gdb) x/i $pc
0x804e8de <allocate__t24__default_alloc_template2b0i0Ui+126>:
mov (%edx),%ecx
(gdb) x
0x804e8e0 <allocate__t24__default_alloc_template2b0i0Ui+128>:
mov %ecx,(%eax)

It still doesn’t say a whole lot. But as I mentioned in my original post, the program did NOT crash there. It actually crashed in step 10.

We have a signal handler attached to the program to handle SIGSEGV’s. One occurred at step 10. All the steps from 9-0 are happening inside the signal handler which is doing nothing but cleaning up hardware cards and then flushing the debug log (where the step 9-0 occur).

Since the original SIGSEGV occured when trying to write to the log it only stands to reason that further attemps in a signal handler are also bound to fail

This is the cleanup signal handler code:

namespace Hardware
{
void handler(int signo)
{
// Close the stepper cards
//
Hardware::StepperCard::CloseStepperCards();

     // disconnect from the Host PC
     X_HAL::IUsbChip* pChip = X_HAL::GetUsbChip();
     if (pChip != 0)
     {
          pChip->EnableInterrupts(false);
          pChip->Disconnect();
    }

    X_HAL::CCommsCard* pCommsCard = X_HAL::GetCommsCard(X_HAL::USB_CARD);

    if (pCommsCard != 0)
    {
          pCommsCard->EnableCombinedInterruptsToPciBus(false);
    }


    // Flush stderr, stdout, and the debug reporter output.
    //
    cerr.flush();
    cout.flush();

    // This is where step 7 comes in.
    DebugReporter::Instance()->ForceFlush();
	

    // Kill the application.
    //
    signal( signo, SIG_DFL);

// kill(0,signo);
raise (signo);

}

// Sets handlers to terminating signals (except SIGKILL, where it is not possible)
// This is done in order to disable hardware interrupts before forced termination,
// what avoid lockups of the whole system.
void initSignalHandler()
{
    signal( SIGHUP, handler);
    signal( SIGQUIT, handler);
    signal( SIGABRT, handler);
    signal( SIGFPE, handler);
    signal( SIGBUS, handler);
    signal( SIGSYS, handler);
    signal( SIGSTOP, handler);
    signal( SIGTTOU, handler);
    signal( SIGINT, handler);
    signal( SIGILL, handler);
    signal( SIGIOT, handler);
    signal( SIGEMT, handler);
    signal( SIGSEGV, handler);
    signal( SIGPIPE, handler);
    signal( SIGTERM, handler);
    signal( SIGPWR, handler);
    signal( SIGTSTP, handler);
    signal( SIGTTIN, handler);
}

}

Here’s GDB trace of another crash in the same thread but a different place yet again related to string allocation in the stdlib:

#0 0x0804e8de in __default_alloc_template<false, 0>::allocate (__n=80)
at /usr/qnx630/target/qnx6/usr/include/g+±3/stl_alloc.h:422
#1 0x0804db12 in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::Rep::operator new (s=16, extra=64)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.cc:33
#2 0x0804db4c in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::Rep::create (extra=34)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.cc:60
#3 0x0804de5f in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::replace (this=0x7d638d8, pos=0, n1=0,
s=0x84aaf40 “DebugReporter flush successful!!!”, n2=34)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.cc:164
#4 0x0804e645 in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::assign (this=0x7d638d8,
s=0x84aaf40 “DebugReporter flush successful!!!”, n=34)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.h:218
#5 0x0804ea3f in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::assign (this=0x7d638d8,
s=0x84aaf40 “DebugReporter flush successful!!!”)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.h:220
#6 0x0804e682 in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::basic_string (this=0x7d638d8,
s=0x84aaf40 “DebugReporter flush successful!!!”)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.h:178
#7 0x083cd2b5 in DebugReporter::ForceFlush (this=0x87a0480)
at DebugReporter.cpp:422
#8 0x08140de7 in Hardware::handler (signo=11) at SignalHandler.cpp:52
#9 0xb031764e in __signalstub ()
from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
#10 0x0804db12 in basic_string<char, string_char_traits, _default_alloc
template<false, 0> >::Rep::operator new (s=16, extra=64)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.cc:33
#11 0x0804db4c in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::Rep::create (extra=35)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.cc:60
#12 0x0804de5f in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::replace (this=0x7d63d48, pos=33, n1=0,
s=0x8d480c8 “50fo11”, n2=2)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.cc:164
#13 0x0804e5c4 in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::replace (this=0x7d63d48, pos1=33, n1=0, str=@0x7d63dc0,
pos2=0, n2=2) at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.cc:131
#14 0x0804eb15 in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::append (this=0x7d63d48, str=@0x7d63dc0, pos=0,
n=4294967295) at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.h:196
#15 0x0804f764 in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> > operator+<char, string_char_traits,
__default_alloc_template<false, 0> > (lhs=@0x7d63dc4, rhs=@0x7d63dc0)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.h:473
#16 0x081e6933 in Hardware::DelayQueue::Insert (this=0x8adad50,
msg=0x9020628) at DelayQueue.cpp:120
#17 0x081e6232 in Hardware::DelayCrossThreadDecoupledCallBase::Start (
this=0x9020628) at DelayCrossThreadDecoupledCall.cpp:79
#18 0x081f4b48 in Hardware::DetectorHandler::WaitDI (this=0x8af2658,
checkParameters= {m_pDigitalInput = 0x8af4500, m_bValue = true,
m_pClientCallback = 0x8b1d570}) at DetectorHandler.cpp:1500
#19 0x081f8e13 in Hardware::DelayCrossThreadDecoupledCall<Hardware::DetectorHan
dler::DigitalInputCheckParameters, Hardware::DetectorHandler>::Run (
this=0x87ace48) at DelayCrossThreadDecoupledCall.h:157
#20 0x08214dd4 in Hardware::DispatchQueue::Loop (this=0x8add860)
at DispatchQueue.cpp:340
#21 0x08214f04 in Hardware::DispatchQueue::LoopStarter (pObject=0x8add860)
at DispatchQueue.cpp:398

This crash occurs at the same line number bastring.cc line 33 (step 10 in the trace) as the other crash yet gets there from a slighly different place in the code originally.

Taking a look here at the last place in our code that got executed (step 16), line 120 of DelayQueue.cpp we see a line of:

m_logger.add(string("inserted into queue with delay = ") + image((long
long int)msg->GetDelay()));

Image() is defined as:

template
std::string IntImage (Type value, int radix)
{
// Taken from QNX library reference for itoa()
//
static const int MAX_IMAGE_SIZE = (8 * sizeof(Type)) + 1;
char buf[MAX_IMAGE_SIZE];

return itoa(value, buf, radix);

}

template
std::string UintImage (Type value, int radix)
{
// Taken from QNX library reference for utoa()
//
static const int MAX_IMAGE_SIZE = (8 * sizeof(Type)) + 1;
char buf[MAX_IMAGE_SIZE];

return utoa(value, buf, radix);

}

template
std::string LongImage (Type value, int radix)
{
// Taken from QNX library reference for ltoa()
//
static const int MAX_IMAGE_SIZE = 33;
char buf[MAX_IMAGE_SIZE];

return lltoa(value, buf, radix);

}

template
std::string UlongImage (Type value, int radix)
{
// Taken from QNX library reference for ultoa()
//
static const int MAX_IMAGE_SIZE = 33;
char buf[MAX_IMAGE_SIZE];

return ulltoa(value, buf, radix);

}

std::string image (Int8 value, int radix = 10)
{
return IntImage(value, radix);
}

std::string image (Int16 value, int radix = 10)
{
return IntImage(value, radix);
}

std::string image (Int32 value, int radix = 10)
{
return IntImage(value, radix);
}

std::string image (Int64 value, int radix = 10)
{
return LongImage(value, radix);
}

std::string image (Uint8 value, int radix = 10)
{
return UintImage(value, radix);
}

std::string image (Uint16 value, int radix = 10)
{
return UintImage(value, radix);
}

std::string image (Uint32 value, int radix = 10)
{
return UintImage(value, radix);
}

std::string image (Uint64 value, int radix = 10)
{
return UlongImage(value, radix);
}

std::string image (Float32 value)
{
char buf[31];
sprintf(buf, “%30.10f”, value);
return buf;
}

std::string image (Float64 value)
{
char buf[31];
sprintf(buf, “%30.10f”, value);
return buf;
}

/*
std::string image (SymbologySettings value )
{
ostrstream str;
str << value;
string a_string(str.str());
str.freeze(0);
return a_string;
}
*/

#ifndef CYGWIN
std::string image (timespec value)
{
char buf[50];
sprintf(buf, “%d %ld”, value.tv_sec, value.tv_nsec);
return buf;
}
#endif

which essentially converts a number of a various type to a std:string.

However if we go down the stack trace from step 16 to step 15 inside the std library we see the following:

#15 0x0804f764 in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> > operator+<char, string_char_traits, __default_alloc_template<false, 0> > (lhs=@0x7d63dc4, rhs=@0x7d63dc0)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.h:473
473 str.append (rhs);

Looking at the arguments here to this function call we get:

(gdb) info args

lhs = ( basic_string<char,string_char_traits,__default_alloc_template<false,0 > > &) @0x7d63dc4: {static npos = 4294967295, static nilRep = {len = 0, res = 0, ref = 261, selfish = false},
dat = 0x8c30878 “inserted into queue with delay = deletedeeds to be
deleted”,","}

rhs = ( basic_string<char,string_char_traits,__default_alloc_template<false,0 > > &) @0x7d63dc0: {static npos = 4294967295, static nilRep = {len = 0, res = 0, ref = 261, selfish = false}, dat = 0x8d480c8 “50fo11”}

Now doesn’t the value of npos for both string of 4294967395 look strange? This is supposed to be the number of characters in the strings if I am not mistaken. These strange npos values keep getting passed down in successive calls into the stdlib until it finally causes a SIGSEGV.

So I wonder if somehow 64 bit numbers aren’t be handled correctly either in our code or the std library someplace (we use 64 bit numbers all over the place).

Anyone know if there are outstanding issues using 64 bit numbers in the std library with the GNU compiler/library (2.95.3) under QNX 6.3 and not QNX 6.1.

Tim

Dude, I think I might have it. Try adding -D_PTHREADS=1 to your compile. This will make the stl allocate threadsafe… :v)

Examine stl_config.h and stl_alloc.h for more delightful details! :vD

That seems to have done the trick!

I changed my CFLAGS from:

CFLAGS = -lang-c++ -g

to

CFLAGS = -lang-c++ -D_PTHREADS=1 -g

and my LDFLAGS from:

LDFLAGS = -lang-c++ -g -Wl,-Map spm.map, -D_PTHREADS

to

LDFLAGS = -lang-c++ -g -Wl,-Map spm.map, -D_PTHREADS=1

and re-compiled and linked everything. So far I’ve run for several hours with no crashes :slight_smile:

Funny because you’ll notice that I did have the -D_PTHREADS defined in my original link line but I guess it has to also be defined in the compile line as well?

I wouldn’t have thought that because I assumed the std lib was linked in at link time and not at compile time. But I guess is has to insert semaphore/mutex code at compile time before all std lib access to create mutual exclusion into the internal std lib free pool.

Thanks for the help on this. It’s been driving me batty for a couple of weeks now!

Tim

Lots of stuff are defined in the header files :wink:

The -D_PTHREADS on the link line would have been ignored - it’s only passed to the pre-processor, so it’s really only a compile time option.

Good to hear it helped…