Hello,
I’m finishing up a port of our application software from QNX 6.1 to QNX 6.3. I’m in the process of running the application and am encountering crashes that I can’t seem to find the source of.
The application itself consists of 40 threads, most of which are hardware and idle 99% of the time. The other threads handle user requests, schedule hardware movements, hardware updates from sensors etc.
What I’ve found is that the application can run anywhere from 2 minutes to 30 minutes before a crash occurs. The crashes happen in different threads but never in hardware threads. 99% of the crashes appear to be related to freeing or allocating RAM whether from the heap or internally in the stdlib calls.
There is a LOT of code to visually inspect and since everything is happening in real time with that many threads it’s impossible to attach a debugger and get meanining results. On top of it all, everything runs fine in 6.1 with no crashes. I’ve obtained oodles of core files and our own log files in an attempt to figure out what’s wrong but I have had no luck so I am going to ask for some help here to see if anyone has any idea’s for me to try:
Here’s a GDB dump of a core file crash I just got. The crash occurred where the code was trying to send a simple debug text string to a log file. At step 9 when the crash is detected we call a signal handler to clean up other threads, esp hardware ones. So the trace from #9 to #1 is all related to code in the signal handler flushing output logs etc.
(gdb) where
#0 0x0804e8de in __default_alloc_template<false, 0>::allocate (__n=80)
at /usr/qnx630/target/qnx6/usr/include/g+±3/stl_alloc.h:422
#1 0x0804db12 in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::Rep::operator new (s=16,
extra=64)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.cc:33
#2 0x0804db4c in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::Rep::create (extra=34)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.cc:60
#3 0x0804de5f in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::replace (this=0x7d23a98, pos=0,
n1=0,
s=0x84aaf40 “DebugReporter flush successful!!!”, n2=34)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.cc:164
#4 0x0804e645 in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::assign (this=0x7d23a98,
s=0x84aaf40 “DebugReporter flush successful!!!”, n=34)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.h:218
#5 0x0804ea3f in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::assign (this=0x7d23a98,
s=0x84aaf40 “DebugReporter flush successful!!!”)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.h:220
#6 0x0804e682 in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::basic_string (this=0x7d23a98,
s=0x84aaf40 “DebugReporter flush successful!!!”)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.h:178
#7 0x083cd2b5 in DebugReporter::ForceFlush (this=0x87a0480)
at DebugReporter.cpp:422
#8 0x08140de7 in Hardware::handler (signo=11) at SignalHandler.cpp:52
#9 0xb031764e in __signalstub ()
from/usr/qnx630/target/qnx6/x86/lib/libc.so.2
#10 0x0804db12 in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::Rep::operator new (s=16,
extra=64)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.cc:33
#11 0x0804db4c in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::Rep::create (extra=38)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.cc:60
#12 0x0804de5f in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::replace (this=0x7d23f38,
pos=0, n1=0,
s=0x84625c0 “DelayQueue::OnTimer() - executing call”, n2=38)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.cc:164
#13 0x0804e645 in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::assign (this=0x7d23f38,
s=0x84625c0 “DelayQueue::OnTimer() - executing call”, n=38)
—Type to continue, or q to quit—
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.h:218
#14 0x0804ea3f in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::assign (this=0x7d23f38,
s=0x84625c0 “DelayQueue::OnTimer() - executing call”)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.h:220
#15 0x0804e682 in basic_string<char, string_char_traits,
__default_alloc_template<false, 0> >::basic_string (this=0x7d23f38,
s=0x84625c0 “DelayQueue::OnTimer() - executing call”)
at /usr/qnx630/target/qnx6/usr/include/g+±3/std/bastring.h:178
#16 0x081e7099 in Hardware::DelayQueue::OnTimer (this=0x8adad50)
at DelayQueue.cpp:281
#17 0x081e7a25 in Hardware::DelayQueue::Loop (this=0x8adad50)
at DelayQueue.cpp:436
#18 0x081e7abc in Hardware::DelayQueue::LoopStarter (pObject=0x8adad50)
at DelayQueue.cpp:461
The actual crash occurs at step 10 in Rep::operator new in bastring.cc
line 33: Going to that file I see the following code:
template <class charT, class traits, class Allocator>
inline void * basic_string <charT, traits, Allocator>::Rep:
operator new (size_t s, size_t extra)
{
return Allocator::allocate(s + extra * sizeof (charT)); // This is
line 33
}
Doing an info locals to check the variable values I find:
(gdb) info locals
s = 80
extra = 131218164
So ‘extra=131218164’ looks VERY out of whack and explains the SIGSEGV
due to lack of Ram to supply that kind of allocate. But looking at the
stack trace of the gdb dump you see that:
__default_alloc_template<false, 0> >::Rep::operator new (s=16, extra=64)
which seems to indicate that extra should be 64 anot 131218164. So I
don’t get how it changed from 64 to 131218164???
Looking at the calling routine step 11 we see it died on line 60. The
code is:
template <class charT, class traits, class Allocator>
inline basic_string <charT, traits, Allocator>::Rep *
basic_string <charT, traits, Allocator>::Rep::
create (size_t extra)
{
extra = frob_size (extra + 1);
Rep *p = new (extra) Rep; // This is line 60
p->res = extra;
p->ref = 1;
p->selfish = false;
return p;
}
Doing an info locals and info args and print *p gives the following
values:
(gdb) info args
extra = 64
(gdb) info locals
extra = 64
p = (Rep *) 0x1
(gdb) print *p
$2 = {len = 0, res = 0, ref = 0, selfish = false}
Note that extra appears to be 64 here which matches what the gdb stack
trace seems to think the value of extra is for the Rep::operator new
routine.
So it appears somehow that the normal value for extra is being changed
into a hugely large value for some reason. I checked my Stroustrup book
on pg 255-256 where it talks about how code such as:
Rep *p = new (extra) Rep;
works and there doesn’t seem to be anything strangely amiss here other
than not allocating Ram from the free pool. I guess the Std lib must
create it’s own memory pool and allocates from there instead for internal temporary storage so as to guarantee not to use more Ram in a memory constrained system?
But overall I am completely mystified as to why extra gets this crazy
value shoved in it.
Doing a ‘pidin -f m -p spm’ when the system is running reveals each thread has a stack size of 256K. This seems to be more than enuff for each thread and the system has 189 megs free out of 256 Ram wise.
So I am trying to figure out if I am getting stack corruption from other threads in some strange manner or if the stack memory is exhausted in some manner (esp the Std lib free pool) that I don’t realize. Esp since GDB post mortem debugging has no method to allow you to see how much stack space was in use by each thread.
Anyone have any idea/suggestions?
Tim