Application hang without debug information

Dear all,

I recently had an application running on 6.5.0 with ARM A8 processor. I launched in debug mode in the IDE and the application starts running as expected. Then after around say 20min, or even an hour, the application just hanged without meeting any crash errors like SEGMENTATION fault indicated in the IDE. Meaning, I could not even debug the application in this case. I have checked the system memory with “pidin info” in my application and the memory usage should be correct. And when the application hang, the wireless connection also went down and I cannot even telnet the system. Then are there any possible methods I can use to debug my application?

My current guessing is that since the IDE cannot catch the crash line in the code, then is this meaning the application hang is not because of the code itself? If the application hanged because of self-lock, then Telnet should be able to work. I have disabled all the CPU consuming functions in the application as well. Is there a way to record down the cpu usage, memory usage and events in the system services or in the application program to identify the problem? Thanks very much.

Best regards,
Eric

Assuming you have some writable disk space, there is the kernel logger.

It’s possible that a high priority task (driver?) is running ‘ready’ and consuming 100% of the CPU. That would lock out the IDE and your application.

To test this theory you could try setting the priory of your telnet session / qconn higher than that of anything else running on your system including drivers.

Tim

Thanks maschoen and Tim.
Yes, I found my application has a thread priority of 10 which is the same as kernel shown in the “top” command. I reduced my application highest priority thread and now can telnet when the application hanged.

After quite a few times of online debugging, we found the place where the application hanged. However, when we checked the variable values in the hanged thread, both local variables and a global variable are corrupted with an unreasonable value and caused the hanging problem. Then we tried to add a hw watchpoint of the global variable to see where exactly it got changed during the application execution period. However, the watchpoint break point was not hit when the thread hanged.

We also tried the memory analysis toolkit with the “Memory problems” options and still cannot find the corrupted area in the code. Can please help suggest a way to debug this kind of issue? Many thanks.

Best regards,
Eric

Eric,

I don’t think that the hw watchpoints work in the way you are hoping for. At least in my experience with them I’ve never managed to use them to find the kind of problem you are describing. That’s because the corruption is occurring elsewhere (a pointer running wild) and not from code statements modifying the variable. This 2nd paragraph of this gdb reference (which is what QNX is based on) talks about this. But maybe someone else has used the watchpoints successfully in a multi-threaded environment and enlighten us both.

sourceware.org/gdb/onlinedocs/g … oints.html

Things I can suggest:

  1. Turn the warning level to maximum and re-compile everything and make sure you fix every warning (uninitialized variables are a big culprit). You might want to start with ‘-Wall -Wuninitialized -O’. You can find gcc warning types here for hunting for array bounds and other things here:
    gcc.gnu.org/onlinedocs/gcc/Warning-Options.html
  2. If you are running running in debug mode, switch to release mode. This normally causes a much faster ‘crash’ that will help locate where the real problem is (run the dumper process so you get a dump file which will give you the crash address that you can then look up in a map file - turn on map file option when compiling). On the other hand if you are running in release mode, switch to debug mode.
  3. Visually inspect your code base. Your looking for pointers or any C arrays/buffers (this is where C++ really helps over C if you are using C++ stdlib classes). One of those is definitely the problem (this can be especially tricky in multi-threaded code if pointers are shared between threads and the memory space or what’s pointed to can change and you missed adding a mutex someplace…).

Tim

Dear Tim,

Many thanks for your detailed explanations. As you suggested, I fixed the uninitialised and unused variables from the compilation warnings. Then I ran the program in release mode for about 2 hours and did not happen the hanging problem. Previously, we did the tests with the debug mode instead of release mode. Can I say that the memory layout somehow changed after fixing the warnings and launch in release mode? I am not quite sure if I have already solved the memory corruption problem after fixing the warnings.

Now to locate the problem source, we put some printf functions in each thread if the global variable value get corrupted during the execution and still looking for the culprit. Hopefully can find out the exact problem.

Best regards,
Eric

Eric,

The memory layout definitely changes when you run in release mode. This mostly affects local (stack) variables as opposed to global variables. What happens is that the optimization in release mode turns many stack variables into register variables so they don’t occupy stack memory (this is why you can’t debug a release compilation because the variables don’t occupy any traditional memory space).

If you ran for 2 hrs with no problems in release mode after fixing the warnings there is a good chance you fixed the problem. You can determine if that’s the case by:

  1. Going back to the original code before you fixed the warnings (you are using some kind of source code control right :neutral_face: ) and compile it in release mode and see if the hangup occurs. If it does then you know you fixed the problem.
  2. Run your current code with all the fixes in debug mode and see if the hangup occurs.

Your last paragraph indicates you are still looking for the culprit. Are you looking to determine exactly what you might have fixed because I thought it was now working (2+ hrs no hang)?

Tim

Hello Tim,

We have identified the problem and it was indeed an array memcpy mistake. It just takes quite a few time before we caught this bug. After fixing the bug, the application can run smoothly now. Btw, is there any effective memory analysis tool under QNX like the Valgrind thing? Thanks a lot for your kind help.

Best regards,
Eric

Eric,

The IDE includes a memory analysis tool (note you need to link in a special library to over ride the normal one)

qnx.com/developers/docs/6.5. … dures.html

which can detect buffer overflows like the one you experienced

qnx.com/developers/docs/6.5. … flow_.html

However it only works on heap variables and not stack variables. So if you have a function like

void foo()
{
char bar[5];
memset(&bar, 0, 6 ); // overflow
}

it won’t help you.

Tim

I usualy use Mudflap to catch this kind of problems.

Thanks, I just started using the mudflap option now.
Just came across a strange problem, my original application can run normally without the mudflap option. After compiling with mudflap option, and launch the run configuration with mudflap, the application hangs with the message "Process 163857 (gumstix_app) terminated SIGSEGV code=1 fltno=11 ip=0105a138(libc.so.3@_Initlocks+0x80) mapaddr=0005a138. ref=00000004 " and cannot proceed with the mudflap analysis.
What might be the reasons for this? Thanks.

Eric

Attach a debug screenshot when the SEGSEV happened.

qnx.com/developers/docs/6.5. … nIDE_.html

Are you doing regular memory analysis too? The docs say it will cause the code to crash if you using it in conjunction with Mudflap.

KGB

Thanks, Tim.
I am not doing the memory analysis tool with the mudflap. Guess it is the linker option (-f mudflapth -lmudflapth) missing reason. Will try and see.

Regards,
Eric

As my application is multi-threaded, I tried with the link option -lmudflapth, still got the same result. Just wondering the option -fmudflapth should be for the compiler option instead of link option? My current application is compile with “-lmudflapth -fmudflapth”, but still cannot get rid of the crash when launching the mudflap testing. Please kindly suggest.

Regards,
Eric

I’ve never used Mudflap myself. Maybe Nico knows if he checks this thread again.

What I would suggest you do is to create a small test program that compiles in the same manner as your application. A ‘Hello World’ kind of simple program that does say 1 memory allocation and then exits. Then try launching that with Mudflap enabled to see if it crashes. If it does you know it’s nothing to do with your application and something to do with how you’ve configured the IDE. At that point you can play with the IDE options until you get it to work.

I’ll assume you have configured your IDE as per this link
qnx.com/developers/docs/6.5. … nIDE_.html

Tim

Thanks, Tim. I compiled a simple application with mudflap and can work.

Then my application accidentally generated a core dump file during execution. However my application was compiled in release mode. When I launched the postmortem analysis, I cannot get the whole picture of the code, but I can see the address where each thread got segmented fault. Then can I locate the culprit code with this dump file? Thanks.

Regards,
Eric

Hello, I just used the ntoarm-gdb.exe with my stripped executable program and the core dump file. As shown in the attached screen shot, I can use the x/20i 0x103a145 to check the 20 instructions from address 0x103a145. However, from the assembly instructions, it is a bit hard to identify the crashed line. Is there a way out there based on the stripped program and core dump file? Thanks.

Regards,
Eric

Eric,

Many of your threads are on the exact same point. Those are probably ‘waiting’ for something (receiving a message, waiting on a mutex etc) and are unlikely to be the culprit. It’s most likely thread 1 that crashed.

There are 2 things you can do to help figure out what went wrong.

  1. Add the creation of a map file as part of your build process. Then without making any changes to code, just re-link (I think map files are creating during linking) to get the map file. Look in the map file and you’ll find the addresses of every function. You can cross reference those with the ones reported in the core file to see what function each thread was in.

It’s possible (and likely) that all the crashes will end up being in functions in the standard library (strcpy for example if you walked off the end of a buffer) in which case you’ll learn nothing from figuring out the actual line/function. If that happens proceed to step 2.

  1. For each thread, do a stack trace. This shows all the functions that were called. That will let you work backwards up the function calls until you get to one that’s in your code. That will at least pin point where in your code you were.

It’s also possible one of the stack traces itself is corrupted. If that occurs you won’t be able to go any further up the chain but you will know it means you have a stack corruption problem (local variable in some function) as opposed to a heap one (malloc based).

Tim

P.S. Note: Pin pointing the exact line of code is almost impossible unless you want to start counting assembly instructions from the start of your function (once you identify from step 1&2 above). Usually just getting the right function is enough because unless you have really complex functions there is normally only one or two variables that could be causing the crash.

Dear Tim, thanks for your kind explanation.

I have generated the map file and found the memory location starts around from 0x00102850 (.init section) to 0x0013b494 (.text section). However, the crash location from the core dump 0x103a145 is out of the range above. During the core dump analysis, the gdb said the registers were corrupted. So the PC register points to the wrong location which is 0x103a145? The back trace seems not working in the release mode of core dump as well.

Regards,
Eric