Pidin not showing all threads in a process

We’ve seen 2 customers have an issue which has two symptoms - 1) is that certain threads in our process don’t appear to be running/sheduled, and 2) is that when our support person did a simple pidin without any arguments, it did not show 30+ threads that should be there. This application normally has around 94 threads and it only displayed 54 of them. No other symptoms could be found and of course, can’t be replicated in our labs.

All threads in the process are static, ie, created once at startup and never exit - while(1) { do stuff }

The system didn’t crash and otherwise appeared to still be working. This is 6.5.0SP1 with a PowerPC T1024 running full SMP.

Has anyone seen threads magically “disappear” from pidin? From the kernel scheduler? Certain threads would be SEM blocked. others appeared fine.

One customer has been running for about 2 yrs until this happened this spring and another around 2 months.

When did the solar flairs hit?

1 Like

What happens if one of your threads crashes inside the while(1) loop?

Since your app appears to be long lived (running 2 years continuously I assume) you presumably have some kind of exception handling in your code. If as part of that handling it causes the thread to exit then maybe that’s where your missing threads are? If those threads were spawned detached then there wouldn’t even be any Zombies left around and they’d be gone entirely.

Tim

Tim, if a thread crashes, ie segfault, then there is a signal handler to catch that, collect crash data, and then dumper runs and the system goes thru reset. There is also a watchdog thread running at high priority which is still running.

Do all the other threads block signals so that only 1 thread catches and handles the signal (your signal handling thread)? Otherwise signals get delivered to the current thread which may not be the one you expect to handle it. If your not sure, you might want to test delivering signals to your app in your lab and see exactly what happens and make sure it’s what you expect.

You mentioned some threads are SEM blocked. I assume that’s expected behavior for your system?

Tim

Yes, each of these threads is blocked waiting on their own signal. and the ones that are SEM blocked is expected.

Can you post some code snippets?

I’d like to see the code that spawns all the threads at startup (I’m interested in how you spawn them). I’m also interested in all the signal related stuff (your handler, how you block signals in individual threads and the main application etc).

Tim

uhmm, its not at all simple. but the bottom line is we wait here sem_wait(m_semaphore);

And I can’t post much code in a public forum…

Without seeing some code it’s going to be impossible to help or even give you an idea of where to look. Clearly something is happening that’s causing threads to disappear. If you have a QNX support contract you can contact QNX directly and they will help but they also will be asking for your code.

I mean how much of a company secret is it if you post your thread spawning call. Are you using spawn() or pthread_create() or something else and the options you are setting when you create those threads. We don’t need the business logic. Same with the signal stuff, you must make a call to mask signals, what does that mask look like and what does your signal handler setup call look like. I don’t really care what happens inside your threads or signal handler itself (business logic).

For example we spawn all ours like this and then detach them.

pthread_attr_t threadAttributes;       // Thread attributes
struct sched_param threadSchedParam;   // Scheduling parameters

// Initialize the thread attributes
pthread_attr_init(&threadAttributes);

// Set the scheduling priority and the policy to round robin
pthread_attr_setinheritsched(&threadAttributes, PTHREAD_EXPLICIT_SCHED);
pthread_attr_setschedpolicy(&threadAttributes, SCHED_RR);
threadSchedParam.sched_priority = mPriority;
pthread_attr_setschedparam(&threadAttributes, &threadSchedParam);

// Set a stack size and a guard space for stack overruns. In theory we
// shouldn't need the guard space because we won't overflow our stack
// right? ;-)   Also force stack to be fully allocated at thread creation
// time.
pthread_attr_setguardsize(&threadAttributes, 4096);
pthread_attr_setstacksize(&threadAttributes, 32678);
pthread_attr_setstacklazy(&threadAttributes, PTHREAD_STACK_NOTLAZY); 

// Create the thread
if (pthread_create(&mThreadId, &threadAttributes, mThreadMain, mArgs) != EOK)
{
	mLogger.format(LOG_ERROR, "Thread::Thread() - pthread_create failed with error %s\n", strerror(errno));
}
else
{
	mLogger.format(LOG_INFO, "Thread::Thread() - Thread %s created with id %d", mThreadName.c_str(), mThreadId);
	pthread_detach(mThreadId);
	mAlive = true;
}

Tim

Thanks Tim. Yes we are working with QNX directly and have an NDA with them so that we can share with them. I’m just searching outside the box. yes we use pthread_create() and a wrapper around the OS. But what I’m really looking for is why would over 30 threads that have been running fine for months/yrs suddenly not show up in the most basic pidin listing?

I’m still a bit confused when you say the threads are not showing up in Pidin.

Are they in fact there (running in your application) and not showing in Pidin? Or are they actually missing in your application too?

If it’s the former, I’d wonder if the system ran out of some sort of resource (memory, file descriptors etc) that caused Pidin to not be able to report correctly. If it’s the latter, then somehow threads in your code are exiting in a manner you don’t expect and Pidin is working correctly.

Tim

Tim,

 I was discussing this with Geoff Roberts and we came up with one more possibility, pidin could have a bug that shows up when a process has some large number of threads.   The numbers involved suggest that that limit might be 64.  It would be easy to test if this is the case or not.

I made a quick test with SDP 6.6.0. pidin shows all threads.
This is not SDP 6.5.0 (I do not have a target on hands) so this is not a really useful test. But at least, pidin works correctly on SDP 6.6.0.
Also, threads are not detached. Does it matter ?

#define NB_THREADS      270

void*               TestManyThreads_Thread(void* pParam);

void                TestManyThreads(void)
{
    uint32_t  Counter;
    pthread_t ThreadId[NB_THREADS];

    for(Counter = 0; Counter < NB_THREADS; Counter++)
    {
        pthread_create(&ThreadId[Counter], NULL, &TestManyThreads_Thread, (void*)Counter);
    }

    for(Counter = 0; Counter < NB_THREADS; Counter++)
    {
        pthread_join(ThreadId[Counter], NULL);
        printf("Thread %d exited\n", Counter+1);
    }
}

void*               TestManyThreads_Thread(void* pParam)
{
    uint32_t Param;
    char     ThreadName[32];

    Param = (uint32_t)pParam;

    snprintf(ThreadName, sizeof(ThreadName), "Thread_%02d", Param);
    pthread_setname_np(0, ThreadName);

    printf("Thread %d created\n", Param+1);
    delay(20000);

    return NULL;
}

I had to add includes and a main to compile and run.

I’m not in my office but I have a QNX 6.3.2 online so I tried it. Works fine. pidin worked fine. All threads visible.

So very unlikely a pidin problem.