pidin/sin/slay hangs on certain processes

I have an application that is somehow causing pidin and sin to hang. I’ve got Momentics 6.3.2 and am running on the 6.3.2 OS (I mention this because I’m not seeing this problem when I run on 6.3.0 SP3, although that doesn’t mean it’s not there). In my application, I’ve got a daemon processes that spawns all of the other processes in the application and also stops them. It does this by first sending a pulse to them, and then a SIGKILL if they did not stop withing 2 seconds of receiving the pulse. At some point, when I tell the daemon to stop, it takes an unusually long time, doesn’t stop the processes, but it returns like everything is ok. However, if I try to run sin or pidin after that, it just hangs. The shell becomes unresponsive, yet I can still open a new terminal and interact with the shell. If I run slay it also hangs. I’ve recently created a program that is a stripped down version of sin, to find out where the program was hanging. I found that the program was hanging on the call to open(/proc/pid) for certain processes. I assume it’s SEND or REPLY blocked, but I have no way of knowing since sin and pidin aren’t working. Any ideas on what might cause a call to open() for a pid in /proc/ to hang like that?

Did you try to run pidin at a highest priority?.. It looks like a problem with io-net. Your daemon proccess has something to do with networking?

I ran “nice --30 sin” and it still just hangs. The CPU has almost no activity on it, so it doesn’t appear that it’s a matter of getting CPU time. The daemon process doesn’t do anything with io-net. It starts and stops processes. If a process dies, it restarts it, but that’s about all it does. I’ve actually run into this problem in the past and was at point where I found a change in the source code where before that change I did not have the problem and afterward I did. However the code change was purely cosmetic and should not have had any effect on the outcome.

The problem showed up when I changed this code:

    if (time(NULL) - atm->time_last_send >= atm->atm_wait_time)/* atm took too long*/

to this:
double time_diff;
time_diff = difftime(time(NULL), atm->time_last_send);
if (time_diff >= atm->atm_wait_time)/* atm took too long*/

I was able to get it working again, by replacing difftime() with some code to just compare two times, however I made another modification after that where I just did another comparison and it stopped working again. Could there be some size limitation I’m hitting? This is, after all, our largest program.


You aren’t by chance running a 6.3.0 SP3 compiled version of your code on 6.3.2 are you?

While they look like similar versions of the O/S there are subtle differences in the kernel that could cause something like this.

I’ve ran into a similar problem to this a couple years ago on 6.2.1 where I couldn’t get pidin to run (don’t remember if I tried sin) even tho the CPU wasn’t busy. The problem turned out to be that when you run those commands they cause each process to report info about itself. If that process is stuck (either in an infinite loop or just forever busy waiting for a mutex) it can’t report and so pidin hangs.


Nope, I’m compiling 6.3.2 and running on 6.3.2. That does seem to fit the description, though. I think it’s the same program that’s not responding each time. I’ll check it and make sure that mutexes are getting cleaned up when it’s shutting down.

Is the program that is not responding yours? If it is, is it a resource manager?

Yes, it’s mine. No, it’s not a resource manager. I ended up doing a massive overhaul on the code (it needed it anyway) and the problem has not occurred since. Although I didn’t specifically identify it, I believe it was a problem caused by the process waiting on a mutex that was never unlocked. I don’t see anything else I changed that would have affected the problem, but then again it was a mysterious problem. I’ve got a complete rewrite planned in the near future, so as long as it doesn’t cause problems before then, I’ll just cut my losses and hope I don’t run into the same problem.

thanks to everyone for your help