pidin hangs at a specific process

FabioG · February 19, 2008, 12:31am

Hi !

I have a problem with a process (my_process) which after running it for a few hours the node begins to get slow.
My process is not consumming CPU (hogs) and the only thing I can see is that when I run ‘pidin’ or ‘pidin mem’, this utility hangs exactly when it’s going to show information about my_process.

My process have 3 threads. One of them (main) is a resource manager with only io_read and io-write programmed.

The problem is progressive. After 4 hours pidin hangs for a moment but then it frees. After one or two days pidin hangs there forever and the node is almost unnoperable.
When I kill this process (it takes almost a minute to kill), everything returns to normal state.

May be a programming problem of my resource manager ?
When pidin executes without arguments, it sends some message to my resource manager ?

Thanks for your help !
Fabio.

[I’m using QNX6.3.0 SP3 with CorePath 6.3.2a.]

Tim · February 19, 2008, 3:22pm

Fabio,

If you are not using CPU then it sounds like you are leaking something else.

Maybe file descriptors, timers, memory, threads etc.

Can you get the ‘sin’ command to execute? That provides a quick way to check memory use, open files etc. You can also get that from pidin but the command line args are a little more archaic.

Tim

maschoen · February 20, 2008, 12:08am

This sounds very familiar. My guess is that you are not closing fd’s properly. Some system table fills up and gets larger and larger until accessing it takes a noticably long time. You might want to take a close look at your close code, and compare it to the documentation examples.

FabioG · February 21, 2008, 3:14pm

Thanks for the answer.

Tim: at this moment I have the node in that situation and I executed a ‘sin’ command and a ‘sin fds’ commands and it didn’t hang up at my proccess

maschoen: Thanks for the advice. I’m going to check my code looking for that (fds close).

I programmed a few tests codes today and I detect that when I execute the following function, it hangs. It’s the same function that hangs the ‘pidin’ command.

devctl(fd, DCMD_PROC_TIDSTATUS, &status, sizeof status, 0)

with fd being

fd = open (’/proc/pid_my_process/as’, O_RDONLY)

The node is very slow by now.

Can you suggest any other test for this problem ??

Thanks.
Fabio.

rgallen · February 21, 2008, 7:25pm

[quote=“FabioG”
devctl(fd, DCMD_PROC_TIDSTATUS, &status, sizeof status, 0)

with fd being

fd = open (’/proc/pid_my_process/as’, O_RDONLY)

The node is very slow by now.

Can you suggest any other test for this problem ??

Thanks.
Fabio.[/quote]

Sounds like you have a thread creating other threads, and it is creating too many of them…

FabioG · February 21, 2008, 9:53pm

As the problem is progressive, now my node is slow but after a couple of minutes it shows the information of pidin (it hangs for a while in my process and then it goes on).

It has only 3 threads and pidin only show that threads.
It has only 6 file descriptor opened (one of them is a TCP/IP connection with other qnx node)

Information on TID 1 and TID 2 is shown relatively quick, then it hangs waiting for the procnto resmgr reply with status information of TID 3

Is there any way to check that system table that might be large (mentioned by maschoen) ?
I have made other tests now and I concluded that really the node isn’t slow all the time. For example, when I do several ‘ls /tmp’ commands quickly it works fine, but when I execute some other tasks with procnto, like ‘pidin’ o my ‘test process’ with devctl function, it gets slow and the ‘ls /tmp’ command now returns its output after almost half a minute.

Do you think that it might be a general scheduling problem generated by my_process ?

Tim · February 21, 2008, 10:15pm

You already mentioned that Hogs shows your process consuming no CPU (I assume that means <2%) when the node gets slow. If that’s the case it’s not a scheduling problem.

What is the 3rd thread that’s causing the pidin command to hang doing?

What I would suggest you do is open 1 terminal and run Hogs at a high priority (like 20 or anything higher than your process) with an update rate of every second.

In the other terminal, you can run the ls /tmp command a few times and see what the result is in hogs. Then run the pidin command and watch what hogs reports. It will be interesting to see if hogs reports a lot of CPU being used when the node is slow.

Also, I assume you have already checked that your process isn’t consuming large amounts of Ram or disk space (not open files, but instead 1 giant file) or creating lots and lots of temporary files.

One other thing to check in your code (I don’t think this info is available via pidin). But you should make sure you are not leaking channels (created via ChannelCreate()).

Tim

FabioG · February 25, 2008, 7:15pm

Thanks Tim.

The 3rd thread is writing periodically some information to a MySQL database via ODBC (tcp/ip). I’ve checked that code and it looks ok.

On the other hand, I ran a hogs with higher priority and nobody is consuming CPU on that node when I run pidin or ls.
I have checked for RAM, disk space, etc and everything is fine.

Recently, I ran IDE System Analysis Tool (via qconn) on that node and all information is fine, except by this:

My process have a signal pending (signal #57) only in my 3rd thread.
All other process have no signals pending.
I’ve checked on other working nodes and it looks like all process that have some kind of tcp/ip connection have this signal pending.

Do you know what is means ?
Might it be a clue for finding out the solution to my problem ?

Thanks.
Fabio.

Tim · February 26, 2008, 3:16pm

Fabio,

Looking in signal.h it says signals starting at 57 and above belong to the kernal. So I suspect that signal you see is from the pidin command/Momentics IDE.

It would be interesting to comment out the actual ODBC code that goes over tcp (including opening/closing sockets) and see if that makes any difference in terms of getting rid of the slowness. I’m wondering if your 3rd thread is leaking sockets (which are file descriptors) on open/close if you open/close each time you update (vs open once and then write periodically).

Tim

FabioG · February 26, 2008, 3:52pm

Recently, I’ve tested my process running database server in my local node.
It was a way to check if problem is network related.
There is no difference: node is getting slow and pidin hangs at my process again and then, after a seconds, it goes on.

Thanks Tim. I’m going to comment out my actual ODBC code for testing.

Only thing I’m wondering is: if I’m leaking sockets (fds)… shouldn’t be shown by pidin fds, IDE analisys, sin fds or other related utilities ?

Regards
Fabio.

jpal · January 5, 2009, 4:07pm

I guess I managed to solve FabioG’s problem, because I’ve faced a similar problem for mounths. I had a process using TCP/IP application. It used the select() call for determining incoming messages.
If you read carefully the “Signals” chapter in /product/neutrino/sys_arch/kernel.html you will find a very interesting thing. The OS uses a special signal SIGSELECT (signal number 57) in case of returning from select() call. This signal is always blocked and can be handled by calling sigwaitinfo() only. These signals are queued, so they can be accumulated, when you doesn’t handle them.
After a while the signal-queue uses up the resources of the system. You can observe this with the pidin utility: pidin sig will show you the pending signals: 0100 0000 0000 0000 means the signal number 57 is pending.
Unfortunately it is not mentioned in Library Reference at chapter “select()”.

FabioG · January 5, 2009, 5:02pm

Hi !

We found that problem comes with programs using TCP/IP and MySQL connections via ODBC (that is TCP/IP too).

As you say, problem arises with select() function, in our case when it’s used with 0 value in time parameter (timeval struct).
In our TCP/IP code we had a select() used in that way (program ported from QNX4, that it hadn’t that problem). We followed MySQL ODBC driver source code and we found some kind of similar code, so programs that uses that odbc library had the same problem.

In those cases, one pending signal was queued by the system every time code execute that function, and after some days, the time the system took walking through that linked list of pending signals (kernel internal) was huge and that was why ‘piding hang at that specific process’. Something similiar to what maschoen commented in this thread.

We reported this to QNX support in that moment and they told us that it will be fixed soon (6.4 version I guess). This happened a few months ago.

Regards
Fabio.

Tim · January 5, 2009, 7:00pm

I have to say I am not sure what either of you are seeing but it’s not normal.

I use select all over the place in several processes to manage TCP/IP connections both with and without timeout values (I never have a timeout of 0, I pass NULL in that case). I never use a sigwaitinfo() call in my code. When I do a pidin sig command I never see signal 57 pending on any of my processes.

Tim

jpal · January 6, 2009, 9:36am

Hi!
The manual says: “This signal mechanism is used by Photon to wait for events and by the select() function to wait for I/O from multiple servers. Of the eight special signals, the first two have been given special names for this use. #define SIGSELECT (SIGRTMAX + 1)
#define SIGPHOTON (SIGRTMAX + 2)”
Our application uses timeout in select(), perhaps this is the key. Fabio, you have written you had also seen signal #57 pending.
We use the QNX 6.3.2 version but I’ll test it under 6.4

Regards
jpal

Tim · January 6, 2009, 2:42pm

Jpal,

Nope, I use timeouts in my select calls with out any problems. I am using 6.3.0 SP3.

Now I will say I NEVER pass in a timeout of 0, I always pass NULL in cases where I don’t want a timeout. So maybe passing a timeout of 0 causes this issue.

I’ll be happy to post some code if you’d like.

Note: I also ignore SIGPIPE errors generated by attempting to write to a broken socket before it can be cleaned up via the select call. I don’t think this matters but I figured I’d mention it just in case.

Tim

cburgess · January 6, 2009, 3:04pm

If you check the release notes of 6.4.0 you will see PR39687, “Continuously calling select() with a timeout of 0 no longer causes an internal signal queue to grow in an unbounded manner.” was fixed.

Specificially you needed a 0 timeout, not a NULL timeout.

Colin

jpal · January 6, 2009, 3:57pm

Hi Tim,

The manual for select() says: “If timeout isn’t NULL, it specifies a maximum interval to wait for the selection to complete. If timeout is NULL, select() blocks until one of the selected conditions occurs. To effect a poll, the timeout argument should be a non-NULL pointer, pointing to a zero-valued timeval structure.” So NULL not equal to 0!
I ignore SIGPIPE. Without it the process can terminate.

jpal