system() is not really thread safe / waitpid() bug

It seems that system() is not really thread-safe, and/or there is a bug in
waitpid().

Consider the following scenario:

ThreadA:

child_a = spawn(…)
child_b = spawn(…)

wait( //for child_a or child_b to complete )

ThreadB (lower or same priority as ThreadA):

ret = system( “ls /tmp”)

It turns out that there is a fairly good chance that the system call will
never return - the reason being that the wait() in ThreadA will catch the
death of the child shell process spawned by system(), and as a result the
waitpid() inside system() will never return.

There are a few issues::

  • waitpid() should return an error rather than hanging forever if another
    thread in the same process successfully wait()s on the same child pid
  • system() should be documented as being “thread safe only if no other
    thread calls wait()”

It seems that there should only ever be one thread at a time wait()ing for
child processes.

Any other thoughts or ideas on how to get around this?

(And yes, I know system() is evil but there are times when it is
convenient).

Rob Rutherford

Do you actually have an example outside of your code-base that shows this
behaviour? I wrote up this example based on your description and it
never does what you say it will do (ie: it never locks up):

http://qnx.wox.org/lockup.c

Please correct anything I misread in your description and post back
a URL to the fixed source.

chris


Robert Rutherford <ruzz@nospamplease.ruzz.com> wrote:

It seems that system() is not really thread-safe, and/or there is a bug in
waitpid().

Consider the following scenario:

ThreadA:

child_a = spawn(…)
child_b = spawn(…)

wait( //for child_a or child_b to complete )

ThreadB (lower or same priority as ThreadA):

ret = system( “ls /tmp”)

It turns out that there is a fairly good chance that the system call will
never return - the reason being that the wait() in ThreadA will catch the
death of the child shell process spawned by system(), and as a result the
waitpid() inside system() will never return.

There are a few issues::

  • waitpid() should return an error rather than hanging forever if another
    thread in the same process successfully wait()s on the same child pid
  • system() should be documented as being “thread safe only if no other
    thread calls wait()”

It seems that there should only ever be one thread at a time wait()ing for
child processes.

Any other thoughts or ideas on how to get around this?

(And yes, I know system() is evil but there are times when it is
convenient).

Rob Rutherford


Chris McKillop <cdm@qnx.com> “The faster I go, the behinder I get.”
Software Engineer, QSSL – Lewis Carroll –
http://qnx.wox.org/

Thanks Chris.

Here is your example, modified to be closer to how our code works. Yes I
guess I did make some over-simplifications.

http://www.ruzz.com/lockup.c

This clearly shows the main() thread getting the child death from the
thread, and causing the system() call to lockup.

Rob Rutherford

“Chris McKillop” <cdm@qnx.com> wrote in message
news:ar27u1$4u6$1@nntp.qnx.com

Do you actually have an example outside of your code-base that shows this
behaviour? I wrote up this example based on your description and it
never does what you say it will do (ie: it never locks up):

http://qnx.wox.org/lockup.c

Please correct anything I misread in your description and post back
a URL to the fixed source.

chris


Robert Rutherford <> ruzz@nospamplease.ruzz.com> > wrote:
It seems that system() is not really thread-safe, and/or there is a bug
in
waitpid().

Consider the following scenario:

ThreadA:

child_a = spawn(…)
child_b = spawn(…)

wait( //for child_a or child_b to complete )

ThreadB (lower or same priority as ThreadA):

ret = system( “ls /tmp”)

It turns out that there is a fairly good chance that the system call
will
never return - the reason being that the wait() in ThreadA will catch
the
death of the child shell process spawned by system(), and as a result
the
waitpid() inside system() will never return.

There are a few issues::

  • waitpid() should return an error rather than hanging forever if
    another
    thread in the same process successfully wait()s on the same child pid
  • system() should be documented as being “thread safe only if no other
    thread calls wait()”

It seems that there should only ever be one thread at a time wait()ing
for
child processes.

Any other thoughts or ideas on how to get around this?

(And yes, I know system() is evil but there are times when it is
convenient).

Rob Rutherford


\

Chris McKillop <> cdm@qnx.com> > “The faster I go, the behinder I get.”
Software Engineer, QSSL – Lewis Carroll –
http://qnx.wox.org/

“Robert Rutherford” <ruzz@NoSpamPlease.ruzz.com> writes:

| Thanks Chris.
|
| Here is your example, modified to be closer to how our code works. Yes I
| guess I did make some over-simplifications.
|
| http://www.ruzz.com/lockup.c
|
| This clearly shows the main() thread getting the child death from the
| thread, and causing the system() call to lockup.

Yeah, there’s no way to ensure the “right” thread gets the notification.

The waitpid() blocking in the kernel though, is definitely a bug, and
I have an open bug with QSSL on it (on observe because I couldn’t
make the time to come up with a simple test case, so I’m glad that you did!)

It’s PR 12805, for what that’s worth.

So the conclusion is:

  • yes, system() is not really thread safe, and
  • yes, there is a bug in waitpid()

Or to put it another way, waitpid() is not thread safe and therefore it (and
anything that calls it, including system()) should only be used in one
thread per process.

Rob Rutherford

“Dave Olson” <olson@free.tahoenetworks.com> wrote in message
news:olson.1037650258@free.tahoenetworks.com

“Robert Rutherford” <> ruzz@NoSpamPlease.ruzz.com> > writes:

| Thanks Chris.
|
| Here is your example, modified to be closer to how our code works. Yes I
| guess I did make some over-simplifications.
|
| > http://www.ruzz.com/lockup.c
|
| This clearly shows the main() thread getting the child death from the
| thread, and causing the system() call to lockup.

Yeah, there’s no way to ensure the “right” thread gets the notification.

The waitpid() blocking in the kernel though, is definitely a bug, and
I have an open bug with QSSL on it (on observe because I couldn’t
make the time to come up with a simple test case, so I’m glad that you
did!)

It’s PR 12805, for what that’s worth.

Robert Rutherford <ruzz@nospamplease.ruzz.com> wrote:

So the conclusion is:

  • yes, system() is not really thread safe, and
  • yes, there is a bug in waitpid()

The fix for waitpid() will have system() return -1 and set ECHILD, which
I understand is what POSIX says it should do. One thing to note about
system() is that you can’t block a signal handler for SIGCHLD and you can
only mask it for the current thread. However, SIGCHLD is delivered to the
process so a signal handler in the first thread can get the signal. If
you don’t use a signal handler for SIGCHLD anywhere in your process then
you will be less likly to get the ECHILD.

This is what my original test showed and why I didn’t see the problem at first.

chris


Chris McKillop <cdm@qnx.com> “The faster I go, the behinder I get.”
Software Engineer, QSSL – Lewis Carroll –
http://qnx.wox.org/

Chris McKillop <cdm@qnx.com> wrote:

Robert Rutherford <> ruzz@nospamplease.ruzz.com> > wrote:
So the conclusion is:

  • yes, system() is not really thread safe, and

If anything, it’s wait() that is to blame. If you call wait() without
making sure that you know about all the children that your process has,
you risk stealing zombies from some other part of the program that forks
a child for some reason. This doesn’t really have that much to do with
threads.

  • yes, there is a bug in waitpid()


    The fix for waitpid() will have system() return -1 and set ECHILD, which
    I understand is what POSIX says it should do. One thing to note about
    system() is that you can’t block a signal handler for SIGCHLD and you can
    only mask it for the current thread. However, SIGCHLD is delivered to the
    process so a signal handler in the first thread can get the signal. If
    you don’t use a signal handler for SIGCHLD anywhere in your process then
    you will be less likly to get the ECHILD.

Or, even better, don’t call wait() or waitpid(-1,…). If you always
give a pid to waitpid(), and make sure that you don’t have two threads
waiting on the same pid at the same time, I don’t think you’ll ever
trigger the waitpid() bug. And, of course, you’ll never pull a zombie
from under system()'s feet.


Wojtek Lerch QNX Software Systems Ltd.

OK thanks everyone I certainly understand what’s going on now.

I think the only remaining item is that system(), wait(), waitpid() are all
listed in the docs as being Thread Safe. While this may be strictly true (by
the formal defiinition of Thread Safety) I think the docs would benefit from
a bit of a caveat regarding this issue.

Also, it seems that what is really needed is a version of waitpid() that
lets you pass in a list of pids that you are waiting for: the level of
control is too coarse to only be able to wait for either one specific child
or else all children. Actually, I suppose you could achieve this to some
extent by using process groups.

Rob Rutherford

“Wojtek Lerch” <wojtek_l@yahoo.ca> wrote in message
news:arg9at$pq0$1@nntp.qnx.com

Chris McKillop <> cdm@qnx.com> > wrote:
Robert Rutherford <> ruzz@nospamplease.ruzz.com> > wrote:
So the conclusion is:

  • yes, system() is not really thread safe, and

If anything, it’s wait() that is to blame. If you call wait() without
making sure that you know about all the children that your process has,
you risk stealing zombies from some other part of the program that forks
a child for some reason. This doesn’t really have that much to do with
threads.

\

  • yes, there is a bug in waitpid()


    The fix for waitpid() will have system() return -1 and set ECHILD, which
    I understand is what POSIX says it should do. One thing to note about
    system() is that you can’t block a signal handler for SIGCHLD and you
    can
    only mask it for the current thread. However, SIGCHLD is delivered to
    the
    process so a signal handler in the first thread can get the signal. If
    you don’t use a signal handler for SIGCHLD anywhere in your process then
    you will be less likly to get the ECHILD.

Or, even better, don’t call wait() or waitpid(-1,…). If you always
give a pid to waitpid(), and make sure that you don’t have two threads
waiting on the same pid at the same time, I don’t think you’ll ever
trigger the waitpid() bug. And, of course, you’ll never pull a zombie
from under system()'s feet.


Wojtek Lerch QNX Software Systems Ltd.

I think the only remaining item is that system(), wait(), waitpid() are all
listed in the docs as being Thread Safe. While this may be strictly true (by
the formal defiinition of Thread Safety) I think the docs would benefit from
a bit of a caveat regarding this issue.

And that issue has been rasied internally. So I suspect something will
change in the docs to note the behavior.

chris


Chris McKillop <cdm@qnx.com> “The faster I go, the behinder I get.”
Software Engineer, QSSL – Lewis Carroll –
http://qnx.wox.org/

“Dave Olson” <olson@free.tahoenetworks.com> wrote in message
news:olson.1037650258@free.tahoenetworks.com

“Robert Rutherford” <> ruzz@NoSpamPlease.ruzz.com> > writes:

| Thanks Chris.
|
| Here is your example, modified to be closer to how our code works. Yes I
| guess I did make some over-simplifications.
|
| > http://www.ruzz.com/lockup.c
|
| This clearly shows the main() thread getting the child death from the
| thread, and causing the system() call to lockup.

Yeah, there’s no way to ensure the “right” thread gets the notification.

The waitpid() blocking in the kernel though, is definitely a bug, and
I have an open bug with QSSL on it (on observe because I couldn’t
make the time to come up with a simple test case, so I’m glad that you
did!)

It’s PR 12805, for what that’s worth.

Just my two cents. The example program could be modified to check for two
waitpids to complete instead of looking for pid2 to complete. If pid2 ever
completed before pid1, this might cause a problem. Also in the signal
handler, In the past, I have put code in my signal handlers to send a
pthread_kill to the proper thread, if the signal is ever caught by any other
thread. This will ensure that the “right” thread gets the notification.

“Michael Wehrer” <michael.wehrer@alcoa.com> writes:
| > It’s PR 12805, for what that’s worth.
|
| Just my two cents. The example program could be modified to check for two
| waitpids to complete instead of looking for pid2 to complete. If pid2 ever

Sure, you could do lots of things, but fundamentally, the OS is broken,
if a valid waitpid() is active in the kernel (for a specific PID), and
another non-specific waitpid() reaps that PID, AND the first waitpid()
remains blocked.

There doesn’t need to be a guarantee on which one gets the status (although
I’d prefer the more specific one, on general principles). But it’s
definitely broken to leave things blocked in the kernel after the process
goes away.

Remember, this was just a test case. My own example was far more complex
(which is partly why I never took the time to boil it down to a test case),
and I suspect the other folks who have hit this have more complex cases as
well.

Also, just how would system(), for example, do what you are suggesting?
It has no info about what other threads in the program might be doing,
it’s just a library routine, after all…

It’s not necessary to have any signals involved at all to cause
the problem, by the way.