select + SIGCHLD problem

Greetings,

I have a program that selects for reading on sockets which had a listen
performed on them. When a connection is made a child is forked to handle the
connection. The child does some stuff and eventually ends while the parent
goes back to the select. The process works without any problems except …

The parent has SIGCHLD handler.

If the child dies while the parent is sitting in the select call the handler
is called but the select NEVER returns. (The handler does a waitpid). In
fact, any attempt to kill a process that has hung on the select will cause
the entire system to hang. (The system does not repond to a ping.)

I can only assume this should not be happening.

]{ristoph

Which stack are you using?
Can you post some code?

-seanb

]{ristoph <news2@kristoph.net> wrote:
: Greetings,

: I have a program that selects for reading on sockets which had a listen
: performed on them. When a connection is made a child is forked to handle the
: connection. The child does some stuff and eventually ends while the parent
: goes back to the select. The process works without any problems except …

: The parent has SIGCHLD handler.

: If the child dies while the parent is sitting in the select call the handler
: is called but the select NEVER returns. (The handler does a waitpid). In
: fact, any attempt to kill a process that has hung on the select will cause
: the entire system to hang. (The system does not repond to a ping.)

: I can only assume this should not be happening.

: ]{ristoph

Sean,

I am using the full stack. I’ve just figured out that any signal that is
caught will cause select to hang.

]{ristoph

“Sean Boudreau” <seanb@qnx.com> wrote in message
news:98089s$jtm$3@nntp.qnx.com

Which stack are you using?
Can you post some code?

-seanb

]{ristoph <> news2@kristoph.net> > wrote:
: Greetings,

: I have a program that selects for reading on sockets which had a listen
: performed on them. When a connection is made a child is forked to handle
the
: connection. The child does some stuff and eventually ends while the
parent
: goes back to the select. The process works without any problems except

: The parent has SIGCHLD handler.

: If the child dies while the parent is sitting in the select call the
handler
: is called but the select NEVER returns. (The handler does a waitpid). In
: fact, any attempt to kill a process that has hung on the select will
cause
: the entire system to hang. (The system does not repond to a ping.)

: I can only assume this should not be happening.

: ]{ristoph

Greetings,

Please ignore my last, rather alarmist, post. I dropped the towel as Igor
would say ;o)

Let me just step back and give you more of an idea of what I am doing. I
noticed that there were a number of problems with the version of samba that
is posted on qnxstart. The main symptoms were an accumulation of zombies and
basically no support for SIGHUP.

I had a closer look at the problem and I found that, in fact, signals were
not working at all in any of the samba binaries. The problem turned out be
that RTP advertises SA_RESTART but it does not support it. In samba, if
SA_RESTART is defined it is used in calls to sigaction. sigaction then fails
but samba does not notice.

So, I removed SA_RESTART and sigaction was working fine. However, a new
problem appeared. I was running smbd as a daemon and as soon as a connection
was closed the daemon would stop working. I traced that to the fact that
sys_select (a samba function) stopped returning after a signal, such as
SIGCHLD, interrupted it. I only quickly looked at the sys_select and
assumed, incorrectly, that it simply mapped to select. Hence, I got the
impression that select was hanging …

In fact, sys_select looks something like this (removing all the ifdef’s) …

int sys_select(int maxfd, fd_set *fds,struct timeval *tval)
{
struct timeval t2;
int selrtn;

do
{
if (tval) memcpy((void *)&t2,(void *)tval,sizeof(t2));
errno = 0;
selrtn = select(maxfd,SELECT_CAST fds,NULL,NULL,tval?&t2:NULL);
}
while (selrtn<0 && errno == EINTR);

return(selrtn);
}

So, now I’ll try again to describe the problem …

If the daemon is sitting in the select (typically without a timeout) and it
receives a SIGHUP (or SIGCHLD) select will return with a -1 and errno will
be EINTR, as expected. The do … while will cause select to be called again.
At that point the select will no longer respond to connections, which is not
expected. It will respond to signals though and return with EINTR again if a
signal is received.

Now, if SIGTERM is received (which is also caught by samba) while in this
“dead” select state the entire system will freze at the point where the
SIGTERM handler hits “exit(0)”.

If you would like to experice this yourself …

(make sure you have the full stack running)

  1. Get the latest samba source (2.0.7)
  2. Get the latest config.* files
  3. Run ./configure
  4. Edit source/include/config.h, add ‘#define HAVE_FCNTL_LOCK 1’
  5. Edit source/lib/signal.c
  • Comment out the following two lines in the CatchSignal function.
  • if ( signum != SIGALRM )
  •   act.sa_flags = SA_RESTART
    
  1. make install
  2. /usr/local/samba/bin/smbd -D
    :sunglasses: kill -SIGHUP smbdpid
  3. kill SIGHUP smbdpid

Expect that your system will be dead after step 9. Actually, for step 8 you
can use fs-cifs to make a connection (you’ll need to set-up smb.conf) and
then kill that connection. smbd will fork on the connection and as the child
dies it will send a SIGCHLD to the parent.

Clearly, a fairly serious bug in NTO.

I found some clean workaround’s for all of the above issues so I will post
both the source and a new samba binary in the next day or so.

]{ristoph

“Sean Boudreau” <seanb@qnx.com> wrote in message
news:98089s$jtm$3@nntp.qnx.com

Which stack are you using?
Can you post some code?

-seanb

]{ristoph <> news2@kristoph.net> > wrote:
: Greetings,

: I have a program that selects for reading on sockets which had a listen
: performed on them. When a connection is made a child is forked to handle
the
: connection. The child does some stuff and eventually ends while the
parent
: goes back to the select. The process works without any problems except

: The parent has SIGCHLD handler.

: If the child dies while the parent is sitting in the select call the
handler
: is called but the select NEVER returns. (The handler does a waitpid). In
: fact, any attempt to kill a process that has hung on the select will
cause
: the entire system to hang. (The system does not repond to a ping.)

: I can only assume this should not be happening.

: ]{ristoph

]{ristoph <news2@kristoph.net> wrote:
: Greetings,

: Please ignore my last, rather alarmist, post. I dropped the towel as Igor
: would say ;o)

: Let me just step back and give you more of an idea of what I am doing. I
: noticed that there were a number of problems with the version of samba that
: is posted on qnxstart. The main symptoms were an accumulation of zombies and
: basically no support for SIGHUP.

: I had a closer look at the problem and I found that, in fact, signals were
: not working at all in any of the samba binaries. The problem turned out be
: that RTP advertises SA_RESTART but it does not support it. In samba, if
: SA_RESTART is defined it is used in calls to sigaction. sigaction then fails
: but samba does not notice.

: So, I removed SA_RESTART and sigaction was working fine. However, a new
: problem appeared. I was running smbd as a daemon and as soon as a connection
: was closed the daemon would stop working. I traced that to the fact that
: sys_select (a samba function) stopped returning after a signal, such as
: SIGCHLD, interrupted it. I only quickly looked at the sys_select and
: assumed, incorrectly, that it simply mapped to select. Hence, I got the
: impression that select was hanging …

: In fact, sys_select looks something like this (removing all the ifdef’s) …

select() will clear your fdset while waiting for the event. You
need to reset it before the next call. You’re probably waiting
for a SIGSELECT but no manager has been armed to send you one.

: int sys_select(int maxfd, fd_set *fds,struct timeval *tval)
: {
: struct timeval t2;
fd_set fds2;
: int selrtn;

: do
: {
memcpy(&fds2, fds, sizeof *fds);
: if (tval) memcpy((void *)&t2,(void *)tval,sizeof(t2));
: errno = 0;
: selrtn = select(maxfd,SELECT_CAST fds2,NULL,NULL,tval?&t2:NULL);
: }
: while (selrtn<0 && errno == EINTR);

: return(selrtn);
: }

: So, now I’ll try again to describe the problem …

: If the daemon is sitting in the select (typically without a timeout) and it
: receives a SIGHUP (or SIGCHLD) select will return with a -1 and errno will
: be EINTR, as expected. The do … while will cause select to be called again.
: At that point the select will no longer respond to connections, which is not
: expected. It will respond to signals though and return with EINTR again if a
: signal is received.

: Now, if SIGTERM is received (which is also caught by samba) while in this
: “dead” select state the entire system will freze at the point where the
: SIGTERM handler hits “exit(0)”.

There was a bug where the stack would run READY if a listening socket
was closed that had queued connections on it (hadn’t called accept()
yet). This sounds like what may be happening. Can you verify this?
This is fixed for the next patch.

: If you would like to experice this yourself …

: (make sure you have the full stack running)

: 1) Get the latest samba source (2.0.7)
: 2) Get the latest config.* files
: 3) Run ./configure
: 4) Edit source/include/config.h, add ‘#define HAVE_FCNTL_LOCK 1’
: 5) Edit source/lib/signal.c
: * Comment out the following two lines in the CatchSignal function.
: * if ( signum != SIGALRM )
: * act.sa_flags = SA_RESTART
: *
: 6) make install
: 7) /usr/local/samba/bin/smbd -D
: :sunglasses: kill -SIGHUP smbdpid
: 9) kill SIGHUP smbdpid

: Expect that your system will be dead after step 9. Actually, for step 8 you
: can use fs-cifs to make a connection (you’ll need to set-up smb.conf) and
: then kill that connection. smbd will fork on the connection and as the child
: dies it will send a SIGCHLD to the parent.

: Clearly, a fairly serious bug in NTO.

: I found some clean workaround’s for all of the above issues so I will post
: both the source and a new samba binary in the next day or so.

: ]{ristoph

I’ve just checked and it looks like this is incorrect behaviour :frowning:

-seanb

Sean Boudreau <seanb@qnx.com> wrote:

: select() will clear your fdset while waiting for the event. You
: need to reset it before the next call. You’re probably waiting
: for a SIGSELECT but no manager has been armed to send you one.

: : int sys_select(int maxfd, fd_set *fds,struct timeval *tval)
: : {
: : struct timeval t2;
: fd_set fds2;
: : int selrtn;

: : do
: : {
: memcpy(&fds2, fds, sizeof *fds);
: : if (tval) memcpy((void *)&t2,(void *)tval,sizeof(t2));
: : errno = 0;
: : selrtn = select(maxfd,SELECT_CAST fds2,NULL,NULL,tval?&t2:NULL);
: : }
: : while (selrtn<0 && errno == EINTR);

: : return(selrtn);
: : }

]{ristoph <news2@kristoph.net> wrote:

int sys_select(int maxfd, fd_set *fds,struct timeval *tval)
{
struct timeval t2;
int selrtn;

do
{
if (tval) memcpy((void *)&t2,(void *)tval,sizeof(t2));
errno = 0;
selrtn = select(maxfd,SELECT_CAST fds,NULL,NULL,tval?&t2:NULL);
}
while (selrtn<0 && errno == EINTR);

return(selrtn);
}

I’ve seen this kind of code in PDS several times. I believe
POSIX (somebody correct me if I’m wrong) said that the fds will
be invalid after a select call, so the loop really should be:

fd_set setcopy;

do {
memcpy(&setcopy, fds, sizeof(fd_set));
if (tval( memcpy((void *)&t2, (void *)tval, sizeof(t2));
errno = 0;
selrtn = select(maxfd, SELECT_CAST &setcopy, NULL, NULL, tval? &t2 : NULL);
} while (selrtn < 0 && errno = EINTR);

-xtang

Xiaodan Tang wrote:

do
{
if (tval) memcpy((void *)&t2,(void *)tval,sizeof(t2));
errno = 0;
selrtn = select(maxfd,SELECT_CAST fds,NULL,NULL,tval?&t2:NULL);
}
while (selrtn<0 && errno == EINTR);

return(selrtn);
}

I’ve seen this kind of code in PDS several times. I believe
POSIX (somebody correct me if I’m wrong) said that the fds will
be invalid after a select call, so the loop really should be:

Sure, I will :slight_smile:

POSIX.1 has nothing to say about select() other than to mention
such a thing exists in certain BSD versions (Annex B.6.4).

But UNIX98 (SOLO2) has some specifications, including:

On successful completion, the objects pointed to by the readfds,
writefds, and errorfds arguments are modified to indicate which
file descriptors are ready for reading, ready for writing, or
have an error condition pending, respectively. For each file
descriptor less than nfds, the corresponding bit will be set on
successful completion if it was set on input and the associated
condition is true for that file descriptor.

On failure, the objects pointed to by the readfds, writefds, and
errorfds arguments are not modified. If the timeout interval
expires without the specified condition being true for any of the
specified file descriptors, the objects pointed to by the readfds,
writefds, and errorfds arguments have all bits set to 0.

I’d say it is definatelly incorrect behaviour since on all the platofrm that
SAMBA runs this does not cause a problem, only NTO. Anyway, it goes without
saying that having the system hang is incorrect behaviour ;o)

]{ristoph

“Sean Boudreau” <seanb@qnx.com> wrote in message
news:985jqs$qa1$1@nntp.qnx.com

I’ve just checked and it looks like this is incorrect behaviour > :frowning:

-seanb

Sean Boudreau <> seanb@qnx.com> > wrote:

: select() will clear your fdset while waiting for the event. You
: need to reset it before the next call. You’re probably waiting
: for a SIGSELECT but no manager has been armed to send you one.

: : int sys_select(int maxfd, fd_set *fds,struct timeval *tval)
: : {
: : struct timeval t2;
: fd_set fds2;
: : int selrtn;

: : do
: : {
: memcpy(&fds2, fds, sizeof *fds);
: : if (tval) memcpy((void *)&t2,(void *)tval,sizeof(t2));
: : errno = 0;
: : selrtn = select(maxfd,SELECT_CAST fds2,NULL,NULL,tval?&t2:NULL);
: : }
: : while (selrtn<0 && errno == EINTR);

: : return(selrtn);
: : }

]{ristoph <news2@kristoph.net> wrote:
: I’d say it is definatelly incorrect behaviour since on all the platofrm that
: SAMBA runs this does not cause a problem, only NTO. Anyway, it goes without
: saying that having the system hang is incorrect behaviour ;o)

That’s a separate bug that I’m pretty sure was fixed if you
could verify it’s in fact what you are hitting. This one was
checked against the specs.

-seanb

: ]{ristoph

: “Sean Boudreau” <seanb@qnx.com> wrote in message
: news:985jqs$qa1$1@nntp.qnx.com
:> I’ve just checked and it looks like this is incorrect behaviour :frowning:
:>
:> -seanb
:>
:> Sean Boudreau <seanb@qnx.com> wrote:
:>
:> : select() will clear your fdset while waiting for the event. You
:> : need to reset it before the next call. You’re probably waiting
:> : for a SIGSELECT but no manager has been armed to send you one.
:>
:> : : int sys_select(int maxfd, fd_set *fds,struct timeval *tval)
:> : : {
:> : : struct timeval t2;
:> : fd_set fds2;
:> : : int selrtn;
:>
:> : : do
:> : : {
:> : memcpy(&fds2, fds, sizeof *fds);
:> : : if (tval) memcpy((void *)&t2,(void *)tval,sizeof(t2));
:> : : errno = 0;
:> : : selrtn = select(maxfd,SELECT_CAST fds2,NULL,NULL,tval?&t2:NULL);
:> : : }
:> : : while (selrtn<0 && errno == EINTR);
:>
:> : : return(selrtn);
:> : : }
:>

Sean,

Please could you post more detailed instruction on how I can identify if it
is, in fact, what I am “hitting”. Unfortunately, your previous instructions
were a little on the terse side ;o)
Alternatively, I could drop a tarball of the binaries into my quics
directory (or post it on my web site) and you could try it out yourself.

]{ristoph

“Sean Boudreau” <seanb@qnx.com> wrote in message
news:9869oc$920$1@nntp.qnx.com

]{ristoph <> news2@kristoph.net> > wrote:
: I’d say it is definatelly incorrect behaviour since on all the platofrm
that
: SAMBA runs this does not cause a problem, only NTO. Anyway, it goes
without
: saying that having the system hang is incorrect behaviour ;o)

That’s a separate bug that I’m pretty sure was fixed if you
could verify it’s in fact what you are hitting. This one was
checked against the specs.

-seanb

: ]{ristoph

: “Sean Boudreau” <> seanb@qnx.com> > wrote in message
: news:985jqs$qa1$> 1@nntp.qnx.com> …
:> I’ve just checked and it looks like this is incorrect behaviour > :frowning:
:
:> -seanb
:
:> Sean Boudreau <> seanb@qnx.com> > wrote:
:
:> : select() will clear your fdset while waiting for the event. You
:> : need to reset it before the next call. You’re probably waiting
:> : for a SIGSELECT but no manager has been armed to send you one.
:
:> : : int sys_select(int maxfd, fd_set *fds,struct timeval *tval)
:> : : {
:> : : struct timeval t2;
:> : fd_set fds2;
:> : : int selrtn;
:
:> : : do
:> : : {
:> : memcpy(&fds2, fds, sizeof *fds);
:> : : if (tval) memcpy((void *)&t2,(void *)tval,sizeof(t2));
:> : : errno = 0;
:> : : selrtn = select(maxfd,SELECT_CAST
fds2,NULL,NULL,tval?&t2:NULL);
:> : : }
:> : : while (selrtn<0 && errno == EINTR);
:
:> : : return(selrtn);
:> : : }
:

Can you email me again? (lost your addr)

-seanb

]{ristoph <news2@kristoph.net> wrote:
: Sean,

: Please could you post more detailed instruction on how I can identify if it
: is, in fact, what I am “hitting”. Unfortunately, your previous instructions
: were a little on the terse side ;o)
: Alternatively, I could drop a tarball of the binaries into my quics
: directory (or post it on my web site) and you could try it out yourself.

: ]{ristoph