Unexpected SIGHUP for tasks run by sysinit

We have a set of tasks that are run out of sysinit at boot time. They are
run from individual command lines like this:

prog >/dev/conX 2>/dev/conX &

so that stdout and stderr are redirected to consoles 1 - 10, according to
task. The end up running with device 0 being /dev/null, as expected. When
examining these tasks using sin, they are in a session with Proc as the
session leader, and no controlling device. Yet, very infrequently (every few
weeks of continuous operation), they all receive SIGHUP. I would think that
this is impossible, since Proc never dies, and there is no controlling
terminal to generate a hangup. Furthermore, I have used this technique
before, and have never seen this behavior.

The only thing I can think of is that the console device is somehow
generating the signal. We have two systems hooked up to a KVM switch, with a
single mouse, keyboard, and monitor.

We can deal with the problem here, but it would be nice to know what is
generating the SIGHUP.

One other issue that arose was that, when the tasks died from the SIGHUP,
they did not leave a trace on their stdout/stderr consoles. I did some
experimentation, and it appears that abort messages are sent to the
session’s controlling device, regardless of what stderr and stdout are.
Since there was no controlling device, no messages showed up. It would be
nice to know if this is indeed the case.

Any help would be appreciated.

Thanks,

Kevin

Kevin Miller <kevin.miller@transcore.com> wrote:

We have a set of tasks that are run out of sysinit at boot time. They are
run from individual command lines like this:

prog >/dev/conX 2>/dev/conX &



so that stdout and stderr are redirected to consoles 1 - 10, according to
task. The end up running with device 0 being /dev/null, as expected. When
examining these tasks using sin, they are in a session with Proc as the
session leader, and no controlling device. Yet, very infrequently (every few
weeks of continuous operation), they all receive SIGHUP. I would think that
this is impossible, since Proc never dies, and there is no controlling
terminal to generate a hangup. Furthermore, I have used this technique
before, and have never seen this behavior.

Do you start & terminate a lot of processes over the life of the
system?

Possible work-around:

nohup prog >/dev/conX 2>/dev/conX &

Do you actually want the controlling terminal changed?

on -t /dev/conx prog

Might be worth considering, too.

-David

Please follow-up to newsgroup, rather than personal email.
David Gibbs
QNX Training Services
dagibbs@qnx.com

There is a cron task that does an rtc -s hw once per minute. Another cron
task runs once at midnight each day to create a new syslog file. tftpd is
invoked a few times per day. The systems are expected to run unattended for
weeks or months at a time, so this adds up, I suppose.

I had considered nohup, and we may end up using it, but it would be nice if
we knew what was causing the signal. Is there anyway to determine the issuer
of a signal?

Thanks for the help

“David Gibbs” <dagibbs@qnx.com> wrote in message
news:ch6255$o9$3@inn.qnx.com

Kevin Miller <> kevin.miller@transcore.com> > wrote:
We have a set of tasks that are run out of sysinit at boot time. They are
run from individual command lines like this:

prog >/dev/conX 2>/dev/conX &


so that stdout and stderr are redirected to consoles 1 - 10, according to
task. The end up running with device 0 being /dev/null, as expected. When
examining these tasks using sin, they are in a session with Proc as the
session leader, and no controlling device. Yet, very infrequently (every
few
weeks of continuous operation), they all receive SIGHUP. I would think
that
this is impossible, since Proc never dies, and there is no controlling
terminal to generate a hangup. Furthermore, I have used this technique
before, and have never seen this behavior.

Do you start & terminate a lot of processes over the life of the
system?

Possible work-around:

nohup prog >/dev/conX 2>/dev/conX &

Do you actually want the controlling terminal changed?

on -t /dev/conx prog

Might be worth considering, too.

-David

Please follow-up to newsgroup, rather than personal email.
David Gibbs
QNX Training Services
dagibbs@qnx.com

Kevin Miller <kevin.miller@transcore.com> wrote:

There is a cron task that does an rtc -s hw once per minute. Another cron
task runs once at midnight each day to create a new syslog file. tftpd is
invoked a few times per day. The systems are expected to run unattended for
weeks or months at a time, so this adds up, I suppose.

It’s the rtc every minute that adds up – the others are pretty small
load compared to that – which is 1440 process creations/day.

There is a bug in the process creation algorithm, where it can (rarely)
create a pid that is the process group leader for an already existing
process group. When this process then exits, all the processes in that
process group will get a SIGHUP. I don’t know if this can affect
session 1 processes or not – but maybe.

I’d go about addressing this in two stages:

  1. reduce the number of process creations. Running rtc every minute is
    very heavy handed – I’d recommend instead getting the source to rtc from
    ftp.qnx.com:/usr/free/qnx4/os/samples/misc/rtc_src.tgz and modifying this
    to use a timer and wakeup & do the rtc work every minute that way – rather
    than running rtc every minute. This will greatly extend the time before
    the pid creation algorithm will recreate the dangerous pid. (And, it
    will also be good for your system, greatly reducing total system load
    by not creating & destroying a process every minute. Far better.) In
    doing the recoding, you might also consider qnx_adj_time() for resynching
    the clock, rather than clock_settime(). (Though, rtc might do that
    already, I’m not sure.)

  2. Inquire through your sales or support rep about getting a fixed Proc
    that does not have this bug. (Have the sales or support rep talk to
    Adam Mallory about such a fix.)

I had considered nohup, and we may end up using it, but it would be nice if
we knew what was causing the signal. Is there anyway to determine the issuer
of a signal?

Not sure. Setting trace verbosity very high, and getting the traceinfo
right after it happens might give enough information to make a guess,
but I’m not sure if that info is recorded there.

Or, there is a signal context that comes with a signal, with a signal
handler for SIGHUP, you could get this context and dump out the signal.
Of course, that only works if it is your programs going boom, not ones
you don’t have code for.

[sigcontext stuff, from an old post]

From steve@qnx.com Tue Sep 16 15:58:01 EDT 1997
Article: 6374 of comp.os.qnx
Path: gateway.qnx.com!not-for-mail
From: steve@qnx.com (Steve McPolin)
Newsgroups: comp.os.qnx
Subject: Re: Getting signal context
Date: 6 Aug 1997 15:01:08 GMT
Organization: QNX Software Systems
Lines: 107
Message-ID: <5sa3jk$oc9@qnx.com>
References: <u4zpqvn1jj.fsf@stlind.lint.lyngso-industri.dk.lyngso-industri.dk>
NNTP-Posting-Host: gateway.qnx.com
X-Newsreader: trn 4.0-test58 (13 May 97)

In article <u4zpqvn1jj.fsf@stlind.lint.lyngso-industri.dk.lyngso-industri.dk>,
Jeppe Sommer <jso@stlind.lint.lyngso-industri.dk.lyngso-industri.dk> wrote:

Is there a way of getting the signal context from within a signal
handler in QNX (i.e., the state of the CPU just before the signal is
delivered)?.

On other (Unix like) systems this is typically available as an extra
argument to the signal handler.

Looking at the signal handlers stack from within the Watcom debugger,
it seems that most of the CPU registers are in fact placed further
down the stack (I am compiling with stack calling conventions).
Unfortunatly these do not seem to be at a fixed distance to the
stack top.

Does anyone have a clue about how this could be done? I am primarily
interested in the instruction pointer register.

The structure below defines it, it is available as the second argument
but the compiler is AR about calling signal with a function which
doesn’t match ‘void (*)(int)’ – you can cast it away to void *.

example:
#include <signal.h>
#include <sys/sigcontext.h> /* follows if you don’t have it in your system */

void catch(int signo, SIGCONTEXT scp)
{
printf(“death by signal %u at 0x%lx\n”, signo, scp->sc_ip);
exit(1);
}


The fields:
ulong_t sc_info; /
fault specific info /
ushort_t sc_errc; /
error code pushed by processor /
uchar_t sc_fault; /
actual fault # /
uchar_t sc_flags; /
signal handler flags: */
are only available in 424 and above, as is the sigaltstack() et al…


#ifndef sigcont_h
#define sigcont_h 1


#ifndef __TYPES_H_INCLUDED
#include <sys/types.h>
#endif

typedef struct _sigcontext SIGCONTEXT;
struct _sigcontext {
ulong_t sc_mask;
ulong_t sc_gs:16,:16; /* register set at fault time /
ulong_t sc_fs:16,:16;
ulong_t sc_es:16,:16;
ulong_t sc_ds:16,:16;
ulong_t sc_di;
ulong_t sc_si;
ulong_t sc_bp;
ulong_t :32; /
hole from pushad /
ulong_t sc_bx;
ulong_t sc_dx;
ulong_t sc_cx;
ulong_t sc_ax;
ulong_t sc_ip;
ulong_t sc_cs:16, :16;
ulong_t sc_fl;
ulong_t sc_sp;
ulong_t sc_ss:16, :16;
ulong_t sc_info; /
fault specific info /
ushort_t sc_errc; /
error code pushed by processor /
uchar_t sc_fault; /
actual fault # /
uchar_t sc_flags; /
signal handler flags: */
#define SC_ONSTACK 1
};

enum {
TRAP_ZDIV = 0, /* SIGFPE: divide by zero /
TRAP_DEBUG = 1, /
SIGTRAP: debug fault /
TRAP_NMI = 2, /
SGIBUS: nmi fault /
TRAP_BRKPT = 3, /
SIGTRAP: cpu breakpoint /
TRAP_OFLOW = 4, /
SIGFPE: integer overflow*/
TRAP_BOUNDS = 5, /* SIGFPE: bound instn failed*/
TRAP_BADOP = 6, /* SIGILL: invalid opcode /
TRAP_NONDP = 7, /
SIGFPE: NDP not present or available /
TRAP_DFAULT = 8, /
never: double fault (system error) /
TRAP_NDPSEGV = 9, /
SIGSEGV: NDP invalid address /
TRAP_BADTSS = 10, /
never: invalid tss (system error) /
TRAP_NOTPRESENT=11, /
SIGSEGV: referenced segment not preset /
TRAP_NOSTACK = 12, /
SIGSEGV: esp|ebp bad address /
TRAP_GPF = 13, /
SIGSEGV: other /
TRAP_PAGE = 14, /
SIGSEGV: page fault /
TRAP_FPERROR = 16, /
SIGFPE: floating point error */
};

#define __ERRC_VALID (1<<TRAP_DFAULT | 1<<TRAP_BADTSS |
1<<TRAP_NOTPRESENT | 1<<TRAP_NOSTACK | 1<<TRAP_GPF | 1<<TRAP_PAGE )

#define __INFO_VALID (1<<TRAP_PAGE)


#endif

Steve McPolin, QNX Software Systems, Ltd.
point+click: steve@qnx.com
lick+stick: 175 Terence Matthews; Kanata, Ontario, Canada; K2M 1W8


[end sigcontext stuff, from an old post]

Thanks for the help

Hope some of this helps,

-David

Please follow-up to newsgroup, rather than personal email.
David Gibbs
QNX Training Services
dagibbs@qnx.com

This is great news. It means that the immediate problem can be solved using
nohup. I will also change the method and period of doing rtc -s hw, and we
are contacting QNX about Proc.

On all of my past QNX projects I had done rtc -s hw once per hour, and had
never seen this. Unfortunately, this project was a port of another company’s
code from an obscure OS (C-Executive), and I just slavishly followed what
they had been doing. Live and learn, I guess. I have to say that I had never
worried much about process groups and sessions before this; now I hope I
never hear those terms again…

Thank you very much for your help.

  • Kevin

“David Gibbs” <dagibbs@qnx.com> wrote in message
news:cha05n$29e$1@inn.qnx.com

Kevin Miller <> kevin.miller@transcore.com> > wrote:
There is a cron task that does an rtc -s hw once per minute. Another cron
task runs once at midnight each day to create a new syslog file. tftpd is
invoked a few times per day. The systems are expected to run unattended
for
weeks or months at a time, so this adds up, I suppose.

It’s the rtc every minute that adds up – the others are pretty small
load compared to that – which is 1440 process creations/day.

There is a bug in the process creation algorithm, where it can (rarely)
create a pid that is the process group leader for an already existing
process group. When this process then exits, all the processes in that
process group will get a SIGHUP. I don’t know if this can affect
session 1 processes or not – but maybe.

I’d go about addressing this in two stages:

  1. reduce the number of process creations. Running rtc every minute is
    very heavy handed – I’d recommend instead getting the source to rtc from
    ftp.qnx.com:/usr/free/qnx4/os/samples/misc/rtc_src.tgz and modifying this
    to use a timer and wakeup & do the rtc work every minute that way –
    rather
    than running rtc every minute. This will greatly extend the time before
    the pid creation algorithm will recreate the dangerous pid. (And, it
    will also be good for your system, greatly reducing total system load
    by not creating & destroying a process every minute. Far better.) In
    doing the recoding, you might also consider qnx_adj_time() for resynching
    the clock, rather than clock_settime(). (Though, rtc might do that
    already, I’m not sure.)

  2. Inquire through your sales or support rep about getting a fixed Proc
    that does not have this bug. (Have the sales or support rep talk to
    Adam Mallory about such a fix.)

I had considered nohup, and we may end up using it, but it would be nice
if
we knew what was causing the signal. Is there anyway to determine the
issuer
of a signal?

Not sure. Setting trace verbosity very high, and getting the traceinfo
right after it happens might give enough information to make a guess,
but I’m not sure if that info is recorded there.

Or, there is a signal context that comes with a signal, with a signal
handler for SIGHUP, you could get this context and dump out the signal.
Of course, that only works if it is your programs going boom, not ones
you don’t have code for.

[sigcontext stuff, from an old post]

From > steve@qnx.com > Tue Sep 16 15:58:01 EDT 1997
Article: 6374 of comp.os.qnx
Path: gateway.qnx.com!not-for-mail
From: > steve@qnx.com > (Steve McPolin)
Newsgroups: comp.os.qnx
Subject: Re: Getting signal context
Date: 6 Aug 1997 15:01:08 GMT
Organization: QNX Software Systems
Lines: 107
Message-ID: <5sa3jk$> oc9@qnx.com
References:
u4zpqvn1jj.fsf@stlind.lint.lyngso-industri.dk.lyngso-industri.dk
NNTP-Posting-Host: gateway.qnx.com
X-Newsreader: trn 4.0-test58 (13 May 97)

In article
u4zpqvn1jj.fsf@stlind.lint.lyngso-industri.dk.lyngso-industri.dk> >,
Jeppe Sommer <> jso@stlind.lint.lyngso-industri.dk.lyngso-industri.dk
wrote:

Is there a way of getting the signal context from within a signal
handler in QNX (i.e., the state of the CPU just before the signal is
delivered)?.

On other (Unix like) systems this is typically available as an extra
argument to the signal handler.

Looking at the signal handlers stack from within the Watcom debugger,
it seems that most of the CPU registers are in fact placed further
down the stack (I am compiling with stack calling conventions).
Unfortunatly these do not seem to be at a fixed distance to the
stack top.

Does anyone have a clue about how this could be done? I am primarily
interested in the instruction pointer register.

The structure below defines it, it is available as the second argument
but the compiler is AR about calling signal with a function which
doesn’t match ‘void (*)(int)’ – you can cast it away to void *.

example:
#include <signal.h
#include <sys/sigcontext.h> /* follows if you don’t have it in your
system */

void catch(int signo, SIGCONTEXT scp)
{
printf(“death by signal %u at 0x%lx\n”, signo, scp->sc_ip);
exit(1);
}


The fields:
ulong_t sc_info; /
fault specific info /
ushort_t sc_errc; /
error code pushed by processor /
uchar_t sc_fault; /
actual fault # /
uchar_t sc_flags; /
signal handler flags: */
are only available in 424 and above, as is the sigaltstack() et al…


#ifndef sigcont_h
#define sigcont_h 1


#ifndef __TYPES_H_INCLUDED
#include <sys/types.h
#endif

typedef struct _sigcontext SIGCONTEXT;
struct _sigcontext {
ulong_t sc_mask;
ulong_t sc_gs:16,:16; /* register set at fault time /
ulong_t sc_fs:16,:16;
ulong_t sc_es:16,:16;
ulong_t sc_ds:16,:16;
ulong_t sc_di;
ulong_t sc_si;
ulong_t sc_bp;
ulong_t :32; /
hole from pushad /
ulong_t sc_bx;
ulong_t sc_dx;
ulong_t sc_cx;
ulong_t sc_ax;
ulong_t sc_ip;
ulong_t sc_cs:16, :16;
ulong_t sc_fl;
ulong_t sc_sp;
ulong_t sc_ss:16, :16;
ulong_t sc_info; /
fault specific info /
ushort_t sc_errc; /
error code pushed by processor /
uchar_t sc_fault; /
actual fault # /
uchar_t sc_flags; /
signal handler flags: */
#define SC_ONSTACK 1
};

enum {
TRAP_ZDIV = 0, /* SIGFPE: divide by zero /
TRAP_DEBUG = 1, /
SIGTRAP: debug fault /
TRAP_NMI = 2, /
SGIBUS: nmi fault /
TRAP_BRKPT = 3, /
SIGTRAP: cpu breakpoint /
TRAP_OFLOW = 4, /
SIGFPE: integer overflow*/
TRAP_BOUNDS = 5, /* SIGFPE: bound instn failed*/
TRAP_BADOP = 6, /* SIGILL: invalid opcode /
TRAP_NONDP = 7, /
SIGFPE: NDP not present or available /
TRAP_DFAULT = 8, /
never: double fault (system error) /
TRAP_NDPSEGV = 9, /
SIGSEGV: NDP invalid address /
TRAP_BADTSS = 10, /
never: invalid tss (system error) /
TRAP_NOTPRESENT=11, /
SIGSEGV: referenced segment not preset /
TRAP_NOSTACK = 12, /
SIGSEGV: esp|ebp bad address /
TRAP_GPF = 13, /
SIGSEGV: other /
TRAP_PAGE = 14, /
SIGSEGV: page fault /
TRAP_FPERROR = 16, /
SIGFPE: floating point error */
};

#define __ERRC_VALID (1<<TRAP_DFAULT | 1<<TRAP_BADTSS |
1<<TRAP_NOTPRESENT | 1<<TRAP_NOSTACK | 1<<TRAP_GPF | 1<<TRAP_PAGE )

#define __INFO_VALID (1<<TRAP_PAGE)


#endif

Steve McPolin, QNX Software Systems, Ltd.
point+click: > steve@qnx.com
lick+stick: 175 Terence Matthews; Kanata, Ontario, Canada; K2M 1W8


[end sigcontext stuff, from an old post]

Thanks for the help

Hope some of this helps,

-David

Please follow-up to newsgroup, rather than personal email.
David Gibbs
QNX Training Services
dagibbs@qnx.com

On 3 Sep 2004 14:47:19 GMT, David Gibbs <dagibbs@qnx.com> wrote:

  1. Inquire through your sales or support rep about getting a fixed Proc
    that does not have this bug. (Have the sales or support rep talk to
    Adam Mallory about such a fix.)
    Does this relate to Proc32 v4.25O as well?

If yes, will it be fixed in the next patch, should you decide to issue it anytime?

Tony.

David Gibbs wrote:

There is a bug in the process creation algorithm, where it can (rarely)
create a pid that is the process group leader for an already existing
process group. When this process then exits, all the processes in that
process group will get a SIGHUP. I don’t know if this can affect
session 1 processes or not – but maybe.

That’s quite good news for us as well.

We had really severe problems last year with SIGHUP beeing sent to several
processes that never should have received any.

We found out that it was related to the PIDs restarting with low numbers
again, but there was never a commitment of QNX that there might
be a problem in the kernel.

Unfortunatly our application was a webserver, with lots of
processes beeing launched in a day, so that we couldn’t
reduce the number of tasks.

Is there a fix available for this problem?


Karsten P. Hoffmann <karsten.p.hoffmann@web.de>
“I love deadlines. I especially like the whooshing sound
they make as they go flying by.”
[In memoriam Douglas Adams, 1952-2001]

Karsten P. Hoffmann wrote:

That’s quite good news for us as well.

We had really severe problems last year with SIGHUP beeing sent to several
processes that never should have received any.

We found out that it was related to the PIDs restarting with low numbers
again, but there was never a commitment of QNX that there might
be a problem in the kernel.

It isn’t a problem in the kernel, it’s in Proc - the Process Manager.

Unfortunatly our application was a webserver, with lots of
processes beeing launched in a day, so that we couldn’t
reduce the number of tasks.

Is there a fix available for this problem?

An offical patch hasn’t been released. Speak to your sales rep to get
further information.

\

Cheers,
Adam

QNX Software Systems Ltd.
[ amallory@qnx.com ]

With a PC, I always felt limited by the software available.
On Unix, I am limited only by my knowledge.
–Peter J. Schoenster <pschon@baste.magibox.net>

Kevin Miller <kevin.miller@transcore.com> wrote:

This is great news. It means that the immediate problem can be solved using
nohup. I will also change the method and period of doing rtc -s hw, and we
are contacting QNX about Proc.

Ok. The modified (stay residen) rtc mod I suggested is still a good
one.

On all of my past QNX projects I had done rtc -s hw once per hour, and had
never seen this.

Well, if this project (running rtc once a minute) would die every two weeks,
then the other projects would take about 60x as long – about every 120
weeks, or once every 2 years. So, if those projects ran without a reboot
(which resets the counter, effectively) for 2 years, they might have seen
it, too.

Thank you very much for your help.

Glad to help.

-David

Please follow-up to newsgroup, rather than personal email.
David Gibbs
QNX Training Services
dagibbs@qnx.com

On 7 Sep 2004 21:17:56 GMT, David Gibbs <dagibbs@qnx.com> wrote:

So, if those projects ran without a reboot (which resets the counter,
effectively) for 2 years, they might have seen it, too.
I’m going to call our sales rep.

What is the ID number of the bug?
How to ask them what I need? Is it enough to ask an update to Proc v4.25O ?

Tony.

Tony wrote:

I’m going to call our sales rep.
What is the ID number of the bug?

You can refer to this as #5251


\

Cheers,
Adam

QNX Software Systems Ltd.
[ amallory@qnx.com ]

With a PC, I always felt limited by the software available.
On Unix, I am limited only by my knowledge.
–Peter J. Schoenster <pschon@baste.magibox.net>