PID (Program ID) overflow

We have an embedded system which runs a crontable. After about 2 months in
the field, the PID overflows (>32767). In one unit, it wrapped around once,
but on the second time, 2 months later, there was a failure on the third new
PID assignment after the roll-over. On another unit, it failed on the first
wrap-around, and on exactly the third PID also. (Coincidence?)

The “failure” is several programs (both QNX and our own) are terminated (no
longer running when you do a sin). Those programs include emu387, emu87_32
(our SBC does not have a hardware floating point processor), and
tracelogger (which is how we know that the PID’s wrapped around), plus 3 of
our programs.

We are speculating that possibly a new PID assignment conflicts with an
existing one (a low number one that was assigned upon the initial boot up),
causing the program terminations.

Has anyone seen this or can shed some light on the problem?

Ed Schwartz
edward.schwartz@l-3com.com

“Edward Schwartz” <edward.schwartz@l-3com.com> wrote in message
news:8ptuar$j00$1@inn.qnx.com

We have an embedded system which runs a crontable. After about 2 months in
the field, the PID overflows (>32767).

Pid can’t overflow, they are “reused”, so pid wrapping around to
not apply here.

In one unit, it wrapped around once,
but on the second time, 2 months later, there was a failure on the third
new
PID assignment after the roll-over. On another unit, it failed on the
first
wrap-around, and on exactly the third PID also. (Coincidence?)

The “failure” is several programs (both QNX and our own) are terminated
(no
longer running when you do a sin). Those programs include emu387, emu87_32
(our SBC does not have a hardware floating point processor), and
tracelogger (which is how we know that the PID’s wrapped around), plus 3
of
our programs.

Can you describe in more details how these program terminated, do you
have SIGSEGV addresses, or did they just died?



We are speculating that possibly a new PID assignment conflicts with an
existing one (a low number one that was assigned upon the initial boot
up),
causing the program terminations.

I think this is impossible, there are server that have been running for
very extended period of time, this would have shown.

Hopefull QSSL staff have some idea.

Has anyone seen this or can shed some light on the problem?

Ed Schwartz
edward.schwartz@l-3com.com

Edward Schwartz <edward.schwartz@l-3com.com> wrote:

We have an embedded system which runs a crontable.

Do you mean a crontab by that? (As in, cron runs and does
something with regularity?)

After about 2 months in
the field, the PID overflows (>32767). In one unit, it wrapped around once,
but on the second time, 2 months later, there was a failure on the third new
PID assignment after the roll-over. On another unit, it failed on the first
wrap-around, and on exactly the third PID also. (Coincidence?)

pids don’t “overflow”, but unused pids may get re-used.

(Just for fun, I ran the following program on my system for a couple
of hours – this had no ill effects. I’ll leave it running over night,
that is sure to reuse a bunch more pids. )

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <signal.h>

void main()
{
int pid;

signal(SIGCHLD, SIG_IGN);

while (1)
{
pid = fork();
if (!pid)
{
/* child */
printf( “my pid is %d\n”, getpid() );
exit(0);
}
}
}

The “failure” is several programs (both QNX and our own) are terminated (no
longer running when you do a sin). Those programs include emu387, emu87_32
(our SBC does not have a hardware floating point processor), and
tracelogger (which is how we know that the PID’s wrapped around), plus 3 of
our programs.

How did they die? Was there anything in the traceinfo still in
Proc’s memory about how they died? Do you do anything that sends
signals to specific processes which might have send a signal to
the wrong process or process group?

We are speculating that possibly a new PID assignment conflicts with an
existing one (a low number one that was assigned upon the initial boot up),
causing the program terminations.

Has anyone seen this or can shed some light on the problem?

I’ve never heard of anything like a new PID assignment getting made
that conflicts with any existing one. We’ve run systems internally
that have stayed running for months or longer, with large numbers
of vcs (which use process table entries) and new processes and
never seen this, and many customers have had systems up for months
or years without seeing anything like this. (That is, without
seeing any case where a new process tries to use the pid of an
existing one.) I don’t think it can happen.

-David

Dear David,

Yes, I mean crontab.

We do not use the kill() function or signals at all, therefore it is
unlikely that we are sending signals to the wrong process.

Please note that this is an embedded application so I have to depend on
tracelogger. Unfortunately, as I said, tracelogger stopped along with emu387
and emu87_32. Here is more info:

We’re using QNX V. 4.25 running on an embedded 386 SBC (Arcom Controls) with
a 2 G hard disk. I am attaching an output of sin on a system (“System 1”)
when the system is running properly, i.e., has been rebooted recently.
‘crontable’ is the name of our file run by the QNX cron utility. It is a
list of programs of ours to be run on a schedule. Our ‘sweep_rx’, ‘scan_rx’,
‘sweep_pwr_cdma’, etc., are our programs that are being run by cron. We are
running 7 programs an hour 5 minutes apart plus 4 once per day. It looks
like an average of 3 PIDs are assigned per running, and it took about 8
weeks to overflow PIDs in both systems. (cron ran our programs successfully
about 10,000 times, so the problem should not be with our programs.)

I noticed a similarity with the 2 systems: both assigned PID 3, 7, and 14
upon wrap-around, and failed on 14. Also, System 1 was trying to spawn
sweep_pwr_cdma, while System 2 was trying to spawn sweep_pwr_amps. Both are
programs called out in our crontable, and I don’t think it is relevant
because the overflow happens when 32756 is reached in both cases,
independent of which program is trying to be run. See below:

******************************* System 1


Aug 21 05:35:01 6 00001020 Spawn pid 32750 ruid 0 euid 0 file (//1/bin/ksh)
Aug 21 05:35:01 6 00001020 Spawn pid 32751 ruid 0 euid 0 file
(//1/usr/narda/bin/scan_rx)
Aug 21 05:40:01 6 00001020 Spawn pid 32753 ruid 0 euid 0 file (//1/bin/ksh)
Aug 21 05:40:02 6 00001020 Spawn pid 32754 ruid 0 euid 0 file
(//1/usr/narda/bin/scan_rx)
Aug 21 05:45:01 6 00001020 Spawn pid 32756 ruid 0 euid 0 file (//1/bin/ksh)
Aug 21 05:45:02 6 00001020 Spawn pid 3 ruid 0 euid 0 file
(//1/usr/narda/bin/scan_rx)
Aug 21 06:05:01 6 00001020 Spawn pid 7 ruid 0 euid 0 file (//1/bin/ksh)
Aug 21 06:05:02 6 00001020 Spawn pid 14 ruid 0 euid 0 file
(//1/usr/narda/bin/sweep_pwr_cdma)
Dec 31 19:00:04 5 00005109 Scsi sense (unit=0 scsi=2 err=70h sense=5h
asc=24h ascq=0h)

(The last line is the first line of the reboot today.)



******************************* System 2


Sep 12 02:00:01 6 00001020 Spawn pid 32750 ruid 0 euid 0 file (//1/bin/ksh)
Sep 12 02:00:01 6 00001020 Spawn pid 32751 ruid 0 euid 0 file
(//1/usr/narda/bin/sweep_tx_rl)
Sep 12 02:05:01 6 00001020 Spawn pid 32753 ruid 0 euid 0 file (//1/bin/ksh)
Sep 12 02:05:02 6 00001020 Spawn pid 32242 ruid 0 euid 0 file
(//1/usr/narda/bin/sweep_rx_rl)
Sep 12 02:10:01 6 00001020 Spawn pid 32756 ruid 0 euid 0 file (//1/bin/ksh)
Sep 12 02:10:02 6 00001020 Spawn pid 3 ruid 0 euid 0 file
(//1/usr/narda/bin/sweep_pwr_amps)
Sep 12 02:12:01 6 00001020 Spawn pid 7 ruid 0 euid 0 file (//1/bin/ksh)
Sep 12 02:12:02 6 00001020 Spawn pid 14 ruid 0 euid 0 file
(//1/usr/narda/bin/sweep_pwr_amps)

(The above was the last line, so this is when tracelogger terminated!)



Remember, System 2 had rolled around successfully 2 months previously, and
had not been rebooted since April 28, 2000. If you examine sin.txt attached,
you will see that PIDs 3, 7, and 14, are not assigned at boot up, at least
on this system. Since they are virtually identical, I’d assume both should
assign the same PIDs, or at least the same PIDs each boot up.

After tracelogger and emu87 stopped running (cron was still running), our
programs all terminated almost immediately because emu87 was not running.
This explains why there were no log files created by our programs afterwards
(our programs create log files daily).

This is about all the info I can give you now. Thanks very much for your
help.

Ed Schwartz (edward.schwartz@l-3com.com)


begin 666 Sin.txt
M4TE$(" @4$E$(%!23T=204T@(" @(" @(" @(" @(" @(%!222!35$%412 @
M($),2R @($-/1$4@("!$051!“B M+2 @(” M+2!-:6-R;VME<FYE;" @(" @
M(" @(" @(" M+2T@+2TM+2T@(" M+2T@(#$R.3<V(" @(" @, H@(# @(" @
M(#$@<WES+U!R;V,S,B @(" @(" @(" @(" @,S!F(%)%0419(" @+2TM(" @
M,3(R:R @(#$V-VL*(" P(" @(" R(’-Y<R]3;&EB,S(@(" @(" @(" @(" @
M(#$P<B @4D5#5B @(" @," @(" U,VL@(" T,#DV"B @," @(" @-" O8FEN
M+T9S>7,@(" @(" @(" @(" @(" Q,’(@4D5!1%D@(" M+2T@(" @-S=K(" @
M.3 U:PH@(# @(" @(#4@+V)I;B]&<WES+F5I9&4@(" @(" @(" @,C)R("!2
M14-6(" @(" P(" @(#8Q:R @(#$Q,&L*(" P(" @(" X(&ED;&4@(" @(" @
M(" @(" @(" @(" @(" P<B!214%$62 @(“TM+2 @(” @(# @(" @(" P"B @
M," @(" Q-B O+S$O8FEN+T1E=C,R(" @(" @(" @(" R-&8@(%)%0U8@(" @
M(# @(" @,S)K(" @(#DP:PH@(# @(" @,C @+R\Q+V)I;B]$978S,BYS97(@
M(" @(" @,C!R("!214-6(" @(" P(" @(#$V:R @(" S,FL*(" P(" @(#(Q
M("\O,2]B:6XO1&5V,S(N<&%R(" @(" @(" Y;R @4D5#5B @(" @," @(#@Q
M.3(@(" @,3)K"B @," @(" R,R O+S$O8FEN+V5M=3,X-R @(" @(" @(" Q
M,&@(%)%0U8@(" @(# @(" @,39K(" @(#$R:PH@(# @(" @,C8@+R\Q+V)I
M;B]E;74X-U\S,B @(" @(" @,3!O("!214-6(" @(" P(" @(#$R:R @(#@Q
M.3((" P(" @(#(W("\O,2]B:6XO1G-Y<RYF;&]P<'D@(" @(#$P;R @4D5#
M5B @(" @," @(" R,&L@(" @-#!K"B @," @(" S," O+S$O8FEN+T1O7,@(" @(" @(" Q,&@(%)%0U8@(" @(# @(" @-#EK(" @(#<S:PH@(# @
M(" @,S(@+R\Q+V)I;B]T:6YI=" @(" @(" @(" @,3!O("!214-6(" @(" P
M(" @.#$Y,B @(" R,&L
(" P(" @(#,S("\O,2]B:6XO=&EN:70@(" @(" @
M(" @(#$P;R @5T%)5" @(" M,2 @(#@Q.3(@(" @,CAK"B @," @(" S-" O
M+S$O8FEN+V-R;VX@(" @(" @(" @(" Q,&@(%)%0U8@(" @(# @(" @,C1K
M(" @(#(P:PH@(# @(" @,S<@+R\Q+RHO8FEN+VQE9%]S97)V97(@(" @,3!O
M("!214-6(" @(" P(" @(#$R:R @(" Q-FL*(" P(" @(#,X("\O,2\J+V)I
M;B]R97-E=%]M;V1E;2 @(#$P;R!215!,62 @(" @," @(" Q,FL@(" @,3)K
M"B @," @(" T," O+S$OB]B:6XO=’)A8V5L;V=G97(@(" Q,&@4D5!1%D@
M(" M+2T@(" @,3)K(" @(#$V:PH@(#$@(" Q.#<@+R\Q+V)I;B]K<V@@(" @
M(" @(" @(" @,3!O("!704E4(" @(“TQ(” @(#DT:R @(" T-6L
(" Q(" @
M,C8U("\O,2]B:6XO<VEN(" @(" @(" @(" @(#$P;R!215!,62 @(" @,2 @
-(" T-6L@(" @-#!K"@``
`
end

Probably the failure is caused by a different problem, pids do not overflow.
But I agree it sounds like a system resource collapse.
Have you tried those basic things like “sin freemem”, “sin files” to see
your
amount of free memory, memory fragmentation, how many opened files
you have, etc.
By the way, how many running process you have when the “failure” occurs?

Edward Schwartz gravada:

We have an embedded system which runs a crontable. After about 2 months in
the field, the PID overflows (>32767). In one unit, it wrapped around once,
but on the second time, 2 months later, there was a failure on the third new
PID assignment after the roll-over. On another unit, it failed on the first
wrap-around, and on exactly the third PID also. (Coincidence?)

The “failure” is several programs (both QNX and our own) are terminated (no
longer running when you do a sin). Those programs include emu387, emu87_32
(our SBC does not have a hardware floating point processor), and
tracelogger (which is how we know that the PID’s wrapped around), plus 3 of
our programs.

We are speculating that possibly a new PID assignment conflicts with an
existing one (a low number one that was assigned upon the initial boot up),
causing the program terminations.

Has anyone seen this or can shed some light on the problem?

Ed Schwartz
edward.schwartz@l-3com.com

Edward Schwartz <edward.schwartz@l-3com.com> wrote:

Dear David,

Yes, I mean crontab.

We do not use the kill() function or signals at all, therefore it is
unlikely that we are sending signals to the wrong process.

Ok. And I ran my little process generation program overnight – from
the rate you say you create/destroy processes, it sounds like I probably
generated a far greater number or process creation/deach results than
you have gone through. I don’t think it is a problem based on the
process id – it may be based on process generation/destruction.

The first thing I would look at is other resources – in particular,
free memory. If the system is running low on memory applications
may malloc(), assume it succeeds, and crash when it fails.

-David