This problem was posted some months ago by my client – and he thought it
was possibly a bug in Proc32 when the PIDs reached 32767. The immediate
symptom is something is “killing” emu387. This occurs
One of the technicians at QSSL wrote a program that continuously forked to
itself, generating new PIDs in the process. It sequenced past “the magic”
number without any difficulty.
In the meantime, I have garnered some additional information. First, some
background. This is an embedded system that is set up to run a number of
processes at times determined by the end user. Everything is controlled by
cron and a crontab. The end user can call in and download the data that has
been collected, revise the crontab, and run the processes from the command
line. In practice, the end user is calling in at most once a week – then
just to download the data files.
I have peppered the source code with trace() statements. There is one for
every library function that returns an error code or sets errno. These will
be “traced” only if an error occurs.
In the most recent experiment (finally got the client to set up a system in
their laboratory), the crontab was loaded to run one of the processes every
minute (and nothing else). After approximately a week, we have the problem
The next-to-last time, the shell started by cron has a PID of 32756. The
process started by the shell has a PID of 3. This one runs and completes.
The next time, the shell started by cron has a PID of 7. The process started
by the shell has a PID of 14. This one actually runs to completion, but
tracing stops 27.976 seconds later, presumably because emu387 has been
murdered and tracelogger faults.
In the latest run, the client was able to run traceinfo after the failure,
but before the internal trace buffer overflowed, and capture the end of the
run. This confirmed that the process did terminate normally.
If, because of the sequence of “outside events”, the PID of the shell
started by cron and the process are both on the same side of the maximum
allowable value of a PID, the system runs perfectly (i.e, emu387 is still
alive and kicking).
Does anyone have any idea at all what might be going on here? I am under
pressure to put in a kludge, a real kludge: add to the crontab a periodic
“shutdown”! Engineering won’t allow anymore shipments until there is a “fix”
and sales is complaining about order cancellations.