resource temporarily unavailable

We had a strange problem in QNX4.25c. Last night, we got a call from
our customer that remote boot processors that boot from a host process
controller, were failing. The remotes use qnx_net_alive to check to see
if they can still communicate with the host processor, and declare
themselves dead when they get an error return. The error returned from
qnx_net_alive was 1004, no proc entry avail for virtual process.
Seconds later, the remote processor would declare itself alive again,
apparently having received a successful return from qnx_net_alive.

When we tried to login via modem, several of the commands executed in
our /etc/profile returned an error message of “resource temporarily
unavailable”. Things like date, and test, etc, returned these errors.

When we look at the processes, there was a total of 371 processes, of
which about 261 were zombies at priority 30f, the same as Proc32. The
highest pid was 32748. So apparently these processes where getting
created and dying, and eventually going away. But not until they hit
the max process ceiling (default 500). We found that the blk column for
all the zombies pointed back to the pid of our largest process.

So what does resource temporarily inavailable mean and where does it
come from. What resources is it talking about. And what could
possilbly make so many zombies?

Scott

“J. Scott Franko” <jsfranko@switch.com> wrote in message
news:39EDC014.AA6099F0@switch.com

We had a strange problem in QNX4.25c. Last night, we got a call from
our customer that remote boot processors that boot from a host process
controller, were failing. The remotes use qnx_net_alive to check to see
if they can still communicate with the host processor, and declare
themselves dead when they get an error return. The error returned from
qnx_net_alive was 1004, no proc entry avail for virtual process.
Seconds later, the remote processor would declare itself alive again,
apparently having received a successful return from qnx_net_alive.

When we tried to login via modem, several of the commands executed in
our /etc/profile returned an error message of “resource temporarily
unavailable”. Things like date, and test, etc, returned these errors.

When we look at the processes, there was a total of 371 processes, of
which about 261 were zombies at priority 30f, the same as Proc32. The
highest pid was 32748. So apparently these processes where getting
created and dying, and eventually going away. But not until they hit
the max process ceiling (default 500). We found that the blk column for
all the zombies pointed back to the pid of our largest process.

So what does resource temporarily inavailable mean and where does it
come from. What resources is it talking about.

Can be any resources, it depends on the requested operation. In your case
it looks like process ceiling.

And what could possibly make so many zombies?

Is your process doing a spaw(), fork(), _beginthread(), etc?
If so then that means the child dies and the parent didn’t do a
wait on it to get the exit code.

Scott

Mario Charest wrote:

“J. Scott Franko” <> jsfranko@switch.com> > wrote in message
news:> 39EDC014.AA6099F0@switch.com> …
We had a strange problem in QNX4.25c. Last night, we got a call from
our customer that remote boot processors that boot from a host process
controller, were failing. The remotes use qnx_net_alive to check to see
if they can still communicate with the host processor, and declare
themselves dead when they get an error return. The error returned from
qnx_net_alive was 1004, no proc entry avail for virtual process.
Seconds later, the remote processor would declare itself alive again,
apparently having received a successful return from qnx_net_alive.

When we tried to login via modem, several of the commands executed in
our /etc/profile returned an error message of “resource temporarily
unavailable”. Things like date, and test, etc, returned these errors.

When we look at the processes, there was a total of 371 processes, of
which about 261 were zombies at priority 30f, the same as Proc32. The
highest pid was 32748. So apparently these processes where getting
created and dying, and eventually going away. But not until they hit
the max process ceiling (default 500). We found that the blk column for
all the zombies pointed back to the pid of our largest process.

So what does resource temporarily inavailable mean and where does it
come from. What resources is it talking about.

Can be any resources, it depends on the requested operation. In your case
it looks like process ceiling.

And what could possibly make so many zombies?

Is your process doing a spaw(), fork(), _beginthread(), etc?
If so then that means the child dies and the parent didn’t do a
wait on it to get the exit code.

I read this in the docs, but we’ve had this code running at least since July,
without experiencing this problem. We don’t do much spawning, forking or
threads, just a fixed number of forks at startup time, and a spawn to execute
and rtc hw every time we finish processing a train (keeps our software clock
from drifting) which occurs every 45 minutes to a couple hours. I looked, and
the child processes we created at startup were still there, in addition to the
zombies.

Only other strange occurence was that we dialed in to the modem connected to
the serial port on our qnx host earlier in the day to troubleshoot a problem
on the same network with some solaris hosts. But when we telented in, the
telnet locked up. We escaped out of telnet and it dropped our modem
connection (because we telnet to a modem pool before dialing). We did this a
couple times before finally giving up and call the operators directly who also
couldn’t access the console (ended up being full disks). I figured that the
telnets from the QNX hosts were killed when we dropped the modem connection,
but could this have started some chain reaction to cause it? Of course, the
operators went home before the situation was discovered and got called back in
at 2am. Isn’t that how it alwys works?! ;o)

What do the resource unavailable mean and come from? Is that the kernel’s
error message, after you’ve used up the process limit?

Scott

Scott

A resource could be memory, free entries in the proc table, ldt entires,
file descriptors, etc. Sometimes you will get a message that is more
specific than “resource unavailable”, but not always.

There are two utilities that may help: osinfo and fsysinfo. They show
most of the limits and how close your system is to them.

As Mario says, it sure looks like you had run out of free Proc entries.
You need to find out what’s causing the zombies to hang around.

Richard

“J. Scott Franko” wrote:

Mario Charest wrote:

“J. Scott Franko” <> jsfranko@switch.com> > wrote in message
news:> 39EDC014.AA6099F0@switch.com> …
We had a strange problem in QNX4.25c. Last night, we got a call from
our customer that remote boot processors that boot from a host process
controller, were failing. The remotes use qnx_net_alive to check to see
if they can still communicate with the host processor, and declare
themselves dead when they get an error return. The error returned from
qnx_net_alive was 1004, no proc entry avail for virtual process.
Seconds later, the remote processor would declare itself alive again,
apparently having received a successful return from qnx_net_alive.

When we tried to login via modem, several of the commands executed in
our /etc/profile returned an error message of “resource temporarily
unavailable”. Things like date, and test, etc, returned these errors.

When we look at the processes, there was a total of 371 processes, of
which about 261 were zombies at priority 30f, the same as Proc32. The
highest pid was 32748. So apparently these processes where getting
created and dying, and eventually going away. But not until they hit
the max process ceiling (default 500). We found that the blk column for
all the zombies pointed back to the pid of our largest process.

So what does resource temporarily inavailable mean and where does it
come from. What resources is it talking about.

Can be any resources, it depends on the requested operation. In your case
it looks like process ceiling.

And what could possibly make so many zombies?

Is your process doing a spaw(), fork(), _beginthread(), etc?
If so then that means the child dies and the parent didn’t do a
wait on it to get the exit code.

I read this in the docs, but we’ve had this code running at least since July,
without experiencing this problem. We don’t do much spawning, forking or
threads, just a fixed number of forks at startup time, and a spawn to execute
and rtc hw every time we finish processing a train (keeps our software clock
from drifting) which occurs every 45 minutes to a couple hours. I looked, and
the child processes we created at startup were still there, in addition to the
zombies.

Only other strange occurence was that we dialed in to the modem connected to
the serial port on our qnx host earlier in the day to troubleshoot a problem
on the same network with some solaris hosts. But when we telented in, the
telnet locked up. We escaped out of telnet and it dropped our modem
connection (because we telnet to a modem pool before dialing). We did this a
couple times before finally giving up and call the operators directly who also
couldn’t access the console (ended up being full disks). I figured that the
telnets from the QNX hosts were killed when we dropped the modem connection,
but could this have started some chain reaction to cause it? Of course, the
operators went home before the situation was discovered and got called back in
at 2am. Isn’t that how it alwys works?! ;o)

What do the resource unavailable mean and come from? Is that the kernel’s
error message, after you’ve used up the process limit?

Scott




Scott

I think we discovered our problem. A long while ago we added a call to qnx_spawn,
at the end of each train we processed. Our software at this yard handles about 1500
cars a day, with somewhere between 30 and 150 cars per train. So qnx_spawn could
get called a several times a day.

We were having a problem with jumps in time using NTP. We found that our software
clock was adjusted, freqently by large +/- amounts of seconds, and this caused
problems in our application. We also found that our hardware clock was barely
drifting at all. So we used qnx_spawn to spawn a script that ran rtc hw. Since we
did it often during the day, it kept our software clock on track (no pun intended
;o) ).

Apparently, this script has been leaving zombies around. Yesterday, we discovered
that the number of trains processed (hence the number of qnx_spawn’s called),
matched the number of zombies created that day. We are guessing that after having
been up and running for an especially long time, these zombies built up and produced
our crisis earlier in the week. It’s amazing it took this long. The change to use
qnx_spawn has been in for months.

We are going to add the flag _SPAWN_NO_ZOMBIE to the call. It’s an amazingly simple
fix. I feel really stupid! ;oP

Scott

“Richard R. Kramer” wrote:

A resource could be memory, free entries in the proc table, ldt entires,
file descriptors, etc. Sometimes you will get a message that is more
specific than “resource unavailable”, but not always.

There are two utilities that may help: osinfo and fsysinfo. They show
most of the limits and how close your system is to them.

As Mario says, it sure looks like you had run out of free Proc entries.
You need to find out what’s causing the zombies to hang around.

Richard

“J. Scott Franko” wrote:

Mario Charest wrote:

“J. Scott Franko” <> jsfranko@switch.com> > wrote in message
news:> 39EDC014.AA6099F0@switch.com> …
We had a strange problem in QNX4.25c. Last night, we got a call from
our customer that remote boot processors that boot from a host process
controller, were failing. The remotes use qnx_net_alive to check to see
if they can still communicate with the host processor, and declare
themselves dead when they get an error return. The error returned from
qnx_net_alive was 1004, no proc entry avail for virtual process.
Seconds later, the remote processor would declare itself alive again,
apparently having received a successful return from qnx_net_alive.

When we tried to login via modem, several of the commands executed in
our /etc/profile returned an error message of “resource temporarily
unavailable”. Things like date, and test, etc, returned these errors.

When we look at the processes, there was a total of 371 processes, of
which about 261 were zombies at priority 30f, the same as Proc32. The
highest pid was 32748. So apparently these processes where getting
created and dying, and eventually going away. But not until they hit
the max process ceiling (default 500). We found that the blk column for
all the zombies pointed back to the pid of our largest process.

So what does resource temporarily inavailable mean and where does it
come from. What resources is it talking about.

Can be any resources, it depends on the requested operation. In your case
it looks like process ceiling.

And what could possibly make so many zombies?

Is your process doing a spaw(), fork(), _beginthread(), etc?
If so then that means the child dies and the parent didn’t do a
wait on it to get the exit code.

I read this in the docs, but we’ve had this code running at least since July,
without experiencing this problem. We don’t do much spawning, forking or
threads, just a fixed number of forks at startup time, and a spawn to execute
and rtc hw every time we finish processing a train (keeps our software clock
from drifting) which occurs every 45 minutes to a couple hours. I looked, and
the child processes we created at startup were still there, in addition to the
zombies.

Only other strange occurence was that we dialed in to the modem connected to
the serial port on our qnx host earlier in the day to troubleshoot a problem
on the same network with some solaris hosts. But when we telented in, the
telnet locked up. We escaped out of telnet and it dropped our modem
connection (because we telnet to a modem pool before dialing). We did this a
couple times before finally giving up and call the operators directly who also
couldn’t access the console (ended up being full disks). I figured that the
telnets from the QNX hosts were killed when we dropped the modem connection,
but could this have started some chain reaction to cause it? Of course, the
operators went home before the situation was discovered and got called back in
at 2am. Isn’t that how it alwys works?! ;o)

What do the resource unavailable mean and come from? Is that the kernel’s
error message, after you’ve used up the process limit?

Scott




Scott

“J. Scott Franko” <jsfranko@switch.com> wrote in message
news:39F04411.EE8281E0@switch.com

I think we discovered our problem. A long while ago we added a call to
qnx_spawn,
at the end of each train we processed. Our software at this yard handles
about 1500
cars a day, with somewhere between 30 and 150 cars per train. So
qnx_spawn could
get called a several times a day.

So you were spawning something :wink: Glad you fix it!!!

We were having a problem with jumps in time using NTP. We found that our
software
clock was adjusted, freqently by large +/- amounts of seconds, and this
caused
problems in our application. We also found that our hardware clock was
barely
drifting at all. So we used qnx_spawn to spawn a script that ran rtc hw.
Since we
did it often during the day, it kept our software clock on track (no pun
intended
;o) ).

Apparently, this script has been leaving zombies around. Yesterday, we
discovered
that the number of trains processed (hence the number of qnx_spawn’s
called),
matched the number of zombies created that day. We are guessing that
after having
been up and running for an especially long time, these zombies built up
and produced
our crisis earlier in the week. It’s amazing it took this long. The
change to use
qnx_spawn has been in for months.

We are going to add the flag _SPAWN_NO_ZOMBIE to the call. It’s an
amazingly simple
fix. I feel really stupid! ;oP

Scott

“Richard R. Kramer” wrote:

A resource could be memory, free entries in the proc table, ldt entires,
file descriptors, etc. Sometimes you will get a message that is more
specific than “resource unavailable”, but not always.

There are two utilities that may help: osinfo and fsysinfo. They show
most of the limits and how close your system is to them.

As Mario says, it sure looks like you had run out of free Proc entries.
You need to find out what’s causing the zombies to hang around.

Richard

“J. Scott Franko” wrote:

Mario Charest wrote:

“J. Scott Franko” <> jsfranko@switch.com> > wrote in message
news:> 39EDC014.AA6099F0@switch.com> …
We had a strange problem in QNX4.25c. Last night, we got a call
from
our customer that remote boot processors that boot from a host
process
controller, were failing. The remotes use qnx_net_alive to check
to see
if they can still communicate with the host processor, and declare
themselves dead when they get an error return. The error returned
from
qnx_net_alive was 1004, no proc entry avail for virtual process.
Seconds later, the remote processor would declare itself alive
again,
apparently having received a successful return from qnx_net_alive.

When we tried to login via modem, several of the commands executed
in
our /etc/profile returned an error message of “resource
temporarily
unavailable”. Things like date, and test, etc, returned these
errors.

When we look at the processes, there was a total of 371 processes,
of
which about 261 were zombies at priority 30f, the same as Proc32.
The
highest pid was 32748. So apparently these processes where
getting
created and dying, and eventually going away. But not until they
hit
the max process ceiling (default 500). We found that the blk
column for
all the zombies pointed back to the pid of our largest process.

So what does resource temporarily inavailable mean and where does
it
come from. What resources is it talking about.

Can be any resources, it depends on the requested operation. In
your case
it looks like process ceiling.

And what could possibly make so many zombies?

Is your process doing a spaw(), fork(), _beginthread(), etc?
If so then that means the child dies and the parent didn’t do a
wait on it to get the exit code.

I read this in the docs, but we’ve had this code running at least
since July,
without experiencing this problem. We don’t do much spawning, forking
or
threads, just a fixed number of forks at startup time, and a spawn to
execute
and rtc hw every time we finish processing a train (keeps our software
clock
from drifting) which occurs every 45 minutes to a couple hours. I
looked, and
the child processes we created at startup were still there, in
addition to the
zombies.

Only other strange occurence was that we dialed in to the modem
connected to
the serial port on our qnx host earlier in the day to troubleshoot a
problem
on the same network with some solaris hosts. But when we telented in,
the
telnet locked up. We escaped out of telnet and it dropped our modem
connection (because we telnet to a modem pool before dialing). We did
this a
couple times before finally giving up and call the operators directly
who also
couldn’t access the console (ended up being full disks). I figured
that the
telnets from the QNX hosts were killed when we dropped the modem
connection,
but could this have started some chain reaction to cause it? Of
course, the
operators went home before the situation was discovered and got called
back in
at 2am. Isn’t that how it alwys works?! ;o)

What do the resource unavailable mean and come from? Is that the
kernel’s
error message, after you’ve used up the process limit?

Scott




Scott