SLOW proc creation thread (QNX4)

I 've been having an odd problem, and I found this old thread (from
1994) that describes it almost verbatim. The only difference between
my problem and this problem, is that I don’t necessarily need to call
spawn with an invalid filename.

Randomly, Proc will take forever (30 seconds) to spawn a process.
The spawn call returns quickly (P_NOWAIT) but, the system is frozen
(prio 29 shell cannot “sin”). I have a resource manager which
registers a prefix, and spawn()s programs which do an open on the
prefix. When a program dies, I receive a close and do a targetted
wait pid to obtain the status. It works just spiffy most of the
time, but every once in a while there is this long process load.

The resource manager has _PPF_SERVER set, but does not have
_PPF_INFORM set. I have tried removing _PPF_SERVER but it makes
no difference.

Anyone with a good memory recall anything like this ? I have included the
original posting by Malcom Watts from 1994 (which was never answered
according to google).



---- SNIP --------------------------------------------------------

  1. Interesting problem; whilst testing a code module recently,
    I filled a .ini file with invalid file names, incorrectly
    formatted sections, and so on, to push the error handling.

The module I’m testing reads a list of processes (file)
names to be spawned from the ini file, then spawns them.
The module runs at prio 20, and reads prio for its progeny from
the ini file. All children are currently spawned at prio 11.
Strange results when spawning using filenames which don’t
exist. My PC slows almost to lockup; the proc manager seems to
spawn a process creation thread, which runs at some obscene
priority (sin shows a “zombie” process ‘running’ at 24) and all of
my lower-prio process are locked out (even ksh running at 29
responds to the keyboard but ‘sin’ doesn’t return for ages). (??)

This happens for 30 seconds, then the spawn fails with a file-not
-found type of error message (as expected), and all is well.
(Zombie process has disappeared).

Problem: I’ve got time-dependent processes running here, and I can’t
wait 30 seconds for proc to decide that a file doesn’t exist,
whilst running at prio 24. Can anyone suggest what I might be
doing wrong here ? Can anyone explain exactly what proc is
doing, wandering off for 30 seconds and effectively starving
other processes ? I thought it might be net-related, given the
30-sec timeout. I could test for existence of the file before
spwaning (and I will) but I want to know what the heck proc is
doing, so it doesn’t bite me later.

“Rennie” <rallen@csical.com> wrote in message
news:bcqiuc$ro7$1@tiger.openqnx.com

I 've been having an odd problem, and I found this old thread (from
1994) that describes it almost verbatim. The only difference between
my problem and this problem, is that I don’t necessarily need to call
spawn with an invalid filename.

Randomly, Proc will take forever (30 seconds) to spawn a process.
The spawn call returns quickly (P_NOWAIT) but, the system is frozen
(prio 29 shell cannot “sin”). I have a resource manager which
registers a prefix, and spawn()s programs which do an open on the
prefix. When a program dies, I receive a close and do a targetted
wait pid to obtain the status. It works just spiffy most of the
time, but every once in a while there is this long process load.

The resource manager has _PPF_SERVER set, but does not have
_PPF_INFORM set. I have tried removing _PPF_SERVER but it makes
no difference.

Anyone with a good memory recall anything like this ? I have included the
original posting by Malcom Watts from 1994 (which was never answered
according to google).

Hi Rennie.

I will take a stab in the dark here. I have seen behavior close to that
when there are issue with nameloc/licenses.

For example if I try to compile watcom would be block trying to create a VC
to a node that is down. During that time no other process can be launched
(but I can do a sin from another node).

Only solution to this problem is to make sure all licenses are the same
everywhere and there is a minimum amount of node that run nameloc.

---- SNIP --------------------------------------------------------

  1. Interesting problem; whilst testing a code module recently,
    I filled a .ini file with invalid file names, incorrectly
    formatted sections, and so on, to push the error handling.

The module I’m testing reads a list of processes (file)
names to be spawned from the ini file, then spawns them.
The module runs at prio 20, and reads prio for its progeny from
the ini file. All children are currently spawned at prio 11.
Strange results when spawning using filenames which don’t
exist. My PC slows almost to lockup; the proc manager seems to
spawn a process creation thread, which runs at some obscene
priority (sin shows a “zombie” process ‘running’ at 24) and all of
my lower-prio process are locked out (even ksh running at 29
responds to the keyboard but ‘sin’ doesn’t return for ages). (??)

This happens for 30 seconds, then the spawn fails with a file-not
-found type of error message (as expected), and all is well.
(Zombie process has disappeared).

Problem: I’ve got time-dependent processes running here, and I can’t
wait 30 seconds for proc to decide that a file doesn’t exist,
whilst running at prio 24. Can anyone suggest what I might be
doing wrong here ? Can anyone explain exactly what proc is
doing, wandering off for 30 seconds and effectively starving
other processes ? I thought it might be net-related, given the
30-sec timeout. I could test for existence of the file before
spwaning (and I will) but I want to know what the heck proc is
doing, so it doesn’t bite me later.

Mario Charest wrote:

Hi Rennie.

I will take a stab in the dark here. I have seen behavior close to that
when there are issue with nameloc/licenses.

For example if I try to compile watcom would be block trying to create a VC
to a node that is down. During that time no other process can be launched
(but I can do a sin from another node).

Only solution to this problem is to make sure all licenses are the same
everywhere and there is a minimum amount of node that run nameloc.

Mario, thanks for the idea, I’ll investigate this possibility (I hadn’t
considered this).

I have been operating under the assumption that this is somehow related to the
launching process (the spawner) being a resource manager (I have never actually
had a resource manager launch apps before, and this is where the problems
started). I also noticed that in Malcoms post from 1994, he was also spawning
processes from a resource manager.

Rennie

Rennie wrote:

I 've been having an odd problem, and I found this old thread (from
1994) that describes it almost verbatim. The only difference between
my problem and this problem, is that I don’t necessarily need to call
spawn with an invalid filename.

Randomly, Proc will take forever (30 seconds) to spawn a process.
The spawn call returns quickly (P_NOWAIT) but, the system is frozen
(prio 29 shell cannot “sin”). I have a resource manager which
registers a prefix, and spawn()s programs which do an open on the
prefix. When a program dies, I receive a close and do a targetted
wait pid to obtain the status. It works just spiffy most of the
time, but every once in a while there is this long process load.

The resource manager has _PPF_SERVER set, but does not have
_PPF_INFORM set. I have tried removing _PPF_SERVER but it makes
no difference.

Anyone with a good memory recall anything like this ? I have included the
original posting by Malcom Watts from 1994 (which was never answered
according to google).

I saw alike and much more weird problem when i forgot to implement file
descriptors duplication with dup()/dup2() in my resource manager. As a
result, when any process with my RM been opened does fork() → dup() is
called… It was fun, including kernel dumps and process disappearing from
the system processes list while process was still running.

Morale is: coding IO manager for QNX4, do care about _IO_DUP message.
therwise you’r risking to find out very odd problems in the future.

// wbr

Ian Zagorskih wrote:

I saw alike and much more weird problem when i forgot to implement file
descriptors duplication with dup()/dup2() in my resource manager. As a
result, when any process with my RM been opened does fork() → dup() is
called… It was fun, including kernel dumps and process disappearing from
the system processes list while process was still running.

Morale is: coding IO manager for QNX4, do care about _IO_DUP message.
therwise you’r risking to find out very odd problems in the future.

Thanks Ian, good advice.

Unfortunately, my resource manager does handle dup()s (I also have logging
that indicates if a dup fails or succeeds, and dup isn’t being called in
this case). I do think it might be related to resource manager type
operations though, since (until now) my launcher process has not been a
resource manager, and I have never seen this behavior).

The interesting thing is that everything continues to work fine in the
system, except that process loads take forever (i.e. the reason I can’t do
a “sin” from a prio 29 shell is that “sin” is taking forever to load -
anything that is already loaded is running fine and getting CPU even at
low priority).

The other interesting thing is that the spawn returns immediately and
successfully, it’s just that the loader thread for the spawned process
doesn’t jump to main for an obscenely long time. Except for this slow
load, the system runs entirely as expected with no errors or other
surprising latencies.

Thanks again for your time.

Rennie

Rennie <rallen@csical.com> wrote in message
news:bcskh1$vg0$1@tiger.openqnx.com

The interesting thing is that everything continues to work fine in the
system, except that process loads take forever (i.e. the reason I can’t do
a “sin” from a prio 29 shell is that “sin” is taking forever to load -
anything that is already loaded is running fine and getting CPU even at
low priority).

This isn’t because sin is taking forever to load, it’s rather that someone
else is still stuck in the loader (1 process can be loading/terming at any
given time), the rest queue, namely you’re spawned process.

The other interesting thing is that the spawn returns immediately and
successfully, it’s just that the loader thread for the spawned process
doesn’t jump to main for an obscenely long time. Except for this slow
load, the system runs entirely as expected with no errors or other
surprising latencies.

Well, the jump isn’t really the problem, rather the loading of the image,
which requires a message Fsys might be causing some issue. This is done at
priority 21.

I’ve seen issues where Net/Socket under heavy network load would starve the
loader and the termer out, and no one could get a chance terminate (which
also means no one gets to load since a termer is currently in progress).

I would double check that you aren’t talking to yourself, if you’re taking
over a prefix of some type (some deadlock), and also check your tracelogs,
as they could also reveal some information.

-Adam

Adam Mallory wrote:

Thanks for responding Adam.

Rennie <> rallen@csical.com> > wrote in message

This isn’t because sin is taking forever to load, it’s rather that someone
else is still stuck in the loader (1 process can be loading/terming at any
given time), the rest queue, namely you’re spawned process.

The point I was trying to make is that there is nothing spinning or
consuming CPU at a high priority. Processes which are memory resident
continue to execute without being significantly pre-empted regardless of
their priority. So the loader could be “stuck” (not sure what that
means), but it is not spinning (or otherwise consuming cpu time).

Well, the jump isn’t really the problem, rather the loading of the image,
which requires a message Fsys might be causing some issue. This is done at
priority 21.

Yes, this seems like a reasonably likely source. If it is, it is some
sort of deadlock that eventually resolves itself based on time.

I’ve seen issues where Net/Socket under heavy network load would starve the
loader and the termer out, and no one could get a chance terminate (which
also means no one gets to load since a termer is currently in progress).

There is 0 network load when the problem occurs.

I would double check that you aren’t talking to yourself, if you’re taking
over a prefix of some type (some deadlock), and also check your tracelogs,
as they could also reveal some information.

Are you suggesting the possibility that the resource manager is talking to
itself (opening it’s own prefix) ? I don’t see any way for this to happen
since I have only 1 send in the resource manager and it is not being
called in this circumstance. Certainly the spawned process is trying to
open the resource manager prefix, but I fail to see why that should be a
problem. The process(es) is(are) being spawned NOWAIT.

If deadlock were occuring between my resource manager and itself (or one
of it’s spawnees) why would it resolve itself ?

Thanks again for your interest.

Rennie <rallen@csical.com> wrote in message
news:bcsvkm$br$1@tiger.openqnx.com

The point I was trying to make is that there is nothing spinning or
consuming CPU at a high priority. Processes which are memory resident
continue to execute without being significantly pre-empted regardless of
their priority. So the loader could be “stuck” (not sure what that
means), but it is not spinning (or otherwise consuming cpu time).

What I meant was that someone was in the process of loading/terming and they
where not completing (for whatever reason), thus ‘stuck’. The only reason,
they can really become stuck is that the termination thread can’t complete
(due to starvation), or the loader is waiting on something (generally the
process image).

Yes, this seems like a reasonably likely source. If it is, it is some
sort of deadlock that eventually resolves itself based on time.

This is why I suggested looking at your tracelogs (set the severity around
4-5), if it’s a timeout it probably gets tracelogged.

I’ve seen issues where Net/Socket under heavy network load would starve
the
loader and the termer out, and no one could get a chance terminate
(which
also means no one gets to load since a termer is currently in progress).

There is 0 network load when the problem occurs.

Sure, the idea was that starving out the termination thread, will cause the
same symptoms (whatever the cause of starvation is).

Are you suggesting the possibility that the resource manager is talking to
itself (opening it’s own prefix) ? I don’t see any way for this to happen
since I have only 1 send in the resource manager and it is not being
called in this circumstance. Certainly the spawned process is trying to
open the resource manager prefix, but I fail to see why that should be a
problem. The process(es) is(are) being spawned NOWAIT.

When the spawned processes attempts to terminate, it sends a message to
Proc, saying terminate, which starts the termination thread, which then has
the process cleanup it’s resources, send ‘close’ messages to all fds open,
close all vcs, clean up proxies, and then sends a message to every process
interested in process death. Finally, it sends a message to proc saying
it’s done, which never gets a reply, since we destroy whatever is left.

If deadlock were occuring between my resource manager and itself (or one
of it’s spawnees) why would it resolve itself ?

Hard to say, what if you spawn a process which doesn’t talk to anyone else -
does the behaviour still exist?

If you spawn NOWAIT, and then do a directed waitpid, how are you responding
to the close message the spawned app will be sending you (since it opens
your prefix)?.

-Adam

Adam Mallory wrote:

If you spawn NOWAIT, and then do a directed waitpid, how are you responding
to the close message the spawned app will be sending you (since it opens
your prefix)?.

The directed waitpid is done in the io_close handler of the resource
manager.

Here is the actual code.

int close_msg(pid_t pid)
{
struct ocb *ocb;
int child_stat;

if((int)(ocb=get_ocb(pid, io_msg.close.fd)) == -1) {
io_msg.status = EBADF;
return(REPLY);
}

if(ocb->task && ocb->task->pid == pid) {
ocb->task->flags |= TASK_FLAG_DEFUNCT;
ocb->task->pid = -1;
}

// the launch_timer is a timer that triggers a proxy that causes the
// resource manager to launch the next (non-running) process
if(ocb->task && reload_cnt && --reload_cnt == 0) {
pending=-1;
program_launch_timer(2500);
} else if(ocb->task && !reload_cnt) {
program_launch_timer(100);
}

ocb->count–;
Log(LOG_DEBUG, true, “Close pid=%d ocb=%p count=%d\n”, pid, ocb,
ocb->count);

if(!(ocb->count)) {
free(ocb);
}

map_ocb(pid, io_msg.close.fd, NULL);

io_msg.close_reply.status = EOK;

Reply(pid, &io_msg, sizeof(io_msg));

waitpid(pid, &child_stat, 0);
Log(LOG_DEBUG, true, “pid %d exited with status %d\n”, pid,
child_stat);

return(NO_REPLY);
}

This assumes that once the close() comes in the exit status is available,
now or will be available shortly (the Reply does happen before the waitpid
for cases where the spawnee is actually performing the close themselves).

When the problem occurs, the resource manager continues to run fine (in
fact it is continuously spewing log messages, regarding how long it is
taking the spawnee to “register” with him - i.e. do an open()), so
deadlock of the resource manager is out of the question in this particular
case.

Basically all that happens is:

  1. resource manager reads config file into internal struct
  2. resource manager traverses internal struct spawning processes on each
    iter.
  3. resource manager Receive() blocks
  4. resource manager processes IO msgs and proxies.
    4a. if message is open() resource manager marks spawnee up
    4b. if message is close() resource manager collects exit status and
    marks spawnee down
    4c. if launch_proxy, search internal struct for non-running process, and
    re-spawn.
  5. goto 3.

Rennie

ok, thanks for the info. Have you got your tracelog yet, or tried spawning
a different (trivial) application to see if the behaviour reoccurs?

-Adam

Rennie <rallen@csical.com> wrote in message
news:bctjtk$1gs$1@tiger.openqnx.com

Adam Mallory wrote:


If you spawn NOWAIT, and then do a directed waitpid, how are you
responding
to the close message the spawned app will be sending you (since it opens
your prefix)?.

The directed waitpid is done in the io_close handler of the resource
manager.

Here is the actual code.

int close_msg(pid_t pid)
{
struct ocb *ocb;
int child_stat;

if((int)(ocb=get_ocb(pid, io_msg.close.fd)) == -1) {
io_msg.status = EBADF;
return(REPLY);
}

if(ocb->task && ocb->task->pid == pid) {
ocb->task->flags |= TASK_FLAG_DEFUNCT;
ocb->task->pid = -1;
}

// the launch_timer is a timer that triggers a proxy that causes the
// resource manager to launch the next (non-running) process
if(ocb->task && reload_cnt && --reload_cnt == 0) {
pending=-1;
program_launch_timer(2500);
} else if(ocb->task && !reload_cnt) {
program_launch_timer(100);
}

ocb->count–;
Log(LOG_DEBUG, true, “Close pid=%d ocb=%p count=%d\n”, pid, ocb,
ocb->count);

if(!(ocb->count)) {
free(ocb);
}

map_ocb(pid, io_msg.close.fd, NULL);

io_msg.close_reply.status = EOK;

Reply(pid, &io_msg, sizeof(io_msg));

waitpid(pid, &child_stat, 0);
Log(LOG_DEBUG, true, “pid %d exited with status %d\n”, pid,
child_stat);

return(NO_REPLY);
}

This assumes that once the close() comes in the exit status is available,
now or will be available shortly (the Reply does happen before the waitpid
for cases where the spawnee is actually performing the close themselves).

When the problem occurs, the resource manager continues to run fine (in
fact it is continuously spewing log messages, regarding how long it is
taking the spawnee to “register” with him - i.e. do an open()), so
deadlock of the resource manager is out of the question in this particular
case.

Basically all that happens is:

  1. resource manager reads config file into internal struct
  2. resource manager traverses internal struct spawning processes on each
    iter.
  3. resource manager Receive() blocks
  4. resource manager processes IO msgs and proxies.
    4a. if message is open() resource manager marks spawnee up
    4b. if message is close() resource manager collects exit status and
    marks spawnee down
    4c. if launch_proxy, search internal struct for non-running process, and
    re-spawn.
  5. goto 3.

Rennie

Adam Mallory wrote:


ok, thanks for the info. Have you got your tracelog yet, or tried spawning
a different (trivial) application to see if the behaviour reoccurs?

Adam,

Sorry, I didn’t work Friday, so it took a while to get back to you.

Here is a tracelog from when the problem occured. The situation “clears
up” when the “Kick” happens.

The “pause” started at this point (i.e. animate is the spawnee of the
resource manager which took the time to load):

Jun 23 04:02:50 6 00001020 Spawn pid 5959 ruid 0 euid 0 file
(//1/fcu/bin/animate)

and resolved itself at this point:

Jun 23 04:03:36 2 0000100c Kick 5963 152 2 //1/bin/cp

The cp is being spawned by a spawnee of the resource manager. I replaced
the spawn of cp with code to do the copy, and the problem does not occur;
so it seems related to the spawnee of the resource manager doing a spawn.
I don’t understand why this should be a problem, any ideas ?


Jun 23 04:02:39 6 00006008 message from pid=5921 type 105 status 89
Jun 23 04:02:39 6 00001020 Spawn pid 5941 ruid 0 euid 0 file (//1/bin/cp)
Jun 23 04:02:39 6 00006008 message from pid=5941 type 310 status 89
Jun 23 04:02:39 6 00006008 message from pid=5941 type 310 status 89
Jun 23 04:02:39 6 00001020 Spawn pid 5942 ruid 0 euid 0 file
(//1/fcu/bin/animate)
Jun 23 04:02:40 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:40 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:40 6 00006008 message from pid=5941 type 105 status 89
Jun 23 04:02:41 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:41 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:42 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:42 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:43 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:43 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:44 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:44 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:44 6 00001020 Spawn pid 5953 ruid 0 euid 0 file (//1/bin/sin)
Jun 23 04:02:45 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:45 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:46 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:46 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:47 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:47 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:48 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:48 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:49 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:49 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:49 6 00006008 message from pid=5942 type 105 status 89
Jun 23 04:02:50 6 00001020 Spawn pid 5959 ruid 0 euid 0 file
(//1/fcu/bin/animate)
Jun 23 04:02:50 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:50 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:51 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:51 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:52 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:52 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:53 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:53 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:54 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:54 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:55 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:55 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:56 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:56 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:57 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:57 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)

… more of the same …

Jun 23 04:03:33 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:03:33 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:03:34 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:03:34 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:03:35 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:03:35 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:03:36 2 0000100c Kick 5963 152 2 //1/bin/cp
Jun 23 04:03:36 6 00001020 Spawn pid 6014 ruid 0 euid 0 file (//1/bin/sin)
Jun 23 04:03:36 6 00001020 Spawn pid 6019 ruid 0 euid 0 file (//1/bin/sh)
Jun 23 04:03:36 6 00001020 Spawn pid 6020 ruid 0 euid 0 file (//1/bin/rtc)
Jun 23 04:03:36 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:03:36 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:03:37 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:03:37 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:03:38 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:03:38 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:03:39 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:03:39 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:03:40 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:03:40 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:03:41 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.

Rennie <rallen@csical.com> wrote in message
news:bd7g8c$qib$1@tiger.openqnx.com

Here is a tracelog from when the problem occured. The situation “clears
up” when the “Kick” happens.

The “pause” started at this point (i.e. animate is the spawnee of the
resource manager which took the time to load):

Jun 23 04:02:50 6 00001020 Spawn pid 5959 ruid 0 euid 0 file
(//1/fcu/bin/animate)

and resolved itself at this point:

Jun 23 04:03:36 2 0000100c Kick 5963 152 2 //1/bin/cp

The cp is being spawned by a spawnee of the resource manager. I replaced
the spawn of cp with code to do the copy, and the problem does not occur;
so it seems related to the spawnee of the resource manager doing a spawn.
I don’t understand why this should be a problem, any ideas ?

Well the kick message is saying that the person we’re kicking (ie. the
process that won’t get out of the loader/termer) is SEND blocked. The code
from the kick is (pid, blocked on process entry, state) - note that process
entry is just the address of the process in the process table and not the
pid (use the Debugger32 to figure out the process entry into (like pid), if
you’re keen). The reason it does ‘unstick’ is because we’ll force them
ready with EINTER being the return value. So it would look like cp is SEND
blocked on someone, what are you copying?

Also, how are you spawning ‘cp’ (using system() or calling spawn*()
yourself)?

-Adam






Jun 23 04:02:39 6 00006008 message from pid=5921 type 105 status 89
Jun 23 04:02:39 6 00001020 Spawn pid 5941 ruid 0 euid 0 file (//1/bin/cp)
Jun 23 04:02:39 6 00006008 message from pid=5941 type 310 status 89
Jun 23 04:02:39 6 00006008 message from pid=5941 type 310 status 89
Jun 23 04:02:39 6 00001020 Spawn pid 5942 ruid 0 euid 0 file
(//1/fcu/bin/animate)
Jun 23 04:02:40 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:40 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:40 6 00006008 message from pid=5941 type 105 status 89
Jun 23 04:02:41 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:41 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:42 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:42 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:43 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:43 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:44 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:44 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:44 6 00001020 Spawn pid 5953 ruid 0 euid 0 file (//1/bin/sin)
Jun 23 04:02:45 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:45 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:46 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:46 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:47 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:47 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:48 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:48 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:49 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:49 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:49 6 00006008 message from pid=5942 type 105 status 89
Jun 23 04:02:50 6 00001020 Spawn pid 5959 ruid 0 euid 0 file
(//1/fcu/bin/animate)
Jun 23 04:02:50 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:50 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:51 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:51 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:52 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:52 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:53 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:53 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:54 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:54 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:55 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:55 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:56 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:56 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:02:57 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:02:57 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)

… more of the same …

Jun 23 04:03:33 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:03:33 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:03:34 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:03:34 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:03:35 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:03:35 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:03:36 2 0000100c Kick 5963 152 2 //1/bin/cp
Jun 23 04:03:36 6 00001020 Spawn pid 6014 ruid 0 euid 0 file (//1/bin/sin)
Jun 23 04:03:36 6 00001020 Spawn pid 6019 ruid 0 euid 0 file (//1/bin/sh)
Jun 23 04:03:36 6 00001020 Spawn pid 6020 ruid 0 euid 0 file (//1/bin/rtc)
Jun 23 04:03:36 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:03:36 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:03:37 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:03:37 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:03:38 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:03:38 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:03:39 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:03:39 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:03:40 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.
Jun 23 04:03:40 6 00004007 0 (007) NET ( tx) Logical=2 failed (vc_attach
ctrl pkt)
Jun 23 04:03:41 6 00004023 0 (035) NET Logical=2 - Node not in
nodemap.


Adam Mallory wrote:

Rennie <> rallen@csical.com> > wrote in message
news:bd7g8c$qib$> 1@tiger.openqnx.com> …


Well the kick message is saying that the person we’re kicking (ie. the
process that won’t get out of the loader/termer) is SEND blocked. The code
from the kick is (pid, blocked on process entry, state) - note that process
entry is just the address of the process in the process table and not the
pid (use the Debugger32 to figure out the process entry into (like pid), if
you’re keen). The reason it does ‘unstick’ is because we’ll force them
ready with EINTER being the return value. So it would look like cp is SEND
blocked on someone, what are you copying?

I am copying a file from /tmp to another directory on the same filesystem
(i.e. only Fsys involved - none of this goes anywhere near the resource
manager prefix, if that’s what you’re wondering).

If I replace the call to spawn() with code that simply does the copy
internally to the spawnee the problem does not occur. If the problem were
external to spawn, why wouldn’t the (former) spawner become send blocked
when the copy is implemented internally ?

cp should talk to no processes other than Fsys.

The issue appears to be strickly related to spawning from a process that
is spawned from my resource manager. If I hand issue the cp from a shell
(i.e. a completely different process tree), or (as I mentioned) perform
the copy without a spawn, then everything works as expected; in fact,
since having the spawner perform the copy internally is actually
preferable, everything works fine now, and is bullet proof (as verified
from reasonably extensive testing).

However, I do not understand why a process spawned by my resource manager
cannot itself (consistently) spawn successfully, and I find this
disturbing.
I am sure there is something I am missing, but I can’t figure what it is…

Also, how are you spawning ‘cp’ (using system() or calling spawn*()
yourself)?

As implied above the spawn was a call to spawnlp.

Rennie

Rennie <rallen@csical.com> wrote in message
news:bd7ur6$rhi$1@tiger.openqnx.com

I am copying a file from /tmp to another directory on the same filesystem
(i.e. only Fsys involved - none of this goes anywhere near the resource
manager prefix, if that’s what you’re wondering).

Nope, just telling you what the trace message is saying.

If I replace the call to spawn() with code that simply does the copy
internally to the spawnee the problem does not occur. If the problem were
external to spawn, why wouldn’t the (former) spawner become send blocked
when the copy is implemented internally ?

cp should talk to no processes other than Fsys.

Sure is strange.

The issue appears to be strickly related to spawning from a process that
is spawned from my resource manager. If I hand issue the cp from a shell
(i.e. a completely different process tree), or (as I mentioned) perform
the copy without a spawn, then everything works as expected; in fact,
since having the spawner perform the copy internally is actually
preferable, everything works fine now, and is bullet proof (as verified
from reasonably extensive testing).

However, I do not understand why a process spawned by my resource manager
cannot itself (consistently) spawn successfully, and I find this
disturbing.
I am sure there is something I am missing, but I can’t figure what it
is…

I would double check spawn flags, and spawn parameters, since your resmgr is
spawning apps which open on the resgmgr prefix, and then spawn cp, there
could be inheritance issues we’ve overlooked here.

Also, how are you spawning ‘cp’ (using system() or calling spawn*()
yourself)?

As implied above the spawn was a call to spawnlp.

An app which spawns a program which spawns shouldn’t be a problem - if it
was a simple test case would show that.

I took a look through the history, but couldn’t find the info on sin ver -
can you post that, sorry if you’ve done it already.

-Adam

Rennie wrote:

Adam Mallory wrote:

Rennie <> rallen@csical.com> > wrote in message
news:bd7g8c$qib$> 1@tiger.openqnx.com> …


Well the kick message is saying that the person we’re kicking (ie. the
process that won’t get out of the loader/termer) is SEND blocked. The
code from the kick is (pid, blocked on process entry, state) - note that
process entry is just the address of the process in the process table and
not the pid (use the Debugger32 to figure out the process entry into
(like pid), if
you’re keen). The reason it does ‘unstick’ is because we’ll force them
ready with EINTER being the return value. So it would look like cp is
SEND blocked on someone, what are you copying?

I am copying a file from /tmp to another directory on the same filesystem
(i.e. only Fsys involved - none of this goes anywhere near the resource
manager prefix, if that’s what you’re wondering).

If I replace the call to spawn() with code that simply does the copy
internally to the spawnee the problem does not occur. If the problem were
external to spawn, why wouldn’t the (former) spawner become send blocked
when the copy is implemented internally ?

cp should talk to no processes other than Fsys.

The issue appears to be strickly related to spawning from a process that
is spawned from my resource manager. If I hand issue the cp from a shell
(i.e. a completely different process tree), or (as I mentioned) perform
the copy without a spawn, then everything works as expected; in fact,
since having the spawner perform the copy internally is actually
preferable, everything works fine now, and is bullet proof (as verified
from reasonably extensive testing).

From my experience, process flags are inherited actross fork() call and
maybe across exec() too. Yea, that nice _PPF_SERVER | _PPF_INFORM and so
on.

So, if you modify process flags in your resource manager (likely add
_PPF_SERVER | _PPF_DEATH for example) you’r spawned cp is running with
them. Are you sure thar ordinary cp is _PPF_XXX aware… ? :slight_smile: I have saw
this behaviour too when spawning 3d party applications from “server”
procerss. Usually it’s a bad idea.

// wrb

Ian Zagorskih wrote:

From my experience, process flags are inherited actross fork() call and
maybe across exec() too. Yea, that nice _PPF_SERVER | _PPF_INFORM and so
on.

So, if you modify process flags in your resource manager (likely add
_PPF_SERVER | _PPF_DEATH for example) you’r spawned cp is running with
them. Are you sure thar ordinary cp is _PPF_XXX aware… ? > :slight_smile: > I have saw
this behaviour too when spawning 3d party applications from “server”
procerss. Usually it’s a bad idea.

Yeah, that’s what I was thinking earlier also. I have never had
_PPF_INFORM, but I did have _PPF_SERVER; so I took _PPF_SERVER off (in the
resmgr), but it did not make a difference, the problem still occured with
the same frequency.

Rennie

Adam Mallory wrote:

Sure is strange.

Yup.

I would double check spawn flags, and spawn parameters, since your resmgr is
spawning apps which open on the resgmgr prefix, and then spawn cp, there
could be inheritance issues we’ve overlooked here.

That’s where I think the problem is also, but I can’t figure where
exactly…

An app which spawns a program which spawns shouldn’t be a problem - if it
was a simple test case would show that.

This is a reasonably simple test case (this program isn’t very big). In
fact, this is basically the iomanager.tgz from qnx.com, with about 200
lines of code added to it. It is almost as simple as you could get a
resource manager spawning processes example. I could supply the code if
you would like to try it ?

I took a look through the history, but couldn’t find the info on sin ver -
can you post that, sorry if you’ve done it already.

/hd/boot/sys/Proc32 Proc 4.24H Nov 10 1997
/hd/boot/sys/Proc32 Slib16 4.23G Oct 04 1996
/hd/boot/sys/Slib32 Slib32 4.24A Feb 04 1997
/hd/bin/Fsys Fsys32 4.24K Aug 20 1997
/hd/bin/Fsys Floppy 4.24B Aug 19 1997
/hd/bin/Fsys IDE 4.24B Jun 09 1998
//1/bin/Fsys Fsys32 4.24K Aug 20 1997
//1/bin/Fsys.eide eide 4.25A Feb 09 2000
//1/bin/Dev Dev32 4.23G Oct 04 1996
//1/bin/Dev.con Dev32.ansi 4.23H Nov 21 1996
//1/bin/Dev.pty Dev32.pty 4.23G Oct 04 1996
//1/bin/Pipe Pipe 4.23A Feb 26 1996
//1/bin/Net Net 4.25C Aug 30 1999
//1/bin/Net.ether905 Net.ether905 4.24J Feb 17 2000
//1/usr/ucb/Tcpip Tcpip 5.00A Jan 26 2001
//1/bin/cron cron 4.23B Oct 30 1997
//1/fcu/bin/fcud FCU 33.22A Jun 23 2003
^^^^^^^^^^^^^^^^

This is my resource manager (no jokes about how transposing the letters c
and u in the name explain why it doesn’t work please - I’ve already heard
these from my coworkers here :wink:

Rennie

Rennie <rallen@csical.com> wrote in message
news:bd9nf8$v1a$1@tiger.openqnx.com

That’s where I think the problem is also, but I can’t figure where
exactly…

Can you post the spawn args that you use to spawn the spawnee, as well as
the args you use in the spawnee to spawn ‘cp’.

This is a reasonably simple test case (this program isn’t very big). In
fact, this is basically the iomanager.tgz from qnx.com, with about 200
lines of code added to it. It is almost as simple as you could get a
resource manager spawning processes example. I could supply the code if
you would like to try it ?

Once you post the spawn arguments, if nothing pops out, I’ll make a
testcase. If that testcase doesn’t recreate the issue for me I’ll ask for
yours.

/hd/boot/sys/Proc32 Proc 4.24H Nov 10 1997
/hd/boot/sys/Proc32 Slib16 4.23G Oct 04 1996
/hd/boot/sys/Slib32 Slib32 4.24A Feb 04 1997
/hd/bin/Fsys Fsys32 4.24K Aug 20 1997
/hd/bin/Fsys Floppy 4.24B Aug 19 1997
/hd/bin/Fsys IDE 4.24B Jun 09 1998
//1/bin/Fsys Fsys32 4.24K Aug 20 1997
//1/bin/Fsys.eide eide 4.25A Feb 09 2000
//1/bin/Dev Dev32 4.23G Oct 04 1996
//1/bin/Dev.con Dev32.ansi 4.23H Nov 21 1996
//1/bin/Dev.pty Dev32.pty 4.23G Oct 04 1996
//1/bin/Pipe Pipe 4.23A Feb 26 1996
//1/bin/Net Net 4.25C Aug 30 1999
//1/bin/Net.ether905 Net.ether905 4.24J Feb 17 2000
//1/usr/ucb/Tcpip Tcpip 5.00A Jan 26 2001
//1/bin/cron cron 4.23B Oct 30 1997
//1/fcu/bin/fcud FCU 33.22A Jun 23 2003
^^^^^^^^^^^^^^^^

This is my resource manager (no jokes about how transposing the letters c
and u in the name explain why it doesn’t work please - I’ve already heard
these from my coworkers here > :wink:

No jokes :wink: but you are running a rather old version of QNX4 Proc32. I’m
not pinning it on that obviously, it’s just an observation.

-Adam

Adam Mallory wrote:

Rennie <> rallen@csical.com> > wrote in message
news:bd9nf8$v1a$> 1@tiger.openqnx.com> …

That’s where I think the problem is also, but I can’t figure where
exactly…

Can you post the spawn args that you use to spawn the spawnee, as well as
the args you use in the spawnee to spawn ‘cp’.

In the resource manager (spawner):

spawnv(P_NOWAIT,(const char *)argv[0],(const char **)argv);

where:
argv[0] is the pathname of the executable
argv[1] contains NULL (in this particular case)

in the spawnee:

spawnlp(P_WAIT, “cp”, “cp”,
“/tmp/download.tmp”, “/fcu/data/download.img”, NULL);

Just to re-hash; the spawn of the cp needs to be “kicked” approximately 1
time in 10. With the exception of the pause and the subsequent kick,
everything is fine.

No jokes > :wink: > but you are running a rather old version of QNX4 Proc32. I’m
not pinning it on that obviously, it’s just an observation.

Yes. We are using 424 because of a (documented by QSSL) bug in Proc32 for
425.

The bug is the one regarding referencing shared memory after it is
unlinked, allowing writes to other processes data, or code (if - for
instance that memory is recycled as a code segment for another process).
I personally know of no case where our code is doing such a thing,
however, we have been running 424 for a long time with no (known) problems
so we are reluctant to change (unless there is a good reason to do so).
Even if this is a known bug with 424, I doubt we would change because of
this (the software is already ported to QNX6); I simply would like to know
that it is not I, that is doing something wrong (since there might be
other, more evil behavior lurking in that mistake).

Rennie

Adam Mallory wrote:

pid (use the Debugger32 to figure out the process entry into (like pid), if
you’re keen). The reason it does ‘unstick’ is because we’ll force them

I’m keen. I have the Debugger in the image, and I know the keychord to
get in there, how do I walk the process entries (I haven’t done this for
about six years) ?

I think I can just use /P with each of my pids until I get a process entry
with the same address as that in the tracelog, is there a better way ?

Rennie