Resource Manager thread problems

I have a bug in my resource manager which I am unable to fix.
It is repeatable and looks like interaction between threads.

The RM uses 4 threads to handle user requests and has an extra
thread running in the background checking internal structures for
consistency. Ironically, this thread seems to be the problem.
(If I disable it the problem goes.)

The RM crashes with SIGSEGV accessing an invalid pointer. The
stack track shows that a function is executing which is only ever
called by the checking thread. The backtrace in gdb shows an
impossible path starting at the resource manager dispatcher,
through io_read or io_write directly to checking function which
in reality it does not call. The stack is evidently corrupted.

The checking thread runs a function like the following and
fails in check():

int check( object, variable )
{
dereference object and variable
}
int object_scan( scan_callback_t callback )
{
for( all objects and no error )
for( all variables and no error )
error = callback( object, variable );
return error;
}
void * verify( void * arg )
{
while( object_scan( check ) != error ) /* always */ ;
}

The check function appears in the stack trace after
the io_read call. I can understand how check() could end
up on the stack, as it is passed as a parameter. But if I
add another stage, check_1() which just calls check() and
pass check_1 in verify, check() still appears on the stack
trace but check_1() doesn’t. Yet check_1() is what goes onto
the stack.

This has me confused.

I don’t imagine anyone can solve this for me, but any ideas
of how to approach the problem would be appreciated.
Thanks in advance
Regards
William Morris

William Morris <william@bangel.demon.co.uk> wrote:

I have a bug in my resource manager which I am unable to fix.
It is repeatable and looks like interaction between threads.

The RM uses 4 threads to handle user requests and has an extra
thread running in the background checking internal structures for
consistency. Ironically, this thread seems to be the problem.
(If I disable it the problem goes.)

Problem solved! :slight_smile:
See below for actual helpful hints, though :slight_smile:

The RM crashes with SIGSEGV accessing an invalid pointer. The
stack track shows that a function is executing which is only ever
called by the checking thread. The backtrace in gdb shows an
impossible path starting at the resource manager dispatcher,
through io_read or io_write directly to checking function which
in reality it does not call. The stack is evidently corrupted.

The checking thread runs a function like the following and
fails in check():

int check( object, variable )
{
dereference object and variable
}
int object_scan( scan_callback_t callback )
{
for( all objects and no error )
for( all variables and no error )
error = callback( object, variable );
return error;
}
void * verify( void * arg )
{
while( object_scan( check ) != error ) /* always */ ;
}

The check function appears in the stack trace after
the io_read call. I can understand how check() could end
up on the stack, as it is passed as a parameter. But if I
add another stage, check_1() which just calls check() and
pass check_1 in verify, check() still appears on the stack
trace but check_1() doesn’t. Yet check_1() is what goes onto
the stack.

This has me confused.

I don’t imagine anyone can solve this for me, but any ideas
of how to approach the problem would be appreciated.

I think this is one place where the printf() debugger really shines
through. Have it generate a log of which objects and which variables
are being scanned. Have it generate another log of when a thread
processes a message, and when it’s done. Print out key variables.
Redirect the whole mess to a logfile if it gets to be too big (the
problem there, though, is that stdout doesn’t get flushed nicely
on a SIGSEGV).

I’ve used a similar technique on a few resmgrs lately, and in all
cases it’s just been a lot easier than slogging through a GBD session.
YMMV.

BTW, are all your objects protected from eacho ther properly with
mutexes etc?? Just checking…

Cheers,
-RK


Robert Krten, PARSE Software Devices +1 613 599 8316.
Realtime Systems Architecture, Books, Video-based and Instructor-led
Training and Consulting at www.parse.com.
Email my initials at parse dot com.

Robert Krten wrote:

I think this is one place where the printf() debugger really shines
through. Have it generate a log of which objects and which variables
are being scanned. Have it generate another log of when a thread
processes a message, and when it’s done. Print out key variables.
Redirect the whole mess to a logfile if it gets to be too big (the
problem there, though, is that stdout doesn’t get flushed nicely
on a SIGSEGV).

I’ve used a similar technique on a few resmgrs lately, and in all
cases it’s just been a lot easier than slogging through a GBD session.
YMMV.

BTW, are all your objects protected from eacho ther properly with
mutexes etc?? Just checking…

Robert

Thanks for replying. I will try the printf method, although I am not
sure whether I will see the error - I provoke it by bombarding the RM
with requests and running the check thread at a high rate. I guess this
increases the probability of whatever conflict happening. printfs tend
to slow things down to a degree that may stop the error. Anyway I will
try.

Regarding mutexes, my object structures (objects are attached as
directories) contain an array of variable structures and each variable
structure contains the details (name, address, DSP number, QNX RM
attribute struct pointer) of the resource it relates to. The structures
contain context information which is read from (and only available from)
the hardware at boot time. The RM makes the context info available
through devctl on each resource. The main read/write handlers of the
resource access the actual hardware address. At this code level there is
a mutex per DSP. It was whilst running a conflict-test program to test
whether I got this right that the problems surfaced. However, the test
fails also when I run with the hardware replaced by a simulation of the
real objects/variables. The simulation also uses a mutex per ‘DSP’. Note
that the RM has an additional single write protection mutex: only one
resource may be written at a time but mnany may be read; the check
function does no writes.

Because all access to the object/variable structures after
initialisation is read-only, except for hardware access through the
contained info and updates on the attribute structures, I am inclined to
think that the problem is elsewhere.
However, I think you have much more experience of this than I, so please
correct me if I missed something.

Thanks again.
Regards
William Morris

William Morris <william@bangel.demon.co.uk> wrote:


Robert Krten wrote:
I think this is one place where the printf() debugger really shines
through. Have it generate a log of which objects and which variables
are being scanned. Have it generate another log of when a thread
processes a message, and when it’s done. Print out key variables.
Redirect the whole mess to a logfile if it gets to be too big (the
problem there, though, is that stdout doesn’t get flushed nicely
on a SIGSEGV).

I’ve used a similar technique on a few resmgrs lately, and in all
cases it’s just been a lot easier than slogging through a GBD session.
YMMV.

BTW, are all your objects protected from eacho ther properly with
mutexes etc?? Just checking…

Robert

Thanks for replying. I will try the printf method, although I am not
sure whether I will see the error - I provoke it by bombarding the RM
with requests and running the check thread at a high rate. I guess this
increases the probability of whatever conflict happening. printfs tend
to slow things down to a degree that may stop the error. Anyway I will
try.

In that case you’ll want to redirect to a file, perhaps even a ramdisk :slight_smile:

Regarding mutexes, my object structures (objects are attached as
directories) contain an array of variable structures and each variable
structure contains the details (name, address, DSP number, QNX RM
attribute struct pointer) of the resource it relates to. The structures
contain context information which is read from (and only available from)
the hardware at boot time. The RM makes the context info available
through devctl on each resource. The main read/write handlers of the
resource access the actual hardware address. At this code level there is
a mutex per DSP. It was whilst running a conflict-test program to test
whether I got this right that the problems surfaced. However, the test
fails also when I run with the hardware replaced by a simulation of the
real objects/variables. The simulation also uses a mutex per ‘DSP’. Note
that the RM has an additional single write protection mutex: only one
resource may be written at a time but mnany may be read; the check
function does no writes.

Because all access to the object/variable structures after
initialisation is read-only, except for hardware access through the
contained info and updates on the attribute structures, I am inclined to
think that the problem is elsewhere.
However, I think you have much more experience of this than I, so please
correct me if I missed something.

It sounds like everything is in order. Do you ever delete any objects?
Or is this more of a “create everything once and then proceed” kind of
operation? The reason I’m asking is 'cuz I’ve had problems where I
deleted objects in one resmgr outcall only to have the resmgr framework
decide that it wanted to do a final (I forget, I think it was) io_close_dup()
on it :slight_smile:

One other thing that might be a problem – did you take care in your devctl
handler to make sure you have sufficient buffer space for the reply message
(assuming it’s bigger than a few dozen bytes)?

Apart from that, I’d prolly need to see the source…

Cheers,
-RK


Robert Krten, PARSE Software Devices +1 613 599 8316.
Realtime Systems Architecture, Books, Video-based and Instructor-led
Training and Consulting at www.parse.com.
Email my initials at parse dot com.

William Morris <william@bangel.demon.co.uk> wrote:

Robert Krten wrote:
It sounds like everything is in order. Do you ever delete any objects?
Or is this more of a “create everything once and then proceed” kind of
operation? The reason I’m asking is 'cuz I’ve had problems where I
deleted objects in one resmgr outcall only to have the resmgr framework
decide that it wanted to do a final (I forget, I think it was) io_close_dup()
on it > :slight_smile:

No, nothing is ever deleted. It is just setup and must then
run for months or, better, years.

One other thing that might be a problem – did you take care in your devctl
handler to make sure you have sufficient buffer space for the reply message
(assuming it’s bigger than a few dozen bytes)?

I have my own buffer of 100 bytes and all handlers use the size
of the buffer and snprintf or strncpy. I don’t have any sprintf
calls anywhere and only a few strcat/strcpy - I guess I should
remove thse too.

I did wonder whether the problem was stack overflow, but adding
a 64k stackguard to each thread has no effect.

I decided that, since the check function is checking structures
that don’t change, I will place the structures and related info
in their own section, align to __PAGE_SIZE and mprotect them
using PROT_READ. That should give me confidence that they are
never modified and I don’t need to run the checker thread.

As a solution, I think that is solid; better than the original code.
But I have a nasty feeling that the original bug may still be there

I agree with you there – if a thread can’t read it to verify
integrity, then you will most likely have another bug somewhere…
If this has to be a highly-available application I’d try and
find the root cause :slight_smile:

Cheers,
-RK


Robert Krten, PARSE Software Devices +1 613 599 8316.
Realtime Systems Architecture, Books, Video-based and Instructor-led
Training and Consulting at www.parse.com.
Email my initials at parse dot com.

Robert Krten wrote:

It sounds like everything is in order. Do you ever delete any objects?
Or is this more of a “create everything once and then proceed” kind of
operation? The reason I’m asking is 'cuz I’ve had problems where I
deleted objects in one resmgr outcall only to have the resmgr framework
decide that it wanted to do a final (I forget, I think it was) io_close_dup()
on it > :slight_smile:

No, nothing is ever deleted. It is just setup and must then
run for months or, better, years.

One other thing that might be a problem – did you take care in your devctl
handler to make sure you have sufficient buffer space for the reply message
(assuming it’s bigger than a few dozen bytes)?

I have my own buffer of 100 bytes and all handlers use the size
of the buffer and snprintf or strncpy. I don’t have any sprintf
calls anywhere and only a few strcat/strcpy - I guess I should
remove thse too.

I did wonder whether the problem was stack overflow, but adding
a 64k stackguard to each thread has no effect.

I decided that, since the check function is checking structures
that don’t change, I will place the structures and related info
in their own section, align to __PAGE_SIZE and mprotect them
using PROT_READ. That should give me confidence that they are
never modified and I don’t need to run the checker thread.

As a solution, I think that is solid; better than the original code.
But I have a nasty feeling that the original bug may still be there
somewhere but not show itself in the same way.

regards
William

Robert Krten wrote:

I agree with you there – if a thread can’t read it to verify
integrity, then you will most likely have another bug somewhere…
If this has to be a highly-available application I’d try and
find the root cause > :slight_smile:

Found it! I use siglongjmp() to escape from unusual hardware
errors and (embarassingly) I used a single jumpbuffer shared between
threads. Even though no siglongjmp calls were occurring (as I had
the hardware layer excluded), the sigsetjmp() calls evidently conflict.

I have added per-thread jump buffers in thread-specific storage (using
pthread_setspecific()) and hope this will resolve the problem.

William

William Morris wrote:

Found it! I use siglongjmp() to escape from unusual hardware
errors and (embarassingly) I used a single jumpbuffer shared between
threads. Even though no siglongjmp calls were occurring (as I had
the hardware layer excluded), the sigsetjmp() calls evidently conflict.

By the way, does anyone know why this would cause a crash?
It is obviously wrong, but I thought setjmp just saved some
info in the jump buffer; I don’t know anything about the
implementation. It would be comforting to know what causes
the crash when two threads save to the same jumbuffer at the
same time. Note that this occurs without a long jump
occurring?

William