Tracing an error

I recently observed a fault which I could not trace. The application in
question failed but remained visible. As it covers the screen, there was
no way to get at other user interfaces.

We did manage to connect via telnet and use that session to start ‘wd’.
The application is stripped and so all we got was an assembly listing.
Trying to step etc was unsuccessful; in fact the only useful thing I got out
of the process was that it was at address 10a3:57ae when interrupted by the
debugger. This address means nothing to me. It doesn’t appear to be in the
normal code segment for the process; it is not in the linkage map file.

Can anyone suggest how to locate the function to which this address belongs?

Thanks in advance

William Morris
wrm@innovation-tk.com

“William Morris” <wrm@innovation-tk.com> wrote in message
news:9n4nuc$mib$1@inn.qnx.com

I recently observed a fault which I could not trace. The application in
question failed but remained visible. As it covers the screen, there was
no way to get at other user interfaces.

We did manage to connect via telnet and use that session to start ‘wd’.
The application is stripped and so all we got was an assembly listing.
Trying to step etc was unsuccessful; in fact the only useful thing I got
out
of the process was that it was at address 10a3:57ae when interrupted by
the
debugger. This address means nothing to me. It doesn’t appear to be in
the
normal code segment for the process; it is not in the linkage map file.

Can anyone suggest how to locate the function to which this address
belongs?

That could be inside Slib32 code. This is definitely not your code.
Unless you can figure out from the stack from what function in
your code it got called I doubt wd will be of much help.


\

  • Mario

Thanks in advance

William Morris
wrm@innovation-tk.com

Mario Charest <mcharest@zinformatic.com> wrote:

That could be inside Slib32 code. This is definitely not your code.
Unless you can figure out from the stack from what function in
your code it got called I doubt wd will be of much help.

Thanks for helping. I don’t understand how my process could have its PC at an
address belonging to Slib32. I though the processes were protected from
each other, any I/O call I might make being turned into a IPC call to the
relevant process. Writing that, it makes me think it was interrupted in a
Receive or Send call underlying some system call I might make. Even so, I
don’t see how the PC could end up in another process’ address space.

Any ideas?

Thanks again

William Morris
wrm@innovation-tk.com

In article <9n591f$31k$1@inn.qnx.com>, wrm@innovation-tk.com says…

Mario Charest <> mcharest@zinformatic.com> > wrote:
That could be inside Slib32 code. This is definitely not your code.
Unless you can figure out from the stack from what function in
your code it got called I doubt wd will be of much help.

Thanks for helping. I don’t understand how my process could have its PC at an
address belonging to Slib32. I though the processes were protected from
each other, any I/O call I might make being turned into a IPC call to the
relevant process. Writing that, it makes me think it was interrupted in a
Receive or Send call underlying some system call I might make. Even so, I
don’t see how the PC could end up in another process’ address space.

The PC is NOT in another process’ address space.

Slib32 stands for Shared Library - 32 bit.
As such, the code of Slib32 is mapped into every process’ address space
(from common physical memory) - sorta like a DLL - only not quite like.
When Mario said “it is not your code”, he meant code that you did not
write. He did not mean code that was not in your address space. This
code (among other things) is the code that maps your read()s, open()s and
other things to the underlying IPC calls.
When you do an open( …, …, … ) it does not generate in-line code to
do the IPC’s, instead it generates a call to code in the Slib32. This
code, then takes the parms, etc., and eventually does the MsgSend() etc.,
What Mario meant by “figure out from the stack what function” - was:

  1. When you execute a function call to code in Slib32 (open() for
    example), the return address is stored on the stack - this points to a
    location in your code. If you examine the stack, and can see the address
    (near the top of the stack) of some of your code, it should be the return
    address (or the address of the next instructions to be executed after
    returning from the Slib32 call).
    If you look at the c code immediately in front of this point, that should
    be the function call that is messing up.


Any ideas?

Thanks again


Stephen Munnings
Software Developer
Corman Technologies Inc.

William Morris <wrm@innovation-tk.com> wrote:

I recently observed a fault which I could not trace. The application in
question failed but remained visible. As it covers the screen, there was
no way to get at other user interfaces.

We did manage to connect via telnet and use that session to start ‘wd’.
The application is stripped and so all we got was an assembly listing.
Trying to step etc was unsuccessful; in fact the only useful thing I got out
of the process was that it was at address 10a3:57ae when interrupted by the
debugger. This address means nothing to me. It doesn’t appear to be in the
normal code segment for the process; it is not in the linkage map file.

Before you started wd, did you think to grab a “sin” snapshot to see
what state your process was in?

If, in fact, the process is stripped, “sin reg” may be more useful than
wd – if the process is looping or something, you can often see what
addresses it loops through.

-David

QNX Training Services
dagibbs@qnx.com

William Morris <wrm@innovation-tk.com> wrote:

Mario Charest <> mcharest@zinformatic.com> > wrote:
That could be inside Slib32 code. This is definitely not your code.
Unless you can figure out from the stack from what function in
your code it got called I doubt wd will be of much help.

Thanks for helping. I don’t understand how my process could have its PC at an
address belonging to Slib32. I though the processes were protected from
each other, any I/O call I might make being turned into a IPC call to the
relevant process. Writing that, it makes me think it was interrupted in a
Receive or Send call underlying some system call I might make. Even so, I
don’t see how the PC could end up in another process’ address space.

Slib32 is the 32 shared library. When processes are started, it is
put in there address space, so they can share use of the code.

In general, code addresses in a shared library won’t show up in the
map of your process – Slib32 is a bit odd in that they actual appear
in a different segment, as access is gained to it using segment manipulations
rather than using mmap() as is used for other shared libraries (e.g. TCP/IP
or Photon).

-David

QNX Training Services
dagibbs@qnx.com

David Gibbs <dagibbs@qnx.com> wrote:

Before you started wd, did you think to grab a “sin” snapshot to see
what state your process was in?

If, in fact, the process is stripped, “sin reg” may be more useful than
wd – if the process is looping or something, you can often see what
addresses it loops through.

Er… No
Actually I did a dump using a program I wrote but not using sin.
Unfortunately the machine in question is now in transit to a show so I cannot
get at the dump anyway.

I seem to remember the process was Reply blocked on the X server.

Thanks to all who have replied

William Morris
wrm@innovation-tk.com

William Morris <wrm@innovation-tk.com> wrote:

David Gibbs <> dagibbs@qnx.com> > wrote:

Before you started wd, did you think to grab a “sin” snapshot to see
what state your process was in?

If, in fact, the process is stripped, “sin reg” may be more useful than
wd – if the process is looping or something, you can often see what
addresses it loops through.

Er… No
Actually I did a dump using a program I wrote but not using sin.
Unfortunately the machine in question is now in transit to a show so I cannot
get at the dump anyway.

I seem to remember the process was Reply blocked on the X server.

Reply blocked on the Xserver is a bit different from “locked”. Still,
something did go wrong. But, being REPLY-blocked on the X server would
explain both the odd address (it would be in the shared library) and why
you couldn’t single-step it with the debugger. The program wasn’t in a
READY state, so it can’t be stepped through code, it was waiting for
the X Server to reply to it. You would likely have been able to single
step it after the X server replied.

Now, as to why it was REPLY blocked on the X server – that is far
harder to figure out. It could be a mistake in the X server, it could
be a mistake in your code. If you sent the server a request that it
hadn’t finished, or that wasn’t finishable in any short time, then you
might end up REPLY blocked indefinitely.

-David

QNX Training Services
dagibbs@qnx.com

David Gibbs <dagibbs@qnx.com> wrote:

Reply blocked on the Xserver is a bit different from “locked”. Still,
something did go wrong. But, being REPLY-blocked on the X server would
explain both the odd address (it would be in the shared library) and why
you couldn’t single-step it with the debugger. The program wasn’t in a
READY state, so it can’t be stepped through code, it was waiting for
the X Server to reply to it. You would likely have been able to single
step it after the X server replied.

Now, as to why it was REPLY blocked on the X server – that is far
harder to figure out. It could be a mistake in the X server, it could
be a mistake in your code. If you sent the server a request that it
hadn’t finished, or that wasn’t finishable in any short time, then you
might end up REPLY blocked indefinitely.

It was a bit dumb to expect a Reply-blocked process to do anything.
Sorry to waste people’s time. Still at least I know what the funny
address was now.

[off topic]
I had hoped this might be a clue to a long-occurring real “lockup” that we
observe from time to time. In this, the whole system freezes and does not
even respond to network pings. I think this was a different problem. Someone
bought me a board which locks up 5 times a day yesterday (normally it is a
rare occurrance). Unfortunately they took it away again almost at once
(to use!) so I didn’t get to play with it.

Thanks for helping.

William Morris
wrm@innovation-tk.com

William Morris <wrm@innovation-tk.com> wrote:

It was a bit dumb to expect a Reply-blocked process to do anything.
Sorry to waste people’s time. Still at least I know what the funny
address was now.

Actually, I was a bit quick to admit dumbness. Just because it is Reply
blocked at the point I run ps (or whatever), doesn’t mean it isn’t looping.
The system appears “failed” because the failed app has a keyboard lock and as
it covers the screen (no title bar) it was not possible to switch to another
window. Only by going in over the net could I start wd.

Cheers

William Morris
wrm@innovation-tk.com

William Morris <wrm@innovation-tk.com> wrote:

David Gibbs <> dagibbs@qnx.com> > wrote:

[off topic]
I had hoped this might be a clue to a long-occurring real “lockup” that we
observe from time to time. In this, the whole system freezes and does not
even respond to network pings. I think this was a different problem. Someone
bought me a board which locks up 5 times a day yesterday (normally it is a
rare occurrance). Unfortunately they took it away again almost at once
(to use!) so I didn’t get to play with it.

What version of Proc32 are you running? We’ve recently found a problem
with timer handling that could cause a Proc lockup on many version of
4.25 – I think Randy Martin posted a thread about it recently.

If you were in a GUI (like X) you wouldn’t see the message it prints
out telling you it is a kernel crash, but if you can hook a serial port
up to another machine, you can catch it – you just have use the
-o port[,baud] option to Proc32, AND make sure Dev.ser doesn’t try to
handle the same serial port. If anyone brings you back that 5-a-day
board, this is probably the first step to trying to track what is going wrong.

We also think we have a fix for this crash, and are testing it internally
right now.

(For which I’m glad – this crash wasn’t frequent, didn’t happen to most
people, but my desktop is one of the machines that just happened to tickle
things right.)

-David

QNX Training Services
dagibbs@qnx.com

David Gibbs <dagibbs@qnx.com> wrote:

What version of Proc32 are you running? We’ve recently found a problem
with timer handling that could cause a Proc lockup on many version of
4.25 – I think Randy Martin posted a thread about it recently.

sin ver listing is as follows. We have 4.25e (recent update CD)

sin ver

PROGRAM NAME VERSION DATE
sys/Proc32 Proc 4.25J Sep 09 1999
sys/Proc32 Slib16 4.23G Oct 04 1996
etc

If you were in a GUI (like X) you wouldn’t see the message it prints
out telling you it is a kernel crash, but if you can hook a serial port
up to another machine, you can catch it – you just have use the
-o port[,baud] option to Proc32, AND make sure Dev.ser doesn’t try to
handle the same serial port. If anyone brings you back that 5-a-day
board, this is probably the first step to trying to track what is going wrong.

I have wanted to do this but we have no free ports. Reallocating a port would
make the system unusable (essential facilities hang on ser1 (touch screen)
and ser2 (strange DOS based servo controller)) and mdify its behaviour enough
to invalidate the test (bearing in mind that the crash is infrequent enough
that it cannot be guaranteed to happen). As you say, maybe with the 5-a-day
board it will be easier to track.

We also think we have a fix for this crash, and are testing it internally
right now.
Great stuff.

Thanks

William Morris
wrm@innovation-tk.com