Running Ready

I have an agent process (forever send loop, report for duty, tasked by
Reply) which is handling a serial port with a modem attached. It
handles dialing and delivering data. Most of the time it works fine.
Once in a while, the process goes into a tight loop and runs Ready.

By doing a number of sin -Pprogram_name reg, I was able to determine
that the IP is looping is between ~6973 and ~69DE which is before
_main. Can anybody give me a clue as to what is down there and why it
might get caught in a loop?

It is running in Slib but since nobody has a map it is hard to say. Some
things to check:

  1. You say you are handling a serial port - using dev_arm() or something
    else that would cause the Send() to unblock?

  2. Deja-view/monitor does wonders for “why did you wake up?”

  3. Did the pid you are sending to die or did it get corrupted in memory
    so you are getting an ESRCH?

  4. Are you trapping the Send() for -1 errors?

Just some thoughts…

Jay

Dean Douthat wrote in message <3A6B3A67.53E67D3B@faac.com>…

I have an agent process (forever send loop, report for duty, tasked by
Reply) which is handling a serial port with a modem attached. It
handles dialing and delivering data. Most of the time it works fine.
Once in a while, the process goes into a tight loop and runs Ready.

By doing a number of sin -Pprogram_name reg, I was able to determine
that the IP is looping is between ~6973 and ~69DE which is before
_main. Can anybody give me a clue as to what is down there and why it
might get caught in a loop?

Jay Hogg wrote:

It is running in Slib but since nobody has a map it is hard to say. Some
things to check:

I have a map but it shows nothing below a000.


  1. You say you are handling a serial port - using dev_arm() or something
    else that would cause the Send() to unblock?

I am doing a simple write (or writev) to the fd when it goes into the loop.

I just put in a breakout timer around the write to see if I can recover from
the hang.


  1. Deja-view/monitor does wonders for “why did you wake up?”

I may have to resort to that.


  1. Did the pid you are sending to die or did it get corrupted in memory
    so you are getting an ESRCH?

The hang does not seem to be associated with my Send. The server is still
running along happily.

  1. Are you trapping the Send() for -1 errors?

Yes, but that doesn’t seem to be the problem.


Just some thoughts…

Jay

Dean Douthat wrote in message <> 3A6B3A67.53E67D3B@faac.com> >…
I have an agent process (forever send loop, report for duty, tasked by
Reply) which is handling a serial port with a modem attached. It
handles dialing and delivering data. Most of the time it works fine.
Once in a while, the process goes into a tight loop and runs Ready.

By doing a number of sin -Pprogram_name reg, I was able to determine
that the IP is looping is between ~6973 and ~69DE which is before
_main. Can anybody give me a clue as to what is down there and why it
might get caught in a loop?

I have a new field report on this problem and it is still dropping into this
loop occasionally (once a week or so). It is difficult to get information about
what combination of events might be causing this. I have more info being logged
and will get access to the logs tomorrow. Meanwhile, it might be helpful if I
could find out what function in slib is looping. I’m not sure I understand the
comment that “nobody has a map”. Can’t you just recompile and make a new map?

Dean Douthat wrote:

Jay Hogg wrote:

It is running in Slib but since nobody has a map it is hard to say. Some
things to check:

I have a map but it shows nothing below a000.


\

  1. You say you are handling a serial port - using dev_arm() or something
    else that would cause the Send() to unblock?

I am doing a simple write (or writev) to the fd when it goes into the loop.

I just put in a breakout timer around the write to see if I can recover from
the hang.



2) Deja-view/monitor does wonders for “why did you wake up?”

I may have to resort to that.



3) Did the pid you are sending to die or did it get corrupted in memory
so you are getting an ESRCH?

The hang does not seem to be associated with my Send. The server is still
running along happily.



4) Are you trapping the Send() for -1 errors?

Yes, but that doesn’t seem to be the problem.



Just some thoughts…

Jay

Dean Douthat wrote in message <> 3A6B3A67.53E67D3B@faac.com> >…
I have an agent process (forever send loop, report for duty, tasked by
Reply) which is handling a serial port with a modem attached. It
handles dialing and delivering data. Most of the time it works fine.
Once in a while, the process goes into a tight loop and runs Ready.

By doing a number of sin -Pprogram_name reg, I was able to determine
that the IP is looping is between ~6973 and ~69DE which is before
_main. Can anybody give me a clue as to what is down there and why it
might get caught in a loop?

Anybody from QSSL care to comment on this?

Dean Douthat wrote:

I have a new field report on this problem and it is still dropping into this
loop occasionally (once a week or so). It is difficult to get information about
what combination of events might be causing this. I have more info being logged
and will get access to the logs tomorrow. Meanwhile, it might be helpful if I
could find out what function in slib is looping. I’m not sure I understand the
comment that “nobody has a map”. Can’t you just recompile and make a new map?

Dean Douthat wrote:

Jay Hogg wrote:

It is running in Slib but since nobody has a map it is hard to say. Some
things to check:

I have a map but it shows nothing below a000.


\

  1. You say you are handling a serial port - using dev_arm() or something
    else that would cause the Send() to unblock?

I am doing a simple write (or writev) to the fd when it goes into the loop.

I just put in a breakout timer around the write to see if I can recover from
the hang.



2) Deja-view/monitor does wonders for “why did you wake up?”

I may have to resort to that.



3) Did the pid you are sending to die or did it get corrupted in memory
so you are getting an ESRCH?

The hang does not seem to be associated with my Send. The server is still
running along happily.



4) Are you trapping the Send() for -1 errors?

Yes, but that doesn’t seem to be the problem.



Just some thoughts…

Jay

Dean Douthat wrote in message <> 3A6B3A67.53E67D3B@faac.com> >…
I have an agent process (forever send loop, report for duty, tasked by
Reply) which is handling a serial port with a modem attached. It
handles dialing and delivering data. Most of the time it works fine.
Once in a while, the process goes into a tight loop and runs Ready.

By doing a number of sin -Pprogram_name reg, I was able to determine
that the IP is looping is between ~6973 and ~69DE which is before
_main. Can anybody give me a clue as to what is down there and why it
might get caught in a loop?

Ping!

Any QSSL people? Would somebody look at this problem, please? It seems
to be in a shared library routine.

TIA

Dean Douthat wrote:

I have an agent process (forever send loop, report for duty, tasked by
Reply) which is handling a serial port with a modem attached. It
handles dialing and delivering data. Most of the time it works fine.
Once in a while, the process goes into a tight loop and runs Ready.

By doing a number of sin -Pprogram_name reg, I was able to determine
that the IP is looping is between ~6973 and ~69DE which is before
_main. Can anybody give me a clue as to what is down there and why it
might get caught in a loop?

tell me the version of slib32 and i can point to a function in slib.
but if you attach to it with wd or start it from wd then you can step thru
the assembly (+function name) that is being called.

my thoughts are that you are returning off the stack/registers and popping an
incorrect return value. rather than jumping to where you’d get a segv, you are
jumping into some valid code.

if the app isn’;t too big i would recommend checking all the function arguments
and returns first…

Dean Douthat <ddouthat@faac.com> wrote:

Ping!

Any QSSL people? Would somebody look at this problem, please? It seems
to be in a shared library routine.

TIA

Dean Douthat wrote:

I have an agent process (forever send loop, report for duty, tasked by
Reply) which is handling a serial port with a modem attached. It
handles dialing and delivering data. Most of the time it works fine.
Once in a while, the process goes into a tight loop and runs Ready.

By doing a number of sin -Pprogram_name reg, I was able to determine
that the IP is looping is between ~6973 and ~69DE which is before
_main. Can anybody give me a clue as to what is down there and why it
might get caught in a loop?


Randy Martin randy@qnx.com
Manager of FAE Group, North America
QNX Software Systems www.qnx.com
175 Terence Matthews Crescent, Kanata, Ontario, Canada K2M 1W8
Tel: 613-591-0931 Fax: 613-591-3579

The version on the development system is 4.24B dated 1997Aug12 and the fielded
systems are at least that high or higher.

I can’t throw it in the debugger to trace asm because it doesn’t happen all the
time. I works correctly for dozens of times, then goes into the loop.

I’ll check automatic variables, etc. to see if I’m corrupting the stack. Knowing
what shared library function is looping might help (or not). :frowning:

TIA

Dean

Randy Martin wrote:

tell me the version of slib32 and i can point to a function in slib.
but if you attach to it with wd or start it from wd then you can step thru
the assembly (+function name) that is being called.

my thoughts are that you are returning off the stack/registers and popping an
incorrect return value. rather than jumping to where you’d get a segv, you are
jumping into some valid code.

if the app isn’;t too big i would recommend checking all the function arguments
and returns first…

Dean Douthat <> ddouthat@faac.com> > wrote:
Ping!

Any QSSL people? Would somebody look at this problem, please? It seems
to be in a shared library routine.

TIA

Dean Douthat wrote:

I have an agent process (forever send loop, report for duty, tasked by
Reply) which is handling a serial port with a modem attached. It
handles dialing and delivering data. Most of the time it works fine.
Once in a while, the process goes into a tight loop and runs Ready.

By doing a number of sin -Pprogram_name reg, I was able to determine
that the IP is looping is between ~6973 and ~69DE which is before
_main. Can anybody give me a clue as to what is down there and why it
might get caught in a loop?


Randy Martin > randy@qnx.com
Manager of FAE Group, North America
QNX Software Systems > www.qnx.com
175 Terence Matthews Crescent, Kanata, Ontario, Canada K2M 1W8
Tel: 613-591-0931 Fax: 613-591-3579

according to my map that is looping in function _mouse_open … part of the
services to allow for mouse cursor movement.

i would think that somewhere you are jumping into that code from some other
routine.
without a debugger i don’t know how you could trap on this and see the
backtrace to where it came from.


Dean Douthat <ddouthat@faac.com> wrote:

The version on the development system is 4.24B dated 1997Aug12 and the fielded
systems are at least that high or higher.

I can’t throw it in the debugger to trace asm because it doesn’t happen all the
time. I works correctly for dozens of times, then goes into the loop.

I’ll check automatic variables, etc. to see if I’m corrupting the stack. Knowing
what shared library function is looping might help (or not). > :frowning:

TIA

Dean

Randy Martin wrote:

tell me the version of slib32 and i can point to a function in slib.
but if you attach to it with wd or start it from wd then you can step thru
the assembly (+function name) that is being called.

my thoughts are that you are returning off the stack/registers and popping an
incorrect return value. rather than jumping to where you’d get a segv, you are
jumping into some valid code.

if the app isn’;t too big i would recommend checking all the function arguments
and returns first…

Dean Douthat <> ddouthat@faac.com> > wrote:
Ping!

Any QSSL people? Would somebody look at this problem, please? It seems
to be in a shared library routine.

TIA

Dean Douthat wrote:

I have an agent process (forever send loop, report for duty, tasked by
Reply) which is handling a serial port with a modem attached. It
handles dialing and delivering data. Most of the time it works fine.
Once in a while, the process goes into a tight loop and runs Ready.

By doing a number of sin -Pprogram_name reg, I was able to determine
that the IP is looping is between ~6973 and ~69DE which is before
_main. Can anybody give me a clue as to what is down there and why it
might get caught in a loop?


Randy Martin > randy@qnx.com
Manager of FAE Group, North America
QNX Software Systems > www.qnx.com
175 Terence Matthews Crescent, Kanata, Ontario, Canada K2M 1W8
Tel: 613-591-0931 Fax: 613-591-3579


Randy Martin randy@qnx.com
Manager of FAE Group, North America
QNX Software Systems www.qnx.com
175 Terence Matthews Crescent, Kanata, Ontario, Canada K2M 1W8
Tel: 613-591-0931 Fax: 613-591-3579

Ugh, bad news indeed.

Thanks Randy, at least now I know how much trouble I’m in. :frowning:

Dean

Randy Martin wrote:

according to my map that is looping in function _mouse_open … part of the
services to allow for mouse cursor movement.

i would think that somewhere you are jumping into that code from some other
routine.
without a debugger i don’t know how you could trap on this and see the
backtrace to where it came from.

Dean Douthat <> ddouthat@faac.com> > wrote:
The version on the development system is 4.24B dated 1997Aug12 and the fielded
systems are at least that high or higher.

I can’t throw it in the debugger to trace asm because it doesn’t happen all the
time. I works correctly for dozens of times, then goes into the loop.

I’ll check automatic variables, etc. to see if I’m corrupting the stack. Knowing
what shared library function is looping might help (or not). > :frowning:

TIA

Dean

Randy Martin wrote:

tell me the version of slib32 and i can point to a function in slib.
but if you attach to it with wd or start it from wd then you can step thru
the assembly (+function name) that is being called.

my thoughts are that you are returning off the stack/registers and popping an
incorrect return value. rather than jumping to where you’d get a segv, you are
jumping into some valid code.

if the app isn’;t too big i would recommend checking all the function arguments
and returns first…

Dean Douthat <> ddouthat@faac.com> > wrote:
Ping!

Any QSSL people? Would somebody look at this problem, please? It seems
to be in a shared library routine.

TIA

Dean Douthat wrote:

I have an agent process (forever send loop, report for duty, tasked by
Reply) which is handling a serial port with a modem attached. It
handles dialing and delivering data. Most of the time it works fine.
Once in a while, the process goes into a tight loop and runs Ready.

By doing a number of sin -Pprogram_name reg, I was able to determine
that the IP is looping is between ~6973 and ~69DE which is before
_main. Can anybody give me a clue as to what is down there and why it
might get caught in a loop?


Randy Martin > randy@qnx.com
Manager of FAE Group, North America
QNX Software Systems > www.qnx.com
175 Terence Matthews Crescent, Kanata, Ontario, Canada K2M 1W8
Tel: 613-591-0931 Fax: 613-591-3579


Randy Martin > randy@qnx.com
Manager of FAE Group, North America
QNX Software Systems > www.qnx.com
175 Terence Matthews Crescent, Kanata, Ontario, Canada K2M 1W8
Tel: 613-591-0931 Fax: 613-591-3579

according to my map that is looping in function _mouse_open … part
of the
services to allow for mouse cursor movement.

While you’ve got the shared lib map out :slight_smile:

I have executables (several) that become reply blocked on Proc. At
this point there is no way to set any signal on them, so the process
level debugger is useless. Now I’m not expecting any revelation here,
but “sin regs” shows the CS:IP as 10A3:17E3 (this is with Proc 4.25I,
Nov 25th 1998, and Slib32 4.24B Aug 12th 1997). Any idea where that is
?

btw: I concur with your conclusion that this probably means there is
some sort of code corruption occuring, either via program induced stack
corruption leading to a random return address (less likely since
multiple exectables exhibit the same problem, and this only happens on
425), or actual code corruption, via shared memory pointer de-references
through unlinked shared memory (more likely since it occurs in multple
executables, and the problem only occurs on 425, not on 424).

Thanks

Rennie

Rennie Allen <RAllen@csical.com> wrote:

according to my map that is looping in function _mouse_open … part
of the
services to allow for mouse cursor movement.

While you’ve got the shared lib map out > :slight_smile:

I have executables (several) that become reply blocked on Proc. At
this point there is no way to set any signal on them, so the process
level debugger is useless. Now I’m not expecting any revelation here,
but “sin regs” shows the CS:IP as 10A3:17E3 (this is with Proc 4.25I,
Nov 25th 1998, and Slib32 4.24B Aug 12th 1997). Any idea where that is
?

also in _send

btw: I concur with your conclusion that this probably means there is
some sort of code corruption occuring, either via program induced stack
corruption leading to a random return address (less likely since
multiple exectables exhibit the same problem, and this only happens on
425), or actual code corruption, via shared memory pointer de-references
through unlinked shared memory (more likely since it occurs in multple
executables, and the problem only occurs on 425, not on 424).

Thanks

Rennie


Randy Martin randy@qnx.com
Manager of FAE Group, North America
QNX Software Systems www.qnx.com
175 Terence Matthews Crescent, Kanata, Ontario, Canada K2M 1W8
Tel: 613-591-0931 Fax: 613-591-3579

I have executables (several) that become reply blocked on Proc. At
this point there is no way to set any signal on them, so the process
level debugger is useless. Now I’m not expecting any revelation here,
but “sin regs” shows the CS:IP as 10A3:17E3 (this is with Proc 4.25I,
Nov 25th 1998, and Slib32 4.24B Aug 12th 1997). Any idea where that
is
?

also in _send

Thanks, Randy. This helps, in that it confirms what I suspected.


Randy Martin randy@qnx.com
Manager of FAE Group, North America
QNX Software Systems www.qnx.com
175 Terence Matthews Crescent, Kanata, Ontario, Canada K2M 1W8
Tel: 613-591-0931 Fax: 613-591-3579