Photon crash

“Bill Caroselli (Q-TPS)” <qtps@earthlink.net> wrote:

OK. Either I misunderstood the original statement or I’m not understanding
your reply to me.

If I do stuff to map some memory and I don’t munmap() it, then the
references count should NOT go to 0. The OS should NOT be free to give that
memory to another process.

I agree that it would be nice, and I realize that most other OSes work
that way. Unfortunately, QNX4 doesn’t, and that’s what the text I
quoted explains. QNX4 only counts fds as “references”, and mmaped pages
do not count. If you want to safely access the mmaped memory after the
shared object has been unlinked, you must keep an fd to it. That’s
documented behaviour – the fact that it’s non-standard doesn’t
necessarily make it a bug. :frowning:

If the text that you are quoting is pointing out some possible benefit to
this behavior, I don’t understand what it could be.

The text doesn’t say that, but my guess would be that the main benefit
is simplicity of implementation. If I remember correctly, the original
QNX 4.0 did not have support for mmap() – it was added later, and it
probably was too difficult at that point to add code to Proc to keep
track of which shared object each page of memory belongs to.


Regarding your kill() code, if the executing code is valid then the OS must
do what it is told to do. But the OS is smart enough to at least check that
you are sufficiently priveleged before raising the signal.

Yes, but my point was that the OS may not be able to know whether your
code is valid or not. If a bug causes your code to ask the OS to do
something that looks valid (but is not what you meant), the OS will do
it, and that may “adversely affect another program”. I was just trying
to give an example of why your suggestion, taken literally, is way too
general to be realistic:

IMHO, ANY program bug should not adversly affect another program.




“Wojtek Lerch” <> wojtek@qnx.com> > wrote in message
news:9pd9m7$2br$> 1@nntp.qnx.com> …
“Bill Caroselli (Q-TPS)” <> qtps@earthlink.net> > wrote:

“Gui Group” <> gui@qnx.com> > wrote in message
news:9pcig2$hup$> 1@nntp.qnx.com> …
If you do shm_open(), mmap(), close() and then shm_unlink() without
munmap(), then the memory you have mmapped stays mmapped but is
considered free memory and may get allocated to another (or the same)
process. Modifying it may cause that process to misbehave or crash.

Please tell me that, this is being dealt with as a kernel bug.

I think it’s considered a feature.

From /etc/readme/technotes/shmem.txt:

2.2. close, unlink

The close system call releases the filedescriptor for
the shared memory object. The unlink system call causes the
shared memory object to be deleted. Unlink has the same
portability concerns as open and creat, thus there is a sim-
iliar function shm_unlink which hides the path /dev/shmem
from the application. Shared memory objects have a refer-
ence count attached to them. When the object is created,
the reference count is set to 1. Every time a filedescrip-
tor is opened to the object, the reference count is incre-
mented. Every time a filedescriptor to the object is
closed, the reference count is decremented. When the refer-
ence count reaches 0 (only possible if the object has been
unlinked and there are NO open file descriptors to it), the
memory associated with the object is returned to the system.
Any mappings which exist when the memory object is deleted
become undefined. Accessing such memory may cause unpre-
dictable effects in your process.


IMHO, ANY program bug should not adversly affect another program.

Good one. > :wink:

How do you suppose we should prevent the following buggy code from
adversely affecting other programs:

void foo( void ) {
pid_t pid;
kill( pid, SIGTERM );
}


Wojtek Lerch QNX Software Systems Ltd.


Wojtek Lerch QNX Software Systems Ltd.

“Wojtek Lerch” <wojtek_l@yahoo.ca> wrote in message
news:9pfaqq$ad2$1@nntp.qnx.com

I agree that it would be nice, and I realize that most other OSes work
that way. Unfortunately, QNX4 doesn’t, and that’s what the text I
quoted explains. QNX4 only counts fds as “references”, and mmaped pages
do not count. If you want to safely access the mmaped memory after the
shared object has been unlinked, you must keep an fd to it. That’s
documented behaviour – the fact that it’s non-standard doesn’t
necessarily make it a bug. > :frowning:

Well, if not a bug, it is certainly LESS THAN a feature, as stated in your

original reply.

Yes, but my point was that the OS may not be able to know whether your
code is valid or not. If a bug causes your code to ask the OS to do
something that looks valid (but is not what you meant), the OS will do
it, and that may “adversely affect another program”. I was just trying
to give an example of why your suggestion, taken literally, is way too
general to be realistic:

OK. By valid I simply meant that it won’t SIGSEGV like:
printf( “My string is %s\n”, NULL );

I wouldn’t expect even QNX to have fully implemented the DWIM instruction.

“Bill Caroselli (Q-TPS)” wrote:

“Wojtek Lerch” <> wojtek_l@yahoo.ca> > wrote in message
news:9pfaqq$ad2$> 1@nntp.qnx.com> …
I agree that it would be nice, and I realize that most other OSes work
that way. Unfortunately, QNX4 doesn’t, and that’s what the text I
quoted explains. QNX4 only counts fds as “references”, and mmaped pages
do not count. If you want to safely access the mmaped memory after the
shared object has been unlinked, you must keep an fd to it. That’s
documented behaviour – the fact that it’s non-standard doesn’t
necessarily make it a bug. > :frowning:

Well, if not a bug, it is certainly LESS THAN a feature, as stated in your
original reply.

Yes, but my point was that the OS may not be able to know whether your
code is valid or not. If a bug causes your code to ask the OS to do
something that looks valid (but is not what you meant), the OS will do
it, and that may “adversely affect another program”. I was just trying
to give an example of why your suggestion, taken literally, is way too
general to be realistic:


OK. By valid I simply meant that it won’t SIGSEGV like:
printf( “My string is %s\n”, NULL );

I wouldn’t expect even QNX to have fully implemented the DWIM instruction.

That’s because it is so poorly documented… :slight_smile:

Hi Brenda,

First of all, did you get my test application and source that I sent last
Thurday?

We are running our software on two different platforms. One is an Octogon
PC510 and the other is an Octogon PC680. On the 680 we get the slowing down
proble, however on the 510 we have been getting a segmentation fault. The
address reported is always the same, regardless of the compile, and we
cannot find this address anywhere in our code. The address is
0007:0007F948. Does this meann anything? It occurs when we press a button.
It doesn’t happen all the time, and not on the same button or even the same
type of button.

Thanks for your help
Rodney


“Gui Group” <gui@qnx.com> wrote in message news:9pcig2$hup$1@nntp.qnx.com

Hi Rodney,

I have done some more checking with the developers.

Hope this helps
Regards
Brenda

Rodney Gullickson <> rodneyg@tritro.com.au> > wrote:
Thanks Brenda

Is the shared memory problem only a problem when you destroy shared
memory?
We allocate our memory at startup and then only ever release it on
shutdown.
After a shutdown (caused by whatever reason, sometimes power loss) the
system always gets rebooted. Is this shared memory problem documented
somewhere? I couldn’t seem to find it in the knoledge base or
newsgroups.

If you do shm_open(), mmap(), close() and then shm_unlink() without
munmap(), then the memory you have mmapped stays mmapped but is
considered free memory and may get allocated to another (or the same)
process. Modifying it may cause that process to misbehave or crash.

But from what you says, it doesn’t appear that this is what is happening.

How do I determine where pfattach() is being called? I can’t find it or
PhAttach() in any of our code or in the debugger. Also, could this have
anything to do with moving text, for instance the font fd gets lost, the
text is in the wrong place so the font driver needs to be reattached? I
guess if I can break on pfattach() this will tell me…

Did you check for pfattach() or PfAttach() ??

Since this is QNX4, the right way to debug it would be by
linking the app static and then running it under the debugger
(which we already discussed). Putting breakpoints inside a QNX4 shared
library is not a good idea.

There’s code in the library that could cause multiple calls to
PfAttach(), but that code should only be called after a font server has
died (and, perhaps, a new one has started), which does not explain
multiple connection to the same font server.

I have added a some code using qnx_fd_query to count the number of fd’s
for
all of my tasks. In the total running system I have got about 260, with
one
process having about 30. qnx_osinfo reports a total of 8000 fds, min =
16,
max= 512, so we shouldn’t be running out.

Another possible item to check for proc32 is the number of processes
The -p num_procs[,code] option lets you set the number of processes.

The total number of real processes, virtual circuits, and proxies that can
exist
at any time (default is 500; maximum is 2000). The optional “,code” is for
command-line compatibility with earlier Procs and is ignored.

You could try seeing if you are coming close to the limits for this?

Each task has a fd open to a number of message queues. We have it so
each
task has it’s own message queue that it reads, and many tasks write to
it.

Thanks again for the info, I am currently reviewing the rest of our
shared
memory objects and classes to see if there is a problem.

Rodney

snip

Hi Brenda,

Have you found out any more about my problem from the sample application I
sent?

We are still experiencing lockups in our systems. Any extra leads would be
much appreciated.

Regards,
Rodney

“Gui Group” <gui@qnx.com> wrote in message news:9pcig2$hup$1@nntp.qnx.com

Hi Rodney,

I have done some more checking with the developers.

Hope this helps
Regards
Brenda

Rodney Gullickson <> rodneyg@tritro.com.au> > wrote:
Thanks Brenda

snip

Hi Rodney,

I just want to check with you if you got my email that I sent you a
while back with the suggestions about the signals and using proxies instead.

Regards
Brenda

Rodney Gullickson wrote:

Hi Brenda,

Have you found out any more about my problem from the sample application I
sent?

We are still experiencing lockups in our systems. Any extra leads would be
much appreciated.

Regards,
Rodney

“Gui Group” <> gui@qnx.com> > wrote in message news:9pcig2$hup$> 1@nntp.qnx.com> …

Hi Rodney,

I have done some more checking with the developers.

Hope this helps
Regards
Brenda

Rodney Gullickson <> rodneyg@tritro.com.au> > wrote:

Thanks Brenda

snip
\

Hi Rodney,

Sorry for the late response…
I did get your application and the developer looked at it. I sent you
an email with some suggestions a while back. If you haven’t gotten it
please let me know and I will send it to you again.

Just to check again…you tried linking to the libraries statically(and
with debugging info) right! If you did the debugger won’t show you
the code if the crash happens in the library but you may see the calls
stack. Even if not, you’ll at least be able to compare look up the
address(that you state below) in the map file and reliably tell
which function your crashing in.

This could help narrow the search.

Let me know if you have already tried this…

Good Luck
Brenda

Rodney Gullickson wrote:

Hi Brenda,

First of all, did you get my test application and source that I sent last
Thurday?

We are running our software on two different platforms. One is an Octogon
PC510 and the other is an Octogon PC680. On the 680 we get the slowing down
proble, however on the 510 we have been getting a segmentation fault. The
address reported is always the same, regardless of the compile, and we
cannot find this address anywhere in our code. The address is
0007:0007F948. Does this meann anything? It occurs when we press a button.
It doesn’t happen all the time, and not on the same button or even the same
type of button.

Thanks for your help
Rodney


“Gui Group” <> gui@qnx.com> > wrote in message news:9pcig2$hup$> 1@nntp.qnx.com> …

Hi Rodney,

I have done some more checking with the developers.

Hope this helps
Regards
Brenda

Rodney Gullickson <> rodneyg@tritro.com.au> > wrote:

Thanks Brenda

Is the shared memory problem only a problem when you destroy shared

memory?

We allocate our memory at startup and then only ever release it on

shutdown.

After a shutdown (caused by whatever reason, sometimes power loss) the
system always gets rebooted. Is this shared memory problem documented
somewhere? I couldn’t seem to find it in the knoledge base or

newsgroups.

If you do shm_open(), mmap(), close() and then shm_unlink() without
munmap(), then the memory you have mmapped stays mmapped but is
considered free memory and may get allocated to another (or the same)
process. Modifying it may cause that process to misbehave or crash.

But from what you says, it doesn’t appear that this is what is happening.


How do I determine where pfattach() is being called? I can’t find it or
PhAttach() in any of our code or in the debugger. Also, could this have
anything to do with moving text, for instance the font fd gets lost, the
text is in the wrong place so the font driver needs to be reattached? I
guess if I can break on pfattach() this will tell me…

Did you check for pfattach() or PfAttach() ??

Since this is QNX4, the right way to debug it would be by
linking the app static and then running it under the debugger
(which we already discussed). Putting breakpoints inside a QNX4 shared
library is not a good idea.

There’s code in the library that could cause multiple calls to
PfAttach(), but that code should only be called after a font server has
died (and, perhaps, a new one has started), which does not explain
multiple connection to the same font server.


I have added a some code using qnx_fd_query to count the number of fd’s

for

all of my tasks. In the total running system I have got about 260, with

one

process having about 30. qnx_osinfo reports a total of 8000 fds, min =

16,

max= 512, so we shouldn’t be running out.

Another possible item to check for proc32 is the number of processes
The -p num_procs[,code] option lets you set the number of processes.

The total number of real processes, virtual circuits, and proxies that can

exist

at any time (default is 500; maximum is 2000). The optional “,code” is for
command-line compatibility with earlier Procs and is ignored.

You could try seeing if you are coming close to the limits for this?


Each task has a fd open to a number of message queues. We have it so

each

task has it’s own message queue that it reads, and many tasks write to

it.

Thanks again for the info, I am currently reviewing the rest of our

shared

memory objects and classes to see if there is a problem.

Rodney

snip

Hello Brenda,

Yes, I did get you email eventually :slight_smile: The original must have got lost
somewhere along the way.

I have converted our Photon app to use Photon Pulses (ie proxies) for
notification on the message queues, which is where most of our signals are
coming from. This does not appear to solve the problem. We are still
getting a crash at a particular address (0007:0007F948), when we press a
PtButton. It can be any button on any
of our screens. Sometimes it runs for quite a while, sometimes it is the
1st button we press. When it crashes, it produces a dump file, and the
address at which it crashes is always the same, regardless of the type of
compile (debugging info on or off) or what is in the build (We have been
adding a fair bit of extra code). The dump file shows no source and no
calls (the library is statically linked).

I have just been looking through the make file again, and the privity level
is 3. Could this be the problem? It still doesn’t make sense, as I would
have thought it would crash at the same point each time if this was the
problem. Anyway, I will make it privity 1 and try that.

Regards
Rodney

“GUI Group” <gui@qnx.com> wrote in message news:3BFC00E4.2050605@qnx.com

Hi Rodney,

I just want to check with you if you got my email that I sent you a
while back with the suggestions about the signals and using proxies
instead.

Regards
Brenda

Rodney Gullickson wrote:

Hi Brenda,

Have you found out any more about my problem from the sample application
I
sent?

We are still experiencing lockups in our systems. Any extra leads would
be
much appreciated.

Regards,
Rodney

“Gui Group” <> gui@qnx.com> > wrote in message
news:9pcig2$hup$> 1@nntp.qnx.com> …

Hi Rodney,

I have done some more checking with the developers.

Hope this helps
Regards
Brenda

Rodney Gullickson <> rodneyg@tritro.com.au> > wrote:

Thanks Brenda

snip

\

Hi Rodney,

Changing the privity from 3 to 1 is how you tell QNX4 that the program
is allowed to access hardware I/O ports (and must be run as root). The
only time you would try that is if you see under the debugger that the
program crashes on an I/O opcode.

  1. When your application hangs do you get any error messages displayed?
  2. Are you still able to move the mouse pointer?
  3. Is the keyboard dead or will the numlock and capslock still light up?
  4. When the system hangs do you see the Display application REPLY
    blocked on an other applications?

Is it possible for me to get a copy of the application that is using the
proxies to see if I can reproduce the problem here?

Regards
Brenda


Rodney Gullickson wrote:

Hello Brenda,

Yes, I did get you email eventually > :slight_smile: > The original must have got lost
somewhere along the way.

I have converted our Photon app to use Photon Pulses (ie proxies) for
notification on the message queues, which is where most of our signals are
coming from. This does not appear to solve the problem. We are still
getting a crash at a particular address (0007:0007F948), when we press a
PtButton. It can be any button on any
of our screens. Sometimes it runs for quite a while, sometimes it is the
1st button we press. When it crashes, it produces a dump file, and the
address at which it crashes is always the same, regardless of the type of
compile (debugging info on or off) or what is in the build (We have been
adding a fair bit of extra code). The dump file shows no source and no
calls (the library is statically linked).

I have just been looking through the make file again, and the privity level
is 3. Could this be the problem? It still doesn’t make sense, as I would
have thought it would crash at the same point each time if this was the
problem. Anyway, I will make it privity 1 and try that.

Regards
Rodney

“GUI Group” <> gui@qnx.com> > wrote in message news:> 3BFC00E4.2050605@qnx.com> …

Hi Rodney,

I just want to check with you if you got my email that I sent you a
while back with the suggestions about the signals and using proxies

instead.

Regards
Brenda

Rodney Gullickson wrote:


Hi Brenda,

Have you found out any more about my problem from the sample application

I

sent?

We are still experiencing lockups in our systems. Any extra leads would

be

much appreciated.

Regards,
Rodney

“Gui Group” <> gui@qnx.com> > wrote in message

news:9pcig2$hup$> 1@nntp.qnx.com> …

Hi Rodney,

I have done some more checking with the developers.

Hope this helps
Regards
Brenda

Rodney Gullickson <> rodneyg@tritro.com.au> > wrote:


Thanks Brenda


snip


\

I may have a further lead on my photon crash, but it has not helped me yet!

I have looked at the dump file after a number of crashes and the address on
the stack when the system got the SIGSEGV comes from a function (I looked
this up in the map file) called widgetname_links where widget name is the
name of one of my widgets. On the few that I have looked at, the widget
name is one from the screen that was displayed when the system crashed.

What does this function do? Could I be calling somethig wrongly?

Any help greatly appreciated.

Thanks
Rodney Gullickson

“Rodney Gullickson” <rodneyg@tritro.com.au> wrote in message
news:9thgq6$nr6$1@inn.qnx.com

Hello Brenda,

Yes, I did get you email eventually > :slight_smile: > The original must have got lost
somewhere along the way.

I have converted our Photon app to use Photon Pulses (ie proxies) for
notification on the message queues, which is where most of our signals are
coming from. This does not appear to solve the problem. We are still
getting a crash at a particular address (0007:0007F948), when we press a
PtButton. It can be any button on any
of our screens. Sometimes it runs for quite a while, sometimes it is the
1st button we press. When it crashes, it produces a dump file, and the
address at which it crashes is always the same, regardless of the type of
compile (debugging info on or off) or what is in the build (We have been
adding a fair bit of extra code). The dump file shows no source and no
calls (the library is statically linked).

I have just been looking through the make file again, and the privity
level
is 3. Could this be the problem? It still doesn’t make sense, as I would
have thought it would crash at the same point each time if this was the
problem. Anyway, I will make it privity 1 and try that.

Regards
Rodney

“GUI Group” <> gui@qnx.com> > wrote in message
news:> 3BFC00E4.2050605@qnx.com> …
Hi Rodney,

I just want to check with you if you got my email that I sent you a
while back with the suggestions about the signals and using proxies
instead.

Regards
Brenda

Rodney Gullickson wrote:

Hi Brenda,

Have you found out any more about my problem from the sample
application
I
sent?

We are still experiencing lockups in our systems. Any extra leads
would
be
much appreciated.

Regards,
Rodney

“Gui Group” <> gui@qnx.com> > wrote in message
news:9pcig2$hup$> 1@nntp.qnx.com> …

Hi Rodney,

I have done some more checking with the developers.

Hope this helps
Regards
Brenda

Rodney Gullickson <> rodneyg@tritro.com.au> > wrote:

Thanks Brenda

snip





\

Hi Rodney,

I have checked with the developers and here is the info he passes on:

Rodney Gullickson wrote:

I may have a further lead on my photon crash, but it has not helped me yet!

I have looked at the dump file after a number of crashes and the address on
the stack when the system got the SIGSEGV comes from a function (I looked
this up in the map file) called widgetname_links where widget name is the
name of one of my widgets. On the few that I have looked at, the widget
name is one from the screen that was displayed when the system crashed.

What does this function do? Could I be calling somethig wrongly?

The widgetname_links are not functions. They are arrays. For sure,
it’s not wise to try to call then as functions…

The question would be what is trying to make those calls. From
experience if the code ends up trying to jump into data, the reason is
most often a buffer overflow. Either that or a corrupted function pointer.

Hope this helps.
Regards
Brenda

Any help greatly appreciated.

Thanks
Rodney Gullickson

“Rodney Gullickson” <> rodneyg@tritro.com.au> > wrote in message
news:9thgq6$nr6$> 1@inn.qnx.com> …

Hello Brenda,

Yes, I did get you email eventually > :slight_smile: > The original must have got lost
somewhere along the way.

I have converted our Photon app to use Photon Pulses (ie proxies) for
notification on the message queues, which is where most of our signals are
coming from. This does not appear to solve the problem. We are still
getting a crash at a particular address (0007:0007F948), when we press a
PtButton. It can be any button on any
of our screens. Sometimes it runs for quite a while, sometimes it is the
1st button we press. When it crashes, it produces a dump file, and the
address at which it crashes is always the same, regardless of the type of
compile (debugging info on or off) or what is in the build (We have been
adding a fair bit of extra code). The dump file shows no source and no
calls (the library is statically linked).

I have just been looking through the make file again, and the privity

level

is 3. Could this be the problem? It still doesn’t make sense, as I would
have thought it would crash at the same point each time if this was the
problem. Anyway, I will make it privity 1 and try that.

Regards
Rodney

“GUI Group” <> gui@qnx.com> > wrote in message

news:> 3BFC00E4.2050605@qnx.com> …

Hi Rodney,

I just want to check with you if you got my email that I sent you a
while back with the suggestions about the signals and using proxies

instead.

Regards
Brenda

Rodney Gullickson wrote:


Hi Brenda,

Have you found out any more about my problem from the sample

application

I

sent?

We are still experiencing lockups in our systems. Any extra leads

would

be

much appreciated.

Regards,
Rodney

“Gui Group” <> gui@qnx.com> > wrote in message

news:9pcig2$hup$> 1@nntp.qnx.com> …

Hi Rodney,

I have done some more checking with the developers.

Hope this helps
Regards
Brenda

Rodney Gullickson <> rodneyg@tritro.com.au> > wrote:


Thanks Brenda


snip



\