Diagnosing Server Problems?

Chris McKillop wrote:

I don’t fully understand what you are saying; but this seems unlikey, as
all executables are mmap’d out of the disk/network filesystem by procnto,
so you’d not be able to run anything at all if mmap() was broken …

Correct if the mapping has been done be mmap() …



And what is exactly broken Armin? The only thing that broke was the
mapping of /dev/mem to bring in a device memory space.

The only ‘thing’ ?? Half of the functionality of mmap() is broken!

Do not make general case statements when the bug is very specific.

Would be nice to see here a public notice about that ‘very specific’
bug … so I havn’t to explane to new customers that our resource
managers aren’t the problem !!

Armin

-I’m-a-little-bit-angry-

Like right now… The server is running at a crawl. The spin I had
running at TOP priority has stopped updating. Attempts to run “hogs” end
in a core dump. Nothin appears amis in any incantation of pidin. There
is no indication of what is consuming CPU time, only that all the CPU
time has been consumed.

I had to reboot the server. There’s nothing in sloginfo, nothing in
/var/dumps, nothing in syslog, no diagnostics to indicate what’s wrong
anywhere.

How am I supposed to fix something when I can’t tell what’s wrong? A few
more days of this, and I’m going to be out on the street.

Mathew Kirsch <mkirsch@ocdus.jnj.com> wrote:

Like right now… The server is running at a crawl. The spin I had
running at TOP priority has stopped updating. Attempts to run “hogs” end
in a core dump. Nothin appears amis in any incantation of pidin. There
is no indication of what is consuming CPU time, only that all the CPU
time has been consumed.

How are you running hogs? I suspect with the -n option? Can you also
post the output from “uname -a” and “pidin in” when the machine has been
rebooted?

I had to reboot the server. There’s nothing in sloginfo, nothing in
/var/dumps, nothing in syslog, no diagnostics to indicate what’s wrong
anywhere.

Things will only appear in /var/dumps if the crash. Slogger is an in-memory
debugging tool, have you run it before you rebooted? Once you reboot any
information that might have stored will have been lost. Nothing we write
uses syslog. So if it is a problem with a system-level process you won’t
get anything printed out.

chris


Chris McKillop <cdm@qnx.com> “The faster I go, the behinder I get.”
Software Engineer, QSSL – Lewis Carroll –
http://qnx.wox.org/

“Chris McKillop” <cdm@qnx.com> wrote in message
news:b5nhv2$63i$1@nntp.qnx.com

I don’t fully understand what you are saying; but this seems unlikey,
as
all executables are mmap’d out of the disk/network filesystem by
procnto,
so you’d not be able to run anything at all if mmap() was broken …

Correct if the mapping has been done be mmap() …


And what is exactly broken Armin? The only thing that broke was the
mapping of /dev/mem to bring in a device memory space. Do not make
general
case statements when the bug is very specific.

That was not the ONLY thing. Believe it or not, it broke one my previous
Apache port. Earlier versions of QNX used to allow mmap(MAP_SHARED,
open("/dev/zero")) with the effect equivalent to mmap(MAP_ANON|MAP_SHARED,
NO_FD). That (coincidentally or not) aligned with the way Solaris handles
it, so mmap branch for Solaris also worked on QNX. Then apparently it was
decided that ‘we should not have allowed it in the first place’ and voila, I
had to go and figure out why working code does not work anymore. Must have
been poor old me and my lack of understanding of the API :stuck_out_tongue:

– igor

Chris McKillop wrote:

And what is exactly broken Armin? The only thing that broke was the
mapping of /dev/mem to bring in a device memory space.

The only ‘thing’ ?? Half of the functionality of mmap() is broken!



Care to give another example of what is busted Armin? As John said, all
binaries and shared libs are loaded from various resource managers using
mmap() (io-blk, fs-nfs[23], fs-cifs, …).

Nonsens … the POSIX mmap() is broken. Only mmap_device_memory() is
working!!

Armin



chris

Nonsens … the POSIX mmap() is broken. Only mmap_device_memory() is
working!!

Armin, if you don’t want to be helpfull why do you even post? Have you
ever looked at the code to mmap() and mmap_device_memory()? They are
all built with messages to proc (the SAME message in fact). I have no doubt
that something changed with respect to mmap() on /dev/mem, but if that is a
bug then it is a bug in proc’s handling of the mmaping, not in the mmap
interface itself.

Again, take a look at the code in lib/c/1b/mmap.c and lib/c/ldd.c, you will
see that shared libs are loaded using mmap() and last time I checked I
was still able to load shared libs on 6.2.1.

chris


Chris McKillop <cdm@qnx.com> “The faster I go, the behinder I get.”
Software Engineer, QSSL – Lewis Carroll –
http://qnx.wox.org/

That was not the ONLY thing. Believe it or not, it broke one my previous
Apache port. Earlier versions of QNX used to allow mmap(MAP_SHARED,
open("/dev/zero")) with the effect equivalent to mmap(MAP_ANON|MAP_SHARED,
NO_FD). That (coincidentally or not) aligned with the way Solaris handles
it, so mmap branch for Solaris also worked on QNX. Then apparently it was
decided that ‘we should not have allowed it in the first place’ and voila, I
had to go and figure out why working code does not work anymore. Must have
been poor old me and my lack of understanding of the API :stuck_out_tongue:

The only thing between 6.2.0 and 6.2.1 at least. :wink: I am sure that you
probably did get burned by that igor, but where you not relying on
undocumented behavior in the first place in a situation where there was a
perfectly valid (and documented) method?

chris


Chris McKillop <cdm@qnx.com> “The faster I go, the behinder I get.”
Software Engineer, QSSL – Lewis Carroll –
http://qnx.wox.org/

The available sources of the CVS seems to be outdated and probably
incomplete … so it is useless for me.

They are older, but totally relavent. Nothing has changed in the code to
mmap() in libc since 1997.

IMHO … that’s off topic. The problem appears if physical memory must
be mapped.

Wrong again. How do you think malloc() works? It uses mmap(), in fact
(thanks Brian) every single virtual address in every single process is
obtained by using mmap(). So all memory, binaries, shared libs, etc are
using mmap() to function.

Fact is that all of our resource managers which are using the POSIX mmap
for mapping physical memory are broken and must be modified in order to
use mmap_device_memory().

No, the fact is that your resources managers where doing…

int fd = open( “/dev/mem”, … );
mmap( fd, … );

Show me in POSIX OR our docs where this method of accessing devices is
defined, given as example, or recommended? This isn’t about POSIX
mmap() being broken, it is about mmap()ing of a particular device entry
changing in behavior. Is it a bug? I dunno, probably. Is the bug in
mmap()? NO!

Why didn’t you use the documented interfaces from the start? Not like
mmap_device_memory() and mmap_device_io() are new to 6.2.1. Every single
driver example we have ever published uses those functions. And you say
your resmgr’s will crash? Are you not checking the return value from all
the functions in your code, reporting errors and exiting?

Tsk Tsk.

chris


Chris McKillop <cdm@qnx.com> “The faster I go, the behinder I get.”
Software Engineer, QSSL – Lewis Carroll –
http://qnx.wox.org/

Chris McKillop wrote:

Nonsens … the POSIX mmap() is broken. Only mmap_device_memory() is
working!!



Armin, if you don’t want to be helpfull why do you even post?

Ditto … I don’t see any helpful approach in your answers.

My posting is absolutely helpful because of it points out a workaround!

Have you ever looked at the code to mmap() and mmap_device_memory()?

It’s not my interest … I need only a working mmap() and
mmap_device_memory().

They are
all built with messages to proc (the SAME message in fact). I have no doubt
that something changed with respect to mmap() on /dev/mem, but if that is a
bug then it is a bug in proc’s handling of the mmaping, not in the mmap
interface itself.

The available sources of the CVS seems to be outdated and probably
incomplete … so it is useless for me.

Again, take a look at the code in lib/c/1b/mmap.c and lib/c/ldd.c, you will
see that shared libs are loaded using mmap() and last time I checked I
was still able to load shared libs on 6.2.1.

IMHO … that’s off topic. The problem appears if physical memory must
be mapped.

Fact is that all of our resource managers which are using the POSIX mmap
for mapping physical memory are broken and must be modified in order to
use mmap_device_memory().


Armin

Armin Steinhoff <a-steinhoff@web.de> wrote in message
news:b5qfl3$8g7$1@inn.qnx.com

Chris McKillop wrote:
This isn’t about POSIX mmap() being broken, it is about mmap()ing of a
particular device entry
changing in behavior. Is it a bug? I dunno, probably. Is the bug in
mmap()? NO!

If mmap() returns a value of addr > NULL and this value creates a memory
violation if you use it … you are trying to tell all of us here that
this isn’t a bug??

My document says, mmap() returns MAP_FAILED if it faild.
Nobody says “MAP_FAILED” is NULL.

Does your program actuall checking “MAP_FAILED” ?

-xtang

Armin Steinhoff <a-steinhoff@web.de> wrote:
: Fact is that the identical correct mmap() statement isn’t working with
: 6.2.1 … and this statement is simply true!

Is it possible that you’ve tripped over this change in 6.2.1?

mmap()
mmap64()
mmap_device_io()
mmap_device_memory()
If you specify a length less than or equal to 0, these functions return
MAP_FAILED and set errno to EINVAL. (Ref# 11176)

This wasn’t in the original release notes, but it is now.


Steve Reid stever@qnx.com
TechPubs (Technical Publications)
QNX Software Systems

Chris McKillop wrote:

The available sources of the CVS seems to be outdated and probably
incomplete … so it is useless for me.

They are older, but totally relavent. Nothing has changed in the code to
mmap() in libc since 1997.

Chris, that’s a real JOKE!
Tell me why identical mmap() calls are working with 6.2.0 and not with
6.2.1!!

IMHO … that’s off topic. The problem appears if physical memory must
be mapped.
Wrong again.

Fact is that the identical correct mmap() statement isn’t working with
6.2.1 … and this statement is simply true!

How do you think malloc() works? It uses mmap(), in fact
(thanks Brian) every single virtual address in every single process is
obtained by using mmap(). So all memory, binaries, shared libs, etc are
using mmap() to function.

But I’m sure … these calls are NOT using MAP_PHYS!

Fact is that all of our resource managers which are using the POSIX mmap
for mapping physical memory are broken and must be modified in order to
use mmap_device_memory().

No, the fact is that your resources managers where doing…

They are doing the same with 6.2.0 !

int fd = open( “/dev/mem”, … );
mmap( fd, … );

Show me in POSIX OR our docs where this method of accessing devices is
defined, given as example, or recommended?

Just copied from the mmap() doc:

Or share memory with hardware such as video memory on an x86 platform:

/* Map in VGA display memory */
addr = mmap( 0,
65536,
PROT_READ|PROT_WRITE,
MAP_PHYS|MAP_SHARED,
NOFD,
0xa0000 );

Notice the parameter NOFD …

This isn’t about POSIX mmap() being broken, it is about mmap()ing of a particular device entry
changing in behavior. Is it a bug? I dunno, probably. Is the bug in
mmap()? NO!

If mmap() returns a value of addr > NULL and this value creates a memory
violation if you use it … you are trying to tell all of us here that
this isn’t a bug??

Why didn’t you use the documented interfaces from the start?

Sorry … this question is too crazy. Are you saying that mmap() is not
a documented interface??

Not like
mmap_device_memory() and mmap_device_io() are new to 6.2.1. Every single
driver example we have ever published uses those functions. And you say
your resmgr’s will crash?

What a question … I told always that mmap() is broken and that
mmap_device_memory is working.

Are you not checking the return value from all
the functions in your code, reporting errors and exiting?

It would be better if some QSSL developers would do this in order to
avoid such annoying mmap() problems.

And BTW … I’m doing system development on a high level since more than
25 years. Compared with my experience you are probably a beginner :slight_smile:

End-Of-Discussion

Armin

Chris, that’s a real JOKE!
Tell me why identical mmap() calls are working with 6.2.0 and not with
6.2.1!!

Because mmap() is a message Armin. Do you even understand how QNX works?
It is the handler for /dev/mem inside of procnto that has changed, not
mmap() itself. This is like saying that because fs-foo fails to allow
files to be open()'d in 6.2.1 that open() is broken in 6.2.1 (fictional
example, there is no fs-foo). It isn’t open() that is broken, it is
fs-foo. Just like with your mmap() example.

How do you think malloc() works? It uses mmap(), in fact
(thanks Brian) every single virtual address in every single process is
obtained by using mmap(). So all memory, binaries, shared libs, etc are
using mmap() to function.

But I’m sure … these calls are NOT using MAP_PHYS!

No, but have you looked at what mmap_device_memory() calls?
(can be found in lib/c/qnx/mmap_device_memory.c)

void *mmap_device_memory( void *addr, size_t len, int prot,
int flags, uint64_t physical)
{
return mmap64( addr, len, prot,
(flags & ~MAP_TYPE) | MAP_PHYS|MAP_SHARED,
NOFD, physical);
}

So tell me, if this works how can mmap() be broken? This code has not
changed since 1998 when it was changed to use 64bit values.

Just copied from the mmap() doc:

Or share memory with hardware such as video memory on an x86 platform:

/* Map in VGA display memory */
addr = mmap( 0,
65536,
PROT_READ|PROT_WRITE,
MAP_PHYS|MAP_SHARED,
NOFD,
0xa0000 );

Notice the parameter NOFD …

Yes, I notice it. It isn’t opening /dev/mem though is it?

If mmap() returns a value of addr > NULL and this value creates a memory
violation if you use it … you are trying to tell all of us here that
this isn’t a bug??

No, not when it returns MAP_FAILED which is -1. Like I said, don’t you
check your return values? No where in any of the docs does it say that
it will return NULL/0 on failure.

Sorry … this question is too crazy. Are you saying that mmap() is not
a documented interface??

No, I am saying that opening /dev/mem and mmap()ing it is not a documented
interface. Just as igor’s example of opening /dev/zero was not documented
either.

chris

\

Chris McKillop <cdm@qnx.com> “The faster I go, the behinder I get.”
Software Engineer, QSSL – Lewis Carroll –
http://qnx.wox.org/

Steve Reid wrote:

Armin Steinhoff <> a-steinhoff@web.de> > wrote:
: Fact is that the identical correct mmap() statement isn’t working with
: 6.2.1 … and this statement is simply true!

Is it possible that you’ve tripped over this change in 6.2.1?

mmap()
mmap64()
mmap_device_io()
mmap_device_memory()
If you specify a length less than or equal to 0, these functions return
MAP_FAILED and set errno to EINVAL. (Ref# 11176)

So far the theory … here is my mmap() code:

BAR[2] = (char *) mmap(0, 65536, PROT_NOCACHE | PROT_READ |
PROT_WRITE, MAP_PHYS | MAP_SHARED, NOFD, PCI_MEM_ADDR( address ) &
~0xfff);

address is a valid PCI_MEM_ADDR.

This statement works with 6.2.0 and not with 6.2.1 !!

Armin

Chris McKillop wrote:

Chris, that’s a real JOKE!
Tell me why identical mmap() calls are working with 6.2.0 and not with
6.2.1!!

Because mmap() is a message Armin. Do you even understand how QNX works?

Chris, I’m working with QNX more than 10 years … in the meantime I’m
convienced to know how QNX is working.

Do you know what’s the value of a documented library interface call?

It should work as documented … even when it is a very low level
function of the OS library.

It is the handler for /dev/mem inside of procnto that has changed, not
mmap() itself. This is like saying that because fs-foo fails to allow
files to be open()'d in 6.2.1 that open() is broken in 6.2.1 (fictional
example, there is no fs-foo). It isn’t open() that is broken, it is
fs-foo. Just like with your mmap() example.

Nice examples … but they are off topic. from the user’s point of view
it is completely unimportant what is failing behind the documented
interface!

Do you know what means compatibility between different OS versions??

How do you think malloc() works? It uses mmap(), in fact
(thanks Brian) every single virtual address in every single process is
obtained by using mmap(). So all memory, binaries, shared libs, etc are
using mmap() to function.

But I’m sure … these calls are NOT using MAP_PHYS!

No, but have you looked at what mmap_device_memory() calls?
(can be found in lib/c/qnx/mmap_device_memory.c)

This has nothing to do with your previous statement.

void *mmap_device_memory( void *addr, size_t len, int prot,
int flags, uint64_t physical)
{
return mmap64( addr, len, prot,
(flags & ~MAP_TYPE) | MAP_PHYS|MAP_SHARED,
NOFD, physical);
}

So tell me, if this works how can mmap() be broken?

This question should be answered by QSSL!

This code has not
changed since 1998 when it was changed to use 64bit values.

I cross my fingers that this interface will not be changed in the
future at least.

Just copied from the mmap() doc:

Or share memory with hardware such as video memory on an x86 platform:

/* Map in VGA display memory */
addr = mmap( 0,
65536,
PROT_READ|PROT_WRITE,
MAP_PHYS|MAP_SHARED,
NOFD,
0xa0000 );

Notice the parameter NOFD …

Yes, I notice it. It isn’t opening /dev/mem though is it?

This isn’t the issue here. Your idea was to provide a real file
descriptor instead of NOFD … but you have deleted this statement from
your response.

If mmap() returns a value of addr > NULL and this value creates a memory
violation if you use it … you are trying to tell all of us here that
this isn’t a bug??

No,
No ?
not when it returns MAP_FAILED which is -1.

If have learned that NULL is > -1 … isn’t it??

Like I said, don’t you
check your return values? No where in any of the docs does it say that
it will return NULL/0 on failure.

I have to correct myself … I’m checking for the -1 since I’m using
mmap().

Armin

Armin Steinhoff wrote:

Steve Reid wrote:

Armin Steinhoff <> a-steinhoff@web.de> > wrote:
: Fact is that the identical correct mmap() statement isn’t working
with : 6.2.1 … and this statement is simply true!

Is it possible that you’ve tripped over this change in 6.2.1?

mmap()
mmap64()
mmap_device_io()
mmap_device_memory()
If you specify a length less than or equal to 0, these functions
return
MAP_FAILED and set errno to EINVAL. (Ref# 11176)


So far the theory … here is my mmap() code:

BAR[2] = (char *) mmap(0, 65536, PROT_NOCACHE | PROT_READ |
PROT_WRITE, MAP_PHYS | MAP_SHARED, NOFD, PCI_MEM_ADDR( address ) &
~0xfff);

address is a valid PCI_MEM_ADDR.

This statement works with 6.2.0 and not with 6.2.1 !!

What is ‘address’?
For mmap(), the offset is a signed 32-bit quantity, so if it’s
above 0x80000000 it will be treated as a negative number and fail.

mmap_device_memory() takes a 64-bit offset so those values work.

Sunil.

Sunil Kittur wrote:

Armin Steinhoff wrote:

Steve Reid wrote:

Armin Steinhoff <> a-steinhoff@web.de> > wrote:
: Fact is that the identical correct mmap() statement isn’t working
with : 6.2.1 … and this statement is simply true!

Is it possible that you’ve tripped over this change in 6.2.1?

mmap()
mmap64()
mmap_device_io()
mmap_device_memory()
If you specify a length less than or equal to 0, these functions
return
MAP_FAILED and set errno to EINVAL. (Ref# 11176)



So far the theory … here is my mmap() code:

BAR[2] = (char *) mmap(0, 65536, PROT_NOCACHE | PROT_READ |
PROT_WRITE, MAP_PHYS | MAP_SHARED, NOFD, PCI_MEM_ADDR( address ) &
~0xfff);

address is a valid PCI_MEM_ADDR.

This statement works with 6.2.0 and not with 6.2.1 !!


What is ‘address’?
For mmap(), the offset is a signed 32-bit quantity, so if it’s
above 0x80000000 it will be treated as a negative number and fail.

address is provided by the PCI header … and it is defined as a ULONG.
You should know this …

mmap_device_memory() takes a 64-bit offset so those values work.

Again:

the mmap() statement above works with 6.2.0 and not with 6.2.1 !!

Armin

Armin Steinhoff wrote:

Sunil Kittur wrote:

So far the theory … here is my mmap() code:

BAR[2] = (char *) mmap(0, 65536, PROT_NOCACHE | PROT_READ |
PROT_WRITE, MAP_PHYS | MAP_SHARED, NOFD, PCI_MEM_ADDR( address ) &
~0xfff);

address is a valid PCI_MEM_ADDR.

This statement works with 6.2.0 and not with 6.2.1 !!

What is ‘address’?
For mmap(), the offset is a signed 32-bit quantity, so if it’s
above 0x80000000 it will be treated as a negative number and fail.

address is provided by the PCI header … and it is defined as a ULONG.
You should know this …

Yes, but not all valid ULONG values are positive 32-bit off_t values.
What I was asking was whether your ‘address’ value was > 0x80000000.

Both mmap() and mmap_device_memory() call the same message to procnto.
The only difference is that mmap_device_memory() takes a 64-bit signed
value, whereas mmap() converts the 32-bit signed value to 64-bits.

If your ‘address’ is >= 0x80000000, the mmap() conversion will give
a different 64-bit value than the one supplied by mmap_device_memory().

Again:

the mmap() statement above works with 6.2.0 and not with 6.2.1 !!

6.2.1 supports > 4GB physical addresses, so if your ‘address’ value
is >= 0x80000000, it is a different physical address than the one
you get via the 64-bit parameter to mmap_device_memory().

Sunil.

Sunil Kittur <skittur@qnx.com> wrote:

Armin Steinhoff wrote:
Steve Reid wrote:

Armin Steinhoff <> a-steinhoff@web.de> > wrote:
: Fact is that the identical correct mmap() statement isn’t working
with : 6.2.1 … and this statement is simply true!

Is it possible that you’ve tripped over this change in 6.2.1?

mmap()
mmap64()
mmap_device_io()
mmap_device_memory()
If you specify a length less than or equal to 0, these functions
return
MAP_FAILED and set errno to EINVAL. (Ref# 11176)


So far the theory … here is my mmap() code:

BAR[2] = (char *) mmap(0, 65536, PROT_NOCACHE | PROT_READ |
PROT_WRITE, MAP_PHYS | MAP_SHARED, NOFD, PCI_MEM_ADDR( address ) &
~0xfff);

address is a valid PCI_MEM_ADDR.

This statement works with 6.2.0 and not with 6.2.1 !!

What is ‘address’?
For mmap(), the offset is a signed 32-bit quantity, so if it’s
above 0x80000000 it will be treated as a negative number and fail.

mmap_device_memory() takes a 64-bit offset so those values work.

I think Sunil’s hit the nail on the head. The above code is wrong.
It relies on “address” being less that 0x8000000, and if “address” is
a physical address, that is an invalid assumption.

That would explain why you’re seeing mmap fail half the time: half of
physical 32-bit address space is below 0x80000000, the other
half is not.

So are you going to attack me like you’ve attacked everyone else
who’s attempted to help you in this thread, or are you going to thank
Sunil for identifying a problem in your code, like most of the
reasonable, courteuous folk we deal with from day to day would do?

Sunil.

the mmap() statement above works with 6.2.0 and not with 6.2.1 !!

6.2.1 supports > 4GB physical addresses, so if your ‘address’ value
is >= 0x80000000, it is a different physical address than the one
you get via the 64-bit parameter to mmap_device_memory().

Sunil.

So how did this work in 6.2.0 as Armin is claiming - was the conversion from
the offset_t to offset64_t in mmap() done by typecasting it to a ULONG?

Jens