Performance of GCC

David_Vainapel · November 8, 2000, 8:55am

I have compiled Python 2.0 on QNX4, Linux and on QNX RTP , all on the
same PC.
Linux and Neutrino versions was compiled with GCC and flag -O3. QNX
version was compiled with Watcom. Then I run simple script to compare
execution time of two python command.

The result for the 3 cases shows execution time in seconds.

Linux, Suse 6.4:
1.86 - map with lambda
3.22 - new syntax

QNX 4.25:
1.87 - map with lambda
2.52 - new syntax

Neutrino:
2.43 - map with lambda
200.05 - new syntax

Watcom is better then GCC, it is not surprise. But the Neutrino is
really to slow.

This is the python script:

Compare run times of map with lambda and list comprehensions

from time import time

n=1000000

t0 = time()
x= map(lambda x: x+x, range(n))
print ‘%6.2f - map with lambda’ % (time()-t0)

t0 = time()
x= [x+x for x in xrange(n)]
print ‘%6.2f - new syntax’ % (time()-t0)

David
davidv@elisra.com

Warren_Peece1 · November 8, 2000, 1:51pm

Out of curiosity, has anyone tried cross-compiling a QNX6 program under
Linux and comparing the execution time against the same program compiled
natively? I’m wondering if we can’t determine if the problem lies in the
version of gcc being used, the QNX6 libraries or the O.S. itsself. I
wouldn’t think it’s the differences between the compiler under Linux and the
compiler under QNX6, but until it’s ruled out it’s still an unknown… I
might expect a small variation in gcc generated code from system to system,
but that’s ridiculous.

-Warren

“David Vainapel” <davidv@elisra.com> wrote in message
news:3A091519.8E0CB2C2@elisra.com…

I have compiled Python 2.0 on QNX4, Linux and on QNX RTP , all on the
same PC.
Linux and Neutrino versions was compiled with GCC and flag -O3. QNX
version was compiled with Watcom. Then I run simple script to compare
execution time of two python command.

The result for the 3 cases shows execution time in seconds.

Linux, Suse 6.4:
1.86 - map with lambda
3.22 - new syntax

QNX 4.25:
1.87 - map with lambda
2.52 - new syntax

Neutrino:
2.43 - map with lambda
200.05 - new syntax

Watcom is better then GCC, it is not surprise. But the Neutrino is
really to slow.

This is the python script:

Compare run times of map with lambda and list comprehensions

from time import time

n=1000000

t0 = time()
x= map(lambda x: x+x, range(n))
print ‘%6.2f - map with lambda’ % (time()-t0)

t0 = time()
x= [x+x for x in xrange(n)]
print ‘%6.2f - new syntax’ % (time()-t0)

David
davidv@elisra.com

Colin_Burgess1 · November 8, 2000, 6:29pm

Igor Kovalenko <Igor.Kovalenko@motorola.com> wrote:

Don’t bother blaming gcc. I can’t say for sure what is the problem now, but
back 2 years when we started using Neutrino I did some simple performance
tests. One particular problem was performance of malloc() - it used to be
about 2 orders of magnitude slower than on QNX4. This correlates quite well
with Phyton scripts performance and indeed being interpreted language Phyton
must be using malloc() heavily, at least under some circumstances.

I did not check the status of problem for a while, since we decided at that
point to avoid using malloc() in CPU bound code. Would be interesting to see
QNX comments on this.

I’m looking into it, I’m looking into it.

I agree that it’s not the compiler.

Igor

Warren Peece <> Warren@nospam.com> > wrote in message
news:8ublkh$7nf$> 1@inn.qnx.com> …
Out of curiosity, has anyone tried cross-compiling a QNX6 program under
Linux and comparing the execution time against the same program compiled
natively? I’m wondering if we can’t determine if the problem lies in the
version of gcc being used, the QNX6 libraries or the O.S. itsself. I
wouldn’t think it’s the differences between the compiler under Linux and
the
compiler under QNX6, but until it’s ruled out it’s still an unknown… I
might expect a small variation in gcc generated code from system to
system,
but that’s ridiculous.

-Warren

“David Vainapel” <> davidv@elisra.com> > wrote in message
news:> 3A091519.8E0CB2C2@elisra.com> …
I have compiled Python 2.0 on QNX4, Linux and on QNX RTP , all on the
same PC.
Linux and Neutrino versions was compiled with GCC and flag -O3. QNX
version was compiled with Watcom. Then I run simple script to compare
execution time of two python command.

The result for the 3 cases shows execution time in seconds.

Linux, Suse 6.4:
1.86 - map with lambda
3.22 - new syntax

QNX 4.25:
1.87 - map with lambda
2.52 - new syntax

Neutrino:
2.43 - map with lambda
200.05 - new syntax

Watcom is better then GCC, it is not surprise. But the Neutrino is
really to slow.

This is the python script:

Compare run times of map with lambda and list comprehensions

from time import time

n=1000000

t0 = time()
x= map(lambda x: x+x, range(n))
print ‘%6.2f - map with lambda’ % (time()-t0)

t0 = time()
x= [x+x for x in xrange(n)]
print ‘%6.2f - new syntax’ % (time()-t0)

David
davidv@elisra.com

\

–
cburgess@qnx.com

Warren_Peece1 · November 8, 2000, 8:09pm

“Colin Burgess” <cburgess@qnx.com> wrote in message
news:8uc622$dnb$1@nntp.qnx.com…
|
| I’m looking into it, I’m looking into it.
|
| I agree that it’s not the compiler.
|

Maybe someone left some debug stuff active in a core routine… Go Colin!

-Warren

Igor_Kovalenko2 · November 8, 2000, 10:57pm

Don’t bother blaming gcc. I can’t say for sure what is the problem now, but
back 2 years when we started using Neutrino I did some simple performance
tests. One particular problem was performance of malloc() - it used to be
about 2 orders of magnitude slower than on QNX4. This correlates quite well
with Phyton scripts performance and indeed being interpreted language Phyton
must be using malloc() heavily, at least under some circumstances.

I did not check the status of problem for a while, since we decided at that
point to avoid using malloc() in CPU bound code. Would be interesting to see
QNX comments on this.

Igor

Warren Peece <Warren@nospam.com> wrote in message
news:8ublkh$7nf$1@inn.qnx.com…

Out of curiosity, has anyone tried cross-compiling a QNX6 program under
Linux and comparing the execution time against the same program compiled
natively? I’m wondering if we can’t determine if the problem lies in the
version of gcc being used, the QNX6 libraries or the O.S. itsself. I
wouldn’t think it’s the differences between the compiler under Linux and
the
compiler under QNX6, but until it’s ruled out it’s still an unknown… I
might expect a small variation in gcc generated code from system to
system,
but that’s ridiculous.

-Warren

“David Vainapel” <> davidv@elisra.com> > wrote in message
news:> 3A091519.8E0CB2C2@elisra.com> …
I have compiled Python 2.0 on QNX4, Linux and on QNX RTP , all on the
same PC.
Linux and Neutrino versions was compiled with GCC and flag -O3. QNX
version was compiled with Watcom. Then I run simple script to compare
execution time of two python command.

The result for the 3 cases shows execution time in seconds.

Linux, Suse 6.4:
1.86 - map with lambda
3.22 - new syntax

QNX 4.25:
1.87 - map with lambda
2.52 - new syntax

Neutrino:
2.43 - map with lambda
200.05 - new syntax

Watcom is better then GCC, it is not surprise. But the Neutrino is
really to slow.

This is the python script:

Compare run times of map with lambda and list comprehensions

from time import time

n=1000000

t0 = time()
x= map(lambda x: x+x, range(n))
print ‘%6.2f - map with lambda’ % (time()-t0)

t0 = time()
x= [x+x for x in xrange(n)]
print ‘%6.2f - new syntax’ % (time()-t0)

David
davidv@elisra.com

\

Colin_Burgess1 · November 9, 2000, 10:05pm

Igor Kovalenko <Igor.Kovalenko@motorola.com> wrote:

Don’t bother blaming gcc. I can’t say for sure what is the problem now, but
back 2 years when we started using Neutrino I did some simple performance
tests. One particular problem was performance of malloc() - it used to be
about 2 orders of magnitude slower than on QNX4. This correlates quite well
with Phyton scripts performance and indeed being interpreted language Phyton
must be using malloc() heavily, at least under some circumstances.

I did not check the status of problem for a while, since we decided at that
point to avoid using malloc() in CPU bound code. Would be interesting to see
QNX comments on this.

Yup, it was malloc. Or more specifically, realloc.

The python code is using realloc to grow a list implemented as an
array. It’s doing it a HUGE amount of times, and growing it by 1 element
each time.

The Linux realloc presumeable notices that you are growing, and gives you
a whole swag of memory to reduce the need to malloc and memcpy.

The QNX realloc is a lot more conservative wrt memory allocation, and so
it ends up having to malloc and memcpy chunks that are about 4Mb in size
almost every time the list is grown.

So some would say that our realloc sucks, but most would agree that the
python list implementation sucks even worse, and the Linux realloc is
just covering up for bad code.

–
cburgess@qnx.com

Warren_Peece1 · November 9, 2000, 11:12pm

So I guess that means if you’re memory constrained, you use the malloc() family
as-is. If you’re looking for speed, then you’re better off doing something
custom. As a suggestion, how about a couple of different malloc() libraries,
one for embedded systems (slow but efficient), and one for desktop systems
(fast and less efficient)?

-Warren

“Colin Burgess” <cburgess@qnx.com> wrote in message
news:8uf742$s62$1@nntp.qnx.com…
| Igor Kovalenko <Igor.Kovalenko@motorola.com> wrote:
| > Don’t bother blaming gcc. I can’t say for sure what is the problem now, but
| > back 2 years when we started using Neutrino I did some simple performance
| > tests. One particular problem was performance of malloc() - it used to be
| > about 2 orders of magnitude slower than on QNX4. This correlates quite well
| > with Phyton scripts performance and indeed being interpreted language
Phyton
| > must be using malloc() heavily, at least under some circumstances.
|
| > I did not check the status of problem for a while, since we decided at that
| > point to avoid using malloc() in CPU bound code. Would be interesting to
see
| > QNX comments on this.
|
| Yup, it was malloc. Or more specifically, realloc.
|
| The python code is using realloc to grow a list implemented as an
| array. It’s doing it a HUGE amount of times, and growing it by 1 element
| each time.
|
| The Linux realloc presumeable notices that you are growing, and gives you
| a whole swag of memory to reduce the need to malloc and memcpy.
|
| The QNX realloc is a lot more conservative wrt memory allocation, and so
| it ends up having to malloc and memcpy chunks that are about 4Mb in size
| almost every time the list is grown.
|
| So some would say that our realloc sucks, but most would agree that the
| python list implementation sucks even worse, and the Linux realloc is
| just covering up for bad code.


cburgess@qnx.com

Igor_Kovalenko2 · November 10, 2000, 2:20am

Colin Burgess wrote:

So some would say that our realloc sucks, but most would agree that the
python list implementation sucks even worse, and the Linux realloc is
just covering up for bad code. >

That does not explain why QNX4 is reasonably fast. It also does not
explain my own experiments - I was doing merely

for(i=0; i<HUGE_NUMBER;i++) malloc(SOME_AMOUNT);

and yet it was still 200 times slower on NTO than on QNX4. I guess I
should run those tests again.

Igor

Colin_Burgess1 · November 10, 2000, 3:38am

Igor Kovalenko <kovalenko@home.com> wrote:

Colin Burgess wrote:

So some would say that our realloc sucks, but most would agree that the
python list implementation sucks even worse, and the Linux realloc is
just covering up for bad code. >

That does not explain why QNX4 is reasonably fast. It also does not
explain my own experiments - I was doing merely

Well, for one, the QNX4 malloc/free never gives any memory back to the
system, whereas the Neutrino one does.

for(i=0; i<HUGE_NUMBER;i++) malloc(SOME_AMOUNT);

and yet it was still 200 times slower on NTO than on QNX4. I guess I
should run those tests again.

Probably. I think the malloc implementation changed at some point from
the original version.

–
cburgess@qnx.com

Colin_Burgess1 · November 10, 2000, 3:41am

Warren Peece <warren@nospam.com> wrote:

So I guess that means if you’re memory constrained, you use the malloc() family
as-is. If you’re looking for speed, then you’re better off doing something
custom. As a suggestion, how about a couple of different malloc() libraries,
one for embedded systems (slow but efficient), and one for desktop systems
(fast and less efficient)?

I agree. Things that are never meant to run on embedded systems wouldn’t
worry so much about memory usage over speed.

Of course, Linux has a swapping VM, so they don’t worry at all about
memory usage.

-Warren

“Colin Burgess” <> cburgess@qnx.com> > wrote in message
news:8uf742$s62$> 1@nntp.qnx.com> …
| Igor Kovalenko <> Igor.Kovalenko@motorola.com> > wrote:
| > Don’t bother blaming gcc. I can’t say for sure what is the problem now, but
| > back 2 years when we started using Neutrino I did some simple performance
| > tests. One particular problem was performance of malloc() - it used to be
| > about 2 orders of magnitude slower than on QNX4. This correlates quite well
| > with Phyton scripts performance and indeed being interpreted language
Phyton
| > must be using malloc() heavily, at least under some circumstances.
|
| > I did not check the status of problem for a while, since we decided at that
| > point to avoid using malloc() in CPU bound code. Would be interesting to
see
| > QNX comments on this.
|
| Yup, it was malloc. Or more specifically, realloc.
|
| The python code is using realloc to grow a list implemented as an
| array. It’s doing it a HUGE amount of times, and growing it by 1 element
| each time.
|
| The Linux realloc presumeable notices that you are growing, and gives you
| a whole swag of memory to reduce the need to malloc and memcpy.
|
| The QNX realloc is a lot more conservative wrt memory allocation, and so
| it ends up having to malloc and memcpy chunks that are about 4Mb in size
| almost every time the list is grown.
|
| So some would say that our realloc sucks, but most would agree that the
| python list implementation sucks even worse, and the Linux realloc is
| just covering up for bad code. >
|
| –
| > cburgess@qnx.com

–
cburgess@qnx.com

Colin_Burgess1 · November 10, 2000, 3:55pm

So some would say that our realloc sucks, but most would agree that the
python list implementation sucks even worse, and the Linux realloc is
just covering up for bad code. >

That’s not correct … your are barking up the
wrong tree >

The handling of the memory alloc/realloc calls of
the QNX library is’t optimal for list processing
of big lists with many small list elements. The
QNX library is optimized for embedded systems …

This is exactly my point. The python code is assuming that the
local system’s realloc implementation is fast at growing an
object by small amounts.

–
cburgess@qnx.com

Armin_Steinhoff1 · November 10, 2000, 4:03pm

Colin Burgess wrote:

Igor Kovalenko <> Igor.Kovalenko@motorola.com> > wrote:
Don’t bother blaming gcc. I can’t say for sure what is the problem now, but
back 2 years when we started using Neutrino I did some simple performance
tests. One particular problem was performance of malloc() - it used to be
about 2 orders of magnitude slower than on QNX4. This correlates quite well
with Phyton scripts performance and indeed being interpreted language Phyton
must be using malloc() heavily, at least under some circumstances.

I did not check the status of problem for a while, since we decided at that
point to avoid using malloc() in CPU bound code. Would be interesting to see
QNX comments on this.

Yup, it was malloc. Or more specifically, realloc.

It is the way how the QNX library is allocatin or
reallocation chunks of memory.

The python code is using realloc to grow a list implemented as an
array. It’s doing it a HUGE amount of times, and growing it by 1 element
each time.

Yes … and the QNX realloc adds exactly the
requested amount of memory to the list of
allocated memory chunks. That’s OK for an embedded
system where memory is an issue.

The LINUX realloc call adds a much bigger chunk of
memory at once and this chunk can be serveral
times bigger than the requested amount of memory.

That means with LINUX the list of memory chunks
can be ~10 times shorter then the list build with
QNX for the same amount of allocated (or
reallocated) memory … and this leads to
dramatically performance issues e.g. when pieces
of memory are given back to the system in a
arbitrary order. ( I did extensive messurements )

It’s possible that a Linux apps has e.g. to search
list elements in a list of 10.000 chunks … and a
QNX apps have to look up a list of 100.000 chunks
and that wents badly slooow.

The Linux realloc presumeable notices that you are growing, and gives you
a whole swag of memory to reduce the need to malloc and memcpy.

The QNX realloc is a lot more conservative wrt memory allocation, and so
it ends up having to malloc and memcpy chunks that are about 4Mb in size
almost every time the list is grown.

the 4MB are related to the heap size … the
BLOCKSIZE is 24KB, IMHO

So some would say that our realloc sucks, but most would agree that the
python list implementation sucks even worse, and the Linux realloc is
just covering up for bad code. >

That’s not correct … your are barking up the
wrong tree

The handling of the memory alloc/realloc calls of
the QNX library is’t optimal for list processing
of big lists with many small list elements. The
QNX library is optimized for embedded systems …

If processing of such lists is an issue …
replace the alloc/realloc calls
of QNX with the calls of the GNU malloc module.

Armin

Steve_Furr · November 10, 2000, 8:20pm

In article <8ufaro$68e$1@inn.qnx.com>, Warren Peece <warren@nospam.com> wrote:

So I guess that means if you’re memory constrained, you use the malloc() family
as-is. If you’re looking for speed, then you’re better off doing something
custom. As a suggestion, how about a couple of different malloc() libraries,
one for embedded systems (slow but efficient), and one for desktop systems
(fast and less efficient)?

No. Don’t read too much into this. What it says is that doing a
realloc(ptr,size+1) is a bad idea. It doesn’t suggest that malloc
performance is bad, Igor’s comments aside. Igor’s comments were related
to the same malloc algorithm on QNX4 and Neutrino some time ago. The
malloc algorithm has since been improved in any case, and, to the
best of my knowledge, procnto was also enhanced.

The Neutrino version performed poorly at that time because of the
manner in which procnto responded to a lot of small mmap() requests.
On the whole, the malloc performance under QNX4 was actually better
than the previous Watcom allocator.

Different malloc implementations for different purposes is a good
idea, but it takes careful consideration. Performance is highly
dependent on application behaviour and optimizations that you
assume will be better may hinder throughput in practice, and, iin
the worst case negatively impact priority behaviour for other threads
– I hesitate to say realtime, because realtime threads shouldn’t
be using malloc().

As an example, U. Texas has an allocator for concurrent programs
that minimizes blocking factors. I mentioned to Peter V. and he
tried it out. Reportedly, it compiled out of the box and had
substantial benefit on programs with many threads. I’ll look
for a URL.

-Warren

“Colin Burgess” <> cburgess@qnx.com> > wrote in message
news:8uf742$s62$> 1@nntp.qnx.com> …
| Igor Kovalenko <> Igor.Kovalenko@motorola.com> > wrote:
| > Don’t bother blaming gcc. I can’t say for sure what is the problem now, but
| > back 2 years when we started using Neutrino I did some simple performance
| > tests. One particular problem was performance of malloc() - it used to be
| > about 2 orders of magnitude slower than on QNX4. This correlates quite well
| > with Phyton scripts performance and indeed being interpreted language
Phyton
| > must be using malloc() heavily, at least under some circumstances.
|
| > I did not check the status of problem for a while, since we decided at that
| > point to avoid using malloc() in CPU bound code. Would be interesting to
see
| > QNX comments on this.
|
| Yup, it was malloc. Or more specifically, realloc.
|
| The python code is using realloc to grow a list implemented as an
| array. It’s doing it a HUGE amount of times, and growing it by 1 element
| each time.
|
| The Linux realloc presumeable notices that you are growing, and gives you
| a whole swag of memory to reduce the need to malloc and memcpy.
|
| The QNX realloc is a lot more conservative wrt memory allocation, and so
| it ends up having to malloc and memcpy chunks that are about 4Mb in size
| almost every time the list is grown.
|
| So some would say that our realloc sucks, but most would agree that the
| python list implementation sucks even worse, and the Linux realloc is
| just covering up for bad code. >
|
| –
| > cburgess@qnx.com

–

Steve Furr email: furr@qnx.com
QNX Software Systems, Ltd.

Steve_Furr · November 10, 2000, 8:38pm

In article <8ufqp4$k24$2@inn.qnx.com>, Colin Burgess <cburgess@qnx.com> wrote:

Warren Peece <> warren@nospam.com> > wrote:
So I guess that means if you’re memory constrained, you use the malloc() family
as-is. If you’re looking for speed, then you’re better off doing something
custom. As a suggestion, how about a couple of different malloc() libraries,
one for embedded systems (slow but efficient), and one for desktop systems
(fast and less efficient)?

I agree. Things that are never meant to run on embedded systems wouldn’t
worry so much about memory usage over speed.

Of course, Linux has a swapping VM, so they don’t worry at all about
memory usage.

Oh, I almost forgot. Also at the University of Texas a couple of
years back, there was a thesis from Mark Johnstone that could turn
a lot of generally accepted principles of memory management on their
heads. Segmented allocators – like power of two allocators – have
very good performance, but are considered wasteful because of internal
fragmentation.

Johnstone factored out many of the implementation considerations
– overhead from the implementation, rather than the allocation policy –
and found that for a large body of applications, the
fragmentation from such a policy was no worse than that of
other policies.

-Warren

“Colin Burgess” <> cburgess@qnx.com> > wrote in message
news:8uf742$s62$> 1@nntp.qnx.com> …
| Igor Kovalenko <> Igor.Kovalenko@motorola.com> > wrote:
| > Don’t bother blaming gcc. I can’t say for sure what is the problem now, but
| > back 2 years when we started using Neutrino I did some simple performance
| > tests. One particular problem was performance of malloc() - it used to be
| > about 2 orders of magnitude slower than on QNX4. This correlates quite well
| > with Phyton scripts performance and indeed being interpreted language
Phyton
| > must be using malloc() heavily, at least under some circumstances.
|
| > I did not check the status of problem for a while, since we decided at that
| > point to avoid using malloc() in CPU bound code. Would be interesting to
see
| > QNX comments on this.
|
| Yup, it was malloc. Or more specifically, realloc.
|
| The python code is using realloc to grow a list implemented as an
| array. It’s doing it a HUGE amount of times, and growing it by 1 element
| each time.
|
| The Linux realloc presumeable notices that you are growing, and gives you
| a whole swag of memory to reduce the need to malloc and memcpy.
|
| The QNX realloc is a lot more conservative wrt memory allocation, and so
| it ends up having to malloc and memcpy chunks that are about 4Mb in size
| almost every time the list is grown.
|
| So some would say that our realloc sucks, but most would agree that the
| python list implementation sucks even worse, and the Linux realloc is
| just covering up for bad code. >
|
| –
| > cburgess@qnx.com

\

cburgess@qnx.com

–

Steve Furr email: furr@qnx.com
QNX Software Systems, Ltd.

Warren_Peece1 · November 10, 2000, 9:15pm

What I’m looking for are some choices. If QNX put out an “embedded lib” and a
“desktop lib”, each tuned appropriately for the target environment, I could
choose which way I wanted to go at compile time, and if necessary easily
produce two different executables for two different purposes. As I understand
it from other newsgroups, QNX6 is targeted as an embedded operating system, so
it’s totally understandable that they would make the trade-off choices in favor
of constrained memory systems. I however have an embedded project underway, a
more traditional desktop requirement, and finally a requirement for dual or
quad CPUs and as much memory as I can stuff in a box (12 to 16 gigabytes). I
want to use QNX6 for all of it, and it just plain doesn’t make sense to trade
speed for low memory use when you’ve got 2GB or more to play with. There’s
absolutely no reason why Linux should kick QNX6’s butt in memory allocation
speed when all that’s required is a little library management geared towards
two different end environments. That’s not to say that I’d use the “T” word to
describe the process of supplying two different libraries (perhaps with a
third, common one), but it doesn’t sound insurmountable at the outset now, does
it?

We can certainly supply our own memory allocators, but I think it would be a
lot cleaner and less hassle for those not inclined to undertake that project if
QNX supplied some official libraries. I realize that realloc( ptr, size + 1 )
is a bad idea, but my main point is that it doesn’t have to be as slow as it is
today. IMHO, the desktop world, where QNX6 is playing whether it’s being
targeted or not, will see this as a fairly significant problem that’s going to
draw an awful lot of flak if it’s not addressed in some manner.

Just toss that two-cents worth in the pile by the door…

-Warren

“Steve Furr” <furr@qnx.com> wrote in message news:8uhlae$d7v$1@nntp.qnx.com…
| In article <8ufaro$68e$1@inn.qnx.com>, Warren Peece <warren@nospam.com>
wrote:
| >So I guess that means if you’re memory constrained, you use the malloc()
family
| >as-is. If you’re looking for speed, then you’re better off doing something
| >custom. As a suggestion, how about a couple of different malloc()
libraries,
| >one for embedded systems (slow but efficient), and one for desktop systems
| >(fast and less efficient)?
| >
|
| No. Don’t read too much into this. What it says is that doing a
| realloc(ptr,size+1) is a bad idea. It doesn’t suggest that malloc
| performance is bad, Igor’s comments aside. Igor’s comments were related
| to the same malloc algorithm on QNX4 and Neutrino some time ago. The
| malloc algorithm has since been improved in any case, and, to the
| best of my knowledge, procnto was also enhanced.
|
| The Neutrino version performed poorly at that time because of the
| manner in which procnto responded to a lot of small mmap() requests.
| On the whole, the malloc performance under QNX4 was actually better
| than the previous Watcom allocator.
|
| Different malloc implementations for different purposes is a good
| idea, but it takes careful consideration. Performance is highly
| dependent on application behaviour and optimizations that you
| assume will be better may hinder throughput in practice, and, iin
| the worst case negatively impact priority behaviour for other threads
| – I hesitate to say realtime, because realtime threads shouldn’t
| be using malloc().
|
| As an example, U. Texas has an allocator for concurrent programs
| that minimizes blocking factors. I mentioned to Peter V. and he
| tried it out. Reportedly, it compiled out of the box and had
| substantial benefit on programs with many threads. I’ll look
| for a URL.
|
| >-Warren
| >
| >
| >“Colin Burgess” <cburgess@qnx.com> wrote in message
| >news:8uf742$s62$1@nntp.qnx.com…
| >| Igor Kovalenko <Igor.Kovalenko@motorola.com> wrote:
| >| > Don’t bother blaming gcc. I can’t say for sure what is the problem now,
but
| >| > back 2 years when we started using Neutrino I did some simple
performance
| >| > tests. One particular problem was performance of malloc() - it used to
be
| >| > about 2 orders of magnitude slower than on QNX4. This correlates quite
well
| >| > with Phyton scripts performance and indeed being interpreted language
| >Phyton
| >| > must be using malloc() heavily, at least under some circumstances.
| >|
| >| > I did not check the status of problem for a while, since we decided at
that
| >| > point to avoid using malloc() in CPU bound code. Would be interesting to
| >see
| >| > QNX comments on this.
| >|
| >| Yup, it was malloc. Or more specifically, realloc.
| >|
| >| The python code is using realloc to grow a list implemented as an
| >| array. It’s doing it a HUGE amount of times, and growing it by 1 element
| >| each time.
| >|
| >| The Linux realloc presumeable notices that you are growing, and gives you
| >| a whole swag of memory to reduce the need to malloc and memcpy.
| >|
| >| The QNX realloc is a lot more conservative wrt memory allocation, and so
| >| it ends up having to malloc and memcpy chunks that are about 4Mb in size
| >| almost every time the list is grown.
| >|
| >| So some would say that our realloc sucks, but most would agree that the
| >| python list implementation sucks even worse, and the Linux realloc is
| >| just covering up for bad code.
| >|
| >| –
| >| cburgess@qnx.com
| >
| >
|


-------------------------------------------------------------------------
Steve Furr email: furr@qnx.com
QNX Software Systems, Ltd.

Armin_Steinhoff1 · November 11, 2000, 11:33am

Warren Peece wrote:

What I’m looking for are some choices. If QNX put out an “embedded lib” and a
“desktop lib”, each tuned appropriately for the target environment, I could
choose which way I wanted to go at compile time, and if necessary easily
produce two different executables for two different purposes.

There is a choice … use e.g. the GNU malloc
module.

Armin

As I understand
it from other newsgroups, QNX6 is targeted as an embedded operating system, so
it’s totally understandable that they would make the trade-off choices in favor
of constrained memory systems. I however have an embedded project underway, a
more traditional desktop requirement, and finally a requirement for dual or
quad CPUs and as much memory as I can stuff in a box (12 to 16 gigabytes). I
want to use QNX6 for all of it, and it just plain doesn’t make sense to trade
speed for low memory use when you’ve got 2GB or more to play with. There’s
absolutely no reason why Linux should kick QNX6’s butt in memory allocation
speed when all that’s required is a little library management geared towards
two different end environments. That’s not to say that I’d use the “T” word to
describe the process of supplying two different libraries (perhaps with a
third, common one), but it doesn’t sound insurmountable at the outset now, does
it?

We can certainly supply our own memory allocators, but I think it would be a
lot cleaner and less hassle for those not inclined to undertake that project if
QNX supplied some official libraries. I realize that realloc( ptr, size + 1 )
is a bad idea, but my main point is that it doesn’t have to be as slow as it is
today. IMHO, the desktop world, where QNX6 is playing whether it’s being
targeted or not, will see this as a fairly significant problem that’s going to
draw an awful lot of flak if it’s not addressed in some manner.

Just toss that two-cents worth in the pile by the door… >

-Warren

“Steve Furr” <> furr@qnx.com> > wrote in message news:8uhlae$d7v$> 1@nntp.qnx.com> …
| In article <8ufaro$68e$> 1@inn.qnx.com> >, Warren Peece <> warren@nospam.com
wrote:
| >So I guess that means if you’re memory constrained, you use the malloc()
family
| >as-is. If you’re looking for speed, then you’re better off doing something
| >custom. As a suggestion, how about a couple of different malloc()
libraries,
| >one for embedded systems (slow but efficient), and one for desktop systems
| >(fast and less efficient)?
|
|
| No. Don’t read too much into this. What it says is that doing a
| realloc(ptr,size+1) is a bad idea. It doesn’t suggest that malloc
| performance is bad, Igor’s comments aside. Igor’s comments were related
| to the same malloc algorithm on QNX4 and Neutrino some time ago. The
| malloc algorithm has since been improved in any case, and, to the
| best of my knowledge, procnto was also enhanced.
|
| The Neutrino version performed poorly at that time because of the
| manner in which procnto responded to a lot of small mmap() requests.
| On the whole, the malloc performance under QNX4 was actually better
| than the previous Watcom allocator.
|
| Different malloc implementations for different purposes is a good
| idea, but it takes careful consideration. Performance is highly
| dependent on application behaviour and optimizations that you
| assume will be better may hinder throughput in practice, and, iin
| the worst case negatively impact priority behaviour for other threads
| – I hesitate to say realtime, because realtime threads shouldn’t
| be using malloc().
|
| As an example, U. Texas has an allocator for concurrent programs
| that minimizes blocking factors. I mentioned to Peter V. and he
| tried it out. Reportedly, it compiled out of the box and had
| substantial benefit on programs with many threads. I’ll look
| for a URL.
|
| >-Warren
|
|
| >“Colin Burgess” <> cburgess@qnx.com> > wrote in message
| >news:8uf742$s62$> 1@nntp.qnx.com> …
| >| Igor Kovalenko <> Igor.Kovalenko@motorola.com> > wrote:
| >| > Don’t bother blaming gcc. I can’t say for sure what is the problem now,
but
| >| > back 2 years when we started using Neutrino I did some simple
performance
| >| > tests. One particular problem was performance of malloc() - it used to
be
| >| > about 2 orders of magnitude slower than on QNX4. This correlates quite
well
| >| > with Phyton scripts performance and indeed being interpreted language
| >Phyton
| >| > must be using malloc() heavily, at least under some circumstances.
| >|
| >| > I did not check the status of problem for a while, since we decided at
that
| >| > point to avoid using malloc() in CPU bound code. Would be interesting to
| >see
| >| > QNX comments on this.
| >|
| >| Yup, it was malloc. Or more specifically, realloc.
| >|
| >| The python code is using realloc to grow a list implemented as an
| >| array. It’s doing it a HUGE amount of times, and growing it by 1 element
| >| each time.
| >|
| >| The Linux realloc presumeable notices that you are growing, and gives you
| >| a whole swag of memory to reduce the need to malloc and memcpy.
| >|
| >| The QNX realloc is a lot more conservative wrt memory allocation, and so
| >| it ends up having to malloc and memcpy chunks that are about 4Mb in size
| >| almost every time the list is grown.
| >|
| >| So some would say that our realloc sucks, but most would agree that the
| >| python list implementation sucks even worse, and the Linux realloc is
| >| just covering up for bad code. >
| >|
| >| –
| >| > cburgess@qnx.com
|
|
|
|

–

Steve Furr email: > furr@qnx.com

QNX Software Systems, Ltd.

Warren_Peece1 · November 11, 2000, 4:29pm

“Armin Steinhoff” <A-Steinhoff@web_.de> wrote in message
news:3A0D2E8F.E907B574@web_.de…

Warren Peece wrote:

What I’m looking for are some choices. If QNX put out an “embedded lib”
and a
“desktop lib”, each tuned appropriately for the target environment, I
could
choose which way I wanted to go at compile time, and if necessary easily
produce two different executables for two different purposes.

There is a choice … use e.g. the GNU malloc
module.

Armin

[snip]

Apparently you missed this paragraph:

We can certainly supply our own memory allocators, but I think it would
be a
lot cleaner and less hassle for those not inclined to undertake that
project if
QNX supplied some official libraries. I realize that realloc( ptr, size

1 )
is a bad idea, but my main point is that it doesn’t have to be as slow
as it is
today. IMHO, the desktop world, where QNX6 is playing whether it’s
being
targeted or not, will see this as a fairly significant problem that’s
going to
draw an awful lot of flak if it’s not addressed in some manner.

I was implying writing your own allocator, but linking the GNU allocator
would also qualify as a project (especially if you’re a newbie).

What makes you think that the malloc module is the only place where a
tradeoff was made in favor of embedded systems that has an effect on
performance? I think it would be nice if QNX would acknowledge that there
are a few areas that could benefit from some alternate algorithms geared
more towards the desktop environment, and provide us with some official
compile or link time options so we can tailor our programs appropriately.
If there are business reasons why it’s not going to happen, then we can
certainly substitute our own stuff at will. It’s just a suggestion.

-Warren

Armin_Steinhoff1 · November 12, 2000, 8:19pm

Warren Peece wrote:

“Armin Steinhoff” <A-Steinhoff@web_.de> wrote in message
news:3A0D2E8F.E907B574@web_.de…

Warren Peece wrote:

What I’m looking for are some choices. If QNX put out an “embedded lib”
and a “desktop lib”, each tuned appropriately for the target environment, I
could choose which way I wanted to go at compile time, and if necessary
easily produce two different executables for two different purposes.

There is a choice … use e.g. the GNU malloc
module.

Armin

[snip]

Apparently you missed this paragraph:

We can certainly supply our own memory allocators, but I think it would
be a lot cleaner and less hassle for those not inclined to undertake that
project if QNX supplied some official libraries. I realize that realloc( > > > >ptr, size + 1 ) is a bad idea, but my main point is that it doesn’t have to > > >be as slow as it is today. IMHO, the desktop world, where QNX6 is playing > > >whether it’s being targeted or not, will see this as a fairly significant
problem that’s >going to draw an awful lot of flak if it’s not addressed in > > >some manner.

I was implying writing your own allocator, but linking the GNU allocator
would also qualify as a project (especially if you’re a newbie).

It’s just a re-compile ‘project’ … so it’s not
a big issue.

What makes you think that the malloc module is the only place where a
tradeoff was made in favor of embedded systems that has an effect on
performance?

… messurements and traces with DejaView.

I think it would be nice if QNX would acknowledge that there
are a few areas that could benefit from some alternate algorithms geared
more towards the desktop environment, and provide us with some official
compile or link time options so we can tailor our programs appropriately.

I support that. Two different libs could do the
job.

Armin

Colin_Burgess1 · November 13, 2000, 3:36am

Armin Steinhoff <A-Steinhoff@web_.de> wrote:

I was implying writing your own allocator, but linking the GNU allocator
would also qualify as a project (especially if you’re a newbie).

It’s just a re-compile ‘project’ … so it’s not
a big issue.

Great - can I have a copy? ;v)

What makes you think that the malloc module is the only place where a
tradeoff was made in favor of embedded systems that has an effect on
performance?

… messurements and traces with DejaView.

Er, are we talking about the same operating system here? Deja view is
only available for QNX4.

–
cburgess@qnx.com

Igor_Kovalenko2 · November 13, 2000, 6:47am

Last time I checked it wasn’t available for QNX4 either ;v)
Anyway, the Neutrino version is in development and might finally become
something else than just a deja view. What a brilliant name you’ve
choosen for this thingie …

igor

Colin Burgess wrote:

Armin Steinhoff <A-Steinhoff@web_.de> wrote:

I was implying writing your own allocator, but linking the GNU allocator
would also qualify as a project (especially if you’re a newbie).

It’s just a re-compile ‘project’ … so it’s not
a big issue.

Great - can I have a copy? ;v)

What makes you think that the malloc module is the only place where a
tradeoff was made in favor of embedded systems that has an effect on
performance?

… messurements and traces with DejaView.

Er, are we talking about the same operating system here? Deja view is
only available for QNX4.

–
cburgess@qnx.com

Performance of GCC

This is the python script:

Compare run times of map with lambda and list comprehensions

t0 = time() x= [x+x for x in xrange(n)] print ‘%6.2f - new syntax’ % (time()-t0)

This is the python script:

Compare run times of map with lambda and list comprehensions

t0 = time() x= [x+x for x in xrange(n)] print ‘%6.2f - new syntax’ % (time()-t0)

This is the python script:

Compare run times of map with lambda and list comprehensions

This is the python script:

Compare run times of map with lambda and list comprehensions

–

–

t0 = time()
x= [x+x for x in xrange(n)]
print ‘%6.2f - new syntax’ % (time()-t0)

t0 = time()
x= [x+x for x in xrange(n)]
print ‘%6.2f - new syntax’ % (time()-t0)