Tricks of the trade - Optimisation

Hi,

Can anyone help me with the following problem:

We have a process that has to run at 125Hz (input from external device) and
currently it takes about 11ms to compute a single iteration. There will be
other processes running on the machine at higher priority as well.

So, I need to get the loop time down to 5ms. The process is heavy on maths
(matrix algebra), implemented using MTL.

I’ve tried various optimisation options in gcc, but they don’t seem to make
any diffrence. Can anyone tell me what options are best for speed
optimisation, for C++ with templates?


Dan

“Dan” <none@no.spam> wrote in message news:8tngm0$4fi$2@inn.qnx.com

Hi,

Can anyone help me with the following problem:

We have a process that has to run at 125Hz (input from external device)
and
currently it takes about 11ms to compute a single iteration. There will be
other processes running on the machine at higher priority as well.

So, I need to get the loop time down to 5ms. The process is heavy on maths
(matrix algebra), implemented using MTL.

I’ve tried various optimisation options in gcc, but they don’t seem to
make
any diffrence. Can anyone tell me what options are best for speed
optimisation, for C++ with templates?

I would be EXTREMELY surprise optimisation flag with gcc will give
you that kind of improvement. What your asking for is HUGE.

  • Rework your algorithm
  • Get faster CPU.
  • Use Athlon processor instead of Intel, they are much faster at handling
    floating point.
  • If you are using QRTP get a 2 processors machine.



Dan





\

I dunno as I haven’t seen any code but as part of reworking your algorithm,
you can save quite a bit of time by unelegant coding i.e. not using function
calls and class methods (your own ones I’m talking about) in favour of
having all your code inline. This will cut out a lot of stack usage and if
you have a lot of iterations it all adds up. As an example: I have a bit of
code thats used to program a gate array and has a loop which bitbashes the
data in over 77000 cycles. Time to program using class methods : 26 secs as
opposed to 3secs for inline code.

You can use ‘register’ for you loop variables as well - that can sometimes
shave a bit off…

Also perhaps look at using the MMX instructions, I believe they were intended
specifically for matrix operations at very high speeds. I’m fairly certain
that to get what you want you’re not going to be able to use high level C++
constructs & 3rd party libraries, unless someone really went out of their way
to optimize for the x86 environment.


“Jim Atkins” <jamesa@tsd.serco.com> wrote in message
news:8tom0h$ios$1@inn.qnx.com
| I dunno as I haven’t seen any code but as part of reworking your algorithm,
| you can save quite a bit of time by unelegant coding i.e. not using function
| calls and class methods (your own ones I’m talking about) in favour of
| having all your code inline. This will cut out a lot of stack usage and if
| you have a lot of iterations it all adds up. As an example: I have a bit of
| code thats used to program a gate array and has a loop which bitbashes the
| data in over 77000 cycles. Time to program using class methods : 26 secs as
| opposed to 3secs for inline code.
|
| You can use ‘register’ for you loop variables as well - that can sometimes
| shave a bit off…

Thanks for your input. The problem is most likely that the 9x9 matrix
inversion takes too long. We are looking at that. I do realise that to
double the speed i have to look at the implementation algorithm and perhaps
the algorithm itself, however, I also want to find out if compiler options
can help me as well.

I tried to compile using various options (all options listed below) but
they don’t appear to have much effect at all. I recall working on an HPUX
ages ago which produced significantly different execution times depending on
optimisation options. (Different code, in c, hence not valid comparison)

Can anyone suggest which combination of options would be most suitable for
C++ template libraries?

thanks

Dan

Optimization options

-fbranch-probabilities
-fcaller-saves -fcse-follow-jumps -fcse-skip-blocks
-fdelayed-branch -fexpensive-optimizations
-ffast-math -ffloat-store -fforce-addr -fforce-mem
-ffunction-sections -finline-functions
-fkeep-inline-functions -fno-default-inline
-fno-defer-pop -fno-function-cse
-fno-inline -fno-peephole -fomit-frame-pointer
-frerun-cse-after-loop -fschedule-insns
-fschedule-insns2 -fstrength-reduce -fthread-jumps
-funroll-all-loops -funroll-loops
-O -O0 -O1 -O2 -O3


“Warren Peece” <warren@nospam.com> wrote in message
news:8tpdr2$g6o$1@inn.qnx.com

Also perhaps look at using the MMX instructions, I believe they were
intended
specifically for matrix operations at very high speeds. I’m fairly
certain
that to get what you want you’re not going to be able to use high level
C++
constructs & 3rd party libraries, unless someone really went out of their
way
to optimize for the x86 environment.


“Jim Atkins” <> jamesa@tsd.serco.com> > wrote in message
news:8tom0h$ios$> 1@inn.qnx.com> …
| I dunno as I haven’t seen any code but as part of reworking your
algorithm,
| you can save quite a bit of time by unelegant coding i.e. not using
function
| calls and class methods (your own ones I’m talking about) in favour of
| having all your code inline. This will cut out a lot of stack usage and
if
| you have a lot of iterations it all adds up. As an example: I have a bit
of
| code thats used to program a gate array and has a loop which bitbashes
the
| data in over 77000 cycles. Time to program using class methods : 26 secs
as
| opposed to 3secs for inline code.
|
| You can use ‘register’ for you loop variables as well - that can
sometimes
| shave a bit off…

“Dan” <none@no.spam> wrote in message news:8tqgoh$krp$1@inn.qnx.com

Thanks for your input. The problem is most likely that the 9x9 matrix
inversion takes too long. We are looking at that. I do realise that to
double the speed i have to look at the implementation algorithm and
perhaps
the algorithm itself, however, I also want to find out if compiler options
can help me as well.

I tried to compile using various options (all options listed below)
but
they don’t appear to have much effect at all. I recall working on an HPUX
ages ago which produced significantly different execution times depending
on
optimisation options. (Different code, in c, hence not valid comparison)

Compiler are much better these days. If you already use -O2 with gcc,

there is little you can do. There are some option to help math function
like sin() cos() etc, but a matrix inversion doesn’t use them.

Can anyone suggest which combination of options would be most suitable for
C++ template libraries?

thanks

Dan

Optimization options

-fbranch-probabilities
-fcaller-saves -fcse-follow-jumps -fcse-skip-blocks
-fdelayed-branch -fexpensive-optimizations
-ffast-math -ffloat-store -fforce-addr -fforce-mem
-ffunction-sections -finline-functions
-fkeep-inline-functions -fno-default-inline
-fno-defer-pop -fno-function-cse
-fno-inline -fno-peephole -fomit-frame-pointer
-frerun-cse-after-loop -fschedule-insns
-fschedule-insns2 -fstrength-reduce -fthread-jumps
-funroll-all-loops -funroll-loops
-O -O0 -O1 -O2 -O3


“Warren Peece” <> warren@nospam.com> > wrote in message
news:8tpdr2$g6o$> 1@inn.qnx.com> …
Also perhaps look at using the MMX instructions, I believe they were
intended
specifically for matrix operations at very high speeds. I’m fairly
certain
that to get what you want you’re not going to be able to use high level
C++
constructs & 3rd party libraries, unless someone really went out of
their
way
to optimize for the x86 environment.


“Jim Atkins” <> jamesa@tsd.serco.com> > wrote in message
news:8tom0h$ios$> 1@inn.qnx.com> …
| I dunno as I haven’t seen any code but as part of reworking your
algorithm,
| you can save quite a bit of time by unelegant coding i.e. not using
function
| calls and class methods (your own ones I’m talking about) in favour of
| having all your code inline. This will cut out a lot of stack usage
and
if
| you have a lot of iterations it all adds up. As an example: I have a
bit
of
| code thats used to program a gate array and has a loop which bitbashes
the
| data in over 77000 cycles. Time to program using class methods : 26
secs
as
| opposed to 3secs for inline code.
|
| You can use ‘register’ for you loop variables as well - that can
sometimes
| shave a bit off…

\

Hi…

Does any one know if there is a such a thing as a two (2) processor
PC104??

Just asking…

Bests…

Miguel


Mario Charest wrote:

  • Rework your algorithm
  • Get faster CPU.
  • Use Athlon processor instead of Intel, they are much faster at handling
    floating point.
  • If you are using QRTP get a 2 processors machine.

Mario Charest wrote:

“Dan” <> none@no.spam> > wrote in message news:8tqgoh$krp$> 1@inn.qnx.com> …
Thanks for your input. The problem is most likely that the 9x9 matrix
inversion takes too long. We are looking at that. I do realise that to
double the speed i have to look at the implementation algorithm and
perhaps
the algorithm itself, however, I also want to find out if compiler options
can help me as well.

I tried to compile using various options (all options listed below)
but
they don’t appear to have much effect at all. I recall working on an HPUX
ages ago which produced significantly different execution times depending
on
optimisation options. (Different code, in c, hence not valid comparison)

Compiler are much better these days. If you already use -O2 with gcc,
there is little you can do. There are some option to help math function
like sin() cos() etc, but a matrix inversion doesn’t use them.

Can anyone suggest which combination of options would be most suitable for
C++ template libraries?

thanks

Dan

Optimization options

-fbranch-probabilities
-fcaller-saves -fcse-follow-jumps -fcse-skip-blocks
-fdelayed-branch -fexpensive-optimizations
-ffast-math -ffloat-store -fforce-addr -fforce-mem
-ffunction-sections -finline-functions
-fkeep-inline-functions -fno-default-inline
-fno-defer-pop -fno-function-cse
-fno-inline -fno-peephole -fomit-frame-pointer
-frerun-cse-after-loop -fschedule-insns
-fschedule-insns2 -fstrength-reduce -fthread-jumps
-funroll-all-loops -funroll-loops
-O -O0 -O1 -O2 -O3

You might also try the following flags:
-malign-loops=2 -malign-jumps=2 -malign-functions=2

I saw an improvement of about 10% in moderately complex C code. YMMV


“Warren Peece” <> warren@nospam.com> > wrote in message
news:8tpdr2$g6o$> 1@inn.qnx.com> …
Also perhaps look at using the MMX instructions, I believe they were
intended
specifically for matrix operations at very high speeds. I’m fairly
certain
that to get what you want you’re not going to be able to use high level
C++
constructs & 3rd party libraries, unless someone really went out of
their
way
to optimize for the x86 environment.


“Jim Atkins” <> jamesa@tsd.serco.com> > wrote in message
news:8tom0h$ios$> 1@inn.qnx.com> …
| I dunno as I haven’t seen any code but as part of reworking your
algorithm,
| you can save quite a bit of time by unelegant coding i.e. not using
function
| calls and class methods (your own ones I’m talking about) in favour of
| having all your code inline. This will cut out a lot of stack usage
and
if
| you have a lot of iterations it all adds up. As an example: I have a
bit
of
| code thats used to program a gate array and has a loop which bitbashes
the
| data in over 77000 cycles. Time to program using class methods : 26
secs
as
| opposed to 3secs for inline code.
|
| You can use ‘register’ for you loop variables as well - that can
sometimes
| shave a bit off…

\

In article <8tqgoh$krp$1@inn.qnx.com>, Dan <none@no.spam> wrote:

Thanks for your input. The problem is most likely that the 9x9 matrix
inversion takes too long. We are looking at that. I do realise that to
double the speed i have to look at the implementation algorithm and perhaps
the algorithm itself, however, I also want to find out if compiler options
can help me as well.

I tried to compile using various options (all options listed below) but
they don’t appear to have much effect at all. I recall working on an HPUX
ages ago which produced significantly different execution times depending on
optimisation options. (Different code, in c, hence not valid comparison)

Can anyone suggest which combination of options would be most suitable for
C++ template libraries?

I’m not very familiar with all of the GCC optimization options, but
a lot of them, such as inlining, would relate to the nature of the
templates. For example, there is no way GCC is going to inline
a template member, but an inline member function in the template
class is a possibility. I don’t think any other optimizations are
going to have a significant impact on template usage.

In general, for matrix operations, loop unrolling and constant
sub-expression elimination may help, if the library hasn’t already
been coded that way.

One point that was only raised peripherally was that some floating
point functions – most notably the transendental functions – can
take much longer than using the co-processor. This is because
libmath is almost entirely software floating point. This won’t make
a difference for operations that are simple bitfield manipulation, but
will for some others. I have a libfp that issues the FPU opcodes
and provides C9X floating point compatibility, but I don’t think it
has ever been shipped.

You may want to check any math optimizations to see if gcc can
provide any of the standard math routines as intrinsics if
you compile for Pentium or better.

thanks

Dan

Optimization options

-fbranch-probabilities
-fcaller-saves -fcse-follow-jumps -fcse-skip-blocks
-fdelayed-branch -fexpensive-optimizations
-ffast-math -ffloat-store -fforce-addr -fforce-mem
-ffunction-sections -finline-functions
-fkeep-inline-functions -fno-default-inline
-fno-defer-pop -fno-function-cse
-fno-inline -fno-peephole -fomit-frame-pointer
-frerun-cse-after-loop -fschedule-insns
-fschedule-insns2 -fstrength-reduce -fthread-jumps
-funroll-all-loops -funroll-loops
-O -O0 -O1 -O2 -O3


“Warren Peece” <> warren@nospam.com> > wrote in message
news:8tpdr2$g6o$> 1@inn.qnx.com> …
Also perhaps look at using the MMX instructions, I believe they were
intended
specifically for matrix operations at very high speeds. I’m fairly
certain
that to get what you want you’re not going to be able to use high level
C++
constructs & 3rd party libraries, unless someone really went out of their
way
to optimize for the x86 environment.


“Jim Atkins” <> jamesa@tsd.serco.com> > wrote in message
news:8tom0h$ios$> 1@inn.qnx.com> …
| I dunno as I haven’t seen any code but as part of reworking your
algorithm,
| you can save quite a bit of time by unelegant coding i.e. not using
function
| calls and class methods (your own ones I’m talking about) in favour of
| having all your code inline. This will cut out a lot of stack usage and
if
| you have a lot of iterations it all adds up. As an example: I have a bit
of
| code thats used to program a gate array and has a loop which bitbashes
the
| data in over 77000 cycles. Time to program using class methods : 26 secs
as
| opposed to 3secs for inline code.
|
| You can use ‘register’ for you loop variables as well - that can
sometimes
| shave a bit off…


\

Steve Furr email: furr@qnx.com
QNX Software Systems, Ltd.

For X86 I’ve never seen any. It would be quite a challenge,
SMP requires specific chip set and only works Pentium so I think
it’s close to impossible to fit on PC104 card.

“Miguel Simon” <simon@ou.edu> wrote in message
news:3A01852C.17CD7FFB@ou.edu

Hi…

Does any one know if there is a such a thing as a two (2) processor
PC104??

Just asking…

Bests…

Miguel


Mario Charest wrote:

  • Rework your algorithm
  • Get faster CPU.
  • Use Athlon processor instead of Intel, they are much faster at
    handling
    floating point.
  • If you are using QRTP get a 2 processors machine.

Isn’t the a PCI version of PC104?

Is it PC106?


Mario Charest <mcharest@zinformatic.com> wrote in message
news:8tt0nh$gbh$1@inn.qnx.com

For X86 I’ve never seen any. It would be quite a challenge,
SMP requires specific chip set and only works Pentium so I think
it’s close to impossible to fit on PC104 card.

You might also try the following flags:
-malign-loops=2 -malign-jumps=2 -malign-functions=2

I saw an improvement of about 10% in moderately complex C code. YMMV

qcc rejects the options “cc: unknown option…”

It’s called PC104-Plus or PC104+.

Actually a PC104+ has even less room for component since
there is an extra connector.

“Bill at Sierra Design” <BC@SierraDesign.com> wrote in message
news:8tt1rg$hmo$1@inn.qnx.com

Isn’t the a PCI version of PC104?

Is it PC106?


Mario Charest <> mcharest@zinformatic.com> > wrote in message
news:8tt0nh$gbh$> 1@inn.qnx.com> …

For X86 I’ve never seen any. It would be quite a challenge,
SMP requires specific chip set and only works Pentium so I think
it’s close to impossible to fit on PC104 card.

9x9 matrix inversion, which algorithm are you using ?

There are some algorithm which gives exact answer in many iterations
and if I remember my math courses, there is some other algorithms who
are just starting from one “guess”, normally [ Identity 9x9 ] and iterate
until the error is less than 0.00x%, depending on what kind of values
you have one might be faster than the other.
If you are willing to have less exact values but better speed, you might
consider this one.
Some “numerical analysis book” might help you.

Another thing, you can do is to test your code and find the “loop” which
takes most of
the CPU and maybe try to put that part inside a function and implement that
function
in heavy purified and optimized assembly code, possibly using MMX or any
other
techniques. There are some nice book about Pentium ASM optimizations.

Another way, to optimize maybe is to multiply everything by a 1,000,000 let
say
and treat everything as long or int64_t (long long), so it could be faster,
the only
problem would be if you have extremely different data

[ 10^-12 10^5 ]
[ 10^5 10^-12 ]

Might not result in something wanted… but
[ 1000 12 ]
[ 500 4320 ]

will give good results.

Another way, would be to implement self recursive method or stack,
I remember few years ago, implementing QuickSort
with a simple array stack and gotos
which was faster than with the normal recursive way.

The first thing to do is to understand what algorithm is used,
take some math book and figure out how you could implement a better one.

Good Luck,

Fred.

J2K Library
http://j2k.sourceforge.net/


Mario Charest wrote in message <8tqlo6$pja$1@inn.qnx.com>…

“Dan” <> none@no.spam> > wrote in message news:8tqgoh$krp$> 1@inn.qnx.com> …
Thanks for your input. The problem is most likely that the 9x9 matrix
inversion takes too long. We are looking at that. I do realise that to
double the speed i have to look at the implementation algorithm and
perhaps
the algorithm itself, however, I also want to find out if compiler
options
can help me as well.

I tried to compile using various options (all options listed below)
but
they don’t appear to have much effect at all. I recall working on an HPUX
ages ago which produced significantly different execution times depending
on
optimisation options. (Different code, in c, hence not valid comparison)

Compiler are much better these days. If you already use -O2 with gcc,
there is little you can do. There are some option to help math function
like sin() cos() etc, but a matrix inversion doesn’t use them.

Can anyone suggest which combination of options would be most suitable
for
C++ template libraries?

thanks

Dan

Optimization options

-fbranch-probabilities
-fcaller-saves -fcse-follow-jumps -fcse-skip-blocks
-fdelayed-branch -fexpensive-optimizations
-ffast-math -ffloat-store -fforce-addr -fforce-mem
-ffunction-sections -finline-functions
-fkeep-inline-functions -fno-default-inline
-fno-defer-pop -fno-function-cse
-fno-inline -fno-peephole -fomit-frame-pointer
-frerun-cse-after-loop -fschedule-insns
-fschedule-insns2 -fstrength-reduce -fthread-jumps
-funroll-all-loops -funroll-loops
-O -O0 -O1 -O2 -O3


“Warren Peece” <> warren@nospam.com> > wrote in message
news:8tpdr2$g6o$> 1@inn.qnx.com> …
Also perhaps look at using the MMX instructions, I believe they were
intended
specifically for matrix operations at very high speeds. I’m fairly
certain
that to get what you want you’re not going to be able to use high level
C++
constructs & 3rd party libraries, unless someone really went out of
their
way
to optimize for the x86 environment.


“Jim Atkins” <> jamesa@tsd.serco.com> > wrote in message
news:8tom0h$ios$> 1@inn.qnx.com> …
| I dunno as I haven’t seen any code but as part of reworking your
algorithm,
| you can save quite a bit of time by unelegant coding i.e. not using
function
| calls and class methods (your own ones I’m talking about) in favour
of
| having all your code inline. This will cut out a lot of stack usage
and
if
| you have a lot of iterations it all adds up. As an example: I have a
bit
of
| code thats used to program a gate array and has a loop which
bitbashes
the
| data in over 77000 cycles. Time to program using class methods : 26
secs
as
| opposed to 3secs for inline code.
|
| You can use ‘register’ for you loop variables as well - that can
sometimes
| shave a bit off…

For those not familiar with my problem: I have a process that uses a lot of
matrix algebra for which we used MTL implementation. The code was slow,
about 18ms per iteration. I needed to get the cycle time down to 2-3 ms (max
about 5 I guessed at the time) and thanks to all those that replied in the
previous thread suggesting to rewrite the code because that’s what was done
to spectacular results.

The code was recompiled with different matrix algebra class and lot of
overhead ripped out, so now I’m happy because we got the cycle time down to
1.5ms! That is better than 10 times faster than the original MTL code!

We also compiled the code using microsoft C++ v6 and got very interesting
results. the MS code runs twice (!) as fast on on NT than the qcc code on
QNXRTP!!! Both are on same machine (just swap the HD) P-MMX 200MHz. Even
when I pump up the priority it runs the same times since there is not much
else running.


This begs a few questions:

  1. Why is MTL so slow on gcc 2.95.2? Is it a problem of handling templates?

  2. Why does microsoft produce faster code? or better Why does gcc produce
    slow code?

  3. Will additional optimiser options speed it up (I’m using O2 and inline)

Also I wonder if anyone has similar experience, or is this specific to my
code? Of course two examples don’t prove anything, but is there a case to
check the compiler and benchmark it properly against others? It may be a
significant issue if you need to go from 200MHz to 450Mhz processor to
achieve the same thing in real time.

have you tried to produce ASM code - just to take a look under the hood?

Dan wrote:

For those not familiar with my problem: I have a process that uses a lot of
matrix algebra for which we used MTL implementation. The code was slow,
about 18ms per iteration. I needed to get the cycle time down to 2-3 ms (max
about 5 I guessed at the time) and thanks to all those that replied in the
previous thread suggesting to rewrite the code because that’s what was done
to spectacular results.

The code was recompiled with different matrix algebra class and lot of
overhead ripped out, so now I’m happy because we got the cycle time down to
1.5ms! That is better than 10 times faster than the original MTL code!

We also compiled the code using microsoft C++ v6 and got very interesting
results. the MS code runs twice (!) as fast on on NT than the qcc code on
QNXRTP!!! Both are on same machine (just swap the HD) P-MMX 200MHz. Even
when I pump up the priority it runs the same times since there is not much
else running.

This begs a few questions:

  1. Why is MTL so slow on gcc 2.95.2? Is it a problem of handling templates?

  2. Why does microsoft produce faster code? or better Why does gcc produce
    slow code?

  3. Will additional optimiser options speed it up (I’m using O2 and inline)

Also I wonder if anyone has similar experience, or is this specific to my
code? Of course two examples don’t prove anything, but is there a case to
check the compiler and benchmark it properly against others? It may be a
significant issue if you need to go from 200MHz to 450Mhz processor to
achieve the same thing in real time.


BR, Andrej

“Dan” <none@no.spam> wrote in message news:8ub3jg$jlr$1@inn.qnx.com

For those not familiar with my problem: I have a process that uses a lot
of
matrix algebra for which we used MTL implementation. The code was slow,
about 18ms per iteration. I needed to get the cycle time down to 2-3 ms
(max
about 5 I guessed at the time) and thanks to all those that replied in the
previous thread suggesting to rewrite the code because that’s what was
done
to spectacular results.

The code was recompiled with different matrix algebra class and lot of
overhead ripped out, so now I’m happy because we got the cycle time down
to
1.5ms! That is better than 10 times faster than the original MTL code!

We also compiled the code using microsoft C++ v6 and got very interesting
results. the MS code runs twice (!) as fast on on NT than the qcc code on
QNXRTP!!! Both are on same machine (just swap the HD) P-MMX 200MHz. Even
when I pump up the priority it runs the same times since there is not much
else running.

Visual C++ is a very good compiler. Obviously GCC has a different agenda
then Visual C++. Personnaly it saddens me, cause Neutrino is IMHO a synomyn
of performance but it’s all undermind by GCC. But that’s a subject for
qdn.public.qnxrtp.advocay.

This begs a few questions:

  1. Why is MTL so slow on gcc 2.95.2? Is it a problem of handling
    templates?

  2. Why does microsoft produce faster code? or better Why does gcc produce
    slow code?

  3. Will additional optimiser options speed it up (I’m using O2 and inline)

Also I wonder if anyone has similar experience, or is this specific to my
code?

I had the same experience as you, the difference between VisualC++ and GCC
was in the order of 30% (that’s still a lot)

Of course two examples don’t prove anything, but is there a case to
check the compiler and benchmark it properly against others? It may be a
significant issue if you need to go from 200MHz to 450Mhz processor to
achieve the same thing in real time.

I know there used to be talk of MetroWerks CodeWarrior being available for
the PC platform to cross-develop (I think it even went Beta as I seem to
recall people complaining about it :slight_smile:, and I thought the intent was to make
a “native” port as well for what was then called Neutrino (but it was hosted
under QNX4 at the time, so I’m not sure where that leaves us). Perhaps
someone cares to speculate wildly about the future possibilities of this
actually happening for RtP, and perhaps someone can comment on the quality
of the code generated from the Metrowerks compiler…

-Warren


“Mario Charest” <mcharest@zinformatic.com> wrote in message
news:8ubgn0$2r8$1@inn.qnx.com

“Dan” <> none@no.spam> > wrote in message news:8ub3jg$jlr$> 1@inn.qnx.com> …
For those not familiar with my problem: I have a process that uses a lot
of
matrix algebra for which we used MTL implementation. The code was slow,
about 18ms per iteration. I needed to get the cycle time down to 2-3 ms
(max
about 5 I guessed at the time) and thanks to all those that replied in
the
previous thread suggesting to rewrite the code because that’s what was
done
to spectacular results.

The code was recompiled with different matrix algebra class and lot of
overhead ripped out, so now I’m happy because we got the cycle time down
to
1.5ms! That is better than 10 times faster than the original MTL code!

We also compiled the code using microsoft C++ v6 and got very
interesting
results. the MS code runs twice (!) as fast on on NT than the qcc code
on
QNXRTP!!! Both are on same machine (just swap the HD) P-MMX 200MHz. Even
when I pump up the priority it runs the same times since there is not
much
else running.


Visual C++ is a very good compiler. Obviously GCC has a different agenda
then Visual C++. Personnaly it saddens me, cause Neutrino is IMHO a
synomyn
of performance but it’s all undermind by GCC. But that’s a subject for
qdn.public.qnxrtp.advocay.


This begs a few questions:

  1. Why is MTL so slow on gcc 2.95.2? Is it a problem of handling
    templates?

  2. Why does microsoft produce faster code? or better Why does gcc
    produce
    slow code?

  3. Will additional optimiser options speed it up (I’m using O2 and
    inline)

Also I wonder if anyone has similar experience, or is this specific to
my
code?

I had the same experience as you, the difference between VisualC++ and GCC
was in the order of 30% (that’s still a lot)

Of course two examples don’t prove anything, but is there a case to
check the compiler and benchmark it properly against others? It may be a
significant issue if you need to go from 200MHz to 450Mhz processor to
achieve the same thing in real time.
\

I was using the metrowerks CodeWarrior beta release (with neutrino 2.0 - i’m
a paying customer!) which actually used the gcc toolchain - The IDE was
actually fairly nice but a bit buggy. I was eagerly waiting for the release
which in the documentation stated that it would supplied with the metrowerks
compiler. Apparently since Metrowerks have been bought by Motorola things
have slowed up dramatically. So much in fact that I’ve now started using the
RTP for development.

I don’t know if any one else has any insight but the last thing I heard
was - don’t hold your breath for it…

Jim


Warren Peece wrote in message <8ubl73$7lm$1@inn.qnx.com>…

I know there used to be talk of MetroWerks CodeWarrior being available for
the PC platform to cross-develop (I think it even went Beta as I seem to
recall people complaining about it > :slight_smile:> , and I thought the intent was to make
a “native” port as well for what was then called Neutrino (but it was
hosted
under QNX4 at the time, so I’m not sure where that leaves us). Perhaps
someone cares to speculate wildly about the future possibilities of this
actually happening for RtP, and perhaps someone can comment on the quality
of the code generated from the Metrowerks compiler…

-Warren


“Mario Charest” <> mcharest@zinformatic.com> > wrote in message
news:8ubgn0$2r8$> 1@inn.qnx.com> …

“Dan” <> none@no.spam> > wrote in message news:8ub3jg$jlr$> 1@inn.qnx.com> …
For those not familiar with my problem: I have a process that uses a
lot
of
matrix algebra for which we used MTL implementation. The code was slow,
about 18ms per iteration. I needed to get the cycle time down to 2-3 ms
(max
about 5 I guessed at the time) and thanks to all those that replied in
the
previous thread suggesting to rewrite the code because that’s what was
done
to spectacular results.

The code was recompiled with different matrix algebra class and lot of
overhead ripped out, so now I’m happy because we got the cycle time
down
to
1.5ms! That is better than 10 times faster than the original MTL code!

We also compiled the code using microsoft C++ v6 and got very
interesting
results. the MS code runs twice (!) as fast on on NT than the qcc code
on
QNXRTP!!! Both are on same machine (just swap the HD) P-MMX 200MHz.
Even
when I pump up the priority it runs the same times since there is not
much
else running.


Visual C++ is a very good compiler. Obviously GCC has a different agenda
then Visual C++. Personnaly it saddens me, cause Neutrino is IMHO a
synomyn
of performance but it’s all undermind by GCC. But that’s a subject for
qdn.public.qnxrtp.advocay.


This begs a few questions:

  1. Why is MTL so slow on gcc 2.95.2? Is it a problem of handling
    templates?

  2. Why does microsoft produce faster code? or better Why does gcc
    produce
    slow code?

  3. Will additional optimiser options speed it up (I’m using O2 and
    inline)

Also I wonder if anyone has similar experience, or is this specific to
my
code?

I had the same experience as you, the difference between VisualC++ and
GCC
was in the order of 30% (that’s still a lot)

Of course two examples don’t prove anything, but is there a case to
check the compiler and benchmark it properly against others? It may be
a
significant issue if you need to go from 200MHz to 450Mhz processor to
achieve the same thing in real time.


\

Damn! I was hoping for something better than gcc. Someone port Watcom
already, then make it support ELF…

-Warren


“Jim Atkins” <jamesa@tsd.serco.com> wrote in message
news:8uc10p$jo0$1@inn.qnx.com
| I was using the metrowerks CodeWarrior beta release (with neutrino 2.0 - i’m
| a paying customer!) which actually used the gcc toolchain - The IDE was
| actually fairly nice but a bit buggy. I was eagerly waiting for the release
| which in the documentation stated that it would supplied with the metrowerks
| compiler. Apparently since Metrowerks have been bought by Motorola things
| have slowed up dramatically. So much in fact that I’ve now started using the
| RTP for development.
|
| I don’t know if any one else has any insight but the last thing I heard
| was - don’t hold your breath for it…
|
| Jim
|
|
| Warren Peece wrote in message <8ubl73$7lm$1@inn.qnx.com>…
| >I know there used to be talk of MetroWerks CodeWarrior being available for
| >the PC platform to cross-develop (I think it even went Beta as I seem to
| >recall people complaining about it :slight_smile:, and I thought the intent was to make
| >a “native” port as well for what was then called Neutrino (but it was
| hosted
| >under QNX4 at the time, so I’m not sure where that leaves us). Perhaps
| >someone cares to speculate wildly about the future possibilities of this
| >actually happening for RtP, and perhaps someone can comment on the quality
| >of the code generated from the Metrowerks compiler…
| >
| >-Warren
| >
| >
| >“Mario Charest” <mcharest@zinformatic.com> wrote in message
| >news:8ubgn0$2r8$1@inn.qnx.com
| >>
| >> “Dan” <none@no.spam> wrote in message news:8ub3jg$jlr$1@inn.qnx.com
| >> > For those not familiar with my problem: I have a process that uses a
| lot
| >> of
| >> > matrix algebra for which we used MTL implementation. The code was slow,
| >> > about 18ms per iteration. I needed to get the cycle time down to 2-3 ms
| >> (max
| >> > about 5 I guessed at the time) and thanks to all those that replied in
| >the
| >> > previous thread suggesting to rewrite the code because that’s what was
| >> done
| >> > to spectacular results.
| >> >
| >> > The code was recompiled with different matrix algebra class and lot of
| >> > overhead ripped out, so now I’m happy because we got the cycle time
| down
| >> to
| >> > 1.5ms! That is better than 10 times faster than the original MTL code!
| >> >
| >> > We also compiled the code using microsoft C++ v6 and got very
| >interesting
| >> > results. the MS code runs twice (!) as fast on on NT than the qcc code
| >on
| >> > QNXRTP!!! Both are on same machine (just swap the HD) P-MMX 200MHz.
| Even
| >> > when I pump up the priority it runs the same times since there is not
| >much
| >> > else running.
| >> >
| >>
| >> Visual C++ is a very good compiler. Obviously GCC has a different agenda
| >> then Visual C++. Personnaly it saddens me, cause Neutrino is IMHO a
| >synomyn
| >> of performance but it’s all undermind by GCC. But that’s a subject for
| >> qdn.public.qnxrtp.advocay.
| >>
| >> >
| >> > This begs a few questions:
| >> >
| >> > 1. Why is MTL so slow on gcc 2.95.2? Is it a problem of handling
| >> templates?
| >> >
| >> > 2. Why does microsoft produce faster code? or better Why does gcc
| >produce
| >> > slow code?
| >> >
| >> > 3. Will additional optimiser options speed it up (I’m using O2 and
| >inline)
| >> >
| >> > Also I wonder if anyone has similar experience, or is this specific to
| >my
| >> > code?
| >>
| >> I had the same experience as you, the difference between VisualC++ and
| GCC
| >> was in the order of 30% (that’s still a lot)
| >>
| >> > Of course two examples don’t prove anything, but is there a case to
| >> > check the compiler and benchmark it properly against others? It may be
| a
| >> > significant issue if you need to go from 200MHz to 450Mhz processor to
| >> > achieve the same thing in real time.
| >> >
| >> >
| >>
| >>
| >
| >
|
|