OUCH - A LOT ! !

I’ve begun testing of an application that I’ve been working on. It kind of
sorts 100,000 elements in an array. There is, however, a lot of CPU work to
compare if one element is greater than an other. I’ve benchmarked this test
in QNX 4 and Nto and I am extreamly disappionted in the results.

Under QNX 4 it took 102 seconds. Under QNX 6 it took 305 seconds. Both
benchmarks were run 5 times and the results wer identicle. All of the data
was pre-read into the array before the timing started, so there is no I/O
involved. Also, the only function called that wasn’t my own was memmove().
This is the exact same code running on Q4 and Q6.

Why is the QNX 6 code 3 times slower?

What compile time options can I use to speed up the execution times under
QNX 6?


Bill Caroselli – 1(530) 510-7292
Q-TPS Consulting
QTPS@EarthLink.net

Also, just for kicks. I tried using an “unsigned long” instead of a “long”
as the index into my array. My Q6 execution time went from 305 seconds to
306 seconds.

I would have thought that 1) there should be no difference, and 2) if there
was a difference, that the unsigned would be faster.


Bill Caroselli – 1(530) 510-7292
Q-TPS Consulting
QTPS@EarthLink.net


“Bill Caroselli (Q-TPS)” <qtps@earthlink.net> wrote in message
news:9r6qrq$62e$1@inn.qnx.com

I’ve begun testing of an application that I’ve been working on. It kind
of
sorts 100,000 elements in an array. There is, however, a lot of CPU work
to
compare if one element is greater than an other. I’ve benchmarked this
test
in QNX 4 and Nto and I am extreamly disappionted in the results.

Under QNX 4 it took 102 seconds. Under QNX 6 it took 305 seconds. Both
benchmarks were run 5 times and the results wer identicle. All of the
data
was pre-read into the array before the timing started, so there is no I/O
involved. Also, the only function called that wasn’t my own was
memmove().
This is the exact same code running on Q4 and Q6.

Why is the QNX 6 code 3 times slower?

What compile time options can I use to speed up the execution times under
QNX 6?


Bill Caroselli – 1(530) 510-7292
Q-TPS Consulting
QTPS@EarthLink.net

“Bill Caroselli (Q-TPS)” <qtps@earthlink.net> wrote in message
news:9r6qrq$62e$1@inn.qnx.com

I’ve begun testing of an application that I’ve been working on. It kind
of
sorts 100,000 elements in an array. There is, however, a lot of CPU work
to
compare if one element is greater than an other. I’ve benchmarked this
test
in QNX 4 and Nto and I am extreamly disappionted in the results.

Under QNX 4 it took 102 seconds. Under QNX 6 it took 305 seconds. Both
benchmarks were run 5 times and the results wer identicle. All of the
data
was pre-read into the array before the timing started, so there is no I/O
involved. Also, the only function called that wasn’t my own was
memmove().
This is the exact same code running on Q4 and Q6.

Why is the QNX 6 code 3 times slower?

What compile time options can I use to speed up the execution times under
QNX 6?

Have you try -O (or -O3) ? I think that memmove under QNX4 is inlined.
Under
QNX6 I couldn’t get it to be inlined.


Bill Caroselli – 1(530) 510-7292
Q-TPS Consulting
QTPS@EarthLink.net

This has a REALLY easy answer.

http://cvs.qnx.com/cgi-bin/cvsweb.cgi/lib/c/string/memmove.c?rev=1.1.1.1&content-type=text/x-cvsweb-markup

And if you dig deeper you will see that there are currently no CPU optimized
versions of memmove() in the C library. So this means that under Watcom you
have super optimized copies and under gcc are being stuck with slow,
byte-by-byte copies. If you can work it so you use memcpy() instead of
memmove() you will find the numbers are probably the same.

chris



“Bill Caroselli (Q-TPS)” <qtps@earthlink.net> wrote:

I’ve begun testing of an application that I’ve been working on. It kind of
sorts 100,000 elements in an array. There is, however, a lot of CPU work to
compare if one element is greater than an other. I’ve benchmarked this test
in QNX 4 and Nto and I am extreamly disappionted in the results.

Under QNX 4 it took 102 seconds. Under QNX 6 it took 305 seconds. Both
benchmarks were run 5 times and the results wer identicle. All of the data
was pre-read into the array before the timing started, so there is no I/O
involved. Also, the only function called that wasn’t my own was memmove().
This is the exact same code running on Q4 and Q6.

Why is the QNX 6 code 3 times slower?

What compile time options can I use to speed up the execution times under
QNX 6?


Bill Caroselli – 1(530) 510-7292
Q-TPS Consulting
QTPS@EarthLink.net
\

cdm@qnx.com > “The faster I go, the behinder I get.”

Chris McKillop – Lewis Carroll –
Software Engineer, QSSL
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Hi Chris

I looked at that. I also looked at the asm code generated by Watcom’s
memmove(). It moves double words until it is down to the last 3 bytes.

I can write my own memmove that does what Watcom’s emmmove did. But i was
wondering, since I’m going for efficiency, would a Pentium run even faster
if I made sure that those double word copies were on 4 byte boundries?


Bill Caroselli – 1(530) 510-7292
Q-TPS Consulting
QTPS@EarthLink.net


“Chris McKillop” <cdm@qnx.com> wrote in message
news:9r7eco$hm9$1@nntp.qnx.com

This has a REALLY easy answer.


http://cvs.qnx.com/cgi-bin/cvsweb.cgi/lib/c/string/memmove.c?rev=1.1.1.1&con

tent-type=text/x-cvsweb-markup

And if you dig deeper you will see that there are currently no CPU
optimized
versions of memmove() in the C library. So this means that under Watcom
you
have super optimized copies and under gcc are being stuck with slow,
byte-by-byte copies. If you can work it so you use memcpy() instead of
memmove() you will find the numbers are probably the same.

chris



“Bill Caroselli (Q-TPS)” <> qtps@earthlink.net> > wrote:

I’ve begun testing of an application that I’ve been working on. It kind
of
sorts 100,000 elements in an array. There is, however, a lot of CPU
work to
compare if one element is greater than an other. I’ve benchmarked this
test
in QNX 4 and Nto and I am extreamly disappionted in the results.

Under QNX 4 it took 102 seconds. Under QNX 6 it took 305 seconds. Both
benchmarks were run 5 times and the results wer identicle. All of the
data
was pre-read into the array before the timing started, so there is no
I/O
involved. Also, the only function called that wasn’t my own was
memmove().
This is the exact same code running on Q4 and Q6.

Why is the QNX 6 code 3 times slower?

What compile time options can I use to speed up the execution times
under
QNX 6?


Bill Caroselli – 1(530) 510-7292
Q-TPS Consulting
QTPS@EarthLink.net



\

cdm@qnx.com > “The faster I go, the behinder I get.”
Chris McKillop – Lewis Carroll –
Software Engineer, QSSL

“Bill Caroselli (Q-TPS)” <qtps@earthlink.net> wrote:

Hi Chris

I looked at that. I also looked at the asm code generated by Watcom’s
memmove(). It moves double words until it is down to the last 3 bytes.

I can write my own memmove that does what Watcom’s emmmove did. But i was
wondering, since I’m going for efficiency, would a Pentium run even faster
if I made sure that those double word copies were on 4 byte boundries?

I think you had better, otherwise you will probably force either alignment
handling in hardware (or the kernel).

chris

\

cdm@qnx.com > “The faster I go, the behinder I get.”

Chris McKillop – Lewis Carroll –
Software Engineer, QSSL
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

“Chris McKillop” <cdm@qnx.com> wrote in message
news:9r9o52$31h$1@nntp.qnx.com

“Bill Caroselli (Q-TPS)” <> qtps@earthlink.net> > wrote:

I can write my own memmove that does what Watcom’s emmmove did. But i
was
wondering, since I’m going for efficiency, would a Pentium run even
faster
if I made sure that those double word copies were on 4 byte boundries?


I think you had better, otherwise you will probably force either alignment
handling in hardware (or the kernel).

OK. I whipped this out yesterday. I offer it up for anyone that wants some

improved memmove performance. Everyone is free to grab it and use it.

I am curious. I have seen performance improvements but they have not been
consistant. Let me know how this works for you.


Bill Caroselli – 1(530) 510-7292
Q-TPS Consulting
QTPS@EarthLink.net




begin 666 MEMMOVE.C
M+R);65M;6]V92YC"@H*+R)0V]P>7)I9VAT($YO=&EC92 M+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM
M+0HO+PHO+PE#;W!Y<FEG:'0@,C P,2!B>2!1+5104RP@45104T!%87)T:$QI
M;FLN;F5T"B\O"B\O"5!E<FUI<W-I;VX@:7,@:&5R96)Y(&=R86YT960@=&@
M=7-E(&-O<‘D@=&AI<R!M;V1U;&4@9G)E96QY(’!R;W9I9&5D(‘1H870*+R)
M=&AI<R!C;W!Y<FEG:‘0@;F]T:6-E(’)E;6%I;G,@=6YC:&%N9V5D+@HO+PHO
M+PE1+5104R!M86ME(&YO(’=A<F5N=‘DL(&5I=&AE<B!E>’!R97-S960@;W(@
M:6UP;&EE9"P@87,@=&@=&AI<R!M;V1U;&4G<PHO+PER96QI86)I;&ET>2P@
M<&5R9F]R;6%N8V4L(&]R(&9I=&YE<W,@9F]R('5S92X*+R*+R)268@>6]U
M(&UA:V4@96YH86YC=F5M96YT<R!T;R!T:&ES(&UO9’5L92!P;&5A<V4@92UM
M86EL('1H96T@8F%C:R!T;R!1+5104PHO+PEA="!15%!30$5A<G1H3&EN:RYN
M970*+R*+R)16YD($]F($-O<'ER:6=H="!.;W1I8V4@+2TM+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+0H*“B\O"41E
M<V-R:7!T:6]N(&]F($9I;&4@+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2T*+R*+R)5&AI<R!I<R!A;B!O
M<‘1I;6EZ960@=F5R<VEO;B!O9B!T:&4@;65M;6]V92@I(&9U;F-T:6]N+B @
M270@871T96UP=’,@=&@“B\O"61O(&1O=6)L92!W;W)D(&UO=F5S(&]N(&1O
M=6)L92!W;W)D(&)O=6YD<FEE<RX*+R*+R)16YD($]F($1E<V-R:7!T:6]N
M(&]F($9I;&4@+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM
M+2TM+2TM+2TM+0H*“B\O"4-H86YG92!(:7-T;W)Y(“TM+2TM+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2T*
M+R*+R)1&%T90E697)S:6]N"5=H;PE$97-C<FEP=&EO;B!O9B!#:&%N9V4*
M+R)+2TM+2TM"2TM+2TM+2T)+2TM"2TM+2TM+2TM+2TM+2TM+2TM+2TM+0HO
M+PDP,3$P,C4)-BXP,2XP,0EB<F,)=W)I=‘1E;B!F<F]M(’-C<F%T8V@@9F]R
M(%%.6”!.975T<FEN;PHO+PHO+PE%;F0@3V8@0VAA;F=E($AI<W1O<GD@+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM
M+2TM”@H*+R)26YC;'5D960@2&5A9&5R<R M+2TM+2TM+2TM+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+0HO+PHC:6YC
M;'5D92 <W1R:6YG+F@^“B\O"B\O"45N9”!/9B!);F-L=61E9”!(96%D97)S
M(“TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM
M+2TM+2T*”@HO+PE&=6YC=&EO;B!M96UM;W9E(“TM+2TM+2TM+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM"B\O"G9O
M:60@B!M96UM;W9E("@@=F]I9" J('1O+"!C;VYS="!V;VED(“H@9G)O;2P@
M<VEZ95]T(&X@0H)>PH)=6YI;VX"0E[”@D)8VAA<B @B )"6-P.PH)"6QO
M;F<@("H@"0EL<#L
"0EU;G-I9VYE9"!I;G0):3L
"0E]('0L(&8[”@ES:7IE
M7W0)“0D)>#L*”@DO+R!C:&5C:R!I9B!I=”!I<R!N96-E<W-A<GD@=&@;6]V
M92!A;GET:&EN9PH):68H(&X@(3T@, H))B8)=&@(3T@9G)O;2 I"@D)>PH)
M"2\O(&-H96-K(&EF(&ET(&ES($]+('1O(&-O<‘D@8GET97,@9F]R=V%R9 H)
M"70N8W @/2 H8VAA<BHI=&["@D)9BYC<" ](“AC:&%RBEF<F]M.PH)"6EF
M
”!T+F-P(#P@9BYC< H)“7Q"70N8W @/B!F+F-P(“L@;B I”@D)“7L*”@D)
M"2\O(&-O<'D@8GET97,@9F]R=V%R9 H*“0D)+R@871T96UP=”!T;R!C;W!Y
M(&1O=6)L92!W;W)D<R!O;B!D;W5B;&4@=V]R9”!B;W5N9’)I97,"@D)"2\O
M(&9I<G-T(&-O<'D@;V1D(&QE861I;F<@8GET97,
“0D)>” ](#0@+2 H(‘0N
M:2 F(#,@3L"0D);B M/2!X.PH)“0EW:&EL92@@>“TM(#X@,” I”@D)"0DJ
M="YC<“LK(#T@F8N8W KSL*”@D)"2\O(&YO=R!C;W!Y(&1O=6)L92!W;W)D
M<PH)"0EW:&EL92@@;B ^/2 T("D*"0D)"7L*"0D)“2IT+FQPRL@/2 J9BYL
M<"LK.PH)"0D);B M/2 T.PH)"0D)?0H
"0D)+R@;F]W(&-O<'D@;V1D('1R
M86EL:6YG(&)Y=&5S”@D)“7=H:6QE*”!N+2T@/B P("D*"0D)"2IT+F-PRL@
M/2 J9BYC<"LK.PH)“0E]”@D)96QS90H)"0E"@H)"0DO+R!C;W!Y(&)Y=&4@
M8F%C:W=A<F0*"0D)=“YC<” K/2!N.PH)"0EF+F-P("L
"0DO+R!A
M='1E;7!T('1O(&-O<‘D@9&]U8FQE(’=O<F1S(&]N(&1O=6)L92!W;W)D(&)O
M=6YD<FEE<PH
"0D)+R@9FER<W0@8V]P>2!O9&0@;&5A9&EN9R!B>71E<PH)
M"0EX(#T@="YI("8@,SL*"0D);B M/2!X.PH)“0EW:&EL92@@>“TM(#X@,” I
M”@D)“0DJ+2UT+F-P(#T@BTM9BYC<#L@D)"2\O(&YO=R!C;W!Y(&1O=6)L
M92!W;W)D<PH)"0EW:&EL92@@;B ^/2 T("D*"0D)"7L*"0D)"2HM+70N;’ @
M/2 J+2UF+FQP.PH)"0D);B M/2 T.PH)"0D)?0H*“0D)+R@;F]W(&-O<'D@
M;V1D('1R86EL:6YG(&)Y=&5S”@D)“7=H:6QE*”!N+2T@/B P("D*"0D)"2HM
M+70N8W @/2 J+2UF+F-P.PH)“0E]”@D)?0H*"7)E='5R;B!T;SL*"7T*+R*
M+R)16YD($]F($9U;F-T:6]N(&UE;6UO=F4@+2TM+2TM+2TM+2TM+2TM+2TM
@+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+0H
end

For the record, it wasn’t hard to improve on the QNX V6 memmove(). But I
also benchmarked my routine against QNX V4. If the copy was to an odd
boundry (i.e., not a multiple of 4) then my routine was still faster. But
when the destination buffer was on a multiple of 4 byte boundry, the library
version was still blowing my doors off.

I will be working on improving this routine further. My application
requires an efficient memmove.


\

Bill Caroselli – 1(530) 510-7292
Q-TPS Consulting
QTPS@EarthLink.net


“Bill Caroselli (Q-TPS)” <qtps@earthlink.net> wrote in message
news:9rbnpc$e0t$1@inn.qnx.com

“Chris McKillop” <> cdm@qnx.com> > wrote in message
news:9r9o52$31h$> 1@nntp.qnx.com> …
“Bill Caroselli (Q-TPS)” <> qtps@earthlink.net> > wrote:

I can write my own memmove that does what Watcom’s emmmove did. But i
was
wondering, since I’m going for efficiency, would a Pentium run even
faster
if I made sure that those double word copies were on 4 byte boundries?


I think you had better, otherwise you will probably force either
alignment
handling in hardware (or the kernel).

OK. I whipped this out yesterday. I offer it up for anyone that wants
some
improved memmove performance. Everyone is free to grab it and use it.

I am curious. I have seen performance improvements but they have not been
consistant. Let me know how this works for you.


Bill Caroselli – 1(530) 510-7292
Q-TPS Consulting
QTPS@EarthLink.net

\

What is the syntax of writing an ASM routine that can be called from C for
V6?


Bill Caroselli – 1(530) 510-7292
Q-TPS Consulting
QTPS@EarthLink.net


“Bill Caroselli (Q-TPS)” <qtps@earthlink.net> wrote in message
news:9rcouq$4tf$1@inn.qnx.com

For the record, it wasn’t hard to improve on the QNX V6 memmove(). But I
also benchmarked my routine against QNX V4. If the copy was to an odd
boundry (i.e., not a multiple of 4) then my routine was still faster. But
when the destination buffer was on a multiple of 4 byte boundry, the
library
version was still blowing my doors off.

I will be working on improving this routine further. My application
requires an efficient memmove.

“Bill Caroselli (Q-TPS)” <qtps@earthlink.net> wrote:

What is the syntax of writing an ASM routine that can be called from C for
V6?

Take a look at the .s/S files in the clib in cvs.

chris

cdm@qnx.com > “The faster I go, the behinder I get.”

Chris McKillop – Lewis Carroll –
Software Engineer, QSSL
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Thank you Chris

I took the function that I wrote and generated asm code by ‘qcc -S
memmove.c’. I then determined that I could assemble that code with ‘gcc -S
memmove.s’ and it generated the same object code as the original C code. So
not I’m optimizing the asm code. I should be able to post a much more
optimized memmove function in a day or so.

For those who care, I’m sweating the details of this because I’m writing an
application that needs to move millions of bytes of data several 10,000
times a day. The performance improvement are very necessary for me. And
data will not always be on quad byte boundries. That was a short coming of
the QNX V4 memmove(). The V4 memmove would take slightgly more than twice
as long to move data that wasn’t on quad byte boundries.


Bill Caroselli – 1(530) 510-7292
Q-TPS Consulting
QTPS@EarthLink.net


“Chris McKillop” <cdm@qnx.com> wrote in message
news:9rdsnu$jjp$1@nntp.qnx.com

“Bill Caroselli (Q-TPS)” <> qtps@earthlink.net> > wrote:

What is the syntax of writing an ASM routine that can be called from C
for
V6?


Take a look at the .s/S files in the clib in cvs.

chris

On Wed, 24 Oct 2001 09:40:50 -0700, “Bill Caroselli (Q-TPS)”
<qtps@earthlink.net> wrote:


Is improving memmove (or whatever) the only possibility to
speed up the process (besides moveing pointers instead of the
elements…)?. In some sorting algorithms it is likely
that one of the elements involved in a (lengthy!) comparison
will also be involved in the next, so cacheing current keys is worth
taking into consideration, though it has nothing to do with the
differences between 4 and 6.

ako

I’ve begun testing of an application that I’ve been working on. It kind of
sorts 100,000 elements in an array. There is, however, a lot of CPU work to
compare if one element is greater than an other. I’ve benchmarked this test
in QNX 4 and Nto and I am extreamly disappionted in the results.

Under QNX 4 it took 102 seconds. Under QNX 6 it took 305 seconds. Both
benchmarks were run 5 times and the results wer identicle. All of the data
was pre-read into the array before the timing started, so there is no I/O
involved. Also, the only function called that wasn’t my own was memmove().
This is the exact same code running on Q4 and Q6.

Why is the QNX 6 code 3 times slower?

What compile time options can I use to speed up the execution times under
QNX 6?


Bill Caroselli – 1(530) 510-7292
Q-TPS Consulting
QTPS@EarthLink.net

“Bill Caroselli (Q-TPS)” <qtps@earthlink.net> wrote in message
news:9rekks$9a5$1@inn.qnx.com

Thank you Chris

I took the function that I wrote and generated asm code by ‘qcc -S
memmove.c’. I then determined that I could assemble that code with
‘gcc -S
memmove.s’ and it generated the same object code as the original C code.
So
not I’m optimizing the asm code. I should be able to post a much more
optimized memmove function in a day or so.

For those who care, I’m sweating the details of this because I’m writing
an
application that needs to move millions of bytes of data several 10,000
times a day. The performance improvement are very necessary for me. And
data will not always be on quad byte boundries. That was a short coming
of
the QNX V4 memmove(). The V4 memmove would take slightgly more than twice
as long to move data that wasn’t on quad byte boundries

Why do you need memmove rather than memcpy? Can your source overlap your
destination?

Tom

Yes.

I have rewritten memmove in asm. The results are so go I can’t believe
them. My newest memmove is clocking in at 66 times faster than the Neutrino
library version. When I am done testing it I’ll post the code.


Bill Caroselli – 1(530) 510-7292
Q-TPS Consulting
QTPS@EarthLink.net


“Tom” <tom_usenet@hotmail.com> wrote in message
news:9rhtrp$85j$1@inn.qnx.com

Why do you need memmove rather than memcpy? Can your source overlap your
destination?

Tom

Have you ever used inline assembly for gcc? There’s a nice little tutorial
on the GAS syntax at:

http://www.castle.net/~avly/djasm.html

cheers,

Kris

“Bill Caroselli (Q-TPS)” <qtps@earthlink.net> wrote:

What is the syntax of writing an ASM routine that can be called from C for
V6?


Bill Caroselli – 1(530) 510-7292
Q-TPS Consulting
QTPS@EarthLink.net



“Bill Caroselli (Q-TPS)” <> qtps@earthlink.net> > wrote in message
news:9rcouq$4tf$> 1@inn.qnx.com> …
For the record, it wasn’t hard to improve on the QNX V6 memmove(). But I
also benchmarked my routine against QNX V4. If the copy was to an odd
boundry (i.e., not a multiple of 4) then my routine was still faster. But
when the destination buffer was on a multiple of 4 byte boundry, the
library
version was still blowing my doors off.

I will be working on improving this routine further. My application
requires an efficient memmove.


Kris Warkentin
kewarken@qnx.com
(613)591-0836 x9368
“Computer science is no more about computers than astronomy is about telescopes”
–E.W.Dijkstra

New memmove()

OK. So there was one bug remaining and it wasn’t really 66 times as fast.
But here are the new versions. I did one for QNX4 and one for QNX6. The
QNX4 version clocks in about twice as fast IFF the moves are on non quad
byte boundries. The QNX6 version clocks in anywhere from 3 to 10 times as
fast as the library version based on direction and weather ort not it is on
quad byte boundries.

Enjoy. Report any bugs or improvements you may find back to me and I’ll
report it.


Bill Caroselli – 1(530) 510-7292
Q-TPS Consulting
QTPS@EarthLink.net

“Bill Caroselli (Q-TPS)” <qtps@earthlink.net> wrote in message
news:9rjms4$kq4$1@inn.qnx.com

Yes.

I have rewritten memmove in asm. The results are so go I can’t believe
them. My newest memmove is clocking in at 66 times faster than the
Neutrino
library version. When I am done testing it I’ll post the code.


Bill Caroselli – 1(530) 510-7292
Q-TPS Consulting
QTPS@EarthLink.net


“Tom” <> tom_usenet@hotmail.com> > wrote in message
news:9rhtrp$85j$> 1@inn.qnx.com> …

Why do you need memmove rather than memcpy? Can your source overlap your
destination?

Tom
\

begin 666 memmoveQ4.S
M.PEM96UM;W9E430N4PH*"CL)0V]P>7)I9VAT($YO=&EC92 M+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM
M+0H[“CL)0V]P>7)I9VAT(#(P,#$@8GD@42U44%,L(%%44%- 16%R=&A,:6YK
M+FYE= H[“CL)4&5R;6ES<VEO;B!I<R!H97)E8GD@9W)A;G1E9”!T;R!U<V4@
M;W(@8V]P>2!T:&ES(&UO9’5L92!F<F5E;‘D@<’)O=FED960@“CL)=&AA=”!T
M:&ES(&-O<'ER:6=H=”!N;W1I8V4@<F5M86EN<R!U;F-H86YG960N"CL*.PE1
M+5104R!M86ME(&YO(’=A<F5N=‘DL(&5I=&AE<B!E>’!R97-S960@;W(@:6UP
M;&EE9"P@87,@=&@=&AI<R!M;V1U;&4G<PH[“7)E;&EA8FEL:71Y+”!P97)F
M;W)M86YC92P@;W(@9FET;F5S<R!F;W(@=7-E+@H["CL)268@>6]U(&UA:V4@
M96YH86YC=F5M96YT<R!T;R!T:&ES(&UO9’5L92!P;&5A<V4@92UM86EL(‘1H
M96T@8F%C:R!T;R!1+5104PH["6%T(%%44%- 16%R=&A,:6YK+FYE= H[“CL)
M16YD($]F($-O<'ER:6=H=”!.;W1I8V4@+2TM+2TM+2TM+2TM+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+0H*"CL)1&5S8W)I<‘1I;VX@
M;V8@1FEL92 M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+0H["CL)5&AI<R!I<R!A;B!O<‘1I;6EZ960@=F5R
M<VEO;B!O9B!T:&4@;65M;6]V92@I(&9U;F-T:6]N+B @270@871T96UP=’,@
M=&@"CL)9&@9&]U8FQE(’=O<F0@;6]V97,@;VX@9&]U8FQE(’=O<F0@8F]U
M;F1R:65S+@H["CL)16YD($]F($1E<V-R:7!T:6]N(&]F($9I;&4@+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+0H*"CL)
M0VAA;F=E($AI<W1O<GD@+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+0H["CL)1&%T90E697)S:6]N
M"5=H;PE$97-C<FEP=&EO;B!O9B!#:&%N9V4*.PDM+2TM+2T)+2TM+2TM+0DM
M+2T)+2TM+2TM+2TM+2TM+2TM+2TM+2TM"CL),#$Q,#(U"38N,#$N,#$)8G)C
M"7=R:71T96X@9G)O;2!S8W)A=&-H(&9O<B!13E@@3F5U=’)I;F*.PDP,3$P
M,C<)-BXP,2XP,0EB<F,)+B N(“X@86YD(&)A8VL@<&]R=&5D('1O(%%.6”!6
M- H["CL)16YD($]F($-H86YG92!(:7-T;W)Y("TM+2TM+2TM+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+0H*“CL)26YC;'5D
M960@2&5A9&5R<R M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+0H[“CL)5&AE<F4@87)E(&YO(&]T:&5R
M(&9I;&5S(&EN8VQU9&5D"CL*.PE%;F0@3V8@26YC;'5D960@2&5A9&5R<R M
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM
M+2TM”@H*“0D);F%M90EM96UM;W9E”@D)“7!U8FQI8PEM96UM;W9E7PI?5$58
M5 D)<V5G;65N= ED=V]R9”!P=6)L:6,@)T-/1$4G”@D)“2XU.#9P@D)"6%S
M<W5M90EC<SI?5$585 H*.PE&=6YC=&EO;B!M96UM;W9E("TM+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM
M"CL*.R!R96=I<W1E<B!C86QL:6YG(&-O;G9E;G1I;VX*.PIM96UM;W9E7PEP
M<F]C"6YE87("@D)“7!U<V@@(” @96-X"0D)“3L@<V%V92!R96=S”@D)“7!U
M<V@@(” @97-I"@D)“7!U<V@@(” @961I"@D)"7!U<V@)97,
"0D)<‘5S: EE
M87@"@D)“6UO=B @(” @961I+&5A> D)“3L@141)(#T@14%8(#T@=&*“0D)
M;6]V(” @(”!E<VDL961X"0D).R!%4TD@/2!%1%@@/2!F<F]M"@D)"6UO=@D)
M96-X+&5B> D)“3L@14-8(#T@14)8(#T@;&5N9W1H”@D)"6UO=@D)87@L9’,)
M"0D[($53(#T@1%,
"0D);6]V"0EE H*“0D)=&5S= EE8W@L96-X"0D)
M.R!I9B@@;&5N9W1H(#T](# @0H)"0EJ>@D)3#DY"0D)“3L)(&=E=”!O=70
M"0D)8VUP"0EE<VDL961I"0D).R!I9B@@9G)O;2 ]/2!T;R I”@D)"6IE"0E,
M.3D)"0D).PD@9V5T(&]U= H)"0EJ80D)3#(P"0D)“3L@:68H('1O(#P@9G)O
M;2 I”@D)"0D)"0D)"3L)($]+(‘1O(&-O<‘D@9F]R=V%R9 H*"0D);&5A"0EE
M9’@L6V5S:2ME8WA="3L@1418(#T@)F9R;VU;;&5N9W1H70H)"0EC;7 )"65D
M>“QE9&D)“0D[(&EF*”!T;R (&9R;VU;;&5N9W1H72 I”@D)"6IB90D)3#(P
M"0D)“3L)($]+('1O(&-O<'D@9F]R=V%R9 H*.R!C;W!Y(&)A8VMW87)D”@D)
M"7-T9 D)"0D)"3L@<V5T(&1I<F5C=&EO;B!F;&%G(&1O=VX*"0D)861D"0EE
M9&DL96-X"0D).R!%1$D@/2 F=&];;&5N9W1H70H)"0EM;W8)"65S:2QE9’@)
M"0D($5322 9&5C"0EE9&D)“0D).R!A9&IU
M<W0@141)”@D)"61E8PD)97-I"0D)“3L@861J=7-T($5320H*.R!C;W!Y(&]D
M9”!T<F%I;&EN9R!B>71E<R!F:7)S= H)"0EM;W8)"65A>"QE9&D)"0D[(&-A
M;&-U;&%T92!N=6UB97(@;V8@=’)A:6QI;F<@8GET97,"0D)86YD"0EE87@L
M,PH)"0EC;7 )"65A>"QE8W@
"0D):F=E"0E,38*"0D)>&-H9PEE87@L96-X
M"0D).R!A;F0@;G5M8F5R(&]F(’)E;6%I;FEN9R!B>71E<PH)"0ES=6()"65A
M>"QE8W@"0ER97 );6]V<V(“CL@8V]P>2!D;W5B;&4@=V]R9’,"0D);6]V
M"0EE8W@L96%X"0D).R!R97-T;W)E(&QE;F=T: H)"0ES:’()"65C>"PR"0D)
M.R!L96YG=&@@+ST@- H
"0D)<W5B"0EE9&DL,PH)“0ES=6()“65S:2PS”@D)
M<F5P"6UO=G-D”@H[(&-O<‘D@;&5A9&EN9R!B>71E<PH)"0EA;F0)"65A>“PS
M"0D).R!L96YG=&@@/2!L96YG=&@@)2 T”@D)"6UO=@D)96-X+&5A> H)"0EA
M9&0)“65D:2PS”@D)"6%D9 D)97-I+#,3#$V.@ER97 );6]V<V("CL@<’)E
M<&%R92!T;R!E>&ET”@D)"6-L9 D)"0D)"3L@8VQE87(@9&ER96-T:6]N(&9L
M86<“0D):FUP"0E,.3D)“0D).R!P<F5P87)E('1O(&5X:70@+2!E87@@;75S
M=” ]('1O”@H
.R!C;W!Y(&9O<G=A<F0*"0D)86QI9VX@"303#(P.@D);6]V
M"0EE87@L961I"0D).R!C86QC=6QA=&4@;G5M8F5R(&]F(&QE861I;F<@;V1D
M(&)Y=&5S"@D)"6%N9 D)96%X+#,
“0D)8VUP"0EE87@L96-X”@D)“6IG90D)
M3#(V”@D)"7AC:&<)96%X+&5C> D)"3L@86YD(&YU;6)E<B!O9B!R96UA:6YI
M;F<@8GET97,“0D)<W5B"0EE87@L96-X”@H[(&-O<'D@;V1D(&QE861I;F<@
M8GET97,@9FER<W0
"0ER97 );6]V<V()"0D)"3L@;6]V92!A(&QE861I;F<@
M8GET90H*.R!C;W!Y(&1O=6)L92!W;W)D<PH)"0EM;W8)"65C>"QE87@)“0D[
M(’-A=F4@;&5N9W1H”@D)“7-H<@D)96-X+#()“0D[(&QE;F=T:” O/2 T”@H)
M"7)E< EM;W9S9 D)“0D).R!M;W9E(&1O=6)L92!W;W)D<PH*.R!C;W!Y('1R
M86EL:6YG(&)Y=&5S”@D)“6%N9 D)96%X+#,)“0D[(&QE;F=T:” ](&QE;F=T
M:” E(#0*"0D);6]V"0EE8W@L96%X"DPR-CH)<F5P"6UO=G-B"0D)"0D[(&UO
M=F4@97AT<F$@8GET97,“CL@<’)E<&%R92!T;R!E>&ET”@I,.3DZ"0EP;W )
M"65A> D)"0D[(’)E='5R;B!T;PH)"0EP;W )"65S"0D)“3L@<F5S=&]R92!R
M96=S”@D)“7!O< D)961I”@D)“7!O< D)97-I”@D)“7!O< D)96-X”@D)"7)E
M= H["CL)16YD($]F($9U;F-T:6]N(&UE;6UO=F4@+2TM+2TM+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+0H
;65M;6]V95)
896YD< I?5$585 D)96YD<PH)"0EE;F0*
`
end

begin 666 memmoveQ6.s
M(PEM96UM;W9E438N<PH*“B,)0V]P>7)I9VAT($YO=&EC92 M+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM
M+0HC"B,)0V]P>7)I9VAT(#(P,#$@8GD@42U44%,L(%%44%- 16%R=&A,:6YK
M+FYE= HC"B,)4&5R;6ES<VEO;B!I<R!H97)E8GD@9W)A;G1E9”!T;R!U<V4@
M;W(@8V]P>2!T:&ES(&UO9’5L92!F<F5E;‘D@<’)O=FED960@“B,)=&AA=”!T
M:&ES(&-O<‘ER:6=H="!N;W1I8V4@<F5M86EN<R!U;F-H86YG960N"B,(PE1
M+5104R!M86ME(&YO(’=A<F5N=‘DL(&5I=&AE<B!E>’!R97-S960@;W(@:6UP
M;&EE9"P@87,@=&@=&AI<R!M;V1U;&4G<PHC"7)E;&EA8FEL:71Y+"!P97)F
M;W)M86YC92P@;W(@9FET;F5S<R!F;W(@=7-E+@HC"B,)268@>6]U(&UA:V4@
M96YH86YC=F5M96YT<R!T;R!T:&ES(&UO9’5L92!P;&5A<V4@92UM86EL('1H
M96T@8F%C:R!T;R!1+5104PHC"6%T(%%44%- 16%R=&A,:6YK+FYE= HC"B,)
M16YD($]F($-O<'ER:6=H="!.;W1I8V4@+2TM+2TM+2TM+2TM+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+0H
"B,)1&5S8W)I<‘1I;VX@
M;V8@1FEL92 M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+0HC"B,)5&AI<R!I<R!A;B!O<‘1I;6EZ960@=F5R
M<VEO;B!O9B!T:&4@;65M;6]V92@I(&9U;F-T:6]N+B @270@871T96UP=’,@
M=&@"B,)9&@9&]U8FQE(’=O<F0@;6]V97,@;VX@9&]U8FQE(’=O<F0@8F]U
M;F1R:65S+@HC"B,)16YD($]F($1E<V-R:7!T:6]N(&]F($9I;&4@+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+0H*“B,)
M0VAA;F=E($AI<W1O<GD@+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+0HC"B,)1&%T90E697)S:6]N
M"5=H;PE$97-C<FEP=&EO;B!O9B!#:&%N9V4*(PDM+2TM+2T)+2TM+2TM+0DM
M+2T)+2TM+2TM+2TM+2TM+2TM+2TM+2TM"B,),#$Q,#(U"38N,#$N,#$)8G)C
M"7=R:71T96X@9G)O;2!S8W)A=&-H(&9O<B!13E@@3F5U=’)I;F*(PHC"45N
M9”!/9B!#:&%N9V4@2&ES=&]R>2 M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2T*"@HC"4EN8VQU9&5D($AE861E
M<G,@+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2T*(PHC"51H97)E(&%R92!N;R!O=&AE<B!F:6QE<R!I
M;F-L=61E9 HC"B,)16YD($]F($EN8VQU9&5D($AE861E<G,@+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+0H*“B,)
M1G5N8W1I;VX@;65M;6]V92 M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+0HC”@D)+F9I;&4)(FUE;6UO
M=F51-BYS(@H)"2YV97)S:6]N"2(P,2XP,2(9V-C,E]C;VUP:6QE9"XZ"BYT
M97AT"@D)+F%L:6=N(#0
+F=L;V)L(&UE;6UO=F4*“0DN='EP90D@;65M;6]V
M92Q 9G5N8W1I;VX*”@HC(’-T86-K(&-A;&QI;F<@8V]N=F5N=&EO;@H*;65M
M;6]V93H*“0D)<‘5S:&P))65B< H)“0EM;W9L"25E<W L)65B< H*“0D)<‘5S
M:&P))65S:0D)“2,@<V%V92!R96=S”@D)"7!U<VAL"25E9&D*"0D)<‘5S: DE
M97,“0D);6]V"0DE9’,L)6%X"0D)(R!%4R ]($13”@D)"6UO=@D))6%X+"5E
M<PH
"0D);6]V; DX*"5E8G I+"5E9&D)(R!%1$D@/2!T;PH)"0EM;W9L"3$R
M*"5E8G I+“5E<VD)(R!%4TD@/2!F<F]M”@D)"6UO=FP),38H)65B<“DL)65C
M> DC($5#6” ](&QE;F=T: H)“0EP=7-H; DE961I"0D)(PET:&ES(&UU<W0@
M8F4@<F5T=7)N960*”@D)"71E<W1L"25E8W@L)65C> D)(R!I9B@@;&5N9W1H
M(#T](# @0H)"0EJ>@D)3#DY"0D)“2,)(&=E=”!O=70"0D)8VUP; DE961I
M+"5E<VD)"2,@:68H(&9R;VT@/3T@=&@0H)"0EJ90D)3#DY"0D)“2,)(&=E
M=”!O=70
"0D):F$)“4PR, D)“0DC(&EF*”!T;R (&9R;VT@0H)"0D)"0D)
M"0DC"2!/2R!T;R!C;W!Y(&9O<G=A<F0
@D)“6QE80D)"5E<VDL)65C>“DL
M)65D>”,@1418(#T@)F9R;VU;;&5N9W1H70H)“0EC;7!L"25E9&DL)65D> D)
M(R!I9B@@=&@/”!F<F]M6VQE;F=T:%T@0H)"0EJ8F4)"4PR, D)"0DC"2!/
M2R!T;R!C;W!Y(&9O<G=A<F0
"B,@8V]P>2!B86-K=V%R9 H)“0ES=&0)“0D)
M"0DC(’-E=”!D:7)E8W1I;VX@9FQA9R!D;W=N”@D)"6%D9 D))65C>"PE961I
M"0DC($5$22 ](“9T;UML96YG=&A=”@D)"6UO=@D))65D>"PE97-I"0DC($53
M22 ]("9F<F]M6VQE;F=T:%T
"0D)9&5C"0DE961I"0D)(R!A9&IU<W0@141)
M”@D)"61E8PD))65S:0D)“2,@861J=7-T($5320H*(R!C;W!Y(&]D9”!T<F%I
M;&EN9R!B>71E<R!F:7)S= H)"0EM;W8)“25E9&DL)65A> D)(R!C86QC=6QA
M=&4@;G5M8F5R(&]F('1R86EL:6YG(&)Y=&5S”@D)"6%N9 D))#,L)65A> H)
M"0EC;7 )"25E8W@L)65A> H)"0EJ9V4)"4PQ-@H)"0EX8VAG"25E8W@L)65A
M> D)(R!A;F0@;G5M8F5R(&]F(’)E;6%I;FEN9R!B>71E<PH)"0ES=6()"25E
M8W@L)65A> H)"7)E< EM;W9S8@H*(R!C;W!Y(&1O=6)L92!W;W)D<PH)"0EM
M;W9L"25E87@L)65C> D)(R!R97-T;W)E(&QE;F=T: H)"0ES:’)L"25E8W@)
M"0DC(&QE;F=T:” O/2 T”@D)“7-H<FP))65C> H)“0ES=6()“20S+“5E9&D*
M"0D)<W5B"0DD,RPE97-I”@D)<F5P"6UO=G-L”@HC(&-O<'D@;&5A9&EN9R!B
M>71E<PH)“0EA;F1L"20S+“5E87@)“0DC(&QE;F=T:” ](&QE;F=T:” E(#0*
M"0D);6]V; DE96%X+“5E8W@“0D)861D"0DD,RPE961I”@D)"6%D9 D))#,L
M)65S:0I,38Z"7)E< EM;W9S8@H
(R!P<F5P87)E(‘1O(&5X:70*"0D)8VQD
M"0D)"0D)(R!C;&5A<B!D:7)E8W1I;VX@9FQA9PH)"0EJ;7 )"4PY.0D)"0DC
M(’!R97!A<F4@=&@97AI=” M(&5A>”!M=7-T(#T@=&*”@HC(&-O<‘D@9F]R
M=V%R9 H)"0DN86QI9VX@"303#(P.@D);6]V; DE961I+"5E87@)“2,@8V%L
M8W5L871E(&YU;6)E<B!O9B!L96%D:6YG(&]D9”!B>71E<PH)"0EA;F1L"20S
M+"5E87@
"0D)8VUP"0DE96-X+"5E87@"0D):F=E"0E,C8"0D)>&-H9PDE
M96%X+"5E8W@)"2,@86YD(&YU;6)E<B!O9B!R96UA:6YI;F<@8GET97,"0D)
M<W5B; DE96-X+"5E87@
"B,@8V]P>2!O9&0@;&5A9&EN9R!B>71E<R!F:7)S
M= H)“7)E< EM;W9S8@D)“0D)(R!M;W9E(&$@;&5A9&EN9R!B>71E”@HC(&-O
M<‘D@9&]U8FQE(’=O<F1S”@D)"6UO=FP))65A>"PE96-X"0DC(’)E<W1O<F4@
M;&5N9W1H”@D)"7-H<FP))65C> D)“2,@;&5N9W1H(”](#0*“0D)<VAR; DE
M96-X”@D)<F5P"6UO=G-L"0D)"0DC(&UO=F4@9&]U8FQE(’=O<F1S”@HC(&-O
M<‘D@=’)A:6QI;F<@8GET97,“0D)86YD; DD,RPE96%X"0D)(R!L96YG=&@@
M/2!L96YG=&@@)2 T”@D)"6UO=FP))65A>"PE96-X"DPR-CH)<F5P"6UO=G-B
M"0D)"0DC(&UO=F4@97AT<F$@8GET97,
“B,@<’)E<&%R92!T;R!E>&ET”@I,
M.3DZ"0EP;W!L"25E87@)"0DC(’)E=‘5R;B!T;PH)"0EP;W )"25E<PD)"0DC
M(’)E<W1O<F4@<F5G<PH)“0EP;W!L"25E9&D*“0D)<&]P; DE97-I”@D)“6QE
M879E”@D)“7)E= HC"B,)16YD($]F($9U;F-T:6]N(&UE;6UO=F4@+2TM+2TM
M+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+0H*
M"BY,9F4Q.@H)“0DN<VEZ90D@;65M;6]V92PN3&9E,2UM96UM;W9E”@D)“2YI
M9&5N= DB0FEL;”!#87)O<V5L;&DL(%$M5%!3+”!15%!30$5A<G1H3&EN:RYN
$970B”@``
`
end