Assembly code speed compared to C

Simon_Wakley · July 10, 2001, 6:04pm

I talk to some hardware via the parallel port. I write 1 byte out and
read back 2 sets of 4 bits to get a byte back. This is to some custom
outdated hardware. I wrote a device driver to do this, and it was
horribly slow. I then wrote the code into my function (root enabled
etc) and now it is almost tolerably fast, but it is slower than I need.
It takes 2-4ms to write and read about 40 bytes.

If I wrote it in assembler would it be a lot faster?? If so, is there
anywhere I can read up on doing so with QNX RTP on Intel 350Mhz Pentium.

Thanks

Simon Wakley
Visit our web site at <<www.cameracontrol.com>>

Mario_Charest1 · July 10, 2001, 7:08pm

“Simon Wakley” <Simonospam@cameracontrol.com> wrote in message
news:BkZqJBAqO0S7EwGt@cameracontrol.com…

I talk to some hardware via the parallel port. I write 1 byte out and
read back 2 sets of 4 bits to get a byte back. This is to some custom
outdated hardware. I wrote a device driver to do this, and it was
horribly slow. I then wrote the code into my function (root enabled
etc) and now it is almost tolerably fast, but it is slower than I need.
It takes 2-4ms to write and read about 40 bytes.

Ok to read 40 bytes that means you write 40 and read 80 right?
So that’s 120 bytes. Parallel port are typicaly 1Megabytes hence
it should take way less then 2-4ms, unless process of higher priority
is constantly stealing cpu time.

If I wrote it in assembler would it be a lot faster??

First make sure you compile with -O2 (turn on optimisation). Could make a
big difference.

The following C code:

out8(0,0);
a = in8(0);
a |= (in8(0)<<4);

When compile with -O2 gives:

movb $0,%al
outb %al,$0
inb $0,%al
movb %al,%dl
inb $0,%al
salb $4,%al
orb %al,%dl

There doesn’t seems like there is much to save there

What happens is -O2 allow gcc to replace in*() and out*() with inline
instructions. Mucho faster!

anywhere I can read up on doing so with QNX RTP on Intel 350Mhz Pentium.

Thanks

Simon Wakley
Visit our web site at <<> www.cameracontrol.com

Rennie_Allen2 · July 10, 2001, 9:52pm

40 bytes every 4ms only works out to be 10KB/sec. With a driver (i.e.
without building the driver into your apps function), you should easily
be able to obtain a rate of less than 400us for every 40 bytes (i.e.
10us/byte which is at least 10x faster than what you are seeing) on the
hardware you have.

Clearly there is something very wrong here regardless of any efficiency
differences with assembler vs C. Are you using the parallel port
interrupt ? Could you post your code ?

-----Original Message-----
From: Simon Wakley [mailto:Simonospam@cameracontrol.com]
Posted At: Tuesday, July 10, 2001 11:04 AM
Posted To: os
Conversation: Assembly code speed compared to C
Subject: Assembly code speed compared to C

I talk to some hardware via the parallel port. I write 1 byte out and
read back 2 sets of 4 bits to get a byte back. This is to some custom
outdated hardware. I wrote a device driver to do this, and it was
horribly slow. I then wrote the code into my function (root enabled
etc) and now it is almost tolerably fast, but it is slower than I need.
It takes 2-4ms to write and read about 40 bytes.

If I wrote it in assembler would it be a lot faster?? If so, is there
anywhere I can read up on doing so with QNX RTP on Intel 350Mhz Pentium.

Thanks

Simon Wakley
Visit our web site at <<www.cameracontrol.com>>

Simon_Wakley · July 11, 2001, 2:24am

Thanks,

I have done further tests, and I am getting better performance than I
thought, but still not what I had hoped for. It takes average 700us to
send 82 bytes and 1700us to receive 102. The slower receive is no doubt
due to the hardware having to read 2 nybbles, I am having new hardware
done to make it a single read, and then I should get the same
performance from the write as read. Of course, the other end was taking
1000us to respond!

If this is the best performance I can expect, I guess I will have to try
something else!

BTW: -O2 did not seem to affect the speed at all. I am using qcc.

Simon

code follows:
////

SETTLE is NOT defined!

int ppi_write(int nbytes, char *buf)
{
register unsigned char *p = buf;
register int i;
unsigned char data, test;
unsigned long j = loclock(1) + rb_timeout;
float v = 3.4;
long rv;

i = nbytes;

while (i–)
{
#ifdef SETTLE
/* Check IACK is stable */
do
{
test = (in8(STATUS) & IACK);
if (loclock(0) >= j)
{
printf(“RB: Write timeout on testing IACK\n”);
goto timeout;
}
//printf(“Test is %2x %2x\n”, test, IACK & IACKTRUE);
} while (test == (IACK & IACKTRUE));
#endif

/* Present data on output latch */
out8(DATA, *p++);

/* Assert IVALID */
out8(CONTROL, SETIVAL);

/* Await IACK */
while ((in8(STATUS) & IACK) != (IACK & IACKTRUE))
{
// Timeout code here
if (loclock(0) >= j)
{
printf(“RB: Write timeout at %d bytes of %d\n”, nbytes -
i -1, nbytes);
goto timeout;
}
//printf(“Waiting IACK\n”);
}

/* De-assert IVALID */
out8(CONTROL, IDLE);
#ifdef SETTLE
do
{
data = (in8(CONTROL) & 0x1f);
if (loclock(0) >= j)
{
printf(“RB: Write timeout on de-asserting IVALID\n”);
goto timeout;
}
} while (data != IDLE);
#endif
}

return nbytes;

timeout:

return RB_TIMEOUT;

}

///

In article <D4907B331846D31198090050046F80C905C8BE@exchangecal.hq.csical
…com>, Rennie Allen <RAllen@csical.com> writes

40 bytes every 4ms only works out to be 10KB/sec. With a driver (i.e.
without building the driver into your apps function), you should easily
be able to obtain a rate of less than 400us for every 40 bytes (i.e.
10us/byte which is at least 10x faster than what you are seeing) on the
hardware you have.

Clearly there is something very wrong here regardless of any efficiency
differences with assembler vs C. Are you using the parallel port
interrupt ? Could you post your code ?

-----Original Message-----
From: Simon Wakley [mailto:> Simonospam@cameracontrol.com> ]
Posted At: Tuesday, July 10, 2001 11:04 AM
Posted To: os
Conversation: Assembly code speed compared to C
Subject: Assembly code speed compared to C

I talk to some hardware via the parallel port. I write 1 byte out and
read back 2 sets of 4 bits to get a byte back. This is to some custom
outdated hardware. I wrote a device driver to do this, and it was
horribly slow. I then wrote the code into my function (root enabled
etc) and now it is almost tolerably fast, but it is slower than I need.
It takes 2-4ms to write and read about 40 bytes.

If I wrote it in assembler would it be a lot faster?? If so, is there
anywhere I can read up on doing so with QNX RTP on Intel 350Mhz Pentium.

Thanks

–
Simon Wakley
Visit our web site at <<www.cameracontrol.com>>

Mitchell_Schoenbrun · July 11, 2001, 4:17am

Any Pentium processesor is much faster than the I/O speed of
a parallel port. Unless the subroutine overhead is massive,
or the compiler turns out pure crap, it won’t matter a lot
porting your code to assembler. Neither of the two
conditions mentioned are true to my knowledge.

Previously, Simon Wakley wrote in qdn.public.qnxrtp.os:

I talk to some hardware via the parallel port. I write 1 byte out and
read back 2 sets of 4 bits to get a byte back. This is to some custom
outdated hardware. I wrote a device driver to do this, and it was
horribly slow. I then wrote the code into my function (root enabled
etc) and now it is almost tolerably fast, but it is slower than I need.
It takes 2-4ms to write and read about 40 bytes.

If I wrote it in assembler would it be a lot faster?? If so, is there
anywhere I can read up on doing so with QNX RTP on Intel 350Mhz Pentium.

Thanks

Simon Wakley
Visit our web site at <<> www.cameracontrol.com

–
Mitchell Schoenbrun --------- maschoen@pobox.com

Mario_Charest1 · July 11, 2001, 12:28pm

“Simon Wakley” <Simon@cameracontrol.com> wrote in message
news:s7OR+AA3j7S7EwSJ@cameracontrol.com…

Thanks,

I have done further tests, and I am getting better performance than I
thought, but still not what I had hoped for. It takes average 700us to
send 82 bytes

It take a minimum of 3 out() and 2 in() to write a byte so that’s 5
instructions.
running at about 1M (Parallel port a usually implemented over ISA bus
internaly).
that’s 5 us per byte to write times 82 that gives around 400us. There is
300us
still not accounted for. Could well be the delay getting the ACK signal
back
and the overhead in the rest of the code. What does loclock() do? Try
making in an inline function.

[cut]

If this is the best performance I can expect, I guess I will have to try
something else!

Without changing hardware I don’t think you’ll do a lot better then this.
You may want to look are more sophisticated parallel port mode, such
as EPP, ECP. You could even use DMA.

BTW: -O2 did not seem to affect the speed at all. I am using qcc.

Simon

code follows:
////

SETTLE is NOT defined!

int ppi_write(int nbytes, char *buf)
{
register unsigned char *p = buf;
register int i;
unsigned char data, test;
unsigned long j = loclock(1) + rb_timeout;
float v = 3.4;
long rv;

i = nbytes;

while (i–)
{
#ifdef SETTLE
/* Check IACK is stable */
do
{
test = (in8(STATUS) & IACK);
if (loclock(0) >= j)
{
printf(“RB: Write timeout on testing IACK\n”);
goto timeout;
}
//printf(“Test is %2x %2x\n”, test, IACK & IACKTRUE);
} while (test == (IACK & IACKTRUE));
#endif

/* Present data on output latch */
out8(DATA, *p++);

/* Assert IVALID */
out8(CONTROL, SETIVAL);

/* Await IACK */
while ((in8(STATUS) & IACK) != (IACK & IACKTRUE))
{
// Timeout code here
if (loclock(0) >= j)
{
printf(“RB: Write timeout at %d bytes of %d\n”, nbytes -
i -1, nbytes);
goto timeout;
}
//printf(“Waiting IACK\n”);
}

/* De-assert IVALID */
out8(CONTROL, IDLE);
#ifdef SETTLE
do
{
data = (in8(CONTROL) & 0x1f);
if (loclock(0) >= j)
{
printf(“RB: Write timeout on de-asserting IVALID\n”);
goto timeout;
}
} while (data != IDLE);
#endif
}

return nbytes;

timeout:

return RB_TIMEOUT;

}

///

Phil_Olynyk1 · July 11, 2001, 2:30pm

Mario Charest wrote:

“Simon Wakley” <> Simon@cameracontrol.com> > wrote in message
news:> s7OR+AA3j7S7EwSJ@cameracontrol.com> …
Thanks,

I have done further tests, and I am getting better performance than I
thought, but still not what I had hoped for. It takes average 700us to
send 82 bytes

It take a minimum of 3 out() and 2 in() to write a byte so that’s 5
instructions.
running at about 1M (Parallel port a usually implemented over ISA bus
internaly).
that’s 5 us per byte to write times 82 that gives around 400us. There is
300us
still not accounted for. Could well be the delay getting the ACK signal
back
and the overhead in the rest of the code. What does loclock() do? Try
making in an inline function.

[cut]

If this is the best performance I can expect, I guess I will have to try
something else!

Without changing hardware I don’t think you’ll do a lot better then this.
You may want to look are more sophisticated parallel port mode, such
as EPP, ECP. You could even use DMA.

In order to use EPP, ECP, or DMA, you will have to change your external
hardware to use the 8 data lines instead of the 4 control bits for data input.
Both EPP and ECP use the control bits for handshaking (in logic instead of by
program), so that’s a complication. Also, AFAIK, DMA requires EPP or ECP, plus
a lot of (well, some, anyway) software setup, so thers’s a bit more overhead,
especially on short transfers.
However… most parallel ports support PS/2 byte mode, so that may be the
easiest way to go. Still have to change your external harware, and you
probably do want to use the control bits for handshaking (of course, if you
didn’t handshake the nibbles, you don’t have to handshake the bytes… )
There are several sites with parallel port programming info, but none fall
readily to keyboard just now…For those of you with money, the IEEE 1284
spec is a very_good reference (it’s the official spec, after all).
BTW, for most SuperI/O chips, you are still stuck with ISA bus speeds
(6-8MHz), no matter how fast your Pentium. I don’t know if the PCI parallel
ports are faster or not because I never tried any of them. In theory…

Phil Olynyk
…snip…

Rennie_Allen2 · July 11, 2001, 6:04pm

It take a minimum of 3 out() and 2 in() to write a byte so that’s 5
instructions.
running at about 1M (Parallel port a usually implemented over ISA bus
internaly).
that’s 5 us per byte to write times 82 that gives around 400us. There
is
300us
still not accounted for. Could well be the delay getting the ACK
signal
back
and the overhead in the rest of the code. What does loclock() do? Try
making in an inline function.

I agree. The delay getting the IACK back is most likely what is using
the majority of the time.