mmap'ed memory access very slow

system · June 25, 2003, 2:54pm

Standard support is taking a bit long to answer this one,
so I thought I’d try here so long:

Application is allocating memory for DMA access as follows:

/* Allocate a physically contiguous buffer */
addr = mmap( 0,
262144,
PROT_READ|PROT_WRITE|PROT_NOCACHE,
MAP_PHYS|MAP_ANON,
NOFD,
0 );

This memory is filled via DMA from a video grabber,
and the customer used the memory directly for calculations.
However, accessing this memory is 10 times slower than allocating
a malloc’ed memory.

By copying the DMA’ed data to a malloc’ed block
and then using that for the calculation reduces the
calculation time by 90%!

Why is it so much slower?
Is MAP_PHYS needed for DMA?
Thanks

Sean_Boudreau1 · June 25, 2003, 2:54pm

PROT_NOCACHE turns of caching.

-seanb

acellarius@yahoo.com wrote:

Standard support is taking a bit long to answer this one,
so I thought I’d try here so long:

Application is allocating memory for DMA access as follows:

/* Allocate a physically contiguous buffer */
addr = mmap( 0,
262144,
PROT_READ|PROT_WRITE|PROT_NOCACHE,
MAP_PHYS|MAP_ANON,
NOFD,
0 );

This memory is filled via DMA from a video grabber,
and the customer used the memory directly for calculations.
However, accessing this memory is 10 times slower than allocating
a malloc’ed memory.

By copying the DMA’ed data to a malloc’ed block
and then using that for the calculation reduces the
calculation time by 90%!

Why is it so much slower?
Is MAP_PHYS needed for DMA?
Thanks

Kirk_Bailey · June 25, 2003, 3:50pm

“Sean Boudreau” <seanb@node25.ott.qnx.com> wrote in message
news:bdccv8$1l2$1@nntp.qnx.com…

PROT_NOCACHE turns of caching.

-seanb

Turning off caching is required to be able to “see” the DMA results - right?
So, it seems the solution is to have a second cached buffer used
to do the calculations.

/Kirk

acellarius@yahoo.com > wrote:
Standard support is taking a bit long to answer this one,
so I thought I’d try here so long:

Application is allocating memory for DMA access as follows:

/* Allocate a physically contiguous buffer */
addr = mmap( 0,
262144,
PROT_READ|PROT_WRITE|PROT_NOCACHE,
MAP_PHYS|MAP_ANON,
NOFD,
0 );

This memory is filled via DMA from a video grabber,
and the customer used the memory directly for calculations.
However, accessing this memory is 10 times slower than allocating
a malloc’ed memory.

By copying the DMA’ed data to a malloc’ed block
and then using that for the calculation reduces the
calculation time by 90%!

Why is it so much slower?
Is MAP_PHYS needed for DMA?
Thanks

John_Garvey1 · June 25, 2003, 4:24pm

Kirk Bailey <kirk.a.bailey@delphi.com> wrote:

PROT_NOCACHE turns of caching.
Turning off caching is required to be able to “see” the DMA
results - right?

Depends on the underlying hardware/processor/MMU/etc. If you have a
“smart cache” or a “bus snooping” cache then you don’t necessarily need
to. You can examine the SYSPAGE(cacheattr) for the CACHE_FLAG_SNOOPED
attribute to determine this …

So, it seems the solution is to have a second cached buffer used
to do the calculations.

This is certainly the safest thing to do (although not always necessary).

system · June 25, 2003, 4:27pm

Kirk Bailey wrote:

“Sean Boudreau” <> seanb@node25.ott.qnx.com> > wrote in message
news:bdccv8$1l2$> 1@nntp.qnx.com> …

PROT_NOCACHE turns of caching.

-seanb

Turning off caching is required to be able to “see” the DMA results - right?
So, it seems the solution is to have a second cached buffer used
to do the calculations.

Thanks guys!
Sean-can you please confirm?

Also, is it necessary to have MAP_PHYS?
Will it have any effect on the access times?

Sean_Boudreau1 · June 25, 2003, 5:03pm

acellarius@yahoo.com wrote:

Kirk Bailey wrote:

“Sean Boudreau” <> seanb@node25.ott.qnx.com> > wrote in message
news:bdccv8$1l2$> 1@nntp.qnx.com> …

PROT_NOCACHE turns of caching.

-seanb

Turning off caching is required to be able to “see” the DMA results - right?
So, it seems the solution is to have a second cached buffer used
to do the calculations.

Thanks guys!
Sean-can you please confirm?

As jgarvey said, platforms with a “snooped” cache don’t
necessarily need PROT_NOCACHE. Even on platforms without
snooping, there may be instructions to invalidate and / or
prefetch the cache after the DMA operation.

Also, is it necessary to have MAP_PHYS?

Yes.

Will it have any effect on the access times?

Not that I’m aware of.

system · June 25, 2003, 5:24pm

John Garvey wrote:

Kirk Bailey <> kirk.a.bailey@delphi.com> > wrote:
PROT_NOCACHE turns of caching.
Turning off caching is required to be able to “see” the DMA
results - right?

Depends on the underlying hardware/processor/MMU/etc. If you have a
“smart cache” or a “bus snooping” cache then you don’t necessarily need
to. You can examine the SYSPAGE(cacheattr) for the CACHE_FLAG_SNOOPED
attribute to determine this …

It’s x86 PC architecture

Paolo_Messina · June 26, 2003, 12:27am

Turning off caching is required to be able to “see” the DMA results -
right?
So, it seems the solution is to have a second cached buffer used
to do the calculations.

Can’t you just change the attributes of the memory block after the DMA
transfer and before the calculations, so that the memory can be cached,
using mprotect()?

Paolo

acellarius@yahoo.com > wrote:
Standard support is taking a bit long to answer this one,
so I thought I’d try here so long:

Application is allocating memory for DMA access as follows:

/* Allocate a physically contiguous buffer */
addr = mmap( 0,
262144,
PROT_READ|PROT_WRITE|PROT_NOCACHE,
MAP_PHYS|MAP_ANON,
NOFD,
0 );

This memory is filled via DMA from a video grabber,
and the customer used the memory directly for calculations.
However, accessing this memory is 10 times slower than allocating
a malloc’ed memory.

By copying the DMA’ed data to a malloc’ed block
and then using that for the calculation reduces the
calculation time by 90%!

Why is it so much slower?
Is MAP_PHYS needed for DMA?
Thanks

John_Garvey1 · June 26, 2003, 7:32pm

Paolo Messina <ppescher@uiuc.edu> wrote:

Can’t you just change the attributes of the memory block after the DMA
transfer and before the calculations, so that the memory can be cached,
using mprotect()?

The call would be msync(), but on a microkernel the overheads would
probably be too high to continuously flush this at each operation.
Depending on the population vs calculation ratio either PROT_NOCACHE
or use the bounce-buffer approach (perhaps in conjunction with the
snooping cache detection) … FYI, this is what the filesystem does,
the buffer cache is cacheable on the grounds that you’ll likely refer
to a cached disk block multiple times in a well-tuned system, and the
emphasis is on the disk driver to bounce via a temporary NOCACHE buffer
if it detects that the MMU is not snooping/invalidating cache lines
after DMA transfers …