mmap64() and PCI "cache line size" = 0

Hi all!

I have to mmap64() a physical address space related to some registers
and internal RAM of a PCI device.

From “pci -v” I can see that my device has 1 Mbyte of space, with
cache-line-size=0 (= uncacheable).

Which are the risks of mapping the region without the flag
PROT_NOCACHE ?

Does the hardware (chipset and host/pci bridge) already handle cache
enabling for me, ensuring that every read/write access from my process
becomes a PCI target access to the device? Or is there any risk that
several write accesses gets posted or reordered before reaching the device?

Thanks!
Davide


/* Ancri Davide - */

Ping!


/* Ancri Davide - */

The implications involve synchronizing the cache manually for all DMA
transactions.
You have to flush the cache after each outbound DMA transfer and invalidate
before taking inbound data. You also need to align the DMA memory to your
cache line size.

“Davide Ancri” <falsemail@nospam.xx> wrote in message
news:ep9qht$72q$1@inn.qnx.com

Ping!


/* Ancri Davide - */

Igor Kovalenko wrote:

The implications involve synchronizing the cache manually for all DMA
transactions. You have to flush the cache after each outbound DMA
transfer and invalidate before taking inbound data. You also need to
align the DMA memory to your cache line size.

Until now, for DMA memory (I mean the host RAM accessed from PCI devices
as DMA masters) I never used the PROT_NOCACHE flag, because AFAIK the
x86 chipset ensures cache coherency snooping accesses initiated by the
DMA masters. Flushing/invalidating caches can improve performances, but
even without them, data coherency is guaranteed.

My question was about PCI device registers, mapped from the driver as
“simple” virtual memory: flagging this memory as “cached” in the
mmap64() call can give accesses incoherency? Things like many writes
combined in a single write access (not so nice for registers!), reads
that take data from the cache and not form the real device…

Thanks!
Davide


/* Ancri Davide - */

Davide Ancri <falsemail@nospam.xx> wrote:

Igor Kovalenko wrote:

My question was about PCI device registers, mapped from the driver as
“simple” virtual memory: flagging this memory as “cached” in the
mmap64() call can give accesses incoherency? Things like many writes
combined in a single write access (not so nice for registers!), reads
that take data from the cache and not form the real device…

Yes, all of these bad things can happen. PROT_NOCACHE helps to protect
against them. I’m not sure if it protects against instruction re-ordering
in CPU pipelines, though. Many processors (e.g. PPC, MIPS) provide an
I/O synch primitive. I don’t know if x86 does so, since most “control
register” access for devices on x86 is done with I/O ports, and the
in/out instructions.

-David

David Gibbs
QNX Training Services
dagibbs@qnx.com

“Davide Ancri” <falsemail@nospam.xx> wrote in message
news:epk9lh$7dg$1@inn.qnx.com

Igor Kovalenko wrote:
The implications involve synchronizing the cache manually for all DMA
transactions. You have to flush the cache after each outbound DMA
transfer and invalidate before taking inbound data. You also need to
align the DMA memory to your cache line size.

Until now, for DMA memory (I mean the host RAM accessed from PCI devices
as DMA masters) I never used the PROT_NOCACHE flag, because AFAIK the
x86 chipset ensures cache coherency snooping accesses initiated by the
DMA masters. Flushing/invalidating caches can improve performances, but
even without them, data coherency is guaranteed.

Define “x86 chipset”?
AFAIK the cache snooping is only supported by Xeons. But I have been wrong
before :wink:

– igor

Igor Kovalenko wrote:

Define “x86 chipset”? AFAIK the cache snooping is only supported by
Xeons. But I have been wrong before > :wink:

Igor, we are using cPCI Pentium 3/4 and M CPU cards from years (honestly
I don’t know the exact chipset model, but I think is exactly the same
used for most desktop PC based on the same uPs), and I have never seen
lost or incoherent data due to DMA accesses without any cache
invalidate/flush istructions on the driver code.

What we have seen is that cache istructions on DMA-able buffers can
dramatically improve free realtime in the uP: I think (but I am not a
grat expert on this) the reason is the uP time thrown away waiting for
the chipset to invalidate/reload cache lines not coherent with DMA
buffers (wrote by the PCI device) when the uP try to access them the
first time after the DMA itself…

Maybe it’s true the vice-versa, also? The PCI device initiates the DMA
transaction for a read burst, but it receives a RETRY from the host/pci
bridge because some time it’s needed to really write DMA buffer in RAM.
All consuming PCI bandwidth.
This is only an hypothesis born in my mind, however :wink:

Thanks for your answers!
Davide


/* Ancri Davide - */