unacceptable memory performance

Hello,

I am programming a time critical application using a PCI IO card with a
Bus-mastering DMA controller. DMA memory is allocated for use by the DMA
controller using the mmap() function. The controller simply couldn’t do fast
enough memory transfers because the PCI card kept giving FIFO overflow
interrupts. So I decided to clock the amount of time taken to do memory
transfers. The test program is given below. I compared time taken to
transfer memory out of a mmapped region and the same for a memory region
allocated using the C++ ‘new’ operator. Here are the results of running the
test program:

Time for reading data from mmapped region (in ms/kB) : 0.266836
Time for reading data from region allocated using ‘new’ (in ms/kB) :
0.0111376

Any reasons why mmapped region should be 30 times slower than memory
allocated using ‘new’. This is on QNX Momentics 6.2.1.

//==============================================
// This program compares data transfer times for a memory region
// obtained using mmap() and memory region obtained using ‘new’
// in QNX 6.2.1

#include <sys/mman.h>
#include <inttypes.h>
#include <sys/syspage.h>
#include <sys/neutrino.h>
#include <iostream.h>


int main()
{
char *mmapAllocatedVirtualAddress;
char *newAllocatedVirtualAddress;
size_t bytes;
int numTestTimes;
char dataCopy;
uint64_t startTime, endTime, ticksPerSecond;

ticksPerSecond = SYSPAGE_ENTRY(qtime)->cycles_per_sec;


// Enter bytes to allocate
bytes = 1024; //1 KB

// Allocate using mmap
mmapAllocatedVirtualAddress = (char *)mmap(0, bytes,
PROT_READ | PROT_WRITE | PROT_NOCACHE,
MAP_ANON | MAP_PHYS | MAP_NOX64K | MAP_BELOW16M,
NOFD, 0);

if(mmapAllocatedVirtualAddress == MAP_FAILED)
{
cout << “mmap() failed.” << endl;
return -1;
}


// Allocate using new
newAllocatedVirtualAddress = new char [bytes];


// Number of tests
numTestTimes = 100;

// Time memory transfer from mmapped region
startTime = ClockCycles();
for(int i = 0; i < numTestTimes; i++)
{
for(int j = 0; j < bytes; j++)
{
dataCopy = mmapAllocatedVirtualAddress[j];
}
}
endTime = ClockCycles();

cout << "Time for reading data from mmapped region (in ms/kB) : "
<< 1000 * (1024 * (double)(endTime - startTime)/(ticksPerSecond *
numTestTimes * bytes))
<< endl;


// Time memory transfer from region allocated using new
startTime = ClockCycles();
for(int i = 0; i < numTestTimes; i++)
{
for(int j = 0; j < bytes; j++)
{
dataCopy = newAllocatedVirtualAddress[j];
}
}
endTime = ClockCycles();

cout << "Time for reading data from region allocated using ‘new’ (in ms/kB)
: "
<< 1000 * (1024 * (double)(endTime - startTime)/(ticksPerSecond *
numTestTimes * bytes))
<< endl;


delete newAllocatedVirtualAddress;

return 0;
}

\

Vilas Kumar Chitrakaran <cvilas@ces.clemson.edu>
Robotics and Mechatronics Laboratory
Clemson University

http://ece.clemson.edu/crb/students/vilas/index.htm

Vilas Kumar Chitrakaran <cvilas@ces.clemson.edu> wrote in message
news:bf68e7$kri$2@inn.qnx.com

[snip]

Time for reading data from mmapped region (in ms/kB) : 0.266836
Time for reading data from region allocated using ‘new’ (in ms/kB) :
0.0111376

Any reasons why mmapped region should be 30 times slower than memory
allocated using ‘new’. This is on QNX Momentics 6.2.1.

Because you mapped it “PROT_NOCACHE” ? Do you really need that?

-xtang

file://==============================================
// This program compares data transfer times for a memory region
// obtained using mmap() and memory region obtained using ‘new’
// in QNX 6.2.1

#include <sys/mman.h
#include <inttypes.h
#include <sys/syspage.h
#include <sys/neutrino.h
#include <iostream.h


int main()
{
char *mmapAllocatedVirtualAddress;
char *newAllocatedVirtualAddress;
size_t bytes;
int numTestTimes;
char dataCopy;
uint64_t startTime, endTime, ticksPerSecond;

ticksPerSecond = SYSPAGE_ENTRY(qtime)->cycles_per_sec;


// Enter bytes to allocate
bytes = 1024; file://1 KB

// Allocate using mmap
mmapAllocatedVirtualAddress = (char *)mmap(0, bytes,
PROT_READ | PROT_WRITE | PROT_NOCACHE,
MAP_ANON | MAP_PHYS | MAP_NOX64K | MAP_BELOW16M,
NOFD, 0);

if(mmapAllocatedVirtualAddress == MAP_FAILED)
{
cout << “mmap() failed.” << endl;
return -1;
}


// Allocate using new
newAllocatedVirtualAddress = new char [bytes];


// Number of tests
numTestTimes = 100;

// Time memory transfer from mmapped region
startTime = ClockCycles();
for(int i = 0; i < numTestTimes; i++)
{
for(int j = 0; j < bytes; j++)
{
dataCopy = mmapAllocatedVirtualAddress[j];
}
}
endTime = ClockCycles();

cout << "Time for reading data from mmapped region (in ms/kB) : "
1000 * (1024 * (double)(endTime - startTime)/(ticksPerSecond *
numTestTimes * bytes))
endl;


// Time memory transfer from region allocated using new
startTime = ClockCycles();
for(int i = 0; i < numTestTimes; i++)
{
for(int j = 0; j < bytes; j++)
{
dataCopy = newAllocatedVirtualAddress[j];
}
}
endTime = ClockCycles();

cout << "Time for reading data from region allocated using ‘new’ (in
ms/kB)
: "
1000 * (1024 * (double)(endTime - startTime)/(ticksPerSecond *
numTestTimes * bytes))
endl;


delete newAllocatedVirtualAddress;

return 0;
}

\

Vilas Kumar Chitrakaran <> cvilas@ces.clemson.edu
Robotics and Mechatronics Laboratory
Clemson University

http://ece.clemson.edu/crb/students/vilas/index.htm

Xiaodan Tang <xtang@qnx.com> wrote in message
news:bf6b81$cnm$1@nntp.qnx.com

Because you mapped it “PROT_NOCACHE” ? Do you really need that?

I’m not sure if bus snooping is occuring on this DMA transaction or not. If
it’s not, then NOCACHE is going to be needed, otherwise you could DMA into
the memory, and then have the cache push stale data out ontop of it.

-Adam

Adam Mallory <amallory@qnx.com> wrote:

Xiaodan Tang <> xtang@qnx.com> > wrote in message
news:bf6b81$cnm$> 1@nntp.qnx.com> …

Because you mapped it “PROT_NOCACHE” ? Do you really need that?

I’m not sure if bus snooping is occuring on this DMA transaction or not. If
it’s not, then NOCACHE is going to be needed, otherwise you could DMA into
the memory, and then have the cache push stale data out ontop of it.

I think most x86 do snoop, and that there is a flag in the syspage that
can be checked to determine if they do or don’t. (I don’t know, off-hand,
though, what the flag is/where to find it.)

-David

QNX Training Services
http://www.qnx.com/support/training/
Please followup in this newsgroup if you have further questions.

On a related note, is this why the QNX VESA driver goes
compute-bound when scrolling a terminal window?

John Nagle

David Gibbs wrote:

Adam Mallory <> amallory@qnx.com> > wrote:

Xiaodan Tang <> xtang@qnx.com> > wrote in message
news:bf6b81$cnm$> 1@nntp.qnx.com> …


Because you mapped it “PROT_NOCACHE” ? Do you really need that?


I’m not sure if bus snooping is occuring on this DMA transaction or not. If
it’s not, then NOCACHE is going to be needed, otherwise you could DMA into
the memory, and then have the cache push stale data out ontop of it.


I think most x86 do snoop, and that there is a flag in the syspage that
can be checked to determine if they do or don’t. (I don’t know, off-hand,
though, what the flag is/where to find it.)

-David

John Nagle <nagle@downside.com> wrote:

On a related note, is this why the QNX VESA driver goes
compute-bound when scrolling a terminal window?

No, it is because it is doing read/writes from the video card. Have
you tried the community made improvements to the vesa driver? It does
double buffering and uses mmx for blitting. http://www.qnxzone.com/ has
the article (it is old, look for it).

chris

\

Chris McKillop <cdm@qnx.com> “The faster I go, the behinder I get.”
Software Engineer, QSSL – Lewis Carroll –
http://qnx.wox.org/