shmem performance

Hello, All

does anyone measured shared memory implementation performance in QNX4
operating system?


kris

“Alexander Krisak” <chris@imp.lg.ua> wrote in message
news:ba07t8$c3f$1@inn.qnx.com

Hello, All

does anyone measured shared memory implementation performance in QNX4
operating system?

I don’t understand the question, do you mean creating files in shared memory
or accessing shared memory via pointer in C for example?

File access isn’t very fast as this is not a full feature file system. As
for accessing memory directly from a C program there is no difference in
performance as compare to access local memory.



kris

Hello, Mario Charest

does anyone measured shared memory implementation performance in QNX4
operating system?

I don’t understand the question, do you mean creating files in shared
memory
or accessing shared memory via pointer in C for example?

File access isn’t very fast as this is not a full feature file system. As
for accessing memory directly from a C program there is no difference in
performance as compare to access local memory.

thank your for answer.

i mean entirely accessing memory via pointers in C.
i asking because i have strange results on my performance measuring program:
the counters that resides in shared memory grow some slower, then the
counters
that resides in local process memory.
the difference is about 2 times on 586@133 and about 1,5 times on k6-3@400.


kris

“Alexander Krisak” <chris@imp.lg.ua> wrote in message
news:ba1t3e$97l$1@inn.qnx.com

Hello, Mario Charest

does anyone measured shared memory implementation performance in QNX4
operating system?

I don’t understand the question, do you mean creating files in shared
memory
or accessing shared memory via pointer in C for example?

File access isn’t very fast as this is not a full feature file system.
As
for accessing memory directly from a C program there is no difference in
performance as compare to access local memory.

thank your for answer.

i mean entirely accessing memory via pointers in C.
i asking because i have strange results on my performance measuring
program:
the counters that resides in shared memory grow some slower, then the
counters
that resides in local process memory.
the difference is about 2 times on 586@133 and about 1,5 times on
k6-3@400.

Post the code you use to test this, doesn’t make sense to me :slight_smile:


kris

\

Hello, Mario Charest

does anyone measured shared memory implementation performance in
QNX4
operating system?
I don’t understand the question, do you mean creating files in shared
memory or accessing shared memory via pointer in C for example?
File access isn’t very fast as this is not a full feature file system.
As for accessing memory directly from a C program there is no
difference in
performance as compare to access local memory.
thank your for answer.
i mean entirely accessing memory via pointers in C.
i asking because i have strange results on my performance measuring
program:
the counters that resides in shared memory grow some slower, then the
counters
that resides in local process memory.
the difference is about 2 times on 586@133 and about 1,5 times on
k6-3@400.
Post the code you use to test this, doesn’t make sense to me > :slight_smile:

/-------------------------------X8–/
#include <time.h>
#include <stdio.h>
#include <setjmp.h>
#include <signal.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/sched.h>
#include “…/…/include/odb.h”

#define PROC_PID 1

extern void f_cpu(unsigned int* p1, unsigned int* p2);
#pragma aux f_cpu =
“m2: mov dword [eax],32000”
“m1: dec dword [eax]”
“jnz m1”
“inc dword [edx]”
“jmp short m2”
parm [eax][edx]

sigjmp_buf sb;
int cnt = 1;
int verbose = 0;
int odb = 0;
int local = 0;
unsigned int tg1, tg2;

void sig(int signo)
{ siglongjmp(sb, 1);
}

int main(int argc, char* argv[])
{ int prio = 1;
struct itimerspec it;
int t;
timer_t tid;
struct sigevent sev;
unsigned int t1, t2;
ODB* o;
unsigned int *pt1, *pt2;

memset(&it, 0, sizeof(it));

it.it_value.tv_sec = 60;

while ((t = getopt(argc, argv, “p:t:c:volg”)) != -1)
{ switch (t)
{ case ‘p’: prio = atoi(optarg); break;
case ‘t’: it.it_value.tv_sec = atoi(optarg); break;
case ‘c’: cnt = atoi(optarg); break;
case ‘v’: verbose = 1; break;
case ‘o’: odb = 1; break;
case ‘l’: local = 1; break;
case ‘g’: local = 2; break;
}
}

if (verbose == 1)
printf("%s: p %d, P %d%s%s\n", argv[0], getpid(), prio,
(odb == 1) ? “, odb” : “”,
(local == 1) ? “, local” : ((local == 2) ? “, global” : “”));

sev.sigev_signo = SIGALRM;
if ((tid = timer_create(CLOCK_REALTIME, &sev)) == -1)
{ perror(“timer_create”);
return 1;
}

if (qnx_scheduler(PROC_PID, 0, SCHED_FIFO, prio, 0) == -1)
{ perror(“qnx_scheduler”);
return 1;
}

signal(SIGALRM, sig);

if (odb == 1)
{ if ((o = k20_odb_open_rw(ODB_TYPE_SYS)) == NULL)
{ perror(“k20_odb_open_rw”);
return 1;
}
for (t = 10; t < sizeof(o->trash->t)/sizeof(o->trash->t[0]) - 10; t++)
o->trash->t[t] = 0;
}

t = 10;

while (cnt–)
{ if (sigsetjmp(sb, SIGALRM) == 0)
{ t1 = t2 = 0;
if (local != 0)
{ switch (local)
{ case 1 : pt1 = &t1; pt2 = &t2; break;
case 2 : pt1 = &tg1; pt2 = &tg2; break;
}
}
else
{ if (odb == 1)
{ pt1 = (unsigned int*)&o->trash->t[t++];
pt2 = (unsigned int*)&o->trash->t[t++];
}
else
{ fprintf(stderr, “%s: no odb\n”, argv[0]);
exit(1);
}
}
(*pt1) = 0;
(*pt2) = 0;
timer_settime(tid, TIMER_ADDREL, &it, NULL);
f_cpu(pt1, pt2);
}
else
{ if (verbose == 1)
printf(“t1 = %d, t2 = %d\n”, *pt1, *pt2);
}
}

return 0;
}
/-------------------------------X8–/

ODB is a collection of shared memory areas, different size each other,
created by another process.
o->trash is a piece of one of shmem:
struct odb_trash
{ usigned int t[100];
};


kris

the difference is about 2 times on 586@133 and about 1,5 times on
k6-3@400.
Post the code you use to test this, doesn’t make sense to me > :slight_smile:

/-------------------------------X8–/
#include <time.h
#include <stdio.h
#include <setjmp.h
#include <signal.h
#include <stdlib.h
#include <unistd.h
#include <sys/sched.h
#include “…/…/include/odb.h”

#define PROC_PID 1

extern void f_cpu(unsigned int* p1, unsigned int* p2);
#pragma aux f_cpu =
“m2: mov dword [eax],32000”
“m1: dec dword [eax]”
“jnz m1”
“inc dword [edx]”
“jmp short m2”
parm [eax][edx]

Ok you are counting home machine time you can decrement the value 32000 to
0.

Using signal isnt very precise but given your test is running for 60 seconds that should be ok :wink: I beleive 60 seconds isnt log enough to wrap
around. I also assume you have been running this program at high priority.

Only explanation I can come up with is that CPU cache is involved. Is the
share memory pointing to a device or just normal ram? Maybe share memory is
set at non cachable (QSS any pointer)? But then if that memory was not
cachable the performance difference would be MUCH bigger.

I would try to allocated some memory with malloc, then obtain the physical
memory address and then mmap it. Thus you would be performing the test on
the same physical memory ruling out possibility of different cause by
hardware issues (although I`d expect f_cpu() to run in the CPU cache and not
be affected by ram speed that much).

Hello, Mario Charest

Only explanation I can come up with is that CPU cache is involved. Is the
share memory pointing to a device or just normal ram? Maybe share memory
is
set at non cachable (QSS any pointer)? But then if that memory was not
cachable the performance difference would be MUCH bigger.

it is normal memory, which were created by shm_open() and mmap()-ed to
process.
i also thought that it might be cache problem, but i don’t understand why
this shared
memory became non cachable.

please tell me what is QSS?

I would try to allocated some memory with malloc, then obtain the physical
memory address and then mmap it. Thus you would be performing the test on
the same physical memory ruling out possibility of different cause by
hardware issues (although I`d expect f_cpu() to run in the CPU cache and
not
be affected by ram speed that much).

i tried to implement more simple example - one shared memory (created same
way
as in k20_odb_open_rw()) and two counters inside it - it works fine -
counter have
value almost equal to value of local counter (i.e. there is no problem).
but in shared memory was created in k20_odb_open_rw() data access still
slow.


kris

“Alexander Krisak” <chris@imp.lg.ua> wrote in message
news:ba5431$171$1@inn.qnx.com

Hello, Mario Charest

Only explanation I can come up with is that CPU cache is involved. Is
the
share memory pointing to a device or just normal ram? Maybe share
memory
is
set at non cachable (QSS any pointer)? But then if that memory was not
cachable the performance difference would be MUCH bigger.

it is normal memory, which were created by shm_open() and mmap()-ed to
process.
i also thought that it might be cache problem, but i don’t understand why
this shared
memory became non cachable.

I’m not convience it’s set as non cachable, as if it would it would be MUCH
slower.

please tell me what is QSS?

It the name of the company that makes QNX. I hopying someone from QSS would
offer some advice.


I would try to allocated some memory with malloc, then obtain the
physical
memory address and then mmap it. Thus you would be performing the test
on
the same physical memory ruling out possibility of different cause by
hardware issues (although I`d expect f_cpu() to run in the CPU cache and
not
be affected by ram speed that much).

i tried to implement more simple example - one shared memory (created same
way
as in k20_odb_open_rw()) and two counters inside it - it works fine -
counter have
value almost equal to value of local counter (i.e. there is no problem).
but in shared memory was created in k20_odb_open_rw() data access still
slow.

Well it’s possible that depending where that memory is obtain from that it
perform at different speed. Use mem_offset to obtain physical address of
the shared memory, maybe that will tell us something.

Can you post code of k20_odb_open_rw and the one for you simplify example.


kris

Hello, Mario Charest

Only explanation I can come up with is that CPU cache is involved. Is
the share memory pointing to a device or just normal ram? Maybe share
memory is set at non cachable (QSS any pointer)? But then if that
memory
was not cachable the performance difference would be MUCH bigger.

it is normal memory, which were created by shm_open() and mmap()-ed to
process.
i also thought that it might be cache problem, but i don’t understand
why
this shared memory became non cachable.

I’m not convience it’s set as non cachable, as if it would it would be
MUCH
slower.

please tell me what is QSS?
It the name of the company that makes QNX. I hopying someone from QSS
would
offer some advice.

ok, i know it as QSSL.

I would try to allocated some memory with malloc, then obtain the
physical memory address and then mmap it. Thus you would be
performing the test
on the same physical memory ruling out possibility of different cause
by
hardware issues (although I`d expect f_cpu() to run in the CPU cache
and
not be affected by ram speed that much).
i tried to implement more simple example - one shared memory (created
same
way as in k20_odb_open_rw()) and two counters inside it - it works
fine -
counter have value almost equal to value of local counter (i.e. there is
no problem).
but in shared memory was created in k20_odb_open_rw() data access still
slow.
Well it’s possible that depending where that memory is obtain from that it
perform at different speed. Use mem_offset to obtain physical address of
the shared memory, maybe that will tell us something.

i didn’t found that function. do you mean offset of process memory where
shared
memory is maped?

Can you post code of k20_odb_open_rw and the one for you simplify example.
/-----------------------X8----/

#define k20_odb_open_rw(t) k20_odb_open_x(_odb_type_e((t), ODB_TYPE_RW))

ODB* k20_odb_open_x(u_int32 type)
{ ODB* o = NULL;
char path[80];
struct stat st;
int fd;
char* shm;
struct odb_sys_serv* s;
int pid;
struct odb_msg_h h;
union odb_open_s m;
int prot, err = 0;
char name[14];

do
{ if ((o = (ODB*)malloc(sizeof(ODB))) == NULL)
{ err = 1;
errno = ENOMEM;
break;
}

memset(o, 0, sizeof(ODB));

o->pid = pid;
o->type = type;

if ((type >> ODB_TYPE_SYS) & ODB_TYPE_RW)
{ sprintf(name, "%s
%s", prefix, ODB_SH_SRV);
if ((o->fd_srv = shm_open(name, O_RDONLY, 0622)) == -1)
{ err = 1;
break;
}

if (fstat(o->fd_srv, &st) != 0)
{ err = 1;
break;
}

o->sz_srv = st.st_size;

if ((o->sh_srv = mmap(NULL, st.st_size, PROT_READ, MAP_SHARED,
o->fd_srv, 0)) == (void*) -1)
{ err = 1;
break;
}

o->srv = (struct odb_srv*) o->sh_srv;

sprintf(name, “%s_%s”, prefix, ODB_SH_SYS);
if ((o->fd_sys = shm_open(name, O_RDONLY, 0622)) == -1)
{ err = 1;
break;
}

if (fstat(o->fd_sys, &st) != 0)
{ err = 1;
break;
}

o->sz_sys = st.st_size;

prot = 0;

switch ((type >> _ODB_TYPE_SYS) & ODB_TYPE_RW)
{ case ODB_TYPE_NONE : prot = 0;
case ODB_TYPE_R : prot |= PROT_READ ; break;
case ODB_TYPE_W : prot |= PROT_WRITE; break;
case ODB_TYPE_RW : prot |= (PROT_READ | PROT_WRITE); break;
}

if ((o->sh_sys = mmap(NULL, st.st_size, prot, MAP_SHARED, o->fd_sys,
0)) == (void*) -1)
{ err = 1;
break;
}

s = &o->srv->sys;

o->sys = (struct odb_sys*) ((char*)o->sh_sys + s->offt_sys);
o->sys_task = (struct odb_sys_task*) ((char*)o->sh_sys +
s->offt_sys_task);
o->upr = (struct odb_upr*) ((char*)o->sh_sys + s->offt_upr);
o->vd = (struct odb_vd*) ((char*)o->sh_sys + s->offt_vd);
o->sd = (struct odb_sd*) ((char*)o->sh_sys + s->offt_sd);
o->svd = (struct odb_svd*) ((char*)o->sh_sys + s->offt_svd);
o->start = (struct odb_start*) ((char*)o->sh_sys + s->offt_start);
o->abst = (struct odb_abst*) ((char*)o->sh_sys + s->offt_abst);
o->stats = (struct odb_stats*) ((char*)o->sh_sys + s->offt_stats);
o->ar_link_req = (struct odb_ar_link_req*) ((char*)o->sh_sys +
s->offt_ar_link_req);
o->net.nsd = (struct odb_nsd*) ((char*)o->sh_sys + s->offt_nsd);
o->trash = (struct odb_trash*) ((char*)o->sh_sys + s->offt_trash);
o->v_mko_delta = (struct odb_v_mko_delta*) ((char*)o->sh_sys +
s->offt_v_mko_delta);
o->st_time = (struct odb_st_time*) ((char*)o->sh_sys + s->offt_st_time);
o->st_mask = (u_int32*) ((char*)o->sh_sys + s->offt_st_mask);
}

o->id0 = ODB_ID;
o->id1 = ODB_ID;
if (o->fd_srv != 0)
{ if (close(o->fd_srv) != -1)
o->fd_srv = 0;
}
if (o->sh_srv != NULL)
{ if (munmap(o->sh_srv, o->sz_srv) == 0)
{ o->sh_srv = NULL;
o->srv = NULL;
}
}
} while(0);

if (err == 1)
{ if (o != NULL)
{ _odbl_cleanup(o);
free(o);
o = NULL;
}
}

return o;
}
/-----------------------X8----/
{ unsigned int *sh0, *sh1;
if ((fd0 = shm_open(“test”, O_RDONLY, 0622)) == -1)
{ perror(“shm_open”);
return 1;
}

if ((sh0 = mmap(0, PAGESIZE, PROT_READ, MAP_SHARED, fd0, 0)) ==
(void*)-1)
{ perror(“mmap”);
shm_unlink(“test”);
return 1;
}

if ((fd1 = shm_open(“test1”, O_RDONLY, 0622)) == -1)
{ perror(“shm_open”);
return 1;
}

if ((sh1 = mmap(0, PAGESIZE * 4, PROT_READ | PROT_WRITE, MAP_SHARED,
fd1, 0)) == (void*)-1)
{ perror(“mmap”);
shm_unlink(“test”);
return 1;
}
close(fd0);
munmap(sh0, PAGESIZE);
}
/-----------------------X8----/
then i just use sh1[0] and sh1[1] as o->trash->t[].

kris

Hello, ALL

i found source of problem - it was unaligned memory data access.
thanks Mario.


kris

“Alexander Krisak” <chris@imp.lg.ua> wrote in message
news:baadhb$rq7$1@inn.qnx.com

Hello, ALL

i found source of problem - it was unaligned memory data access.
thanks Mario.

Good catch!


kris

Alexander Krisak <chris@imp.lg.ua> wrote:
AK > Hello, ALL

AK > i found source of problem - it was unaligned memory data access.
AK > thanks Mario.

Was this on X86?

I thought the latest X86 hardware didn’t care about alignment.
I guess I was wrong.

“Bill Caroselli” <qtps@earthlink.net> wrote in message
news:badp0e$nvi$3@inn.qnx.com

Alexander Krisak <> chris@imp.lg.ua> > wrote:
AK > Hello, ALL

AK > i found source of problem - it was unaligned memory data access.
AK > thanks Mario.

Was this on X86?

I thought the latest X86 hardware didn’t care about alignment.

It doesn’t care (will work) but it does affect performance. Some non x86
process will crash )

Think that if you are reading a long at address 2 the processor actually has
to fetch (either from cache or memory) 2 32 bits words.

Now with wider bus and bigger cache I would guess the overhead is reduce
but it’s still there.

I guess I was wrong.