Resource manager eats too much processor

Good day everyone. I’m writing a resource manager and I’ve faced the problem that if I pass a large buffer (about 1MB) to MsgReply from io_read handler then my resource manager starts eating processor time enormously, for example:

int io_read (resmgr_context_t *ctp, io_read_t *msg, RESMGR_OCB_T *ocb)
{
if ((status = iofunc_read_verify (ctp, msg, &ocb->ocb, NULL)) != EOK)
return (status);

if ((msg->i.xtype & _IO_XTYPE_MASK) != _IO_XTYPE_NONE)
return (ENOSYS);

MsgReply(ctp->rcvid, msg->i.nbytes, my_buffer, msg->i.nbytes);
return (EOK);
}

That’s the io_read handling code from my test resource manager. my_buffer is 786432 bytes big. For testing purposes I wrote a program which issues ‘read’ to my resource manager once in 300ms and reads 786432 bytes every time. In that case, my resource manager eats ~20% of processor, if I issue ‘read’ once in 100ms, the processor load is ~40%. Just terrible. The profiler shows that all 20%(40%) are used by single function - MsgReply and there’s no other code executing in resource manager (well, we still have one io_open and one io_close_ocb). So the question is: why so much ?
I’ve used my test program to test filesystem manager, I’ve read a large file by issueing 786432 byte 'read’s once in 300ms and cpu load was ~1%. So, why do I have ~20% ?

Better use shared memory + synchronisation ( semaphor / mutex / pulse notification )

Nah no need for shared memory, you are looking at 2.1Mbytes/sec this should use very little cpu.

Try return _RESMGR_NOREPLY instead of EOK. By default the framework does the reply for you ( might want to consider doing that instead), but because you didnt tell it that you already did the reply it is most probably doing a second MsgReply which cause havoc. Im pretty sure the CPU usage you noticed with the profiler is not your MsgReply but the MsgReply perform by the resmgr framework.

It is true that using MsgReply() and after return(EOK), will cause MsgReply to be issued twice- second time by the RM framework. But do not forget MsgReply is not blocking kernel call, and kernel will immediatelly return with ESRCH as the sender is no longer REPLY-blocked!
Anyway I continue to believe that for data chunks as 786432 shared memory would provide better bandwith! Message passing is excellent for tiny messages. Actually you might find this recommendation in QNX documentation as well(Nto_sys_arch.pdf)!

I tried using _RESMGR_NOREPLY insted of EOK, no use.
koko, ok, I can use shared mem or something like that. But I just don’t understand why does dev-ide handle such messages with cpu load ~1% (this also includes lot’s of operations with hardware), but my resource manager which only calls MsgReply, uses 20%.
By the way, I’ve mesured the time MsgReply is executing, it’s 30ms!!! You said that it’s a non-blocking call, how come it executes for 30ms then ?

mario, I’ve mesured my CPU usage with ‘hogs’ and visually by looking at the cpu meter in the right part of the screen on the shelf :slight_smile:

Just tried sending a 750K message 20 times a second and it takes less then 1% of CPU (in fact with hogs it shows 0%)

I find that it’s more about the number of messages per seconds then the size of the message because the cost of messaging is more about the context switches then the actual moving of data. In fact if you need to implement sort or go/status mecanisms with pulses, you’ll find that it’s pretty close to s/r/r in term of overhead.

Yes you get less CPU usage or more bandwidth with shared memory, but you get increase complexity and you loose networking ability.

Sheff, blocking operations are MsgSend( you try to send but the receiver is busy with something else ) and MsgReceive( you are ready to receive but nobody send to you). With MsgReply the case is different. The thread/process is waiting on your reply, being Reply blocked. Thereafter your MsgReply is non blocking, and the measured time of 30ms is the one needed to transfer the data to the sender buffer + context switching. Try to use smaller amount - 700bytes and check what is going on.

I did the same experiment: RM handling io-read with transfer of 786432 bytes, no other tasks than RM and client reading, working on same priority.
Here are the results:

  • read_every_50ms - CPU load=8%
  • read_every_100ms - CPU load=4%
  • read_every_300ms - CPU load=1%

Then I changed the amount to 700 bytes(tiny message), here are the results:

  • read_every_50ms - CPU load=0.03%
  • read_every_100ms - CPU load=0.03%
  • read_every_300ms - CPU load=0.03%

I think what was written in the QNX books is true.

Do the same test with shared memory and a pulse…

Read the documentation, MsgReply may block.

I read it and it states for MsgReply:
Blocking states:
None. In network case, lower priority threads may run.

And here is for MsgSend:
STATE_SEND
The message has been sendt but not yet received…
STATE_REPLY
The message has been received but not yet replied to…

“In network case, lower priority threads may run”, for a lower priority thread to run doesn’t that imply the process is blocked?

It also says MsgReply function has increased latency when it’s used to communicate across a network, which may need to communicate with the clients npm-qnet to actually transfer the data.

Could you please post full source code of your test resmgr here

koko, in your test you’ve replaced 700KB with 700b, it’s not quite correct, better replace 700KB with 1000 reads of 700b, then pause, etc…
I tried to do so, that’s even worse.
Also I’ve written another test program:

open(something)
for(10 seconds)
{
read(1KB/1MB);
}

Note that there’s no sleep here, the results were the following:

from /dev/zero - ~2000MB/sec with read buf 1MB
from /dev/hd0 - ~56MB/sec with read buf 1MB
from /dev/my - ~12MB/sec with read buf 1KB, ~24MB/sec with read buf 1MB

:slight_smile: The problem is solved.
Such high cpu load was present because my_buffer was allocated using mmap with NO_CACHE flag. I did it so because my_buffer is actually video frame buffer and I read in QNX docs that for video frame buffers it’s better to set NO_CACHE. But I just got it all wrong by the time I read it, you should only set NO_CACHE if you only need to access dual ported memory

Good catch.

What do you mean by a “video frame buffer”? Is it just a buffer in memory that you transfer video frames in and out of, or is the memory connected to a piece of hardware? If it is connected to hardware, how can it not be dual ported?

maschoen, it’s a DMA buffer, video capture card writes data to this buffer and I read it. You’re right, it is dual-ported, but anyway, since I have to copy that memory I must cache it in order to maximize performance…

I think we have some terminology issues here. If it’s a DMA buffer, that usually means it’s memory you allocated, and pointed a DMA channel at.
This is not Dual ported, and as all memory, can be cached.

On first thought, this doesn’t make sense either. Caching should only help performance if you have to read the memory a 2nd time, as happens with code, and local variables. However, I suspect marking the data as uncachable means that the Level 2 and 3 caches are turned off. Data is retrieved from normal memory, only when requested. With the cache turned on, data can be moved into the cache in 64bye or 128byte chunks.
I’m just guessing here, but I think that must be what’s happening.

maschoen, I think you’re right here. BTW, could you explain what dual-ported memory actually means ?