Streaming Frame Grabber data to disk

ddowens · April 26, 2006, 2:44pm

We are trying to stream some 100 frames/sec ‘video’ data to disk. The data is raw 256 x 256 by 16 bit pixels from an IR focal plane array. The frame rate is 100 Hz and the frame grabber is a Matrox Meteor II digital using the Genesis library for QNX. We have tested the speed of our IDE drive and it is capable of at least 25 MB/sec for large files. We are able to capture frames with no problems as long as the disks are not being accessed. When we try to overlap these operations, we get dropped frames and the disk cannot keep up with the data coming in (13 MB/sec). I have tried adjusting priorities but any change I have made seems to result in poorer performance. Has anybody else had these sort of problems with QNX (I’m starting to wonder how real time it really is.)

Thanks,
Doug Owens
owens2@llnl.gov

mario · April 26, 2006, 7:38pm

In the end the HD is the bootle neck, priority can’t do much good. That being said hd performance isn’t QNX strength. You could try using async write and increase cache size.

Not sure what you mean by overlap?

ddowens · April 26, 2006, 8:27pm

Overlap means I have 2 threads, one that places the data frames in a circular buffer and one that writes the data in the circular buffer to disk. The collection thread uses a semaphore to notify the writing thread that a sufficient number of frames are available. New frames continue to enter the buffer at a point just past the ones to be written.

Doug Owens
owens2@llnl.gov

mario · April 27, 2006, 1:04pm

Ah, i though you mean multiple access (read/write) to the disk.

You could have a look at aio_* function. They work in SP2 + 6.3.2 update ( I think I saw that in the release note)

My guess is your implementation isn`t as efficient as it needs to be. Can you post some code.

That being said, are you sure the HD can sustain 13Mb write?

rgallen · May 1, 2006, 12:20am

Bingo! There’s your problem.

Semaphores do not convey priority. Use a mutex/condvar pair to wake up the consumer (writer) thread. Semaphores are provided to facilitate porting of non real-time Unix applications to QNX, and are not intended to be used in a real-time application.

Also, you don’t explicitly state what you are writing to; so I assume that you are by-passing the filesystem and writing direct to a raw disk partition.

mario · May 1, 2006, 2:39am

Well if the sharing is done between two processes only, even though semaphores themselves don’t have priority each processes priority is still honored, no?

I beleive usage of semaphores is not well implement in this perticular case.

rgallen · May 1, 2006, 3:36pm

The hard priority of each thread is honoured, but there is no inheritance, therefore if the consumer thread is a higher priority and it “waits” on the semaphore, then there is a priority inversion.

Given the stated design, the priority should driven from the consumer thread (i.e. the thread that writes to disk). It will try to get some data from the producer, and block on the mutex (which implements priority inheritance) and the producer thread will be elevated to the consumers priority to fill the buffer, it will then signal the condvar (which is not a blocking operation) which will move the consumer thread to the ready queue.

The other priority driver for the producer, should be the interrupt pulse from the frame grabber. It should (in all likelyhood) be of equal or higher priority than the consumer thread. When a frame is ready it will deliver the pulse and the producer will raise to that priority and attempt to lock the “overlap pointer mutex” in order to add the frame to the overlap area, and adjust the overlap pointers. Since this is an overlap, the lock time is very small since only the pointers (or potentially even simple flags if the frames are all of equal size) need to be adjusted with exclusion.

On the other side when the consumer awakes from the condvar wait, it will adjust the overlap pointers to reflect that it has removed the entry, and then unlock the mutex (if you simply have a flag to indicate that the overlap area has been processed, then an atomic_set could be used, and the mutex could be immediately unlocked as soon as the consumer is made ready).

As far as semaphores being a bad design, I agree; however, the implementation that QNX must provide is determined by the design which has no interface for selecting priority inheritance (and exactly what would the semantics of priority inheritance be, for a counting semaphore?). BTW: since it wasDijkstra who designed semaphores, I feel obliged to add that they are a bad design in the context of priority driven pre-emptive schedulers that utilize priority inheritance protocols (which was clearly not the environment in which Dijkstra designed semaphores).

mario · May 1, 2006, 5:15pm

I meant bad design in the way the original poster probably uses them, not the way semaphore were coded in the kernel/clib ;-)

I get the feeling the original poster would use mutex instead of semaphore that it would solve his problem.

The way I see it, the original poster wants to acheive some level of real-time but is using a non-realtime device (the HD) (by the way a real-time os cannot turn a non-realtime device into a realtime device…). Proper care should be taken to deal with this. For example I would setup the frame grabber to grab in some sort of ring buffer, providing buffering that doesn’t cost CPU time (no need to move data around).

The HD thread could be setup as read only, never preventing the producer from writing (unless buffers are full). I would use a simple queue to store pointers to images in the ring buffer.

ddowens · May 1, 2006, 5:28pm

Thanks to all for the replies. I will look into the semaphore question but I still find it hard to believe that could be the cause of these gaps that are tens of milliseconds long. It is difficult to control or even determine the priority of the frame grabber threads because one of them runs in the Genesis driver and the other is created by the library software in my application. The event capture from the profiler shows no activity on any of the relevant threads during these gaps. I have improved the situation by using a single file and pregrowing it. Previously new files were created periodically to keep them a reasonable size. This is all based on code that we had been using on a Linux system. The latest results on a 4000 frame file are 16 dropped frames, all between frames 900 and 940, with the rest being perfect. I haven’t seen any information in the QNX docs about raw writes to disk. Is there some info out there somewhere?

Thanks,
Doug Owens
owens2@llnl.gov

rgallen · May 1, 2006, 5:35pm

mario:

I meant bad design in the way the original poster probably uses them, not the way semaphore were coded in the kernel/clib

I get the feeling the original poster would use mutex instead of semaphore that it would solve his problem.

The way I see it, the original poster wants to acheive some level of real-time but is using a non-realtime device (the HD) (by the way a real-time os cannot turn a non-realtime device into a realtime device…). Proper care should be taken to deal with this. For example I would setup the frame grabber to grab in some sort of ring buffer, providing buffering that doesn’t cost CPU time (no need to move data around).

The HD thread could be setup as read only, never preventing the producer from writing (unless buffers are full). I would use a simple queue to store pointers to images in the ring buffer.

Almost all hard drives today are real-time (i.e. they do incremental thermal recalibration), so I don’t think there is any fundamental problem with the OPs design. Actually double buffered overlapped I/O is the ideal implementation. Only two buffers are required, so that there can be simultaneous outstanding acquisition and disk write operations (i.e. the frame grabber can DMA to one place, while the disk driver DMAs from the last complete frame grab). Given the stated acquisition rates, and disk write throughput (and assuming no seek latency due to exclusive use of the hard-drive) no additional buffering is necessary (or desirable).

I believe that the OPs problem is either:

a) writing through the filesystem rather than directly to devb-eide through a raw device.

or

b) access to disk is not exclusive, and thus the heads might be repositioned during operation (implicit in a, still possible even if writing to raw partition). If the heads are repositioned, then the maximum seek latency (and max number of seeks/second) must be figured into the throughput numbers, and additional buffering implemented to store the number of incoming frames that can arrive throughout the duration of these seeks.

or

c) priority inversion caused by the semaphore.

ddowens · May 1, 2006, 5:56pm

Another update: (i saw a post come up just after I posted). My buffer is 4000 frames, or 40 seconds worth at 100 Hz. I was guessing that would cover any head positioning latencies in the disk. I have to process the frames individually in order to get a time stamp for every frame (we need to determine exact times for the frames in order to determine the exact position of the aircraft at each frame). Although this is probably not the most efficient way to use the framegrabber, it does provide the diagonostic that is showing the exact frames being dropped. The basic procedure is to grab one frame to a host buffer (system memory) and then copy that data to the shared circular buffer with header information added. I would love to have more control over this process but most of the operations are buried in Genesis library calls.

rgallen · May 1, 2006, 6:20pm

ddowens:

Another update: (i saw a post come up just after I posted). My buffer is 4000 frames, or 40 seconds worth at 100 Hz. I was guessing that would cover any head positioning latencies in the disk. I have to process the frames individually in order to get a time stamp for every frame (we need to determine exact times for the frames in order to determine the exact position of the aircraft at each frame). Although this is probably not the most efficient way to use the framegrabber, it does provide the diagonostic that is showing the exact frames being dropped. The basic procedure is to grab one frame to a host buffer (system memory) and then copy that data to the shared circular buffer with header information added. I would love to have more control over this process but most of the operations are buried in Genesis library calls.

Ahhh so the system isn’t a real-time app, but actually a “create a huge buffer and hope that covers the worst case” app. Dang; I was hoping that one day, I would actually encounter a hard real-time application on openqnx…

OK, so you have already found out that pre-growing will help; in addition, you can play with the io-blk.so “commit” and “delwri” options in order to fudge things so that they behave like a non-rtos, and allow your application to experience the same degree of stochastic behavior that you are familiar with on Linux.

mario · May 1, 2006, 10:29pm

Ok I’ll rephrase HD are very difficult device to get consistant result from. Depending on what track it’s writting you get different write speed, then fragmentation can get in the way. Seek time are close to impossible to predict if there are multiple access. Granted most of these you can control if you bypass the filesystem.

Well that depends on what else is happening on the system. If you work with two frames that means you basicly have 1 ms to copy and save the frame. 1ms should be simple to acheive, but personnaly I like to get myself a little more head room for stuff like debugging, more important field debugging, while the system is running.

I would add:

d) Disk and filesystem can’t really sustain (13Mbytes), when data flow is not constant

e) The buffer is not done as well as it should and valuable CPU time is waisted.

f) CPU is needed to handle the HD and there isn’t enough left for the buffering ( on HD I have here, doing cp /dev/hd0 /dev/null takes close to 100% of the CPU for a transfer rate of 25Megs …). Not a big deal but you are moving around 25Meg of data per seconds minimum. plus the DMA from the frame buffer 13M and dma to the HD 13Mg, that close to 50Meg. For todays machine that’s almost trivial, but you didn’t say what your hardware wes, so it might be something to watch for.

What is the CPU usage when the system is running?

mario · May 1, 2006, 10:34pm

Guessing? hum here is a nasty word LOL!

ddowens · May 1, 2006, 11:19pm

Actually the large buffer is there to allow capture followed by store, so we can get a reasonable amount of good data. I just use the same area for streaming. Our drive is a WD caviar - we (and probably a lot of other people) are waiting for some sort of SATA support from QSS.

mario · May 2, 2006, 12:29am

There is some SATA support with some Intel chip set (ICH5). However unless you use Raptor which are available only in SATA, I don’t beleive there is that big of a difference between SATA and PATA (in your case you say you can write at 25Mbytes/sec, that a fifth of PATA…)

At least nothing that would make it or break it, in your case.

You could use SCSI. (On QNX4 I’ve seen 65Msec read speed with 15k RMP drive, not sure about QNX6 though)

rgallen · May 2, 2006, 4:16am

Did you try setting commit=none?

mario · May 2, 2006, 11:14am

Also if you don’t need updated access time using noatime can reduce head movement.

ddowens · May 2, 2006, 4:21pm

I found a problem with my code in that it was posting the semaphores before the frames arrived. After fixing this, I was able to collect a 4000 frame file with no drops. This was after pregrowing and fsync() on the destination file. It still appears that file system activity clobbers me, probably from any process, but I need to test that. The system does appear to be keeping up with the data at this point.

Doug Owens

mario · May 2, 2006, 5:53pm

I’m curious, how many frames is the writer ahead of the reader (the reader being to one that writes to disk)