Zero-Copy Message Passing & Block Device I/O

bumpn86civic · January 26, 2007, 3:52am

Can the sample read/write resource manager (from the qnx website) be modified in such a way that when someone does a read()/write(), the data will not be copied to the resource manager?

I’m trying to write a resource manager which exposes some fibrechannel ports and would like to allow developers to just read/write to the devices without having to send custom messages.

Better yet, I’d like to write a block i/o device driver but this topic seems difficult for me to figure out what that means in qnx to be honest. I see some cam stuff and io-dev or whatever, but how do I hook into that?

I’ve searched the web for ‘qnx block io’ in every form I can think of and just end up frustrated.

any help appreciated!

mario · January 26, 2007, 3:46pm

Unless the data rate is VERY high, don’t concern yourself too much with the message passing overhead. Actually it’s not the amount of data that is usually of concern but the number of IO per seconds.

The resource manager can setup shared memory. Or better the resource manager can support direct mapping of shared memory via _IO_MMAP.

The problem with using memory mapping is that you end up having to use some sort of synchronisation method

bumpn86civic · January 26, 2007, 8:44pm

First off, thanks for the reply!!

The data rate is relatively high since I’ve got 2gig (soon to be 4gig) Qlogic FC hba’s connected up to essentially ramdisk targets. I can write an application that hooks directly into the i/o library I’ve written and can get ~48 commands outstanding each with data transfer lengths of 640K. That will saturate the link at 200Megs/Sec no problem and the CPU is just loping along.

But using a resource manager to expose the ports (so the test authors can just use open(), pread(), pwrite() ), even multi-threading the resource manager with a tpool (just the basic sample out of the documentation…) and using multiple threads in the test client, I can only get about 8-12 cmds outstanding and the CPU is pinned and the link is at ~70Megs (a single thread can maintain 25Megs/Sec with only a single cmd outstanding back to back with about 50% cpu utilization…)

I could design the test apps such that the test authors link into my i/o libraries but this is difficult to maintain and means that if they crash their process, they’re fundamentally crashing my driver which is not good.
(I can’t put signal handlers into the library since it’s part of a big i/o stack wound up in a great bit process and the use of signal has been strictly forbidden in any corporate builds. It’s a long story, but that’s the way it is.)

Another alternative I was thinking of is for the test client to just send in a small message describing the i/o (direction, target, phsical address and length to be transfered, etc…) the memory is physically contiguous so this would be pretty straight forward. But then I have to have their test code break away from the normal use of read()/write() which they’re more familiar with.

I did some research into the direct mapping idea you suggested. Considering that this is for a very specific test-only application, I think the syncronization requirements would be acceptable.

I’m going to try experimenting with this today so maybe I’ll figure it out, but just after reading through the _IO_MMAP default handler iofunc_mmap_default() source code I’m not clear what the client should do.

Is it something like this?

fd = open( “/dev/fc/wwn/rwbuf”, O_RDONLY )
addr = mmap( 0, len, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0 );

Or more like the sample from the mmap() manpage which uses shm_open()?

/* Map in a shared memory region */
fd = shm_open( "/datapoints", O_RDWR, 0777 );
addr = mmap( 0, len, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0 );

In any case, let’s say that the resource manager supports direct mapping via _IO_MMAP, what does the read() and write() code look like in the client app? The same thing as usual except now the kernel knows the address is mapped and therefore doesn’t perform a copy? Or does the client app have to do something special? Or does the resource manager have extra work to do for each i/o?

My apologies for the lame questions, I’ve never had the need to get around the message copy before and it’s revealed some shortcomings in my understanding of the way QNX works!

thanks again!

mario · January 26, 2007, 11:07pm

Memory mapping (first or second method doesn’t make any difference) mean the the data is accessed via pointers not via read() and write(). If you want to absolutely use read and write this will involved message passing, there is no way around that (that I know of)

What kind of device do you have that can push 200Mbyte of data into memory (PCI-E, PCI-X ?)

That being said I don’t know how you did you test. But I pulled out test programs I have to benchmark message passing and using 640K messages. I’m getting tranfer rate between the two programs at 600 Mbytes seconds (it the same data being sent though so it probably does all fit in the cache)
This is on a Xeon 3.8gig (2Meg of cache). Transfer rate will decrease as block size decreases as well.

Threads will not help you get better throughput, the bottle neck will always be the device you write into. Having threads in the resmgr waiting in line to write to the device, is not much different then having applications waiting in line for a single threaded resmgr (i’m leaving some details aside…)

I don’t have all the details of your design, but let me take a shot at guessing how you did this (or are thinking about implement it). The application perform a write, the data is transfer in the resmgr and when the data is received it is transfered in the device. So there is two memory transfers, application->resmgr->hardware. To perform a real zero-copy operation the hardware must support DMA; The application would send a small message to the resmgr telling where in memory the data is and then the resmgr would setup the hardware to fetch the data, via DMA, directly from the application memory. I will leave out synchronisation details for sake of simplicity. This would be ideal scenario. I’m not familliar with the IO_MAP 6.3 feature, but I doubt it’s possible to implement this scenario with read() and write(). It’s possible with a custom API or maybe async messages. That being said I get the feeling your device doesn’t support DMA since you mentionned “and the CPU is just loping along”.

As I said earlier my guess is your current design is application->resmgr->device. You would like the resmgr to be able to do the equivalent of a memcpy, having the source as the application memory and the destination as the device, right?. Basically resulting in what you call Zero-Copy Messages. But think about it, there is still a copy involved. What if you could have the resmgr put the data it receives via QNX messaging directly in the device’s memory? Well it’s possible. The resmgr frame work will read the first X bytes of data, as setup in the attributes. Then you can MANUALLY get the rest of the data via a resmgr_msgread and specify as the receive buffer the memory in the device. Basicaly it’s the kernel that would by performing the memcpy for you. This would be very close the performance you get from doing the memcpy yourself, from the application to the hardware.

Obviously this assume the device is memory mapped and that the data send by the application can be put in the device pretty much as is.

This would work for read operation as well, the resmgr would use resmgr_msgwrite() to copy from the device memory into the application memory.

Am I making any kind of sense?

Tim · January 29, 2007, 3:24pm

Mario,

Since the bottleneck seems to be the double copy of the data, perhaps he could just set up his resource manager to have the read/write just take in pointers into the shared memory.

In other words, he sets up the shared memory between the resource manager and the application processes. Then the apps do a read() and his resource manager simply returns 2 values. A pointer to place in the shared memory region where the data begins and the length of the data.

The write() would be similar in that the app would send to the resource manager 2 values. A pointer to the place in shared memory where the data begins and the length of the data.

Then his apps can use the read/write functionality and with shared memory the data won’t be copied twice. He also won’t have to have a device that can do DMA since the resource manager can pull the data from shared memory and pass it to the device doing any needed processing.

Of course he still needs to implement mutex’s to protect the shared memory regions.

Tim

P.S. This sort of idea was presented by Robert Krten in his ‘The QNX Cookbook - recipies for programmers’ in chapter 7.

mario · January 29, 2007, 7:26pm

That doesn’t really work transparently because you can’t send pointers, the pointers are virtual. You must send offsets (relative to start of shared memory). That kind of beats the purpose of using read and write and would actually, IMO, confuse a typical humain brain ;-) The shared memory would also have to be setup in conjunction with the resmgr, which can be done through posix fonction mmap and stuff but that is not you typical open/read/write/close operation.

Either you use open/read/write/close as it was intented to or you build a custom API, which is actually not a bad solution after all.

bumpn86civic · January 29, 2007, 11:21pm

Thanks alot.

Here’s a description of what I’m doing in a bit more detail.
The platform is a basic off-the-shelf 1ghz intel machine with pci 32/66MHz.
The Host Bus Adapter I’m using is a Qlogic ISP2300 series fibrechannel pcix card with 2GBit optical.

Communication with the device is like so:

1.) A SCSI command is submitted into the top-half of the i/o stack and translated into a request structure on a ring queue. (The ring queue physical address has been provided to the Qlogic HBA during initialization time and each q entry is 64-bytes…)
1a.) Part of the ‘translation’ process is to format the scatter-gather list data maintained on the ‘private’ i/o request structure onto the shared ring-q. In theory there may be a lot of these but in practice it’s never more than 10 and in my unit test case it’s only 1 since the buffer is physically contiguous.

2.) Update the in-pointer index on the Qlogic chip by writing to a register.

3.) The Qlogic firmware detects the change, reads in the queue entries from the shared ring-q.

4.) From here it’s automated.
The Qlogic HBA is a bus-master and sets up a DMA data transfer using the scatter-gather entries provided in the ring-q record and handles all the fibrechannel protocol details to carry out the i/o request with the scsi target.

5.) An interrupt pulse is received from the Qlogic HBA when the command is completed which executes a callback specified in the ‘private’ i/o structure to notify the client that the i/o is completed.

So as you can see, if you specify a data tx length of say 1meg (or even 640K) with a single s/g descriptor, the amount of cpu time required to get the Qlogic chip to start working on the command is very small compared to the amount of time required for the actual data transfer to complete on the 2Gig wire. (That’s what I meant when I said that the cpu was loping along; it was just idle waiting for the big, relatively slow data transfers to complete between the FC-SCSI initiator and target).

Shared Memory Approach

Ok, I see how this works now… Thanks for the explanation; and doesn’t look like the way to go for what I’m trying to do. (Though I’ve learned quite a bit more about what memory mapping means to QNX, which is always a good thing! 8^) )

Custom API Approach

Wow! I have to admit I was really surpised to see how fast QNX can send and receive messages between processes; even with multiple threads in each one! So following Tim’s suggestion, I modified the current resource manager with message_attach() and then use these structures to carry the information describing the i/o to be executed:

typedef struct fcdev_io_rqst_s {
//
// The routing header.
//
io_msg_t hdr;

//
// The target the i/o is headed for.
//
uint32_t target_handle;

//
// The description of the i/o operation.
// Bitwise OR them together to change the behavior of the i/o
// message processing.
//
uint32_t flags;
    #define PI_FCDEV_IO_FLAG_RD         (UINT32_C(1) << 0)
    #define PI_FCDEV_IO_FLAG_WR         (UINT32_C(1) << 1)
    #define PI_FCDEV_IO_FLAG_ASYNC      (UINT32_C(1) << 2)
    #define PI_FCDEV_IO_FLAG_10CDB      (UINT32_C(1) << 3)

//
// The physical address in memory where the i/o should
// be conducted to/from.
//
uint64_t paddr;

//
// The i/o request lba address.
//
uint64_t lba;

//
// The number of 512-byte blocks you want to transfer.
//
uint32_t blk_cnt;

//
// Caller's private data pointer.
// (This will be copied to the reply message...)
//
uint32_t caller_key;

}fcdev_io_rqst_st;

//*******************************************
// Reply message to the above i/o request.
//*******************************************
typedef struct fcdev_io_reply_s {
//
// The routine header.
//
io_msg_t hdr;

//
// Response codes.
//
uint32_t io_submit;      // Should be EOK if the i/o stack accepted it.

//
// SCSI Response Codes. (See pds_scsi.h for the details...)
//
uint8_t scsi_status;    // Should be SCSI_STATUS_GOOD
uint8_t skey;           // Only valid if 'scsi_status' is not GOOD.
uint8_t asc;            // Additional sense code.
uint8_t ascq;           // Additional sense code qualifier.

//
// Caller's private data pointer.
// (Copied from the above io_rqst message...)
//
uint32_t caller_key;

}fcdev_io_reply_st;

Using these messages and just programming values onto them in the client, then ‘faking’ completion in the resource manager just for the sake of seeing how many commands/replies the cpu could execute, I measured a transaction rate just over 300,000 transaction per second.
Considering the Qlogic h/w can only handle and absolute maximum 40,000/Sec that’s more than enough message bandwidth!

Current Experiment and a Question:

1.) Create an api that ‘feels’ like read()/write().
My plan is to use these message structs above and just create a library that will allow synchronous i/o (where I leave the client blocked until the i/o completes…) and async i/o (where I perform the reply from the callback and just return _RESMGR_NOREPLY from the message handler I attached). That’s the correct way to respond if you want to perform the reply outside the message handler, right?
I was just going wrap the MsgSend() calls within some xxx_read(), xxx_write() function so the test engineers can write their test code in such a way that makes them cozy.

2.) Question: Can I get ‘path-specific’ data delivered to the message_attach() function?
What I really liked about the traditional read()/write() approach was that I could associate target-specific information in an extended ocb. This way I get the pre-computed target-handle the Qlogic layer can use directly (just a 32-bit key…) in each read()/write() iofunc entrypoint execution which saves a lot of time.
But the message_attach() just appears to let me associate a void* that will accompany the handler function. What I really need is a way to associate path-specific information with every message that gets sent to that fd after it’s been open()'d. Is that possible? (If I could get at the ocb I’d be set!)

The other possibility would be to use the ctp->rcvid as a key to lookup the target information.
There are never more than 127 targets max and the number of open()'s would be <16 (or could be driven lower if necessary…) Does that sound like a reasonable approach if the answer to the above question is ‘No’?

Thanks again for your patience and taking the time to answer my questions, I certainly appreciate it!
Also, Tim thanks for the book title; going to pick that up asap!
regards,
-barry