Message passing advice wanted

John Nagle wrote:

Kevin Stallard wrote:

Hi Jorri,


I’ve noticed that the Send/Receive/Reply mechanism is somewhat feared
by folks. Synchronous communication can be misconstrued as being slow
and troublesome. They fear something holding up the sender so it
can’t continue processing when it needs to so they develop some kind
of asynchronous communication scheme or use a queue. Some folks
consider it easier (I would disagree). Sometimes it imay actually be
necessary, although I haven’t (yet) come across a reason why it has to
be. I think many folks’ point of reference QNX IPC is like that of a
socket TCP/IP connection. I wouldn’t want to learn the subtleties of
a connection oriented socket in a critical system either, so w/o QNX
IPC, async would be preferable.


Me too. The MsgSend/MsgReceive/MsgReply mechanism’s performance is
what makes QNX work.

You mean 'what makes the resource managers concept and the client/server
concept of QNX work …

But it’s not well understood by many QNX users.

Yes … it seems so. Have a look to the very fast message passing
implementations of PVM and MPI … they are not bound to a specific OS!

–Armin

PS: what about the TIPC ??

In almost every other mainstream operating system, message
passing performance is poor. It’s slow, badly integrated with
the scheduler, or doesn’t have proper memory protection.
UNIX(BSD) got it wrong. Linux got it wrong. Mach got it wrong.
Windows got it wrong.

QNX got it right. It’s a very elegant design, and the design
issues behind it need to be thoroughly understood by QNX programmers.
The interaction with scheduling, the non-blocking in MsgReply, and
the ability of TimerTimeout to cleanly break out of a blocked
operation all play together to make it work in hard real time
applications.

I wrote a bit in Wikipedia about this, in the QNX article.

John Nagle

Igor Kovalenko wrote:

You have to be careful with the word “performance”. Sync message passing
will inherently have lower cost than async, especially when tighty
integrated with the scheduler. But comparing it against async schemes is
unfair and pointless, you’re not comparing equivalent feature sets.

There’s a reason QNX does not use SRR mechanism for communication between
drivers and their respective I/O managers. There’s also a reason why they
bothered to come up with async kernel messages…

Also be careful with trusting the “kernel timeouts”. There is a window for
race condition between you call TimerTimeout() and when your blocking call
starts. You can get preempted and not get CPU back before the timeout
expires. This may not be very likely, but it can happen.

True … but it depends on the priority of your application.
Give your process (or thread) a higher (or the higest) priority before
it goes into the blocked state.

Regards

–Armin

For the purposes of sustained data transfer SRR will be faster on average
than shared memory block controlled by a pair of semaphores (in a classic
producer-consumer arrangement). However if your requirements really do not
permit blocking then shared memory with semaphores will be your best option.

– igor

“John Nagle” <> nagle@downside.com> > wrote in message
news:e8i0me$876$> 1@inn.qnx.com> …

Kevin Stallard wrote:

Hi Jorri,

I’ve noticed that the Send/Receive/Reply mechanism is somewhat feared by
folks. Synchronous communication can be misconstrued as being slow and
troublesome. They fear something holding up the sender so it can’t
continue processing when it needs to so they develop some kind of
asynchronous communication scheme or use a queue. Some folks consider it
easier (I would disagree). Sometimes it imay actually be necessary,
although I haven’t (yet) come across a reason why it has to be. I think
many folks’ point of reference QNX IPC is like that of a socket TCP/IP
connection. I wouldn’t want to learn the subtleties of a connection
oriented socket in a critical system either, so w/o QNX IPC, async would
be preferable.

Me too. The MsgSend/MsgReceive/MsgReply mechanism’s performance is
what
makes QNX work. But it’s not well understood by many QNX users.

In almost every other mainstream operating system, message
passing performance is poor. It’s slow, badly integrated with
the scheduler, or doesn’t have proper memory protection.
UNIX(BSD) got it wrong. Linux got it wrong. Mach got it wrong.
Windows got it wrong.

QNX got it right. It’s a very elegant design, and the design
issues behind it need to be thoroughly understood by QNX programmers.
The interaction with scheduling, the non-blocking in MsgReply, and
the ability of TimerTimeout to cleanly break out of a blocked
operation all play together to make it work in hard real time
applications.

I wrote a bit in Wikipedia about this, in the QNX article.

John Nagle

Igor Kovalenko wrote:

You have to be careful with the word “performance”. Sync message passing
will inherently have lower cost than async, especially when tighty
integrated with the scheduler. But comparing it against async schemes is
unfair and pointless, you’re not comparing equivalent feature sets.

There’s a reason QNX does not use SRR mechanism for communication between
drivers and their respective I/O managers. There’s also a reason why they
bothered to come up with async kernel messages…

BTW … you can study the performance impacts of a clean message passing
based OS with the MINIX 3.0 impementation :slight_smile: ( http://wwww.minix3.org )

–Armin

Also be careful with trusting the “kernel timeouts”. There is a window for
race condition between you call TimerTimeout() and when your blocking call
starts. You can get preempted and not get CPU back before the timeout
expires. This may not be very likely, but it can happen.

What happens in this scenario?

I looked at the doc and it’s not clear to me if the countdown start when
TimerTimeout is invoked or when the kernel call is performed.



For the purposes of sustained data transfer SRR will be faster on average
than shared memory block controlled by a pair of semaphores (in a classic
producer-consumer arrangement). However if your requirements really do not
permit blocking then shared memory with semaphores will be your best
option.

– igor

“John Nagle” <> nagle@downside.com> > wrote in message
news:e8i0me$876$> 1@inn.qnx.com> …
Kevin Stallard wrote:
Hi Jorri,

I’ve noticed that the Send/Receive/Reply mechanism is somewhat feared by
folks. Synchronous communication can be misconstrued as being slow and
troublesome. They fear something holding up the sender so it can’t
continue processing when it needs to so they develop some kind of
asynchronous communication scheme or use a queue. Some folks consider
it easier (I would disagree). Sometimes it imay actually be necessary,
although I haven’t (yet) come across a reason why it has to be. I think
many folks’ point of reference QNX IPC is like that of a socket TCP/IP
connection. I wouldn’t want to learn the subtleties of a connection
oriented socket in a critical system either, so w/o QNX IPC, async would
be preferable.

Me too. The MsgSend/MsgReceive/MsgReply mechanism’s performance is
what
makes QNX work. But it’s not well understood by many QNX users.

In almost every other mainstream operating system, message
passing performance is poor. It’s slow, badly integrated with
the scheduler, or doesn’t have proper memory protection.
UNIX(BSD) got it wrong. Linux got it wrong. Mach got it wrong.
Windows got it wrong.

QNX got it right. It’s a very elegant design, and the design
issues behind it need to be thoroughly understood by QNX programmers.
The interaction with scheduling, the non-blocking in MsgReply, and
the ability of TimerTimeout to cleanly break out of a blocked
operation all play together to make it work in hard real time
applications.

I wrote a bit in Wikipedia about this, in the QNX article.

John Nagle

Mario Charest wrote:

Also be careful with trusting the “kernel timeouts”. There is a window for
race condition between you call TimerTimeout() and when your blocking call
starts. You can get preempted and not get CPU back before the timeout
expires. This may not be very likely, but it can happen.


What happens in this scenario?

I looked at the doc and it’s not clear to me if the countdown start when
TimerTimeout is invoked or when the kernel call is performed.

The timeout is activated only if the kernel call causes the thread to block.
It is cancelled as soon as the thread is unblocked, so preemption should
not have any effect on the timeout (unless TIMER_ABSTIME was specified).

Sunil.

Sunil Kittur wrote:

Mario Charest wrote:
Also be careful with trusting the “kernel timeouts”. There is a
window for race condition between you call TimerTimeout() and when
your blocking call starts. You can get preempted and not get CPU back
before the timeout expires. This may not be very likely, but it can
happen.
What happens in this scenario?
I looked at the doc and it’s not clear to me if the countdown start
when TimerTimeout is invoked or when the kernel call is performed.
The timeout is activated only if the kernel call causes the thread to
block.
It is cancelled as soon as the thread is unblocked, so preemption should
not have any effect on the timeout (unless TIMER_ABSTIME was specified).

This implies that for deadline scheduling you would want to use
absolute timeouts. If your thread is preempted after calling
TimerTimeout() to install a relative timeout but before making the
kernel call, then that time period is not accounted for and your
deadline will slip. Suppose you had a 20ms timeout, but other
threads ran for 10ms before you made the blocking call, then 30ms
of wall time would elapse; an absolute time would unblock you
after just 10ms of waiting in this case. The window as above is
not that the timer would expire, but that it doesn’t; the semantics
of the control loop would determine which was best (must re-control
every 20ms or be prepared to accept command for 20ms).

John Garvey wrote:

Sunil Kittur wrote:

Mario Charest wrote:

Also be careful with trusting the “kernel timeouts”. There is a
window for race condition between you call TimerTimeout() and when
your blocking call starts. You can get preempted and not get CPU
back before the timeout expires. This may not be very likely, but it
can happen.

What happens in this scenario?
I looked at the doc and it’s not clear to me if the countdown start
when TimerTimeout is invoked or when the kernel call is performed.

The timeout is activated only if the kernel call causes the thread to
block.
It is cancelled as soon as the thread is unblocked, so preemption should
not have any effect on the timeout (unless TIMER_ABSTIME was specified).


This implies that for deadline scheduling you would want to use
absolute timeouts. If your thread is preempted after calling
TimerTimeout()

As Sunil mentioned … not the TimerTimeout() will preempt the thread,
this happens with the ‘blocking’ kernel call, which also starts then the
timer!

–Armin

Armin Steinhoff wrote:

John Garvey wrote:
As Sunil mentioned … not the TimerTimeout() will preempt the thread,
this happens with the ‘blocking’ kernel call, which also starts then the
timer!
–Armin

Try reading it again more carefully … A relative TimerTimeout() will
let you block in the subsequent kernel call for that length of time,
regardless of when you actually get to make that kernel call (which
could be some period of time after calculating the timeout value to
use, if other threads scheduled to run in between, thus a slippage
in elapsed/wall time before the call unblocks).

Armin Steinhoff wrote:

Igor Kovalenko wrote:
BTW … you can study the performance impacts of a clean message passing
based OS with the MINIX 3.0 impementation > :slight_smile: > ( > http://wwww.minix3.org > )

No, Minix 3 doesn’t get it right. I just read the book.
The Minix message passing system isn’t exported to the application level;
applications just use pipes. Minix doesn’t have QNX’s careful
interaction of scheduling and message passing, which is essential
to get good performance. And Minix 3 still has most of its drivers inside
the kernel.

There’s an “extension” to Minix that moves more drivers outside
the kernel, but it’s an awful hack with a special API for drivers.

John Nagle

“Igor Kovalenko” <kovalenko@comcast.net> wrote in message
news:e8i7j7$c5m$1@inn.qnx.com

There’s a reason QNX does not use SRR mechanism for communication between
drivers and their respective I/O managers.
There’s also a reason why they bothered to come up with async kernel
messages…

I was told that a customer really wanted async messages, so they
implemented it.

“Kevin Stallard” <kevin@a.com> wrote in message
news:e8k7r1$np5$1@inn.qnx.com

“Igor Kovalenko” <> kovalenko@comcast.net> > wrote in message
news:e8i7j7$c5m$> 1@inn.qnx.com> …
There’s a reason QNX does not use SRR mechanism for communication between
drivers and their respective I/O managers.
There’s also a reason why they bothered to come up with async kernel
messages…

I was told that a customer really wanted async messages, so they
implemented it.

They are needed to implement things like POSIX message queues in a way that
does not suck. That was always horrible in QNX.

John Garvey wrote:

Armin Steinhoff wrote:

John Garvey wrote:
As Sunil mentioned … not the TimerTimeout() will preempt the thread,
this happens with the ‘blocking’ kernel call, which also starts then
the timer!
–Armin


Try reading it again more carefully … A relative TimerTimeout() will
let you block in the subsequent kernel call for that length of time,
regardless of when you actually get to make that kernel call

That sentence makes realy no sense for me.

I don’t believe your understanding is correct.
Try reading again more carefully the manual:


Description:

The TimerTimeout() and TimerTimeout_r() kernel calls set a timeout on
any kernel blocking state.

These functions are identical except in the way they indicate errors.
See the Returns section for details.

These blocking states are entered as a result of the following kernel
calls:

a.s.o


I hope that the docs and the implementation are in line.

–Armin



(which

could be some period of time after calculating the timeout value to
use, if other threads scheduled to run in between, thus a slippage
in elapsed/wall time before the call unblocks).

John Nagle wrote:

Armin Steinhoff wrote:

Igor Kovalenko wrote:
BTW … you can study the performance impacts of a clean message
passing based OS with the MINIX 3.0 impementation > :slight_smile: > (
http://wwww.minix3.org > )


No, Minix 3 doesn’t get it right. I just read the book.
The Minix message passing system isn’t exported to the application level;
applications just use pipes. Minix doesn’t have QNX’s careful
interaction of scheduling and message passing, which is essential
to get good performance.

Here is just answers from the Minix NG:

Minix3 operating system (and application running on it) is a set of
loosely coupled processes. There are 5 different system calls available

  • SENDREC
  • SEND
  • RECEIVE
  • NOTIFY
  • ECHO
    This is the only way how processes can interact. This is the way how
    particular operating system components interact as well as how (after
    all) “user processes” invoke operating system services. If a user-space
    process performs (calls) `read’, this is actually a procedure which
    sends appropriate message to appropriate OS component (filesystem).

Those communication primitives are used to create UNIX-like environment
and are somhow burried in libraries the programs are using. It is true,
that Minix 3 processes interact with each other in UNIX way (pipes,
signals etc.) but these abstractions are implemented underneath by the
above communication primitives.

Minix doesn’t have QNX’s careful
interaction of scheduling and message passing, which is essential
to get good performance.

Minix 3 scheduling ensures that the processes with higher priority
cannot be preempted by processes with lower priorities. That has
advantages as well as disadvantages. Advantages are that this policy is
simple to implement and comprehend and that more important events (e.g.
some hardware events) will certainly be handled prior to less important
ones (user processes requests). The disadvantage is that the high
priority processes (if written in a wrong way) could seize CPU for
ever.

( my comment: that’s not very special for Minix :slight_smile: )


And Minix 3 still has most of its drivers
inside the kernel.

That is not true. All (but the clock) device drivers exists as user
space processes. It is true they have somewhat more privileges than the
normal applications. Device drivers can ask system-task to do I/O
operations (and other things) on behalf of device drivers. Other
(ordinary) user space processes can try to do that too, but at run time
system-task will reject to do that.

There’s an “extension” to Minix that moves more drivers outside
the kernel, but it’s an awful hack with a special API for drivers.

What book did you realy read :slight_smile:

–Armin


John Nagle

Igor Kovalenko wrote:

“Kevin Stallard” <> kevin@a.com> > wrote in message
news:e8k7r1$np5$> 1@inn.qnx.com> …

“Igor Kovalenko” <> kovalenko@comcast.net> > wrote in message
news:e8i7j7$c5m$> 1@inn.qnx.com> …

There’s a reason QNX does not use SRR mechanism for communication between
drivers and their respective I/O managers.
There’s also a reason why they bothered to come up with async kernel
messages…

I was told that a customer really wanted async messages, so they
implemented it.



They are needed to implement things like POSIX message queues in a way that
does not suck. That was always horrible in QNX.

Oh, that’s why.

A big advantage of synchronous message passing is that MsgReply
is non-blocking. So when your low-priority non-realtime process makes
some request of a higher-priority process, the higher priority process
can’t get stuck at MsgReply waiting for the lower priority caller
to get some CPU time.

I really came to appreciate all this when doing the Overbot
software. We had a lot of stuff going on in one CPU: mid
level servoloop control, LIDAR data processing, video processing,
map building, and planning. QNX could meet the real time constraints
consistently. And we were checking; if updates didn’t get done in
time, emergency hardware timers tripped and the brakes slammed on.
Even at 80% CPU utilization, it all worked.

John Nagle

“John Nagle” <nagle@downside.com> wrote in message
news:e8ma7s$7q1$1@inn.qnx.com

A big advantage of synchronous message passing is that MsgReply
is non-blocking. So when your low-priority non-realtime process makes
some request of a higher-priority process, the higher priority process
can’t get stuck at MsgReply waiting for the lower priority caller
to get some CPU time.

That is not an “advantage” of sync message passing. Async would not block
either. If you think about it, the sync message passing is nothing more than
a special case of async one, where the queue is limited to 1 item of size
min(send_buffer, recv_buffer) and is provided by either sender or receiver.
Which means either sender or receiver has to block.

The trick is integrating this with the scheduler so you can avoid priority
inversion and take advantage of the fact that you block. Your blocking on
send/recv provides implicit synchronisation. That is why QNX message passing
is faster than shared memory queue with flow control via semaphores (you
have 3 kernel calls per transaction with SRR vs 4 with semaphores). Of
course the gain is at the expense of flexibility - there’s still no free
lunch, QNX or not.

Other systems have used this idea too. It is incorrect to say ‘they got it
wrong’. They had different design goals. Mach for example was an academic
excercize and very advanced at that. Their design goals included ability to
run unmodified BSD binaries, VM external to the kernel, sharing memory
across network, etc. QNX on the other hand had limited goals and targeted
mostly embedded systems where all that sophistication is not needed, nor
legacy 3rd party apps are much of a concern. So they are fast, but it’s not
the message passing that they got right. It is the balance of complexity and
features. So I will point that QNX can’t run its own binaries from older
releases, let alone BSD binaries. No free lunch.

What QNX realized is that it’s not copying that kills you (as many naive
opponents of message passing tend to assume). It is the context switches. So
they made it (the kernel) simple, which made it much easier to make context
switches cheap. I am sure Mach people knew that too, but they could not make
it that simple given their design goals. So they tried to optimize message
passing using copy-on-write, but that did not help all that much since
copying does not hurt you that much in the first place.

What QNX failed to see (or chose to ignore) is that in a system where
high-bandwidth data has to travel through multiple memory-isolated
subsystems, let alone where some of them use async abstractions built on top
of sync ones (which really should be the other way around) performance will
be miserable. Disk I/O and TCP/IP performance … cough, cough

I really came to appreciate all this when doing the Overbot
software. We had a lot of stuff going on in one CPU: mid
level servoloop control, LIDAR data processing, video processing,
map building, and planning. QNX could meet the real time constraints
consistently. And we were checking; if updates didn’t get done in
time, emergency hardware timers tripped and the brakes slammed on.
Even at 80% CPU utilization, it all worked.

Yes, the ability of a system to avoid priority inversion is one of the keys
to that. If you have ever seen tools that do RMS analysis/simulation, it is
amazing to see how picture changes when you click on ‘use priority
inheritance’ button. All of a sudden you need much less powerful CPU to meet
your deadlines…

Linux is getting those capabilities, albeit slow because Linus does not feel
like encumbering the kernel with stuff that will benefit 10% of users at the
expence of other 90%. But Montavista got enough money and momentum to keep
pushing so far (of course we’re providing a good chunk of that money, lol).

– igor

Igor Kovalenko wrote:

“John Nagle” <> nagle@downside.com> > wrote in message
news:e8ma7s$7q1$> 1@inn.qnx.com> …

A big advantage of synchronous message passing is that MsgReply
is non-blocking. So when your low-priority non-realtime process makes
some request of a higher-priority process, the higher priority process
can’t get stuck at MsgReply waiting for the lower priority caller
to get some CPU time.


That is not an “advantage” of sync message passing. Async would not block
either. If you think about it, the sync message passing is nothing more than
a special case of async one, where the queue is limited to 1 item of size
min(send_buffer, recv_buffer) and is provided by either sender or receiver.
Which means either sender or receiver has to block.

The problem is where to put the message on reply. In an async
message pass, if there’s no one ready to receive, you either have to
block the sender, or find space somewhere to queue the message for
later delivery. With MsgSend/MsgReply, when the MsgReply takes place,
you’re guaranteed that the receiver is ready to take the message.
(If the MsgSend was cancelled, the reply is lost, not queued.)
So MsgReply never needs to block, and you don’t have to worry
about buffer space exhaustion issues. This makes the sending
of large messages (I’ve sent video frames usefully) work
effectively.

The other issue with async messaging is that processes
which contain a send followed by a receive create scheduling
issues. If you send to a higher priority process, then
the sending process will lose the CPU, and won’t make it
to its receive before the higher priority process gets
done and sends a message back. Then the higher priority
process may have to stall, or the message gets queued
for later delivery. A few trips through the scheduler
later, everything unwinds, but there’s a performance
penalty.

This is why emulating a subroutine call with async
messaging tends to be sluggish.

John Nagle

What QNX realized is that it’s not copying that kills you (as many naive
opponents of message passing tend to assume). It is the context switches.
So they made it (the kernel) simple, which made it much easier to make
context switches cheap. I am sure Mach people knew that too, but they
could not make it that simple given their design goals. So they tried to
optimize message passing using copy-on-write, but that did not help all
that much since copying does not hurt you that much in the first place.

What QNX failed to see (or chose to ignore) is that in a system where
high-bandwidth data has to travel through multiple memory-isolated
subsystems, let alone where some of them use async abstractions built on
top of sync ones (which really should be the other way around) performance
will be miserable. Disk I/O and TCP/IP performance … cough, cough

On the one hand your point in the first paragraph is that context switches
are the bottle neck, yet it appears in the second paragraph that you’re also
hitting QNX with a data throughput penalty because of copying.

Well…I guess it depends on how all that data is moved through the system
(piece-meal or in big chunks…)… piece-meal would cause more context
switches. This is why they have the ethernet drivers and io-manager in the
same process. Interesting…

Kevin

“Kevin Stallard” <kevin@a.com> wrote in message
news:e8p6ea$6hl$1@inn.qnx.com

What QNX realized is that it’s not copying that kills you (as many naive
opponents of message passing tend to assume). It is the context switches.
So they made it (the kernel) simple, which made it much easier to make
context switches cheap. I am sure Mach people knew that too, but they
could not make it that simple given their design goals. So they tried to
optimize message passing using copy-on-write, but that did not help all
that much since copying does not hurt you that much in the first place.

What QNX failed to see (or chose to ignore) is that in a system where
high-bandwidth data has to travel through multiple memory-isolated
subsystems, let alone where some of them use async abstractions built on
top of sync ones (which really should be the other way around)
performance will be miserable. Disk I/O and TCP/IP performance …
cough, cough

On the one hand your point in the first paragraph is that context switches
are the bottle neck, yet it appears in the second paragraph that you’re
also hitting QNX with a data throughput penalty because of copying.

Don’t put words in my mouth. I did not say ‘because of copying’.

Well…I guess it depends on how all that data is moved through the system
(piece-meal or in big chunks…)… piece-meal would cause more context
switches. This is why they have the ethernet drivers and io-manager in
the same process. Interesting…

Exactly. Traveling though multiple subsystems means context switches, not
just copying.

– igor

“John Nagle” <nagle@downside.com> wrote in message
news:e8p4om$5b3$1@inn.qnx.com

Igor Kovalenko wrote:
“John Nagle” <> nagle@downside.com> > wrote in message
news:e8ma7s$7q1$> 1@inn.qnx.com> …

A big advantage of synchronous message passing is that MsgReply
is non-blocking. So when your low-priority non-realtime process makes
some request of a higher-priority process, the higher priority process
can’t get stuck at MsgReply waiting for the lower priority caller
to get some CPU time.


That is not an “advantage” of sync message passing. Async would not block
either. If you think about it, the sync message passing is nothing more
than a special case of async one, where the queue is limited to 1 item of
size min(send_buffer, recv_buffer) and is provided by either sender or
receiver. Which means either sender or receiver has to block.

The problem is where to put the message on reply. In an async
message pass, if there’s no one ready to receive, you either have to
block the sender, or find space somewhere to queue the message for
later delivery. With MsgSend/MsgReply, when the MsgReply takes place,
you’re guaranteed that the receiver is ready to take the message.
(If the MsgSend was cancelled, the reply is lost, not queued.)
So MsgReply never needs to block, and you don’t have to worry
about buffer space exhaustion issues. This makes the sending
of large messages (I’ve sent video frames usefully) work
effectively.

But it also makes sending small messages work ineffectively, since each one
has to make full round trip before next one can go. No free lunch.

The other issue with async messaging is that processes
which contain a send followed by a receive create scheduling
issues. If you send to a higher priority process, then
the sending process will lose the CPU, and won’t make it
to its receive before the higher priority process gets
done and sends a message back. Then the higher priority
process may have to stall, or the message gets queued
for later delivery. A few trips through the scheduler
later, everything unwinds, but there’s a performance
penalty.

If someone really wanted to, async messages could be designed with priority
inheritance too.

This is why emulating a subroutine call with async
messaging tends to be sluggish.

True. But emulating a subrotine call with sync message is still slower than
a subrotine call.

– igor