SMP on MIPS CPU

There are two chips I am interested in:
(1) PMC-Sierra RM9000x2 http://www.pmc-sierra.com/products/details/rm9000x2/
(2) Broadcom BCM1250
http://www.broadcom.com/products/1250.html

Both are multicore (two CPU’s in single silicon) chips. Is any chance to
make them working under QNX, preferably in SMP mode?

Actually I need a “number cracker” with more clocks per watt per square
inch.Any suggestion is very welcome.

-dip

Dmitri Poustovalov <pdmitri@bigfoot.com> wrote:

There are two chips I am interested in:
(1) PMC-Sierra RM9000x2 > http://www.pmc-sierra.com/products/details/rm9000x2/
(2) Broadcom BCM1250
http://www.broadcom.com/products/1250.html

Both are multicore (two CPU’s in single silicon) chips. Is any chance to
make them working under QNX, preferably in SMP mode?

We don’t have SMP support for MIPS at the current time.

I’ve looked at the Broadcom in the past and we won’t currently run on
it in uniprocessor mode due to the fully virtual instruction cache.

I haven’t heard of the PMC-Sierra chip before and the PMC-Sierra web site
and I are having a small disagreement right now :slight_smile: - I can’t get at the
documentation to see if they’ve done something that we don’t like, so I
can’t say whether or not we’ll run in uniprocessor mode.

As always, contact your sales guy if you want to suggest supporting
the above.


Brian Stecher (bstecher@qnx.com) QNX Software Systems, Ltd.
phone: +1 (613) 591-0931 (voice) 175 Terence Matthews Cr.
+1 (613) 591-3579 (fax) Kanata, Ontario, Canada K2M 1W8

“Brian Stecher” <bstecher@qnx.com> wrote in message
news:a33oud$m6l$1@nntp.qnx.com

Dmitri Poustovalov <> pdmitri@bigfoot.com> > wrote:
There are two chips I am interested in:
(1) PMC-Sierra RM9000x2
http://www.pmc-sierra.com/products/details/rm9000x2/
(2) Broadcom BCM1250
http://www.broadcom.com/products/1250.html

Both are multicore (two CPU’s in single silicon) chips. Is any chance to
make them working under QNX, preferably in SMP mode?

We don’t have SMP support for MIPS at the current time.

I’ve looked at the Broadcom in the past and we won’t currently run on
it in uniprocessor mode due to the fully virtual instruction cache.

Could you, please, give a little bit more detailes on what’s so bad about
the virtual instruction cache? I not dare to ask why qnx doesn’t like it
:wink: but
why Linux and NetBSD are fine with that?

Thanks,
-dip

I haven’t heard of the PMC-Sierra chip before and the PMC-Sierra web site
and I are having a small disagreement right now > :slight_smile: > - I can’t get at the
documentation to see if they’ve done something that we don’t like, so I
can’t say whether or not we’ll run in uniprocessor mode.

As always, contact your sales guy if you want to suggest supporting
the above.


Brian Stecher (> bstecher@qnx.com> ) QNX Software Systems, Ltd.
phone: +1 (613) 591-0931 (voice) 175 Terence Matthews Cr.
+1 (613) 591-3579 (fax) Kanata, Ontario, Canada K2M
1W8

Dmitri Poustovalov <pdmitri@bigfoot.com> wrote:

Could you, please, give a little bit more detailes on what’s so bad about
the virtual instruction cache? I not dare to ask why qnx doesn’t like it
:wink: > but
why Linux and NetBSD are fine with that?

MIPS processor caches are historically virtually indexed and physically
tagged. That means that the initial cache lookup is done with a virtual
address (which is not unique across processes), but before the cache is
used a check is make sure that the physical address provided by the
TLB is what the cache also thinks should be there. If they don’t match,
a cache miss is declared. So, for the most part, you can pretend that the
cache is a physical one (there are some strangnesses that I won’t go
into). The Broadcom chip is virtually indexed and virtually tagged, which
means they don’t do the second check. The upshot is that you have to
flush the instruction cache when you do a context switch. Now, there’s
some other stuff that I won’t go into that let’s you not have to flush
every time, but we don’t currently have code in to do any flushing at all,
which is why we won’t run on the chip right now.

If you were to try it with the current procnto, you would occasionally
find the CPU seemingly executing instructions other than what’s in memory
for the process.

The Linux & NetBSD ports will have added the extra instruction cache
flushes. If/when we port to Broadcom, we’ll add them as well. The bad part
of doing them is that they slow down the context switches.


Brian Stecher (bstecher@qnx.com) QNX Software Systems, Ltd.
phone: +1 (613) 591-0931 (voice) 175 Terence Matthews Cr.
+1 (613) 591-3579 (fax) Kanata, Ontario, Canada K2M 1W8

Thanks for “MIPS cache for dummies” :slight_smile:

“Brian Stecher” <bstecher@qnx.com> wrote in message
news:a3mibn$hk6$1@nntp.qnx.com

Dmitri Poustovalov <> pdmitri@bigfoot.com> > wrote:

Could you, please, give a little bit more detailes on what’s so bad
about
the virtual instruction cache? I not dare to ask why qnx doesn’t like
it
:wink: > but
why Linux and NetBSD are fine with that?

MIPS processor caches are historically virtually indexed and
physically
tagged. That means that the initial cache lookup is done with a virtual
address (which is not unique across processes), but before the cache is
used a check is make sure that the physical address provided by the
TLB is what the cache also thinks should be there. If they don’t match,
a cache miss is declared. So, for the most part, you can pretend that the
cache is a physical one (there are some strangnesses that I won’t go
into). The Broadcom chip is virtually indexed and virtually tagged,
which
means they don’t do the second check. The upshot is that you have to
flush the instruction cache when you do a context switch. Now, there’s
some other stuff that I won’t go into that let’s you not have to flush
every time, but we don’t currently have code in to do any flushing at all,
which is why we won’t run on the chip right now.

If you were to try it with the current procnto, you would occasionally
find the CPU seemingly executing instructions other than what’s in memory
for the process.

The Linux & NetBSD ports will have added the extra instruction cache
flushes. If/when we port to Broadcom, we’ll add them as well. The bad part
of doing them is that they slow down the context switches.


Brian Stecher (> bstecher@qnx.com> ) QNX Software Systems, Ltd.
phone: +1 (613) 591-0931 (voice) 175 Terence Matthews Cr.
+1 (613) 591-3579 (fax) Kanata, Ontario, Canada K2M
1W8

Brian Stecher wrote:

The Linux & NetBSD ports will have added the extra instruction cache
flushes. If/when we port to Broadcom, we’ll add them as well. The bad part
of doing them is that they slow down the context switches.

So (simple curiosity at work), since you have now implemented this, what
is the damage to the context switches (vis-a-vis the uniprocessor with
physically tagged cache) ?

Rennie

Rennie Allen <rallen@csical.com> wrote:

So (simple curiosity at work), since you have now implemented this, what
is the damage to the context switches (vis-a-vis the uniprocessor with
physically tagged cache) ?

I haven’t done any benchmarking to see the actual numbers but, empirically,
it doesn’t seem to hurt too much. It’s a fast chip.


Brian Stecher (bstecher@qnx.com) QNX Software Systems, Ltd.
phone: +1 (613) 591-0931 (voice) 175 Terence Matthews Cr.
+1 (613) 591-3579 (fax) Kanata, Ontario, Canada K2M 1W8

Got me curious. Don’t you have also to flush cached pages before and
after DMA transfers, plus when several processes have a page mapped into
addresses that don’t fall into the same cache line? And things like
mprotect()?

I might be confused here, since I only know this stuff from books not
from experience …

Brian Stecher wrote:

Rennie Allen <> rallen@csical.com> > wrote:

So (simple curiosity at work), since you have now implemented this, what
is the damage to the context switches (vis-a-vis the uniprocessor with
physically tagged cache) ?


I haven’t done any benchmarking to see the actual numbers but, empirically,
it doesn’t seem to hurt too much. It’s a fast chip.

“Igor Kovalenko” <kovalenko@attbi.com> wrote in message
news:3DA39888.8000807@attbi.com

Got me curious. Don’t you have also to flush cached pages before and
after DMA transfers,

RAM used for DMA transfer(s) and dual-port RAM are usually not cached.
mmap’s PROT_NOCACHE
option tells the kernel to avoid caching for such memory regions. Also a
kernel is not aware when a
DMA tranfer occurs, such transfer is device-memory interaction that happens
w/o CPU.

Saying “usually not cached” I meant such architecture where a device
suddenly initiates DMA transfer and

notifies CPU with an interrupt on DMA transfer completion. An Ethernet chip
is the best example. Sure, there is

an exception that proofs the rule :slight_smile: Let’s imagine “best-effort” image
recognition system that we are building

with a pci board that could DMA-transfer a huge chunk of data by request.
Our CPU, Broadcom BCM1250

for sake of conversation :wink: is fast enough to crunch this data in acceptable
time only if the data is cached.

In this case we would (1) keep DMA region cached, (2) explicitly initiate
DMA-transfers by requesting the pci chip,

and (3) we would also invalidate/flush cache when this is needed. Hmm,
what’s my point? My point is there is

no kernel awareness about our devices and our cache strategies.


plus when several processes have a page mapped into
addresses that don’t fall into the same cache line? And things like
mprotect()?

Very interesting question. On x86 and PPC it might be not a problem. But for

Broadcom’s BCM1250 it could be
a “problem” due to virtual indexes/tags those Brian told us about. It is
crystal clear that instruction cache has to be
flushed on every context switch. It is mud clear to me if data cache has to
be flushed too. Brian could you, please,
make BCM1250 data cache story ‘crystal clear’? :slight_smile:

I might be confused here, since I only know this stuff from books not
from experience …

Brian Stecher wrote:
Rennie Allen <> rallen@csical.com> > wrote:

So (simple curiosity at work), since you have now implemented this, what
is the damage to the context switches (vis-a-vis the uniprocessor with
physically tagged cache) ?


I haven’t done any benchmarking to see the actual numbers but,
empirically,
it doesn’t seem to hurt too much. It’s a fast chip.

Dmitri Poustovalov <pdmitri@bigfoot.com> wrote:

Very interesting question. On x86 and PPC it might be not a problem. But for
Broadcom’s BCM1250 it could be
a “problem” due to virtual indexes/tags those Brian told us about. It is
crystal clear that instruction cache has to be
flushed on every context switch.

Actually not every switch - there’s some extra gear that avoids that.

It is mud clear to me if data cache has to be flushed too.

It doesn’t.

Brian could you, please,
make BCM1250 data cache story ‘crystal clear’? > :slight_smile:

It’s hard, since a lot of the Broadcom documentation is still under NDA
so I’m restricted in what I can say, but what’s public is that the data
cache implements the MESI protocol (Modified, Exclusive, Shared, Invalid
cache states), which means it’s a snooping cache (it watches what all the
bus masters do and maintains coherency between the cache and main memory).
That means that as long as all the bus masters follow the coherency protocol,
you don’t even have to have PROT_NOCACHE on memory involved in a DMA. the
cache will notice that a bus master wants the data and will perform a
“cache push” - writing the data back to main memory so that the bus
master sees the most up to date information.


Brian Stecher (bstecher@qnx.com) QNX Software Systems, Ltd.
phone: +1 (613) 591-0931 (voice) 175 Terence Matthews Cr.
+1 (613) 591-3579 (fax) Kanata, Ontario, Canada K2M 1W8

Brian Stecher wrote:

Dmitri Poustovalov <> pdmitri@bigfoot.com> > wrote:

Very interesting question. On x86 and PPC it might be not a problem. But for
Broadcom’s BCM1250 it could be
a “problem” due to virtual indexes/tags those Brian told us about. It is
crystal clear that instruction cache has to be
flushed on every context switch.


Actually not every switch - there’s some extra gear that avoids that.

You mean they have some sort of a ‘tag’ that allows to identify if a
context switch changes cache state from ‘invalid’ to ‘valid’?

It’s hard, since a lot of the Broadcom documentation is still under NDA
so I’m restricted in what I can say, but what’s public is that the data
cache implements the MESI protocol (Modified, Exclusive, Shared, Invalid
cache states), which means it’s a snooping cache (it watches what all the
bus masters do and maintains coherency between the cache and main memory).
That means that as long as all the bus masters follow the coherency protocol,
you don’t even have to have PROT_NOCACHE on memory involved in a DMA. the
cache will notice that a bus master wants the data and will perform a
“cache push” - writing the data back to main memory so that the bus
master sees the most up to date information.

How cute. Sounds like this chip might even show the benefits of the
virtual cache if supported properly. Thanks Brian!

– igor

Igor Kovalenko <kovalenko@attbi.com> wrote:

You mean they have some sort of a ‘tag’ that allows to identify if a
context switch changes cache state from ‘invalid’ to ‘valid’?

Something like that.

How cute. Sounds like this chip might even show the benefits of the
virtual cache if supported properly. Thanks Brian!

I think pretty well all SMP systems have a snooping data caches. Certainly
ones that have L1 caches that are local to each CPU.

\

Brian Stecher (bstecher@qnx.com) QNX Software Systems, Ltd.
phone: +1 (613) 591-0931 (voice) 175 Terence Matthews Cr.
+1 (613) 591-3579 (fax) Kanata, Ontario, Canada K2M 1W8