To VmWare users

I have been testing VMware 5.5 beta on an AMD X2-3800 (dual core). I’m
using QNX4 for the moment and compile time are the same as on a P4 3Gig
equipped with 15kRPM scsi drive… Quite impressive if you asked me.

I even get a better experience by using Phindows to connect on the virtual
machine. I have tried this setup on a single core a while ago but
performance was really bad. Now with dual core, Phindows feels faster then
when working directly in photon in the virtual machine.

  • Mario

Mario Charest wrote:

machine. I have tried this setup on a single core a while ago but
performance was really bad. Now with dual core, Phindows feels faster then
when working directly in photon in the virtual machine.

I’ve come to the conclusion that, with multi-core setups or equivalent, the primary reason for massive performance boosts will be when mutiple threads are RUNNING/READY and would thrash the caches with millisecond slicing.

One fix would be to have the kernel divide the caches up according to the number of RUNNING/READY contexts that are using the cache.


Evan

I’ve been following the developments of VM technology with much interest.
Clearly this is THE solution for lack of support for new HW in OS’s like
QNX4. And of course there are obvious other great advantages as well. (As
you VMWare users already know :wink:. The article about VMWorld on
www.tomshardware.com is very interesting as well. And last year at my
former employer, ANP, we’ve even deployed it in production systems. Since
then the QNX4 VM’s have been running without problems, as if they were on
real machines.

Obviously VMWare has a clear advantage in experience compared to other
suppliers. But I’m banking on the open source alternatives, such as
XenSource, to get a foothold. Especially with the upcoming native support
of VM technology in the Intel and AMD chips. Not only as a (hopefully)
viable alternative, but also to force VMWare to lower their prices :slight_smile: I’m
hoping that with the native VT support in the processors, it’ll be easier
to implement true VM’s without modified guest OS’s, thereby opening up a
wide variety of virtualization solutions. I for one am convinced that
virtualization will play an important role in the future with the growing
demands for standardization and consolidation.

regards,
rick
(not an expert on the subject; just a curious techie :wink:

Mario Charest postmaster@127.0.0.1 wrote:

I have been testing VMware 5.5 beta on an AMD X2-3800 (dual core). I’m
using QNX4 for the moment and compile time are the same as on a P4 3Gig
equipped with 15kRPM scsi drive… Quite impressive if you asked me.

I even get a better experience by using Phindows to connect on the virtual
machine. I have tried this setup on a single core a while ago but
performance was really bad. Now with dual core, Phindows feels faster then
when working directly in photon in the virtual machine.

  • Mario

“Evan Hillas” <evanh@clear.net.nz> wrote in message
news:djt073$6ki$1@inn.qnx.com

Mario Charest wrote:
machine. I have tried this setup on a single core a while ago but
performance was really bad. Now with dual core, Phindows feels faster
then when working directly in photon in the virtual machine.


I’ve come to the conclusion that, with multi-core setups or equivalent,
the primary reason for massive performance boosts will be when mutiple
threads are RUNNING/READY and would thrash the caches with millisecond
slicing.

One fix would be to have the kernel divide the caches up according to the
number of RUNNING/READY contexts that are using the cache.

The situation as you describe it (and as I understand it) is the same as
with single core. Switching threads/processes will result in cache trashing
if data being handled doesn’t fit in cache. Of course you can get more
performance if you optimize your applications to make the best use of the
cache (usually by working on small chunk of data), the principal are the
same wether you are single or multicore/CPU.

I’m pretty sure having the kernel divide the cache up is a not very good
idea (i’m not even sure it’s possible),simply because the kernel doesn’t
know how much data is being handled by each threads, that would create a LOT
of unused cache space. Imagine the nightmare of changing the cache
allocation when thread are created/destroy, oh boy.

Evan

Mario Charest wrote:

“Evan Hillas” <> evanh@clear.net.nz> > wrote in message
One fix would be to have the kernel divide the caches up according to the
number of RUNNING/READY contexts that are using the cache.


The situation as you describe it (and as I understand it) is the same as
with single core. Switching threads/processes will result in cache trashing
if data being handled doesn’t fit in cache. Of course you can get more

Yep, I was generally talking about single vs multi where the single is the one thrashing with two RUNNING threads while the dual core doesn’t thrash because there is two memory blocks being chewed on at the same time and therefore is balanced in the caches. This is true no matter how big the caches are or even if just a single cache between multiple cores.

Obviously, if you run four busy threads on a dual core then the same thing happens there too, it’s a simple but serious downside to slicing at the millisecond interval.


Evan

“Evan Hillas” <evanh@clear.net.nz> wrote in message
news:dk1ke2$hbv$1@inn.qnx.com

Mario Charest wrote:
“Evan Hillas” <> evanh@clear.net.nz> > wrote in message
One fix would be to have the kernel divide the caches up according to the
number of RUNNING/READY contexts that are using the cache.


The situation as you describe it (and as I understand it) is the same as
with single core. Switching threads/processes will result in cache
trashing if data being handled doesn’t fit in cache. Of course you can
get more


Yep, I was generally talking about single vs multi where the single is the
one thrashing with two RUNNING threads while the dual core doesn’t
thrash because there is two memory blocks being chewed on at the same time
and therefore is balanced in the caches. This is true no matter how big
the caches are or even if just a single cache between multiple cores.

Obviously, if you run four busy threads on a dual core then the same thing
happens there too, it’s a simple but serious downside to slicing at the
millisecond interval.

QNX round robin is 50ms, at today’s CPU speed, a complete trashing of the
cache 20 times per secondes is I beleive rather insignificant.

However if caching occurs because of multiple context switched cause of
message passing, pulses etc, that’s a different issue and doesn’t have
anything to do with slicing.

Evan

Mario Charest wrote:

“Evan Hillas” <> evanh@clear.net.nz> > wrote in message
Yep, I was generally talking about single vs multi where the single is the
one thrashing with two RUNNING threads while the dual core doesn’t
thrash because there is two memory blocks being chewed on at the same time
and therefore is balanced in the caches. This is true no matter how big
the caches are or even if just a single cache between multiple cores.

Obviously, if you run four busy threads on a dual core then the same thing
happens there too, it’s a simple but serious downside to slicing at the
millisecond interval.


QNX round robin is 50ms, at today’s CPU speed, a complete trashing of the
cache 20 times per secondes is I beleive rather insignificant.

I’ve learnt something :slight_smile:, I’ll admit I’d assumed the slicing would happen every time the scheduler was run …

—(Snipped from the QNX docs)—
SCHED_FIFO – a fixed-priority scheduler in which the highest priority,
ready thread runs until it blocks or is preempted by a higher priority
thread.
SCHED_RR – the same as SCHED_FIFO, except threads at the same priority
level timeslice (round robin) every 4 * the clock period
—(/Snip)—


So we have 4 ms slicing happening when two busy threads are evenly sharing the cpu, agreed? Ok, given that I figure, like you, that there is plenty of time to completely use all the caches on each timeslice.

Next, how long does it take to fill all the caches? 1 ms, 100 us, 10 us? I really don’t know the answer but I’m guessing that it’s pretty much all stalling time from the processor’s pov.


Evan

“Evan Hillas” <evanh@clear.net.nz> wrote in message
news:dk4s3n$ru1$1@inn.qnx.com

Mario Charest wrote:
“Evan Hillas” <> evanh@clear.net.nz> > wrote in message
Yep, I was generally talking about single vs multi where the single is
the one thrashing with two RUNNING threads while the dual core doesn’t
thrash because there is two memory blocks being chewed on at the same
time and therefore is balanced in the caches. This is true no matter how
big the caches are or even if just a single cache between multiple cores.

Obviously, if you run four busy threads on a dual core then the same
thing happens there too, it’s a simple but serious downside to slicing at
the millisecond interval.


QNX round robin is 50ms, at today’s CPU speed, a complete trashing of the
cache 20 times per secondes is I beleive rather insignificant.


I’ve learnt something > :slight_smile:> , I’ll admit I’d assumed the slicing would happen
every time the scheduler was run …

—(Snipped from the QNX docs)—
SCHED_FIFO – a fixed-priority scheduler in which the highest priority,
ready thread runs until it blocks or is preempted by a higher priority
thread. SCHED_RR – the same as SCHED_FIFO, except threads at the same
priority level timeslice (round robin) every 4 * the clock period
—(/Snip)—


So we have 4 ms slicing happening when two busy threads are evenly sharing
the cpu, agreed? Ok, given that I figure, like you, that there is plenty
of time to completely use all the caches on each timeslice.

Next, how long does it take to fill all the caches? 1 ms, 100 us, 10 us?
I really don’t know the answer but I’m guessing that it’s pretty much all
stalling time from the processor’s pov.

AMD X64 has max memory bandwidht of 6gig per seconds (as mesure with
SiSandra), my cpu has 512k of cache, but lets assume a model with 1 Meg that
means 160 us. But the cache doesn’t have to be fullly refill for the CPU
to start working, it’s just becomes starved for data, which I guess with a
6Gig burst memory transfer rate isn’t that bad :wink:

Most benchmark of typical application shows very little gain going from 512k
to 1M of cache, at least on the desktop world.




Evan

Mario Charest wrote:

“Evan Hillas” <> evanh@clear.net.nz> > wrote in message
Next, how long does it take to fill all the caches? 1 ms, 100 us, 10 us?
I really don’t know the answer but I’m guessing that it’s pretty much all
stalling time from the processor’s pov.


AMD X64 has max memory bandwidht of 6gig per seconds (as mesure with

That’s optimal.


SiSandra), my cpu has 512k of cache, but lets assume a model with 1 Meg that
means 160 us. But the cache doesn’t have to be fullly refill for the CPU
to start working, it’s just becomes starved for data, which I guess with a
6Gig burst memory transfer rate isn’t that bad > :wink:

Again, I can only assume that there will be a reasonably consistent amount of stalling for a full refill. When you get to GUI code, and C++ STL in particular, have a tendency to be memory and execution hogs.


Most benchmark of typical application shows very little gain going from 512k
to 1M of cache, at least on the desktop world.

Most benchmarks are not looking for this behaviour. The ones that do run more than one task per core, and btw always note a decent hit in speed, are trying to explain the problem away with other reasons.


Evan