Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6071 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

27 crawler(s) on-line.

149 guest(s) on-line.

0 member(s) on-line.

You are an anonymous user.
Register Now!

kolla: 1 hr 32 mins ago

Hammer: 1 hr 44 mins ago

amigakit: 2 hrs 25 mins ago

OneTimer1: 2 hrs 29 mins ago

pixie: 2 hrs 36 mins ago

Rob: 2 hrs 58 mins ago

matthey: 3 hrs 2 mins ago

corb0: 3 hrs 28 mins ago

zipper: 3 hrs 29 mins ago

RobertB: 5 hrs 3 mins ago

Forum Index

General Technology (No Console Threads)

Understanding CPU and integer performance

Poster

Thread

KimmoK

Understanding CPU and integer performance
Posted on 18-Feb-2014 11:24:44

[ #1 ]

Elite Member

Joined: 14-Mar-2003
Posts: 5211
From: Ylikiiminki, Finland

Understanding CPU and integer performance
(writing memory refresh for myself and other dummies)

Integer performance depends on a lot of things…
Pipeline lengths/depths & how many cycles core needs to perform a instruction.
(68000 needed several cycles, 68060 managed in one cycle)

When CPUs moved to become superscalar (since 68060 and Pentium), more than one instruction are taken in processing and more than one can be finished at a time. After the above two steps it has been more proper to measure “average” execution per cycle, meaning that modern CPUs can finish more than one instruction in clock cycle, but it may take up to 20 (or more) clock cycles for some particular instruction to become processed through CPU.

Multiple integer execution units enable to take two or more instructions in processing (pipelines) in one cycle.
(with wide data path, a CPU can get (fetch) multiple instructions in. so far up to 4? or even 8?)

Modern CPUs (since 68060 and Pentium) decode the read instruction to micro operations that can be handled in smaller parts and with more units.

Modern CPUs have multiple integer units that execute micro operations (Athlon has 3, i7 has up to 8).

…as a result…
I7 does processess up to to 8 integer instructions per clock cycle (as an average) per core.
PA6T processes up to 2.2 integer instructions per clock cycle (as an average) per core.

Often some instruction execution results depend on the results of previous instructions, in this situation the processing needs to stop and wait for previous result to become out of execution. (execution units stand idle unless provided more work in the meanwhile)

Also many instructions need data to process, if the data is not available in CPU registers, it need to be read from L1 cache or from L2 or from system RAM. Instruction processing is stopped until data becomes available. (execution units stand idle unless provided more work in the meanwhile)

Branch prediction is like weather forecasts, CPU tries to figure out what instructions depends on what things and schedule them accordingly to different pipelines, trying to avoid pipeline stopping/stalls. (intel is very advanced in this)

Multithreading.
With multithreading CPU cores, the small micro units inside a core can be kept busy even is some instruction is waiting for something in some pipeline. As a result this can be seen the CPU performance doubling in some cases.
Further improvement in modern cores is that there are extra micro units placed where most often there is processing that can proceed, this makes multithreading perform almost like true multi cores. (AMD bulldozer design, e6500 etc…)

Multicore.
Multicore CPUs are like separate CPUs glued together, so that they can share data more easily, take more efficient use of CPU features (like caches) and therefore perform better than when being separate.

Multitasking...

Threaded apps...

Bandwidth...

Sw Code Optimization...

A little about FPU and SIMD importance vs performance...

HW integration in SoC...

Any major error so far?
(trying to not to be too technical, but feel free to give all/any further info)
And I continue later...

WARNING: Some things I may have oversimplified so that it has become (dangerously) inaccurate.
(but I do not have enough competence to add much more detail, it would generate more errors and most likely make the topic harder to understand for average reader. But let's see...)
TODO: spellcheck

Last edited by KimmoK on 19-Feb-2014 at 07:49 AM.
Last edited by KimmoK on 19-Feb-2014 at 07:47 AM.
Last edited by KimmoK on 18-Feb-2014 at 11:49 AM.
Last edited by KimmoK on 18-Feb-2014 at 11:26 AM.

_________________
- KimmoK
// For freedom, for honor, for AMIGA
//
// Thing that I should find more time for: CC64 - 64bit Community Computer?

Status: Offline

tlosm

Re: Understanding CPU and integer performance
Posted on 18-Feb-2014 11:47:59

[ #2 ]

Elite Member

Joined: 28-Jul-2012
Posts: 2746
From: Amiga land

@KimmoK

grat explanation :)

_________________
I love Amiga and new hope by AmigaNG
A 500 + ; CDTV; CD32;
PowerMac G5 Quad 8GB,SSD,SSHD,7800gtx,Radeon R5 230 2GB;
MacBook Pro Retina I7 2.3ghz;
#nomorea-eoninmyhome

Status: Offline

fishy_fis

Re: Understanding CPU and integer performance
Posted on 18-Feb-2014 16:59:59

[ #3 ]

Elite Member

Joined: 29-Mar-2004
Posts: 2159
From: Australia

Mostly accurate if oversimplified. One cant underestimate things like type of bus, intrcommunication methods, leakage, etc.
While its clear this is supposed to be quite basic try not to fall into the trap of simplifying things to a point its inaccurate (as often occurs when people try to explain technical things in a simple way)

Status: Offline

KimmoK

Re: Understanding CPU and integer performance
Posted on 19-Feb-2014 15:56:06

[ #4 ]

Elite Member

Joined: 14-Mar-2003
Posts: 5211
From: Ylikiiminki, Finland

@fishy_fis

Thanks for the input

@thread

Long time ago I learned that when task switch occurs, the previous task in processing was removed from execution, it's data was taken from CPU registers and put to stack. Then the other task could run.

Guestion:
-With modern CPU, is the whole pipeline system emptied/reset when task switch happens??

Initial answer:
(pipeline is cleaned up)
(to me this would indicate that shorter pipeline processors are more multitasking friendly etc.. And that multicore/multithreading helps in this matter!)

Last edited by KimmoK on 19-Feb-2014 at 04:36 PM.
Last edited by KimmoK on 19-Feb-2014 at 03:56 PM.

_________________
- KimmoK
// For freedom, for honor, for AMIGA
//
// Thing that I should find more time for: CC64 - 64bit Community Computer?

Status: Offline

matthey

Re: Understanding CPU and integer performance
Posted on 20-Feb-2014 2:01:37

[ #5 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2009
From: Kansas

Quote:

KimmoK wrote:
Understanding CPU and integer performance
(writing memory refresh for myself and other dummies)

Integer performance depends on a lot of things…
Pipeline lengths/depths & how many cycles core needs to perform a instruction.
(68000 needed several cycles, 68060 managed in one cycle)

When CPUs moved to become superscalar (since 68060 and Pentium), more than one instruction are taken in processing and more than one can be finished at a time. After the above two steps it has been more proper to measure “average” execution per cycle, meaning that modern CPUs can finish more than one instruction in clock cycle, but it may take up to 20 (or more) clock cycles for some particular instruction to become processed through CPU.

Average instructions per cycle (IPC) retired is common for measuring parallel performance of a CPU but it depends on the code measured and usually doesn't isolate integer performance.

Quote:

Multiple integer execution units enable to take two or more instructions in processing (pipelines) in one cycle.
(with wide data path, a CPU can get (fetch) multiple instructions in. so far up to 4? or even 8?)

The average instruction length in proportion to the instruction fetch is what is important here. This is where the CISC processors with good code density have an advantage. The 68060 was bottlenecked with a small 32 bit instruction fetch (a decoupled instruction buffer helped) but it wasn't that bad because the average 68k instruction length is ~3 bytes. Most RISC processors use 32 bit fixed length encodings so they couldn't even be superscalar fetching 32 bits per cycle like the 68060. The current 68k Apollo core fetches 10-12 bytes per cycle which is up to 6 instructions but that is 3-4 on average. It fetches enough to make 3 parallel integer units worthwhile and the early instruction retirement with forwarding like the 68060 removes many dependencies but the number of register read and especially expensive write ports necessary would result in the 2nd and 3rd integer units being weak (very limited in capabilities). I don't know of any pure superscalar designs that can execute 3 integer instructions per cycle. Most modern processors reschedule the instructions to avoid dependencies and reorder memory/cache reads and writes in what is called Out of Order (OoO) execution. Strong OoO takes a lot of logic and is very complex but the results can be amazing. Many superscalar designs exhibit some OoO techniques at a much lower cost though. The 68060 very likely has register port sharing between integer units (judging by timings) and the Apollo may get a similar integer unit swapping. This makes superscalar compiler scheduling of instructions much easier. Note that OoO execution is better for most RISC processors than CISC processors. This is because of the weak cache/memory performance of load/store architectures which is still only partially fixed by OoO (and loop unrolling which wastes caches). This effect could be seen in our SortBench results:

http://www.apollo-core.com/sortbench/index.htm?page=benchmarks

Superscalar only RISC processors did poorly while the superscalar only 68060, ColdFire and Atom did very well. Note that the results should scale up with processor speed and compilers and options don't make much difference to the results. This benchmark is just a simple bubblesort that can't easily be unrolled by compilers. Note that the CISC x86/x86_64 domination is helped some by their adaptive branch prediction. CISC is very strong in single core performance and in memory/caches while needing less caches for good performance. Intel has made a lot of money off their little secret as many RISC designs have failed to compete.

Quote:

Modern CPUs (since 68060 and Pentium) decode the read instruction to micro operations that can be handled in smaller parts and with more units.

Modern CPUs have multiple integer units that execute micro operations (Athlon has 3, i7 has up to 8).

I believe this is mostly true although RISC designs are already simplified into small operations. I read about a superscalar Atom where they decided to do less micro oping. I can only guess that the process requires a lot of logic as the Atom is designed to be energy efficient compared to the full x86. I believe it's possible to have good CISC designs without micro ops. The 68060 and Apollo do not use micro ops but maybe the 68k doesn't need it as much. Perhaps without micro ops, OoO and a difficult to decode instruction encoding like the x86/x86_64, maybe it would be possible to get decent energy efficiency and still have good performance.

Quote:

…as a result…
I7 does processes up to to 8 integer instructions per clock cycle (as an average) per core.
PA6T processes up to 2.2 integer instructions per clock cycle (as an average) per core.

I don't believe the i7 number. I doubt it would be possible to retire that many integer instructions per cycle per core in peak performance. I don't think it's possible to average more than 4 integer instructions retired per cycle per core in integer units. Maybe they are counting SIMD integer performance and this is a peak number?

Edit: Found some numbers for the i7. Each core can fetch, dispatch, execute and retire up to 4 instructions per cycle. The ancient 68060 in comparison can execute and retire up to 3 instructions per cycle :P.

The PA6T number looks more realistic but I would think it would be the IPC for all instructions and not just integer instructions per cycle per core. I would be very happy with retiring an average of 2.2 integer instructions per cycle per core.

It's very easy to end up comparing apples to oranges. I see the 8 for an i7 listed here:

http://en.wikipedia.org/wiki/Instructions_per_second

I wonder if they are even mixed up.

Quote:

Often some instruction execution results depend on the results of previous instructions, in this situation the processing needs to stop and wait for previous result to become out of execution. (execution units stand idle unless provided more work in the meanwhile)

Also many instructions need data to process, if the data is not available in CPU registers, it need to be read from L1 cache or from L2 or from system RAM. Instruction processing is stopped until data becomes available. (execution units stand idle unless provided more work in the meanwhile)[quote]

The CPU units don't always stop. Multi-threading, hardware scout or speculative execution can sometimes continue past dependencies. Sometimes bubbles are thrown and the unit does nothing but usually not for very long.

[quote]
Branch prediction is like weather forecasts, CPU tries to figure out what instructions depends on what things and schedule them accordingly to different pipelines, trying to avoid pipeline stopping/stalls. (intel is very advanced in this)

Honestly, this isn't very well explained. Most branches have 2 paths that can be taken based on a condition like an if then statement. The results of the condition are not known early in the pipeline so the processor can wait doing nothing or choose a path to speculatively execute. The best path is predicted and the path taken by the CPU. Static prediction is the most basic prediction. The most common modern type is to predict backward branches as taken (commonly loops) and forward branches as not taken. Most modern processors have a branch prediction unit that stores a history of prediction accuracy to improve predictions. The most common types are 2 bit saturating which has 4 positions for how likely the branch is to be taken and adaptive which recognizes patterns and is more accurate but requires more resources. If the incorrect path of a branch is taken, the processor must flush the pipeline and reload from the branch position. This creates bubbles for as many cycles as the pipeline is long. Longer pipelines need better branch prediction but are generally better performing at the cost of more logic. Processor pipelines and latencies should probably be in another topic.

Quote:

Multithreading.
With multithreading CPU cores, the small micro units inside a core can be kept busy even is some instruction is waiting for something in some pipeline. As a result this can be seen the CPU performance doubling in some cases.
Further improvement in modern cores is that there are extra micro units placed where most often there is processing that can proceed, this makes multithreading perform almost like true multi cores. (AMD bulldozer design, e6500 etc…)

I haven't ever seen double the performance from multi-threading. I though +50% was very good but it does depend on the design as you have pointed out. I would take out the "small micro" and "micro" above. Most of the time the units are waiting on memory reads.

Quote:

Multicore.
Multicore CPUs are like separate CPUs glued together, so that they can share data more easily, take more efficient use of CPU features (like caches) and therefore perform better than when being separate.

I believe multiple cores generally have separate L1 caches but commonly share L2 caches. The resource cost of another core is higher than multi-threading but the performance is better. Each core can typically only do one cache/memory read and write per cycle. It's very unlikely to see a CISC CPU with 8 integer instructions in a row that don't access memory. That's why I doubt the 8 IPC number for the i7. I did hear that Intel was working on dual ported memory that would allow 2 reads and writes but I still doubt 8 integer IPC instructions could be retired per cycle per core.

Quote:

Any major error so far?
(trying to not to be too technical, but feel free to give all/any further info)
And I continue later...

I'm no expert either but I'll give you a C+ base on what I know. This is a good learning exercise for you. Now maybe a real expert will show up and make both of us look like preschoolers ;).

Quote:

Guestion:
-With modern CPU, is the whole pipeline system emptied/reset when task switch happens??

The interrupt for the task switch usually does. It isn't predicted so it has to be flushed when the pipeline switches to a different path of code execution unexpectedly. There is generally overhead of throwing all the user space registers and state information on the stack as well. Using virtual (memory) addresses can cause some caches to be flushed also. Now you are starting to talk about some overhead.

Quote:

Initial answer:
(pipeline is cleaned up)
(to me this would indicate that shorter pipeline processors are more multitasking friendly etc.. And that multicore/multithreading helps in this matter!)

Shorter pipeline processors are more responsive but there is a lot more overhead than the pipeline flush. The 68k Amiga did decide to have system calls and libraries in user space to avoid overhead. The 68k MacOS and I believe AtariST used A-line traps (interrupts) which had a lot more overhead but there were advantages to having their own protected supervisor space. Hardware registers and accessing could be hidden from user space in this way. Fast and free or secure and slow. Choose your poison.

Last edited by matthey on 20-Feb-2014 at 02:41 PM.
Last edited by matthey on 20-Feb-2014 at 04:59 AM.

Status: Offline

KimmoK

Re: Understanding CPU and integer performance
Posted on 20-Feb-2014 8:03:31

[ #6 ]

Elite Member

Joined: 14-Mar-2003
Posts: 5211
From: Ylikiiminki, Finland

@matthey

WOW! Thanks for the input!
(and I think I understood almost every word in one read, really)

Clearly shows how hugely complex modern CPUs are.
(and they are a major compromise with everything, and clearly the "CPU" is not "done" yet)

Now wonder if some modern newbies might have hard time understanding if not even knowing the basics from the 80's...

Last edited by KimmoK on 20-Feb-2014 at 08:04 AM.

_________________
- KimmoK
// For freedom, for honor, for AMIGA
//
// Thing that I should find more time for: CC64 - 64bit Community Computer?

Status: Offline

olegil

Re: Understanding CPU and integer performance
Posted on 20-Feb-2014 12:01:41

[ #7 ]

Elite Member

Joined: 22-Aug-2003
Posts: 5895
From: Work

@matthey

e6500 single thread: 3.3instructions per clock. e6500 SMT: 6.0. But that's a WILDLY different SMT approach than the original P4 hyperthreading. I expect i7 to be in the same ballpark.

I don't get how 4 fetches per clock can give a dmips rating of 9.43 (i7-2600k), though. There IS some trickery there. With macrofusion (two assembly instructions end up as a single micro-operation) and microfusion (two micro-ops end up as a single micro-op), you can obviously get further.

But in that scenario, you'll have to be VERY careful to generate exactly the sequences the core can optimize, otherwise practical throughput drops off a lot.

POWER arches don't do micro-ops, instead this optimization would be done at compile-time. So for practical work I would venture a guess that the gap is not as big as it seems.

_________________
This weeks pet peeve:
Using "voltage" instead of "potential", which leads to inventing new words like "amperage" instead of "current" (I, measured in A) or possible "charge" (amperehours, Ah or Coulomb, C). Sometimes I don't even know what people mean.

Status: Offline

matthey

Re: Understanding CPU and integer performance
Posted on 20-Feb-2014 15:17:02

[ #8 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2009
From: Kansas

Quote:

KimmoK wrote:
Clearly shows how hugely complex modern CPUs are.
(and they are a major compromise with everything, and clearly the "CPU" is not "done" yet)

The CPU will continue to be the brain for computers for many years. More internal cores and external parallel processors will be added but there isn't much more that can be done per core. x86_64 will be the performance leader for many years. ARM64 won't be able to touch them in performance any more than PowerPC. Nobody wants to make a cleaner (design and ISA) and easier to decode CISC with better code density that would have an advantage against x86_64. Even with a superior design, it would be very difficult to compete with the head start, refined technology, constant die shrinks and cash generated from economies of scale due to the Intel M$ duopoly.

Quote:

Now wonder if some modern newbies might have hard time understanding if not even knowing the basics from the 80's...

Even the geek kids who have all the latest CPU benchmarks memorized generally don't have a clue about how they work or how to program them. For all the technology, we are in a computing dark ages. It's really sad.

Last edited by matthey on 20-Feb-2014 at 03:18 PM.

Status: Offline

matthey

Re: Understanding CPU and integer performance
Posted on 20-Feb-2014 16:36:03

[ #9 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2009
From: Kansas

Quote:

olegil wrote:
@matthey

e6500 single thread: 3.3instructions per clock. e6500 SMT: 6.0. But that's a WILDLY different SMT approach than the original P4 hyperthreading. I expect i7 to be in the same ballpark.

Averaging 3.3 IPC per core is very strong. If that was true, I would expect it would be competitive with the i7. The i7 does have 2 threads per core which should help it's average IPC. I couldn't find the manufacturer's average IPC claim for the i7 though.

Quote:

I don't get how 4 fetches per clock can give a dmips rating of 9.43 (i7-2600k), though. There IS some trickery there. With macrofusion (two assembly instructions end up as a single micro-operation) and microfusion (two micro-ops end up as a single micro-op), you can obviously get further.

It doesn't add up. Micro op and/or instruction folding/fusing isn't going to give that kind of gain. Counting SIMD instructions in a creative way might. Does each SIMD instruction count as x2 or x4 when doing parallel operations? I don't think that's the way average IPC is calculated.

Quote:

But in that scenario, you'll have to be VERY careful to generate exactly the sequences the core can optimize, otherwise practical throughput drops off a lot.

POWER arches don't do micro-ops, instead this optimization would be done at compile-time. So for practical work I would venture a guess that the gap is not as big as it seems.

Here in lies the flaws of RISC thinking.

1) High level programmers don't know what the hardware needs to do and compilers can't figure out what high level programmers want. Most modern high level languages are inadequate for producing quality code from the compiler. Hardware hints are possible in some high level languages with advanced compilers like GCC but they are rarely utilized and instead the compiler makes assumptions or produces a bunch of code to avoid making assumptions. PowerPC was counting on compilers to generate optimal code so the complexity could be taken out of the CPU but we can see how that's worked out for PowerPC.

2) Low level programmers are no longer needed so there is no need to make the assembler code readable anymore. The low level learning becomes tedious and less interesting so people don't learn it. Most programmers end up using high level languages without an understanding of the low level that needs to be manually tweaked correctly. They generally can't even read disassembled code in their debugger.

The work that PowerPC can do is not that far behind because of diminishing returns in parallel execution of a sequential stream of instructions. PowerPC can be competitive with optimal "perfect" code but that is rare instead of common. The fact that most PowerPC processors have moved to OoO execution is proof that compilers failed to schedule instructions adequately for superscalar. I believe some of the modern PowerPC processors have stream detection despite the lack of need if compilers generated the correct manual prefetch instructions. ARM64 is one the few RISC processors to decide that unaligned memory access support was a good idea but aren't compilers supposed to avoid that also? All of the RISC compiler avoidance code ends up eating up the caches that are already large due to poor code density.

Last edited by matthey on 20-Feb-2014 at 04:38 PM.

Status: Offline

olegil

Re: Understanding CPU and integer performance
Posted on 21-Feb-2014 7:51:13

[ #10 ]

Elite Member

Joined: 22-Aug-2003
Posts: 5895
From: Work

@matthey

SIMD does not count in coremark or dhrystone benchmarks.

And dhrystone mips is a VERY theoretical number, with a 14+ stage pipeline I would be very surprised if an i7 could actually do sustained throughput of anything CLOSE to that.

My gut feeling:
x86 does a little more with each instruction as it doesn't need to load and store in separate instructions, but PPC has more registers (by far)

Example follows.
PPC:
load
load
do stuff
do stuff
do stuff
do stuff
store
store

x86:
load/do stuff
push
load/do stuff
push
pop
do stuff/store
pop
do stuff/store

Conclusion: The size of the register file on PPC is as it is because it mostly negates the down-side of having to do the loads and stores, avoiding the use of push and pop (which does have a small chance of a cache miss with completely disastrous results).

Also:
PPC has shorter pipeline.
PPC has a MUCH better SIMD
x86 has higher clock rates

Number one priority would for me be to get the Altivec back in use. e6500 is the way forward here.

I'm actually VERY surprised that Freescale don't do any sort of benchmarks that displays just how good Altivec really is.

Status: Offline

KimmoK

Re: Understanding CPU and integer performance
Posted on 21-Feb-2014 8:16:33

[ #11 ]

Elite Member

Joined: 14-Mar-2003
Posts: 5211
From: Ylikiiminki, Finland

@thread

Made me think that it is important that our community get latest gcc ported ...
(silly that we are so far behind in utilizing PPC ISA fully)

_________________
- KimmoK
// For freedom, for honor, for AMIGA
//
// Thing that I should find more time for: CC64 - 64bit Community Computer?

Status: Offline

olegil

Re: Understanding CPU and integer performance
Posted on 21-Feb-2014 8:46:41

[ #12 ]

Elite Member

Joined: 22-Aug-2003
Posts: 5895
From: Work

@KimmoK

What sort of performance gain are we missing out on?

However, I feel that things like gcc/binutils should have its own team of dedicated porters, and if none want the job then maybe we just need to accept that we are doomed.

Porting GCC will ALWAYS be much more about AmigaOS specifics than it will be about CPU specifics, as the CPU manufacturer will have already taken care of the Linux version for any new CPU. As long as "CPU manufacturer" is one of APM, IBM and Freescale, at least

Status: Offline

olegil

Re: Understanding CPU and integer performance
Posted on 21-Feb-2014 9:03:45

[ #13 ]

Elite Member

Joined: 22-Aug-2003
Posts: 5895
From: Work

@olegil

I just did a disassembly on something I use now and then. I'm used to seeing the same sort of code snippets in my 8 bit AVR code, but after seeing the x86 version I only have three words to conclude the discussion:

TOO MANY MOV'S!

Random code snippet:
x86


 8048584:       83 7c 24 1c 00          cmpl   $0x0,0x1c(%esp)
 8048589:       74 7c                   je     8048607 
 804858b:       8b 44 24 2c             mov    0x2c(%esp),%eax
 804858f:       8d 48 0a                lea    0xa(%eax),%ecx
 8048592:       ba 81 80 80 80          mov    $0x80808081,%edx
 8048597:       89 c8                   mov    %ecx,%eax
 8048599:       f7 ea                   imul   %edx
 804859b:       8d 04 0a                lea    (%edx,%ecx,1),%eax
 804859e:       89 c2                   mov    %eax,%edx
 80485a0:       c1 fa 07                sar    $0x7,%edx
 80485a3:       89 c8                   mov    %ecx,%eax
 80485a5:       c1 f8 1f                sar    $0x1f,%eax
 80485a8:       89 d3                   mov    %edx,%ebx
 80485aa:       29 c3                   sub    %eax,%ebx
 80485ac:       89 d8                   mov    %ebx,%eax
 80485ae:       89 44 24 2c             mov    %eax,0x2c(%esp)
 80485b2:       8b 54 24 2c             mov    0x2c(%esp),%edx
 80485b6:       89 d0                   mov    %edx,%eax
 80485b8:       c1 e0 08                shl    $0x8,%eax
 80485bb:       29 d0                   sub    %edx,%eax
 80485bd:       89 ca                   mov    %ecx,%edx
 80485bf:       29 c2                   sub    %eax,%edx
 80485c1:       89 d0                   mov    %edx,%eax
 80485c3:       89 44 24 2c             mov    %eax,0x2c(%esp)
 80485c7:       8b 44 24 2c             mov    0x2c(%esp),%eax
 80485cb:       8b 54 24 28             mov    0x28(%esp),%edx
 80485cf:       8d 0c 02                lea    (%edx,%eax,1),%ecx
 80485d2:       ba 81 80 80 80          mov    $0x80808081,%edx
 80485d7:       89 c8                   mov    %ecx,%eax
 80485d9:       f7 ea                   imul   %edx
 80485db:       8d 04 0a                lea    (%edx,%ecx,1),%eax
 80485de:       89 c2                   mov    %eax,%edx
 80485e0:       c1 fa 07                sar    $0x7,%edx
 80485e3:       89 c8                   mov    %ecx,%eax
 80485e5:       c1 f8 1f                sar    $0x1f,%eax
 80485e8:       89 d3                   mov    %edx,%ebx
 80485ea:       29 c3                   sub    %eax,%ebx
 80485ec:       89 d8                   mov    %ebx,%eax
 80485ee:       89 44 24 28             mov    %eax,0x28(%esp)
 80485f2:       8b 54 24 28             mov    0x28(%esp),%edx

e300c4


10000620:       2f 80 00 00     cmpwi   cr7,r0,0
10000624:       41 9e 00 78     beq-    cr7,1000069c 
10000628:       81 3f 00 18     lwz     r9,24(r31)
1000062c:       39 69 00 0a     addi    r11,r9,10
10000630:       3c 00 80 80     lis     r0,-32640
10000634:       60 00 80 81     ori     r0,r0,32897
10000638:       7c 0b 00 96     mulhw   r0,r11,r0
1000063c:       7c 00 5a 14     add     r0,r0,r11
10000640:       7c 09 3e 70     srawi   r9,r0,7
10000644:       7d 60 fe 70     srawi   r0,r11,31
10000648:       7d 20 48 50     subf    r9,r0,r9
1000064c:       7d 20 4b 78     mr      r0,r9
10000650:       54 00 40 2e     rlwinm  r0,r0,8,0,23
10000654:       7c 09 00 50     subf    r0,r9,r0
10000658:       7c 00 58 50     subf    r0,r0,r11
1000065c:       90 1f 00 18     stw     r0,24(r31)
10000660:       81 3f 00 14     lwz     r9,20(r31)
10000664:       80 1f 00 18     lwz     r0,24(r31)
10000668:       7d 69 02 14     add     r11,r9,r0
1000066c:       3c 00 80 80     lis     r0,-32640
10000670:       60 00 80 81     ori     r0,r0,32897
10000674:       7c 0b 00 96     mulhw   r0,r11,r0
10000678:       7c 00 5a 14     add     r0,r0,r11
1000067c:       7c 09 3e 70     srawi   r9,r0,7
10000680:       7d 60 fe 70     srawi   r0,r11,31
10000684:       7d 20 48 50     subf    r9,r0,r9
10000688:       7d 20 4b 78     mr      r0,r9
1000068c:       54 00 40 2e     rlwinm  r0,r0,8,0,23
10000690:       7c 09 00 50     subf    r0,r9,r0
10000694:       7c 00 58 50     subf    r0,r0,r11
10000698:       90 1f 00 14     stw     r0,20(r31)
1000069c:       81 3f 00 18     lwz     r9,24(r31)
100006a0:       39 69 00 2a     addi    r11,r9,42
100006a4:       3c 00 80 80     lis     r0,-32640
100006a8:       60 00 80 81     ori     r0,r0,32897
100006ac:       7c 0b 00 96     mulhw   r0,r11,r0
100006b0:       7c 00 5a 14     add     r0,r0,r11
100006b4:       7c 09 3e 70     srawi   r9,r0,7
100006b8:       7d 60 fe 70     srawi   r0,r11,31
100006bc:       7d 20 48 50     subf    r9,r0,r9
100006c0:       7d 20 4b 78     mr      r0,r9

Now, there's a x86 advantage here, 8 and 16 bit instructions mean less code. But mov mov mov mov mov mov is not doing math

The whole main section is about 10% bigger on ppc than x86, while the full file is 35% larger. There's something very stupid about the ABI here, gcc is generating a bunch of call_ functions that I would think were not needed.

But of course, it's also a 4.1.78 vs a 4.6.3, so it's not ENTIRELY fair

Last edited by olegil on 21-Feb-2014 at 09:29 AM.

Status: Offline

KimmoK

Re: Understanding CPU and integer performance
Posted on 21-Feb-2014 9:17:03

[ #14 ]

Elite Member

Joined: 14-Mar-2003
Posts: 5211
From: Ylikiiminki, Finland

@olegil

>What sort of performance gain are we missing out on?

Not sure.
But to me it seems that PA6T does not perform to it's maximum with AOS4 SW.

(long time ago I saw 2x speed increase in some Barefeats benchmark for G5 when the code became optimized vs standard G4 compile)

Another todo for weekend for me ... try to get community external guru to look at gcc for PPC...
(computer geek who has code optimization as his hobby for x86, PPC, MIPS and ARM)

@somone with x1000

simple benchmark please:
-compile benchmark code on older gcc (version that exist for AOS4) and latest PPC linux gcc
-run those on Linux and see if there is difference

(bytemark codes might be available for that test???)

@olegil
Lovely code sniplets.

Last edited by KimmoK on 21-Feb-2014 at 09:40 AM.
Last edited by KimmoK on 21-Feb-2014 at 09:38 AM.
Last edited by KimmoK on 21-Feb-2014 at 09:37 AM.
Last edited by KimmoK on 21-Feb-2014 at 09:36 AM.

_________________
- KimmoK
// For freedom, for honor, for AMIGA
//
// Thing that I should find more time for: CC64 - 64bit Community Computer?

Status: Offline

olegil

Re: Understanding CPU and integer performance
Posted on 21-Feb-2014 9:32:24

[ #15 ]

Elite Member

Joined: 22-Aug-2003
Posts: 5895
From: Work

@olegil

Mull it over a bit:
each instruction on x86 takes on average 20-30 % less space, yet main() is only about 10% smaller. This means that the RISC vs CISC arguments really need to stop, at least in the x86 vs PPC context

Also, I had forgotten to use -O2, which had a HUGE effect on the ppc binary and little to no effect on the x86 binary. -Os reduced the size advantage further, I'm now at 1821 vs 2430. That's about the difference in instruction size, which means that the code I'm running seems to map to an identical number of instructions on x86 and PPC.

Nice little study, now back to work. Doing more or less exactly the same thing, just on a different arch

Last edited by olegil on 21-Feb-2014 at 09:37 AM.

Status: Offline

matthey

Re: Understanding CPU and integer performance
Posted on 21-Feb-2014 19:31:30

[ #16 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2009
From: Kansas

Quote:

olegil wrote:
SIMD does not count in coremark or dhrystone benchmarks.

I didn't think the SIMD was used for benchmarks very much but x86/x86_64 compilers don't use the FPU but only SIMD for floating point.

Quote:

My gut feeling:
x86 does a little more with each instruction as it doesn't need to load and store in separate instructions, but PPC has more registers (by far)

Example follows.
PPC:
load
load
do stuff
do stuff
do stuff
do stuff
store
store

x86:
load/do stuff
push
load/do stuff
push
pop
do stuff/store
pop
do stuff/store

Conclusion: The size of the register file on PPC is as it is because it mostly negates the down-side of having to do the loads and stores, avoiding the use of push and pop (which does have a small chance of a cache miss with completely disastrous results).

Modern x86/x86_64 has now 8 or 16 mostly general purpose registers. It also has a modern ABI that passed most function arguments in registers so the stack isn't used as much as it used to be. There are quirks which are specific to a certain register or series of registers which still requires extra MOV instructions sometimes. x86/x86_64 does work in cache/memory more than RISC. One x86 study showed 57% of instructions accessing memory (my 68k stats showed about 45%). The big problem with this is that there is only one cache/memory read and write possible per cycle per core. That means the compiler and/or CPU is going to have to schedule the instructions doing memory accesses for superscalar (parallel) operations. The more memory accesses there are the more difficult this is (instruction dependencies and other things need to be scheduled for too). Initially this looks like a RISC advantage. Many CISC instructions are read+write instructions like ADDQ.L #1,(An). RISC needs 2 memory accesses and at least 3 instructions (12 bytes of code vs 2 for 68k) to do do the same and the 68k is faster (starts faster). PowerPC compilers may need to add code to check the alignment to make sure the access is aligned if can't 100% determine this. For one access the compiler probably wouldn't add any prefetch instructions but what if we want to add 1 to a series of addresses in memory? PowerPC compilers would start by aligning the memory access if necessary and generating a prefetch instruction. Next it would unroll the loop much as your example shows above except the "do stuff" would just be adding 1 a bunch of times in different registers. The problem is that the LOAD, the ADD, and the STORE are dependent. Therefore the RISC CPU needs at least 3 cycle for them. Additionally the Cache read from L1 cache has a load usage latency. This means between the LOAD and the ADD there will be a bubble of 1 or 2 cycles (the 68060 and Apollo completely hide the L1 cache latency so the compiler doesn't have to worry about it and no ICache is wasted unrolling the loop). The 68060/Apollo optimal loop is 2-4 instructions using about 6 bytes of ICache and accessing cache/memory 1 time per loop iteration. The PowerPC optimal loop, assuming no alignment or prefetching and x4 unroll, would need 14 instructions and use about 56 bytes of ICache while accessing cache/memory 2 times per loop. RISC needs more registers and needs to load data into registers where it can do many calculations to be efficient. A good CISC design is very strong and fast in caches/memory. Not only can it do read+write cache/memory accesses but it can do an operation to or from cache/memory which is fast and further conserves ICaches.

Quote:

Also:
PPC has shorter pipeline.
PPC has a MUCH better SIMD
x86 has higher clock rates

Too short of pipeline can be bad. It needs to be long enough to give free decrement and branch on loops and hide cache accesses. Too long of pipeline gives higher clockspeeds but takes extra logic (less power efficient) and has large branch penalties. There is a sweet spot in the middle and it varies by CPU design.

Altivec is a nice SIMD.

Quote:

Number one priority would for me be to get the Altivec back in use. e6500 is the way forward here.

I'm actually VERY surprised that Freescale don't do any sort of benchmarks that displays just how good Altivec really is.

Apple (before abandoning PowerPC) and Motorola used to do benchmarks and try to get Quake running faster than the x86. I recall promises of how the next PowerPC would beat x86 and how superior it was in technology (like Altivec) but the PowerPC processors seemed to always be late and behind in performance. Then x86 won, Apple switched and PowerPC left the desktop/laptop market.

Status: Offline

itix

Re: Understanding CPU and integer performance
Posted on 21-Feb-2014 20:32:35

[ #17 ]

Elite Member

Joined: 22-Dec-2004
Posts: 3398
From: Freedom world

@matthey

Quote:

Modern x86/x86_64 has now 8 or 16 mostly general purpose registers. It also has a modern ABI that passed most function arguments in registers so the stack isn't used as much as it used to be.

In my experience 32 registers are not so useful than one might think. You cant use registers straight away. You must save register to stack then load new value to register and at the end of function restore register. If you need that register only once it is inefficient versus to read+write instruction.

15 registers in 68k (A7 was SP) was often enough except that they were not general purpose registers and you could run out of address registers.

Quote:

PowerPC compilers may need to add code to check the alignment to make sure the access is aligned if can't 100% determine this. For one access the compiler probably wouldn't add any prefetch instructions but what if we want to add 1 to a series of addresses in memory? PowerPC compilers would start by aligning the memory access if necessary and generating a prefetch instruction. Next it would unroll the loop much as your example shows above except the "do stuff" would just be adding 1 a bunch of times in different registers.

Compilers dont do that. They assumbe developer has properly aligned memory access. Neither I have seen compilers generating prefetch instructions...

_________________
Amiga Developer
Amiga 500, Efika, Mac Mini and PowerBook

Status: Offline

WolfToTheMoon

Re: Understanding CPU and integer performance
Posted on 21-Feb-2014 22:04:32

[ #18 ]

Super Member

Joined: 2-Sep-2010
Posts: 1351
From: CRO

@olegil

Quote:
PPC has a MUCH better SIMD

Compared to what, MMX? Sure.
But x64 now uses SSE and AVX.

_________________

Status: Offline

minator

Re: Understanding CPU and integer performance
Posted on 21-Feb-2014 22:08:42

[ #19 ]

Cult Member

Joined: 23-Mar-2004
Posts: 989
From: Cambridge

CISC was designed when memory was very expensive so instructions were small to save money - but it made the processors complex.

RISC processors were thought up when memory was cheaper and being simple meant you could go faster. Being simpler also meant they were cheaper to make. Compare the die of a G5 to a contemporary x86 and you'll find it's substantially smaller.

The advantage of CISC is smaller instructions but ...he who giveth also taketh away.
CISC instructions have to be taken apart and this takes a load of logic. In x86 they are converted into micro-ops which are basically RISC instructions (x86 hasn't been proper CISC for a long time).
All this also has to be tracked through the pipeline so while CISC saves room in the cache it has a cost in area and power. Area = $$$ and these days chip design is all about power.

At the high end the only company doing CISC now is IBM and that's only in their mainframes. x86 is more of a hybrid and has been using RISC internally since the 90s. Other than those and some microcontrollers, RISC has pretty much won.

BTW you can also do multi-length instructions in RISC. e.g. 32 bit ARMs can do Thumb instructions which are 16 bits long. Even code size is not a CISC advantage.

_________________
Whyzzat?

Status: Offline

WolfToTheMoon

Re: Understanding CPU and integer performance
Posted on 21-Feb-2014 22:19:03

[ #20 ]

Super Member

Joined: 2-Sep-2010
Posts: 1351
From: CRO

@minator

Quote:
Compare
the die of a G5 to a contemporary x86 and you'll find it's
substantially smaller.

It's also substantially slower and hotter.

Compared to P4, the die size is about the same. Same goes for G4 vs PIII.

_________________

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle