Click Here
home features news forums classifieds faqs links search
6071 members 
Amiga Q&A /  Free for All /  Emulation /  Gaming / (Latest Posts)
Login

Nickname

Password

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net
Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.
Donate

Menu
Main sections
» Home
» Features
» News
» Forums
» Classifieds
» Links
» Downloads
Extras
» OS4 Zone
» IRC Network
» AmigaWorld Radio
» Newsfeed
» Top Members
» Amiga Dealers
Information
» About Us
» FAQs
» Advertise
» Polls
» Terms of Service
» Search

IRC Channel
Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online
9 crawler(s) on-line.
 164 guest(s) on-line.
 1 member(s) on-line.


 ppcamiga1

You are an anonymous user.
Register Now!
 ppcamiga1:  1 min ago
 VooDoo:  15 mins ago
 OlafS25:  25 mins ago
 marcofreeman:  33 mins ago
 pixie:  39 mins ago
 kolla:  55 mins ago
 BigD:  1 hr 24 mins ago
 CosmosUnivers:  2 hrs 37 mins ago
 Musashi5150:  3 hrs 6 mins ago
 AmigaPapst:  3 hrs 6 mins ago

/  Forum Index
   /  General Technology (No Console Threads)
      /  Understanding CPU and integer performance
Register To Post

Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 Next Page )
PosterThread
cdimauro 
Re: Understanding CPU and integer performance
Posted on 2-Mar-2014 8:07:47
#41 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

Quote:

itix wrote:
@cdimauro

Quote:

I'm not a PowerPCs expert, but I suppose that they have a specific load instruction which is slower than the usual (aligned) one, right?


I recall reading something like that long ago but I cant find anything like that from PowerPC instruction set.

Certain special load/store instructions work only with word alignment, for example lwarx/stwcx. used to implement atomic operations. When atomic 8-bit/16-bit operations are needed code must test alignment and implement paths for all possible alignment types.

I had some time, so I did a check. PowerPCs handle automatically misaligned loads and stores.

It's strange for a RISC, but we know that PowerPCs are one of the most complex RISC family.

Quote:

itix wrote:
@cdimauro

Quote:

Even on x86 (much less on x64) there are more pushs than pops. It happens usually because passing parameters on the stack, and not popping them out, but simply resetting the SP to "clean-up" the stack and removing them.

For me it's strange, because I suppose that a 68K ABI prefers registers to pass parameters, instead of the stack, since there are many of them.


On 68K passing parameters in the stack is the norm but compilers may support passing parameters in registers.

If disassembled code is from AmigaOS executeable it can be using SetAttrs() or DoMethod() calls which are (sort of) variadic functions and used a lot in modern UI rogramming. Parameters to those calls are almost always constructed to the stack.

I never had the chance to do it on my Amiga. The most that I made was using gadtools (and working entirely in assembly).

What I recall is that when invoking the Amiga o.s. API I used registers to pass the required parameters.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Understanding CPU and integer performance
Posted on 2-Mar-2014 8:12:50
#42 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

Quote:

KimmoK wrote:
@SIMDs

From AMD e350, it does have:
MMX instructions
SSE / Streaming SIMD Extensions
SSE2 / Streaming SIMD Extensions 2
SSE3 / Streaming SIMD Extensions 3
SSSE3 / Supplemental Streaming SIMD Extensions 3
SSE4a

not those new intel SIMD instructions...

I have an AMD C-50 on my current sub-notebook, so the same applies to me.

AVX is used by Sandybridge and successors for Intel and by Bulldozer and successors by AMD.

AVX2 was introduced by Intel with its latest processor, Haswell.
Quote:
I imagine it's not simple to use SIMD on x86, if you want the SW to run on older CPUs as well etc...?

You simply use code-paths, as usual.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Understanding CPU and integer performance
Posted on 2-Mar-2014 8:17:27
#43 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@olegil

Quote:

olegil wrote:
@matthey

e6500 single thread: 3.3instructions per clock. e6500 SMT: 6.0. But that's a WILDLY different SMT approach than the original P4 hyperthreading. I expect i7 to be in the same ballpark.

I don't get how 4 fetches per clock can give a dmips rating of 9.43 (i7-2600k), though. There IS some trickery there. With macrofusion (two assembly instructions end up as a single micro-operation) and microfusion (two micro-ops end up as a single micro-op), you can obviously get further.

But in that scenario, you'll have to be VERY careful to generate exactly the sequences the core can optimize, otherwise practical throughput drops off a lot.

They are already known and established instructions patterns. Something like compare & conditionally jump, for example.
Quote:
POWER arches don't do micro-ops, instead this optimization would be done at compile-time. So for practical work I would venture a guess that the gap is not as big as it seems.

Using micro-ops is related to the micro-architecture, not to the ISA. To be clear, you can have a PowerPC implementation which is entirely based on micro-code.

POWER/PowerPCs have very complex instructions, and some of them can be executed with micro-code or cranked into multiple micro-ops. POWER4 / G5 do it for sure.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Understanding CPU and integer performance
Posted on 2-Mar-2014 8:21:54
#44 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

Quote:

matthey wrote:

Nobody wants to make a cleaner (design and ISA) and easier to decode CISC with better code density that would have an advantage against x86_64.

Nobody but some geek, perhaps...

 Status: Offline
Profile     Report this post  
itix 
Re: Understanding CPU and integer performance
Posted on 2-Mar-2014 15:43:25
#45 ]
Elite Member
Joined: 22-Dec-2004
Posts: 3398
From: Freedom world

@cdimauro

Quote:

I had some time, so I did a check. PowerPCs handle automatically misaligned loads and stores.


IIRC it is not always case in little endian operation mode. If a non-aligned access is crossing page boundary it throws an exception. AFAIK Windows NT software suffered greatly and it was adviced to not use non-aligned access for integers ever in little endian mode.

But who in the world could waste his precious time to remove all non-aligned accesses from source code that can have millions of lines of code?

Quote:

It's strange for a RISC, but we know that PowerPCs are one of the most complex RISC family.


And G5 is not really RISC

Quote:

I never had the chance to do it on my Amiga. The most that I made was using gadtools (and working entirely in assembly).

What I recall is that when invoking the Amiga o.s. API I used registers to pass the required parameters.


In AmigaOS parameters are passed in registers but when using variable length taglists and coding in C they are constructed in the stack (f.e. OpenScreenTags()).

When working in assembly you probably were using static taglist arrays

_________________
Amiga Developer
Amiga 500, Efika, Mac Mini and PowerBook

 Status: Offline
Profile     Report this post  
megol 
Re: Understanding CPU and integer performance
Posted on 2-Mar-2014 21:12:57
#46 ]
Regular Member
Joined: 17-Mar-2008
Posts: 355
From: Unknown

@olegil

Quote:

olegil wrote:
@matthey

e6500 single thread: 3.3instructions per clock. e6500 SMT: 6.0. But that's a WILDLY different SMT approach than the original P4 hyperthreading. I expect i7 to be in the same ballpark.

I don't get how 4 fetches per clock can give a dmips rating of 9.43 (i7-2600k), though. There IS some trickery there. With macrofusion (two assembly instructions end up as a single micro-operation) and microfusion (two micro-ops end up as a single micro-op), you can obviously get further.

But in that scenario, you'll have to be VERY careful to generate exactly the sequences the core can optimize, otherwise practical throughput drops off a lot.


Not really - macro-op fusion is done for a very common pattern, that of a compare/test/add/sub instruction and a conditional branch. Most loops and even most conditional code use such patterns.

Quote:

POWER arches don't do micro-ops, instead this optimization would be done at compile-time. So for practical work I would venture a guess that the gap is not as big as it seems.


I can't agree. Look at e.g. the Power 6 architecture and the differences between the Power instructions and what the processor actually executes.

But perhaps you mean that Power doesn't use microcode?

Quote:

I just did a disassembly on something I use now and then. I'm used to seeing the same sort of code snippets in my 8 bit AVR code, but after seeing the x86 version I only have three words to conclude the discussion:

TOO MANY MOV'S!

Random code snippet:
x86

8048584: 83 7c 24 1c 00 cmpl $0x0,0x1c(%esp)
8048589: 74 7c je 8048607
804858b: 8b 44 24 2c mov 0x2c(%esp),%eax
804858f: 8d 48 0a lea 0xa(%eax),%ecx
8048592: ba 81 80 80 80 mov $0x80808081,%edx
8048597: 89 c8 mov %ecx,%eax
8048599: f7 ea imul %edx
804859b: 8d 04 0a lea (%edx,%ecx,1),%eax
804859e: 89 c2 mov %eax,%edx
80485a0: c1 fa 07 sar $0x7,%edx
80485a3: 89 c8 mov %ecx,%eax
80485a5: c1 f8 1f sar $0x1f,%eax
80485a8: 89 d3 mov %edx,%ebx
80485aa: 29 c3 sub %eax,%ebx
80485ac: 89 d8 mov %ebx,%eax
80485ae: 89 44 24 2c mov %eax,0x2c(%esp)
80485b2: 8b 54 24 2c mov 0x2c(%esp),%edx
80485b6: 89 d0 mov %edx,%eax
80485b8: c1 e0 08 shl $0x8,%eax
80485bb: 29 d0 sub %edx,%eax
80485bd: 89 ca mov %ecx,%edx
80485bf: 29 c2 sub %eax,%edx
80485c1: 89 d0 mov %edx,%eax
80485c3: 89 44 24 2c mov %eax,0x2c(%esp)
80485c7: 8b 44 24 2c mov 0x2c(%esp),%eax
80485cb: 8b 54 24 28 mov 0x28(%esp),%edx
80485cf: 8d 0c 02 lea (%edx,%eax,1),%ecx
80485d2: ba 81 80 80 80 mov $0x80808081,%edx
80485d7: 89 c8 mov %ecx,%eax
80485d9: f7 ea imul %edx
80485db: 8d 04 0a lea (%edx,%ecx,1),%eax
80485de: 89 c2 mov %eax,%edx
80485e0: c1 fa 07 sar $0x7,%edx
80485e3: 89 c8 mov %ecx,%eax
80485e5: c1 f8 1f sar $0x1f,%eax
80485e8: 89 d3 mov %edx,%ebx
80485ea: 29 c3 sub %eax,%ebx
80485ec: 89 d8 mov %ebx,%eax
80485ee: 89 44 24 28 mov %eax,0x28(%esp)
80485f2: 8b 54 24 28 mov 0x28(%esp),%edx



Now that looks like _BAD_ code! Even when size optimizing one shouldn't get such a result IMHO.

One example using the Intel standard layout (operation destination, source):

mov edx, [esp+0x2c]
mov eax, edx
shl eax, 8 ; will have a dependency on the previous instruction
sub eax, edx

Better:
mov eax, [esp+0x2c]
mov edx, eax
shl eax, 8 ; can be done in parallel with the mov instruction above
sub eax, edx

 Status: Offline
Profile     Report this post  
olegil 
Re: Understanding CPU and integer performance
Posted on 2-Mar-2014 23:33:02
#47 ]
Elite Member
Joined: 22-Aug-2003
Posts: 5895
From: Work

@megol

Don't blame me, it was GCC who dunnit

And about micro-ops, ok. I learned something new.

But you people need to stop discussing RISC vs CISC as instruction set differences.

It's load/store vs direct memory access that is the difference between ARM/POWER on the one hand and m68k/x86 on the other.

And everyone except x86 showed up to the party with registers. x86 just used stack. Which sucked for a while but they somehow managed it in the end.

_________________
This weeks pet peeve:
Using "voltage" instead of "potential", which leads to inventing new words like "amperage" instead of "current" (I, measured in A) or possible "charge" (amperehours, Ah or Coulomb, C). Sometimes I don't even know what people mean.

 Status: Offline
Profile     Report this post  
matthey 
Re: Understanding CPU and integer performance
Posted on 3-Mar-2014 0:41:24
#48 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2010
From: Kansas

Quote:

olegil wrote:
But you people need to stop discussing RISC vs CISC as instruction set differences.

It's load/store vs direct memory access that is the difference between ARM/POWER on the one hand and m68k/x86 on the other.


The instructions sets are based on the architecture types though. CISC has instructions that do operations (calculations) while they move to and from memory and read/write (read+calc+write) operations in memory while RISC has load and store (in modern times sometimes with operations too). CISC also has immediates in the code mostly as is (although extend is popular) while RISC usually has multiple instructions doing extend, shift and insert to load an immediate. This also leads to the other big difference which is that most CISC uses variable length instructions while RISC mostly uses fixed length instructions. The lines are blurred today between CISC and RISC though. PowerPC has complex instructions that commonly do more than 1 operation which is more like CISC. Some ARM has variable length instructions and powerful addressing modes for RISC. Modern CISC processors are internally RISC. The best architecture for performance is probably some hybrid between CISC and RISC.

Quote:

And everyone except x86 showed up to the party with registers. x86 just used stack. Which sucked for a while but they somehow managed it in the end.


Only 8 (mostly) general purpose registers is a bottleneck for modern CISC. CISC can do free or almost free memory accesses but it is generally limited by 1 read and 1 write access per cycle. I mentioned earlier my stats that showed 68020 compiled code was doing about 15% less cache/memory accesses per cycle than x86 compiled code. I found a research study that shows about the same percentage from having 8 general purpose registers instead of 16:

http://researchbank.rmit.edu.au/eserv/rmit:2517/n2001000381.pdf


Register Program Load/Store
Count Size Frequency

27 100.00 27.90%
24 100.35 28.21%
22 100.51 28.34%
20 100.56 28.38%
18 100.97 28.85%
16 101.62 30.22%
14 103.49 31.84%
12 104.45 34.34%
10 109.41 41.02%
8 114.76 44.45%

1st column= Register Count, 2nd column=Program Size, 3rd column=Load/Store Frequency

This was with a MIPS RISC CPU which is baselined from their number of general purpose registers. It's obvious to me that RISC needs more registers because the cost of memory accesses is higher (working directly in memory causes bubbles). More registers help RISC avoid the bubbles (although they need a lot of extra code to do it). CISC can work directly in memory as long as the number of cache accesses can be scheduled for superscalar. I expect this is why x86 quickly went to OoO (it helps with poor code quality too). The 68k line probably could have avoided OoO for longer and where power efficiency is more important than processing power. I still think x86 won the war with just 8 registers and one of the worst ISAs because of economies of scale, market timing, and the rest of the world falling for the hype of RISC.

Last edited by matthey on 03-Mar-2014 at 12:53 AM.
Last edited by matthey on 03-Mar-2014 at 12:51 AM.
Last edited by matthey on 03-Mar-2014 at 12:50 AM.
Last edited by matthey on 03-Mar-2014 at 12:49 AM.
Last edited by matthey on 03-Mar-2014 at 12:44 AM.

 Status: Offline
Profile     Report this post  
olegil 
Re: Understanding CPU and integer performance
Posted on 3-Mar-2014 12:04:53
#49 ]
Elite Member
Joined: 22-Aug-2003
Posts: 5895
From: Work

@matthey

I know and I agree with you on what RISC and CISC means, but a lot of people seem to think that RISC should have FEWER INSTRUCTIONS by definition. No, it should have dedicated load/store instructions, that's what it should have.

It's a little bit funny how I don't know if you're thinking you're in an argument with me or not. Things like "68020 compiled code was doing about 15% less cache/memory accesses per cycle than x86 compiled code" comes more or less directly from my statement of "everyone except x86 showed up to the party with registers. x86 just used stack". Since stack is memory (or at best cache).

But it's still mighty interesting that there is no real life benchmark being used where x86, ARM and POWER arches are being put head to head and actually COMPARED USING REAL LIFE CODE.

On all three arches, you would be MAD not to take advantage of SIMD if it existed. Yet no benchmark I've seen uses it. What gives?

So, is anyone actually running anything multiplatform and can they shed some light on the practical benefits of SSE/Altivec/NEON?

_________________
This weeks pet peeve:
Using "voltage" instead of "potential", which leads to inventing new words like "amperage" instead of "current" (I, measured in A) or possible "charge" (amperehours, Ah or Coulomb, C). Sometimes I don't even know what people mean.

 Status: Offline
Profile     Report this post  
matthey 
Re: Understanding CPU and integer performance
Posted on 3-Mar-2014 20:57:26
#50 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2010
From: Kansas

Quote:

olegil wrote:
I know and I agree with you on what RISC and CISC means, but a lot of people seem to think that RISC should have FEWER INSTRUCTIONS by definition. No, it should have dedicated load/store instructions, that's what it should have.


The old "pure" RISC processors like SPARC and m88k did have fewer instructions (and anything else they could move from hardware to compiler) and thought that was an advantage. They generally didn't have load/store multiple, load/store with sign/zero extension or load/store with endian swap. It was a violation of the purity of RISC to do any other work than what was supposed to be done in that pipeline stage. Dedicated load/store instructions (and fixed length instruction encodings) define RISC but even that is blurred today. It's kind of like a sports car. Rear wheel drive, 2 seats and an open top used to define a sports car. Now days rear wheel drive with traction control is enough and people look back and say the old sports cars didn't have enough power to be sports cars.

Quote:

It's a little bit funny how I don't know if you're thinking you're in an argument with me or not. Things like "68020 compiled code was doing about 15% less cache/memory accesses per cycle than x86 compiled code" comes more or less directly from my statement of "everyone except x86 showed up to the party with registers. x86 just used stack". Since stack is memory (or at best cache).


I thought you were implying that 8 registers was a handicap for x86 performance which I presented more evidence for. If I was arguing, it would be for your hypothesis.

Quote:

But it's still mighty interesting that there is no real life benchmark being used where x86, ARM and POWER arches are being put head to head and actually COMPARED USING REAL LIFE CODE.

On all three arches, you would be MAD not to take advantage of SIMD if it existed. Yet no benchmark I've seen uses it. What gives?

So, is anyone actually running anything multiplatform and can they shed some light on the practical benefits of SSE/Altivec/NEON?


Most comparisons I've seen are old. Nobody wants to be compared to the Intel juggernaut.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Understanding CPU and integer performance
Posted on 4-Mar-2014 22:23:05
#51 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

Quote:

itix wrote:
@cdimauro

Quote:

I had some time, so I did a check. PowerPCs handle automatically misaligned loads and stores.


IIRC it is not always case in little endian operation mode. If a non-aligned access is crossing page boundary it throws an exception.

I was not aware of this. Another strange thing of PowerPCs...
Quote:
AFAIK Windows NT software suffered greatly and it was adviced to not use non-aligned access for integers ever in little endian mode.

But who in the world could waste his precious time to remove all non-aligned accesses from source code that can have millions of lines of code?



Quote:

Quote:

It's strange for a RISC, but we know that PowerPCs are one of the most complex RISC family.


And G5 is not really RISC

We can debate a lot with this never-ending argument , so I just give my personal opinion.

To classify a processor as a RISC or CISC we should only look at the ISA, not at the micro-architecture (since this, which is just ONE implementation, can change).

The G5 has almost the same ISA of the G4, even with some instruction missing if I remember correctly, so it's has a Reduced Instruction Set.
Quote:
Quote:

I never had the chance to do it on my Amiga. The most that I made was using gadtools (and working entirely in assembly).

What I recall is that when invoking the Amiga o.s. API I used registers to pass the required parameters.


In AmigaOS parameters are passed in registers but when using variable length taglists and coding in C they are constructed in the stack (f.e. OpenScreenTags()).

When working in assembly you probably were using static taglist arrays

Exactly.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Understanding CPU and integer performance
Posted on 4-Mar-2014 22:42:42
#52 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

Quote:

olegil wrote:

But you people need to stop discussing RISC vs CISC as instruction set differences.

It's load/store vs direct memory access that is the difference between ARM/POWER on the one hand and m68k/x86 on the other.

I'll also had fixed-length instructions vs variable-length instructions, which was and is another distinctive characteristic of the two macro-families.
Quote:
And everyone except x86 showed up to the party with registers. x86 just used stack.

Only if it ran out of registers.
Quote:
Which sucked for a while but they somehow managed it in the end.

Caches do miracles. And cache accesses can be easily masked, thanks to pipelining and/or out-of-order execution.

Quote:

olegil wrote:
@matthey

I know and I agree with you on what RISC and CISC means, but a lot of people seem to think that RISC should have FEWER INSTRUCTIONS by definition. No, it should have dedicated load/store instructions, that's what it should have.

And instruction width; see above.

However I agree with you: RISCs have A LOT of instructions nowadays, and I don't blame them for it, since adding more specialized instructions gains both execution speed and code density. I don't also blame them for integrating more and more complex instructions, which requires more than clock cycle (and sometimes cranking into more micro-ops, or executing micro-code), for the same reasons.

I don't blame them because history shown that the RISC model failed, since RISCs have been forced to become much more similar to CISCs to remain competitive...
Quote:
It's a little bit funny how I don't know if you're thinking you're in an argument with me or not. Things like "68020 compiled code was doing about 15% less cache/memory accesses per cycle than x86 compiled code" comes more or less directly from my statement of "everyone except x86 showed up to the party with registers. x86 just used stack". Since stack is memory (or at best cache).

And what about x64? It has 16 general purpose registers.
Quote:
But it's still mighty interesting that there is no real life benchmark being used where x86, ARM and POWER arches are being put head to head and actually COMPARED USING REAL LIFE CODE.

On all three arches, you would be MAD not to take advantage of SIMD if it existed. Yet no benchmark I've seen uses it. What gives?

So, is anyone actually running anything multiplatform and can they shed some light on the practical benefits of SSE/Altivec/NEON?

I already answered to your questions, before: http://mil-embedded.com/articles/avx-leap-forward-dsp-performance/

Is it real life code enough? It's focused on SIMDs too. And pay attention: it was tested on Sandy Bridge; Haswell has mostly doubled the SIMD performance.

Last edited by cdimauro on 04-Mar-2014 at 10:50 PM.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Understanding CPU and integer performance
Posted on 4-Mar-2014 22:49:27
#53 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

Quote:

matthey wrote:

Some ARM has variable length instructions

That's why I consider Thumb a CISC ISA.
Quote:
and powerful addressing modes for RISC.

Yes, ARMs offer some of the most complex addressing modes; even more complex than many CISCs.
Quote:
Modern CISC processors are internally RISC. The best architecture for performance is probably some hybrid between CISC and RISC.

The best is: CISC ISA + RISC internal execution unit. IMHO.

Quote:
Quote:

And everyone except x86 showed up to the party with registers. x86 just used stack. Which sucked for a while but they somehow managed it in the end.


Only 8 (mostly) general purpose registers is a bottleneck for modern CISC. CISC can do free or almost free memory accesses but it is generally limited by 1 read and 1 write access per cycle. I mentioned earlier my stats that showed 68020 compiled code was doing about 15% less cache/memory accesses per cycle than x86 compiled code. I found a research study that shows about the same percentage from having 8 general purpose registers instead of 16:

http://researchbank.rmit.edu.au/eserv/rmit:2517/n2001000381.pdf


Register Program Load/Store
Count Size Frequency

27 100.00 27.90%
24 100.35 28.21%
22 100.51 28.34%
20 100.56 28.38%
18 100.97 28.85%
16 101.62 30.22%
14 103.49 31.84%
12 104.45 34.34%
10 109.41 41.02%
8 114.76 44.45%

1st column= Register Count, 2nd column=Program Size, 3rd column=Load/Store Frequency

Very interesting! Thanks.

 Status: Offline
Profile     Report this post  
matthey 
Re: Understanding CPU and integer performance
Posted on 5-Mar-2014 4:53:34
#54 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2010
From: Kansas

Quote:

cdimauro wrote:
Quote:

matthey wrote:
Some ARM has variable length instructions

That's why I consider Thumb a CISC ISA.
Quote:
and powerful addressing modes for RISC.

Yes, ARMs offer some of the most complex addressing modes; even more complex than many CISCs.


ARM has a RISC style pipeline that is too short to hide the cache/memory access cost though. That's why it's so weak in cache/memory. The powerful addressing modes are nice but not free like most CISC designs (68k and x86). Every access generates bubbles like most RISC.

Quote:

Quote:
Modern CISC processors are internally RISC. The best architecture for performance is probably some hybrid between CISC and RISC.

The best is: CISC ISA + RISC internal execution unit. IMHO.


I agree. Clean, simple encodings based on 16 bit words could reduce much of the overhead of CISC decoding. The big question is how to handle the integer registers and how many to have. Adding 32 registers with CISC makes it difficult to keep most instructions as 16 bit encodings. An An/Dn split like the 68k uses makes this almost possible and having the EA and early instruction retirement before the ALU is more powerful than x86 IMO (without micro-oping and OoO) but the An registers become less general purpose. With a whole new CISC design, would you try to:

A) stay with 16 registers all general purpose except SP
B) stay with 16 registers all general purpose + separate SP
C) split An and Dn with 16 data and 8 address registers
D) split An and Dn with 16 data and 8 address registers + separate SP
E) split An and Dn with 16 data and 16 address registers
F) move to 32 general purpose registers with registers greater than 16 now 32 bit encodings

In any case, it should be possible to maintain a better code density than x86 and x86_64. Adding 64 bit support would be bad for code density but it should still be possible to have an average instruction length of less than 4 bytes.

Quote:
Quote:

Only 8 (mostly) general purpose registers is a bottleneck for modern CISC. CISC can do free or almost free memory accesses but it is generally limited by 1 read and 1 write access per cycle. I mentioned earlier my stats that showed 68020 compiled code was doing about 15% less cache/memory accesses per cycle than x86 compiled code. I found a research study that shows about the same percentage from having 8 general purpose registers instead of 16:

http://researchbank.rmit.edu.au/eserv/rmit:2517/n2001000381.pdf


Register Program Load/Store
Count Size Frequency

27 100.00 27.90%
24 100.35 28.21%
22 100.51 28.34%
20 100.56 28.38%
18 100.97 28.85%
16 101.62 30.22%
14 103.49 31.84%
12 104.45 34.34%
10 109.41 41.02%
8 114.76 44.45%

1st column= Register Count, 2nd column=Program Size, 3rd column=Load/Store Frequency

Very interesting! Thanks.


I read another study that showed about a 10% simulated performance improvement from 16 to 32 registers with ARM32. I would think a CISC CPU that could do bubble free cache/memory accesses would probably be 1/2 that or less? With 50% of variables in the cache, it's impossible to schedule all the cache/memory accesses but a good CISC design that gets closer to 25% of the variables in cache is pretty much always going to be able to schedule the accesses, maybe even without OoO.

Gunnar wants to add more registers to the Apollo. Although adding 8 more data registers is almost there for the 68k, I've been opposed to it because it can't be done cleanly enough and there isn't enough advantage IMO. I showed him links to the old natami.net discussions on the subject and studies like the one above but he hasn't given up yet. He also can't make up his mind on ISA proposals. I don't know if he'll ever get anything finished with his perfectionist attitude.

By the way, do you happen to have the frequency of MOVSX and MOVZX instructions as a percentage in x86 and x86_64?

Last edited by matthey on 05-Mar-2014 at 05:00 AM.
Last edited by matthey on 05-Mar-2014 at 04:59 AM.
Last edited by matthey on 05-Mar-2014 at 04:58 AM.
Last edited by matthey on 05-Mar-2014 at 04:55 AM.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Understanding CPU and integer performance
Posted on 5-Mar-2014 6:47:35
#55 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

Quote:

matthey wrote:
Quote:

cdimauro wrote:
[quote]
Yes, ARMs offer some of the most complex addressing modes; even more complex than many CISCs.


ARM has a RISC style pipeline that is too short to hide the cache/memory access cost though. That's why it's so weak in cache/memory.

It depends on the micro-architecture. ARM Cortex A15 (and A12) have very long pipelines, comparable to modern x86 micro-architectures.
Quote:
The powerful addressing modes are nice but not free like most CISC designs (68k and x86).

Yes, we can use them on almost every instruction.
Quote:
Every access generates bubbles like most RISC.

Even with longer pipelines?
Quote:
Quote:

The best is: CISC ISA + RISC internal execution unit. IMHO.


I agree. Clean, simple encodings based on 16 bit words could reduce much of the overhead of CISC decoding. The big question is how to handle the integer registers and how many to have. Adding 32 registers with CISC makes it difficult to keep most instructions as 16 bit encodings. An An/Dn split like the 68k uses makes this almost possible and having the EA and early instruction retirement before the ALU is more powerful than x86 IMO (without micro-oping and OoO) but the An registers become less general purpose. With a whole new CISC design, would you try to:

A) stay with 16 registers all general purpose except SP
B) stay with 16 registers all general purpose + separate SP
C) split An and Dn with 16 data and 8 address registers
D) split An and Dn with 16 data and 8 address registers + separate SP
E) split An and Dn with 16 data and 16 address registers
F) move to 32 general purpose registers with registers greater than 16 now 32 bit encodings

In any case, it should be possible to maintain a better code density than x86 and x86_64. Adding 64 bit support would be bad for code density but it should still be possible to have an average instruction length of less than 4 bytes.

I did all this with my new CISC design (which I called NEx64T): 32 general purpose registers, up to 128 SIMD registers (from 128 to 1024 bits wide), up to 16 mask registers (for SIMD), and fully (natively) 64 bit support.

I gave a talk at the last EuroPython:https://ep2013.europython.eu/conference/talks/x86x64-assembly-python-new-cpu-architecture-rule-world

The code density is also very good. Consider that in my analysis I just translated the disassembled x86 or x64 instructions and converted them to the equivalent(s) NEx64T ones. So I'm not using any new feature of my ISA, which can still improve code density (and execution speed).

My ISA is an x86/x64 superset, has all 68K addressing modes except the bad ones (double indirect) and it misses PC + Scaled index, many new instructions, and the SIMD unit is superset of Larrabee / Xeon Phi. I designed it in a way that can be easily scaled from embedded to HPC: you can "cut" features in a clean way, depending on the specific needs.
Quote:
Quote:
Very interesting! Thanks.


I read another study that showed about a 10% simulated performance improvement from 16 to 32 registers with ARM32. I would think a CISC CPU that could do bubble free cache/memory accesses would probably be 1/2 that or less? With 50% of variables in the cache, it's impossible to schedule all the cache/memory accesses but a good CISC design that gets closer to 25% of the variables in cache is pretty much always going to be able to schedule the accesses, maybe even without OoO.

I think that my ISA can do better, since it has 32 registers and all the good things of a CISC design.
Quote:
Gunnar wants to add more registers to the Apollo. Although adding 8 more data registers is almost there for the 68k, I've been opposed to it because it can't be done cleanly enough and there isn't enough advantage IMO. I showed him links to the old natami.net discussions on the subject and studies like the one above but he hasn't given up yet. He also can't make up his mind on ISA proposals. I don't know if he'll ever get anything finished with his perfectionist attitude.

I know him. But he should be realistic: it's better to release a basic implementation as soon as possible and THEN start improving it, adding new features and/or optimizing things. You'll never end if you try to perfect everything.

For my new ISA, I developed a first one and completed it. Then I had a new idea on how to improve it, and I wrote a new one and completed it. Then I had other ideas, and I started a third one. But I've completed the previous works!

Anyway, it's pity that the Apollo forum cannot be read. Requiring registration only to take a look is overkill...
Quote:
By the way, do you happen to have the frequency of MOVSX and MOVZX instructions as a percentage in x86 and x86_64?

Yes: http://www.appuntidigitali.it/18362/statistiche-su-x86-x64-parte-8-istruzioni-mnemonici/

The statistics are relative to a single executable, Adobe Photoshop CS 6 public beta, but I found a similar trend with other applications (MySQL, Firebird, MAME, etc. etc.).

Of course they are limited to the instructions that I can automatically disassemble in a safe way (e.g.: without data interpreted as instructions).

So, they are useful, but that much frequent. To answer a possible question: no, it doesn't deserve to allocate bits on an opcode to introduce automatic zero/sign extending a data for all instructions. Normal MOVSX or MOVZX instructions are enough to cover the majority of extension cases.

To be clearer: MOST of the code is regular, in the sense that it uses a single/common integer size to do the calculations, and only in rare cases it used different, little, sizes that requires extension (before usage).

 Status: Offline
Profile     Report this post  
matthey 
Re: Understanding CPU and integer performance
Posted on 5-Mar-2014 20:49:11
#56 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2010
From: Kansas

Quote:

cdimauro wrote:
Quote:

matthey wrote:
ARM has a RISC style pipeline that is too short to hide the cache/memory access cost though. That's why it's so weak in cache/memory.

It depends on the micro-architecture. ARM Cortex A15 (and A12) have very long pipelines, comparable to modern x86 micro-architectures.


Yes, I see 15 integer pipeline stages for the A15. That's a lot for a power efficient chip as the long pipeline and sophisticated branch prediction draws extra power. Either the longer pipeline is stronger in memory and/or OoO is helping a lot but they are still weak in cache/memory compared to old 68k and x86 CISC processors that did not have OoO. The 68060 pipeline is only 8 stages (decoupled 4+4) and it is able to hide the cost of memory/cache accesses and most or all of short loop overhead.

Quote:

Quote:
Every access generates bubbles like most RISC.

Even with longer pipelines?


They should be able to hide the bubbles with a long enough pipeline but I'm not sure. Even if there is a cost to memory accesses, OoO will help the performance by avoiding many of them. OoO is not perfect either and having a cost (bubbles) when accessing cache/memory adds to the complexity of scheduling instructions. The higher complexity for OoO equates to more power draw. It's nice if the pipeline can hide the cache/memory access bubbles without getting too long or moving to OoO. The 68060 does a better job of it than most modern processors. The An/Dn split actually helps keep the pipeline relatively short while hiding the cost of cache/memory accesses. There is a disadvantage that a bubble is generated when updating the An in the ALU though (also there is no forwarding for the An registers like the 68060). This means that I will probably change the ISA to only allow a source An in an EA, provided I can talk Gunnar out of expanding the number of registers. A source An in EAs is possible with a monolithic (combined An and Dn) register file which all further 68k designs are likely to be (it was possible in the 68060 and would have reduced the instructions and improved the code density). Allowing a source An in EAs is almost as good as a destination An with 2 op style like the 68k.

Quote:

I gave a talk at the last EuroPython:https://ep2013.europython.eu/conference/talks/x86x64-assembly-python-new-cpu-architecture-rule-world

The code density is also very good. Consider that in my analysis I just translated the disassembled x86 or x64 instructions and converted them to the equivalent(s) NEx64T ones. So I'm not using any new feature of my ISA, which can still improve code density (and execution speed).

My ISA is an x86/x64 superset, has all 68K addressing modes except the bad ones (double indirect) and it misses PC + Scaled index, many new instructions, and the SIMD unit is superset of Larrabee / Xeon Phi. I designed it in a way that can be easily scaled from embedded to HPC: you can "cut" features in a clean way, depending on the specific needs.


Well, you won't ever be able to sell your ISA to Intel but maybe they will pay you to keep it quiet. Oh wait, don't you work for them now :D. You would think Freescale might be interested in a powerful CISC ISA to compete with Intel considering their past history with CISC processors. Then again, they had a better CISC ISA than x86 and Joe Circello but cut down the ISA and assigned Joe to work on microcontrollers. The next time the auto industry falters they can lay off a few thousand more workers and buy back more of their stock at pennies on the dollar to keep them out of bankruptcy. They really have done a good job of managing their debt but they are getting owned by Intel in the CPU market. IBM has no experience with CISC. Maybe AMD would be interested but that wouldn't fly to well with you working for Intel ;).

There is this flawed idea that addressing modes are slow CISC ideas. Complex addressing modes are bad ideas like double indirect but simple addressing modes can be very powerful as they apply to many instructions and the cost minimal even for less common types. Even if they limit clock speed a little, they boost performance per MHz. There are disadvantages to high clock speeds.

Quote:

Anyway, it's pity that the Apollo forum cannot be read. Requiring registration only to take a look is overkill...Quote:


Yes. It would be much better if there were more knowledgeable people involved in the discussions like at natami.net. I suggested long ago inviting you, Megol and Marcel but Gunnar seems to stick to people who agree with him. I would also like to see ThoR there but I don't know if he would have the time.

[quote]
[quote]By the way, do you happen to have the frequency of MOVSX and MOVZX instructions as a percentage in x86 and x86_64?

Yes: http://www.appuntidigitali.it/18362/statistiche-su-x86-x64-parte-8-istruzioni-mnemonici/

The statistics are relative to a single executable, Adobe Photoshop CS 6 public beta, but I found a similar trend with other applications (MySQL, Firebird, MAME, etc. etc.).

Of course they are limited to the instructions that I can automatically disassemble in a safe way (e.g.: without data interpreted as instructions).

So, they are useful, but that much frequent. To answer a possible question: no, it doesn't deserve to allocate bits on an opcode to introduce automatic zero/sign extending a data for all instructions. Normal MOVSX or MOVZX instructions are enough to cover the majority of extension cases.

To be clearer: MOST of the code is regular, in the sense that it uses a single/common integer size to do the calculations, and only in rare cases it used different, little, sizes that requires extension (before usage).


Thanks, it looks like they are fairly common. I see that LEA is used even more on the x86_64 than x86. That should help my argument for LEA EA,Dn or, even better, the addressing mode that drops the EA calculation without the memory access into the ALU. Like usual, he changes his mind on his own suggestions so our conversations commonly go in circles.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Understanding CPU and integer performance
Posted on 6-Mar-2014 6:47:44
#57 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

Quote:

matthey wrote:
Quote:

cdimauro wrote:
Even with longer pipelines?


They should be able to hide the bubbles with a long enough pipeline but I'm not sure. Even if there is a cost to memory accesses, OoO will help the performance by avoiding many of them. OoO is not perfect either and having a cost (bubbles) when accessing cache/memory adds to the complexity of scheduling instructions. The higher complexity for OoO equates to more power draw. It's nice if the pipeline can hide the cache/memory access bubbles without getting too long or moving to OoO. The 68060 does a better job of it than most modern processors.

But if you want to get the most from performance, you need to go for an OoO micro-architecture. Take a look at the boost the Atom had passing from in-order to OoO.

Quote:
The An/Dn split actually helps keep the pipeline relatively short while hiding the cost of cache/memory accesses. There is a disadvantage that a bubble is generated when updating the An in the ALU though (also there is no forwarding for the An registers like the 68060). This means that I will probably change the ISA to only allow a source An in an EA, provided I can talk Gunnar out of expanding the number of registers. A source An in EAs is possible with a monolithic (combined An and Dn) register file which all further 68k designs are likely to be (it was possible in the 68060 and would have reduced the instructions and improved the code density). Allowing a source An in EAs is almost as good as a destination An with 2 op style like the 68k.

Allowing An to be used a source requires adding a costly 16-bit prefix. It will cost complexity in the decoder, decreased code density, and doesn't solve the problem of having only 15 registers (I don't count the SP register).

Usually a good coder is able to organize Dn and An registers so that it cannot run out of them, but if you still need more registers I don't think that enabling An as a source (so, having a general purpose register file) will improve the situation.

Quote:
Quote:

I gave a talk at the last EuroPython: https://ep2013.europython.eu/conference/talks/x86x64-assembly-python-new-cpu-architecture-rule-world

The code density is also very good. Consider that in my analysis I just translated the disassembled x86 or x64 instructions and converted them to the equivalent(s) NEx64T ones. So I'm not using any new feature of my ISA, which can still improve code density (and execution speed).

My ISA is an x86/x64 superset, has all 68K addressing modes except the bad ones (double indirect) and it misses PC + Scaled index, many new instructions, and the SIMD unit is superset of Larrabee / Xeon Phi. I designed it in a way that can be easily scaled from embedded to HPC: you can "cut" features in a clean way, depending on the specific needs.


Well, you won't ever be able to sell your ISA to Intel but maybe they will pay you to keep it quiet. Oh wait, don't you work for them now :D. You would think Freescale might be interested in a powerful CISC ISA to compete with Intel considering their past history with CISC processors. Then again, they had a better CISC ISA than x86 and Joe Circello but cut down the ISA and assigned Joe to work on microcontrollers. The next time the auto industry falters they can lay off a few thousand more workers and buy back more of their stock at pennies on the dollar to keep them out of bankruptcy. They really have done a good job of managing their debt but they are getting owned by Intel in the CPU market. IBM has no experience with CISC. Maybe AMD would be interested but that wouldn't fly to well with you working for Intel ;).

Yeah. Intel is big company, so actually I'm blocked waiting for a good opportunity to present my work to some ISA architect. It may require years...

Quote:
There is this flawed idea that addressing modes are slow CISC ideas. Complex addressing modes are bad ideas like double indirect but simple addressing modes can be very powerful as they apply to many instructions and the cost minimal even for less common types. Even if they limit clock speed a little, they boost performance per MHz. There are disadvantages to high clock speeds.

I agree. In my ISA I've added more useful addressing modes than x86 and 68K have; they are simple to implement (no double indirection, as I said, neither sign-extending the index register) and can improve both code density and execution speed. I've still some (little) room for adding others, and ideas on how to better use the available slots.

Quote:
Quote:

Anyway, it's pity that the Apollo forum cannot be read. Requiring registration only to take a look is overkill...


Yes. It would be much better if there were more knowledgeable people involved in the discussions like at natami.net. I suggested long ago inviting you, Megol and Marcel but Gunnar seems to stick to people who agree with him. I would also like to see ThoR there but I don't know if he would have the time.

Unfortunately I cannot contribute anymore to improving other ISAs, due to my new position. CPU architectures are the core business of my company, so I can only discuss about what I already written in the past about the arguments.

But I appreciate to read technical discussions, to see what's boiling...

Quote:
Quote:

Yes: http://www.appuntidigitali.it/18362/statistiche-su-x86-x64-parte-8-istruzioni-mnemonici/

The statistics are relative to a single executable, Adobe Photoshop CS 6 public beta, but I found a similar trend with other applications (MySQL, Firebird, MAME, etc. etc.).

Of course they are limited to the instructions that I can automatically disassemble in a safe way (e.g.: without data interpreted as instructions).

So, they are useful, but that much frequent. To answer a possible question: no, it doesn't deserve to allocate bits on an opcode to introduce automatic zero/sign extending a data for all instructions. Normal MOVSX or MOVZX instructions are enough to cover the majority of extension cases.

To be clearer: MOST of the code is regular, in the sense that it uses a single/common integer size to do the calculations, and only in rare cases it used different, little, sizes that requires extension (before usage).


Thanks, it looks like they are fairly common.

Common enough to justify ad hoc instructions. Not enough to spend bits in the extension word or in a prefix, as we already discussed in amigacoding.de forum, if I recall correctly.
Quote:
I see that LEA is used even more on the x86_64 than x86.

Yes, but it's due to the new ABI: x86_x64 prefers passing parameters by registers instead of pushing them on the stack.
Quote:
That should help my argument for LEA EA,Dn or, even better, the addressing mode that drops the EA calculation without the memory access into the ALU.

I've already put in on my wish list when I've written an article about Natami: http://www.appuntidigitali.it/9907/native-amiga-natami-il-vero-erede-dellamiga/

Quote:
Like usual, he changes his mind on his own suggestions so our conversations commonly go in circles.

Yes, I know. Being a very skilled engineer doesn't automatically mean the you have a good "vision" of what to do with your abilities, and how it can impact on the real world, or in future implementations.

My experience about it was pretty bad. What I see is that skilled hardware guys have their vision, and it's difficult to change their mind. And it's a limited vision, also: mostly thinking about "I can use this free hole here to put this new thing". They are too much legacy related: think too much about reusing the existing legacy, and patching it to accomplish their ideas.

What I see is they are lacking a ground-breaking vision: thinking about brand new stuff to trace a new path, and limiting the legacy of the old architecture.

Amiga was disruptive. Instead, what I see now is that the "successors" are just cosmetic enhancements with new features added following the same, very old and limited, vision.

 Status: Offline
Profile     Report this post  
matthey 
Re: Understanding CPU and integer performance
Posted on 6-Mar-2014 9:46:34
#58 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2010
From: Kansas

Quote:

cdimauro wrote:
But if you want to get the most from performance, you need to go for an OoO micro-architecture. Take a look at the boost the Atom had passing from in-order to OoO.


I think Atom was initially trying to improve power efficiency enough to compete in the cell phone market with ARM. There may have been concern that increasing performance would cause them to compete against Intel's other low end processors too. Rather than cut their power usage in half again (maybe they couldn't), they decided it was easier to increase their performance and dominate the growing pad and netbook market. The Atom is definitely more powerful with OoO but it also is considerably less power efficient if you take away the die shrinks. Intel already had the expertise to do CISC OoO.

I think a 68060 like CPU can do better than most processors without the need to go full OoO. It's one of the easier processors to schedule for because the pipeline hides many of the bubbles and the early instruction retirement reduces dependencies. We would be stronger with OoO but then we may be more power efficient than Atom without OoO, micro-oping, x86 decoding and with smaller caches. We would probably have to move out of fpga to realize the full potential but this may give a bigger performance boost than OoO. We just need to find some investors :P.

Quote:

Allowing An to be used a source requires adding a costly 16-bit prefix. It will cost complexity in the decoder, decreased code density, and doesn't solve the problem of having only 15 registers (I don't count the SP register).

Usually a good coder is able to organize Dn and An registers so that it cannot run out of them, but if you still need more registers I don't think that enabling An as a source (so, having a general purpose register file) will improve the situation.


No prefix is needed and opening An sources up will increase code density significantly as less instructions are needed. All instructions of the type OP EA,Dn just don't allow An in the EA. This dates back to the original 68000 which had separate registers files where it wasn't possible without a bubble. The 68060 and most any modern 68k would have a monolithic register file so there is no disadvantage to enabling these. Example:

if (address1 | address2) & 3) // checks both addresses for32 bit alignment

68020:

move.l a0,d0
move.l a1,d1
or.l d0,d1
and.w #3,d1
beq aligned

vs Apollo:

move.l a0,d0
or.l a1,d0
and.l #3,d0
beq aligned

We save 1 instruction, 2 bytes and a scratch register. Also the compiler doesn't have to worry about converting the AND.L #3,Dn to AND.W #3,Dn to improve code density.

Quote:

Yeah. Intel is big company, so actually I'm blocked waiting for a good opportunity to present my work to some ISA architect. It may require years...


I wonder why Intel is not in a hurry.

Quote:
Quote:
Like usual, he changes his mind on his own suggestions so our conversations commonly go in circles.

Yes, I know. Being a very skilled engineer doesn't automatically mean the you have a good "vision" of what to do with your abilities, and how it can impact on the real world, or in future implementations.

My experience about it was pretty bad. What I see is that skilled hardware guys have their vision, and it's difficult to change their mind. And it's a limited vision, also: mostly thinking about "I can use this free hole here to put this new thing". They are too much legacy related: think too much about reusing the existing legacy, and patching it to accomplish their ideas.

What I see is they are lacking a ground-breaking vision: thinking about brand new stuff to trace a new path, and limiting the legacy of the old architecture.

Amiga was disruptive. Instead, what I see now is that the "successors" are just cosmetic enhancements with new features added following the same, very old and limited, vision.


More often it's the clueless pencil pushers that make the big decisions and the engineers have little say. The engineers just about have to be into their complex work so much as to be perfectionists and perfectionists are going to be opinionated. Gunnar has vision but it's so lofty as to be impractical at times. At least he is logical. Sometimes I can sway him with enough logic and statistics ;).

 Status: Offline
Profile     Report this post  
megol 
Re: Understanding CPU and integer performance
Posted on 6-Mar-2014 18:25:05
#59 ]
Regular Member
Joined: 17-Mar-2008
Posts: 355
From: Unknown

@matthey
Quote:

matthey wrote:
Quote:

cdimauro wrote:
But if you want to get the most from performance, you need to go for an OoO micro-architecture. Take a look at the boost the Atom had passing from in-order to OoO.


I think Atom was initially trying to improve power efficiency enough to compete in the cell phone market with ARM. There may have been concern that increasing performance would cause them to compete against Intel's other low end processors too. Rather than cut their power usage in half again (maybe they couldn't), they decided it was easier to increase their performance and dominate the growing pad and netbook market. The Atom is definitely more powerful with OoO but it also is considerably less power efficient if you take away the die shrinks. Intel already had the expertise to do CISC OoO.


Not really. The OoO Atom isn't really similar to other Intel OoO processors and is also partially in-line. But at the same time OoO isn't rocket science unless trying to reach peak performance.

Quote:

I think a 68060 like CPU can do better than most processors without the need to go full OoO. It's one of the easier processors to schedule for because the pipeline hides many of the bubbles and the early instruction retirement reduces dependencies. We would be stronger with OoO but then we may be more power efficient than Atom without OoO, micro-oping, x86 decoding and with smaller caches. We would probably have to move out of fpga to realize the full potential but this may give a bigger performance boost than OoO. We just need to find some investors :P.


I have seen this stated repeatedly but IMHO there simply isn't true.
This idea that 68k is better fitted that e.g. x86 (post-386) for anything isn't based in any real advantage. The 68060 design type with address generation/cache access have been used for many x86 processors some of which also could do some other operations (MOVs of different types). In fact the 68k designs make this _harder_ as MOVE instructions change flags which x86 MOV doesn't. One example is the Cyrix 6x86 that supported register renaming of the same type the 68060 did.

But this type of execution is only an advantage for a small subset of designs as address generation stalls can be hard to avoid, especially for stores (without extra hardware), increasing the cache size will lengthening the whole execution pipeline due to extra cache stage(s). Also there are many designs where there simply isn't any advantage of all as the processor can already execute as if the pipeline was hardcoded by forwarding the cache result to the execution unit without penalty.

Quote:

Quote:

Allowing An to be used a source requires adding a costly 16-bit prefix. It will cost complexity in the decoder, decreased code density, and doesn't solve the problem of having only 15 registers (I don't count the SP register).

Usually a good coder is able to organize Dn and An registers so that it cannot run out of them, but if you still need more registers I don't think that enabling An as a source (so, having a general purpose register file) will improve the situation.


No prefix is needed and opening An sources up will increase code density significantly as less instructions are needed. All instructions of the type OP EA,Dn just don't allow An in the EA. This dates back to the original 68000 which had separate registers files where it wasn't possible without a bubble. The 68060 and most any modern 68k would have a monolithic register file so there is no disadvantage to enabling these. Example:

if (address1 | address2) & 3) // checks both addresses for32 bit alignment

68020:

move.l a0,d0
move.l a1,d1
or.l d0,d1
and.w #3,d1
beq aligned

vs Apollo:

move.l a0,d0
or.l a1,d0
and.l #3,d0
beq aligned

We save 1 instruction, 2 bytes and a scratch register. Also the compiler doesn't have to worry about converting the AND.L #3,Dn to AND.W #3,Dn to improve code density.


It also forces the compiler to do more work as this kind of extension is just a special case - it isn't a generic register extension allowing address and data registers to be equivalent. Something like this can be hard to support. Unless the use of address registers have been expanded to cover all ALU operations?

This design also complicates hardware but I have listed my objections a number of times already.

Reusing the "free" address target encoding for this also takes away the possibility to use them for something else like:
Shifts with immediate larger than 7.
Extending ADDX/SUBX to be more generic.
And other things.

With (my version of) a prefix one only have to use it when needed and it also gives other advantages that in many cases can decrease code size.

It is hard to optimize your example as it is fundamentally an address operation and my design doesn't change the address/data register divide however if you have any other example where the doing ALU operations on address registers is an advantage please post it.

(BTW this forum have an incredible ability to crap up the formatting all the time - I can't be bothered to edit this post any more to see if the quotes are correct. Had a longer/more detailed post for this thread yesterday but spending >4x the time of writing to try to make it readable isn't worth it, not when there are better forums where things just work. :/)

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Understanding CPU and integer performance
Posted on 6-Mar-2014 21:59:59
#60 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

Quote:

matthey wrote:
Quote:

cdimauro wrote:
But if you want to get the most from performance, you need to go for an OoO micro-architecture. Take a look at the boost the Atom had passing from in-order to OoO.


I think Atom was initially trying to improve power efficiency enough to compete in the cell phone market with ARM. There may have been concern that increasing performance would cause them to compete against Intel's other low end processors too. Rather than cut their power usage in half again (maybe they couldn't), they decided it was easier to increase their performance and dominate the growing pad and netbook market.

Honestly I have no idea why Intel designed the Atom, but at that time the ARMs were used a lot in the PDAs, and Intel had its ARM-based XScale family. Maybe the Atom was a self-made attempt to enter the PDA market. Who knows...

Quote:
The Atom is definitely more powerful with OoO but it also is considerably less power efficient if you take away the die shrinks. Intel already had the expertise to do CISC OoO.

Yes, but Atom (Bay-Trail) is different, as megol stated.

Quote:
I think a 68060 like CPU can do better than most processors without the need to go full OoO. It's one of the easier processors to schedule for because the pipeline hides many of the bubbles and the early instruction retirement reduces dependencies. We would be stronger with OoO but then we may be more power efficient than Atom without OoO, micro-oping, x86 decoding and with smaller caches. We would probably have to move out of fpga to realize the full potential but this may give a bigger performance boost than OoO. We just need to find some investors :P.

Investors and Amiga are an oxymoron.

Anyway, I agree with megol here: I don't see that much advantage for a 68060.

Quote:
Quote:

Allowing An to be used a source requires adding a costly 16-bit prefix. It will cost complexity in the decoder, decreased code density, and doesn't solve the problem of having only 15 registers (I don't count the SP register).

Usually a good coder is able to organize Dn and An registers so that it cannot run out of them, but if you still need more registers I don't think that enabling An as a source (so, having a general purpose register file) will improve the situation.


No prefix is needed and opening An sources up will increase code density significantly as less instructions are needed. All instructions of the type OP EA,Dn just don't allow An in the EA. This dates back to the original 68000 which had separate registers files where it wasn't possible without a bubble. The 68060 and most any modern 68k would have a monolithic register file so there is no disadvantage to enabling these.

OK, I understood, it requires little changes to the ISA, enabling An where possible, but do you really see any concrete advantage in real-world code? I see very little.

Quote:
Example:

if (address1 | address2) & 3) // checks both addresses for32 bit alignment

68020:

move.l a0,d0
move.l a1,d1
or.l d0,d1
and.w #3,d1
beq aligned

vs Apollo:

move.l a0,d0
or.l a1,d0
and.l #3,d0
beq aligned

We save 1 instruction, 2 bytes and a scratch register. Also the compiler doesn't have to worry about converting the AND.L #3,Dn to AND.W #3,Dn to improve code density.

OK, I see. But how much frequent is a code like this?

Quote:

Quote:

Yeah. Intel is big company, so actually I'm blocked waiting for a good opportunity to present my work to some ISA architect. It may require years...


I wonder why Intel is not in a hurry.

Because it already has very good products. Also only a bunch of people know my work and my new ISA.

So I have to better "advetise" NEx64T.

Quote:
Quote:
Yes, I know. Being a very skilled engineer doesn't automatically mean the you have a good "vision" of what to do with your abilities, and how it can impact on the real world, or in future implementations.

My experience about it was pretty bad. What I see is that skilled hardware guys have their vision, and it's difficult to change their mind. And it's a limited vision, also: mostly thinking about "I can use this free hole here to put this new thing". They are too much legacy related: think too much about reusing the existing legacy, and patching it to accomplish their ideas.

What I see is they are lacking a ground-breaking vision: thinking about brand new stuff to trace a new path, and limiting the legacy of the old architecture.

Amiga was disruptive. Instead, what I see now is that the "successors" are just cosmetic enhancements with new features added following the same, very old and limited, vision.


More often it's the clueless pencil pushers that make the big decisions and the engineers have little say. The engineers just about have to be into their complex work so much as to be perfectionists and perfectionists are going to be opinionated. Gunnar has vision but it's so lofty as to be impractical at times. At least he is logical. Sometimes I can sway him with enough logic and statistics ;).

Numbers speak. I did with him in the past too, and it worked, but they are rare cases.

Having a "pusher" with a good vision and a team of skilled people to coordinate is, IMO, much more productive. Too much time is spent talking and without getting results. A "pusher" has interest on market a product as soon as possible, and is able to manager strong personalities. Sorry to say that, but I think it's needed by the Amiga community.

 Status: Offline
Profile     Report this post  
Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 Next Page )

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]
Copyright (C) 2000 - 2019 Amigaworld.net.
Amigaworld.net was originally founded by David Doyle