Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.
|
|
|
|
Poster | Thread | cdimauro
| |
The Case for the Complex Instruction Set Computer Posted on 9-Nov-2023 22:12:21
| | [ #1 ] |
| |
|
Elite Member |
Joined: 29-Oct-2012 Posts: 4127
From: Germany | | |
|
| | Status: Offline |
| | Karlos
| |
Re: The Case for the Complex Instruction Set Computer Posted on 10-Nov-2023 14:07:05
| | [ #2 ] |
| |
|
Elite Member |
Joined: 24-Aug-2003 Posts: 4678
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| @cdimauro
I don't really believe CISC exists. It's just a term we use to describe CPUs that typically aren't load/store, full-register-width operation-only/preferred, not-extensively-microcodedq machines.
There's nothing especially simple about the instructions that common RISC machines can execute. You've got complex masked rotates and the like on PPC and conditionally executable instructions and similar on ARM. They both have many of the same addressing modes as your typical "CISC" machine but the key difference is that they are only for load and store rather than an effective address for arbitrary instructions. _________________ Doing stupid things for fun... |
| Status: Offline |
| | Karlos
| |
Re: The Case for the Complex Instruction Set Computer Posted on 10-Nov-2023 14:08:32
| | [ #3 ] |
| |
|
Elite Member |
Joined: 24-Aug-2003 Posts: 4678
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| In summary, CISC is just a term to describe something which is not RISC. That's a massive set of possible architecture types. _________________ Doing stupid things for fun... |
| Status: Offline |
| | BigD
| |
Re: The Case for the Complex Instruction Set Computer Posted on 10-Nov-2023 14:23:33
| | [ #4 ] |
| |
|
Elite Member |
Joined: 11-Aug-2005 Posts: 7466
From: UK | | |
|
| @cdimauro
I thought CISC design morphed into a hybrid RISC/CISC eventually. It is all invisible to the users though. _________________ "Art challenges technology. Technology inspires the art." John Lasseter, Co-Founder of Pixar Animation Studios |
| Status: Offline |
| | kolla
| |
Re: The Case for the Complex Instruction Set Computer Posted on 10-Nov-2023 16:29:41
| | [ #5 ] |
| |
|
Elite Member |
Joined: 20-Aug-2003 Posts: 3270
From: Trondheim, Norway | | |
|
| _________________ B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC |
| Status: Offline |
| | OneTimer1
| |
Re: The Case for the Complex Instruction Set Computer Posted on 10-Nov-2023 19:26:51
| | [ #6 ] |
| |
|
Super Member |
Joined: 3-Aug-2015 Posts: 1112
From: Germany | | |
|
| @BigD
Quote:
I thought CISC design morphed into a hybrid RISC/CISC eventually. It is all invisible to the users though. |
Well i86 CISC commands where so poor it was nearly a RISC CPU with CISC execution times. ;)
---
CISC:
I once programmed a DEC CPU in Assembler, that's what I would call a true CISC CPU.
It was a 3 address machine, this means an add command was something like add a,b,c; // meaning c = a+b
The 68k is considered to be a 2 address machine only.
This DEC CPU had a command for converting a binary number into a BCD and another command for converting the BCD number into ASCII.
And it had a command to copy a full memory block, that was (unlike on the i86) not restricted to 64k block sizes.
This was a true CISC machine, compared to it and it's addressing modes, the i86 was a poor design, but maybe this simple designs are better for optimizations. If you don't have complex addressing modes, you will have commands, that could be divided into simple sequential load / store processes easier for pipelining. I believe it was easier to do those optimizations on i386 than on a 68k, that might be one of the reasons why Motorola failed to speed up their CPUs after the 68020. Last edited by OneTimer1 on 10-Nov-2023 at 11:38 PM. Last edited by OneTimer1 on 10-Nov-2023 at 07:27 PM.
|
| Status: Offline |
| | matthey
| |
Re: The Case for the Complex Instruction Set Computer Posted on 11-Nov-2023 1:13:24
| | [ #7 ] |
| |
|
Elite Member |
Joined: 14-Mar-2007 Posts: 2388
From: Kansas | | |
|
| @cdimauro Your RISC pillar points of 1, 3, 4, 2 looks kind of strange but I see why you left 2 for last as it is the last to fall. I agree with your main point that L/S should be reexamined rather than taken for granted. I have a few comments to make though.
cdimauro Quote:
In fact, being forced to use only load instructions to load values for use entails four not insignificant things:
o the addition of instructions to be executed; o the consequent worsening of code density (more instructions occupy more memory space); o the use of a register in which to load the value before it can be used; o the stall (of several clock cycles) in the pipeline caused by the second instruction for waiting to read the value from the register where it will be loaded (technically this is called a load-to-use penalty).
|
I agree with all these points and the description of the load-to-use penalty is good. The name for it can be called load-use or load-to-use penalty, delay, stall or hazard. Both names are used and load-use is probably more common but load-to-use is more descriptive and seems to be gaining popularity.
It is possible to design a RISC load/store pipeline without a L1 data cache load-to-use penalty but they are uncommon. SuperSPARC and R8000 are listed as 0 cycle load-use delay along with the 486 and Pentium at the following link.
https://www.tutorialspoint.com/what-is-the-performance-of-load-use-delay-in-computer-architecture
Newer RISC cores tend to have deeper pipelines with increased load-to-use penalties but some designs can avoid them. The SiFive RISC-V U74-MC is an 8 stage superscalar in-order core with no load-to-use penalty.
https://sifive.cdn.prismic.io/sifive/1a82e600-1f93-4f41-b2d8-86ed8b16acba_fu740-c000-manual-v1p6.pdf Quote:
The S7 execution unit is a dual-issue, in-order pipeline. The pipeline comprises eight stages: two stages of instruction fetch (F1 and F2), two stages of instruction decode (D1 and D2), address generation (AG), two stages of data memory access (M1 and M2), and register writeback (WB). The pipeline has a peak execution rate of two instructions per clock cycle, and is fully bypassed so that most instructions have a one-cycle result latency:
o Integer arithmetic and branch instructions can execute in either the AG or M2 pipeline stage. If such an instruction’s operands are available when the instruction enters the AG stage, then it executes in AG; otherwise, it executes in M2.
o Loads produce their result in the M2 stage. There is no load-use delay for most integer instructions. However, effective addresses for memory accesses are always computed in the AG stage. Hence, loads, stores, and indirect jumps require their address operands to be ready when the instruction enters AG. If an address-generation operation depends upon a load from memory, then the load-use delay is two cycles.
o Integer multiplication instructions consume their operands in the AG stage and produce their results in the M2 stage. The integer multiplier is fully pipelined.
o Integer division instructions consume their operands in the AG stage. These instructions have between a 3-cycle and 64-cycle result latency, depending on the operand values.
o CSR accesses execute in the M2 stage. CSR read data can be bypassed to most integer instructions with no delay. Most CSR writes flush the pipeline (a seven-cycle penalty).
The pipeline only interlocks on read-after-write and write-after-write hazards, so instructions may be scheduled to avoid stalls.
The pipeline implements a flexible dual-instruction-issue scheme. Provided there are no data hazards between a pair of instructions, the two instructions may issue in the same cycle, provided the following constraints are met:
o At most one instruction accesses data memory; o At most one instruction is a branch or jump; o At most one instruction is an integer multiplication or division operation; o Neither instruction explicitly accesses a CSR.
|
The U74-MC core design is similar to the 68060 with the load-to-use delay being eliminated by executing loads in the Address Generation (AG) stage before a possible later instruction execution in the M2 stage. The pointer register has to be ready earlier or there is a change/use penalty instead of load-to-use penalty but this is easier to avoid. The worst case change/use penalty is 3 cycles for the 68060 when using a data register as scaled index register that was just updated but there are many optimizations and most address register changes and the most common addressing modes have no penalty (see page 10.10 of the M68060UM where it talks about a "change/use" register stall). Both the U74-MC pipeline and 68060 pipeline execute instructions early using the AG stage if possible which makes the result available earlier. The 68060 can load an immediate value and use them in the late instruction execution of the other pipe for example (MOVEQ #imm,D0+ADD.L D0,D1 can execute together as a superscalar pair). The U74-MC core also makes good use of early execution in the AG stage. Superscalar instruction scheduling becomes much easier with this design for both RISC and CISC. So why don't more RISC cores use this design? It's more efficient when both the early instruction execution ALU/AG and the late execution ALU hardware can be used with the same instruction. CISC instructions combine an address and ALU instruction together. ADD.L (4,A0),D0 uses the early AG ALU to calculate the (4,A0) and the late ALU performs the ADD. RISC ISAs chose to simplify these instructions by breaking them in two so few if any RISC instructions will execute early and late yet getting rid of the load-to-use penalty and easier instruction scheduling was considered worthwhile for the U74-MC core. This design is much more powerful for a CISC ISA though.
Easier instruction scheduling and removing the load-to-use penalty makes a difference for performance. Going back to my first link.
https://www.tutorialspoint.com/what-is-the-performance-of-load-use-delay-in-computer-architecture Quote:
For traditional scalar processors, load-use delays of one cycle are quite acceptable, since a parallel optimizing ILP-compiler will frequently find independent instruction to fill the slot following a load.
However, for a superscalar processor with instruction issue rates of 2 and higher, it is much less probable that the compiler can find, for each load instruction, two, three, four, or more independent instruction. Thus, with increasing, instruction issue rate in superscalar processor load-use delays become a bottleneck.
According to these results, an increase of the load-use delay from one to two or three cycles will reduce speed-up considerably. For instance, at an issue rate of 4, a load-use delay of 2 will impede performance by about 30% when compared with a load-use delay of 1. Although these figures are valid only for a certain set of parameters, a general tendency such as this can be expected.
|
Large load-to-use delays are absolutely a performance killer as superscalar instruction scheduling becomes impossible. The ARM Cortex-A53 with a 3 cycle load-to-use penalty is bad enough with scheduling but what looks like a good code translation to AArch64 from emu68 turns into a complete performance disaster without scheduling. The later generation ARM Cortex-A55 reduces the load-to-use penalty to 2 cycles which is huge but the Cortex-A53 remains popular because it is the smallest 64 bit AArch64 core giving a competitive cost advantage (this core is several times the size of the also 8 stage superscalar in-order 68060 core though). Even a 2 cycle load-to-use penalty makes instruction scheduling difficult and a requirement for performance. OoO execution can reduce load-to-use penalties but code should still be scheduled to avoid these stalls and can't always eliminate them, especially with more energy efficient limited OoO.
It is difficult to explain to people, even programmers and developers, why reducing pipeline stalls and making instruction scheduling easier is important because it is technical. Many of these RISC cores have high peak and theoretical performance but it doesn't matter as it is rarely reached. It's like Hammer's argument that the Pentium FPU has much better performance than the 68060 FPU which he brings up often and it may be true but theoretical peak performance doesn't translate to normal or average performance as benchmarks show. A good test of a superscalar in-order CPU is how well it performs without instruction scheduling and the 68060 passes the test, aided by a relatively orthogonal CISC ISA.
Last edited by matthey on 11-Nov-2023 at 01:22 AM.
|
| Status: Offline |
| | matthey
| |
Re: The Case for the Complex Instruction Set Computer Posted on 11-Nov-2023 4:23:32
| | [ #8 ] |
| |
|
Elite Member |
Joined: 14-Mar-2007 Posts: 2388
From: Kansas | | |
|
| BigD Quote:
I thought CISC design morphed into a hybrid RISC/CISC eventually. It is all invisible to the users though. |
Most of what RISC introduced was not new but rather a basket of ideas that was combined into a philosophy. CISC core designs resemble RISC designs in places but there are major differences too. For example, the internal RISC like fixed size instruction encoding used by the 68060 and some x86 CPU cores use instructions that are much larger than any RISC encoding, do more work and more instructions can access memory. The execution pipelines also resemble RISC pipelines but there may be differences. It is possible for a core to break powerful CISC instructions down into simple weak RISC instructions to be executed by a traditional RISC pipeline but that is more work to decrease performance.
Some people would say CISC has remained CISC while RISC has morphed into a RISC/CISC hybrid. This is because CISC is free to borrow from RISC as CISC has no rules. RISC has philosophical ideals which are not well defined and are a moving toward CISC target but the only one remaining that hasn't been violated is that RISC can only access memory with load/store instructions. I don't have a problem with calling modern high performance CPU cores RISC/CISC hybrids though. Legacy CISC and traditional RISC core designs are both dead for high performance CPU cores and what we have is in between or a hybrid of the two. CISC cores moved toward being more RISC like but RISC ISAs have moved and are still moving toward being more CISC like.
OneTimer1 Quote:
CISC:
I once programmed a DEC CPU in Assembler, that's what I would call a true CISC CPU.
It was a 3 address machine, this means an add command was something like add a,b,c; // meaning c = a+b
|
VAX?
OneTimer1 Quote:
The 68k is considered to be a 2 address machine only.
This DEC CPU had a command for converting a binary number into a BCD and another command for converting the BCD number into ASCII.
And it had a command to copy a full memory block, that was (unlike on the i86) not restricted to 64k block sizes.
This was a true CISC machine, compared to it and it's addressing modes, the i86 was a poor design, but maybe this simple designs are better for optimizations. If you don't have complex addressing modes, you will have commands, that could be divided into simple sequential load / store processes easier for pipelining. I believe it was easier to do those optimizations on i386 than on a 68k, that might be one of the reasons why Motorola failed to speed up their CPUs after the 68020.
|
Check out the POWER ISA which has crazy obscure specialized instruction and datatype support too.
Motorola was significantly increasing the performance with each new 68k CPU. The 68040 design was a disappointment as it ran too hot to be able to clock it up but the performance/MHz was still good. Most x86 CPUs ran hot too causing problems but Intel leveraged the more profitable PC clone market to provide updated and newer chips more often. The 68060 was a better design than the Pentium but Apple had already left the 68k market and C= was bust so it became a high end embedded CPU. Motorola didn't clock it up but good performance/MHz is more desirable than high clock speeds in the embedded market where it was successful.
The 68060 supported in hardware all the 68020 addressing modes often with no EA calc penalty and cheaper than any other 68k CPU.
Effective Address Calculation Times Dn 0(0/0) An 0(0/0) (An) 0(0/0) (An)+ 0(0/0) –(An) 0(0/0) (d16,An) 0(0/0) (d8,An,Xi*SF) 0(0/0) (bd,An,Xi*SF) 1(0/0) ([bd,An,Xn],od) 3(1/0) ([bd,An],Xn,od) 3(1/0) (xxx).W 0(0/0) (xxx).L 0(0/0) (d16,PC) 0(0/0) (d8,PC,Xi*SF) 0(0/0) (bd,PC,Xi*SF) 1(0/0) #data 0(0/0) ([bd,PC,Xn],od) 3(1/0) ([bd,PC],Xn,od) 3(1/0)
Sure, Motorola 68020 ISA designers got carried away by the double memory indirect modes and they are challenging to make fast, at least without OoO execution. They were likely added for OOP code which is challenging to make fast too despite becoming prevalent. Their timing is reasonable even though they should generally be avoided to allow better instruction scheduling. The (bd,An,Xi*SF) and (bd,PC,Xi*SF) addressing modes unfortunately require an extra EA calc cycle but they are generally uncommon enough not to make much difference in overall performance. Was timing too tight for them? Probably not. I believe the problem is that the RISC like internal fixed length instruction format that 68k instructions are converted into is not larger enough to hold the bd data. Floating point immediates suffer the same fate of extra cycles to execute. In fact, any 68k instructions over 6 bytes are likely not to execute in a single cycle. The ColdFire v5 core practically uses the 68060 core and just disallowed any instruction over 6 bytes that would take more than a single cycle. This is likely an arbitrary decision to save power as most instructions are less than 6 bytes but this reduces the performance advantage of CISC. Gunnar von Boehn of the Apollo Core team recommended for ColdFire to support larger instructions.
https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Coldfire-compatible-FPGA-core-with-ISA-enhancement-Brainstorming/td-p/238714 Quote:
Hello,
I work as chip developer.
While creating a super scalar Coldfire ISA-C compatible FPGA Core implementation,
I've noticed some possible "Enhancements" of the ISA.
I would like to hear your feedback about their usefulness in your opinion.
Many thanks in advance.
1) Support for BYTE and WORD instructions.
I've noticed that re-adding the support for the Byte and Word modes to the Coldfire comes relative cheap.
The cost in the FPGA for having "Byte, Word, Longword" for arithmetic and logic instructions like
ADD, SUB, CMP, OR, AND, EOR, ADDI, SUBI, CMPI, ORI, ANDI, EORI - showed up to be neglect-able.
Both the FPGA size increase as also the impact on the clockrate was insignificant.
2) Support for more/all EA-Modes in all instructions
In the current Coldfire ISA the instruction length is limited to 6 Byte, therefore some instructions have EA mode limitations.
E.g the EA-Modes available in the immediate instruction are limited.
That currently instructions can either by 2,4, or 6 Byte length - reduces the complexity of the Instruction Fetch Buffer logic.
The complexity of this unit increases as more options the CPU supports - therefore not supporting a range from 2 to over 20 like 68K - does reduce chip complicity.
Nevertheless in my tests it showed that adding support for 8 Byte encoding came for relative low cost.
With support of 8 Byte instruction length, the FPU instruction now can use all normal EA-modes - which makes them a lot more versatile.
MOVE instruction become a lot more versatile and also the Immediate Instruction could also now operate on memory in a lot more flexible ways.
While with 10 Byte instructions length support - there are then no EA mode limitations from the users perspective - in our core it showed that 10 byte support start to impact clockrate - with 10 Byte support enabled we did not reach anymore the 200 MHz clockrate in Cyclone FPGA that the core reached before.
I'm interested in your opinion of the usefulness of having Byte/Word support of Arithmetic and logic operations for the Coldfire.
Do you think that re-adding them would improve code density or the possibility to operate with byte data?
I would also like to know if you think that re-enabling the EA-modes in all instruction would improve the versatility of the Core.
Many thanks in advance.
Gunnar
|
Gunnar found supporting up to an 8 byte instruction length to be free. Reducing the size of instructions that can be executed in one cycle is ignorant. Intel moved the other direction increasing the size of the RISC like internal instructions to be able to execute more powerful instructions in a single cycle. ColdFire moved the 68k in the weak RISC direction and is now dead while Intel x86(-64) is the leading high performance CISC architecture in the world. Gunnar's AC68080 FPGA toy requires no additional EA calc time for (bd,An,Xi*SF) and (bd,PC,Xi*SF) addressing modes. He supported the double memory indirect modes too without trapping after much complaining about them although I'm not sure of the timing on them.
Last edited by matthey on 11-Nov-2023 at 04:25 AM.
|
| Status: Offline |
| | cdimauro
| |
Re: The Case for the Complex Instruction Set Computer Posted on 11-Nov-2023 10:29:48
| | [ #9 ] |
| |
|
Elite Member |
Joined: 29-Oct-2012 Posts: 4127
From: Germany | | |
|
| @Karlos
Quote:
Karlos wrote: @cdimauro
I don't really believe CISC exists. |
Maybe you've used one to write this comment. Quote:
It's just a term we use to describe CPUs that typically aren't load/store, full-register-width operation-only/preferred, not-extensively-microcodedq machines. |
Technically, CISCs are NOT RISCs.
It was the RISCs introduction which defined what RISCs were and, conversely, what non-RISCs AKA CISCs were. Quote:
There's nothing especially simple about the instructions that common RISC machines can execute. You've got complex masked rotates and the like on PPC |
Well, PowerPCs had also load/store multiple registers, which are super complicated... Quote:
and conditionally executable instructions and similar on ARM. |
Not only that. ARM's first processors were entirely microcoded. And supported utterly complicated instructions (again, load/store multiple registers). Quote:
They both have many of the same addressing modes as your typical "CISC" machine |
Exactly! So, even more complicated... Quote:
but the key difference is that they are only for load and store rather than an effective address for arbitrary instructions. |
No, RISCs, by definition, had 3 other pillars at their foundations. Using L/S instructions was only one of the four.
From this PoV, RISCs were a strict subset of the L/S architectures macrofamily. Quote:
Karlos wrote: In summary, CISC is just a term to describe something which is not RISC. That's a massive set of possible architecture types. |
Besides the above clarification, yes: this is exactly how RISCs were defined and CISCs as well (being "the opposite").
Quote:
BigD wrote: @cdimauro
I thought CISC design morphed into a hybrid RISC/CISC eventually. |
No, CISCs remained exactly same. By definition.
RISCs... aren't anymore since very long time.
What nowadays are called RISCs are, by definition, as a strict subset of CISC architectures. Specifically, the subset where only L/S store instructions are used for accessing memory.
In fact, it would be much better for those processors to be called L/S instead of using RISCs, which is obviously the wrong term. Quote:
It is all invisible to the users though. |
It's exactly the opposite: the ISA, which is the primary thing to look at when talking about computer architectures, is the only thing which is visible to the users (specifically to the developers).
The ISA is, by itself, good enough (let's say this to simplify the discussion: more details on my previous series of articles) to distinguish a CISC from a RISC.
@kolla
Quote:
kolla wrote:
|
Well, for sysadmins and the average user, yes: it doesn't matter.
For computer scientists it matters, if they are interested on architectures, compilers, emulators, etc.
@OneTimer1
Quote:
OneTimer1 wrote: @BigD
Quote:
I thought CISC design morphed into a hybrid RISC/CISC eventually. It is all invisible to the users though. |
Well i86 CISC commands where so poor it was nearly a RISC CPU with CISC execution times. ;) |
Ehm... x86 were far way from RISCs, even looking at the first of the family.
In fact, the 8086 had close to 100 instructions (I haven't counted one by one, but we should be around that number), some very complicated (e.g.: the REP / "STRING" ones), it required even hundred of cycles for the execution of some of them, and it had mem-to-mem instructions too.
How close to RISCs all of this could be considered, I don't know... Quote:
---
CISC:
I once programmed a DEC CPU in Assembler, that's what I would call a true CISC CPU.
It was a 3 address machine, this means an add command was something like add a,b,c; // meaning c = a+b |
It was a VAX: one of the most complex architecture ever... Quote:
The 68k is considered to be a 2 address machine only. |
Not really. Only a few instructions had this possibility (with MOVE which is the most important and used).
Most instructions had only reg-mem or mem-reg as possibility. Quote:
This DEC CPU had a command for converting a binary number into a BCD and another command for converting the BCD number into ASCII.
And it had a command to copy a full memory block, that was (unlike on the i86) not restricted to 64k block sizes. |
The VAX was a 32-bit architecture AFAIR, but 8086 was only 16-bit so it was naturally limited to 64kB memory transfers (128kB, in reality, using two different 64kB segments).
However the 80386 was able to transfer up to 4GB and x64... "a little bit" more. Quote:
This was a true CISC machine, |
It was a CISC. There's no "true CISC" definition, whatever you want to attribute to it. Quote:
compared to it and it's addressing modes, the i86 was a poor design, but maybe this simple designs are better for optimizations. |
The 8086 was a small enough CPU: you can't compare it to VAX, which was a monster... In fact, the 8086 was a microprocessor (single-chip), whereas VAX wasn't... Quote:
If you don't have complex addressing modes, you will have commands, |
Commands = instructions, I assume. Quote:
that could be divided into simple sequential load / store processes easier for pipelining. I believe it was easier to do those optimizations on i386 than on a 68k, that might be one of the reasons why Motorola failed to speed up their CPUs after the 68020. |
No, the problem with Motorola was mostly due to double-indirection addressing modes of 68020+. |
| Status: Offline |
| | cdimauro
| |
Re: The Case for the Complex Instruction Set Computer Posted on 11-Nov-2023 10:48:47
| | [ #10 ] |
| |
|
Elite Member |
Joined: 29-Oct-2012 Posts: 4127
From: Germany | | |
|
| @matthey
Quote:
matthey wrote: @cdimauro Your RISC pillar points of 1, 3, 4, 2 looks kind of strange but I see why you left 2 for last as it is the last to fall. |
Exactly. It was the last pillar and the only one which nowadays matter more for the RISCs vs CISCs dispute. That's the reason of points ordering which looks strange at the beginning. Quote:
I agree with your main point that L/S should be reexamined rather than taken for granted. I have a few comments to make though.
cdimauro Quote:
In fact, being forced to use only load instructions to load values for use entails four not insignificant things:
o the addition of instructions to be executed; o the consequent worsening of code density (more instructions occupy more memory space); o the use of a register in which to load the value before it can be used; o the stall (of several clock cycles) in the pipeline caused by the second instruction for waiting to read the value from the register where it will be loaded (technically this is called a load-to-use penalty).
|
I agree with all these points and the description of the load-to-use penalty is good. The name for it can be called load-use or load-to-use penalty, delay, stall or hazard. Both names are used and load-use is probably more common but load-to-use is more descriptive and seems to be gaining popularity. |
I've used load-to-use (like you usually do as well) because it's easier to understand and the article was written to be simpler / didactic. Quote:
Interesting. Then I think that SuperSPARC and R8000 should have some special load/result forwarding logic implemented, to catch those cases and avoid the stall. Quote:
Newer RISC cores tend to have deeper pipelines with increased load-to-use penalties but some designs can avoid them. The SiFive RISC-V U74-MC is an 8 stage superscalar in-order core with no load-to-use penalty.
https://sifive.cdn.prismic.io/sifive/1a82e600-1f93-4f41-b2d8-86ed8b16acba_fu740-c000-manual-v1p6.pdf Quote:
The S7 execution unit is a dual-issue, in-order pipeline. The pipeline comprises eight stages: two stages of instruction fetch (F1 and F2), two stages of instruction decode (D1 and D2), address generation (AG), two stages of data memory access (M1 and M2), and register writeback (WB). The pipeline has a peak execution rate of two instructions per clock cycle, and is fully bypassed so that most instructions have a one-cycle result latency:
o Integer arithmetic and branch instructions can execute in either the AG or M2 pipeline stage. If such an instruction’s operands are available when the instruction enters the AG stage, then it executes in AG; otherwise, it executes in M2.
o Loads produce their result in the M2 stage. There is no load-use delay for most integer instructions. However, effective addresses for memory accesses are always computed in the AG stage. Hence, loads, stores, and indirect jumps require their address operands to be ready when the instruction enters AG. If an address-generation operation depends upon a load from memory, then the load-use delay is two cycles.
o Integer multiplication instructions consume their operands in the AG stage and produce their results in the M2 stage. The integer multiplier is fully pipelined.
o Integer division instructions consume their operands in the AG stage. These instructions have between a 3-cycle and 64-cycle result latency, depending on the operand values.
o CSR accesses execute in the M2 stage. CSR read data can be bypassed to most integer instructions with no delay. Most CSR writes flush the pipeline (a seven-cycle penalty).
The pipeline only interlocks on read-after-write and write-after-write hazards, so instructions may be scheduled to avoid stalls.
The pipeline implements a flexible dual-instruction-issue scheme. Provided there are no data hazards between a pair of instructions, the two instructions may issue in the same cycle, provided the following constraints are met:
o At most one instruction accesses data memory; o At most one instruction is a branch or jump; o At most one instruction is an integer multiplication or division operation; o Neither instruction explicitly accesses a CSR.
|
The U74-MC core design is similar to the 68060 with the load-to-use delay being eliminated by executing loads in the Address Generation (AG) stage before a possible later instruction execution in the M2 stage. The pointer register has to be ready earlier or there is a change/use penalty instead of load-to-use penalty but this is easier to avoid. The worst case change/use penalty is 3 cycles for the 68060 when using a data register as scaled index register that was just updated but there are many optimizations and most address register changes and the most common addressing modes have no penalty (see page 10.10 of the M68060UM where it talks about a "change/use" register stall). Both the U74-MC pipeline and 68060 pipeline execute instructions early using the AG stage if possible which makes the result available earlier. |
Nice and wise design decision by SiFive for this RISC-V implementation.
However this should imply some data/result forwarding logic implemented as well.
The 68060 doesn't needed it, because it's "natural" (the instruction itself is implemented by waiting the result from memory. So it doesn't have to "look" if the next instructions is using its result). Quote:
The 68060 can load an immediate value and use them in the late instruction execution of the other pipe for example (MOVEQ #imm,D0+ADD.L D0,D1 can execute together as a superscalar pair). The U74-MC core also makes good use of early execution in the AG stage. Superscalar instruction scheduling becomes much easier with this design for both RISC and CISC. So why don't more RISC cores use this design? It's more efficient when both the early instruction execution ALU/AG and the late execution ALU hardware can be used with the same instruction. CISC instructions combine an address and ALU instruction together. ADD.L (4,A0),D0 uses the early AG ALU to calculate the (4,A0) and the late ALU performs the ADD. RISC ISAs chose to simplify these instructions by breaking them in two so few if any RISC instructions will execute early and late yet getting rid of the load-to-use penalty and easier instruction scheduling was considered worthwhile for the U74-MC core. This design is much more powerful for a CISC ISA though. |
I agree. But still, it's strange that other RISCs don't use a similar trick like SiFive did. Quote:
Easier instruction scheduling and removing the load-to-use penalty makes a difference for performance. Going back to my first link.
https://www.tutorialspoint.com/what-is-the-performance-of-load-use-delay-in-computer-architecture Quote:
For traditional scalar processors, load-use delays of one cycle are quite acceptable, since a parallel optimizing ILP-compiler will frequently find independent instruction to fill the slot following a load.
However, for a superscalar processor with instruction issue rates of 2 and higher, it is much less probable that the compiler can find, for each load instruction, two, three, four, or more independent instruction. Thus, with increasing, instruction issue rate in superscalar processor load-use delays become a bottleneck.
According to these results, an increase of the load-use delay from one to two or three cycles will reduce speed-up considerably. For instance, at an issue rate of 4, a load-use delay of 2 will impede performance by about 30% when compared with a load-use delay of 1. Although these figures are valid only for a certain set of parameters, a general tendency such as this can be expected.
|
Large load-to-use delays are absolutely a performance killer as superscalar instruction scheduling becomes impossible. The ARM Cortex-A53 with a 3 cycle load-to-use penalty is bad enough with scheduling but what looks like a good code translation to AArch64 from emu68 turns into a complete performance disaster without scheduling. The later generation ARM Cortex-A55 reduces the load-to-use penalty to 2 cycles which is huge but the Cortex-A53 remains popular because it is the smallest 64 bit AArch64 core giving a competitive cost advantage (this core is several times the size of the also 8 stage superscalar in-order 68060 core though). Even a 2 cycle load-to-use penalty makes instruction scheduling difficult and a requirement for performance. OoO execution can reduce load-to-use penalties but code should still be scheduled to avoid these stalls and can't always eliminate them, especially with more energy efficient limited OoO.
It is difficult to explain to people, even programmers and developers, why reducing pipeline stalls and making instruction scheduling easier is important because it is technical. Many of these RISC cores have high peak and theoretical performance but it doesn't matter as it is rarely reached. It's like Hammer's argument that the Pentium FPU has much better performance than the 68060 FPU which he brings up often and it may be true but theoretical peak performance doesn't translate to normal or average performance as benchmarks show. A good test of a superscalar in-order CPU is how well it performs without instruction scheduling and the 68060 passes the test, aided by a relatively orthogonal CISC ISA.
|
That's a great explanation, many thanks! What I was looking for was numbers about the load-to-use penalties on microarchitectures and this is more than what I was expecting.
It also clearly shows why CISCs architecture still matters and should be used, instead of RISCs (unless specific areas, like very small cores on embedded system, or on GPUs, etc.). |
| Status: Offline |
| | cdimauro
| |
Re: The Case for the Complex Instruction Set Computer Posted on 11-Nov-2023 11:08:51
| | [ #11 ] |
| |
|
Elite Member |
Joined: 29-Oct-2012 Posts: 4127
From: Germany | | |
|
| @matthey
Quote:
matthey wrote: BigD Quote:
I thought CISC design morphed into a hybrid RISC/CISC eventually. It is all invisible to the users though. |
Most of what RISC introduced was not new but rather a basket of ideas that was combined into a philosophy. CISC core designs resemble RISC designs in places but there are major differences too. For example, the internal RISC like fixed size instruction encoding used by the 68060 and some x86 CPU cores use instructions that are much larger than any RISC encoding, do more work and more instructions can access memory. The execution pipelines also resemble RISC pipelines but there may be differences. It is possible for a core to break powerful CISC instructions down into simple weak RISC instructions to be executed by a traditional RISC pipeline but that is more work to decrease performance.
Some people would say CISC has remained CISC while RISC has morphed into a RISC/CISC hybrid. This is because CISC is free to borrow from RISC as CISC has no rules. RISC has philosophical ideals which are not well defined and are a moving toward CISC target but the only one remaining that hasn't been violated is that RISC can only access memory with load/store instructions. I don't have a problem with calling modern high performance CPU cores RISC/CISC hybrids though. Legacy CISC and traditional RISC core designs are both dead for high performance CPU cores and what we have is in between or a hybrid of the two. CISC cores moved toward being more RISC like but RISC ISAs have moved and are still moving toward being more CISC like. |
I beg to differ, as I've explained on other comments: RISCs and CISCs are very well defined. Quote:
OneTimer1 Quote:
CISC:
I once programmed a DEC CPU in Assembler, that's what I would call a true CISC CPU.
It was a 3 address machine, this means an add command was something like add a,b,c; // meaning c = a+b
|
VAX? |
99.9999% sure. Quote:
The 68060 supported in hardware all the 68020 addressing modes often with no EA calc penalty and cheaper than any other 68k CPU.
Effective Address Calculation Times Dn 0(0/0) An 0(0/0) (An) 0(0/0) (An)+ 0(0/0) –(An) 0(0/0) (d16,An) 0(0/0) (d8,An,Xi*SF) 0(0/0) (bd,An,Xi*SF) 1(0/0) ([bd,An,Xn],od) 3(1/0) ([bd,An],Xn,od) 3(1/0) (xxx).W 0(0/0) (xxx).L 0(0/0) (d16,PC) 0(0/0) (d8,PC,Xi*SF) 0(0/0) (bd,PC,Xi*SF) 1(0/0) #data 0(0/0) ([bd,PC,Xn],od) 3(1/0) ([bd,PC],Xn,od) 3(1/0)
Sure, Motorola 68020 ISA designers got carried away by the double memory indirect modes and they are challenging to make fast, at least without OoO execution. They were likely added for OOP code which is challenging to make fast too despite becoming prevalent. Their timing is reasonable even though they should generally be avoided to allow better instruction scheduling. |
IMO double indirection made/make sense only for 68020's jump instructions.
What I don't like of 68k's JMP/JSR instructions is that they work like LEA and PEA instructions: they calculate the EA and then they just use it as it is.
Instead, many other CISC architectures implemented the equivalent instructions like any other instruction which was accessing memory: load the address from the calculated EA. This still allows to support OOP programming (and, before that, the traditional function pointers) without complicating much the architecture. Quote:
The (bd,An,Xi*SF) and (bd,PC,Xi*SF) addressing modes unfortunately require an extra EA calc cycle but they are generally uncommon enough not to make much difference in overall performance. Was timing too tight for them? Probably not. |
Unfortunately they are quite common. Maybe not using the index register, but at least using large offsets on the much more common (bd, An) addressing mode can be often found.
Take a look at the numbers reported on this article which I've written years ago: https://www.appuntidigitali.it/18192/statistiche-su-x86-x64-parte-5-indirizzamento-verso-la-memoria/ (in Italian).
8 bits are much more often, but 16 and 32 bit offsets cannot be ignored. Quote:
I believe the problem is that the RISC like internal fixed length instruction format that 68k instructions are converted into is not larger enough to hold the bd data. Floating point immediates suffer the same fate of extra cycles to execute. In fact, any 68k instructions over 6 bytes are likely not to execute in a single cycle. The ColdFire v5 core practically uses the 68060 core and just disallowed any instruction over 6 bytes that would take more than a single cycle. This is likely an arbitrary decision to save power as most instructions are less than 6 bytes but this reduces the performance advantage of CISC. Gunnar von Boehn of the Apollo Core team recommended for ColdFire to support larger instructions.
https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Coldfire-compatible-FPGA-core-with-ISA-enhancement-Brainstorming/td-p/238714 Quote:
Hello,
I work as chip developer.
While creating a super scalar Coldfire ISA-C compatible FPGA Core implementation,
I've noticed some possible "Enhancements" of the ISA.
I would like to hear your feedback about their usefulness in your opinion.
Many thanks in advance.
1) Support for BYTE and WORD instructions.
I've noticed that re-adding the support for the Byte and Word modes to the Coldfire comes relative cheap.
The cost in the FPGA for having "Byte, Word, Longword" for arithmetic and logic instructions like
ADD, SUB, CMP, OR, AND, EOR, ADDI, SUBI, CMPI, ORI, ANDI, EORI - showed up to be neglect-able.
Both the FPGA size increase as also the impact on the clockrate was insignificant.
2) Support for more/all EA-Modes in all instructions
In the current Coldfire ISA the instruction length is limited to 6 Byte, therefore some instructions have EA mode limitations.
E.g the EA-Modes available in the immediate instruction are limited.
That currently instructions can either by 2,4, or 6 Byte length - reduces the complexity of the Instruction Fetch Buffer logic.
The complexity of this unit increases as more options the CPU supports - therefore not supporting a range from 2 to over 20 like 68K - does reduce chip complicity.
Nevertheless in my tests it showed that adding support for 8 Byte encoding came for relative low cost.
With support of 8 Byte instruction length, the FPU instruction now can use all normal EA-modes - which makes them a lot more versatile.
MOVE instruction become a lot more versatile and also the Immediate Instruction could also now operate on memory in a lot more flexible ways.
While with 10 Byte instructions length support - there are then no EA mode limitations from the users perspective - in our core it showed that 10 byte support start to impact clockrate - with 10 Byte support enabled we did not reach anymore the 200 MHz clockrate in Cyclone FPGA that the core reached before.
I'm interested in your opinion of the usefulness of having Byte/Word support of Arithmetic and logic operations for the Coldfire.
Do you think that re-adding them would improve code density or the possibility to operate with byte data?
I would also like to know if you think that re-enabling the EA-modes in all instruction would improve the versatility of the Core.
Many thanks in advance.
Gunnar
|
Gunnar found supporting up to an 8 byte instruction length to be free. |
It would be interesting to know why going over 8 bytes isn't "free" anymore. I mean: which technical constraints are coming up with this.
This because 68020+ supported instructions up to 22 bytes in size, so a processor which implements this architecture should anyway take into account them. |
| Status: Offline |
| | AmigaNoob
| |
Re: The Case for the Complex Instruction Set Computer Posted on 11-Nov-2023 19:49:39
| | [ #12 ] |
| |
|
Member |
Joined: 14-Oct-2021 Posts: 15
From: Unknown | | |
|
| ARM not even a pure load-store architecture. Quote:
|
| Status: Offline |
| | kolla
| |
Re: The Case for the Complex Instruction Set Computer Posted on 11-Nov-2023 20:54:03
| | [ #13 ] |
| |
|
Elite Member |
Joined: 20-Aug-2003 Posts: 3270
From: Trondheim, Norway | | |
|
| Quote:
Back in those days, VAX assembly was considered a high level language.
“There be VAXen” - Dave Haynie in Deathbed Vigil.
(my syadm career started with Ultrix on VAX, and a wee bit of VMS)Last edited by kolla on 11-Nov-2023 at 08:54 PM.
_________________ B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC |
| Status: Offline |
| | matthey
| |
Re: The Case for the Complex Instruction Set Computer Posted on 12-Nov-2023 1:27:40
| | [ #14 ] |
| |
|
Elite Member |
Joined: 14-Mar-2007 Posts: 2388
From: Kansas | | |
|
| cdimauro Quote:
Interesting. Then I think that SuperSPARC and R8000 should have some special load/result forwarding logic implemented, to catch those cases and avoid the stall.
|
Forwarding logic has a limitation that results can't be forwarded back in time. Traditional RISC pipelines will have at least a 1 cycle load-to-use penalty.
https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_B._Pipeline_interlock https://www.tutorialspoint.com/how-to-remove-load-use-delay-in-computer-architecture
Removing the load-to-use delay requires a more complex pipeline design that a traditional RISC pipeline design, even with more stages added which can increase the load-to-use penalty as well as branch mispredict penalty depending on where stages are added. The following pic has a nice diagram of the delays for the Berkeley RISC-V SonicBoom OoO CPU cores.
The diagram is from the following paper.
SonicBOOM: The 3rd Generation Berkeley Out-of-Order Machine https://carrv.github.io/2020/papers/CARRV2020_paper_15_Zhao.pdf
RISC-V is the 5th generation RISC ISA developed at Berkeley. Is the SonicBoom CPU core the new RISC reference design? Can we see why OoO is so important to high performance RISC cores or at least this design? Why does a RISC design waste transistors on the fetch instruction queue like CISC designs? Is there an advantage to decoupling the fetch pipeline from the execution pipeline? Can RISC-V fill the instructions queue with as powerful of instructions as CISC or is it already at a disadvantage due to the "RISC" ISA?
cdimauro Quote:
Nice and wise design decision by SiFive for this RISC-V implementation.
However this should imply some data/result forwarding logic implemented as well.
|
Correct. Result forwarding is standard on all but the most primitive pipelines. There were some early RISC pipelines which avoided it for simplicity but that philosophy didn't last any longer than the MIPS idea of avoiding pipeline interlock logic for stalls by requiring the compiler to insert NOPs to avoid stalls but decreasing code density. MIPS stands for Microprocessor without Interlocked Pipelined Stages even though this simplification was removed due to poor instruction cache performance. This is described in my first link of this post. What happened to simple RISC?
cdimauro Quote:
The 68060 doesn't needed it, because it's "natural" (the instruction itself is implemented by waiting the result from memory. So it doesn't have to "look" if the next instructions is using its result).
|
The 68060 pipeline uses result forwarding too. You are correct that this pipeline design is more "natural" for CISC as forwarding is not needed for a CISC instruction which has an EA calc followed by an ALU calc. Providing results early at the EA calc stage and forwarding them avoids and/or reduces stalls for other instructions including instructions which are superscalar issued at the same time.
cdimauro Quote:
I agree. But still, it's strange that other RISCs don't use a similar trick like SiFive did.
|
More complexity and hardware is required as there are 2 ALUs in each execution pipeline. A RISC instructions will usually use either the early EA ALU calc or the late ALU calc but there are fewer opportunities to use both without an addressing mode and ALU operation encoded in a single CISC instruction. Instruction fusion could join RISC instructions into CISC like instructions internally to avoid load-to-use stalls but this requires more logic and at least CISC like addressing modes which RISC-V avoided to remain simple. AArch64 would be a better candidate for this. Still, SiFive architects chose this pipeline design likely to reduce stalls and improve average performance at the expense of theoretical performance.
cdimauro Quote:
That's a great explanation, many thanks! What I was looking for was numbers about the load-to-use penalties on microarchitectures and this is more than what I was expecting.
|
Stall penalties are often under documented. Compiler developers can even have trouble finding info which is needed for writing an instruction scheduler. Some info can be gained from examining benchmarks. You are probably familiar with the 7-Zip benchmark.
https://www.7-cpu.com/
The benchmark results for CPU cores often contain the info.
https://www.7-cpu.com/cpu/Cortex-A53.html Quote:
ARM Cortex-A53 L1 Data Cache Latency = 3 cycles for simple access via pointer L1 Data Cache Latency = 3 cycles for access with complex address calculation L2 Cache Latency = 15 cycles RAM Latency = 15 cycles + 128 ns
|
L1 data cache load-to-use penalty = 3 cycles L2 data cache load-to-use penalty = 15 cycles Ram data cache load-to-use penalty = 15 cycles + 128 ns
This doesn't work for some core designs like CISC or the SiFive U74 core designs.
https://www.7-cpu.com/cpu/SiFive_U74.html Quote:
SiFive U74 L1 Data Cache Latency = 3 cycles for simple access via pointer L1 Data Cache Latency = 5 cycles for access with complex address calculation L2 Cache Latency = 26 cycles RAM Latency = 26 cycles + 145 ns
|
OoO cores may be able to remove some of the load-to-use penalty although how effective they are at this varies depending on OoO design and how well scheduled the code already is. There is instruction scheduling documentation for OoO cores and I recall benchmarks showing performance improvements from better scheduled code for OoO cores. Certainly limited OoO cores, like most PPC cores, need instruction scheduling and load-to-use penalties are still very important.
cdimauro Quote:
It also clearly shows why CISCs architecture still matters and should be used, instead of RISCs (unless specific areas, like very small cores on embedded system, or on GPUs, etc.).
|
CISC instructions give an opportunity to avoid load-to-use penalties by executing the EA ALU calc and operation ALU calc in the same execution pipeline.
mem-reg: add.l (a0),d0 ; load var, load-to-use stall, add
reg-mem: add.l d0,(a0) ; load var, load-to-use stall, add, store addq.l #1,(a0) ; load var, load-to-use stall, add, store
mem-mem: move.l (a0),(a1) ; load var, load-to-use stall, store
These are very common CISC instruction types on the 68k. How can load/store developers ignore this advantage?
cdimauro Quote:
IMO double indirection made/make sense only for 68020's jump instructions.
What I don't like of 68k's JMP/JSR instructions is that they work like LEA and PEA instructions: they calculate the EA and then they just use it as it is.
Instead, many other CISC architectures implemented the equivalent instructions like any other instruction which was accessing memory: load the address from the calculated EA. This still allows to support OOP programming (and, before that, the traditional function pointers) without complicating much the architecture.
|
Good point. JMP/JSR using double memory indirect addressing modes only do one memory access and can be fast. The original JMP/JSR behavior could have been better alright.
cdimauro Quote:
I agree that the (bd,An,Xi*SF) addressing mode is important including variations like you describe. I argued with Gunnar that it should be supported with no EA calc overhead (he considered trapping it along with the double memory indirect modes at first). It is not commonly used in most 68k code but I did find some code where it is more common (the GCC executable as I recall). Why is it less common on the 68k than x86(-64)? A large program on the 68k may be 1MiB while a large program on x86(-64) can be 1GiB. Likewise, x86(-64) programs access larger data as well. The 68k never advanced enough to support new high performance CPUs with more memory like x86(-64). It's important to plan for the future and not just look at current 68k code. I expect even the 68060 designers looked too much at current 68k code and optimized too much for the 85% case. This resulted in useful ISA performance advantages becoming slower and programmers using them less. The (bd,An,Xi*SF) addressing mode makes compiler development much easier because this covers the whole addressing range and alternatives for performance reasons are likely to be worse. Look no further than the trampoline tables from not having Bcc.L on the 68000 for the kind of mess that can result with limited range. Even ColdFire finally added Bcc.L back with ISA_C and they tolerated a lot of this type of mess by clipping the wings of the ISA so it couldn't fly.
https://www.nxp.com/docs/en/white-paper/V1CFWP.pdf Quote:
The original ISA definition minimized the support for instructions referencing byte- and word-sized operands compared to the M68K family. Full support for the move byte (MOVE.B) and move word (MOVE.W) instructions was provided, but the only other opcodes supporting these data types were clear (CLR) and test (TST). Based on the input from compiler writers and users, the ColdFire ISA has been improved with the ISA_B and ISA_C extensions. These expanded ISA definitions have improved performance and code density in three areas:
1. Enhanced support for byte and word-sized operands through added move and compare operations 2. Enhanced support for position-independent code 3. Improved support for certain types of bit manipulation operators
Accordingly, the ISA specification for the V1 ColdFire core is defined as ISA_C since code size and performance associated with byte- and word-sized references to the S08 peripheral set is such an important factor. For the specifics of the ISA_C definition, see Chapter 3 of the ColdFire Programmer’s Reference Manual, especially the instruction set cross-reference presented in Table 3-16.
|
The damage from excessively stripping the 68k was already done though. They lost performance, code density, 68k customers and developers. ColdFire isn't even that bad but it is frustrating for 68k users who are forced to use it. Continuing the CPU32 ISA instead of ColdFire would have retained more of the embedded market while offering some reasonable simplification over the 68020 ISA. It's about on par with NXP stripping the standard PPC FPU out of the e500v1 and e500v2 cores as used in the P1022 SoC only to add it back with the e500mc core after many complaints and customers lost.
cdimauro Quote:
It would be interesting to know why going over 8 bytes isn't "free" anymore. I mean: which technical constraints are coming up with this.
This because 68020+ supported instructions up to 22 bytes in size, so a processor which implements this architecture should anyway take into account them.
|
The 68060 supports the full 22 bytes of instructions but perhaps the larger sizes with uCode while ColdFire just lopped off that support completely and required all instructions to be 6 bytes or less. The internal RISC like fixed length instruction size impacts what can be executed in a single cycle as it has to contain all information for the complex 68k instruction to execute in a single cycle. I believe this internal fixed length instruction format is 96 bits and most of it is the 68k instruction and extension words but it also has early decode data.
instruction word 16 bits (2 bytes) extension words 32 bits (4 bytes) early decode data 48 bits (6 bytes) --- total 96 bits (12 bytes)
The variable length 68k instruction is fetched 4 bytes/cycle and early decoded into a 96 bit fixed length entry of the instruction buffer over several cycles if necessary. Multiple cycles are fine as the fetch and early decode are decoupled from the execution pipelines by the instruction buffer. Each instruction buffer entry can only hold a 6 byte 68k instruction while longer instructions need multiple entries of the 16 available entries. The 96 bit instruction buffer entry could be increased by 16 bits to 112 bits to allow 8 byte instructions to execute in one cycle. This would allow more immediates and addressing modes like ADD.L #d32,(d16,An) and FADD.S #d32,FPn to possibly execute in a single cycle and perhaps (bd,An,Xi*SF) with bd=16bits to start to be used. With the instruction buffer entry increased by another 16 bits to 128 bits, 10 byte 68k instructions could fit in a single entry opening up (bd,An,Xi*SF) with bd=32bits. With another 16 bits to 144bits, FADD.D #d64,FPn would fit in a single entry. The instructions in an instruction buffer entry become more and more powerful with the possibility to not only execute in a single cycle but possibly dual issue in instruction pairs. For the max 68k instruction of 22 bytes would require 224bits per entry and doesn't gain much in performance anymore. With 192bits already all 68k instructions except double memory indirect instructions would fit in a single entry but even this may be impractical or start to slow down the accesses to this likely SRAM instruction buffer. The 68060 limitation of 6 bytes per entry may have been low simply because 85% of instructions are 6 bytes in size or less while much more powerful instructions could have been placed in an instruction buffer entry.
If there is a critical timing limitation, it is more likely to be on the execution pipeline side of the instruction buffer. This is where 2 instruction buffer entries may undergo additional decoding and are examined for dependencies with superscalar execution. More data takes longer to examine and a larger instruction buffer is slower. More performance is likely possible with the faster silicon today. The AC68080 can execute much larger instructions in a single cycle although I suspect the instruction fetch per cycle is very large and no decoupling of the instruction fetch pipleline and execution pipeline is used. The superscalar decoding and dispatch/issue needs to look at a lot of data and this requires LUTs which are slow in FPGA, especially when there are many choices in what is a large table of possible selections. I hope this give you an idea with the limited knowledge I have. Even some of this info is educated speculation from multiple sources.
Last edited by matthey on 12-Nov-2023 at 01:48 AM. Last edited by matthey on 12-Nov-2023 at 01:42 AM.
|
| Status: Offline |
| | cdimauro
| |
Re: The Case for the Complex Instruction Set Computer Posted on 12-Nov-2023 18:06:04
| | [ #15 ] |
| |
|
Elite Member |
Joined: 29-Oct-2012 Posts: 4127
From: Germany | | |
|
| @matthey
Quote:
matthey wrote: cdimauro Quote:
Interesting. Then I think that SuperSPARC and R8000 should have some special load/result forwarding logic implemented, to catch those cases and avoid the stall.
|
Forwarding logic has a limitation that results can't be forwarded back in time. Traditional RISC pipelines will have at least a 1 cycle load-to-use penalty.
https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_B._Pipeline_interlock https://www.tutorialspoint.com/how-to-remove-load-use-delay-in-computer-architecture
Removing the load-to-use delay requires a more complex pipeline design that a traditional RISC pipeline design, even with more stages added which can increase the load-to-use penalty as well as branch mispredict penalty depending on where stages are added. The following pic has a nice diagram of the delays for the Berkeley RISC-V SonicBoom OoO CPU cores.
The diagram is from the following paper.
SonicBOOM: The 3rd Generation Berkeley Out-of-Order Machine https://carrv.github.io/2020/papers/CARRV2020_paper_15_Zhao.pdf
RISC-V is the 5th generation RISC ISA developed at Berkeley. Is the SonicBoom CPU core the new RISC reference design? Can we see why OoO is so important to high performance RISC cores or at least this design? Why does a RISC design waste transistors on the fetch instruction queue like CISC designs? Is there an advantage to decoupling the fetch pipeline from the execution pipeline? Can RISC-V fill the instructions queue with as powerful of instructions as CISC or is it already at a disadvantage due to the "RISC" ISA? |
Those are rethorical questions to me (in a positive sense, since I fully agree with you).
Thanks a lot for this further data! I might reuse if I'll write some article which talks about this topic.
BTW, it's not reported on the chart but I assume that on BOOMv1 there's still a 4 cycles load-to-use penalty. OR, there's simply no forwarding of the loaded data? Quote:
cdimauro Quote:
Nice and wise design decision by SiFive for this RISC-V implementation.
However this should imply some data/result forwarding logic implemented as well.
|
Correct. Result forwarding is standard on all but the most primitive pipelines. There were some early RISC pipelines which avoided it for simplicity but that philosophy didn't last any longer than the MIPS idea of avoiding pipeline interlock logic for stalls by requiring the compiler to insert NOPs to avoid stalls but decreasing code density. MIPS stands for Microprocessor without Interlocked Pipelined Stages even though this simplification was removed due to poor instruction cache performance. This is described in my first link of this post. What happened to simple RISC? |
Answer: they don't exist.
MIPS is a clear example of the RISCs failure: they aimed for extreme simplicity which might have payed at the beginning, on the first years, but they crippled performances with the technology progresses and they (the architects) had to regret from their original choices. A recurring pattern talking about RISCs... Quote:
cdimauro Quote:
That's a great explanation, many thanks! What I was looking for was numbers about the load-to-use penalties on microarchitectures and this is more than what I was expecting.
|
Stall penalties are often under documented. Compiler developers can even have trouble finding info which is needed for writing an instruction scheduler. Some info can be gained from examining benchmarks. You are probably familiar with the 7-Zip benchmark.
https://www.7-cpu.com/
The benchmark results for CPU cores often contain the info.
https://www.7-cpu.com/cpu/Cortex-A53.html Quote:
ARM Cortex-A53 L1 Data Cache Latency = 3 cycles for simple access via pointer L1 Data Cache Latency = 3 cycles for access with complex address calculation L2 Cache Latency = 15 cycles RAM Latency = 15 cycles + 128 ns
|
L1 data cache load-to-use penalty = 3 cycles L2 data cache load-to-use penalty = 15 cycles Ram data cache load-to-use penalty = 15 cycles + 128 ns
This doesn't work for some core designs like CISC or the SiFive U74 core designs.
https://www.7-cpu.com/cpu/SiFive_U74.html Quote:
SiFive U74 L1 Data Cache Latency = 3 cycles for simple access via pointer L1 Data Cache Latency = 5 cycles for access with complex address calculation L2 Cache Latency = 26 cycles RAM Latency = 26 cycles + 145 ns
|
|
Yes, I know very well the 7-zip benchmark. However and as you've reported, it's difficult to extract the effective load-to-use penalty. Then maybe it's better to take a look at some vendor's manual or slide to take this data. Quote:
cdimauro Quote:
It also clearly shows why CISCs architecture still matters and should be used, instead of RISCs (unless specific areas, like very small cores on embedded system, or on GPUs, etc.).
|
CISC instructions give an opportunity to avoid load-to-use penalties by executing the EA ALU calc and operation ALU calc in the same execution pipeline.
mem-reg: add.l (a0),d0 ; load var, load-to-use stall, add
reg-mem: add.l d0,(a0) ; load var, load-to-use stall, add, store addq.l #1,(a0) ; load var, load-to-use stall, add, store
mem-mem: move.l (a0),(a1) ; load var, load-to-use stall, store
These are very common CISC instruction types on the 68k. How can load/store developers ignore this advantage? |
Simple: it's the RISC propaganda that demolished CISCs and contributed to hide their benefits... Quote:
cdimauro Quote:
It would be interesting to know why going over 8 bytes isn't "free" anymore. I mean: which technical constraints are coming up with this.
This because 68020+ supported instructions up to 22 bytes in size, so a processor which implements this architecture should anyway take into account them.
|
The 68060 supports the full 22 bytes of instructions but perhaps the larger sizes with uCode while ColdFire just lopped off that support completely and required all instructions to be 6 bytes or less. The internal RISC like fixed length instruction size impacts what can be executed in a single cycle as it has to contain all information for the complex 68k instruction to execute in a single cycle. I believe this internal fixed length instruction format is 96 bits and most of it is the 68k instruction and extension words but it also has early decode data.
instruction word 16 bits (2 bytes) extension words 32 bits (4 bytes) early decode data 48 bits (6 bytes) --- total 96 bits (12 bytes)
The variable length 68k instruction is fetched 4 bytes/cycle and early decoded into a 96 bit fixed length entry of the instruction buffer over several cycles if necessary. |
Strange. 6 bytes only of early decode data look so much, especially on this very critical part. Quote:
Multiple cycles are fine as the fetch and early decode are decoupled from the execution pipelines by the instruction buffer. Each instruction buffer entry can only hold a 6 byte 68k instruction while longer instructions need multiple entries of the 16 available entries. The 96 bit instruction buffer entry could be increased by 16 bits to 112 bits to allow 8 byte instructions to execute in one cycle. This would allow more immediates and addressing modes like ADD.L #d32,(d16,An) and FADD.S #d32,FPn to possibly execute in a single cycle and perhaps (bd,An,Xi*SF) with bd=16bits to start to be used. With the instruction buffer entry increased by another 16 bits to 128 bits, 10 byte 68k instructions could fit in a single entry opening up (bd,An,Xi*SF) with bd=32bits. With another 16 bits to 144bits, FADD.D #d64,FPn would fit in a single entry. The instructions in an instruction buffer entry become more and more powerful with the possibility to not only execute in a single cycle but possibly dual issue in instruction pairs. For the max 68k instruction of 22 bytes would require 224bits per entry and doesn't gain much in performance anymore. With 192bits already all 68k instructions except double memory indirect instructions would fit in a single entry but even this may be impractical or start to slow down the accesses to this likely SRAM instruction buffer. The 68060 limitation of 6 bytes per entry may have been low simply because 85% of instructions are 6 bytes in size or less while much more powerful instructions could have been placed in an instruction buffer entry. |
68060 was a very simple design, as we know.
However for a modernized version a different format for the internal instruction (micro-op) could be used. As an idea, at least 128 bits in size to support: - FPU instructions with extended immediates (32 + 96 = 128 bit); - MOVE (bd,An,Xn),(bd,An,Xn) (16 + 2 * 16 + 2 * 32 = 112 bit); - LEA/PEA/JSR/JMP with any addressing mode (16 + 16 + 2 * 32 = 96 bit). Instructions with double-indirect mode might be split in 2 micro-ops per each EA which is using a double-indirect mode. Quote:
If there is a critical timing limitation, it is more likely to be on the execution pipeline side of the instruction buffer. This is where 2 instruction buffer entries may undergo additional decoding and are examined for dependencies with superscalar execution. More data takes longer to examine and a larger instruction buffer is slower. More performance is likely possible with the faster silicon today. The AC68080 can execute much larger instructions in a single cycle although I suspect the instruction fetch per cycle is very large and no decoupling of the instruction fetch pipleline and execution pipeline is used. The superscalar decoding and dispatch/issue needs to look at a lot of data and this requires LUTs which are slow in FPGA, especially when there are many choices in what is a large table of possible selections. I hope this give you an idea with the limited knowledge I have. Even some of this info is educated speculation from multiple sources. |
OK, thanks. At least it gives to me some ideas on the possible challenges. |
| Status: Offline |
| | matthey
| |
Re: The Case for the Complex Instruction Set Computer Posted on 13-Nov-2023 0:40:56
| | [ #16 ] |
| |
|
Elite Member |
Joined: 14-Mar-2007 Posts: 2388
From: Kansas | | |
|
| cdimauro Quote:
Those are rhetorical questions to me (in a positive sense, since I fully agree with you).
Thanks a lot for this further data! I might reuse if I'll write some article which talks about this topic.
BTW, it's not reported on the chart but I assume that on BOOMv1 there's still a 4 cycles load-to-use penalty. OR, there's simply no forwarding of the loaded data?
|
I don't know why the pipeline diagram didn't show the load-to-use penalty for BOOMv1 but it looks like it would be 4 cycles as well.
cdimauro Quote:
Yes, I know very well the 7-zip benchmark. However and as you've reported, it's difficult to extract the effective load-to-use penalty. Then maybe it's better to take a look at some vendor's manual or slide to take this data.
|
The original CPU core documentation is the most reliable source of info but it is not unusual for info to be missing or difficult to understand. If using 2nd hand sources, it is better to look at multiple sources as there can be mistakes. The 7-zip entry for the U74 core has "Branch misprediction penalty = 4 cycles" which is highly suspect for an 8 stage pipeline. It does have a link to wikichips.com below though.
https://en.wikichip.org/wiki/sifive/microarchitectures/7_series Quote:
The largest change in the 7 Series is the overhaul of the memory subsystem. The data cache and the optional tightly integrated memory (TIM) can now span two cycles, enabling large SRAM/TIM to be included with the core. Additionally, two sets of ALUs have been incorporated into the pipeline in order to allow a zero cycle load-to-use latency where the first stage is used for the address generation and the last stage can be used to operate on the data.
|
There is no mention of the branch prediction penalty but it gives 2.5 DMIPS/MHz for the U74 core. This is competitive with the also in-order 8 stage superscalar Cortex-A53 and the 7-zip results back it up. The SiFive U74 core outperforms the Cortex-A53 and competes with the later generation Cortex-A55.
single core | compression/MHz | decompression/MHz SiFive_U74 0.70 0.92 Cortex-A53 0.56 0.92 Cortex-A55 0.63 1.03 IBM_Cell_PPE 0.23 0.33
IBM_PPC_G5 0.49 0.82 POWER9 1.08 0.83
The first 4 cores above are in-order cores while the last 2 are OoO PPC/Power cores since how poor the PPC G5 and Cell performed came up in another thread and some Amiga users want AmigaOS 4 for the POWER9 Blackbird. The in-order SiFive U74 destroys the in-order Cell PPE and even outperforms the OoO PPC G5 in this benchmark while decompression is better than the OoO POWER9. The SiFive U74 uses a 28nm process which is better than the older PPC cores except the POWER9 but all of these PPC cores have a huge advantage in peak processing power and transistor counts (Cell has 256 GFlops of single precision fp performance and uses 234 million transistors for example). The PPC cores need perfectly scheduled code or they stall a lot, especially the Cell. The little SiFive U74 in-order core is a little miser that uses a 68060 like pipeline to reduce stalls even though the weak RISC-V ISA rarely if ever is able to use both ALUs "incorporated into the pipeline in order to allow a zero cycle load-to-use latency where the first stage is used for the address generation and the last stage can be used to operate on the data". A CISC ISA is required to take full advantage of both ALUs like the 68060 pipeline was capable of and which provides incredible performance potential, even for an in-order core.
cdimauro Quote:
Strange. 6 bytes only of early decode data look so much, especially on this very critical part.
|
The early decode data in the instruction buffer needs to be as simple as possible and appear in a set location or more logic will be needed to "decode" it for the execution pipelines. Adding more bits to the fixed length instruction buffer entries isn't nearly as expensive as adding more bits into the ISA instruction encoding. The instruction buffer SRAM cost in transistors is relatively cheap as long as the timing doesn't become a problem.
cdimauro Quote:
68060 was a very simple design, as we know.
However for a modernized version a different format for the internal instruction (micro-op) could be used. As an idea, at least 128 bits in size to support: - FPU instructions with extended immediates (32 + 96 = 128 bit); - MOVE (bd,An,Xn),(bd,An,Xn) (16 + 2 * 16 + 2 * 32 = 112 bit); - LEA/PEA/JSR/JMP with any addressing mode (16 + 16 + 2 * 32 = 96 bit). Instructions with double-indirect mode might be split in 2 micro-ops per each EA which is using a double-indirect mode.
|
The 68060 design was very complex and advanced for the mid-90s but it would be relatively simple by today's standards and comparable to cores like the ARM Cortex-A53 and SiFive U74.
I wouldn't call the fixed length internal instruction format a micro-op. The original 68k instruction is not broken down but expanded. It would be closer to an AMD macro-op.
https://en.wikichip.org/wiki/macro-operation Quote:
AMD refers to the a more simplified fixed-length operation as macro-ops (sometimes also Complex-Op or COPs). In their context, macro-operations are a fixed-length operation that may be composed of a memory operation and an arithmetic operation. For example, a single MOP can perform a read, modify, and write operation. Another way of describing MOPs is x86 instructions that have undergone a number of transformations to make them fit into a more strict, but still complex, format. In Intel's context, no such concept exist.
|
The 68060 does less decoding as the instructions are a fixed length translation of the variable length 68k encoding that can often be executed without further decomposition. ColdFire simply reduced the size of 68k instructions to 6 bytes allowing an internal fixed length encoding for all of them which they then called a variable length RISC architecture. Does this mean expanding the 68060 internal fixed length instruction size to hold all 68k instructions would also make it a variable length RISC architecture? A larger internal fixed length instruction allows more 68k instructions to be superscalar executed in a single cycle which is more RISC like too. I don't know how large the internal instruction format can practically grow but we both see the performance potential of increasing it. The current 6 byte instruction size limitation looks like a performance bottleneck even though it was good enough for the 68060 to have integer performance/MHz that was better than some OoO cores like the PPC 601 and 603.
Last edited by matthey on 13-Nov-2023 at 01:00 AM. Last edited by matthey on 13-Nov-2023 at 12:57 AM. Last edited by matthey on 13-Nov-2023 at 12:50 AM.
|
| Status: Offline |
| | AmigaMac
| |
Re: The Case for the Complex Instruction Set Computer Posted on 13-Nov-2023 2:15:17
| | [ #17 ] |
| |
|
Super Member |
Joined: 26-Oct-2002 Posts: 1108
From: 3rd Rock from the Sun! | | |
|
| @cdimauro
🙄 _________________
|
| Status: Offline |
| | cdimauro
| |
Re: The Case for the Complex Instruction Set Computer Posted on 13-Nov-2023 5:49:19
| | [ #18 ] |
| |
|
Elite Member |
Joined: 29-Oct-2012 Posts: 4127
From: Germany | | |
|
| @AmigaMac
Quote:
AmigaMac wrote: @cdimauro
🙄 |
What's the (your) problem?
@matthey
Quote:
matthey wrote: cdimauro Quote:
Yes, I know very well the 7-zip benchmark. However and as you've reported, it's difficult to extract the effective load-to-use penalty. Then maybe it's better to take a look at some vendor's manual or slide to take this data.
|
The original CPU core documentation is the most reliable source of info but it is not unusual for info to be missing or difficult to understand. If using 2nd hand sources, it is better to look at multiple sources as there can be mistakes. The 7-zip entry for the U74 core has "Branch misprediction penalty = 4 cycles" which is highly suspect for an 8 stage pipeline. It does have a link to wikichips.com below though. |
Indeed. But this is just one part of the cake, because the branch penalty can be 6 cycles as well:
Mispredicted branches usually incur a four-cycle penalty, but sometimes the branch resolves later in the execution pipeline and incurs a six-cycle penalty instead. Mispredicted indirect jumps incur a six-cycle penalty.
It depends on where the misprediction happened. Quote:
There is no mention of the branch prediction penalty but it gives 2.5 DMIPS/MHz for the U74 core. This is competitive with the also in-order 8 stage superscalar Cortex-A53 and the 7-zip results back it up. The SiFive U74 core outperforms the Cortex-A53 and competes with the later generation Cortex-A55.
single core | compression/MHz | decompression/MHz SiFive_U74 0.70 0.92 Cortex-A53 0.56 0.92 Cortex-A55 0.63 1.03 IBM_Cell_PPE 0.23 0.33
IBM_PPC_G5 0.49 0.82 POWER9 1.08 0.83
The first 4 cores above are in-order cores while the last 2 are OoO PPC/Power cores since how poor the PPC G5 and Cell performed came up in another thread and some Amiga users want AmigaOS 4 for the POWER9 Blackbird. The in-order SiFive U74 destroys the in-order Cell PPE and even outperforms the OoO PPC G5 in this benchmark while decompression is better than the OoO POWER9. The SiFive U74 uses a 28nm process which is better than the older PPC cores except the POWER9 but all of these PPC cores have a huge advantage in peak processing power and transistor counts (Cell has 256 GFlops of single precision fp performance and uses 234 million transistors for example). The PPC cores need perfectly scheduled code or they stall a lot, especially the Cell. |
Indeed. SiFi made a great work with the U74.
Cell / PPE is very poor, but we know it since very long time. Only PowerPCs die-hard fanaticals could have asked a port of their favorite o.s. to it. Quote:
cdimauro Quote:
68060 was a very simple design, as we know.
However for a modernized version a different format for the internal instruction (micro-op) could be used. As an idea, at least 128 bits in size to support: - FPU instructions with extended immediates (32 + 96 = 128 bit); - MOVE (bd,An,Xn),(bd,An,Xn) (16 + 2 * 16 + 2 * 32 = 112 bit); - LEA/PEA/JSR/JMP with any addressing mode (16 + 16 + 2 * 32 = 96 bit). Instructions with double-indirect mode might be split in 2 micro-ops per each EA which is using a double-indirect mode.
|
The 68060 design was very complex and advanced for the mid-90s but it would be relatively simple by today's standards and comparable to cores like the ARM Cortex-A53 and SiFive U74.
I wouldn't call the fixed length internal instruction format a micro-op. The original 68k instruction is not broken down but expanded. It would be closer to an AMD macro-op.
https://en.wikichip.org/wiki/macro-operation Quote:
AMD refers to the a more simplified fixed-length operation as macro-ops (sometimes also Complex-Op or COPs). In their context, macro-operations are a fixed-length operation that may be composed of a memory operation and an arithmetic operation. For example, a single MOP can perform a read, modify, and write operation. Another way of describing MOPs is x86 instructions that have undergone a number of transformations to make them fit into a more strict, but still complex, format. In Intel's context, no such concept exist.
|
|
Yes, macro-op is more adeguate for 68060's format. Intel used micro-ops. Quote:
The 68060 does less decoding as the instructions are a fixed length translation of the variable length 68k encoding that can often be executed without further decomposition. ColdFire simply reduced the size of 68k instructions to 6 bytes allowing an internal fixed length encoding for all of them which they then called a variable length RISC architecture. Does this mean expanding the 68060 internal fixed length instruction size to hold all 68k instructions would also make it a variable length RISC architecture? A larger internal fixed length instruction allows more 68k instructions to be superscalar executed in a single cycle which is more RISC like too. I don't know how large the internal instruction format can practically grow but we both see the performance potential of increasing it. The current 6 byte instruction size limitation looks like a performance bottleneck even though it was good enough for the 68060 to have integer performance/MHz that was better than some OoO cores like the PPC 601 and 603.
|
There should be problems extending too much the MOPs length. Clock skew might happen or, in general, timing could be affected.
To save a full 68020+ instruction as a single MOP you need 22 x 8 + 48 (assuming 6 bytes for the early decode data ) = 224 bits.
Another thing to be considered is that you're wasting space a lot of time. Which means that the instruction queue is limited to much less entries.
As usual, a solution is always a trade-off. However I think that it's better to spend the transistor budget to increase the number of entries and treat the special / much more complex instructions with separate solutions (e.g.: using two or more entries on the queue. AFAIR Pentium 4 / 64 bit did something similar to use 64-bit immediates).Last edited by cdimauro on 18-Nov-2023 at 04:56 AM.
|
| Status: Offline |
| | matthey
| |
Re: The Case for the Complex Instruction Set Computer Posted on 13-Nov-2023 21:28:51
| | [ #19 ] |
| |
|
Elite Member |
Joined: 14-Mar-2007 Posts: 2388
From: Kansas | | |
|
| cdimauro Quote:
Indeed. But this is just one part of the cake, because the branch penalty can be 6 cycles as well:
Mispredicted branches usually incur a four-cycle penalty, but sometimes the branch resolves later in the execution pipeline and incurs a six-cycle penalty instead. Mispredicted indirect jumps incur a six-cycle penalty.
It depends on where the misprediction happened.
|
Even a 6 cycle branch mis-predict penalty for an 8 stage pipeline is good while getting 4 cycles for this penalty is amazing (Cortex-A53 is 7 cycles, 68060 is usually 7 and rarely 8 cycles). This is great for an embedded core as it reduces jitter. I like the SiFive U74 core design philosophy of minimizing stalls, improving the average case performance and reducing instruction scheduling requirements. This is an awesome and practical little in-order core. I wouldn't mind programming this core at the assembler level except that simple RISC-V instructions make programming tedious. With designs like this, RISC-V may be able to compete with ARM though.
cdimauro Quote:
Indeed. SiFi made a great work with the U74.
Cell / PPE is very poor, but we know it since very long time. Only PowerPCs die-hard fanaticals could have asked a port of their favorite o.s. to it.
|
The SiFive U74 and Cortex-A53/A55 cores are simple, small and cheap 8 stage superscalar dual-issue in-order cores. We are likely talking about 10-50 million transistor SoCs that cost as little as $1-$2 with multiple cores. The PPC/POWER cores were or are very expensive.
The 64 bit IBM PPC 970 (G5) used 58 million transistors and the OoO design could dispatch 6 instructions/cycle and had 10 execution units (3+ times the compute power of the in-order cores above and closer to modern x86-64 cores). This was an expensive high performance design that had difficulty keeping all the execution units busy. The G5 Performance can be good but it requires good code which compilers seem to have trouble generating. This is likely partially due to stalls that make instruction scheduling difficult and even affects OoO cores.
The 64 bit IBM Cell processor used 234 million transistors, had up to 4GHz operation and up to 256 GFlops of fp computing performance. The eye popping specs had console business execs salivating despite the high cost. What went wrong? The PPC core was a stripped down in-order core that was more traditional RISC like but with the opposite philosophy from the SiFive U74 core. It wasn't intended to run games on the CPU core but on SIMD units that were separated from the CPU to provide more of them like GPU shader units. This requires special programming and is time consuming compared to using CPU cores with good single core performance. The Cell processor is not a general purpose processor but a specialized SIMD/media processor. The PPC CPU core did not need to be stripped down so far but this saved transistors for more parallel SIMD units and the simplifications with a deep pipeline (24 stages) and a high frequency increased SIMD performance. The PPC CPU core is a horrible general purpose core and very difficult to program with frequent stalls but the CPU was intended to be more like the management CPU of a GPU. GPU parallelism stayed more popular as it allowed CPU cores to stay more general purpose and push single core performance.
The IBM POWER9 CPUs are large to support server/workstation needs, retain POWER compatibility and provide a more versatile range of execution units. A 4 core POWER9 CPU had a $375 cost for the BlackBird which is hundreds of times more expensive than the in-order SoCs CPU cores above. Despite using a 14nm process which is usually better than the in-order cores, the single core performance is not impressive, especially for the price. It also runs hot with a 90W TDP requiring a big power supply, heat sink and fan which increases the already high price. The POWER9 has many execution units but likely requires good quality code to extract performance which compilers seem to have trouble providing. The expensive and hot running POWER9 CPU resembles the earlier PPC G5 CPU with which it shares heritage. If existing customers didn't need POWER features and compatibility, it would likely already be as dead as PPC.
cdimauro Quote:
Yes, macro-op is more adeguate for 68060's format. Intel used micro-ops.
|
Intel may have used micro-ops and macro-ops. Check out figure 1 of the following article.
I See Dead μops: Leaking Secrets via Intel/AMD Micro-Op Caches https://www.cs.virginia.edu/venkat/papers/isca2021a.pdf
The following are the actual 68060 stages of the instruction fetch pipeline.
68060 1. instruction address generation 2. instruction fetch 3. instruction early decode (to macro-ops with length decode) 4. instruction buffer (macro-op)
The following are possible x86(-64) stages based on side channel reverse engineering from the article above.
Intel x86(-64) 2. instruction fetch 3. instruction early decode (to macro-ops with length decode) 4. macro-op instruction buffer 5. decoder (macro-ops -> micro-ops) 6. instruction decode queue & micro-op cache
The decoupled execution pipelines would then execute the code placed in the buffer/queue at the end of these steps or stages. The 68060 is using fewer decoding stages and executing macro-ops instead of micro-ops like x86(-64). The 68060 only uses one (macro-op) instruction buffer while x86(-64) uses a macro-op buffer, micro-op buffer/queue and a micro-op cache which uses significant resources (the micro-op cache was the source of the vulnerability for this article too). Do you still think 68k instruction decoding is as difficult as x86(-64) decoding?
cdimauro Quote:
There should be problems extending too much the MOPs length. Clock skew might happen or, in general, timing could be affected.
To save a full 68020+ instruction as a single MOP you need 22 + 8 + 48 (assuming 6 bytes for the early decode data ) = 224 bits.
Another thing to be considered is that you're wasting space a lot of time. Which means that the instruction queue is limited to much less entries.
As usual, a solution is always a trade-off. However I think that it's better to spend the transistor budget to increase the number of entries and treat the special / much more complex instructions with separate solutions (e.g.: using two or more entries on the queue. AFAIR Pentium 4 / 64 bit did something similar to use 64-bit immediates). |
I knew I had seen info on a x86(-64) very large fixed length instruction encoding and I finally found the following.
https://hardwaresecrets.com/inside-pentium-m-architecture/ Quote:
On P6 architecture, each microinstruction is 118-bit long. Pentium M instead of working with 118-bit micro-ops works with 236-bit long micro-ops that are in fact two 118-bit micro-ops.
|
This article only speaks of micro-ops so it is possible that instructions were not broken down as far in the older x86(-64) cores and these micro-ops more closely resemble the macro-ops mentioned above. The P6 came out in 1995 using a similar 500nm chip process to the 1994 68060.
68060 macro-op (96 bits) P6 micro-op (118 bits)
It likely would have been possible for the 68060 or ColdFire to support an 8 byte instruction size based on the P6. The Pentium M hints that wider yet may have been possible.
A 68060 macro-op likely would not need to support a 22 byte 68k instruction with the current design. This is based on what the execution pipeline hardware can support and the current behavior. The largest 68k instruction is a MOVE using double memory indirect to double memory indirect with longword displacements. The 68060 already creates 2 macro-ops for a MOVE mem,mem and interlocks the execution pipelines to execute the macro-ops together.
pipe1: move mem1,pipe2 pipe2: move pipe1,mem2
Each pipe only has one AG (EA calc) stage and MOVE EA1,EA2 needs 2. I believe breaking a MOVE mem,mem into 2 macro-ops leaves a 16 byte Fop.X #imm,FPn as the longest instruction needing a 176 bit macro-op. The 68060 currently traps extended precision immediates so a 12 byte Fop.D #imm,FPn would only need a 144 bit macro-op. A MOVE mem,mem split in half also has a max instruction size of about 12 bytes although double memory indirect addressing modes usually can't be executed in a single cycle either so splitting into multiple macro-ops is ok. The way the 68060 is currently designed, all single cycle superscalar capable instructions need to be a single macro-op which increases performance. Is a 144 bit macro-op too radical?
Last edited by matthey on 14-Nov-2023 at 04:16 PM. Last edited by matthey on 13-Nov-2023 at 10:09 PM. Last edited by matthey on 13-Nov-2023 at 09:34 PM.
|
| Status: Offline |
| | cdimauro
| |
Re: The Case for the Complex Instruction Set Computer Posted on 18-Nov-2023 5:33:36
| | [ #20 ] |
| |
|
Elite Member |
Joined: 29-Oct-2012 Posts: 4127
From: Germany | | |
|
| @matthey
Quote:
matthey wrote: The 64 bit IBM PPC 970 (G5) used 58 million transistors and the OoO design could dispatch 6 instructions/cycle and had 10 execution units (3+ times the compute power of the in-order cores above and closer to modern x86-64 cores). This was an expensive high performance design that had difficulty keeping all the execution units busy. The G5 Performance can be good but it requires good code which compilers seem to have trouble generating. This is likely partially due to stalls that make instruction scheduling difficult and even affects OoO cores. |
That's why an SMT solutions would have been a perfect fit for the G5. Quote:
The 64 bit IBM Cell processor used 234 million transistors, had up to 4GHz operation and up to 256 GFlops of fp computing performance. The eye popping specs had console business execs salivating despite the high cost. What went wrong? The PPC core was a stripped down in-order core that was more traditional RISC like but with the opposite philosophy from the SiFive U74 core. It wasn't intended to run games on the CPU core but on SIMD units that were separated from the CPU to provide more of them like GPU shader units. This requires special programming and is time consuming compared to using CPU cores with good single core performance. The Cell processor is not a general purpose processor but a specialized SIMD/media processor. The PPC CPU core did not need to be stripped down so far but this saved transistors for more parallel SIMD units and the simplifications with a deep pipeline (24 stages) and a high frequency increased SIMD performance. The PPC CPU core is a horrible general purpose core and very difficult to program with frequent stalls but the CPU was intended to be more like the management CPU of a GPU. GPU parallelism stayed more popular as it allowed CPU cores to stay more general purpose and push single core performance. |
In fact this is the reason why Cell's SPU cores were used to offload some work from the weak GPU that the PS3 had.
However the same PPE/PowerPC cores were used on the XBox360, which had... no SPU cores! So, the very weak PowerPC cores had to make all calculations by their own and send them to the GPU (which was the strong point of this console); fortunately there were some nice changes to those PPC cores which improved their performance for this task (bigger and more powerful SIMD units and the ability to directly control the data caches). Anyway, this doesn't change the fact that those PowerPC cores were really weak. Quote:
cdimauro Quote:
Yes, macro-op is more adeguate for 68060's format. Intel used micro-ops.
|
Intel may have used micro-ops and macro-ops. Check out figure 1 of the following article.
I See Dead μops: Leaking Secrets via Intel/AMD Micro-Op Caches https://www.cs.virginia.edu/venkat/papers/isca2021a.pdf |
Nice read thanks.
Intel uses Macro-ops as a synonymous for instructions on this context. They are kept in this bigger format until some transformations are done and then they are converted to one or more micro-ops, which are the ones sent to the execution queue and finally to the backed. Quote:
The following are the actual 68060 stages of the instruction fetch pipeline.
68060 1. instruction address generation 2. instruction fetch 3. instruction early decode (to macro-ops with length decode) 4. instruction buffer (macro-op) |
How is it possible to generate the address of instructions that aren't decoded and even not yet fetched? Quote:
The following are possible x86(-64) stages based on side channel reverse engineering from the article above.
Intel x86(-64) 2. instruction fetch 3. instruction early decode (to macro-ops with length decode) 4. macro-op instruction buffer 5. decoder (macro-ops -> micro-ops) 6. instruction decode queue & micro-op cache
The decoupled execution pipelines would then execute the code placed in the buffer/queue at the end of these steps or stages. The 68060 is using fewer decoding stages and executing macro-ops instead of micro-ops like x86(-64). |
See above: x64 only executes micro-ops. Quote:
The 68060 only uses one (macro-op) instruction buffer while x86(-64) uses a macro-op buffer, micro-op buffer/queue and a micro-op cache which uses significant resources (the micro-op cache was the source of the vulnerability for this article too). Do you still think 68k instruction decoding is as difficult as x86(-64) decoding? |
You're comparing a 2-ways in-order core to a very aggressive 4-ways OoO core (Intel's Skylake is a beast. AMD's first Zen is more between Intel's Ivy Bridge and Haswell).
You've to compare the 68060 core to the Pentium core, which is more similar to it. Quote:
cdimauro Quote:
There should be problems extending too much the MOPs length. Clock skew might happen or, in general, timing could be affected.
To save a full 68020+ instruction as a single MOP you need 22 + 8 + 48 (assuming 6 bytes for the early decode data ) = 224 bits.
Another thing to be considered is that you're wasting space a lot of time. Which means that the instruction queue is limited to much less entries.
As usual, a solution is always a trade-off. However I think that it's better to spend the transistor budget to increase the number of entries and treat the special / much more complex instructions with separate solutions (e.g.: using two or more entries on the queue. AFAIR Pentium 4 / 64 bit did something similar to use 64-bit immediates). |
I knew I had seen info on a x86(-64) very large fixed length instruction encoding and I finally found the following.
https://hardwaresecrets.com/inside-pentium-m-architecture/ Quote:
On P6 architecture, each microinstruction is 118-bit long. Pentium M instead of working with 118-bit micro-ops works with 236-bit long micro-ops that are in fact two 118-bit micro-ops.
|
This article only speaks of micro-ops so it is possible that instructions were not broken down as far in the older x86(-64) cores and these micro-ops more closely resemble the macro-ops mentioned above. |
No, the Pentium M still used 118-bit micro-ops. There's another thing of this part of the article which clarifies how those 236-bit are used:
On P6 architecture, each microinstruction is 118-bit long. Pentium M instead of working with 118-bit micro-ops works with 236-bit long micro-ops that are in fact two 118-bit micro-ops.
Keep in mind that the micro-ops continue to be 118-bit long; what changed is that they are transported in groups of two.
So, they are moved around two at the time to make it faster. But micro-ops stay 118-bit long and executed as such by the execution units. Quote:
The P6 came out in 1995 using a similar 500nm chip process to the 1994 68060.
68060 macro-op (96 bits) P6 micro-op (118 bits)
It likely would have been possible for the 68060 or ColdFire to support an 8 byte instruction size based on the P6. The Pentium M hints that wider yet may have been possible.
A 68060 macro-op likely would not need to support a 22 byte 68k instruction with the current design. This is based on what the execution pipeline hardware can support and the current behavior. The largest 68k instruction is a MOVE using double memory indirect to double memory indirect with longword displacements. The 68060 already creates 2 macro-ops for a MOVE mem,mem and interlocks the execution pipelines to execute the macro-ops together.
pipe1: move mem1,pipe2 pipe2: move pipe1,mem2
Each pipe only has one AG (EA calc) stage and MOVE EA1,EA2 needs 2. I believe breaking a MOVE mem,mem into 2 macro-ops leaves a 16 byte Fop.X #imm,FPn as the longest instruction needing a 176 bit macro-op. The 68060 currently traps extended precision immediates so a 12 byte Fop.D #imm,FPn would only need a 144 bit macro-op. A MOVE mem,mem split in half also has a max instruction size of about 12 bytes although double memory indirect addressing modes usually can't be executed in a single cycle either so splitting into multiple macro-ops is ok. The way the 68060 is currently designed, all single cycle superscalar capable instructions need to be a single macro-op which increases performance. Is a 144 bit macro-op too radical?
|
If you compare it to the original 68060 design and even to the x86 designs, yes: it's very radical.
I show you what I've found on the above article (I See Dead µops: Leaking Secrets via Intel/AMD Micro-Op Caches) that you've posted before:
64-bit immediate values consume two micro-op slots within a given cache line.
which matches what I was recalling: instructions handling big values are internally split in two micro-ops. I assume that this was decided to save space on the micro-op cache, by having micro-ops of reduced length.
Which makes sense: take into account the more common cases and use more micro-ops for the less common ones.
That simplicity should be used for a modernized 68060 core.
IMO those 48 bits only as early decoding information are way too much. I was wondering before when you've first reported it and now I'm even more convinced about it.
If an x86 core can use 118 bits having a maximum instruction length of 15 bytes (which is very artificial: you can reach it only by adding redundant, unused extra prefixes) and performing so well, a modernized 68060 core should move to a similar solution. IMO. |
| Status: Offline |
| |
|
|
|
[ home ][ about us ][ privacy ]
[ forums ][ classifieds ]
[ links ][ news archive ]
[ link to us ][ user account ]
|