Poster | Thread |
matthey
| |
Re: One major reason why Motorola and 68k failed... Posted on 28-May-2024 22:55:48
| | [ #141 ] |
|
|
|
Elite Member |
Joined: 14-Mar-2007 Posts: 2270
From: Kansas | | |
|
| Gunnar Quote:
Reading from memory/cache/stack is a very limited resource that can only very difficultly be scaled up. On the other hand reading several register per cycle is easy and in comparison very cost effective to scale up.
Therefore a good design will always favor to put values in register instead cache/stack.
|
Sure, cache accesses are more expensive than register accesses. Cache accesses are often unavoidable though and very high performance CPU/FPU cores that would have large register files also usually allow more than one cache access per cycle. RISC cores need both for performance as all data requires a load to become a usable variable for other instructions. There is no advantage for CISC cores to load data that is used once into a register. In fact, there is a disadvantage of a wasted move instruction if a mem-reg instruction can be used. These mem-reg accesses can take advantage of pipelining as even the P5 Pentium with crap stack based FPU is still worthwhile.
https://www.jagregory.com/abrash-black-book/#pentium-floating-point-optimization
Despite the ugly FXCH instructions and handicapped FPU ISA with no orthogonal FPU registers, the pipelined P5 Pentium FPU did outperform most CISC competitors. With pipelining and register renaming, it is obvious a 68k FPU with just 8 FPU registers would have a significant advantage over the P5 Pentium FPU. It looks like 8 FPU registers would be adequate for the Quake Dot Product, Cross Product, Transformation and Projection inlines. Modern games likely use larger matrices but do they do the floating point math in the FPU or SIMD units?
I was reading a paper on cache hints which benchmarked global full cache hints vs one time cache hints. The Quake benchmark was the only benchmark code which did better with one time cache hints and significantly so. A large percentage of the Quake data is used one time and it is unavoidable that the data is loaded once no matter how many registers exist. Well, another possibility is that the data cache was too small resulting in cache thrashing but more registers would not help much.
Gunnar Quote:
Yes the old FPU code could work OK withj 8 regs. But we all know that the old 68K FPU is not pipelined. A pipelined FPU needs unrolled code to make full use of it. To unroll code you need several times the number of registers. This is very simple logic.
Most today pipelined FPU (power/intel/etc) have often a latency of 6 clocks. This means you want to unroll your work loops generally 4-6 times to be able to eat the latency. For doing this you need 4 to 6 times the number of register.
|
See above. I still think 16 FPU registers is a good practical number with additional registers likely better used for a SIMD unit supporting floating point.
Gunnar Quote:
We all know that IBM did increase the FPU register to 64 since a few years for POWER? Why did IBM do this - because its very useful for increasing performance. And yes IBM also has register renaming in addition to this!
Having 32 FPU register plus being CISC gives the 68080 FPU a huge advantage over the "old 8 Register".
|
It wasn't POWER9 with 64 FPU registers that was chosen for the newest AmigaOS 4 hardware but the PPC QorIQ P1022 which removed the more reasonable but still large 32 PPC FPU registers. Perhaps 32 FPU registers would be the right number for a CISC CPU competing in the workstation and server markets. It is obviously too many standard FPU registers for the desktop market as AmigaOS 4 hardware targets the desktop market and the a1222 doesn't need any standard FPU registers.
Gunnar Quote:
Matt
If you want to learn more then I highly suggest you to talk to the people which actually code. Talk to coders writing real FPU code. And talk to coders which wrote performance coder for the 68080. You can learn a lot.
If you never code real software, and your "knowledge" is based on Wikipedia ... yes you can then also contribute to brainstorming and post here in this Forum - "as a Wikipedia Quarterback "
Do for real serious development more knowledge will be useful.
|
I'm no expert like Gunnar. I'll sit back with my popcorn in my armchair to watch and learn how his FPGA CPU core takes market share away from POWER systems in the high end server/workstation markets. Maybe it will be a race between the A1222 on the desktop to see which of these markets brings back the Amiga first.
Gunnar Quote:
You have a good fantasy - we can see this.
Super-AGA is completely differently in design to Thomas NATAMI AGA - and is a complete unrelated development.
Super-AGA is based on a concept of internal DMA buffers and decoupled prefetcher with the ability to run exact to pixel timing and also unrelated to pixel timing. This is somewhat an extension to what Haynie already planned in AAA the reason for this internal design is to fully optimized Super-AGA to be able to make perfect usage of modern memory technology.
|
I can believe it was rewritten enough that it is just based on SAGA. So Super-AGA is not the same as SAGA? Couldn't think of a less confusing name for the new version?
Gunnar Quote:
Besides this fundamental design difference. Thomas Hirsch uses AHD as coding language which we don't use. We simulate all our code in Modelsim - And Modelsim is not compatible to AHDL. All our code is 100% written in VHDL - which is a complete different language.
|
Yes, SAGA was supposedly originally written in AHDL. The rumor is that it was converted to VHDL, perhaps by Gunnar, and that Thomas was originally annoyed by his meddling.
Last edited by matthey on 28-May-2024 at 11:10 PM.
|
|
Status: Offline |
|
|
Lou
| |
Re: One major reason why Motorola and 68k failed... Posted on 29-May-2024 0:40:09
| | [ #142 ] |
|
|
|
Elite Member |
Joined: 2-Nov-2004 Posts: 4227
From: Rhode Island | | |
|
| @matthey
Quote:
matthey wrote: Lou Quote:
Since we like benchmarks, here's one I found doing an admittedly simple benchmark, but it exposes why most of the time efficiency is key...especially when it costs less.
https://www.youtube.com/watch?v=2k_jP73Ly7A
Interesting that the SNES cpu benchmark is almost exactly 1/2 for the PCEngine...while clocked at 1/2 the speed. The Megadrive cpu at 7.6 Mhz lost to NEC's version of the 6502 despite being clock slightly higher and, you know, having them extra 'bits' and registers...and both the SNES and PC Engine were operating on RAM not an internal register.
68K's code was smaller...but who cares? |
Did you read the assembly code? The 68000 did not use moveq or addq which is important as the 68000 uses fewer cycles to fetch smaller code. It is likely the 68000 wins the benchmark contest with these minor changes. Did you read the comments which suggest this?
|
the [q] ops are for small numbers. Writing to Zero Page on a 6502 is faster than an absolute address. Once again advantage 6502. Many revisions of the 6502 supported a relocatable Zero Page pointer.
In the 6502 code, the value is stored to an absolute address (sta RAM_x2000) every loop and to x2001 every 255 cycles. The isn't even happening in the 68000 code. The 68000 code cheats by never writing to memory and is still slower.Last edited by Lou on 29-May-2024 at 01:03 AM. Last edited by Lou on 29-May-2024 at 01:00 AM. Last edited by Lou on 29-May-2024 at 12:59 AM. Last edited by Lou on 29-May-2024 at 12:50 AM.
|
|
Status: Offline |
|
|
Hammer
| |
Re: One major reason why Motorola and 68k failed... Posted on 29-May-2024 3:57:13
| | [ #143 ] |
|
|
|
Elite Member |
Joined: 9-Mar-2003 Posts: 5859
From: Australia | | |
|
| @matthey
Quote:
It is not that x86 CPUs were so well optimized for integer to floating point conversions but rather that PPC CPUs had a memory bottleneck. Transfers through cached memory are potentially not much worse than passing registers directly between units but most RISC CPUs have a bottleneck because of load-to-use stalls and up to 2 more instructions per transfer to execute compared to CISC with mem-reg and reg-mem operations. Worst case through memory is also much worse. PPC developers likely expected OoO to solve these problems but PPC limited OoO has limited ability to solve these problems. Even more aggressive OoO may have problems due to synchronization of shared resources between units. It may be possible to rewrite the algorithm to use more fp and less mixed fp and integer code if the FPU is high enough performance but that is not an easy fix. The 68060 FPU requires mixed fp and integer code for best performance so many early 3D algorithms for x86 shouldn't perform too bad with the minimalist 68060 FPU. The 68k FPU supports all register conversions and transfers too.
|
Within a smaller transistor budget and at a given clock speed, the RISC CPU has an advantage in arithmetic intensity against mostly ROM'ed microcode 68030 e.g. ARM60 @ 12.5 Mhz vs 68030 @ 50 Mhz playing Doom.
At the same clock speed, 68030's mem-reg and reg-mem advantages weren't enough to close the arithmetic intensity gap against ARM60.
Since 68030 has a hardware barrel shifter, then a few basic ADD and MUL instructions should be implemented in the hardware to cover some arithmetic intensity gaps i.e. 68035.
For the Saturn project, Sega rejected 68030 for SuperH2. Motorola was not selling 68LC040-25 at 68030 @ 50Mhz prices. Near $100 68EC040 is useless with DMA'ed devices that are prevalent in game consoles, not just the Amiga. MMU is a premium according to Motorola. It's too bad 68EC040 didn't have 668030's cache behavior.
On wholesale prices, Motorola's 68040 prices didn't keep pace with Intel's 486 prices.
These are the early rounds that Motorola is losing its customers and this Motorola stupidity was repeated during smart handheld devices.
Motorola lost to ARM9xx-T (with MMU, ARMv4T instruction set) during the smart handheld's rise. Motorola thinks 68000-based Dragonball is good enough to battle ARM9?
Last edited by Hammer on 29-May-2024 at 05:53 AM. Last edited by Hammer on 29-May-2024 at 05:51 AM. Last edited by Hammer on 29-May-2024 at 05:50 AM.
_________________ Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68) Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68) Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB |
|
Status: Offline |
|
|
matthey
| |
Re: One major reason why Motorola and 68k failed... Posted on 29-May-2024 4:01:26
| | [ #144 ] |
|
|
|
Elite Member |
Joined: 14-Mar-2007 Posts: 2270
From: Kansas | | |
|
| Lou Quote:
the [q] ops are for small numbers. Writing to Zero Page on a 6502 is faster than an absolute address. Once again advantage 6502. Many revisions of the 6502 supported a relocatable Zero Page pointer.
|
MOVEQ allows 8 bits of immediate data to be moved into a register which is as large of datatype as many 8 bit CPUs can use. This is possible due to the 16 bit variable length encoding. ADDQ and SUBQ only use 3 bits of encoding for the immediate 1-8 but this covers the most common cases and is a big improvement over INC and DEC instructions more common on 8 bit CPUs.
Lou Quote:
In the 6502 code, the value is stored to an absolute address (sta RAM_x2000) every loop and to x2001 every 255 cycles. The isn't even happening in the 68000 code. The 68000 code cheats by never writing to memory and is still slower. |
It's not the fault of the 68000 that the benchmark is unrealistic. The 68000 is very flexible and has many options. The 6502 memory stores are a disadvantage of an accumulator architecture CPU that a CISC CPU doesn't have. It is also possible to combine all 3 loops into one loop using the larger datatypes of the 68000. Using the full capabilities of the CPU isn't cheating. The 68000 is clearly a more capable CPU. Any slow memory access disadvantage the 68000 has is more that compensated by more GP registers reducing memory traffic, memory accesses using larger datatype sizes and powerful addressing modes compared to the 6502.
|
|
Status: Offline |
|
|
Hammer
| |
Re: One major reason why Motorola and 68k failed... Posted on 29-May-2024 6:12:28
| | [ #145 ] |
|
|
|
Elite Member |
Joined: 9-Mar-2003 Posts: 5859
From: Australia | | |
|
| @matthey
Quote:
Despite the ugly FXCH instructions and handicapped FPU ISA with no orthogonal FPU registers, the pipelined P5 Pentium FPU did outperform most CISC competitors. With pipelining and register renaming, it is obvious a 68k FPU with just 8 FPU registers would have a significant advantage over the P5 Pentium FPU. It looks like 8 FPU registers would be adequate for the Quake Dot Product, Cross Product, Transformation and Projection inlines. Modern games likely use larger matrices but do they do the floating point math in the FPU or SIMD units?
|
X86-64v1's standard IEEE FP32 and FP64 are on SSE2 path. SSE2 supports both scalar and vector (SIMD) use cases.
_________________ Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68) Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68) Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB |
|
Status: Offline |
|
|
Gunnar
| |
Re: One major reason why Motorola and 68k failed... Posted on 29-May-2024 6:23:59
| | [ #146 ] |
|
|
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @matthey
Quote:
Cache accesses are often unavoidable though and very high performance CPU/FPU cores that would have large register files also usually allow more than one cache access per cycle. |
Lets make a real world example to give the discussion some real content. Lets say you want to 3D rotate points for a 3D game.
Here is an example of this math. Arne coded this on Amiga when he was 14 years https://www.youtube.com/watch?v=FgrPX_NGwYM
Lets say you have a game that wants to rotate 10,000 points. For this operation you have a rotation matrix, this is 6 values and each point that you rotate you have 3 values: X,Y,Z For doing the matrix calculation you need 2 more temp registers.
This makes 11 values. A good routine can use 11 register for this. The 68K FPU can do this well as it can use BOTH the 8-Dn Register and 8-FPn Register as Inputs
Arnes example code is a nice example for coding this simple for an unpipelined FPU.
If you have a pipelined FPU than you can run this much faster with unrolled code. You will make the code 4 times faster if you unroll it 4 times. If you unroll 4 times, then the code will use 5*4 = 20 register for the vectors + 6 Register for the Matrix. All the Cache/memory loads you want to reserve for loading the vectors each just one time from memory. A good code will never want to waste the memory/cache access for reloading values inside the work loop.
This example makes it clear you want to have 26 register for this algorithm.
As we know the 3-matrix code is the "small" version that you can use for vector operation. Sometimes you want to use the 4-matrix code in your program. The 4-matrix code is similar but needs of course more values. It needs 8 for the matrix, and 6 for each vector, (4*6 for unroll) This means a good routine will use 32 register total for the unrolled loop.
Mind that an unrolled Loop will run this operation about 4 times faster than the not unrolled version.
In my experience talking about real world code examples makes it always much clearer.
Looking at this example everyone can clearly see how much benefit and value more register have.
|
|
Status: Offline |
|
|
matthey
| |
Re: One major reason why Motorola and 68k failed... Posted on 29-May-2024 22:58:05
| | [ #147 ] |
|
|
|
Elite Member |
Joined: 14-Mar-2007 Posts: 2270
From: Kansas | | |
|
| Hammer Quote:
X86-64v1's standard IEEE FP32 and FP64 are on SSE2 path. SSE2 supports both scalar and vector (SIMD) use cases.
|
The handicapped x86 FPU was replaced for a reason.
Gunnar Quote:
Lets make a real world example to give the discussion some real content. Lets say you want to 3D rotate points for a 3D game.
Here is an example of this math. Arne coded this on Amiga when he was 14 years https://www.youtube.com/watch?v=FgrPX_NGwYM
Lets say you have a game that wants to rotate 10,000 points. For this operation you have a rotation matrix, this is 6 values and each point that you rotate you have 3 values: X,Y,Z For doing the matrix calculation you need 2 more temp registers.
This makes 11 values. A good routine can use 11 register for this. The 68K FPU can do this well as it can use BOTH the 8-Dn Register and 8-FPn Register as Inputs
Arnes example code is a nice example for coding this simple for an unpipelined FPU.
|
Arne's code for drawing a 3D rotating cube only uses 5 of 8 FPU registers in the existing 68k FPU ISA. There are 6 integer data registers used which hold single precision fp constants but this is not optimum. Each FPU instruction using an integer register has a penalty on 68k FPUs (68060 requires 2 more cycles and does not allow superscalar issue). Using the cache provides better performance than using a data register on the 68060 but this is not optimal for multiple use fp variables/constants and with more parallel execution. A few more FPU registers would be valuable even in this simple 3D code but 16 FPU registers are already a big improvement.
Gunnar Quote:
If you have a pipelined FPU than you can run this much faster with unrolled code. You will make the code 4 times faster if you unroll it 4 times. If you unroll 4 times, then the code will use 5*4 = 20 register for the vectors + 6 Register for the Matrix. All the Cache/memory loads you want to reserve for loading the vectors each just one time from memory. A good code will never want to waste the memory/cache access for reloading values inside the work loop.
This example makes it clear you want to have 26 register for this algorithm.
As we know the 3-matrix code is the "small" version that you can use for vector operation. Sometimes you want to use the 4-matrix code in your program. The 4-matrix code is similar but needs of course more values. It needs 8 for the matrix, and 6 for each vector, (4*6 for unroll) This means a good routine will use 32 register total for the unrolled loop.
Mind that an unrolled Loop will run this operation about 4 times faster than the not unrolled version.
In my experience talking about real world code examples makes it always much clearer.
Looking at this example everyone can clearly see how much benefit and value more register have.
|
There are always going to be algorithms that would benefit from more registers. Adding more registers is far from free and provides diminishing returns. I still think 16 GP FPU registers is a good number for a CISC FPU while 32 is a good idea for a RISC FPU. CISC FPUs have options when short a few registers like loads from cache and Dn registers which have a minimal performance loss with limited use. Register renaming reduces register needs. Reducing pipelined FPU instruction latencies reduces the number of instructions needed for unrolling. The P5 Pentium had 3 cycle pipelined FADD and FMUL reducing the need to unroll code. Multiple parallel FPU units improves parallelism without unrolling code. I don't think there is enough code that would benefit from 32 FPU registers. Even if pipelined performance is 25% better with 32 FPU registers when a FPU pipeline can be kept full, it won't make much difference to overall FPU performance if this only occurs 0.25% of the time. I believe a 32 FPU register standard is too many registers for the embedded market where some implementations will want to reduce the number or remove the FPU registers completely. Code size will likely be increased to encode so many registers which is a turnoff for embedded use. Perhaps 32 FPU registers will allow Gunnar's FPGA FPU to better compete with the POWER FPU though.
|
|
Status: Offline |
|
|
Lou
| |
Re: One major reason why Motorola and 68k failed... Posted on 30-May-2024 0:00:50
| | [ #148 ] |
|
|
|
Elite Member |
Joined: 2-Nov-2004 Posts: 4227
From: Rhode Island | | |
|
| @matthey
Quote:
matthey wrote:
Lou Quote:
In the 6502 code, the value is stored to an absolute address (sta RAM_x2000) every loop and to x2001 every 255 cycles. The isn't even happening in the 68000 code. The 68000 code cheats by never writing to memory and is still slower. |
It's not the fault of the 68000 that the benchmark is unrealistic. The 68000 is very flexible and has many options. The 6502 memory stores are a disadvantage of an accumulator architecture CPU that a CISC CPU doesn't have. It is also possible to combine all 3 loops into one loop using the larger datatypes of the 68000. Using the full capabilities of the CPU isn't cheating. The 68000 is clearly a more capable CPU. Any slow memory access disadvantage the 68000 has is more that compensated by more GP registers reducing memory traffic, memory accesses using larger datatype sizes and powerful addressing modes compared to the 6502.
|
That's a heck of a coping mechanism...
Fact is the 6502 can access zero page almost like extra registers. The 65CE02 has a Z register than can be used as an extra register or as a relocatable Zero Page address register.
This code didn't need to write to memory at all. In the 68k code is moves the #1 back into D0 to add 1 to it again...so the total is always 2. The point was not to count to 2^23, the point was to do a simple rudimentary addition that many times to test CPU efficiency, not memory read/write speed. So adding an extra step to the 6502 code was a handicap and the 68k still lost.
This is why 68k failed. The instructions per Clock wasn't competitive until the 68040. Too little too late. Too expensive.Last edited by Lou on 30-May-2024 at 12:01 AM.
|
|
Status: Offline |
|
|
matthey
| |
Re: One major reason why Motorola and 68k failed... Posted on 30-May-2024 2:33:30
| | [ #149 ] |
|
|
|
Elite Member |
Joined: 14-Mar-2007 Posts: 2270
From: Kansas | | |
|
| Lou Quote:
That's a heck of a coping mechanism...
...
This is why 68k failed. The instructions per Clock wasn't competitive until the 68040. Too little too late. Too expensive.
|
I'm really not concerned about the 68000 losing a benchmark with unoptimized and unrealistic code. "Instructions per Clock" are important but your 68000 benchmark code isn't optimized to reduce them? How experienced is a 68000 assembly programmer that doesn't know about MOVEQ and ADDQ?
@hagopds Quote:
Very cool! I wasn't aware of the 68000's moveq instruction, and it appears to support signed 8-bit fields from -128 to 127. This would be equivalent to the other consoles loop of 1 to 255, and should run faster as moveq requires fewer clock cycles than move.b.
|
|
|
Status: Offline |
|
|
Gunnar
| |
Re: One major reason why Motorola and 68k failed... Posted on 30-May-2024 8:32:27
| | [ #150 ] |
|
|
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @matthey
Quote:
. Adding more registers is far from free and provides diminishing returns. .. Register renaming reduces register needs. |
You contradict yourself in your post. Let me help you:
Fact: the cost for adding more registers is low. Fact: The HW cost of register renaming is high. And No 68K CPU, neither Coldfire CPU does register renaming.
To do register renaming in the sense you mean the CPU needs internally to have more "hidden" registers. This means you have to pay both. The cost of more registers and the much higher cost of register renaming.
Matthey this is the problem with talking with you. You talk about stuff that you googled without you understanding what it means.
|
|
Status: Offline |
|
|
Gunnar
| |
Re: One major reason why Motorola and 68k failed... Posted on 30-May-2024 8:41:47
| | [ #151 ] |
|
|
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @matthey
Quote:
The P5 Pentium had 3 cycle pipelined FADD and FMUL reducing the need to unroll code. |
The P5 designed was for 90MHz clockrate. We spoke about modern FPU (designs that can do Gigaherz Clock) and those have around 6 Cycle latency.
With a modern CPU unrolling is very important performance. Every coder that programmed FPU code knows this.
I would suggest you to look at real world FPU codes done by IBM, MOTOROLA, ARM, you name them. State of the art code woes unrolling workloop 4, 5, 6 times.
|
|
Status: Offline |
|
|
Gunnar
| |
Re: One major reason why Motorola and 68k failed... Posted on 30-May-2024 8:49:44
| | [ #152 ] |
|
|
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @matthey
Quote:
Multiple parallel FPU units improves parallelism without unrolling code.
|
Was this a joke? You say you find more register to expensive and then you propose to instead have multiple FPUs. What you propose is 10,000 times more costly. And the multiple FPU units again need more register each. Don't you know this?
This is like saying "You want to save the money for the subway, and your propose to buy instead a new helicopter"?
Matt you talk about solutions without having any knowledge of hardware costs of any of the options that you propose. |
|
Status: Offline |
|
|
Gunnar
| |
Re: One major reason why Motorola and 68k failed... Posted on 30-May-2024 8:55:06
| | [ #153 ] |
|
|
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @matthey
Quote:
Arne's code for drawing a 3D rotating cube only uses 5 of 8 FPU registers in the existing 68k FPU ISA. |
So you complain about Arne Amiga demo? Arne was 13 or 14 years when he wrote this demo effect and shared the code with Amiga community to help others to learn Asm. Arne code allow us to look something when we talk about FPU code.
The code is good for clearly showing us that for doing a 4-Matrix loop you need 8 constants and you need 4+2 = 6 register per vector.
This means you have 14 variables to handle for the singel (slow) unrolled case For a 4way unroll this makes 32 variables. And for a 6way unroll this makes 44 variables.
Everyone with coding experience does understand that coding a loop with more variables will gets much easier with more registers available.
Quote:
I don't think there is enough code that would benefit from 32 FPU registers |
Could be the reason is that you never code anything like this? Maybe the problem here is talking about things you never did and not understand from own experience.
Matthey how about you write a matrix 3D rotation code for us as example? And then you unroll it 4 or 6 times for speed.
Please do this and then we can talk again about the topic.
Last edited by Gunnar on 30-May-2024 at 09:15 AM.
|
|
Status: Offline |
|
|
kolla
| |
Re: One major reason why Motorola and 68k failed... Posted on 30-May-2024 18:39:28
| | [ #154 ] |
|
|
|
Elite Member |
Joined: 21-Aug-2003 Posts: 3187
From: Trondheim, Norway | | |
|
| @Gunnar
Maybe the world isn’t filled with amigans obsessing about rotating 3D objects? _________________ B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC |
|
Status: Offline |
|
|
matthey
| |
Re: One major reason why Motorola and 68k failed... Posted on 30-May-2024 23:51:59
| | [ #155 ] |
|
|
|
Elite Member |
Joined: 14-Mar-2007 Posts: 2270
From: Kansas | | |
|
| Gunnar Quote:
You contradict yourself in your post. Let me help you:
Fact: the cost for adding more registers is low. Fact: The HW cost of register renaming is high. And No 68K CPU, neither Coldfire CPU does register renaming.
To do register renaming in the sense you mean the CPU needs internally to have more "hidden" registers. This means you have to pay both. The cost of more registers and the much higher cost of register renaming.
|
Register renaming has a higher hardware cost than adding architectural registers but register renaming is an optional core design decision where architectural registers defined in the ISA are a requirement. Register renaming has several advantages over an equivalent number of architectural registers.
1. Code is smaller with renamed registers instead of architectural registers due to fewer bits for registers in the instruction encoding. This makes a larger number of rename registers practical. For example, 128 registers uses 3x7=21 bits of encoding space for a 3 op instruction which is 66% of a 32 bit instruction and not practical for architectural registers but possible for renamed registers.
2. Register renaming makes instruction scheduling and programming in general easier resulting in more optimized code without depending on compiler support. Perfect instruction scheduling often is not possible and core designs which minimize stalls usually have better real world performance. For example, the simple and small in-order SiFive U74 CPU core designed to reduce stalls often outperforms a much more complex OoO PPC G5 CPU with more processing hardware but requiring perfect code.
3. Existing code benefits from register renaming where adding architectural registers requires recompiling with updated compiler support to have any performance advantage. A good instruction scheduler is required to maximize performance without register renaming.
It's true that the 68040 and 68060 FPUs were minimalist. Integer performance was the priority but the FPU performance was not bad for minimalist FPUs. Minimalist FPUs are still valuable for lower end embedded use while FPU performance has become more important for high end embedded use than it was for the desktop when the 68040 and 68060 were designed. The 68060 did receive integer register renaming for an in-order core which is not necessary but makes the 68060 more forgiving to program and improves performance. It's natural to assume that a higher performance 68k FPU would have received FPU pipelining and register renaming to allow existing code to perform well.
Gunnar Quote:
The P5 designed was for 90MHz clockrate. We spoke about modern FPU (designs that can do Gigaherz Clock) and those have around 6 Cycle latency.
|
A similar 5/6 stage in-order P5 Pentium design could already reach 300MHz in the late 1990s.
P5@66MHz 800nm P54C@100MHz 600nm P54CS@200MHz 350nm P55C@233MHz 280nm Tillamook@300MHz 250nm
Targeting a higher clock speed with a deeper pipeline is likely to increase the latency with more FPU pipeline stages. We can see this with the in-order Intel Atom Bonnell microarchitecture based on the P5 Pentium. The Bonnell pipeline was aggressively increased to 16-19 stages to achieve over 2GHz using a 45nm process. The latency of the x87 FADD and FMUL instructions increased from 3 cycles to 5 cycles. A more practical design with 7-11 stages likely could achieve 1-2GHz with 3-4 cycle FPU latencies. A deeply pipelined core designed for 3-5GHz to compete with POWER likely would have 6+ cycle FPU instruction latencies. A core hyper optimized for a FPGA may also use many stages to increase the clock speed resulting in longer FPU instruction latencies. Perhaps we can see Gunnar's design priorities.
Gunnar Quote:
With a modern CPU unrolling is very important performance. Every coder that programmed FPU code knows this.
I would suggest you to look at real world FPU codes done by IBM, MOTOROLA, ARM, you name them. State of the art code woes unrolling workloop 4, 5, 6 times.
|
RISC FPUs need more registers and more loop unrolling to not only avoid the long FPU instruction latencies but also load-to-use stalls which can be longer for deep pipelines and FPU loads. A CISC FPU can avoid some of the unrolling and code enlargement especially if FPU instruction latencies are practical.
Gunnar Quote:
Was this a joke? You say you find more register to expensive and then you propose to instead have multiple FPUs. What you propose is 10,000 times more costly. And the multiple FPU units again need more register each. Don't you know this?
|
There are often separate sub units in the FPU like FADD, FMUL, FDIV, FMISC, etc. The units can potentially execute FPU instructions in parallel although some logic overhead is necessary to make this possible. The P5 Pentium can execute FPU instructions in different FPU sub units at the same time including multiple simultaneous pipelined FADD and FMUL instructions while the 68060 can not despite having similar FPU sub units. The logic to allow parallel instructions could be as simple as scoreboarding which allows instructions using different resources to execute in parallel or more OoO like complexity with in-order completion and potentially using register renaming.
|
|
Status: Offline |
|
|
Lou
| |
Re: One major reason why Motorola and 68k failed... Posted on 31-May-2024 0:02:39
| | [ #156 ] |
|
|
|
Elite Member |
Joined: 2-Nov-2004 Posts: 4227
From: Rhode Island | | |
|
| @matthey
Quote:
matthey wrote: Lou Quote:
That's a heck of a coping mechanism...
...
This is why 68k failed. The instructions per Clock wasn't competitive until the 68040. Too little too late. Too expensive.
|
I'm really not concerned about the 68000 losing a benchmark with unoptimized and unrealistic code. "Instructions per Clock" are important but your 68000 benchmark code isn't optimized to reduce them? How experienced is a 68000 assembly programmer that doesn't know about MOVEQ and ADDQ?
|
The 6502 code had a worse disadvantage. You're quite the deflector.
Let's add you're 'q' instructions along with the writing to ram that the 6502 was uselessly doing and rerun then... |
|
Status: Offline |
|
|
matthey
| |
Re: One major reason why Motorola and 68k failed... Posted on 31-May-2024 1:22:48
| | [ #157 ] |
|
|
|
Elite Member |
Joined: 14-Mar-2007 Posts: 2270
From: Kansas | | |
|
| Lou Quote:
The 6502 code had a worse disadvantage. You're quite the deflector.
Let's add you're 'q' instructions along with the writing to ram that the 6502 was uselessly doing and rerun then...
|
Rather than arbitrarily decide which CPUs can or should do what, how about using a simple benchmark with code that performs something useful like the Byte Sieve benchmark. Dhrystone or BYTEmark/NBench benchmarks would be better but the 6502 is primitive and has trouble supporting compilers.
|
|
Status: Offline |
|
|
Kronos
| |
Re: One major reason why Motorola and 68k failed... Posted on 31-May-2024 7:46:24
| | [ #158 ] |
|
|
|
Elite Member |
Joined: 8-Mar-2003 Posts: 2657
From: Unknown | | |
|
| @kolla Quote:
kolla wrote: @Gunnar
Maybe the world isn’t filled with amigans obsessing about rotating 3D objects? |
Or maybe the world is filled with people who haven't about "GPU"s over the past 30+ years?_________________ - We don't need good ideas, we haven't run out on bad ones yet - blame Canada |
|
Status: Offline |
|
|
Gunnar
| |
Re: One major reason why Motorola and 68k failed... Posted on 31-May-2024 9:47:50
| | [ #159 ] |
|
|
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @matthey
Quote:
The 68060 did receive integer register renaming for an in-order core which is not necessary but makes the 68060 more forgiving to program and improves performance. |
This is not true.
The 68060 does not have register renaming. |
|
Status: Offline |
|
|
Gunnar
| |
Re: One major reason why Motorola and 68k failed... Posted on 31-May-2024 9:50:33
| | [ #160 ] |
|
|
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @matthey
Quote:
There are often separate sub units in the FPU like FADD, FMUL, FDIV, FMISC, etc. |
The APOLLO 68080 FPU is fully parallel and can do 22 FPU instructions in parallel at the same time.
But this does NOT solve the limitation of the registers. To calculate and store the results of 22 FPU instructions you need a lot more than 8 Registers.
Its very simple to understand this.
|
|
Status: Offline |
|
|