Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.
|
|
|
|
Poster | Thread | Hammer
| |
Re: One major reason why Motorola and 68k failed... Posted on 31-May-2024 14:03:01
| | [ #161 ] |
| |
|
Elite Member |
Joined: 9-Mar-2003 Posts: 5859
From: Australia | | |
|
| @matthey
Quote:
There are always going to be algorithms that would benefit from more registers. Adding more registers is far from free and provides diminishing returns. I still think 16 GP FPU registers is a good number for a CISC FPU while 32 is a good idea for a RISC FPU. CISC FPUs have options when short a few registers like loads from cache and Dn registers which have a minimal performance loss with limited use. Register renaming reduces register needs. Reducing pipelined FPU instruction latencies reduces the number of instructions needed for unrolling. The P5 Pentium had 3 cycle pipelined FADD and FMUL reducing the need to unroll code. Multiple parallel FPU units improves parallelism without unrolling code. I don't think there is enough code that would benefit from 32 FPU registers. Even if pipelined performance is 25% better with 32 FPU registers when a FPU pipeline can be kept full, it won't make much difference to overall FPU performance if this only occurs 0.25% of the time. I believe a 32 FPU register standard is too many registers for the embedded market where some implementations will want to reduce the number or remove the FPU registers completely. Code size will likely be increased to encode so many registers which is a turnoff for embedded use. Perhaps 32 FPU registers will allow Gunnar's FPGA FPU to better compete with the POWER FPU though. |
32 registers = more SRAM storage on the chip.
_________________ Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68) Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68) Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB |
| Status: Offline |
| | Lou
| |
Re: One major reason why Motorola and 68k failed... Posted on 31-May-2024 22:10:52
| | [ #162 ] |
| |
|
Elite Member |
Joined: 2-Nov-2004 Posts: 4227
From: Rhode Island | | |
|
| @matthey
Quote:
matthey wrote: Lou Quote:
The 6502 code had a worse disadvantage. You're quite the deflector.
Let's add you're 'q' instructions along with the writing to ram that the 6502 was uselessly doing and rerun then...
|
Rather than arbitrarily decide which CPUs can or should do what, how about using a simple benchmark with code that performs something useful like the Byte Sieve benchmark. Dhrystone or BYTEmark/NBench benchmarks would be better but the 6502 is primitive and has trouble supporting compilers.
|
So rather than admit defeat, you move the goal posts.
https://en.wikipedia.org/wiki/Instructions_per_second
We already have known the answers for a long time. 6502 before the improvements of the 65C02 and 65CE02 which improves another 25% 0.43 instructions per clock
68000 : 0.175 68020 : 0.303 68030 : 0.360 finally some performance too little, too late, too expensive: 68040 : 1.1 68060 : 1.33
65C02 is still in production from WDC and has potential to reach 200Mhz in hypothetical 0.25-micron implementation... You'd need a 491Mhz 68000 to keep up.
https://www.ardent-tool.com/CPU/docs/MPR/19980511/1206en.pdf
This is why Motorola 68k failed. Too little, too late and too expensive. |
| Status: Offline |
| | Hammer
| |
Re: One major reason why Motorola and 68k failed... Posted on 1-Jun-2024 3:51:57
| | [ #163 ] |
| |
|
Elite Member |
Joined: 9-Mar-2003 Posts: 5859
From: Australia | | |
|
| @Lou
Quote:
Lou wrote:
So rather than admit defeat, you move the goal posts.
https://en.wikipedia.org/wiki/Instructions_per_second
We already have known the answers for a long time. 6502 before the improvements of the 65C02 and 65CE02 which improves another 25% 0.43 instructions per clock
68000 : 0.175 68020 : 0.303 68030 : 0.360 finally some performance too little, too late, too expensive: 68040 : 1.1 68060 : 1.33
65C02 is still in production from WDC and has potential to reach 200Mhz in hypothetical 0.25-micron implementation... You'd need a 491Mhz 68000 to keep up.
https://www.ardent-tool.com/CPU/docs/MPR/19980511/1206en.pdf
This is why Motorola 68k failed. Too little, too late and too expensive.
|
Motorola 683XX includes bastardized 68020 known as CPU32 and 68000.
http://archive.computerhistory.org/resources/access/text/2013/04/102723315-05-01-acc.pdf
Supply Base for 32-Bit Microprocessors—1994, For Product's Share of Total 32-Bit-and-Up MPU Market 1994 Page 89 of 417,
68000, 17% 80386SX/SL, 3% 80386DX, 3% 80486SX, 16% 80486DX, 21% 683XX, 9% 68040, 3% 68030, 1% 68020, 3% 80960, 4% AM29000, 1% 32X32, 3% R3000/R4000, 1% Sparc, 1% Pentium, 4% Others, 10%
683XX's instruction set of the CPU32 core is similar to the 68020 without bitfield instructions, and with a few instructions added to the CPU32 core.
"The CPU32 is a 68000-based microprocessor that can execute most 32-bit operations in two clock periods" https://www.nxp.com/docs/en/product-brief/MC68341.pdf
Mostly 0.5 IPC for bastard 68020 (CPU32).Last edited by Hammer on 01-Jun-2024 at 04:01 AM. Last edited by Hammer on 01-Jun-2024 at 03:59 AM. Last edited by Hammer on 01-Jun-2024 at 03:56 AM. Last edited by Hammer on 01-Jun-2024 at 03:53 AM. Last edited by Hammer on 01-Jun-2024 at 03:52 AM.
_________________ Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68) Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68) Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB |
| Status: Offline |
| | matthey
| |
Re: One major reason why Motorola and 68k failed... Posted on 2-Jun-2024 22:23:58
| | [ #164 ] |
| |
|
Elite Member |
Joined: 14-Mar-2007 Posts: 2270
From: Kansas | | |
|
| Gunnar Quote:
This is not true.
The 68060 does not have register renaming.
|
I used to think the 68060 lacked register renaming too. Joe Circello says otherwise.
https://www.academia.edu/64300961/The_superscalar_architecture_of_the_MC68060 Quote:
Register scoreboarding, register renaming and a robust resource crossbar minimize pipeline breaks.
|
The real Gunnar should know this as I have pointed out this article to him before. Perhaps Gunnar believes he is more of an expert than the chief architect of the 68060 though?
Gunnar Quote:
The APOLLO 68080 FPU is fully parallel and can do 22 FPU instructions in parallel at the same time.
But this does NOT solve the limitation of the registers. To calculate and store the results of 22 FPU instructions you need a lot more than 8 Registers.
Its very simple to understand this.
|
The P5 Pentium could only execute ~6 instructions in parallel with fewer orthogonal FPU registers than the 68k FPU and no register renaming. Register renaming would have been valuable as the same registers are hammered but the P5 Pentium made reasonably good use of the pipelining considering the ISA handicap and few stack registers. It makes me think that 16 orthogonal FPU registers with register renaming would be adequate but Gunnar is the expert at everything so maybe prefix using FPU instructions to access many non-orthogonal registers without register renaming and an instruction scheduler is better.
Hammer Quote:
Motorola 683XX includes bastardized 68020 known as CPU32 and 68000.
|
I wouldn't call CPU32 a bastard ISA. The CPU32 ISA is a compatible subset of the 68k ISA unlike ColdFire. The more complex missing functionality can be trapped and emulated. If CPU32 had survived instead of ColdFire, it would have likely found use in Amigas unlike ColdFire. CPU32 was still too much 68k and not low end enough so it threatened PPC and had to go.
Last edited by matthey on 02-Jun-2024 at 10:28 PM. Last edited by matthey on 02-Jun-2024 at 10:27 PM. Last edited by matthey on 02-Jun-2024 at 10:26 PM.
|
| Status: Offline |
| | pixie
| |
Re: One major reason why Motorola and 68k failed... Posted on 3-Jun-2024 10:37:01
| | [ #165 ] |
| |
|
Elite Member |
Joined: 10-Mar-2003 Posts: 3287
From: Figueira da Foz - Portugal | | |
|
| @matthey
Quote:
The real Gunnar should know this as I have pointed out this article to him before. Perhaps Gunnar believes he is more of an expert than the chief architect of the 68060 though? |
May I have your attention, please? May I have your attention, please? Will the real Gunnar please stand up? I repeat Will the real Gunnar please stand up? We're gonna have a problem here...
_________________ Indigo 3D Lounge, my second home. The Illusion of Choice | Am*ga |
| Status: Offline |
| | Hammer
| |
Re: One major reason why Motorola and 68k failed... Posted on 4-Jun-2024 5:44:42
| | [ #166 ] |
| |
|
Elite Member |
Joined: 9-Mar-2003 Posts: 5859
From: Australia | | |
|
| @matthey
Quote:
The P5 Pentium could only execute ~6 instructions in parallel with fewer orthogonal FPU registers than the 68k FPU and no register renaming.
|
That's not realistic when the P5 Pentium's front end is a dual-issue design.
"Instructions in flight" doesn't reflect instruction completion (retirement) rates.
Quote:
Register renaming would have been valuable as the same registers are hammered but the P5 Pentium made reasonably good use of the pipelining considering the ISA handicap and few stack registers. It makes me think that 16 orthogonal FPU registers with register renaming would be adequate but Gunnar is the expert at everything so maybe prefix using FPU instructions to access many non-orthogonal registers without register renaming and an instruction scheduler is better. |
From your link https://www.academia.edu/64300961/The_superscalar_architecture_of_the_MC68060
Quote:
The optimized computing engines in the instruction execution pipeline stages perform most integer operations in a single cycle and the most frequently used floating-point operations in three cycles or less |
68060's FPU hardware implementation is weaker.
Quote:
I wouldn't call CPU32 a bastard ISA. The CPU32 ISA is a compatible subset of the 68k ISA unlike ColdFire. The more complex missing functionality can be trapped and emulated. If CPU32 had survived instead of ColdFire, it would have likely found use in Amigas unlike ColdFire. CPU32 was still too much 68k and not low end enough so it threatened PPC and had to go. |
It's still an instruction set kitbashing from 68020 e.g. missing bitfield instructions.
https://www.nxp.com.cn/docs/en/product-brief/MC68341.pdf Despite MC68341 having the CPU32, it has a 16-bit data bus and up to 25 Mhz CPU clock speed.
https://www.nxp.jp/docs/en/product-brief/MC68349.pdf 68349 has a 32-bit data bus with up to 25 Mhz for the CPU.
Against the RISC competition, these are not competitive.Last edited by Hammer on 04-Jun-2024 at 05:57 AM. Last edited by Hammer on 04-Jun-2024 at 05:52 AM.
_________________ Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68) Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68) Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB |
| Status: Offline |
| | Gunnar
| |
Re: One major reason why Motorola and 68k failed... Posted on 4-Jun-2024 6:23:12
| | [ #167 ] |
| |
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @matthey
I see what you are misunderstanding here in this PDF.
You spoke about register renaming in the context of increasing the performance of the FPU. For this you need to have more internal FPU registers. And you need to rename destination and input registers to remove dependencies over several cycles. The 68060 does not do this and it can not do this.
What actually both the Motorola 68060 and also the Apollo 68080 CPU can do - is removing false dependencies while executing 2 integer instruction in the same cycle.
Both the 060 CPU and 080 CPU execute together move.l D0,D1; add.l D2,D1 What is done here is precisely called false hazard removal. The 060 and 080 can find and remove false input dependencies - by "renaming" an input
This pdf also calls this register renaming but its limited to only same single and only to inputs.
So what you read in this PDF is NOT what you think it is.
|
| Status: Offline |
| | matthey
| |
Re: One major reason why Motorola and 68k failed... Posted on 4-Jun-2024 21:26:03
| | [ #168 ] |
| |
|
Elite Member |
Joined: 14-Mar-2007 Posts: 2270
From: Kansas | | |
|
| Hammer Quote:
That's not realistic when the P5 Pentium's front end is a dual-issue design.
"Instructions in flight" doesn't reflect instruction completion (retirement) rates.
|
I meant to say that the P5 Pentium FPU could only execute ~6 instructions in parallel.
FADD unit 3 pipelined instructions FMUL unit 2-3 pipelined instructions FDIV unit 1 non-pipelined instruction FMISC unit 1 non-pipelined instruction --- 7-8 FPU instructions in parallel (rough estimate)
The CPU+FPU could be executing 5-10 (2x5 stage) more FPU instructions in parallel so ~12-18 total FPU "instructions in flight". As I recall, the 2nd CPU integer execute pipeline is limited to FXCH instructions which are unnecessary with a better ISA so ~13 FPU "instructions in flight" is better for comparisons. While it is difficult to keep FPU units busy with the ISA handicap, the FPU units are usually available to begin execution without stalls. Instruction scheduling mostly involves avoiding dependencies which is often possible but difficult using FXCH and CISC cache/mem load+OP instructions despite few orthogonal FPU registers. The FXCH instructions can be thought of as providing primitive "manual" FPU register renaming capabilities but the number of registers is still limited and FXCH instructions clog and enlarge the code.
Hammer Quote:
68060's FPU hardware implementation is weaker.
|
Sure. The 68060 FPU is minimalist. There is no FPU pipelining. It has FPU (sub) units like the P5 Pentium but lacks even the simplest FPU register scoreboard hardware to execute FPU instructions in parallel when there are no dependencies. This is in contrast to the integer hardware which is fully pipelined, uses register renaming, uses register scoreboarding, performs early execution in AGU stages where possible and uses extensive forwarding/bypassing of results. The 68060 design prioritizes integer performance over FPU performance like the Cyrix 5x86 and results in similarly much superior integer performance compared to the P5 Pentium. Both the 68060 and 5x86 lack FPU pipelining and perform better with the more common mixed integer and FPU code but the 68060 FPU generally has shorter instruction latencies and the FPU ISA is vastly better. The 5x86 has a small FPU instruction queue to avoid stalling due to new FPU instructions in the integer pipeline before an FPU instruction is finished executing. This makes instruction scheduling easier with a minimum of hardware resources but doesn't appear to overcome the FPU performance deficit compared to the 68060 FPU as seen in the ByteMark floating point benchmarks compiled with VBCC. The 68060 FPU performance is even surprisingly close to the P5 Pentium performance at the same clock speed with the most likely reason being the handicapped x86 FPU ISA.
Hammer Quote:
ARM kept adding more standard 68k features like hardware multiply/divide, caches, misaligned memory address accesses, Thumb code density derived from the 68k, AArch64 powerful 68k like addressing modes, etc. in attempts to compete with the 68k and x86(-64). The 68k/CPU32/ColdFire ISA after the 68020 ISA kept reducing features and performance to become more RISC like. Granted, the 68k was replaced by PPC for political reasons and the industry leading code density 68k ISAs only used where fat PPC could not scale. We would have liked to have seen higher performance 68k CPUs but the 68060 was the end. Many Motorola/Freescale embedded customers would have liked to see more and better 68k embedded CPUs too but they had PPC jammed down their throat instead and are now ARM customers using 68k like features.
Gunnar Quote:
I see what you are misunderstanding here in this PDF.
You spoke about register renaming in the context of increasing the performance of the FPU. For this you need to have more internal FPU registers. And you need to rename destination and input registers to remove dependencies over several cycles. The 68060 does not do this and it can not do this.
|
Is this my misunderstanding or your misunderstanding? I was pointing out that register renaming is a benefit for in-order CPU cores too as demonstrated by 68060 integer register renaming with benefits like reducing the number of architectural registers needed, improving code density with fewer register bits in encodings and improving the performance of existing code which hammers a small number of registers. FPU register renaming would have more benefit for a pipelined FPU with the longer execution latencies. This makes FPU register renaming as natural of fit for an upgraded pipelined 68k FPU as it was for the pipelined 68060 integer CPU. Only 8 architectural FPU registers is limiting for heavy FPU workloads even with FPU register renaming but 16 architectural FPU registers can be added in an orthogonal way with usually no increase in code size and no instruction prefix required. I believe a CISC FPU ISA is more practical with 16 architectural FPU registers which is better for lower end embedded FPU designs while a high performance pipelined FPU design with register renaming can provide most if not all of the performance of a RISC FPU with 32 FPU registers and without register renaming. CISC FPU mem-reg accesses using powerful addressing modes while avoiding load-to-use stalls are a huge advantage which may be better for performance than RISC FPUs doubling the FPU registers from 16 to 32.
Gunnar Quote:
What actually both the Motorola 68060 and also the Apollo 68080 CPU can do - is removing false dependencies while executing 2 integer instruction in the same cycle.
Both the 060 CPU and 080 CPU execute together move.l D0,D1; add.l D2,D1 What is done here is precisely called false hazard removal. The 060 and 080 can find and remove false input dependencies - by "renaming" an input
This pdf also calls this register renaming but its limited to only same single and only to inputs.
So what you read in this PDF is NOT what you think it is.
|
There was no 68060 register renaming according to you and now you understand it perfectly enough to say it is not what I think it is? There are certainly different ways to implement register renaming.
THE DESIGN SPACE OF REGISTER RENAMING TECHNIQUES https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=2bef45642abd1c42d92b40806139ae0e265cdf2c
Details of the register renaming are lacking in 68060 documentation I am aware of. The article above mentions the 68060 does register renaming but does not try to classify the 68060 register renaming. We know the register renaming scope would be classified as partial due to the lack of FPU register renaming but the article does not even make that assumption. I'm less confident of details than you.
Last edited by matthey on 04-Jun-2024 at 09:33 PM.
|
| Status: Offline |
| | Hammer
| |
Re: One major reason why Motorola and 68k failed... Posted on 5-Jun-2024 6:26:20
| | [ #169 ] |
| |
|
Elite Member |
Joined: 9-Mar-2003 Posts: 5859
From: Australia | | |
|
| @matthey
Quote:
Sure. The 68060 FPU is minimalist. There is no FPU pipelining. It has FPU (sub) units like the P5 Pentium but lacks even the simplest FPU register scoreboard hardware to execute FPU instructions in parallel when there are no dependencies. This is in contrast to the integer hardware which is fully pipelined, uses register renaming, uses register scoreboarding, performs early execution in AGU stages where possible and uses extensive forwarding/bypassing of results. The 68060 design prioritizes integer performance over FPU performance like the Cyrix 5x86 and results in similarly much superior integer performance compared to the P5 Pentium. Both the 68060 and 5x86 lack FPU pipelining and perform better with the more common mixed integer and FPU code but the 68060 FPU generally has shorter instruction latencies and the FPU ISA is vastly better. The 5x86 has a small FPU instruction queue to avoid stalling due to new FPU instructions in the integer pipeline before an FPU instruction is finished executing. This makes instruction scheduling easier with a minimum of hardware resources but doesn't appear to overcome the FPU performance deficit compared to the 68060 FPU as seen in the ByteMark floating point benchmarks compiled with VBCC. The 68060 FPU performance is even surprisingly close to the P5 Pentium performance at the same clock speed with the most likely reason being the handicapped x86 FPU ISA.
|
Quake beats fake ByteMark. ByteMark is less useful (meaningless) when there are real application benchmarks.
Software rendering and FP processing, the solution needs a high memory bandwidth.
68060 (3.3V)'s 32-bit 040 bus is aging in 1994.
Pentium 83 Mhz overdrive on a 32-bit 486 bus is like Pentium 75 (P54C, 3.3V, Socket 5). P54C reached 120 Mhz.
Pentium 75 is just Pentium 90 with 50 Mhz FSB, hence 60 Mhz FSB jump is a popular overclock i.e. Pentium 75 to Pentium 90.
Pentium 75's 1.5X x 50 Mhz = 75 Mhz. Pentium 90's 1.5X x 60 Mhz = 90 Mhz. Pentium 100s 1.5X x 66 Mhz = 100 Mhz. Pentium 120s 2.0X x 60 Mhz = 120 Mhz. https://www.tomshardware.com/reviews/overclocking-guide,15-10.html
"P75 most of them run at least flawlessly at 90 @ 1.5 x 60 MHz, many of them at 100 @ 1.5 x 66 MHz."
I pushed my 1996-era Pentium 150 into 187.5 Mhz with 75 Mhz FSB.
I have 68060 Rev1 and it doesn't overclock like P54C. ------------
For game consoles, they wanted 68EC040 level CPU without paying $100.
Quote:
ARM kept adding more standard 68k features like hardware multiply/divide, caches, misaligned memory address accesses, Thumb code density derived from the 68k, AArch64 powerful 68k like addressing modes, etc. in attempts to compete with the 68k and x86(-64). The 68k/CPU32/ColdFire ISA after the 68020 ISA kept reducing features and performance to become more RISC like.
|
There's nothing wrong with adding features, just don't unnecessarily delete them. CPU32 could have kept missing instructions on the slower microcode path. Control over the C++ compiler is important.
Quote:
Granted, the 68k was replaced by PPC for political reasons and the industry leading code density 68k ISAs only used where fat PPC could not scale.
|
Apple was approached by IBM on the PowerPC project and Apple brought in Motorola due to long-term partnership.
Apple is a major factor for Motorola's desktop computer partner more than Commodore.
Quote:
We would have liked to have seen higher performance 68k CPUs but the 68060 was the end. Many Motorola/Freescale embedded customers would have liked to see more and better 68k embedded CPUs too but they had PPC jammed down their throat instead and are now ARM customers using 68k like features.
|
Motorola wasn't able to translate 68000's success for 68020, 68030 and 68040. Motorola was focused on Intel 386DX price vs performance guides instead of X86 cloners and RISC competitors.
Motorola thinks it's like Intel when it's not.
ARM team ramped up clock speed faster than 68K.
Against ARM925T @ 144 Mhz for handheld devices, what's Motorola's 68K solution? DragonBall is not good enough.
At 33 MHz, 5.4 MIPS for the DragonBall VZ (MC68VZ328). At 66 MHz, 10.8 MIPS for the DragonBall Super VZ (MC68SZ328).
Handheld devices separated from the embedded market, hence Motorola lost another market.
Last edited by Hammer on 05-Jun-2024 at 07:34 AM. Last edited by Hammer on 05-Jun-2024 at 07:17 AM. Last edited by Hammer on 05-Jun-2024 at 06:40 AM. Last edited by Hammer on 05-Jun-2024 at 06:33 AM. Last edited by Hammer on 05-Jun-2024 at 06:27 AM.
_________________ Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68) Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68) Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB |
| Status: Offline |
| | Gunnar
| |
Re: One major reason why Motorola and 68k failed... Posted on 5-Jun-2024 11:37:34
| | [ #170 ] |
| |
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @matthey
Quote:
Is this my misunderstanding or your misunderstanding? I was pointing out that register renaming is a benefit for in-order CPU cores too as demonstrated by 68060 integer register renaming with benefits like reducing the number of architectural registers needed, |
Easy answer: This is your misunderstanding.
Let me explain you. The point if "real register renaming" is to remove dependencies of several instructions over several cycle.
Let me show you the concept.
Lets make on oversimplified and wrong example - just to drive the point. [code] move.l (A0)+,D0 move.l D0,(A1)+ move.l (A0)+,D0 move.l D0,(A1)+ move.l (A0)+,D0 move.l D0,(A1)+ move.l (A0)+,D0 move.l D0,(A1)+ [/code]
This code uses D0 4 times ... as TMP register... [code] movem.l (A0)+,D0/D1/D2/D3 move.l D0,(A1)+ move.l D1,(A1)+ move.l D2,(A1)+ move.l D3,(A1)+ [/code]
With real register naming the CPU can use more register than there we used in the code This means you can then really use more physical register than the CPU has architecturally.
This allows for doing more in parallel in hardware than the "bad" code was written for.
The 68060 can not do this. It not has more physical registers and it also not has the "scope" of watching for the renaming and it also does not rename destinations. What is does is "mini renaming" or fusing of 2 instructions to 1.
Yes they call this register renaming in this paper - but it clearly does not have the features that register renaming is supposed to have.
Clear ? |
| Status: Offline |
| | Gunnar
| |
Re: One major reason why Motorola and 68k failed... Posted on 5-Jun-2024 11:50:36
| | [ #171 ] |
| |
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @matthey
Quote:
I think the article lists Super-Scalar CPUs in the overview.
Those CPUs are grouped into 3 colors. (empty) =like the M68060 (light renaming) (full renaming)
Why did they not color the M68060? Maybe they not classify it as having the features they classify as light or full?
What is clear to us is renaming its very limited in power. And not enough to make the FPU stronger.
The paper has some nice info that shows you what real register renaming includes. It has both a rename "work window" and bigger register file. The 68060 has neither of this.Last edited by Gunnar on 05-Jun-2024 at 12:03 PM.
|
| Status: Offline |
| | Gunnar
| |
Re: One major reason why Motorola and 68k failed... Posted on 5-Jun-2024 16:06:14
| | [ #172 ] |
| |
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @Hammer
Quote:
32 registers = more SRAM storage on the chip. |
Of course having more registers will cost more transistors. This is logical. But this is a very small cost.
The more registers costs is less than 1% of what the FPU does cost.
I you spend 1% more on the transistors for the extra registers and this allows you to double the speed to the 100 times bigger FPU then this is a very good deal, isn't it?
For the registers we talk about we only need to spend a few thousand transistors and this will allow us to maybe double the performance of the FPU. Double the speed of the FPU that costs many hundreds of thousands of transistors.
Think of it like you spend an extra $30 on good tires which will allow you to double the speed of your $60,000 car.
Who would not do this?
Last edited by Gunnar on 05-Jun-2024 at 05:44 PM.
|
| Status: Offline |
| | matthey
| |
Re: One major reason why Motorola and 68k failed... Posted on 5-Jun-2024 20:23:54
| | [ #173 ] |
| |
|
Elite Member |
Joined: 14-Mar-2007 Posts: 2270
From: Kansas | | |
|
| Hammer Quote:
Quake beats fake ByteMark. ByteMark is less useful (meaningless) when there are real application benchmarks.
|
The ByteMark benchmarks use real world algorithms typical of more common code while Quake was the anomaly. There was similar 3D code after Quake but this changed with SIMD units and T&L graphics cards. Quake remains a good benchmark that emphasizes floating point performance but it is not typical of floating point workloads before Quake or even modern 3D workloads.
Hammer Quote:
Pentium 83 Mhz overdrive on a 32-bit 486 bus is like Pentium 75 (P54C, 3.3V, Socket 5). P54C reached 120 Mhz.
...
I have 68060 Rev1 and it doesn't overclock like P54C.
|
The P54C Pentium had a die shrink to 350nm already in 1994. The 68060 plans to increase the max clock speed rating were cancelled despite being designed for high clock speeds.
https://www.academia.edu/64300961/The_superscalar_architecture_of_the_MC68060 Quote:
It's scalable design can move to higher frequencies to keep pace with process improvements without requiring major architectural modifications.
|
You are comparing a popular desktop Pentium CPU to a high end embedded 68060 CPU that was being replaced by PPC.
Hammer Quote:
There's nothing wrong with adding features, just don't unnecessarily delete them. CPU32 could have kept missing instructions on the slower microcode path. Control over the C++ compiler is important.
|
Smaller and lower power cores are more competitive for embedded use. Transistor budgets for embedded use were much smaller back then and often needed for specialized embedded features. Today, it is possible to retain more standardization and provide more general purpose embedded features allowing SoCs to be produced in higher volumes. I don't see any problem with using a subset of a more advanced ISA for embedded use. ARM uses subsets of Thumb ISAs for their low end embedded Cortex-M cores and they are inconsistent what is supported.
https://en.wikipedia.org/wiki/ARM_Cortex-M#Instruction_sets Quote:
All Cortex-M cores implement a common subset of instructions that consists of most Thumb-1, some Thumb-2, including a 32-bit result multiply. The Cortex-M0 / Cortex-M0+ / Cortex-M1 / Cortex-M23 were designed to create the smallest silicon die, thus having the fewest instructions of the Cortex-M family.
|
See the chart "ARM Cortex-M instruction variations" at the link above to see that ARM Cortex-M ISAs are less standardized than Motorola/Freescale using 68000, 68020, CPU32 and ColdFire cores for embedded use. More 68k standardization was possible though. The ColdFire ISA was not worth scaling down so far considering how much 68k compatibility was lost but ColdFire extensions could have benefited the other ISAs. The 68000 and CPU32 ISAs are subsets of the 68020 ISA but ColdFire is not.
Hammer Quote:
Apple was approached by IBM on the PowerPC project and Apple brought in Motorola due to long-term partnership.
Apple is a major factor for Motorola's desktop computer partner more than Commodore.
|
CBM was a non-factor for desktop 68k CPUs as they primarily bought embedded 68k CPUs to go with their low end Amiga chipsets.
Hammer Quote:
ARM team ramped up clock speed faster than 68K.
|
ARM cores had to "ramp up" clock speeds for performance which is undesirable for embedded use. RISC cores, including ARM cores, are easier to pipeline and clock up though.
|
| Status: Offline |
| | matthey
| |
Re: One major reason why Motorola and 68k failed... Posted on 5-Jun-2024 21:49:21
| | [ #174 ] |
| |
|
Elite Member |
Joined: 14-Mar-2007 Posts: 2270
From: Kansas | | |
|
| Gunnar Quote:
Easy answer: This is your misunderstanding.
Let me explain you. The point if "real register renaming" is to remove dependencies of several instructions over several cycle.
Let me show you the concept.
Lets make on oversimplified and wrong example - just to drive the point. [code] move.l (A0)+,D0 move.l D0,(A1)+ move.l (A0)+,D0 move.l D0,(A1)+ move.l (A0)+,D0 move.l D0,(A1)+ move.l (A0)+,D0 move.l D0,(A1)+ [/code]
This code uses D0 4 times ... as TMP register...
|
This is not what I understand register renaming to be but result forwarding/bypassing from the early address generation stages of one OEP to the other OEP. It sounds like this may be considered a form of register renaming in the article but this may not be the only register renaming technique.
Gunnar Quote:
[code] movem.l (A0)+,D0/D1/D2/D3 move.l D0,(A1)+ move.l D1,(A1)+ move.l D2,(A1)+ move.l D3,(A1)+ [/code]
With real register naming the CPU can use more register than there we used in the code This means you can then really use more physical register than the CPU has architecturally.
This allows for doing more in parallel in hardware than the "bad" code was written for.
The 68060 can not do this. It not has more physical registers and it also not has the "scope" of watching for the renaming and it also does not rename destinations. What is does is "mini renaming" or fusing of 2 instructions to 1.
Yes they call this register renaming in this paper - but it clearly does not have the features that register renaming is supposed to have.
Clear ? |
No. The 68060 performs very limited code folding for branches which I doubt would be mentioned as register renaming. Code folding/fusing techniques can remove some false dependencies so could be considered a limited register renaming technique if used enough.
Gunnar Quote:
I think the article lists Super-Scalar CPUs in the overview.
Those CPUs are grouped into 3 colors. (empty) =like the M68060 (light renaming) (full renaming)
Why did they not color the M68060? Maybe they not classify it as having the features they classify as light or full?
|
I presume 68060 information about register renaming was limited and they didn't want to make assumptions like you are so quick to do.
Gunnar Quote:
What is clear to us is renaming its very limited in power. And not enough to make the FPU stronger.
The paper has some nice info that shows you what real register renaming includes. It has both a rename "work window" and bigger register file. The 68060 has neither of this.
|
The assumption that 68060 register renaming techniques are limited to integer registers is reasonable but the article was cautious due to the lack of details. It is certainly possible and advantageous to add separate FPU register renaming with a pipelined FPU. It is unlikely that register renaming techniques for 68k CPU and FPU registers would be similar as the registers are different sizes even with 64 bit integer registers.
|
| Status: Offline |
| | Hammer
| |
Re: One major reason why Motorola and 68k failed... Posted on 6-Jun-2024 3:01:43
| | [ #175 ] |
| |
|
Elite Member |
Joined: 9-Mar-2003 Posts: 5859
From: Australia | | |
|
| @Gunnar
Quote:
Gunnar wrote: @Hammer
Quote:
32 registers = more SRAM storage on the chip. |
Of course having more registers will cost more transistors. This is logical. But this is a very small cost.
The more registers costs is less than 1% of what the FPU does cost.
I you spend 1% more on the transistors for the extra registers and this allows you to double the speed to the 100 times bigger FPU then this is a very good deal, isn't it?
For the registers we talk about we only need to spend a few thousand transistors and this will allow us to maybe double the performance of the FPU. Double the speed of the FPU that costs many hundreds of thousands of transistors.
Think of it like you spend an extra $30 on good tires which will allow you to double the speed of your $60,000 car.
Who would not do this?
|
I prioritize data locality when there's a latency and bandwidth gap between external memory and the CPU core's potential.
16 registers are not future-proof enough when the competition from AArch64 and X86-64v4** has up to 32 registers.
**AVX-512 supports scalar, vectors, integer, and floating point.
https://x.com/InstLatX64/status/1692989174909997350 System V Application Binary Interface AMD64 Architecture Processor Supplement as of June 2024;
APX (32 GPR for AMD64), and AMX road map are included. Intel and SUSE used "AMD64" due to historical reasons.
Both Intel and AMD are supporting Intel's APX and AMX for AMD64 ABI and it's on the roadmap.
A major CISC road map with 32 GPR. The debate on 16 vs 32 GPR with CISC should end.
Last edited by Hammer on 06-Jun-2024 at 03:03 AM.
_________________ Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68) Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68) Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB |
| Status: Offline |
| | Hammer
| |
Re: One major reason why Motorola and 68k failed... Posted on 6-Jun-2024 3:47:23
| | [ #176 ] |
| |
|
Elite Member |
Joined: 9-Mar-2003 Posts: 5859
From: Australia | | |
|
| @matthey Quote:
The ByteMark benchmarks use real world algorithms typical of more common code while Quake was the anomaly. There was similar 3D code after Quake but this changed with SIMD units and T&L graphics cards. Quake remains a good benchmark that emphasizes floating point performance but it is not typical of floating point workloads before Quake or even modern 3D workloads.
|
Quake single-handedly wreaked the fortunes of 586/686 clones. Quake engine powered several other games.
In terms of market impact, Quake beats ByteMark.
For the PC, GeForce 256 and 2's fixed-function T&L only lasted a single DirectX7 generation before switching to DirectX8's programmable vertex shader.
DirectX8's programmable vertex shader emulates DirectX7's T&L.
GeForce 256/GeForce 2 Ti's fixed-function T&L hype train was with Quake 3's NV15 map and 3DMarks 2000. Fixed-function T&L wasn't long-lasting on the PC, but it's enough to wreak 3DFX's business.
NVIDIA has a behavior that brings to market a certain feature, partners with highly visible game developers, and hype it. This is repeated multiple times e.g. saturated tessellation with Crysis 2. https://hothardware.com/news/indepth-analysis-of-dx11-crysis-shows-highly-questionable-tessellation-usage This is yet another bullshit geometry from NVIDIA's PR.
Raytracing's BVH has a geometry load, hence it's another saturated geometry load from NVIDIA.
NVIDIA will pull a saturated geometry PR tactics. Intel and AMD should learn from NVIDIA's historical behavior.
Remember, the co-founder of NVIDIA has an SUN GX TEC (geometry engine) background.
Quote:
The P54C Pentium had a die shrink to 350nm already in 1994 |
Read https://en.wikipedia.org/wiki/List_of_Intel_Pentium_processors including part numbers and sSpec numbers
For 1994: P54C = 600 nm process node Pentium 75 with 50 MT/s FSB Pentium 90 with 60 MT/s FSB Pentium 100 with 50 MT/s FSB Pentium 100 with 66 MT/s FSB
For 1995: P54CQS = 350 nm process node Pentium 120 with 60 MT/s FSB
P54CS = 350 nm process node Pentium 133 with 66 MT/s FSB
For 1996: P54CS = 350 nm Pentium 150 with 60 MT/s FSB Pentium 166 with 66 MT/s FSB Pentium 200 with 66 MT/s FSB
Quote:
You are comparing a popular desktop Pentium CPU to a high end embedded 68060 CPU that was being replaced by PPC.
|
Embedded Pentium 100 with SL2TU (cC0) number has a 600 nm process node and 100 Mhz.
Quote:
It's scalable design can move to higher frequencies to keep pace with process improvements without requiring major architectural modifications.
|
That's meaningless without actual demonstration. The 68060 author's authority for high clock speed narrative is in question.
68060's FPU is not pipelined.
Reminder, Motorola/Freescale lost the Ghz race.
Quote:
See the chart "ARM Cortex-M instruction variations" at the link above to see that ARM Cortex-M ISAs are less standardized than Motorola/Freescale using 68000, 68020, CPU32 and ColdFire cores for embedded use. More 68k standardization was possible though. The ColdFire ISA was not worth scaling down so far considering how much 68k compatibility was lost but ColdFire extensions could have benefited the other ISAs. The 68000 and CPU32 ISAs are subsets of the 68020 ISA but ColdFire is not.
|
Cortex M arrived later than ARMv4T which defeated Freescale DragonBall from the handheld market.
ARM has a higher feature set and higher clock speed "Cortex-A" family.
For Amiga and ARM CPU context, Cortex A9 ARMv7-A (e.g. Z3660), A53 ARMv8-A (e.g. PiStorm/PiStorm32, theA500mini) and A72 ARMv8-A (e.g. PiStorm32). Cortex-A is for the application processor use case, hence the A.
Cortex-M mostly targets microcontroller use cases, hence the "M" and started with ARMv6-M.
Last edited by Hammer on 06-Jun-2024 at 04:11 AM. Last edited by Hammer on 06-Jun-2024 at 03:59 AM. Last edited by Hammer on 06-Jun-2024 at 03:56 AM. Last edited by Hammer on 06-Jun-2024 at 03:53 AM.
_________________ Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68) Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68) Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB |
| Status: Offline |
| | Gunnar
| |
Re: One major reason why Motorola and 68k failed... Posted on 6-Jun-2024 6:19:07
| | [ #177 ] |
| |
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @Hammer
Quote:
16 registers are not future-proof enough when the competition from AArch64 and X86-64v4** has up to 32 registers. |
Yes this is clear.
And logically the APOLLO 68080 has enhanced the 68K ISA to support 16 Address Register plus 32 Data Register plus it enables the FPU to use 32 register for FPU results and enables 3 operand instructions
This allows the Apollo 68080 to reach by far the higher FLOPS of any 68k model ever. |
| Status: Offline |
| | Hammer
| |
Re: One major reason why Motorola and 68k failed... Posted on 6-Jun-2024 8:11:01
| | [ #178 ] |
| |
|
Elite Member |
Joined: 9-Mar-2003 Posts: 5859
From: Australia | | |
|
| @Gunnar
Any plans for NPU AI extensions for 68K? AI IoT embedded market? For example, FP8 packed math?
_________________ Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68) Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68) Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB |
| Status: Offline |
| | Gunnar
| |
Re: One major reason why Motorola and 68k failed... Posted on 6-Jun-2024 8:20:10
| | [ #179 ] |
| |
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @Hammer
Quote:
Any plans for NPU AI extensions for 68K? AI IoT embedded market? For example, FP8 packed math? |
Actually I took part in designed an IA chip for a major global company. I did prototype the AI functions by including them first in Apollo 68080 CPU as part of AMMX.
So yes - I have in fact I have added AI to 68k already. And it worked and I used Caffe AI software on 68k.
But I see no real sense for this on Amiga today. |
| Status: Offline |
| | Lou
| |
Re: One major reason why Motorola and 68k failed... Posted on 6-Jun-2024 14:17:01
| | [ #180 ] |
| |
|
Elite Member |
Joined: 2-Nov-2004 Posts: 4227
From: Rhode Island | | |
|
| Too much whining about ISA here.
If the cpu is efficient and compilers are good, the average developer doesn't care. Apps make the world go round.
Microsoft created Visual Studio and that sealed AAA's (Amiga/Apple/Atari) fate.
It now became EASY to develop for Windows.
For year's I've been asking for one of you clever low-level (bare metal) developers to port the Roslyn compiler to the Amiga platform, and Mono, so that the Amiga platform can benefit from C# and Visual Basic developers.
Some years ago, I did get someone to create an ODBC api for Amiga and that did get done ... which is a step in the right direction for creating business apps...
All this talk of new accelerators yet no new apps to run on them.
What a damn shame... |
| Status: Offline |
| |
|
|
|
[ home ][ about us ][ privacy ]
[ forums ][ classifieds ]
[ links ][ news archive ]
[ link to us ][ user account ]
|