Click Here
home features news forums classifieds faqs links search
6071 members 
Amiga Q&A /  Free for All /  Emulation /  Gaming / (Latest Posts)
Login

Nickname

Password

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net
Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.
Donate

Menu
Main sections
» Home
» Features
» News
» Forums
» Classifieds
» Links
» Downloads
Extras
» OS4 Zone
» IRC Network
» AmigaWorld Radio
» Newsfeed
» Top Members
» Amiga Dealers
Information
» About Us
» FAQs
» Advertise
» Polls
» Terms of Service
» Search

IRC Channel
Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online
27 crawler(s) on-line.
 57 guest(s) on-line.
 0 member(s) on-line.



You are an anonymous user.
Register Now!
 agami:  1 hr 33 mins ago
 kolla:  1 hr 42 mins ago
 DiscreetFX:  1 hr 54 mins ago
 amigakit:  2 hrs 20 mins ago
 NutsAboutAmiga:  3 hrs 24 mins ago
 michalsc:  3 hrs 31 mins ago
 Tuxedo:  4 hrs 18 mins ago
 Rob:  5 hrs 11 mins ago
 Swisso:  7 hrs 27 mins ago
 Matt3k:  7 hrs 32 mins ago

/  Forum Index
   /  Classic Amiga Hardware
      /  One major reason why Motorola and 68k failed...
Register To Post

Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 Next Page )
PosterThread
matthey 
Re: One major reason why Motorola and 68k failed...
Posted on 7-Jun-2024 3:28:53
#181 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2150
From: Kansas

Hammer Quote:

I prioritize data locality when there's a latency and bandwidth gap between external memory and the CPU core's potential.

16 registers are not future-proof enough when the competition from AArch64 and X86-64v4** has up to 32 registers.

**AVX-512 supports scalar, vectors, integer, and floating point.

https://x.com/InstLatX64/status/1692989174909997350
System V Application Binary Interface AMD64 Architecture Processor Supplement as of June 2024;

APX (32 GPR for AMD64), and AMX road map are included. Intel and SUSE used "AMD64" due to historical reasons.

Both Intel and AMD are supporting Intel's APX and AMX for AMD64 ABI and it's on the roadmap.

A major CISC road map with 32 GPR. The debate on 16 vs 32 GPR with CISC should end.


No! More registers are not a free lunch but a trade off. The x86-64 ISA is fat with thousands of instructions and an average instruction length of over 4 bytes per instruction. The ISA is limited to high performance use only and has trouble scaling lower but it is threatened by AArch64. Exponential resources are wasted to gain less than 1% performance to stay ahead of AArch64. The CISC advantage comes from fewer but more powerful instructions.

https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=2bef45642abd1c42d92b40806139ae0e265cdf2c Quote:

Note that for the issue rate of CISC processors, one x86 instruction is considered to be the equivalent of 1.3 to 1.9 RISC instructions.


The x86-64 ISA gave up the CISC code density advantage, the large instructions increase the decoding tax and the resources required for the standard ISA are huge. The x86 ISA with 8 GP integers registers had a shortage of integer registers which elevated memory traffic making it worthwhile to increase GP registers to 16 but the result was already low single digit overall percentage gains partially offset by larger code from using more encoding bits. From 16 GP registers to 32 GP registers for a CISC CPU is likely to give less than 1% overall percentage advantage. The 68k already has 16 GP integer registers with memory traffic comparable to a RISC ISA with 32 GP registers. In addition, the 68k starts with an average instruction length less than 3 bytes/instruction, code density comparable to Thumb2 but without elevated instruction counts and powerful instructions on par with x86(-64) instructions. The 68k can scale lower than AArch64 into the embedded market unless bloating the standard to compete with x86-64 and POWER in the desktop and server markets which I see as a huge mistake similar to frontal assaults on defensive fortresses. It would be better to attack ARM where it is weak and code density has been ignored to beef up for their assaults on x86-64. The 68k has the code density of Thumb2 without the performance loss from too many weak instructions. Bloating up hardware to chase 1% performance gains is folly when the market needs more powerful small footprint hardware. The goal should be to do more with less and this requires being smart rather than using exponential hardware resources for minor performance gains. Another round of AmigaOS on desktop hardware will guarantee the Amiga remains irrelevant and forgotten but 68k Amiga hardware could be a nice alternative to RPi hardware where the ARM competition couldn't reach the footprint of a 68k Amiga.

Hammer Quote:

Quake single-handedly wreaked the fortunes of 586/686 clones. Quake engine powered several other games.

In terms of market impact, Quake beats ByteMark.


Quake had a market changing impact but general purpose performance which is mostly integer performance is more important. The 3D calculations were specialized and moved to the GPU. Modern 3D hardware has become more flexible but it is still far from general purpose.

Hammer Quote:

That's meaningless without actual demonstration. The 68060 author's authority for high clock speed narrative is in question.

68060's FPU is not pipelined.

Reminder, Motorola/Freescale lost the Ghz race.


Joe Circello went on to architect some of the ColdFire CPUs which reached hundreds of MHz with fully synthesizable cores (no custom IP blocks to optimize timing critical logic). The ColdFire v5 design was based on the 68060 design, architected by Joe and could reach over 500MHz using a fully synthesizable core. It's not GHz speeds but it is a lower end embedded core using auto layout/routing. A ColdFire v5 core using a more modern process likely could reach at least 1 GHz.

Hammer Quote:

Cortex M arrived later than ARMv4T which defeated Freescale DragonBall from the handheld market.

ARM has a higher feature set and higher clock speed "Cortex-A" family.

For Amiga and ARM CPU context, Cortex A9 ARMv7-A (e.g. Z3660), A53 ARMv8-A (e.g. PiStorm/PiStorm32, theA500mini) and A72 ARMv8-A (e.g. PiStorm32). Cortex-A is for the application processor use case, hence the A.

Cortex-M mostly targets microcontroller use cases, hence the "M" and started with ARMv6-M.


There were no MCUs or SoCs when the 68000 was introduced as there was not enough space on chip. Minimalist 68000 and CPU32 SoCs were introduced later and MCUs with ColdFire. The 68k and ColdFire cores are more comparable to ARM Cortex-M cores today. The high end embedded 68060 may have been closer in performance to an ARM Cortex-A core but it lacked SoC features common today. There wasn't even a memory controller on chip. The 68060 may be more comparable to a high end Cortex-M core today. It's just that most MPU chips have become either MCUs or SoCs because the number of external support chips is reduced with a cost savings for the price sensitive embedded market.

 Status: Offline
Profile     Report this post  
Hammer 
Re: One major reason why Motorola and 68k failed...
Posted on 7-Jun-2024 6:00:13
#182 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5616
From: Australia

@matthey

Quote:

The x86-64 ISA gave up the CISC code density advantage, the large instructions increase the decoding tax and the resources required for the standard ISA are huge. The x86 ISA with 8 GP integers registers had a shortage of integer registers which elevated memory traffic making it worthwhile to increase GP registers to 16 but the result was already low single digit overall percentage gains partially offset by larger code from using more encoding bits. From 16 GP registers to 32 GP registers for a CISC CPU is likely to give less than 1% overall percentage advantage. The 68k already has 16 GP integer registers with memory traffic comparable to a RISC ISA with 32 GP registers. In addition, the 68k starts with an average instruction length less than 3 bytes/instruction, code density comparable to Thumb2 but without elevated instruction counts and powerful instructions on par with x86(-64) instructions. The 68k can scale lower than AArch64 into the embedded market unless bloating the standard to compete with x86-64 and POWER in the desktop and server markets which I see as a huge mistake similar to frontal assaults on defensive fortresses. It would be better to attack ARM where it is weak and code density has been ignored to beef up for their assaults on x86-64. The 68k has the code density of Thumb2 without the performance loss from too many weak instructions. Bloating up hardware to chase 1% performance gains is folly when the market needs more powerful small footprint hardware. The goal should be to do more with less and this requires being smart rather than using exponential hardware resources for minor performance gains. Another round of AmigaOS on desktop hardware will guarantee the Amiga remains irrelevant and forgotten but 68k Amiga hardware could be a nice alternative to RPi hardware where the ARM competition couldn't reach the footprint of a 68k Amiga.


https://ieeexplore.ieee.org/document/5413117
Size of LZSS decompression code:
X86-64 beats m68K and ARM's thumb.

Size of string concatenation code:
X86-64 beats m68K and ARM's thumb.

Size of string searching code:
X86-64 beats m68K and ARM's thumb.

Size of integer printing code:
X86-64 beats m68K and ARM's thumb.


Quote:

Quake had a market changing impact but general purpose performance which is mostly integer performance is more important. The 3D calculations were specialized and moved to the GPU. Modern 3D hardware has become more flexible but it is still far from general purpose.

Cyrix says Hi from the grave.

Quote:

Joe Circello went on to architect some of the ColdFire CPUs which reached hundreds of MHz with fully synthesizable cores (no custom IP blocks to optimize timing critical logic).

ColdFire v1 reached 240 Mhz @65 nm

ColdFire v2 reached 240 Mhz @65 nm

Coldfire v4 reached 345 Mhz @65 nm

Reference
https://www.nxp.com/products/nxp-product-information/ip-block-licensing/coldfire-32-bit-processors:COLDFIRE-32-BIT-PROCESSORS

Quote:

The ColdFire v5 design was based on the 68060 design, architected by Joe and could reach over 500MHz using a fully synthesizable core. It's not GHz speeds but it is a lower end embedded core using auto layout/routing. A ColdFire v5 core using a more modern process likely could reach at least 1 GHz.

For desktops:
Intel Core 2 "Conroe" @ 65 nm node reached 3 GHz via E6850.

Intel Core 2 "Conroe XE" @ 65 nm node reached 3.2 GHz via X6900.
----

For laptops:
Intel Core 2 "Merom-L" @ 65 nm node reached 1.2 GHz with 5.5 watts TDP via Core 2 Solo ULV U2200.

Intel Core 2 "Merom" (low-voltage) @ 65 nm node reached 1.8 Ghz with 17 watts TDP via Core 2 Duo L7700

"Merom-2M" (ultra-low-voltage) @ 65 nm node reached 1.33 Ghz with 10 watts TDP via Core 2 Duo U7700

-----------------------

AMD's 65 nm results with Phenom series:
"Agena" Phenom X4 9950 Black Edition reached 2.6 Ghz.

Bulldozer is designed for very high clock speed with long pipelines i.e. AMD's Pentium IV.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

 Status: Offline
Profile     Report this post  
Hammer 
Re: One major reason why Motorola and 68k failed...
Posted on 7-Jun-2024 6:05:28
#183 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5616
From: Australia

@Gunnar

Quote:

Gunnar wrote:
@Hammer

Quote:
Any plans for NPU AI extensions for 68K? AI IoT embedded market? For example, FP8 packed math?


Actually I took part in designed an IA chip for a major global company.
I did prototype the AI functions by including them first in Apollo 68080 CPU as part of AMMX.

So yes - I have in fact I have added AI to 68k already.
And it worked and I used Caffe AI software on 68k.

But I see no real sense for this on Amiga today.

Caffe AI is during NVIDIA's Pascal (CUDA 8.0) and CUDA 7.5 era. Caffe 8 years ago was compiled to support FP32 or FP64.

"AI" is nothing without implementation details.

New instructions require new software or recompile.

Last edited by Hammer on 07-Jun-2024 at 06:11 AM.
Last edited by Hammer on 07-Jun-2024 at 06:09 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

 Status: Offline
Profile     Report this post  
matthey 
Re: One major reason why Motorola and 68k failed...
Posted on 7-Jun-2024 21:13:57
#184 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2150
From: Kansas

Hammer Quote:

https://ieeexplore.ieee.org/document/5413117
Size of LZSS decompression code:
X86-64 beats m68K and ARM's thumb.

Size of string concatenation code:
X86-64 beats m68K and ARM's thumb.

Size of string searching code:
X86-64 beats m68K and ARM's thumb.

Size of integer printing code:
X86-64 beats m68K and ARM's thumb.


No! That is an old code density paper with wrong data. I discovered the J-Core project using an old version of this paper to justify their choice of SuperH and immediately was suspicious of the results. I found Dr. Vince Weaver's site and looked at the 68k code which was a joke and far worse than a compiler would generate. The paper originated from a 808x/x86 code density contest between Vince and friends to give a hint which code was the most optimized. There have been updates including some of the 68k code updates I sent to him which can be found at the following site.

http://www.deater.net/weave/vmwprod/asm/ll/ll.html

You can see that I am mentioned in the 68k source code.

http://www.deater.net/weave/vmwprod/asm/ll/ll.m68k.s Quote:

| Notes somewhat grumpily contributed by Matthew Hey:
| * Did you not see that there is a TST instruction for CMP #0
| * and ADDQ/SUBQ instead of LEA to add/sub small immediates including to
| address registers?
| * You can do MOVE EA,EA where EA is almost any addressing mode and MOVE
| sets the condition codes for a Bcc without a CMP or TST
| * You use LEA where you shouldn't but not where you should which is
| instead of MOVE.L #address,An.

...

| lots of time passes, start optimizing based on e-mail from Matthew Hay


The 68k source went from 1014 bytes down to 870 bytes and is missing some of the submitted 68k optimizations. The changes mentioned above are far from insane hand optimizations but are basic optimizations performed by compilers. I made a spreadsheet of the results with all the changes and which Vince agreed to update but never did.

https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vTyfDPXIN6i4thorNXak5hlP0FQpqpZFk2sgauXgYZPdtJX7FvVgabfbpCtHTkp5Yo9ai6MhiQqhgyG/pubhtml?gid=909588979&single=true&pli=1

Note that the code is optimized for size which is usually only done for embedded targets. This does not change the 68k code much but the x86-64 code changes a lot resulting in many small instructions (~2.29 bytes/instruction) using the original 8 GP registers and resulting in through the roof memory traffic. Instructions for performance optimized x86-64 code are much larger due to prefixes often used to access more than 8 GP registers and other new functionality.


https://project-archive.inf.ed.ac.uk/ug4/20191424/ug4_proj.pdf

Code density for performance optimized x86-64 code also deteriorates with the large instructions but other performance metrics improve like instruction counts and memory traffic. The 68k doesn't have to make these compromises due to more GP registers without prefixes and performance metrics are excellent whether optimizing code for performance or size.

The 68k lack of availability has eliminated it from many modern papers and modern compiler support has declined. Usually old papers and compiles are necessary to show the code density advantage.


https://www.researchgate.net/publication/221306454_SPARC16_A_new_compression_approach_for_the_SPARC_architecture

The 68k and Thumb2 are very competitive in code density but Thumb2 has worse performance metrics (~21% more instructions and ~23% more data memory traffic in the Vince Weaver benchmark). Enhancements to the 68k including ColdFire enhancements can likely improve code density to win most code density contests but it won't happen with a virtual/emulated 68k target.

Hammer Quote:

For desktops:
Intel Core 2 "Conroe" @ 65 nm node reached 3 GHz via E6850.

Intel Core 2 "Conroe XE" @ 65 nm node reached 3.2 GHz via X6900.
----

For laptops:
Intel Core 2 "Merom-L" @ 65 nm node reached 1.2 GHz with 5.5 watts TDP via Core 2 Solo ULV U2200.

Intel Core 2 "Merom" (low-voltage) @ 65 nm node reached 1.8 Ghz with 17 watts TDP via Core 2 Duo L7700

"Merom-2M" (ultra-low-voltage) @ 65 nm node reached 1.33 Ghz with 10 watts TDP via Core 2 Duo U7700

-----------------------

AMD's 65 nm results with Phenom series:
"Agena" Phenom X4 9950 Black Edition reached 2.6 Ghz.

Bulldozer is designed for very high clock speed with long pipelines i.e. AMD's Pentium IV.


Desktop CPUs are designed with deeper pipelines and commonly use custom IP blocks extensively. Embedded CPUs are designed with practical pipelines and may be fully synthesizable without any custom blocks. More performance/MHz reduces the need to clock up the CPU to gain performance for embedded use. Even though ColdFire lost performance/MHz from the 68k castration, it still had more performance/MHz than ARM CPUs at that time. The 68060 performance/MHz wasn't surpassed by ARM for about a decade. The 68060@50MHz was a success in the embedded market because it didn't need to be clocked up for many embedded uses. It would have been more competitive for some embedded uses, consoles and the desktop if higher clock rated parts were available but this was a political decision to push PPC.

Last edited by matthey on 07-Jun-2024 at 10:26 PM.

 Status: Offline
Profile     Report this post  
Gunnar 
Re: One major reason why Motorola and 68k failed...
Posted on 8-Jun-2024 7:51:52
#185 ]
Cult Member
Joined: 25-Sep-2022
Posts: 512
From: Unknown

The Coldfire CPUs are targeted for embedded usage.
Motorola always used a production process that was optimized for power saving and never used process that was optimized for speed. So just looking at the nanometer will not tell you this.

If you look at the internal design of the Coldfire then you can see that:
- The Coldfire V1/V2/V3 are pretty weak in design.
They are tuned absolutely for minimal cost and minimal size and not for being a powerful CPU.
They are designed like the 68000 and have to do EA and ALU calculation sharing the same HW unit.
Which of course makes then slow.

A lot better in internal design it the Coldfire V4.
The Coldfire V4 is much more powerful - its designed more like 68040 from the pipeline.
But the Coldfire V4 is still a lot weaker per clock than the Motorola 68060 and Apollo 68080 CPU.

Again much improved is the Coldfire V5.
The Coldfire V5 is actually a lot closer to the 68060 in its internal design.
Which is no wonder as the Coldfire V5 and 68060 were designed by the same person.
There is silicon of V5 on the market running over 500Mhz that were produced in old/slow 130nm process.

The Coldfire V6 pipeline was little bit lengthened and therefore tuned to reach over 800 MHz (also in old and slow process) but never came to market.
I think with a modern process the Coldfire V6 pipeline design could reach around 1500 Mhz.


The Coldfire V5 pipeline design is similar to the 68060/68080 internal design.
A 68060 design could lengthened to come to Coldfire V6 clockrate.
The Apollo 68080 CPU architecture is alrready designed to reach higher clock than the 68060 in ASIC.

If you compare the pipeline designs of the latest Coldfire, the Apollo 68080 CPU, the INTEL ATOM or some VIA x86 CPUs, then you will see that they have a lot in common.

If you compare these design then you can see that they are internally all very close to each other
and from a 100feed view they look like basically the same designs.

Motorola showed that even using an old and slow process they came to nearly a Gigaherz clockrate
and INTEL and VIA produced these design with over 1500-2000 MHz also using from todays view older process.


I would say making an 68K ASIC with clockrate 1000-2000 is absolutely possible.
Considering that AMIGA OS runs very fast on 68060 with 100Mhz, and that Amiga OS absolutely flies on Apollo 68080 CPU today - a machine with gigaherz 68080 and Amiga OS will feel insanely fast.




Last edited by Gunnar on 08-Jun-2024 at 07:58 AM.

 Status: Offline
Profile     Report this post  
Kronos 
Re: One major reason why Motorola and 68k failed...
Posted on 8-Jun-2024 8:50:13
#186 ]
Elite Member
Joined: 8-Mar-2003
Posts: 2615
From: Unknown

@Gunnar

Quote:

Gunnar wrote: - a machine with gigaherz 68080 and Amiga OS will feel insanely fast.



Sure but that would be mostly due to the lack of any real apps/games that prove that "feeling" wrong within 0.01s

_________________
- We don't need good ideas, we haven't run out on bad ones yet
- blame Canada

 Status: Offline
Profile     Report this post  
Hammer 
Re: One major reason why Motorola and 68k failed...
Posted on 10-Jun-2024 13:56:45
#187 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5616
From: Australia

@matthey

You can't remove the C++ Complier's quantity with the hand assembly version. This is why Intel invested in their C++ compiler.

The bulk of modern Linux and GNU middleware are in platform portable source code.

Your example is not realistic for portable GNU middleware and Linux source code.

Against https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vTyfDPXIN6i4thorNXak5hlP0FQpqpZFk2sgauXgYZPdtJX7FvVgabfbpCtHTkp5Yo9ai6MhiQqhgyG/pubhtml?gid=909588979&single=true&pli=1


https://github.com/deater/ll_asm/tree/master/binaries/native/i386
i386 LZSS reached 57 bytes.

Last edited by Hammer on 10-Jun-2024 at 03:39 PM.
Last edited by Hammer on 10-Jun-2024 at 02:01 PM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

 Status: Offline
Profile     Report this post  
Hammer 
Re: One major reason why Motorola and 68k failed...
Posted on 10-Jun-2024 15:14:30
#188 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5616
From: Australia

@matthey

Quote:

Code density for performance optimized x86-64 code also deteriorates with the large instructions but other performance metrics improve like instruction counts and memory traffic.

Reminder, x86-64 still runs IA-32 as is.

Long mode REX prefix must be encoded when:
1. using 64-bit operand size and the instruction does not default to 64-bit operand size; or
2. using one of the extended registers (R8 to R15, XMM8 to XMM15, YMM8 to YMM15, CR8 to CR15 and DR8 to DR15); or
3. using one of the uniform byte registers SPL, BPL, SIL or DIL.

Not everything needs 64-bit integers. On Windows, the standard ABI is 32-bit long. A 64-bit type is available as int64_t or long long. Reference https://learn.microsoft.com/en-au/windows/win32/winprog/windows-data-types?redirectedfrom=MSDN

Linux X64's "long" datatype is 64-bits.

Windows X64's "long" is 32-bit by ABI default.

Modern X86 CPUs have competent register renaming, zero-level cache, and decoded instruction cache (after X86 decoders).



Quote:

The 68k doesn't have to make these compromises due to more GP registers without prefixes and performance metrics are excellent whether optimizing code for performance or size.

68060 has no 64-bit GPRs.
68060 has no FMA3.
68060 has no popcount.
68060 has no vector FP math.
68060 has no vector integer math.

The maximum extent for IA-32 has 8 32-bit GPRs, 8 80-bit FPR/8 integer 64-bit vectors and 8 128-bit registers (scalar, vector, integer and FP).

68060 missing a few instructions https://www.youtube.com/watch?v=ofXWIVMdsmc
68060 doesn't work on 68040-aware Macintosh. This YouTuber installed 68060 on Performa 575 via a 68060 to 68040 adapter and it failed to boot.

Emu68 emulates every 68K instruction minus MMU.

68K ISA useless on 4GB to 8 GB RAM equipped RPi4 for PiStorm-Emu68. I have 4 MB equipped RPi 4B. Due to AmigaOS, only 2 GB RAM is useable.

68060 has no concept of 36-bit Physical Address Extension (PAE).

What's 68K's instruction sequence for the equivalent AVX-512 read-and-write instructions?

Examples
VPGATHERDD, Gathers 32-bit integers from memory using 32-bit indices
VPGATHERDQ, Gathers 64-bit integers from memory using 32-bit indices.
VPGATHERQD, Gathers 32-bit integers from memory using 64-bit indices.
VPGATHERQQ, Gathers 64-bit integers from memory using 64-bit indices.

VPSCATTERDD, Scatters 32-bit integers into memory using 32-bit indices.
VPSCATTERDQ, Scatters 64-bit integers into memory using 32-bit indices.
VPSCATTERQD, Scatters 32-bit integers into memory using 64-bit indices.
VPSCATTERQQ, Scatters 64-bit integers into memory using 64-bit indices.

The 512-bit register has 16 32-bit or 8 64-bit data elements.

AVX-512 has floating-point instructions. AVX-512 is very useful for software rendering.


Quote:

. Even though ColdFire lost performance/MHz from the 68k castration, it still had more performance/MHz than ARM CPUs at that time. The 68060 performance/MHz wasn't surpassed by ARM for about a decade.

68060 wasn't cheap when compared to ARM or other RISC competitors. No 68060 won a game console design win. Amiga Hombre is a $40 solution. Sega selected SuperH2 ahead of 68030.

The main reason for PiStorm with RPi 3A+ or 4B is lower cost against rip-off cost from 68060.

For the handheld market, DragonBall VZ lost to ARM925T. 68060's size wasn't optimized for small handheld devices like smartphones.

Quote:

. The 68060@50MHz was a success in the embedded market because it didn't need to be clocked up for many embedded uses

Amiga 68060 accelerators are not bound by "embedded", hence overclock can be attempted and the results are not good. My 68060 Rev 1 reached 62.5 Mhz on TF1260 and it's unstable at 74 Mhz.

https://archive.computerhistory.org/resources/access/text/2013/04/102723352-05-01-acc.pdf
Page 26 of 583

32-bit CISC Computational Microprocessor Vendors by Revenue (Millions of Dollars) for 1995
1st Intel, $10,365
2nd AMD, $743
3rd Cyrix, $205
4th IBM, $67
...
5th Motorola, $45


32/64-Bit RISC Computational Microprocessor Vendors by Revenue (Millions of Dollars) for 1995
IBM, $525
Motorola, $176
Texas Instruments, $180
NEC, $39
Toshiba, $32
Fujitsu, $51




Last edited by Hammer on 11-Jun-2024 at 12:47 AM.
Last edited by Hammer on 11-Jun-2024 at 12:46 AM.
Last edited by Hammer on 11-Jun-2024 at 12:15 AM.
Last edited by Hammer on 10-Jun-2024 at 03:32 PM.
Last edited by Hammer on 10-Jun-2024 at 03:18 PM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

 Status: Offline
Profile     Report this post  
Hammer 
Re: One major reason why Motorola and 68k failed...
Posted on 11-Jun-2024 1:03:39
#189 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5616
From: Australia

@matthey

Quote:

Quake had a market changing impact but general purpose performance which is mostly integer performance is more important.

The 3D calculations were specialized and moved to the GPU.

One of Amiga's professional niches was Lightwave 3D.

Despite 3D acceleration, the CPU maintains geometry pipeline as per DirectX6's SSE/3DNow optimizations.

Fixed function T&L only lasted DirectX7 generation and it was replaced.

CPU maintains 3D game simulation logic while 3D accelerator renders the game's viewport.

https://www.vogonswiki.com/index.php/CPU_Benchmarks
3dmark 99 3D with Voodoo3 2000 PCI

Cyrix MII 2x100 = 21
P200MMX 2x100 = 1001
P200MMX 2.5x100 = 1335
K6-2 2.5x100 = 1628

Breakneck with Voodoo3 2000 PCI
Cyrix MII 2x100 = 13.6fps
P200MMX 2.5x100 = 23.4fps
K6-2 2.5x100 = 20.5fps

Voodoo3 2000 PCI didn't rescue Cyrix MII's FPU weak.

FPU SinJulia
Cyrix MII 2x100 = 23
P200MMX 2.0x100 = 43
P200MMX 2.5x100 = 53
K6-2 2.5x100 = 40

Quote:

Modern 3D hardware has become more flexible but it is still far from general purpose.

Reminder: Amiga's main attraction is games.

Last edited by Hammer on 11-Jun-2024 at 01:06 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

 Status: Offline
Profile     Report this post  
bhabbott 
Re: One major reason why Motorola and 68k failed...
Posted on 11-Jun-2024 6:00:15
#190 ]
Regular Member
Joined: 6-Jun-2018
Posts: 387
From: Aotearoa

@Hammer

Quote:

Hammer wrote:

68060 doesn't work on 68040-aware Macintosh. This YouTuber installed 68060 on Performa 575 via a 68060 to 68040 adapter and it failed to boot.

So? Amiga BASIC doesn't work on 68020 either, largely due to being a brain-dead Mac port. That doesn't mean 68020 is no good.

Quote:
Emu68 emulates every 68K instruction minus MMU.

Call me when they fix that.

Quote:
68K ISA useless on 4GB to 8 GB RAM equipped RPi4 for PiStorm-Emu68. I have 4 MB equipped RPi 4B. Due to AmigaOS, only 2 GB RAM is useable.

'Only' 2 GB RAM, the horror! I wonder how I am able to type this on my Linux PC which has 'only' 2 GB...

You know the emulator needs RAM too, right? So a 4GB RPi4 wouldn't give you 4GB anyway.

Quote:
68060 has no concept of 36-bit Physical Address Extension (PAE).
[sniff]

Quote:
The main reason for PiStorm with RPi 3A+ or 4B is lower cost against rip-off cost from 68060.

68060 isn't 'rip-off', it's retro. My Cyberstorm MKII 68060 rev 5 with no RAM cost more in 1996 than a BFG9060 with 128MB does today. If we take into account the 128MB RAM and 90% inflation its a bargain!

Quote:
Amiga 68060 accelerators are not bound by "embedded", hence overclock can be attempted and the results are not good. My 68060 Rev 1 reached 62.5 Mhz on TF1260 and it's unstable at 74 Mhz.

I only managed 60MHz with full stability on my Cyberstorm. You need rev 6 CPU for highest overclocking.

 Status: Offline
Profile     Report this post  
matthey 
Re: One major reason why Motorola and 68k failed...
Posted on 11-Jun-2024 23:39:30
#191 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2150
From: Kansas

Gunnar Quote:

The Coldfire CPUs are targeted for embedded usage.
Motorola always used a production process that was optimized for power saving and never used process that was optimized for speed. So just looking at the nanometer will not tell you this.

If you look at the internal design of the Coldfire then you can see that:
- The Coldfire V1/V2/V3 are pretty weak in design.
They are tuned absolutely for minimal cost and minimal size and not for being a powerful CPU.
They are designed like the 68000 and have to do EA and ALU calculation sharing the same HW unit.
Which of course makes then slow.


Power, performance and area (PPA) are the 3 chip design tradeoffs. Early ColdFire designs were certainly trying to minimize power and area at the expense of performance to scale ColdFire down as far as possible.

Gunnar Quote:

A lot better in internal design it the Coldfire V4.
The Coldfire V4 is much more powerful - its designed more like 68040 from the pipeline.
But the Coldfire V4 is still a lot weaker per clock than the Motorola 68060 and Apollo 68080 CPU.


The 68040 pipeline design was retired and the ColdFire V4 pipeline design is close to a scalar version of the superscalar ColdFire V5 and 68060 pipeline design.

68040
1. Instruction Fetch
2. Decode
3. EA Calc
4. EA Fetch
5. Execute
6. WB

68060
1. Instruction Address Generation
2. Instruction Fetch Cycle
3. Instruction Early Decode
4. Instruction Buffer
*** decoupling instruction buffer ***
5. Decode Instruction and Select
6. Operand Address Generation (EA Calc)
7. Operand Fetch (EA Fetch)
8. Instruction Execution
9. Data Available (optional)
10. WB (optional)

XCF5102 (68040/ColdFire hybrid)
1. Instruction Fetch
2. Decode & Instruction Address Calculate
*** decoupling instruction buffer ***
3. EA Calc
4. EA Fetch
5. Execute
6. WB

ColdFire V2
1. Instruction Address Generation
2. Instruction Fetch
*** decoupling instruction buffer ***
3. Decode & Select Operand
4. Address Generation & Execute

ColdFire V3
1. Instruction Address Generation
2. Instruction Fetch Cycle 1
3. Instruction Fetch Cycle 2
4. Instruction Early Decode
*** decoupling instruction buffer ***
5. Decode & Select Operand
6. Address Generation & Execute

ColdFire V4 and V5
1. Instruction Address Generation
2. Instruction Fetch Cycle 1
3. Instruction Fetch Cycle 2
4. Instruction Early Decode
*** decoupling instruction buffer ***
5. Decode/Select, evaluation of dispatch algorithm
6. Operand Address Generation
7. Operand Fetch Cycle 1
8. Operand Fetch Cycle 2
9. Execute
10. Optional Data Writeback for stores to memory

The XCF5102 pipeline looks like it was derived from decoupling the 68040 pipeline but for later ColdFire pipeline designs the instruction address generation was moved earlier and the Instruction Fetch Pipeline (IFP) lengthened to look more like the 68060 IFP. The ColdFire V4 and V5 pipelines are nearly identical but the V4 pipeline has a single Operand Execution Pipeline (OEP) where the V5 pipeline has dual Operand Execution Pipelines (OEPs) like the 68060.

The ColdFire V2 and V3 OEP designs are more like the 68000-68030 OEP designs that loop through shared address generation and execute stages making them much lower performance. Performance came with separate EA Calc, EA Fetch and Execute stages starting with the 68040 and for ColdFire, V4, even though the V4 pipeline as a whole resembles the 68060 pipeline more. Gunnar likes to ignore the importance of the decoupled instruction buffer. Perhaps he prefers brute force instead of finesse. The 68040 design likely targeted performance more than the 68060 design but it produced more heat than performance as it lacked finesse.

Gunnar Quote:

Again much improved is the Coldfire V5.
The Coldfire V5 is actually a lot closer to the 68060 in its internal design.
Which is no wonder as the Coldfire V5 and 68060 were designed by the same person.
There is silicon of V5 on the market running over 500Mhz that were produced in old/slow 130nm process.

The Coldfire V6 pipeline was little bit lengthened and therefore tuned to reach over 800 MHz (also in old and slow process) but never came to market.
I think with a modern process the Coldfire V6 pipeline design could reach around 1500 Mhz.


Motorola documentation shows plans for ColdFire V5 at 615MHz-800MHz using a 100nm process with an arrow pointing up from there for newer processes. This makes me think that 1.5GHz and maybe even 2GHz would be possible using a 10nm to 40nm process common for embedded use today. This is good considering the fully synthesizable core ease of moving to newer processes. ColdFire was the only Motorola/Freescale architecture that was fully synthesizable circa 2000. This was possible despite the performance handicaps of a fully sythesizable core and weakened 68k ISA because the ColdFire V4 and V5 performance/MHz was still innately generally better than the RISC competition.

CPU | caches | DMIPS/MHz
68040 4kiB-I/4kiB-D 1.1
68060 8kiB-I/8kiB-D 1.8
ColdFireV3 8kiB-U 0.8
ColdFireV4 16kiB-I/8kiB 1.6
ColdFireV5 32kiB-I/32kiB-D 1.8

The ColdFire V5 already adds an extra pipeline stage compared to the 68060, from 8 to 9 stage. I'm not a fan of moving to a deeper pipeline just to increase clock speed like was planned for the superpipelined ColdFire V6. Superpipelining was a fad that was pushed to the limits (Pentium 4, PPC 970/G5) but generally backed away from to more practical pipeline depths, especially for mostly in-order embedded use (e.g. ARM Cortex-A8 14 stage pipe was replaced with cores with 8 stage pipes). Deeper pipelines make programming more difficult which takes away from one of the major advantages of the 68k. I prefer designs which minimize stalls, make programming easier and improve performance without a long pipeline. The RISC-V SiFive U74 core is a good example of what minimizing stalls can do. It uses a 68060 like 8 stage pipeline that removes most load-to-use stalls and mispredicted branches are only 4-6 cycles. The small 8 stage SiFive U74 core has better performance/MHz than a much wider and deeper 16 stage PPC G5 core. Competing with x86-64 and POWER for the desktop and server markets is a waste of time. Opportunities to change the world exist in embedded cores with more performance using a small footprint and easier to use performance with more consistency and less jitter. The SiFive U74 core shows the way even though it didn't push performance enough and was handicapped by the weak RISC-V ISA.

Gunnar Quote:

The Coldfire V5 pipeline design is similar to the 68060/68080 internal design.
A 68060 design could lengthened to come to Coldfire V6 clockrate.
The Apollo 68080 CPU architecture is already designed to reach higher clock than the 68060 in ASIC.

If you compare the pipeline designs of the latest Coldfire, the Apollo 68080 CPU, the INTEL ATOM or some VIA x86 CPUs, then you will see that they have a lot in common.

If you compare these design then you can see that they are internally all very close to each other
and from a 100feed view they look like basically the same designs.

Motorola showed that even using an old and slow process they came to nearly a Gigaherz clockrate
and INTEL and VIA produced these design with over 1500-2000 MHz also using from todays view older process.


The in-order 7+ stage CISC pipelines may be similar between 68060, ColdFire V5, Cyrix/Via x86 and Intel "Bonnell" Atom cores but there are significant differences in the core designs based on core design decisions and the ISA.

https://en.wikichip.org/wiki/bonnell#Branch_predictor Quote:

The branch-misprediction penalty is 11 to 13 cycles. Some of the rare or complex x86 instructions will detour into a microcode sequencer for decoding, necessitating two additional clock cycles. Additionally there is a roughly 7 cycle penalty for correctly predicted branches but no target can be predicted because of a missing branch target buffer (BTB) entry. Bonnell return stack buffer is 8-entry deep.


CPU core | pipeline | branch mispredict cycles
6x86 7-stage 4-5 (Cyrix)
U74 8-stage 4-6 (SiFive)
68060 8-stage 7-8
ColdFireV5 9-stage 8-9
Bonnell 16-19-stage 11-22 (Intel Atom)

Bonnell is based on the Pentium P5 with only 5 stages too. I'd rather avoid the superpipelined Bonnell, ColdFire V6 and possibly AC68080 extremes.

https://en.wikichip.org/wiki/intel/microarchitectures/bonnell#Back_End Quote:

Each cycle two instructions are dispatched in-order. The scheduler can take a pair of instructions from a single thread or across threads. Bonnell in-order back-end resembles a traditional early 90s design featuring a dual ALU, a dual FPU and a dual AGU. Similarly to the front-end, in order to accommodate simultaneous multithreading, the Bonnell design team chose to duplicate both the floating-point and integer register files. The duplication of the register files allows Bonnell to perform context switching on each stage by maintaining duplicate states for each thread. The decision to duplicate this logic directly results in more transistors and larger area of the silicon. Overall implementing SMT still required less power and less die area than the other heavyweight alternatives (i.e., out-of-order and larger superscaler). Nonetheless the total register file area accounts for 50% of the entire core's die area which was single-handedly an important contributor to the overall chip power consumption.


The total register file area accounts for 50% of the entire Bonnell core's die area and early Bonnell chips used ~47 million transistors compared to the 68060 using ~2.5 million. Some of those transistors are due to using 8 transistor instead of 6 transistor per bit SRAM. SRAM is far from free as it is what is not scaling well with newer chip processes.

https://fuse.wikichip.org/news/7343/iedm-2022-did-we-just-witness-the-death-of-sram/

If talking about older embedded CPUs, the QorIQ P1022 CPU used in the A1222 is a good example of trying to scale down a fat ISA with many architectural registers.

Gunnar Quote:

I would say making an 68K ASIC with clockrate 1000-2000 is absolutely possible.
Considering that AMIGA OS runs very fast on 68060 with 100Mhz, and that Amiga OS absolutely flies on Apollo 68080 CPU today - a machine with gigaherz 68080 and Amiga OS will feel insanely fast.


Yes, a superpipelined ASIC desktop 68k CPU with lots of non-orthogonal registers and without MMU and SMP should absolutely fly. I'm afraid it is not what the 68k Amiga needs to compete with RPi hardware though.

Last edited by matthey on 11-Jun-2024 at 11:52 PM.

 Status: Offline
Profile     Report this post  
Hammer 
Re: One major reason why Motorola and 68k failed...
Posted on 12-Jun-2024 0:45:38
#192 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5616
From: Australia

@bhabbott

Quote:
So? Amiga BASIC doesn't work on 68020 either, largely due to being a brain-dead Mac port. That doesn't mean 68020 is no good.


Amiga Basic will run on 020+ machines as long as you disable Fast RAM. Stock A1200 doesn't have Fast RAM.

Quote:

Call me when they fix that.

For 68K MMU-less, Emu68 is in a similar boat as AC68080 V4.

Quote:

Only' 2 GB RAM, the horror! I wonder how I am able to type this on my Linux PC which has 'only' 2 GB...

You know the emulator needs RAM too, right? So a 4GB RPi4 wouldn't give you 4GB anyway.

1. Emu68 uses a tiny amount of RAM.

2. Matt's argument is pro-68060 against ARM in a modern context.

Quote:

68060 isn't 'rip-off', it's retro. My Cyberstorm MKII 68060 rev 5 with no RAM cost more in 1996 than a BFG9060 with 128MB does today. If we take into account the 128MB RAM and 90% inflation its a bargain!

So is TheA500mini and this product reached mainstream stores.

Matt's argument is pro-68060 against ARM in a modern context. 68060 is not cost competitive.



Last edited by Hammer on 12-Jun-2024 at 12:58 AM.
Last edited by Hammer on 12-Jun-2024 at 12:54 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

 Status: Offline
Profile     Report this post  
Hammer 
Re: One major reason why Motorola and 68k failed...
Posted on 12-Jun-2024 1:06:35
#193 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5616
From: Australia

Quote:

@matthey
The ColdFire V5 already adds an extra pipeline stage compared to the 68060, from 8 to 9 stage

https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Coldfire-v5e/m-p/765957

Quote:
Actually ColdFire V5 product does not in chip market. Only HP company is using special V5 chips in the laser printer.
- miduo, NXP Employee, 2017.

https://www.eetgroup.com/en-gb/rp000320870-hp-laserjet-m1522nf-mfp-wid-w125072004
HP laserjet's ColdFire V5e with 450 Mhz clock speed.


https://www.rentex.com/wp-content/uploads/2016/07/PRN3305.pdf
HP Laserjet P4014n's ColdFire V5e with 550 Mhz clock speed.


https://www.nxp.com/docs/en/supporting-information/RPT68KFORUM.pdf
From NXP's Motorola ColdFire Core Roadmap, Page 5 of 15, 0.10 micron process node Cold Fire V6 has 615 to 800 Mhz..

Cold Fire V6 was canceled.

Quote:

@matthey

Superpipelining was a fad that was pushed to the limits (Pentium 4, PPC 970/G5) but generally backed away from to more practical pipeline depths

POWER9 has a 12 stage pipeline, 5 stage shorter than POWER8.

Last edited by Hammer on 12-Jun-2024 at 01:22 AM.
Last edited by Hammer on 12-Jun-2024 at 01:21 AM.
Last edited by Hammer on 12-Jun-2024 at 01:19 AM.
Last edited by Hammer on 12-Jun-2024 at 01:17 AM.
Last edited by Hammer on 12-Jun-2024 at 01:13 AM.
Last edited by Hammer on 12-Jun-2024 at 01:10 AM.
Last edited by Hammer on 12-Jun-2024 at 01:09 AM.
Last edited by Hammer on 12-Jun-2024 at 01:07 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

 Status: Offline
Profile     Report this post  
matthey 
Re: One major reason why Motorola and 68k failed...
Posted on 12-Jun-2024 4:12:02
#194 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2150
From: Kansas

Hammer Quote:

You can't remove the C++ Compiler's quantity with the hand assembly version. This is why Intel invested in their C++ compiler.


Compilers are very important. The 68k compiler support has deteriorated and compiled code quality with it. Any comparison of the 68k using a modern compiler will result in a major handicap for the 68k. Even at its peak support, the 68k was just starting to get some compiler support close to that of x86. Sometimes hand optimization "contests" are useful to ignore compiler support and see what the ISA is capable of. Vince Weaver's mistake was not starting with compiled code and hand optimizing from there. This would have prevented the 68k poor results in papers that are still highlighted in searches over his website with more accurate data.

Hammer Quote:

The bulk of modern Linux and GNU middleware are in platform portable source code.

Your example is not realistic for portable GNU middleware and Linux source code.


You selected Vince Weaver's benchmark. I pointed you to more accurate data.

Hammer Quote:

https://github.com/deater/ll_asm/tree/master/binaries/native/i386
i386 LZSS reached 57 bytes.


Indeed. He updated the i386 results from 63 to 57 bytes but did not change the m68k results from 58 to 54 bytes. I lost interest when he stopped updating the m68k results.

Code density declined from 808x to x86 to x86-64 but not too bad if optimizing for size. Most of the earlier instructions can still be encoded in later ISAs but the shortest encodings are for the original 8 integer registers and the stack resulting in lots of memory traffic. The INC and DEC instruction grew from 1 byte to 2 bytes which hurt x86-64 for LZSS as the encodings were repurposed.

Hammer Quote:

Reminder, x86-64 still runs IA-32 as is.

Long mode REX prefix must be encoded when:
1. using 64-bit operand size and the instruction does not default to 64-bit operand size; or
2. using one of the extended registers (R8 to R15, XMM8 to XMM15, YMM8 to YMM15, CR8 to CR15 and DR8 to DR15); or
3. using one of the uniform byte registers SPL, BPL, SIL or DIL.


So x86-64 instructions can use 32 bit and 8 GP registers without a prefix. Enter the code bloat. Some amateur architects still think it is worthwhile to add prefixes for 64 bit operations in a 64 bit CPU after this mistake.

Hammer Quote:

Not everything needs 64-bit integers. On Windows, the standard ABI is 32-bit long. A 64-bit type is available as int64_t or long long. Reference https://learn.microsoft.com/en-au/windows/win32/winprog/windows-data-types?redirectedfrom=MSDN

Linux X64's "long" datatype is 64-bits.

Windows X64's "long" is 32-bit by ABI default.


The prefixes when using 64 bit integers often bloat the code enough that 32 bit integers are faster. An early benchmark of SPEC CPU2000 integer code using 64 bit x86-64 code vs 32 bit x86 code in compatibility mode showed a less than 1% performance gain and the 64 bit mode had 16 instead of 8 integer registers. A later SPEC CPU2006 integer benchmark showed a 7% performance advantage for x86-64 code vs x86 code after the code was changed. It's nice to have compiler and benchmark support.

Performance Characterization of SPEC CPU2006 Integer Benchmarks on x86-64 Architecture
https://ece.northeastern.edu/groups/nucar/publications/SWC06.pdf

Hammer Quote:

68060 has no 64-bit GPRs.
68060 has no FMA3.
68060 has no popcount.
68060 has no vector FP math.
68060 has no vector integer math.

The maximum extent for IA-32 has 8 32-bit GPRs, 8 80-bit FPR/8 integer 64-bit vectors and 8 128-bit registers (scalar, vector, integer and FP).


The 68k can avoid wasting registers on small 64 bit SIMD registers like MMX and save encoding space for larger vectors. Only amateur architects would make the same mistakes as x86(-64).

Hammer Quote:

68060 missing a few instructions https://www.youtube.com/watch?v=ofXWIVMdsmc
68060 doesn't work on 68040-aware Macintosh. This YouTuber installed 68060 on Performa 575 via a 68060 to 68040 adapter and it failed to boot.


Earlier versions of the 68k MacOS were more compatible with the 68060. Apple may have deliberately made changes to later versions of MacOS to reduce compatibility with the 68060.

Hammer Quote:

68060 has no concept of 36-bit Physical Address Extension (PAE).


The 68k has SFC and DFC registers which act like address space extensions or memory banks. The MOVES instruction accesses the different spaces or banks. If the external address lines associated with SFC and DFC existed and were connected, this may be adequate for allowing 4GiB of memory per task/process. I'm not sure how shared memory between tasks/processes would work though.

Hammer Quote:

What's 68K's instruction sequence for the equivalent AVX-512 read-and-write instructions?

Examples
VPGATHERDD, Gathers 32-bit integers from memory using 32-bit indices
VPGATHERDQ, Gathers 64-bit integers from memory using 32-bit indices.
VPGATHERQD, Gathers 32-bit integers from memory using 64-bit indices.
VPGATHERQQ, Gathers 64-bit integers from memory using 64-bit indices.

VPSCATTERDD, Scatters 32-bit integers into memory using 32-bit indices.
VPSCATTERDQ, Scatters 64-bit integers into memory using 32-bit indices.
VPSCATTERQD, Scatters 32-bit integers into memory using 64-bit indices.
VPSCATTERQQ, Scatters 64-bit integers into memory using 64-bit indices.

The 512-bit register has 16 32-bit or 8 64-bit data elements.

AVX-512 has floating-point instructions. AVX-512 is very useful for software rendering.


Without a SIMD unit, integer operations would be done by the CPU and fp operations by the FPU. The CPU and FPU are more flexible while a SIMD unit is much higher performance for a tiny percentage of code when large amounts of data are formatted correctly in memory. A FPU is adequate for playing old games like Quake.

VisionFive 2 TyrQuake
https://youtu.be/QUgpWp0OEg0?t=120

The video is 640x480 using software rendering on the sub $100 VisionFive 2 SBC which does not have a SIMD unit. It uses SiFive U74 cores and the FPU is not particularly high performance either. It still easily outperforms a 68060@100MHz because it is clocked higher and more modern.

Hammer Quote:

68060 wasn't cheap when compared to ARM or other RISC competitors. No 68060 won a game console design win. Amiga Hombre is a $40 solution. Sega selected SuperH2 ahead of 68030.


Most of the consoles at that time did not use as powerful of CPU or as large of caches which made the 68060 more expensive. Motorola promoted their PPC CPUs over the EOL 68060 for consoles too. The 3DO successor Panasonic M2 was going to use PPC as did the Apple Pippin. The original Pippen idea was to use a 68030 and be MacOS compatible. Later, the Nintendo GameCube, first MS XBox and PS3 used PPC but the 68060@50MHz was low clocked, EOL and likely forgotten by then.

Hammer Quote:

https://www.nxp.com/docs/en/supporting-information/RPT68KFORUM.pdf
From NXP's Motorola ColdFire Core Roadmap, Page 5 of 15, 0.10 micron process node Cold Fire V6 has 615 to 800 Mhz..

Cold Fire V6 was canceled.


That was the document and roadmap I was looking at. I thought the 615-800MHz referred to the ColdFire V5 as the clock speed is above the oval like at the previous node. The clock speed is between the ColdFire V5 and V6 but there is a good chance it refers to the ColdFire V6 considering the ColdFire V5 at 130nm is only 300-366MHz. Still, HP was reaching over 500MHz with the ColdFire V5 using an old process. At least 1GHz using a fully synthesizable core is still likely with a 10-40nm process.

Hammer Quote:

POWER9 has a 12 stage pipeline, 5 stage shorter than POWER8.


Even IBM learned that reducing the pipeline depth results in a more practical CPU with fewer stalls. This was a different philosophy from when superpipelining was popular and pipeline lengths were increased to clock up the core and provide more ILP. The desktop PPC G5 increased the POWER4 integer pipeline length from 12 stages to 16 stages.

https://www.cs.tufts.edu/comp/150PAT/arch/power/PPC970_MPF_Review.pdf Quote:

The deeper pipeline is especially welcome because it could allow the 970 to reduce the growing clock-frequency gap between today’s PowerPC chips and the speedy x86 competition. G4 processors and their G4+ derivatives are stunted by short pipelines of only five or seven stages, compared with at least twenty stages in the superpipelined Pentium 4. That some G4+ chips manage to run at 1.25GHz is a testament to careful circuit design and good fabrication process. Imagine what might be possible with the 970, whose pipelines are 16 stages deep for integer operations and up to 25 stages deep for SIMD operations. Table 1 compares some vital statistics for these processors.


PPC shallow pipelines were a problem for clock speed but shallow pipeline limited OoO PPC cores did a good job of minimizing load-to-use stalls and branch misprediction stalls making them more practical than the PPC G5. The Motorola G4+ with 7 stage pipeline shown in Table 1 of the link above was a good practical compromise that still allowed descent clock speeds.

Last edited by matthey on 12-Jun-2024 at 02:41 PM.
Last edited by matthey on 12-Jun-2024 at 04:21 AM.

 Status: Offline
Profile     Report this post  
Hammer 
Re: One major reason why Motorola and 68k failed...
Posted on 12-Jun-2024 7:54:56
#195 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5616
From: Australia

@matthey

Quote:

Compilers are very important. The 68k compiler support has deteriorated and compiled code quality with it. Any comparison of the 68k using a modern compiler will result in a major handicap for the 68k. Even at its peak support, the 68k was just starting to get some compiler support close to that of x86. Sometimes hand optimization "contests" are useful to ignore compiler support and see what the ISA is capable of. Vince Weaver's mistake was not starting with compiled code and hand optimizing from there. This would have prevented the 68k poor results in papers that are still highlighted in searches over his website with more accurate data.

The platform's potential is influenced by the CPU and toolchain's quality.

Learn from PS1's good quality Psy-Q SDK against the competition.

Quote:

Indeed. He updated the i386 results from 63 to 57 bytes but did not change the m68k results from 58 to 54 bytes. I lost interest when he stopped updating the m68k results.

Code density declined from 808x to x86 to x86-64 but not too bad if optimizing for size.

It depends on workload e.g. X86 with SSE has the advantage with FADD or FMUL four IEEE FP32 math from a single instruction while the 68060 has frozen in-time scalar FADD or FMAL i.e. four math instructions.

128-bit SSE has an 8-bit pack math optimized for 8-bit color space i.e. 16 8-bit data elements per 128-bit SSE register.

IA-32 gains VNNI for 8-bit/16-bit pack math with 32-bit accumulator on AVX-256 and AVX-512 registers.

i386 works on X86-64 as is.

Linux defaults C++ long datatype to 64 bits while WIndows X64's version defaults to 32 bits.

Seeking code density while using 64-bit large memory is contradictory.

You're not going to win with modern gaming and AI workloads.

For AMD Jaguar with FADD/FMUL 128-bit SIMD hardware and two X86 decoders, AVX-256 is being used as an instruction issue slot conservation.


Quote:

Most of the earlier instructions can still be encoded in later ISAs but the shortest encodings are for the original 8 integer registers and the stack resulting in lots of memory traffic.

CPU's cache was designed to reduce external memory traffic. Mitigation against excessive external memory traffic is a microarchitecture implementation issue.

IA-32 includes 8 128-bit SSE registers. X86 is optimized for register spillover on FPR/MMX and SSE registers.

IA-32 has access to another 8 256-bit AVX registers.

IA-32
8 32-bit GPR,
8 80-bit FPR, 8 64-bit MMX,
8 128-bit SSE to SSE4.2, 8 256-bit AVX to AVX2, 8 512-bit AVX-512.
8 K0-K7 (opmask registers for AVX-512)

68K's doesn't have enough registers to rival GpGPU's multiple thousand registers.

Quote:

The INC and DEC instruction grew from 1 byte to 2 bytes which hurt x86-64 for LZSS as the encodings were repurposed.

Reminder, X86-64 can still run IA-32 code. 68060 has a 32-bit memory address range limit.

Quote:

The 68k can avoid wasting registers on small 64 bit SIMD registers like MMX and save encoding space for larger vectors. Only amateur architects would make the same mistakes as x86(-64).

With SIMD extensions, 68K is mostly missing in action.

PC's Tomb Raider III has MMX 3D software rendering with Bilinear Filter and still works with good performance on Zen 4 (e.g. Ryzen 5 7600X).

AMD's K6-2 MMX was modified for 3DNow. Both 3DNow and SSE SIMD extensions are abstracted by DirectX6's geometry pipeline.

Only fools caused an exit from the CPU market like Motorola. Motorola's 68K has been kicked out of the games console market for a reason.

Quote:

VisionFive 2 TyrQuake
https://youtu.be/QUgpWp0OEg0?t=120

The video is 640x480 using software rendering on the sub $100 VisionFive 2 SBC which does not have a SIMD unit. It uses SiFive U74 cores and the FPU is not particularly high performance either. It still easily outperforms a 68060@100MHz because it is clocked higher and more modern.

1. Quake 2 has a 3DNow SIMD patch. https://www.moddb.com/games/quake-2/downloads/quake-2-v320-patch-3dnow-3dfx-minigl-updates

The problem with Quake's OpenGL is missing DirectX6's geometry pipeline abstraction for CPU SIMD extensions.

In 1998, Microsoft’s DirectX6 runtime executed the geometry pipeline on the three variants of the x86 instruction set i.e. x87, 3DNow!, and SSE.

2. Where's Quake demo1 benchmark?

3. SiFive U74 is not 68060.

4. StarFive is partly CPC state-owned.

5. Render needs memory bandwidth and 68060 has an obsolete CPU external bus.

Quote:

Most of the consoles at that time did not use as powerful of CPU or as large of caches which made the 68060 more expensive.

68060's price policy is a Motorola issue.

MMU-less 68EC040 reaches about $108, but is useless with DMA'ed devices.

https://archive.computerhistory.org/resources/access/text/2013/04/102723262-05-01-acc.pdf
Page 119 of 981

For 1992
68000-12 = $5.5
68EC020-16 PQFP = $16.06
68EC020-25 PQFP = $19.99

68EC030-25 PQFP = $35.94
68030-25 CQFP = $108.75

68040-25 = $418.52
68EC040-25 = $112.50, useless with DMA'ed devices. Useless for Apple's, Amiga's, Atari's, and Sharp's use cases.

Competition

AM386-40 = $102.50
386DX-25 PQFP = $103.00

486SX-20 PQFP = $157.75, functionional for a PC with DMA'ed devices! Is somebody alive at Motorola?
486DX-33 = $376.75
486DX2-50 = $502.75


The BOM cost for 68EC040-25, 68LC040-25, and 68040-25 are the same since they use the same silicon chip, hence it's Motorola's profit expectation issue.


Quote:

Motorola promoted their PPC CPUs over the EOL 68060 for consoles too. The 3DO successor Panasonic M2 was going to use PPC as did the Apple Pippin. The original Pippen idea was to use a 68030 and be MacOS compatible. Later, the Nintendo GameCube, first MS XBox and PS3 used PPC but the 68060@50MHz was low clocked, EOL and likely forgotten by then.

PowerPC 602 was IBM's toy.

https://3dodev.com/documentation/hardware/m2/ppc602
3DO M2 has two IBM PowerPC 602 CPUs @ 66 Mhz in a Master/Slave configuration.

Bandai Pippin's hardware, middleware and OS were designed by Apple, hence the move towards PowerPC.

The original Pippin with 16 MHz Motorola 68030 Macintosh Classic II base is just an Atari Falcon 030 rehash. 16 MHz Motorola 68030 was replaced by Motorola's PowerPC 603 @ 66 Mhz.

Bandai is not large enough to directly compete against Sony's PS1.

3D on 68030 @ 16 Mhz with 16-bit bus for 1995 release is as comical as 1996 Amiga Walker's stupidity.

Last edited by Hammer on 12-Jun-2024 at 08:47 AM.
Last edited by Hammer on 12-Jun-2024 at 08:42 AM.
Last edited by Hammer on 12-Jun-2024 at 08:34 AM.
Last edited by Hammer on 12-Jun-2024 at 08:30 AM.
Last edited by Hammer on 12-Jun-2024 at 08:12 AM.
Last edited by Hammer on 12-Jun-2024 at 08:06 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

 Status: Offline
Profile     Report this post  
Lou 
Re: One major reason why Motorola and 68k failed...
Posted on 12-Jun-2024 15:26:44
#196 ]
Elite Member
Joined: 2-Nov-2004
Posts: 4199
From: Rhode Island

@all
Quote:

68060's price policy is a Motorola issue.

MMU-less 68EC040 reaches about $108, but is useless with DMA'ed devices.

https://archive.computerhistory.org/resources/access/text/2013/04/102723262-05-01-acc.pdf
Page 119 of 981

For 1992
68000-12 = $5.5
68EC020-16 PQFP = $16.06
68EC020-25 PQFP = $19.99

68EC030-25 PQFP = $35.94
68030-25 CQFP = $108.75

68040-25 = $418.52
68EC040-25 = $112.50, useless with DMA'ed devices. Useless for Apple's, Amiga's, Atari's, and Sharp's use cases.


As I keep repeating:

Too little (performance)
Too late (delivery of product improvements compared to competitors)
Too expensive (self-explanatory)

 Status: Offline
Profile     Report this post  
matthey 
Re: One major reason why Motorola and 68k failed...
Posted on 13-Jun-2024 5:59:43
#197 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2150
From: Kansas

Hammer Quote:

The platform's potential is influenced by the CPU and toolchain's quality.

Learn from PS1's good quality Psy-Q SDK against the competition.


I can see what no mass produced 68k hardware and emulated 68k hardware have not done for 68k compiler support and platforms. I have seen the 68k AmigaOS stop moving forward with CPU technology support and instead move backward to a 16 bit 68000 target. I can see what mass produced RPi hardware did for ARM and RISC OS. I tried to solve the 68k compiler problem by developing for VBCC and pushing for a 68k ASIC. I understand the problem and the solution but some things are too hard.

Hammer Quote:

It depends on workload e.g. X86 with SSE has the advantage with FADD or FMUL four IEEE FP32 math from a single instruction while the 68060 has frozen in-time scalar FADD or FMAL i.e. four math instructions.

128-bit SSE has an 8-bit pack math optimized for 8-bit color space i.e. 16 8-bit data elements per 128-bit SSE register.

IA-32 gains VNNI for 8-bit/16-bit pack math with 32-bit accumulator on AVX-256 and AVX-512 registers.


Do SIMD instructions actually improve code density though?

Photoshop x86
Code size: 5,634,556 bytes
Instruction count: 1,746,569

Class | Count | % | Avg size
INTEGER 1631136 93.39 3.2
FPU 114521 6.56 3.2
SSE 912 0.05 4.0
AVXFAKE 912 0.05 5.0

Instruction Size | Count
2 440045
3 419064
1 298078
5 190101
6 157035
4 136394
7 97835
10 4660
8 2411
11 829
9 117
Average length: 3.2

---

Photoshop x86-64
Code size: 7,556,180 bytes
Instruction count: 1,737,331

Class | Count | % | Avg size
INTEGER 1638505 94.31 4.3
SSE 93942 5.41 5.2
AVXFAKE 93942 5.41 5.3
FPU 4884 0.28 3.1

Instruction Size | Count
3 362650
5 353288
4 283352
2 240530
8 172284
7 131164
1 91535
6 80322
9 12945
10 3725
11 3257
12 1997
13 126
14 89
15 67
Average length: 4.3

https://www-appuntidigitali-it.translate.goog/18054/statistiche-su-x86-x64-parte-1-macrofamiglie-di-istruzioni/?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en-US&_x_tr_pto=wapp
https://www-appuntidigitali-it.translate.goog/18095/statistiche-su-x86-x64-parte-2-distribuzione-per-dimensione/?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en-US&_x_tr_pto=wapp

Photoshop x86 mostly uses FPU instructions for floating point while Photoshop x86-64 mostly replaces FPU instructions with SIMD unit instructions. The number of instructions decreased with SIMD instruction use but not as much as maybe expected. Many of the SIMD instructions are likely scalar used as FPU instruction replacements but the SIMD unit instructions are larger than the FPU instructions in order to encode vector SIMD operations and because of the lack of encoding space (encoding space taken by the deprecated FPU does not help). Code density of the Photoshop x86 code is 25% better than for the x86-64 code. RISC-V research showed that every 25%-30% code density improvement is like doubling the instruction cache size.

The large x86-64 instructions are powerful but require more hardware and limit how much x86-64 can scale down. Hardware challenges from large instructions can be seen with the in-order Intel Atom which was an attempt to scale x86(-64) lower but was abandoned.

https://en.wikichip.org/wiki/intel/microarchitectures/bonnell#Front_End Quote:

Front End

Bonnell's front end is very simple when compared to Intel's high-performance architectures. Out-of-order execution (OoOE) that is found ubiquitously in all HPC architectures was rejected. Bonnell's power and area constraints simply couldn't allow for the complex logic needed to support that capability. The Instruction Fetch consists of 3 stages, capable of going through up to 16 bytes per cycle. Like fetch, the Instruction Decode is also 3 stages, capable of decording instructions with up to 3 prefixes each cycle (considerably longer for more complex instructions).

Bonnell is a departure from all modern x86 architectures with respect to decoding (including those developed by AMD and VIA and every Intel architecture since P6). Whereas modern architectures transform complex x86 instructions into a more easily digestible µop form, Bonnell does almost no such transformations. The pipeline is tailored to execute regular x86 instructions as single atomic operations consisting of a single destination register and up to three source-registers (typical load-operate-store format). Most instructions actually correspond very closely to the original x86 instructions. This design choice results in lower complexity but at the cost of performance reduction. Bonnell has two identical decoders capable of decoding complex x86 instructions. Being variable length instruction architecture introduces an additional layer of complexity. To assist the decoders, Bonnell implements predecoders that determine instruction boundaries and mark them using a single-bit marker. Two cycles are allocated for predecoding as well as L1 storage. Boundary marks are also stored in the L1 eliminating the need to preform needlessly redundant predecoding. Repeated operations are retrieved pre-marked eliminating two cycles. Bonnel has a 36 KiB L1 instruction cache consisting of 32 KiB instruction cache and 4 KiB instruction boundary mark cache. All instructions (coming from both cache or predecode) must undergo full decode. It's worthwhile noting that Intel states Bonnell is a 16-stage pipeline because for the most part, after a cache hit you'll have 16 stages. This is also true in some cases where the processor can simultaneously decode the next instruction. However, in the cases where you get a miss, it will cost 3 additional stages to catch up and locate the boundary for that instruction for a total of 19 stages.

Some x86 instructions are simply too complex to handle directly. Those selected few get diverted into the micro-code sequencer ROM (MSROM) for decoding producing much more sane RISCish instructions at the cost of 2 additional cycles. Intel estimates that only 5% of common software require instructions to be split up. Only decoder0 can request transfer to use the MSROM. All instructions longer than 8 bytes or instructions having more than three prefixes will result in a MSROM transfer unconditionally. Those instructions will experience two cycles of delay. The inability to execute things out-of-order eliminates lots of optimization opportunities at this stage. One thing Bonnell can do is lockstep instructions that can be execute simultaneously such as in the case of instructions that performance a memory access along an arithmetic operation. In those instances Bonnell will issue the instruction as if it were two separate instructions executing simultaneously. In addition, only one x87 instruction can be decoded per cycle.


The 68060 only fetches 4 bytes/cycle and can only superscalar and single cycle execute instructions of 6 bytes size or less due to arbitrary design choices but it is less handicapped than Bonnell due to tiny 68k code and small instructions. it does not need microcode decoding or cache markers. It is easy to see that x86-64 large code and large instructions were a problem for Bonnell which was not true for x86 code but x86 code has other issues like memory traffic due to 7 GP registers, an awkward stack based FPU and existing x86-64 software which made returning to a lower hardware spec difficult.

Hammer Quote:

i386 works on X86-64 as is.

Linux defaults C++ long datatype to 64 bits while WIndows X64's version defaults to 32 bits.

Seeking code density while using 64-bit large memory is contradictory.


I strongly disagree. There is a competition for 64 bit ISAs and the best code density ISAs have a big advantage.

Itanium
Alpha
MIPS
SPARC
PPC
AArch64
RISCV64IMC
x86-64
68k64?

There was a similar competition for 32 bit ISAs.

PA-RISC
MIPS
SPARC
m88k
PPC
ARM (original, replaced by AArch64 but still in use)
x86 (replaced by x86-64 but still in use)
SuperH (revived by the J-Core project based on Vince Weaver's bad code density paper)
ARM Thumb1&2 (used in ARM Cortex-M cores)
68k (why is the best code density ISA ignored except for small retro FPGA cores?)

Code density appears to be important for ISA survival. With Moore's Law ending, ISA is likely to play a more important role in core competitiveness.

Hammer Quote:

CPU's cache was designed to reduce external memory traffic. Mitigation against excessive external memory traffic is a microarchitecture implementation issue.


Cache accesses are more expensive than register accesses. Simple CPU cores can only perform one data mem/cache access per cycle. High performance CPU cores may be able to perform two data mem/cache access per cycle with dual ported and/or banked data cache accesses but increases above this become very expensive in hardware and access times may require more than a single cycle. This means too much memory traffic can create a bottleneck to the caches. Most x86-64 cores have dual ported data caches which handle traffic very well until the bottleneck is reached which it appears to be with only 8 GP registers.

SPEC2000 integer benchmarks
GP regs | mem refs | performance loss

8 +42% 4.4%
12 +14% 0.9%
16 0% 0%

SPEC2000 floating point benchmarks
GP regs | mem refs | performance loss

8 +78% 5.6%
12 +29% 0.3%
16 0% 0%

https://link.springer.com/content/pdf/10.1007/11688839_14.pdf Quote:

Overall, for the SPEC2000 benchmark suite, we can see the trend that when the number of available registers is increased from 8 to 12, the memory accesses are reduced dramatically, and moderate performance improvement is achieved. However, with more than 12 registers, although the memory accesses are further reduced, the performance is barely improved.

Once the number of allocatable registers reaches a certain threshold, results show that the performance cannot be improved further even given more registers. We suspected that this phenomenon is due to the powerful out-of-order execution engine of the x86 core. In addition, comprehensive compiler optimization techniques can reduce the register pressure by folding some of the memory load instructions into other instructions as operands.


The 68k starts with 16 GP integer registers which appears to be adequate for a CISC CPU to avoid bottlenecking data cache accesses with too much data memory traffic. The floating point results above include increasing the number of XMM registers from 8 to 16 which also appears to be adequate for a CISC CPU. The 68k may need to increase the FPU registers from 8 to 16 to avoid the data cache bottleneck. More than 16 GP registers for a CISC core like used in the paper are unlikely to provide more than 1% overall performance gain. ISA was important with the 68k ISA likely to provide a 4.4% integer performance advantage over the x86 ISA as benchmarked in the paper which the "microarchitecture implementation" could not solve. The x86-64 ISA solves the memory traffic performance issue from lack of GP registers but the loss of code density may be like reducing the instruction cache to half the size. The 68k does not need to make such tradeoffs. Upgrading 68k FPU registers from 8 to 16 is unlikely to have much affect on code density and may even improve it. Using prefixes to increase the number of registers decreases code density.

https://link.springer.com/content/pdf/10.1007/11688839_14.pdf Quote:

In addition, Luna et al. show that the code size could be increased by 17% due to REX prefixes.


The x86-64 code density problem is needing an instruction prefix for more than 8 registers and for 64 bit operations on a 64 bit CPU which are too common. There may also be x86-64 decoder overhead and large instruction size issues with prefixes judging from Bonnell. Part of the large instruction size issue is lack of available encoding space for x86 with its variable length 8 bit encoding which also complicates decoding compared to a variable length 16 bit encoding like the 68k uses.

Last edited by matthey on 13-Jun-2024 at 06:09 AM.

 Status: Offline
Profile     Report this post  
matthey 
Re: One major reason why Motorola and 68k failed...
Posted on 13-Jun-2024 6:59:43
#198 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2150
From: Kansas

Lou Quote:

As I keep repeating:

Too little (performance)


No. Most 68k CPU performance was good or at least competitive when released. Performance/MHz was generally better than x86. RISC CPUs did not arrive until about 1985 and the 68k was still competitive in performance while having nicer features. The 68040 and 68060 had better performance/MHz than most RISC CPUs. The only arguments for lack of performance are due to not getting upgraded CPUs out sooner and not being more aggressive about clocking up the chips.

Lou Quote:

Too late (delivery of product improvements compared to competitors)


Yes. The 68020 was late and had early availability issues due to fab problems. The 68040 was so late that it set back the 68k development roadmap and the newer 68040 CPUs were more expensive than older 486 CPUs even though the 68040 had better performance/MHz.

Lou Quote:

Too expensive (self-explanatory)


No. Motorola/Freescale CPUs were generally competitively priced. It was not unusual for 68k CPUs to be cheaper than equivalent x86 CPUs. There were often cheaper alternatives but Motorola/Freescale was known for high quality which brought a premium price. Intel often had higher margins on their CPUs and less quality.

 Status: Offline
Profile     Report this post  
Hammer 
Re: One major reason why Motorola and 68k failed...
Posted on 13-Jun-2024 8:04:05
#199 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5616
From: Australia

@matthey

Quote:

I can see what no mass produced 68k hardware and emulated 68k hardware have not done for 68k compiler support and platforms.

For Emu68, the 68K's performance is dictated by the JIT 68k-to-ARM compiler's and host CPU core's quality. The compiler's optimization occurs with JIT.

Quote:

I have seen the 68k AmigaOS stop moving forward with CPU technology support and instead move backward to a 16 bit 68000 target. I can see what mass produced RPi hardware did for ARM and RISC OS. I tried to solve the 68k compiler problem by developing for VBCC and pushing for a 68k ASIC. I understand the problem and the solution but some things are too hard.

Pushing without the means (money) is futile. Mainstream venture capitalists are addicted to AI hype. It's either a "national security" or an AI issue.

68K advocates need to attach AI hype to attract venture capitalists.

Quote:

Do SIMD instructions actually improve code density though?

Vector instruction improves data processing throughput which is important for arithmetic intensity bias workloads.


Quote:

Photoshop x86
Code size: 5,634,556 bytes
Instruction count: 1,746,569

Flawed argument. You're using an old Photoshop CS6.

Adobe Photoshop CS6 was released in May 2012. Intel Sandy Bridge with AVX1 was released on January 9, 2011.

AI hype will push Adobe to optimize towards AI-related hardware.

The critical performance section needs SIMD e.g. plugin modules.

https://www.phoronix.com/review/epyc-bergamo-avx512/3
TensorFlow 2.12's
AVX-512=ON with 119.32
AVX-512=OFF with 19.89


https://www.phoronix.com/review/epyc-bergamo-avx512/2
EPYC 9754 Bergamo's 128 cores with Zen 4C (AMD's little Zen4C core)'s Intel Embree 4.1 path tracer with Asian Dragon Object
AVX512=OFF, 115.25 fps (AVX2 path)
AVX512=ON, 134.84 fps
The difference is the instruction set running on the same 256-bit SIMD units.

Zen 5 has the AVX512 hardware implementation i.e. real 512-bit wide SIMD hardware.

https://www.phoronix.com/review/embree4-sapphire-rapids/3
Embree 4.0 on Intel's Xeon "Sapphire Rapids" Platinum 8490H (60 P-Cores each) 2P (total:120 P-Cores)
SSE4.2 = 110.93
AVX = 125.39
AVX2 = 136.93
AVX512 = 155.77

Each microarchitecture will attempt to maximize each SIMD instruction set.


Blender 3D example.
https://devtalk.blender.org/t/proposal-bump-minimum-cpu-requirements-for-blender/26855
December 2022,
Quote:

Subject: Bump minimum CPU requirements for Blender

the minimum CPU instruction set for x86-64 that is required to launch Blender and stated on the requirements is SSE2 at the moment.


World of Tanks Encore with software RT and Intel Embree
https://www.youtube.com/watch?v=-w7wUs30OXk

PlayStation 4 was released in November 2013 and has influenced AVX's mandatory requirement on PC. The same PS5's influence on PC's AVX2 mandatory requirement.

Game consoles influence the gaming PC's minimum CPU requirements.

https://whatcookie.github.io/posts/why-is-avx-512-useful-for-rpcs3/
RPCS3 PS3 emulator's LLVM recompiler is known to target the latest SIMD instruction set.
An Intel Core i9-12900K CPU was used for testing at 5.2 GHz with AVX-512 enabled.
SSE2 = 5 FPS
SSE 4.1 = 160 FPS
AVX2/FMA = 190 FPS
AVX-512 = 235 FPS

SSSE3 is needed.

Quote:

The performance when targeting SSE2 is absolutely terrible, likely due to the lack of the pshufb instruction from SSSE3. pshufb is invaluable for emulating the shufb instruction, and it’s also essential for byteswapping vectors, something that’s necessary since the PS3 is a big endian system, while x86 is little endian. - Malcolm Jestadt (RPCS3 developer)


Shufb is CELL SPE vector instruction (1). Missing vector instructions can cause a major emulation bottleneck.

Pshufb vector instruction was included in Intel's Core 2 "Conroe" SSSE3 in time for PS3's release.

There are 16 bytes on a 128 register and the scalar 1 byte version has a magnitude of 16 times slower.

Reference
1. http://ilab.usc.edu/packages/cell-processor/docs/SPU_Assembly_Language_Spec_1.5.pdf


Last edited by Hammer on 13-Jun-2024 at 09:00 AM.
Last edited by Hammer on 13-Jun-2024 at 08:54 AM.
Last edited by Hammer on 13-Jun-2024 at 08:48 AM.
Last edited by Hammer on 13-Jun-2024 at 08:44 AM.
Last edited by Hammer on 13-Jun-2024 at 08:36 AM.
Last edited by Hammer on 13-Jun-2024 at 08:28 AM.
Last edited by Hammer on 13-Jun-2024 at 08:11 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

 Status: Offline
Profile     Report this post  
Hammer 
Re: One major reason why Motorola and 68k failed...
Posted on 13-Jun-2024 9:16:09
#200 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5616
From: Australia

@matthey

Quote:

No. Most 68k CPU performance was good or at least competitive when released.

Not good enough.

Quote:

Performance/MHz was generally better than x86.

Performance = IPC x clock speed.

68000 has it's four-cycle memory access. 68000 is good for hosting 32-bit OS.

68030-25 being a match against Am386DX-40 is absurd.

Quote:

RISC CPUs did not arrive until about 1985 and the 68k was still competitive in performance while having nicer features.

Intel i860 beats 68040 e.g. 68040 is not used for SGI RealityEngine's geometry engine.

3DO's ARM60 @ 12.5 Mhz Doom performance shows A1200's 68EC020 @ 14 Mhz with Fast RAM doesn't have ARM60's arithmetic intensity.

Quote:

The 68040 and 68060 had better performance/MHz than most RISC CPUs.

Prove 68LC040 was offered as a $20 part like DSP3210 in 1992-1993.

$100 68EC040 was useless for DMA'ed equipped desktop computers.

The entire Amiga Hombre chipset is $40 part. Motorola's 68K with 1 IPC is out of "cheap RISC".

Commodore's "cheap RISC" for Amiga Hombre chipset is in common with ARM's "cheap RISC" mindset.


http://wla.berkeley.edu/~cs266/sp10/readings/price89.pdf
Dhrystone high
Motorola SYS1147's 68030 @ 20 Mhz = 6,334
Motorola SYS3600's 68030 @ 25 Mhz = 8,826

Compaq 386/20's 80386 @ 20 Mhz = 9,436
Compaq 386/25's 80386 @ 25 Mhz = 10,617

MIPS (RISC) R2000 @ 15 Mhz = 25,000

RISC threat is real.

Last edited by Hammer on 14-Jun-2024 at 03:19 AM.
Last edited by Hammer on 13-Jun-2024 at 09:35 AM.
Last edited by Hammer on 13-Jun-2024 at 09:20 AM.
Last edited by Hammer on 13-Jun-2024 at 09:18 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

 Status: Offline
Profile     Report this post  
Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 Next Page )

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]
Copyright (C) 2000 - 2019 Amigaworld.net.
Amigaworld.net was originally founded by David Doyle