Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.
|
|
|
|
Poster | Thread | minator
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 16-Jul-2025 15:21:29
| | [ #341 ] |
| |
 |
Super Member  |
Joined: 23-Mar-2004 Posts: 1036
From: Cambridge | | |
|
| @cdimauro
Quote:
If code density wasn't important anymore, then why almost all processor vendors cared about it and introduced proper extensions or even new architectures with the sole purpose of supporting it? |
It wasn't important at the high end. It did matter, and still does matter in embedded.
Quote:
Well, NO! 486s only had L1 caches. Exactly like Pentiums. |
Well, YES. Both 486s and Pentiums could have L2 on the motherboard. In the PPEC results all of Intel's Pentium results have L2.
Quote:
But Motorola decided to focus on PowerPCs... |
PowerPC made more money.
Quote:
A better metric to check how more efficient are (micro)architectures would have been the SPEC/Mhz. |
Why? It's no use other than academic interest. 68060 is in the middle in the 1994 results but lower in the 1996 results. The "crappy" PA-7200 beats it in both cases.Last edited by minator on 16-Jul-2025 at 03:22 PM.
_________________ Whyzzat? |
| Status: Offline |
| | matthey
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 16-Jul-2025 19:06:27
| | [ #342 ] |
| |
 |
Elite Member  |
Joined: 14-Mar-2007 Posts: 2764
From: Kansas | | |
|
| cdimauro Quote:
matthey Quote:
- does not work for immediates above a certain threshold like 8-bit or 16-bit |
Could you please elaborate on the last point?
|
Take for example a 32-bit fixed length RISC instruction encoding that has a 16-bit unsigned immediate field. A 17-bit unsigned immediate is above the threshold and unsigned/zero extension does not work. With simple RISC ISAs, a 2nd immediate is usually loaded, shifted and an or operation of the two immediates performed after the immediate threshold has been exceeded. Sign and unsigned extension work on integers for any number of bits but encoding immediates fields cause arbitrary thresholds. Even with a variable length encoding like the 68k and my #d16.w EA encoding for immediates, signed 16-bit values greater than or less than a 16-bit two's complement integer size are above the threshold and sign extension can not be performed. The variable length encoding allows a 32-bit integer encoding of the immediate used by a single instruction instead of adding 3 dependent instructions with the 32-bit RISC encoding.
Going back to the BA2 ISA examples.
https://www.chipestimate.com/Extreme-Code-Density-Energy-Savings-and-Methods/CAST/Technical-Article/2013/04/02 Quote:
addq.l #$3,d7 ; 16-bit encoding adding 3-bit unsigned immediate (add of 8-bit immediate)
add.l #$6000,d7 ; 48-bit encoding with current 68k *or* add.l #$6000.w,d7 ; 32-bit encoding with my #d16.w EA sign extension (add of 16-bit immediate)
add.l #$7fffffff,d7 ; 48-bit encoding with current 68k (add of 32-bit immediate)
The example add of a 32-bit immediate in the pic has to be wrong for Thumb-2 and MIPS as a 32-bit immediate can not fit in a 32-bit encoding. The immediate field thresholds of Thumb-2 and MIPS ISAs were exceeded and there is no encoding larger than 32-bit so a 2nd immediate has to be introduced and combined with the 1st immediate using dependent instructions. This requires at least another 32-bit instruction and may require two more instructions. Once a RISC threshold is surpassed, the code bloat starts which is not only code size but multiple dependent instructions to execute. The scalability of BA2 immediates shows this does not have to be the case for a load/store architecture. However, what if the variable is in memory?
addq.l #$3,(a0) ; 16-bit encoding adding 3-bit unsigned immediate (add of 8-bit immediate)
add.l #$6000,(a0) ; 48-bit addi.l encoding with current 68k (add of 16-bit immediate)
add.l #$7fffffff,(a0) ; 48-bit addi.l encoding with current 68k (add of 32-bit immediate)
My EA mode compression idea does not work with addi unfortunately as the immediate is not an EA. (it does work with move.l EA,EA though). The 68k still has a large advantage compared to load/store architectures though.
BA2: load r7,(r6) ; at least 16-bit encoding b.addi r7, r7, 0x03 ; 16-bit encoding store (r6),r7 ; at least 16-bit encoding
load r7,(r6) ; at least 16-bit encoding b.addi r7, r7, 0x6000 ; 24-bit encoding store (r6),r7 ; at least 16-bit encoding
load r7,(r6) ; at least 16-bit encoding b.addi r7, r7, 0x7fffffff ; 48-bit encoding store (r6),r7 ; at least 16-bit encoding
None of the BA2 examples at the link access memory where it would not want to be compared to the 68k but likely is still better code density than Thumb-2 and no doubt is better than MIPS, especially where MIPS stands for Microprocessor without Interlocked Pipelined Stages like R2000 and R3000 cores.
MIPS: lw r7,(r6) ; 32-bit encoding nop ; 32-bit encoding load-to-use delay slot without Interlocked Pipelined Stages addi r7, r7, 0x03 ; 32-bit encoding sw (r6),r7 ; 32-bit encoding
lw r7,(r6) ; 32-bit encoding nop ; 32-bit encoding load-to-use delay slot without Interlocked Pipelined Stages addi r7, r7, 0x6000 ; 32-bit encoding sw (r6),r7 ; 32-bit encoding
lw r7,(r6) ; 32-bit encoding lui r8, 0x7fff ; 32-bit encoding load upper 16-bit immediate instruction ori r8, 0xffff ; 32-bit encoding add r7, r7, r8 ; 32-bit encoding sw (r6),r7 ; 32-bit encoding
The last MIPS example filled the load delay slot with an independent instruction which saved an instruction for the cost of using another register. The 68k code is one instruction using 6 bytes and one register where the MIPS code is 5 instructions using 20 bytes and 3 registers. The R2000/R3000 have single cycle throughput pipelined instructions though. The R4000 would have needed two NOP instructions in the first two examples but interlocked pipeline stages were added to stall the pipeline instead of bloating the code with NOP instructions. The last example is R4000 core ready and will not stall while the first two examples would stall for a cycle even with the NOP instructions. RISC simplification is a pain in the programming ass but to add insult to injury, the performance potential is lower. The 68060 can execute the single instruction 68k examples accessing memory with single cycle throughput when the data is in the L1 cache. The 68060 requires more hardware than an also 8-stage MIPS R4000 but it is significantly smaller size than an 8-stage in-order superscalar ARM Cortex-A53 that remains popular because of its small area. The Cortex-A53 has a performance killing 3 cycle load-to-use latency requiring 3 non-dependent instructions between a load and the instruction that uses the load result or the pipeline stalls.
cdimauro Quote:
Indeed. This helps a lot the code density, because of what I've said before (not so many RISC architectures support small immediate FP values or, in general, FP immediates).
|
Yes. Most RISC FPUs do not support fp immediates at all and even some poorly designed CISC FPUs. It is much more common for fp immediates to end up clogging up data caches rather than compressed in more predictable code. Load/store architectures commonly have load-to-use stalls when loading fp immediates (and other fp data) reducing fp performance similar to what I just showed for integer memory accesses.
cdimauro Quote:
Do you mean because the 68k FPU can load a FP32 immediate data, which is (automatically) expanded to extended precision? Or... can load integer immediate data. Or both?
|
The vasm peephole optimization can reduce the precision of fp immediates from extended precision down to double precision and from double precision down to single precision. If half precision was supported, it could further compress single precision down to half precision if possible. It is true that the 68k FPU supports integer immediates as well which could give the same compressions as half precision but the integer to fp conversion performance loss would make this a -Os compiler option.
cdimauro Quote:
But it supports loading come important constants (pi, log2 e, etc.) from its ROM, with the best precision possible (extended).
|
The FMOVECR 68k FPU instruction is not supported on the 68040 or 68060. It likely does improve code density and reduces extended precision immediates in data caches but I have not examined it closely. It was not a rare instruction as used by the SAS/C compiler, for example for Lightwave.
cdimauro Quote:
Yes, but it has 16-bit immediates only on some basic/common instructions (some GP/scalar ones. Which includes loads/stores, of course).
Besides that, all instructions support 32 or 64 bit immediates.
The primary problem, however, is that the instructions are multiple of 32 bits. Which I think that it's not so code density-friendly with general purpose code. But I've no statistics about that (Mitch only shared the number of executed instructions against RISC-V, which looks around 70%).
|
Not to criticize Mitch, but a 16-bit base variable length encoding is more practical than a 32-bit base variable length encoding. Mitch is targeting a very high performance ISA with many registers using many encoding bits. It remains to be seen if code density sabotages his efforts. BA2 uses an 8-bit base variable length encoding which is very good for code density but I expect has some alignment and decoding disadvantages which may affect performance. Cast offers RISC-V cores in addition to BA2 cores but I would not expect performance to be any better with RISC-V. It may just be that BA2 is more proprietary and protectionism reduces proliferation compared to open RISC-V hardware.
cdimauro Quote:
Another very important thing that he shared, is that directly using immediates on instructions makes an architecture immune to (some, I think) side-channel attacks. That's because the processor's speculative backend can't be maliciously trained, since no load/store instructions are used in this case.
|
It makes sense that immediates and displacements in the code are better protected and more secure.
cdimauro Quote:
They already CISCs because they lost all RISC "principles/pillars".
However, I'm preparing the popcorns for when RISC-V will introduce 48-bit (or even more) instructions... 
|
RISC-V 48-bit Load Long Immediate Extension https://github.com/riscvarchive/riscv-code-size-reduction/blob/main/existing_extensions/Huawei%20Custom%20Extension/riscv_LLI_extension.rst
https://www.reddit.com/r/RISCV/comments/zrpi3m/why_48bit_instructions/ brucehoult Quote:
Encoding for 48 bit, 64 bit, and longer instructions in RISC-V has not been ratified. The stuff in the ISA manual is just a sketch of how things might work eventually, so all suggestions are welcome.
I've made some myself, and Claire Wolf riffed off my suggestions a little:
https://github.com/riscv/riscv-isa-manual/issues/280
To date there are no 48 bit instructions (and no ratified way to encode them) and multiple companies have strongly resisted introducing the first 48 bit instruction in e.g. the Vector extension, with the unfortunate result that the FMA instructions had to be made destructive (the only such instructions in the 32-bit encoding) and come in two versions depending on which operand is destructed.
Personally I think this is a pity as 48 bit instructions do provide a meaningful increase in code density in ISAs such as S/360 and nanoMIPS (which seems to be dead, but it looks to be a very nice post-RISC-V ISA).
Having 48 bit instructions would also allow for including the vtype in every V instruction instead of the hack of inserting special vsetvli instructions between pairs of vector instructions, and thus using 64 bits per actual work-doing instruction. Going straight to 64 bit would give no program size advantage.
|
The last post was 3 years ago but talks about a code density advantage to larger encoding sizes. It does not even mention the reduced instruction advantage. Of course any code taking advantage of larger instruction sizes with RISC-V extensions would require a recompile to gain the benefits that the 68000 had in 1979 and all 68k Amigas already have. RISC-V extensions would have duplicate encodings wasting encoding space compared to an ISA which planned for scaling immediates/displacements using variable length encodings from inception. The 68k is ancient technology long forgotten like Roman cement and Baalbek quarrying. Maybe some day the technology will be rediscovered if it is not forgotten and lost first.
Last edited by matthey on 16-Jul-2025 at 07:56 PM.
|
| Status: Offline |
| | matthey
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 16-Jul-2025 20:54:54
| | [ #343 ] |
| |
 |
Elite Member  |
Joined: 14-Mar-2007 Posts: 2764
From: Kansas | | |
|
| minator Quote:
cdimauro Quote:
If code density wasn't important anymore, then why almost all processor vendors cared about it and introduced proper extensions or even new architectures with the sole purpose of supporting it?
|
It wasn't important at the high end. It did matter, and still does matter in embedded.
|
Code density is not as important for high CPUs but it is still important. The same effect as low end CPUs occurs but to a lesser extent.
https://www.chipestimate.com/Extreme-Code-Density-Energy-Savings-and-Methods/CAST/Technical-Article/2013/04/02 Quote:
 Figure 1. Energy consumption for Processor B is significantly less than for A due to B's 50% smaller code size.
|
The processor with good code density uses more energy because it stalls less waiting for code. Instruction pipelines stall unless a new instruction is fed into the pipeline every cycle. The SRAM which could be MCU memory or caches is smaller so uses less energy. Accessing DDR memory to load more code is very expensive. When you say code density is not important for high end CPUs, it is like saying I will just buy a more expensive CPU with larger caches. You will probably select the CPU with the largest caches available for the price but ignore the CPU with code compression. It is kind of like ignoring a GPU with 2GiB memory and texture compression allowing for adequate texture memory and choosing to buy a GPU with 4GiB memory and without texture compression instead. You may pay twice as much to solve a texture traffic performance problem. You can always say to yourself you will only use old programs and buy the 2GiB memory GPU too.
minator Quote:
Well, YES. Both 486s and Pentiums could have L2 on the motherboard. In the PPEC results all of Intel's Pentium results have L2.
|
Some 1990s high end CPUs had on-chip L2 cache tags which helped performance but it made the CPUs more expensive and required expensive high performance L2 memory for the caches. The cost of the more expensive x86 CPU and expensive cache memory was several times what the 68060 cost but not several times the performance.
minator Quote:
I question that PPC was highly profitable for Motorola. They quickly replaced the PPC601, PPC603 and PPC604 after relatively short times on the market and replaced them with PPC601+, PPC603e and PPC604e using expensive chip fab process improvements and double the caches for the cache hungry PPC. They also shared the PPC market with IBM CPUs and later other competition. This was mostly for the Apple desktop market which was less than 10% of the desktop market. They tried to force PPC on the embedded market to improve economies of scale but that backfired and they lost most of the 68k embedded market.
minator Quote:
cdimauro Quote:
A better metric to check how more efficient are (micro)architectures would have been the SPEC/Mhz.
|
Why? It's no use other than academic interest. 68060 is in the middle in the 1994 results but lower in the 1996 results. The "crappy" PA-7200 beats it in both cases.
|
In 2 years, a chip fab process improvement and doubling of the caches would be expected. Intel kept and upgraded their in-order 5-stage P54C, their equivalent of the in-order 8-stage 68060, to the in-order 6-stage P55C with double the caches.
Intel’s Long-Awaited P55C Disclosed https://websrv.cecs.uci.edu/~papers/mpr/MPR/ARTICLES/101404.PDF Quote:
Optimizing for Faster Core Clock
Intel doubled the cache size to reduce the performance lost from cache misses at high core clock speeds. In addition, the caches are four-way (instead of two-way) set-associative. Preliminary data from Intel shows the doubled caches cut the data-cache miss rate (on SPECint95) by 20–30% and the instruction-cache miss rate by 35–40%. The net effect, combined with the pipeline and branch-prediction enhancements, is a 10–20% increase on standard benchmarks, according to Intel. The benefit of the large cache will be greatest at the 200-MHz clock rate, due to the higher cost of cache misses at that speed.
|
The 68060 was still superior in some ways and the die was small enough in 1994 that the 68060 could have launched with 16kiB I+D caches. Instead, the 68060+ was planned for later and the safer and more practical for embedded use 8kiB I+D 68060 was launched. Doubling the caches reduced cache misses for the SPECInt95 benchmarks improving performance by 10-20%. Moore's Law made pigs fly but it ended and efficiency matters again, even for high end CPUs.
Last edited by matthey on 16-Jul-2025 at 09:03 PM. Last edited by matthey on 16-Jul-2025 at 09:01 PM. Last edited by matthey on 16-Jul-2025 at 08:56 PM.
|
| Status: Offline |
| | Hammer
 |  |
Re: The (Microprocessors) Code Density Hangout Posted on 16-Jul-2025 23:24:30
| | [ #344 ] |
| |
 |
Elite Member  |
Joined: 9-Mar-2003 Posts: 6515
From: Australia | | |
|
| @ppcamiga1
Quote:
Amiga NG is better Amiga than these made by Commodore after Amiga 500. It is your problem that you do not accept reality. Amiga NG is Amiga that Commdore will made if survive few years more. has everything that should be in Amiga RISC 3D FPU MMU etc
|
1. Amiga is not a Mac.
2. Petro Tyschtschenko's Amiga Technologies GmbH / Phase 5 PowerPC camp failed to understand Amiga's majority audience. Neo-Amiga PowerCrap had their chance, and they blew it._________________ Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68) Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68) Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB |
| Status: Offline |
| | Hammer
 |  |
Re: The (Microprocessors) Code Density Hangout Posted on 17-Jul-2025 0:03:49
| | [ #345 ] |
| |
 |
Elite Member  |
Joined: 9-Mar-2003 Posts: 6515
From: Australia | | |
|
| @cdimauro
Quote:
What's not clear to you that this thread is about CODE DENSITY? Memory footprint is also good to discuss, because code density is a very important factor which heavily influences it.
|
For 3D, arithmetic intensity has higher importance. Any processor design is a compromise. CODE DENSITY can't exist without economic factors.
What's the point of CODE DENSITY focus when Motorola's 68K lost on performance vs price?
https://www.electronicproducts.com/mips-processors-to-push-performance-and-price/
From 1992, IDT MIPS R3041 @ 20 Mhz has $15 price, that's a 68LC040 near-1 IPC class with a 68EC020-16's price range.
Sony's PlayStation 1's LSI Logic MIPS R3050 selection is a no-brainer.
R3041 is a $15 R3000 variant with embedded MMU. Motorola continues to price the MMU-equipped 68030 not competitively.
"It's the economy, stupid!"
For the microcontroller embedded markets, MIPS gained 16-bit instructions. The US government-funded US academia executed another attempt with RISC-V. The US government's RISC-V and MIPS R&D programs are no different from the Chinese Communist Academy of Science. Unlike the Soviet Communist model, the US allows private and publicly funded processor products to co-exist i.e. "two systems, one country" model. The US cloaks MIPS and RISC-V R&D programs under a "national security" blanket.
ARM is not different when it's defended by the UK government against NVIDIA's takeover on national security grounds, but allowed Japan Inc's SoftBank takeover. The UK government via the "education" route, helps fund the Raspberry Pi, which is reminiscent of the BBC Micro initiative. That's state power in play.
For Sega's Saturn, this is repeated for Sega's SuperH2 selection ahead of the 68030.
Motorola exited 68K's development roadmap. 68K exited mainstream 32-bit game consoles. 68K exited the workstation market. 68K (Dragball VZ) exited the smart handheld market. Motorola exited the semiconductor industry.
X86 had the benefit of being entrenched via the large economies of scale business PC market. -------------------
In modern times, NVIDIA's current RTX GPUs have custom RISC-V command processors. https://www.tomshardware.com/pc-components/gpus/nvidia-to-ship-a-billion-of-risc-v-cores-in-2024
These things are now managed by 10 to 40 custom RISC-V cores developed by Nvidia, depending on chip complexity. Nvidia started to replace its proprietary microcontrollers with RISC-V-based microcontroller cores in 2015, and by now, virtually all of its MCU cores are RISC-V-based, according to an Nvidia slide demonstrated at the RISC-V Summit.
By now, Nvidia has developed at least three RISC-V microcontroller cores: NV-RISCV32 (RV32I-MU, in-order single-issue core), NV-RISCV64 (RV64I-MSU, out-of-order dual-issue core), and NV-RVV (RV32I-MU, NVRISCV32 + 1024-bit vector extension). These cores (and perhaps others) replaced the proprietary Falcon microcontroller unit based on a different instruction set architecture. In addition, Nvidia has developed 20+ custom RISC-V extensions for extra performance, functionality, and security.
Perhaps the most important RISC-V-based part of Nvidia GPUs is its embedded GPU System Processor (GSP). According to Nvidia's website, the first GPUs to use RISC-V-based GSP were based on the Turing architecture. This GSP offloads Kernel Driver functions, reduces GPU MIMO exposure to the CPU, and manages how the GPU is used.
Since MCU cores are universal, they can be used across Nvidia's products. As a result, in 2024, Nvidia is expected to ship around a billion RISC-V cores built into its GPUs, CPUs, SoCs, and other products, according to one of the demonstrated slights, which highlights the ubiquity of custom RISC-V cores in Nvidia's hardware.
Intel's light-x86 with SIMD 512-bit Larrabee cGPU is a flop.
Qualcomm shifted towards RISC-V MCU cores.
Western Digital's hard drive has an RISC-V MCU (RV32I-MU) e.g. SweRV, which is a 32-bit, 2-way superscalar, 9-stage pipeline core.
RISC-V horse has bolted, even though X86-64 v1 ISA has exited US patent protections in 2023. Last edited by Hammer on 17-Jul-2025 at 12:47 AM. Last edited by Hammer on 17-Jul-2025 at 12:41 AM. Last edited by Hammer on 17-Jul-2025 at 12:20 AM.
_________________ Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68) Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68) Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB |
| Status: Offline |
| | kolla
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 17-Jul-2025 0:16:28
| | [ #346 ] |
| |
 |
Elite Member  |
Joined: 20-Aug-2003 Posts: 3479
From: Trondheim, Norway | | |
|
| @Hammer
Who blew what? Something about PowerUP vs WarpUP, CyberGraphX vs Picasso96, MUI vs ReAction, Poseidon USB vs N/A, MorphOS vs OS4, blue vs red... some would argue that Phase5's PowerPC effort pretty much was sabotaged, and so they decided to leave the "Amiga" playground and build their own. _________________ B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC |
| Status: Offline |
| | Hammer
 |  |
Re: The (Microprocessors) Code Density Hangout Posted on 17-Jul-2025 0:34:49
| | [ #347 ] |
| |
 |
Elite Member  |
Joined: 9-Mar-2003 Posts: 6515
From: Australia | | |
|
| @kolla
Quote:
Who blew what? Something about PowerUP vs WarpUP, CyberGraphX vs Picasso96, MUI vs ReAction, Poseidon USB vs N/A, MorphOS vs OS4, blue vs red... some would argue that Phase5's PowerPC effort pretty much was sabotaged, and so they decided to leave the "Amiga" playground and build their own. |
Uncompetitive on price vs performance.
FYI, Phase 5 has been selling PowerPC accelerators for the PowerMac market, not just for the minority of the Amiga market https://everymac.com/upgrade_cards/phase5/
Phase 5's Amiga customer base potential is capped by the smaller install base of A1200, A2000, A3000, and A4000. The combined A1200/A2000/A3000/A4000's install base is tiny when compared to the multi-millions of A500. Neo-Amiga PowerPC advocates have an assumption that the Amiga is like a Mac.
Phase 5 PowerPC advocates largely ignored the wedge A500 market majority. In modern times, we have PowerPC Hyperion Entertainment entering the 68K market via AmigaOS 3.1.4, which includes the A500 majority, since the neo-Amiga NG PowerPC adventure sucked. Again, the Amiga is not a Mac.Last edited by Hammer on 17-Jul-2025 at 12:44 AM. Last edited by Hammer on 17-Jul-2025 at 12:43 AM.
_________________ Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68) Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68) Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB |
| Status: Offline |
| | Hammer
 |  |
Re: The (Microprocessors) Code Density Hangout Posted on 17-Jul-2025 1:11:00
| | [ #348 ] |
| |
 |
Elite Member  |
Joined: 9-Mar-2003 Posts: 6515
From: Australia | | |
|
| @matthey
Quote:
In 2 years, a chip fab process improvement and doubling of the caches would be expected. Intel kept and upgraded their in-order 5-stage P54C, their equivalent of the in-order 8-stage 68060, to the in-order 6-stage P55C with double the caches.
|
FYI, P5 had doubled L1 cache size with Pentium Overdrive before Pentium MMX (P55).
The P5 processor's floating-point unit (FPU) has an 8-stage pipeline, while the integer pipelines are 5 stages long. P55 has an extra pipeline stage.
Pentium Overdrive was designed with a doubled L1 cache size to mitigate against the 486's 32-bit bus.
Pentium Overdrive PODP5V63 was introduced on February 3, 1995. Pentium Pro (P6) was also introduced in 1995 with a slower 16-bit X86 support and fixed in the next Pentium II.
AMD's K5 processor entered the sampling phase in 1995.
Cyrix 6x86 was announced in October 1995.
NexGen Nx586 CPU was introduced in 1994. NexGen was purchased by AMD on January 16, 1996 and NexGen's Nx686 turned into K6.
This is before the mixed code integer/floating point Quake's 1996 release.Last edited by Hammer on 17-Jul-2025 at 01:12 AM.
_________________ Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68) Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68) Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB |
| Status: Offline |
| | Hammer
 |  |
Re: The (Microprocessors) Code Density Hangout Posted on 17-Jul-2025 2:34:00
| | [ #349 ] |
| |
 |
Elite Member  |
Joined: 9-Mar-2003 Posts: 6515
From: Australia | | |
|
| @matthey
Quote:
I agree that Motorola had chip fab problems but they were unrelated to the 68k and the politically motivated instead of technically motivated decision to switch to PPC. Motorola also had management problems including all the way to the top where there was a lack of technical understanding. The 68k remained vastly superior to PPC for the embedded market and Motorola did not lose this market until they sabotaged products like the 68060 by not clocking it up, used 68k embedded profits on PPC products and shoved PPC down customer's throats.
|
For the embedded MCU market, ST-Micro and NXP created PowerPC 16-bit VLE. For the automotive embedded MPU sector, Europe has effectively adopted PowerPC 16-bit VLE as its MCU.
ST-Micro / NXP PowerPC 200 series is not competitive in the "application processors" and games console market.
Quote:
Intel had the marketing advantage as cdimauro said with 486 frequency doubling. The 68040 had severe production problems and some questionable design decisions. However, the 68040 still had more performance/MHz than the 486 and it used less power when the problems were solved. The 3.3V 68040V@33MHz dissipates only 1.5W and LPSTOP sleep of the full static design was 0.00066W.
|
The problem is the premium asking price of the MMU-equipped 68K. 68000's large embedded market didn't benefit the fat 68K SKUs.
Meanwhile, DEC StrongARM raced ahead to +100 Mhz, establishing the smart handheld template for +100 Mhz ARM9T that knocked out Motorola's 68000-based Dragon Ball VZ from the smart handheld device market.
Meanwhile, the disposable MCU market has raced to the bottom in terms of BOM price.
Motorola demanded Japan's 1 IPC 68000 reimplementation to be fabbed on Motorola's fabs.
If you tracked 68040 vs 486 wholesale prices, Intel was dropping 486 prices faster than 68040's.
The higher clocked Intel 486 and Pentium SKUs were constantly pushing lower clocked SKUs into lower price segments until Intel / AMD could offer a competitive CPU solution for the MS Xbox project against the embedded MIPS competition.
Quote:
The 68040 had 3 MFLOPS compared to the 486 1.0 MFLOPS in a double precision FP Linpack benchmark from the same link. It was really bad to be late with tech when Moore's Law kicked in hard as Commodore found out too.
|
68040 was missing in action during 1989.
For the 1990 release window, Commodore management rejected the completed L2 cache-equipped A3640 with the A3000.
Apple released the Macintosh Quadra 700 in October 1991 for Christmas Q4 1991.
Christmas Q4 1991 would be Commodore's last Amiga sales boom.
From the large 68K platform vendors (i.e. Apple and Commodore), 68040 was largely missing action in 1989, 1990, and Q1-to-Q3 1991. 68040 accelerators for A2000 were manufactured in small numbers i.e. it's like German Panzer Tiger tanks in small numbers against "zergling rushed" US Sherman M4 tanks in mass numbers, while the US had M26 Pershing tanks in small numbers.
System integration's time to market is a major factor against fat 68K CPUs.
In July 1993, Apple released the Quadra 840AV with a 40 MHz 68040. Intel released Pentium 60/66 MHz on March 22, 1993 with its close PC partners following.
Quadra 840AV includes AT&T DSP3210 @ 66Mhz to boost multimedia FP32 processing when the PC competition had Pentium 60/66.
http://kpolsson.com/micropro/proc1993.htm
In May 1993, Motorola announced the availability of 40 MHz 68040 processor. The price is US$393 in 1000 unit quantities.
In June 1993, Intel added more 3.3-volt 486 processors to its line: i486SX-33 (for US $171), i486DX-33 (for US$324), and i486DX2-40 (for US $406). Prices are in quantities of 1000.
July 1993, AMD priced Am486SX-33 at US$185 in 1000 unit quantities.
October 1993, TI486SXLC-33, TI486SXL-40, TI486SXLC2-50, TI486SXL2-50, with prices in 1000 unit quantities are, respectively, US$79, US$89, US$110, US$149.
AMD 486DXL-40 processor for US$283 and Am486DX2-66 for US$463 in 1000 unit quantities.
Thanks to IBM's second-source insurance for x86, that's the 486 clone war in 1993.
Last edited by Hammer on 17-Jul-2025 at 02:42 AM. Last edited by Hammer on 17-Jul-2025 at 02:36 AM.
_________________ Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68) Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68) Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB |
| Status: Offline |
| | matthey
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 17-Jul-2025 22:01:46
| | [ #350 ] |
| |
 |
Elite Member  |
Joined: 14-Mar-2007 Posts: 2764
From: Kansas | | |
|
| Hammer Quote:
Pentium Overdrive was designed with a doubled L1 cache size to mitigate against the 486's 32-bit bus.
|
A larger L1 cache gives higher hit rates which results in less DRAM memory traffic through the data bus. The 68060 has less memory traffic than the P5 Pentium for the following reasons.
1. the 68k has 16 GP integer registers vs x86 6 GP integer registers 2. the 68k has better code density 3. the 68060 has 4-way set associate caches vs P5 2-way set associative caches
There will always be some software like Quake that streams most data so does not benefit from data caches but for most software and considering the more expensive and larger chip package, double the caches with a 32-bit data bus is more practical than a 64-bit bus. Even without the 68060 introduced with double the caches as planned for the 68060+, I expect it had significantly higher cache hit rates than the P5 Pentium which used its 64-bit data bus more than the 68060 32-bit data bus. The low memory traffic with 32-bit data bus allowed the 68060 to be successful in the embedded market where the 64-bit data bus P5 Pentium was not yet the 68060 could compete in performance against the desktop P5 Pentium.
Hammer Quote:
For the embedded MCU market, ST-Micro and NXP created PowerPC 16-bit VLE. For the automotive embedded MPU sector, Europe has effectively adopted PowerPC 16-bit VLE as its MCU.
ST-Micro / NXP PowerPC 200 series is not competitive in the "application processors" and games console market.
|
Timelines are important for the embedded market.
1979 68000 - pioneered the 32-bit embedded market 1984 68020 - improved code density and performance vs 68000 1992 SuperH - handicapped by 16-bit fixed length encoding 1994 ColdFire - initially very low end and inferior code density to 68000 1994 Thumb - handicapped by 16-bit fixed length encoding and not enough GP registers 1996 MIPS16 - handicapped by 16-bit fixed length encoding and not enough GP registers 1999 PPC Codepack - odd library based compression from IBM 2003 Thumb-2 - ARM becomes competitive in the embedded market 2006 PPC VLE - only used by Motorola and competed against earlier IBM PPC Codepack 2009 MicroMIPS - MIPS16 was equivalent of Thumb and MicroMIPS late equivalent of Thumb-2
The 68k was the best compressed ISA until 2003 when ARM started to get more competitive with Thumb-2, SoCs and licensing their IP while old 68k embedded designs were being replaced with less desirable PPC and ColdFire designs. This was the turning point. The 68k was likely #1 in 32-bit embedded chip sales into the early 2000s.
RISC Volume Gains But 68K Still Reigns 1998 article with 1997 results https://websrv.cecs.uci.edu/~papers/mpr/MPR/19980126/120102.pdf Quote:
Motorola’s 79.3 million units put it on top, as usual. Its 68K line has been the embedded 32-bit volume leader since it created the category. As the figure shows, sales of 68K chips were about equal to worldwide sales of PCs. Taken together, that’s one new 32-bit microprocessor for every man, woman, and child living in the United States.
|
Embedded Processors by the Numbers 1999 article https://www.eetimes.com/embedded-processors-by-the-numbers/ Quote:
Last year, microprocessor makers built and sold almost 250 million 32-bit embedded microprocessors. (Source: MicroDesign Resources, January 1999) That's one new 32-bit embedded CPU for every man, woman, and child living in the United States. That's also more than double the number of PCs sold around the world in the same year. Seen another way, Motorola sold almost as many 68k chips to embedded customers as Intel sold Pentium II processors to PC makers. Hitachi's sales of 32-bit chips outstripped AMD's PC sales by a two-to-one margin. Heck, even AMD's 29K processors (remember those?) were more successful, on a per-unit basis, than IDT's WinChip used in PCs.
Add to that 250 million 32-bit chips the much greater number of 16-bit processors, estimated at over one billion per year. Then add another billion eight-bit processors, and another billion four-bitters. Suddenly, the 100 million PCs, Macs, workstations, and supercomputers don't seem like such a big deal.
|
Both articles are by the famous Jim Turley. The 68k was at or near the top of the 32-bit embedded market when the first AmigaNOne was released in 2002. They may have thrown away the 68k while it was still the embedded 32-bit volume leader for a distant 2nd place low volume desktop ISA. PPC AmigaNOne was expensive hardware for the classes where the 68k Amiga hardware had been inexpensive hardware for the masses. Maybe 5k units sold in 23 years of failure and they are still sabotaging the 68k Amiga market with PPC far more dead than the 68k was when PPC AmigaNOne hardware appeared in 2002.
|
| Status: Offline |
| | cdimauro
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 18-Jul-2025 4:29:52
| | [ #351 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 4449
From: Germany | | |
|
| | Status: Offline |
| | cdimauro
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 18-Jul-2025 4:56:20
| | [ #352 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 4449
From: Germany | | |
|
| @minator
Quote:
minator wrote: @cdimauro
Quote:
If code density wasn't important anymore, then why almost all processor vendors cared about it and introduced proper extensions or even new architectures with the sole purpose of supporting it? |
It wasn't important at the high end. It did matter, and still does matter in embedded. |
It's the key factor in the embedded market, for sure, but it remains very important for whatever architecture & microarchitecture due to the intrinsic benefits.
Introducing Intel® Advanced Performance Extensions (Intel® APX) While the new prefixes increase average instruction length, there are 10% fewer instructions in code compiled with Intel APX,2 resulting in similar code density as before. APX was purely designed for high-performance computing. Then why for Intel is so important to mention the code density here?
Arm Compiler 6.11 – What’s New? With regards to code density, Arm Compiler 6.11 delivers an improvement of ~3% for AArch64 targets. AArch64 was purely designed for high-performance computing. Then why for ARM is so important to mention the code density here? Quote:
Quote:
Well, NO! 486s only had L1 caches. Exactly like Pentiums. |
Well, YES. Both 486s and Pentiums could have L2 on the motherboard. |
That's all about motherboards, and not about the chip design.
Even for 68000 there were some systems (or accelerators) with external cache, but the processor itself has zero support for this memory. Quote:
In the PPEC results all of Intel's Pentium results have L2. |
Because it helps for the performance. Which, as we know, it was the key factor for processors vendors, at the time. Quote:
Quote:
But Motorola decided to focus on PowerPCs... |
PowerPC made more money. |
That's another story. The point is that the 68060 had less resources for being worked on, because some were drained for working on PowerPCs. Quote:
Quote:
A better metric to check how more efficient are (micro)architectures would have been the SPEC/Mhz. |
Why? It's no use other than academic interest. |
It's used also by processor architects to understand the efficiency of an architecture & microarchitecture, and take decisions on how to improve it. Quote:
68060 is in the middle in the 1994 results but lower in the 1996 results. The "crappy" PA-7200 beats it in both cases. |
You like to win easy, don't you?
The 68060 has a 32-bit bus, 8 + 8 kB L1 caches and 64 entries for the ATC (TLB), whereas the PA-7200 has a 64-bit bus, 1MB L1 code and 2MB L1 data caches, and 120 x TLBs + 16 x BTLB.
The difference is embarrassing to say the least, and certainly NOT in favour of the PA-7200...
In fact, and guess what (from the link): PA-7200 were expensive to fabricate and were used in only few 32-bit HP 9000 workstations in the mid-1990s
But, as I've already said, the only factor/metric where processors vendors are interested in was performance, at the time. |
| Status: Offline |
| | cdimauro
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 18-Jul-2025 5:02:37
| | [ #353 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 4449
From: Germany | | |
|
| @Hammer
Quote:
Hammer wrote: @cdimauro
Quote:
What's not clear to you that this thread is about CODE DENSITY? Memory footprint is also good to discuss, because code density is a very important factor which heavily influences it.
|
For 3D, arithmetic intensity has higher importance. |
Again, this has NOTHING to do with the CODE DENSITY.
You continue trying to change the topic and scope of the thread, which should very clear... Quote:
Any processor design is a compromise. |
self-evident Quote:
CODE DENSITY can't exist without economic factors. |
Code density exists BECAUSE of economic factors. Quote:
[...WALL-OF-NON-SENSE / AKA Hammer's padding...] |
Let me repeat it again, since you don't or don't want to understand.
What's not clear to you that this thread is about CODE DENSITY? |
| Status: Offline |
| | Hammer
 |  |
Re: The (Microprocessors) Code Density Hangout Posted on 18-Jul-2025 7:01:46
| | [ #354 ] |
| |
 |
Elite Member  |
Joined: 9-Mar-2003 Posts: 6515
From: Australia | | |
|
| @cdimauro
Quote:
Again, this has NOTHING to do with the CODE DENSITY. |
You're wrong. It's CODE DENSITY for the 3D use case. _________________ Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68) Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68) Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB |
| Status: Offline |
| | Hammer
 |  |
Re: The (Microprocessors) Code Density Hangout Posted on 18-Jul-2025 7:14:30
| | [ #355 ] |
| |
 |
Elite Member  |
Joined: 9-Mar-2003 Posts: 6515
From: Australia | | |
|
| @matthey
Quote:
A larger L1 cache gives higher hit rates which results in less DRAM memory traffic through the data bus. The 68060 has less memory traffic than the P5 Pentium for the following reasons.
1. the 68k has 16 GP integer registers vs x86 6 GP integer registers 2. the 68k has better code density 3. the 68060 has 4-way set associate caches vs P5 2-way set associative caches
|
1. I am still waiting for Warp1260 to deliver superior Quake scores against Pentium 100. Is it another "two more weeks"?
Warp1260 has a 64KB L2 cache.
2. SysInfo benchmark easily exceeds 68060's optimal L1 cache fetch rate.
Quote:
There will always be some software like Quake that streams most data so does not benefit from data caches but for most software and considering the more expensive and larger chip package, double the caches with a 32-bit data bus is more practical than a 64-bit bus. Even without the 68060 introduced with double the caches as planned for the 68060+, I expect it had significantly higher cache hit rates than the P5 Pentium which used its 64-bit data bus more than the 68060 32-bit data bus. The low memory traffic with 32-bit data bus allowed the 68060 to be successful in the embedded market where the 64-bit data bus P5 Pentium was not yet the 68060 could compete in performance against the desktop P5 Pentium.
|
I'm still waiting for your annual sales statistics with the 68K model breakdown. Is it another "two more weeks"?
Don't hide mass 68000 sales with 68060.
Motorola’s 79.3 million units put it on top, as usual. Its 68K line has been the embedded 32-bit volume leader since it created the category. As the figure shows, sales of 68K chips were about equal to worldwide sales of PCs. Taken together, that’s one new 32-bit microprocessor for every man, woman, and child living in the United States.
Where's the 68K model breakdown?
Fact: MIPS displaced 68K in mainstream 32-bit/64-bit embedded game consoles.
For the system integration phase in 1995, where's your game console business plan with valid BOM costings and 68EC060 / 68LC060 / 68060? Is it another "two more weeks"?
Note that the 1998-1999 time scale was the original Xbox's system integration design phase.Last edited by Hammer on 18-Jul-2025 at 07:23 AM.
_________________ Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68) Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68) Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB |
| Status: Offline |
| | bhabbott
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 18-Jul-2025 18:48:23
| | [ #356 ] |
| |
 |
Cult Member  |
Joined: 6-Jun-2018 Posts: 556
From: Aotearoa | | |
|
| @Hammer
Quote:
Hammer wrote:
Fact: MIPS displaced 68K in mainstream 32-bit/64-bit embedded game consoles. |
Fact: current game consoles use AMD x86-64, not MIPS.
Quote:
For the system integration phase in 1995, where's your game console business plan with valid BOM costings and 68EC060 / 68LC060 / 68060? Is it another "two more weeks"? |
This thread is about code density, not the plans of businesses in 1995.
Quote:
Note that the 1998-1999 time scale was the original Xbox's system integration design phase. |
Which is why it used a MIPS CPU. Oh wait...
|
| Status: Offline |
| |
|
|
|
[ home ][ about us ][ privacy ]
[ forums ][ classifieds ]
[ links ][ news archive ]
[ link to us ][ user account ]
|