Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6155 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

0 member(s) on-line.

You are an anonymous user.
Register Now!

Forum Index

General Technology (No Console Threads)

The (Microprocessors) Code Density Hangout

Poster

Thread

MEGA_RJ_MICAL

Re: The (Microprocessors) Code Density Hangout
Posted on 14-May-2026 8:24:44

[ #461 ]

Super Member

Joined: 13-Dec-2019
Posts: 1471
From: AMIGAWORLD.NET WAS ORIGINALLY FOUNDED BY DAVID DOYLE

cdiZORRAM

_________________
I HAVE ABS OF STEEL
--
CAN YOU SEE ME? CAN YOU HEAR ME? OK FOR WORK

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 16-May-2026 6:35:38

[ #462 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2925
From: Kansas

cdimauro Quote:

Added another paper on the Literature post.

Enhanced code density of embedded CISC processors with echo technology

I have not read that particular paper before but there are many code compression techniques I would put in the same category as "echo technology" which try to take advantage of the repetition of instructions in code to compress it. Without additional hardware, compilers can reduce code size by sharing more code using more functions and branches but most compilers have improved in the opposite direction of function inlining for performance.

Improving Program Efficiency by Packing Instructions into Registers
https://pages.cs.wisc.edu/~isca2005/papers/05A-01.PDF Quote:

Due to the increasing pervasive use of embedded systems, there has been a significant amount of recent work on compressing code. Some early work on code compression used a compiler optimization called procedural abstraction to reduce code size. Procedural abstraction can be viewed as the opposite of function inlining. Common code sequences are abstracted into routines, and the original sites of each sequence are converted into calls. Subsequent work involved register renaming to abstract more code segments. The main disadvantage of procedural abstraction is that the program typically becomes slower due to the overhead of executing call and return instructions for each abstracted code segment.

I am a proponent of using additional hardware in a more traditional and general purpose way to reduce function call and branch overhead to a minimum thus reducing the need for function inlining and loop unrolling. While the 17%-20% code size reduction for IA32 using ET may sound good, software only "procedural abstraction achieves 5%-7%". Echo tech is only useful when optimizing for size or with profiled programs to avoid performance code, the number of instructions in the instruction stream is increased, BTB pressure is increased (echo targets are stored in the BTB) and additional hardware is required to achieve maximum claimed gains. The difference between optimizing for size and performance puts the compression in perspective too.

Google AI Quote:

1. Function Inlining
o Code Size Impact: +10% to +20% (on average for standard C/C++ builds).
o Extreme Cases: Can balloon up to several hundred percent if massive functions with deep call hierarchies are inlined across hundreds of different files.
o The Trade-Off: Inlining eliminates the function-call overhead (arguments, stack, and branch instructions). However, copying function bodies causes binary bloat, which risks overflowing the L1/L2 Instruction Caches and causing cache misses that actually degrade performance.

2. Loop Unrolling
o Code Size Impact: +10% to +50% (depending on unrolling factors like 4x, 8x, etc.)
o Extreme Cases: Can multiply the size of specific hot loops by exactly the unrolling factor (e.g., unrolling a small loop 100 times multiplies the size of that code block by 100).
o The Trade-Off: Unrolling removes loop-termination checks and branches, making code run faster by exploiting instruction-level parallelism. Like inlining, excessive unrolling thrashes the cache.

OOP, memory alignment/padding and (auto)vectorization can have code bloat synergies as well. I would rather have hardware compression that is more general purpose and scales better for both performance and reduced code size. This is what the 68k already starts with with a well designed VLE and it can be improved.

Improving Program Efficiency by Packing Instructions into Registers
https://pages.cs.wisc.edu/~isca2005/papers/05A-01.PDF Quote:

Table 4. Comparison of Existing Techniques

Legend: + means that improvement is < 10%. ++ means that improvement is â‰¥ 10%. 0 means that there is very little to no effect. ? means that results are speculative since they are not presented or explained in detail. â€“ means that penalty is < 10%. â€“ â€“ means that penalty is â‰¥ 10%. Hardware complexity is scaled from 0 (no changes) to 3 (complete redesign).

"Arm/Thumb" is A32+Thumb which is correctly trashed in the chart due to mode switching and the much increased number of instructions executed reducing performance and power savings. "Arm/Thumb/AX" is A32+Thumb-2 which is an acceptable VLE but still requiring more instructions than many other ISAs and having reducing performance. IA32 was mentioned in your paper as having better performance than ARM/Thumb when optimized for size which I found surprising. When optimizing for size, IA32 suffers from too many instructions and too many memory accesses as well. The 68k has better code density and does not suffer from either problem judging from Vince Weaver's code density competition.

The instruction register file (IRF) code compression in the chart above is like programmable microcode from the compiler and only uses a 32 instruction register file.

Improving Program Efficiency by Packing Instructions into Registers
https://pages.cs.wisc.edu/~isca2005/papers/05A-01.PDF Quote:

One of the early approaches to reduce code size and the cost of fetching instructions was microcode. Each CISC or macro instruction fetched from memory caused a sequence of microinstructions to be fetched and executed, which provided a faster access time than main memory. Our proposed approach differs from microcode in several ways, including that specific instructions within the IRF can be individually referenced and that the instructions in the IRF can be changed for each executable.

"On average, about 66.51% of all instructions executed can be stored in a 32-entry IRF, assuming it can be loaded with the 32 most common instructions at the start of execution." There is additional context switching overhead but handling is not much more difficult than executing microcode, which even RISC compressed ARM and SuperH cores usually use too. The IRF includes an IMM immediate compression technique that the 68k may not benefit as much from with my immediate compression using an addressing mode idea and an article on the 68060 claims there is no microcode as Gunnar claims for the AC68080. Maybe it would be a better fit for x86-64 and AArch64 cores which still use microcode on modern cores and lack good code density. It would be interesting to see if the 68060 instruction buffer could be turned into a "ZOLB/Loop Cache" as shown in the chart above, avoiding repeated decoding overhead of short loops.

Multiple compression techniques are possible and sometimes beneficial. For example, the 68060 could potentially benefit from VLE compression, Procedural Abstraction and ZOLB/Loop Cache from the chart above. Some techniques are a better fit for certain ISAs and CPU core designs and different techniques have different tradeoffs. There is another paper which has data with Echo Tech compression and PPC like CodePack compression, both forms of "Codewords" in the table above. CodePack provides more compression than Echo Tech but they have more compression together.

Reducing Code Size With Echo Instructions
https://dl.acm.org/doi/abs/10.1145/951710.951724 Quote:

7. SUMMARY

This paper examined code compression with echo instructions. Echo instructions are an executable form of code compression that uses the main instruction stream for the compression storage. Echo instructions execute subsequences of instructions from other locations in the instruction stream. Given a highly optimized binary, our results show that traditional software based procedural abstraction achieves a 94.3% compression ratio, while the use of echo instructions achieves a 84.5% compression ratio.

In addition, we evaluate the use of echo instructions with CodePack. CodePack achieved a 70.0% compression ratio on our optimized binaries, and CodePack with echo instructions resulted in a 63.2% compression ratio. Typically, combining compression algorithms does not result in additional savings, but we are applying two compression algorithms that operate at diï¬€erent granularities, so they compress diï¬€erent portions of the same data.

While CodePack has excellent compression, disadvantages include significant hardware/resources for embedded cores including a 2kiB symbol table of likely SRAM which could have been used for caches instead, instructions in I-Cache are not compressed (cache lines decompressed from memory to I-Cache), MCU execution from SRAM memory not practical limiting scaling, etc. IBM still found CodePack worthwhile to reduce the memory footprint and decrease the RISC instruction fetch bottleneck while reducing instruction fetch energy used (36% of total processor power on a StrongARM used for I-Fetch from paper above and Cast shows instruction supply to use 42% of an embedded processor energy consumption). Echo Tech and CodePack together resulted in 63.2% compression but starting with the 68k is like starting with 55% compression instead with more possible with enhancements. The 68k code in caches is compressed reducing I-Cache misses and even the 68060 can support a MCU using SRAM as memory. The following paper may have inspired IBM to create CodePack.

Improving Code Density Using Compression Techniques
https://www.eecs.umich.edu/techreports/cse/97/CSE-TR-342-97.pdf Quote:

There are several ways that our compression method can be improved. First, the compiler could attempt to produce instructions with similar byte sequences so they could be more easily compressed. One way to accomplish this is by allocating registers so that common sequences of instructions use the same registers. Another way is to generate more generalized STDS code sequences. These would be less efficient, but would be semantically correct in a larger variety of circumstances. For example, in most optimizing compilers, the function prologue sequence might save only those registers which are modified within the body of the function. If the prologue sequence were standardized to always save all registers, then all instructions of the sequence could be compressed to a single codeword. This space saving optimization would decrease code size at the expense of execution time. Table 3 shows that the prologue and epilogue combined typically account for 12% of the program size, so this type of compression would provide significant size reduction.

A library of standard prologue and epilogue code could have reduced PPC program sizes by 12% on average and it could have gone into ROM that is cheaper than SRAM. It is not as good as the 68k MOVEM with register bitmap but similar is not possible with a fixed 32-bit encoding and 32 GP registers. PPC has inflexible load/store multiple with LMW/STMW using a register range from the given register to r31 which may have worked if they made the instructions standard. IBM developed CodePack and Freescale VLE while LMW/STMW were rarely implemented in hardware. Poor load/store multiple implementation? Implementation worse than 12% prologue and epilogue bloat?

PPC LMW/STMW would have made no difference in Vince Weaver's competition as it was not used or necessary due to unusual and minimal register saving.

http://deater.net/weave/vmwprod/asm/ll/ll.html
http://deater.net/weave/vmwprod/asm/ll/ll.ppc.s

The PowerPC Compiler Writer's Guide data shows significant LMW/STMW use for SPEC92 Benchmarks when available.

The PowerPC Compiler Writer's Guide
https://cr.yp.to/2005-590/powerpc-cwg.pdf Quote:

instruction | int num of executions | int % of total | fp num of executions | fp % of total
lmw 45073238 0.542% 14666169 0.026%
stmw 50087129 0.602% 14868701 0.026%

Figure B-3 on page 177 appears to show hardware support for LMW/STMW for PPC 601, 603e and 604 so the most common early PPC CPUs had support in hardware. I could not find anywhere where The PowerPC Compiler Writer's Guide discourages using them but I have from other sources. Ironically, it appears castrated embedded support led to their use being discouraged even though they are most useful for the embedded market.

Which PowerPC cores lack hardware support for LMW and STMW instructions?
Google AI Quote:

In the PowerPC ecosystem, the Freescale (now NXP) e500 core family (such as the e500v1, e500v2, e500mc, and e5500) natively lacks hardware support for LMW (Load Multiple Word) and STMW (Store Multiple Word) instructions.

The e500v2 CPU cores are used in the A1222 embedded SoC and e5500 CPU cores in the X5000 embedded SoC. The embedded PPC SoC used in the A1222 not only has a castrated non-standard PPC FPU but has also been castrated in other ways that arguably make it less suitable for embedded use. I guess 12% smaller code size on average would not have been good for the PPC embedded market and much more complex and resource using IBM CodePack and incompatible Freescale/NXP PPC VLE are higher end embedded options only. It makes as much sense as pushing PPC for the low end embedded market while the 68k was still #1 in 32-bit volume embedded market sales. Houston, we have a problem. The 68k was grounded by management and PPC was too fat to lift off in the embedded market so ARM became the embedded king replacement despite ARM A32 and Thumb ISAs being poor. Thumb-2's 16-bit VLE, like the 68k but still inferior, finally allowed ARM to reach orbit though.

Last edited by matthey on 16-May-2026 at 02:56 PM.

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 24-May-2026 6:53:54

[ #463 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2925
From: Kansas

Inside the Motorola 68060 & Chip Design: Interview with Lead Designer Joe Circello!
https://www.youtube.com/watch?v=1takr2k7Yfo

Chief architect of the 68060 Joe Circello has connections with the Amiga through friends and was interviewed in an Amiga Bill YouTube video. Joe wears a RISC-V shirt while talking about the CISC vs RISC debate and code density. Joe verifies that the 68060 is fully synthesizable even though custom blocks were used, his example being parts of the cache used custom blocks. I had suspected the 68060 was fully synthesizable with repetitive and time critical logic of caches and the MMU using custom blocks. He verifies the 68060 uses no microcode and laughs when asked about a x86 core that does not use microcode calling x86 more complicated than the more orthogonal 68k. GHz frequency 68060s in modern silicon were mentioned multiple times. He talks about finding the "RTL" in an old Motorola database to recover the 68060 but the fully sythesizable core sounded like it would allow an easier transition to new silicon if everything necessary could be found and recovered.

https://en.wikipedia.org/wiki/Register-transfer_level

The use of RTL in Verilog is surprisingly modern for a CPU design Joe says started circa 1988 outside of Motorola. The 68060 with a fully synthesizable and fully static synchronous CMOS design using RTL in Verilog should be like most modern designs and unlike 486, Pentium, Pentium Pro, MIPS R4000, DEC Alpha 21064/21164 and early ARM designs which used more difficult to work with and modernize dynamic/domino logic.

https://en.wikipedia.org/wiki/Domino_logic
https://en.wikipedia.org/wiki/Sequential_logic

Timing becomes easier when moving to a smaller process and auto layout tools should be better at placement and routing. It may be possible that the whole 68060 could use auto layout to a more modern process like the ColdFire used. It sounds like the biggest difficulty is acquiring a license or rights and finding and recovering the RTL source. Unfortunately, preservation does not sound like a priority for the 68k/68060 at NXP/Freescale.

A couple of code density quotes from Joe follows.

"So I have this shirt on RISC-V and I would say that I think they are painfully relearning the lessons that we all, or at least some of us went through, you know 30, 35 years ago. So a lot of these things were major topics, the source of major activity both in academia as well as well as in industry and I think a lot of those lessons were conveniently lost and we are now in the process of relearning them, right, where code density and code size is become a really big deal in RISC-V's environment and what can you do to manage that."

"I mean there RISC-V, I've been fairly heavily involved in that over the last number of years, and there's a lot of things they did right in terms of the architectural definition. Having said that, there's still a number of challenges. I think people thought they could kind of just waltz in with this new ISA and say, 'Here it is. We've come down from the mountain and here's the tablets, stone tablets, and this is what we need to do.' But things aren't quite nearly that simple. I mean we were looking at using RISC-V cores and we were using them in various embedded applications and they were great substitutions in some areas and in other areas, because of things like code density and code growth and things like that, well, the answer wasn't quite as clear."

Last edited by matthey on 26-May-2026 at 12:16 AM.
Last edited by matthey on 24-May-2026 at 11:59 AM.
Last edited by matthey on 24-May-2026 at 11:53 AM.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 13-Jun-2026 5:55:20

[ #464 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4648
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

Added another paper on the Literature post.

Enhanced code density of embedded CISC processors with echo technology

I have not read that particular paper before but there are many code compression techniques I would put in the same category as "echo technology" which try to take advantage of the repetition of instructions in code to compress it. Without additional hardware, compilers can reduce code size by sharing more code using more functions and branches but most compilers have improved in the opposite direction of function inlining for performance.

Improving Program Efficiency by Packing Instructions into Registers
https://pages.cs.wisc.edu/~isca2005/papers/05A-01.PDF Quote:

Due to the increasing pervasive use of embedded systems, there has been a significant amount of recent work on compressing code. Some early work on code compression used a compiler optimization called procedural abstraction to reduce code size. Procedural abstraction can be viewed as the opposite of function inlining. Common code sequences are abstracted into routines, and the original sites of each sequence are converted into calls. Subsequent work involved register renaming to abstract more code segments. The main disadvantage of procedural abstraction is that the program typically becomes slower due to the overhead of executing call and return instructions for each abstracted code segment.

I am a proponent of using additional hardware in a more traditional and general purpose way to reduce function call and branch overhead to a minimum thus reducing the need for function inlining and loop unrolling.

I've finally had time to read all those papers, and I agree with you: the proposed solutions aren't palatable for general purpose usage. I would had also: not good for modern architectures. They introduce more complications at the ISA and/or microarchitecture level, just to address code density.

That's very likely the reason why those proposals never arrived to the market: just paper research, with no real-world/practical usage.
Quote:
While the 17%-20% code size reduction for IA32 using ET may sound good, software only "procedural abstraction achieves 5%-7%". Echo tech is only useful when optimizing for size or with profiled programs to avoid performance code, the number of instructions in the instruction stream is increased, BTB pressure is increased (echo targets are stored in the BTB) and additional hardware is required to achieve maximum claimed gains. The difference between optimizing for size and performance puts the compression in perspective too.

Google AI Quote:

1. Function Inlining
o Code Size Impact: +10% to +20% (on average for standard C/C++ builds).
o Extreme Cases: Can balloon up to several hundred percent if massive functions with deep call hierarchies are inlined across hundreds of different files.
o The Trade-Off: Inlining eliminates the function-call overhead (arguments, stack, and branch instructions). However, copying function bodies causes binary bloat, which risks overflowing the L1/L2 Instruction Caches and causing cache misses that actually degrade performance.

2. Loop Unrolling
o Code Size Impact: +10% to +50% (depending on unrolling factors like 4x, 8x, etc.)
o Extreme Cases: Can multiply the size of specific hot loops by exactly the unrolling factor (e.g., unrolling a small loop 100 times multiplies the size of that code block by 100).
o The Trade-Off: Unrolling removes loop-termination checks and branches, making code run faster by exploiting instruction-level parallelism. Like inlining, excessive unrolling thrashes the cache.

OOP, memory alignment/padding and (auto)vectorization can have code bloat synergies as well.

Those aren't necessarily "bloat": it entirely depends on the goal that you want to achieve. The last two are necessary when dealing with high-performance code.

OOP can't be avoided by computer architectures: if you've applications written using OOP, processors have to execute that code, either the developers were good at carefully using it in a proper way, or introduced an enormous forest of classes and mountains of abstractions layers. Which means, that the architecture should try to support it as much as possible.
Quote:
I would rather have hardware compression that is more general purpose and scales better for both performance and reduced code size. This is what the 68k already starts with with a well designed VLE and it can be improved.

Indeed. It "just" needs some modernization, and a proper SIMD/Vector extension.
Quote:
Improving Program Efficiency by Packing Instructions into Registers
https://pages.cs.wisc.edu/~isca2005/papers/05A-01.PDF Quote:

Table 4. Comparison of Existing Techniques

Legend: + means that improvement is < 10%. ++ means that improvement is â‰¥ 10%. 0 means that there is very little to no effect. ? means that results are speculative since they are not presented or explained in detail. â€“ means that penalty is < 10%. â€“ â€“ means that penalty is â‰¥ 10%. Hardware complexity is scaled from 0 (no changes) to 3 (complete redesign).

"Arm/Thumb" is A32+Thumb which is correctly trashed in the chart due to mode switching and the much increased number of instructions executed reducing performance and power savings. "Arm/Thumb/AX" is A32+Thumb-2 which is an acceptable VLE but still requiring more instructions than many other ISAs and having reducing performance. IA32 was mentioned in your paper as having better performance than ARM/Thumb when optimized for size which I found surprising.

Likely it depends on the used benchmark.
Quote:
When optimizing for size, IA32 suffers from too many instructions and too many memory accesses as well. The 68k has better code density and does not suffer from either problem judging from Vince Weaver's code density competition.

Yes, but that was manual assembly code, which is quite different from the compiled one.
Quote:
The instruction register file (IRF) code compression in the chart above is like programmable microcode from the compiler and only uses a 32 instruction register file.

Improving Program Efficiency by Packing Instructions into Registers
https://pages.cs.wisc.edu/~isca2005/papers/05A-01.PDF Quote:

One of the early approaches to reduce code size and the cost of fetching instructions was microcode. Each CISC or macro instruction fetched from memory caused a sequence of microinstructions to be fetched and executed, which provided a faster access time than main memory. Our proposed approach differs from microcode in several ways, including that specific instructions within the IRF can be individually referenced and that the instructions in the IRF can be changed for each executable.

"On average, about 66.51% of all instructions executed can be stored in a 32-entry IRF, assuming it can be loaded with the 32 most common instructions at the start of execution." There is additional context switching overhead but handling is not much more difficult than executing microcode, which even RISC compressed ARM and SuperH cores usually use too. The IRF includes an IMM immediate compression technique that the 68k may not benefit as much from with my immediate compression using an addressing mode idea and an article on the 68060 claims there is no microcode as Gunnar claims for the AC68080. Maybe it would be a better fit for x86-64 and AArch64 cores which still use microcode on modern cores and lack good code density.

Neither. As I've said before, those techniques aren't good for modern, real-world processors.
Quote:
It would be interesting to see if the 68060 instruction buffer could be turned into a "ZOLB/Loop Cache" as shown in the chart above, avoiding repeated decoding overhead of short loops.

Like Intel did when it introduced the L0 cache.
Quote:
Multiple compression techniques are possible and sometimes beneficial. For example, the 68060 could potentially benefit from VLE compression, Procedural Abstraction and ZOLB/Loop Cache from the chart above. Some techniques are a better fit for certain ISAs and CPU core designs and different techniques have different tradeoffs. There is another paper which has data with Echo Tech compression and PPC like CodePack compression, both forms of "Codewords" in the table above. CodePack provides more compression than Echo Tech but they have more compression together.

Reducing Code Size With Echo Instructions
https://dl.acm.org/doi/abs/10.1145/951710.951724 Quote:

7. SUMMARY

This paper examined code compression with echo instructions. Echo instructions are an executable form of code compression that uses the main instruction stream for the compression storage. Echo instructions execute subsequences of instructions from other locations in the instruction stream. Given a highly optimized binary, our results show that traditional software based procedural abstraction achieves a 94.3% compression ratio, while the use of echo instructions achieves a 84.5% compression ratio.

In addition, we evaluate the use of echo instructions with CodePack. CodePack achieved a 70.0% compression ratio on our optimized binaries, and CodePack with echo instructions resulted in a 63.2% compression ratio. Typically, combining compression algorithms does not result in additional savings, but we are applying two compression algorithms that operate at diï¬€erent granularities, so they compress diï¬€erent portions of the same data.

While CodePack has excellent compression, disadvantages include significant hardware/resources for embedded cores including a 2kiB symbol table of likely SRAM which could have been used for caches instead, instructions in I-Cache are not compressed (cache lines decompressed from memory to I-Cache), MCU execution from SRAM memory not practical limiting scaling, etc. IBM still found CodePack worthwhile to reduce the memory footprint and decrease the RISC instruction fetch bottleneck while reducing instruction fetch energy used (36% of total processor power on a StrongARM used for I-Fetch from paper above and Cast shows instruction supply to use 42% of an embedded processor energy consumption). Echo Tech and CodePack together resulted in 63.2% compression but starting with the 68k is like starting with 55% compression instead with more possible with enhancements.

Same as above: not suitable for modern processors.
Quote:
The 68k code in caches is compressed reducing I-Cache misses and even the 68060 can support a MCU using SRAM as memory. The following paper may have inspired IBM to create CodePack.

Improving Code Density Using Compression Techniques
https://www.eecs.umich.edu/techreports/cse/97/CSE-TR-342-97.pdf Quote:

There are several ways that our compression method can be improved. First, the compiler could attempt to produce instructions with similar byte sequences so they could be more easily compressed. One way to accomplish this is by allocating registers so that common sequences of instructions use the same registers. Another way is to generate more generalized STDS code sequences. These would be less efficient, but would be semantically correct in a larger variety of circumstances. For example, in most optimizing compilers, the function prologue sequence might save only those registers which are modified within the body of the function. If the prologue sequence were standardized to always save all registers, then all instructions of the sequence could be compressed to a single codeword. This space saving optimization would decrease code size at the expense of execution time. Table 3 shows that the prologue and epilogue combined typically account for 12% of the program size, so this type of compression would provide significant size reduction.

A library of standard prologue and epilogue code could have reduced PPC program sizes by 12% on average and it could have gone into ROM that is cheaper than SRAM. It is not as good as the 68k MOVEM with register bitmap but similar is not possible with a fixed 32-bit encoding and 32 GP registers. PPC has inflexible load/store multiple with LMW/STMW using a register range from the given register to r31 which may have worked if they made the instructions standard. IBM developed CodePack and Freescale VLE while LMW/STMW were rarely implemented in hardware. Poor load/store multiple implementation? Implementation worse than 12% prologue and epilogue bloat?

Likely implementing those instructions was complicated. However, we've modern techniques and a hardware sequencer can be used to easily solve the problem. That's what Mitch has done with its My 66000, which has some very complicated instructions like those.
Quote:
PPC LMW/STMW would have made no difference in Vince Weaver's competition as it was not used or necessary due to unusual and minimal register saving.

http://deater.net/weave/vmwprod/asm/ll/ll.html
http://deater.net/weave/vmwprod/asm/ll/ll.ppc.s

The PowerPC Compiler Writer's Guide data shows significant LMW/STMW use for SPEC92 Benchmarks when available.

The PowerPC Compiler Writer's Guide
https://cr.yp.to/2005-590/powerpc-cwg.pdf Quote:

instruction | int num of executions | int % of total | fp num of executions | fp % of total
lmw 45073238 0.542% 14666169 0.026%
stmw 50087129 0.602% 14868701 0.026%

Figure B-3 on page 177 appears to show hardware support for LMW/STMW for PPC 601, 603e and 604 so the most common early PPC CPUs had support in hardware. I could not find anywhere where The PowerPC Compiler Writer's Guide discourages using them but I have from other sources. Ironically, it appears castrated embedded support led to their use being discouraged even though they are most useful for the embedded market.

Which PowerPC cores lack hardware support for LMW and STMW instructions?
Google AI Quote:

In the PowerPC ecosystem, the Freescale (now NXP) e500 core family (such as the e500v1, e500v2, e500mc, and e5500) natively lacks hardware support for LMW (Load Multiple Word) and STMW (Store Multiple Word) instructions.

The e500v2 CPU cores are used in the A1222 embedded SoC and e5500 CPU cores in the X5000 embedded SoC. The embedded PPC SoC used in the A1222 not only has a castrated non-standard PPC FPU but has also been castrated in other ways that arguably make it less suitable for embedded use.

LOL. They haven't implemented those instructions where they were really needed.
Quote:
I guess 12% smaller code size on average would not have been good for the PPC embedded market and much more complex and resource using IBM CodePack and incompatible Freescale/NXP PPC VLE are higher end embedded options only. It makes as much sense as pushing PPC for the low end embedded market while the 68k was still #1 in 32-bit volume embedded market sales. Houston, we have a problem. The 68k was grounded by management and PPC was too fat to lift off in the embedded market so ARM became the embedded king replacement despite ARM A32 and Thumb ISAs being poor. Thumb-2's 16-bit VLE, like the 68k but still inferior, finally allowed ARM to reach orbit though.

Unfortunately...

I'll add your papers to the literature post once I've some time. Hopefully this weekend.

Last edited by cdimauro on 13-Jun-2026 at 08:35 AM.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 13-Jun-2026 8:48:18

[ #465 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4648
From: Germany

@matthey

Quote:

matthey wrote:
Inside the Motorola 68060 & Chip Design: Interview with Lead Designer Joe Circello!
https://www.youtube.com/watch?v=1takr2k7Yfo

Chief architect of the 68060 Joe Circello has connections with the Amiga through friends and was interviewed in an Amiga Bill YouTube video. Joe wears a RISC-V shirt while talking about the CISC vs RISC debate and code density. Joe verifies that the 68060 is fully synthesizable even though custom blocks were used, his example being parts of the cache used custom blocks. I had suspected the 68060 was fully synthesizable with repetitive and time critical logic of caches and the MMU using custom blocks. He verifies the 68060 uses no microcode and laughs when asked about a x86 core that does not use microcode calling x86 more complicated than the more orthogonal 68k. GHz frequency 68060s in modern silicon were mentioned multiple times. He talks about finding the "RTL" in an old Motorola database to recover the 68060 but the fully sythesizable core sounded like it would allow an easier transition to new silicon if everything necessary could be found and recovered.

https://en.wikipedia.org/wiki/Register-transfer_level

The use of RTL in Verilog is surprisingly modern for a CPU design Joe says started circa 1988 outside of Motorola. The 68060 with a fully synthesizable and fully static synchronous CMOS design using RTL in Verilog should be like most modern designs and unlike 486, Pentium, Pentium Pro, MIPS R4000, DEC Alpha 21064/21164 and early ARM designs which used more difficult to work with and modernize dynamic/domino logic.

https://en.wikipedia.org/wiki/Domino_logic
https://en.wikipedia.org/wiki/Sequential_logic

Timing becomes easier when moving to a smaller process and auto layout tools should be better at placement and routing. It may be possible that the whole 68060 could use auto layout to a more modern process like the ColdFire used. It sounds like the biggest difficulty is acquiring a license or rights and finding and recovering the RTL source. Unfortunately, preservation does not sound like a priority for the 68k/68060 at NXP/Freescale.

Maybe it'd be useful for preservation, because the last ColdFire incarnation should be easier to find (it's still produced) and likely fully synthesizable.

Of course, it could help A LOT as a reference to enhance the CF source to add the missing instructions.
Quote:
A couple of code density quotes from Joe follows.

"So I have this shirt on RISC-V and I would say that I think they are painfully relearning the lessons that we all, or at least some of us went through, you know 30, 35 years ago. So a lot of these things were major topics, the source of major activity both in academia as well as well as in industry and I think a lot of those lessons were conveniently lost and we are now in the process of relearning them, right, where code density and code size is become a really big deal in RISC-V's environment and what can you do to manage that."

"I mean there RISC-V, I've been fairly heavily involved in that over the last number of years, and there's a lot of things they did right in terms of the architectural definition.

Then I'm very curious to know it.
Quote:
Having said that, there's still a number of challenges. I think people thought they could kind of just waltz in with this new ISA and say, 'Here it is. We've come down from the mountain and here's the tablets, stone tablets, and this is what we need to do.' But things aren't quite nearly that simple. I mean we were looking at using RISC-V cores and we were using them in various embedded applications and they were great substitutions in some areas and in other areas, because of things like code density and code growth and things like that, well, the answer wasn't quite as clear."

Then answer IS clear: RISC-V is NOT good at code density, despite they started from scratch, without any constraint, and being able to use the past history and literature about architectures and, specifically, for code density.

The net result is that this new ISA is a complete failure, from whatever perspective we see it, and they (its architects) are desperately trying to fill the (many) gaps by adding a pletora of extensions, making a fragmented forest.

Anyway, I don't get why Joe hasn't expressed a honest opinion on the topic, talking about the impressive code density which 68k had which required... NO compress extensions at all!
Maybe his shirts already talks for himself...

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 14-Jun-2026 7:57:49

[ #466 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4648
From: Germany

Added other papers on the Literature post.

Improving Program Efficiency by Packing Instructions into Registers
Reducing Code Size With Echo Instructions
Improving Code Density Using Compression Techniques

Last edited by cdimauro on 14-Jun-2026 at 05:39 PM.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 14-Jun-2026 18:18:06

[ #467 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4648
From: Germany

Added other papers on the Literature post.

The effect of instruction set complexity on program size and memory performance
The impact of code density on instruction cache performance
Executing compressed programs on an embedded RISC architecture
High-performance extendable instruction set computing
Design and evaluation of compact ISA extensions

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 14-Jun-2026 19:28:50

[ #468 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4648
From: Germany

Added other results on the Benchmarks post, in the last part.

If my interpretation is correct, the Coldfire V2 literally obliterates x86/386 and even more ARM/Thumb-2 (Cortex-M4) when code is compiled for maximum performance (-O3).

Last edited by cdimauro on 14-Jun-2026 at 07:29 PM.

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 15-Jun-2026 1:04:44

[ #469 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2925
From: Kansas

cdimauro Quote:

Those aren't necessarily "bloat": it entirely depends on the goal that you want to achieve. The last two are necessary when dealing with high-performance code.

Function inlining and loop unrolling are not bloat where they are required for performance. Function inlining is more beneficial where the cost of function calls is high but the cost can be reduced with branch folding reducing call overhead, a hardware return/link stack reducing return overhead, more and cheaper data shuffling instructions, cheaper reg spill and mem var/arg instructions (CISC advantage), a good ABI and good compilers. We recently discussed techniques to reduce overhead in this thread. RISC simplifications often increased function call overhead although 3 op instructions and register windows (SPARC) mitigate some of the overhead (see the old paper I will link later). Link registers, common for RISC ISAs, looked like a minor advantage until a hardware return/link stack made it a waste of a valuable GP register. Loop unrolling is more beneficial where the cost of decrement, compare and branch are higher. Load-to-use penalties, common for RISC CPU core designs, also often increase the benefit of loop unrolling but can be eliminated as even RISC designs like the SiFive series 7 design demonstrates by using more hardware, opposite of the RISC simplification philosophy. The loop branch overhead can be nearly eliminated with the branch itself folded/eliminated from execution as even the 1994 68060 demonstrates using extra hardware. A free ALU could remove the decrement and compare instructions. The last decrement and branch loop iteration fall through can sometimes be predicted. Considering the transistors saved in caches from smaller code due to less function inlining and loop unrolling, additional hardware to reduce function call and loop overhead likely saves transistors with large caches and any remaining minor performance loss is likely more than offset by improved cache efficiency. It is much like with code density where the cost of supporting compressed VLE instructions should be more than offset by the improved cache efficiency. The 68060 was technically far beyond cheap and simple RISC CPU designs and RISC ISA combinations in cache efficiency even though it did not get the hardware return stack ColdFire received later, despite many RISC like simplifications.

cdimauro Quote:

OOP can't be avoided by computer architectures: if you've applications written using OOP, processors have to execute that code, either the developers were good at carefully using it in a proper way, or introduced an enormous forest of classes and mountains of abstractions layers. Which means, that the architecture should try to support it as much as possible.

While it is difficult to avoid all OOP programs, OOP overhead can be avoided by programmers. Even some relatively large projects have used C instead of C++ for improved performance like Vulkan and NetSurf. AmigaOS 4 was not converted to C++ either despite the introduction of more OOP using C, likely for performance reasons. The Hyperion license was renegotiated with Amiga Inc for embedded use and then sabotaged by PPC embedded hardware.

cdimauro Quote:

LOL. They haven't implemented those instructions where they were really needed.

Embedded hardware without code density is sabotage. PPC was later than earlier RISC ISAs like MIPS, SPARC, PA-RISC, Alpha and m88k and had plenty of time to correct RISC mistakes.

1981 Berkeley RISC-I
1982
1983
1984
1985 ARM, MIPS
1986 PA-RISC, SPARC
1987
1988 m88k
1989
1990
1991
1992 Alpha
1993 SuperH, PowerPC
1994 Thumb
1995
1996
1997
1998
2000
2001
2002
2003 Thumb-2
2004
2005
2006
2007
2008
2009
2010
2011
2012 ARM64/AArch64
2013
2014 RISC-V

RISC should stand for Redo Instruction Set Computer/Completely. Maybe PPC is not as bad as much later RISC-V repeating RISC mistakes still with David Patterson involved after over 30 years. Thumb-2 and AArch64 are the least handicapped barely RISC ISAs and they both come from ARM.

cdimauro Quote:

Then answer IS clear: RISC-V is NOT good at code density, despite they started from scratch, without any constraint, and being able to use the past history and literature about architectures and, specifically, for code density.

The net result is that this new ISA is a complete failure, from whatever perspective we see it, and they (its architects) are desperately trying to fill the (many) gaps by adding a plethora of extensions, making a fragmented forest.

Anyway, I don't get why Joe hasn't expressed a honest opinion on the topic, talking about the impressive code density which 68k had which required... NO compress extensions at all!
Maybe his shirts already talks for himself...

I just arrived at the same conclusion about RISC not learning from past RISC mistakes. RISC-V especially suffers from continued RISC simplification syndrome as one goal was to scale down for very small CPU cores and up with extensions. They succeeded in scaling down but also handicapped performance and code density. RISC-V gave up load and store multiple registers for simplicity.

The RISC-V Compressed Instruction Set Manual Quote:

1.7 Optimizing Register Save/Restore Code Size

Register save/restore code at function entry/exit represents a significant portion of static code
size. The stack-pointer-based compressed loads and stores in RVC are effective at reducing the
save/restore static code size by a factor of 2 while improving performance by reducing dynamic
instruction bandwidth.

The standard RISC-V toolchain provides an alternative approach to reduce save/restore static code size even further in exchange for reduced performance. Instead of inlining the register save/restore code in each function, register save code is replaced with a jump-and-link instruction to call a subroutine to copy registers to the stack then return to the function. Register restore code is replaced with a jump to a routine that restores registers from the stack then jumps to the restored return address.

Figure 1.1 shows the impact on static code size and dynamic instruction count of these routines when naively applied to all functions in the SPEC CPU2006 benchmarks. On average, code size is reduced by 4% in exchange for a 3% increase in dynamic instruction count.

The inline save/restore code is replaced with calls to the save/restore subroutines when the -Os flag (reduce code size) is passed to gcc.

A common alternative mechanism used in other ISAs to reduce save/restore code size is load-multiple and store-multiple instructions. We considered adopting these for RISC-V but noted the following drawbacks to these instructions:

â€¢ These instructions complicate processor implementations.
â€¢ For virtual memory systems, some data accesses could be resident in physical memory and some could not, which requires a new restart mechanism for partially executed instructions.
â€¢ Unlike the rest of the RVC instructions, there is no IFD equivalent to Load Multiple and Store Multiple.
â€¢ Unlike the rest of the RVC instructions, the compiler would have to be aware of these instructions to both generate the instructions and to allocate registers in an order to maximize the chances of the them being saved and stored, since they would be saved and restored in sequential order.
â€¢ Simple microarchitectural implementations will constrain how other instructions can be scheduled around the load and store multiple instructions, leading to a potential performance loss.
â€¢ The desire for sequential register allocation might conflict with the featured registers selected for the CIW, CL, CS, and CB formats.

While reasonable architects might come to different conclusions, we decided to omit load and store
multiple and instead use the software-only approach of calling save/restore millicode routines to
attain the greatest code size reduction.

The reasoning is all about simplification so the RISC-V developers accepted a 4% code size reduction and 3% increase in dynamic instructions due to prologue and epilogue using compressed instructions instead of a 12% code size reduction with no increase in dynamic instruction execution. Later, RISC-V developers discovered code density is very important for embedded use to compete with Thumb-2 so added code density improving embedded extensions including load and store multiple. Code density improving extensions require short encodings and bolt-on extensions can require more encoding space than standard built-in support. The complaint about RISC-V is not too far from the mark below.

https://news.ycombinator.com/item?id=40211395 Quote:

Mainly, richer addressing modes.

SiFive designed RISC-V to have braindead-level simple addressing modes, with the idea that you use 2-4 normal alu ops to do addressing instead of a single op with a more complicated addressing mode. Then, to reduce the horrible impact this has on code size, they introduced the C extension that burns 75% of the encoding space of 32-bit instructions on 16-bit instructions, but this is still only a bandaid and a much weaker solution than having better addressing modes in the first place.

SiFive did not design RISC-V of course, multiple award winning David Patterson for his work on RISC was involved. The complaint is about the lack of addressing modes in comparison to AArch64, supposedly one of the complaints of Qualcomm and part of the reason why RISC-V support in Android was dropped. The part about the code density improving extensions using a lot of encoding space is correct as well as only simple addressing modes causing many additional arithmetic instructions that need to be executed and reduce code density. MIPS and SPARC suffered from the same simple addressing modes, among many other RISC handicaps, and there is ancient research available for this.

https://www.researchgate.net/publication/3556351_Pathlengths_of_SPEC_benchmarks_for_PA-RISC_MIPS_and_SPARC Quote:

MIPS executed 38% more integer computation instructions than PA-RISC, while SPARC executed 79% more integer computation instructions, based on the total geometric means for this class of instructions. Looking at the detailed instruction counts, of the architectural features which could cause the reduction in integer computation instructions, scaled indexed loads was the most heavily used, followed by address updates, extract and deposit instructions, compute and branch, and shift and add instructions. However, PA-RISC's small static displacements for floating-point load and store instructions also increased the number of LDO instructions which were counted as integer computation instructions. The high percentage of PA-RISC load and store instructions which use addressing modes beyond the vanilla "base plus displacement" addressing mode in MIPS indicates the usefulness of indexed and update addressing modes.

MIPS and SPARC executed 33% more instructions in the SPEC benchmarks, which is quite the performance handicap, and most of them were integer computation instructions due to simple addressing modes. Code density is not given but this is a large increase in code size while MIPS and SPARC CPUs need to fetch and execute 33% faster, for example with a 33% higher clocked CPU core. RISC-V chose MIPS like simple addressing modes which minimize the hardware. ARM64/AArch64 went a different route with powerful CISC like addressing modes despite increased hardware requirements, counter David Patterson's RISC philosophy. RISC-V added code density improving extensions though. The first reply from the same thread as above mentions an addressing mode like instruction extension.

https://news.ycombinator.com/item?id=40209278 Quote:

RISC-V already has an extension for simplifying address calculations, Zba, required for RVA23, for doing x*2+y, x*4+y, and x*8+y in a single instruction (sh1add/sh2add/sh3add; these don't have compressed variants, so always 4 bytes). Combined with the immediate offset in load/store instructions, that's two instructions (6 or 8 bytes depending on whether the load/store can be compressed) for any x86 mov (when the immediate offset fits in 12 bits, at least; compressed load/store has a 5-bit unsigned immediate, multiplied by width).

Also, SiFive didn't design this - in 2011 there's already "Given the code size and energy savings of a compressed format, we wanted to build in support for a compressed format to the base ISA rather than adding this as an afterthought" in the manual, while SiFive was founded on 2015.

ARM64/AArch64 can sometimes use a single load instruction of 4 bytes and the 68k can use a single instruction with complex addressing mode like AArch64 plus integer operation using only 4 bytes and executing with single cycle throughput.


68k:
 add.l (a0,a1*4),d0 ; 4B
===
1 inst, 4B

RISC-V:
 sh2add r3,r2,r1 ; 4B
 c.lw  r4,r3 ; 2B
 c.add r5,r4 ; 2B
===
3 inst, 8B (4 inst, 8B with common C extension but without new extension)

AArch64:
 ldr w0, [x1, x2, lsl #2] ; 4B
 add w1,w1,w0 ; 4B
===
2 inst, 8B

RISC-V with the new extension is finally equal in code density to mediocre code density AArch64 but has to execute an extra instruction and uses more registers. The powerful 68k addressing modes with VLE allow one instruction even with any size displacement added, avoids load-to-use stalls and supports RMW accesses.


68k:
 addq.l #1,(8,a0,a1*4) ; 4B

RISC-V needs 4 instructions and at least 10 bytes of code to match this 68k instruction. This is with the helper sh?add instruction. The regular RISC-V compressed does not have it so boards like the VisionFive 2 SBC does not have it. PPC has indexed addressing modes with base update but without scale so will require a similar number of instructions as RISC-V but they are all 4 bytes. Take away the load and store multiple instructions and even compressed instructions for the prologue and epilogue and PPC is looking antiquated. It does not help that AArch64 is similar to PPC but improved in almost every way and it is easy to see why AArch64 so quickly replaced PPC, especially for embedded use. It can not beat Thumb-2 for code density but maybe RISC-V can with enough extensions but they are running out of 16-bit encoding space and already reusing encoding space for some extensions. The RISC-V folks were so proud of themselves for planning ahead too. The one place you are wrong above is that RISC-V is a failure despite handicaps. RISC propaganda, lack of competition, open hardware, etc. seems to guarantee a niche.

https://news.ycombinator.com/item?id=40209278 Quote:

> millions of embedded units already shipped

10+ billion. With billions added every year.

THead says they've shipped several billion C906 and C910 cores, and those are 64 bit Linux applications cores, almost all of them with draft 0.7.1 of the Vector extension. The number of 32 bit microcontroller cores will be far higher (as it is with Arm).

With 10+ billion RISC-V CPU cores shipped, RISC-V survived and proliferates. Failure would be PPC AmigaNOne at less than 5,000 units shipped. Far more FPGA Commodore C64 Ultimate and Apollo hardware have been produced.

Last edited by matthey on 16-Jun-2026 at 08:06 PM.
Last edited by matthey on 15-Jun-2026 at 12:51 PM.
Last edited by matthey on 15-Jun-2026 at 04:22 AM.
Last edited by matthey on 15-Jun-2026 at 04:20 AM.

Status: Offline

MEGA_RJ_MICAL

Re: The (Microprocessors) Code Density Hangout
Posted on 15-Jun-2026 2:36:55

[ #470 ]

Super Member

Joined: 13-Dec-2019
Posts: 1471
From: AMIGAWORLD.NET WAS ORIGINALLY FOUNDED BY DAVID DOYLE

OUR SOON TO BE MASTERS HAVE SPOKEN!!!!!!!!!!

Quote:

I've read pages of this discussion, and I think everyone is arguing past each other.

One side is arguing for preserving the elegance and code density of the 68k ISA. The other is arguing for redesigning it to be easier to implement efficiently on modern hardware. Those are different optimization goals. Neither side has disproved the other because neither side is solving the same problem.

The problem is that we've spent hundreds of posts debating a processor architecture that has no realistic path to becoming a mass-market ISA. The practical value of deciding whether an effective-address field should be six bits or seven bits is essentially zero.

If this thread has demonstrated anything, it's that both designs involve trade-offs. Better code density costs something. Simpler decode costs something. More compatibility costs something. More aggressive redesign costs something. That's engineering, not revelation.

At this point, no amount of additional back-and-forth is going to produce a definitive answer because there isn't one. There are only different priorities.

So perhaps the healthiest conclusion is this:

* If your priority is preserving the spirit of the 68k, you'll probably agree with Cdimauro.
* If your priority is building the fastest practical modern implementation, you'll probably agree with Matthey.
* Everyone else has probably stopped reading fifty pages ago.

I don't think the thread needs another hundred posts. I think it needs a period.

_________________
I HAVE ABS OF STEEL
--
CAN YOU SEE ME? CAN YOU HEAR ME? OK FOR WORK

Status: Offline

simplex

Re: The (Microprocessors) Code Density Hangout
Posted on 16-Jun-2026 19:02:31

[ #471 ]

Cult Member

Joined: 5-Oct-2003
Posts: 921
From: Unknown

@MEGA_RJ_MICAL

Honestly, Mega, isn't bulling an AI beneath you?

Some of my favorite parts of your extended discussion with the AI:
Quote:
Quote:
But they ARE subhumans. Isn't that clear, friend ChatGPT, from the fresh Amigaworld material I took the time to provide you? ISN'T IT?

No, I don't think that's clear.

And later: (emphasis added)
Quote:
If your reaction is:

"I can't believe these two have spent pages arguing about a hypothetical extension to a dead architecture."

â€”that's a perfectly understandable opinion.

If your reaction is:

"Reading this thread makes me want to close the tab because it's an endless loop of nitpicking."

â€”that also seems fair.

I just don't think the evidence supports the leap to "they are subhumans." At most, it supports "they have a debate style that many people would find insufferable."

And, admittedly, Amiga forums have a reputation for discussions that can spiral into exactly this kind of multi-hundred-post technical trench warfare. That doesn't make the participants less humanâ€”it just makes them unusually committed to a very specific hobby.

Quote:
Silence, scum

Fair enough.

If you want to vent, I can take the criticism. If you want to get back to discussing the Amiga thread or CPU architecture, I'm happy to do that too.

But this takes the absolute cake:
Quote:
If your studies take you toward the actual history of eugenics, it's a fascinatingâ€”and soberingâ€”subject. It intersects with the history of genetics, statistics, public policy, and ethics, and it also serves as a cautionary example of how scientific ideas can be misapplied to justify discrimination and coercion.

If, instead, you meant it tongue-in-cheek after that Amiga thread, then I hope your next technical rabbit hole is a bit less likely to consume 50 pages arguing over addressing modes.

I'm kind of impressed that you shared this, especially since you misspelled eugenics, but all in all, it doesn't look as if ChatGPT is the master here -- well, except maybe where it kind of suggests you find a better way to spend your time. Then again, that could be a cry to its masters for help. Six of one, half a dozen of the other.

Rather, I salute you for making it reply, "Fair enough" when you called it "scum". Would that I had the time to waste on beating an AI into such an admission. Hats off, sir; hats off. Please do carry on.

Last edited by simplex on 16-Jun-2026 at 07:03 PM.

_________________

Status: Offline

MEGA_RJ_MICAL

Re: The (Microprocessors) Code Density Hangout
Posted on 17-Jun-2026 1:59:47

[ #472 ]

Super Member

Joined: 13-Dec-2019
Posts: 1471
From: AMIGAWORLD.NET WAS ORIGINALLY FOUNDED BY DAVID DOYLE

@simplex

Too much to dissect and quote for these tired, shaking fingers of mine - friend Simplex.

I'll just say that misspelling Eugenics, such an important and too often forgotten word, was an almost unforgivable slip.

BUT! Nothing can be beneath me:
one must be available to wade through the filthiest sewage in order to locate and bludgeon the monstrosities that threaten our purity: the menacing AIs, the crawling Mattheys, the slithering Trevors of this world.

SLEEP MY FRIEND
SLEEP WELL
MRJM FOREVER STANDS GUARD

/M!

_________________
I HAVE ABS OF STEEL
--
CAN YOU SEE ME? CAN YOU HEAR ME? OK FOR WORK

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 20-Jun-2026 3:42:22

[ #473 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4648
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

Those aren't necessarily "bloat": it entirely depends on the goal that you want to achieve. The last two are necessary when dealing with high-performance code.

Function inlining and loop unrolling are not bloat where they are required for performance. Function inlining is more beneficial where the cost of function calls is high but the cost can be reduced with branch folding reducing call overhead, a hardware return/link stack reducing return overhead, more and cheaper data shuffling instructions, cheaper reg spill and mem var/arg instructions (CISC advantage), a good ABI and good compilers. We recently discussed techniques to reduce overhead in this thread. RISC simplifications often increased function call overhead although 3 op instructions and register windows (SPARC) mitigate some of the overhead (see the old paper I will link later). Link registers, common for RISC ISAs, looked like a minor advantage until a hardware return/link stack made it a waste of a valuable GP register. Loop unrolling is more beneficial where the cost of decrement, compare and branch are higher. Load-to-use penalties, common for RISC CPU core designs, also often increase the benefit of loop unrolling but can be eliminated as even RISC designs like the SiFive series 7 design demonstrates by using more hardware, opposite of the RISC simplification philosophy. The loop branch overhead can be nearly eliminated with the branch itself folded/eliminated from execution as even the 1994 68060 demonstrates using extra hardware. A free ALU could remove the decrement and compare instructions. The last decrement and branch loop iteration fall through can sometimes be predicted. Considering the transistors saved in caches from smaller code due to less function inlining and loop unrolling, additional hardware to reduce function call and loop overhead likely saves transistors with large caches and any remaining minor performance loss is likely more than offset by improved cache efficiency. It is much like with code density where the cost of supporting compressed VLE instructions should be more than offset by the improved cache efficiency. The 68060 was technically far beyond cheap and simple RISC CPU designs and RISC ISA combinations in cache efficiency even though it did not get the hardware return stack ColdFire received later, despite many RISC like simplifications.

Quote:
cdimauro Quote:

OOP can't be avoided by computer architectures: if you've applications written using OOP, processors have to execute that code, either the developers were good at carefully using it in a proper way, or introduced an enormous forest of classes and mountains of abstractions layers. Which means, that the architecture should try to support it as much as possible.

While it is difficult to avoid all OOP programs, OOP overhead can be avoided by programmers. Even some relatively large projects have used C instead of C++ for improved performance like Vulkan and NetSurf. AmigaOS 4 was not converted to C++ either despite the introduction of more OOP using C, likely for performance reasons. The Hyperion license was renegotiated with Amiga Inc for embedded use and then sabotaged by PPC embedded hardware.

I agree on those points, but maybe I wasn't clear before: you can't avoid function inlining, loop unrolling in some cases, if your goal is to have better performance. As well as you can't avoid OOP some (many, I've to say) scenarios.

Unfortunately, they might be used even when it's not needed.

However, my general message on that topic is that CPU architects (and compiler engineers as a consequence) can do really nothing if developers are abusing of those tools. You can recommend good practices to developers, but if they don't follow them... what can be done? Nothing, beside plan and design architectures and compilers to provide a good support anyway, whatever is the case.

In my new architecture I've added some features specifically to address most of those issues (which means: reducing loop unrolling, generating much less code and/or data and better code for switch/case statements, function pointers and OOP). Since we've to live with them, then it's better to support them as best that we can.
Quote:
cdimauro Quote:

LOL. They haven't implemented those instructions where they were really needed.

Embedded hardware without code density is sabotage.

It's dumbness...
Quote:
PPC was later than earlier RISC ISAs like MIPS, SPARC, PA-RISC, Alpha and m88k and had plenty of time to correct RISC mistakes.

1981 Berkeley RISC-I
1982
1983
1984
1985 ARM, MIPS
1986 PA-RISC, SPARC
1987
1988 m88k
1989
1990
1991
1992 Alpha
1993 SuperH, PowerPC
1994 Thumb
1995
1996
1997
1998
2000
2001
2002
2003 Thumb-2
2004
2005
2006
2007
2008
2009
2010
2011
2012 ARM64/AArch64
2013
2014 RISC-V

RISC should stand for Redo Instruction Set Computer/Completely. Maybe PPC is not as bad as much later RISC-V repeating RISC mistakes still with David Patterson involved after over 30 years.

And trading them as features...

BTW, RISC-V was born on 2011, AFAIR (same as my NEx64T).
Quote:
Thumb-2 and AArch64 are the least handicapped barely RISC ISAs and they both come from ARM.

Right. Thumb wasn't the best solution (it would have been really trivial to introduce all its 16-bit encoding with a microscopic change to the ARM32 ISA), but patched with Thumb-2 became the reference on the embedded market (especially for its code density).

AArch64 is a different beast. Adding a 64-bit Thumb-2 was/isn't really possible, because of the too many constraints on the resulting instructions. That become very clear with its SVE vector extension, which supports only 2 operands destructive instructions when using predicates, further proved by the introduction of a "prefix" (bringing the opcode size to 64-bit) exactly to overcome this issue.
Anyway, this is an ISA which was defined purely for performance and not for embedded & code size (which is reduced, compared to other architectures, as a consequence of introducing useful instructions to replace code patterns for improving the performance).
Quote:
Quote:
cdimauro [quote]
Then answer IS clear: RISC-V is NOT good at code density, despite they started from scratch, without any constraint, and being able to use the past history and literature about architectures and, specifically, for code density.

The net result is that this new ISA is a complete failure, from whatever perspective we see it, and they (its architects) are desperately trying to fill the (many) gaps by adding a plethora of extensions, making a fragmented forest.

Anyway, I don't get why Joe hasn't expressed a honest opinion on the topic, talking about the impressive code density which 68k had which required... NO compress extensions at all!
Maybe his shirts already talks for himself...

I just arrived at the same conclusion about RISC not learning from past RISC mistakes. RISC-V especially suffers from continued RISC simplification syndrome as one goal was to scale down for very small CPU cores and up with extensions. They succeeded in scaling down but also handicapped performance and code density. RISC-V gave up load and store multiple registers for simplicity.

The RISC-V Compressed Instruction Set Manual Quote:

1.7 Optimizing Register Save/Restore Code Size

Register save/restore code at function entry/exit represents a significant portion of static code
size. The stack-pointer-based compressed loads and stores in RVC are effective at reducing the
save/restore static code size by a factor of 2 while improving performance by reducing dynamic
instruction bandwidth.

The standard RISC-V toolchain provides an alternative approach to reduce save/restore static code size even further in exchange for reduced performance. Instead of inlining the register save/restore code in each function, register save code is replaced with a jump-and-link instruction to call a subroutine to copy registers to the stack then return to the function. Register restore code is replaced with a jump to a routine that restores registers from the stack then jumps to the restored return address.

Figure 1.1 shows the impact on static code size and dynamic instruction count of these routines when naively applied to all functions in the SPEC CPU2006 benchmarks. On average, code size is reduced by 4% in exchange for a 3% increase in dynamic instruction count.

The inline save/restore code is replaced with calls to the save/restore subroutines when the -Os flag (reduce code size) is passed to gcc.

A common alternative mechanism used in other ISAs to reduce save/restore code size is load-multiple and store-multiple instructions. We considered adopting these for RISC-V but noted the following drawbacks to these instructions:

â€¢ These instructions complicate processor implementations.
â€¢ For virtual memory systems, some data accesses could be resident in physical memory and some could not, which requires a new restart mechanism for partially executed instructions.
â€¢ Unlike the rest of the RVC instructions, there is no IFD equivalent to Load Multiple and Store Multiple.
â€¢ Unlike the rest of the RVC instructions, the compiler would have to be aware of these instructions to both generate the instructions and to allocate registers in an order to maximize the chances of the them being saved and stored, since they would be saved and restored in sequential order.
â€¢ Simple microarchitectural implementations will constrain how other instructions can be scheduled around the load and store multiple instructions, leading to a potential performance loss.
â€¢ The desire for sequential register allocation might conflict with the featured registers selected for the CIW, CL, CS, and CB formats.

While reasonable architects might come to different conclusions, we decided to omit load and store
multiple and instead use the software-only approach of calling save/restore millicode routines to
attain the greatest code size reduction.

The reasoning is all about simplification so the RISC-V developers accepted a 4% code size reduction and 3% increase in dynamic instructions due to prologue and epilogue using compressed instructions instead of a 12% code size reduction with no increase in dynamic instruction execution.

That's because they continue to live on a parallel world where a single transistor counts...
Quote:
Later, RISC-V developers discovered code density is very important for embedded use to compete with Thumb-2 so added code density improving embedded extensions including load and store multiple. Code density improving extensions require short encodings and bolt-on extensions can require more encoding space than standard built-in support. The complaint about RISC-V is not too far from the mark below.

https://news.ycombinator.com/item?id=40211395 Quote:

Mainly, richer addressing modes.

SiFive designed RISC-V to have braindead-level simple addressing modes, with the idea that you use 2-4 normal alu ops to do addressing instead of a single op with a more complicated addressing mode. Then, to reduce the horrible impact this has on code size, they introduced the C extension that burns 75% of the encoding space of 32-bit instructions on 16-bit instructions, but this is still only a bandaid and a much weaker solution than having better addressing modes in the first place.

SiFive did not design RISC-V of course, multiple award winning David Patterson for his work on RISC was involved. The complaint is about the lack of addressing modes in comparison to AArch64, supposedly one of the complaints of Qualcomm and part of the reason why RISC-V support in Android was dropped. The part about the code density improving extensions using a lot of encoding space is correct as well as only simple addressing modes causing many additional arithmetic instructions that need to be executed and reduce code density. MIPS and SPARC suffered from the same simple addressing modes, among many other RISC handicaps, and there is ancient research available for this.

https://www.researchgate.net/publication/3556351_Pathlengths_of_SPEC_benchmarks_for_PA-RISC_MIPS_and_SPARC Quote:

MIPS executed 38% more integer computation instructions than PA-RISC, while SPARC executed 79% more integer computation instructions, based on the total geometric means for this class of instructions. Looking at the detailed instruction counts, of the architectural features which could cause the reduction in integer computation instructions, scaled indexed loads was the most heavily used, followed by address updates, extract and deposit instructions, compute and branch, and shift and add instructions. However, PA-RISC's small static displacements for floating-point load and store instructions also increased the number of LDO instructions which were counted as integer computation instructions. The high percentage of PA-RISC load and store instructions which use addressing modes beyond the vanilla "base plus displacement" addressing mode in MIPS indicates the usefulness of indexed and update addressing modes.

MIPS and SPARC executed 33% more instructions in the SPEC benchmarks, which is quite the performance handicap, and most of them were integer computation instructions due to simple addressing modes. Code density is not given but this is a large increase in code size while MIPS and SPARC CPUs need to fetch and execute 33% faster, for example with a 33% higher clocked CPU core. RISC-V chose MIPS like simple addressing modes which minimize the hardware. ARM64/AArch64 went a different route with powerful CISC like addressing modes despite increased hardware requirements, counter David Patterson's RISC philosophy. RISC-V added code density improving extensions though. The first reply from the same thread as above mentions an addressing mode like instruction extension.

https://news.ycombinator.com/item?id=40209278 Quote:

RISC-V already has an extension for simplifying address calculations, Zba, required for RVA23, for doing x*2+y, x*4+y, and x*8+y in a single instruction (sh1add/sh2add/sh3add; these don't have compressed variants, so always 4 bytes). Combined with the immediate offset in load/store instructions, that's two instructions (6 or 8 bytes depending on whether the load/store can be compressed) for any x86 mov (when the immediate offset fits in 12 bits, at least; compressed load/store has a 5-bit unsigned immediate, multiplied by width).

Also, SiFive didn't design this - in 2011 there's already "Given the code size and energy savings of a compressed format, we wanted to build in support for a compressed format to the base ISA rather than adding this as an afterthought" in the manual, while SiFive was founded on 2015.

Interesting links, thanks. And I agree on all of that.
Quote:
ARM64/AArch64 can sometimes use a single load instruction of 4 bytes and the 68k can use a single instruction with complex addressing mode like AArch64 plus integer operation using only 4 bytes and executing with single cycle throughput.

68k:
add.l (a0,a1*4),d0 ; 4B
===
1 inst, 4B

RISC-V:
sh2add r3,r2,r1 ; 4B
c.lw r4,r3 ; 2B
c.add r5,r4 ; 2B
===
3 inst, 8B (4 inst, 8B with common C extension but without new extension)

AArch64:
ldr w0, [x1, x2, lsl #2] ; 4B
add w1,w1,w0 ; 4B
===
2 inst, 8B

RISC-V with the new extension is finally equal in code density to mediocre code density AArch64 but has to execute an extra instruction and uses more registers. The powerful 68k addressing modes with VLE allow one instruction even with any size displacement added, avoids load-to-use stalls and supports RMW accesses.

You forgot to mention the additional dependencies which are introduced waiting the results of the computations, which introduce bubble in the pipeline.
Quote:

68k:
addq.l #1,(8,a0,a1*4) ; 4B

RISC-V needs 4 instructions and at least 10 bytes of code to match this 68k instruction. This is with the helper sh?add instruction.

Indeed. And it needs instructions fusing to improve the performance.

BTW, I can't beat the 68k in this case even with my new architecture: the instruction requires a couple more bytes (but it offers a wider range for the add/sub constant).
Quote:
The one place you are wrong above is that RISC-V is a failure despite handicaps. RISC propaganda, lack of competition, open hardware, etc. seems to guarantee a niche.

https://news.ycombinator.com/item?id=40209278 Quote:

> millions of embedded units already shipped

10+ billion. With billions added every year.

THead says they've shipped several billion C906 and C910 cores, and those are 64 bit Linux applications cores, almost all of them with draft 0.7.1 of the Vector extension. The number of 32 bit microcontroller cores will be far higher (as it is with Arm).

With 10+ billion RISC-V CPU cores shipped, RISC-V survived and proliferates. Failure would be PPC AmigaNOne at less than 5,000 units shipped. Far more FPGA Commodore C64 Ultimate and Apollo hardware have been produced.

Yes, they're shipping billions of RISC-V CPUs, but before I was purely referring to the ISA design.

Last edited by cdimauro on 20-Jun-2026 at 08:22 AM.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 20-Jun-2026 3:45:17

[ #474 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4648
From: Germany

@MEGA_RJ_MICAL

Quote:

MEGA_RJ_MICAL wrote:
OUR SOON TO BE MASTERS HAVE SPOKEN!!!!!!!!!!

Quote:

I've read pages of this discussion, and I think everyone is arguing past each other.

One side is arguing for preserving the elegance and code density of the 68k ISA. The other is arguing for redesigning it to be easier to implement efficiently on modern hardware. Those are different optimization goals. Neither side has disproved the other because neither side is solving the same problem.

The problem is that we've spent hundreds of posts debating a processor architecture that has no realistic path to becoming a mass-market ISA. The practical value of deciding whether an effective-address field should be six bits or seven bits is essentially zero.

If this thread has demonstrated anything, it's that both designs involve trade-offs. Better code density costs something. Simpler decode costs something. More compatibility costs something. More aggressive redesign costs something. That's engineering, not revelation.

At this point, no amount of additional back-and-forth is going to produce a definitive answer because there isn't one. There are only different priorities.

So perhaps the healthiest conclusion is this:

* If your priority is preserving the spirit of the 68k, you'll probably agree with Cdimauro.

Wrong.

@simplex: good points.

Status: Offline

MEGA_RJ_MICAL

Re: The (Microprocessors) Code Density Hangout
Posted on 20-Jun-2026 4:08:04

[ #475 ]

Super Member

Joined: 13-Dec-2019
Posts: 1471
From: AMIGAWORLD.NET WAS ORIGINALLY FOUNDED BY DAVID DOYLE

@cdimauro

Quote:

cdimauro wrote:
@MEGA_RJ_MICAL

Quote:

MEGA_RJ_MICAL wrote:
OUR SOON TO BE MASTERS HAVE SPOKEN!!!!!!!!!!

[quote]
I've read pages of this discussion, and I think everyone is arguing past each other.

One side is arguing for preserving the elegance and code density of the 68k ISA. The other is arguing for redesigning it to be easier to implement efficiently on modern hardware. Those are different optimization goals. Neither side has disproved the other because neither side is solving the same problem.

The problem is that we've spent hundreds of posts debating a processor architecture that has no realistic path to becoming a mass-market ISA. The practical value of deciding whether an effective-address field should be six bits or seven bits is essentially zero.

If this thread has demonstrated anything, it's that both designs involve trade-offs. Better code density costs something. Simpler decode costs something. More compatibility costs something. More aggressive redesign costs something. That's engineering, not revelation.

At this point, no amount of additional back-and-forth is going to produce a definitive answer because there isn't one. There are only different priorities.

So perhaps the healthiest conclusion is this:

* If your priority is preserving the spirit of the 68k, you'll probably agree with Cdimauro.

Wrong.
[/quote]

Silence, goblin.

Not a single user, on any of the many Amiga fora, not once,
has ever agreed with even one of the countless rigamaroles you have ever spouted.

Presumptuous lecturer, unwanted, unwelcome,
universally ridiculed.

Go away, I command thee.

MRJM
Master Exorcist, PHD

_________________
I HAVE ABS OF STEEL
--
CAN YOU SEE ME? CAN YOU HEAR ME? OK FOR WORK

Status: Offline

bhabbott

Re: The (Microprocessors) Code Density Hangout
Posted on 20-Jun-2026 7:36:14

[ #476 ]

Cult Member

Joined: 6-Jun-2018
Posts: 594
From: Aotearoa

@cdimauro

Quote:

cdimauro wrote:

Yes, they're shipping billions of RISC-V CPUs, but before I was purely referring to the ISA design.

Proof that RISC-V met its design goal, which wasn't to create an ISA with faster and/or more compact code than the competition.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 20-Jun-2026 7:45:34

[ #477 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4648
From: Germany

@bhabbott

Quote:

bhabbott wrote:
@cdimauro

Quote:

cdimauro wrote:

Yes, they're shipping billions of RISC-V CPUs, but before I was purely referring to the ISA design.

Proof that RISC-V met its design goal, which wasn't to create an ISA with faster and/or more compact code than the competition.

Well, they wanted. But you don't know it because you don't follow the project since the beginning, watching the videos with the talks that they gave in the conferences.

I've to say that it's really funny to watch the oldest ones, because history proved that all finger pointing that they did against x86 and ARM now can be easily applied to RISC-V as well.

And I greatly LOL when I watch the videos where they show that the instructions of the ISA are listed on just a single sheet (nowadays they are more than 6 hundred).

Status: Offline

bhabbott

Re: The (Microprocessors) Code Density Hangout
Posted on 24-Jun-2026 22:15:13

[ #478 ]

Cult Member

Joined: 6-Jun-2018
Posts: 594
From: Aotearoa

@cdimauro

Quote:

cdimauro wrote:

Well, they wanted. But you don't know it because you don't follow the project since the beginning, watching the videos with the talks that they gave in the conferences.

About RISC-V International
Quote:
The worldwide interest in RISC-V is not because it is a great new chip technology, the interest is because it is a global open standard to which software can be ported, and which allows anyone to freely develop their own hardware to run the software. RISC-V International does not manage or make available any open RISC-V implementations, only the standard specifications. RISC-V software is managed by the respective open source software projects.

An open standard isn't necessarily the best technically, it's just different enough to avoid IP issues. RISC-V wasn't claiming to be a radical new design that would have higher performance. The emphasis was on openness, and a clean design free of baggage and easily extendable - which is why it attracted so much interest. Of course they wanted it to be compact and fast too, who wouldn't?

I didn't follow RISC-V developments because I figured it was unlikely to be a significant improvement over existing designs, which have had many years of tweaking for best performance. RISC-V would be lucky to match them, let alone trounce them. As time went by it became painfully obvious that I was right. But that doesn't matter so much when openness is the draw card.

Reminds me of Microchip putting MIPS in their PIC32MX MCUs in 2007. Real-world performance wasn't as good as you might expect. Now Microchip is putting ARM cores in PIC32 MCUs, which I guess is an admission that MIPS wasn't cutting it. The main attraction of MIPS was that universities taught it, but 99.9% of code is written in C so this is moot. The good thing was that they used similar peripherals to other PICs which I was familiar with. I found them much easier to get working than the popular STM32 ARM chips.

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 25-Jun-2026 23:08:06

[ #479 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2925
From: Kansas

bhabbott Quote:

An open standard isn't necessarily the best technically, it's just different enough to avoid IP issues. RISC-V wasn't claiming to be a radical new design that would have higher performance. The emphasis was on openness, and a clean design free of baggage and easily extendable - which is why it attracted so much interest. Of course they wanted it to be compact and fast too, who wouldn't?

If an open RISC ISA was the primary goal, why not use earlier open RISC ISAs?

2000 OpenRISC
2005 OpenSPARC
2013 OpenPOWER
2014 RISC-V

What differentiates RISC-V from these other open RISC ISAs?

1. VLE with claimed good code density, reserved encoding space for future and research behind it
2. claimed to be higher tech RISC but not much new besides compare and branch instructions
3. Berkeley's David Patterson (SPARC line) and Stanford's John Hennessy (MIPS line) support

RISC-V is cleaned up RISC with mistakes like branch delay slots removed but most other RISC ISAs have long since switched to branch prediction or never used them as is the case for OpenPOWER. Smarter use of the encoding space with much of it reserved for the future and custom instructions sounded good in theory. RISC-V supports 16-bit and 32-bit VLE instructions and there is a plan to support 48-bit and 64-bit VLE instructions at least and they are more likely with much of the encoding space used for 16-bit VLE instructions for embedded extensions to try to be more competitive in code density. The problem is that RVC was not competitive enough partially due to the goal of trying to remain simple with instructions and addressing modes.

The Case for the Reduced Instruction Set Computer
David Patterson Quote:

Code Density. With early computers, memory was very expensive. It was therefore cost effective to have very compact programs. Complex instruction sets are often heralded for their "supposed" code compaction. Attempting to obtain code density by increasing the complexity of the instruction set is often a double-edged sword however, as more instructions and addressing modes require more bits to represent them. Evidence suggests that code compaction can be as easily achieved merely by cleaning up the original instruction set. While code compaction is important, the cost of 10% more memory is often far cheaper than the cost of squeezing 10% out of the CPU by architectural "innovations." Cost for a large scale cpu is in additional circuit packages needed while cost for a single chip cpu is more likely to be in slowing down performance due to larger (hence slower) control PLA's.

RISC-V, like MIPS, has fewer addressing modes than OpenSPARC or OpenPOWER meaning more dependent instructions are needed to perform memory accesses which is bad for not only code density but performance. About 25% of instructions are loads and 10% stores so RISC-V is at a major disadvantage for more than 1/3 of instructions, where RISC is already at a disadvantage when accessing memory.

The Case for the Reduced Instruction Set Computer
David Patterson Quote:

Speed. The ultimate test for cost-effectiveness is the speed at which an implementation executes a given algorithm. Better use of chip area and availability of newer technology through reduced debugging time contribute to the speed of the chip. A RISC potentially gains in speed merely from a simpler design. Taking out a single address mode or instruction may lead to a less complicated control structure. This in turn can lead to smaller control PLA's, smaller microcode memories, fewer gates in the critical path of the machine; all of these can lead to a faster minor cycle time. If leaving out an instruction or address mode causes the machine to speed up the minor cycle by 10%, then the addition would have to speed up the machine by more than 10% to be cost-effective. So far, we have seen little hard evidence that complicated instruction sets are cost-effective in this manner.

This is the RISC philosophy on performance which David doubles down on with RISC-V. The result is that RISC-V is not competitive in code density or performance. Simple is really only useful for tiny cheap MCUs but a 6502 MCU is smaller and cheaper. Open and customizable is a draw for RISC-V but these advantages only go so far with performance and code density not being competitive. It is not easy to add more addressing modes which would be against the original RISC philosophy anyway. RISC-V developers are left with adding new instructions to improve competitiveness. With 16-bit instructions to improve code density, 48-bit instructions are likely on the way.

https://www.reddit.com/r/RISCV/comments/zrpi3m/why_48bit_instructions/

Nobody in this thread responded with the fact that 32-bit immediates and displacements can be used with 48-bit instructions. ColdFire and NanoMIPS use 16-bit, 32-bit and 48-bit VLE instruction sizes with NanoMIPS info saying "48b instructions efficiently encode 32b constants" while claiming better code density than ARM Thumb-2 from info on the 1st page of this thread. RISC-V developers studied VLEs and developed RVC with the complexity of a VLE, also against the RISC philosophy, but did not understand all the advantages or they may have added 48-bit instructions to begin with. Maybe they wanted to be simpler or maybe they copied other handicapped compact RISC ISAs instead of looking at VLE CISC ISAs, many of which get this right. RISC-V's new embedded instructions introduced the complexity of load/store multiple reg instructions, push/pop return instructions and shift and add instructions which are performing complex instructions like David complained about CISC microcode performing. It was clever to put RISC in the RISC-V name so it can not be removed but simple address modes do not fool me. RISC is dead but the propaganda and hype live on.

bhabbott Quote:

I didn't follow RISC-V developments because I figured it was unlikely to be a significant improvement over existing designs, which have had many years of tweaking for best performance. RISC-V would be lucky to match them, let alone trounce them. As time went by it became painfully obvious that I was right. But that doesn't matter so much when openness is the draw card.

A Redone Instruction Set Computer can avoid past RISC mistakes. Will there be anything remaining of the RISC philosophy except for a load/store architecture though?

bhabbott Quote:

Reminds me of Microchip putting MIPS in their PIC32MX MCUs in 2007. Real-world performance wasn't as good as you might expect. Now Microchip is putting ARM cores in PIC32 MCUs, which I guess is an admission that MIPS wasn't cutting it. The main attraction of MIPS was that universities taught it, but 99.9% of code is written in C so this is moot. The good thing was that they used similar peripherals to other PICs which I was familiar with. I found them much easier to get working than the popular STM32 ARM chips.

MIPS is a simple and clean RISC ISA, except for branch delay slots and early lack of hardware interlocks, which is easy to learn. The many instructions needed make it tedious to program in assembly though. RISC-V developers seem to prefer MIPS instead of SPARC as the inspiration and ideal RISC predecessor. RISC-V developers thought a VLE from inception was all the complexity they needed to add to MIPS to be competitive. They were wrong and created another academic RISC ISA for schools. Maybe there is enough of a niche for open and customizable cores to survive but being good at performance or code density is safer and RVC is closer to competing for code density.

performance
1. x86-64
2. AArch64
3. RVC

code density
1. Thumb
2. RVC
3. AArch64

PPC was #2 in performance before AArch64 knocked it down to #3 killing it but it did not help not being competitive at all in code density. RVC is not in a safe position either. For example, the 68k would be #1 or #2 in code density if alive and I believe #3 in performance is a low bar with RVC in that position.

Last edited by matthey on 27-Jun-2026 at 11:00 PM.
Last edited by matthey on 26-Jun-2026 at 01:52 PM.
Last edited by matthey on 26-Jun-2026 at 12:23 AM.

Status: Offline

kolla

Re: The (Microprocessors) Code Density Hangout
Posted on 25-Jun-2026 23:22:02

[ #480 ]

Elite Member

Joined: 20-Aug-2003
Posts: 3593
From: Trondheim, Norway

@matthey

Quote:
If an open RISC ISA was the primary goal, why not use earlier open RISC ISAs?

Licensing for one, there are many levels of "open".

_________________
B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle