Click Here
home features news forums classifieds faqs links search
6155 members 
Amiga Q&A /  Free for All /  Emulation /  Gaming / (Latest Posts)
Login

Nickname

Password

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net
Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.
Donate

Menu
Main sections
» Home
» Features
» News
» Forums
» Classifieds
» Links
» Downloads
Extras
» OS4 Zone
» IRC Network
» AmigaWorld Radio
» Newsfeed
» Top Members
» Amiga Dealers
Information
» About Us
» FAQs
» Advertise
» Polls
» Terms of Service
» Search

IRC Channel
Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online
22 crawler(s) on-line.
 95 guest(s) on-line.
 0 member(s) on-line.



You are an anonymous user.
Register Now!

/  Forum Index
   /  General Technology (No Console Threads)
      /  The (Microprocessors) Code Density Hangout
Register To Post

Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 )
PosterThread
MEGA_RJ_MICAL 
Re: The (Microprocessors) Code Density Hangout
Posted on 14-May-2026 8:24:44
#461 ]
Super Member
Joined: 13-Dec-2019
Posts: 1428
From: AMIGAWORLD.NET WAS ORIGINALLY FOUNDED BY DAVID DOYLE

cdiZORRAM

_________________
I HAVE ABS OF STEEL
--
CAN YOU SEE ME? CAN YOU HEAR ME? OK FOR WORK

 Status: Offline
Profile     Report this post  
matthey 
Re: The (Microprocessors) Code Density Hangout
Posted on 16-May-2026 6:35:38
#462 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2868
From: Kansas

cdimauro Quote:

Added another paper on the Literature post.

Enhanced code density of embedded CISC processors with echo technology


I have not read that particular paper before but there are many code compression techniques I would put in the same category as "echo technology" which try to take advantage of the repetition of instructions in code to compress it. Without additional hardware, compilers can reduce code size by sharing more code using more functions and branches but most compilers have improved in the opposite direction of function inlining for performance.

Improving Program Efficiency by Packing Instructions into Registers
https://pages.cs.wisc.edu/~isca2005/papers/05A-01.PDF Quote:

Due to the increasing pervasive use of embedded systems, there has been a significant amount of recent work on compressing code. Some early work on code compression used a compiler optimization called procedural abstraction to reduce code size. Procedural abstraction can be viewed as the opposite of function inlining. Common code sequences are abstracted into routines, and the original sites of each sequence are converted into calls. Subsequent work involved register renaming to abstract more code segments. The main disadvantage of procedural abstraction is that the program typically becomes slower due to the overhead of executing call and return instructions for each abstracted code segment.


I am a proponent of using additional hardware in a more traditional and general purpose way to reduce function call and branch overhead to a minimum thus reducing the need for function inlining and loop unrolling. While the 17%-20% code size reduction for IA32 using ET may sound good, software only "procedural abstraction achieves 5%-7%". Echo tech is only useful when optimizing for size or with profiled programs to avoid performance code, the number of instructions in the instruction stream is increased, BTB pressure is increased (echo targets are stored in the BTB) and additional hardware is required to achieve maximum claimed gains. The difference between optimizing for size and performance puts the compression in perspective too.

Google AI Quote:

1. Function Inlining
o Code Size Impact: +10% to +20% (on average for standard C/C++ builds).
o Extreme Cases: Can balloon up to several hundred percent if massive functions with deep call hierarchies are inlined across hundreds of different files.
o The Trade-Off: Inlining eliminates the function-call overhead (arguments, stack, and branch instructions). However, copying function bodies causes binary bloat, which risks overflowing the L1/L2 Instruction Caches and causing cache misses that actually degrade performance.

2. Loop Unrolling
o Code Size Impact: +10% to +50% (depending on unrolling factors like 4x, 8x, etc.)
o Extreme Cases: Can multiply the size of specific hot loops by exactly the unrolling factor (e.g., unrolling a small loop 100 times multiplies the size of that code block by 100).
o The Trade-Off: Unrolling removes loop-termination checks and branches, making code run faster by exploiting instruction-level parallelism. Like inlining, excessive unrolling thrashes the cache.


OOP, memory alignment/padding and (auto)vectorization can have code bloat synergies as well. I would rather have hardware compression that is more general purpose and scales better for both performance and reduced code size. This is what the 68k already starts with with a well designed VLE and it can be improved.

Improving Program Efficiency by Packing Instructions into Registers
https://pages.cs.wisc.edu/~isca2005/papers/05A-01.PDF Quote:

Table 4. Comparison of Existing Techniques

Legend: + means that improvement is < 10%. ++ means that improvement is ≥ 10%. 0 means that there is very little to no effect. ? means that results are speculative since they are not presented or explained in detail. – means that penalty is < 10%. – – means that penalty is ≥ 10%. Hardware complexity is scaled from 0 (no changes) to 3 (complete redesign).


"Arm/Thumb" is A32+Thumb which is correctly trashed in the chart due to mode switching and the much increased number of instructions executed reducing performance and power savings. "Arm/Thumb/AX" is A32+Thumb-2 which is an acceptable VLE but still requiring more instructions than many other ISAs and having reducing performance. IA32 was mentioned in your paper as having better performance than ARM/Thumb when optimized for size which I found surprising. When optimizing for size, IA32 suffers from too many instructions and too many memory accesses as well. The 68k has better code density and does not suffer from either problem judging from Vince Weaver's code density competition.

The instruction register file (IRF) code compression in the chart above is like programmable microcode from the compiler and only uses a 32 instruction register file.

Improving Program Efficiency by Packing Instructions into Registers
https://pages.cs.wisc.edu/~isca2005/papers/05A-01.PDF Quote:

One of the early approaches to reduce code size and the cost of fetching instructions was microcode. Each CISC or macro instruction fetched from memory caused a sequence of microinstructions to be fetched and executed, which provided a faster access time than main memory. Our proposed approach differs from microcode in several ways, including that specific instructions within the IRF can be individually referenced and that the instructions in the IRF can be changed for each executable.


"On average, about 66.51% of all instructions executed can be stored in a 32-entry IRF, assuming it can be loaded with the 32 most common instructions at the start of execution." There is additional context switching overhead but handling is not much more difficult than executing microcode, which even RISC compressed ARM and SuperH cores usually use too. The IRF includes an IMM immediate compression technique that the 68k may not benefit as much from with my immediate compression using an addressing mode idea and an article on the 68060 claims there is no microcode as Gunnar claims for the AC68080. Maybe it would be a better fit for x86-64 and AArch64 cores which still use microcode on modern cores and lack good code density. It would be interesting to see if the 68060 instruction buffer could be turned into a "ZOLB/Loop Cache" as shown in the chart above, avoiding repeated decoding overhead of short loops.

Multiple compression techniques are possible and sometimes beneficial. For example, the 68060 could potentially benefit from VLE compression, Procedural Abstraction and ZOLB/Loop Cache from the chart above. Some techniques are a better fit for certain ISAs and CPU core designs and different techniques have different tradeoffs. There is another paper which has data with Echo Tech compression and PPC like CodePack compression, both forms of "Codewords" in the table above. CodePack provides more compression than Echo Tech but they have more compression together.

Reducing Code Size With Echo Instructions
https://dl.acm.org/doi/abs/10.1145/951710.951724 Quote:

7. SUMMARY

This paper examined code compression with echo instructions. Echo instructions are an executable form of code compression that uses the main instruction stream for the compression storage. Echo instructions execute subsequences of instructions from other locations in the instruction stream. Given a highly optimized binary, our results show that traditional software based procedural abstraction achieves a 94.3% compression ratio, while the use of echo instructions achieves a 84.5% compression ratio.

In addition, we evaluate the use of echo instructions with CodePack. CodePack achieved a 70.0% compression ratio on our optimized binaries, and CodePack with echo instructions resulted in a 63.2% compression ratio. Typically, combining compression algorithms does not result in additional savings, but we are applying two compression algorithms that operate at different granularities, so they compress different portions of the same data.


While CodePack has excellent compression, disadvantages include significant hardware/resources for embedded cores including a 2kiB symbol table of likely SRAM which could have been used for caches instead, instructions in I-Cache are not compressed (cache lines decompressed from memory to I-Cache), MCU execution from SRAM memory not practical limiting scaling, etc. IBM still found CodePack worthwhile to reduce the memory footprint and decrease the RISC instruction fetch bottleneck while reducing instruction fetch energy used (36% of total processor power on a StrongARM used for I-Fetch from paper above and Cast shows instruction supply to use 42% of an embedded processor energy consumption). Echo Tech and CodePack together resulted in 63.2% compression but starting with the 68k is like starting with 55% compression instead with more possible with enhancements. The 68k code in caches is compressed reducing I-Cache misses and even the 68060 can support a MCU using SRAM as memory. The following paper may have inspired IBM to create CodePack.

Improving Code Density Using Compression Techniques
https://www.eecs.umich.edu/techreports/cse/97/CSE-TR-342-97.pdf Quote:

There are several ways that our compression method can be improved. First, the compiler could attempt to produce instructions with similar byte sequences so they could be more easily compressed. One way to accomplish this is by allocating registers so that common sequences of instructions use the same registers. Another way is to generate more generalized STDS code sequences. These would be less efficient, but would be semantically correct in a larger variety of circumstances. For example, in most optimizing compilers, the function prologue sequence might save only those registers which are modified within the body of the function. If the prologue sequence were standardized to always save all registers, then all instructions of the sequence could be compressed to a single codeword. This space saving optimization would decrease code size at the expense of execution time. Table 3 shows that the prologue and epilogue combined typically account for 12% of the program size, so this type of compression would provide significant size reduction.


A library of standard prologue and epilogue code could have reduced PPC program sizes by 12% on average and it could have gone into ROM that is cheaper than SRAM. It is not as good as the 68k MOVEM with register bitmap but similar is not possible with a fixed 32-bit encoding and 32 GP registers. PPC has inflexible load/store multiple with LMW/STMW using a register range from the given register to r31 which may have worked if they made the instructions standard. IBM developed CodePack and Freescale VLE while LMW/STMW were rarely implemented in hardware. Poor load/store multiple implementation? Implementation worse than 12% prologue and epilogue bloat?

PPC LMW/STMW would have made no difference in Vince Weaver's competition as it was not used or necessary due to unusual and minimal register saving.

http://deater.net/weave/vmwprod/asm/ll/ll.html
http://deater.net/weave/vmwprod/asm/ll/ll.ppc.s

The PowerPC Compiler Writer's Guide data shows significant LMW/STMW use for SPEC92 Benchmarks when available.

The PowerPC Compiler Writer's Guide
https://cr.yp.to/2005-590/powerpc-cwg.pdf Quote:

instruction | int num of executions | int % of total | fp num of executions | fp % of total
lmw 45073238 0.542% 14666169 0.026%
stmw 50087129 0.602% 14868701 0.026%


Figure B-3 on page 177 appears to show hardware support for LMW/STMW for PPC 601, 603e and 604 so the most common early PPC CPUs had support in hardware. I could not find anywhere where The PowerPC Compiler Writer's Guide discourages using them but I have from other sources. Ironically, it appears castrated embedded support led to their use being discouraged even though they are most useful for the embedded market.

Which PowerPC cores lack hardware support for LMW and STMW instructions?
Google AI Quote:

In the PowerPC ecosystem, the Freescale (now NXP) e500 core family (such as the e500v1, e500v2, e500mc, and e5500) natively lacks hardware support for LMW (Load Multiple Word) and STMW (Store Multiple Word) instructions.


The e500v2 CPU cores are used in the A1222 embedded SoC and e5500 CPU cores in the X5000 embedded SoC. The embedded PPC SoC used in the A1222 not only has a castrated non-standard PPC FPU but has also been castrated in other ways that arguably make it less suitable for embedded use. I guess 12% smaller code size on average would not have been good for the PPC embedded market and much more complex and resource using IBM CodePack and incompatible Freescale/NXP PPC VLE are higher end embedded options only. It makes as much sense as pushing PPC for the low end embedded market while the 68k was still #1 in 32-bit volume embedded market sales. Houston, we have a problem. The 68k was grounded by management and PPC was too fat to lift off in the embedded market so ARM became the embedded king replacement despite ARM A32 and Thumb ISAs being poor. Thumb-2's 16-bit VLE, like the 68k but still inferior, finally allowed ARM to reach orbit though.

Last edited by matthey on 16-May-2026 at 02:56 PM.

 Status: Offline
Profile     Report this post  
matthey 
Re: The (Microprocessors) Code Density Hangout
Posted on 24-May-2026 6:53:54
#463 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2868
From: Kansas

Inside the Motorola 68060 & Chip Design: Interview with Lead Designer Joe Circello!
https://www.youtube.com/watch?v=1takr2k7Yfo

Chief architect of the 68060 Joe Circello has connections with the Amiga through friends and was interviewed in an Amiga Bill YouTube video. Joe wears a RISC-V shirt while talking about the CISC vs RISC debate and code density. Joe verifies that the 68060 is fully synthesizable even though custom blocks were used, his example being parts of the cache used custom blocks. I had suspected the 68060 was fully synthesizable with repetitive and time critical logic of caches and the MMU using custom blocks. He verifies the 68060 uses no microcode and laughs when asked about a x86 core that does not use microcode calling x86 more complicated than the more orthogonal 68k. GHz frequency 68060s in modern silicon were mentioned multiple times. He talks about finding the "RTL" in an old Motorola database to recover the 68060 but the fully sythesizable core sounded like it would allow an easier transition to new silicon if everything necessary could be found and recovered.

https://en.wikipedia.org/wiki/Register-transfer_level

The use of RTL in Verilog is surprisingly modern for a CPU design Joe says started circa 1988 outside of Motorola. The 68060 with a fully synthesizable and fully static synchronous CMOS design using RTL in Verilog should be like most modern designs and unlike 486, Pentium, Pentium Pro, MIPS R4000, DEC Alpha 21064/21164 and early ARM designs which used more difficult to work with and modernize dynamic/domino logic.

https://en.wikipedia.org/wiki/Domino_logic
https://en.wikipedia.org/wiki/Sequential_logic

Timing becomes easier when moving to a smaller process and auto layout tools should be better at placement and routing. It may be possible that the whole 68060 could use auto layout to a more modern process like the ColdFire used. It sounds like the biggest difficulty is acquiring a license or rights and finding and recovering the RTL source. Unfortunately, preservation does not sound like a priority for the 68k/68060 at NXP/Freescale.

A couple of code density quotes from Joe follows.

"So I have this shirt on RISC-V and I would say that I think they are painfully relearning the lessons that we all, or at least some of us went through, you know 30, 35 years ago. So a lot of these things were major topics, the source of major activity both in academia as well as well as in industry and I think a lot of those lessons were conveniently lost and we are now in the process of relearning them, right, where code density and code size is become a really big deal in RISC-V's environment and what can you do to manage that."

"I mean there RISC-V, I've been fairly heavily involved in that over the last number of years, and there's a lot of things they did right in terms of the architectural definition. Having said that, there's still a number of challenges. I think people thought they could kind of just waltz in with this new ISA and say, 'Here it is. We've come down from the mountain and here's the tablets, stone tablets, and this is what we need to do.' But things aren't quite nearly that simple. I mean we were looking at using RISC-V cores and we were using them in various embedded applications and they were great substitutions in some areas and in other areas, because of things like code density and code growth and things like that, well, the answer wasn't quite as clear."

Last edited by matthey on 26-May-2026 at 12:16 AM.
Last edited by matthey on 24-May-2026 at 11:59 AM.
Last edited by matthey on 24-May-2026 at 11:53 AM.

 Status: Offline
Profile     Report this post  
Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 )

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]
Copyright (C) 2000 - 2019 Amigaworld.net.
Amigaworld.net was originally founded by David Doyle