Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.
|
|
|
22 crawler(s) on-line.
95 guest(s) on-line.
0 member(s) on-line.
You are an anonymous user. Register Now!
pixie: 11 hrs 35 mins ago
|
|
|
|
| Poster | Thread | minator
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 27-Jun-2025 0:32:01
| | [ #301 ] |
| |
 |
Super Member  |
Joined: 23-Mar-2004 Posts: 1046
From: Cambridge | | |
|
| @matthey
Quote:
| The 2-way superscalar 32-bit 68060 CPU uses ~2.5 million transistors while the lowest end 64-bit 2-way superscalar Cortex-A53 core uses ~12.5 million transistors. A 32-bit in-order 2-way Cortex-A7 core predecessor uses ~10 million transistors so a 64-bit equivalent Cortex-A53 core uses ~25% more transistors. The 64-bit tax applies to more than just memory. |
A lot can change over 17 years, also, the A53 effectively implements 2 different instruction sets and that impacts the entire processor.
Wouldn't it be better to compare similar processors from the same time: All of these are 2 way superscalar:
32 bit 1993 Pentium P5 66MHz (2x 8K caches) 3.1 million transistors 1994 68060 50MHz (2 x 8K caches) 2.5 million transistors 1994 PA-7200 120MHz (1 x 2K assist cache) 1.3 million transistors 1994 Pentium P54 100MHz (2x 8K caches) 3.2 million transistors
64 bit 1992 Alpha 21064 (EV4S) 200MHz (2 x 8K caches) 1.68 million transistors 1991 MIPS R4000 100MHz (2 x 8K caches) 1.35 million transistors 1992 MIPS R4400 250MHz (2 x 16K caches) 2.2 million transistors
The 64 bit tax doesn't seem too high. Caches can add huge number of transistors as the R4400 number shows.
There is a CISC tax though. They have more logic transistors, they are far more complex to design, and slower. There's reason the industry gave up on CISC.
Last edited by minator on 27-Jun-2025 at 12:36 AM.
_________________ Whyzzat? |
| | Status: Offline |
| | matthey
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 28-Jun-2025 1:39:54
| | [ #302 ] |
| |
 |
Elite Member  |
Joined: 14-Mar-2007 Posts: 2828
From: Kansas | | |
|
| minator Quote:
A lot can change over 17 years, also, the A53 effectively implements 2 different instruction sets and that impacts the entire processor.
|
The Cortex-A53 supports at least 4 ISAs.
1. ARM (original) 2. Thumb 3. Thumb-2 4. AArch64
New Cortex-A cores support only #4 and Cortex-M cores support #2-3.
minator Quote:
Wouldn't it be better to compare similar processors from the same time: All of these are 2 way superscalar:
32 bit 1993 Pentium P5 66MHz (2x 8K caches) 3.1 million transistors 1994 68060 50MHz (2 x 8K caches) 2.5 million transistors 1994 PA-7200 120MHz (1 x 2K assist cache) 1.3 million transistors
|
The PA-7200 is superscalar but that 2kiB on-chip assist cache and off chip L1, with the 2nd worst RISC code density after Alpha, is grossly inadequate for instruction supply. The PA-7200 is a good example of ignorance of the RISC instruction bottleneck. The design would have been better left as scalar and the transistors wasted on superscalar hardware reallocated to at least an 8kiB on-chip L1 instruction cache. The PA-7200 design uses a 5-stage pipeline and lacks dynamic branch prediction reducing the number of transistors compared to the 8-stage 68060. It also should have reduced the max clock speed compared to the 68060 which should have eventually been clocked around 150MHz.
minator Quote:
1994 Pentium P54 100MHz (2x 8K caches) 3.2 million transistors
64 bit 1992 Alpha 21064 (EV4S) 200MHz (2 x 8K caches) 1.68 million transistors
|
The Alpha 21064 is a professional quality 7-stage superscalar design, other than being handicapped by the Alpha code density and extreme simplicity. The 8kiB instruction cache has the performance of a 68060 ~2kiB instruction cache and would need to be increased to a ~32kiB instruction cache to match the 68060 8kiB instruction cache performance, according to RISC-V research.
The RISC-V Compressed Instruction Set Manual, Version 1.7 https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-157.pdf Quote:
The philosophy of RVC is to reduce code size for embedded applications and to improve performance and energy-efficiency for all applications due to fewer misses in the instruction cache. Waterman shows that RVC fetches 25%-30% fewer instruction bits, which reduces instruction cache misses by 20%-25%, or roughly the same performance impact as doubling the instruction cache size.
|
The small advantage in logic that simple RISC cores had in logic by eliminating standard features on CISC cores, is gone when increasing the instruction caches to compensate for the poor instruction cache performance. A good example was the PPC603.
core | pipeline | L1 caches | transistors PPC603 4-stage 8kiB/8kiB 1.6million PPC603e 4-stage 16-kiB/16kiB 2.6million
The PPC strategy was to use limited OoO to increase the performance of shallow pipelines which do not need dynamic branch prediction thus saving transistors. The poor code density and much increased memory traffic resulted in the caches of the PPC603 and PPC 604 being doubled sabotaging the cost advantage of PPC. Even worse, the shallow pipelines could not be clocked up without expensive die shrinks which also came with doubling the caches for the PPC603e and PPC604e. The large 32kiB I+D caches likely reduced the max clock speed of the PPC604e as the PPC604 had a higher clock speed. PPC stagnated until Moore's Law allowed L2 caches on chip providing PPC a 2nd wind but the cache hog problem remained and grows as the amount of caches holding instructions increase.
minator Quote:
1991 MIPS R4000 100MHz (2 x 8K caches) 1.35 million transistors 1992 MIPS R4400 250MHz (2 x 16K caches) 2.2 million transistors
|
These MIPS cores are scalar 64-bit cores with an 8-stage pipeline. Transistors counts between scalar and superscalar cores will often be more than between 32-bit and 64-bit cores. The scalar version of the 68060 was never released but the similar Cyrix design was.
Cyrix5x86 6-stage 16kiB 2.0million (scalar) Cyrix6x86 7-stage 16kiB 3.0million (superscalar)
While I cannot confirm the above Cyrix transistor counts, Cyrix literature mentions the following confirming the cost of superscalar designs compared to scalar designs.
Cyrix 5x86: Fifth-Generation Design Emphasizes Maximum Performance While Minimizing Transistor Count https://dosdays.co.uk/media/cyrix/5x86/5X-DPAPR.PDF Quote:
5x86 Architecture
The increased complexity, transistor count, and power consumption of superscalar designs led Cyrix engineers to re-examine the benefits of the superscalar approach. Clearly the power dissipated in a second execution pipeline plus the added power dissipated in the control logic to oversee two execution pipelines should be minimal to achieve performance that will justify the transistors added. Analysis has shown that the increased complexity of two execution pipelines can cost 40% in transistor count while providing an increase of less than 20% in instructions-per-clock performance.
|
A scalar version of the 68060 would use 1.5 million transistors if it used 40% fewer transistors than the 68060 (the scalar 68040 used 1.17 million transistors). The MIPS R4000 is more primitive than the more advanced and better Cyrix 5x86 design (and hypothetical scalar 68060 design). The R4000 pipeline was stretched from the R3000 5-stages to 8-stages with the common naive RISC assumptions of performance. The 64-bit core and extreme clock speeds were better for marketing than performance. I have commented before about deeper RISC pipelines increasing stalls and the same is true here with the architects practically ignoring the problems. Both load-to-use and branch misprediction stalls were increased.
https://blog.jyotiprakash.org/delving-deeper-into-the-mips-pipeline Quote:
Load Delays
In the R4000 pipeline, load delays are increased to 2 cycles because the data value becomes available at the end of the DS stage. The following figures show the pipeline schedule when a use immediately follows a load, indicating that forwarding is required to access the result of a load instruction in subsequent cycles.
|
After a load instruction, 2 independent instructions must be placed between the load and the next instruction to use the load to avoid a load-to-use stall. There is a nice picture to show the load-use delay but it is a little large for this forum. The 68060 and most CISC designs have no load-to-use stalls so benefit more from the deeper pipeline. As bad as the increased load-to-use delay is for the R4000, branching is much worse. The MIPS ISA was designed for a branch delay slot and has static not taken branch prediction.
https://blog.jyotiprakash.org/delving-deeper-into-the-mips-pipeline Quote:
Branch Delays
The basic branch delay in the R4000 pipeline is 3 cycles since the branch condition is computed during the EX stage. The MIPS architecture includes a single-cycle delayed branch. The R4000 employs a predicted-not-taken strategy for the remaining 2 cycles of the branch delay. The following figures demonstrates that untaken branches behave as 1-cycle delayed branches, while taken branches include a 1-cycle delay slot followed by 2 idle cycles. The instruction set includes a branch-likely instruction to help fill the branch delay slot. Pipeline interlocks enforce the 2-cycle branch stall penalty on taken branches and any data hazard stalls resulting from load uses.
|
The R4000 has no dynamic branch prediction for an 8-stage pipeline! Each iteration of a loop stalls for 2 cycles and a 3rd cycle is wasted if the branch delay slot is not useful. Compare this to the 68060 which starts with BTFN static prediction so it will predict the loop branch as taken the first time and the 68060 has 2-bit saturating dynamic branch prediction with a BTB which not only has no stalls for loops but allows the branch itself to be folded away. The 68060 has a 3-4 cycle advantage on every iteration of a loop! Other branches benefit too. The 68060 2-bit saturating prediction is better than the 1-bit prediction of the Alpha 21064 which was upgraded to 2-bit saturating in the Alpha 21064A.
The R4000 average CPI for SPEC92 integer benchmarks was 1.54 CPI from the same paper above where Motorola claimed "1.2 CPI measured on range of desktop of desktop and embedded applications".
The Superscalar hardware architecture of the M68060 https://old.hotchips.org/wp-content/uploads/hc_archives/hc06/3_Tue/HC6.S8/HC6.8.3.pdf
The 68060 was not a barely superscalar CPU. It was a finely tuned high tech Pentium killer but that also made it more of a RISC killer, including shallow pipeline limited OoO PPC killer. It threatened the AIM Alliance and thus could not be clocked up.
minator Quote:
The 64 bit tax doesn't seem too high. Caches can add huge number of transistors as the R4400 number shows.
|
The minimum 64-bit tax is not too high but gives more of a 32-bit/64-bit CPU core.
The MIPS R4000 Processor https://people.eecs.berkeley.edu/~kubitron/courses/cs252-S07/handouts/papers/R4000.pdf Quote:
The hardware cost of extending the architecture to 64 bits was about 7% of the die area. A longer 64-bit ALU stage represents the cycle time speed penalty.
|
The R4000 only has a 32-bit barrel shifter which is half the size of a 64-bit barrel shifter. Modern 64-bit ISAs are more likely to have more 64-bit integer multiply and divide instructions which are also very expensive. The MIPS ISA is simple making it cheaper to implement in 64-bit. For example, there is only one addressing mode. Compare that to AArch64 which rivals the 68k in addressing modes and has thousands of instructions instead of hundreds at most like the 68k and MIPS. There was a performance pipeline "cycle time speed penalty" slowing down the whole pipeline for the "64-bit ALU stage". This is less of a problem with modern silicon but 64-bit ALU operations are still sometimes slower, 64-bit pointers are sometimes much slower than 32-bit pointers and 64-bit code tends to be larger than 32-bit code decreasing cache efficiency. I am not completely opposed to 64-bit but there is higher cost than benefit on low end inexpensive hardware.
Nintendo bought into the MIPS 64-bit propaganda for the Nintendo 64.
https://en.wikipedia.org/wiki/Nintendo_64#Hardware Quote:
Technical specifications
The Nintendo 64's architecture is built around the Reality Coprocessor (RCP), which serves as the system’s central hub for processing graphics, audio, and memory management. It works in tandem with the VR4300, a 93.75 MHz 64-bit CPU fabricated by NEC with a performance of 125 million instructions per second. Popular Electronics compared its processing power to that of contemporary Pentium desktop processors. Though constrained by a narrower 32-bit system bus, the VR4300 retained the computational capabilities of the more powerful 64-bit MIPS R4300i on which it was based. However, software rarely utilized 64-bit precision, as Nintendo 64 games primarily relied on faster and more compact 32-bit operations.
|
The Nintendo Cube successor went back to a more practical 32-bit PPC CPU.
minator Quote:
There is a CISC tax though. They have more logic transistors, they are far more complex to design, and slower. There's reason the industry gave up on CISC.
|
I see a x86 tax but if there is a 68k tax at all, it is small and well worth the code density advantage allowing the 68k to save caches, which you admit, "can add huge number of transistors". The transistors for caches dwarf the pipeline transistors on modern cores. Modern load/store architectures that pretend to be RISC care about code density now and have abandoned the RISC simplicity which was bad for performance. A minimal 68060 core may actually be smaller than the minimal AArch64 core today.
Last edited by matthey on 29-Jun-2025 at 04:31 PM. Last edited by matthey on 29-Jun-2025 at 04:26 PM.
|
| | Status: Offline |
| | cdimauro
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 29-Jun-2025 6:02:30
| | [ #303 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 4593
From: Germany | | |
|
| @matthey
Quote:
matthey wrote: cdimauro Quote:
I agree. Those results reflect similar trends on other code density benchmarks. But, as you've said, it's old (and I don't even recall if "Thumb" is just Thumb or it's Thumb-2).
|
Thumb-2 was released early enough for the SPARC16 paper and "arm(thumb)" likely represents the Thumb modes with the compiler choosing between Thumb and Thumb-2.
SPARC16: A new compression approach for the SPARC architecture https://www.researchgate.net/publication/221306454_SPARC16_A_new_compression_approach_for_the_SPARC_architecture Quote:
First presented by ARM on its ARM7 model, the next 16 bits processor extension in the market was Thumb. Thumb enabled ARM processors are capable of running code in both 32 and 16 bits modes and allow subroutines of both types to share the same address space, while the mode exchange is achieved during runtime through BX and BLX instructions, which are branch and call instructions that flip the current mode bit in a special processor register. To fit functionality in 16 bits, a group of only 8 registers together with a stack pointer and link registers are visible, the remaining registers can only be accessed implicitly or through special instructions. Results presented by ARM show a compression ratio ranging from 55% to 70%, with an overall performance gain of 30% for 16 bit buses and 10% loss for 32 bit ones. Thumb2 is the recent version of the original Thumb incremented with new features like the addition of specific instructions for operating system usage.
|
Thumb-2 would be chosen most of the time for performance but the original Thumb ISA often has similar and sometimes better code density. Only 8 GP registers causing increased memory traffic is mentioned for Thumb while the increase in instructions executed of 30% or more compared to the original ARM ISA is not mentioned. The Thumb-2 advantages of avoiding performance killing instruction pipeline flushing mode switches, being able to access all 16 GP registers with 32-bit instructions and being able to encode larger immediates and displacements in 32-bit instructions are not mentioned or the resulting decrease in instructions executed and memory traffic. Thumb-2 was a big improvement over Thumb which a code density study alone does not show. Code density should be studied with performance metrics/traits but it is more difficult. |
Thumb-2 shows better code density on average compared to Thumb, because it executes less instructions. Thumb has a very good 16-bit encoding, but as you've said it uses only 8 GP registers and it has limited constants/offsets support. Thumb-2 solves all those problems by allowing the execution of the original ISA, with the exception of conditional execution. Of course, it's limited by the ARM ISA itself. Quote:
cdimauro Quote:
I don't expect that such good ARC results were coming from the RISC-V version, because we've seen that this ISA isn't so good about code density.
The only way to clarify it is checking the situation trying to reproduce the buildroot, at least for ARC, and check the produced object code.
|
It is possible that ARC-V standardizes more of the optional RISC-V ISA extensions which provide better code density than the RV32GC compiler target although the ~13% code density improvement for ARC is more than I would expect. |
It's unlikely, because those new extensions are quite recent, whereas the benchmark has already some years. I'm more inclined to think that the last ARC ISA was used. Quote:
| If ARC was using an improved ISA that competed with ARM Thumb ISAs, then it is interesting that they would give it up for RISC-V. RISC-V compressed code density competitiveness was better in this code density comparison than most others I have seen outside of RISC-V promoters. |
Indeed, but you've seen what happened to Motorola with its 68k: they've thrown it, despite being top class in this area... Quote:
| One thing RISC-V does not suffer from is lack of GP registers and the code density is good for the number of GP registers. It is the reduced instruction set "RISC" instructions and addressing modes which made the ISA weak. |
The instruction set is all but reduced: with more than 500 instructions (the last time that I've seen it: now it might be much more), it's in the right track with competing with Intel and ARM. Quote:
| There are new ISA extensions to try to fix the known problems but will they become the Cortex-A like pseudo standard as the compressed extension became and will some of the handicap remain? |
The base ISA is very weak, and the academic decisions especially about the load/store instructions heavily crippled it. Those are things that couldn't be solved with extentions.
Which is good, because there's still open space for competitors. Quote:
cdimauro Quote:
Nowadays it doesn't matter: all Amiga OS could stay on persistent storage (hard disk, SSD, Flash) and loaded in memory at the boot time. So, I don't bother about NOR systems. For embedded systems, we've plenty of space on Flash memory, which should perfectly fit the scope.
But I don't agree about putting the LVOs in flash: the Amiga OS engineers already did dirty things by internally calling library functions WITHOUT using the LVOs, to squeeze the most space possible. And this is dirty because there's a "contract" to be respected, which is SetFunction, as I've told them the last year on a FB discussion (on the last article which I've published).
|
Loading/decompressing libraries from NAND flash storage into memory has several advantages as a trade off to the increased memory footprint.
1. NAND flash is cheaper, higher performance and scales down better than NOR flash 2. good compatibility and upgradability can be maintained 3. PC relative libraries that do not need the library base in the a6 register could be used
Compatible PC relative libraries are possible right now for storage loaded Amiga libraries. This is possible by merging the whole library together in memory including the code, data, and LVO. A 68k CPU can not do a PC relative write/store so a LEA would be necessary first but this is low overhead and I believe better than maintaining the library base in the a6 register when many functions do not use it and it creates problems for 68k compiler support (most other 68k compiler targets default to a6 as the frame pointer). The (d16,PC) addressing mode would work for small libraries but a shorter encoding for (d32,PC) would improve the efficiency for larger libraries. PC relative writes/stores should be considered for a 68k64 ISA which could further improve efficiency.
With the introduction of PC relative libraries, new Amiga code which reallocates a7 for other purposes would not work with old libraries but old code would be compatible with the the new libraries. PC relative libraries would reduce how low a 68k system can scale in footprint as a NOR flash Kickstart would use less memory but it would not give up much considering NOR flash is not scaling below about a 40nm chip fab process and requires two dies as the RP2354 stacked die packages demonstrate.
https://en.wikipedia.org/wiki/RP2350
The 68k Amiga tiny footprint is a significant advantage but there are other costs that are higher than the memory cost. |
I see the problem here, which is more general because those problematics belong to embedded systems without an MMU. MMUs make the life much easier, because they can map memory everywhere, hence "solving" the problem of having local library data being written (reading isn't the problem), since you can't write on NOR/NAND space. The only problem with MMUs is that this is made at the expense of wasting memory due to the page alignments.
Yes, enabling the PC relative modes to write on memory solves the 68ks problem (AFAIR the 68080 has already lifted this archaic limit), and 32-bit version will be more general and useful on a 64-bit extension.
The big problem with the Amiga OS libraries is this a6 register (you've written a7 in the last part, but I assume that it's just a typo), which forces to always load it with the library base once you need to call another library (or simply you've use the register for something else, before that). The good thing is that it solves the above problem without requiring an MMU neither PC-relative modes, but at the expense of reserving a register for this (plus the mentioned loading of it every time that it's needed). The Amiga libraries were a nice concept which perfectly matched the model of having a lot of their code running on ROMs, but on the other side they aren't much efficient, considered also the evolution of technologies & needs. Quote:
cdimauro Quote:
PowerPCs have load/store multiple registers instructions, so that's not the case.
|
The PPC ISA has load/store multiple register instructions but the standard does not require them to be implemented in hardware which is why prologues/epilogues are the standard to save and restore GP registers. Also, the PPC STMW/LMW instructions only store and load multiple consecutive instructions where the 68k MOVEM loads and stores nonconsecutive registers from a list. Many unnecessary registers are stored to the stack compared to the 68k whether PPC STM/LMW instructions or the prologue/epilogue method is used because they both only access consecutive GP registers.
Power Architecture 32-bit Application Binary Interface Supplement 1.0 - Linux & Embedded https://example61560.wordpress.com/wp-content/uploads/2016/11/powerpc_abi.pdf Quote:
_save32gpr_14: stw r14,-72(r11) _save32gpr_15: stw r15,-68(r11) ... _save32gpr_30: stw r30,-8(r11) _save32gpr_31: stw r31,-4(r11) blr
|
There is an example with vector registers on page 66 which require an ADDI instruction for each GP register stored. This is why PPC stack sizes and memory traffic are crazy and it does not even consider the extra loop unrolling and function inlining required by many RISC CPU core implementations for good performance. |
That's really crazy: they have the technology/instructions, but they prefer to don't use it. Even a millicoded version of the LMW/STM instructions would have been better than bloating the code of so many instructions inserted into the code and wasting memory and cache efficiency.
BTW, saving/load consecutive registers aren't really a problem, because this is what happens with the regular code. This ultimately depends on ABIs, but they usually impose the usage of consecutive registers for passing parameters & callee-saved registers. So, even if 68ks are much more flexible from this PoV, this flexibility isn't much of a value in the real world (at least on functions' prologues/epilogues). Quote:
cdimauro Quote:
There should be other reasons, because passing the args to the stack due to the 68k ABI should require more or less the same space as passing them in regs (PowerPC ABI) but saving the old values in the stack and restoring them back.
What might influence here is the "RISC-factor" using many more registers inside a function to accomplish the same task, whereas on the 68k you can use immediates and instructions can directly access memory, which requires many less registers to be used.
|
No doubt the "RISC-factor" has synergies to bloat PPC programs. Once going down the fat everything path the snowball grows. The PPC philosophy is practically the opposite of the 68k philosophy of minimization of code size, minimization of memory traffic and good code sharing support with PC relative addressing. The 68k AmigaOS extends the philosophy and elegance while the PPC AmigaOS CPU ISA does not, at least if wanting to retain the small footprint advantage of the 68k AmigaOS. Hyperion had plans to enter the embedded market with the PPC AmigaOS 4 but when the EfikaPPC with 128MiB of memory was not enough, it should have been clear that the replacement CPU ISA was chosen poorly for the AmigaOS and embedded market. |
That was already very well know much before that Hyperion was evaluating entering the embedded market.
I can understand using some (restrict, limited) PowerPC processors because they might be more suitable on some areas (safety, RAS, protection from radiation), because they might be smaller (at least the logic for implementing the ISA), but besides that I really don't see any value on them (rather, the opposite). Quote:
cdimauro Quote:
A guard page to detect stack underflow is perfectly fine. But it's always ONE page, regardless of the allocated stack. 
|
The stack guard page could be eliminated with stack limit checking as newer ARM Cortex-M cores use.
TRUSTZONE TECHNOLOGY 04_LPC5500_TrustZone_v1.4.pdf Quote:
Stack limit checking
• As part of ARM TrustZone technology for ARMv8-M, there is also a stack limit checking feature. For ARMv8-M Mainline, all stack pointers have corresponding stack limit registers.
|
Some Cortex-M features are limited instead of dynamic like Cortex-A features though. |
That's like the "good, old" (!) x86 segmentation (!!) which is back (selectors, in reality: the very granular access check is provided by the protected mode segmentation memory model). 
Which, despite what was said and still being said 'til now (some months ago Linus Torvalds expressed the same critics on RealWorldTech), is a good solutions to this (as well as others) very common problematic. However, a questionable (but understandable) design & implementation does NOT imply that this is a bad technology by itself (and here is his mistake: extending to the generality a single use-case, and pretending that it's not ok in all cases). Quote:
cdimauro Quote:
No, large pages aren't normally used on a modern OS, unless the application "requires it".
And it would have been a very dumb decision for an OS which is running on a system with little available memory.
Probably the memory is just reserved on the address space, but not effectively allocated (so, it will not be paged out in case of low memory available), and will be (only) allocated when needed. Just a guess...
|
I think you are on the right track but I would not rule out large MMU page sizes for the stack. A large MMU page size for the stack reduces TLB misses which improves performance. |
I'm quite in favour of (very) large pages, because they have a great value for performances, at least for the code (and partially for the stack, since it's growing down and its usage is not predictable).
Data (not dataro, of course) and BSS are a different story, since you need to protect the data of each process/thread and completely isolate those segments from each other.
Which isn't not the case for the Amiga OS, and that's exactly the reason why I'm advocating the usage of very large pages (1GB) on AROS x64, since this will considerably improve the performance (basically nullifying the impact of TLB misses). Quote:
| The large Linux stack size may be a virtual memory size that is only partially allocated but allocating more memory on demand is also bad for performance. Compilers bloat up code for minimal performance gains and Linux OSs can be expected to do the same. The problem is that when everything is bloated up for minimal performance gains, low memory paging may result in a major performance loss. It is kind of like the speed launchers that many Windows programs want to add into startup but they slow down startup and overall performance with too many installed. |
That's maininly related to the increased complexity and the installed applications on the system.
So, not a fault of the system, "per sé" Quote:
cdimauro Quote:
Likely. The developers of OS & libraries preferred to drop supporting x32 "simply" (!) because they don't want to pay for this "burden", and AArch64 ILP32 is probably facing the same.
I can understand that support is a cost, but that it's a well spent cost, especially on systems with limited available resources.
|
Yes. The following is a thread suggesting x32 ABI support should be removed in Linux.
https://lwn.net/ml/linux-kernel/CALCETrXoRAibsbWa9nfbDrt0iEuebMnCMhSFg-d9W-J2g8mDjw@mail.gmail.com/
Linus Torvald responds.
Linus Torvalds Quote:
Andy Lutomirski [quote] > > I'm seriously considering sending a patch to remove x32 support from > upstream Linux. Here are some problems with it:
|
I talked to Arnd (I think - we were talking about all the crazy ABI's, but maybe it was with somebody else) about exactly this in Edinburgh.
Apparently the main real use case is for extreme benchmarking. It's the only use-case where the complexity of maintaining a whole development environment and distro is worth it, it seems. Apparently a number of Spec submissions have been done with the x32 model.
I'm not opposed to trying to sunset the support, but let's see who complains..
Linus
|
Q.E.D.. People don't really get the value of such ABI, and think that's it's only for benchmarking purposes. No comment... Quote:
I think it would have been easier to maintain x32 if the LLP64 data model had been chosen for Linux like Windows instead of the LP64 data model which also reduces the memory footprint a little. Changing existing datatype sizes (32-bit long to 64-bit long) is a pain and it is better to define and use new ones like 64-bit long long.
https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models
If new drivers are not being supported for the x32 ABI, it is difficult to maintain support but support often depends on the attitude of the developers providing the support. |
Exactly, but the main problem here is represented by the Linux (and perhaps Unix) model, with the very bad decision to extend longs to 64-bit, instead of keeping them 32-bit (as it was made on Windows for x64), bloating a lot the data usage. It was a really dumb decision, because keeping longs to 32-bit would have made no difference to the existing applications (rather the opposite having them 64-bit in size: could have caused compatibility issues).
This wouldn't have solved the pointers size issue (for this, x32 is required), but the "64-bit tax" would have been mitigated.
Anyway, x32 is a great, missed opportunity... |
| | Status: Offline |
| | cdimauro
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 29-Jun-2025 6:30:59
| | [ #304 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 4593
From: Germany | | |
|
| @matthey
Quote:
matthey wrote:
It may be easier to compare non GUI based memory footprints as the GUI desktop varies with screen/window memory used, backdrop memory used, taskbar memory used, icons used and other GUI related programs that may be launched at startup. A 1080P screen (1920x1080) in true color 32-bit would use nearly 8MiB of memory but this still does not explain large Linux footprints. The 32-bit RPi still uses ~76MiB to boot to a CLI. We did not consider the screen when looking at the 68k Amiga footprint.
68k AmigaOS 3 used 54kiB of 2MiB of memory after boot or 55,296B floppy drive defaults to 5x512B buffers using 2,560B Amiga defaults to a 640x200 screen with 4 colors using 32,000B --- ~20,736B used by the AmigaOS excluding floppy drive buffers and screen/window bitmap
The 68k Amiga memory footprint was not even small compared to 8-bit computers with character/tile based graphics but the 68k Amiga used a large flat address space, preemptive multitasking, bitmapped graphics and was much more dynamic than 8-bit systems. How good a memory footprint seems to be relative to what you are used to. |
Comparisons like those can't be made, because there are three main factors which don't allow to draw conclusions: - the Amiga OS doesn't use any MMU; - Amiga OS lacks any modern feature, which is implemented on any modern OS/platform. - systems/platforms/environments considerably evolved over the time, adding more tools, libraries, and applications;
In the first has implications on how the memory is allocated and used. Page sizes on systems using an MMU impose a much greater memory consumption, whereas the Amiga OS has just an 8 bytes granularity.
The second is more subtle, and difficult to understand, because it imposes the usage of bigger and additional data structures on modern OSes, whereas the Amiga OS has ZERO of both.
The third is self-explanatory: as long as you want to provide an environment with many more features, you need to add more services, libraries, tools, and users need more applications. Which the Amiga OS lacks. And this is where the majority of memory consumption comes from, defining the memory footprint.
The Amiga OS can't be compared to embedded Linux distros which sport GUIs simply because we're talking of completely different platforms which fit different markets.
A better comparison could be done by using embedded versions of Linux, not using MMUs at all, and "crippled" to a similar "least common multiple". Yocto is a very common and heavily used "meta (Linux )distro" which can be used for that, because it allows to carefully define what to put on a distro, defining every single aspect. However, I don't know if it's worth the time spent only for a benchmark (Yocto requires additional stills compared to system-level Linux skills). |
| | Status: Offline |
| | cdimauro
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 29-Jun-2025 6:37:18
| | [ #305 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 4593
From: Germany | | |
|
| @Hammer
Quote:
Hammer wrote: @matthey
Quote:
Thumb-2 is actually less handicapped by the lack of GP registers than x86 which really only has 6 usable orthogonal GP registers for programs to use (EBX and ESP are not general purpose and lack orthogonality). ARM 32-bit loses the PC and LR registers so really only has 14 GP registers compared to the 68k 16 GP registers which moves the PC out of orthogonal encodings and does not have a LR register. The 68k SP is mostly orthogonal except for 8-bit stores which are padded to 16-bit. The 68k has better PC relative addressing support than x86 too. The 68k ISA is not as handicapped as x86 and Thumb-2 ISAs.
|
1. For the NXP / ST-Micro camp, PPC 16-bit VLE is the main competition against 68K. |
For which you still have provided no benchmarks to show how it compares to other architectures regarding the code density (which is the argument of the thread)... Quote:
2. IA-32's X87 supports integer formats due to 8087's support for INT32 and INT64, in addition to FP32, FP64, and FP80 data formats.
For 8086 and 8088, Intel addressed the 68000's INT32 advantage via 8087's INT32 support.
IA-32 includes 8 X87/MMX and 8 XMM (SSE2) registers. SSE2 supports scalar/vector integers.
Unlike 68060, Pentium guarantees X87's existence. |
Irrelevant? Quote:
Hammer wrote: @cdimauro
Quote:
cdimauro wrote: I've already and IMMEDIATELY reported this news once it was published AND added my comment on that (NOT in favour of Intel), idiot!
|
You're the real idiot. |
 |
| | Status: Offline |
| | cdimauro
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 29-Jun-2025 7:12:09
| | [ #306 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 4593
From: Germany | | |
|
| @minator
Quote:
minator wrote: @matthey
Quote:
| The 2-way superscalar 32-bit 68060 CPU uses ~2.5 million transistors while the lowest end 64-bit 2-way superscalar Cortex-A53 core uses ~12.5 million transistors. A 32-bit in-order 2-way Cortex-A7 core predecessor uses ~10 million transistors so a 64-bit equivalent Cortex-A53 core uses ~25% more transistors. The 64-bit tax applies to more than just memory. |
A lot can change over 17 years, also, the A53 effectively implements 2 different instruction sets and that impacts the entire processor. |
They are three: ARM, Thumb/Thumb-2 (Thumb-2 is a transparent extension of Thumb) and AArch64.
However, I don't know how much it could impact the processor, since it's mostly impacting the frontend. Quote:
Wouldn't it be better to compare similar processors from the same time: All of these are 2 way superscalar:
32 bit 1993 Pentium P5 66MHz (2x 8K caches) 3.1 million transistors 1994 68060 50MHz (2 x 8K caches) 2.5 million transistors 1994 PA-7200 120MHz (1 x 2K assist cache) 1.3 million transistors 1994 Pentium P54 100MHz (2x 8K caches) 3.2 million transistors
64 bit 1992 Alpha 21064 (EV4S) 200MHz (2 x 8K caches) 1.68 million transistors 1991 MIPS R4000 100MHz (2 x 8K caches) 1.35 million transistors 1992 MIPS R4400 250MHz (2 x 16K caches) 2.2 million transistors |
Yes, if we just stop at the number of used transistors.
However, those are architectures which also offer/expose different features, and you cannot narrow down them to a "least common multiple" (again! ) which allows to better compare the impact of each architecture.
But it's important to underline that the above RISCs* are novel architectures which had no legacy to carry on, which allowed their architects to focus and put resources only where they are needed. It's the great advantage of "green projects"... Quote:
| The 64 bit tax doesn't seem too high. |
Indeed. When AMD introduced the first x86-64 processors, it talked about just a 5% increase in the core & transistors count, which includes doubling the number of both GP and SIMD registers. Quote:
| Caches can add huge number of transistors as the R4400 number shows. |
It always depends on the ISA.
ISAs with poor code densities require much more caches to balance the performance drop of the instruction cache. A clear example is represented by HP's PA-RISC, which required a huge amount of external caches first, and then very big internal caches to address it.
64-bit architectures also require much more caches, because of the doubling of long data type and/or of the pointer sizes (see above on that when we discussed about the x32 ABI). They usually require more caches because of more instructions are needed for building larger immediates. And x64 requires more cache because of larger code size (due to the instruction prefixes used for using 64-bit data types where the default is 32-bit, or for accessing the new registers). Quote:
| There is a CISC tax though. They have more logic transistors, they are far more complex to design, and slower. There's reason the industry gave up on CISC. |
You can talk of x86/x64 and 68k taxes, but certainly not about a general "CISC tax".
Both x86/x64 and 68k have their legacies, which of course require transistors. But this is a "fixed tax", because its impact is lower and lower with the more usage of transistors for caches, for example (which are taking the largest piece of the take on chips) or vector units. But it's there, for sure, and it had a bigger relevance on 80's and 'til mid of 90s.
However, you can't take those examples and use them to generalize the concept to ALL CISCs. In fact, it's possible to have CISCs which have no such "taxes" and using a comparable amount of transistors (while keeping the benefits of CISC designs). I've not one, but two proofs about that: NEx64T and my new architecture (which is furtherly "super-simplified and super-complicated": easier to implement, but "richer" in terms of more features & flexibility. In fact, all x86/x64 legacy is removed, I've made completely different design decisions about some common features, and I've pushed more the pedal on the "CISC philosophy").
* RISCs are just a label, good for academics and marketing, because there was/is no real RISC, as I've already proved: all processors sold as "RISCs" are CISCs, in reality, since almost none of the "RISC philosophy" apply. |
| | Status: Offline |
| | ppcamiga1
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 29-Jun-2025 7:51:52
| | [ #307 ] |
| |
 |
Super Member  |
Joined: 23-Aug-2015 Posts: 1145
From: Unknown | | |
|
| this thread is pure waste of time. motorola switch from 68k 30 years ago simply because they are unable to compete alone with intell and it is history nobody care about 68k anymore and mythical better code density will not help
mattay di mauro hammer and other trolls should just hard work on mui if want ot switch to pc
|
| | Status: Offline |
| | cdimauro
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 29-Jun-2025 11:40:07
| | [ #308 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 4593
From: Germany | | |
|
| @ppcamiga1
Quote:
ppcamiga1 wrote: this thread is pure waste of time. |
They why you're still here? Go away and also make me a favour, since you're only adding garbage here. Quote:
motorola switch from 68k 30 years ago simply because they are unable to compete alone with intell and it is history |
Right. Quote:
| nobody care about 68k anymore |
Wrong. It's clearly evident that it still matters to a lot of people.
Specifically, Amiga and Atari ST fans. And on the post-Amiga land is clearly evident that the Amiga (hence, purely 68k) has way much interested compared to PC-x86+PowerPC (which SOMETIMES it's called "AmigaOne". Which is NOT an Amiga). Quote:
| and mythical better code density will not help |
Again and plainly wrong: code density matters. A LOT.
And that's exactly the reason that practically ALL CPU vendors had/had or will have invest(ed) MONTAINS OF MONEY only for improving it on their processors.
You clearly have no clue, at all, of what you're talking about, but that's something very well know. Quote:
| mattay di mauro hammer and other trolls should just hard work on mui if want ot switch to pc |
I reveal you two secrets: we already switched to PCs since very long. AND... rolling drum... MUI is a CLOSED and PRIVATE project which only the author can work on it.
As usual, you miss no opportunity to show your complete non-sense AND ignorance.Last edited by cdimauro on 29-Jun-2025 at 11:41 AM.
|
| | Status: Offline |
| | matthey
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 30-Jun-2025 0:03:50
| | [ #309 ] |
| |
 |
Elite Member  |
Joined: 14-Mar-2007 Posts: 2828
From: Kansas | | |
|
| cdimauro Quote:
Thumb-2 shows better code density on average compared to Thumb, because it executes less instructions. Thumb has a very good 16-bit encoding, but as you've said it uses only 8 GP registers and it has limited constants/offsets support. Thumb-2 solves all those problems by allowing the execution of the original ISA, with the exception of conditional execution. Of course, it's limited by the ARM ISA itself.
|
Thumb-2 has the same problem as x86-64 when optimizing for performance. The x86-64 ISA can reach code density approaching x86 code density when optimizing for size as most of the same encodings exist in x86-64. Using new features including the new registers and 64-bit support, improves performance by decreasing the number of instructions executed but the code size is increased. Thumb-2 uses 32-bit encodings to access more than 8 GP registers which decreases the number of instructions but this can also increases code size. Thumb code can have better code density than Thumb-2 code when optimizing for performance. See slide 14 of the following link by Philippe Robin of ARM where Thumb code is smaller than Thumb-2 code.
Experiment with Linux and ARM Thumb-2 ISA https://linuxdevices.org/ldfiles/article078/Experiment_with_Linux_and_ARM_Thumb-2_ISA.pdf
I expect there are cases where only 8 GP registers are adequate and Thumb can compete with Thumb-2 in performance as these places are also where Thumb code would be smaller and benefit in performance from better code density. Thumb was purpose built for code density at the expense of all other CPU performance metrics/traits which is not true for Thumb-2. It is not surprising that Thumb has better code density sometimes.
The 68k 16-bit encodings can access all 16 GP registers. This is possible by limiting the address register operations with a reduction of orthogonality but many people considered the 68k to have good orthogonality despite the Dn/An register split and the limited "address calculating" operations for address registers are mostly what is needed. The 68k performance not only does not suffer from excessive instructions executed or memory traffic but it appears to be among the leading ISAs in these performance metrics to go with Thumb like code density.
cdimauro Quote:
Indeed, but you've seen what happened to Motorola with its 68k: they've thrown it, despite being top class in this area...
|
Motorola throwing out their beautiful 68k baby with the bathwater was a politically motivated decision after being pushed into joining the AIM Alliance. When a business sabotages their own products for technically inferior products, there is a higher likely hood that they are ignoring business technical factors and analysis too. We know the destiny of Motorola. The Hyperion A-EonKit syndicate is doing similar shenanigans in Amiga Neverland like producing noncompetitive hardware, continuing forever with broken business models, pushing big lies to cover their ever growing IP encroachments, coercing the competition to protect their products, etc. They are at least sabotaging the competition and appear to be colluding to keep them out of their manipulated market but the result is likely to be the same. Most intellectuals see the incompetence and corruption choosing to stay away from such businesses and markets.
cdimauro Quote:
The instruction set is all but reduced: with more than 500 instructions (the last time that I've seen it: now it might be much more), it's in the right track with competing with Intel and ARM.
|
The RISC-V base ISA size is not all about bragging rights or pretending to maintain the RISC philosophy. It does allow hardware to scale lower with tiny cores and large memories.
ISA | instructions | source RV32I 40 https://en.wikipedia.org/wiki/RISC-V ARMv1 45 https://arstechnica.com/gadgets/2022/09/a-history-of-arm-part-1-building-the-first-chip/ 6502 56 http://www.6502.org/users/obelisk/6502/instructions.html 68000 56 https://en.wikipedia.org/wiki/Motorola_68000 RV32IMC 88 https://en.wikipedia.org/wiki/RISC-V 80286 357 https://arstechnica.com/gadgets/2022/09/a-history-of-arm-part-1-building-the-first-chip/
ARMv1 with only 45 instructions, a 32-bit fixed length encoding and fat code is no longer used by ARM but Thumb(-2) compressed ISAs are alive in Cortex-M for small cores. Adding M (multiply and divide instructions) and C (compressed instructions) extensions to the RISC-V 32-bit base increases the RISC-V instructions above that of the 68000. The 68000 has instructions that would come from the RISC-V A, Zicsr, B and S extensions which have another 50 instructions combined. Defining a small base ISA allows to keep a reduced instruction set computer (RISC) and still be somewhat competitive by using extensions?
cdimauro Quote:
I see the problem here, which is more general because those problematics belong to embedded systems without an MMU. MMUs make the life much easier, because they can map memory everywhere, hence "solving" the problem of having local library data being written (reading isn't the problem), since you can't write on NOR/NAND space. The only problem with MMUs is that this is made at the expense of wasting memory due to the page alignments.
|
NOR flash memory is random access so it is possible to write it like other memory but it has a finite number of write cycles, which have increased with some modern NOR memory. It is possible to have a MMU with NOR memory which could protect the NOR memory from random writing but other more primitive means are used on MCUs too. NOR writing is very slow and NOR reading is slower than accessing SRAM which is very fast for both reads and writes. NOR flash reading may be faster than ROM reading in retro computers though. Ideally, for more general purpose use than a small MCU, the NOR flash would go away replaced by a tiny ROM that would read and decompress NAND flash modules into memory. This would require more memory than is practical for SRAM with the preference for on chip eDRAM so the pin count is not increased to access off chip SDRAM. This would allow to keep a smaller and cheaper MCU like package while removing the memory limitation and the NOR flash scaling problem.
cdimauro Quote:
Yes, enabling the PC relative modes to write on memory solves the 68ks problem (AFAIR the 68080 has already lifted this archaic limit), and 32-bit version will be more general and useful on a 64-bit extension.
|
As I recall, there is one place where PC relative writes can not be enabled do to an encoding conflict. Scc?
cdimauro Quote:
The big problem with the Amiga OS libraries is this a6 register (you've written a7 in the last part, but I assume that it's just a typo), which forces to always load it with the library base once you need to call another library (or simply you've use the register for something else, before that). The good thing is that it solves the above problem without requiring an MMU neither PC-relative modes, but at the expense of reserving a register for this (plus the mentioned loading of it every time that it's needed). The Amiga libraries were a nice concept which perfectly matched the model of having a lot of their code running on ROMs, but on the other side they aren't much efficient, considered also the evolution of technologies & needs.
|
Yes, I meant a6 which I wrote correctly the first time and edited in my original post. The original 68k Amiga plan was for 64kiB of memory which required as much as possible in ROM. NOR flash was invented in 1980 at Toshiba but it required time to developed and become affordable (Toshiba introduced NOR flash chips in 1987 and Intel NAND flash chips in 1988). Dave Haynie experimented with NOR flash at Commodore but it was likely just though of as as a more flexible option for ROM that would be used in the same way. DRAM memory remained a relatively high percentage of a system cost and a limited resource much like a MCU today. The difference from the 68k Amiga to modern MCUs is that DRAM is replaced by SRAM and ROM is replaced by NOR flash. Both external SDRAM and on chip eDRAM increase the complexity of a chip but can be affordable with economies of scale. The memory becomes cheap and avoiding NOR flash access during normal system operation becomes advantageous. If just accessing the flash at startup then it should be considered how to get rid of it considering it is not scaling below ~40nm anyway. Move libraries into main memory and take advantage of PC relative addressing if possible.
cdimauro Quote:
That's really crazy: they have the technology/instructions, but they prefer to don't use it. Even a millicoded version of the LMW/STM instructions would have been better than bloating the code of so many instructions inserted into the code and wasting memory and cache efficiency.
|
That is the cost of not requiring LMW/STM to be standard in the PPC ISA. RISC-V would have the same problem using an extension for this support. It is ok for embedded use where the software is compiled for that particular hardware. The A1222 SoC supports LMW/STM in hardware and there are specific AmigaOS 4 software compiles for the A1222 non-standard FPU but do you think they use LMW/STM for that specific target?
cdimauro Quote:
BTW, saving/load consecutive registers aren't really a problem, because this is what happens with the regular code. This ultimately depends on ABIs, but they usually impose the usage of consecutive registers for passing parameters & callee-saved registers. So, even if 68ks are much more flexible from this PoV, this flexibility isn't much of a value in the real world (at least on functions' prologues/epilogues).
|
Multiple consecutive registers often need to be accessed consecutively which is fine for a short list but a long list is more likely to need multiple instructions. AArch64 allows two registers for load/store with one instruction which is a big improvement over one at a time for most RISC ISAs. A 32-bit fixed length encoding can not store a 32-bit bitmapped mask for 32 GP registers. The 68k only needs a 32-bit instruction to load/store 16 GP registers though. It is still a source of stack and memory traffic savings as well as convenience.
cdimauro Quote:
That's like the "good, old" (!) x86 segmentation (!!) which is back (selectors, in reality: the very granular access check is provided by the protected mode segmentation memory model). 
Which, despite what was said and still being said 'til now (some months ago Linus Torvalds expressed the same critics on RealWorldTech), is a good solutions to this (as well as others) very common problematic. However, a questionable (but understandable) design & implementation does NOT imply that this is a bad technology by itself (and here is his mistake: extending to the generality a single use-case, and pretending that it's not ok in all cases).
|
At least Intel used separate x86 segment registers. ARM uses an address bit to designate which modules are secure/protected. I guess if more addressing space is needed or a more flexible design then upgrade to the Cortex-A. The Cortex-M used to be ARM's embedded bread and butter but now it is kind of like Motorola's lack of support for ColdFire before it disappeared.
cdimauro Quote:
I'm quite in favour of (very) large pages, because they have a great value for performances, at least for the code (and partially for the stack, since it's growing down and its usage is not predictable).
Data (not dataro, of course) and BSS are a different story, since you need to protect the data of each process/thread and completely isolate those segments from each other.
Which isn't not the case for the Amiga OS, and that's exactly the reason why I'm advocating the usage of very large pages (1GB) on AROS x64, since this will considerably improve the performance (basically nullifying the impact of TLB misses).
|
It is not uncommon for stack data on the Amiga to be passed to functions meaning the stack memory needs to be shared. If stack memory can not be swapped out for process isolation, then it does need to be MMU mapped memory except for maybe a stack overflow detection page. Did any of the Amiga NG OSs forbid passing stack data to functions?
cdimauro Quote:
Exactly, but the main problem here is represented by the Linux (and perhaps Unix) model, with the very bad decision to extend longs to 64-bit, instead of keeping them 32-bit (as it was made on Windows for x64), bloating a lot the data usage. It was a really dumb decision, because keeping longs to 32-bit would have made no difference to the existing applications (rather the opposite having them 64-bit in size: could have caused compatibility issues).
This wouldn't have solved the pointers size issue (for this, x32 is required), but the "64-bit tax" would have been mitigated.
Anyway, x32 is a great, missed opportunity...
|
The decision to promote long to 64-bit was likely to avoid pointer to long and long to pointer datatype conversion problems. They probably should have broken that software and fixed it as the real problem is the assumption of pointer datatype sizes which is more dangerous than an assumption of the long datatype size. Either way there will be bugs and a 32-bit long would have offered better compatibility with x86, better compatibility with x86-64 Windows, a smaller memory footprint for x86-64 Linux and an easier to support x32 ABI.
Last edited by matthey on 30-Jun-2025 at 12:07 AM.
|
| | Status: Offline |
| | matthey
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 30-Jun-2025 1:06:03
| | [ #310 ] |
| |
 |
Elite Member  |
Joined: 14-Mar-2007 Posts: 2828
From: Kansas | | |
|
| ppcamiga1 Quote:
this thread is pure waste of time. motorola switch from 68k 30 years ago simply because they are unable to compete alone with intell and it is history nobody care about 68k anymore and mythical better code density will not help
|
Compared to the RISC competition minator mentioned, the 68060 has a better cycles per instruction (CPI) when RISC is supposed to have the advantage with single cycle throughput instructions, has no load-to-use stalls unlike most RISC pipelines, is tied for the longest pipeline depth which provides more instruction level parallelism (ILP) and should provide a higher clock speed, has better branch prediction, has no 64-bit pipeline slowdowns, supports unaligned memory accesses and easily has the best code density. The 68060 dominates the RISC competition in integer performance/MHz with a full static CMOS design compared to most of the RISC competition using dynamic logic designs.
full static CMOS design + clock speed from max to zero including low power sleep modes + lower power + easier to design and maintain - does not allow as high of clock speed as a dynamic logic design - uses more transistors than a dynamic logic design
The 68060 is an amazing high performance and efficient in-order CPU design. The code density is just the icing on the cake and a good recipe is timeless.
|
| | Status: Offline |
| | cdimauro
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 30-Jun-2025 5:05:01
| | [ #311 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 4593
From: Germany | | |
|
| @matthey
Quote:
matthey wrote: cdimauro Quote:
Thumb-2 shows better code density on average compared to Thumb, because it executes less instructions. Thumb has a very good 16-bit encoding, but as you've said it uses only 8 GP registers and it has limited constants/offsets support. Thumb-2 solves all those problems by allowing the execution of the original ISA, with the exception of conditional execution. Of course, it's limited by the ARM ISA itself.
|
Thumb-2 has the same problem as x86-64 when optimizing for performance. The x86-64 ISA can reach code density approaching x86 code density when optimizing for size as most of the same encodings exist in x86-64. Using new features including the new registers and 64-bit support, improves performance by decreasing the number of instructions executed but the code size is increased. Thumb-2 uses 32-bit encodings to access more than 8 GP registers which decreases the number of instructions but this can also increases code size. Thumb code can have better code density than Thumb-2 code when optimizing for performance. See slide 14 of the following link by Philippe Robin of ARM where Thumb code is smaller than Thumb-2 code.
Experiment with Linux and ARM Thumb-2 ISA https://linuxdevices.org/ldfiles/article078/Experiment_with_Linux_and_ARM_Thumb-2_ISA.pdf |
Slide 14 shows that Thumb-2 is better than Thumb when compiling for code size (-Os), which is the case when talking about code density.
It's only slightly worse Thumb when compiling for performance (-O2), which is largely acceptable (there's a small difference in size, but a HUGE gain in performance). Quote:
| I expect there are cases where only 8 GP registers are adequate and Thumb can compete with Thumb-2 in performance as these places are also where Thumb code would be smaller and benefit in performance from better code density. |
They are rare birds, IF they exist. When compiling for size there's no way that Thumb can do better than Thumb-2 (for obvious reasons). When compiling for performance, Thumb-2 can access all registers and all ARM instructions. So, again, there's practically no chance that Thumb can do better.
There MIGHT be single routines where Thumb can do better (when compiling code for performance), but on regular applications (e.g.: not trivial, simple code) I don't expect such cases. Quote:
| Thumb was purpose built for code density at the expense of all other CPU performance metrics/traits which is not true for Thumb-2. It is not surprising that Thumb has better code density sometimes. |
Only when compiling for performance. But which can be acceptable, because there's a different goal for the code to be executed, and Thumb-2 shows only a slightly worse code size. Quote:
| The 68k 16-bit encodings can access all 16 GP registers. This is possible by limiting the address register operations with a reduction of orthogonality but many people considered the 68k to have good orthogonality despite the Dn/An register split and the limited "address calculating" operations for address registers are mostly what is needed. |
The big mistake of Motorola was not to force a perfect distinction between address and data registers. The ISA could have been much better taking some decisions.
Anyway, it's too late now. Quote:
| The 68k performance not only does not suffer from excessive instructions executed or memory traffic but it appears to be among the leading ISAs in these performance metrics to go with Thumb like code density. |
Indeed. There's substantially no difference when compiling for size or performance, albeit old compilers don't take advantage of modern optimization techniques which might expand the code. Quote:
The small instruction set of RISC-V is just a joke: in the real world you need many more instructions to get better code density and/or better performance. Which are the two most important metrics, which processor vendors are always looking at.
So, having only 40 instructions in the base ISA is just pure marketing and something that only academics could be proud of ("the instruction set is reduced!").
Anyway, I don't put so much attention to the number of instructions, because logic is cheaper in the last years, and using more of it to improve code density and/or performance is The Right Thing To Do (e.g.: reading that a few gates are saved by some decisions for the ISA is simply ridiculous). Quote:
cdimauro Quote:
Yes, enabling the PC relative modes to write on memory solves the 68ks problem (AFAIR the 68080 has already lifted this archaic limit), and 32-bit version will be more general and useful on a 64-bit extension.
|
As I recall, there is one place where PC relative writes can not be enabled do to an encoding conflict. Scc? |
I don't recall now, but it might the case (that's why I've said some days ago that there are exceptions in the ISA, with instructions using "illegal" encodings).
However, I don't see the problem: new encodings can be found to "restore" the missing functionality of such few instructions. Quote:
cdimauro Quote:
That's really crazy: they have the technology/instructions, but they prefer to don't use it. Even a millicoded version of the LMW/STM instructions would have been better than bloating the code of so many instructions inserted into the code and wasting memory and cache efficiency.
|
That is the cost of not requiring LMW/STM to be standard in the PPC ISA. RISC-V would have the same problem using an extension for this support. It is ok for embedded use where the software is compiled for that particular hardware. |
RISC-V was forced to introduce the new PUSHM/POPM instructions because it had a SEVERE impact on code density, which is fundamental on the embedded market. However, this extension is only available for the embedded versione of the ISA, which means that the regular ISA is still handicapped.
Another example of how short minded were/are the academics that the designed this weak ISA: they are living on an ideal world, and don't know how the things work on the real one... Quote:
| The A1222 SoC supports LMW/STM in hardware and there are specific AmigaOS 4 software compiles for the A1222 non-standard FPU but do you think they use LMW/STM for that specific target? |
I don't know it, but it depends on which processor the AmigaOS4 was first released. Since they use the same binaries for all supported platforms (with the exception of the crappy A1222), then there might be the case that they are using the LMW/STM instructions. Quote:
cdimauro Quote:
BTW, saving/load consecutive registers aren't really a problem, because this is what happens with the regular code. This ultimately depends on ABIs, but they usually impose the usage of consecutive registers for passing parameters & callee-saved registers. So, even if 68ks are much more flexible from this PoV, this flexibility isn't much of a value in the real world (at least on functions' prologues/epilogues).
|
Multiple consecutive registers often need to be accessed consecutively which is fine for a short list but a long list is more likely to need multiple instructions. AArch64 allows two registers for load/store with one instruction which is a big improvement over one at a time for most RISC ISAs. A 32-bit fixed length encoding can not store a 32-bit bitmapped mask for 32 GP registers. |
Yes, but usually prologues/epilogues (which are the parts of the code where it's generally needed to save & load registers) only use consecutive registers, so instructions like LMW/STM are enough.
AArch64 can do it only with two registers at time, which is a (tiny?) compromise, but it's clearly handicapped from this PoV. Quote:
| The 68k only needs a 32-bit instruction to load/store 16 GP registers though. It is still a source of stack and memory traffic savings as well as convenience. |
Absolutely: MOVEM is perfect here, having only 16 registers. Quote:
cdimauro Quote:
That's like the "good, old" (!) x86 segmentation (!!) which is back (selectors, in reality: the very granular access check is provided by the protected mode segmentation memory model). 
Which, despite what was said and still being said 'til now (some months ago Linus Torvalds expressed the same critics on RealWorldTech), is a good solutions to this (as well as others) very common problematic. However, a questionable (but understandable) design & implementation does NOT imply that this is a bad technology by itself (and here is his mistake: extending to the generality a single use-case, and pretending that it's not ok in all cases).
|
At least Intel used separate x86 segment registers. |
Indeed. That was a good design. By case (since it's the legacy from 8086), but yet a good design. Quote:
| ARM uses an address bit to designate which modules are secure/protected. |
Which is ok for a 64-bit ISA with a 64-bit address space. Quote:
| I guess if more addressing space is needed or a more flexible design then upgrade to the Cortex-A. |
Cortex-A is 64-bit, so it's not a problem. But it's clearly a big problem with a 32-bit one (which is not developed anymore -> not an issue anymore for ARM). Quote:
| he Cortex-M used to be ARM's embedded bread and butter but now it is kind of like Motorola's lack of support for ColdFire before it disappeared. |
Exactly. Hence, there's chance for competitors.  Quote:
cdimauro Quote:
I'm quite in favour of (very) large pages, because they have a great value for performances, at least for the code (and partially for the stack, since it's growing down and its usage is not predictable).
Data (not dataro, of course) and BSS are a different story, since you need to protect the data of each process/thread and completely isolate those segments from each other.
Which isn't not the case for the Amiga OS, and that's exactly the reason why I'm advocating the usage of very large pages (1GB) on AROS x64, since this will considerably improve the performance (basically nullifying the impact of TLB misses).
|
It is not uncommon for stack data on the Amiga to be passed to functions meaning the stack memory needs to be shared. If stack memory can not be swapped out for process isolation, then it does need to be MMU mapped memory except for maybe a stack overflow detection page. |
It depends on how you compile your code. You can have the stack which is completely isolated, and the compiler avoids to share its memory with other processes (ad hoc memory will be allocated for sharing such data).
Normally, yes: the stack can be shared across all threads of a process. Quote:
| Did any of the Amiga NG OSs forbid passing stack data to functions? |
No. From what I know, the MMU is only used for mapping non-allocated memory and catch those accesses. Regular data (included stack) is always shared. I don't know if the code segments/pages are protected in some way (e.g.: not writable).
P.S. As usual, no time to read again and fix errors. |
| | Status: Offline |
| | matthey
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 1-Jul-2025 4:53:18
| | [ #312 ] |
| |
 |
Elite Member  |
Joined: 14-Mar-2007 Posts: 2828
From: Kansas | | |
|
| cdimauro Quote:
Slide 14 shows that Thumb-2 is better than Thumb when compiling for code size (-Os), which is the case when talking about code density.
It's only slightly worse Thumb when compiling for performance (-O2), which is largely acceptable (there's a small difference in size, but a HUGE gain in performance).
|
Not all code density comparisons use -Os. It would be good if both -Os and -O2 are compared as in the ARM slide show but it would be good to give the hardware configurations. The hardware configuration is very important, especially in regard to memory bandwidth.
Design and evaluation of compact ISA extensions https://ic.unicamp.br/~eduardo/publications/2016MICRO.pdf Quote:
ARM introduced Thumb as the first 16-bit extension in ARM7. Later on, Thumb2 was released and superseded initial Thumb, introducing additional features. Thumb2 enabled ARM processors are capable of running code in both 32 and 16 bits modes and allow subroutines of both types to share the same address space. Mode exchange is achieved during runtime through BX and BLX instructions: branch and call instructions that flip the current mode bit in a special processor register. A group of only 8 registers including the stack pointer and link registers are visible, but the remaining registers can also be accessed implicitly or through other special instructions. Results presented by ARM for Thumb, show a compression ratio ranging from 55% to 70%, with an overall performance gain of 30% for 16 bit buses and 10% loss for 32 bit ones.
|
The original 32-bit ARM ISA was not competitive in the embedded market with the common and cheaper 16-bit memory and data bus. The Thumb ISA was a game changer for ARM to be able to compete in the embedded market. The 68k likely had at least a 30% performance advantage over the original ARM ISA as it can efficiently support a 16-bit bus, has super code density like Thumb. The 68k likely had a larger advantage as it does not have the increase in instructions and memory traffic like Thumb and the 68k did not multiplex the address and data busses like ARM. The 68k could have easily had a 50% performance advantage over the original ARM ISA using a 16-bit multiplexed bus. Thumb-2 makes it worthwhile to use Thumb-2 most of the time over the original ARM ISA for a 32-bit bus too which is why the famous and hyped original ARM ISA is dead.
If compilers were smart enough to generate the smallest code for an embedded target using a 16-bit bus and -O2 instead of unrolling loops and inlining functions, then Thumb-2 may be better in every case. If Thumb has smaller code than Thumb-2 in some cases with -O2 and a 16-bit bus, then it may be higher performance in some cases. There are other factors as well like how much caches are used, the performance of the memory and whether the busses are multiplexed so it may not be enough to just know there is a 16-bit bus. Now the question is whether the ARM slideshow used 16-bit or 32-bit bus/memory. Using a 32-bit bus/memory would likely show a larger performance gain for Thumb-2 over Thumb and a 32-bit bus/memory has become more common for embedded use.
cdimauro Quote:
Indeed. There's substantially no difference when compiling for size or performance, albeit old compilers don't take advantage of modern optimization techniques which might expand the code.
|
The main differences on the 68k when optimizing for performance vs size are loop unrolling, function inlining and peephole optimizations. The basic structure and instructions are similar which is easy for compilers. Anytime code grows at higher optimization levels, the larger code size offsets other performance gains.
cdimauro Quote:
The small instruction set of RISC-V is just a joke: in the real world you need many more instructions to get better code density and/or better performance. Which are the two most important metrics, which processor vendors are always looking at.
So, having only 40 instructions in the base ISA is just pure marketing and something that only academics could be proud of ("the instruction set is reduced!").
Anyway, I don't put so much attention to the number of instructions, because logic is cheaper in the last years, and using more of it to improve code density and/or performance is The Right Thing To Do (e.g.: reading that a few gates are saved by some decisions for the ISA is simply ridiculous).
|
Yea. A really simple and tiny core is going to be 8-bit or 16-bit so a 32-bit or 64-bit ISA should have a more robust standard ISA taking advantage of standard features.
cdimauro Quote:
Yes, but usually prologues/epilogues (which are the parts of the code where it's generally needed to save & load registers) only use consecutive registers, so instructions like LMW/STM are enough.
AArch64 can do it only with two registers at time, which is a (tiny?) compromise, but it's clearly handicapped from this PoV.
|
Standard LMW/STM instructions in hardware which handle consecutive registers would eliminate the prologues/epilogues but there would still be more stack data and memory traffic in some cases than a 68k style MOVEM with a bitmapped field for the registers.
I think the AArch64 load and store register pairs is a good compromise. It is simple enough that it can have good performance and load and store of register pairs are common for more than function entry and exit. It does not completely eliminate the calls to prologues/epilogues but I expect it significantly reduces them.
cdimauro Quote:
Indeed. That was a good design. By case (since it's the legacy from 8086), but yet a good design.
|
Intel did a nice job of turning a bank switching handicap into a security feature. They often had better marketing than Motorola even though they were dishonest at times.
cdimauro Quote:
Which is ok for a 64-bit ISA with a 64-bit address space.
|
Except the address bit using TrustZone technology for ARMv8-M security feature is for the Cortex-M with a 32-bit address space.
cdimauro Quote:
It depends on how you compile your code. You can have the stack which is completely isolated, and the compiler avoids to share its memory with other processes (ad hoc memory will be allocated for sharing such data).
Normally, yes: the stack can be shared across all threads of a process.
|
The AmigaOS usually allocates the stack so some mechanism of telling the OS to allocate a private stack would be necessary when no stack data is shared. It would not be rocket science to add but would change Amiga programming norms.
cdimauro Quote:
No. From what I know, the MMU is only used for mapping non-allocated memory and catch those accesses. Regular data (included stack) is always shared. I don't know if the code segments/pages are protected in some way (e.g.: not writable).
|
I believe AmigaOS 4 protects code and the zero page with the MMU as well. Just this amount of MMU protection and bug detection is important and effective. I expect the order of importance to be the following.
1. zero page (detects small offset access from a null pointer) 2. unused address space (large empty space on most 68k Amigas) 3. unallocated memory (detect access of unnallocated and freed memory but performance issues?) 4. code (relatively small on the 68k Amiga) 5. kickstart/ROM (write protection/detection)
As I recall for some PPC AmigaOS 4 hardware, all code is grouped together which may reduce MMU pages used but also may eliminate PC relative data access. Fat PPC code protection/detection would be more important and unused address space detection less important.
Last edited by matthey on 01-Jul-2025 at 12:07 PM. Last edited by matthey on 01-Jul-2025 at 04:54 AM.
|
| | Status: Offline |
| | cdimauro
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 2-Jul-2025 4:54:05
| | [ #313 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 4593
From: Germany | | |
|
| @matthey
Quote:
matthey wrote: cdimauro Quote:
Slide 14 shows that Thumb-2 is better than Thumb when compiling for code size (-Os), which is the case when talking about code density.
It's only slightly worse Thumb when compiling for performance (-O2), which is largely acceptable (there's a small difference in size, but a HUGE gain in performance).
|
Not all code density comparisons use -Os. It would be good if both -Os and -O2 are compared as in the ARM slide show but it would be good to give the hardware configurations. The hardware configuration is very important, especially in regard to memory bandwidth.
Design and evaluation of compact ISA extensions https://ic.unicamp.br/~eduardo/publications/2016MICRO.pdf Quote:
ARM introduced Thumb as the first 16-bit extension in ARM7. Later on, Thumb2 was released and superseded initial Thumb, introducing additional features. Thumb2 enabled ARM processors are capable of running code in both 32 and 16 bits modes and allow subroutines of both types to share the same address space. Mode exchange is achieved during runtime through BX and BLX instructions: branch and call instructions that flip the current mode bit in a special processor register. A group of only 8 registers including the stack pointer and link registers are visible, but the remaining registers can also be accessed implicitly or through other special instructions. Results presented by ARM for Thumb, show a compression ratio ranging from 55% to 70%, with an overall performance gain of 30% for 16 bit buses and 10% loss for 32 bit ones.
|
The original 32-bit ARM ISA was not competitive in the embedded market with the common and cheaper 16-bit memory and data bus. The Thumb ISA was a game changer for ARM to be able to compete in the embedded market. The 68k likely had at least a 30% performance advantage over the original ARM ISA as it can efficiently support a 16-bit bus, has super code density like Thumb. The 68k likely had a larger advantage as it does not have the increase in instructions and memory traffic like Thumb and the 68k did not multiplex the address and data busses like ARM. The 68k could have easily had a 50% performance advantage over the original ARM ISA using a 16-bit multiplexed bus. Thumb-2 makes it worthwhile to use Thumb-2 most of the time over the original ARM ISA for a 32-bit bus too which is why the famous and hyped original ARM ISA is dead.
If compilers were smart enough to generate the smallest code for an embedded target using a 16-bit bus and -O2 instead of unrolling loops and inlining functions, then Thumb-2 may be better in every case. If Thumb has smaller code than Thumb-2 in some cases with -O2 and a 16-bit bus, then it may be higher performance in some cases. There are other factors as well like how much caches are used, the performance of the memory and whether the busses are multiplexed so it may not be enough to just know there is a 16-bit bus. Now the question is whether the ARM slideshow used 16-bit or 32-bit bus/memory. Using a 32-bit bus/memory would likely show a larger performance gain for Thumb-2 over Thumb and a 32-bit bus/memory has become more common for embedded use. |
Correct. Having a 16 or 32-bit bit data bus greatly influence the results.
Nevertheless, my guestimate is that Thumb-2 performs better than Thumb on a 16-bit bus, when compiled for performance. The code size is only a little bit worse, which means that even using 32-bit instructions there's not so much difference from this PoV. However, Thumb-2 requires at least two 16-bit instructions to emulate such 32-bit ones, which means that it has a consistent drop in the performance. Overall, the usage of 32-bit instructions should greatly compensate the slightly increase of the code size in the Thumb-2 (similar thing which happened with x86 -> x86-64 and ARM/Thumb-2 -> AArch64: fatter code, but overall better performance due to the less executed instructions). Quote:
cdimauro Quote:
Yes, but usually prologues/epilogues (which are the parts of the code where it's generally needed to save & load registers) only use consecutive registers, so instructions like LMW/STM are enough.
AArch64 can do it only with two registers at time, which is a (tiny?) compromise, but it's clearly handicapped from this PoV.
|
Standard LMW/STM instructions in hardware which handle consecutive registers would eliminate the prologues/epilogues but there would still be more stack data and memory traffic in some cases than a 68k style MOVEM with a bitmapped field for the registers. |
I expect that those cases to be rare in the real code (generated by compilers). Even before calling a function which uses some registers which we have to save, a good compiler arranges the (used) registers that the caller has to save, so that they are in sequence and can be pushed to the stack with a single STM instruction. Quote:
| I think the AArch64 load and store register pairs is a good compromise. It is simple enough that it can have good performance and load and store of register pairs are common for more than function entry and exit. It does not completely eliminate the calls to prologues/epilogues but I expect it significantly reduces them. |
I don't think so. This works, for example, when calling functions which have up to two registers to be saved because the callee has up to two arguments and doesn't trash other registers (which the caller uses). But on all other cases (more than two arguments and/or more used registers to be saved) you'll see that more LD/ST instructions are needed, which penalizes AArch64.
I've disassembled enough x86 and x86-64 binaries and having seen so many sequences of PUSH and POP instructions (much less on the latter, yet so many), and that's the reason why I've introduced some instructions to deal with that.
Taking a look at the slides which I've shared here showing some statistics on my NEx64T architecture, you can see the one about the code size: the great reduction which is visible with the last bar ("Quickcall") is due to the special instruction which I've introduced and that takes care of PUSHing and POPing up to a certan number of those instructions. Which means that I can cover only a some specific patterns of those sequences, and not all of them. Yet, it produces a significant reduction of the code. A general PUSHM/POPM would have allowed to squeeze much more.
Now think about AArch64 (or the upcoming Intel's APX, which has double registers PUSH and POP instructions. So, similar in concept to AArch64 in those cases), and you can imagine what happens with this ISA. ARM wanted to greatly simplify its architecture by avoiding the ARM/Thumb/Thumb-2 load/store/push/pop multiple registers, but it has to pay the price for that in terms of worse code density... Quote:
cdimauro Quote:
Which is ok for a 64-bit ISA with a 64-bit address space.
|
Except the address bit using TrustZone technology for ARMv8-M security feature is for the Cortex-M with a 32-bit address space. |
Which is a general problem, since it limits it to 31-bit effective address space (Amiga OS-style). Quote:
cdimauro Quote:
It depends on how you compile your code. You can have the stack which is completely isolated, and the compiler avoids to share its memory with other processes (ad hoc memory will be allocated for sharing such data).
Normally, yes: the stack can be shared across all threads of a process.
|
The AmigaOS usually allocates the stack so some mechanism of telling the OS to allocate a private stack would be necessary when no stack data is shared. It would not be rocket science to add but would change Amiga programming norms. |
Yes, it's not possible if the goal is to keep backward-compatibility.
It's like OS4: only new applications could take advantage of the MEMF_SHARED & co., but the vast majority of the software is the Amiga one, for which only the binaries are available. Quote:
cdimauro Quote:
No. From what I know, the MMU is only used for mapping non-allocated memory and catch those accesses. Regular data (included stack) is always shared. I don't know if the code segments/pages are protected in some way (e.g.: not writable).
|
I believe AmigaOS 4 protects code and the zero page with the MMU as well. Just this amount of MMU protection and bug detection is important and effective. I expect the order of importance to be the following.
1. zero page (detects small offset access from a null pointer) 2. unused address space (large empty space on most 68k Amigas) 3. unallocated memory (detect access of unnallocated and freed memory but performance issues?) 4. code (relatively small on the 68k Amiga) 5. kickstart/ROM (write protection/detection) |
Makes sense, but point 3 could be feasible only for free memory which is fully contained on a single MMU page. As we know, the Amiga OS has an 8 byte granularity for memory allocations (and that's one of reasons for its very small footprint), but this greatly fragments memory and freed memory might be sparse over several "memory pages pools".
Regarding the code, and as I've said, I don't know if there is a protection like that. Quote:
| As I recall for some PPC AmigaOS 4 hardware, all code is grouped together which may reduce MMU pages used but also may eliminate PC relative data access. Fat PPC code protection/detection would be more important and unused address space detection less important. |
I expect as well that OS4 groups at least the used code, otherwise there'll be too much waste of memory. PC relative data accesses might still stay close, because we talk about 4KB pages, so the code using some data stays at most 4kB - 4 bytes far away -> it shouldn't cause a general problem with that. |
| | Status: Offline |
| | Hammer
 |  |
Re: The (Microprocessors) Code Density Hangout Posted on 4-Jul-2025 4:28:48
| | [ #314 ] |
| |
 |
Elite Member  |
Joined: 9-Mar-2003 Posts: 6704
From: Australia | | |
|
| @matthey
Quote:
The PA-7200 is superscalar but that 2kiB on-chip assist cache and off chip L1, with the 2nd worst RISC code density after Alpha, is grossly inadequate for instruction supply. The PA-7200 is a good example of ignorance of the RISC instruction bottleneck. The design would have been better left as scalar and the transistors wasted on superscalar hardware reallocated to at least an 8kiB on-chip L1 instruction cache. The PA-7200 design uses a 5-stage pipeline and lacks dynamic branch prediction reducing the number of transistors compared to the 8-stage 68060. It also should have reduced the max clock speed compared to the 68060 which should have eventually been clocked around 150MHz.
|
PA-7200 (PA-RISC 1.1) implements two floating-point multiple-operation instructions: FLOATING-POINT MULTIPLY/ADD (FMPYADD), FLOATING-POINT MULTIPLY/SUBTRACT (FMPYSUB),
PA-7200 (PA-RISC 1.1) increased FPU registers i.e. 32 FP64 and 64 FP32. HP's PA-7200 (PA-RISC 1.1) wasn't the only implementation e.g. Hitachi PA50 (PA-RISC 1.1).
Unlike the 68060 FPU, Hitachi PA50's and PA-7200's FPUs are fully pipelined.
The focus for PA-RISC is a floating-point workstation target. HP's PC range handles integer-heavy workloads.
The claim to fame for PA-RISC camp is MAX 32-bit SIMD (from PA-7100LC) and MAX2 64-bit SIMD (PA-RISC 2.0) multimedia extensions. MAX2 64-bit SIMD targeted MPEG1 and MPEG2.
PA-RISC 2.0 (January 1996, PA-8000) includes fused multiply–add (FMA) instructions.
In October 1996, the Pentium MMX (64-bit SIMD) was introduced as x86 PC's counter move.
MPEG-1 on a Pentium ~133 is doable in VideoCD resolution.
_________________
|
| | Status: Offline |
| | cdimauro
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 4-Jul-2025 4:59:40
| | [ #315 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 4593
From: Germany | | |
|
| Dammit. Another wall-of-non-sense on a thread which talks about CODE DENSITY... |
| | Status: Offline |
| | matthey
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 9-Jul-2025 14:22:07
| | [ #316 ] |
| |
 |
Elite Member  |
Joined: 14-Mar-2007 Posts: 2828
From: Kansas | | |
|
| The new Embench benchmark is kind of like a mini SPEC benchmark for embedded cores and includes code sizes (TEXT segment only).
https://www.embench.org/
RISC pioneer David Patterson helped develop the new benchmark and wrote articles about it.
Embench: Recruiting for the Long Overdue and Deserved Demise of Dhrystone as a Benchmark for Embedded Computing https://www.sigarch.org/embench-recruiting-for-the-long-overdue-and-deserved-demise-of-dhrystone-as-a-benchmark-for-embedded-computing/
Embench: A Modern Benchmark For Embedded Computing https://googlegroups.com/a/lists.librecores.org/group/embench/attach/e22e54324e58/Embench%20Journal%20Paper.pdf
There are Embench results for Cortex-M4, RI5CY RV32IMC and SweRV-EH2 RV32IMACZb cores using both GCC -O2 and -Os optimizations.
https://github.com/embench/embench-iot-results#user-content-Results_sorted_by_Embench_size_score
The RI5CY RV32IMC benchmark results are only 5% larger than for the Cortex-M4 with -Os while the advantage for the Cortex-M4 (full Thumb+Thumb-2 ISA) grows for -O2. For some reason, the SweRV-EH2 RV32IMACZb size results using a newer version of GCC and -Os are worse than for the RI5CY RV32IMC core but the newer extensions improve the -O2 results.
The Embench benchmark was mentioned in an interesting paper about developing a core that could use multiple ISAs.
Dual Instruction Set on a Single Microarchitecure https://repository.tudelft.nl/file/File_3c111f43-5d2a-47d2-bf1b-7398f4f03d37
Other than microarchitecture being spelled incorrectly in the title, it is an easy to understand paper about a core that has a decoder for both 32-bit ARMv4 and RV32IM instructions to execute from the same pipeline.
ISA | Unique instructions 80286 357 (https://arstechnica.com/gadgets/2022/09/a-history-of-arm-part-1-building-the-first-chip/) ARMv7A 189 (Thumb & Thumb-2 standard except for tiny Cortex-M cores, source paper above) ARMv6 116 (Thumb-2 introduced but optional, source paper above) RV32IMC 88 (https://en.wikipedia.org/wiki/RISC-V) ARMv5 72 (source paper above) ARMv4 63 (ARMv4T introduces Thumb, source paper above)) 68000 56 (tiny compressed complex instruction set, https://en.wikipedia.org/wiki/Motorola_68000) ARMv1 45 (https://arstechnica.com/gadgets/2022/09/a-history-of-arm-part-1-building-the-first-chip/) RV32IM 45 (48 in RV32I v2.1 + M v2.0 from https://en.wikipedia.org/wiki/RISC-V) RV32I 37 (40 in v2.1 from https://en.wikipedia.org/wiki/RISC-V)
The ISAs are compared as well as the logic differences to implement them in a simple 5-stage pipeline. RISC-V with less instructions used less area but not by much. RISC-V's elimination of the CC flags register surprisingly reduced the clock speed of the combined core in FPGA compared to ARM with a traditional CC flag register. ARM's more advanced addressing modes, conditional shifts, load/store multiple and 32*32=64+64 were not a problem for a simple RISC core other than the register read and write ports required for max performance are 4 read and 2 write ports, more than the 2 read and 1 write ports of the combination core and classic RISC pipeline. The combination core required an extra ALU which was also placed in the execute stage, perhaps not realizing the pipeline could be extended into a 7-stage CISC like pipeline avoiding load-to-use stalls with the extra ALU two stages earlier. I proposed that the SiFive 7-series design could allow different ISAs and the CISC like design would be more flexible for executing multiple ISAs. There was the 68000 microcode change for a similar CISC IBM ISA. The RP2350 MCU has 2xCortex-M33 and 2xHazard3 RV32IMAC+ cores but the ARM and RISC-V cores can not be used at the same time. I believe the whole pipeline is switched rather than shared between ISAs like in the paper above. Sharing a pipeline would likely only be worthwhile with similar ISAs but developing a core that can be easily configured for different ISAs would have advantages. Testing and verification would increase though.
|
| | Status: Offline |
| | cdimauro
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 10-Jul-2025 5:03:35
| | [ #317 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 4593
From: Germany | | |
|
| @matthey
Quote:
I know Embench since long and it's certainly a valuable benchmark, but it's very limited to the embedded market, and actually only supports integers (no FP data). That's because the target is the very low-end embedded market (64kB Flash, 16kB RAM).
But for measuring the code density is good, albeit they decided to only measure the code size, ignoring the dataro segment, which is relevant because it takes space on the Flash and it's also influenced by the architectures (e.g.: constants which are on the dataro segment in some architectures are embedded in the code segment on others, and viceversa). Quote:
There are Embench results for Cortex-M4, RI5CY RV32IMC and SweRV-EH2 RV32IMACZb cores using both GCC -O2 and -Os optimizations.
https://github.com/embench/embench-iot-results#user-content-Results_sorted_by_Embench_size_score
The RI5CY RV32IMC benchmark results are only 5% larger than for the Cortex-M4 with -Os while the advantage for the Cortex-M4 (full Thumb+Thumb-2 ISA) grows for -O2. For some reason, the SweRV-EH2 RV32IMACZb size results using a newer version of GCC and -Os are worse than for the RI5CY RV32IMC core but the newer extensions improve the -O2 results. |
The reason is very simple: SweRV-EH2 RV32IMACZb is a RISC-V with the regular extension used by almost everybody (with the addition of the atomic instructions, which usually aren't required on the low-end embedded), plus a subset of some bit manipulation instruction, whereas the RI5CY RV32IMC is the PULP extension of RISC-V developed by the Zurich university, which sports several and very useful extensions to the ISA.
https://pulp-platform.org/implementation.html implements the RV32-IMC, has an optional 32-bit FPU supporting the F extension and instruction set extensions for DSP operations, including hardware loops, SIMD extensions, bit manipulation and post-increment instructions.
which speaks by itself. That's why it has much better results compared to the regular RISC-V, and it's so close to ARM/Thumb-2. Quote:
The Embench benchmark was mentioned in an interesting paper about developing a core that could use multiple ISAs.
Dual Instruction Set on a Single Microarchitecure https://repository.tudelft.nl/file/File_3c111f43-5d2a-47d2-bf1b-7398f4f03d37
Other than microarchitecture being spelled incorrectly in the title, it is an easy to understand paper about a core that has a decoder for both 32-bit ARMv4 and RV32IM instructions to execute from the same pipeline.
ISA | Unique instructions 80286 357 (https://arstechnica.com/gadgets/2022/09/a-history-of-arm-part-1-building-the-first-chip/) ARMv7A 189 (Thumb & Thumb-2 standard except for tiny Cortex-M cores, source paper above) ARMv6 116 (Thumb-2 introduced but optional, source paper above) RV32IMC 88 (https://en.wikipedia.org/wiki/RISC-V) ARMv5 72 (source paper above) ARMv4 63 (ARMv4T introduces Thumb, source paper above)) 68000 56 (tiny compressed complex instruction set, https://en.wikipedia.org/wiki/Motorola_68000) ARMv1 45 (https://arstechnica.com/gadgets/2022/09/a-history-of-arm-part-1-building-the-first-chip/) RV32IM 45 (48 in RV32I v2.1 + M v2.0 from https://en.wikipedia.org/wiki/RISC-V) RV32I 37 (40 in v2.1 from https://en.wikipedia.org/wiki/RISC-V)
The ISAs are compared as well as the logic differences to implement them in a simple 5-stage pipeline. RISC-V with less instructions used less area but not by much. RISC-V's elimination of the CC flags register surprisingly reduced the clock speed of the combined core in FPGA compared to ARM with a traditional CC flag register. ARM's more advanced addressing modes, conditional shifts, load/store multiple and 32*32=64+64 were not a problem for a simple RISC core other than the register read and write ports required for max performance are 4 read and 2 write ports, more than the 2 read and 1 write ports of the combination core and classic RISC pipeline. The combination core required an extra ALU which was also placed in the execute stage, perhaps not realizing the pipeline could be extended into a 7-stage CISC like pipeline avoiding load-to-use stalls with the extra ALU two stages earlier. I proposed that the SiFive 7-series design could allow different ISAs and the CISC like design would be more flexible for executing multiple ISAs. There was the 68000 microcode change for a similar CISC IBM ISA. The RP2350 MCU has 2xCortex-M33 and 2xHazard3 RV32IMAC+ cores but the ARM and RISC-V cores can not be used at the same time. I believe the whole pipeline is switched rather than shared between ISAs like in the paper above. Sharing a pipeline would likely only be worthwhile with similar ISAs but developing a core that can be easily configured for different ISAs would have advantages. Testing and verification would increase though. |
The paper is quite interesting, but it also shows the limits of such dual-ISA support, because the two ISAs should be similar for a good part. If there are too many differences, then it becomes not convenient anymore.
It might make sense if we look at the low-end embedded market, because the used architectures are simple and usually only at the "Machine" level (e.g.: only the user space execution mode is available). That's because supervisor/kernel mode and, in general, MMUs and other system level elements can diverge even a lot, up to the point that the common backend (and part of the frontend) becomes too fat and/or inefficient.
In fact, the paper mentions also the Transmeta's Crusoe processors as a software-based translation platform, omitting key information about how it really worked (e.g.: the internal architecture). Which is basically what people usually find searching around: "software translation!". The reality is quite different, since those processors implements parts of the x86 system-level (PMMU and some other stuff) in hardware, which are fundamental things that HEAVILY influence the execution performance. Without that (e.g.: a completely different architecture which is purely relying on the software translation of x86 instructions), the performance would have been miserable. |
| | Status: Offline |
| | Hammer
 |  |
Re: The (Microprocessors) Code Density Hangout Posted on 10-Jul-2025 13:29:26
| | [ #318 ] |
| |
 |
Elite Member  |
Joined: 9-Mar-2003 Posts: 6704
From: Australia | | |
|
| @cdimauro
There are operations being applied on multiple data elements per instruction with SIMD, which is important for 3D games while the 68060 version is scalar.
The focus for Amiga Hombre is texture-mapped 3D within $50 BOM cost range. The primary Amiga Hombre's use case is CD3D game console and A1200 (desktop computer with game console bias) replacement. Unlike Apple's Mac, the Amiga didn't establish a large enough business customer base.
The code written for 32-bit MIPS-III is typically reduced by 40% in size when compiled for MIPS16 (Kissell 97).
Yet another US government-funded US academia created another RISC-V pro-density instruction set. For commercialized RISC-V, SiFive's 8-core P550 is available in laptops at the current time, while it's missing in action with 68060. There's extra aid for RISC-V not including other governments' support for RISC-V.
MIPS was US government-funded via US academia and US DARPA before shifting into commercialization. China's state owned Loongson's MIPS architecture shifted to Loongson 3 5000 ISA, a MIPS and RISC-V blend.
There's always a cost vs performance. For this topic, cost vs performance vs code density, and this can't be avoided. Without additional support, 68K is dead.
Last edited by Hammer on 10-Jul-2025 at 01:47 PM. Last edited by Hammer on 10-Jul-2025 at 01:43 PM. Last edited by Hammer on 10-Jul-2025 at 01:39 PM.
_________________
|
| | Status: Offline |
| | matthey
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 10-Jul-2025 22:23:04
| | [ #319 ] |
| |
 |
Elite Member  |
Joined: 14-Mar-2007 Posts: 2828
From: Kansas | | |
|
| cdimauro Quote:
I know Embench since long and it's certainly a valuable benchmark, but it's very limited to the embedded market, and actually only supports integers (no FP data). That's because the target is the very low-end embedded market (64kB Flash, 16kB RAM).
|
I had seen Embench mentioned before but never looked into. There was supposed to be a floating-point version of the benchmark being worked on but it should have been out by now if there were no problems. Unfortunately, it looks like support for Embench has stalled. Maybe they target too small of memory footprint to be popular even though it is useful for larger footprints too? The "64kB Flash" is NOR flash or ROM memory which could greatly supplement the 16kiB minimum RAM. I doubt anyone would write the Embench benchmark to NOR flash memory or ROM to perform the benchmark but maybe the idea is that support code resides there?
cdimauro Quote:
But for measuring the code density is good, albeit they decided to only measure the code size, ignoring the dataro segment, which is relevant because it takes space on the Flash and it's also influenced by the architectures (e.g.: constants which are on the dataro segment in some architectures are embedded in the code segment on others, and viceversa).
|
I agree about the data which is an important performance metric too but it is nice to see the man who coined the term RISC and made it popular recognize that code density is important after all.
cdimauro Quote:
The paper is quite interesting, but it also shows the limits of such dual-ISA support, because the two ISAs should be similar for a good part. If there are too many differences, then it becomes not convenient anymore.
It might make sense if we look at the low-end embedded market, because the used architectures are simple and usually only at the "Machine" level (e.g.: only the user space execution mode is available). That's because supervisor/kernel mode and, in general, MMUs and other system level elements can diverge even a lot, up to the point that the common backend (and part of the frontend) becomes too fat and/or inefficient.
|
Modern CPU cores often do support multiple ISAs already. ARM had original 32-bit ARM, Thumb, Thumb-2 and AArch64 for 4 ISAs to support. Some older ARM cores supported the Jazelle ISA which allowed to execute Java byte code. ARM had a large variety of ISAs even though most had similarities. Maybe not that much different than 32-bit ARM and RISC-V though. The x86 ISA with segments and 16-bit modes is still supported on 64-bit x86-64 cores even though the additional logic paths may cause a slowdown of the max clock speed. The area is certainly increased too with the larger ISAs but most of it is the newer ISAs. ARM is jettisoning most of their baggage ISAs while Intel has been unsuccessful at jettisoning their ISA baggage. Multiple 32-bit ARM ISAs were useful but ARM went all in on one large standard all purpose 64-bit ISA instead.
cdimauro Quote:
In fact, the paper mentions also the Transmeta's Crusoe processors as a software-based translation platform, omitting key information about how it really worked (e.g.: the internal architecture). Which is basically what people usually find searching around: "software translation!". The reality is quite different, since those processors implements parts of the x86 system-level (PMMU and some other stuff) in hardware, which are fundamental things that HEAVILY influence the execution performance. Without that (e.g.: a completely different architecture which is purely relying on the software translation of x86 instructions), the performance would have been miserable. |
Code morphing sounds more like software even though it could be hardware as the paper demonstrates. Also, the VLIW hardware is much different than traditional instruction pipelines. Yes, extra hardware support was utilized by matching common hardware processing and control resources. The ~$1 billion Transmeta experiment did not provide enough info to stop the $10+ billion Itanic mistake. At least the combi pipeline experiment was cheap and the solution may be more practical than VLIW "general purpose" cores.
|
| | Status: Offline |
| | matthey
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 11-Jul-2025 0:14:53
| | [ #320 ] |
| |
 |
Elite Member  |
Joined: 14-Mar-2007 Posts: 2828
From: Kansas | | |
|
| Hammer Quote:
There are operations being applied on multiple data elements per instruction with SIMD, which is important for 3D games while the 68060 version is scalar.
|
Each 68060 instruction operation is scalar when considered individually but the 68060 as a whole is performing multiple superscalar instruction operations. At least for integer instructions, the superscalar 68060 could sometimes compete with the PA-RISC SIMD which only supported 16-bit integer datatypes. Superscalar CISC is powerful for integer workloads, is more general purpose than SIMD instructions and improves code density more since most code is integer code.
Hammer Quote:
The focus for Amiga Hombre is texture-mapped 3D within $50 BOM cost range. The primary Amiga Hombre's use case is CD3D game console and A1200 (desktop computer with game console bias) replacement. Unlike Apple's Mac, the Amiga didn't establish a large enough business customer base.
|
It does not matter the cost if the performance is crap because of waiting on memory. The bad PA-RISC code density and tiny instruction cache limits performance. Every cache miss is many cycles of pipeline bubbles as a new instruction must be pushed into the pipeline every cycle to maintain the full pipeline performance. It is possible to pipeline memory accesses but it requires more pipeline stages making the pipeline long increasing the branch penalty for multi-cycle instruction fetches and both the branch penalty and load-to-use penalty for multi-cycle pipelined data accesses with most RISC core designs. Also, all memory would need to have the same access time which decreases the flexibiility of memory unless it is SRAM for a MCU. It was/is not practical. Pipelining cache accesses is practical but still requires hits in instruction caches so new instructions can be fed into the instruction pipeline. It is possible to speculate over data accesses and do extra work with an OoO core but every cycle requires a new instruction to be fed into the instruction pipeline. It may be possible to speculate another branch direction if those instructions are cached but it will be wrong most of the time with modern branch prediction. L1 instruction cache hit rates are very important for performance and PA-RISC would need a 32kiB instruction cache to have the performance of the 68060 8kiB instruction cache do to code density and cache misses.
The RISC-V Compressed Instruction Set Manual, Version 1.7 https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-157.pdf Quote:
The philosophy of RVC is to reduce code size for embedded applications and to improve performance and energy-efficiency for all applications due to fewer misses in the instruction cache. Waterman shows that RVC fetches 25%-30% fewer instruction bits, which reduces instruction cache misses by 20%-25%, or roughly the same performance impact as doubling the instruction cache size.
|
The increased access time of a large 32-kiB instruction cache may have limited the clock speed of the PA-RISC core, without pipelining the instruction fetch and increasing the pipeline length.
Hammer Quote:
The code written for 32-bit MIPS-III is typically reduced by 40% in size when compiled for MIPS16 (Kissell 97).
|
MIPS16 was replaced because it suffers the same problem of Thumb and SuperH with 16-bit only instructions increasing instructions executed and there was mode switching overhead like Thumb.
Profile Guided Selection of ARM and Thumb Instructions https://www2.cs.arizona.edu/~arvind/papers/lctes02.pdf Quote:
While the use of Thumb instructions generally gives smaller code size and lower instruction cache energy, there are certain problems with using the Thumb mode. In many cases the reductions in code size are obtained at the expense of a significant increase in the number of instructions executed by the program. In our experiments this increase ranged from 9% to 41%. In fact in case of one of the benchmarks, the increase in dynamic instruction count was so high that instead of obtaining reductions in cache energy used, we observed an increase in the total amount of energy expended by the instruction cache.
|
They were still worth it for 16-bit memory as instructions could be provided faster. It was better than feeding a bubble every other cycle into the instruction pipeline without an instruction cache. MIPS16 was replaced like Thumb as they were specialized ISAs for 16-bit memory in the embedded market to try to compete with the 68k where most of the instructions are 16-bit. Thumb-2 and MicroMIPS were successors with 16-bit and 32-bit instruction encodings that had tolerable general purpose performance even though the 68k still has fewer instructions to execute in most cases. MicroMIPS looks like it is inferior to Thumb-2.
Code Size – a comprehensive comparison of microMIPS32 and Thumb code size using many Megabytes of customer code https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/code-size-a-comprehensive-comparison-of-micromips32-and-thumb-code-size-using-many-megabytes-of-customer-code Quote:
Over the last 20 years or so, ARM has accumulated many megabytes of test source code, with a substantial amount of that code coming from ARM customers who have worked with us to optimize the delicate trade-off between code size and performance. In fact in the early days of portable consumer devices a reduction in code size of just 1 or 2% could be the difference between whether or not your product included that new killer feature. As a result our CPU product development and compiler teams have lived and breathed in this environment for many years. During the development of new CPU products and new versions of the compiler, the substantial database of test code is regularly used to check that ARM has the correct trade-off between performance and code size. Below I present ARM’s findings from that source code database.
...
Summary of results
ARM’s results show that on this large sample of source code, the microMIPS32 object code was on average (taking the geometric mean) 23.5% larger than the same code compiled for Cortex-M3.
The results ranged from 1.8% larger (for “Customer 1”) to 57% larger (for “Customer 28”). I should also note that the code size for Cortex-M3 was significantly smaller for large object code samples (such as armcc at 405kB, showing 30% smaller code than microMIPS32) right down to small object code (such as gzip at 26kB, showing 32% smaller code than microMIPS32).
|
Performance is important too but " a reduction in code size of just 1 or 2%" mattered to ARM before they removed Thumb-2 on their Cortex-A cores. AArch64 code is ~47% larger than Thumb-2 code in Vince Weaver's code density contest.
https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vTyfDPXIN6i4thorNXak5hlP0FQpqpZFk2sgauXgYZPdtJX7FvVgabfbpCtHTkp5Yo9ai6MhiQqhgyG/pubhtml?gid=909588979&single=true&pli=1
Some businesses lose focus of what is important to their embedded customers.
Vince Weaver's code density contest results supports ARM's claim that Thumb-2 has better code density than MicroMIPS
http://deater.net/weave/vmwprod/asm/ll/ll.html
It looks like MIPS did not have the code density to compete.
Hammer Quote:
Yet another US government-funded US academia created another RISC-V pro-density instruction set. For commercialized RISC-V, SiFive's 8-core P550 is available in laptops at the current time, while it's missing in action with 68060. There's extra aid for RISC-V not including other governments' support for RISC-V.
MIPS was US government-funded via US academia and US DARPA before shifting into commercialization. China's state owned Loongson's MIPS architecture shifted to Loongson 3 5000 ISA, a MIPS and RISC-V blend.
|
MIPS was not competitive at developing CPU ISAs and now they are owned by GlobalFoudries.
GlobalFoundries to Acquire MIPS to Accelerate AI and Compute Capabilities https://investors.gf.com/news-releases/news-release-details/globalfoundries-acquire-mips-accelerate-ai-and-compute
The acquisition was just announced this week. I hope they are better at AI development than CPU ISA development.
Hammer Quote:
There's always a cost vs performance. For this topic, cost vs performance vs code density, and this can't be avoided. Without additional support, 68K is dead.
|
Engineering is the art of compromise. Motorola thought the way to compete was to castrate the 68k into RISC when they should have been enhancing the 68k ISA and improving the code density to stay ahead of Thumb-2. Now ARM thinks they have won and do not need Thumb-2 code density anymore, except for their small Cortex-M cores. The 68k has a great combination of performance, code density and ease of use that all the RISC ISAs could not match and Motorola throws their baby out with the bathwater, for ugly fat PPC no less. The 68k should be turned into a Cinderella story as the Amiga depends on it.
Last edited by matthey on 11-Jul-2025 at 12:23 AM. Last edited by matthey on 11-Jul-2025 at 12:18 AM.
|
| | Status: Offline |
| |
|
|
|
[ home ][ about us ][ privacy ]
[ forums ][ classifieds ]
[ links ][ news archive ]
[ link to us ][ user account ]
|