Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.
|
|
|
22 crawler(s) on-line.
95 guest(s) on-line.
1 member(s) on-line.
zipper
You are an anonymous user. Register Now! |
|
|
|
Poster | Thread | kolla
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 24-Aug-2025 21:44:24
| | [ #401 ] |
| |
 |
Elite Member  |
Joined: 20-Aug-2003 Posts: 3507
From: Trondheim, Norway | | |
|
| Suppose this thread is as good as any to make people aware...
The fundraiser to bring AC68080 out of its "EC" state and bring on a "full" AC68000 with MMU, was met within just a few days. Interesting times ahead, though I suspect it is much too late to really draw the attention of "the right stuff", and Gunnar has of course been doing himself the disservice by running around and insisting that "nothing and no-one needs MMU", just like he did with the "nothing and no-one needs FPU" earlier.
https://www.gofundme.com/f/memory-management-unit-mmu-for-apollo-v4-series _________________ B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC |
| Status: Offline |
| | Hammer
 |  |
Re: The (Microprocessors) Code Density Hangout Posted on 25-Aug-2025 1:42:29
| | [ #402 ] |
| |
 |
Elite Member  |
Joined: 9-Mar-2003 Posts: 6582
From: Australia | | |
|
| @kolla
Quote:
kolla wrote: Suppose this thread is as good as any to make people aware...
The fundraiser to bring AC68080 out of its "EC" state and bring on a "full" AC68000 with MMU, was met within just a few days. Interesting times ahead, though I suspect it is much too late to really draw the attention of "the right stuff", and Gunnar has of course been doing himself the disservice by running around and insisting that "nothing and no-one needs MMU", just like he did with the "nothing and no-one needs FPU" earlier.
https://www.gofundme.com/f/memory-management-unit-mmu-for-apollo-v4-series
|
AC68080 has increased general-purpose performance, and SAGA comes with RTG's chunky graphics. Why not use Linux 68K on it?
There's Vamos, which allows AmigaOS command-line software to run on Linux. There's potential with this idea i.e. NT'ed AmigaOS.Last edited by Hammer on 25-Aug-2025 at 01:43 AM.
_________________
|
| Status: Offline |
| | bhabbott
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 25-Aug-2025 2:25:39
| | [ #403 ] |
| |
 |
Cult Member  |
Joined: 6-Jun-2018 Posts: 567
From: Aotearoa | | |
|
| @kolla
Gunnar won't even put Apollo Shield into V2, so Vampire is now dead to me.
|
| Status: Offline |
| | Hammer
 |  |
Re: The (Microprocessors) Code Density Hangout Posted on 25-Aug-2025 2:26:29
| | [ #404 ] |
| |
 |
Elite Member  |
Joined: 9-Mar-2003 Posts: 6582
From: Australia | | |
|
| @matthey
Quote:
For a 16-bit base VLE, the other common option is tiered registers with the lower 3b reg encoding of 16-bit instructions accessing the first 8 registers and 32-bit encodings using 4b reg encodings. This initially appears cleaner as the visible Dn/An register split can disappear but commonly used special registers like the SP need to be mapped to the low 8 registers which does not look as clean, leaves fewer registers available and I do not believe the code density is as good. Some of these ISAs create more instructions like PUSH and POP in the case of the SP but this reduces orthogonality. The x86-64 ISA is similar but poorly implemented as a couple of the lower 8 registers are not GP despite having PUSH and POP instructions leaving only 6 GP registers before needing a prefix to access 8 more GP registers. Fortunately for x86-64, CISC cores with mem-reg and reg-mem memory accesses usually do not need as many registers and load-to-use stalls are avoided, one of the keys why x86 with 6 GP registers stayed ahead in performance of fat RISC with 32 GP registers like Alpha, MIPS, PPC, etc. Fat RISC code density was another reason. RISC fanatics are slow learners but the failures eventually disappeared leaving more competitive RISC architectures. |
Unlike 68060 vs 68LC060, Pentium guaranteed X87 registers for baseline PCs, which is enforced by major AAA PC games such as Tomb Raider and Quake.
PC DOS Tomb Raider running poorly on x87-less AMD 486SX2-66 (overlocked to 80 Mhz) with 3DFX Voodoo https://www.youtube.com/watch?v=XHqLYzqZciM Various high clock speed 486 models are compared with PC DOS Tomb Raider.
IA-32's 8 GPR limitation evolved the X86 implementations to have fast data transfers with X87 and XMM registers.
Explicit ALU operation to be linked with data in memory instructions has lessened the demand on the inferior GPR count.
Actual microarchitecture implementation is equally important e.g. AMD K5 or Cyrix 6x86 have a stronger integer implementation path, while Intel Pentium has a stronger X87 implementation path.
https://barefeats.com/doom3.html
MAC GAME PERFORMANCE BRIEFING FROM THE DOOM 3 DEVELOPERS Glenda Adams, Director of Development at Aspyr Media, has been involved in Mac game development for over 20 years. I asked her to share a few thoughts on what attempts they had made to optimize Doom 3 on the Mac and what barriers prevented them from getting it to run as fast on the Mac as in comparable Windows PCs. Here's what she wrote:
"Just like the PC version, timedemos should be run twice to get accurate results. The first run the game is caching textures and other data into RAM, so the timedemo will stutter more. Running it immediately a second time and recording that result will give more accurate results.
The performance differences you see between Doom 3 Mac and Windows, especially on high end cards, is due to a lot of factors (in general order from smallest impact to largest):
1. PowerPC architectural differences, including a much higher penalty for float to int conversion on the PPC. This is a penalty on all games ported to the Mac, and can't be easily fixed. It requires re-engineering much of the game's math code to keep data in native formats more often. This isn't 'bad' coding on the PC -- they don't have the performance penalty, and converting results to ints saves memory and can be faster in many algorithms on that platform. It would only be a few percentage points that could be gained on the Mac, so its one of those optimizations that just isn't feasible to do for the speed increase.
2. Compiler differences. gcc, the compiler used on the Mac, currently can't do some of the more complex optimizations that Visual Studio can on the PC. Especially when inlining small functions, the PC has an advantage. Add to this that the PowerPC has a higher overhead for functional calls, and not having as much inlining drops frame rates another few percentage points.
Microsoft's Visual Studio is a 1st party software for the Wintel platform.
_________________
|
| Status: Offline |
| | cdimauro
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 25-Aug-2025 4:27:03
| | [ #405 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 4506
From: Germany | | |
|
| @bhabbott
Quote:
bhabbott wrote: @cdimauro
Quote:
cdimauro wrote:
An 11%-13% code density improvement over Thumb-2 is a large difference. |
No, it's trifling. |
No, you simply (!) don't know of what you talk about.
It was already reported several times. There are studies which show that a 25-30% of code density improvement is roughly equivalent to a system with HALF the code cache size. I repeat again for YOUR benefit: HALF the code cache.
11-13% of around HALF that code density improvement. You should figure out yourself now if that is "trifling" (SIC!)... Quote:
Quote:
Actually, x86-64 was sporting the best results:
 |
Actually there's nothing in it (even if 'instruction count' has any relevance). |
It's the other very important metric when talking about computer architectures...
And again, it was already reported several times. Quote:
In the real world bloat swamps these trifling differences. |
In a sensible world, people should only talk about tails they actually know. Quote:
I must correct you: it is only the ignorant who aren't interested. Quote:
I reveal you a secret: 64-bit have limits. 2^64, precisely (but that's high-order math). Quote:
and no reason to rein in the bloat. |
And here comes again the magic word which Bruce repeats like a parrot several times when he's not able to accept the reality (which is very different from the cave where he's leaving): bloat. |
| Status: Offline |
| | cdimauro
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 25-Aug-2025 4:28:37
| | [ #406 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 4506
From: Germany | | |
|
| @kolla
Quote:
kolla wrote: Suppose this thread is as good as any to make people aware...
The fundraiser to bring AC68080 out of its "EC" state and bring on a "full" AC68000 with MMU, was met within just a few days. Interesting times ahead, though I suspect it is much too late to really draw the attention of "the right stuff", and Gunnar has of course been doing himself the disservice by running around and insisting that "nothing and no-one needs MMU", just like he did with the "nothing and no-one needs FPU" earlier.
https://www.gofundme.com/f/memory-management-unit-mmu-for-apollo-v4-series |
I know it, because I follow his forum. Anyway, it's off-topic: here we discuss about code density (and memory footprint).
There's already a thread which was specifically created to talk about computer microarchitectures and similar things. |
| Status: Offline |
| | cdimauro
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 25-Aug-2025 4:33:02
| | [ #407 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 4506
From: Germany | | |
|
| @Hammer
Quote:
Hammer wrote: @matthey
Quote:
For a 16-bit base VLE, the other common option is tiered registers with the lower 3b reg encoding of 16-bit instructions accessing the first 8 registers and 32-bit encodings using 4b reg encodings. This initially appears cleaner as the visible Dn/An register split can disappear but commonly used special registers like the SP need to be mapped to the low 8 registers which does not look as clean, leaves fewer registers available and I do not believe the code density is as good. Some of these ISAs create more instructions like PUSH and POP in the case of the SP but this reduces orthogonality. The x86-64 ISA is similar but poorly implemented as a couple of the lower 8 registers are not GP despite having PUSH and POP instructions leaving only 6 GP registers before needing a prefix to access 8 more GP registers. Fortunately for x86-64, CISC cores with mem-reg and reg-mem memory accesses usually do not need as many registers and load-to-use stalls are avoided, one of the keys why x86 with 6 GP registers stayed ahead in performance of fat RISC with 32 GP registers like Alpha, MIPS, PPC, etc. Fat RISC code density was another reason. RISC fanatics are slow learners but the failures eventually disappeared leaving more competitive RISC architectures. |
Unlike 68060 vs 68LC060, Pentium guaranteed X87 registers for baseline PCs, which is enforced by major AAA PC games such as Tomb Raider and Quake.
PC DOS Tomb Raider running poorly on x87-less AMD 486SX2-66 (overlocked to 80 Mhz) with 3DFX Voodoo https://www.youtube.com/watch?v=XHqLYzqZciM Various high clock speed 486 models are compared with PC DOS Tomb Raider.
IA-32's 8 GPR limitation evolved the X86 implementations to have fast data transfers with X87 and XMM registers.
Explicit ALU operation to be linked with data in memory instructions has lessened the demand on the inferior GPR count.
Actual microarchitecture implementation is equally important e.g. AMD K5 or Cyrix 6x86 have a stronger integer implementation path, while Intel Pentium has a stronger X87 implementation path.
https://barefeats.com/doom3.html
MAC GAME PERFORMANCE BRIEFING FROM THE DOOM 3 DEVELOPERS Glenda Adams, Director of Development at Aspyr Media, has been involved in Mac game development for over 20 years. I asked her to share a few thoughts on what attempts they had made to optimize Doom 3 on the Mac and what barriers prevented them from getting it to run as fast on the Mac as in comparable Windows PCs. Here's what she wrote:
"Just like the PC version, timedemos should be run twice to get accurate results. The first run the game is caching textures and other data into RAM, so the timedemo will stutter more. Running it immediately a second time and recording that result will give more accurate results.
The performance differences you see between Doom 3 Mac and Windows, especially on high end cards, is due to a lot of factors (in general order from smallest impact to largest):
1. PowerPC architectural differences, including a much higher penalty for float to int conversion on the PPC. This is a penalty on all games ported to the Mac, and can't be easily fixed. It requires re-engineering much of the game's math code to keep data in native formats more often. This isn't 'bad' coding on the PC -- they don't have the performance penalty, and converting results to ints saves memory and can be faster in many algorithms on that platform. It would only be a few percentage points that could be gained on the Mac, so its one of those optimizations that just isn't feasible to do for the speed increase.
2. Compiler differences. gcc, the compiler used on the Mac, currently can't do some of the more complex optimizations that Visual Studio can on the PC. Especially when inlining small functions, the PC has an advantage. Add to this that the PowerPC has a higher overhead for functional calls, and not having as much inlining drops frame rates another few percentage points.
Microsoft's Visual Studio is a 1st party software for the Wintel platform. |
Same as above to kolla, plus some gentle remainders:
Quote:
Hammer wrote: @matthey
Quote:
The 32-bit 68060 has 16 GP integer registers, good orthogonality, a good FPU ISA with 8 GP FPU registers and it was obviously better than the in-order P5 Pentium equivalent. Motorola pulled the plug on the 68k for a RISC ISA more like Alpha though. I guess they could not read the writing on the wall.
|
You ignored X86 integer register use case are both GPR and x87 registers. |
Out of curiosity, what do you mean with that?
Quote:
Hammer wrote: @cdimauro
Quote:
cdimauro wrote: @Hammer
You continue to report it, but there's not a single benchmark using this VLE for embedded (and solely there, it looks like).
That's despite I've already asked you several times.
Since there's nothing yet, I wonder who is using it. If anyone ever did it...
|
1. I don't care about NXP/STM's PowerPC VLE vs 68K. PPC fanboys can cover their CPU horse. |
I haven't talked about PowerPC vs 68k: I've ONLY talked about VLE.
You reported it already several times, in the context of code density, yet there's not a single number baking any credibility of this PowerPC extension about this key metric (which is THE key metric when talking about embedded. In fact, it's an extension for the embedded).
BTW, I've just started reading this manual, and I've immediately found something which made me laugh. Those are another set of engineers which were living on a parallel world. I leave you as an exercise to figure out what I was talking about. Hint: it's at the very beginning of the documentation. Quote:
3. For VLE PPC, NXP/STM claims 30 percent code density improvement. |
Source? |
| Status: Offline |
| | matthey
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 26-Aug-2025 1:15:11
| | [ #408 ] |
| |
 |
Elite Member  |
Joined: 14-Mar-2007 Posts: 2821
From: Kansas | | |
|
| cdimauro Quote:
No, you simply (!) don't know of what you talk about.
It was already reported several times. There are studies which show that a 25-30% of code density improvement is roughly equivalent to a system with HALF the code cache size. I repeat again for YOUR benefit: HALF the code cache.
11-13% of around HALF that code density improvement. You should figure out yourself now if that is "trifling" (SIC!)...
|
The RISC-V code density manual and research talks about code density and number of instructions executed with some code density history.
The RISC-V Compressed Instruction Set Manual https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-209.pdf Quote:
Variable-length instruction sets have long been used to improve code density. For example, the IBM Stretch, developed in the late 1950s, had an ISA with 32-bit and 64-bit instructions, where some of the 32-bit instructions were compressed versions of the full 64-bit instructions. Stretch also employed the concept of limiting the set of registers that were addressable in some of the shorter instruction format. The later IBM 360 architecture supported a simple variable-length instruction encoding with 16-bit, 32-bit, or 48-bit instruction formats.
In 1963, CDC introduced the Cray-designed CDC 6600, a precursor to RISC architectures that introduced a register-rich load-store architecture with instructions of two lengths, 15-bits and 30-bits. The later Cray-1 design used a very similar instruction format, with 16-bit and 32-bit instruction lengths.
|
Some RISC fans like to claim the CDC 6600 as being RISC like because the ISA is simplified. It does not have load/store instructions, instead requiring writing registers to perform memory accesses. It also has three different types of registers, X0-X7, A0-A7 and B0-B7 and a VLE.
The RISC-V Compressed Instruction Set Manual https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-209.pdf Quote:
The initial RISC ISAs from the 1980s all picked performance over code size, which was reasonable for a workstation environment, but not for embedded systems. Hence, both ARM and MIPS subsequently made versions of the ISAs that offered smaller code size by offering an alternative 16-bit wide instruction set instead of the standard 32-bit wide instructions. The compressed RISC ISAs reduced code size relative to their starting points by about 25–30%, yielding code that was significantly smaller than 80x86. This result surprised some, as their intuition was that the variable-length CISC ISA should be smaller than RISC ISAs that offered only 16-bit and 32-bit formats.
|
Fat RISC ISAs with a large code size picked simplicity over performance is how it should read. Compressed RISC ISAs were handicapped by fat RISC ISAs much like the 80x86 ISA was handicapped by starting as a 16-bit ISA while maintaining compatibility with a 808x 8-bit ISA. The smart thing to do would have been to start with a compressed 32-bit ISA like the 68k introduced in 1979. RISC-V finally got it right on the 5th attempt in 2010, discarding earlier RISC mistakes but falling short of 68k coded density.
The RISC-V Compressed Instruction Set Manual https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-209.pdf Quote:
Since the original RISC ISAs did not leave sufficient opcode space free to include these unplanned compressed instructions, they were instead developed as complete new ISAs. This meant compilers needed different code generators for the separate compressed ISAs. The first compressed RISC ISA extensions (e.g., ARM Thumb and MIPS16) used only a fixed 16-bit instruction size, which gave good reductions in static code size but caused an increase in dynamic instruction count, which led to lower performance compared to the original fixed-width 32-bit instruction size. This led to the development of a second generation of compressed RISC ISA designs with mixed 16-bit and 32-bit instruction lengths (e.g., ARM Thumb2, microMIPS, PowerPC VLE), so that performance was similar to pure 32-bit instructions but with significant code size savings. Unfortunately, these different generations of compressed ISAs are incompatible with each other and with the original uncompressed ISA, leading to significant complexity in documentation, implementations, and software tools support.
|
An increase in "dynamic instruction count" led to "lower performance" for 16-bit fixed length RISC encodings but also 16-bit and 32-bit VLEs which RISC-V developers still are clueless about. The BA2 and NanoMIPS realized that 48-bit encodings are necessary for 32-bit immediates/displacements to keep from breaking instructions apart thus increasing the number of instructions and reducing performance. They still do not match CISC performance with mem-reg and reg-mem single cycle memory/cache accesses and multiple scaled immediates/displacements.
The RISC-V Compressed Instruction Set Manual https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-209.pdf Quote:
Of the commonly used 64-bit ISAs, only PowerPC and microMIPS currently supports a compressed instruction format. It is surprising that the most popular 64-bit ISA for mobile platforms (ARM v8) does not include a compressed instruction format given that static code size and dynamic instruction fetch bandwidth are important metrics. Although static code size is not a major concern in larger systems, instruction fetch bandwidth can be a major bottleneck in servers running commercial workloads, which often have a large instruction working set.
|
If the caches are much larger than necessary with expensive high end hardware, then performance is not reduced as much as for more affordable hardware.
The RISC-V Compressed Instruction Set Manual https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-209.pdf Quote:
Benefiting from 25 years of hindsight, RISC-V was designed to support compressed instructions from the outset, leaving enough opcode space for RVC to be added as a simple extension on top of the base ISA (along with many other extensions). The philosophy of RVC is to reduce code size for embedded applications and to improve performance and energy-efficiency for all applications due to fewer misses in the instruction cache. Waterman shows that RVC fetches 25%-30% fewer instruction bits, which reduces instruction cache misses by 20%-25%, or roughly the same performance impact as doubling the instruction cache size.
|
This is the RISC-V code density research we often refer to which bhabbott somehow missed. The 68k has about 50% better code density than fat "original RISC ISAs" like MIPS, SPARC, Alpha, ARM, PA-RISC and PPC so instruction caches needed to be roughly quadruple the size to match the instruction cache performance of the 68k. The 68060 8kiB instruction cache had the instruction cache performance of a PPC604e with 32kiB instruction cache and wasting 24kiB of instruction cache compared to the 68060. The 24kiB of instruction cache with 6 transistors per bit uses 1,179,648 transistors.
CPU | pipeline | caches | transistors | cost 68060 8-stage 8kiB/8kiB 2,530,000 ? PPC603e 4-stage 16kiB/16kiB 2,600,000 $30 PPC604e 6-stage 32kiB/32kiB 5,100,000 $60
1,179,648/5,100,000 = 23% of PPC604e transistors were wasted on the I-cache vs 68060 1,179,648/2,530,000 = 47% of the transistors of the 68060 were wasted for PPC604e I-cache
The PPC604e with 5.1 million transistors was estimate to have twice the manufacturing cost of the PPC603e with 2.6 million transistors. The PPC604e could have likely had a ~23% lower manufacturing cost with an 8kiB I-cache lowering the cost from $60 to $46.12. The price is usually around three times the cost so the CPU price may have dropped by $41.63 if PPC had 68k code density. The PPC data is taken from the following Microprocessor Report. The problem with PPC was not cost but performance which did not compete against x86.
Arthur Revitalizes PowerPC Line https://websrv.cecs.uci.edu/~papers/mpr/MPR/ARTICLES/110203.PDF Quote:
Arthur’s (4-stage PPC G3) low manufacturing cost, however, lets IBM and Motorola continue to undercut Intel’s prices.We expect Klamath to initially debut at a list price of $700–$800. In contrast, Arthur is likely to appear at $400–$500. Apple will pay far less, of course, while even Intel’s best customers don’t get much of a discount off list. Intel will bring down the price of Klamath over several quarters, but Arthur’s cost structure will easily support a price well below Klamath’s.
This positioning appears advantageous but is unlikely to boost PowerPC’s prospects versus Intel. Without even getting into Apple’s problems (see 1101ED.PDF), a major failure of PowerPC is that it has never been able to deliver a large performance advantage over Intel. Although lower CPU prices are impressive from a technical standpoint, the cost savings are usually eaten up by higher system margins and component costs. Arthur keeps pace with Intel but doesn’t appear to change this basic equation. As noted, Apple’s one opportunity for performance leadership will come in the notebook market, but the company must respond quickly when this opportunity knocks.
|
The PPC poor cache efficiency and shallow pipelines have synergies to sink PPC. The PPC604(e) had a 6-stage pipeline which could be clocked up more than the PPC603(e)/G3 but the PPC604e did not clock as high as the PPC604e after doubling the caches from 16kiB to 32kiB. Larger caches have a slower access time which also reduces performance and/or increases pipeline latency with a deeper pipeline. Motorola likely could have pipelined the instruction and data cache accesses by adding pipeline stages to access the cache over 2 or more stages. This increases the load-to-use penalty which most load/store RISC designs suffer from and the branch misprediction penalty though. Motorola stayed with the shallow pipeline design for the PPC G3 (Arthur) based on the PPC603 for this reason. Most CISC designs do not suffer from load-to-use stalls and Motorola added 2 stage instruction cache and data cache accesses with 32kiB instruction and data caches for the ColdFire V5 based on the 68060 design. The instruction pipeline only increased by 1 stage from the 8-stage 68060 to the 9-stage ColdFire V5 by combining stages. Newer chip fab processes decrease distances allowing more logic so 32kiB cache accesses in a single stage became possible later. Cache access times are still very important and larger caches have longer access times at all levels.
Good code density reduces system costs due to less memory needed. Fewer memory accesses reduces power allowing cheaper power supplies and reducing the cost of cooling. To summarize, code density advantages include the following.
1. fewer transistors for caches allow cheaper chip costs and/or better cache performance 2. smaller caches allow better performance and/or less latency 3. memory costs are reduced with smaller memory footprint systems 4. power supply and cooling costs are reduced by fewer memory accesses reducing power
These are just code density advantages where CISC has other advantages. It is amazing that all "original RISC ISA" developers did not understand what RISC-V research discovered much later about just the cache performance, enough by itself to sink fat RISC. I guess it is not surprising that bhabbott did not understand when DEC Alpha, HP PA-RISC and Motorola PPC developers did not see it. Credit to Intel for abandoning the i960 and StrongARM to return to x86 and to AMD for staying with x86(-64) instead of sailing with the Itanic. Motorola sure never looked back at the 68k after castrating their baby into ColdFire and throwing it out with the bathwater.
Last edited by matthey on 26-Aug-2025 at 01:25 AM.
|
| Status: Offline |
| | Hammer
 |  |
Re: The (Microprocessors) Code Density Hangout Posted on 26-Aug-2025 4:41:05
| | [ #409 ] |
| |
 |
Elite Member  |
Joined: 9-Mar-2003 Posts: 6582
From: Australia | | |
|
| | Status: Offline |
| | Hammer
 |  |
Re: The (Microprocessors) Code Density Hangout Posted on 26-Aug-2025 5:20:29
| | [ #410 ] |
| |
 |
Elite Member  |
Joined: 9-Mar-2003 Posts: 6582
From: Australia | | |
|
| @matthey
https://youtu.be/s1G2caOSi9M?t=780 Tomb Raider (1996) 320x200p benchmarks for various 486DX and Pentium Overdrive. Frame rates are capped at around 30 fps.
Pentium Overdrive 83 Mhz = 29.1 fps
AMD 5x86 160 Mhz = 29.1 fps AMD 5x86 133 Mhz = 27.5 fps AMD 486DX4 100Mhz =22.9 fps
Cyrix 5x86 100Mhz (50 Mhz bus x2) = 27.6 fps Cyrix 5x86 Enhanced 100Mhz = 27.5 fps Cyrix 5x86 100Mhz = 23.9 fps Cyrix 486 100Mhz = 20.4 fps Cyrix 5x86 primarily uses a Socket 3 (168-pin PGA).
Intel 486DX4 100Mhz = 23.7 fps Intel 486DX2 66Mhz Writeback cache = 15.8 fps
These are 32-bit FSB platforms. ----------
https://www.youtube.com/watch?v=KiNTp1jlrR4 OpenLara running on A1200 with 68060 @ 50Mhz.
From https://eab.abime.net/showthread.php?t=120230
Vampire AC68080 V2 easily reaches 30 fps cap limit.
The old Apollo 1240 @ 40MHz is around 12 fps. 68060 @ 50MHz is not delivering 2X over 68040 @ 40 Mhz. Trinity1240 (68040 @ 33 MHz) with semi-modern SDR memory is around 12 fps.
Trinity1240/1260 project can also support 68060 via a jumper.
http://www.b737.org.uk/fmc.htm The recent FMC Model 2907C1 has an MC68040 processor running at 60MHz (30MHz bus clock speed). Last edited by Hammer on 26-Aug-2025 at 05:28 AM. Last edited by Hammer on 26-Aug-2025 at 05:25 AM.
_________________
|
| Status: Offline |
| | kolla
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 29-Aug-2025 1:20:29
| | [ #411 ] |
| |
 |
Elite Member  |
Joined: 20-Aug-2003 Posts: 3507
From: Trondheim, Norway | | |
|
| @Hammer
Quote:
Hammer wrote:
AC68080 has increased general-purpose performance, and SAGA comes with RTG's chunky graphics. Why not use Linux 68K on it?
|
Why are you asking me? I've been running Linux on 68k for more than 3 decades on both real and emulated hardware. The ability to run Linux or NetBSD is the only thing that would make me perhaps buy a V4 eventually. But how likely is that, really? Who would maintain Linux and NetBSD support for 68080, and how? Is the "Apollo team" capable and interested? Are the Linux/68k and NetBSD/68k teams evem interested at this point?
(What do you mean with "RTG's chunky graphics"? RTG is software, an API for AmigaOS, and SAGA is not 68080. Granted, a V4 has both 68080 and SAGA, and SAGA does chunky graphics, but RTG is irrelevant for Linux)Last edited by kolla on 29-Aug-2025 at 01:24 AM.
_________________ B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC |
| Status: Offline |
| | cdimauro
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 2-Sep-2025 5:35:41
| | [ #412 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 4506
From: Germany | | |
|
| @Hammer, @kolla what's not clear to you that this thread is about code density?
As I've already reported, there's another one which is better suited for discussions about microarchitectures et similia: https://amigaworld.net/modules/newbb/viewtopic.php?topic_id=45544&forum=17
Or, you can certainly open new threads talking about those topics.
@matthey
Quote:
matthey wrote: Newer chip fab processes decrease distances allowing more logic so 32kiB cache accesses in a single stage became possible later. Cache access times are still very important and larger caches have longer access times at all levels. |
Only one point here: this depends on the cache lines granularity (size).
In fact, you can double the cache size, but if you double as well the cache lines size, then the access time is the same (keeping all other factors the same, of course). The price to pay is more traffic -> more transistors for the buffers & more power drawn, but those are other factors. Quote:
Good code density reduces system costs due to less memory needed. Fewer memory accesses reduces power allowing cheaper power supplies and reducing the cost of cooling. To summarize, code density advantages include the following.
1. fewer transistors for caches allow cheaper chip costs and/or better cache performance 2. smaller caches allow better performance and/or less latency 3. memory costs are reduced with smaller memory footprint systems 4. power supply and cooling costs are reduced by fewer memory accesses reducing power
These are just code density advantages where CISC has other advantages. |
It's a good summary, thanks. Quote:
It is amazing that all "original RISC ISA" developers did not understand what RISC-V research discovered much later about just the cache performance, enough by itself to sink fat RISC. I guess it is not surprising that bhabbott did not understand when DEC Alpha, HP PA-RISC and Motorola PPC developers did not see it. Credit to Intel for abandoning the i960 and StrongARM to return to x86 and to AMD for staying with x86(-64) instead of sailing with the Itanic. Motorola sure never looked back at the 68k after castrating their baby into ColdFire and throwing it out with the bathwater. |
As I've already said on one of the last replies to minator, at the time the only relevant metric for chip vendors was performance, and nothing else.
Code density wasn't relevant, because they wanted to win the speed race, and due to that not even the price was important (they were packing tons of transistors on their chips only to get more performance), neither power consumption was relevant.
The things were working differently on the embedded and pocket console market, and the history of the Nintendo GameBoy Advance and ARM's Thumb was a clear indication of how important code density was.
It became important to general purpose processors & architectures once they hit the wall of scaling with newer node processes and frequency, as well as the impact of mobile devices on our lives, which brought them to rediscover how much important was this key factor. |
| Status: Offline |
| | cdimauro
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 2-Sep-2025 5:42:14
| | [ #413 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 4506
From: Germany | | |
|
| @Hammer
Quote:
It's just a number baked by not a single data supporting it -> irrelevant.
Pay attention that if we have to consider that relevant, then it would mean that VLE was/is performing WAY BETTER than Thumb-2, BA2 and NanoMIPS, getting very very close to 8086 results. Just take a look at some benchmark reported on the first page of this thread, applying this 30% to all PowerPC data, and you can figure out yourself how completely unrealistic it would be. In fact, VLE doesn't even support 32-bit immediates, and despite everything, it shows far better results compared to to NanoMIPS and even BA2 (which is the best architecture in terms of pure code density. 32-bit architecture, to be more precise).
Freescale has to show where this number comes from. |
| Status: Offline |
| | matthey
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 2-Sep-2025 18:51:01
| | [ #414 ] |
| |
 |
Elite Member  |
Joined: 14-Mar-2007 Posts: 2821
From: Kansas | | |
|
| cdimauro Quote:
Only one point here: this depends on the cache lines granularity (size).
In fact, you can double the cache size, but if you double as well the cache lines size, then the access time is the same (keeping all other factors the same, of course). The price to pay is more traffic -> more transistors for the buffers & more power drawn, but those are other factors.
|
The smaller the cache size the lower the cache access latency is a rule of thumb. There are other cache characteristics which affect access time including cache associativity and the cache line size. I expect the cache line size to have less of a direct effect on cache hit access time than cache size and associativity. Larger cache line sizes are more important for cache misses where the cache access time is less important than higher level cache and memory access times. Larger cache lines increase conflict misses which is usually compensated with more cache associativity increasing cache access latency despite the performance advantages.
cdimauro Quote:
As I've already said on one of the last replies to minator, at the time the only relevant metric for chip vendors was performance, and nothing else.
Code density wasn't relevant, because they wanted to win the speed race, and due to that not even the price was important (they were packing tons of transistors on their chips only to get more performance), neither power consumption was relevant.
The things were working differently on the embedded and pocket console market, and the history of the Nintendo GameBoy Advance and ARM's Thumb was a clear indication of how important code density was.
It became important to general purpose processors & architectures once they hit the wall of scaling with newer node processes and frequency, as well as the impact of mobile devices on our lives, which brought them to rediscover how much important was this key factor.
|
Right. Early RISC was a clock speed race with, in race car terms, low energy density fuel.

Methanol is great for a drag race. A fire breathing 1.3L engine in a RX-7 can run a 7.9s@177mph 1/4 mile (400m) and this is not the most powerful or quickest 13B car or even quickest 13B RX-7 in the world but I chose it because the car info is given.
Matt Esplan Runs a 7!! | ESPYFAB RACING 13B TURBO FD RX7 Drag Car | MAZDA | FullBOOST | Drag Racing https://youtu.be/fVVTPf-kESo?t=272
It has 12x1600cc injectors to supply the methanol where stock for gas is 2x550cc and 2x850cc (my mildly modified 1993 RX-7 uses 4x850cc and ran 12.8s@110mph in the 1/4 mile). It uses a production motor with production parts although some parts are from the S5 RX-7 introduced in 1989. Even 1980s tech can be good. The only problem is that CPUs are not like drag race cars but like endurance race cars. Consistent performance over a long time is desired for a CPU as there are no speed limits. It is just a matter of supplying enough instructions, like fuel for a car, for the performance where code density, like energy dense gasoline, is an advantage. The Wankel rotary engine and RX-7 are actually better known for endurance racing and handling using gasoline. A light weight engine with few moving parts allowing a low center of gravity with no valves to blow out or get in the way for turbos offers certain advantages and the engine could be much lighter if it was all aluminum where I can already pick it up by myself with it being the size of a beer keg without intake or exhaust manifolds. Performance was never a problem for the rotary engine either which is kind of like CISC CPUs which have more performance potential than RISC CPUs, yet piston engines replaced rotary engines like RISC CPUs replaced CISC CPUs. Well, RISC CPUs became more CISC like while retaining the RISC propaganda where it is not possible for piston engines to become more a rotary engine. The rotary engine still has potential too, especially with multi-fuels and as a small lightweight engine for recharging batteries. The M-1 tank uses a rotary engine for auxiliary power and it is much more practical than the thirsty turbine engine. Diesel engines are hard to beat for tanks and better for tanks than gasoline engines, as the Germans discovered during WWII where fire breathing tanks are not as good as fire breathing race cars but diesel is more difficult to make from coal. Fuel supply is just as important for internal combustion engines as instruction supply is for CPUs as both become useless without them.
cdimauro Quote:
It's just a number baked by not a single data supporting it -> irrelevant.
Pay attention that if we have to consider that relevant, then it would mean that VLE was/is performing WAY BETTER than Thumb-2, BA2 and NanoMIPS, getting very very close to 8086 results. Just take a look at some benchmark reported on the first page of this thread, applying this 30% to all PowerPC data, and you can figure out yourself how completely unrealistic it would be. In fact, VLE doesn't even support 32-bit immediates, and despite everything, it shows far better results compared to to NanoMIPS and even BA2 (which is the best architecture in terms of pure code density. 32-bit architecture, to be more precise).
Freescale has to show where this number comes from.
|
A 30% claim for PPC VLE is typical of compressed RISC ISAs. Just above in post #408 I quoted RISC-V documentation which gave 25-30% reduced code size relative to their starting points for compressed RISC ISAs.
The RISC-V Compressed Instruction Set Manual https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-209.pdf Quote:
The initial RISC ISAs from the 1980s all picked performance over code size, which was reasonable for a workstation environment, but not for embedded systems. Hence, both ARM and MIPS subsequently made versions of the ISAs that offered smaller code size by offering an alternative 16-bit wide instruction set instead of the standard 32-bit wide instructions. The compressed RISC ISAs reduced code size relative to their starting points by about 25–30%, yielding code that was significantly smaller than 80x86. This result surprised some, as their intuition was that the variable-length CISC ISA should be smaller than RISC ISAs that offered only 16-bit and 32-bit formats.
|
It is interesting that you mention the best compressed RISC ISAs getting close to 8086 code density when the RISC-V documentation says the 25-30% reduced code size is significantly smaller than 80x86 code size. That is quite the decline in code density from the 8086 to 80x86 even though 8086 code can execute on 80x86 CPUs. In reality, the 808x and x86 ISAs have significantly better code density if code is size optimized for 8-bit datatypes, stack accesses and 6 GP registers. The 68k has a 32-bit ISA with more efficient use of larger datatypes and 16 GP registers so the code density remains good when optimizing for performance too. The Vince Weaver contest has the 68k with ~45% better code density than PPC and x86 at ~30% better code density. Where 25-30% better code density may be typical for compressed RISC ISAs, I expect the best code density ISAs to be 40%-50% better code density than classic RISC ISAs like MIPS, SPARC and PPC (and excluding Alpha and PA-RSIC). It is easy to show any ISA not reaching its code density potential as RISC-V studies have demonstrated. Compiler options, compiler selection and benchmark selection play large roles in code density.
|
| Status: Offline |
| | bhabbott
|  |
Re: The (Microprocessors) Code Density Hangout Posted on 3-Sep-2025 2:34:57
| | [ #415 ] |
| |
 |
Cult Member  |
Joined: 6-Jun-2018 Posts: 567
From: Aotearoa | | |
|
| @cdimauro
Quote:
cdimauro wrote: @bhabbott
you simply (!) don't know of what you talk about.
It was already reported several times. There are studies which show that a 25-30% of code density improvement is roughly equivalent to a system with HALF the code cache size. I repeat again for YOUR benefit: HALF the code cache. |
11-13% is half as much as 25-30%, so equivalent to a system with 2/3 to 3/4 the cache size. But what does this mean? Code has to fit in the cache to benefit, so more compact code can do more at cache speed. Great. Then bloat wipes it all out. in this case the code only has to bloat by 11-13% to wipe out the gains. In the real world that's nothing.
Quote:
I reveal you a secret: 64-bit have limits. 2^64, precisely (but that's high-order math). |
In practice it's no limit. 1^64 is 16.8 exabytes. A top-end desktop computer today might have 64 gigabytes, or ~0.00000001% of the theoretical maximum.
Quote:
And here comes again the magic word which Bruce repeats like a parrot several times when he's not able to accept the reality (which is very different from the cave where he's leaving): bloat. |
The cave I am living in is an Amiga 1200 with 50MHz 68030 and 32MB RAM. The CPU has a 256 byte instruction cache. The simplicity of that cave appeals to me, and I like living within its confines. However many others don't like being so restricted. Today you can throw a PiStorm into your Amiga and have mind-blowing performance - yet that still isn't enough for some.
The irony of it is that you guys are also living in a cave. While you pontificate about which ISA has the best code density, Amiga coders write apps in Hollywood or port over games designed for much more powerful PCs. 68k might have a bit better code density than ARM, but that's irrelevant when the Pi's CPU is running at several GHz and emulates a 68k much faster than any real one. Last edited by bhabbott on 03-Sep-2025 at 02:38 AM. Last edited by bhabbott on 03-Sep-2025 at 02:36 AM.
|
| Status: Offline |
| |
|
|
|
[ home ][ about us ][ privacy ]
[ forums ][ classifieds ]
[ links ][ news archive ]
[ link to us ][ user account ]
|