Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6197 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

0 member(s) on-line.

You are an anonymous user.
Register Now!

Mobileconnect: 19 mins ago

bet88spot: 23 mins ago

bet88spot1: 25 mins ago

MarisaG: 1 hr 30 mins ago

DRiB: 2 hrs 56 mins ago

number6: 3 hrs 33 mins ago

davidf215: 3 hrs 58 mins ago

matthey: 4 hrs 36 mins ago

AmigaMac: 5 hrs 16 mins ago

RobertB: 5 hrs 41 mins ago

Forum Index

General Technology (No Console Threads)

The (Microprocessors) Code Density Hangout

Poster

Thread

minator

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Jun-2025 0:32:01

[ #301 ]

Super Member

Joined: 23-Mar-2004
Posts: 1034
From: Cambridge

@matthey

Quote:
The 2-way superscalar 32-bit 68060 CPU uses ~2.5 million transistors while the lowest end 64-bit 2-way superscalar Cortex-A53 core uses ~12.5 million transistors. A 32-bit in-order 2-way Cortex-A7 core predecessor uses ~10 million transistors so a 64-bit equivalent Cortex-A53 core uses ~25% more transistors. The 64-bit tax applies to more than just memory.

A lot can change over 17 years, also, the A53 effectively implements 2 different instruction sets and that impacts the entire processor.

Wouldn't it be better to compare similar processors from the same time:
All of these are 2 way superscalar:

32 bit
1993 Pentium P5 66MHz (2x 8K caches) 3.1 million transistors
1994 68060 50MHz (2 x 8K caches) 2.5 million transistors
1994 PA-7200 120MHz (1 x 2K assist cache) 1.3 million transistors
1994 Pentium P54 100MHz (2x 8K caches) 3.2 million transistors

64 bit
1992 Alpha 21064 (EV4S) 200MHz (2 x 8K caches) 1.68 million transistors
1991 MIPS R4000 100MHz (2 x 8K caches) 1.35 million transistors
1992 MIPS R4400 250MHz (2 x 16K caches) 2.2 million transistors

The 64 bit tax doesn't seem too high. Caches can add huge number of transistors as the R4400 number shows.

There is a CISC tax though. They have more logic transistors, they are far more complex to design, and slower. There's reason the industry gave up on CISC.

Last edited by minator on 27-Jun-2025 at 12:36 AM.

_________________
Whyzzat?

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 28-Jun-2025 1:39:54

[ #302 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2728
From: Kansas

minator Quote:

A lot can change over 17 years, also, the A53 effectively implements 2 different instruction sets and that impacts the entire processor.

The Cortex-A53 supports at least 4 ISAs.

1. ARM (original)
2. Thumb
3. Thumb-2
4. AArch64

New Cortex-A cores support only #4 and Cortex-M cores support #2-3.

minator Quote:

Wouldn't it be better to compare similar processors from the same time:
All of these are 2 way superscalar:

32 bit
1993 Pentium P5 66MHz (2x 8K caches) 3.1 million transistors
1994 68060 50MHz (2 x 8K caches) 2.5 million transistors
1994 PA-7200 120MHz (1 x 2K assist cache) 1.3 million transistors

The PA-7200 is superscalar but that 2kiB on-chip assist cache and off chip L1, with the 2nd worst RISC code density after Alpha, is grossly inadequate for instruction supply. The PA-7200 is a good example of ignorance of the RISC instruction bottleneck. The design would have been better left as scalar and the transistors wasted on superscalar hardware reallocated to at least an 8kik on-chip L1 instruction cache. The PA-7200 design uses a 5-stage pipeline and lacks dynamic branch prediction reducing the number of transistors compared to the 8-stage 68060. It also should have reduced the max clock speed compared to the 68060 which should have eventually been clocked around 150MHz.

minator Quote:

1994 Pentium P54 100MHz (2x 8K caches) 3.2 million transistors

64 bit
1992 Alpha 21064 (EV4S) 200MHz (2 x 8K caches) 1.68 million transistors

The Alpha 21064 is a professional quality 7-stage superscalar design, other than being handicapped by the Alpha code density and extreme simplicity. The 8kiB instruction cache has the performance of a 68060 ~2kiB instruction cache and would need to be increased to a ~32kiB instruction cache to match the 68060 8kiB instruction cache performance, according to RISC-V research.

The RISC-V Compressed Instruction Set Manual, Version 1.7
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-157.pdf Quote:

The philosophy of RVC is to reduce code size for embedded applications and to improve performance and energy-efficiency for all applications due to fewer misses in the instruction cache.
Waterman shows that RVC fetches 25%-30% fewer instruction bits, which reduces instruction
cache misses by 20%-25%, or roughly the same performance impact as doubling the instruction
cache size.

The small advantage in logic that simple RISC cores had in logic by eliminating standard features on CISC cores, is gone when increasing the instruction caches to compensate for the poor instruction cache performance. A good example was the PPC603.

core | pipeline | L1 caches | transistors
PPC603 4-stage 8kiB/8kiB 1.6million
PPC603e 4-stage 16-kiB/16kiB 2.6million

The PPC strategy was to use limited OoO to increase the performance of shallow pipelines which do not need dynamic branch prediction thus saving transistors. The poor code density and much increased memory traffic resulted in the caches of the PPC603 and PPC 604 being doubled sabotaging the cost advantage of PPC. Even worse, the shallow pipelines could not be clocked up without expensive die shrinks which also came with doubling the caches for the PPC603e and PPC604e. The large 32kiB I+D caches likely reduced the max clock speed of the PPC604e as the PPC604 had a higher clock speed. PPC stagnated until Moore's Law allowed L2 caches on chip providing PPC a 2nd wind but the cache hog problem remained and grows as the amount of caches holding instructions increase.

minator Quote:

1991 MIPS R4000 100MHz (2 x 8K caches) 1.35 million transistors
1992 MIPS R4400 250MHz (2 x 16K caches) 2.2 million transistors

These MIPS cores are scalar 64-bit cores with an 8-stage pipeline. Transistors counts between scalar and superscalar cores will often be more than between 32-bit and 64-bit cores. The scalar version of the 68060 was never released but the similar Cyrix design was.

Cyrix5x86 6-stage 16kiB 2.0million (scalar)
Cyrix6x86 7-stage 16kiB 3.0million (superscalar)

While I cannot confirm the above Cyrix transistor counts, Cyrix literature mentions the following confirming the cost of superscalar designs compared to scalar designs.

Cyrix 5x86: Fifth-Generation Design Emphasizes Maximum Performance While Minimizing Transistor Count
https://dosdays.co.uk/media/cyrix/5x86/5X-DPAPR.PDF Quote:

5x86 Architecture

The increased complexity, transistor count, and power consumption of superscalar designs led Cyrix engineers to re-examine the benefits of the superscalar approach. Clearly the power dissipated in a second execution pipeline plus the added power dissipated in the control logic to oversee two execution pipelines should be minimal to achieve performance that will justify the transistors added. Analysis has shown that the increased complexity of two execution pipelines can cost 40% in transistor count while providing an increase of less than 20% in instructions-per-clock performance.

A scalar version of the 68060 would use 1.5 million transistors if it used 40% fewer transistors than the 68060. The MIPS R4000 is more primitive than the more advanced and better Cyrix 5x86 design (and hypothetical scalar 68060 design). The R4000 pipeline was stretched from the R3000 5-stages to 8-stages with the common naive RISC assumptions of performance. The 64-bit core and high clock speed were better for marketing than performance. I have commented before about deeper RISC pipelines increasing stalls and the same is true here with the architects practically ignoring the problems. Both load-to-use and branch misprediction stalls were increased.

https://blog.jyotiprakash.org/delving-deeper-into-the-mips-pipeline Quote:

Load Delays

In the R4000 pipeline, load delays are increased to 2 cycles because the data value becomes available at the end of the DS stage. The following figures show the pipeline schedule when a use immediately follows a load, indicating that forwarding is required to access the result of a load instruction in subsequent cycles.

After a load instruction, 2 independent instructions must be placed between the load and the next instruction to use the load to avoid a load-to-use stall. There is a nice picture to show the load-use delay but it is a little large for this forum. The 68060 and most CISC designs have no load-to-use stalls so benefit more from the deeper pipeline. As bad as the increased load-to-use delay is for the R4000, branching is much worse. The MIPS ISA was designed for a branch delay slot and has static not taken branch prediction.

https://blog.jyotiprakash.org/delving-deeper-into-the-mips-pipeline Quote:

Branch Delays

The basic branch delay in the R4000 pipeline is 3 cycles since the branch condition is computed during the EX stage. The MIPS architecture includes a single-cycle delayed branch. The R4000 employs a predicted-not-taken strategy for the remaining 2 cycles of the branch delay. The following figures demonstrates that untaken branches behave as 1-cycle delayed branches, while taken branches include a 1-cycle delay slot followed by 2 idle cycles. The instruction set includes a branch-likely instruction to help fill the branch delay slot. Pipeline interlocks enforce the 2-cycle branch stall penalty on taken branches and any data hazard stalls resulting from load uses.

The R4000 has no dynamic branch prediction for an 8-stage pipeline! Each iteration of a loop stalls for 2 cycles and a 3rd cycle is wasted if the branch delay slot is not useful. Compare this to the 68060 which starts with BTFN static prediction so it will predict the loop branch as taken the first time and the 68060 has 2-bit saturating dynamic branch prediction with a BTB which not only has no stalls for loops but allows the branch itself to be folded away. The 68060 has a 3-4 cycle advantage on every iteration of a loop! Other branches benefit too. The 68060 2-bit saturating prediction is better than the 1-bit prediction of the Alpha 21064 which was upgraded to 2-bit saturating in the Alpha 21064A.

The R4000 average CPI for SPEC92 integer benchmarks was 1.54 CPI from the same paper above where Motorola claimed "1.2 CPI measured on range of desktop of desktop and embedded applications".

The Superscalar hardware architecture of the M68060
https://old.hotchips.org/wp-content/uploads/hc_archives/hc06/3_Tue/HC6.S8/HC6.8.3.pdf

The 68060 was not a barely superscalar CPU. It was a finely tuned high tech Pentium killer but that also made it more of a RISC killer, including shallow pipeline limited OoO PPC killer. It threatened the AIM Alliance and thus could not be clocked up.

minator Quote:

The 64 bit tax doesn't seem too high. Caches can add huge number of transistors as the R4400 number shows.

The minimum 64-bit tax is not too high but gives more of a 32-bit/64-bit CPU core.

The MIPS R4000 Processor
https://people.eecs.berkeley.edu/~kubitron/courses/cs252-S07/handouts/papers/R4000.pdf Quote:

The hardware cost of extending the architecture to 64 bits was about 7% of the die area. A longer 64-bit ALU stage represents the cycle time speed penalty.

The R4000 only has a 32-bit barrel shifter which is half the size of a 64-bit barrel shifter. Modern 64-bit ISAs are more likely to have more 64-bit integer multiply and divide instructions which are also very expensive. The MIPS ISA is simple making it cheaper to implement in 64-bit. For example, there is only one addressing mode. Compare that to AArch64 which rivals the 68k in addressing modes and has thousands of instructions instead of hundreds at most like the 68k and MIPS. There was a performance pipeline "cycle time speed penalty" slowing down the whole pipeline for the "64-bit ALU stage". This is less of a problem with modern silicon but 64-bit ALU operations are still sometimes slower, 64-bit pointers are sometimes much slower than 32-bit pointers and 64-bit code tends to be larger than 32-bit code decreasing cache efficiency. I am not completely opposed to 64-bit but there is higher cost than benefit on low end inexpensive hardware.

Nintendo bought into the MIPS 64-bit propaganda for the Nintendo 64.

https://en.wikipedia.org/wiki/Nintendo_64#Hardware Quote:

Technical specifications

The Nintendo 64's architecture is built around the Reality Coprocessor (RCP), which serves as the systemâ€™s central hub for processing graphics, audio, and memory management. It works in tandem with the VR4300, a 93.75 MHz 64-bit CPU fabricated by NEC with a performance of 125 million instructions per second. Popular Electronics compared its processing power to that of contemporary Pentium desktop processors. Though constrained by a narrower 32-bit system bus, the VR4300 retained the computational capabilities of the more powerful 64-bit MIPS R4300i on which it was based. However, software rarely utilized 64-bit precision, as Nintendo 64 games primarily relied on faster and more compact 32-bit operations.

The Nintendo Cube successor went back to a more practical 32-bit PPC CPU.

minator Quote:

There is a CISC tax though. They have more logic transistors, they are far more complex to design, and slower. There's reason the industry gave up on CISC.

I see a x86 tax but if there is a 68k tax at all, it is small and well worth the code density advantage allowing the 68k to save caches, which you admit, "can add huge number of transistors". The transistors for caches dwarf the pipeline transistors on modern cores. Modern load/store architectures that pretend to be RISC care about code density now and have abandoned the RISC simplicity which was bad for performance. A minimal 68060 core may actually be smaller than the minimal AArch64 core today.

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle