Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6071 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

12 crawler(s) on-line.

143 guest(s) on-line.

2 member(s) on-line.

pavlor,

OldFart

You are an anonymous user.
Register Now!

OldFart: 3 secs ago

pavlor: 15 secs ago

zipper: 17 mins ago

VooDoo: 33 mins ago

matthey: 39 mins ago

kolla: 1 hr 52 mins ago

michalsc: 2 hrs 2 mins ago

amigang: 2 hrs 11 mins ago

gryfon: 2 hrs 27 mins ago

Rob: 3 hrs 6 mins ago

Forum Index

General Technology (No Console Threads)

Apple moving to arm, the end of x86

Poster

Thread

kolla

Re: Apple moving to arm, the end of x86
Posted on 12-Jul-2020 17:05:40

[ #101 ]

Elite Member

Joined: 21-Aug-2003
Posts: 2885
From: Trondheim, Norway

@Samurai_Crow

What fantasy concept is a 800MHz Vampire?

Or are you perhaps talking about a potential 800MHz AC68080 in ASIC, something that is probably never going to happen?

Why this CONSTANT confusion between a CPU and a family of _FPGA_ cards for Amiga?

Just to stay a little on topic - Quake runs great on Raspberry Pi.

Last edited by kolla on 12-Jul-2020 at 05:07 PM.
Last edited by kolla on 12-Jul-2020 at 05:06 PM.

_________________
B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC

Status: Offline

OneTimer1

Re: Apple moving to arm, the end of x86
Posted on 12-Jul-2020 20:30:58

[ #102 ]

Cult Member

Joined: 3-Aug-2015
Posts: 973
From: Unknown

@Samurai_Crow

Quote:

Samurai_Crow wrote:

Once Gunnar gets over his fetish for reconfigurable silicon and comes out with an ASIC ....

I'm sure, Gunnar doesn't has a fetish for reconfigurable silicon. If he had enough paying users he would already sell a Vampire with a 68080 in ASIC
5000 paying customers would be enough, 5000 people who expressed their interest on a WWW Page are not

Edit: totally wrong grammar.

Last edited by OneTimer1 on 13-Jul-2020 at 08:59 PM.

Status: Offline

megol

Re: Apple moving to arm, the end of x86
Posted on 12-Jul-2020 20:38:28

[ #103 ]

Regular Member

Joined: 17-Mar-2008
Posts: 355
From: Unknown

@matthey

Quote:

matthey wrote:
Quote:

NutsAboutAmiga wrote:
https://www.youtube.com/watch?v=AADZo73yrq4

Well technically RISK can be clocked at higher clock frequency because they create less heat, INTEL have a problem, I’m sure AMD might have a few more years, buts it’s the end of moores law, the CISC architecture only really exists as backwards compatibility layer, on top of INEL micro opcodes, and the micro opcodes are RISK.

Don't know how many times I've had to correct this but the internal instruction sets of modern x86 aren't RISC. They are simplified but complex instructions designed for efficient execution of x86 code. People go on about this but really variable length instructions with microcode support and condition codes doesn't sound as a real RISC design.

Quote:

The video compares desktop x86_64 CPUs with high performance cores and crazy amounts of high performance caches to ARM SoCs designed for low power smart phones. The x86_64 CPUs do waste energy breaking down instructions to micro ops but so do high performance ARM cores since the Cortex-A76 including the Cortex-A78 shown in one of the last frames under "Future".

The DEC Alpha tried to take advantage of a simpler RISC ISA and pipeline to clock the CPU much faster than the competition. This was not a good design and was a big reason why DEC disappeared. Higher clock speeds produced elevated heat and the slimmed down cores were weakened. Processor speeds outpaced memory speeds so the core would only have an advantage when instructions and data was in caches. Caches need to be small to be fast. RISC usually needs to execute many more simple instructions than CISC and Alpha code has horrible code density creating a bottleneck in the ICache. Alpha eventually went to multi-level caches which helped but secondary caches are bigger (slower), further from the core (slower) and use more power.

That wasn't the reason behind the Alpha dying, saving money on development and manufacturing were! Intel said they would give them a processor that was not only more efficient but also less expensive in the Itanium - and we all know how that ended.

Up until the day they stopped development of the line they were superior. Seems to me that their pure RISC focus provided exactly what they wanted?

Itanium had to have multi-level instruction caches but no Alpha delivered or IIRC planned had it.

Multi-level instruction caches can be more efficient than single level even for compact CISC code, just look at Intel and AMD that both have reached the same solution with a decoded L0 cache to reduce power (decoders take a lot of power) and also increase the throughput in inner loops. A 68k processor would likely have needed that long before x86 given the problem of instruction length parsing.

Quote:

Quote:

For the simpler processor cores in mobile devices in particular, delivering the instruction
stream is often the single largest source of energy consumption. In the DEC StrongARM-110,
for example, instruction address translation and cache access account for 36% of the chip’s power
dissipation [10]. In a more recent study [5], instruction cache access alone dissipated 40% of the
energy in a five-stage RISC pipeline. Main memory accesses and processor stalls incurred upon
instruction cache misses consume more energy still.

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-63.pdf

This is the same paper I linked for Samurai_Crow. It is the thesis of one of the RISC-V guys and the incentive to create the compressed RISC-V encodings. The 68020 ISA has something like 50% better code density than the Alpha meaning the 68060 8kiB ICache had the performance of a 32kiB ICache for the Alpha or the Alpha 8kiB ICache had the performance of a 2kiB ICache on the 68060. As bad as this sounds, the 68060 has an even bigger ICache advantage as it is 4 way while the Alpha used a direct mapped ICache which is faster to keep up with the core but suffers from conflict misses. Doubling the associativity, from direct mapped to two-way, or from two-way to four-way, has about the same effect on raising the hit rate as doubling the cache size. Now that first Alpha core ICache has the performance of a 0.5kiB Icache on a 68060. This doesn't consider the energy use which is significant as the Alpha is using memory which uses more energy than a cache access. While the Alpha is waiting on memory it can execute all those extra instructions needed by RISC too. It isn't just the 68060 which is more practical. The PPC 604e was a much better design than the Alpha. The large caches limited the clock speed but high clock speeds are useless without adequate caches. The 68060 had a deeper pipeline than the 604e and more efficiently used smaller caches likely making it a better candidate to clock up.

Comparing processors designed for completely different tasks with different target clock frequencies and power usage is dishonest. It's possible to design a bit serial 68k but it shouldn't then be compared with the (unreleased) 21464 and claimed to show the superior transistor efficiency of the CISC design.
The 21264 was superior to any PPC in the time and for the task it was designed. It wasn't designed to be cheap or low power. It even used more silicone to make its transistors perform better to be the best performing chip at the time and it succeeded.

Quote:

Quote:

The PowerPC problem is mismanagement, too high price, its just as good as ARM, but the world was stuck on INTEL / little endian instructions, and does not really care about needs of Amiga users, or classic software.

I think the ARM AArch64 ISA is better than the PPC ISA. Better code density, fewer instructions needed, better branch handling and conditional instructions, more powerful addressing modes including better PC relative support and longer branch displacements important for 64 bit addressing, more standard and more readable. It's not perfect by any means but looks to me like it is a move in the right direction for RISC toward a higher performance RISC CISC hybrid.

ARM have never been a pure RISC.

There are better ISAs possible and it wouldn't look anything close to the 68k.

Status: Offline

NutsAboutAmiga

Re: Apple moving to arm, the end of x86
Posted on 12-Jul-2020 20:49:32

[ #104 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12817
From: Norway

@megol

How many extra transistors do you think 680x0 cpu will need to support 64bit and 32bit instructions and multi core, MMU and FPU ++ all the other stuff.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

megol

Re: Apple moving to arm, the end of x86
Posted on 12-Jul-2020 20:51:28

[ #105 ]

Regular Member

Joined: 17-Mar-2008
Posts: 355
From: Unknown

@Samurai_Crow
Current Vampires run at about 100MHz and while FPGA-targeted 68k cores could clock higher that's still fantasies. It's like comparing the 68000 with the DEC 21064 and saying that at 10GHz it would win when the 68000 core can never be clocked at 10GHz.

Status: Offline

matthey

Re: Apple moving to arm, the end of x86
Posted on 13-Jul-2020 2:35:27

[ #106 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2000
From: Kansas

Quote:

megol wrote:
That wasn't the reason behind the Alpha dying, saving money on development and manufacturing were! Intel said they would give them a processor that was not only more efficient but also less expensive in the Itanium - and we all know how that ended.

Up until the day they stopped development of the line they were superior. Seems to me that their pure RISC focus provided exactly what they wanted?

The DEC Alpha was in trouble before the Itanium. Initially, the Alpha 21064 was a leap in performance mostly because of the high clock speeds. DEC gained a large share of the high performance CPU market and thought they were untouchable so bet the farm on Alpha. Competitors soon caught up in performance with more practical but lower clocked designs. The PPC 604 exceeded the performance of the Alpha 21064 despite running at half the clock speed. The 200 MHz Pentium Pro had similar performance as the Alpha 21164 at 300 MHz. The later MIPS R10000 and HP PA-RISC 8000 had competitive if not better performance than the Alpha 21164 as well. Alpha cores usually were the highest clocked cores but were not always the best performance cores. These other processors were more practical than the extreme Alphas. High clock speeds were good for marketing but this was not enough for the Alpha. Compatibility for x86 played a big role in the demise of the Alpha too as the PC clone market was already unleashed and customers could see the performance of games. The advantage of compatibility was often underestimated in those days.

Quote:

Itanium had to have multi-level instruction caches but no Alpha delivered or IIRC planned had it.

Multi-level instruction caches can be more efficient than single level even for compact CISC code, just look at Intel and AMD that both have reached the same solution with a decoded L0 cache to reduce power (decoders take a lot of power) and also increase the throughput in inner loops. A 68k processor would likely have needed that long before x86 given the problem of instruction length parsing.

The Alpha 21164 was the first CPU to have a secondary cache on die (called S-cache instead of L2) and has support for an external (L3) B-cache as well.

L1: 8kiB ICache direct mapped, 8kiB DCache direct mapped and dual ported
L2: 96kiB 3 way
L3: 1MB-64MB optional external direct mapped

This is a huge cache improvement over the Alpha 21064 although the L1 ICache is going to miss often as I already pointed out. I like the small dual ported L1 DCache but it would probably be better to slow down the clock speed some and make these L1 caches at least 2 ways.

I agree that multi-level caches are important for CISC too. I like the idea of keeping cache access cycles low with smaller lowest level caches. Much of the performance of x86 has probably come from small loop and/or stack caches acting as a kind of L0 cache. I have read some about more conventional L0 caches although the Qualcomm Krait is the only core that I have seen use one and it was only for energy savings. It is likely possible to improve performance as well as save energy with an L0.

Quote:

Comparing processors designed for completely different tasks with different target clock frequencies and power usage is dishonest. It's possible to design a bit serial 68k but it shouldn't then be compared with the (unreleased) 21464 and claimed to show the superior transistor efficiency of the CISC design.
The 21264 was superior to any PPC in the time and for the task it was designed. It wasn't designed to be cheap or low power. It even used more silicone to make its transistors perform better to be the best performing chip at the time and it succeeded.

All cores compared were general purpose cores. There is nothing dishonest about comparing cores although it is important to understand the target purpose and limitations. Every choice is a tradeoff. The extreme clock speeds of Alpha cores were not the best for performance and poor for power and area. The PPC 604 was the last PPC core to be competitive in performance but that was partially because of choice, probably a good choice. The PPC G3/G4 design was overall better than most Alpha designs. DEC pioneered multi-level caches made the PPC G3/G4 design more efficient than the 604(e).

Quote:

ARM have never been a pure RISC.

There are better ISAs possible and it wouldn't look anything close to the 68k.

True. ARM never was pure RISC. ARM ISA designers have done a good job of thinking outside the box.

The 68k ISA may not be the best for performance but I believe it can have one of the best performances for a highly compressed ISA. Code density is among the best, instruction counts are low and memory traffic is low. There is not enough research going into ISAs as ARM and x86_64 have stagnated development. There is RISC-V which is interesting but boring.

Last edited by matthey on 13-Jul-2020 at 02:54 AM.

Status: Offline

bison

Re: Apple moving to arm, the end of x86
Posted on 13-Jul-2020 5:31:53

[ #107 ]

Elite Member

Joined: 18-Dec-2007
Posts: 2112
From: N-Space

@matthey

Quote:
There is RISC-V which is interesting but boring.

How can this be?

_________________
"Unix is supposed to fix that." -- Jay Miner

Status: Offline

tlosm

Re: Apple moving to arm, the end of x86
Posted on 13-Jul-2020 8:04:48

[ #108 ]

Elite Member

Joined: 28-Jul-2012
Posts: 2746
From: Amiga land

Today news,
apple arm dev kit (x code?) gave the opportunity to make universal binary and ppc is included too.

_________________
I love Amiga and new hope by AmigaNG
A 500 + ; CDTV; CD32;
PowerMac G5 Quad 8GB,SSD,SSHD,7800gtx,Radeon R5 230 2GB;
MacBook Pro Retina I7 2.3ghz;
#nomorea-eoninmyhome

Status: Offline

megol

Re: Apple moving to arm, the end of x86
Posted on 13-Jul-2020 13:15:47

[ #109 ]

Regular Member

Joined: 17-Mar-2008
Posts: 355
From: Unknown

@NutsAboutAmiga
Depends on how it's done. Using a prefix design will not increase size of decoders much (more down below) however adding a more efficient instruction encoding could potentially double the transistors. Simple execution units would double in size worst case while the multiplication and division unit(s?) could either stay the same size or increase greatly, depends on whether 64*64->128 bit and 128/64->64:64 bit variants should be added. Caches wouldn't change much but the wider data paths adds a little.
Decoding with an optional prefix would add a number of simple prefix decoders plus widening the logic to extract 64 bit data and some additional bits here and there.

@matthey
Quote:

matthey wrote:
The DEC Alpha was in trouble before the Itanium. Initially, the Alpha 21064 was a leap in performance mostly because of the high clock speeds. DEC gained a large share of the high performance CPU market and thought they were untouchable so bet the farm on Alpha. Competitors soon caught up in performance with more practical but lower clocked designs. The PPC 604 exceeded the performance of the Alpha 21064 despite running at half the clock speed. The 200 MHz Pentium Pro had similar performance as the Alpha 21164 at 300 MHz. The later MIPS R10000 and HP PA-RISC 8000 had competitive if not better performance than the Alpha 21164 as well. Alpha cores usually were the highest clocked cores but were not always the best performance cores. These other processors were more practical than the extreme Alphas. High clock speeds were good for marketing but this was not enough for the Alpha. Compatibility for x86 played a big role in the demise of the Alpha too as the PC clone market was already unleashed and customers could see the performance of games. The advantage of compatibility was often underestimated in those days.

Clock rate have been the main driver behind computer performance just recently starting to be replaced by minor bottlenecks, but efficient clock rate through power management is still the main performance driver. The Alpha was designed to be the best performing design possible and succeeded in just that.
Yes it was not always the best performing core in some benchmarks however that (and the x86 compatibility angle) miss the point that it was designed to replace previous DEC designs and to provide superior floating point performance. Bottlenecks are always there in any design, for instance Alphas could be very slow accessing unaligned data as it didn't support it in hardware and worst case a hardware trap-and-emulate sequence would be needed. For the Intel Pentium Pro that was not a problem but running legacy x86 code could slow it down significantly, of course that was not shown in the SpecInt benchmark you reference above.

The Pentium Pro 200MHz used a 0.35µm Bi-CMOS process while the 21164 300MHz used a 0.5µm process, one have to remember that unlike now Intel process technology were absolutely superior in those days.

Don't know exactly what PPC 604 and what 21064 you are comparing however the PPC 604 was released in the end of 1994 and the 21064 in 1992, an eternity. The PPC used a 0.5µm process and the 21064 a 1µm process. A better comparison would be the 21164 that was released in 1995 at a 0.5µm process.

DEC expected to need much higher clock rate than other designs so that some lower clocked one could compete isn't strange or showing the design to be bad. What importance is a lower clock rate with higher performance per clock (brainiac) when the higher clocked design with lower performance per clock (speed demon) will perform better in the end? Well, if power is a design target it would be however for the Alphas it didn't matter as long as they could be cooled.

Quote:

The Alpha 21164 was the first CPU to have a secondary cache on die (called S-cache instead of L2) and has support for an external (L3) B-cache as well.

L1: 8kiB ICache direct mapped, 8kiB DCache direct mapped and dual ported
L2: 96kiB 3 way
L3: 1MB-64MB optional external direct mapped

This is a huge cache improvement over the Alpha 21064 although the L1 ICache is going to miss often as I already pointed out. I like the small dual ported L1 DCache but it would probably be better to slow down the clock speed some and make these L1 caches at least 2 ways.

Yes but that's not a multi-level instruction cache as such? I assume that the designers simulated their target workloads and found this design to be the best performing. The processor microarchitects at DEC were among the absolute best at the time.

Couldn't find hit rates in the L1 I cache with a quick search. Maybe later.

Quote:

I agree that multi-level caches are important for CISC too. I like the idea of keeping cache access cycles low with smaller lowest level caches. Much of the performance of x86 has probably come from small loop and/or stack caches acting as a kind of L0 cache. I have read some about more conventional L0 caches although the Qualcomm Krait is the only core that I have seen use one and it was only for energy savings. It is likely possible to improve performance as well as save energy with an L0.

For variable length instruction sets a smaller L0 cache makes it possible to increase fetch rate too.

Quote:

All cores compared were general purpose cores. There is nothing dishonest about comparing cores although it is important to understand the target purpose and limitations. Every choice is a tradeoff. The extreme clock speeds of Alpha cores were not the best for performance and poor for power and area. The PPC 604 was the last PPC core to be competitive in performance but that was partially because of choice, probably a good choice. The PPC G3/G4 design was overall better than most Alpha designs. DEC pioneered multi-level caches made the PPC G3/G4 design more efficient than the 604(e).

Better in what way? Highest memory bandwidth, throughput, power? DEC cared for the first two not the last. It wasn't designed to be inexpensive and "wasted" a lot of expensive silicon to provide high performance, the 21164 double ported cache is one example where the cache isn't a normal design but two caches in parallel with a shared write port - twice the size but removing complications in the read path.

Quote:

True. ARM never was pure RISC. ARM ISA designers have done a good job of thinking outside the box.

The 68k ISA may not be the best for performance but I believe it can have one of the best performances for a highly compressed ISA. Code density is among the best, instruction counts are low and memory traffic is low. There is not enough research going into ISAs as ARM and x86_64 have stagnated development. There is RISC-V which is interesting but boring.

RISC-V isn't my favorite design by far. The efforts of a regular comp.arch poster called My66000 is however is tempting but not exactly what I'd call perfect. RISC with complex address mode support, limited variable length support (data only, not instructions) etc. The vector extension in development is interesting with added metadata allowing the processor to convert a loop into parallel execution without specific SIMD instructions.
And while it's not a traditional RISC it's still designed to be efficient in hardware being a project of an experienced designer.

Status: Offline

Hammer

Re: Apple moving to arm, the end of x86
Posted on 13-Jul-2020 17:08:57

[ #110 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5275
From: Australia

@matthey

FYI, Zen 2 has an L0 cache. https://en.wikichip.org/wiki/amd/microarchitectures/zen_2

L0 Op Cache:
4,096 Ops, 8-way set associative
64 sets, 8 Op line size

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

Hammer

Re: Apple moving to arm, the end of x86
Posted on 13-Jul-2020 17:15:20

[ #111 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5275
From: Australia

@megol

X86 translates variable instructions into fixed-length instructions which is one of the RISC concept.

Simple x86 instructions have 1:1 translation with the internal instruction set.

Modern X86 CPU like Zen has separate load and store units

From https://arstechnica.com/features/2005/02/amd-hammer-1/5/

Quote:

A RISC instruction set's fixed-length instruction format does more than just simplify processor fetch and decode hardware; it also simplifies dynamic scheduling, making the instruction stream easier to reorder in the execution core.

In addition to being fixed-length, RISC instructions are also atomic in that each instruction tells the computer to perform one specific and carefully delimited task (e.g. multiply, divide, load, store, shift, rotate, etc.). A single x86 instruction, in contrast, can specify a whole series of tasks, e.g. a memory access followed by an arithmetic instruction, a multi-step BCD conversion, a multi-step string manipulation, etc.

This non-atomic aspect of x86 instructions renders them pretty well impossible for the execution core to reorder as-is. So in order for an x86 processor's instruction window to be able to rearrange the instruction stream for optimal execution, x86 instructions must first be converted into an instruction format that's uniform in size and atomic in function. This conversion process is called instruction set translation, and all modern x86 processors do some form of it.

AMD's Athlon and Hammer translate x86 instructions into sequences of small, RISC-like instructions called MacroOps. A MacroOp consists of either one or two parts; single-part MacroOps can be arithmetic operations or memory accesses, while two-part MacroOps consist of an arithmetic operation and memory access (a load or load-store). Note that two-part MacroOps are split at execution time, with the arithmetic operation going to the appropriate ALU and the memory access going to an AGU.

In general, x86 instructions can be categorized based on how many MacroOps they break down into. Most x86 operations break down into one or two MacroOps, while a small minority break down into more than one MacroOp. The Athlon has two types of decoders: a hardware decoder for single- or two-MacroOp instructions and a microcode decoder for all the rest

The ASM code generated from Microsoft and Intel C++ compiler development follows optimal X86 CPU instruction paths.

------
Intel vs TSMC vs Samsung process tech can't be directly compared by a simple nm marketing.

https://www.pcgamer.com/au/chipmaking-process-node-naming-lmc-paper/
Quote:

Intel reports a density of 100.76MTr/mm2 (mega-transistor per squared millimetre) for its 10nm process, while TSMC's 7nm process is said to land a little behind at 91.2MTr/mm2 (via Wikichip)

Last edited by Hammer on 13-Jul-2020 at 05:29 PM.
Last edited by Hammer on 13-Jul-2020 at 05:28 PM.
Last edited by Hammer on 13-Jul-2020 at 05:25 PM.
Last edited by Hammer on 13-Jul-2020 at 05:18 PM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

Hammer

Re: Apple moving to arm, the end of x86
Posted on 13-Jul-2020 17:47:39

[ #112 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5275
From: Australia

@Samurai_Crow

Quote:

Samurai_Crow wrote:
@ppcamiga1

Once Gunnar gets over his fetish for reconfigurable silicon and comes out with an ASIC it will clock much faster. The per-clock performance of the 68080 is equivalent to a Core2 Solo already.

Core 2 has 128bit wide SIMD SSE2/SSE3/SSSE3 (floating point and integers) instead of 68080's MMX (integer only).

Intel MMX 64bit SIMD was inferior to AMD's 3DNow 64bit SIMD (floating point and integers). Nintendo's PowerPC 750 CPU has custom 64bit SIMD.

I rather see accelerated BVH (bounding volume hierarchy) hardware integrated with the CPU core which impacts raytraced graphics, physics, and audio workloads.

Last edited by Hammer on 13-Jul-2020 at 05:57 PM.
Last edited by Hammer on 13-Jul-2020 at 05:56 PM.
Last edited by Hammer on 13-Jul-2020 at 05:52 PM.
Last edited by Hammer on 13-Jul-2020 at 05:50 PM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

matthey

Re: Apple moving to arm, the end of x86
Posted on 13-Jul-2020 22:36:29

[ #113 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2000
From: Kansas

Quote:

megol wrote:
Depends on how it's done. Using a prefix design will not increase size of decoders much (more down below) however adding a more efficient instruction encoding could potentially double the transistors. Simple execution units would double in size worst case while the multiplication and division unit(s?) could either stay the same size or increase greatly, depends on whether 64*64->128 bit and 128/64->64:64 bit variants should be added. Caches wouldn't change much but the wider data paths adds a little.
Decoding with an optional prefix would add a number of simple prefix decoders plus widening the logic to extract 64 bit data and some additional bits here and there.

Prefixes are not necessary. Gunnar does not use prefixes although it offers minimal 64 bit support other than SIMD instructions, especially for 64 bit addressing (not currently used by AmigaOS but AROS could be testing 64 bit 68k addressing). I prefer a 64 bit mode which is partially re-encoded and does not need a prefix. A-line can provide MOVE.Q and re-encoding can provide OP.Q instructions which simplifies decoding (Gunnar wanted to make this encoding simplification but it causes incompatibility if not in a separate mode). The 32 bit 68k mode could be dropped for implementations which do not need compatibility. I believe better performance, security and 64 bit code density can be provided with a separate 64 bit mode. It should be possible to allow 32 bit mode processes for compatibility like ARM modes.

The Apollo Core has 64 bit registers and limited 64 bit operations. The justification is probably that the SIMD instruction performance increases offset some of the slow down and transistor costs of 64 bit in the limited FPGA space. This shows that the cost of 64 bit is not much even though a SIMD unit with 64 bit registers and no floating point support likely has limited appeal.

Quote:

Clock rate have been the main driver behind computer performance just recently starting to be replaced by minor bottlenecks, but efficient clock rate through power management is still the main performance driver. The Alpha was designed to be the best performing design possible and succeeded in just that.

General purpose performance came from pipelining, caches, superscalarity, OoO, super-pipelining, clock increases and SMT/SMP. Clock increases have been important but have occurred gradually over timer mostly made possible by die shrinks. If power management is important, OoO, super-pipelining and clock increases may not be worthwhile.

Alpha architects were no doubt some of the best at that time and likely made greater contributions to technology than other teams. Their extreme designs were not always the best. These designs often had bottlenecks, were difficult to program and the ISA was primitive with one of the worst code densities of any RISC ISA ever.

Quote:

Yes it was not always the best performing core in some benchmarks however that (and the x86 compatibility angle) miss the point that it was designed to replace previous DEC designs and to provide superior floating point performance. Bottlenecks are always there in any design, for instance Alphas could be very slow accessing unaligned data as it didn't support it in hardware and worst case a hardware trap-and-emulate sequence would be needed. For the Intel Pentium Pro that was not a problem but running legacy x86 code could slow it down significantly, of course that was not shown in the SpecInt benchmark you reference above.

The Pentium Pro 200MHz used a 0.35µm Bi-CMOS process while the 21164 300MHz used a 0.5µm process, one have to remember that unlike now Intel process technology were absolutely superior in those days.

Don't know exactly what PPC 604 and what 21064 you are comparing however the PPC 604 was released in the end of 1994 and the 21064 in 1992, an eternity. The PPC used a 0.5µm process and the 21064 a 1µm process. A better comparison would be the 21164 that was released in 1995 at a 0.5µm process.

Die shrinks were the name of the game at that time and were partially responsible for the short lived performance holders. The 604 was performance king between the Alpha 21064 and 21064A which was also on .5 um process but doubled the I and D caches. Even benchmark code may have been falling out of the ICache on the 21064 slowing it to memory speed. The large caches of the 604 was much better for multitasking and server use while the Alpha 21064 caches were more appropriate for embedded applications with small reused code.

PPC would have had a highly clocked contender if the Exponential X704 had made it to market (533MHz target around 1997).

Exponential PPC X704
L1: 2kiB ICache direct mapped, 2kiB DCache direct mapped
L2: 32kiB 8 way
L3: 512kiB-2MiB optional external direct mapped

The L1 is tiny but completely eliminated the load-use penalty. There may have been room for more clocking up with these small caches although problems kept them from even achieving their initial target clock rating. Exponential was a startup that Apple strung along before cancelling their contract for breach of contract due to lower than estimated clock ratings.

Quote:

DEC expected to need much higher clock rate than other designs so that some lower clocked one could compete isn't strange or showing the design to be bad. What importance is a lower clock rate with higher performance per clock (brainiac) when the higher clocked design with lower performance per clock (speed demon) will perform better in the end? Well, if power is a design target it would be however for the Alphas it didn't matter as long as they could be cooled.

Alpha showed the world how much heat is produced when clocking up which was more than many people thought. Unfortunatly for DEC, the power of their cores kept them from entering the embedded market. Exponential also found itself without customers for its highly clocked chips. On the other hand, the startup P.A. Semi had embedded customers lined up for its low power PWRficient design and was taken out by Apple for this technology. Ironically, P.A. Semi was founded by Daniel Dobberpuhl (RIP 2019) the lead designer of the Alpha 21064.

Quote:

Yes but that's not a multi-level instruction cache as such? I assume that the designers simulated their target workloads and found this design to be the best performing. The processor microarchitects at DEC were among the absolute best at the time.

Couldn't find hit rates in the L1 I cache with a quick search. Maybe later.

An L2 cache is usually unified. If there are 2 levels of separated caches, they are usually called an L0+L1 with a unified L2 cache, not that the latter is common or the terminology standardized.

Quote:

For variable length instruction sets a smaller L0 cache makes it possible to increase fetch rate too.

I did mention performance. Keeping the DCache small and close is probably more important to performance, especially for an in order design.

Quote:

Better in what way? Highest memory bandwidth, throughput, power? DEC cared for the first two not the last. It wasn't designed to be inexpensive and "wasted" a lot of expensive silicon to provide high performance, the 21164 double ported cache is one example where the cache isn't a normal design but two caches in parallel with a shared write port - twice the size but removing complications in the read path.

You can make the argument that the Alpha cores were throughput cores but, as far as I know, that is not how they were usually used. Weren't they used as high end PCs and workstations?

Quote:

RISC-V isn't my favorite design by far. The efforts of a regular comp.arch poster called My66000 is however is tempting but not exactly what I'd call perfect. RISC with complex address mode support, limited variable length support (data only, not instructions) etc. The vector extension in development is interesting with added metadata allowing the processor to convert a loop into parallel execution without specific SIMD instructions.
And while it's not a traditional RISC it's still designed to be efficient in hardware being a project of an experienced designer.

The SonicBOOM RISC-V core has done a good job of adding performance enhancing features without adding the complexity in the ISA. RISC-V open cores should gain market share for low to mid performance cores. The compressed extension is being supported by Linux and many new cores now. It will be interesting to see what they standardize on for SIMD support. I just can't get excited about it.

Status: Offline

matthey

Re: Apple moving to arm, the end of x86
Posted on 13-Jul-2020 22:50:50

[ #114 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2000
From: Kansas

Quote:

Hammer wrote:
FYI, Zen 2 has an L0 cache. https://en.wikichip.org/wiki/amd/microarchitectures/zen_2

L0 Op Cache:
4,096 Ops, 8-way set associative
64 sets, 8 Op line size

The "L0 Op Cache" looks to me like a loop cache or maybe a small trace cache? I would *not* consider this to be an L0 ICache although it provides some of the same benefits as well as removing some of the power used for decoding.

Status: Offline

Hammer

Re: Apple moving to arm, the end of x86
Posted on 14-Jul-2020 4:41:50

[ #115 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5275
From: Australia

@matthey

Quote:

matthey wrote:
Quote:

Hammer wrote:
FYI, Zen 2 has an L0 cache. https://en.wikichip.org/wiki/amd/microarchitectures/zen_2

L0 Op Cache:
4,096 Ops, 8-way set associative
64 sets, 8 Op line size

The "L0 Op Cache" looks to me like a loop cache or maybe a small trace cache? I would *not* consider this to be an L0 ICache although it provides some of the same benefits as well as removing some of the power used for decoding.

Zen 2's Op Cache has 4K instruction storage.

Branch Prediction unit feeds to both L1 instruction cache and Op Cache.

Last edited by Hammer on 14-Jul-2020 at 04:44 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

MEGA_RJ_MICAL

Re: Apple moving to arm, the end of x86
Posted on 14-Jul-2020 5:53:45

[ #116 ]

Super Member

Joined: 13-Dec-2019
Posts: 1200
From: AMIGAWORLD.NET WAS ORIGINALLY FOUNDED BY DAVID DOYLE

Such a constructive discussion!

Makes me want to post a cool diagram too!

_________________
I HAVE ABS OF STEEL
--
CAN YOU SEE ME? CAN YOU HEAR ME? OK FOR WORK

Status: Offline

LarsB

Re: Apple moving to arm, the end of x86
Posted on 14-Jul-2020 9:27:09

[ #117 ]

Regular Member

Joined: 29-Jul-2019
Posts: 104
From: Unknown

@MEGA_RJ_MICAL You read my thoughts ;)

Status: Offline

LarsB

Re: Apple moving to arm, the end of x86
Posted on 14-Jul-2020 9:48:45

[ #118 ]

Regular Member

Joined: 29-Jul-2019
Posts: 104
From: Unknown

https://forums.hollywood-mal.com/viewtopic.php?f=8&t=3114

Status: Offline

evilFrog

Re: Apple moving to arm, the end of x86
Posted on 14-Jul-2020 13:03:42

[ #119 ]

Regular Member

Joined: 20-Jan-2004
Posts: 397
From: UK

@MEGA_RJ_MICAL

Disappointed you didn’t post the circuit diagram for ZORRAM.

_________________
"Knowledge is power. Power corrupts. Study hard, be evil."

Status: Offline

megol

Re: Apple moving to arm, the end of x86
Posted on 14-Jul-2020 20:48:51

[ #120 ]

Regular Member

Joined: 17-Mar-2008
Posts: 355
From: Unknown

@Hammer

Quote:

Hammer wrote:
@megol

X86 translates variable instructions into fixed-length instructions which is one of the RISC concept.

But not the only one and one feature that isn't really a defining characteristic anymore. ARM, MIPS, RISC V, are all variable length. AARCH aka ARM64 isn't though.

Not all modern x86 use fixed length internal instructions BTW.

Quote:

Simple x86 instructions have 1:1 translation with the internal instruction set.

Modern X86 CPU like Zen has separate load and store units

From https://arstechnica.com/features/2005/02/amd-hammer-1/5/

Yes and? The instructions are still stored as macro instructions that are split into one or more when the right time comes, they are treated as one instruction for retire purposes and some other complications. The reason separate agu and ld/st units are to increase instruction throughput in an out of order execution design, simple as that.

Now realize that a high performance x86 will require complicated tracking per instruction to support very rare special cases where an instruction have to be re-executed in a weird way to be compatible. Or having to track some instruction chunks as one instruction. Micro-exceptions to handle special cases for common instructions (don't know if current Intel and AMD designs do that anymore). Microcode sequence treated as a single instruction while it have a complex sequence of operations that have to be flushed if misspeculated, extra tracking in timing sensitive paths to handle this and more common but still complex flushing.

X86 aren't RISC.

Quote:

The ASM code generated from Microsoft and Intel C++ compiler development follows optimal X86 CPU instruction paths.

Let's say they do generate optimal code (they don't), what would the significance be?

Quote:

Intel vs TSMC vs Samsung process tech can't be directly compared by a simple nm marketing.

Never said it can be. But if you think Intels 14nm++++++++++++++ process is better than TSMCs 7nm++ currently in production well...
Note that Intel haven't used their newer 10nm-the-return-and-actually-working-kind-of process for high performance designs which illustrates your point that a simple nm marketing isn't worth much.

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle