Click Here
home features news forums classifieds faqs links search
6102 members 
Amiga Q&A /  Free for All /  Emulation /  Gaming / (Latest Posts)
Login

Nickname

Password

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net
Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.
Donate

Menu
Main sections
» Home
» Features
» News
» Forums
» Classifieds
» Links
» Downloads
Extras
» OS4 Zone
» IRC Network
» AmigaWorld Radio
» Newsfeed
» Top Members
» Amiga Dealers
Information
» About Us
» FAQs
» Advertise
» Polls
» Terms of Service
» Search

IRC Channel
Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online
47 crawler(s) on-line.
 39 guest(s) on-line.
 0 member(s) on-line.



You are an anonymous user.
Register Now!
 dframeli:  1 hr 37 mins ago
 agami:  1 hr 49 mins ago
 matthey:  1 hr 55 mins ago
 BigD:  3 hrs 21 mins ago
 saimon69:  3 hrs 50 mins ago
 Karlos:  4 hrs 11 mins ago
 kolla:  4 hrs 18 mins ago
 NutsAboutAmiga:  4 hrs 32 mins ago
 Futaura:  5 hrs 20 mins ago
 DiscreetFX:  5 hrs 34 mins ago

/  Forum Index
   /  General Technology (No Console Threads)
      /  APX: Intel's new architecture
Register To Post

PosterThread
cdimauro 
APX: Intel's new architecture
Posted on 4-Sep-2023 6:02:23
#1 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3161
From: Germany

I've written a series of eight articles about Intel's new architecture: APX.
The last article closes the series by taking stock of the situation and with some reflections. It also embeds the links to all previous articles.
English: APX: Intel’s new architecture – 8 – Conclusions
Italian: APX: la nuova architettura di Intel – 8 – Conclusioni

 Status: Offline
Profile     Report this post  
amigagr 
Re: APX: Intel's new architecture
Posted on 4-Sep-2023 23:00:02
#2 ]
Member
Joined: 2-Sep-2022
Posts: 16
From: Thessaloniki, Greece

@cdimauro

è una buona occasione per me di imparare la terminologia dei computer in italiano. grazie mille!

 Status: Offline
Profile     Report this post  
matthey 
Re: APX: Intel's new architecture
Posted on 5-Sep-2023 10:07:15
#3 ]
Super Member
Joined: 14-Mar-2007
Posts: 1757
From: Kansas

@cdimauro
Nice analysis and write up for APX. I hadn't heard of it before (or the x86-S simplification). When is it due to be implemented in Intel CPU cores?

I have a few comments and suggestions as an arm chair technical proof reader.

cdimauro Quote:

Exactly the same could be said of ARM and its also blazoned 32-bit architecture, which, however, had the courage to put its hands to the project and re-establish it on new foundations when it decided to bring out its own 64-bit extension, AArch64 AKA AMD64, which is not compatible with the previous 32-bit ISA (although it has much in common and bringing applications to it does not require a total rewrite).


For part 8, should this have been "AArch64 AKA ARMv8-A"? You used AKA properly right before it so you obviously know what it means and it is even highlighted in red as a link yet it went unnoticed in both the English and Italian article.

cdimauro Quote:

This sounds rather strange to me, since I still remember very well how AMD had claimed, when introducing x86-64 AKA x64, to have evaluated the extension of x86 to 32 instead of 16 registers, but to have given up because the advantages did not prove to be significant (contrary to the switch from 8 to 16 registers, where the differences, instead, were quite tangible, as we have seen for ourselves) and did not justify the greater implementation complexity of such a solution.


For part 3, you may consider changing "x86-64 AKA x64" to "AMD64 AKA x64" since I believe that is what AMD originally called the ISA and there may be minor differences between the Intel x86-64 implementation.

https://people.eecs.berkeley.edu/~krste/papers/waterman-ms.pdf Quote:

RVC is a superset of the RISC-V ISA, encoding the most frequent instructions in half the size of a RISCV instruction; the remaining functionality is still accessible with full-length instructions. RVC programs are 25% smaller than RISC-V programs, fetch 25% fewer instruction bits than RISCV programs, and incur fewer instruction cache misses.



RVC improves performance substantially when the instruction working set does not fit in cache. For 6 of the 8 cache configurations, using RVC is more effective than doubling the associativity. Using RVC attains, on average, 80% of the speedup of doubling the cache size. A system with a 16 KB direct-mapped cache with RVC is 99% as fast as a system with a 32 KB direct-mapped cache without RVC.


For part 5, I see you used a play from my playbook pulling out the RISC-V code density studies and documentation. Much of the RISC-V documentation is written to make RVC code density look good by comparing to the non-compressed RISC-V ISA which resembles the old extinct and fat RISC "desktop" ISAs like SPARC, MIPS, PA-RISC and Alpha. The more universal code density RISC-V research I like to quote is the following.

The RISC-V Compressed Instruction Set Manual Version 1.7 Quote:

The philosophy of RVC is to reduce code size for embedded applications and to improve performance and energy-efficiency for all applications due to fewer misses in the instruction cache. Waterman shows that RVC fetches 25%-30% fewer instruction bits, which reduces instruction cache misses by 20%-25%, or roughly the same performance impact as doubling the instruction cache size.


It's probably the same research and researcher, Waterman, but this is shorter and more easily understood in my opinion. In other words, every 25%-30% improvement in code density is roughly like doubling the size of the instruction cache. The 1995 DEC Alpha 21164 CPU using AXP ISA demonstrates the turning point of RISC fallacies where the L1 ICache had to be reduced to 8kiB to maintain timing at the high clock speeds but it performed more like a 1-2kiB L1 ICache for a 68020 CPU. The Alpha 21164 pioneered the on-chip L2 cache but the 96kiB L2 caused the chip to use 9.3 million transistors, draw 56W of power at 333MHz, and the 433 MHz version cost $1,492 in 1996. The 1994 68060 used a similar chip fab process, had one more pipeline stage than the Alpha 21164 which is better for clocking up, only used 2.5 million transistors, was cool enough at low clock speeds for a mobile device and cost a fraction of the price. It was the 1995 Pentium Pro@200MHz that was first to outperform a 300 MHz Alpha 21164 in SPECint95 benchmarks though. The Pentium Pro used a 14 stage pipeline and 5.5 million transistors up from the 5 stages and 3.1 million transistors of the original P5 Pentium. The moral of the story is that the best code density doesn't always win even though the worst 6 died (Alpha, PA-RISC, MIPS, SPARC, PPC and ARM original). It is necessary to leverage an industry leading pipeline depth and code density by turning up the clock speed to show off performance but the power saving advantages are excellent for embedded use even if less flashy.

cdimauro Quote:

A hybrid solution between the two (as well as the preferable one) would be to PUSH the register to be used on the stack, and then use it taking into account where it is now located within it. Eventually, when no longer useful, a POP would restore the contents of the temporarily used register. This is a technique that I used in the first half of the 1990s, when I tried my hand at writing an 80186 emulator for Amiga systems equipped with a 68020 processor (or higher), and which has the virtue of combining the two previous scenarios, taking the best of them (minimising the cost of preserving and restoring the previous value in/from the stack).


For part 5, I see you are paying respect to the 68k Amiga. Well, the 68k still does it better and more elegantly than x86(-64)/APX in more than a few ways. Of course mentioning the 68k isn't nearly as embarrassing as the Amiga which pretty much requires a bag over our heads it is getting so embarrassing. PPC A1222 AmigaNOne hardware received a "2 more weeks" announcement for a few hundred units and it is expected to cost something like $1000 or Euros for Raspberry Pi 3 ARM Cortex-A53 like integer performance and now we have an A600GS 68k emulation device with Cortex-A53 performance that is an attempt to build a 68k user base. I mentioned to them they would be better off with x86-64 hardware as the emulation and other OS support is better. While x86-64 cores are fat and don't scale down very far as demonstrated by the early in-order superscalar Atom CPUs, they have performance and don't use too much power on smaller chip processes. The successors to the Atom microarchitectures have been beefed up and are now used as the energy efficient cores in newer Intel desktop and mobile CPUs (like ARM big.LITTLE/DynamIQ cores). Even these relatively little energy efficient cores are not so little CISC power houses.

https://en.wikipedia.org/wiki/Gracemont_(microarchitecture)
https://upload.wikimedia.org/wikipedia/commons/d/dd/GracemontRevised.png
https://en.wikichip.org/wiki/intel/microarchitectures/gracemont

The 10nm Atom x7xxx line is only $39-$58 and 6W-12W TDP for a SoC with descent GPU.

The Raspberry Pi 4 has 21% of the Geekbench 5 64 bit single core performance and 32% of the GPU single precision GFLOPS performance of the Intel Atom x7425E SoC.

https://www.cpu-monkey.com/en/compare_cpu-raspberry_pi_4_b_broadcom_bcm2711-vs-intel_atom_x7425e

The Atom SoCs are likely strong enough to emulate older and low end PPC AmigaNOne hardware but that is probably the threat of assimilation. Emulation is good enough for the 68k but not for PPC AmigaNOne, yet. Assimilation is the easy route not that a hardware platform can be built on emulation that destroys all the philosophies of the 68k Amiga. CPU performance metrics are important for competitiveness but economies of scale win the war.

Your analysis and conclusion of APX appears accurate to me. I'd say Intel is getting nervous about the AArch64 ISA which has better performance metrics in most categories and often by more than a little bit. The x86-64 ISA bolt on was just good enough when it came out but it only has a small advantage in code density and a shorter average instruction size while the only big advantage I see is CISC memory accesses. APX attacks the performance metrics where x86 and x86-64 are weak like too many instructions to execute, too much memory traffic and too many branches but it is challenging to fix them because they are inherent to the hardware needed to execute x86 and x86-64 code and Intel doesn't want to give up their design expertise in this. They are willing to jettison some of the baggage with x86-S and may even be able to get rid of x86 now but they need x86-64 compatibility which implies the same inherited warts due to bad legacy choices like poor orthogonality, limited encoding space, and 8 bit variable length encodings and prefix hell adding a decoding tax. I'm very skeptical that APX will help much as the decreased number of instructions results in larger instructions that are more difficult to decode and I don't see how the code density does not decrease despite claims (one of your strong arguments too). I expect most of the decreased memory traffic to come from PUSH2 and POP2 but, again, decreased code density offsets some of the gains. The 16 new GP registers are all caller saved (trash/volatile) registers where an even split would likely be better resulting in more spills to the stack than expected from so many registers, using more power for the larger register file and again code density is likely to be negatively affected. The 3 op addition reduces the number of instructions but they are larger and code density declines in most cases. The conditional instructions are flawed as you point out, perhaps to match legacy behavior of similar instructions. I expect overall performance to improve but I would be surprised to see more than 5% even though some individual benchmarks may see 20%. They obviously ran prototype simulations using the SPEC CPU 2017 Integer benchmark suite so they should have performance and code density numbers but, I agree, they are suspiciously absent. I wouldn't be surprised if APX is delayed and even canceled. If they were serious about getting more competitive, they would create a new encoding map based on a 16 bit variable length encoding. One 16 bit prefix on the 68k could achieve most of what APX is attempting to do but with fewer instructions, less encoding overhead and better code density. The 68k ISA is in much better shape than x86(-64) as far as performance metrics, especially for a smaller and lower power CPU cores.

I expect x86-64 to hold off ARM AArch64 encroachment into the high performance computing markets for awhile despite ARM closing the gap. The desktop and server markets are very profitable for Intel and AMD compared to ARM embedded and mobile markets. AArch64 is more scalable, their markets are expanding faster compared to a likely shrinking desktop market and the ARM reference designs are improving in performance quickly. Intel has something to be worried about but they need to also worry about their competitor AMD who has been out competing them recently. Maybe AMD has something up their sleeves but there is only so much room for AMD64 improvement. The writing would be on the wall if we start to see more serious ARM development from Intel and AMD. The consoles have been a good predictor of ISA trends so far so a switch to ARM by more than the Nintendo Switch could be an indicator. Can there be just one ISA to rule them all? RISC-V found a niche and I believe has staying power with more open hardware and their compressed RVC encoding helps. AArch64 doesn't scale all the way down to Thumb territory either. The problem with the compressed RISC encodings like RVC and Thumb encodings is they have an increased number of small instructions that decreases performance, a problem the 68k doesn't seem to have. The CPU market has large barriers of entry mostly due to economies of scale but there may be some opportunities and innovations yet. The CPU and ISA landscape would be pretty boring if there was only one ISA. We may be surprised and the Mill Computing Mill architecture belt machine may pop back up in the news. More likely would probably be a more conservative effort starting small with a niche to fill.

Last edited by matthey on 05-Sep-2023 at 08:09 PM.
Last edited by matthey on 05-Sep-2023 at 03:55 PM.
Last edited by matthey on 05-Sep-2023 at 10:28 AM.

 Status: Offline
Profile     Report this post  
Fl@sh 
Re: APX: Intel's new architecture
Posted on 5-Sep-2023 12:31:52
#4 ]
Regular Member
Joined: 6-Oct-2004
Posts: 252
From: Napoli - Italy

@cdimauro

Read in italian, nice article well done!
Why Intel APX was not done on top of X86-S? Do you agree about removing all old 16bit legacy instructions could leave more room to implement better new APX extensions?

Why your suggestions for an optimized APX implementation were non considered by Intel? Are they blind, lack of time or what?

_________________
Pegasos II G4@1GHz 2GB Radeon 9250 256MB
AmigaOS4.1 fe - MorphOS - Debian 9 Jessie

 Status: Offline
Profile     Report this post  
matthey 
Re: APX: Intel's new architecture
Posted on 5-Sep-2023 19:11:42
#5 ]
Super Member
Joined: 14-Mar-2007
Posts: 1757
From: Kansas

@Fl@sh Quote:

Why Intel APX was not done on top of X86-S?


Slim down x86 with x86-S and then fatten it back up with APX? Maybe too many experiments at once?

@Fl@sh Quote:

Do you agree about removing all old 16bit legacy instructions could leave more room to implement better new APX extensions?


The 68k uses a 16 bit variable length encoding while x86 uses an 8 bit variable length encoding so legacy x86 instructions start at 8 bits. Many of the original x86 instructions are commonly used in x86-64 and APX but with new variations that require prefixes. There are only 256 different base 8 bit encodings and they were encoded for maximum code density using an 8 bit CPU accessing the stack and 8 general purpose registers. Instruction length should be based on frequency of instructions used while original x86 encodings were suspect in this regard. A variable length 8 bit encoding can likely achieve superior code density to a 16 bit variable length encoding with an efficient encoding based on instruction frequency as can be seen by the Cast BA2 instruction set for 32 bit embedded cores which allows variable length instructions of 16, 24, 32 and 48 bits (not 8 bits to preserve more encoding space) and 16 or 32 GP registers yet claims to have better code density than Thumb2 which uses a 16 bit variable length encoding like the 68k. A variable length 16 bit encoding has performance advantages due to alignment (simplified decoding) and supporting more powerful instructions like 16, 32, 48 and 64 bit lengths in the common case. Many compressed RISC ISAs use a 16 bit variable length encoding but only allow 16 or 32 bit lengths and the 68k like ColdFire allowed 16, 32 and 48 bit lengths losing performance and 68k compatibility for little benefit by not allowing 64 bit lengths. The 68k allows longer than a 64 bit instruction length but they are uncommon for a 32 bit ISA and don't need to be as fast. The x86 ISA has a shorter maximum instruction length than the 68k which is further restricted in x86-64 but x86(-64) in the worst case has to look at more 8 bit encodings/prefixes/extensions than the 68k has to look at 16 bit encodings/extensions in the worst case. Both incur complex instruction overhead (usually microcode) for long instructions but the longer instructions can add performance and flexibility while it is the common shorter instructions that are more important for performance.

The baggage with x86 is far beyond instructions. The x86 and x86-64 have many modes and submodes like protected (virtual?) mode, real address mode, system management mode, 64bit (long?) mode, compatibility mode and perhaps more. The old x86 CPUs are compatible with the 808x which only had 8 address bits and used a segment register to allow 16 bit addressing that is closer to memory banks than a 16 bit flat memory model (the 68k ISA started with a 32 bit flat memory model so avoids all these legacy modes except for a possible 64 bit mode in the future). The x86 supported memory protection rings with the idea of improving security and was partially used in some OSs. The MMU support has changed over the years with some of the early support outdated and likely rarely used (a modernized 68k MMU would likely need some changes as well especially for a 64 bit mode). The x86 FPU was an old stack based design that has long been deprecated but there may be somewhat modern programs still using it due to functionality missing in the SIMD unit replacement. Many of the x86 instructions have different behaviors in different modes and with different configurations. I don't pretend to understand the issues but can see it is a huge ugly legacy mess. Maybe much of the legacy can be wiped away but the encoding maps are inefficient and a 16 bit variable length encoding would be a big improvement yet less compatible and require new hardware designs. The x86-64 ISA needs to be thrown away but x86-64 CPUs are powerful at least partially due to the CISC design. Nobody wants to develop a memory munching CISC monster replacement that competes with it because they are more difficult and expensive to develop and have a bad reputation due to x86(-64). Maybe AArch64 "RISC" has adopted enough performance benefits of CISC while leaving behind the legacy baggage to finally compete but I believe a good CISC ISA has more potential than x86-64 or APX too.

@Fl@sh Quote:

Why your suggestions for an optimized APX implementation were non considered by Intel? Are they blind, lack of time or what?


Maybe Cesare's pursuit of an improved 64 bit x86 ISA led him away from current x86-64 hardware designs which Intel management didn't want to hear? Maybe Intel's management is trying to tweak a little more performance out of existing x86-64 hardware without losing compatibility? Maybe Intel management can get his free input from writing articles criticizing their tweaked APX ISA without having to pay him for a better solution?

Last edited by matthey on 05-Sep-2023 at 07:27 PM.
Last edited by matthey on 05-Sep-2023 at 07:16 PM.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: APX: Intel's new architecture
Posted on 5-Sep-2023 20:07:57
#6 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3161
From: Germany

@amigagr

Quote:

amigagr wrote:
@cdimauro

è una buona occasione per me di imparare la terminologia dei computer in italiano. grazie mille!

Prego! Se hai dubbi o altre curiosità, sono a disposizione.

@Fl@sh

Quote:

Fl@sh wrote:
@cdimauro

Read in italian, nice article well done!

Thanks.
Quote:
Why Intel APX was not done on top of X86-S?

They aren't incompatible. APX can be implemented on x86 processors as well as X86-S ones (when/if they arrive. Which is my guess: keeping x64 processors as they are is just a waste of silicon and power, due to functionalities which aren't used since decades.).
Quote:
Do you agree about removing all old 16bit legacy instructions could leave more room to implement better new APX extensions?

I don't know the impact of 16 bits code on a concrete core implementation, but my idea is that it might free some space for APX (at least).
Quote:
Why your suggestions for an optimized APX implementation were non considered by Intel? Are they blind, lack of time or what?

I haven't submitted them to Intel.

Maybe I'll write to a former colleague which is also working on Intel's compilers and share my articles. Let's see...

 Status: Offline
Profile     Report this post  
cdimauro 
Re: APX: Intel's new architecture
Posted on 5-Sep-2023 21:09:58
#7 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3161
From: Germany

@matthey

Quote:

matthey wrote:
@cdimauro
Nice analysis and write up for APX. I hadn't heard of it before (or the x86-S simplification). When is it due to be implemented in Intel CPU cores?

Thanks! I don't know, but the idea of Intel working to a new architecture is floating around since several years. I think that the time is mature now for that and the fact that there are preliminary SPEC2017 tests for it is a clear signal that there's something already ready. To me it's a matter of a few years (maybe sooner).
Quote:
I have a few comments and suggestions as an arm chair technical proof reader.

Much appreciated!
Quote:
cdimauro Quote:

Exactly the same could be said of ARM and its also blazoned 32-bit architecture, which, however, had the courage to put its hands to the project and re-establish it on new foundations when it decided to bring out its own 64-bit extension, AArch64 AKA AMD64, which is not compatible with the previous 32-bit ISA (although it has much in common and bringing applications to it does not require a total rewrite).


For part 8, should this have been "AArch64 AKA ARMv8-A"? You used AKA properly right before it so you obviously know what it means and it is even highlighted in red as a link yet it went unnoticed in both the English and Italian article.

That was a lapsus. I was intending to write ARM64. Fixed now, thanks!
Quote:
cdimauro Quote:

This sounds rather strange to me, since I still remember very well how AMD had claimed, when introducing x86-64 AKA x64, to have evaluated the extension of x86 to 32 instead of 16 registers, but to have given up because the advantages did not prove to be significant (contrary to the switch from 8 to 16 registers, where the differences, instead, were quite tangible, as we have seen for ourselves) and did not justify the greater implementation complexity of such a solution.


For part 3, you may consider changing "x86-64 AKA x64" to "AMD64 AKA x64" since I believe that is what AMD originally called the ISA and there may be minor differences between the Intel x86-64 implementation.

x86-64 was the original name which AMD used when it introduced its 64-bit extension to IA-32/x86, so I prefer to keep it.

AMD64 sounds much better but to me x86-64 is more appropriate especially in this context (since I was talking of the introduction of this new ISA).
Quote:
https://people.eecs.berkeley.edu/~krste/papers/waterman-ms.pdf Quote:

RVC is a superset of the RISC-V ISA, encoding the most frequent instructions in half the size of a RISCV instruction; the remaining functionality is still accessible with full-length instructions. RVC programs are 25% smaller than RISC-V programs, fetch 25% fewer instruction bits than RISCV programs, and incur fewer instruction cache misses.



RVC improves performance substantially when the instruction working set does not fit in cache. For 6 of the 8 cache configurations, using RVC is more effective than doubling the associativity. Using RVC attains, on average, 80% of the speedup of doubling the cache size. A system with a 16 KB direct-mapped cache with RVC is 99% as fast as a system with a 32 KB direct-mapped cache without RVC.


For part 5, I see you used a play from my playbook pulling out the RISC-V code density studies and documentation. Much of the RISC-V documentation is written to make RVC code density look good by comparing to the non-compressed RISC-V ISA which resembles the old extinct and fat RISC "desktop" ISAs like SPARC, MIPS, PA-RISC and Alpha.

Indeed. We know how it goes: they make comparisons only with selected products where they can show their improvements (see also some other papers of a recent Turing award).
Quote:
The more universal code density RISC-V research I like to quote is the following.

The RISC-V Compressed Instruction Set Manual Version 1.7 Quote:

The philosophy of RVC is to reduce code size for embedded applications and to improve performance and energy-efficiency for all applications due to fewer misses in the instruction cache. Waterman shows that RVC fetches 25%-30% fewer instruction bits, which reduces instruction cache misses by 20%-25%, or roughly the same performance impact as doubling the instruction cache size.


It's probably the same research and researcher, Waterman, but this is shorter and more easily understood in my opinion. In other words, every 25%-30% improvement in code density is roughly like doubling the size of the instruction cache.

That was exactly what I was looking for, thanks! But unfortunately I didn't recall it right when I was writing the article, so I opted to report some excerpt from the original thesis.

I've now replaced it.
Quote:
The 1995 DEC Alpha 21164 CPU using AXP ISA demonstrates the turning point of RISC fallacies where the L1 ICache had to be reduced to 8kiB to maintain timing at the high clock speeds but it performed more like a 1-2kiB L1 ICache for a 68020 CPU. The Alpha 21164 pioneered the on-chip L2 cache but the 96kiB L2 caused the chip to use 9.3 million transistors, draw 56W of power at 333MHz, and the 433 MHz version cost $1,492 in 1996. The 1994 68060 used a similar chip fab process, had one more pipeline stage than the Alpha 21164 which is better for clocking up, only used 2.5 million transistors, was cool enough at low clock speeds for a mobile device and cost a fraction of the price. It was the 1995 Pentium Pro@200MHz that was first to outperform a 300 MHz Alpha 21164 in SPECint95 benchmarks though. The Pentium Pro used a 14 stage pipeline and 5.5 million transistors up from the 5 stages and 3.1 million transistors of the original P5 Pentium. The moral of the story is that the best code density doesn't always win even though the worst 6 died (Alpha, PA-RISC, MIPS, SPARC, PPC and ARM original). It is necessary to leverage an industry leading pipeline depth and code density by turning up the clock speed to show off performance but the power saving advantages are excellent for embedded use even if less flashy.

True, but it's also important to encode "more useful work" on instructions, like APX is also showing. So, a combination of the three factors is the winning element for an architecture.
Quote:
cdimauro Quote:

A hybrid solution between the two (as well as the preferable one) would be to PUSH the register to be used on the stack, and then use it taking into account where it is now located within it. Eventually, when no longer useful, a POP would restore the contents of the temporarily used register. This is a technique that I used in the first half of the 1990s, when I tried my hand at writing an 80186 emulator for Amiga systems equipped with a 68020 processor (or higher), and which has the virtue of combining the two previous scenarios, taking the best of them (minimising the cost of preserving and restoring the previous value in/from the stack).


For part 5, I see you are paying respect to the 68k Amiga.

I still love it.

The 68k was the main source of inspiration even for my ISA, NEx64T (which is... a x86/x64 rewriting/extension). In fact, the most important features are directly coming from the 68k.
Quote:
Well, the 68k still does it better and more elegantly than x86(-64)/APX in more than a few ways.

I fully agree on that as well. x86/x64 is... what it is. And APX makes it even worse from an architectural / structural PoV.

68k is a piece of cake compared to them. And it's a very small / simpler ISA.
Quote:
Of course mentioning the 68k isn't nearly as embarrassing as the Amiga which pretty much requires a bag over our heads it is getting so embarrassing. PPC A1222 AmigaNOne hardware received a "2 more weeks" announcement for a few hundred units and it is expected to cost something like $1000 or Euros for Raspberry Pi 3 ARM Cortex-A53 like integer performance and now we have an A600GS 68k emulation device with Cortex-A53 performance that is an attempt to build a 68k user base. I mentioned to them they would be better off with x86-64 hardware as the emulation and other OS support is better. While x86-64 cores are fat and don't scale down very far as demonstrated by the early in-order superscalar Atom CPUs, they have performance and don't use too much power on smaller chip processes.

Indeed. However the RPi has the advantage of being very cheap. I don't know how much a similar x86-64 could cost.

PowerPCs... I prefer a "no comment"...
Quote:
The successors to the Atom microarchitectures have been beefed up and are now used as the energy efficient cores in newer Intel desktop and mobile CPUs (like ARM big.LITTLE/DynamIQ cores). Even these relatively little energy efficient cores are not so little CISC power houses.

https://en.wikipedia.org/wiki/Gracemont_(microarchitecture)
https://upload.wikimedia.org/wikipedia/commons/d/dd/GracemontRevised.png
https://en.wikichip.org/wiki/intel/microarchitectures/gracemont

The 10nm Atom x7xxx line is only $39-$58 and 6W-12W TDP for a SoC with descent GPU.

The Raspberry Pi 4 has 21% of the Geekbench 5 64 bit single core performance and 32% of the GPU single precision GFLOPS performance of the Intel Atom x7425E SoC.

https://www.cpu-monkey.com/en/compare_cpu-raspberry_pi_4_b_broadcom_bcm2711-vs-intel_atom_x7425e

The Atom SoCs are likely strong enough to emulate older and low end PPC AmigaNOne hardware but that is probably the threat of assimilation. Emulation is good enough for the 68k but not for PPC AmigaNOne, yet.

Absolutely. The so called "E-Core" are based on a new microarchitecture which is simply awesome considering how it performs while drawing much less power compared to equivalent cores.

Atoms based on that will shine, especially on the emulation side (they have a 6-way decoder with a 17 ports backend: a monster!).
Quote:
Assimilation is the easy route not that a hardware platform can be built on emulation that destroys all the philosophies of the 68k Amiga. CPU performance metrics are important for competitiveness but economies of scale win the war.

Money is the most important factor. Unfortunately not leading to "nice" designs.
Quote:
Your analysis and conclusion of APX appears accurate to me. I'd say Intel is getting nervous about the AArch64 ISA which has better performance metrics in most categories and often by more than a little bit. The x86-64 ISA bolt on was just good enough when it came out but it only has a small advantage in code density and a shorter average instruction size while the only big advantage I see is CISC memory accesses. APX attacks the performance metrics where x86 and x86-64 are weak like too many instructions to execute, too much memory traffic and too many branches but it is challenging to fix them because they are inherent to the hardware needed to execute x86 and x86-64 code and Intel doesn't want to give up their design expertise in this.

Exactly. It's the ultimate move for Intel to be more competitive. But it'll be an advantage limited to some time: then the competition will be back making it nervous again.
Quote:
They are willing to jettison some of the baggage with x86-S and may even be able to get rid of x86 now

No, x86 is here to stay: there's still too much software using it. In fact X86-S doesn't abolish x86: it "just" makes it a "second-class citizen" (since it can be used only on userland applications).
Quote:
but they need x86-64 compatibility which implies the same inherited warts due to bad legacy choices like poor orthogonality, limited encoding space, and 8 bit variable length encodings and prefix hell adding a decoding tax. I'm very skeptical that APX will help much as the decreased number of instructions results in larger instructions that are more difficult to decode and I don't see how the code density does not decrease despite claims (one of your strong arguments too).

In fact, to me it's unbelievable: the "new" instructions are really too long. I don't see how code density could be similar to x64.

That's why I can't wait to put my hands on some APX binaries to verify my impressions.
Quote:
I expect most of the decreased memory traffic to come from PUSH2 and POP2 but, again, decreased code density offsets some of the gains. The 16 new GP registers are all caller saved (trash/volatile) registers where CISC would likely benefit from more callee saved registers causing more spills to the stack than expected,

Another choice which to me looks non-sense. Bah...
Quote:
using more power for the larger register file

No, this will not change effectively, because modern processors have hundreds of (micro)architectural registers. Adding 16 GP registers to the ISA makes absolutely no change on running applications using them (besides the bit longer context switch).
Quote:
and again code density is likely to be negatively affected. The 3 op addition reduces the number of instructions but they are larger and code density declines in most cases. The conditional instructions are flawed as you point out, perhaps to match legacy behavior of similar instructions.

That's my idea as well. However with APX Intel introduced the CFCMOVcc instructions which is at least solving the misbehavior with the exceptions, which are contradicting the legacy implementation (CMOVcc).

This is another non-sense...
Quote:
I expect overall performance to improve but I would be surprised to see more than 5% even though some individual benchmarks may see 20%. They obviously ran prototype simulations using the SPEC CPU 2017 Integer benchmark suite so they should have performance and code density numbers but, I agree, they are suspiciously absent. I wouldn't be surprised if APX is delayed and even canceled.

As I've said before, I believe that we'll see concrete products soon. This is the last anchor for Intel to fight with competitors.
Quote:
If they were serious about getting more competitive, they would create a new encoding map based on a 16 bit variable length encoding.

That was my aim/wish. And it was/is also exactly what I've done my new ISA.
Quote:
One 16 bit prefix on the 68k could achieve most of what APX is attempting to do but with fewer instructions, less encoding overhead and better code density. The 68k ISA is in much better shape than x86(-64) as far as performance metrics, especially for a smaller and lower power CPU cores.

Absolutely. And 68k would have gained a 64-bit extension as well.
Quote:
I expect x86-64 to hold off ARM AArch64 encroachment into the high performance computing markets for awhile despite ARM closing the gap. The desktop and server markets are very profitable for Intel and AMD compared to ARM embedded and mobile markets. AArch64 is more scalable, their markets are expanding faster compared to a likely shrinking desktop market and the ARM reference designs are improving in performance quickly.

Indeed. See Apple's M1/M2 processors...
Quote:
Intel has something to be worried about but they need to also worry about their competitor AMD who has been out competing them recently. Maybe AMD has something up their sleeves

AMD owns the consoles market, which is a cow to milk.
Quote:
but there is only so much room for AMD64 improvement. The writing would be on the wall if we start to see more serious ARM development from Intel and AMD. The consoles have been a good predictor of ISA trends so far so a switch to ARM by more than the Nintendo Switch could be an indicator. Can there be just one ISA to rule them all? RISC-V found a niche and I believe has staying power with more open hardware and their compressed RVC encoding helps.

However RISC-V isn't tailored for single core/thread performances, even with super-aggressive implementations: this ISA is too much simple and it isn't doing so much "useful work".

RISC-V primary advantage is represented by the license-free model.
Quote:
AArch64 doesn't scale all the way down to Thumb territory either. The problem with the compressed RISC encodings like RVC and Thumb encodings is they have an increased number of small instructions that decreases performance, a problem the 68k doesn't seem to have.

Because it provides more "useful work": 68k shines at that, especially with its super-flexible Mem-To-Mem MOVE instruction.
Quote:
The CPU market has large barriers of entry mostly due to economies of scale but there may be some opportunities and innovations yet. The CPU and ISA landscape would be pretty boring if there was only one ISA.

I fully agree AND it's my hope too!
Quote:
We may be surprised and the Mill Computing Mill architecture belt machine may pop back up in the news.

Hum. Too many announcements / propaganda and no concrete product. It's hard to believe on some revolution here, IMO.
Quote:
More likely would probably be a more conservative effort starting small with a niche to fill.

Indeed.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: APX: Intel's new architecture
Posted on 5-Sep-2023 21:43:44
#8 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3161
From: Germany

@matthey

Quote:

matthey wrote:
@Fl@sh Quote:

Why Intel APX was not done on top of X86-S?


Slim down x86 with x86-S and then fatten it back up with APX? Maybe too many experiments at once?

I think that they'll come separate, but I see future processors adopting both.
Quote:
@Fl@sh Quote:

Do you agree about removing all old 16bit legacy instructions could leave more room to implement better new APX extensions?


The 68k uses a 16 bit variable length encoding while x86 uses an 8 bit variable length encoding so legacy x86 instructions start at 8 bits. Many of the original x86 instructions are commonly used in x86-64 and APX but with new variations that require prefixes. There are only 256 different base 8 bit encodings and they were encoded for maximum code density using an 8 bit CPU accessing the stack and 8 general purpose registers. Instruction length should be based on frequency of instructions used while original x86 encodings were suspect in this regard.

Because it was a primitive design without a vision of future needs and extensions.

8086 was very good when it was introduced for the market of the time: a simple design with a incredible code density.

But it wasn't future-proof from this PoV and it required horrible patches for its extensions.
Quote:
A variable length 8 bit encoding can likely achieve superior code density to a 16 bit variable length encoding with an efficient encoding based on instruction frequency as can be seen by the Cast BA2 instruction set for 32 bit embedded cores which allows variable length instructions of 16, 24, 32 and 48 bits (not 8 bits to preserve more encoding space) and 16 or 32 GP registers yet claims to have better code density than Thumb2 which uses a 16 bit variable length encoding like the 68k.

That's impressive. I've searched around several times to get the BA2 ISA manual, but I never found it. It looks like that the company is jealously keeping the opcodes design for itself.

However my idea is that it might be too much specialized for the embedded market and not useful for other areas (mobile, desktop, servers, workstations, HPC).
Quote:
A variable length 16 bit encoding has performance advantages due to alignment (simplified decoding) and supporting more powerful instructions like 16, 32, 48 and 64 bit lengths in the common case. Many compressed RISC ISAs use a 16 bit variable length encoding but only allow 16 or 32 bit lengths and the 68k like ColdFire allowed 16, 32 and 48 bit lengths losing performance and 68k compatibility for little benefit by not allowing 64 bit lengths. The 68k allows longer than a 64 bit instruction length but they are uncommon for a 32 bit ISA and don't need to be as fast.

Well, depends on the encodings.

I've very long instructions on my ISA, because they allow to pack a lot of "useful work" on single instructions and I've gained some code density using them on real code (e.g.: x86 or x64 pairs of instructions which were "fused" to a single NEx64T instructions).

And this was limited only because my current "reassembler" can work with two instructions at the time (supporting more would be a nightmare to implement, even with my beloved Python). I can surely tell it because I've clearly seen several times, on disassembled x86/x64 binaries, instructions patterns that could be "fused" (even 4 instructions to one).

So, having longer instructions might not be so uncommon with some ISA.
Quote:
The x86 ISA has a shorter maximum instruction length than the 68k which is further restricted in x86-64 but x86(-64)

No, they are the same: max 15 bytes per instruction. Which is OK, because no instruction could exceed that length (except when adding more redundant prefixes).

On the other way, an APX instruction can exceed 15 bytes and that's why Intel said that, in those cases, the instruction should be split in two (e.g.: using LEA to calculate the address of the memory location and saving it on a register. Then referencing the memory using that register with the following APX instruction).
Quote:
in the worst case has to look at more 8 bit encodings/prefixes/extensions than the 68k has to look at 16 bit encodings/extensions in the worst case. Both incur complex instruction overhead (usually microcode) for long instructions

I don't think so. The reason is that x86/x64 use 8-bit prefixes and catching them on a byte stream is quite easy.

Whereas catching all 68k weird encodings is more difficult and even worse with the 68020+ new extension words.

However Apollo's 68080 seems to have no problems on that.
Quote:
but the longer instructions can add performance and flexibility while it is the common shorter instructions that are more important for performance.

Exactly.
Quote:
The baggage with x86 is far beyond instructions. The x86 and x86-64 have many modes and submodes like protected (virtual?) mode, real address mode, system management mode, 64bit (long?) mode, compatibility mode and perhaps more. The old x86 CPUs are compatible with the 808x which only had 8 address bits and used a segment register to allow 16 bit addressing that is closer to memory banks than a 16 bit flat memory model (the 68k ISA started with a 32 bit flat memory model so avoids all these legacy modes except for a possible 64 bit mode in the future). The x86 supported memory protection rings with the idea of improving security and was partially used in some OSs. The MMU support has changed over the years with some of the early support outdated and likely rarely used (a modernized 68k MMU would likely need some changes as well especially for a 64 bit mode). The x86 FPU was an old stack based design that has long been deprecated but there may be somewhat modern programs still using it due to functionality missing in the SIMD unit replacement. Many of the x86 instructions have different behaviors in different modes and with different configurations. I don't pretend to understand the issues but can see it is a huge ugly legacy mess. Maybe much of the legacy can be wiped away but the encoding maps are inefficient and a 16 bit variable length encoding would be a big improvement yet less compatible and require new hardware designs.

I agree on almost all of that, besides the last part.

x64 before and APX now require new binaries, which will not be compatible with the older processors. So, binary compatibility is good to have, but not possible when so much changes need to be introduced.

New hardware design IMO isn't required if the new ISA is done properly. IMO only the decoder needs most of the changes, while everything else could be kept as it is, with some small changes.
Quote:
The x86-64 ISA needs to be thrown away but x86-64 CPUs are powerful at least partially due to the CISC design. Nobody wants to develop a memory munching CISC monster replacement that competes with it because they are more difficult and expensive to develop and have a bad reputation due to x86(-64). Maybe AArch64 "RISC" has adopted enough performance benefits of CISC while leaving behind the legacy baggage to finally compete but I believe a good CISC ISA has more potential than x86-64 or APX too.

I fully agree and... I can prove it.
Quote:
@Fl@sh Quote:

Why your suggestions for an optimized APX implementation were non considered by Intel? Are they blind, lack of time or what?


Maybe Cesare's pursuit of an improved 64 bit x86 ISA led him away from current x86-64 hardware designs which Intel management didn't want to hear?

That happened when I joined Intel. Talking with my first manager, he told me that the company isn't interested on anything different from its ISA. So, my new ISA (which was just v2 at the time. Now it reached v11 and dramatically changed) had no hope.

I think that this still applies: Intel doesn't want to change its ISA, but only extending it to capitalize on the big legacy. Which is good... while it's working, but not future-proof.
Quote:
Maybe Intel's management is trying to tweak a little more performance out of existing x86-64 hardware without losing compatibility?

Yup. See above.
Quote:
Maybe Intel management can get his free input from writing articles criticizing their tweaked APX ISA without having to pay him for a better solution?

They could do it and I hope for them that they will consider at least those suggestions to improve APX. The suggestions are free, simple and effective: now it's up to them.

 Status: Offline
Profile     Report this post  

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]
Copyright (C) 2000 - 2019 Amigaworld.net.
Amigaworld.net was originally founded by David Doyle