Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6224 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

0 member(s) on-line.

You are an anonymous user.
Register Now!

dalek: 10 mins ago

jonathanhans: 24 mins ago

Pelsaert002: 2 hrs 31 mins ago

Wildstar128: 2 hrs 39 mins ago

mbrantley: 2 hrs 49 mins ago

minator: 4 hrs 15 mins ago

amigakit: 4 hrs 28 mins ago

number6: 4 hrs 34 mins ago

DiscreetFX: 4 hrs 51 mins ago

nbache: 5 hrs 10 mins ago

Forum Index

General Technology (No Console Threads)

APX: Intel's new architecture

Poster

Thread

cdimauro

APX: Intel's new architecture
Posted on 4-Sep-2023 5:02:23

[ #1 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

I've written a series of eight articles about Intel's new architecture: APX.
The last article closes the series by taking stock of the situation and with some reflections. It also embeds the links to all previous articles.
English: APX: Intel’s new architecture – 8 – Conclusions
Italian: APX: la nuova architettura di Intel – 8 – Conclusioni

Status: Offline

amigagr

Re: APX: Intel's new architecture
Posted on 4-Sep-2023 22:00:02

[ #2 ]

Member

Joined: 2-Sep-2022
Posts: 24
From: Thessaloniki, Greece

@cdimauro

è una buona occasione per me di imparare la terminologia dei computer in italiano. grazie mille!

Status: Offline

matthey

Re: APX: Intel's new architecture
Posted on 5-Sep-2023 9:07:15

[ #3 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2752
From: Kansas

@cdimauro
Nice analysis and write up for APX. I hadn't heard of it before (or the x86-S simplification). When is it due to be implemented in Intel CPU cores?

I have a few comments and suggestions as an arm chair technical proof reader.

cdimauro Quote:

Exactly the same could be said of ARM and its also blazoned 32-bit architecture, which, however, had the courage to put its hands to the project and re-establish it on new foundations when it decided to bring out its own 64-bit extension, AArch64 AKA AMD64, which is not compatible with the previous 32-bit ISA (although it has much in common and bringing applications to it does not require a total rewrite).

For part 8, should this have been "AArch64 AKA ARMv8-A"? You used AKA properly right before it so you obviously know what it means and it is even highlighted in red as a link yet it went unnoticed in both the English and Italian article.

cdimauro Quote:

This sounds rather strange to me, since I still remember very well how AMD had claimed, when introducing x86-64 AKA x64, to have evaluated the extension of x86 to 32 instead of 16 registers, but to have given up because the advantages did not prove to be significant (contrary to the switch from 8 to 16 registers, where the differences, instead, were quite tangible, as we have seen for ourselves) and did not justify the greater implementation complexity of such a solution.

For part 3, you may consider changing "x86-64 AKA x64" to "AMD64 AKA x64" since I believe that is what AMD originally called the ISA and there may be minor differences between the Intel x86-64 implementation.

https://people.eecs.berkeley.edu/~krste/papers/waterman-ms.pdf Quote:

RVC is a superset of the RISC-V ISA, encoding the most frequent instructions in half the size of a RISCV instruction; the remaining functionality is still accessible with full-length instructions. RVC programs are 25% smaller than RISC-V programs, fetch 25% fewer instruction bits than RISCV programs, and incur fewer instruction cache misses.

…

RVC improves performance substantially when the instruction working set does not fit in cache. For 6 of the 8 cache configurations, using RVC is more effective than doubling the associativity. Using RVC attains, on average, 80% of the speedup of doubling the cache size. A system with a 16 KB direct-mapped cache with RVC is 99% as fast as a system with a 32 KB direct-mapped cache without RVC.

For part 5, I see you used a play from my playbook pulling out the RISC-V code density studies and documentation. Much of the RISC-V documentation is written to make RVC code density look good by comparing to the non-compressed RISC-V ISA which resembles the old extinct and fat RISC "desktop" ISAs like SPARC, MIPS, PA-RISC and Alpha. The more universal code density RISC-V research I like to quote is the following.

The RISC-V Compressed Instruction Set Manual Version 1.7 Quote:

The philosophy of RVC is to reduce code size for embedded applications and to improve performance and energy-efficiency for all applications due to fewer misses in the instruction cache. Waterman shows that RVC fetches 25%-30% fewer instruction bits, which reduces instruction cache misses by 20%-25%, or roughly the same performance impact as doubling the instruction cache size.

It's probably the same research and researcher, Waterman, but this is shorter and more easily understood in my opinion. In other words, every 25%-30% improvement in code density is roughly like doubling the size of the instruction cache. The 1995 DEC Alpha 21164 CPU using AXP ISA demonstrates the turning point of RISC fallacies where the L1 ICache had to be reduced to 8kiB to maintain timing at the high clock speeds but it performed more like a 1-2kiB L1 ICache for a 68020 CPU. The Alpha 21164 pioneered the on-chip L2 cache but the 96kiB L2 caused the chip to use 9.3 million transistors, draw 56W of power at 333MHz, and the 433 MHz version cost $1,492 in 1996. The 1994 68060 used a similar chip fab process, had one more pipeline stage than the Alpha 21164 which is better for clocking up, only used 2.5 million transistors, was cool enough at low clock speeds for a mobile device and cost a fraction of the price. It was the 1995 Pentium Pro@200MHz that was first to outperform a 300 MHz Alpha 21164 in SPECint95 benchmarks though. The Pentium Pro used a 14 stage pipeline and 5.5 million transistors up from the 5 stages and 3.1 million transistors of the original P5 Pentium. The moral of the story is that the best code density doesn't always win even though the worst 6 died (Alpha, PA-RISC, MIPS, SPARC, PPC and ARM original). It is necessary to leverage an industry leading pipeline depth and code density by turning up the clock speed to show off performance but the power saving advantages are excellent for embedded use even if less flashy.

cdimauro Quote:

A hybrid solution between the two (as well as the preferable one) would be to PUSH the register to be used on the stack, and then use it taking into account where it is now located within it. Eventually, when no longer useful, a POP would restore the contents of the temporarily used register. This is a technique that I used in the first half of the 1990s, when I tried my hand at writing an 80186 emulator for Amiga systems equipped with a 68020 processor (or higher), and which has the virtue of combining the two previous scenarios, taking the best of them (minimising the cost of preserving and restoring the previous value in/from the stack).

For part 5, I see you are paying respect to the 68k Amiga. Well, the 68k still does it better and more elegantly than x86(-64)/APX in more than a few ways. Of course mentioning the 68k isn't nearly as embarrassing as the Amiga which pretty much requires a bag over our heads it is getting so embarrassing. PPC A1222 AmigaNOne hardware received a "2 more weeks" announcement for a few hundred units and it is expected to cost something like $1000 or Euros for Raspberry Pi 3 ARM Cortex-A53 like integer performance and now we have an A600GS 68k emulation device with Cortex-A53 performance that is an attempt to build a 68k user base. I mentioned to them they would be better off with x86-64 hardware as the emulation and other OS support is better. While x86-64 cores are fat and don't scale down very far as demonstrated by the early in-order superscalar Atom CPUs, they have performance and don't use too much power on smaller chip processes. The successors to the Atom microarchitectures have been beefed up and are now used as the energy efficient cores in newer Intel desktop and mobile CPUs (like ARM big.LITTLE/DynamIQ cores). Even these relatively little energy efficient cores are not so little CISC power houses.

https://en.wikipedia.org/wiki/Gracemont_(microarchitecture)
https://upload.wikimedia.org/wikipedia/commons/d/dd/GracemontRevised.png
https://en.wikichip.org/wiki/intel/microarchitectures/gracemont

The 10nm Atom x7xxx line is only $39-$58 and 6W-12W TDP for a SoC with descent GPU.

The Raspberry Pi 4 has 21% of the Geekbench 5 64 bit single core performance and 32% of the GPU single precision GFLOPS performance of the Intel Atom x7425E SoC.

https://www.cpu-monkey.com/en/compare_cpu-raspberry_pi_4_b_broadcom_bcm2711-vs-intel_atom_x7425e

The Atom SoCs are likely strong enough to emulate older and low end PPC AmigaNOne hardware but that is probably the threat of assimilation. Emulation is good enough for the 68k but not for PPC AmigaNOne, yet. Assimilation is the easy route not that a hardware platform can be built on emulation that destroys all the philosophies of the 68k Amiga. CPU performance metrics are important for competitiveness but economies of scale win the war.

Your analysis and conclusion of APX appears accurate to me. I'd say Intel is getting nervous about the AArch64 ISA which has better performance metrics in most categories and often by more than a little bit. The x86-64 ISA bolt on was just good enough when it came out but it only has a small advantage in code density and a shorter average instruction size while the only big advantage I see is CISC memory accesses. APX attacks the performance metrics where x86 and x86-64 are weak like too many instructions to execute, too much memory traffic and too many branches but it is challenging to fix them because they are inherent to the hardware needed to execute x86 and x86-64 code and Intel doesn't want to give up their design expertise in this. They are willing to jettison some of the baggage with x86-S and may even be able to get rid of x86 now but they need x86-64 compatibility which implies the same inherited warts due to bad legacy choices like poor orthogonality, limited encoding space, and 8 bit variable length encodings and prefix hell adding a decoding tax. I'm very skeptical that APX will help much as the decreased number of instructions results in larger instructions that are more difficult to decode and I don't see how the code density does not decrease despite claims (one of your strong arguments too). I expect most of the decreased memory traffic to come from PUSH2 and POP2 but, again, decreased code density offsets some of the gains. The 16 new GP registers are all caller saved (trash/volatile) registers where an even split would likely be better resulting in more spills to the stack than expected from so many registers, using more power for the larger register file and again code density is likely to be negatively affected. The 3 op addition reduces the number of instructions but they are larger and code density declines in most cases. The conditional instructions are flawed as you point out, perhaps to match legacy behavior of similar instructions. I expect overall performance to improve but I would be surprised to see more than 5% even though some individual benchmarks may see 20%. They obviously ran prototype simulations using the SPEC CPU 2017 Integer benchmark suite so they should have performance and code density numbers but, I agree, they are suspiciously absent. I wouldn't be surprised if APX is delayed and even canceled. If they were serious about getting more competitive, they would create a new encoding map based on a 16 bit variable length encoding. One 16 bit prefix on the 68k could achieve most of what APX is attempting to do but with fewer instructions, less encoding overhead and better code density. The 68k ISA is in much better shape than x86(-64) as far as performance metrics, especially for a smaller and lower power CPU cores.

I expect x86-64 to hold off ARM AArch64 encroachment into the high performance computing markets for awhile despite ARM closing the gap. The desktop and server markets are very profitable for Intel and AMD compared to ARM embedded and mobile markets. AArch64 is more scalable, their markets are expanding faster compared to a likely shrinking desktop market and the ARM reference designs are improving in performance quickly. Intel has something to be worried about but they need to also worry about their competitor AMD who has been out competing them recently. Maybe AMD has something up their sleeves but there is only so much room for AMD64 improvement. The writing would be on the wall if we start to see more serious ARM development from Intel and AMD. The consoles have been a good predictor of ISA trends so far so a switch to ARM by more than the Nintendo Switch could be an indicator. Can there be just one ISA to rule them all? RISC-V found a niche and I believe has staying power with more open hardware and their compressed RVC encoding helps. AArch64 doesn't scale all the way down to Thumb territory either. The problem with the compressed RISC encodings like RVC and Thumb encodings is they have an increased number of small instructions that decreases performance, a problem the 68k doesn't seem to have. The CPU market has large barriers of entry mostly due to economies of scale but there may be some opportunities and innovations yet. The CPU and ISA landscape would be pretty boring if there was only one ISA. We may be surprised and the Mill Computing Mill architecture belt machine may pop back up in the news. More likely would probably be a more conservative effort starting small with a niche to fill.

Last edited by matthey on 05-Sep-2023 at 07:09 PM.
Last edited by matthey on 05-Sep-2023 at 02:55 PM.
Last edited by matthey on 05-Sep-2023 at 09:28 AM.

Status: Offline

Fl@sh

Re: APX: Intel's new architecture
Posted on 5-Sep-2023 11:31:52

[ #4 ]

Regular Member

Joined: 6-Oct-2004
Posts: 253
From: Napoli - Italy

@cdimauro

Read in italian, nice article well done!
Why Intel APX was not done on top of X86-S? Do you agree about removing all old 16bit legacy instructions could leave more room to implement better new APX extensions?

Why your suggestions for an optimized APX implementation were non considered by Intel? Are they blind, lack of time or what?

_________________
Pegasos II G4@1GHz 2GB Radeon 9250 256MB
AmigaOS4.1 fe - MorphOS - Debian 9 Jessie

Status: Offline

matthey

Re: APX: Intel's new architecture
Posted on 5-Sep-2023 18:11:42

[ #5 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2752
From: Kansas

@Fl@sh Quote:

Why Intel APX was not done on top of X86-S?

Slim down x86 with x86-S and then fatten it back up with APX? Maybe too many experiments at once?

@Fl@sh Quote:

Do you agree about removing all old 16bit legacy instructions could leave more room to implement better new APX extensions?

The 68k uses a 16 bit variable length encoding while x86 uses an 8 bit variable length encoding so legacy x86 instructions start at 8 bits. Many of the original x86 instructions are commonly used in x86-64 and APX but with new variations that require prefixes. There are only 256 different base 8 bit encodings and they were encoded for maximum code density using an 8 bit CPU accessing the stack and 8 general purpose registers. Instruction length should be based on frequency of instructions used while original x86 encodings were suspect in this regard. A variable length 8 bit encoding can likely achieve superior code density to a 16 bit variable length encoding with an efficient encoding based on instruction frequency as can be seen by the Cast BA2 instruction set for 32 bit embedded cores which allows variable length instructions of 16, 24, 32 and 48 bits (not 8 bits to preserve more encoding space) and 16 or 32 GP registers yet claims to have better code density than Thumb2 which uses a 16 bit variable length encoding like the 68k. A variable length 16 bit encoding has performance advantages due to alignment (simplified decoding) and supporting more powerful instructions like 16, 32, 48 and 64 bit lengths in the common case. Many compressed RISC ISAs use a 16 bit variable length encoding but only allow 16 or 32 bit lengths and the 68k like ColdFire allowed 16, 32 and 48 bit lengths losing performance and 68k compatibility for little benefit by not allowing 64 bit lengths. The 68k allows longer than a 64 bit instruction length but they are uncommon for a 32 bit ISA and don't need to be as fast. The x86 ISA has a shorter maximum instruction length than the 68k which is further restricted in x86-64 but x86(-64) in the worst case has to look at more 8 bit encodings/prefixes/extensions than the 68k has to look at 16 bit encodings/extensions in the worst case. Both incur complex instruction overhead (usually microcode) for long instructions but the longer instructions can add performance and flexibility while it is the common shorter instructions that are more important for performance.

The baggage with x86 is far beyond instructions. The x86 and x86-64 have many modes and submodes like protected (virtual?) mode, real address mode, system management mode, 64bit (long?) mode, compatibility mode and perhaps more. The old x86 CPUs are compatible with the 808x which only had 8 address bits and used a segment register to allow 16 bit addressing that is closer to memory banks than a 16 bit flat memory model (the 68k ISA started with a 32 bit flat memory model so avoids all these legacy modes except for a possible 64 bit mode in the future). The x86 supported memory protection rings with the idea of improving security and was partially used in some OSs. The MMU support has changed over the years with some of the early support outdated and likely rarely used (a modernized 68k MMU would likely need some changes as well especially for a 64 bit mode). The x86 FPU was an old stack based design that has long been deprecated but there may be somewhat modern programs still using it due to functionality missing in the SIMD unit replacement. Many of the x86 instructions have different behaviors in different modes and with different configurations. I don't pretend to understand the issues but can see it is a huge ugly legacy mess. Maybe much of the legacy can be wiped away but the encoding maps are inefficient and a 16 bit variable length encoding would be a big improvement yet less compatible and require new hardware designs. The x86-64 ISA needs to be thrown away but x86-64 CPUs are powerful at least partially due to the CISC design. Nobody wants to develop a memory munching CISC monster replacement that competes with it because they are more difficult and expensive to develop and have a bad reputation due to x86(-64). Maybe AArch64 "RISC" has adopted enough performance benefits of CISC while leaving behind the legacy baggage to finally compete but I believe a good CISC ISA has more potential than x86-64 or APX too.

@Fl@sh Quote:

Why your suggestions for an optimized APX implementation were non considered by Intel? Are they blind, lack of time or what?

Maybe Cesare's pursuit of an improved 64 bit x86 ISA led him away from current x86-64 hardware designs which Intel management didn't want to hear? Maybe Intel's management is trying to tweak a little more performance out of existing x86-64 hardware without losing compatibility? Maybe Intel management can get his free input from writing articles criticizing their tweaked APX ISA without having to pay him for a better solution?

Last edited by matthey on 05-Sep-2023 at 06:27 PM.
Last edited by matthey on 05-Sep-2023 at 06:16 PM.

Status: Offline

cdimauro

Re: APX: Intel's new architecture
Posted on 5-Sep-2023 19:07:57

[ #6 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

@amigagr

Quote:

amigagr wrote:
@cdimauro

è una buona occasione per me di imparare la terminologia dei computer in italiano. grazie mille!

Prego! Se hai dubbi o altre curiosità, sono a disposizione.

@Fl@sh

Quote:

Fl@sh wrote:
@cdimauro

Read in italian, nice article well done!

Thanks.
Quote:
Why Intel APX was not done on top of X86-S?

They aren't incompatible. APX can be implemented on x86 processors as well as X86-S ones (when/if they arrive. Which is my guess: keeping x64 processors as they are is just a waste of silicon and power, due to functionalities which aren't used since decades.).
Quote:
Do you agree about removing all old 16bit legacy instructions could leave more room to implement better new APX extensions?

I don't know the impact of 16 bits code on a concrete core implementation, but my idea is that it might free some space for APX (at least).
Quote:
Why your suggestions for an optimized APX implementation were non considered by Intel? Are they blind, lack of time or what?

I haven't submitted them to Intel.

Maybe I'll write to a former colleague which is also working on Intel's compilers and share my articles. Let's see...

Status: Offline

cdimauro

Re: APX: Intel's new architecture
Posted on 5-Sep-2023 20:09:58

[ #7 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

@matthey

Quote:

matthey wrote:
@cdimauro
Nice analysis and write up for APX. I hadn't heard of it before (or the x86-S simplification). When is it due to be implemented in Intel CPU cores?

Thanks! I don't know, but the idea of Intel working to a new architecture is floating around since several years. I think that the time is mature now for that and the fact that there are preliminary SPEC2017 tests for it is a clear signal that there's something already ready. To me it's a matter of a few years (maybe sooner).
Quote:
I have a few comments and suggestions as an arm chair technical proof reader.

Much appreciated!
Quote:
cdimauro Quote:

Exactly the same could be said of ARM and its also blazoned 32-bit architecture, which, however, had the courage to put its hands to the project and re-establish it on new foundations when it decided to bring out its own 64-bit extension, AArch64 AKA AMD64, which is not compatible with the previous 32-bit ISA (although it has much in common and bringing applications to it does not require a total rewrite).

For part 8, should this have been "AArch64 AKA ARMv8-A"? You used AKA properly right before it so you obviously know what it means and it is even highlighted in red as a link yet it went unnoticed in both the English and Italian article.

That was a lapsus. I was intending to write ARM64. Fixed now, thanks!
Quote:
cdimauro Quote:

This sounds rather strange to me, since I still remember very well how AMD had claimed, when introducing x86-64 AKA x64, to have evaluated the extension of x86 to 32 instead of 16 registers, but to have given up because the advantages did not prove to be significant (contrary to the switch from 8 to 16 registers, where the differences, instead, were quite tangible, as we have seen for ourselves) and did not justify the greater implementation complexity of such a solution.

For part 3, you may consider changing "x86-64 AKA x64" to "AMD64 AKA x64" since I believe that is what AMD originally called the ISA and there may be minor differences between the Intel x86-64 implementation.

x86-64 was the original name which AMD used when it introduced its 64-bit extension to IA-32/x86, so I prefer to keep it.

AMD64 sounds much better but to me x86-64 is more appropriate especially in this context (since I was talking of the introduction of this new ISA).
Quote:
https://people.eecs.berkeley.edu/~krste/papers/waterman-ms.pdf Quote:

RVC is a superset of the RISC-V ISA, encoding the most frequent instructions in half the size of a RISCV instruction; the remaining functionality is still accessible with full-length instructions. RVC programs are 25% smaller than RISC-V programs, fetch 25% fewer instruction bits than RISCV programs, and incur fewer instruction cache misses.

…

RVC improves performance substantially when the instruction working set does not fit in cache. For 6 of the 8 cache configurations, using RVC is more effective than doubling the associativity. Using RVC attains, on average, 80% of the speedup of doubling the cache size. A system with a 16 KB direct-mapped cache with RVC is 99% as fast as a system with a 32 KB direct-mapped cache without RVC.

For part 5, I see you used a play from my playbook pulling out the RISC-V code density studies and documentation. Much of the RISC-V documentation is written to make RVC code density look good by comparing to the non-compressed RISC-V ISA which resembles the old extinct and fat RISC "desktop" ISAs like SPARC, MIPS, PA-RISC and Alpha.

Indeed. We know how it goes: they make comparisons only with selected products where they can show their improvements (see also some other papers of a recent Turing award).
Quote:
The more universal code density RISC-V research I like to quote is the following.

The RISC-V Compressed Instruction Set Manual Version 1.7 Quote:

The philosophy of RVC is to reduce code size for embedded applications and to improve performance and energy-efficiency for all applications due to fewer misses in the instruction cache. Waterman shows that RVC fetches 25%-30% fewer instruction bits, which reduces instruction cache misses by 20%-25%, or roughly the same performance impact as doubling the instruction cache size.

It's probably the same research and researcher, Waterman, but this is shorter and more easily understood in my opinion. In other words, every 25%-30% improvement in code density is roughly like doubling the size of the instruction cache.

That was exactly what I was looking for, thanks! But unfortunately I didn't recall it right when I was writing the article, so I opted to report some excerpt from the original thesis.

I've now replaced it.
Quote:
The 1995 DEC Alpha 21164 CPU using AXP ISA demonstrates the turning point of RISC fallacies where the L1 ICache had to be reduced to 8kiB to maintain timing at the high clock speeds but it performed more like a 1-2kiB L1 ICache for a 68020 CPU. The Alpha 21164 pioneered the on-chip L2 cache but the 96kiB L2 caused the chip to use 9.3 million transistors, draw 56W of power at 333MHz, and the 433 MHz version cost $1,492 in 1996. The 1994 68060 used a similar chip fab process, had one more pipeline stage than the Alpha 21164 which is better for clocking up, only used 2.5 million transistors, was cool enough at low clock speeds for a mobile device and cost a fraction of the price. It was the 1995 Pentium Pro@200MHz that was first to outperform a 300 MHz Alpha 21164 in SPECint95 benchmarks though. The Pentium Pro used a 14 stage pipeline and 5.5 million transistors up from the 5 stages and 3.1 million transistors of the original P5 Pentium. The moral of the story is that the best code density doesn't always win even though the worst 6 died (Alpha, PA-RISC, MIPS, SPARC, PPC and ARM original). It is necessary to leverage an industry leading pipeline depth and code density by turning up the clock speed to show off performance but the power saving advantages are excellent for embedded use even if less flashy.

True, but it's also important to encode "more useful work" on instructions, like APX is also showing. So, a combination of the three factors is the winning element for an architecture.
Quote:
cdimauro Quote:

A hybrid solution between the two (as well as the preferable one) would be to PUSH the register to be used on the stack, and then use it taking into account where it is now located within it. Eventually, when no longer useful, a POP would restore the contents of the temporarily used register. This is a technique that I used in the first half of the 1990s, when I tried my hand at writing an 80186 emulator for Amiga systems equipped with a 68020 processor (or higher), and which has the virtue of combining the two previous scenarios, taking the best of them (minimising the cost of preserving and restoring the previous value in/from the stack).

For part 5, I see you are paying respect to the 68k Amiga.

I still love it.

The 68k was the main source of inspiration even for my ISA, NEx64T (which is... a x86/x64 rewriting/extension). In fact, the most important features are directly coming from the 68k.
Quote:
Well, the 68k still does it better and more elegantly than x86(-64)/APX in more than a few ways.

I fully agree on that as well. x86/x64 is... what it is. And APX makes it even worse from an architectural / structural PoV.

68k is a piece of cake compared to them. And it's a very small / simpler ISA.
Quote:
Of course mentioning the 68k isn't nearly as embarrassing as the Amiga which pretty much requires a bag over our heads it is getting so embarrassing. PPC A1222 AmigaNOne hardware received a "2 more weeks" announcement for a few hundred units and it is expected to cost something like $1000 or Euros for Raspberry Pi 3 ARM Cortex-A53 like integer performance and now we have an A600GS 68k emulation device with Cortex-A53 performance that is an attempt to build a 68k user base. I mentioned to them they would be better off with x86-64 hardware as the emulation and other OS support is better. While x86-64 cores are fat and don't scale down very far as demonstrated by the early in-order superscalar Atom CPUs, they have performance and don't use too much power on smaller chip processes.

Indeed. However the RPi has the advantage of being very cheap. I don't know how much a similar x86-64 could cost.

PowerPCs... I prefer a "no comment"...
Quote:
The successors to the Atom microarchitectures have been beefed up and are now used as the energy efficient cores in newer Intel desktop and mobile CPUs (like ARM big.LITTLE/DynamIQ cores). Even these relatively little energy efficient cores are not so little CISC power houses.

https://en.wikipedia.org/wiki/Gracemont_(microarchitecture)
https://upload.wikimedia.org/wikipedia/commons/d/dd/GracemontRevised.png
https://en.wikichip.org/wiki/intel/microarchitectures/gracemont

The 10nm Atom x7xxx line is only $39-$58 and 6W-12W TDP for a SoC with descent GPU.

The Raspberry Pi 4 has 21% of the Geekbench 5 64 bit single core performance and 32% of the GPU single precision GFLOPS performance of the Intel Atom x7425E SoC.

https://www.cpu-monkey.com/en/compare_cpu-raspberry_pi_4_b_broadcom_bcm2711-vs-intel_atom_x7425e

The Atom SoCs are likely strong enough to emulate older and low end PPC AmigaNOne hardware but that is probably the threat of assimilation. Emulation is good enough for the 68k but not for PPC AmigaNOne, yet.

Absolutely. The so called "E-Core" are based on a new microarchitecture which is simply awesome considering how it performs while drawing much less power compared to equivalent cores.

Atoms based on that will shine, especially on the emulation side (they have a 6-way decoder with a 17 ports backend: a monster!).
Quote:
Assimilation is the easy route not that a hardware platform can be built on emulation that destroys all the philosophies of the 68k Amiga. CPU performance metrics are important for competitiveness but economies of scale win the war.

Money is the most important factor. Unfortunately not leading to "nice" designs.
Quote:
Your analysis and conclusion of APX appears accurate to me. I'd say Intel is getting nervous about the AArch64 ISA which has better performance metrics in most categories and often by more than a little bit. The x86-64 ISA bolt on was just good enough when it came out but it only has a small advantage in code density and a shorter average instruction size while the only big advantage I see is CISC memory accesses. APX attacks the performance metrics where x86 and x86-64 are weak like too many instructions to execute, too much memory traffic and too many branches but it is challenging to fix them because they are inherent to the hardware needed to execute x86 and x86-64 code and Intel doesn't want to give up their design expertise in this.

Exactly. It's the ultimate move for Intel to be more competitive. But it'll be an advantage limited to some time: then the competition will be back making it nervous again.
Quote:
They are willing to jettison some of the baggage with x86-S and may even be able to get rid of x86 now

No, x86 is here to stay: there's still too much software using it. In fact X86-S doesn't abolish x86: it "just" makes it a "second-class citizen" (since it can be used only on userland applications).
Quote:
but they need x86-64 compatibility which implies the same inherited warts due to bad legacy choices like poor orthogonality, limited encoding space, and 8 bit variable length encodings and prefix hell adding a decoding tax. I'm very skeptical that APX will help much as the decreased number of instructions results in larger instructions that are more difficult to decode and I don't see how the code density does not decrease despite claims (one of your strong arguments too).

In fact, to me it's unbelievable: the "new" instructions are really too long. I don't see how code density could be similar to x64.

That's why I can't wait to put my hands on some APX binaries to verify my impressions.
Quote:
I expect most of the decreased memory traffic to come from PUSH2 and POP2 but, again, decreased code density offsets some of the gains. The 16 new GP registers are all caller saved (trash/volatile) registers where CISC would likely benefit from more callee saved registers causing more spills to the stack than expected,

Another choice which to me looks non-sense. Bah...
Quote:
using more power for the larger register file

No, this will not change effectively, because modern processors have hundreds of (micro)architectural registers. Adding 16 GP registers to the ISA makes absolutely no change on running applications using them (besides the bit longer context switch).
Quote:
and again code density is likely to be negatively affected. The 3 op addition reduces the number of instructions but they are larger and code density declines in most cases. The conditional instructions are flawed as you point out, perhaps to match legacy behavior of similar instructions.

That's my idea as well. However with APX Intel introduced the CFCMOVcc instructions which is at least solving the misbehavior with the exceptions, which are contradicting the legacy implementation (CMOVcc).

This is another non-sense...
Quote:
I expect overall performance to improve but I would be surprised to see more than 5% even though some individual benchmarks may see 20%. They obviously ran prototype simulations using the SPEC CPU 2017 Integer benchmark suite so they should have performance and code density numbers but, I agree, they are suspiciously absent. I wouldn't be surprised if APX is delayed and even canceled.

As I've said before, I believe that we'll see concrete products soon. This is the last anchor for Intel to fight with competitors.
Quote:
If they were serious about getting more competitive, they would create a new encoding map based on a 16 bit variable length encoding.

That was my aim/wish. And it was/is also exactly what I've done my new ISA.
Quote:
One 16 bit prefix on the 68k could achieve most of what APX is attempting to do but with fewer instructions, less encoding overhead and better code density. The 68k ISA is in much better shape than x86(-64) as far as performance metrics, especially for a smaller and lower power CPU cores.

Absolutely. And 68k would have gained a 64-bit extension as well.
Quote:
I expect x86-64 to hold off ARM AArch64 encroachment into the high performance computing markets for awhile despite ARM closing the gap. The desktop and server markets are very profitable for Intel and AMD compared to ARM embedded and mobile markets. AArch64 is more scalable, their markets are expanding faster compared to a likely shrinking desktop market and the ARM reference designs are improving in performance quickly.

Indeed. See Apple's M1/M2 processors...
Quote:
Intel has something to be worried about but they need to also worry about their competitor AMD who has been out competing them recently. Maybe AMD has something up their sleeves

AMD owns the consoles market, which is a cow to milk.
Quote:
but there is only so much room for AMD64 improvement. The writing would be on the wall if we start to see more serious ARM development from Intel and AMD. The consoles have been a good predictor of ISA trends so far so a switch to ARM by more than the Nintendo Switch could be an indicator. Can there be just one ISA to rule them all? RISC-V found a niche and I believe has staying power with more open hardware and their compressed RVC encoding helps.

However RISC-V isn't tailored for single core/thread performances, even with super-aggressive implementations: this ISA is too much simple and it isn't doing so much "useful work".

RISC-V primary advantage is represented by the license-free model.
Quote:
AArch64 doesn't scale all the way down to Thumb territory either. The problem with the compressed RISC encodings like RVC and Thumb encodings is they have an increased number of small instructions that decreases performance, a problem the 68k doesn't seem to have.

Because it provides more "useful work": 68k shines at that, especially with its super-flexible Mem-To-Mem MOVE instruction.
Quote:
The CPU market has large barriers of entry mostly due to economies of scale but there may be some opportunities and innovations yet. The CPU and ISA landscape would be pretty boring if there was only one ISA.

I fully agree AND it's my hope too!
Quote:
We may be surprised and the Mill Computing Mill architecture belt machine may pop back up in the news.

Hum. Too many announcements / propaganda and no concrete product. It's hard to believe on some revolution here, IMO.
Quote:
More likely would probably be a more conservative effort starting small with a niche to fill.

Indeed.

Status: Offline

cdimauro

Re: APX: Intel's new architecture
Posted on 5-Sep-2023 20:43:44

[ #8 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

@matthey

Quote:

matthey wrote:
@Fl@sh Quote:

Why Intel APX was not done on top of X86-S?

Slim down x86 with x86-S and then fatten it back up with APX? Maybe too many experiments at once?

I think that they'll come separate, but I see future processors adopting both.
Quote:
@Fl@sh Quote:

Do you agree about removing all old 16bit legacy instructions could leave more room to implement better new APX extensions?

The 68k uses a 16 bit variable length encoding while x86 uses an 8 bit variable length encoding so legacy x86 instructions start at 8 bits. Many of the original x86 instructions are commonly used in x86-64 and APX but with new variations that require prefixes. There are only 256 different base 8 bit encodings and they were encoded for maximum code density using an 8 bit CPU accessing the stack and 8 general purpose registers. Instruction length should be based on frequency of instructions used while original x86 encodings were suspect in this regard.

Because it was a primitive design without a vision of future needs and extensions.

8086 was very good when it was introduced for the market of the time: a simple design with a incredible code density.

But it wasn't future-proof from this PoV and it required horrible patches for its extensions.
Quote:
A variable length 8 bit encoding can likely achieve superior code density to a 16 bit variable length encoding with an efficient encoding based on instruction frequency as can be seen by the Cast BA2 instruction set for 32 bit embedded cores which allows variable length instructions of 16, 24, 32 and 48 bits (not 8 bits to preserve more encoding space) and 16 or 32 GP registers yet claims to have better code density than Thumb2 which uses a 16 bit variable length encoding like the 68k.

That's impressive. I've searched around several times to get the BA2 ISA manual, but I never found it. It looks like that the company is jealously keeping the opcodes design for itself.

However my idea is that it might be too much specialized for the embedded market and not useful for other areas (mobile, desktop, servers, workstations, HPC).
Quote:
A variable length 16 bit encoding has performance advantages due to alignment (simplified decoding) and supporting more powerful instructions like 16, 32, 48 and 64 bit lengths in the common case. Many compressed RISC ISAs use a 16 bit variable length encoding but only allow 16 or 32 bit lengths and the 68k like ColdFire allowed 16, 32 and 48 bit lengths losing performance and 68k compatibility for little benefit by not allowing 64 bit lengths. The 68k allows longer than a 64 bit instruction length but they are uncommon for a 32 bit ISA and don't need to be as fast.

Well, depends on the encodings.

I've very long instructions on my ISA, because they allow to pack a lot of "useful work" on single instructions and I've gained some code density using them on real code (e.g.: x86 or x64 pairs of instructions which were "fused" to a single NEx64T instructions).

And this was limited only because my current "reassembler" can work with two instructions at the time (supporting more would be a nightmare to implement, even with my beloved Python). I can surely tell it because I've clearly seen several times, on disassembled x86/x64 binaries, instructions patterns that could be "fused" (even 4 instructions to one).

So, having longer instructions might not be so uncommon with some ISA.
Quote:
The x86 ISA has a shorter maximum instruction length than the 68k which is further restricted in x86-64 but x86(-64)

No, they are the same: max 15 bytes per instruction. Which is OK, because no instruction could exceed that length (except when adding more redundant prefixes).

On the other way, an APX instruction can exceed 15 bytes and that's why Intel said that, in those cases, the instruction should be split in two (e.g.: using LEA to calculate the address of the memory location and saving it on a register. Then referencing the memory using that register with the following APX instruction).
Quote:
in the worst case has to look at more 8 bit encodings/prefixes/extensions than the 68k has to look at 16 bit encodings/extensions in the worst case. Both incur complex instruction overhead (usually microcode) for long instructions

I don't think so. The reason is that x86/x64 use 8-bit prefixes and catching them on a byte stream is quite easy.

Whereas catching all 68k weird encodings is more difficult and even worse with the 68020+ new extension words.

However Apollo's 68080 seems to have no problems on that.
Quote:
but the longer instructions can add performance and flexibility while it is the common shorter instructions that are more important for performance.

Exactly.
Quote:
The baggage with x86 is far beyond instructions. The x86 and x86-64 have many modes and submodes like protected (virtual?) mode, real address mode, system management mode, 64bit (long?) mode, compatibility mode and perhaps more. The old x86 CPUs are compatible with the 808x which only had 8 address bits and used a segment register to allow 16 bit addressing that is closer to memory banks than a 16 bit flat memory model (the 68k ISA started with a 32 bit flat memory model so avoids all these legacy modes except for a possible 64 bit mode in the future). The x86 supported memory protection rings with the idea of improving security and was partially used in some OSs. The MMU support has changed over the years with some of the early support outdated and likely rarely used (a modernized 68k MMU would likely need some changes as well especially for a 64 bit mode). The x86 FPU was an old stack based design that has long been deprecated but there may be somewhat modern programs still using it due to functionality missing in the SIMD unit replacement. Many of the x86 instructions have different behaviors in different modes and with different configurations. I don't pretend to understand the issues but can see it is a huge ugly legacy mess. Maybe much of the legacy can be wiped away but the encoding maps are inefficient and a 16 bit variable length encoding would be a big improvement yet less compatible and require new hardware designs.

I agree on almost all of that, besides the last part.

x64 before and APX now require new binaries, which will not be compatible with the older processors. So, binary compatibility is good to have, but not possible when so much changes need to be introduced.

New hardware design IMO isn't required if the new ISA is done properly. IMO only the decoder needs most of the changes, while everything else could be kept as it is, with some small changes.
Quote:
The x86-64 ISA needs to be thrown away but x86-64 CPUs are powerful at least partially due to the CISC design. Nobody wants to develop a memory munching CISC monster replacement that competes with it because they are more difficult and expensive to develop and have a bad reputation due to x86(-64). Maybe AArch64 "RISC" has adopted enough performance benefits of CISC while leaving behind the legacy baggage to finally compete but I believe a good CISC ISA has more potential than x86-64 or APX too.

I fully agree and... I can prove it.
Quote:
@Fl@sh Quote:

Why your suggestions for an optimized APX implementation were non considered by Intel? Are they blind, lack of time or what?

Maybe Cesare's pursuit of an improved 64 bit x86 ISA led him away from current x86-64 hardware designs which Intel management didn't want to hear?

That happened when I joined Intel. Talking with my first manager, he told me that the company isn't interested on anything different from its ISA. So, my new ISA (which was just v2 at the time. Now it reached v11 and dramatically changed) had no hope.

I think that this still applies: Intel doesn't want to change its ISA, but only extending it to capitalize on the big legacy. Which is good... while it's working, but not future-proof.
Quote:
Maybe Intel's management is trying to tweak a little more performance out of existing x86-64 hardware without losing compatibility?

Yup. See above.
Quote:
Maybe Intel management can get his free input from writing articles criticizing their tweaked APX ISA without having to pay him for a better solution?

They could do it and I hope for them that they will consider at least those suggestions to improve APX. The suggestions are free, simple and effective: now it's up to them.

Status: Offline

AmigaNoob

Re: APX: Intel's new architecture
Posted on 16-Oct-2023 14:32:59

[ #9 ]

Member

Joined: 14-Oct-2021
Posts: 15
From: Unknown

@cdimauro
Quote:
That's impressive. I've searched around several times to get the BA2 ISA manual, but I never found it. It looks like that the company is jealously keeping the opcodes design for itself.

However my idea is that it might be too much specialized for the embedded market and not useful for other areas (mobile, desktop, servers, workstations, HPC).

This might interest you, somebody reverse engineered xiaomi zigbee and get its ISA which is BA2
https://alephsecurity.com/2019/07/09/xiaomi-zigbee-2/

Status: Offline

matthey

Re: APX: Intel's new architecture
Posted on 16-Oct-2023 21:23:22

[ #10 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2752
From: Kansas

AmigaNoob Quote:

This might interest you, somebody reverse engineered xiaomi zigbee and get its ISA which is BA2
https://alephsecurity.com/2019/07/09/xiaomi-zigbee-2/

Interesting, thanks. Industry leading code density is a good selling point but NXP has other options including ARM Cortex-M0(+) and royalty free ColdFire cores which also have good code density. For other readers, the BA2 ISA and cores are used on the NXP JN5168 microcontroller with a tiny footprint for IoT embedded use.

https://www.nxp.com/products/wireless-connectivity/zigbee/zigbee-and-ieee802-15-4-wireless-microcontroller-with-256-kb-flash-32-kb-ram:JN5168 Quote:

ZigBee and IEEE802.15.4 Wireless Microcontroller with 256 KB Flash, 32 KB RAM

The JN5168 is an ultra low power, high performance wireless microcontroller supporting ZigBee and IEEE802.15.4 networking stacks to facilitate the development of Smart Energy, Home Automation, Smart Lighting and wireless sensor applications. It features an enhanced 32-bit RISC processor with 256 kB embedded Flash, 32 kB RAM and 4 kB EEPROM memory, offering high coding efficiency through variable width instructions, a multi-stage instruction pipeline and low power operation with programmable clock speeds. It also includes a 2.4 GHz IEEE802.15.4 compliant transceiver and a comprehensive mix of analogue and digital peripherals. The very low operating current of 15 mA, with a 0.6 μA sleep timer mode, gives excellent battery life allowing operation direct from a coin cell.

The peripherals support a wide range of applications. They include a 2-wire I²C, and SPI ports which can operate as either leader or follower, a four channel ADC with battery and a temperature sensor. It can support a large switch matrix of up to 100 elements, or alternatively a 20 key capacitive touch pad.

This makes the Raspberry Pi Foundation RP2040 SoC microcontroller with ARM Cortex-M0+ and 264kiB of memory look fat. Computers are going smaller except for the Amiga which is going extinct. Hyperion originally considered the embedded market to be important for AmigaOS 4.

Declaration of Evert Carton Quote:

I founded Hyperion VOF, better known as Hyperion Entertainment in February 1999 with Mr. Ben Hermans, whom I met during my college years. Mr. Hermans later opted to pursue a legal career. I am to this date Managing Partner at Hyperion. The company specializes in 3D graphics and 3D driver development, firmware development for embedded systems, IT consulting and the conversion of high quality entertainment software from Windows to niche platforms including Amiga, Linux (x86,PPC) and MacOS.

https://cases.justia.com/federal/district-courts/washington/wawdce/2:2007cv00631/143245/26/0.pdf

AmigaOS 4 didn't do very well on PPC but neither did Motorola/Freescale/NXP who lost the embedded market when they stopped producing competitive 68k chips and shoved fat PPC down developer throats instead. Code density is definitely important for the embedded market and small computers like the RPi rely on this market to increase volumes for mass production. After more than 20 years of PPC AmigaNG failure they keep on shoving PPC down Amiga users throats even as THEA500 Mini and 68k AmigaOS 3 sales showed the way. C= insisted on self inflicted suicide like these AmigaNG businesses too.

Just a quick analysis of BA2. The only 16 bit instructions which are the shortest used by the variable length ISA are mov, movi, add, addi and j (unconditional short jump) which are very common. Unlike x86(-64) which wastes 8 bit encodings on less frequently used instructions, less encoding space is used for instructions which appear to grow longer based on frequency of use. The ISA looks at least as full featured and general purpose as ColdFire, instead of minimalist and embedded focused (no FPU instructions but there is unused encoding space which may be able to support a minimalist FPU). The ISA has longer versions of the variable length instructions supporting longer displacements and immediates which is common for most variable length compressed RISC ISAs. It still does not support full length immediates and displacements due to the instruction length limit but it is likely adequate for small footprint computers. It is interesting that it somewhat follows OpenRISC specifications which likely makes it easier to add compiler backend support. Motorola/Freescale/NXP half heartedly tried to create minimalist RISC like ISA with the ColdFire and MCore. MCore was a spartan fixed length 16 bit RISC ISA that resembles SuperH which is like a RISC version of the 68k (mnemonics look like 68k instead of normal RISC). ColdFire doesn't scale as far down but likely scales up better before losing performance even though it was truncated over the 68k. ColdFire lost code density in the RISC castration but most was added back with new instructions which would further improve code density on the 68k and the encoding space is mostly open. Perhaps the biggest downfalls of ColdFire were not designing it to be more 68k compatible or upgradeable to 64 bit even though it was created to scale down into embedded and even deeply embedded systems. BA2 is mostly designed for microcontrollers and deeply embedded systems where the reuse of a standardized ISA is not as important (the standardized but full featured 68k Amiga platform at such a small footprint is rare). It is relatively easy to turn a 32 bit RISC ISA into a 64 bit RISC ISA if software compatibility is not important. It looks like Motorola/Freescale/NXP only does a la carte chips for embedded markets and Cast Inc. has core design IP ready to use. Perhaps it is for the best considering their lackluster organic ISA and core designs since the Motorola and 68k days. The embedded market is so big today that a little niche can make a Cast Inc. or a Raspberry Pi Foundation successful and prosperous.

Last edited by matthey on 16-Oct-2023 at 09:36 PM.
Last edited by matthey on 16-Oct-2023 at 09:32 PM.

Status: Offline

Hammer

Re: APX: Intel's new architecture
Posted on 17-Oct-2023 2:55:42

[ #11 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6504
From: Australia

Intel's X86S initiative is a repeat for UEFI Class 3, but the rest of the PC world stays with UEFI Class 2.

Intel APX's major improvement is the 32 registers and 3 operands for GPRs while AVX-512 already has 32 registers and 3 operands since AVX2 FMA3, but Intel caused a Motorola-style trainwreck for AVX-512.

Intel SSE/AVX didn't replace X87's FP80. AVX doesn't support FP128 like on IBM's Power9 (with Power ISA 3.0) or NEC SX-Aurora TSUBASA.

NEC SX-Aurora TSUBASA has X86-64 CPUs with NEC's Vector Engine PCie card.

Quadruple-precision floating-point (FP128) format has its use cases.

Intel also has AVX10 initiative besides the APX initiative.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

cdimauro

Re: APX: Intel's new architecture
Posted on 22-Oct-2023 5:20:03

[ #12 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

@AmigaNoob

Quote:

AmigaNoob wrote:
@cdimauro
Quote:
That's impressive. I've searched around several times to get the BA2 ISA manual, but I never found it. It looks like that the company is jealously keeping the opcodes design for itself.

However my idea is that it might be too much specialized for the embedded market and not useful for other areas (mobile, desktop, servers, workstations, HPC).

This might interest you, somebody reverse engineered xiaomi zigbee and get its ISA which is BA2
https://alephsecurity.com/2019/07/09/xiaomi-zigbee-2/

Thanks a lot!

This architecture is simply spectacular, and it was a joy to analyse its details from the GCC source that was included in the article (although, unfortunately, some of the instructions and choices remain mysterious, because information on the internal details is lacking).

Kudos to the designers, who have realised an absolute masterpiece, thanks to some extremely wise and well-balanced technical decisions (one of them all: the 3-byte opcodes, which manage to cover many common scenarios that in other architectures require 4 bytes), which allow it to excel in code density (but also in implementation simplicity, thanks to the mere 4 instruction lengths combined with the not excessive formats used for the opcodes).

The incredible thing, then, is that it manages to do all this despite having as many as 32 registers at its disposal. Although the register file is clearly unified: all values, both integer and floating point, are on the same registers. But this is not necessarily a disadvantage (in fact, it can be the exact opposite: a great advantage), if the architecture is well designed (as is the case here).

It deserves to be taken as a use case / study.

@matthey

Quote:

matthey wrote:
AmigaNoob Quote:

This might interest you, somebody reverse engineered xiaomi zigbee and get its ISA which is BA2
https://alephsecurity.com/2019/07/09/xiaomi-zigbee-2/

Interesting, thanks. Industry leading code density is a good selling point but NXP has other options including ARM Cortex-M0(+) and royalty free ColdFire cores which also have good code density.

It would be interesting to be able to compare the code density, but I do not recall any studies on this subject covering BA2, Cortex/Thumb-2 and/or other architectures.
Quote:
For other readers, the BA2 ISA and cores are used on the NXP JN5168 microcontroller with a tiny footprint for IoT embedded use.

https://www.nxp.com/products/wireless-connectivity/zigbee/zigbee-and-ieee802-15-4-wireless-microcontroller-with-256-kb-flash-32-kb-ram:JN5168 Quote:

ZigBee and IEEE802.15.4 Wireless Microcontroller with 256 KB Flash, 32 KB RAM

The JN5168 is an ultra low power, high performance wireless microcontroller supporting ZigBee and IEEE802.15.4 networking stacks to facilitate the development of Smart Energy, Home Automation, Smart Lighting and wireless sensor applications. It features an enhanced 32-bit RISC processor with 256 kB embedded Flash, 32 kB RAM and 4 kB EEPROM memory, offering high coding efficiency through variable width instructions, a multi-stage instruction pipeline and low power operation with programmable clock speeds. It also includes a 2.4 GHz IEEE802.15.4 compliant transceiver and a comprehensive mix of analogue and digital peripherals. The very low operating current of 15 mA, with a 0.6 μA sleep timer mode, gives excellent battery life allowing operation direct from a coin cell.

The peripherals support a wide range of applications. They include a 2-wire I²C, and SPI ports which can operate as either leader or follower, a four channel ADC with battery and a temperature sensor. It can support a large switch matrix of up to 100 elements, or alternatively a 20 key capacitive touch pad.

It is interesting to note that BA2 is presented as a 32-bit architecture (and RISC when, in fact, it is not at all: it is a CISC that is part of the L/S family), but an analysis of the opcodes reveals 64-bit load/store instructions (ld and sd).

Unfortunately, nothing is known about their details (because they might use a 32-bit register pair), but a possible 64-bit version of BA2 is clearly feasible.
Quote:
This makes the Raspberry Pi Foundation RP2040 SoC microcontroller with ARM Cortex-M0+ and 264kiB of memory look fat. Computers are going smaller except for the Amiga which is going extinct. Hyperion originally considered the embedded market to be important for AmigaOS 4.

Declaration of Evert Carton Quote:

I founded Hyperion VOF, better known as Hyperion Entertainment in February 1999 with Mr. Ben Hermans, whom I met during my college years. Mr. Hermans later opted to pursue a legal career. I am to this date Managing Partner at Hyperion. The company specializes in 3D graphics and 3D driver development, firmware development for embedded systems, IT consulting and the conversion of high quality entertainment software from Windows to niche platforms including Amiga, Linux (x86,PPC) and MacOS.

https://cases.justia.com/federal/district-courts/washington/wawdce/2:2007cv00631/143245/26/0.pdf

AmigaOS 4 didn't do very well on PPC but neither did Motorola/Freescale/NXP who lost the embedded market when they stopped producing competitive 68k chips and shoved fat PPC down developer throats instead. Code density is definitely important for the embedded market and small computers like the RPi rely on this market to increase volumes for mass production. After more than 20 years of PPC AmigaNG failure they keep on shoving PPC down Amiga users throats even as THEA500 Mini and 68k AmigaOS 3 sales showed the way. C= insisted on self inflicted suicide like these AmigaNG businesses too.

PowerPC and embedded in the same sentence? LOL
Quote:
Just a quick analysis of BA2. The only 16 bit instructions which are the shortest used by the variable length ISA are mov, movi, add, addi and j (unconditional short jump) which are very common. Unlike x86(-64) which wastes 8 bit encodings on less frequently used instructions, less encoding space is used for instructions which appear to grow longer based on frequency of use. The ISA looks at least as full featured and general purpose as ColdFire, instead of minimalist and embedded focused (no FPU instructions but there is unused encoding space which may be able to support a minimalist FPU).

No, it has FP instructions (albeit only a bunch of them):

{ "fn.add.s",	"rD,rA,rB",	"0x7 00 DD DDDA AAAA BBBB B000", EF(lf_add_s), 0, it_float },
{ "fn.sub.s",	"rD,rA,rB",	"0x7 00 DD DDDA AAAA BBBB B001", EF(lf_sub_s), 0, it_float },
{ "fn.mul.s",	"rD,rA,rB",	"0x7 00 DD DDDA AAAA BBBB B010", EF(lf_mul_s), 0, it_float },
{ "fn.div.s",	"rD,rA,rB",	"0x7 00 DD DDDA AAAA BBBB B011", EF(lf_div_s), 0, it_float },
{ "fn.ftoi.s",	"rD,rA",	"0x7 11 10 --0A AAAA DDDD D000", EF(lf_ftoi_s), 0, it_float },
{ "fn.itof.s",	"rD,rA",	"0x7 11 10 --0A AAAA DDDD D001", EF(lf_itof_s), 0, it_float },

and yes: there's still a lot of space for adding many others.
Since the register file looks unified, it doesn't need the usual instructions which are duplicated for loading/storing FP values, or doing other tricks (e.g.: abs, changing sign, ...).

There are also two "lines" (E and F) which are completely free, so IMO they could be used for adding an entirely new SIMD/vector extension, putting this architecture on par with the most common ones.
BA2 is very good for embedded, but it's general-purpose enough to compete on any market.
Quote:

The ISA has longer versions of the variable length instructions supporting longer displacements and immediates which is common for most variable length compressed RISC ISAs. It still does not support full length immediates and displacements due to the instruction length limit but it is likely adequate for small footprint computers.

It has them: up to 32 bit values / immediates... for a 32 bit architecture. For example (only a bunch listed here):

{ "bw.sb",	"h(rA),rB",	"0x8 00 BB BBBA AAAA hhhh hhhh hhhh hhhh hhhh hhhh hhhh hhhh", EF(l_sb), 0, it_store },
{ "bw.lbz",	"rD,h(rA)",	"0x8 01 DD DDDA AAAA hhhh hhhh hhhh hhhh hhhh hhhh hhhh hhhh", EF(l_lbz), 0, it_load },
{ "bw.addi",	"rD,rA,g",	"0x9 00 DD DDDA AAAA gggg gggg gggg gggg gggg gggg gggg gggg", EF(b_add), 0, it_arith },
{ "bw.andi",	"rD,rA,h",	"0x9 01 DD DDDA AAAA hhhh hhhh hhhh hhhh hhhh hhhh hhhh hhhh", EF(l_and), 0, it_arith },
{ "bw.ori",	"rD,rA,h",	"0x9 10 DD DDDA AAAA hhhh hhhh hhhh hhhh hhhh hhhh hhhh hhhh", EF(l_or), 0, it_arith },

The architecture is complete, from this PoV.

Only a 64 bit version will be limited by 32-bit displacements / immediates, but that's not a big problem (absolutely not for displacements: 32-bit are enough. Only 32-bit immediates are a problem).
Quote:

It is interesting that it somewhat follows OpenRISC specifications which likely makes it easier to add compiler backend support. Motorola/Freescale/NXP half heartedly tried to create minimalist RISC like ISA with the ColdFire and MCore. MCore was a spartan fixed length 16 bit RISC ISA that resembles SuperH which is like a RISC version of the 68k (mnemonics look like 68k instead of normal RISC). ColdFire doesn't scale as far down but likely scales up better before losing performance even though it was truncated over the 68k. ColdFire lost code density in the RISC castration but most was added back with new instructions which would further improve code density on the 68k and the encoding space is mostly open. Perhaps the biggest downfalls of ColdFire were not designing it to be more 68k compatible or upgradeable to 64 bit even though it was created to scale down into embedded and even deeply embedded systems

RISCs / L/S architectures have an advantage here, because they can be easily extended to 64-bit by primarily adding 64-bit load/store instructions, plus some bunch of instructions for making it easier to build 64-bit immediates.

Whereas CISCs require proper encoding, which might be a big problem (depending on the ISA).
Quote:

BA2 is mostly designed for microcontrollers and deeply embedded systems where the reuse of a standardized ISA is not as important

I don't think so: see above. I find it quite general-purpose and future-ready.
Quote:

(the standardized but full featured 68k Amiga platform at such a small footprint is rare). It is relatively easy to turn a 32 bit RISC ISA into a 64 bit RISC ISA if software compatibility is not important. It looks like Motorola/Freescale/NXP only does a la carte chips for embedded markets and Cast Inc. has core design IP ready to use. Perhaps it is for the best considering their lackluster organic ISA and core designs since the Motorola and 68k days. The embedded market is so big today that a little niche can make a Cast Inc. or a Raspberry Pi Foundation successful and prosperous.

I agree.

@Hammer

Quote:

Hammer wrote:
Intel's X86S initiative is a repeat for UEFI Class 3, but the rest of the PC world stays with UEFI Class 2.

Usual Hammer's PADDING...
Quote:

Intel APX's major improvement is the 32 registers and 3 operands for GPRs

Indeed. As already reported by my articles.
Quote:

while AVX-512 already has 32 registers and 3 operands since AVX2 FMA3, but Intel caused a Motorola-style trainwreck for AVX-512.

Could you better clarify it?
Quote:

Intel SSE/AVX didn't replace X87's FP80. AVX doesn't support FP128 like on IBM's Power9 (with Power ISA 3.0) or NEC SX-Aurora TSUBASA.

Which doesn't mean that it this will not happen in future, with a proper extension of the architecture.
Quote:

NEC SX-Aurora TSUBASA has X86-64 CPUs with NEC's Vector Engine PCie card.

And?
Quote:

Quadruple-precision floating-point (FP128) format has its use cases.

Sure, and?
Quote:

Intel also has AVX10 initiative besides the APX initiative.

Sure, and?

Status: Offline

Srtest

Re: APX: Intel's new architecture
Posted on 22-Oct-2023 10:25:29

[ #13 ]

Regular Member

Joined: 15-Nov-2016
Posts: 259
From: Israel, Haderah

@cdimauro

Why not go to a PC forum then? Or perhaps the real pc of today being a console or some sort of a set top box.

Status: Offline

cdimauro

Re: APX: Intel's new architecture
Posted on 22-Oct-2023 12:59:28

[ #14 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

@Srtest

Quote:

Srtest wrote:
@cdimauro

Why not go to a PC forum then? Or perhaps the real pc of today being a console or some sort of a set top box.

What's your problem?

Anyway, no: I'll stay here, either you like it or not. If you don't like it then it's YOU that should find another place. Understood?

Status: Offline

matthey

Re: APX: Intel's new architecture
Posted on 22-Oct-2023 23:34:15

[ #15 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2752
From: Kansas

cdimauro Quote:

It would be interesting to be able to compare the code density, but I do not recall any studies on this subject covering BA2, Cortex/Thumb-2 and/or other architectures.

As others have not been able to locate a BA2 ISA manual either, it is likely that licensing closed hardware BA2 cores require licensees to sign a NDA covering the BA2 ISA. I don't know of any benchmarks, stats or comparisons beyond claims either.

https://www.chipestimate.com/Extreme-Code-Density-Energy-Savings-and-Methods/CAST/Technical-Article/2013/04/02 Quote:

The BA2 ISA is especially designed for extreme code density-often yielding even smaller programs than the industry-leading ARM Thumb2 ISA-so it is instructive to understand how we achieve this.

Many ISAs use 16 registers, including Thumb, Thumb2, Coldfire, and MIPS. But the BA2 ISA uses 32. This provides the code size benefits of doubling the number of registers, yet the BA2 avoids paying an instruction-width penalty by keeping the average instruction width to 24 bits or less for most applications.

Claims in the article are code density is ~Thumb2 and average instruction size is less than 3 bytes. Similar claims could be made for the 68k though. Officially, at least the latter is true.

https://www.academia.edu/64300961/The_superscalar_architecture_of_the_MC68060 Quote:

The chips instruction-set architecture contains 16-bit and larger instructions, with a measured average instruction length of less than 3 bytes.

Of course smaller instructions are a tradeoff that is good for small low power cores while large high performance cores would prefer to execute fewer large instructions.

The Xiaomi Zigbee article suggests that the GCC toolchain is available for public download which supports a BA2 CPU so maybe it would be possible to compile some executables for comparisons.

cdimauro Quote:

It is interesting to note that BA2 is presented as a 32-bit architecture (and RISC when, in fact, it is not at all: it is a CISC that is part of the L/S family), but an analysis of the opcodes reveals 64-bit load/store instructions (ld and sd).

Unfortunately, nothing is known about their details (because they might use a 32-bit register pair), but a possible 64-bit version of BA2 is clearly feasible.

Perhaps heavily modified 32 bit version of the Intel 8051 ISA and cores?

cdimauro Quote:

No, it has FP instructions (albeit only a bunch of them):

{ "fn.add.s", "rD,rA,rB", "0x7 00 DD DDDA AAAA BBBB B000", EF(lf_add_s), 0, it_float },
{ "fn.sub.s", "rD,rA,rB", "0x7 00 DD DDDA AAAA BBBB B001", EF(lf_sub_s), 0, it_float },
{ "fn.mul.s", "rD,rA,rB", "0x7 00 DD DDDA AAAA BBBB B010", EF(lf_mul_s), 0, it_float },
{ "fn.div.s", "rD,rA,rB", "0x7 00 DD DDDA AAAA BBBB B011", EF(lf_div_s), 0, it_float },
{ "fn.ftoi.s", "rD,rA", "0x7 11 10 --0A AAAA DDDD D000", EF(lf_ftoi_s), 0, it_float },
{ "fn.itof.s", "rD,rA", "0x7 11 10 --0A AAAA DDDD D001", EF(lf_itof_s), 0, it_float },

and yes: there's still a lot of space for adding many others.
Since the register file looks unified, it doesn't need the usual instructions which are duplicated for loading/storing FP values, or doing other tricks (e.g.: abs, changing sign, ...).

There are also two "lines" (E and F) which are completely free, so IMO they could be used for adding an entirely new SIMD/vector extension, putting this architecture on par with the most common ones.
BA2 is very good for embedded, but it's general-purpose enough to compete on any market.

I looked at that line and missed it expecting more floating point instructions and not in the middle of the encoding map. The fp instructions are easy enough to miss but the ftoi and itof are clear enough as floating point conversion instructions. Perhaps the ".s" is single precision only for this core but being squeezed into the middle of the map and using 32 bit dual purpose registers doesn't make it easy to guess how they would add the more common double precision support, at least without becoming a 64 bit ISA. I guess they could always make the GP registers 64 bit for a 32 bit CPU like the P1022 CPU cores for the A1222 but that wasn't a popular design decision.

cdimauro Quote:

It has them: up to 32 bit values / immediates... for a 32 bit architecture. For example (only a bunch listed here):

{ "bw.sb", "h(rA),rB", "0x8 00 BB BBBA AAAA hhhh hhhh hhhh hhhh hhhh hhhh hhhh hhhh", EF(l_sb), 0, it_store },
{ "bw.lbz", "rD,h(rA)", "0x8 01 DD DDDA AAAA hhhh hhhh hhhh hhhh hhhh hhhh hhhh hhhh", EF(l_lbz), 0, it_load },
{ "bw.addi", "rD,rA,g", "0x9 00 DD DDDA AAAA gggg gggg gggg gggg gggg gggg gggg gggg", EF(b_add), 0, it_arith },
{ "bw.andi", "rD,rA,h", "0x9 01 DD DDDA AAAA hhhh hhhh hhhh hhhh hhhh hhhh hhhh hhhh", EF(l_and), 0, it_arith },
{ "bw.ori", "rD,rA,h", "0x9 10 DD DDDA AAAA hhhh hhhh hhhh hhhh hhhh hhhh hhhh hhhh", EF(l_or), 0, it_arith },

The architecture is complete, from this PoV.

Only a 64 bit version will be limited by 32-bit displacements / immediates, but that's not a big problem (absolutely not for displacements: 32-bit are enough. Only 32-bit immediates are a problem).

Yes, you are right. I noticed it soon after I posted but was sidetracked before I could fix it. Finally a 32 bit RISC ISA that can match the immediate and displacement range of the 68k ISA. I can't think of any others besides the variable length instruction RISC ISA Mitch Alsup was working on although his would do it for a 64 bit ISA and uses multiples of 16 bit encodings like the 68k. The better aligned encoding likely gives a little better performance and decoding efficiency while the byte aligned encoding likely gives a little better code density. It would probably be better to have the 64 bit ISA worked out to begin with. It likely would be possible convert BA2 to a 64 bit ISA but it is difficult to foresee all the potential problems. For embedded systems, it is not as important to maintain backward compatibility though.

cdimauro Quote:

RISCs / L/S architectures have an advantage here, because they can be easily extended to 64-bit by primarily adding 64-bit load/store instructions, plus some bunch of instructions for making it easier to build 64-bit immediates.

Whereas CISCs require proper encoding, which might be a big problem (depending on the ISA).

A simple RISC 32 bit to 64 bit ISA conversion is simpler as the whole register is usually affected. More advanced load/store ISAs like AArch64 which supports 2 sizes and has many load/store and 64 bit support instructions may not be that much simpler. Even RVC with short versions of instructions is not too simple.

cdimauro Quote:

I don't think so: see above. I find it quite general-purpose and future-ready.

I agree that BA2 is general purpose and for integer more robust than many RISC ISAs and even some CISC ISAs like the 68k. It certainly can be used for a variety of embedded uses but it is not being used in commodity off the shelf chips. The cores available are mostly competition for ARM Cortex-R cores which are usually used for microcontrollers and often deeply embedded. Their top core does have MMU support and is advertised as an application core so would be competition for low end ARM Cortex-A cores but I expect the lower end cores are more successful, especially where there is a clear advantage due to code density and/or the lack of royalties which is particularly appealing for high production deeply embedded cores. ARM Cortex-M is likely easier to embed but has royalties while royalty free RISC-V cores are available but don't have Thumb2 like code density. The lower end core DMIPS/MHz claims are better than the ARM Cortex-M(0)+ which are likely better than RISC-V as well, perhaps improved by the very good code density. Better performance means the core can finish work quicker and go back to sleep sooner thus saving energy. The Cast website lists DMIPS/MHz of various cores which I expect are running from SRAM boosting performance in a microcontroller type of setup.

https://www.cast-inc.com/processors/32-bit

Last edited by matthey on 22-Oct-2023 at 11:37 PM.

Status: Offline

Hammer

Re: APX: Intel's new architecture
Posted on 23-Oct-2023 0:12:39

[ #16 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6504
From: Australia

@cdimauro

Quote:
Usual Hammer's PADDING...

You can't handle the truth.

Quote:

Could you better clarify it?

Intel progressed the AVX instruction set into AVX-512's EVEX encoding.

Intel reverted to the AVX instruction set into the old VEX encoding with AVX-IFMA in RaptorLake.

RaptorLake's VEX-encoded AVX-IFMA and AVX-VNNI are incompatible with EVEX-encoded AVX-512 IFMA and AVX-512 VNNI.

----

Intel progressed the AVX-512 instruction set with BF16 extensions with Cooper Lake in 2020.

Intel removed BF16 extensions 2020's Tiger Lake and 2021's Rocket Lake.

Intel restored BF16 extensions on early 2021's Alder Lake and 2023's Sapphire Rapids.

Intel is flip-flopping instruction set extensions.

Don't get me started on Intel's AVX10.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

cdimauro

Re: APX: Intel's new architecture
Posted on 23-Oct-2023 4:22:56

[ #17 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

@Hammer

Quote:

Hammer wrote:
@cdimauro

Quote:
Usual Hammer's PADDING...

You can't handle the truth.

No, it's that I can't understand the garbage / noise.

Just to help you out, I'll explain it simply: the article talks about the new APX architecture, whereas you started reporting a lot of things that have absolutely nothing to do with this topic.
Quote:
Quote:

Could you better clarify it?

Intel progressed the AVX instruction set into AVX-512's EVEX encoding.

Intel reverted to the AVX instruction set into the old VEX encoding with AVX-IFMA in RaptorLake.

RaptorLake's VEX-encoded AVX-IFMA and AVX-VNNI are incompatible with EVEX-encoded AVX-512 IFMA and AVX-512 VNNI.

----

Intel progressed the AVX-512 instruction set with BF16 extensions with Cooper Lake in 2020.

Intel removed BF16 extensions 2020's Tiger Lake and 2021's Rocket Lake.

Intel restored BF16 extensions on early 2021's Alder Lake and 2023's Sapphire Rapids.

Intel is flip-flopping instruction set extensions.

Don't get me started on Intel's AVX10.

Ah, ok. Then see above: you're confirming that it's your usual garbage / noise / padding.

Nothing new coming from you, as usual...

Status: Offline

cdimauro

Re: APX: Intel's new architecture
Posted on 23-Oct-2023 4:51:32

[ #18 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

It would be interesting to be able to compare the code density, but I do not recall any studies on this subject covering BA2, Cortex/Thumb-2 and/or other architectures.

As others have not been able to locate a BA2 ISA manual either, it is likely that licensing closed hardware BA2 cores require licensees to sign a NDA covering the BA2 ISA. I don't know of any benchmarks, stats or comparisons beyond claims either.

https://www.chipestimate.com/Extreme-Code-Density-Energy-Savings-and-Methods/CAST/Technical-Article/2013/04/02 Quote:

The BA2 ISA is especially designed for extreme code density-often yielding even smaller programs than the industry-leading ARM Thumb2 ISA-so it is instructive to understand how we achieve this.

Many ISAs use 16 registers, including Thumb, Thumb2, Coldfire, and MIPS. But the BA2 ISA uses 32. This provides the code size benefits of doubling the number of registers, yet the BA2 avoids paying an instruction-width penalty by keeping the average instruction width to 24 bits or less for most applications.

Claims in the article are code density is ~Thumb2 and average instruction size is less than 3 bytes. Similar claims could be made for the 68k though. Officially, at least the latter is true.

Thanks. From what I see, it looks quite better than Thumb2:

The BA2 ISA is especially designed for extreme code density-often yielding even smaller programs than the industry-leading ARMTM Thumb2 ISA
[...]
This gives BA2 processors a significant code size advantage over fixed instruction width RISC ISAs, and also over ISAs based on variable 16/32 bit (e.g. Thumb2) or 16/24 bit (e.g. ARCv2, Extensa) instruction encoding schemes.

Impressive.

I'd really like to see benchmarks for its code density based on some standard suites (SPEC, Embench, ...).
Quote:
https://www.academia.edu/64300961/The_superscalar_architecture_of_the_MC68060 Quote:

The chips instruction-set architecture contains 16-bit and larger instructions, with a measured average instruction length of less than 3 bytes.

Of course smaller instructions are a tradeoff that is good for small low power cores while large high performance cores would prefer to execute fewer large instructions.

Indeed. Another important metric is the number of executed instructions. Smaller code is very good, but if it requires much more instructions to execute, then performance will be affected and power consumption as well.
Quote:
The Xiaomi Zigbee article suggests that the GCC toolchain is available for public download which supports a BA2 CPU so maybe it would be possible to compile some executables for comparisons.

OK, nice. Let's see if it's also kept up-to-date and supporting the latest GCC versions.
Quote:
Quote:
cdimauro [quote]
It is interesting to note that BA2 is presented as a 32-bit architecture (and RISC when, in fact, it is not at all: it is a CISC that is part of the L/S family), but an analysis of the opcodes reveals 64-bit load/store instructions (ld and sd).

Unfortunately, nothing is known about their details (because they might use a 32-bit register pair), but a possible 64-bit version of BA2 is clearly feasible.

Perhaps heavily modified 32 bit version of the Intel 8051 ISA and cores?

I don't know the 8051. I've to investigate.
Quote:
cdimauro Quote:

No, it has FP instructions (albeit only a bunch of them):

{ "fn.add.s", "rD,rA,rB", "0x7 00 DD DDDA AAAA BBBB B000", EF(lf_add_s), 0, it_float },
{ "fn.sub.s", "rD,rA,rB", "0x7 00 DD DDDA AAAA BBBB B001", EF(lf_sub_s), 0, it_float },
{ "fn.mul.s", "rD,rA,rB", "0x7 00 DD DDDA AAAA BBBB B010", EF(lf_mul_s), 0, it_float },
{ "fn.div.s", "rD,rA,rB", "0x7 00 DD DDDA AAAA BBBB B011", EF(lf_div_s), 0, it_float },
{ "fn.ftoi.s", "rD,rA", "0x7 11 10 --0A AAAA DDDD D000", EF(lf_ftoi_s), 0, it_float },
{ "fn.itof.s", "rD,rA", "0x7 11 10 --0A AAAA DDDD D001", EF(lf_itof_s), 0, it_float },

and yes: there's still a lot of space for adding many others.
Since the register file looks unified, it doesn't need the usual instructions which are duplicated for loading/storing FP values, or doing other tricks (e.g.: abs, changing sign, ...).

There are also two "lines" (E and F) which are completely free, so IMO they could be used for adding an entirely new SIMD/vector extension, putting this architecture on par with the most common ones.
BA2 is very good for embedded, but it's general-purpose enough to compete on any market.

I looked at that line and missed it expecting more floating point instructions and not in the middle of the encoding map. The fp instructions are easy enough to miss but the ftoi and itof are clear enough as floating point conversion instructions. Perhaps the ".s" is single precision only for this core but being squeezed into the middle of the map and using 32 bit dual purpose registers doesn't make it easy to guess how they would add the more common double precision support, at least without becoming a 64 bit ISA. I guess they could always make the GP registers 64 bit for a 32 bit CPU like the P1022 CPU cores for the A1222 but that wasn't a popular design decision.

Correct. But it could be acceptable for an embedded ISA, especially since it's having 32 registers.

On my newest ISA I took the same decision: a unified register file. A lot of things got simplified by this decision, included the ABI.
Quote:
cdimauro Quote:

It has them: up to 32 bit values / immediates... for a 32 bit architecture. For example (only a bunch listed here):

{ "bw.sb", "h(rA),rB", "0x8 00 BB BBBA AAAA hhhh hhhh hhhh hhhh hhhh hhhh hhhh hhhh", EF(l_sb), 0, it_store },
{ "bw.lbz", "rD,h(rA)", "0x8 01 DD DDDA AAAA hhhh hhhh hhhh hhhh hhhh hhhh hhhh hhhh", EF(l_lbz), 0, it_load },
{ "bw.addi", "rD,rA,g", "0x9 00 DD DDDA AAAA gggg gggg gggg gggg gggg gggg gggg gggg", EF(b_add), 0, it_arith },
{ "bw.andi", "rD,rA,h", "0x9 01 DD DDDA AAAA hhhh hhhh hhhh hhhh hhhh hhhh hhhh hhhh", EF(l_and), 0, it_arith },
{ "bw.ori", "rD,rA,h", "0x9 10 DD DDDA AAAA hhhh hhhh hhhh hhhh hhhh hhhh hhhh hhhh", EF(l_or), 0, it_arith },

The architecture is complete, from this PoV.

Only a 64 bit version will be limited by 32-bit displacements / immediates, but that's not a big problem (absolutely not for displacements: 32-bit are enough. Only 32-bit immediates are a problem).

Yes, you are right. I noticed it soon after I posted but was sidetracked before I could fix it. Finally a 32 bit RISC ISA that can match the immediate and displacement range of the 68k ISA. I can't think of any others besides the variable length instruction RISC ISA Mitch Alsup was working on although his would do it for a 64 bit ISA and uses multiples of 16 bit encodings like the 68k. The better aligned encoding likely gives a little better performance and decoding efficiency while the byte aligned encoding likely gives a little better code density.

16-bit aligned opcodes gives benefits on code density as well, since usually you get smaller offsets (e.g.: short branches reach more instructions. Same for memory locations if the ISA has proper addressing modes which use width-aligned offsets: you can reach more locations with the same bits used for the offset).

Here BA2 is disadvantaged, IMO.
Quote:
It would probably be better to have the 64 bit ISA worked out to begin with. It likely would be possible convert BA2 to a 64 bit ISA but it is difficult to foresee all the potential problems.

Unfortunately many internals as missing, as I've already reported, however I don't see particular problems here, especially taking into account that there's still a lot of room for adding instructions.
Quote:
For embedded systems, it is not as important to maintain backward compatibility though.

Yup. But a 64 bit ISA breaks compatibility anyway, as it always happened.

If you really care about backward-compatibility, then you can also have 32-bit and 64-bit execution modes, which solves the problem.
Quote:
cdimauro Quote:

I don't think so: see above. I find it quite general-purpose and future-ready.

I agree that BA2 is general purpose and for integer more robust than many RISC ISAs and even some CISC ISAs like the 68k. It certainly can be used for a variety of embedded uses but it is not being used in commodity off the shelf chips. The cores available are mostly competition for ARM Cortex-R cores which are usually used for microcontrollers and often deeply embedded. Their top core does have MMU support and is advertised as an application core so would be competition for low end ARM Cortex-A cores but I expect the lower end cores are more successful, especially where there is a clear advantage due to code density and/or the lack of royalties which is particularly appealing for high production deeply embedded cores. ARM Cortex-M is likely easier to embed but has royalties while royalty free RISC-V cores are available but don't have Thumb2 like code density. The lower end core DMIPS/MHz claims are better than the ARM Cortex-M(0)+ which are likely better than RISC-V as well, perhaps improved by the very good code density. Better performance means the core can finish work quicker and go back to sleep sooner thus saving energy. The Cast website lists DMIPS/MHz of various cores which I expect are running from SRAM boosting performance in a microcontroller type of setup.

https://www.cast-inc.com/processors/32-bit

I agree. It's clear that they are only interested on the embedded market, but I wonder how they're making (a lot of) money, since their cores are royalty-free, as you stated.

Anyway, what impresses me more reading the above link is, besides the DMIPS/Mhz claims that you reported, the gates used for implementing their cores: around 10K for the low-end and 150K for the high-end cores is... NOTHING!

This architecture simply shines on the embedded market: Kudos again to its architects!

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle