Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6071 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

15 crawler(s) on-line.

156 guest(s) on-line.

1 member(s) on-line.

zipper

You are an anonymous user.
Register Now!

zipper: 2 mins ago

OlafS25: 26 mins ago

Swisso: 30 mins ago

amigakit: 1 hr 5 mins ago

amigang: 1 hr 38 mins ago

clint: 2 hrs 4 mins ago

ppcamiga1: 3 hrs 11 mins ago

VooDoo: 3 hrs 30 mins ago

marcofreeman: 3 hrs 49 mins ago

pixie: 3 hrs 54 mins ago

Forum Index

General Technology (No Console Threads)

Amiga SIMD unit

Poster

Thread

Hammer

Re: Amiga SIMD unit
Posted on 20-Oct-2020 23:06:43

[ #161 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5286
From: Australia

@cdimauro

FYI, in terms of gate size, Intel's 14+++ nm variant is closer to TSMC's 1st gen 7nm.

https://www.techpowerup.com/272489/intel-14-nm-node-compared-to-tsmcs-7-nm-node-using-scanning-electron-microscope

the Intel 14 nm chip features transistors with a gate width of 24 nm, while the AMD/TSMC 7 nm one has a gate width of 22 nm (gate height is also rather similar)
-

------

World of Tanks software BVH raytracing also supports AVX1/AVX2/AVX-512 via Intel Embree middleware
https://www.youtube.com/watch?v=-w7wUs30OXk

https://www.embree.org/
The kernels are optimized for the latest Intel® processors with support for SSE, AVX, AVX2, and AVX-512 instruction

World of Tanks software BVH raytracing (Intel Embree middleware) works well on Ryzen 9 3900X (with RTX 2080, hardware RT not used), Core i9 9900K(with RTX 2080 Ti, hardware RT not used), and Core i7-7820X (with GTX 1080 Ti).

Intel Embree adds extra software BVH raytracing resources on top of hardware BVH raytracing, hence more raytracing effects.

https://wccftech.com/amnesia-rebirth-intel-xe-hpg-pc-recommendations/
Under OpenGL, Intel Xe HPG GPU is about AMD RX 580 level.

IF the price is right, Power9 4C/16T+Raptor Blackbird bundle can add to my modern CPU architecture collection.

https://www.phoronix.com/scan.php?page=article&item=blackbird-power9-4c&num=3
In workloads that had been tested/tuned for POWER, the 4-core POWER9 processor was competitive with the Intel/AMD processors of similar core counts. Of course, the IBM POWER9 4-core at $375 USD is at a premium over the Intel/AMD processors of similar spec

I'm already aware of Ryzen 7 2700 8C/16T beats Power9 4C/16T, let alone Ryzen 9 3900X, but Power9 4C/16T +Raptor Blackbird bundle is not bad.

Quote:

.No, this isn't possible: you cannot abstract/virtualize the differences between the e500v2 core and all other PowerPC-compliant cores.

The only way could be by analyzing the executables at load time, before running them, to see if there are instructions overlapping in the opcode space, and trying to patch the executable in memory to avoid problems. Which, as you can imagine, is difficult to implement and kills the performances.

Do it the X86 way i.e. check CPU id or feature id and run the appropriate code path.

From my POV, e500v2 can act like PowerPC Book E when custom 64-bit SIMD is not used.

SAM440's PowerPC 440 has Book E, hence I'm guessing A1222 is acting like a SAM440/460, but with an out-of-order dual instruction issue e500v2 CPU. From youtube videos, A1222 is already running AmigaOS 4.1 and it's apps. It would be silly to break existing userland PowerPC apps.

Only the performance-critical section needs the optimize code paths.

On detection of CPU ID or feature ID, the different CPU instruction sets can be abstracted with virtual CPU with JIT LLVM.

From https://www.youtube.com/watch?v=XPdr7MaGvLo
Question: SPE compiled or int / softfloat / fixpoint codecs?

Answer: No, it is not a special SPE version, but ffmpeg uses mostly fixedpoint, so in the current AmigaOS 4.1 state, the tabor is about 60% faster as sam 460 in decoding.
(From EntwicklerX)

Last edited by Hammer on 21-Oct-2020 at 02:03 AM.
Last edited by Hammer on 21-Oct-2020 at 02:02 AM.
Last edited by Hammer on 21-Oct-2020 at 12:29 AM.
Last edited by Hammer on 21-Oct-2020 at 12:17 AM.
Last edited by Hammer on 21-Oct-2020 at 12:09 AM.
Last edited by Hammer on 20-Oct-2020 at 11:58 PM.
Last edited by Hammer on 20-Oct-2020 at 11:54 PM.
Last edited by Hammer on 20-Oct-2020 at 11:51 PM.
Last edited by Hammer on 20-Oct-2020 at 11:32 PM.
Last edited by Hammer on 20-Oct-2020 at 11:13 PM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

cdimauro

Re: Amiga SIMD unit
Posted on 21-Oct-2020 5:46:45

[ #162 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Hammer Quote:

Hammer wrote:
@cdimauro

FYI, in terms of gate size, Intel's 14+++ nm variant is closer to TSMC's 1st gen 7nm.

https://www.techpowerup.com/272489/intel-14-nm-node-compared-to-tsmcs-7-nm-node-using-scanning-electron-microscope

the Intel 14 nm chip features transistors with a gate width of 24 nm, while the AMD/TSMC 7 nm one has a gate width of 22 nm (gate height is also rather similar)

Unfortunately this is only a parameter when you compare transistors. The most important is the overall density, and here TSMC's 7nm process is clearly much better than Intel's 14++nm process (but inferior to Intel's 10nm).
Quote:
World of Tanks software BVH raytracing also supports AVX1/AVX2/AVX-512 via Intel Embree middleware
https://www.youtube.com/watch?v=-w7wUs30OXk

https://www.embree.org/
The kernels are optimized for the latest Intel® processors with support for SSE, AVX, AVX2, and AVX-512 instruction

World of Tanks software BVH raytracing (Intel Embree middleware) works well on Ryzen 9 3900X (with RTX 2080, hardware RT not used), Core i9 9900K(with RTX 2080 Ti, hardware RT not used), and Core i7-7820X (with GTX 1080 Ti).

Intel Embree adds extra software BVH raytracing resources on top of hardware BVH raytracing, hence more raytracing effects.

Indeed. And expected: Embree is under Intel's umbrella.
Quote:
https://wccftech.com/amnesia-rebirth-intel-xe-hpg-pc-recommendations/
Under OpenGL, Intel Xe HPG GPU is about AMD RX 580 level.

I wouldn't say that. It's not a comparison, but a recommendation. Intel's Xe GPUs come with the "tile" concept, so maybe the reviewer got one with one tile (around 10TFLOPS performances claimed by Intel).

I expect that Intel's new Xe GPUs be very competitive with the modern nVidia's and AMD's ones. You can already see it on some TigerLake overview, which embeds an Xe GPU, and which overall outperformed AMD's latest iGPUs.
Quote:
IF the price is right, Power9 4C/16T+Raptor Blackbird bundle can add to my modern CPU architecture collection.

https://www.phoronix.com/scan.php?page=article&item=blackbird-power9-4c&num=3
In workloads that had been tested/tuned for POWER, the 4-core POWER9 processor was competitive with the Intel/AMD processors of similar core counts...

Sorry, but I see a completely different scenario. Only on a couple of data compression benchmarks it looks very good:
https://www.phoronix.com/scan.php?page=article&item=blackbird-power9-4c&num=2

But on all other tests it's clearly put in the dust (sometimes with a HUGE margin) by a much cheaper and simpler Intel's i3...
Quote:
Of course, the IBM POWER9 4-core at $375 USD is at a premium over the Intel/AMD processors of similar spec

I'm already aware of Ryzen 7 2700 8C/16T beats Power9 4C/16T, let alone Ryzen 9 3900X, but Power9 4C/16T +Raptor Blackbird bundle is not bad.

OK. If it's for your collection, then it makes sense.
Quote:
Do it the X86 way i.e. check CPU id or feature id and run the appropriate code path.

This isn't possible now: it's too late. I'll explain it below.
Quote:
From my POV, e500v2 can act like PowerPC Book E when custom 64-bit SIMD is not used.

Yes, in THIS case.
Quote:
SAM440's PowerPC 440 has Book E, hence I'm guessing A1222 is acting like a SAM440/460, but with an out-of-order dual instruction issue e500v2 CPU. From youtube videos, A1222 is already running AmigaOS 4.1 and it's apps. It would be silly to break existing userland PowerPC apps.

Only the performance-critical section needs the optimize code paths.

On detection of CPU ID or feature ID, the different CPU instruction sets can be abstracted with virtual CPU with JIT LLVM.

This isn't possible now, because there are already several AmigaOS4 executables which were compiled without taking into account the e500v2. So, there's no "CPUID"-like check neither proper code paths. And if you run those binaries on Tabor you can get weird problems (since it can produce wrong results).

To "solve" this problem you need to recompile all existing binaries which can cause troubles, using the above mechanism, but I don't know if this can be made.
Quote:
From https://www.youtube.com/watch?v=XPdr7MaGvLo
Question: SPE compiled or int / softfloat / fixpoint codecs?

Answer: No, it is not a special SPE version, but ffmpeg uses mostly fixedpoint, so in the current AmigaOS 4.1 state, the tabor is about 60% faster as sam 460 in decoding.
(From EntwicklerX)

That's normal: the CPU is much better, albeit we're talking about very low-end systems.

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 21-Oct-2020 15:45:35

[ #163 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5286
From: Australia

@cdimauro

Good luck to Intel catching up to RTX 3080 level GPU and AMD has claimed that their "BiG NAVI" can about rival RTX 3080 class GPU.

TFLOPS means little when rasterization hardware scaling is very important.

Supporting DirectX12 Feature Level 12_2 is important to remain hardware feature matching with XSS/XSX game consoles.

For a thin-n-light 2-in1 laptop, I'm waiting for laptops to support DirectX12 Feature Level 12_2 IGP as per game console feature set standard.

Intel Tigerlake IGP beating AMD's aging 7nm Vega 8CU GCN is not hard but Xbox Series S already has RDNA 2 20 CU (22 CU design).

Tiger Lake Xe-LP graphics with 96 EUs with a clock frequency of 1.35 GHz yields about 2.07 TFLOPs FP32.

AMD still has
OPN 100-0000000285
Ryzen 7 5800U (Zen 3 Cezanne)
8 cores / 16 threads
2.0GHz base - 4.4GHz boost
8 CU @ 2.0GHz yeilds about 2 TFLOPS.
16MB L3 cache
10-25W cTDP

Ryzen 7 4800U IGP has up to 1.75 GHz clock speed which yields about 1.79 TFLOPS FP32. https://www.amd.com/en/products/apu/amd-ryzen-7-4800u

AMD Picasso APU's die size is about 209.78 mm2 with 12 nm process.

AMD Renoir APU's die size is about 150 mm2 with 7 nm process.
https://www.anandtech.com/show/15381/amd-ryzen-mobile-4000-measuring-renoirs-die-size.

Xbox Series S APU has 190 mm^2. https://twitter.com/_rogame/status/1303745295382126594?s=20

Expect the future AMD APU to be based on Xbox Series S APU with Zen 3 CPU cores and mainstream DDR4 or DDR5 memory.

Quote:

This isn't possible now, because there are already several AmigaOS4 executables which were compiled without taking into account the e500v2. So, there's no "CPUID"-like check neither proper code paths. And if you run those binaries on Tabor you can get weird problems (since it can produce wrong results).

Again, It will be silly to break existing userland AmigaOS4.x apps. Are you claiming e500v2 broke PowerPC Book E standard?

Existing AmigaOS 4.1 PPC apps will be unaware of e500v2's extra SPU features and Altivec is already treated as a separate code path or a different app version.

FFmpeg example is a fixpoint version and seems to run fine on e500v2 CPU as PowerPC Book E fixpoint compatible CPU.

SoftFPU can be created for e500v2's SPU to fake a standard PowerPC FPU.

I don't have A1222 to verify support for existing userland AmigaOS4.x apps.

Last edited by Hammer on 21-Oct-2020 at 03:58 PM.
Last edited by Hammer on 21-Oct-2020 at 03:55 PM.
Last edited by Hammer on 21-Oct-2020 at 03:47 PM.
Last edited by Hammer on 21-Oct-2020 at 03:46 PM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

NutsAboutAmiga

Re: Amiga SIMD unit
Posted on 21-Oct-2020 16:25:48

[ #164 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12818
From: Norway

@Hammer

Anyway you can’t use Altivec code on A1222, nor can you on any older G3’s or AMCC4x0 cpu’s, so the issue is not as big anyway. (most code does not have it.)

AltiVec code I guess is only really used in things like FFMPEG, Mplayer and few other things, the compiler we have can’t optimize for it, it has to be hand written as inline assembler, (or as the AltiVec macros).

If some programs acting badly, its maybe because they are old, not updated, as Hyperion broke AmigaOS4 in in 4.1 Final, Picasso96 is more or less useless now, there other stuff as well changed in SDK, forcing developers to update the code.

Last edited by NutsAboutAmiga on 21-Oct-2020 at 04:29 PM.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

cdimauro

Re: Amiga SIMD unit
Posted on 21-Oct-2020 22:10:41

[ #165 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Hammer Quote:

Hammer wrote:
@cdimauro

Good luck to Intel catching up to RTX 3080 level GPU and AMD has claimed that their "BiG NAVI" can about rival RTX 3080 class GPU.

Well, I'm not working for Intel from almost 4 years, so it wouldn't be a drama for me.

Anyway Intel doesn't need luck, but good products, and this time seems in a much better position compared to the past (Larrabee).
Quote:
TFLOPS means little when rasterization hardware scaling is very important.

True. That's why AMD's GPUs have usually more TFLOPS compared to nVidia's ones, but minor performances. TFLOPS are just numbers: very important, but it has to be seen also how they can be used and reached.
Quote:
Intel Tigerlake IGP beating AMD's aging 7nm Vega 8CU GCN is not hard but Xbox Series S already has RDNA 2 20 CU (22 CU design).

Tigerlake and consoles aren't direct competitors.
Quote:
Tiger Lake Xe-LP graphics with 96 EUs with a clock frequency of 1.35 GHz yields about 2.07 TFLOPs FP32.

Looks very efficient. See below.
Quote:
AMD still has
OPN 100-0000000285
Ryzen 7 5800U (Zen 3 Cezanne)
8 cores / 16 threads
2.0GHz base - 4.4GHz boost
8 CU @ 2.0GHz yeilds about 2 TFLOPS.
16MB L3 cache
10-25W cTDP

Ryzen 7 4800U IGP has up to 1.75 GHz clock speed which yields about 1.79 TFLOPS FP32. https://www.amd.com/en/products/apu/amd-ryzen-7-4800u

So, basically 1KFLOPS per clock cycle: less than Tigerlake's IGP.
Quote:
AMD Picasso APU's die size is about 209.78 mm2 with 12 nm process.

AMD Renoir APU's die size is about 150 mm2 with 7 nm process.
https://www.anandtech.com/show/15381/amd-ryzen-mobile-4000-measuring-renoirs-die-size.

Xbox Series S APU has 190 mm^2. https://twitter.com/_rogame/status/1303745295382126594?s=20

Do you have some data for Tigerlake?
Quote:
Expect the future AMD APU to be based on Xbox Series S APU with Zen 3 CPU cores and mainstream DDR4 or DDR5 memory.

Indeed. Something similar will happen with the Tigerlake successor.
Quote:
Again, It will be silly to break existing userland AmigaOS4.x apps. Are you claiming e500v2 broke PowerPC Book E standard?

I don't remember now. What I can tell you is that the e500v2 core is reusing some instructions from the regular PowerPC ISA, AND some others from Altivec. So, it's clearly NOT compatible with binaries which are using any of those.
Quote:
Existing AmigaOS 4.1 PPC apps will be unaware of e500v2's extra SPU features and Altivec is already treated as a separate code path or a different app version.

FFmpeg example is a fixpoint version and seems to run fine on e500v2 CPU as PowerPC Book E fixpoint compatible CPU.

It doesn't change what I've said before: existing applications might have problems if those overlapping instructions are used.
Quote:
SoftFPU can be created for e500v2's SPU to fake a standard PowerPC FPU.

Yes, this is possible, but it'll be very slow, due to the trap-emulate-return.
Quote:
I don't have A1222 to verify support for existing userland AmigaOS4.x apps.

Me neither.

@NutsAboutAmiga Quote:

NutsAboutAmiga wrote:
@Hammer

Anyway you can’t use Altivec code on A1222, nor can you on any older G3’s or AMCC4x0 cpu’s, so the issue is not as big anyway. (most code does not have it.)

AltiVec code I guess is only really used in things like FFMPEG, Mplayer and few other things, the compiler we have can’t optimize for it, it has to be hand written as inline assembler, (or as the AltiVec macros).

If some programs acting badly, its maybe because they are old, not updated, as Hyperion broke AmigaOS4 in in 4.1 Final, Picasso96 is more or less useless now, there other stuff as well changed in SDK, forcing developers to update the code.

Yes, but we were talking about something different here. Even a binary that doesn't use Altivec can have problems on the A1222 if it uses the instructions which were reused on its e500v2 core.

With Altivec you can be quite safe because usually you have proper, separated binaries, so you're aware that they shouldn't be used on your system if it hasn't this SIMD extension.

But it'll be interesting to know if such Altivec binaries have some check-up on the start-up code, to display some message/dialog and quit the application if Altivec isn't detected...

Status: Offline

cdimauro

Re: Amiga SIMD unit
Posted on 21-Oct-2020 22:50:43

[ #166 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Hammer Quote:

Hammer wrote:

AMD's Jaguar CPU is a dual instruction issue per cycle with 128-bit AVX** SIMD design and Xbox One X has 8 Jaguar CPUs with 40 CU GCN.

**Subset of 256-bit AVX.

Jaguar hasn't a subset of AVX. AVX is always a 256-bit wide SIMD ISA. However Jaguar has an internal 128-bit implementation.
Quote:
68080's quad instruction issue per cycle design needs to scale towards four CPU cores, gain 128-bit SIMD units and >1.5Ghz clock speed.

68080 needs to go to an out-of-order design to make a better use of those max 4 instructions per cycles executed, which is very unlikely in an in-order design due to dependencies on instructions.

2 or more cores are not useful at all in an Amiga-like system, because only one CPU is and can be used (unless you use other cores as "coprocessors", offloading some tasks. Like with the Blitter).

>1.5Ghz clocks cannot be achieved without moving to ASICs, which is unlikely (too high costs for a nano-niche market).

@matthey Quote:

matthey wrote:

Most of the x86_64 gain came from doubling the number of GP integer and SIMD registers which will vary by hardware but should be around 5% on average.

It should be around 15% on average, according to AMD when x86-64 was introduced. Which is mostly reflected by the benchmarks.
Quote:
There are other ISA changes which could make a larger difference for specific algorithms. Some algorithms with low data memory traffic and not needing many registers are faster with the more compact 32 bit code and these benefit from faster load times as well.

This is shown as well with x86-32 (not to be confused with x86, which Intel calls IA-32), which is the "32-bit castrated" version of x86-64 using 32-bit pointers and 32-bit longs by default.

@Fl@sh Quote:

Fl@sh wrote:
@all

About PPC Altivec G4/G5 vs Intel SSE1/SSE2 both, on paper, have same potentials.
Maybe Altivec is still more simple and similar to AVX/AVX2 ISA rather than SSE1/SSE2.

Indeed.
Quote:
I.E. this is a link about all Altivec instruction set and yes we have also FMADD, even with FLOAT datatype, between vectors http://mirror.informatimago.com/next/developer.apple.com/hardware/ve/instruction_crossref.html#compare

But it's missing the much more useful FMA instructions.
Quote:
For Altivec we have also much better human readable instructions

Questionable. Some are more readable because they have long and explicative mnemonics, but others are not.
Quote:
and up until three arguments for instruction.

The same for AVX+.

Status: Offline

NutsAboutAmiga

Re: Amiga SIMD unit
Posted on 21-Oct-2020 23:41:46

[ #167 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12818
From: Norway

@cdimauro

Quote:
But it'll be interesting to know if such Altivec binaries have some check-up on the start-up code, to display some message/dialog and quit the application if Altivec isn't detected...

That’s up the developer who wrote the program, I believe the linker refuses if you try link two different PowerPC ISA’s, but you can put the code in a library, and let program pick the library that’s best.

It can be easier to just compile two different versions of the program, like it was done on 680x0, where find demos for 030, 040 and 060, a few compiler switches. The problem is testing that the code works, on different CPU’s naturally, can’t expect developers to have collection of computers, with different configurations.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

cdimauro

Re: Amiga SIMD unit
Posted on 22-Oct-2020 6:16:08

[ #168 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@NutsAboutAmiga Quote:

NutsAboutAmiga wrote:
@cdimauro
Quote:
But it'll be interesting to know if such Altivec binaries have some check-up on the start-up code, to display some message/dialog and quit the application if Altivec isn't detected...

That’s up the developer who wrote the program, I believe the linker refuses if you try link two different PowerPC ISA’s, but you can put the code in a library, and let program pick the library that’s best.

Understood, but supporting Altivec doesn't require a new ISA: it's just an extension.

Then I hope that developers added this check to avoid issues to the users.
Quote:
It can be easier to just compile two different versions of the program, like it was done on 680x0, where find demos for 030, 040 and 060, a few compiler switches.

Indeed. On x86/x64 usually there are fatter binaries, and the optimizations are made inside, selecting different code paths depending on the specific micro-architetture / ISA extension(s).

Both have their pros and cons. I, as a user, prefer a single binary. I, as a coder, prefer a well optimized binary for each "hardware variant".
Quote:
The problem is testing that the code works, on different CPU’s naturally, can’t expect developers to have collection of computers, with different configurations.

(Win/E/FS)UAE is there for this (and even for being used by users).

@matthey Quote:

matthey wrote:
Register renaming helps performance more with fewer registers. x86 was still at a disadvantage to 32 register RISC in memory traffic but it didn't make as much of a difference in performance as expected. Only 8 XMM FPU/SIMD registers actually made a bigger difference to FP performance. "Performance Characterization of the 64-bit x86 Architecture from Compiler Optimizations’ Perspective" compared the x86_64 with 16 GP integer and 16 SIMD registers to x86 with 8 GP integer and 16 SIMD registers using benchmarks and found the following.

CINT2000 42% more memory traffic, 4.4% performance loss
CFP2000 78% more memory traffic, 5.6% performance loss

The data memory traffic performance disadvantage practically disappeared with x86_64 but so did much of the instruction memory traffic performance advantage with larger code size.

Interesting. After reading the paper, and found this:

"However, our results also show that using 12 GPRs and 12 XMM registers can achieve as competitive performance as employing 16 GPRs and 16 XMM registers."

I'm considering to move from 64 SIMD registers to 32.

At the very end it was just an exercise to show that "I can do much more" (than x64/Xeon Phi), but I should also be practical, and see that I already have a CISC design which allows me to save a lot of space and instructions using the powerful memory (and quick/direct-immediates) addressing modes that I can apply to any instruction.

This way I'll gain 2-3 bits in the opcodes that allows me to have many more shorter opcodes, which I can allocate for other most common operations. This should give a good boost in code density for the SIMD code (which currently is much better than AVX-512, but still suffers a little bit compared to AVX/2 and especially SSE/2/3/4, and particularly in 32-bit mode), because now I can pack all SIMD instructions in the more shorter versions (currently I've a few that I need to map to the longer opcodes, loosing a little bit from AVX-512).

It'll be a pain to rework this part, because I've finally tuned both the ISA and the current Python script which I've implemented to generate the stats. A lot of work, again (7th version of my ISA. Hopefully the final one).

I'll leave the 32 GP registers, because I already have a very good code density, and more GP registers can greatly help in some code (emulators, compilers, virtual machines).

Status: Offline

Fl@sh

Re: Amiga SIMD unit
Posted on 22-Oct-2020 8:00:54

[ #169 ]

Regular Member

Joined: 6-Oct-2004
Posts: 253
From: Napoli - Italy

@NutsAboutAmiga

Quote:
That’s up the developer who wrote the program, I believe the linker refuses if you try link two different PowerPC ISA’s, but you can put the code in a library, and let program pick the library that’s best.

It can be easier to just compile two different versions of the program, like it was done on 680x0, where find demos for 030, 040 and 060, a few compiler switches. The problem is testing that the code works, on different CPU’s naturally, can’t expect developers to have collection of computers, with different configurations.

You can mix simd and non simd code into same binary.
I did it long time ago with fpu and non fpu functions, in C simply use a function to function call to use the right one.
For simd is the same way to do, of course it's simpler to generate two different binaries.

Anyway it's not related to linker phase.

_________________
Pegasos II G4@1GHz 2GB Radeon 9250 256MB
AmigaOS4.1 fe - MorphOS - Debian 9 Jessie

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 22-Oct-2020 9:29:43

[ #170 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2010
From: Kansas

@cdimauro
I think you figured it out but the quote from me should have read the following.

Quote:

"Performance Characterization of the 64-bit x86 Architecture from Compiler Optimizations’ Perspective" compared the x86_64 with 16 GP integer and 16 SIMD registers to x86 with 8 GP integer and 8 SIMD registers using benchmarks and found the following.

CINT2000 42% more memory traffic, 4.4% performance loss
CFP2000 78% more memory traffic, 5.6% performance loss

It is amazing that so much DCache memory traffic from only 8 registers did not cause more of a performance decline. I wonder if x86 Cores have been using a cached load/store queue with many entries to bypass L1 DCache loads of recent saves. This could be especially effective for the stack.

move.l d0,-(sp) ; var 2 to store buffer
move.l d1,-(sp) ; var 1 to store buffer
bsr function ; return value to store buffer
...

function:
move.l (4,sp),d0 ; load checks store buffer and finds var 1 value bypassing L1 DCache
add.l (8,sp),d0 ; load checks store buffer and finds var 2 value bypassing L1 DCache
rts ; load checks store buffer and finds return value bypassing L1 DCache or use return/link stack

The store buffer becomes a small load/store cache with very quick access kind of like a L0 DCache (entries are retained after storing until needed for new stores). It saves energy and avoids using the limited number of DCache accesses per cycle as well. Newer x86-64 cores often allow more DCache accesses per cycle so I would expect the performance loss from fewer registers would decline further on newer hardware.

The desirable number of registers not only depends on the architecture but also what market is being targeted. More registers up scales the target market. There are many more processors sold into lower end markets but margins are not as high. Embedded markets can increase chip volumes lowering costs and can have a long production life for a good design (amazingly over 40 years for the 68000). Simpler cores also cost less to design. ISA specs seem to increase in an attempt to outdo competitors but sometimes they up scale right out of markets. PPC is a good example with the Tabor design trying to down scale enough to compete with ARM. Early Atom processors tried to downscale enough with early 32 bit x86 standards to compete with ARM but they gave up. Even the ColdFire project explored down scaling by reducing registers.

Quote:

It should also be noted that careful consideration and study was given to the possibility of supporting a reduced user programming model for the V1 core. In particular, a proposal to reduce the number of general-purpose registers (Rn) from 16 to 8 was given serious consideration. In this proposal, the number of address (An) and data (Dn) registers was halved, so that the V1 core would only support the {D0, D1, D6, D7} and {A0, A1, A6, A7} registers. This proposal was driven by the fact that the register file is the largest single structure in the core and a sizable reduction in this function could have an interesting impact on the overall core size. However, in the final analysis it was decided that code compatibility across the entire ColdFire family and the ability to reuse the existing development tools (compilers, debuggers, etc.) was more important than the gate savings achievable through this program-visible redefinition of the register set.

It makes sense to stay smaller and slimmer for the 68k where it was so strong in the embedded market. Nearly 5 million Amiga computers were sold but the 68k sales were over 50 million chips per year as late as 1996 when it was no longer being used in desktops (nearly 13 times as many 32 bit chips as ARM and about 107 times as many as PPC). ColdFire missed the mark but still likely sold 50-100 million units. I would guess there was somewhere around 250-500 million 68k and ColdFire chips sold worth billions of U.S. dollars. The 68k was a huge success and people loved it. It is in this realm where it would need to reemerge and I would prefer to not up scale it out of the markets where it has had the most success.

Last edited by matthey on 22-Oct-2020 at 03:36 PM.
Last edited by matthey on 22-Oct-2020 at 03:35 PM.
Last edited by matthey on 22-Oct-2020 at 09:36 AM.

Status: Offline

cdimauro

Re: Amiga SIMD unit
Posted on 22-Oct-2020 22:51:35

[ #171 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:

matthey wrote:
@cdimauro
I think you figured it out but the quote from me should have read the following.
Quote:
"Performance Characterization of the 64-bit x86 Architecture from Compiler Optimizations’ Perspective" compared the x86_64 with 16 GP integer and 16 SIMD registers to x86 with 8 GP integer and 8 SIMD registers using benchmarks and found the following.

CINT2000 42% more memory traffic, 4.4% performance loss
CFP2000 78% more memory traffic, 5.6% performance loss

It is amazing that so much DCache memory traffic from only 8 registers did not cause more of a performance decline.

I've red it, but I don't know how much sense makes the comparison between x86 and x64 for this particular aspect, because they have completely different ABIs and the code doesn't look the same as well for the same reason.
x86 is stack-base substantially, whereas x64 is register(s)-based. So, x86 makes A LOT of PUSHes and POPs, whereas x64 does much less and uses MOVs instead.
I think that you've already red my article which reports such statistics.

And I assume that we all agree that what we do NOT want from an ISA is that it should stack-based.
Quote:
I wonder if x86 Cores have been using a cached load/store queue with many entries to bypass L1 DCache loads of recent saves. This could be especially effective for the stack.

move.l d0,-(sp) ; var 2 to store buffer
move.l d1,-(sp) ; var 1 to store buffer
bsr function ; return value to store buffer
...

function:
move.l (4,sp),d0 ; load checks store buffer and finds var 1 value bypassing L1 DCache
add.l (8,sp),d0 ; load checks store buffer and finds var 2 value bypassing L1 DCache
rts ; load checks store buffer and finds return value bypassing L1 DCache or use return/link stack

The store buffer becomes a small load/store cache with very quick access kind of like a L0 DCache (entries are retained after storing until needed for new stores). It saves energy and avoids using the limited number of DCache accesses per cycle as well.

I think so. And maybe data in the store buffer(s) is written on the stack memory only after that certain conditions are met, greatly reducing the writes to the DCache.

This is pretty logical, looking at how x86 works. Maybe the load/store buffers are only used when the SP register is referenced.
Quote:
Newer x86-64 cores often allow more DCache accesses per cycle so I would expect the performance loss from fewer registers would decline further on newer hardware.

I agree here too.
Quote:
The desirable number of registers not only depends on the architecture but also what market is being targeted. More registers up scales the target market. There are many more processors sold into lower end markets but margins are not as high. Embedded markets can increase chip volumes lowering costs and can have a long production life for a good design (amazingly over 40 years for the 68000). Simpler cores also cost less to design. ISA specs seem to increase in an attempt to outdo competitors but sometimes they up scale right out of markets. PPC is a good example with the Tabor design trying to down scale enough to compete with ARM. Early Atom processors tried to downscale enough with early 32 bit x86 standards to compete with ARM but they gave up. Even the ColdFire project explored down scaling by reducing registers.
Quote:
It should also be noted that careful consideration and study was given to the possibility of supporting a reduced user programming model for the V1 core. In particular, a proposal to reduce the number of general-purpose registers (Rn) from 16 to 8 was given serious consideration. In this proposal, the number of address (An) and data (Dn) registers was halved, so that the V1 core would only support the {D0, D1, D6, D7} and {A0, A1, A6, A7} registers. This proposal was driven by the fact that the register file is the largest single structure in the core and a sizable reduction in this function could have an interesting impact on the overall core size. However, in the final analysis it was decided that code compatibility across the entire ColdFire family and the ability to reuse the existing development tools (compilers, debuggers, etc.) was more important than the gate savings achievable through this program-visible redefinition of the register set.

It makes sense to stay smaller and slimmer for the 68k where it was so strong in the embedded market. Nearly 5 million Amiga computers were sold but the 68k sales were over 50 million chips per year as late as 1996 when it was no longer being used in desktops (nearly 13 times as many 32 bit chips as ARM and about 107 times as many as PPC). ColdFire missed the mark but still likely sold 50-100 million units. I would guess there was somewhere around 250-500 million 68k and ColdFire chips sold worth billions of U.S. dollars. The 68k was a huge success and people loved it.

I understand and I agree for most of the things (especially with the stupid decisions of Motorola's management about the 68K family), but I think that there's an important point which you're not considering when talking about small cores (which means small area -> reduced costs): the process used for fabricating the chip.

What you said made absolute sense 20 or more years ago, but nowadays with the production processes used in the last years you can pack million transistors even in a few mm^2 of area. And chips areas are mostly dominated by caches (L1, L2, and even L3) and the "uncore".

In short: I don't think that the registers file takes a lot of space, compared to all the rest.

So, does it really makes sense trying to limit as much as possible the registers file? My feeling is clearly no, for what I've said. But I like to see if other people has a different opinion, and it'll be nice to have some concrete data about it (e.g.: registers file area vs caches area vs uncore area vs chip area. For latest chips, of course -> using modern production processes) which will greatly help here.
Quote:
It is in this realm where it would need to reemerge and I would prefer to not up scale it out of the markets where it has had the most success.

Yes, I think that the embedded market is the only one which can/should be targeted for a 68K revival. Desktop and servers is out of question because they are dominated by other architectures, and attacking those markets might be possible only if the (new) 68K takes a consistent market segment somewhere.

Another question for you: do you still want to keep as much as possible of the 68K legacy (e.g: opcodes structure & instructions, addressing modes)?
For the 32-bit code / execution mode it might make sense, because you can reuse the existing tools. But for the 64-bit code you are forced anyway to make several, non-compatible, changes.
64 bit is the future even in the embedded market, looking at the trend. And ARM has just recently announced that future ISA versions will be 64-bit only, just to give an important news about it.

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 23-Oct-2020 3:15:21

[ #172 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5286
From: Australia

@cdimauro

Quote:

Jaguar hasn't a subset of AVX. AVX is always a 256-bit wide SIMD ISA. However Jaguar has an internal 128-bit implementation.

I was referring to actual hardware implementation since Jaguar's 128-bit SIMD hardware implements AVX-128 while AVX-256 is for compatibility.

The AVX instructions support both 128-bit and 256-bit SIMD.

The 128-bit AVX can be useful to improve old code without needing to widen the vectorization, and avoid the penalty of going from SSE to AVX, they are also faster on some early AMD implementations of AVX. This mode is sometimes known as AVX-128.

https://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/i386-and-x86-64-Options.html#i386-and-x86-64-Options
-mprefer-avx128
This option instructs GCC to use 128-bit AVX instructions instead of 256-bit AVX instructions in the auto-vectorizer.

Jaguar's AVX-256 has a latency penalty when compared to AVX-128.

The instruction set is only part of the solution when micro-architecture implementation can influence the overall result.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 23-Oct-2020 3:18:40

[ #173 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2010
From: Kansas

Quote:

cdimauro wrote:
I've red it, but I don't know how much sense makes the comparison between x86 and x64 for this particular aspect, because they have completely different ABIs and the code doesn't look the same as well for the same reason.
x86 is stack-base substantially, whereas x64 is register(s)-based. So, x86 makes A LOT of PUSHes and POPs, whereas x64 does much less and uses MOVs instead.
I think that you've already red my article which reports such statistics.

The comparison in the "Performance Characterization of the 64-bit x86 Architecture from Compiler Optimizations’ Perspective" paper actually uses x86-64 mode for the comparisons of REG_16, REG_12 and REG_8 GP registers. It was possible to reduce the number of available GP registers with the compiler. This is a better way to isolate the memory traffic and performance difference than to compare x86-64 and x86 with ISA differences. I expect this gives a pretty good indication of the advantage x86-64 gained over x86 by moving from 8 to 16 GP registers though.

REG_16 to REG_8 (16 to 8 GP registers)
CINT2000 42% more memory traffic, 4.4% performance loss
CFP2000 78% more memory traffic, 5.6% performance loss

REG_16 to REG_12 (16 to 12 GP registers)
CINT2000 14% more memory traffic, 0.9% performance loss
CFP2000 29% more memory traffic, 0.3% performance loss

I would expect that moving to more than 16 GP registers would provide less of a performance gain than the performance loss from 16 GP registers to 12 GP registers which should be less than 1% performance difference for these older benchmarks at least. We also agree that newer processors with more DCache accesses per cycle should be able to reduce the performance loss of DCache memory traffic more.

Quote:

And I assume that we all agree that what we do NOT want from an ISA is that it should stack-based.

Elevated memory traffic is undesirable whether it is excessive DCache stack use or increased ICache use from poor code density. It is better to have more registers but they reach the point of diminishing returns and start making code larger which reduces ICache performance. CISC with 16 GP registers seems to be a good balance. It appears that the 68k uses less memory traffic than x86-64 even with a stack arg passing ABI as function inling reduces the overall cost of function stack args. The same "Performance Characterization of the 64-bit x86 Architecture from Compiler Optimizations’ Perspective" paper as above also found that for x86-64 with stack args instead of reg args, "On average, the CINT2000 is slowed down by 0.86%, and almost no noticeable slowdown for the CFP2000." It is good to encourage code reuse for ICache efficiency though. The 68k Amiga used function reg args for libraries but it was also forced to reuse code more because of the limited memory and small ICaches. A small footprint has advantages for embedded as well.

Quote:

I think so. And maybe data in the store buffer(s) is written on the stack memory only after that certain conditions are met, greatly reducing the writes to the DCache.

This is pretty logical, looking at how x86 works. Maybe the load/store buffers are only used when the SP register is referenced.

Writes to the store buffer would need to eventually take place although another write to the same address can overwrite a value in the store buffer which has not been written yet. Write combining of sequential address writes from the store buffer can also decrease the total number of writes. This would be especially helpful for the stack (practically a stack cache). Little tricks like this likely helped overcome the dreadful memory traffic of the x86. Many cheap processors try to reduce an unintelligent store buffer to save gates instead.

Quote:

I understand and I agree for most of the things (especially with the stupid decisions of Motorola's management about the 68K family), but I think that there's an important point which you're not considering when talking about small cores (which means small area -> reduced costs): the process used for fabricating the chip.

What you said made absolute sense 20 or more years ago, but nowadays with the production processes used in the last years you can pack million transistors even in a few mm^2 of area. And chips areas are mostly dominated by caches (L1, L2, and even L3) and the "uncore".

In short: I don't think that the registers file takes a lot of space, compared to all the rest.

So, does it really makes sense trying to limit as much as possible the registers file? My feeling is clearly no, for what I've said. But I like to see if other people has a different opinion, and it'll be nice to have some concrete data about it (e.g.: registers file area vs caches area vs uncore area vs chip area. For latest chips, of course -> using modern production processes) which will greatly help here.

A larger register file certainly has a higher percentage cost for a small core than a large high performance core kind of like the decoder cost which became insignificant on powerful x86-64 cores. The decoder and register file cost of x86-64 started to matter again with the slimmed down Atom and Larrabee cores (percentage cost went up per core and it is multiplied times the number of cores). A couple percent performance gain from increasing the register file size may have been worth it for a high performance core but now the slimmed down cores may not be as competitive in area and power. Processor design is a trade off for PPA (Power, Performance and Area).

A larger chip area costs more and newer fab processes cost more. Yes, newer processes do allow for better transistor density and give more chips per wafer but this is partially offset by higher production costs and lower (successful) yields. Current leakage is reducing the advantage of these newer processes as well. There is a sweet spot on the curve which gives the best cost per transistor with an older process which many cost sensitive embedded customers use. The cost to design and produce high performance processors using modern fabs is only done by the largest companies while many companies can afford to produce simpler and smaller ASICs on an older process. The higher performance your ISA is optimized for, the fewer companies there are who can afford to design and produce your processors and the fewer customers there are who can afford to use it. Maybe you like to go to the high stakes gambling table first though.

Quote:

Another question for you: do you still want to keep as much as possible of the 68K legacy (e.g: opcodes structure & instructions, addressing modes)?
For the 32-bit code / execution mode it might make sense, because you can reuse the existing tools. But for the 64-bit code you are forced anyway to make several, non-compatible, changes.
64 bit is the future even in the embedded market, looking at the trend. And ARM has just recently announced that future ISA versions will be 64-bit only, just to give an important news about it.

The original 68000 was a 16/32 bit CPU. A new 68k should be a 32/64 bit CPU. The lack of a path or roadmap to 64 bit support for ColdFire likely contributed to its demise. I believe ARM is making a mistake if they plan to discontinue all 32 bit Thumb2 cores as AArch64 doesn't have good enough code density for the low end embedded market. There isn't enough competition of good code density 64 bit architectures which is needed for the embedded market

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 23-Oct-2020 4:00:11

[ #174 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5286
From: Australia

@cdimauro

Quote:

True. That's why AMD's GPUs have usually more TFLOPS compared to nVidia's ones, but minor performances. TFLOPS are just numbers: very important, but it has to be seen also how they can be used and reached.

AMD's GCN TFLOPS has less effective performance when compared to NVIDIA's Pascal and Turing.

RX Vega 56 at ~1.71 Ghz OC with 12 TFLOPS can beat RX Vega 64 with ~1.5 Ghz (AIB ASUS Strix OC) and 13.1 TFLOPS.

Higher clock speed improves hardware rasterization along with any TFLOPS increase.

RX 6800 XT (RDNA 2) is rumored reach up to 2.577 Ghz clock speed with 64 CU which yields 21.1 TFLOPS with a very high clock speed for rasterization hardware.
https://videocardz.com/newz/amd-radeon-rx-6800xt-board-partner-card-allegedly-features-a-2577-mhz-boost-clock

In terms of TFLOPS, RDNA v1 is nearly the Turing level.

Quote:

It doesn't change what I've said before: existing applications might have problems if those overlapping instructions are used.

As long e500v2 didn't break 32-bit Book E specifications for userland AOS4.X software, then issue is not a major problem. Privilege instructions are not guaranteed under Block E.

From my POV, e500v2 is being treated as a faster SAM460 Book E PowerPC machine.

-----------
On Ryzen 4000U's Vega 8 vs Tirgerlake 96EUs topic

https://www.notebookcheck.net/Intel-Core-i7-1185G7-in-Review-First-Tiger-Lake-Benchmarks.494462.0.html
V-Ray Benchmark Next,

AMD Ryzen 7 4800U beats Intel Iris Xe Graphics G7 96EUs, Intel Core i7-1185G7 (28W)

GTA V - 1920x1080 High/On (Advanced Graphics Off) AA:2xMSAA + FX AF:8X
Intel Reference Design Laptop 28W
Intel Iris Xe Graphics G7 96EUs, Intel Core i7-1185G7 = 30 fps

Lenovo Yoga Slim 7-14ARE
AMD Radeon RX Vega 8 (Ryzen 4000), AMD Ryzen 7 4800U = 29.8 fps

Dota 2 Reborn - 1920x1080 ultra (3/3) best looking
Lenovo Yoga Slim 7-14ARE
AMD Radeon RX Vega 8 (Ryzen 4000), AMD Ryzen 7 4800U = 52.5 fps

Intel Reference Design Laptop 28W
Intel Iris Xe Graphics G7 96EUs, Intel Core i7-1185G7 = 49.4 fps.

https://www.notebookcheck.net/Intel-Core-i7-1185G7-in-Review-First-Tiger-Lake-Benchmarks.494462.0.html
Intel Reference Design Laptop 28W
Intel Iris Xe Graphics G7 96EUs, Intel Core i7-1185G7 has LPDDR4x-4266 RAM

Lenovo Yoga Slim 7-14ARE is gimped by LPDDR4x-2400 RAM.
https://www.notebookcheck.net/The-Ryzen-7-4800U-is-an-Absolute-Monster-Lenovo-Yoga-Slim-7-14-Laptop-Review.456068.0.html

It's not an absolute win for Intel Iris Xe Graphics G7 96EUs with 28W mode SOC.

Last edited by Hammer on 23-Oct-2020 at 04:13 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 23-Oct-2020 4:10:46

[ #175 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5286
From: Australia

@NutsAboutAmiga

Quote:

NutsAboutAmiga wrote:
@Hammer

Anyway you can’t use Altivec code on A1222, nor can you on any older G3’s or AMCC4x0 cpu’s, so the issue is not as big anyway. (most code does not have it.)

AltiVec code I guess is only really used in things like FFMPEG, Mplayer and few other things, the compiler we have can’t optimize for it, it has to be hand written as inline assembler, (or as the AltiVec macros).

If some programs acting badly, its maybe because they are old, not updated, as Hyperion broke AmigaOS4 in in 4.1 Final, Picasso96 is more or less useless now, there other stuff as well changed in SDK, forcing developers to update the code.

I'm aware of e500v2 lacks Altivec support.

I don't have A1222, hence AOS4.1 userland apps' behavior on e500v2 is not visible to me.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

cdimauro

Re: Amiga SIMD unit
Posted on 23-Oct-2020 5:48:01

[ #176 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Hammer Quote:

Hammer wrote:
@cdimauro
Quote:
Jaguar hasn't a subset of AVX. AVX is always a 256-bit wide SIMD ISA. However Jaguar has an internal 128-bit implementation.

I was referring to actual hardware implementation since Jaguar's 128-bit SIMD hardware implements AVX-128 while AVX-256 is for compatibility.

The AVX instructions support both 128-bit and 256-bit SIMD.

The 128-bit AVX can be useful to improve old code without needing to widen the vectorization, and avoid the penalty of going from SSE to AVX, they are also faster on some early AMD implementations of AVX. This mode is sometimes known as AVX-128.

https://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/i386-and-x86-64-Options.html#i386-and-x86-64-Options
-mprefer-avx128
This option instructs GCC to use 128-bit AVX instructions instead of 256-bit AVX instructions in the auto-vectorizer.

Jaguar's AVX-256 has a latency penalty when compared to AVX-128.

The instruction set is only part of the solution when micro-architecture implementation can influence the overall result.

The micro-architecture is an implementation detail: what counts when talking about real code is which ISA & extensions are exposed, and that applications can make use of.

From this PoV Jaguar has a complete AVX ISA extension, which means that vector registers are 256-bit in size and both 128 and 256-bit instructions are available. It's like Ryzen 1&2, which had an AVX/-2 ISA SIMD available, but the internal implementation is 128-bit.

What you reported are microarchitecture-level tricks that are used to gain better performance for a specific implementation. There's no "AVX-128" mode.

@Hammer Quote:

Hammer wrote:
@cdimauro Quote:
True. That's why AMD's GPUs have usually more TFLOPS compared to nVidia's ones, but minor performances. TFLOPS are just numbers: very important, but it has to be seen also how they can be used and reached.

AMD's GCN TFLOPS has less effective performance when compared to NVIDIA's Pascal and Turing.

RX Vega 56 at ~1.71 Ghz OC with 12 TFLOPS can beat RX Vega 64 with ~1.5 Ghz (AIB ASUS Strix OC) and 13.1 TFLOPS.

Higher clock speed improves hardware rasterization along with any TFLOPS increase.

RX 6800 XT (RDNA 2) is rumored reach up to 2.577 Ghz clock speed with 64 CU which yields 21.1 TFLOPS with a very high clock speed for rasterization hardware.
https://videocardz.com/newz/amd-radeon-rx-6800xt-board-partner-card-allegedly-features-a-2577-mhz-boost-clock

Let's see, because rumors and slides were different from real-world products, when we talk about AMD GPUs (AMD is struggling to be competitive from very long time).
Quote:
In terms of TFLOPS, RDNA v1 is nearly the Turing level.

AFAIR no: Turing had still an advantage.
Quote:
Quote:
It doesn't change what I've said before: existing applications might have problems if those overlapping instructions are used.

As long e500v2 didn't break 32-bit Book E specifications for userland AOS4.X software, then issue is not a major problem. Privilege instructions are not guaranteed under Block E.

From my POV, e500v2 is being treated as a faster SAM460 Book E PowerPC machine.

I don't remember now which instructions were removed, either if they are user or privileged ones. I'll check once I've some time.
Quote:
On Ryzen 4000U's Vega 8 vs Tirgerlake 96EUs topic

https://www.notebookcheck.net/Intel-Core-i7-1185G7-in-Review-First-Tiger-Lake-Benchmarks.494462.0.html
V-Ray Benchmark Next,

AMD Ryzen 7 4800U beats Intel Iris Xe Graphics G7 96EUs, Intel Core i7-1185G7 (28W)

GTA V - 1920x1080 High/On (Advanced Graphics Off) AA:2xMSAA + FX AF:8X
Intel Reference Design Laptop 28W
Intel Iris Xe Graphics G7 96EUs, Intel Core i7-1185G7 = 30 fps

Lenovo Yoga Slim 7-14ARE
AMD Radeon RX Vega 8 (Ryzen 4000), AMD Ryzen 7 4800U = 29.8 fps

Dota 2 Reborn - 1920x1080 ultra (3/3) best looking
Lenovo Yoga Slim 7-14ARE
AMD Radeon RX Vega 8 (Ryzen 4000), AMD Ryzen 7 4800U = 52.5 fps

Intel Reference Design Laptop 28W
Intel Iris Xe Graphics G7 96EUs, Intel Core i7-1185G7 = 49.4 fps.

https://www.notebookcheck.net/Intel-Core-i7-1185G7-in-Review-First-Tiger-Lake-Benchmarks.494462.0.html
Intel Reference Design Laptop 28W
Intel Iris Xe Graphics G7 96EUs, Intel Core i7-1185G7 has LPDDR4x-4266 RAM

Lenovo Yoga Slim 7-14ARE is gimped by LPDDR4x-2400 RAM.
https://www.notebookcheck.net/The-Ryzen-7-4800U-is-an-Absolute-Monster-Lenovo-Yoga-Slim-7-14-Laptop-Review.456068.0.html

It's not an absolute win for Intel Iris Xe Graphics G7 96EUs with 28W mode SOC.

I had a look at different reviews (Anandtech), where 28W was a win (obviously), but the 15W was comparable to AMD.

Status: Offline

cdimauro

Re: Amiga SIMD unit
Posted on 23-Oct-2020 6:14:43

[ #177 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:

matthey wrote:
Quote:
cdimauro wrote:
I've red it, but I don't know how much sense makes the comparison between x86 and x64 for this particular aspect, because they have completely different ABIs and the code doesn't look the same as well for the same reason.
x86 is stack-base substantially, whereas x64 is register(s)-based. So, x86 makes A LOT of PUSHes and POPs, whereas x64 does much less and uses MOVs instead.
I think that you've already red my article which reports such statistics.

The comparison in the "Performance Characterization of the 64-bit x86 Architecture from Compiler Optimizations’ Perspective" paper actually uses x86-64 mode for the comparisons of REG_16, REG_12 and REG_8 GP registers. It was possible to reduce the number of available GP registers with the compiler. This is a better way to isolate the memory traffic and performance difference than to compare x86-64 and x86 with ISA differences. I expect this gives a pretty good indication of the advantage x86-64 gained over x86 by moving from 8 to 16 GP registers though.

REG_16 to REG_8 (16 to 8 GP registers)
CINT2000 42% more memory traffic, 4.4% performance loss
CFP2000 78% more memory traffic, 5.6% performance loss

REG_16 to REG_12 (16 to 12 GP registers)
CINT2000 14% more memory traffic, 0.9% performance loss
CFP2000 29% more memory traffic, 0.3% performance loss

I would expect that moving to more than 16 GP registers would provide less of a performance gain than the performance loss from 16 GP registers to 12 GP registers which should be less than 1% performance difference for these older benchmarks at least. We also agree that newer processors with more DCache accesses per cycle should be able to reduce the performance loss of DCache memory traffic more.

OK, it's much clear and I agree. The only thing is that those benchmarks are really old. I remember that 2006 gives different results. And now we have a SPEC2017 test suite. Would be good to have some updated benchmarks.
Quote:
Quote:
And I assume that we all agree that what we do NOT want from an ISA is that it should stack-based.

Elevated memory traffic is undesirable whether it is excessive DCache stack use or increased ICache use from poor code density. It is better to have more registers but they reach the point of diminishing returns and start making code larger which reduces ICache performance.

Usually yes and I agree, but, as I said before, I've already a very good code density with my ISA (and without using many more other features that can greatly reduce it) which has 32 GP registers (but "unused" currently: for my experiments and statistics I just translates x86/x64 instructions to an equivalent of my ISA. So, currently the comparisons are always using 8 registers for x86 and 16 for x64).
Quote:
CISC with 16 GP registers seems to be a good balance...

I agree. Memory access for (almost) all instructions, powerful addressing modes, and medium/large immediates are a clear win for CISCs (which are able to make a good use of them).
Quote:
It appears that the 68k uses less memory traffic than x86-64 even with a stack arg passing ABI as function inling reduces the overall cost of function stack args.

That's strange. Both have the same number of registers and similar commonly used addressing modes. Any idea why it's happening?
Quote:
Quote:
I understand and I agree for most of the things (especially with the stupid decisions of Motorola's management about the 68K family), but I think that there's an important point which you're not considering when talking about small cores (which means small area -> reduced costs): the process used for fabricating the chip.

What you said made absolute sense 20 or more years ago, but nowadays with the production processes used in the last years you can pack million transistors even in a few mm^2 of area. And chips areas are mostly dominated by caches (L1, L2, and even L3) and the "uncore".

In short: I don't think that the registers file takes a lot of space, compared to all the rest.

So, does it really makes sense trying to limit as much as possible the registers file? My feeling is clearly no, for what I've said. But I like to see if other people has a different opinion, and it'll be nice to have some concrete data about it (e.g.: registers file area vs caches area vs uncore area vs chip area. For latest chips, of course -> using modern production processes) which will greatly help here.

A larger register file certainly has a higher percentage cost for a small core than a large high performance core kind of like the decoder cost which became insignificant on powerful x86-64 cores. The decoder and register file cost of x86-64 started to matter again with the slimmed down Atom and Larrabee cores (percentage cost went up per core and it is multiplied times the number of cores). A couple percent performance gain from increasing the register file size may have been worth it for a high performance core but now the slimmed down cores may not be as competitive in area and power. Processor design is a trade off for PPA (Power, Performance and Area).

I never heard of problems caused by the register file for Atom and Larrabee, unless we talk about the SIMD ones (which is the reason why low-end x86/x64 processors had only SSE integrated, and not AVX/-2). The decoder, on the other end, was and is a sensible element for x86 and x64.

Do you have some studies / numbers about the register file?
Quote:
A larger chip area costs more and newer fab processes cost more. Yes, newer processes do allow for better transistor density and give more chips per wafer but this is partially offset by higher production costs and lower (successful) yields. Current leakage is reducing the advantage of these newer processes as well. There is a sweet spot on the curve which gives the best cost per transistor with an older process which many cost sensitive embedded customers use. The cost to design and produce high performance processors using modern fabs is only done by the largest companies while many companies can afford to produce simpler and smaller ASICs on an older process. The higher performance your ISA is optimized for, the fewer companies there are who can afford to design and produce your processors and the fewer customers there are who can afford to use it. Maybe you like to go to the high stakes gambling table first though.

I agree, but I suspect that the oldest processes used for embedded aren't the ones available 20 or more years ago. AFAIR many are using 32-28nm processes, which allow a very good number of transistors packed in a small area.

I doubt that embedded SoCs are smaller than 1mm^2 area.
Quote:
Quote:
Another question for you: do you still want to keep as much as possible of the 68K legacy (e.g: opcodes structure & instructions, addressing modes)?
For the 32-bit code / execution mode it might make sense, because you can reuse the existing tools. But for the 64-bit code you are forced anyway to make several, non-compatible, changes.
64 bit is the future even in the embedded market, looking at the trend. And ARM has just recently announced that future ISA versions will be 64-bit only, just to give an important news about it.

The original 68000 was a 16/32 bit CPU. A new 68k should be a 32/64 bit CPU. The lack of a path or roadmap to 64 bit support for ColdFire likely contributed to its demise.

And that's why there are many discussions. But we (not only me and you) have (very) different opinions on how to fill that gap.
Quote:
I believe ARM is making a mistake if they plan to discontinue all 32 bit Thumb2 cores as AArch64 doesn't have good enough code density for the low end embedded market.

They simply don't further develop the 32 ISA anymore. Thumb-2 et similar are here to stay for the partners that want to use them. But future AMD ISA will be entirely 64-bit AFAIR: so, not even the 32-bit execution mode will be supported.
Quote:
There isn't enough competition of good code density 64 bit architectures which is needed for the embedded market

Indeed, and that's very good for the ones which are working on them.

Status: Offline

Fl@sh

Re: Amiga SIMD unit
Posted on 23-Oct-2020 12:09:08

[ #178 ]

Regular Member

Joined: 6-Oct-2004
Posts: 253
From: Napoli - Italy

@cdimauro

Quote:
I don't remember now which instructions were removed, either if they are user or privileged ones. I'll check once I've some time.

Privileged

_________________
Pegasos II G4@1GHz 2GB Radeon 9250 256MB
AmigaOS4.1 fe - MorphOS - Debian 9 Jessie

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 23-Oct-2020 17:01:07

[ #179 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5286
From: Australia

@cdimauro

Quote:

The micro-architecture is an implementation detail: what counts when talking about real code is which ISA & extensions are exposed, and that applications can make use of.

From this PoV Jaguar has a complete AVX ISA extension, which means that vector registers are 256-bit in size and both 128 and 256-bit instructions are available. It's like Ryzen 1&2, which had an AVX/-2 ISA SIMD available, but the internal implementation is 128-bit.

What you reported are microarchitecture-level tricks that are used to gain better performance for a specific implementation. There's no "AVX-128" mode.

There is AVX 128-bit SIMD size refer to GCC's 128-bit AVX auto-vectorization option.

Jaguar's 256 bit AVX is merely for forward compatibility with a higher latency penalty. Jaguar wasn't optimally designed for 256-bit AVX workloads when there is higher penalty.

Jaguar's load/store units are only 128-bit wide not like Zen 2's two 256-bit wide load units and one 256 bit store unit.

In Jaguar, AVX operations complete as 2 x 128-bit operations, while all other 128-bit operations can execute without multiple passes through the pipeline, this increases time to completion latency.

Jaguar's Store queue has 20 entries that are 16bytes (128-bits) wide!

Jaguar’s L1D can sustain a 128-bit read and a 128-bit write each cycle!

1.6 Ghz Jaguar with AVX-256 is like Jaguar at 800Mhz!

ASM/C/C++ programmer (especially for soon to be obsolete game consoles) needs to know the microarchitecture's weaknesses to minimize performance pitfalls.

I have plenty of criticisms against Jaguar's microarchitecture.

You can't ignore microarchitecture weaknesses when it comes to high-performance 3D game engines.

Quote:

Let's see, because rumors and slides were different from real-world products, when we talk about AMD GPUs (AMD is struggling to be competitive from very long time).

Reminder, Intel's GPU efforts are worst than AMD's.

RX 5700 XT is the feature set obsolete when it doesn't support XSS/XSX's DirectX12 Ultimate and DirectX12 Feature Level 12.2. Turing RTX supports DirectX12 Ultimate and DirectX12 Feature Level 12.2!

There's a very high chance for RDNA 2's high clock speed due to PS5's reveal.

Using PS5's GPU clock speed,
RX 6800 XT's 64 CU at 2230 Mhz yields 18.268 TFLOPS which is almost 2X of RX 5700 XT.

RX 6900 XT's 80 CU at 2230 Mhz yields 22.835 TFLOPS which is 2.34X over RX 5700 XT.

Unlike Polaris/ Vega's GCN TFLOPS, RDNA v1's TFLOPS nearly like Turing RTX's TFLOPS.

Desktop PC SKUs are not limited by the game console's TDP limitation and NVIDIA has thrown PEG 300 watts standard design limit out of the window with Ampere RTX.

My MSI RTX 2080 Ti Gaming X Trio has three 8 -pins PCI-E power sockets to blast past 300 watts which can narrow the gap with RTX 3080.

https://www.techpowerup.com/review/nvidia-geforce-rtx-3080-founders-edition/31.html
RTX 3080 FE's peak gaming power consumption is 348 watts.

https://www.techpowerup.com/review/nvidia-geforce-rtx-3080-founders-edition/34.html
RTX 2080 Ti is 76% of RTX 3080

https://www.techpowerup.com/review/msi-geforce-rtx-2080-ti-gaming-x-trio/33.html
My MSI RTX 2080 Ti Gaming X Trio with a factory overclock mode is 6% faster over the stock RTX 2080 Ti.

End user's overclock can yield another 6.7% increase.
https://www.techpowerup.com/review/msi-geforce-rtx-2080-ti-gaming-x-trio/36.html

https://www.techpowerup.com/review/msi-geforce-rtx-2080-ti-gaming-x-trio/31.html
RTX 3080 Ti's peak gaming power consumption is 358 watts.

If NVIDIA can blasted pass PEG's 300 watts design limit, so can AIB's RX 6800 XT and RX 6900 XT.

My argument is based on history when PS4 has GCN version 2.0 with 20 CUs design at 800Mhz while PC's R9-290X GCN version 2.0 with 44 CU design at +1Ghz.

PS5 GPU has 20 DCU (aka 40 CU) up to 2.23 Ghz RDNA 2 design.

RX 6900 XT has 40 DCU (aka 80 CU) RDNA 2 design.

RX 6800 XT has 32 DCU (aka 64 CU) RDNA 2 design.

Add 200 Mhz on top PS5 GPU's 2.23Ghz lands on 2.43 Ghz.

Ampere has doubled CUDA cores within SM units without increasing rasterization hardware, hence game results is meh when compared to RTX 2080 Ti OC.

RTX 3080 is using GA102 which the same as RTX 3090 instead of expected GA104.

Usually, G?104 is assigned for ?080 type SKU e.g. GTX 1080 or RTX 2080.

NVIDIA knows AMD's expected RDNA 2 RX 6800 XT/RX 6900 XT results and GA104 wouldn't be enough.

Last edited by Hammer on 26-Oct-2020 at 02:59 AM.
Last edited by Hammer on 23-Oct-2020 at 05:14 PM.
Last edited by Hammer on 23-Oct-2020 at 05:08 PM.
Last edited by Hammer on 23-Oct-2020 at 05:07 PM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 23-Oct-2020 17:29:01

[ #180 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5286
From: Australia

@cdimauro

Quote:

They simply don't further develop the 32 ISA anymore. Thumb-2 et similar are here to stay for the partners that want to use them. But future AMD ISA will be entirely 64-bit AFAIR: so, not even the 32-bit execution mode will be supported.

Zen 3 still has native support for X86-32.

AMD doesn't officially support Windows 95/98/Me/NT/2K/XP/7, but it still runs fine on it.

https://www.youtube.com/watch?v=KFEpHEXBCbA
Ryzen 7 2700 running retro DOS-based Windows 98 and Doom 2 game just fine. No Motorola style instruction set kitbashing when it comes to running legacy Windows OS and Doom 2.

Last edited by Hammer on 23-Oct-2020 at 05:32 PM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle