Click Here
home features news forums classifieds faqs links search
6071 members 
Amiga Q&A /  Free for All /  Emulation /  Gaming / (Latest Posts)
Login

Nickname

Password

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net
Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.
Donate

Menu
Main sections
» Home
» Features
» News
» Forums
» Classifieds
» Links
» Downloads
Extras
» OS4 Zone
» IRC Network
» AmigaWorld Radio
» Newsfeed
» Top Members
» Amiga Dealers
Information
» About Us
» FAQs
» Advertise
» Polls
» Terms of Service
» Search

IRC Channel
Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online
8 crawler(s) on-line.
 136 guest(s) on-line.
 1 member(s) on-line.


 matthey

You are an anonymous user.
Register Now!
 matthey:  2 mins ago
 amigakit:  39 mins ago
 OlafS25:  55 mins ago
 OneTimer1:  1 hr 18 mins ago
 RobertB:  1 hr 20 mins ago
 pavlor:  1 hr 52 mins ago
 VooDoo:  1 hr 53 mins ago
 OldFart:  2 hrs 30 mins ago
 zipper:  2 hrs 51 mins ago
 kolla:  4 hrs 25 mins ago

/  Forum Index
   /  General Technology (No Console Threads)
      /  Amiga SIMD unit
Register To Post

Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 Next Page )
Poll : Is AMMX a good standard for the 68k?
Yes
No
No opinion or pancackes
 
PosterThread
cdimauro 
Re: Amiga SIMD unit
Posted on 18-Oct-2020 21:45:21
#141 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@mattheyQuote:

matthey wrote:
You don't think a Motorola 68080 would have had a SIMD unit? Would that have been a mistake too?

At the 68K time, yes: would have made sense. Not nowadays, with the 68K ISA which is dead.
Quote:
CPU cores with SIMD units
+ fast startup times
+ powerful and flexible for a high latency parallel processor
- tethered to fat CPU cores (the CELL architecture avoided this with separate SIMD like SPEs)

Cell PPUs had VMX/Altivec units as well.
Quote:
CPU cores can be slimmed down and are for GPGPUs. The Knights Landing project predecessors went as far back as the Pentium P54C design which is the 68060 era (best comparison to the 68060 and dominated by the 68060 in PPA).

We talked several times about that: the Pentium vs 68060 comparison isn't fare!
Motorola dropped a lot of stuff, already starting from the 68030, and even removing user-mode instructions. Plus, the 68060 had a very limited instructions queue (4 bytes = max 2 16-bit instructions; however I don't remember now how big is the width for the Pentium instructions line), and the FPU wasn't pipelined.
On the exact contrary, the Pentium supported ALL legacy stuff, adding more instructions (and "machine registers"), and a 64-bit data bus: all of this required transistors/area & power. And the FPU was pipelined, as I said.

Anyway, it would be interesting to know the SPECInt and SPECfp values for both the 68060 and the Pentium.
Quote:
SIMD support definatly bloats up a core and using the GPU with HSA hardware is much more efficient. It could be the blind following the blind. It wouldn't be the first time technology followed the hype down the wrong path.

I already answered to this on my previous comment, but I want to add a couple of more things.

First, HSA is much more limited due to the integrated GPU. Big performances come from discrete GPUs.

Second, many processors vendors add more complex and wider SIMD units. If they are all wrong, well, they are in good company...
Quote:
Weaker CPUs have more advantages than powerful ones.

Weaker CPU
+ cheaper (production and development)
+ better security
+ easier to program

Highly questionable: you have to carefully select the instructions if you want to squeeze the most from in-order designs.

Out-of-order designs freed both programmers and performances from this big burden...
Quote:
+ lower power
+ more reliable

Why?
Quote:
Extreme performance from the CPU is usually *not* a good idea. Usually a more balanced and smarter approach works better.

It depends on the specific application fields. On general-purpose computers we saw that high performances were and are a good thing.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Amiga SIMD unit
Posted on 18-Oct-2020 22:14:26
#142 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:
When I was on the Apollo Team, I brought up the StarCore DSP to Gunnar and we discussed it. Personally, I'm not a fan of it overall even though I like some ideas. StarCore probably is easier to program than most DSPs but is far from the 68k in consistency and easy of use.

The CRIS ISAs (there were two, very different) look also very interesting, and borrow some 68K ideas IMO.

However, like the StarCore, they are too much oriented towards the embedded market.
Quote:
My rough estimate is that from RISC 32 GP integer registers to the 68k 16 GP integer registers is less than 10% memory traffic increase and less than 1% performance reduction in most designs. The 68k has several traits that reduce register usage (reg-mem/mem-mem, powerful addressing modes, large immediate support, PC relative addressing, register renaming).

The problem with the 68K is that is has too much complex addressing modes.
Quote:
More GP registers are anything but a free lunch. They use more transistors, draw more power, sometimes require more time and memory to save and restore extra (caller/callee/all) registers

That's a problem which is mostly related to ISAs with less registers, because they should spend much more time saving & restoring them. An ISA with more registers requires less save/restore to/from memory, because there are more available both for the caller and the callee.
Quote:
and require more encoding space increasing code sizes.

That's not a general rule. My ISA has 32 GP registers, with a very good code density. It has also 64 SIMD registers (and 16 masks registers), and the code density is much better than AVX-512 (which has "only" 32+8 of them).
Quote:
ARM AArch64 has an optional Scalable Vector Extension (SVE) in addition to the standard fixed width SIMD instructions. Is this what you mean by "variable length SIMD"?

https://community.arm.com/developer/tools-software/hpc/b/hpc-blog/posts/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture

It sounds like it would be less efficient today with the abstraction but may be more efficient tomorrow with wider SIMD units (although code recompiled for a fixed width tomorrow will be more efficient). It assumes that SIMD width will continue to grow

That's what all processors vendors have assumed.
Quote:
and that may not be practical as can be seen by Knights Landing down clocking cores using 512 bit wide SIMD instructions.

Yes, clock lowers, but the performances are better thanks to the wider registers & ALUs -> more data processed.
Quote:
Also, wider SIMD unit standards limit low end core designs.

Vector-length agnostics ISAs avoid this: you can have a simpler implementation for low-end core designs, without touching the existing code.
Quote:
I expect high end core SIMD designs need support and don't mind losing some performance but lower end designs would rather have the better performance of fixed width SIMD.

See above: it's the same for high-end cores. The code is the same, but the execution is faster. So, no explicit support is required.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Amiga SIMD unit
Posted on 18-Oct-2020 22:23:39
#143 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:
I expect SIMD instructions translate well to VLIW but the big problem for VLIW cores has been branches.

And loads (since they aren't predictable). That's why VLIWs aren't general purpose processors.
Quote:
Has Nvidia finally solved the VLIW "code morphing" problems?

Performances seem quite well, but Project Denver also embeds an ARM instructions decoder, AFAIR.
Quote:
Would they be interested in 68k or PPC "code morphing"?

Are we kidding?

@Lou Quote:

Lou wrote:
@matthey

To be clear. I think SIMD is great. I just believe putting it in the cpu is wrong vs an actually enhanced chip is where it belongs.

Personal beliefs can be (very) different from reality...

@dooz Quote:

dooz wrote:
@matthey

We already have kind of combined embedded SIMD/FPU unit on Amiga (except ALTIVEC).

P1022 SPE (Signal Processing Engine) unit from A1222 is a 64-bit, two element, single-instruction multiple-data (SIMD) ISA. The two-element vector fits within GPRs extended to 64-bit. It doesnt have dedicated floating-point registers. GPRs are used for integer operations, extended to 64-bit to support vector single precision and scalar double precision categories.

SPE can execute floating point and vector instructions.

Embedded scalar double-precision floating-point instructions treat the GPRs as 64-bit
single-element registers for double-precision computation.

Maybe its worth mentioning.

Also detailed manual exists:

http://www.google.com/url?q=https://www.nxp.com/docs/en/reference-manual/SPEPEM.pdf&sa=U&ved=2ahUKEwi1j6GRkanrAhVKzaQKHeGRCPIQFjAAegQIBRAB&usg=AOvVaw29VhUTRaqRfrKFVoz8tCsB

-dooz

That's a cheap crap just for embedded systems.

 Status: Offline
Profile     Report this post  
Fl@sh 
Re: Amiga SIMD unit
Posted on 18-Oct-2020 22:40:49
#144 ]
Regular Member
Joined: 6-Oct-2004
Posts: 253
From: Napoli - Italy

@matthey

Please give a fast read to the following link, it's about function calls on PPC and parameters passing

The simplified 64-bit ABI - IBM POWER

quite simple and fast, even against 68k and other cisc cpus.

Quote:

matthey wrote:
Quote:

Fl@sh wrote:
Your are making things much harder for ppc.
I know you are in love with 68k and it can cause a loss of impartiality.

Anyway for ppc I have seen prologue and epilogue simpler than in your examples.


The PPC prologue and epilogue code is recommended from the Freescale/NXP "AltiVec Technology Programming Interface Manual".

Quote:

Obviously in ppc code you are saving and restoring even vector registers, not present in 68k.


That is one of the disadvantage of having more registers and different register files. For example, x86_64 has a shared SIMD and FPU register file which reduces this cost. If the 68k had also shared FPU and SIMD registers then the 68k code I posted may suffice. Sharing SIMD and FPU registers is common including from IBM (POWER and z/architecture), ARM, AMD and Intel.

Quote:

The best case scenario instead IMHO is do not save nothing and use volatile registers as much as possible for ppc code, for normal user program function calls.


If you use only volatile registers on PPC then you have 11 gp integer registers which is less than the 68k and x86_64 with 16 and RISC needs more registers as performance degrades quickly when out of registers. The PPC stack frame is often needed for local variables, local storage, varargs, etc. (all but global variables which are usually discouraged and dynamic memory allocations) so typically the cost of the stack frame would already be incurred, especially with function inlining popularity today.

Quote:

You did the worst scenario where a function call needs to save all in one gpr, fpu and vector registers. Obviously restoring all these at end.


There are several entry points for the register saving functions. The top entries can be skipped if those registers do not require saving and even the whole branch is removed for each register file if no registers need saving. The SIMD unit is rarely used and more SIMD registers are volatile so rarely need saving and restoring.

Quote:

I guess in a such complex function, where are involved all these different registers, the save and restore of non volatile resources is really a minor overhead compared with a complexity of job cpu is doing.


The compiler tries to figure out when it is worthwhile to save and restore non volatile registers but it is worthwhile in most cases with RISC because of the high overhead of using memory. Using 12 registers instead of 11 registers can generate a costly stack frame which the compiler may have trouble computing the cost of. Using more registers can sometimes slow performance. Context switches where all the registers are saved are slower too.

_________________
Pegasos II G4@1GHz 2GB Radeon 9250 256MB
AmigaOS4.1 fe - MorphOS - Debian 9 Jessie

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Amiga SIMD unit
Posted on 18-Oct-2020 22:41:34
#145 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:

matthey wrote:
Dynamic vectorization avoids the encoding space bloat and corner cases. The following article compares dynamic vectorization to fixed SIMD (comments are interesting too).

https://www.sigarch.org/simd-instructions-considered-harmful/

The article is so biased towards the vector-length agnostic ISAs. In fact, it's enough to take a look at some comments which made some (solid) critics, and some of them not even got an answer from the eminent prof. Patterson...
Quote:
The OpenPOWER Foundation has been much more cooperative but POWER isn't as suitable (although it now is a VLE but not using the more compact VLE).

Even using the VLE encoding doesn't put POWER at the same code-size level of other ISAs.

There's nothing on POWER which can make it desirable, compared to other ISAs.

 Status: Offline
Profile     Report this post  
Hammer 
Re: Amiga SIMD unit
Posted on 19-Oct-2020 4:28:06
#146 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5275
From: Australia

@cdimauro

Quote:

Unfortunately 3DNow! had the same MMX limits: just 64-bit registers size, only 8 registers, and requires a context switch between the FPU and SIMD execution modes.

Between MMX vs 3DNow, I prefer 3DNow.

Ideally, GC/Wii's PowerPC's custom 64bit SIMD with compact instruction set CISC-RISC hybrid CPU core would be nicer.

My comment's context is made with a low budget 64bit wide SIMD.

Both MMX and wMMX doesn't benefit floating-point geometry workloads.

Last edited by Hammer on 19-Oct-2020 at 04:37 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

 Status: Offline
Profile     Report this post  
Hammer 
Re: Amiga SIMD unit
Posted on 19-Oct-2020 4:35:28
#147 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5275
From: Australia

@cdimauro

Larabee's X86 CPU core was updated with X86-64 support like the Atom Bonnell microarchitecture.

Pentium P54C has 32bit ALUs while Atom Bonnell has the 64bit ALU.

Larabee's X86 CPU cores can address more than 4GBs of RAM which is beyond the 32-bit memory address limits.

Last edited by Hammer on 19-Oct-2020 at 04:37 AM.
Last edited by Hammer on 19-Oct-2020 at 04:35 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Amiga SIMD unit
Posted on 19-Oct-2020 5:26:27
#148 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Hammer Quote:

Hammer wrote:
@cdimauro

Quote:

Unfortunately 3DNow! had the same MMX limits: just 64-bit registers size, only 8 registers, and requires a context switch between the FPU and SIMD execution modes.

Between MMX vs 3DNow, I prefer 3DNow.

Ideally, GC/Wii's PowerPC's custom 64bit SIMD with compact instruction set CISC-RISC hybrid CPU core would be nicer.

My comment's context is made with a low budget 64bit wide SIMD.

Both MMX and wMMX doesn't benefit floating-point geometry workloads.

OK, so if the requirements are low-budget AND geometry handling (packed floating point support, in general), I agree that 3DNow! is (obviously) better than MMX/wMMX.

@Hammer Quote:

Hammer wrote:
@cdimauro

Larabee's X86 CPU core was updated with X86-64 support like the Atom Bonnell microarchitecture.

Pentium P54C has 32bit ALUs while Atom Bonnell has the 64bit ALU.

Larabee's X86 CPU cores can address more than 4GBs of RAM which is beyond the 32-bit memory address limits.

I know it: I'm a former Intel engineer, and I just worked with Xeon Phi (the debugging tools) when I was there.

But I've to make a correction: Larrabee, Knights Ferry, and Knights Corner weren't x86-64 (or EM64T, like Intel likes to call it) CPU, because they lacked some instructions. In fact, they were used only as coprocessors, and it wasn't possible to use them as a regular CPU.

It's only with Knights Landing that those missing instructions were added back, and then the Xeon Phi ISA became fully x86-64 compatible.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Amiga SIMD unit
Posted on 19-Oct-2020 6:15:19
#149 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@kolla Quote:

kolla wrote:

Most 68k software also isn't optimized for 68020+, so why don't we just scrap all those pointless 020+ CPUs...

For what it's worth - a LOT of the software that _I_ use, benefit from FPU, but I cannot have an FPU that confuses software with silly rounding errors due to lack of precision.

There's a contradiction: if you use a lot of software which uses the FPU, then it should use 68020+ features as well, since there was no 68000 (and 010) processor with FPU.

However I remember that several applications and games explicitly required a 68020 at least (the 80186 emulator that I was writing was one of them). Why you're so much against 68020+?

I assume that for rounding errors / lack of precision you're strictly referring to software which MUST use the extended precision (80-bit). That's true (IF this is the case), but such kind of extra precision requires more resources both on FPGA and on an emulator.

All modern mainstream processors/ISAs support 64-bit precision at most (I know: x86 has still the old x86 FPU. But it's a really old design, which is stack-based, and with only 8 registers). 128-bit precision can be possible as next step, but it's really too much almost all existing software.

@matthey Quote:

matthey wrote:

After the 68000 ISA which was ahead of its time and forward thinking, the 68020 ISA was disappointing. However, there are some very important and commonly used additions, like 32/64 bit multiply and division, index register scaling for addressing modes and longer displacement implicitly PC relative branches.[quote]
Indeed, but the ISA became also more messy starting from the 68020.

The Motorola engineers already made several mistakes with the 68000 (too many exceptions for reusing parts of the existing opcodes for doing different things), but with the 68020 decided to make it even worse...
[quote]Some compiler support for the 68k FPU is based on older code which uses the extra precision, like vbcc direct FPU support, so new software can have errors too. With vbcc and SAS/C, these problems can be avoided by compiling using the Amiga math libraries which is much slower and lower precision but has the advantage that it works on Amigas without an FPU too. A fair amount of Amiga programs use the Amiga math libraries which abstracts the 68k FPU hardware away and allows for an incompatible 68k FPU replacement.[quote]
The problem here is that using the math libraries is really slow, and AFAIR doesn't support all FPU instructions. So even slower.

Why on Earth an FPU shouldn't be directly used? It's happening on all other processors, with great benefit.

The problem is only with the Amiga, were the Amiga o.s. designers took the very bad way of defining the math libraries for FP calculations.

If you wanted to give FPU support for processors that hadn't one, then... emulate it! Like it was done with other architectures. Is it slow? True, but you should be a really stupid user if you pretend to have good FP performances using a processore that has no FPU, because it'll be slow even using the math libraries (without the "trap-emulate tax, of course).
[quote]I like the old school 68k FPU and appreciate the advantages it offers which are often overlooked today. It is much nicer than the old x86 FPU.

Yes, but still old and legacy. A better design than the crappy coprocessors interface would have helped.
Quote:
There are challenges for modernizing and there are advantages to replacing with an SIMD unit like x86.

SIMD is the natural FPU replacement, since it requires scalar support as well.
Quote:
Perhaps it would be a better fit for a vector processing unit (VPU) than SIMD unit but I'm still studying.

If VPU = vector-length agnostic ISA, then it's ok. If it's an external unit, then I don't agree and I've already written why.
Quote:
In any case, I don't set standards nor is there any standards committee to decide. There are only de facto standards like an incompatible 68k FPU.

That's in the normal Motorola tradition: a series of processors which aren't (fully) compatible each others...

Anyway, I think that talking about revitalizing the 68K is pointless nowadays: there's already plenty of competition which has much better support and/or a cleaner design.

A new architecture might have some possibility, if it has a good design and something interesting / better to offer, but a 68K heir which is borrowing all the burden of bad design decisions as well, then... I don't think that can have a chance (again), even if the ISA was (badly) patched to add some modern features...

So, a 68K-inspired ISA might be interesting. And there's plenty of room for getting a nice design (more data registers, some extra address register, new vector-length agnostic ISA) while keeping more or less the same cool features (better code size, lower memory-traffic, lower amount of executed instructions).

 Status: Offline
Profile     Report this post  
Hammer 
Re: Amiga SIMD unit
Posted on 19-Oct-2020 15:18:18
#150 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5275
From: Australia

@cdimauro

Quote:

I know it: I'm a former Intel engineer, and I just worked with Xeon Phi (the debugging tools) when I was there.

But I've to make a correction: Larrabee, Knights Ferry, and Knights Corner weren't x86-64 (or EM64T, like Intel likes to call it) CPU, because they lacked some instructions. In fact, they were used only as coprocessors, and it wasn't possible to use them as a regular CPU.

It's only with Knights Landing that those missing instructions were added back, and then the Xeon Phi ISA became fully x86-64 compatible.

I didn't state AMD64 in reference to Larrabee's X86 64bit GPR capability i.e. Larrabee has a different X86-64 fork from AMD64 and my comment focused on beyond the 4GBlimit address range.

In modern times, Intel refers to their AMD64 alternative as "Intel 64" e.g. https://www.intel.com/content/www/us/en/architecture-and-technology/microarchitecture/intel-64-architecture-general.html

EMT64 reference is old.

From https://superuser.com/questions/931742/windows-10-64-bit-requirements-does-my-cpu-support-cmpxchg16b-prefetchw-and-la

X86-64 has evolved and early AMD64 and EMT64 CPUs couldn't run Windows 10 X64 due to missing instructions. "X86-64" alone does not guarantee it will run Windows 8.1 X64 and Windows 10 X64.

At least Intel is not deleting instructions in the same style as Motorola's 68030 vs 68040 vs 68060

When targeting Windows 8.1 and 10 X64, early AMD64 and EMT64 CPUs joins Larrabee's gimped 64bit X86s, but that's nothing new in that regard.

------------
Different lower-cost CPU approaches

1. Amiga A1222's NXP PowerPC e500v2 has 64-bit SIMD (INT and FP) with different instructions set from 128-bit Altivec/VMX.

2. AMD Bobcat has 128-bit SSE1/SSE2 instruction set with 64-bit wide SIMD hardware. Jaguar has a 256-bit AVX instruction set with 128-bit SIMD hardware. AMD's approach keeps software compatibility with fatter X86 CPUs while delivering lower-cost CPUs.

I prefer Bobcat approach.

Last edited by Hammer on 19-Oct-2020 at 04:12 PM.
Last edited by Hammer on 19-Oct-2020 at 04:10 PM.
Last edited by Hammer on 19-Oct-2020 at 03:58 PM.
Last edited by Hammer on 19-Oct-2020 at 03:53 PM.
Last edited by Hammer on 19-Oct-2020 at 03:45 PM.
Last edited by Hammer on 19-Oct-2020 at 03:41 PM.
Last edited by Hammer on 19-Oct-2020 at 03:29 PM.
Last edited by Hammer on 19-Oct-2020 at 03:23 PM.
Last edited by Hammer on 19-Oct-2020 at 03:21 PM.
Last edited by Hammer on 19-Oct-2020 at 03:20 PM.
Last edited by Hammer on 19-Oct-2020 at 03:18 PM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

 Status: Offline
Profile     Report this post  
Hammer 
Re: Amiga SIMD unit
Posted on 19-Oct-2020 16:25:17
#151 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5275
From: Australia

@cdimauro
Quote:

All modern mainstream processors/ISAs support 64-bit precision at most (I know: x86 has still the old x86 FPU. But it's a really old design, which is stack-based, and with only 8 registers). 128-bit precision can be possible as next step, but it's really too much almost all existing software.

FYI, K7 Athlon's x87 FPU hardware is a full superpipelined FPU despite being a stack-based FPU instruction set. Internally, the K7 Athon treats the x87 stack as a flat register file.

Intel's P6's x87 is partially pipelined when performing multiplies.

Being stack based X87 didn't stop AMD from designing a faster FPU. Micro-architecture design can influence performance.

Last edited by Hammer on 19-Oct-2020 at 04:35 PM.
Last edited by Hammer on 19-Oct-2020 at 04:34 PM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Amiga SIMD unit
Posted on 19-Oct-2020 22:00:12
#152 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Hammer Quote:

Hammer wrote:
@cdimauro

I didn't state AMD64 in reference to Larrabee's X86 64bit GPR capability i.e. Larrabee has a different X86-64 fork from AMD64 and my comment focused on beyond the 4GBlimit address range.

That's normal, because Larrabee was a 64-bit design from the beginning.
Quote:
In modern times, Intel refers to their AMD64 alternative as "Intel 64" e.g. https://www.intel.com/content/www/us/en/architecture-and-technology/microarchitecture/intel-64-architecture-general.html

EMT64 reference is old.

Yes, they changed it finally, with a simpler name.
Quote:
From https://superuser.com/questions/931742/windows-10-64-bit-requirements-does-my-cpu-support-cmpxchg16b-prefetchw-and-la

X86-64 has evolved and early AMD64 and EMT64 CPUs couldn't run Windows 10 X64 due to missing instructions. "X86-64" alone does not guarantee it will run Windows 8.1 X64 and Windows 10 X64.

At least Intel is not deleting instructions in the same style as Motorola's 68030 vs 68040 vs 68060

When targeting Windows 8.1 and 10 X64, early AMD64 and EMT64 CPUs joins Larrabee's gimped 64bit X86s, but that's nothing new in that regard.

No, Larrabbe (and the first two Xeon Phis) was crippled on purpose, because it was born just as a coprocessor, as I said before. So, Intel decided to remove some instructions from the ISA, only because of that.

The early AMD64 and EM64T had different issues (implementation not ready in time for Windows XP x64 support).
Quote:
------------
Different lower-cost CPU approaches

1. Amiga A1222's NXP PowerPC e500v2 has 64-bit SIMD (INT and FP) with different instructions set from 128-bit Altivec/VMX.

Not only different: the e500v2 specific instructions set partially overlaps with some other PowerPC instructions (included Altivec). So, not only it isn't PowerPC compatible, but it's dangerous supporting it, because it can execute instructions which are completely different from the standard ISA (and viceversa for PowerPC processors executing e500v2 binaries).

BTW, the A1222 isn't an Amiga: it's an AmigaOne.
Quote:
2. AMD Bobcat has 128-bit SSE1/SSE2 instruction set with 64-bit wide SIMD hardware. Jaguar has a 256-bit AVX instruction set with 128-bit SIMD hardware. AMD's approach keeps software compatibility with fatter X86 CPUs while delivering lower-cost CPUs.

I prefer Bobcat approach.

Absolutely. No way: Bobcat wins hands-down compared to the crappy e500v2 core.

@Hammer Quote:

Hammer wrote:
@cdimauro

FYI, K7 Athlon's x87 FPU hardware is a full superpipelined FPU despite being a stack-based FPU instruction set. Internally, the K7 Athon treats the x87 stack as a flat register file.

Intel's P6's x87 is partially pipelined when performing multiplies.

Being stack based X87 didn't stop AMD from designing a faster FPU. Micro-architecture design can influence performance.

This is implementation dependent, and you must pay the aggressive x87 FPU performances with much more transistors/area/power.

But the implementation cannot change the ISA, and coders and compilers should use the FPU in a stack-based manner, which is a pain.

 Status: Offline
Profile     Report this post  
matthey 
Re: Amiga SIMD unit
Posted on 20-Oct-2020 2:29:15
#153 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2001
From: Kansas

@cdimauro
I see you found my SIMD thread shortly after an extended Amigaworld.net downtime.

Quote:

cdimauro wrote:
AMMX from the Apollo Core has nothing to do with Intel's MMX, except that it's a 64-bit integer-only SIMD.


AMMX having "Nothing to do with Intel's MMX" is a bit harsh. AMMX is documented as being closer to WMMX which is loosely based on MMX. There are even a few instructions with the same name as MMX uses.

paddb, paddw, paddusb, paddusw, psubb, psubw, psubusb, psubusw, packuswb, pcmpb, pcmpw

The MMX pmullw and pmulhw were renamed to pmull and pmulh respectively which is consistent with Gunnar's inconsistency. There are more MMX SIMD instructions and datatypes which could be added if there was room in the FPGA. Maybe the 'A' in AMMX stands for Axed MMX or Arm MMX (Poor MMX in German).

Quote:

Something is missing. SSE3/4 (several new instructions added), AVX2 (which introduced some nice features like Gather & FMA; 2013), LRBNi (LarraBee New instructions; 2009), KNI (Knights Corner New Instructions, 2011), AVX-512 (2013).


I only listed major SIMD advancements and I didn't think the above enhancements were as influential and/or popular. I went back to the first post and added the MC88110 GPU which is an SIMD unit for a GP CPU which predates the PA-RISC MAX. I did not add the Intel i860 which MMX is based off of because I don't consider it to be GP as a VLIW processor.

Quote:

Cache trashing can be avoided using instructions to properly "mark" some memory regions as "non-temporal". x86/x64 has also some "NT" instructions which directly implement this behavior, without requiring to use ad-hoc instructions for marking some areas. My architecture goes a step further: any instruction with memory reference can be marked as non-temporal (so, there's no need for specific NT instructions).


A temporal hint bit in load/store encodings stops the cache trashing and is interesting.

+ better code density, if there is a free bit in the load/store encodings
- earlier hint instruction may be able to reduce load latency by pre-buffering stream data

Quote:

Data is very often processed using the SIMD registers. It can clearly seen by disassembly SIMD code. This is also the reason why SIMD extensions have usually many registers (Power increased it to 64 with VMX2, by also using the FPU registers).


The SIMD data is streamed from memory, processed and then streamed back to memory. Yes, registers are used during the processing.

SIMD RISC 2 variables: load reg, load reg, op reg to reg, store reg
SIMD CISC 2 variables: load reg, op mem to reg, store reg

SIMD RISC 3 variables: load reg, load reg, load reg, op reg to reg, op reg to reg, store reg
SIMD CISC 3 variables: load reg, op mem to reg, op mem to reg, store reg

More SIMD registers are useful for avoiding load latency as uncached data from loads can't be touched for many cycles without a load-use stall. Also, matrix math can reuse many variables.


Quote:

That's true, but this can be overcome by using some flags to "re-use/encode" the existing opcode space in a much better way.

AFAIR the Apollo Core has a specific flag on SR which signals the new execution mode / new features enabled. Unfortunately they used this possibility in a wrong way, because they didn't took the chance to re-encode the opcode space.
Very long time ago (in the amigacoding.de forum, which unfortunately is gone with its rich knowledge base) I suggested to use it to re-use the F-line for SIMD scalar instructions (completely removing the coprocessors support, which is anachronistic nowadays) and the A-line for the equivalent packed instructions: this would have opened the possibility to define a much powerful (and easier to decode) opcode structure & SIMD ISA.


ARM and the 88k had similar co-processor support. The F-line co-processor ID really doesn't take that many encoding bits as it can be considered part of the opcode. A-line I use for MOVE.Q (like x86-64 MOVQ). I do re-encode for a 68k 64 bit mode though. The advantages of 64 bit sizes for ops (2 bits for b/w/l/q size), especially immediates (mostly compressed to 16 bits), and PC relative write support is too compelling not to. It allows for slimmed down 64 bit only cores with significantly better code density than x86-64.

Quote:

That's not true. Please take a look at the benchmarks (real applications: not synthetic tests) which clearly show the advantage of using SIMD code. There are many open source applications that can be recompiled using just the regular FPU, or any SIMD unit. Phronix often publishes benchmarks like that.


From your article with Photoshop instruction counts which I would expect to use the SIMD often, the x86 did not have any SIMD instructions in the top 19 most used instructions which stopped with instructions used .42% of the time or less. The x86-64 Photoshop disassembly came up with 2 SIMD instructions in the top 19, MOVAPS 1.04% and MOVSS 0.88% but these are obviously because of the use of the SIMD unit as a FPU which you note in the following translation.

"Finally, the MOVAPS and MOVSS instructions denote the use of the SIMD unit instead of the FPU, which on x64 is, on the other hand, scarcely used."

Certainly as a static percentage, parallel (non-scalar) SIMD instructions are rare and most of those instructions are probably used for data movement and corner cases. Do x86-64 PADD (parallel add) instructions even break .25% of the static instruction count for Photoshop? What do you consider rare?

Quote:

And the argument: "let's use the GPU instead" is not valid. Yes, many workloads can be offloaded to them, but it's not a general rule that can applied to every scenario.
Offloading tasks to the GPU requires memory allocation in the GPU, transferring the data to it, then waiting for the GPU to complete the tasks, moving the data back to the system/CPU memory, and finally freeing the GPU memory buffer. This "round-trip" can take very long, and it only makes sense if you have a huge amount of data which can justify the big overhead which I've just reported.
Last but not really least, there's some non-massive number crunching where SIMD instructions can be used to speed-up some "integer/scalar" algorithm.


A Heterogeneous System Architecture (HSA) eliminates much of the GPU transfer overhead and the performance has been good in console GPUs and AMD APUs. There is some overhead with the communication between CPU and GPU cores necessary to keep cache coherency. Removing the GPU cores and using CPU cores for non-fixed pipeline rendering simplifies cache coherency. The CPU cores need to be slimmed down for the amount of parallel processing necessary. This is likely adequate as a low end embedded GPU but a few stronger cores are likely necessary for a higher end system for better single thread performance necessary for modern software performance and playing games. Why didn't the Knights Landing/Mill add 4 strong x86_64 cores?

Quote:

Now to answer the poll: it's a clear No. I think that it's quite evident to any architecture expert/passionate that the AMMX implementation is the worst ever made: they decided to share the data and FPU registers with the new SIMD ones! This partial overlapping of such kind of (completely) different register sets is simply crazy.
The reason for this was that... context switching was faster. Another "design decision" (!) made by people which has a very limited vision, which is just "implementation-centric", and specifically for the current FPGA implementation.
The same thing happened for the added instructions: they are just filling some holes left open by Motorola.
Finally, they also added the so called "BANK" instruction which is just a prefix (yes: exactly like x86/x64!) used to "enable" 64-bits and/or the access to the new registers. The very bad thing is that on a 16-bit opcode size ISA like the 68K this means greatly reducing the code density, which was the great advantage of this micro-processors family.


Keeping the context switching overhead low is important for embedded use and CPU core 3D rendering. More SIMD registers are important for 3D rendering performance though too.

Registers with register bank overhead are not the best but the 68k 16 bit instructions becoming 32 bit instructions is no worse than RISC 32 bit fixed length instructions like PPC and AArch64. There is still a code density advantage as long as the 16 bit instructions are much more common which is likely for the integer units anyway. The x86-64 prefix overhead is less but has to be used too much. For example, only 8 integer registers can be used before a prefix and most 64 bit instructions need one. A 64 bit 68k can use 16 integer registers before a prefix and 64 bit instructions can still be 16 bit in length.

 Status: Online!
Profile     Report this post  
Hammer 
Re: Amiga SIMD unit
Posted on 20-Oct-2020 3:02:05
#154 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5275
From: Australia

@cdimauro
Quote:

No, Larrabbe (and the first two Xeon Phis) was crippled on purpose, because it was born just as a coprocessor, as I said before. So, Intel decided to remove some instructions from the ISA, only because of that.

The early AMD64 and EM64T had different issues (implementation not ready in time for Windows XP x64 support).

My comments were against Larrabee being just Pentium P54 with 512-bit SIMD attached.

Pentium P54 has 32-bit ALUs while Larrabee has 64bit ALUs. I'm aware of Larrabee has kitbashed X86-64 fork.

Larrabee is a failed product but AVX-512 is included in mass-produced shipping SKUs e.g. Skylake-SP, Skylake-X, Cannon Lake, Cascade Lake, Cooper Lake, Ice Lake, and Tiger Lake
Intel needs to stabilize AVX-512's optional instruction set.

For Intel dGPU with PC gaming, there's Xe-HPG SKU.

Centaur Technology "CNS" core (8C/8T) has support for AVX-512FCD (+D/VL/BW/DQ/IFMA/VBM) before AMD.

It looks like X64 CPU market will have healthy competition.



Quote:

This is implementation dependent, and you must pay the aggressive x87 FPU performances with much more transistors/area/power.

Not a major issue in X86's core markets.

Quote:

But the implementation cannot change the ISA, and coders and compilers should use the FPU in a stack-based manner, which is a pain.

Beyond FP32, there's FP64 alterative from SSE2 and AVX. On Zen, X87 runs on the same SIMD units anyway.


I have read https://www.nxp.com/docs/en/application-note/AN3531.pdf
PPC e500 based CPU core's 64-bit SIMD recycled GPR instead of FP registers or VMX registers.
e500 does not implement PPC's FPR and VMX registers. e500 has the potential to be isolated architecture.

Last edited by Hammer on 20-Oct-2020 at 04:21 AM.
Last edited by Hammer on 20-Oct-2020 at 04:12 AM.
Last edited by Hammer on 20-Oct-2020 at 03:18 AM.
Last edited by Hammer on 20-Oct-2020 at 03:12 AM.
Last edited by Hammer on 20-Oct-2020 at 03:04 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

 Status: Offline
Profile     Report this post  
Hammer 
Re: Amiga SIMD unit
Posted on 20-Oct-2020 3:31:27
#155 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5275
From: Australia

@cdimauro
Quote:

So, a 68K-inspired ISA might be interesting. And there's plenty of room for getting a nice design (more data registers, some extra address register, new vector-length agnostic ISA) while keeping more or less the same cool features (better code size, lower memory-traffic, lower amount of executed instructions).

ColdFire was an attempt to clean up 68K and the Amiga market wasn't interested. AC68080's sales were pretty good for a zombie platform.

My primary selection for AC68080 is something faster than my former A3000's 68030 for my near mint A1200.

Wicher 508's 68000 at 50 Mhz for my A500 is close to A3000's 68030 25Mhz in SysInfo.

Last edited by Hammer on 20-Oct-2020 at 03:33 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Amiga SIMD unit
Posted on 20-Oct-2020 6:10:56
#156 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:


matthey wrote:
@cdimauro
I see you found my SIMD thread shortly after an extended Amigaworld.net downtime.

I was already lurking from long time, but I had no time to write something before.
Quote:
AMMX having "Nothing to do with Intel's MMX" is a bit harsh. AMMX is documented as being closer to WMMX which is loosely based on MMX. There are even a few instructions with the same name as MMX uses.

paddb, paddw, paddusb, paddusw, psubb, psubw, psubusb, psubusw, packuswb, pcmpb, pcmpw

That's expected because they are pretty normal for a SIMD unit with integer data types support. Not enough to be called MMX, IMO.
Quote:
The MMX pmullw and pmulhw were renamed to pmull and pmulh respectively which is consistent with Gunnar's inconsistency. There are more MMX SIMD instructions and datatypes which could be added if there was room in the FPGA. Maybe the 'A' in AMMX stands for Axed MMX or Arm MMX (Poor MMX in German).

Indeed. We know how "good" is the guy at designing ISAs...

BTW, there should be plenty or room in the FPGA. There was no room for FPU, and then one appeared (after lots of critics). There was no room for an MMU, and then it was discovered that one was there (albeit not the right one: he hates Paged-MMUs). And more istructions and features are added from time to time. There's always room for things which HE likes to add to the core...
Quote:
Cache trashing can be avoided using instructions to properly "mark" some memory regions as "non-temporal". x86/x64 has also some "NT" instructions which directly implement this behavior, without requiring to use ad-hoc instructions for marking some areas. My architecture goes a step further: any instruction with memory reference can be marked as non-temporal (so, there's no need for specific NT instructions).

A temporal hint bit in load/store encodings stops the cache trashing and is interesting.

+ better code density, if there is a free bit in the load/store encodings
- earlier hint instruction may be able to reduce load latency by pre-buffering stream data[/quote
Unfortunately there's a price to pay in may case: longer instructions, because it makes uses of my instruction extender (not a prefix, but a cleaver mechanism). I can "extended" the behavior of any instruction (not the compressed ones) with some interesting features, but code density is affected. It's good anyway, because at least the number of executed instructions is reduced.

I kept all x86/x64 NT instructions as they are, so there's no code size penalty in those cases. A good compromise.
Quote:
ARM and the 88k had similar co-processor support.

That ARM dropped with AArch64, to recover important space in the opcode (which was needed, with 32 registers and a lot of features/instructions to implement).
Quote:
The F-line co-processor ID really doesn't take that many encoding bits as it can be considered part of the opcode.

3-bits there are extremely important in that place. You'll see once you try to define a (modern) SIMD unit for the 68K. Worst case, if you still had the chance to pack the most important things, you can use them for defining a mask register.
Quote:
A-line I use for MOVE.Q (like x86-64 MOVQ). I do re-encode for a 68k 64 bit mode though.

I remember, and for your 64-bit 68K extension is a good choice.
Quote:
The advantages of 64 bit sizes for ops (2 bits for b/w/l/q size), especially immediates (mostly compressed to 16 bits), and PC relative write support is too compelling not to. It allows for slimmed down 64 bit only cores with significantly better code density than x86-64.

Right, but it's not a surprise: x86-64 really sucks at code density. Did you had the chance to make some benchmark with your extension, to see how it competes with the other ISAs (AArch64 and RISC-V/64, in particular)?
Quote:
From your article with Photoshop instruction counts which I would expect to use the SIMD often, the x86 did not have any SIMD instructions in the top 19 most used instructions which stopped with instructions used .42% of the time or less. The x86-64 Photoshop disassembly came up with 2 SIMD instructions in the top 19, MOVAPS 1.04% and MOVSS 0.88% but these are obviously because of the use of the SIMD unit as a FPU which you note in the following translation.

"Finally, the MOVAPS and MOVSS instructions denote the use of the SIMD unit instead of the FPU, which on x64 is, on the other hand, scarcely used."

Certainly as a static percentage, parallel (non-scalar) SIMD instructions are rare and most of those instructions are probably used for data movement and corner cases. Do x86-64 PADD (parallel add) instructions even break .25% of the static instruction count for Photoshop? What do you consider rare?

Yes, this needs a clarification: I was talking about the FPU, which is very rarely used on x64 binaries like the one which I've disassembled for my statistics.

However pay attention that I haven't said in any part of the article neither here that an application which makes an intensive use of SIMD instructions should have a lot of them. I was just (and only) talking about the performance gain from their usage.

Another important thing: unfortunately my disassembler doesn't disassemble all instructions in a binary. It just starts from the executable entry-point, and then disassemble as much instructions as possible scraping addresses from CALL/JMP/Jcc instructions. This, unfortunately, means that only a very small percentage of the binary is disassembled, which was good enough for my purposes (have some benchmark / statistics to see "a trend" for my ISA, compared to its direct competitors).

I stopped with this approach because I need real-world benchmarks, and for this a real compiler is needed. I started looking at LLVM (and writing something), but it's an ENORMOUS task. My ISA is definitely much better as opcode structure(s), so way easier than x86/x64 (which is a monster on LLVM: almost 8MB of code only for the backend), but it's also a superset of those ISAs, so there's too much stuff to be implemented (included my "extender" mechanism which is something novel).
Quote:
A Heterogeneous System Architecture (HSA) eliminates much of the GPU transfer overhead and the performance has been good in console GPUs and AMD APUs. There is some overhead with the communication between CPU and GPU cores necessary to keep cache coherency. Removing the GPU cores and using CPU cores for non-fixed pipeline rendering simplifies cache coherency. The CPU cores need to be slimmed down for the amount of parallel processing necessary. This is likely adequate as a low end embedded GPU but a few stronger cores are likely necessary for a higher end system for better single thread performance necessary for modern software performance and playing games.

Indeed: a lot of SIMD code is still used on the CPUs, even if the GPUs are taking the biggest computational part.

HSA can help, but only on the consoles, because it's specialized hardware tailored for videogames. For the rest HSA isn't general purpose enough (as I said before, it cannot be used to completely replace the SIMD units in EVERY scenario), and discrete GPUs offer a lot more processing power (albeit with the round-trip costs).
Quote:
Why didn't the Knights Landing/Mill add 4 strong x86_64 cores?

Because it wasn't and isn't needed. Xeon Phi domain is HPC, where week cores (in-order, but with 4-8 way HyperThreading) with massive processing power (e.g.: huge vector units) are the best solution. I wonder why Intel moved to an OoO design starting from Knights Landing...
Quote:
Keeping the context switching overhead low is important for embedded use and CPU core 3D rendering. More SIMD registers are important for 3D rendering performance though too.

3D rendering doesn't match well with embedded use, and the Apollo core is clearly targeting a more "desktop" market / usage, where high performance is needed.

Yes, lowering context switches is always good, but the Apollo Core design is really ridiculous: registers of THREE different domains (GP, GPU, SIMD) which are partially shared. Never saw a design worse than this...

If you want to be cheap, just re-use / share one of it: either the GP domain (NOT recommended for a CISC, as I've explained in a previous comment) or the FPU one (which better fits).
Quote:
Registers with register bank overhead are not the best but the 68k 16 bit instructions becoming 32 bit instructions is no worse than RISC 32 bit fixed length instructions like PPC and AArch64. There is still a code density advantage as long as the 16 bit instructions are much more common which is likely for the integer units anyway.

Yes, but you're extending to 32-bit only the 16-bit opcodes. The 68K has many other 32-bit opcodes, which then will be extended to 48-bit.

It's true that RISCs have 32-bit opcodes, but in they are better organized.
Quote:
The x86-64 prefix overhead is less but has to be used too much. For example, only 8 integer registers can be used before a prefix and most 64 bit instructions need one.

I know, and as I said, x64 is very crap from this point-of-view. My ISA has much better code density and it can use 32 GP registers and 64 SIMD registers almost always. Unfortunately x64 was a quick and dirty extension of x86...
Quote:
A 64 bit 68k can use 16 integer registers before a prefix and 64 bit instructions can still be 16 bit in length.

True. And a better reimplementation can do the same most of the time with 16 data registers, 8 address registers, and separated SP and possibly FP (frame pointer).

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Amiga SIMD unit
Posted on 20-Oct-2020 6:16:34
#157 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Hammer Quote:

Hammer wrote:

My comments were against Larrabee being just Pentium P54 with 512-bit SIMD attached.

Pentium P54 has 32-bit ALUs while Larrabee has 64bit ALUs.

And I had nothing to say against that.
Quote:
I'm aware of Larrabee has kitbashed X86-64 fork.

Larrabee is a failed product but AVX-512 is included in mass-produced shipping SKUs e.g. Skylake-SP, Skylake-X, Cannon Lake, Cascade Lake, Cooper Lake, Ice Lake, and Tiger Lake
Intel needs to stabilize AVX-512's optional instruction set.

For Intel dGPU with PC gaming, there's Xe-HPG SKU.

Centaur Technology "CNS" core (8C/8T) has support for AVX-512FCD (+D/VL/BW/DQ/IFMA/VBM) before AMD.

It looks like X64 CPU market will have healthy competition.

That's true and welcome. I really like AVX-512, and I hope that it's usage increases a lot.
Quote:
Beyond FP32, there's FP64 alterative from SSE2 and AVX. On Zen, X87 runs on the same SIMD units anyway.

Yes, but old applications still has to use the FPU a stack-based. And same for compilers and coders. x87 is still widely used...
Quote:
I have read https://www.nxp.com/docs/en/application-note/AN3531.pdf
PPC e500 based CPU core's 64-bit SIMD recycled GPR instead of FP registers or VMX registers.
e500 does not implement PPC's FPR and VMX registers. e500 has the potential to be isolated architecture.

No, it's not isolated. As I said before, this core is re-using some instructions on the regular PowerPC ISA.

@Hammer Quote:

Hammer wrote:

ColdFire was an attempt to clean up 68K and the Amiga market wasn't interested.

Because it was born for the embedded market, with a lot of compromises.
Quote:
AC68080's sales were pretty good for a zombie platform.

A few thousands of boards sold still means that the platform is a zombie.
Quote:
My primary selection for AC68080 is something faster than my former A3000's 68030 for my near mint A1200.

Wicher 508's 68000 at 50 Mhz for my A500 is close to A3000's 68030 25Mhz in SysInfo

Nothing to say about that: the Apollo Core is the clear leader on the 68K HARDWARE market.

But it's still a nano-niche market...

 Status: Offline
Profile     Report this post  
Hammer 
Re: Amiga SIMD unit
Posted on 20-Oct-2020 12:57:29
#158 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5275
From: Australia

@cdimauro
Quote:

No, it's not isolated. As I said before, this core is re-using some instructions on the regular PowerPC ISA.

Wii's custom 64bit SIMD was isolated from the mainline PowerPC/Power64 SKUs and it died.

CELL's SPU instruction set died a similar way since it's not supported by mainline PowerPC/Power64 SKUs. It was a waste of time.

If I'm going to buy PowerPC/Power64 based system (e.g. business has IBM Power9 servers) for my own use, the IBM Power9 4C/16T + Blackbird Raptor motherboard bundle is nice. It's closest to my Ryzen 9 3900X + ASUS ROG X570 combo.

If only IBM Power9 4C/16T + Blackbird Raptor motherboard bundle's price matched Ryzen 9 3900X + ASUS ROG Strix X570 combo price.

PowerPC e600 has support for Altivec and FPR (PowerPC FP registers), hence e500 based CPU's 64-bit SIMD is a waste of time.

Developers for AmigaOS 4.1 can create an abstraction layer like OpenCL that ran on both e500's custom 64 bit SIMD or mainline PowerPC's 128 bit Altivec SIMD.

For AVX-512 exploration, I have Core i7-7820X Skylake X + ASUS ROG Strix X299 combo. PS3 emulator has some support for AVX-512.

For the Intel side, I'm waiting for Rocket Lake or the product after Rocket Lake.





Last edited by Hammer on 20-Oct-2020 at 01:16 PM.
Last edited by Hammer on 20-Oct-2020 at 01:01 PM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

 Status: Offline
Profile     Report this post  
Hammer 
Re: Amiga SIMD unit
Posted on 20-Oct-2020 13:26:58
#159 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5275
From: Australia

@cdimauro
Quote:

HSA can help, but only on the consoles, because it's specialized hardware tailored for videogames. For the rest HSA isn't general purpose enough (as I said before, it cannot be used to completely replace the SIMD units in EVERY scenario), and discrete GPUs offer a lot more processing power (albeit with the round-trip costs).

From
https://www.extremetech.com/gaming/274805-amd-announces-new-custom-apu-for-chinese-game-consoles

Note that AMD created an APU with quad-core / eight-thread Ryzen clocked at 3GHz plus a 24 CU Radeon Vega Graphics solution (1536 cores) with a 256-bit GDDR5 interface for a Chinese platform customer. The SoC connects to a mainboard with 8GB of GDDR5. This APU can run normal WIndows OS.


https://www.youtube.com/watch?v=x0KSJg2sqJM
(Digital Foundry): Subor Z Plus Chinese PC/Console Hybrid - Ryzen+Vega AMD Analysis!


Xbox Series S's APU has eight-cores / 16-thread Ryzen Zen 2 clocked at 3.5GHz plus a 20 CU RDNA 2 GPU.

Last edited by Hammer on 20-Oct-2020 at 01:27 PM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Amiga SIMD unit
Posted on 20-Oct-2020 22:30:25
#160 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Hammer Quote:

Hammer wrote:
Wii's custom 64bit SIMD was isolated from the mainline PowerPC/Power64 SKUs and it died.

CELL's SPU instruction set died a similar way since it's not supported by mainline PowerPC/Power64 SKUs. It was a waste of time.

I've absolutely no problem with custom solutions which will stay isolated in their universe.
Quote:
If I'm going to buy PowerPC/Power64 based system (e.g. business has IBM Power9 servers) for my own use, the IBM Power9 4C/16T + Blackbird Raptor motherboard bundle is nice. It's closest to my Ryzen 9 3900X + ASUS ROG X570 combo.

If only IBM Power9 4C/16T + Blackbird Raptor motherboard bundle's price matched Ryzen 9 3900X + ASUS ROG Strix X570 combo price.

Is there any reason why you wanted to buy this Power9 system? Because price and performance wise it's way inferior to the Ryzen.
Quote:
PowerPC e600 has support for Altivec and FPR (PowerPC FP registers), hence e500 based CPU's 64-bit SIMD is a waste of time.

Developers for AmigaOS 4.1 can create an abstraction layer like OpenCL that ran on both e500's custom 64 bit SIMD or mainline PowerPC's 128 bit Altivec SIMD.

No, this isn't possible: you cannot abstract/virtualize the differences between the e500v2 core and all other PowerPC-compliant cores.

The only way could be by analyzing the executables at load time, before running them, to see if there are instructions overlapping in the opcode space, and trying to patch the executable in memory to avoid problems. Which, as you can imagine, is difficult to implement and kills the performances.
Quote:
For AVX-512 exploration, I have Core i7-7820X Skylake X + ASUS ROG Strix X299 combo. PS3 emulator has some support for AVX-512.

Lucky you! I'm waiting for a more mainstream AVX-512 capable CPU.

I'm surprised that emulators are already supporting this new SIMD. Impressive...
Quote:
For the Intel side, I'm waiting for Rocket Lake or the product after Rocket Lake.

Rocket Lake should be a very good product: a consistent IPC improvement + the high frequencies of the best 14nm process.

Quote:
Hammer wrote:
From
https://www.extremetech.com/gaming/274805-amd-announces-new-custom-apu-for-chinese-game-consoles

Note that AMD created an APU with quad-core / eight-thread Ryzen clocked at 3GHz plus a 24 CU Radeon Vega Graphics solution (1536 cores) with a 256-bit GDDR5 interface for a Chinese platform customer. The SoC connects to a mainboard with 8GB of GDDR5. This APU can run normal WIndows OS.

https://www.youtube.com/watch?v=x0KSJg2sqJM
(Digital Foundry): Subor Z Plus Chinese PC/Console Hybrid - Ryzen+Vega AMD Analysis!

Makes sense for AMD: she has already everything in-house, with Sony's and Microsoft's consoles.

China is a HUGE market, and a low-cost "variant" can sell a lot.
Quote:
Xbox Series S's APU has eight-cores / 16-thread Ryzen Zen 2 clocked at 3.5GHz plus a 20 CU RDNA 2 GPU.

Microsoft did a very good job at this time, largely surpassing Sony for the hardware.

 Status: Offline
Profile     Report this post  
Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 Next Page )

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]
Copyright (C) 2000 - 2019 Amigaworld.net.
Amigaworld.net was originally founded by David Doyle