Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6223 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

1 member(s) on-line.

kolla

You are an anonymous user.
Register Now!

kolla: 16 secs ago

zipper: 23 mins ago

ruben: 32 mins ago

ZXDunny: 34 mins ago

amyren: 1 hr 7 mins ago

amigakit: 2 hrs 7 mins ago

pixie: 2 hrs 16 mins ago

BigD: 2 hrs 30 mins ago

dipsomania: 3 hrs 13 mins ago

mordock: 4 hrs 3 mins ago

Forum Index

General Technology (No Console Threads)

Amiga SIMD unit

Poster

Thread

matthey

Amiga SIMD unit
Posted on 7-Aug-2020 23:30:19

[ #1 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2751
From: Kansas

There are many misconceptions about the SIMD unit. I am no expert but there are some others that may be able to contribute as well. We will start by looking at the history and basic features. MMX will be looked at which the Apollo Core adopted as a standard. Finally some questions will be answered from another thread.

History
MC88110 GPU (coprocessor like SFU extension) 1992
MAX - HP PA-RISC PA-7100LC 1994
MAX-2- HP PA-RISC PA-8000 1996
MMX - Intel Pentium MMX 1997
SSE - Intel Pentium 3 1999
Altivec - Motorola PPC 7400 (G4) 1999
SSE2 - Intel Pentium 4 2000
Neon - ARM1136J (ARMv6) 2002
WMMX - Intel XScale PXA270 2004
AVX - Intel Sandy Bridge CPUs 2011
AArch64 - Apple A7 (iPhone 5S) 2013

Basic features when introduced
MC88110 GPU 32x32b int regs (64b SIMD ops using reg pairs); int4x16, int8x8, int16x4, int32x2, uint4x16, uint8x8, uint16x4, uint32x2
MAX - 32x32b int regs; int16x2, uint16x2
MAX-2 - 32x64b int regs; int16x4, uint16x4
MMX - 8x64b regs shared with FPU; int8x8, int16x4, int32x2, uint8x8, uint16x4, uint32x2
SSE - 8x128b regs; fp32x4
Altivec/VMX - 32x128b regs; int8x16, int16x8, int32x4, uint8x16, uint16x8, uint32x4, fp32x4
SSE2 - 16x128b regs; int8x16, int16x8, int32x4, uint8x16, uint16x8, uint32x4, fp32x4, fp64x2
Neon - 16x128b regs shared with FPU; int8x16, int16x8, int32x4, uint8x16, uint16x8, uint32x4, fp32x4
WMMX - 16x64b regs; int8x8, int16x4, int32x2, uint8x8, uint16x4, uint32x2
AVX - 16x256b regs; fp32x8, fp64x4
AArch64 - 32x128b regs shared with FPU; int8x16, int16x8, int32x4, uint8x16, uint16x8, uint32x4, fp32x4, fp64x2

SIMD instruction research came about from VLIW processor parallelism research. The MC88110 Graphics Processing Unit (GPU) may have been the first SIMD extension to a general purpose CPU architecture ISA. It was implemented using the integer registers which were 32 bits wide but the SIMD operations were on 64 bits of data in register pairs supporting 4, 8, 16 or 32 bit signed or unsigned datatypes with or without saturation. HP PA-RISC MAX was implemented using the integer registers which were 32 bits wide and only supported 16 bit datatypes. MAX-2 was added when PA-RISC became 64 bit expanding the SIMD register width to 64 bits. SIMD instructions on integer registers is also how the Apollo Core implements SIMD operations. Advantages are small implementation size (MAX-2 took 0.1% of silicon area of PA-8000) and reduced shuffling of data between registers. Disadvantages are wasted integer register file space if expanding the width of SIMD registers beyond 64 bit and possibly slowing integer instructions if adding floating point support. PA-RISC was the planned GPU for the C= Hombre. A while ago, I did a rough comparison of MPEG performance between the 68060 and PA-RISC processors where the 68060 outperformed early PA-RISC processors and held its own even against MAX accelerated PA-RISC.

http://eab.abime.net/showpost.php?s=9a7c7ee9b88f1ebc4dde420b41148e94&p=1142968&postcount=22

The first MMX implementation left a lot to be desired. The FPU and SIMD unit could not be used at the same time and SIMD usage was limited with the narrow SIMD unit register width, few registers, few instructions and only integer support. More datatypes were supported than PA-RISC which was trend setting but not floating point. Early on, MMX was helpful to clones to manipulate 2D data like a blitter since GPUs were poor performance and non-standard. This became less used as GPUs became more powerful with 3D support, used fp more and gained 3D T&L where the focus switched from integer operations to fp operations. The Apollo Core does not have hardware 3D support so it is using MMX in the same way as clones did years ago and to emulate the blitter. The AMMX manual states it is based on Wireless MMX (WMMX) which Intel used for their ARM XScale embedded chips. WMMX did not have the implementation flaws of the original MMX and doubled the number of registers.

Doubling the SIMD unit register width doubles the number of operations per SIMD instruction. This doubles the theoretical performance but is limited by memory bandwidth, cache efficiency (often more about cache bypassing techniques) and data alignment and ordering. Data is often accessed in memory with no cache but using a small read buffer. The SIMD unit requires huge amounts of encoding space and supporting outdated SIMD extensions can use a significant amount of transistors. SIMD units are often rarely used and are not general purpose but can provide a large boost to performance in some cases. Like VLIW processors, SIMD units have high theoretical performance but actual performance can be a fraction of this.

Last edited by matthey on 19-Oct-2020 at 08:17 PM.
Last edited by matthey on 08-Aug-2020 at 04:39 PM.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 8-Aug-2020 0:32:12

[ #2 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2751
From: Kansas

Answers to some questions from the following thread.

https://amigaworld.net/modules/newbb/viewtopic.php?mode=viewtopic&topic_id=43830&forum=17&start=160&viewmode=flat&order=0

Quote:

Hypex wrote:
I don't like the idea of anything Intel inside 68K. You put X86 into 68K and then suddenly you've got a K86!

But, VMX may have not been much better, since it's RISC based. So is ARM. Guess there wasn't much else to base it on from the CISC world. Apart from building on the 68K with the common operations to match the 68K ISA and naming schemes.

A load/store SIMD unit extension can be converted to a reg-mem style and loses nothing. The encodings would be different but the extension is usually defined by the instructions. It makes sense to reuse an extension to make compiler support easier and reuse existing programmer knowledge. The Apollo Team talked about using Altivec.

AMMX is not true reg-mem either. Neither is the 68k FPU. They are reg-mem on load but load/store on store. Loads are more common than stores so this gets most of the advantages of reg-mem (fewer instructions and registers needed) while simplifying support.

Quote:

Hypex wrote:
I hope it won't be an Amiga without a Workbench. It wouldn't be the same. Sounds complicated if there is an MPU and MMU. Most MMU support on Amiga is through the MMU libraries. Thought it would have been easier just to add support for the 080 there. If the project is still open.

It won't be an Amiga but rather an AROS fork. Yes, another AmigaOS like split and flavor.

Quote:

I think I've seen that before. What's all the Ps mean in the instructions? V would make sense to me. Bank switching? Oh no! It's going to look like the palette registers in AGA. Ax, Bx; Dx, Ex? Not again no! He really is turning the Motorola 68K into an Intel K86!

'P' is for Parallel. SIMD instructions execute several operations in parallel. 'V' makes more sense to me because the 68k has a PMMU with instructions that start with 'P' (PFLUSH, PLOAD, PLPA, PMOVE, PSAVE, etc.). There is one AMMX instruction VPERM which starts with 'V'. The LOAD and STORE instructions don't start with a 'P' or 'V' and look out of place on the 68k. Like the 68k FMOVE for the FPU, I would prefer VMOVE or VMOV if making the ending 'E' optional to reduce typing. The LSLQ and LSRQ instructions look like quick forms of instructions and should be LSL.Q and LSR.Q (or VLSL.Q and VLSR.Q) where Q=Quad. MOVEX looks like it uses the X bit like ADDX, SUBX and NEGX but does an endian conversion. Naming and consistency seem to be a little of this with a little of that thrown in. That is part of the problem of using an existing SIMD unit extension but the naming issues are more than that too.

Status: Offline

MEGA_RJ_MICAL

Re: Amiga SIMD unit
Posted on 8-Aug-2020 5:17:50

[ #3 ]

Super Member

Joined: 13-Dec-2019
Posts: 1200
From: AMIGAWORLD.NET WAS ORIGINALLY FOUNDED BY DAVID DOYLE

matthey's keyboard

Last edited by MEGA_RJ_MICAL on 08-Aug-2020 at 05:29 AM.

_________________
I HAVE ABS OF STEEL
--
CAN YOU SEE ME? CAN YOU HEAR ME? OK FOR WORK

Status: Offline

Wol

Re: Amiga SIMD unit
Posted on 8-Aug-2020 7:47:58

[ #4 ]

Super Member

Joined: 8-Mar-2003
Posts: 1009
From: UK.......Sol 3.

@matthey

You forgot the Z80...

Wol.

_________________
It is my conviction that killing under the cloak of war is nothing but an act of murder.~Albert Einstein

Status: Offline

Fl@sh

Re: Amiga SIMD unit
Posted on 8-Aug-2020 13:36:47

[ #5 ]

Regular Member

Joined: 6-Oct-2004
Posts: 253
From: Napoli - Italy

@matthey

G5 has AltiVec no VMX.

IHMO Major issue of AltiVec is lack of float64, it was resolved later in VMX with others features too.
AltiVec/VMX SIMD extensions was the most complete and future proof, and don’t share registers with integer/fpu units.

_________________
Pegasos II G4@1GHz 2GB Radeon 9250 256MB
AmigaOS4.1 fe - MorphOS - Debian 9 Jessie

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 8-Aug-2020 17:43:09

[ #6 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2751
From: Kansas

Quote:

Fl@sh wrote:
G5 has AltiVec no VMX.

IBM usually called the PPC/POWER SIMD unit VMX on chips they developed or were licensed from them. Motorola/Freescale called the SIMD unit AltiVec and appears to have licensed the name to IBM for the G5. It looks like the G5 was released before IBM started calling the SIMD unit VMX but they reused this SIMD unit design and called it VMX later so they didn't have to license the AltiVec name. I believe the G5 was the first IBM PPC or POWER design to have a SIMD unit which in 2002 was late for the introduction of a SIMD unit. POWER did not get a SIMD unit until POWER6 in 2007.

Quote:

IHMO Major issue of AltiVec is lack of float64, it was resolved later in VMX with others features too.
AltiVec/VMX SIMD extensions was the most complete and future proof, and don’t share registers with integer/fpu units.

POWER shares registers between the SIMD unit and FPU (as does IBM's z/architecture) and at least POWER8 and newer support 64 bit floating point vector operations.

https://www.ibm.com/support/pages/vectorizing-fun-and-performance

It looks like PPC designs do *not* share the SIMD unit registers with other units. While this is good for performance, it uses more power and area which likely resulted in the SIMD unit being dropped from lower performance PPC chips. The result is that POWER, x86_64 and AArch64 have standard SIMD units available on all chips while PPC chips often don't have a SIMD unit and the ones that do have inferior support. There is a balancing act to provide the most performance and features while using the least resources and power.

Last edited by matthey on 08-Aug-2020 at 08:44 PM.

Status: Offline

Hypex

Re: Amiga SIMD unit
Posted on 8-Aug-2020 17:47:45

[ #7 ]

Elite Member

Joined: 6-May-2007
Posts: 11351
From: Greensborough, Australia

@matthey

I ended up voting MMX in the end as it looked more suiable for 68K. The others looked a bit too good with 32 registers. One based on 8 registers and 64-bit looked closer to the mark.

I see AltiVec came a couple years late after MMX in 1997, but could match most SSE operations a year eariler than SSE, in 1999.

Quote:
A load/store SIMD unit extension can be converted to a reg-mem style and loses nothing. The encodings would be different but the extension is usually defined by the instructions. It makes sense to reuse an extension to make compiler support easier and reuse existing programmer knowledge. The Apollo Team talked about using Altivec.

A main difference would be in loading and storing with memory. Core operations would be in memory. I don't see why PPC can't have things like "mr" between GPR and FPU or vectors, aside from size difference. But, vectors do tend operate on memory data. It's just cached in CPU when loaded in then stored later. Good for fast memory copy loading in registers at max width.

Quote:
AMMX is not true reg-mem either. Neither is the 68k FPU. They are reg-mem on load but load/store on store. Loads are more common than stores so this gets most of the advantages of reg-mem (fewer instructions and registers needed) while simplifying support.

Sorry, reg-mem on load but load/store on store? Where is the loading from? The store is reading then writing?

Quote:
'P' is for Parallel. SIMD instructions execute several operations in parallel. 'V' makes more sense to me because the 68k has a PMMU with instructions that start with 'P' (PFLUSH, PLOAD, PLPA, PMOVE, PSAVE, etc.).

Ah okay that makes sense. Yes good point about the other P codes. PErhaps this PMMU conflicts with their own MPU?

Quote:
The LOAD and STORE instructions don't start with a 'P' or 'V' and look out of place on the 68k. Like the 68k FMOVE for the FPU, I would prefer VMOVE or VMOV if making the ending 'E' optional to reduce typing.

Any LOAD or STORE seems strange on 68K. It tends to go through a MOVE. MOVE is probably the single most used instruction I wonder. I think VMOV suits x86 better. The mnemonics are like that for x86, but 68K is always spelled out, at least for MOVE that I've noticed.

Quote:
The LSLQ and LSRQ instructions look like quick forms of instructions and should be LSL.Q and LSR.Q (or VLSL.Q and VLSR.Q) where Q=Quad.

Yes I was reading that and wondered what the Quick was about. Though a shift from register was faster, it's all coded into instruction. It makes some sense if the value is 1 to 63.

Quote:
MOVEX looks like it uses the X bit like ADDX, SUBX and NEGX but does an endian conversion. Naming and consistency seem to be a little of this with a little of that thrown in. That is part of the problem of using an existing SIMD unit extension but the naming issues are more than that too.

It does. Suppose a MOVER could have worked for Reverse. Is MOVES for Swap. Or copy Intel and do MOVELE. Since that has MOVEBE. Even MOVE.LX ,MOVE.LR or MOVE.LS for cross, reversed or swapped long.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 8-Aug-2020 23:32:17

[ #8 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2751
From: Kansas

Quote:

Hypex wrote:
I ended up voting MMX in the end as it looked more suitable for 68K. The others looked a bit too good with 32 registers. One based on 8 registers and 64-bit looked closer to the mark.

I would prefer 16 integer registers, 16 FPU registers and 16 vector unit registers. The FPU and vector registers could be shared using 128 bit wide registers, perhaps with an option for 256 bit wide registers in the future. Wider SIMD registers are the key to SIMD performance. The x86_64 has performed quite well with 16 integer and 16 SIMD registers. A 128 bit wide register shouldn't be a problem in FPGA if the parallel operations are on narrower datatypes although doubling the register width requires doubling the number of ALUs.

Quote:

I see AltiVec came a couple years late after MMX in 1997, but could match most SSE operations a year earlier than SSE, in 1999.

SSE2 is where the x86 SIMD unit started to outperform AltiVec. This is when the number of SSE registers was doubled from 8 to 16.

Quote:

A main difference would be in loading and storing with memory. Core operations would be in memory. I don't see why PPC can't have things like "mr" between GPR and FPU or vectors, aside from size difference. But, vectors do tend operate on memory data. It's just cached in CPU when loaded in then stored later. Good for fast memory copy loading in registers at max width.

Sharing registers between units using a different pipeline may requires some synchronization to make it safe. The 68060 only has a 2 cycle penalty for the FPU to use integer registers as a source or destination but this may be higher for other designs. The 68060 uses an integer pipe for FPU instructions until the last stage making it easier to access integer registers.

SIMD unit vector data is often *not* cached as data sets can be large enough to effectively flush the caches of more commonly used data thus reducing overall performance. This is why cache bypassing techniques and stream prefetching logic are so important for an SIMD unit. It does make it more expensive to transfer data between the SIMD unit and integer unit in the case of Altivec possibly limiting what the SIMD unit can be used for but it is *not* a general purpose unit.

Quote:

Sorry, reg-mem on load but load/store on store? Where is the loading from? The store is reading then writing?

fadd.d (4,a0),fp0 ; reg-mem load with an op possible on 68k FPU
fadd.d fp0,(4,a0) ; reg-mem store with op *not* possible on 68k FPU

fadd.d (4,a0),fp0 ; this pair of instructions replaces the 2nd fadd.d above
fmove.d fp0,(4,a0)

The Read-Modify-Write reg-mem store is avoided which is simpler. The Reg-mem load above saves an instruction and register compared to load/store where a reg-mem store would only save an instruction. There are usually about twice as many reads as writes too so not much is lost.

Quote:

Ah okay that makes sense. Yes good point about the other P codes. Perhaps this PMMU conflicts with their own MPU?

I don't know. I've never seen documentation on the Apollo Core MPU. ThoR knows more and wasn't happy with the decisions.

Quote:

Any LOAD or STORE seems strange on 68K. It tends to go through a MOVE. MOVE is probably the single most used instruction I wonder. I think VMOV suits x86 better. The mnemonics are like that for x86, but 68K is always spelled out, at least for MOVE that I've noticed.

The MOVE instruction is the most common instruction in the 68k which may sound strange for reg-mem which can do an op while moving but the 68k has some simple mem-mem capabilities as well (a mem-mem architecture usually executes fewer instructions, has better code density and less memory traffic than a reg-mem architecture).

Most ISAs have simplified MOVE to MOV. It looks more modern even if it resembles x86 instruction names more, which isn't a bad thing IMO. The x86 instructions names are pretty good. It is the x86 inconsistencies, limitations, ancient cruft and more modern bloat which are the problem.

Quote:

Yes I was reading that and wondered what the Quick was about. Though a shift from register was faster, it's all coded into instruction. It makes some sense if the value is 1 to 63.

The only thing quick about LSLQ and RSRQ was the time it took thinking about how to add the instructions and name them.

Quote:

It does. Suppose a MOVER could have worked for Reverse. Is MOVES for Swap. Or copy Intel and do MOVELE. Since that has MOVEBE. Even MOVE.LX ,MOVE.LR or MOVE.LS for cross, reversed or swapped long.

The ColdFire uses BYTEREV which isn't too bad if a bit long. The x86/x86_64 uses MOVBE and BSWAP. MOVELE or MOVLE is pretty good. MOVES is "Move alternate address Space" in the 68020 ISA.

Last edited by matthey on 08-Aug-2020 at 11:50 PM.
Last edited by matthey on 08-Aug-2020 at 11:35 PM.

Status: Offline

Fl@sh

Re: Amiga SIMD unit
Posted on 9-Aug-2020 11:53:07

[ #9 ]

Regular Member

Joined: 6-Oct-2004
Posts: 253
From: Napoli - Italy

@matthey

IMHO optimal simd is a 256bit wide one with at least 16 dedicated registers and support until to float128 datatype, based on latest IBM power ISA implementation.
On embedded socks simply will be removed all optimisations due register renaming and other tricks to make all low power and reduced complexity.
It’s important to have the same instruction sets for all the same cpu line, limiting only performances among the lowest and highest cpu cores,
A great mistake of all chipmakers is to have different isa for the same architecture.
Current power implementation is the best possible high performance isa, ready for future..
x86, looking at next 10/15 years, is dead just as mc68k was 15 years ago.
Future is ARM, RISC V and IBM Power, simply because they’re best performers for low power and/or high compute tasks.
X86 is closed and totally controlled by intel and AMD, it lacks of security and Most probably have USA government backdoors .
..Just like huawei did for its hisilicon cpu line and 5g infrastructures.

My2cents

_________________
Pegasos II G4@1GHz 2GB Radeon 9250 256MB
AmigaOS4.1 fe - MorphOS - Debian 9 Jessie

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 10-Aug-2020 2:04:49

[ #10 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2751
From: Kansas

Quote:

Fl@sh wrote:
IMHO optimal simd is a 256bit wide one with at least 16 dedicated registers and support until to float128 datatype, based on latest IBM power ISA implementation.

The x86_64 ISA is the only popular ISA I'm aware of which has made a 256 bit wide SIMD unit standard although a 512 bit wide SIMD unit was even tried. Knights Landing cores use a 512 bit wide SIMD unit but down clock the core frequency by 200MHz when using SIMD instructions and have special and expensive high bandwidth MCDRAM memory. The wider the SIMD bit width, the higher end the supporting hardware needs to be. POWER hardware is high end and has enough memory bandwidth to support a wider SIMD unit but has so far stopped at 128 bits wide. ARM's AArch64 ISA chose to double the number of SIMD unit registers rather than move to a 256 bit width and even this likely limits use in the mid to low end embedded market.

A 256 bit wide SIMD unit and/or more registers would be acceptable as a standard if it was found highly useful for certain applications. For example, if ray tracing could be dramatically accelerated using Amiga GPGPU hardware then I would go for it. The horse can get ahead of the carriage sometimes. An SIMD unit is specialized requiring some tuning for particular purposes that need to be known and researched. The Amiga added the specialized blitter because the 68000 was under powered and had poor performance with certain operations like shifting. The Amiga nearly received an AT&T DSP to handle audio and networking workloads. Specialized hardware can be useful but can also quickly become outdated. A blitter can be faster than a CPU core today but the startup time would limit it to large operations. The DSP could still handle audio and networking but would save a tiny fraction of CPU time with modern processors. These specialized processors are also difficult to use. The Mac Quadra AV models received the same AT&T DSP 3210 which was to be used on the Amiga. A Mac DSP programmer wrote the following, "BTW, the 'C' compiler is a complete piece of shit. It produces some of the worst code I have ever seen (trying to do a matrix multiply in 3210 'C' ran 5 times slower than the host 68k on a Quadra 700 - rewriting the same in assembler ran 7-8 times *faster*). If you do any serious 3210 programming, you will need to learn 3210 assembler." An SIMD unit has many of the same drawbacks as a DSP. Many compilers and programmers would be better off forgetting about a SIMD unit. However, it can provide a large performance boost for certain performance bottlenecks.

Quad precision 128 bit floating point in an SIMD unit is a waste of resources. SIMD instructions are more useful for narrower datatypes which allows more parallelism (half the datatype size and double the number of operations). Half precision 16 bit fp is more interesting for an SIMD unit, at least as a load and store format. Quad precision fp in the FPU makes more sense but only a few scientists and engineers are likely to use it. The extended precision 68k fp format has the same sized exponent providing the same range but with reduced precision of the fractional part. This was adequate for some scientists and engineers who continued to use the extended 80 bit fp format of the x86 FPU long after it was deprecated. Half precision support would be nice for the 68k FPU as well. Longer IEEE fp formats can often be exactly represented by shorter fp formats which I recognized. I suggested an optimization which Frank Wille implemented in the Vasm assembler and the Vbcc compiler uses (GCC does *not* have this optimization). I found the optimization had converted every double precision fp immediate value in a compile of the Vbcc compiler to a single precision fp immediate. Many single precision fp immediates could likewise be compressed to half precision. Half precision fp can be used to reduce data and code (fp immediates) memory traffic.

Quote:

On embedded socks simply will be removed all optimisations due register renaming and other tricks to make all low power and reduced complexity.
It’s important to have the same instruction sets for all the same cpu line, limiting only performances among the lowest and highest cpu cores,
A great mistake of all chipmakers is to have different isa for the same architecture.

Lower end (embedded) processors may make some instructions multiple cycle because of the complexity of the ISA but this makes the core design more complex (and violates RISC principals). I understand the importance of standardization though. AArch64 has very much helped ARM performance and compiler support but it can't go as low end as Thumb2.

Quote:

Current power implementation is the best possible high performance isa, ready for future..
x86, looking at next 10/15 years, is dead just as mc68k was 15 years ago.
Future is ARM, RISC V and IBM Power, simply because they’re best performers for low power and/or high compute tasks.

AArch64 and RISC-V have the most room to improve and are catching up. POWER and x86_64 are improving with technology advances which are slowing. You can have your "high performance" POWER ISA if you can afford the hardware for it. Most people don't want to pay twice as much as x86_64 hardware to get POWER hardware which is similar performance.

Quote:

X86 is closed and totally controlled by intel and AMD, it lacks of security and Most probably have USA government backdoors .
..Just like huawei did for its hisilicon cpu line and 5g infrastructures.

IBM would never cooperate with the U.S. government like Intel and AMD.

Status: Offline

Lou

Re: Amiga SIMD unit
Posted on 10-Aug-2020 16:39:58

[ #11 ]

Elite Member

Joined: 2-Nov-2004
Posts: 4259
From: Rhode Island

I voted no because I don't see why *we* need to repeat the mistakes of every architecture that has come before us.

These operations were *moved* to the gpu because parallel processing is what they do well. Applications that require SIMD instructions generally required a whole lot of them...more than a 68k cache can hold.

When you start adding more and more kitchen sinks to a cpu - you will inevitably run out of rooms to add kitchen sinks to...

Having 'custom' chips working their magic is what allowed the Amiga to have a comparatively weaker cpu...when cpus were the high dollar part. That's one thing they did right back then.

Status: Offline

CosmosUnivers

Re: Amiga SIMD unit
Posted on 10-Aug-2020 17:18:51

[ #12 ]

Regular Member

Joined: 20-Sep-2007
Posts: 113
From: Unknown

Quote:

Lou wrote:
I voted no because I don't see why *we* need to repeat the mistakes of every architecture that has come before us.

Once they added the AMMX = they lost me forever...

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 11-Aug-2020 1:50:40

[ #13 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2751
From: Kansas

Quote:

Lou wrote:
I voted no because I don't see why *we* need to repeat the mistakes of every architecture that has come before us.

You don't think a Motorola 68080 would have had a SIMD unit? Would that have been a mistake too?

Quote:

These operations were *moved* to the gpu because parallel processing is what they do well.

Many CPU cores with SIMD units can do parallel processing well also.

CPU cores with SIMD units
+ fast startup times
+ powerful and flexible for a high latency parallel processor
- tethered to fat CPU cores (the CELL architecture avoided this with separate SIMD like SPEs)

GPU cores
+ Slim cores allow more cores and more parallelism
+ GPU usually isn't working at full capacity so has processing power to spare
- specialization for GPU makes them less powerful
- difficult to program
- slow startup times due to different CPU and GPU memory

CPU cores can be slimmed down and are for GPGPUs. The Knights Landing project predecessors went as far back as the Pentium P54C design which is the 68060 era (best comparison to the 68060 and dominated by the 68060 in PPA). Many newer ISAs are too fat from supporting too much and baggage from "mistakes" and antiquated support to slim down cores much.

The slow startup times of GPU cores can be improved with a Heterogeneous System Architecture (HSA).

https://en.wikipedia.org/wiki/Heterogeneous_System_Architecture

The memory of the CPU and GPU is unified with caches kept coherent and a shared MMU for both. There is no more memory copying over a bus and pointers to data in memory can be passed between the CPU and GPU. It reminds me of efficient pointer passing in Amiga messages. The OS needs to be adapted to HSA although the 68k AmigaOS would probably run on this hardware as is. Newer console hardware may already be using this technology but it will likely be slower for current MMU using OSs to adapt to. This hardware setup works best as one SoC. Perhaps this is the emphasis for Nvidia to buy ARM so they can have an HSA SoC to compete with AMD. HSA with ray tracing would be cool.

The following link is an old 2013 article about PS4 unified memory architecture and the advantages.

https://www.gamasutra.com/view/feature/191007/inside_the_playstation_4_with_mark_.php

Quote:

Applications that require SIMD instructions generally required a whole lot of them...more than a 68k cache can hold.

There are many SIMD instructions added to an ISA to support SIMD which takes huge amounts of encoding space but few SIMD instructions are needed in code and it is usually good for code density (uses less ICache). The SIMD unit can flush the DCache by flooding it with single use data but this can be avoided in hardware (smart stream processing or not caching) or by prefetch hints. The SIMD unit can actually lead to more efficient DCache use by separating the streaming data processing and avoiding the DCache. Doing the same stream processing in the integer unit could lead to a flushed cache afterwards without stream recognition hardware or prefetch hints.

Quote:

When you start adding more and more kitchen sinks to a cpu - you will inevitably run out of rooms to add kitchen sinks to...

SIMD support definatly bloats up a core and using the GPU with HSA hardware is much more efficient. It could be the blind following the blind. It wouldn't be the first time technology followed the hype down the wrong path.

Quote:

Having 'custom' chips working their magic is what allowed the Amiga to have a comparatively weaker cpu...when cpus were the high dollar part. That's one thing they did right back then.

Weaker CPUs have more advantages than powerful ones.

Weaker CPU
+ cheaper (production and development)
+ better security
+ easier to program
+ lower power
+ more reliable
- slower to complete work
- less energy efficient (work is done slower using more total energy)

Extreme performance from the CPU is usually *not* a good idea. Usually a more balanced and smarter approach works better.

Last edited by matthey on 11-Aug-2020 at 03:23 PM.
Last edited by matthey on 11-Aug-2020 at 02:25 AM.
Last edited by matthey on 11-Aug-2020 at 02:03 AM.
Last edited by matthey on 11-Aug-2020 at 02:00 AM.

Status: Offline

Lou

Re: Amiga SIMD unit
Posted on 11-Aug-2020 13:18:18

[ #14 ]

Elite Member

Joined: 2-Nov-2004
Posts: 4259
From: Rhode Island

@matthey

68K is already CISC. We don't need it CISC-ier.
Enhance the actual chipset. CPU don't need magic, they need efficiency and speed.
SIMD was a crutch because the GPU market at the time was crap.

Again - what applications are you using SIMD instructions for? Typically image processing. Guess what does it better? I don't mind 2 or 4 internal FPU units but at some point it's bloat.

A faster overall cpu is better than one with more features slowing it down. A rising tide raises all ships.

Focus should be on a real successor to AGA/AA or even AAA. Accelerating a 68k should be trivial. Perhaps make it 68k-64, now that would have some merit. Everything else is just fragmenting an already fragmented environment.

Status: Offline

MEGA_RJ_MICAL

Re: Amiga SIMD unit
Posted on 11-Aug-2020 18:19:53

[ #15 ]

Super Member

Joined: 13-Dec-2019
Posts: 1200
From: AMIGAWORLD.NET WAS ORIGINALLY FOUNDED BY DAVID DOYLE

Instruction and cycle counts imply CPI is less on x86 implementations: geometric mean CPI is 3.4 for A8, 2.2
for A9, 2.1 for Atom, and 0.7 for i7 across all suites. x86 ISA overheads, if any, are overcome by microarchitecture. I-cache minimizes code density impact ◦ Modern compilers pick mostly RISC insts; x86 and ARM µ-op latencies

_________________
I HAVE ABS OF STEEL
--
CAN YOU SEE ME? CAN YOU HEAR ME? OK FOR WORK

Status: Offline

NutsAboutAmiga

Re: Amiga SIMD unit
Posted on 11-Aug-2020 19:16:45

[ #16 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12993
From: Norway

@Lou

Quote:
SIMD was a crutch because the GPU market at the time was crap.

SIMD is fast when work on system memory, that was designed to do, the GPU can’t or should access the system memory, yes, I know there are APU’s / hybrid CPU with integrated GPU’s, but this most inefficient. The GPU is most efficient as float point math, but does really offer anything for integer math.

SISK or RISK debate is sort of useless, here is way you can’t go faster X Ghz, so more you do out of order the better, the SIMD is designed for out of order execution most efficient way.

What we need is a .library that has simple set predefined routines that is easy to use without writing assembler, this way it wont be issue if some did not have SIMD instruction set, I think few transformation functions, like move all dots in array +10, or some thing like that, bclear and memcpy lots tuff like that can be optimized for SIMD.

Last edited by NutsAboutAmiga on 11-Aug-2020 at 07:19 PM.
Last edited by NutsAboutAmiga on 11-Aug-2020 at 07:19 PM.
Last edited by NutsAboutAmiga on 11-Aug-2020 at 07:18 PM.
Last edited by NutsAboutAmiga on 11-Aug-2020 at 07:17 PM.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

Lou

Re: Amiga SIMD unit
Posted on 11-Aug-2020 19:54:47

[ #17 ]

Elite Member

Joined: 2-Nov-2004
Posts: 4259
From: Rhode Island

@NutsAboutAmiga

Quote:

NutsAboutAmiga wrote:
@Lou

Quote:
SIMD was a crutch because the GPU market at the time was crap.

SIMD is fast when work on system memory, that was designed to do, the GPU can’t or should access the system memory, yes, I know there are APU’s / hybrid CPU with integrated GPU’s, but this most inefficient. The GPU is most efficient as float point math, but does really offer anything for integer math.

SISK or RISK debate is sort of useless, here is way you can’t go faster X Ghz, so more you do out of order the better, the SIMD is designed for out of order execution most efficient way.

What we need is a .library that has simple set predefined routines that is easy to use without writing assembler, this way it wont be issue if some did not have SIMD instruction set, I think few transformation functions, like move all dots in array +10, or some thing like that, bclear and memcpy lots tuff like that can be optimized for SIMD.

Comparing CPU to GPU is almost entirely dependent on the parallelisation opportunities in the code that want to use to compare, with a secondary consideration on the memory hierarchy that the code needs to exploit to extract maximum performance and how the data you're working on will make it to the GPU and back to the host afterwards.

To answer the question, it makes sense to perform integer computation on the GPU only if there's parallelism to exploit and the memory hierarchies and round trip cost to the GPU is worth it. You can only determine those things on a case-by-case basis.

Here is a game that uses OpenCL integer performance:
https://www.youtube.com/watch?v=OtP71P4ncJQ

But again, seeing 2 or 4 MPU's inside a core is not uncommon...but when a GPU contains 2000-5000 SIMD units...what is the point of 1 in the cpu? Again it was put there in a time when GPU's were crap and standards were bad.

Sure you can create a "test case" that will show having an SIMD unit improves performance in THAT test case. But it's unrealistic. It's only there because your actual GPU (aka Amiga chipset) can't handle it. So my answer is to make a new chipset with those features. It's what we expected from SAGA until we found out it was just cpu functions on the 2nd thread rather than an evolved actual chipset.

Last edited by Lou on 11-Aug-2020 at 08:07 PM.
Last edited by Lou on 11-Aug-2020 at 08:00 PM.
Last edited by Lou on 11-Aug-2020 at 07:59 PM.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 12-Aug-2020 3:12:14

[ #18 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2751
From: Kansas

Quote:

NutsAboutAmiga wrote:
SIMD is fast when work on system memory, that was designed to do, the GPU can’t or should access the system memory, yes, I know there are APU’s / hybrid CPU with integrated GPU’s, but this most inefficient. The GPU is most efficient as float point math, but does really offer anything for integer math.

Many APUs do *not* support a Heterogeneous System Architecture (HSA) so have to copy data to and from GPU memory. Only the Playstation 4 AMD SoC, AMD "Kaveri" A-series APU and ARM's Mali-G71 GPU support HSA that I'm aware of. It's a game changer as far as using the GPU to do parallel processing. See the article I linked above about the PS4 to see how important it was to that project (HSA advantages are about half the long article). HSA hardware is likely the future but requires standard hardware in an integrated SoC like a console and like the Amiga could be to take full advantage.

Many modern GPU processors have become more flexible and general purpose. Unified shader model GPUs use a universal shader processor for all shading (vertex, pixel, geometry, etc.). They support integer datatypes pretty well including integer vector support in some cases. Even some older GPUs can be programmed in C like languages and support an OS. For example, the Raspberry Pi VideoCore IV has a vbcc backend and the ThreadX RTOS is used to manage the board (not just the GPU). Most integer datatypes use saturation math which can keep them from being conformant with older language standards and compatible with many OSs.

Quote:

SISK or RISK debate is sort of useless, here is way you can’t go faster X Ghz, so more you do out of order the better, the SIMD is designed for out of order execution most efficient way.

I don't think of SIMD as being OoO. I see it as doing more work in order. SIMD instructions are often able to start and finish parallel operations sooner than an unrolled loop of RISC instructions giving some of the benefits of OoO.

Quote:

What we need is a .library that has simple set predefined routines that is easy to use without writing assembler, this way it wont be issue if some did not have SIMD instruction set, I think few transformation functions, like move all dots in array +10, or some thing like that, bclear and memcpy lots tuff like that can be optimized for SIMD.

AmigaOS already has exec.library CopyMem() and CopyMemQuick() which could be SIMD optimized if worthwhile (a simple copy loop without SIMD instructions may be able to saturate memory). It is difficult to make a library specifically to use an SIMD unit because so much data needs to be specified in a particular order with a particular alignment including often difficult start and end cases. It is more important to have good compiler support (auto-vectorization and vector intrinsics).

Status: Offline

NutsAboutAmiga

Re: Amiga SIMD unit
Posted on 12-Aug-2020 10:07:52

[ #19 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12993
From: Norway

@matthey

Quote:
It is more important to have good compiler support (auto-vectorization and vector intrinsics).

Well that’s problematic most PowerPC chips used in AmigaONE’s does not have SIMD instruction set, the same is the case for 680x0, only 68080 has one, and lets say its not most common chip around, having to optimize for CPU with or without FPU, with or without SIMD and with old or new instructions, or embedded chip where they removed half of instructions to save power, and heat.

:-/

Last edited by NutsAboutAmiga on 12-Aug-2020 at 03:39 PM.
Last edited by NutsAboutAmiga on 12-Aug-2020 at 03:37 PM.
Last edited by NutsAboutAmiga on 12-Aug-2020 at 10:08 AM.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 12-Aug-2020 22:08:01

[ #20 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2751
From: Kansas

Quote:

NutsAboutAmiga wrote:
Well that’s problematic most PowerPC chips used in AmigaONE’s does not have SIMD instruction set, the same is the case for 680x0, only 68080 has one, and lets say its not most common chip around, having to optimize for CPU with or without FPU, with or without SIMD and with old or new instructions, or embedded chip where they removed half of instructions to save power, and heat.

:-/

The Amiga suffers from lack of standardization and division. We can look at ARM to see how standardization has helped their support.

When the Raspberry Pi 3 came out, I noticed an AArch64 mode benchmark suite showed a large performance increase in some areas over AArch32 mode primarily from having a combination of standard hardware and more compatible IEEE fp hardware support. Raspbian (now called Raspberry Pi OS) and some other RPi OSs do *not* use AArch64 mode to be compatible with earlier RPi hardware which did not support AArch64. Even with the same hardware often available in AArch32 mode and a long list of compiler flags (below for example), performance is usually less.

-march=armv8-a+crc -mtune=cortex-a53 -mfpu=crypto-neon-fp-armv8 -mfloat-abi=hard -mneon-for-64bits -ftree-vectorize -funsafe-math-optimizations

AArch64 is still not perfect for SIMD utilization. Loops and arrays often have to be rewritten for auto-vectorization, SIMD code should be checked that it is not slower and SIMD code can be bigger usually from dealing with the last SIMD data which is often not a multiple of the SIMD width. The following article talks about auto-vectorization with some examples of issues.

https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/neon-programmers-guide-for-armv8-a/compiling-for-neon-with-auto-vectorization/single-page

I expect the Raspberry Pi 3 and 4 AArch64 4 CPU cores have more SIMD performance than the VideoCore IV 4 QPU processors. If the SoC had supported a Heterogeneous System Architecture, the CPU cores could likely have accelerated the 3D graphics provided there was access to the other specialized 3D hardware. The AArch64 SIMD unit is easier to program and more flexible than a QPU too.

Last edited by matthey on 12-Aug-2020 at 10:11 PM.

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle