Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6225 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

0 member(s) on-line.

You are an anonymous user.
Register Now!

amyren: 13 mins ago

minator: 16 mins ago

matthey: 1 hr 22 mins ago

zipper: 1 hr 24 mins ago

Rob: 1 hr 30 mins ago

mbrantley: 1 hr 45 mins ago

lionstorm: 1 hr 59 mins ago

Tuxedo: 2 hrs ago

Chris_Y: 2 hrs 6 mins ago

Frank: 2 hrs 31 mins ago

Forum Index

General Technology (No Console Threads)

Amiga SIMD unit

Poster

Thread

Hypex

Re: Amiga SIMD unit
Posted on 15-Aug-2020 17:02:12

[ #21 ]

Elite Member

Joined: 6-May-2007
Posts: 11351
From: Greensborough, Australia

@matthey

Quote:
I would prefer 16 integer registers, 16 FPU registers and 16 vector unit registers. The FPU and vector registers could be shared using 128 bit wide registers, perhaps with an option for 256 bit wide registers in the future. Wider SIMD registers are the key to SIMD performance. The x86_64 has performed quite well with 16 integer and 16 SIMD registers. A 128 bit wide register shouldn't be a problem in FPGA if the parallel operations are on narrower datatypes although doubling the register width requires doubling the number of ALUs.

After checking it out it looks like they have 32 registers. Or they have 4 banks of 8 registers. Going back to 6502 memory limits or AGA palette setup. The BANK instruction is almost functional to a prefix. I'm not aware of 68K being able to switch register banks apart from at an exception. In fact, the AMMX ASM doc isn't exactly clear, but it looks like the BANK instruction doesn't stick like it should. In the code example it keeps calling it, as if it only affects the next instruction. Which would mean they've put in a prefix.

However, I have my own idea of vectorisation, based on variable width vectors. The entire vectors register file would take up 1024 bits in my example. If we have 8 registers for vectors that would allow 128 bits wide each. But in this case, they can be configured as 128x8 bits, 64x16 bits, 32x32 bits, 16x64 bits, 8x128 bits, 4x256 bits, 2x512 bits or 1x1024 bits. This would allow for maxium paralleism with the least amount, a byte; or minimum paralleism with the largest amount, 128 bytes.

Quote:
SSE2 is where the x86 SIMD unit started to outperform AltiVec. This is when the number of SSE registers was doubled from 8 to 16.

Which is still half of 32 in AltiVec.

I read they are now increase it to a count of 64. Perhaps they should adopt my variable width idea. Using rigid register counts is like continuing to use a rigid amount of hardware audio channels against a DSP that can dynamically mix a variable amount.

Quote:
Sharing registers between units using a different pipeline may requires some synchronization to make it safe. The 68060 only has a 2 cycle penalty for the FPU to use integer registers as a source or destination but this may be higher for other designs. The 68060 uses an integer pipe for FPU instructions until the last stage making it easier to access integer registers.

The PPC does have a "mr" but that is GPR.

Quote:
SIMD unit vector data is often *not* cached as data sets can be large enough to effectively flush the caches of more commonly used data thus reducing overall performance. This is why cache bypassing techniques and stream prefetching logic are so important for an SIMD unit. It does make it more expensive to transfer data between the SIMD unit and integer unit in the case of Altivec possibly limiting what the SIMD unit can be used for but it is *not* a general purpose unit.

With what I was thinking of, using my DSP as an example, is a simplified SIMD mixing routine that would need to read in vectors of samples then add them to other vectors of samples. The total amount of vectors needed would equal the total amount of tracks mixed. In full, it would need to read both sets of vectors, add them, then store them back. The read/modify/write RMW operation would need to read in two data sets, combine them in the modify, then write result back. Of course, it would also need to extend data read in and double the size, if it needed space for overhead.

Quote:
fadd.d (4,a0),fp0 ; reg-mem load with an op possible on 68k FPU
fadd.d fp0,(4,a0) ; reg-mem store with op *not* possible on 68k FPU

fadd.d (4,a0),fp0 ; this pair of instructions replaces the 2nd fadd.d above
fmove.d fp0,(4,a0)

The Read-Modify-Write reg-mem store is avoided which is simpler. The Reg-mem load above saves an instruction and register compared to load/store where a reg-mem store would only save an instruction. There are usually about twice as many reads as writes too so not much is lost.

I had forgot earlier that all 68K instructions modifying memory also need to load in the memory first. One of the benefits of CISC. It can be done with an atomic.

Quote:
I don't know. I've never seen documentation on the Apollo Core MPU. ThoR knows more and wasn't happy with the decisions.

It will all come out evenually if anyone else is to program it. Since they have AROS they can modify it as they like. Probably better than relying on third parties to add 080 optimisation to OS3.1.4 which may never come.

Quote:
The MOVE instruction is the most common instruction in the 68k which may sound strange for reg-mem which can do an op while moving but the 68k has some simple mem-mem capabilities as well (a mem-mem architecture usually executes fewer instructions, has better code density and less memory traffic than a reg-mem architecture).

Most ISAs have simplified MOVE to MOV. It looks more modern even if it resembles x86 instruction names more, which isn't a bad thing IMO. The x86 instructions names are pretty good. It is the x86 inconsistencies, limitations, ancient cruft and more modern bloat which are the problem.

The 6502 used three letter abbreviations but they looked neat. The 68000 always used full words in instruction tables I read where possible so I'm used to that style. I've seen other styles, such as ($xx,Dx) which seemed slightly foreign, as the standard was $xx(Dx). It does look neat enclosed fully, which the 6502 also did. ($xx,X)

The MOV just doesn't fit well with me. I guess I don't like it being abbreviated as a cut off. MVE could be more to my liking. But technically, I never thought MOVE was a correct term, anyway. When has MOVE actually MOVED data? MOVE imples it cuts and pastes it. To me a proper MOVE would move the data, by copying the source to the desination, then deleting the source with a zero wipe. Now thats a move. That or renaming the data location. Register renaming anyone? Well, not quite.

Therefore, I would call it a COPY instrunction. I've only seen it copy. Let's see... CPY. Yes I'm fine with that. CPY instead would make me happy.

Quote:
The only thing quick about LSLQ and RSRQ was the time it took thinking about how to add the instructions and name them.

Looking back again, it looks like the Q is meant to mean Quad word. But it should be as LSL.Q and LSR.Q then. I don't know, the asssembler doc is incomplete. Too many standards. I gave up at the first BANK trying to understand it.

Quote:
The ColdFire uses BYTEREV which isn't too bad if a bit long. The x86/x86_64 uses MOVBE and BSWAP. MOVELE or MOVLE is pretty good. MOVES is "Move alternate address Space" in the 68020 ISA.

I thought MOVES was taken but my basic Google search failed me. Another one, taken from ppc64el. MOVEEL. Looks stinky.

Status: Offline

megol

Re: Amiga SIMD unit
Posted on 16-Aug-2020 14:23:47

[ #22 ]

Regular Member

Joined: 17-Mar-2008
Posts: 355
From: Unknown

Still think my idea of a prefix beats the BANK thing but then I'm biased. For a small not-that-wide core 32 registers are generally not much better than 16 and for a CISC design with wide immediate support the advantage is smaller still. They added a length field to the BANK instruction to make parsing the instruction length easier but that takes 2 bits, in total they "waste" 4 bits compared to a 16 register expansion without a length field.

I guess the length thing is to make it easier for 64 bit immediates whereas my 64 bit expansion proposal used a special quad encoding instead of the standard immediate type that scales with the operation size. Maybe they added it for some other reason though.

Lots of little weird design decisions IMHO.

--

Variable length SIMD or SIMD/vector (Cray type) hybrids seems to be the future however it's not realistic to expect something like that on a FPGA core designed to fit relatively small chips.

Status: Offline

Lou

Re: Amiga SIMD unit
Posted on 16-Aug-2020 15:53:42

[ #23 ]

Elite Member

Joined: 2-Nov-2004
Posts: 4259
From: Rhode Island

ARM is anything but standard once it's licensed.

Samsung will be integrating AMD chiplet gpus in their future ARM chips.

NVidia Tegra X1 powers the Nintendo Switch and it has a Maxwell-based gpu.
NVidia is trying to purchase ARM outright.

Apple does what Apple do.

The only standard is that there is no standard.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 17-Aug-2020 9:50:52

[ #24 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2755
From: Kansas

Quote:

megol wrote:
Still think my idea of a prefix beats the BANK thing but then I'm biased. For a small not-that-wide core 32 registers are generally not much better than 16 and for a CISC design with wide immediate support the advantage is smaller still. They added a length field to the BANK instruction to make parsing the instruction length easier but that takes 2 bits, in total they "waste" 4 bits compared to a 16 register expansion without a length field.

The register banks remind me of the StarCore DSP which has some similarities to the 68k.

o Sixteen 40-bit data registers (low bank D0-D7, high bank D8-D15)
o Eight low bank address registers (R0–R7)
o Eight high bank address registers (R8–R15), or alternatively, eight base address registers (B0–B7)
o Four offset registers (N0–N3)
o Four modifier registers (M0–M3)

In addition to all the registers, it uses prefixes and has 3 op instructions.

https://www.nxp.com/docs/en/reference-manual/MNSC140CORE.pdf

When I was on the Apollo Team, I brought up the StarCore DSP to Gunnar and we discussed it. Personally, I'm not a fan of it overall even though I like some ideas. StarCore probably is easier to program than most DSPs but is far from the 68k in consistency and easy of use.

My rough estimate is that from RISC 32 GP integer registers to the 68k 16 GP integer registers is less than 10% memory traffic increase and less than 1% performance reduction in most designs. The 68k has several traits that reduce register usage (reg-mem/mem-mem, powerful addressing modes, large immediate support, PC relative addressing, register renaming). More GP registers are anything but a free lunch. They use more transistors, draw more power, sometimes require more time and memory to save and restore extra (caller/callee/all) registers and require more encoding space increasing code sizes.

Quote:

I guess the length thing is to make it easier for 64 bit immediates whereas my 64 bit expansion proposal used a special quad encoding instead of the standard immediate type that scales with the operation size. Maybe they added it for some other reason though.

Lots of little weird design decisions IMHO.

You make it sound like it was a team design which I highly doubt.

Quote:

Variable length SIMD or SIMD/vector (Cray type) hybrids seems to be the future however it's not realistic to expect something like that on a FPGA core designed to fit relatively small chips.

ARM AArch64 has an optional Scalable Vector Extension (SVE) in addition to the standard fixed width SIMD instructions. Is this what you mean by "variable length SIMD"?

https://community.arm.com/developer/tools-software/hpc/b/hpc-blog/posts/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture

It sounds like it would be less efficient today with the abstraction but may be more efficient tomorrow with wider SIMD units (although code recompiled for a fixed width tomorrow will be more efficient). It assumes that SIMD width will continue to grow and that may not be practical as can be seen by Knights Landing down clocking cores using 512 bit wide SIMD instructions. Also, wider SIMD unit standards limit low end core designs. I expect high end core SIMD designs need support and don't mind losing some performance but lower end designs would rather have the better performance of fixed width SIMD.

Last edited by matthey on 17-Aug-2020 at 10:17 AM.
Last edited by matthey on 17-Aug-2020 at 10:06 AM.
Last edited by matthey on 17-Aug-2020 at 09:58 AM.
Last edited by matthey on 17-Aug-2020 at 09:53 AM.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 18-Aug-2020 0:45:14

[ #25 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2755
From: Kansas

Quote:

Lou wrote:
ARM is anything but standard once it's licensed.

Samsung will be integrating AMD chiplet gpus in their future ARM chips.

NVidia Tegra X1 powers the Nintendo Switch and it has a Maxwell-based gpu.
NVidia is trying to purchase ARM outright.

Apple does what Apple do.

The only standard is that there is no standard.

Nvidia is using VLIW cores (Denver and Carmel) to execute AArch64 code with a "code morphing" translation layer like Transmeta (VLIW Intel i860, Transmeta, Itanium processors were EPIC failures for GP computing and financially). Nvidia at one time claimed their VLIW SoC chips (Tegra X2) were the highest performance ARM SoC chips (likely based on SpecInt2K). I expect SIMD instructions translate well to VLIW but the big problem for VLIW cores has been branches. Maybe Nvidia wants ARM so they can market their VLIW ARM cores. Rumor is that Nvidia tried to get Intel patents to code morph x86_64 but had trouble licensing the patents from Intel. While VLIW cores have had trouble with GP performance, they are low power. This could be good for embedded use but is limited how low end they can go because of the overhead of the "code morphing" translation (optimized code sequences were stored in a 128 MB cache using main memory for example).

https://techreport.com/news/26906/nvidia-claims-haswell-class-performance-for-denver-cpu-core/

Has Nvidia finally solved the VLIW "code morphing" problems? Would they be interested in 68k or PPC "code morphing"?

It looks like Nvidia SoCs have supported a proprietary Heterogeneous System Architecture (HSA) since at least the Tegra X2 (SoC after the one used in the Nintendo Switch).

Last edited by matthey on 18-Aug-2020 at 12:51 AM.

Status: Offline

Lou

Re: Amiga SIMD unit
Posted on 19-Aug-2020 13:16:33

[ #26 ]

Elite Member

Joined: 2-Nov-2004
Posts: 4259
From: Rhode Island

@matthey

To be clear. I think SIMD is great. I just believe putting it in the cpu is wrong vs an actually enhanced chip is where it belongs.

For instance, the SNES can run DOOM because the MATH was off-loaded to the Super-FX2 chip...
https://www.youtube.com/watch?v=JqP3ZzWiul0

...and yes that video is full of Amiga references...

Last edited by Lou on 19-Aug-2020 at 01:17 PM.

Status: Offline

bison

Re: Amiga SIMD unit
Posted on 19-Aug-2020 13:58:26

[ #27 ]

Elite Member

Joined: 18-Dec-2007
Posts: 2112
From: N-Space

@Lou

Quote:
For instance, the SNES can run DOOM because the MATH was off-loaded to the Super-FX2 chip... https://www.youtube.com/watch?v=JqP3ZzWiul0

That was an interesting video -- thanks for the link!

_________________
"Unix is supposed to fix that." -- Jay Miner

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 19-Aug-2020 19:14:32

[ #28 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2755
From: Kansas

Quote:

Lou wrote:
To be clear. I think SIMD is great. I just believe putting it in the cpu is wrong vs an actually enhanced chip is where it belongs.

Custom hardware usually has a learning curve compared to SMP in CPU cores. Look how long it took for programmers to learn the Amiga custom hardware. It was nearly outdated by the time they learned to use it well. Custom hardware is necessary in some cases for performance but flexible programmable processors (like SIMD unified shaders) will be more difficult to fully utilize their performance potential.

Quote:

For instance, the SNES can run DOOM because the MATH was off-loaded to the Super-FX2 chip...
https://www.youtube.com/watch?v=JqP3ZzWiul0

...and yes that video is full of Amiga references...

The Modern Vintage Gamer video is by Lantus360 who ported the Canonball Outrun clone (which I beta tested), Doom Engine based Strife and the OpenBOR game engine to the Amiga. He lives less than a half hour from me.

Is DOOM on the SNES really that impressive? The SNES came out 5-6 years after the Amiga and has stronger hardware and faster memory in some areas even if the CPU is slow (similar to Apple IIGS CPU but the 68000 performance isn't much better). It is closer to the time period of AGA Amigas which can handle DOOM at that reduced resolution. Even a 1MB ECS 68000 Amiga could have handled DOOM with limitations as the following video demonstrates.

https://www.youtube.com/watch?v=KeaNb5QzoU0

The Amiga has custom hardware too but it is not really what is needed for a game like Doom. Adding a higher performance CPU and memory solves all the problems and accelerates all types of games where the custom hardware was not as flexible and more difficult to use.

Status: Offline

Lou

Re: Amiga SIMD unit
Posted on 20-Aug-2020 3:05:03

[ #29 ]

Elite Member

Joined: 2-Nov-2004
Posts: 4259
From: Rhode Island

@matthey

Quote:

matthey wrote:

Custom hardware usually has a learning curve compared to SMP in CPU cores. Look how long it took for programmers to learn the Amiga custom hardware. It was nearly outdated by the time they learned to use it well. Custom hardware is necessary in some cases for performance but flexible programmable processors (like SIMD unified shaders) will be more difficult to fully utilize their performance potential.

Eh? You proved why it should be on a GPU and accessible via an API.
All gpus today are GPgpus. Moving such functions to a separate chip and hitting it via an api allows you to keep more backwards compatibility over time since "our" cpu would just be a faster 68k without introducing un-necessary complexities and incompatibilities.

Quote:

The Modern Vintage Gamer video is by Lantus360 who ported the Canonball Outrun clone (which I beta tested), Doom Engine based Strife and the OpenBOR game engine to the Amiga. He lives less than a half hour from me.

Is DOOM on the SNES really that impressive? The SNES came out 5-6 years after the Amiga and has stronger hardware and faster memory in some areas even if the CPU is slow (similar to Apple IIGS CPU but the 68000 performance isn't much better). It is closer to the time period of AGA Amigas which can handle DOOM at that reduced resolution. Even a 1MB ECS 68000 Amiga could have handled DOOM with limitations as the following video demonstrates.

https://www.youtube.com/watch?v=KeaNb5QzoU0

The Amiga has custom hardware too but it is not really what is needed for a game like Doom. Adding a higher performance CPU and memory solves all the problems and accelerates all types of games where the custom hardware was not as flexible and more difficult to use.

The SNES uses a 65C816 @ 3.58Mhz...basically a 16bit version of the good old 6502. You are telling me that THAT cpu is superior to any Amiga cpu? They WISELY included the SuperFX2 chip to handle the math because it runs independently of the CPU.

Applications that require alot of SIMD instructions (aka 3D games) would flood the cache of a cpu. You'd have no room for actual game logic because all the time would be spent on 4x4 matrix math on millions of vertices. GPU's specialize on doing such math in PARALLEL. Having 1 unit in the CPU is peanuts. When you look at any gpu today with 2000+ CUDA cores (Nvidia terminology) or STREAM processors (AMD terminology)...those are basically how many SIMD units it has. One per cpu core is a joke for real world uses of SIMD...

Last edited by Lou on 20-Aug-2020 at 03:06 AM.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 20-Aug-2020 5:44:57

[ #30 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2755
From: Kansas

Quote:

Lou wrote:
Eh? You proved why it should be on a GPU and accessible via an API.
All gpus today are GPgpus. Moving such functions to a separate chip and hitting it via an api allows you to keep more backwards compatibility over time since "our" cpu would just be a faster 68k without introducing un-necessary complexities and incompatibilities.

Let's take a closer look at how GPU shaders on a gfx card are typically used for parallel processing of data on the CPU side.

1) An intermediate language program is compiled (usually with LLVM) into a program for the GPU shaders.
2) The program and data are copied to gfx memory.
3) The CPU and/or GPU must distribute the program to shaders and start executing.
4) The GPU may be able to signal the CPU when done or the CPU has to poll the GPU until done.
5) Resulting data is copied back to CPU memory.

This process is very inefficient, has huge latency and wastes memory. With HSA as AMD proposes we would reduce this to the following steps.

1) An intermediate language program is compiled (usually with LLVM) into a program for the GPU shaders.
2) The CPU copies the program into a queue and the shaders begin executing.
3) The GPU signals the CPU when done or the CPU has to poll the GPU until done.

Much simpler. Step 1 could be eliminated if we always used the same GPU hardware but then we would be limited how much we could change the shader ISA and maintain compatibility.

Quote:

The SNES uses a 65C816 @ 3.58Mhz...basically a 16bit version of the good old 6502. You are telling me that THAT cpu is superior to any Amiga cpu? They WISELY included the SuperFX2 chip to handle the math because it runs independently of the CPU.

The 68000 has a much better ISA but instructions take too many cycles. The 68000 design was for quick development and not performance.

68000@7.16 MHz 1.25 MIPS
Ricoh 5A22@3.58 MHz ~1.5 MIPS
6502@1MHz .43 MIPS
6502@3.58MHz 1.54 MIPS

I'm not sure these numbers are correct but they look about right. MIPS aren't everything though. The 68000 doesn't need to execute as many instructions to keep up and likely still has the advantage in performance.

Quote:

Applications that require alot of SIMD instructions (aka 3D games) would flood the cache of a cpu. You'd have no room for actual game logic because all the time would be spent on 4x4 matrix math on millions of vertices. GPU's specialize on doing such math in PARALLEL. Having 1 unit in the CPU is peanuts. When you look at any gpu today with 2000+ CUDA cores (Nvidia terminology) or STREAM processors (AMD terminology)...those are basically how many SIMD units it has. One per cpu core is a joke for real world uses of SIMD...

Good programming and/or a good SIMD unit design in a CPU core is going to avoid the DCache for large data stream processing. Many SMP CPU cores with SIMD units could do the same work as GPU shaders with the advantage that they could more easily be used for general purpose and parallel stream processing. The CPU cores would have reduced performance when slimmed down to allow more SIMD capable cores to be added. GPU shaders are already slimmed down allowing for many SIMD capable cores but their utilization is lower. The CPU SIMD shaders are appealing for lower end systems where the CPU cores are weaker anyway and the transistor budget wouldn't allow for enough shaders (for example, would the Raspberry Pi be better off with 8 CPU cores which can do shade processing or 4 CPU cores and 4 GPU shaders?). Higher performance systems need stronger cores reducing the number of cores and the transistor budget is high enough to allow more shaders.

Last edited by matthey on 20-Aug-2020 at 06:43 AM.
Last edited by matthey on 20-Aug-2020 at 05:51 AM.

Status: Offline

dooz

Re: Amiga SIMD unit
Posted on 20-Aug-2020 6:45:32

[ #31 ]

Member

Joined: 17-Jul-2013
Posts: 49
From: Unknown

@matthey

We already have kind of combined embedded SIMD/FPU unit on Amiga (except ALTIVEC).

P1022 SPE (Signal Processing Engine) unit from A1222 is a 64-bit, two element, single-instruction multiple-data (SIMD) ISA. The two-element vector fits within GPRs extended to 64-bit. It doesnt have dedicated floating-point registers. GPRs are used for integer operations, extended to 64-bit to support vector single precision and scalar double precision categories.

SPE can execute floating point and vector instructions.

Embedded scalar double-precision floating-point instructions treat the GPRs as 64-bit
single-element registers for double-precision computation.

Maybe its worth mentioning.

Also detailed manual exists:

http://www.google.com/url?q=https://www.nxp.com/docs/en/reference-manual/SPEPEM.pdf&sa=U&ved=2ahUKEwi1j6GRkanrAhVKzaQKHeGRCPIQFjAAegQIBRAB&usg=AOvVaw29VhUTRaqRfrKFVoz8tCsB

-dooz

Status: Offline

Lou

Re: Amiga SIMD unit
Posted on 20-Aug-2020 16:39:58

[ #32 ]

Elite Member

Joined: 2-Nov-2004
Posts: 4259
From: Rhode Island

@matthey

You're grasping at straws. You can always build a 'test case' to prove your example (as I said originally). It's very Gunnar-like of you.

..

...

However it's not real world.

In the real world your cpu would render at less than 5 fps and barely have anything left in the tank for sound, game logic and I/O.

Last edited by Lou on 20-Aug-2020 at 05:27 PM.
Last edited by Lou on 20-Aug-2020 at 04:41 PM.
Last edited by Lou on 20-Aug-2020 at 04:40 PM.

Status: Offline

megol

Re: Amiga SIMD unit
Posted on 20-Aug-2020 17:29:06

[ #33 ]

Regular Member

Joined: 17-Mar-2008
Posts: 355
From: Unknown

@matthey

Quote:

matthey wrote:
The register banks remind me of the StarCore DSP which has some similarities to the 68k.

o Sixteen 40-bit data registers (low bank D0-D7, high bank D8-D15)
o Eight low bank address registers (R0–R7)
o Eight high bank address registers (R8–R15), or alternatively, eight base address registers (B0–B7)
o Four offset registers (N0–N3)
o Four modifier registers (M0–M3)

In addition to all the registers, it uses prefixes and has 3 op instructions.

https://www.nxp.com/docs/en/reference-manual/MNSC140CORE.pdf

When I was on the Apollo Team, I brought up the StarCore DSP to Gunnar and we discussed it. Personally, I'm not a fan of it overall even though I like some ideas. StarCore probably is easier to program than most DSPs but is far from the 68k in consistency and easy of use.

DPSs have been the odd types of processors not intended for general purpose code execution. Strange address space design, specialized address modes to accelerate FFT, separate code and data spaces (not inherently a bad idea but out of the norm), ...
Banks isn't nice for a scalable ISA as it adds extra serialization however the BANK of Apollo seems to be a prefix which isn't generally a problem in that regard.
But some of the strangeness of the Apollo may come from the idea of making a 68k compatible DSP core, maybe that was the idea all along? To make a small DSP core that's easy to code for with good code density and sell as a custom core?

Quote:

My rough estimate is that from RISC 32 GP integer registers to the 68k 16 GP integer registers is less than 10% memory traffic increase and less than 1% performance reduction in most designs. The 68k has several traits that reduce register usage (reg-mem/mem-mem, powerful addressing modes, large immediate support, PC relative addressing, register renaming). More GP registers are anything but a free lunch. They use more transistors, draw more power, sometimes require more time and memory to save and restore extra (caller/callee/all) registers and require more encoding space increasing code sizes.

Memory operands that aren't read only are expensive especially in something supporting multiprocessing/multithreading. One of my prefix ideas was to enable OP + Store without a previous read from the same address and if I'd try to create a neo-CISC load-op/op-store would be the first choice instead of load-op/load-op-store.
Registers are much cheaper than adding memory read ports and increasing cache capacity, that's one of the things the original RISCs got right - avoid touching memory if possible. For each memory read access there are register reads + address calculation + (TLB lookup) + n parallel lookups + n hit comparisons + data selection + (TLB hit comparison) + statistics update + data alignment shift. For each register read access there is a register read.
Memory is damn expensive and very slow, so slow that out of order execution is a requirement to reach high clock frequencies as in-order designs can't hide the 4 cycles of delay from a memory read. It's actually worse than the description above as in a parallel pipelined design each new memory access have to be compared with the previous for forwarding and hazard reasons.

Quote:

You make it sound like it was a team design which I highly doubt.

Perhaps more of a dictatorship than a team then?

Quote:

ARM AArch64 has an optional Scalable Vector Extension (SVE) in addition to the standard fixed width SIMD instructions. Is this what you mean by "variable length SIMD"?

That and the RISC-V design.

Quote:

It sounds like it would be less efficient today with the abstraction but may be more efficient tomorrow with wider SIMD units (although code recompiled for a fixed width tomorrow will be more efficient). It assumes that SIMD width will continue to grow and that may not be practical as can be seen by Knights Landing down clocking cores using 512 bit wide SIMD instructions. Also, wider SIMD unit standards limit low end core designs. I expect high end core SIMD designs need support and don't mind losing some performance but lower end designs would rather have the better performance of fixed width SIMD.

It's a bit like microoptimization to a specific target vs having something scaling to the designs of tomorrow whether it's narrow or wide.

Quote:

matthey wrote:
Nvidia is using VLIW cores (Denver and Carmel) to execute AArch64 code with a "code morphing" translation layer like Transmeta (VLIW Intel i860, Transmeta, Itanium processors were EPIC failures for GP computing and financially). Nvidia at one time claimed their VLIW SoC chips (Tegra X2) were the highest performance ARM SoC chips (likely based on SpecInt2K). I expect SIMD instructions translate well to VLIW but the big problem for VLIW cores has been branches. Maybe Nvidia wants ARM so they can market their VLIW ARM cores. Rumor is that Nvidia tried to get Intel patents to code morph x86_64 but had trouble licensing the patents from Intel. While VLIW cores have had trouble with GP performance, they are low power. This could be good for embedded use but is limited how low end they can go because of the overhead of the "code morphing" translation (optimized code sequences were stored in a 128 MB cache using main memory for example).

https://techreport.com/news/26906/nvidia-claims-haswell-class-performance-for-denver-cpu-core/

Has Nvidia finally solved the VLIW "code morphing" problems? Would they be interested in 68k or PPC "code morphing"?

Transmeta designs were general purpose processors with extra hardware to support x86 emulation, there are reverse engineered descriptions available somewhere out there IIRC on realworldtech.com . Denver instead is something specifically designed to execute ARM code and even support hardware decoding of ARM instructions without going through a translation process, this reduces upstart delay and removes expensive binary translation for cold code.
A VLIW designed for 68k emulation would look a bit different.

Last edited by megol on 20-Aug-2020 at 05:39 PM.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 20-Aug-2020 23:03:18

[ #34 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2755
From: Kansas

Quote:

dooz wrote:
We already have kind of combined embedded SIMD/FPU unit on Amiga (except ALTIVEC).

Tabor is finally available after all the delays?

Quote:

P1022 SPE (Signal Processing Engine) unit from A1222 is a 64-bit, two element, single-instruction multiple-data (SIMD) ISA. The two-element vector fits within GPRs extended to 64-bit. It doesnt have dedicated floating-point registers. GPRs are used for integer operations, extended to 64-bit to support vector single precision and scalar double precision categories.

SPE can execute floating point and vector instructions.

Embedded scalar double-precision floating-point instructions treat the GPRs as 64-bit
single-element registers for double-precision computation.

The integer unit and a new cut down SIMD unit design share the same register file while the traditional FPU is removed. I don't like it for the following reasons.

1) The design is incompatible with PPC software which uses the traditional FPU or AltiVec.
2) The 32 bit CPU integer cores waste the upper 32 bits of the register file (wouldn't be as bad with 64 bit integer support).
3) Floating point instructions are introduced into the integer file (very rare and cringe worthy).
4) The SIMD unit is only 64 bits wide so can only do two 32 bit operations in parallel.
5) The floating point support is not IEEE compliant.

32x64 bit integer/SIMD unit register file

The designers could have kept the traditional FPU with register file and shared the register file with the cut down SIMD unit. This would have provided good PPC compatibility for existing software and doesn't require much more register file storage.

32x32 bit integer register file
32x64 bit FPU/SIMD unit shared register file

The FPU/SIMD unit register files are shared in several architectures including ARM, POWER and z/architecture.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 21-Aug-2020 0:36:09

[ #35 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2755
From: Kansas

Quote:

Lou wrote:
You're grasping at straws. You can always build a 'test case' to prove your example (as I said originally). It's very Gunnar-like of you.

However it's not real world.

In the real world your cpu would render at less than 5 fps and barely have anything left in the tank for sound, game logic and I/O.

I'm open minded enough to consider different options for different applications and am really just floating ideas. I don't think you fully understand the complexity of a modern GPU most of which is hidden away and does its job like magic. Shaders aren't just added to a GPU. There has to be management of workloads (scheduling, queuing, balancing, error detection, security, etc.) which is done by a GPU core (Nvidia calls their management core a Front End Engine). The following youtube video talks about their RISC-V replacement of this core.

https://www.youtube.com/watch?v=gg1lISJfJI0

The management core has what is basically a simple OS for the GPU. Most of what it is doing is duplication of what the CPU OS does. Most GPU OSs and hardware don't support preemptive context switching but it has advantages as AMD notes in their "Heterogeneous System Architecture: A Technical Review" paper section 2.4.

Quote:

2.4. Preemption and Context Switching

TCUs provide excellent opportunities for offloading computation, but the current generation of TCU
hardware does not support pre-emptive context switching, and is therefore difficult to manage in a multiprocess environment. This has presented several problems to date:

• A rogue process might occupy the hardware for an arbitrary amount of time, because processes cannot be preempted.
• A faulted process may not allow other jobs to execute on the unit until the fault has been handled, again because the faulted process cannot be preempted.

HSA supports job preemption, flexible job scheduling, and fault-handling mechanisms to overcome the
above drawbacks. These concepts allow an HSA system (a combination of HSA hardware and HSA
system software) to maintain high throughput in a multi-process environment, as a traditional multi-user OS exposes the underlying hardware to the user.

To accomplish this, HSA-compliant hardware provides mechanisms to guarantee that no TCU process
(graphics or compute) can prevent other TCU processes from making forward progress within a
reasonable time.

http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/hsa10.pdf

A CPU OS supporting SMP in an HSA system should be able to schedule the SMP cores to do the GPU workloads and offers several advantages including simpler, less duplication saves resources, parallel processing power can be more easily used for general purpose workloads, faster development time, scalable with the number of SMP cores as well as the advantages mentioned in the HSA paper. Disadvantages include higher CPU core context switching overhead and fewer parallel processing cores as CPU cores are strengthened.

Looking at lower end hardware like the Raspberry Pi, the CPU cores appear to have more processing power than the GPU programmable cores (the CPU cores are clocked much higher but support down clocking to save energy). The SMP process affinity masks could be set to exclude the GPU workloads from the first few cores if process prioritization is inadequate for balancing loads.

There is still a need for specialized 3D hardware logic for more consistent fixed workloads and this does most of the work. The idea is just talking about the shader workloads which require flexibility by programmability.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 21-Aug-2020 2:58:17

[ #36 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2755
From: Kansas

Quote:

megol wrote:
DSPs have been the odd types of processors not intended for general purpose code execution. Strange address space design, specialized address modes to accelerate FFT, separate code and data spaces (not inherently a bad idea but out of the norm), ...
Banks isn't nice for a scalable ISA as it adds extra serialization however the BANK of Apollo seems to be a prefix which isn't generally a problem in that regard.
But some of the strangeness of the Apollo may come from the idea of making a 68k compatible DSP core, maybe that was the idea all along? To make a small DSP core that's easy to code for with good code density and sell as a custom core?

I did *not* get the feeling that Gunnar wanted to create a DSP but he supported improvements to the 68k to handle a wider variety of workloads thus improving its performance and value to make it universally more appealing. I believe he wanted to keep the Apollo Core general purpose which is a goal we shared.

Quote:

Memory operands that aren't read only are expensive especially in something supporting multiprocessing/multithreading. One of my prefix ideas was to enable OP + Store without a previous read from the same address and if I'd try to create a neo-CISC load-op/op-store would be the first choice instead of load-op/load-op-store.

The biggest benefit for RISC would probably come from the more common and simpler load+OP but maybe it can be done with code fusion. Losing the RMW OP+store would be a nice simplifation but sometimes hurts.

addq.l #1,(a0) ; 3 instructions with RISC but only 2 if allowing load+OP

Quote:

Registers are much cheaper than adding memory read ports and increasing cache capacity, that's one of the things the original RISCs got right - avoid touching memory if possible. For each memory read access there are register reads + address calculation + (TLB lookup) + n parallel lookups + n hit comparisons + data selection + (TLB hit comparison) + statistics update + data alignment shift. For each register read access there is a register read.

RISC has more memory traffic and more mostly dependant instructions to execute than CISC. Registers were added to reduce the memory traffic but instruction sizes grew to accommodate them so now you have more instructions and larger instructions on average needing more caches and causing an instruction fetch bottleneck when trying to keep up with CISC. It's good to reduce memory traffic but is it worth it for a less than 10% reduction in memory traffic most of which will hit in caches giving a barely measurable performance loss?

Adding registers and register ports gets expensive when accesses to the register file increase a cycle. The Alpha 21264 used 2 duplicate register files to reduce the register ports, likely to keep the register file access time 1 cycle. The Alpha 21264 only had 32 integer registers (reg file included 40 rename registers and 8 shadow registers) serving 2 integer units and 2 load/store units. Performance can drop in half in some cases when the reg file access time is increased by 1 cycle. Some of the performance (20% in SpecInt95) can be gained back by adding a 2nd level of bypass/forwarding logic but this is expensive too. A single cycle register file with bypass/forwarding provided a further 22% speedup.

https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf

Quote:

Memory is damn expensive and very slow, so slow that out of order execution is a requirement to reach high clock frequencies as in-order designs can't hide the 4 cycles of delay from a memory read. It's actually worse than the description above as in a parallel pipelined design each new memory access have to be compared with the previous for forwarding and hazard reasons.

OoO can only hide a few cycles of cache/mem latency sometimes. RISC has more instructions to execute to keep up with CISC and can take several cycles longer to calculate an EA because of dependent code (AArch64 is an exception as they added more powerful CISC like addressing modes). These delays mean RISC gets started later and OoO brings back some of the loss compared to CISC (gives better single thread performance to CISC too). There can be a lot of stalling waiting on more distant caches and memory for even OoO. Multi-threading helps with this more than OoO and is more energy efficient but doesn't give the good single thread performance that most people want.

Quote:

Perhaps more of a dictatorship than a team then?

The team is there to give ideas to and do work for Gunnar. German management style is commonly more along these lines with a top down "dictatorship" style management. U.S. management style commonly has more shared decision making and team effort. This idea of leadership styles came from a book called "Creating Modern Capitalism" which examines and compares the industrial revolutions in several counties.

Quote:

Transmeta designs were general purpose processors with extra hardware to support x86 emulation, there are reverse engineered descriptions available somewhere out there IIRC on realworldtech.com . Denver instead is something specifically designed to execute ARM code and even support hardware decoding of ARM instructions without going through a translation process, this reduces upstart delay and removes expensive binary translation for cold code.
A VLIW designed for 68k emulation would look a bit different

Some of the engineers for the Nvidia VLIW project came from the Transmeta project. It looks like code morphing to me. Yes, tuning the code morphing for a particular architecture is quite a difficult task and I was joking when I mentioned the 68k and PPC. It looks like they are using a trace cache to store converted code reducing the normally long latency of the "translation process". Longer latencies are probably tolerable for servers as payment is usually based on data throughput, especially if energy used is reduced. I wouldn't think a trace cache would work as well for server workloads where I would expect a larger variety of programs would reduce cache effectiveness.

Last edited by matthey on 22-Aug-2020 at 06:40 PM.

Status: Offline

Fl@sh

Re: Amiga SIMD unit
Posted on 22-Aug-2020 7:00:43

[ #37 ]

Regular Member

Joined: 6-Oct-2004
Posts: 253
From: Napoli - Italy

About GPU use for AmigaLike market IMHO, it’s far better consider an os/apps development based on CPU’s simd resources rather than base any idea on a future gpu driver to be developed.
as seen in later years amd/nvidia documentation is near inexistent and driver development is really difficult and based mostly on Linux sources.
Instead every cpu with simd support has a very well described documentation and gcc is well optimised to generate vector code.

It’s also my opinion, best choice for a winning simd unit, would be a dinamic datatype length support with ability to process from 8bit to 256bit integer/float operations.

My2cents.

Last edited by Fl@sh on 23-Aug-2020 at 04:39 AM.

_________________
Pegasos II G4@1GHz 2GB Radeon 9250 256MB
AmigaOS4.1 fe - MorphOS - Debian 9 Jessie

Status: Offline

megol

Re: Amiga SIMD unit
Posted on 22-Aug-2020 18:56:03

[ #38 ]

Regular Member

Joined: 17-Mar-2008
Posts: 355
From: Unknown

@matthey

RMW instructions helps in the case where memory have to be used as variable storage - however that is always a suboptimal situation since at least the 90's. It's slower and wastes a lot of power.
The problem isn't effective address calculation but the other things I listed, things that (generally) can't be avoided especially in a high performance design. Normally requiring two instructions or more for a complex address isn't a problem as in the most common case the heavy work can be done outside the inner loop.

OoO in combination with wider execution is very powerful to effectively hide L1 and in many cases L2 latency. Looking in the literature of yesterday it's obvious that L1 latency needed to be as low as possible not to tank general performance, however the architectures of today are moving away from that instead having comparatively large L1 caches with longer latency. The difference is deeper OoO and more capable branch prediction.

RMW instructions have some uses but they shouldn't be seen as an alternative to larger register files. They are less power efficient, not as generally useful, and require more transistors to implement efficiently.

--
Binary translation doesn't need a trace cache as code is translated and stored in main memory but it works kind of a trace cache. One way to look at it is a trace cache being a binary translation done in hardware as remapping of branches (=detecting presence in the cache) and code rewrite (=branch elimination, misc. compaction) are similar. Of course a hybrid software/hardware design would be possible and Denver may have something like that.

Status: Offline

Lou

Re: Amiga SIMD unit
Posted on 22-Aug-2020 23:28:14

[ #39 ]

Elite Member

Joined: 2-Nov-2004
Posts: 4259
From: Rhode Island

@Fl@sh

Quote:

Fl@sh wrote:
About GPU use for AmigaLike market IMHO, it’s far better consider as os/apps development based on CPU’s sims resources rather than base any idea on a future gpu driver to be developed.
as seen in later years amd/nvidia documentation is near inexistent and driver development is really difficult and based mostly on Linux sources.
Instead every cpu with simd support has a very well described documentation and gcc is well optimised to generate vectors code.

It’s also my opinion best choice for a winning simd unit would be a dinamic datatype length support with ability to process from 8bit to 256bit integer/float operations.

My2cents.

The reason AMD Linux drivers exist is because AMD has been quite open with their drivers. For instance it is because AMD open-sourced "Mantle" that Vulcan api exists...and why DixectX 12 has improved.

I am personally sporting an AMD Radeon VII in my main machine with 3840 "stream processors"...
https://en.wikipedia.org/wiki/Stream_processing

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 24-Aug-2020 1:51:20

[ #40 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2755
From: Kansas

Quote:

Fl@sh wrote:
About GPU use for AmigaLike market IMHO, it’s far better consider an os/apps development based on CPU’s simd resources rather than base any idea on a future gpu driver to be developed.
as seen in later years amd/nvidia documentation is near inexistent and driver development is really difficult and based mostly on Linux sources.
Instead every cpu with simd support has a very well described documentation and gcc is well optimised to generate vector code.

AMD documentation was open enough that open hardware Southern Islands GPU shaders could be created using the API.

http://miaowgpu.org/

There is an existing A-Eon Warp3D Nova driver for Southern Islands although compatibility would not be as simple as adding shaders to the hardware. Other GPU using software could more easily be ported as well. Down side is that newer GPU shaders are likely more flexible and capable of supporting ray tracing. Using CPU core SIMD/vector units is very flexible and easier to use in many ways but has disadvantages too.

Quote:

It’s also my opinion, best choice for a winning simd unit, would be a dinamic datatype length support with ability to process from 8bit to 256bit integer/float operations.

Dynamic vectorization avoids the encoding space bloat and corner cases. The following article compares dynamic vectorization to fixed SIMD (comments are interesting too).

https://www.sigarch.org/simd-instructions-considered-harmful/

SIMD instructions are a large part of the 1500+ instructions in x86_64. RISC POWER and AArch64 are trying to compete for the most instructions (POWER came from the IBM 801 which had all single cycle instructions and AArch64 is supposedly for embedded where large cores are a disadvantage).

The open hardware Libre project is trying to create a hybrid CPU+GPU+VPU SoC using dynamic vectorization for embedded and Unix applications.

https://libre-soc.org/

AmigaBlitter gave the same link in the "POWER related news" thread. They were originally using RISC-V with a custom extension for dynamic vectorization as the RISC-V standard was inadequate for what they needed but had a falling out and partially switched to POWER (yes, weird dual POWER/RISC-V ISA). The OpenPOWER Foundation has been much more cooperative but POWER isn't as suitable (although it now is a VLE but not using the more compact VLE).

https://libre-soc.org/openpower/

It is a little bit difficult to understand what they are trying to do. The CPU core is a simple 64 bit in order RISC core with an OoO dynamic vector unit using their custom extensions. The following video goes over slides and explains more.

https://www.youtube.com/watch?v=HeVz-z4D8os

The 8th page of the slide gives the following point.

> Started work on a "Precise" CDC 6600 style OoO Engine, with help from Mitch Alsup, the designer of the M68000

The video talks more about Mitch Alsup. Mitch was the guy I was going to try and bring in to look at the Apollo Core source but I found out just how closed and controlled the project is.

Last edited by matthey on 24-Aug-2020 at 01:52 AM.

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle