Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6071 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

9 crawler(s) on-line.

104 guest(s) on-line.

1 member(s) on-line.

OlafS25

You are an anonymous user.
Register Now!

OlafS25: 4 mins ago

matthey: 17 mins ago

OneTimer1: 54 mins ago

hardwaretech: 1 hr 21 mins ago

CosmosUnivers: 1 hr 49 mins ago

Rob: 1 hr 55 mins ago

pixie: 2 hrs 10 mins ago

fatbob_gb: 2 hrs 23 mins ago

amigakit: 2 hrs 32 mins ago

Mgwl: 2 hrs 54 mins ago

Forum Index

General Technology (No Console Threads)

Amiga SIMD unit

Poster

Thread

A1200coder

Re: Amiga SIMD unit
Posted on 3-Oct-2020 13:04:20

[ #121 ]

New Member

Joined: 5-Oct-2019
Posts: 4
From: Unknown

@Hammer

Quote:

I made my statement with both real-world performance and microarchitecture considerations.

For Doom and Quake frame rate results, 68080's performance didn't match my old Pentium 166Mhz and S3 Trio 64U.

It's Doom benchmark time
https://www.complang.tuwien.ac.at/misc/doombench.html
doom -timedemo demo3

Pentium 166 with Diamond Stealth S3 Trio64V+ has 98.9 fps

Pentium 90 with Cirrus Logic 5434 has 50 fps

I was being generous.

It's nearly pointless to argue 68080's quad instruction issue per cycle capability when 68080's FPU is NOT Core 2 level.

This is why I argue for any future 68090 design should focus on multi-pipeline FPU to improve Quake-type game engines.

Intel Core 2 has multi-pipeline 128 bit SIMD integer and FPU hardware , 128 bits wide
Load/Store units and ALUs/AGUs are 64 bits wide. You can't say the same for 68080!

Pentium III has 64bit SIMD SSE FADD and 64bit FMUL, hence Pentium III can do one 32bit ALU, 64bit SIMD SSE FADD, and 64bit FMUL which is effectively five 32bit instructions per clock cycle.

Doom is perhaps not a good benchmark; at least if you run it with a c2p version as there is an extra pass not needed for Vampire RTG display, which is the same as PC's displays which have chunky pixels. On the other hand, I would not expect a 68080 to be twice faster than a Pentium MMX CPU, unless you optimize the code for 68080. The FPU of 68080 is another matter; it's certainly not as good as core 2 level (and I didn't claim that), not sure if it's even a lot better than the FPU of Pentium. You are also right in that Pentium 3 SSE is better than MMX, but for integer/memory performance, the 68080 should be no slower than a Pentium 2 or 3.

But certainly the 68080 is faster than Pentium MMX, since Pentium, like 68060, can only execute max 2 instructions per clock.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 4-Oct-2020 1:15:44

[ #122 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2025
From: Kansas

Quote:

Hammer wrote:
According to https://twitter.com/tom_forsyth/status/641016896033173505
In HW functionality, GCN and LRB (Intel Larrabee, x86 GPU) are very close. Not exposed by most languages though.

Yes, the Larrabee architecture was likely intended to be not only a GPGPU but a GPU using many x86_64 cores.

https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing

AVX 512 was developed from the project. In order to fit more cores, cores were based on an ancient in order Pentium P54C (circa 1994 and closest competitor to the 68060 but using 28% more transistors and 73% more power). I expect the reasons for failure were too high of power draw (72 Core Knights Landing used 260 Watts which would be 151 Watts if power could be scaled down by the same percentage as the 68060 power advantage over the Pentium) and the single thread performance was weak by today's standards. The latter problem would mean that most modern games could not be played with acceptable performance despite the x86_64 compatible cores and huge parallel performance.

The Cell architecture (PPC and Xbox 360) detached some of the SIMD units (PPEs) from the PPC cores to allow for more SIMD performance. Weak in order PPC cores were clocked up to give more SIMD performance. They were low enough power to fit in a console but were likely abandoned from lack of single thread performance for games (replaced with x86_64 hardware with excellent single thread performance) and difficulty of programming.

Both the Larrabee and Cell architectures had enough parallel processing power to serve as a scalable GPU. More x86_64 cores could be added for Larrabee and more SPEs for Cell yet both came up short.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 4-Oct-2020 3:20:42

[ #123 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2025
From: Kansas

Quote:

Fl@sh wrote:
Your are making things much harder for ppc.
I know you are in love with 68k and it can cause a loss of impartiality.

Anyway for ppc I have seen prologue and epilogue simpler than in your examples.

The PPC prologue and epilogue code is recommended from the Freescale/NXP "AltiVec Technology Programming Interface Manual".

Quote:

Obviously in ppc code you are saving and restoring even vector registers, not present in 68k.

That is one of the disadvantage of having more registers and different register files. For example, x86_64 has a shared SIMD and FPU register file which reduces this cost. If the 68k had also shared FPU and SIMD registers then the 68k code I posted may suffice. Sharing SIMD and FPU registers is common including from IBM (POWER and z/architecture), ARM, AMD and Intel.

Quote:

The best case scenario instead IMHO is do not save nothing and use volatile registers as much as possible for ppc code, for normal user program function calls.

If you use only volatile registers on PPC then you have 11 gp integer registers which is less than the 68k and x86_64 with 16 and RISC needs more registers as performance degrades quickly when out of registers. The PPC stack frame is often needed for local variables, local storage, varargs, etc. (all but global variables which are usually discouraged and dynamic memory allocations) so typically the cost of the stack frame would already be incurred, especially with function inlining popularity today.

Quote:

You did the worst scenario where a function call needs to save all in one gpr, fpu and vector registers. Obviously restoring all these at end.

There are several entry points for the register saving functions. The top entries can be skipped if those registers do not require saving and even the whole branch is removed for each register file if no registers need saving. The SIMD unit is rarely used and more SIMD registers are volatile so rarely need saving and restoring.

Quote:

I guess in a such complex function, where are involved all these different registers, the save and restore of non volatile resources is really a minor overhead compared with a complexity of job cpu is doing.

The compiler tries to figure out when it is worthwhile to save and restore non volatile registers but it is worthwhile in most cases with RISC because of the high overhead of using memory. Using 12 registers instead of 11 registers can generate a costly stack frame which the compiler may have trouble computing the cost of. Using more registers can sometimes slow performance. Context switches where all the registers are saved are slower too.

Quote:

I want repeat ppc isa is the more recent among all others, with exception of riscv, and it was developed with future in mind. This is reason why on Ppc word we had a soft 64bit transition, little endian support, latest VMX powerful extensions, embedded and custom cores spacing from car industry to gaming consoles, sharing the same code.

AArch64 is a newer ISA too. It was created for 64 bit from the start, has a powerful and standard SIMD unit, has better PC relative addressing, has more powerful addressing modes, has more friendly assembler and has better code density. PPC competes better with RISC-V but RISC-V has much better code density and is simpler. POWER has replaced PPC at the high end and ARM at the low/embedded end squeezing PPC out of existence. PPC is old in technology years and showing its age.

Quote:

On paper ppc is a good architecture, better than others, the handicap resides in vendors implementations and scale economy fab processes.
With right investments in research and fab process, we could have today a z80 clocked at 10ghz, maybe faster than any x86.

PPC had several top core designers including IBM and Motorola/Freescale/NXP and was using a competitive fab process at one time but couldn't compete in performance with x86_64. It was used for consoles and embedded ARM cores where it offered acceptable performance for the power used but x86_64 and AArch64 cores improved faster in performance. New PPC designs have stopped and the fab process of what is left of PPC chips on the shelves is getting old.

Quote:

With right investments in research and fab process, we could have today a z80 clocked at 10ghz, maybe faster than any x86.

A high clocked Z80 would have weak performance. It is missing too many modern features like pipelining and superscalarity. There is a pipelined 24 bit eZ80 embedded microprocessor introduced in 2001 which clocks to 50MHz. The strength is the Z80 compatibility and minimalist area though. A 68000 compatible core with pipelining could probably be clocked to more GHz than most modern cores but it too would be weak performance in comparison. High clocked processors are fine tuned engineering designs and not old cores clocked up.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 4-Oct-2020 4:43:48

[ #124 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2025
From: Kansas

Quote:

A1200coder wrote:
Doom is perhaps not a good benchmark; at least if you run it with a c2p version as there is an extra pass not needed for Vampire RTG display, which is the same as PC's displays which have chunky pixels. On the other hand, I would not expect a 68080 to be twice faster than a Pentium MMX CPU, unless you optimize the code for 68080. The FPU of 68080 is another matter; it's certainly not as good as core 2 level (and I didn't claim that), not sure if it's even a lot better than the FPU of Pentium. You are also right in that Pentium 3 SSE is better than MMX, but for integer/memory performance, the 68080 should be no slower than a Pentium 2 or 3.

The Amiga has RTG versions of Doom and c2p overhead becomes small after advancing out of the stone age of processors. I believe Quake is a better overall benchmark as it uses floating point and stresses caches and memory. The Apollo Core FPU is missing even some instructions which the 68060 supports and lost some precision but is pipelined, takes advantage of the larger caches and high bandwidth of memory. The Pentium FPU has many more instructions still in hardware but is more difficult to use as a funky stack based FPU but has good theoretical performance with hand laid assembler code. At the same clock speed, the winner would probably be decided by which instructions are used but I would go with the Apollo Core FPU.

The 68060@50MHz provided 600MB/s from the Caches to the pipelines and the integer execute engines could sustain 1200MB/s transfer rates. The Vampire stand alone uses DDR3 memory which provides a data rate of 6400MB/s at 100MHz. The DDR3 memory throughput is about ten times the throughput of the 68060 caches even though latency is much better for the caches. The Pentium 2 and 3 are going to be closer to a 68060 than to a modern CPU. These old processors could be connected to modern memory and would get a nice performance boost as if using off chip caches but the advancement in fab technology inside the CPU chip is huge in comparison.

Quote:

But certainly the 68080 is faster than Pentium MMX, since Pentium, like 68060, can only execute max 2 instructions per clock.

The 68060 can execute a max of 3 instructions per clock.

Quote:

Additionally, the use of instruction folding techniques allow one or two instructions to be
simultaneously executed with a predicted taken Bcc (also for BRA and JMP instructions).

- M68060 User's Manual 10-8

Max or peak instructions per cycle isn't very useful for judging performance. It is average Instructions Per Cycle (IPC) or the inverse Cycles Per Instruction (CPI) that is important.

https://en.wikipedia.org/wiki/Instructions_per_cycle
https://en.wikipedia.org/wiki/Cycles_per_instruction

Motorola measured only 1.2 CPI on a range of desktop and embedded applications. Existing code issued pairs/triplets 45%-55% of the time while targeted 68060 code issued pairs/triplets 50%-65% of the time. Obviously there were many stalls with such small caches but this was actually pretty good for the time, an in order core and cache sizes. Old CPUs commonly used CPI while modern CPUs usually use IPC.

Too bad the performance counters were removed from the Apollo Core to save space.

Status: Offline

NutsAboutAmiga

Re: Amiga SIMD unit
Posted on 4-Oct-2020 10:11:25

[ #125 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12825
From: Norway

@Hammer

I think your graps are interesting, but not I don’t think they are telling what you say they are.

CPU in

Xbox360: 34

PS3: 105

PS4: 98

Xbox One: 113

GPU graphics:

PS3:
NVIDIA G70 (previously known as NV47) architecture

PS4: 1600
(AMD GPGPU Graphics Core Next (GCN[))
this a risk based GPU, from what read, funny, this is used in this GPU cards.
(Radeon HD 7000, HD 8000, 200, 300, 400, 500 and Vega series of AMD Radeon graphics cards)

Xbox One: 830
GPU that’s very similar to the Radeon 7790.

XBOX One and PS3 CPU speed begin slightly faster than PS4, is maybe indication that most games do not really need major CPU speed boost, the change of CPU architecture, was maybe more political, or had do with other factors then speed.

The fundamental difference in game play between first Tombe Raider on the PS1 vs last one on the PS4, has not changed a lot, but the difference is in the graphics is major.

How many CPU cores do you need to play tetris?

You can update the graphic on tetris to make it real time raytraced, but the game logic wont change.

Last edited by NutsAboutAmiga on 04-Oct-2020 at 01:05 PM.
Last edited by NutsAboutAmiga on 04-Oct-2020 at 11:13 AM.
Last edited by NutsAboutAmiga on 04-Oct-2020 at 11:12 AM.
Last edited by NutsAboutAmiga on 04-Oct-2020 at 10:25 AM.
Last edited by NutsAboutAmiga on 04-Oct-2020 at 10:21 AM.
Last edited by NutsAboutAmiga on 04-Oct-2020 at 10:20 AM.
Last edited by NutsAboutAmiga on 04-Oct-2020 at 10:20 AM.
Last edited by NutsAboutAmiga on 04-Oct-2020 at 10:11 AM.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

NutsAboutAmiga

Re: Amiga SIMD unit
Posted on 4-Oct-2020 11:08:35

[ #126 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12825
From: Norway

@Hammer

Quote:
Mia Logic has incompetent Northbridge design worse than VIA's regardless of the Amiga's requirements.

I’m not sure, I remember having lots issues with my Antahlon PC, crashes and freezes, that’s when I switched to using Linux for many years, because the drivers work betters, maybe some quarks was worked out when we got the chips in AmigaONE-SE/XE's. but issues where bad on PC's.

https://www.vogons.org/viewtopic.php?f=46&t=43116

Quote:
I have a classic AmigaOS 4.1 FE PowerPC on WinUAE, and running WHDload 68K game which triggers another UAE is LOL which is not much different from AROS X86's UAE emulation box.

Well it’s your choice to use AmigaOS4.1 in WinUAE, clearly if want to play WHDLoad games or ADF’S you should maybe buy a Rasbarry PI, not waste your money on expensive PC. Or buy a Minimig, the Vampire is maybe over kill.

You can also run Amos games in Amos kittens without using EUAE, if you have the source code.

A lot of work put into running classic software that was not system friendly like NallePuh, Blitzen, CIAAgent, but there is limit as what you can do as you point out.

A lot of tools that was created early on, to use EUAE simply because a lot of source was not available, for example can’t find the power packer code, but a lot of code was staticky linked into other products like UADE, so tools like PPMore, powerpacker.library can be recreated, from the UADE sourcecode.

I guess you don’t have issue with PPMore and PowerPacker.library on WinUAE, because you have access to chipset, I guess the issue with it uses color flashes tell the users that something is unpacking, so it writes directly to the hardware (Amiga 500 chipset), to recreate C64 color flashing.

chipset.libary Is an attempt to provide chipset on demand, so make it easier to make changes to the assembler code, so all hardware access are redirected. Instead of the slow EUAE sandbox.

Quote:
PS; I signed up for Apollo's Vampire 1200 V2 since mid-2020 for my A1200 and its price is not as crazy when compared to Amiga One XE, X1000 and X5000.

Yes but your only getting framebuffer as output on the Vampire, you don’t get a Radeon RX or HD card in your Vampire. So the price comparison is maybe not so bad, considering get lot less for your money.

Last edited by NutsAboutAmiga on 04-Oct-2020 at 11:38 AM.
Last edited by NutsAboutAmiga on 04-Oct-2020 at 11:22 AM.
Last edited by NutsAboutAmiga on 04-Oct-2020 at 11:19 AM.
Last edited by NutsAboutAmiga on 04-Oct-2020 at 11:15 AM.
Last edited by NutsAboutAmiga on 04-Oct-2020 at 11:14 AM.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 4-Oct-2020 14:46:30

[ #127 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5312
From: Australia

@NutsAboutAmiga

Quote:

NutsAboutAmiga wrote:
@Hammer

I think your graps are interesting, but not I don’t think they are telling what you say they are.

CPU in

Xbox360: 34

PS3: 105

PS4: 98

Xbox One: 113

My comments was about real-world IPC

Clock speed
Xbox 360 CPU has 3.2 Ghz

PS3 CPU/SPU has 3.2 Ghz

PS4 CPU has 1.6 Ghz

XBO CPU has 1.75 Ghz

Quote:

GPU graphics:

PS3:
NVIDIA G70 (previously known as NV47) architecture

PS3 RSX GPU was an aging design in 2006. G70 type GPU is crap at Fold at Home.

PS3 RSX GPU is crap at GpGPU workloads.

Quote:

PS4: 1600
(AMD GPGPU Graphics Core Next (GCN[))
this a risk based GPU, from what read, funny, this is used in this GPU cards.
(Radeon HD 7000, HD 8000, 200, 300, 400, 500 and Vega series of AMD Radeon graphics cards)

GPUs have complex instructions such as gather and scatter in its TMUs which is not RISC atomic instruction operations.

GCN uses MIMD = multiple instructions, multiple data design and it's closely related to VILW (Very long instruction word, instruction-level parallelism) or EPIC (Explicitly Parallel Instruction Computing) approaches.

GCN wave64 contains multiple instructions and data elements payload.

MIMD has greater flexibility when compared to very wide SIMD (single instruction, multiple data).

Quote:

Xbox One: 830
GPU that’s very similar to the Radeon 7790.

XBOX One and PS3 CPU speed begin slightly faster than PS4, is maybe indication that most games do not really need major CPU speed boost, the change of CPU architecture, was maybe more political, or had do with other factors then speed.

Unlike XBO, PS3's CELL was patching its aging RSX GPU design.

From https://forum.beyond3d.com/posts/1460125/

------------------------

"I could go on for pages listing the types of things the spu's are used for to make up for the machines aging gpu, which may be 7 series NVidia but that's basically a tweaked 6 series NVidia for the most part. But I'll just type a few off the top of my head:"

1) Two ppu/vmx units

There are three ppu/vmx units on the 360, and just one on the PS3. So any load on the 360's remaining two ppu/vmx units must be moved to spu.

2) Vertex culling

You can look back a few years at my first post talking about this, but it's common knowledge now that you need to move as much vertex load as possible to spu otherwise it won't keep pace with the 360.

3) Vertex texture sampling

You can texture sample in vertex shaders on 360 just fine, but it's unusably slow on PS3. Most multi platform games simply won't use this feature on 360 to make keeping parity easier, but if a dev does make use of it then you will have no choice but to move all such functionality to spu.

4) Shader patching

Changing variables in shader programs is cake on the 360. Not so on the PS3 because they are embedded into the shader programs. So you have to use spu's to patch your shader programs.

5) Branching

You never want a lot of branching in general, but when you do really need it the 360 handles it fine, PS3 does not. If you are stuck needing branching in shaders then you will want to move all such functionality to spu.

6) Shader inputs

You can pass plenty of inputs to shaders on 360, but do it on PS3 and your game will grind to a halt. You will want to move all such functionality to spu to minimize the amount of inputs needed on the shader programs.

7) MSAA alternatives

Msaa runs full speed on 360 gpu needing just cpu tiling calculations. Msaa on PS3 gpu is very slow. You will want to move msaa to spu as soon as you can.

Post processing

360 is unified architecture meaning post process steps can often be slotted into gpu idle time. This is not as easily doable on PS3, so you will want to move as much post process to spu as possible.

9) Load balancing

360 gpu load balances itself just fine since it's unified. If the load on a given frame shifts to heavy vertex or heavy pixel load then you don't care. Not so on PS3 where such load shifts will cause frame drops. You will want to shift as much load as possible to spu to minimize your peak load on the gpu.

10) Half floats

You can use full floats just fine on the 360 gpu. On the PS3 gpu they cause performance slowdowns. If you really need/have to use shaders with many full floats then you will want to move such functionality over to the spu's.

11) Shader array indexing

You can index into arrays in shaders on the 360 gpu no problem. You can't do that on PS3. If you absolutely need this functionality then you will have to either rework your shaders or move it all to spu.

Etc, etc, etc...
---------------

The goal of CISC was to take common coding patterns and accelerate them in hardware.

The goal of RISC was the opposite i.e. perform few base functions as fast as possible.

GPU combines extreme RISC and extreme CISC design approaches to create the fastest and efficient graphics processor in the world.

GPU has complex fixed-function hardware that combines with MIMD stream compute hardware.

NVIDIA CUDA without rasterization hardware is not fast when compared to CUDA (shaders) + fix function rasterization hardware.

Last edited by Hammer on 04-Oct-2020 at 03:43 PM.
Last edited by Hammer on 04-Oct-2020 at 02:55 PM.
Last edited by Hammer on 04-Oct-2020 at 02:51 PM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 4-Oct-2020 15:26:22

[ #128 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5312
From: Australia

@NutsAboutAmiga

Quote:

NutsAboutAmiga wrote:
@Hammer

Quote:
Mia Logic has incompetent Northbridge design worse than VIA's regardless of the Amiga's requirements.

I’m not sure, I remember having lots issues with my Antahlon PC, crashes and freezes, that’s when I switched to using Linux for many years, because the drivers work betters, maybe some quarks was worked out when we got the chips in AmigaONE-SE/XE's. but issues where bad on PC's.

https://www.vogons.org/viewtopic.php?f=46&t=43116

There are two major crash types in Windows i.e. application crash or BSOD crash.

BSOD crash is at kernel level crash e.g. usually caused by drivers.

Linux kernel panic crash is equivalent to Windows BSOD.

My K7 Athlon XP/GeForce 4 Ti/nForce 2/Windows XP was stable for the most part i.e. it didn't suffer data corruption. I still have this machine as a DX6/DX7/DX8 legacy machine.

It's well known that certain NVIDIA nForce drivers can cause data corruption, hence I don't field nForce as a workstation or server-based motherboards.

Quote:

Well it’s your choice to use AmigaOS4.1 in WinUAE, clearly if want to play WHDLoad games or ADF’S you should maybe buy a Rasbarry PI, not waste your money on expensive PC.

For my "expensive" gaming PCs, Microsoft Gamepass, Blender3D accelerated raytracing (RTX), Epic Unreal Engine 4, VMware Workstation, Visual Studio, and Netflix/Blu-ray says Hi.

Tax deductions can be applied for PC hardware purchases when working in the industry.

I need my gaming PCs for my day job and primary entertainment. I use my Amigas for taking a break from Windows.

Quote:

Or buy a Minimig, the Vampire is maybe over kill.

Apollo's Vampire is a nice "What IF" 68K hardware and my A1200 needs its performance boost.

MiSTer... I don't need another 68030 50Mhz AGA level Amiga. i.e. I prefer 68060/68080 level A1200.

Quote:

Yes but your only getting framebuffer as output on the Vampire, you don’t get a Radeon RX or HD card in your Vampire. So the price comparison is maybe not so bad, considering get lot less for your money.

Again, any Amiga platforms that launch UAE will attract competition from AmigaForever/WinUAE+WIndows 10 land and I already enabled 3D acceleration with WinUAE which is backed by NVIDIA RTX 2080 OC or RTX 2080 Ti OC GPU.

Radeon RX... why should I use these inferior AMD GPUs? AMD(RTG) needs to put some more effort.

PS: I'm using bloated Amikit/AmigaOS3.x for WinUAE, hence Vampire hardware fits my Amikit/AmigaOS3.x existing hard drive setup. AmigaOS4.1 FE PPC/WinUAE was for exploration.

Last edited by Hammer on 06-Oct-2020 at 05:24 AM.
Last edited by Hammer on 06-Oct-2020 at 05:03 AM.
Last edited by Hammer on 04-Oct-2020 at 03:41 PM.
Last edited by Hammer on 04-Oct-2020 at 03:40 PM.
Last edited by Hammer on 04-Oct-2020 at 03:34 PM.
Last edited by Hammer on 04-Oct-2020 at 03:32 PM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

Hypex

Re: Amiga SIMD unit
Posted on 5-Oct-2020 18:17:57

[ #129 ]

Elite Member

Joined: 6-May-2007
Posts: 11228
From: Greensborough, Australia

@matthey

Quote:
PPC has a self imposed limitation of not allowing integer FP conversions in registers. Other load/store architectures allow it like AArch64. Even the Motorola 88k RISC processor which lost out to PPC at Motorola had it. It was likely a simplification as there is some logic and bus synchronization involved. The 'R' in RISC stood for "reduced" and that was the RISC philosophy then even though it also resulted in "reduced" performance which was supposed to be offset by higher clock speeds and better compiler support. The PPC and 88k were neither pure RISC philosophies but they were still reduced in some ways. Even the CISC 68k was reduced in some areas starting with the 68040 removing many FPU instructions which included the commonly used FINT/FINTRZ instruction also used for FP to integer conversions. This was perhaps a bigger mistake than not allowing integer FP conversions in registers and was added back for the 68060.

As a trade off, even if direct conversion is unsupported, I would expect to a register to register move in the least.

Quote:
How does a core with a static number of registers allow a dynamic number of function calls?

That pretty much describes everry CPU there is. I would say optimise for as least possible number of registers in each function block and only stack it when it reaches the limit. It's popular for GCC to optimise away a function and inline the code or even remove some functions.

Quote:
PPC has a link register (LR) which contains the return address for the last function (the return address is pushed to the stack with CISC). A branch and link instruction (bl) writes the return address to the LR and a branch and link return instruction (blr) branches to the address in the LR. This works well until a function calls another function in which case the LR has to be saved and restored in the prologue and epilogue of functions. Leaf (the last) functions called can return without touching memory which was faster than the CISC method until cores received a hardware link or return stack which stores more of the return addresses than the last one and more often quickly returns without needing to load the return address from memory than the LR register.

The need to store locals would force the neat register arrangement into memory. So the register RISC benefit would be lost if needs to use CISC methods. Looks like a case where pryamid style code would be needed. With the top code being the heaviest and gradually working towards the lighest. Inverted pyramid maybe.

Quote:
PPC has a combined stack and stack frame pointer register which is R1. Data other than frame pointers are not allowed on the stack. This saves a register but has more setup/breakdown overhead and is annoying, especially for humans (the 68k can also use 1 register for the stack and stack frame pointer which is faster but loses easy management of the stack frame pointer for debugging). All local data is stored within the stack frame for the function. Additionally, there is a pointer within the stack frame which points to the previous stack frame. Simple functions can avoid creating the high overhead stack frames making PPC have similar function call overhead to other architectures. When we need storage and a stack frame, the prologue and epilogue of the function adds overhead.

I wonder of this relates to varargs being a problem on PPC? It does stack registers and other local data it looks. So don't know why it can't use other data. But I recall PPC has to use particular functions for varargs making porting from 68K harder. Given the stack was the common method don't know why using a stack to stack data was a problem. As long as the stack frame fits on the 16 byte boundry.

Quote:
To save space I have not included _savefpr_, _savegpr_, _restgpr_, _savevr20 and _restvr20 which resemble _restfpr_. Inlining for PPC allows code to share the prologue, epilogue and stack frame overhead so I would expect it to be faster most of the time.

Except that function which looks to store a lot and do very little.

Quote:
Let's compare the PPC function overhead to the 68k.

Yes, 68K has those multi stacking instructions. And less registers. Would it use less clocks by comparison?

Quote:
I expect x86_64 function overhead to be somewhere between PPC and the 68k although there are different ABIs used by Windows and Unix derived OSs. More volatile (scratch, caller save) registers allows a function to do more work without saving registers but lack of non-volatile (callee save) registers makes it slower to call functions to do the work. Generally a balance of both is good but for a SIMD unit there usually are few if any function calls so they should likely have more volatile registers. Shared FPU and SIMD unit registers are a compromise then.

All about the balance.

Quote:
Conclusions:

I would consider 68K ASM to be the best, x86-64 unreal mode surely easier than the cryptic x86 real mode ASM, and PPC at least can be understandable between the two.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 6-Oct-2020 4:58:15

[ #130 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2025
From: Kansas

Quote:

Hypex wrote:
As a trade off, even if direct conversion is unsupported, I would expect to a register to register move in the least.

The fp to int conversion directly to register looks cleaner, is better for superscalar operation and uses less ICache but there isn't much performance difference on the 68060.

fp2int1:
fmove.l fp0,d0 ; 6 cycles

fp2int2:
fmove.l fp0,-(sp) ; 4 cycles
move.l (sp)+,d0 ; 1 cycle

These timings could be better but the 68k is good at using memory and the stack. In fp2int2, the move.l will wait (stall) until fmove.l writes its result before executing which may not be true of all architectures. It is best to put other instructions between anyway but that makes superscalar scheduling more difficult.

Quote:

That pretty much describes every CPU there is. I would say optimise for as least possible number of registers in each function block and only stack it when it reaches the limit. It's popular for GCC to optimise away a function and inline the code or even remove some functions.

I was specifically talking about where to save return addresses. Branching to a function on PPC writes the link register (LR) and if that function branches to another function, the LR is overwritten and lost. It is possible to move the LR to another general purpose register but a register is lost with each nested function. Most architectures are only limited by memory for how many nested functions can be called so the return addresses have to be placed in memory when there are not enough registers to hold them. Function args are also placed in memory when there are not enough registers to hold them. Function inlining reduces memory accesses in both cases but often requires more registers and reduces code sharing.

Quote:

The need to store locals would force the neat register arrangement into memory. So the register RISC benefit would be lost if needs to use CISC methods. Looks like a case where pryamid style code would be needed. With the top code being the heaviest and gradually working towards the lighest. Inverted pyramid maybe.

Compilers should keep the most commonly used function local variables in registers. Choosing when to inline functions with PPC is likely more difficult because it is RISC and because of the way it uses stack frames.

Quote:

I wonder of this relates to varargs being a problem on PPC? It does stack registers and other local data it looks. So don't know why it can't use other data. But I recall PPC has to use particular functions for varargs making porting from 68K harder. Given the stack was the common method don't know why using a stack to stack data was a problem. As long as the stack frame fits on the 16 byte boundry.

The compiler should know most of the time how many varargs arguments there are for the function and allocate the necessary space inside the PPC stack frame. It looks like PPC can have any alignments for datatypes and structures embedded in the stack frame as long as proper alignment padding is used around it. Perhaps varargs or Amiga taglists passed in to a function would have problems if it is necessary to copy them. Some language features like dynamic arrays could be more challenging to support as well. PPC does support dynamic stack space allocation although it is a more complicated process than most other architectures. See "System V ABI PPC Processor Supplement" 3-43 for details. Dynamic memory allocations would work too.

Quote:

Yes, 68K has those multi stacking instructions. And less registers. Would it use less clocks by comparison?

The 68060 MOVEM instruction can only save or restore one register per cycle and no superscalar operation is possible. Most older CPUs can only do one DCache access per cycle anyway. A more advanced 68k core may be able to save or restore consecutive pairs of instructions in a single cycle. AArch64 chose to allow load/store pairs but no more as it is simpler to combine, but only saves half the ICache of a load/store multiple. PPC has a load/store multiple instruction by the way. The "System V ABI PPC Processor Supplement" says the following.

Quote:

Note that "Load and Store Multiple" PowerPC instructions should not be used on Little-Endian
implementations because they cause alignment exceptions, or on Big-Endian implementations
because they are slower than the register-at-a-time saves.

A PPC function creating a stack frame would have used up half the ICache on a 68k processor before doing any work. The PPC prologue and epilogue have the extra cost of calling functions to save and restore the registers and other management instructions without post increment and pre decrement addressing modes which make it simple. It's not like PPC doesn't use microcode for more complex instructions as even POWER uses microcode.

Quote:

I would consider 68K ASM to be the best, x86-64 unreal mode surely easier than the cryptic x86 real mode ASM, and PPC at least can be understandable between the two.

ARM has done a good job of assembler for RISC processors (important for embedded use). SuperH assembler is good too, being like the 68k, but the 16 bit fixed encoding size really handicapped the performance. The 88k assembler looked nice other than a few confusing areas. Even MIPS and RISC-V assembler looks easier to me than PPC assembler other than the increased verboseness and tediousness to code due to simplicity, especially from lack of addressing modes. PPC assembler feels like it was made for compilers instead of humans.

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 6-Oct-2020 6:24:58

[ #131 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5312
From: Australia

@matthey

Quote:

matthey wrote:

The 68060@50MHz provided 600MB/s from the Caches to the pipelines and the integer execute engines could sustain 1200MB/s transfer rates. The Vampire stand alone uses DDR3 memory which provides a data rate of 6400MB/s at 100MHz. The DDR3 memory throughput is about ten times the throughput of the 68060 caches even though latency is much better for the caches. The Pentium 2 and 3 are going to be closer to a 68060 than to a modern CPU. These old processors could be connected to modern memory and would get a nice performance boost as if using off chip caches but the advancement in fab technology inside the CPU chip is huge in comparison.

Classic Pentium has a 64-bit frontside bus while 68060 has a 32-bit frontside bus. 68060 has major bottlenecks with its motherboard or daughter card infrastructure.

As an example, Classic Pentium i430VX based PCChips M525 motherboard has 256KB L2 cache at frontside clock speed.

Pentium II (Klamath) has 16 KB instruction and 16 KB data cache. 512 KB L2 cache at half of the CPU frequency.

68060 has 8 KB instruction and 8 KB data cache.

In real terms, 68060 based Amiga 4000T motherboard and accelerator daughter cards are inferior to PCChips M525 motherboard.

In 1996, I owned A3000 and I have done cost vs performance benefits between Phase 5 CyberStorm MkII 68060 50Mhz + CyberGraphics 64 (S3 Trio 64) add-on upgrade cards vs new build Pentium 150/S3 Trio 64/PCChips M525 mobo based PC clone. Pentium 150 was easily overclocked to 166Mhz with FSB 66Mhz jumper (the same setting for Pentium 166Mhz). Quake will be faster on my selected Pentium clone box.

Pentium II's 440LX chipset has AGP 2X support. 440BX chipset has 100Mhz FSB support.

Besides raw specs, latency in the overall graphics rendering pipeline can influence the game's frame rate yields, hence Celeron 300A (Mendocino) has 128 KB L2 cache at the full CPU clock speed in 1998 which benefits games.

Vampire 1200 V2 and V4 are interesting "what if", but real-world performance doesn't match Socket 7 classic Pentiums.

Last edited by Hammer on 10-Oct-2020 at 01:31 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 6-Oct-2020 8:22:29

[ #132 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2025
From: Kansas

@Hammer
Quote:

Classic Pentium has a 64-bit frontside bus while 68060 has a 32-bit frontside bus. 68060 has major bottlenecks with its motherboard or daughter card infrastructure.

The 68060 has a 32 bit data bus where the Pentium has a 64 bit data bus (but I believe only a 32 bit FSB). A 64 bit data bus has more memory bandwidth while a 32 bit data bus allows for cheaper memory. The 68060 was more of a balanced design than the Pentium performance design as it needed to sell into the embedded market where cheaper memory was an advantage. The 68060 had more efficient caches (more ways reduces conflict misses), about 20% better code density than x86 reducing ICache bandwidth loss and has about half the DCache memory traffic of x86 with only 8 GP registers. The Pentium had more memory bandwidth but it used significantly more memory bandwidth. The fact that the 68060 had similar performance with half the memory bandwidth shows this. Too bad C= didn't appreciate and leverage the cheaper memory of the 68060.

Quote:

As an example, Classic Pentium i430VX based PCChips M525 motherboard has 256KB L2 cache at frontside clock speed.

An external cache was better than nothing if all you can afford is a little expensive memory.

Quote:

Pentium II (Klamath) has 16 KB instruction and 16 KB data cache. 512 KB L2 cache at half of the CPU frequency.

The 16kiB L1 caches were a nice upgrade. Technology was moving right along.

Quote:

68060 has 8 KB instruction and 8 KB data cache.

Frozen in time at a time when fab technology rate of improvements nearly reached their peak. The 68060 didn't even get a memory controller on chip.

Quote:

In real terms, 68060 based Amiga 4000T motherboard and accelerator daughter cards are inferior to PCChips M525 motherboard.

I don't think C= ever produced a 68060 accelerator. As I recall, Amiga Technologies had one manufactured and it was pretty good but the 68060 had been out for some time by then. Even early Amiga 68060 accelerator cards were quickly outclassed by the Pentium being clocked up so quickly.

Quote:

In 1996, I owned A3000 and I have done cost vs performance benefits between Phase 5 CyberStorm MkII 68060 50Mhz + CyberGraphics 64 (S3 Trio 64) add-on upgrade cards vs new build Pentium 150/S3 Trio 64/PCChips M525 mobo based PC clone. Pentium 150 was easily overclocked to 166Mhz with FSB 66Mhz jumper (the same setting for Pentium 166Mhz). Quake will faster on my selected Pentium clone box.

Most 68060 Amiga accelerators were still running at 50MHz as Pentiums clocked over 100MHz. Amiga owners couldn't compete with that. Motorola decided to abandon the 68k for PPC about this time.

Quote:

Pentium II's 440LX chipset has AGP 2X support. 440BX chipset has 100Mhz FSB support.

Besides raw specs, latency in the overall graphics rendering pipeline can influence the game's frame rate yields, hence Celeron 300A (Mendocino) has 128 KB L2 cache at the full CPU clock speed in 1998 which benefits games.

No on chip L2 caches yet?

Quote:

Vampire 1200 V2 and V4 are interesting "what if", but real-world performance doesn't match Socket 7 classic Pentiums.

Like "what if" a 68k core was stuck in a low cost FPGA forever? The Apollo Core still should come close to the faster clocked Pentiums with slow external L2 caches and even slower main memory slowing them down.

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 10-Oct-2020 1:48:12

[ #133 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5312
From: Australia

@matthey

http://www.ic72.com/pdf_file/4/142652.pdf
For Intel 430VX chipset with Socket 7 and Pentium P54 or Pentium MMX P55
Source: Intel Corp

The Intel 430VX PCIset consists of the 82437VX System Controller (TVX), two 82438VX Data Paths (TDX), and the PCI ISA IDE Xcelerator (PIIX3). The PCIset forms a Host-to-PCI bridge and provides the second level cache control and a full-function 64-bit data path to main memory
....
TDX
Two TDXs create a 64-bit CPU memory data path.

68060 (at 85Mhz)'s and AC68080's Doom and Quake results are okay but they are less than my Pentium 166Mhz /430VX/S3 Trio 64V results.

It's Doom benchmark time (again)
https://www.complang.tuwien.ac.at/misc/doombench.html
doom -timedemo demo3

Pentium 166 with Diamond Stealth S3 Trio64V+ has 98.9 fps
Pentium 90 with Cirrus Logic 5434 has 50 fps

68060 at 85Mhz comparable to Pentium 90.... I don't think so!

Last edited by Hammer on 10-Oct-2020 at 01:57 AM.
Last edited by Hammer on 10-Oct-2020 at 01:55 AM.
Last edited by Hammer on 10-Oct-2020 at 01:54 AM.
Last edited by Hammer on 10-Oct-2020 at 01:52 AM.
Last edited by Hammer on 10-Oct-2020 at 01:49 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

bhabbott

Re: Amiga SIMD unit
Posted on 10-Oct-2020 20:28:17

[ #134 ]

Regular Member

Joined: 6-Jun-2018
Posts: 339
From: Aotearoa

Quote:

Hammer wrote:

68060 (at 85Mhz)'s and AC68080's Doom and Quake results are okay but they are less than my Pentium 166Mhz /430VX/S3 Trio 64V results.
20 years later and Amiga users still suffer from PC envy.

Who cares how fast some PC can run a crappy old game? Doom is plenty fast enough on an 060 or Vampire, and Quake is a boring game at any speed.

Quote:
68060 at 85Mhz comparable to Pentium 90.... I don't think so!

I had a 50MHz 060 in my A3000 and it felt about the same as a Pentium 90 - except for Quake which was apparently much slower (I never tried it on a PC). It was the only game I bought that was supposed to make use of the 060's power, and it stunk. Meanwhile my A1200 ran Amiga games perfectly - much more fun! No way a Pentium 90 could compete with that.

Any discussion today about Amiga vs PC speeds is silly. All that matters is can I make my Amiga fast enough to do what I want. For 99% of the Amiga stuff I have, my 50MHz 030 equipped A1200 is plenty fast enough. For a few applications such as web browsing and compiling C code the Vampire in my A600 is nicer. It runs Doom ridiculously fast even in hires, but I had just as much fun playing it on the A1200 where the lower frame rate made it more of a challenge - and it looks better on the big TV screen in composite.

To me the Amiga is about the total experience, not silly 3D benchmarks. A PC is just an appliance, a boring box full of forgettable hardware and an OS that gets less enjoyable with each generation. I use them when I have a job to do, but I use the Amiga when I want to savour the experience.

Quote:
It's Doom benchmark time (again)...
Ho hum. 50fps, 99fps, who cares?

Status: Offline

cdimauro

Re: Amiga SIMD unit
Posted on 18-Oct-2020 10:34:29

[ #135 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:

matthey wrote:
There are many misconceptions about the SIMD unit. I am no expert but there are some others that may be able to contribute as well. We will start by looking at the history and basic features. MMX will be looked at which the Apollo Core adopted as a standard.

AMMX from the Apollo Core has nothing to do with Intel's MMX, except that it's a 64-bit integer-only SIMD.
Quote:
Finally some questions will be answered from another thread.

History
MAX - HP PA-RISC PA-7100LC 1994
MAX-2- HP PA-RISC PA-8000 1996
MMX - Intel Pentium MMX 1997
SSE - Intel Pentium 3 1999
Altivec - Motorola PPC 7400 (G4) 1999
SSE2 - Intel Pentium 4 2000
Neon - ARM1136J (ARMv6) 2002
WMMX - Intel XScale PXA270 2004
AVX - Intel Sandy Bridge CPUs 2011
AArch64 - Apple A7 (iPhone 5S) 2013

Basic features when introduced
MAX - 32x32b int regs; int16x2, uint16x2
MAX-2 - 32x64b int regs; int16x4, uint16x4
MMX - 8x64b regs shared with FPU; int8x8, int16x4, int32x2, uint8x8, uint16x4, uint32x2
SSE - 8x128b regs; fp32x4
Altivec/VMX - 32x128b regs; int8x16, int16x8, int32x4, uint8x16, uint16x8, uint32x4, fp32x4
SSE2 - 16x128b regs; int8x16, int16x8, int32x4, uint8x16, uint16x8, uint32x4, fp32x4, fp64x2
Neon - 16x128b regs shared with FPU; int8x16, int16x8, int32x4, uint8x16, uint16x8, uint32x4, fp32x4
WMMX - 16x64b regs; int8x8, int16x4, int32x2, uint8x8, uint16x4, uint32x2
AVX - 16x256b regs; fp32x8, fp64x4
AArch64 - 32x128b regs shared with FPU; int8x16, int16x8, int32x4, uint8x16, uint16x8, uint32x4, fp32x4, fp64x2

Something is missing. SSE3/4 (several new instructions added), AVX2 (which introduced some nice features like Gather & FMA; 2013), LRBNi (LarraBee New instructions; 2009), KNI (Knights Corner New Instructions, 2011), AVX-512 (2013).
Quote:
The Apollo Core does not have hardware 3D support so it is using MMX in the same way as clones did years ago and to emulate the blitter. The AMMX manual states it is based on Wireless MMX (WMMX) which Intel used for their ARM XScale embedded chips. WMMX did not have the implementation flaws of the original MMX and doubled the number of registers.

As I said before, AMMX is very different from MMX.
Quote:
Doubling the SIMD unit register width doubles the number of operations per SIMD instruction. This doubles the theoretical performance but is limited by memory bandwidth, cache efficiency (often more about cache bypassing techniques) and data alignment and ordering.

Cache trashing can be avoided using instructions to properly "mark" some memory regions as "non-temporal". x86/x64 has also some "NT" instructions which directly implement this behavior, without requiring to use ad-hoc instructions for marking some areas. My architecture goes a step further: any instruction with memory reference can be marked as non-temporal (so, there's no need for specific NT instructions).
Quote:
Data is often accessed in memory with no cache but using a small read buffer.

Data is very often processed using the SIMD registers. It can clearly seen by disassembly SIMD code. This is also the reason why SIMD extensions have usually many registers (Power increased it to 64 with VMX2, by also using the FPU registers).
Quote:
The SIMD unit requires huge amounts of encoding space and supporting outdated SIMD extensions can use a significant amount of transistors.

That's true, but this can be overcome by using some flags to "re-use/encode" the existing opcode space in a much better way.

AFAIR the Apollo Core has a specific flag on SR which signals the new execution mode / new features enabled. Unfortunately they used this possibility in a wrong way, because they didn't took the chance to re-encode the opcode space.
Very long time ago (in the amigacoding.de forum, which unfortunately is gone with its rich knowledge base) I suggested to use it to re-use the F-line for SIMD scalar instructions (completely removing the coprocessors support, which is anachronistic nowadays) and the A-line for the equivalent packed instructions: this would have opened the possibility to define a much powerful (and easier to decode) opcode structure & SIMD ISA.
Quote:
SIMD units are often rarely used and are not general purpose but can provide a large boost to performance in some cases. Like VLIW processors, SIMD units have high theoretical performance but actual performance can be a fraction of this.

That's not true. Please take a look at the benchmarks (real applications: not synthetic tests) which clearly show the advantage of using SIMD code. There are many open source applications that can be recompiled using just the regular FPU, or any SIMD unit. Phronix often publishes benchmarks like that.

And the argument: "let's use the GPU instead" is not valid. Yes, many workloads can be offloaded to them, but it's not a general rule that can applied to every scenario.
Offloading tasks to the GPU requires memory allocation in the GPU, transferring the data to it, then waiting for the GPU to complete the tasks, moving the data back to the system/CPU memory, and finally freeing the GPU memory buffer. This "round-trip" can take very long, and it only makes sense if you have a huge amount of data which can justify the big overhead which I've just reported.
Last but not really least, there's some non-massive number crunching where SIMD instructions can be used to speed-up some "integer/scalar" algorithm.

Now to answer the poll: it's a clear No. I think that it's quite evident to any architecture expert/passionate that the AMMX implementation is the worst ever made: they decided to share the data and FPU registers with the new SIMD ones! This partial overlapping of such kind of (completely) different register sets is simply crazy.
The reason for this was that... context switching was faster. Another "design decision" (!) made by people which has a very limited vision, which is just "implementation-centric", and specifically for the current FPGA implementation.
The same thing happened for the added instructions: they are just filling some holes left open by Motorola.
Finally, they also added the so called "BANK" instruction which is just a prefix (yes: exactly like x86/x64!) used to "enable" 64-bits and/or the access to the new registers. The very bad thing is that on a 16-bit opcode size ISA like the 68K this means greatly reducing the code density, which was the great advantage of this micro-processors family.

P.S. There are many other interesting messages in this spectacular thread (I love this kind of stuff!) which might deserver some answer. I'll do it once I've some time.

P.P.S. Sorry, no time to read again.

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 18-Oct-2020 13:57:19

[ #136 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5312
From: Australia

@bhabbott
Quote:

20 years later and Amiga users still suffer from PC envy.

Who cares how fast some PC can run a crappy old game? Doom is plenty fast enough on an 060 or Vampire, and Quake is a boring game at any speed.

FYI, I have signed up for V1200 for my A1200. A1200 has a smaller footprint when compared to my ditched Pentium 166 Mhz PC mini-tower.

Quote:

I had a 50MHz 060 in my A3000 and it felt about the same as a Pentium 90 - except for Quake which was apparently much slower (I never tried it on a PC). It was the only game I bought that was supposed to make use of the 060's power, and it stunk. Meanwhile my A1200 ran Amiga games perfectly - much more fun! No way a Pentium 90 could compete with that.

FYI, I have 1993 to 1995 era PC games equivalent to my Amiga games running in DOSBox-X on Windows 10, but I prefer the AmigaOS 3.X GUI environment (with AmigaOS 4.x theme) with 1993 to 1995 era WHDload games e.g. I prefer WHDload Pinball Illusions AGA when compared to using MS-DOS to run Pinball Illusions.

Quote:

Any discussion today about Amiga vs PC speeds is silly. All that matters is can I make my Amiga fast enough to do what I want. For 99% of the Amiga stuff I have, my 50MHz 030 equipped A1200 is plenty fast enough. For a few applications such as web browsing and compiling C code the Vampire in my A600 is nicer. It runs Doom ridiculously fast even in hires, but I had just as much fun playing it on the A1200 where the lower frame rate made it more of a challenge - and it looks better on the big TV screen in composite.

To me the Amiga is about the total experience, not silly 3D benchmarks. A PC is just an appliance, a boring box full of forgettable hardware and an OS that gets less enjoyable with each generation. I use them when I have a job to do, but I use the Amiga when I want to savour the experience.

That's your opinion. Certain AmigaOS experience is primitive. Installing Windows 10 was smoother than installing AmigaOS 4.1 for classic hardware.

I accidentally damaged my 1st PCB with the Amiga when Wicher 508's CPU pins for 68000 socket are brittle which forced me to perform my 1st soldering job beyond the simple single or two-wire soldering job.

From the late 1980s, my A500 Rev 6 suffered a dead PSU and cracked Agnus socket which is replaced by a warranty and my school friend's abandoned A500 Rev 5 has a dead PSU and this is the machine that I use to rebuild a German A500 Rev 6 motherboard during COVID lockdown.

Quote:

Ho hum. 50fps, 99fps, who cares?

Read the context for my Doom benchmark example i.e. the poster made a claim and I responded with the counter-argument.

Last edited by Hammer on 18-Oct-2020 at 03:47 PM.
Last edited by Hammer on 18-Oct-2020 at 03:39 PM.
Last edited by Hammer on 18-Oct-2020 at 03:31 PM.
Last edited by Hammer on 18-Oct-2020 at 01:59 PM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 18-Oct-2020 14:43:53

[ #137 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5312
From: Australia

@matthey

Quote:

Doubling the SIMD unit register width doubles the number of operations per SIMD instruction. This doubles the theoretical performance but is limited by memory bandwidth, cache efficiency (often more about cache bypassing techniques) and data alignment and ordering. Data is often accessed in memory with no cache but using a small read buffer. The SIMD unit requires huge amounts of encoding space and supporting outdated SIMD extensions can use a significant amount of transistors

On the same 28 nm process node, notice AMD Jaguar rivals ARM Cortex A15 in die size while Jaguar beats A15 in performance e.g. 3DMark Icestorm physics benchmark.

Quote:

.SIMD units are often rarely used and are not general purpose but can provide a large boost to performance in some cases

What's your point? Offer an alternative solution to accelerating workloads such as Blender3D.

Blender3D is accelerated by RTX hardware which is very large scale MIMD hardware.

CAD, Games heavily use vector math co-processors.

MathsLab supports AVX SIMD extensions. Many physics engineering apps support SIMD units.

Swiftshader Vulkan CPU renderer can run Quake 3 with >150 fps with 640x480 resolution on Ryzen 9 3900X. Swiftshader Vulkan supports CPU's SIMD units.

Swiftshader Vulkan CPU renderer has Vulkan version 1.1 API + extensions support.

Modern X86 with SIMD units are recommended when emulating PS3, Wii, WiiU, Xbox 360 and etc.

Sony's Naughty Dog has given lectures on optimizing cache with Jaguar CPUs e.g.

Don't assume AAA game developers are noobs when it comes to software optimizations.

Last edited by Hammer on 18-Oct-2020 at 02:56 PM.
Last edited by Hammer on 18-Oct-2020 at 02:54 PM.
Last edited by Hammer on 18-Oct-2020 at 02:46 PM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 18-Oct-2020 15:21:26

[ #138 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5312
From: Australia

@cdimauro

For Xscale, Intel added Wireless MMX 64bit SIMD INT to ARM v5 before ARM established NEON SIMD.

WMMX carried ARM's 16 register model which is close to 68000's register count. Intel added 16 WMMX registers instead of IA-32 MMX's recycled X87 registers

Both wMMX and IA-32 MMX was supplanted by AMD's 3DNow 64bit SIMD and IBM's custom 64bit SIMD with Nintendo's GC/Wii/WiiU PowerPC CPU which both supported FP32 data formats and geared towards 3D workloads e.g. geometry transformations.

Consider WMMX to be between IA-32's MMX and SSE1.

For low budget SIMD, I prefer 3DNow e.g. Quake II has 3DNow optimizations

Last edited by Hammer on 18-Oct-2020 at 03:57 PM.
Last edited by Hammer on 18-Oct-2020 at 03:22 PM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

OneTimer1

Re: Amiga SIMD unit
Posted on 18-Oct-2020 18:41:00

[ #139 ]

Cult Member

Joined: 3-Aug-2015
Posts: 984
From: Unknown

@Thread

Instead of a MMX command set with own registers I would have favoured a multimedia extensions for existing registers.

For example commands for adding 4 ARGB bytes packed in a 32bit register to 4 ARGB bytes in another 32 bit register and of course the same with 16 bit values for sound. And of course with saturation arithmetic.

Conversion for YUV ( YUV 4:2:2 ) -> ARGB might also be a useful thing for video replay, but I heard this format is or will be supported by SAGA.

I have seen such commands on DSPs from TI and they seemed like a logical extension for graphic and sound manipulation.

Some DSPs have special addressing modes for FFTs, this could also a logical type of enhancement for an existing ISA. The vectorisation used in MMX is still not really supported by compilers, a single command like an Add-ARGB command could easily be implemented via macros.

So if the Apollo team implements an MMX like extension they found somewhere, it might not really enhance their 68k implementation.

Last edited by OneTimer1 on 18-Oct-2020 at 06:42 PM.

Status: Offline

cdimauro

Re: Amiga SIMD unit
Posted on 18-Oct-2020 20:55:10

[ #140 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@HammerQuote:

Hammer wrote:
@matthey
On the same 28 nm process node, notice AMD Jaguar rivals ARM Cortex A15 in die size while Jaguar beats A15 in performance e.g. 3DMark Icestorm physics benchmark.

Indeed. Se also: The final ISA showdown: Is ARM, x86, or MIPS intrinsically more power efficient?
Quote:
Sony's Naughty Dog has given lectures on optimizing cache with Jaguar CPUs e.g.

Don't assume AAA game developers are noobs when it comes to software optimizations.

Well, those are things which an average assembly coder should know.
Quote:
Hammer wrote:
@cdimauro

For Xscale, Intel added Wireless MMX 64bit SIMD INT to ARM v5 before ARM established NEON SIMD.

WMMX carried ARM's 16 register model which is close to 68000's register count. Intel added 16 WMMX registers instead of IA-32 MMX's recycled X87 registers

Both wMMX and IA-32 MMX was supplanted by AMD's 3DNow 64bit SIMD and IBM's custom 64bit SIMD with Nintendo's GC/Wii/WiiU PowerPC CPU which both supported FP32 data formats and geared towards 3D workloads e.g. geometry transformations.

Consider WMMX to be between IA-32's MMX and SSE1.

That's interesting. But unfortunately it was an Intel extension, and that's why it was supplanted by the more standard NEON.
Quote:
For low budget SIMD, I prefer 3DNow e.g. Quake II has 3DNow optimizations

Unfortunately 3DNow! had the same MMX limits: just 64-bit registers size, only 8 registers, and requires a context switch between the FPU and SIMD execution modes.

@OneTimer1Quote:

OneTimer1 wrote:
@Thread

Instead of a MMX command set with own registers I would have favoured a multimedia extensions for existing registers.

This depends on what's your goal.

If you just want a cheap (to implement) SIMD unit, then reusing the existing registers is the right way (but NOT mixing registers from different domains: either you use the FPU ones, or the data/general purpose ones).

If you care about performances (and you want to have a more general-purpose SIMD unit) then it's better to have a separate registers set.

Another important thing which heavily influences some design decisions is if your processor is a CISC or a RISC. It's very common nowadays that SIMD extensions support the so called "masking" (using some registers as "masks" to filter-out/select the results of the regular operations).
Since CISCs instructions can access memory, and for it you usually need some integer/general purposes registers, then it's better to use a different, own, set of registers for those masks registers.
RISCs, on the exact contrary, only use registers for operations (except for loading/storing data), so they can just re-use the regular integer/general purpose registers as mask registers as well (even because SIMD-intensive code usually don't use so much the integer/general purpose registers). Here RISC-V designers failed, since they decided to use the same vector registers as masks...
Quote:
For example commands for adding 4 ARGB bytes packed in a 32bit register to 4 ARGB bytes in another 32 bit register and of course the same with 16 bit values for sound. And of course with saturation arithmetic.

Conversion for YUV ( YUV 4:2:2 ) -> ARGB might also be a useful thing for video replay, but I heard this format is or will be supported by SAGA.

I have seen such commands on DSPs from TI and they seemed like a logical extension for graphic and sound manipulation.

Some DSPs have special addressing modes for FFTs, this could also a logical type of enhancement for an existing ISA.

You can add as many instructions that you want, but the risk is that you're flooding the ISA with too much specific instructions, which you can use only on specific (even unique, maybe) scenarios.

That's not good for the ISA because it becomes too complex, too expensive to implement, and this might also affect performances (the ALU takes some time "decide" which kind of operation to do). Plus, all those instructions then become part of the legacy that the future processors implementations have to bring with them.
Quote:
The vectorisation used in MMX is still not really supported by compilers, a single command like an Add-ARGB command could easily be implemented via macros.

So if the Apollo team implements an MMX like extension they found somewhere, it might not really enhance their 68k implementation.

First, see above.

Second, those instructions need support by compilers or assemblers, and then by developers as well. The Amiga market is already too much fragmented, and this new stuff increases the fragmentation too.

I don't think there's any possibility for future markets regarding Amiga: time passed, and chances gone. Those who think that it can become great again, or even competing with the current mainstream systems, are visionary (in the negative sense).

Amiga is gone, and it's only a retro-platform. Enjoy as it is, and stop dreaming too much.

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle