Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6225 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

2 member(s) on-line.

MEGA_RJ_MICAL,

matthey

You are an anonymous user.
Register Now!

MEGA_RJ_MICAL: 4 mins ago

matthey: 4 mins ago

amigakit: 15 mins ago

coffeemonk: 1 hr 2 mins ago

OneTimer1: 1 hr 9 mins ago

number6: 1 hr 13 mins ago

zipper: 2 hrs 14 mins ago

Karlos: 2 hrs 21 mins ago

Yssing: 3 hrs 13 mins ago

WolfToTheMoon: 4 hrs 21 mins ago

Forum Index

General Technology (No Console Threads)

Amiga SIMD unit

Poster

Thread

Hammer

Re: Amiga SIMD unit
Posted on 28-Sep-2020 4:35:50

[ #101 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6515
From: Australia

@Fl@sh

Quote:

Fl@sh wrote:
@all

About PPC Altivec G4/G5 vs Intel SSE1/SSE2 both, on paper, have same potentials.
Maybe Altivec is still more simple and similar to AVX/AVX2 ISA rather than SSE1/SSE2.

I.E. this is a link about all Altivec instruction set and yes we have also FMADD, even with FLOAT datatype, between vectors http://mirror.informatimago.com/next/developer.apple.com/hardware/ve/instruction_crossref.html#compare

For Altivec we have also much better human readable instructions and up until three arguments for instruction.

I don't know anything about apollo core due my lack of interest about 68k arch, but maybe some choices like AMMX simd implementation was generated due small fpga space, focusing on reusing transistor logic where possible.

IMHO much better bypass it for now and implement something more powerfull and future proof in a next bigger fpga version.

PowerPC instructions have split branch, load, store, and float instructions which can fill up to four instruction issue slots.

X86's CISC counterpart can conserve instruction issue slots which acts like instruction compression.

AMD Bulldozer has four operands FP (FMA4) AVX and Zen v1 has undocumented support for Bulldozer's four operands FP (FMA4) AVX.

AMD's current official support is FMA3 AVX/AVX-2.
Intel's current official support is FMA3 AVX/AVX-2/AVX-512.

AVX 2 have GPU style gather instruction support.

AVX-512 have GPU style scatter instruction support.

I see you want to start another X86 vs PPC debate. I'm game for it.

https://barefeats.com/doom3.html
The real-world problems with PowerPC and Altivec when running Doom 3.

MAC GAME PERFORMANCE BRIEFING FROM THE DOOM 3 DEVELOPERS
Glenda Adams, Director of Development at Aspyr Media, has been involved in Mac game development for over 20 years. I asked her to share a few thoughts on what attempts they had made to optimize Doom 3 on the Mac and what barriers prevented them from getting it to run as fast on the Mac as in comparable Windows PCs. Here's what she wrote:

"Just like the PC version, timedemos should be run twice to get accurate results. The first run the game is caching textures and other data into RAM, so the timedemo will stutter more. Running it immediately a second time and recording that result will give more accurate results.

The performance differences you see between Doom 3 Mac and Windows, especially on high-end cards, is due to a lot of factors (in general order from smallest impact to largest):

1. PowerPC architectural differences, including a much higher penalty for float to int conversion on the PPC. This is a penalty on all games ported to the Mac, and can't be easily fixed. It requires re-engineering much of the game's math code to keep data in native formats more often. This isn't 'bad' coding on the PC -- they don't have the performance penalty, and converting results to ints saves memory and can be faster in many algorithms on that platform. It would only be a few percentage points that could be gained on the Mac, so its one of those optimizations that just isn't feasible to do for the speed increase.

2. Compiler differences. gcc, the compiler used on the Mac, currently can't do some of the more complex optimizations that Visual Studio can on the PC. Especially when inlining small functions, the PC has an advantage. Add to this that the PowerPC has a higher overhead for functional calls, and not having as much inlining drops frame rates another few percentage points.

----------
For benchmark leading gaming workloads, integer and FP conversion penalty must be very low. Function calls overhead needs to be low. CPUs must be designed for optimal C/C++.

Don't get me started on AMD Jaguar vs IBM PPE debate i.e. PPE will lose the IPC war.

Last edited by Hammer on 28-Sep-2020 at 04:48 AM.
Last edited by Hammer on 28-Sep-2020 at 04:44 AM.
Last edited by Hammer on 28-Sep-2020 at 04:43 AM.
Last edited by Hammer on 28-Sep-2020 at 04:42 AM.
Last edited by Hammer on 28-Sep-2020 at 04:38 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Hypex

Re: Amiga SIMD unit
Posted on 29-Sep-2020 17:55:17

[ #102 ]

Elite Member

Joined: 6-May-2007
Posts: 11351
From: Greensborough, Australia

@Hammer

Quote:
1. PowerPC architectural differences, including a much higher penalty for float to int conversion on the PPC. This is a penalty on all games ported to the Mac, and can't be easily fixed. It requires re-engineering much of the game's math code to keep data in native formats more often. This isn't 'bad' coding on the PC -- they don't have the performance penalty, and converting results to ints saves memory and can be faster in many algorithms on that platform. It would only be a few percentage points that could be gained on the Mac, so its one of those optimizations that just isn't feasible to do for the speed increase.

I suspect this because of the load/store nature and how PPC can't move from FPU to GPR directly. This would seem to be a lack of foresight to me since it can obviously move from GPR to GPR. Of course the data format is different, but on a register optimised architecture, you would expect to be able to move data easily between register banks. Regardless of any conversion needed.

Quote:
2. Compiler differences. gcc, the compiler used on the Mac, currently can't do some of the more complex optimizations that Visual Studio can on the PC. Especially when inlining small functions, the PC has an advantage. Add to this that the PowerPC has a higher overhead for functional calls, and not having as much inlining drops frame rates another few percentage points.

This shouldn't be the case. Unlike a stack based architecture PPC doesn't need to store parameters on the stack and then do a JSR or whatever, save volatiles on the stack, do the work, unstack them, then unstack the return address and jump back. In fact, I don't think PPC actually has any kind of stack, the function calls use a stack area to manually store data. A set of parameters are kept in registers and most functions use registers over memory. Functions can be optimised to skip the stack frame and use registers when possible. Perhaps GCC wasn't as good then, no PPC compilers ever seem to be, but it shouldn't be a problem now.

Status: Offline

Fl@sh

Re: Amiga SIMD unit
Posted on 29-Sep-2020 20:02:21

[ #103 ]

Regular Member

Joined: 6-Oct-2004
Posts: 253
From: Napoli - Italy

Yes stack usage to allocate variables in function calls is strongly used in x86 arch, where few registers cause to use it. On PowerPC arch, with so many register counts, a good compiler tends to use them to allocate as many variables as possible.

If I remember well on PowerPC you can copy gpr registers on fpu registers and viceversa without any penalty.
Convert floats to int is another story and clock cycles are influenced by single vendors implementations.

_________________
Pegasos II G4@1GHz 2GB Radeon 9250 256MB
AmigaOS4.1 fe - MorphOS - Debian 9 Jessie

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 1-Oct-2020 0:31:57

[ #104 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6515
From: Australia

@Fl@sh

X86-64 ISA has 16 registers (in addition to register renaming hardware) while PowerPC ISA has 32 registers.

Doom devs claim integer and floating-point conversions incur a performance penalty on PPC when compared to X86 counterpart.

CPU register count debate is nearly pointless when AMD GpGPU has thousands of registers that range into megabytes of SRAM storage. Eat that CELL.

Last edited by Hammer on 01-Oct-2020 at 12:33 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 1-Oct-2020 0:40:54

[ #105 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6515
From: Australia

@Hypex

Quote:

Hypex wrote:
@Hammer

This shouldn't be the case. Unlike a stack based architecture PPC doesn't need to store parameters on the stack and then do a JSR or whatever, save volatiles on the stack, do the work, unstack them, then unstack the return address and jump back. In fact, I don't think PPC actually has any kind of stack, the function calls use a stack area to manually store data. A set of parameters are kept in registers and most functions use registers over memory. Functions can be optimised to skip the stack frame and use registers when possible. Perhaps GCC wasn't as good then, no PPC compilers ever seem to be, but it shouldn't be a problem now.

From the gaming workloads POV, PowerPC doesn't impress me when PowerPC 970 is beaten by K8 Athlon 64 competitor let alone from Core 2 and Core i series.

It doesn't look good for game console priced PowerPC when AMD Jaguar beats IBM PPE in IPC.

I'm still waiting for the "A500/A1200" replacement from PowerPC/Power64 camp.

Offer PowerPC alternative against Xbox Series X, Series S, and PS5 APUs which includes 8 cores Zen 2 and RDNA 2 GpGPU.

Intel has Xe HPG gaming GPU in 2021 and already has a Xe GPU card competitive against Xbox One or 7790/R7-260 level GPUs. GpGPUs in X86 APUs is mostly used for vector math workloads.

Last edited by Hammer on 02-Oct-2020 at 03:11 AM.
Last edited by Hammer on 01-Oct-2020 at 12:49 AM.
Last edited by Hammer on 01-Oct-2020 at 12:48 AM.
Last edited by Hammer on 01-Oct-2020 at 12:42 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 2-Oct-2020 2:23:41

[ #106 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2763
From: Kansas

Quote:

Hypex wrote:
I suspect this because of the load/store nature and how PPC can't move from FPU to GPR directly. This would seem to be a lack of foresight to me since it can obviously move from GPR to GPR. Of course the data format is different, but on a register optimised architecture, you would expect to be able to move data easily between register banks. Regardless of any conversion needed.

PPC has a self imposed limitation of not allowing integer FP conversions in registers. Other load/store architectures allow it like AArch64. Even the Motorola 88k RISC processor which lost out to PPC at Motorola had it. It was likely a simplification as there is some logic and bus synchronization involved. The 'R' in RISC stood for "reduced" and that was the RISC philosophy then even though it also resulted in "reduced" performance which was supposed to be offset by higher clock speeds and better compiler support. The PPC and 88k were neither pure RISC philosophies but they were still reduced in some ways. Even the CISC 68k was reduced in some areas starting with the 68040 removing many FPU instructions which included the commonly used FINT/FINTRZ instruction also used for FP to integer conversions. This was perhaps a bigger mistake than not allowing integer FP conversions in registers and was added back for the 68060.

Quote:

This shouldn't be the case. Unlike a stack based architecture PPC doesn't need to store parameters on the stack and then do a JSR or whatever, save volatiles on the stack, do the work, unstack them, then unstack the return address and jump back. In fact, I don't think PPC actually has any kind of stack, the function calls use a stack area to manually store data. A set of parameters are kept in registers and most functions use registers over memory. Functions can be optimised to skip the stack frame and use registers when possible. Perhaps GCC wasn't as good then, no PPC compilers ever seem to be, but it shouldn't be a problem now.

How does a core with a static number of registers allow a dynamic number of function calls?

PPC has a link register (LR) which contains the return address for the last function (the return address is pushed to the stack with CISC). A branch and link instruction (bl) writes the return address to the LR and a branch and link return instruction (blr) branches to the address in the LR. This works well until a function calls another function in which case the LR has to be saved and restored in the prologue and epilogue of functions. Leaf (the last) functions called can return without touching memory which was faster than the CISC method until cores received a hardware link or return stack which stores more of the return addresses than the last one and more often quickly returns without needing to load the return address from memory than the LR register.

PPC has a combined stack and stack frame pointer register which is R1. Data other than frame pointers are not allowed on the stack. This saves a register but has more setup/breakdown overhead and is annoying, especially for humans (the 68k can also use 1 register for the stack and stack frame pointer which is faster but loses easy management of the stack frame pointer for debugging). All local data is stored within the stack frame for the function. Additionally, there is a pointer within the stack frame which points to the previous stack frame. Simple functions can avoid creating the high overhead stack frames making PPC have similar function call overhead to other architectures. When we need storage and a stack frame, the prologue and epilogue of the function adds overhead.

function:
# Prologue begin
mflr r0 # Save return address ...
stw r0,4(sp) # ... in caller's frame.
ori r11,sp,0 # Save end of fpr save area
rlwinm r12,sp,0,28,28 # 0 or 8 based on SP alignment
subfic r12,r12,-len # Add in stack length
stwux sp,sp,r12 # Establish new aligned frame
bl _savefpr_14 # Save floating-point registers
addi r11,r11,-144 # Compute end of gpr save area
bl _savegpr_14_g # Save gprs and fetch GOT ptr
mflr r31 # Place GOT ptr in r31
# Save CR here if necessary
addi r30,r11,144 # Save pointer to incoming arguments
mfspr r0,vrsave # Save VRSAVE ...
stw r0,-220(r30) # ... in caller's frame.
oris r0,r0,0xff70 # Use v0-v10 and ...
ori r0,r0,0x0fff # v20-v31 (for example)
mtspr vrsave,r0 # Update VRSAVE
addi r0,sp,len-224 # Compute end of vr save area
bl _savevr20 # Save VRs
# Prologue end

# Body of function

# Epilogue begin
addi r0,sp,len-224 # Address of vr save area to r0
bl _restvr20 # Restore VRs
lwz r0,-220(r30) # Fetch prior value of VRSAVE
mtspr vrsave,r0 # Restore VRSAVE
addi r11,r30,-144 # Address of gpr save area to r11
bl _restgpr_14 # Restore gprs
addi r11,r11,144 # Address of fpr save area to r11
bl _restfpr_14_x # Restore fprs and return
# Epilogue end

_restfpr_14_x: lfd r14, -144(r11)
_restfpr_15_x: lfd r15, -136(r11)
_restfpr_16_x: lfd r16, -128(r11)
_restfpr_17_x: lfd r17, -120(r11)
_restfpr_18_x: lfd r18, -112(r11)
_restfpr_19_x: lfd r19, -104(r11)
_restfpr_20_x: lfd r20, -96(r11)
_restfpr_21_x: lfd r21, -88(r11)
_restfpr_22_x: lfd r22, -80(r11)
_restfpr_23_x: lfd r23, -72(r11)
_restfpr_24_x: lfd r24, -64(r11)
_restfpr_25_x: lfd r25, -56(r11)
_restfpr_26_x: lfd r26, -48(r11)
_restfpr_27_x: lfd r27, -40(r11)
_restfpr_28_x: lfd r28, -32(r11)
_restfpr_29_x: lfd r29, -24(r11)
_restfpr_30_x: lfd r30, -16(r11)
_restfpr_31_x: lwz r0, 4(r11)
lfd r31, -8(r11)
mtlr r0
ori r1, r11, 0
blr

To save space I have not included _savefpr_, _savegpr_, _restgpr_, _savevr20 and _restvr20 which resemble _restfpr_. Inlining for PPC allows code to share the prologue, epilogue and stack frame overhead so I would expect it to be faster most of the time.

Let's compare the PPC function overhead to the 68k.

function:
fmovem fp_reg_list,-(sp)
movem.l gp_reg_list,-(sp)

... ; Body of function

movem.l (sp)+,gp_reg_list
fmovem (sp)+,fp_reg_list
rts

I expect x86_64 function overhead to be somewhere between PPC and the 68k although there are different ABIs used by Windows and Unix derived OSs. More volatile (scratch, caller save) registers allows a function to do more work without saving registers but lack of non-volatile (callee save) registers makes it slower to call functions to do the work. Generally a balance of both is good but for a SIMD unit there usually are few if any function calls so they should likely have more volatile registers. Shared FPU and SIMD unit registers are a compromise then.

System V PPC:
11 volatile gp, 20 non-volatile gp, 8 gp param
9 volatile fp, 22 non-volatile fp, 8 fp param
20 volatile SIMD, 11 non-volatile SIMD, 12 SIMD param

Windows x86_64:
9 volatile gp, 7 non-volatile gp, 4 gp param
6 volatile SIMD, 10 non-volatile SIMD, ? gp param

System V x86_64:
9 volatile gp, 7 non-volatile gp, 6 gp param
16 volatile SIMD, 6 gp param

System V 68k
4 volatile gp, 12 non-volatile gp
2 volatile fp, 6 non-volatile fp

Newer versions of x86_64 ISAs and ABIs likely have changed.

Conclusions:
PPC has plenty of registers including separate SIMD and FPU registers and passes many arguments in registers but function overhead can be high and moving registers between register files through memory may be slower. x86_64 overhead can vary depending on the ABI and we don't know what was used in the claim. Using all stack parameters gave a 0.86% slowdown in integer performance on x86_64 compared to register parameters in one paper likely because of so much inlining which jumped to a 7.4% slow down with compiler inlining disabled for integers and 1.2% for floating point. Like increasing the number of registers, the advantages are often over stated. PPC is powerful but feels unfriendly, unwieldy and wasteful especially considering how much more instruction cache is used as can be seen by the function prologue and epilogue above. Most PPC programmers don't have a good understanding of the hardware which results in inferior code compared to x86_64. Low level programmers preferred x86_64 despite the warts, inconsistencies and kludges which is especially helpful for SIMD programming. AArch64 has shown that attaching an SIMD unit to a RISC processor can be done better and in a standard way. Even ARM doesn't want to compete with the beef or the bloat of the x86_64 SIMD unit though.

Status: Online!

Hammer

Re: Amiga SIMD unit
Posted on 2-Oct-2020 3:41:36

[ #107 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6515
From: Australia

@matthey

Intel's beefy AVX-512 SIMD was a Larrabee relic from Intel's attempt to compete against GpGPUs from AMD(RTG) and NVIDIA.

Intel has Xe GPU family to compete against the mentioned GpGPU vendors.

https://www.wepc.com/news/intel-xe-dg1-discrete-gpu-benchmark-results/

As for the results, Intel’s Xe DG1 ostensibly scored 5,538 points overall in 3DMark’s Fire Strike when coupled with an Intel Core i9-9900K CPU. The GPU hit 5,960 points graphics score, 22,957 points for physics, and a combined score of 2,075.

To say the results are underwhelming is undoubtedly an understatement. The scores put Intel’s Xe DG1 in league with GPUs that are starting to show their age. As one participant in the resulting conversation noted, this puts the DG1 slightly above NVIDIA’s GeForce GTX 750 Ti (5184 points), and lower than AMD’s Radeon RX 460 (5924 points). Both those results were obtained using the same Intel Core i9-9900K and, as such, can be reliably compared.

AMD's budget RDNA 2 comes from Xbox Series S's 20 CU iGPU which has similar render power to Xbox One X's semi-custom 40 CU Polaris GCN.

Intel’s Xe DG1 power budget fits within a single PEG slot just like RX 460.

RX 460 has 14 CU Polaris GCN. RX 460/560 has been replaced by RX 5300/5500 (NAVI 14, RDNA v1) SKUs.

Last edited by Hammer on 02-Oct-2020 at 03:47 AM.
Last edited by Hammer on 02-Oct-2020 at 03:43 AM.
Last edited by Hammer on 02-Oct-2020 at 03:41 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

CosmosUnivers

Re: Amiga SIMD unit
Posted on 2-Oct-2020 7:32:41

[ #108 ]

Regular Member

Joined: 20-Sep-2007
Posts: 113
From: Unknown

@matthey

Quote:
PPC has plenty of registers including separate SIMD and FPU registers and passes many arguments in registers but function overhead can be high and moving registers between register files through memory may be slower. x86_64 overhead can vary depending on the ABI and we don't know what was used in the claim. Using all stack parameters gave a 0.86% slowdown in integer performance on x86_64 compared to register parameters in one paper likely because of so much inlining which jumped to a 7.4% slow down with compiler inlining disabled for integers and 1.2% for floating point. Like increasing the number of registers, the advantages are often over stated. PPC is powerful but feels unfriendly, unwieldy and wasteful especially considering how much more instruction cache is used as can be seen by the function prologue and epilogue above. Most PPC programmers don't have a good understanding of the hardware which results in inferior code compared to x86_64. Low level programmers preferred x86_64 despite the warts, inconsistencies and kludges which is especially helpful for SIMD programming. AArch64 has shown that attaching an SIMD unit to a RISC processor can be done better and in a standard way. Even ARM doesn't want to compete with the beef or the bloat of the x86_64 SIMD unit though.

The PPC is a big piece of junk, that's why Phase5 added this CPU in front of the 68k on BlizzardPPC and CyberStormPPC : to dirty the 68k and the Amiga...

Same scenario with the AMMX... Same story with ARM on Warp 560/1260, ZZ9000 and A314...

They add new problems for after create a solution : the purchase of the new cards of course, nothing is free with them...

I hope all the Amiga users will finally understand one day : the ennemies are inside in our community...

Next f**k will be certainly an new Efika ARM, and a new Pegasos ARM : they use always the same synopsis, they repeat because they saw the previous failed = they are 100% certain their new hardware will fail too...

ARM will never work, AMMX will never work, PPC will never work, x86 (new MorphOS) will never work : and they **P*E*R*F*E*C*T*L*Y** know that...

Last edited by CosmosUnivers on 02-Oct-2020 at 07:36 AM.
Last edited by CosmosUnivers on 02-Oct-2020 at 07:35 AM.
Last edited by CosmosUnivers on 02-Oct-2020 at 07:34 AM.

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 2-Oct-2020 12:46:11

[ #109 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6515
From: Australia

@CosmosUnivers

For after 68080, I prefer multi-pipelined FPU instead of A-MMX like in Zen CPU family i.e.
two FADD and two FMAC/FMA/FMUL units.

Zen's multi-pipelined FPU still speeds up X87 FPU legacy codebase.

68090 with multi-pipelined out of order processing FPU would be nicer for Amiga 68K Quake like codebase.

AMMX integer SIMD is an attempt to mirror the proposed Commodore-Amiga Hombre's PA-7150's MAX-1 integer SIMD instruction set feature.

68080 is a good "what IF" instead of Motorola committing 68K suicide.

Quake used very optimized x86 code that interleaved FPU and integer instructions, as John Carmack had worked out that apart from instruction loading, which used the same registers, FPU and integer operations used different parts of the Pentium core and could effectively be overlapped. This nearly doubled the speed of FPU-intensive parts of the game's code.

Half-Life: Uses MMX to great extent, particularly for the software DSP sound engine and probably also skeletal animation system

Unreal Engine 1: Uses MMX for its Galaxy sound engine. MMX was also used greatly for the software renderer. https://twitter.com/TimSweeneyEpic/status/640962491460268032
Unreal's software renderer had a 16x4 MMX mode for dealing with color back in 1997

PC game list that used MMX.
https://www.mobygames.com/attribute/sheet/attributeId,1478/

68080 is a Pentium MMX class CPU for the 68K CPU family.

NXP 68K CPU family is under EU control.

The main problem with ARM + 68K accelerator is creating yet another Phase 5 style PowerUP fork and segregation within the Amiga community.

Phase 5 ~= bPlan GmbH in partnership with Genesi produced the Pegasos PPC.

It's better to create FPGA 68K hardware decoder for FPGA RISC CPU core design in the footsteps of 68060 or Pentium Pro.

68060 is a CISC-RISC hybrid CPU design closer to Pentium Pro but with classic Pentium's dual instruction per cycle and 68040's 32-bit frontside bus.

Last edited by Hammer on 02-Oct-2020 at 01:14 PM.
Last edited by Hammer on 02-Oct-2020 at 01:04 PM.
Last edited by Hammer on 02-Oct-2020 at 01:02 PM.
Last edited by Hammer on 02-Oct-2020 at 12:52 PM.
Last edited by Hammer on 02-Oct-2020 at 12:51 PM.
Last edited by Hammer on 02-Oct-2020 at 12:49 PM.
Last edited by Hammer on 02-Oct-2020 at 12:47 PM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 2-Oct-2020 13:20:50

[ #110 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6515
From: Australia

@matthey

According to https://twitter.com/tom_forsyth/status/641016896033173505
In HW functionality, GCN and LRB (Intel Larrabee, x86 GPU) are very close. Not exposed by most languages though.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

NutsAboutAmiga

Re: Amiga SIMD unit
Posted on 2-Oct-2020 15:37:14

[ #111 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12993
From: Norway

@Hypex

Quote:
PPC can't move from FPU to GPR directly.

because FPU and CPU are actually two different processors, something quake did take advantage of back in the day, when used efficient it can be used in out of sequence order, because CPU does depend on result of FPU, and FPU does depend on result of CPU

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

NutsAboutAmiga

Re: Amiga SIMD unit
Posted on 2-Oct-2020 15:45:24

[ #112 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12993
From: Norway

@Hammer

Well with PowerPC and Risk in general they noticed, that more and more people where writing C code, and that compiler where unable to take advantage of special instructions, the idea was that if they removed the dead wight, they can reduce the power consumption and increases the clock frequency.

the result is CPU that’s great at being OK, but not great a being the fastest, or most HOT cpu, the real problem is they never really managed to agree on what instruction to include or not, because of this it make hard to optimize code in assembler.

It was perfect CPU to put in a highend printer, or router, back in 2001.

Power family CPU’s on the other and of the scale, hare they don’t care about heat, notice, they only care about performance, this CPU’s are not meant for the office or as home computer,

Anyway now some older PowerPC isa’s are open sourced can lower price pf PowerPC in some preference/heat/price matrixes.

Last edited by NutsAboutAmiga on 02-Oct-2020 at 04:04 PM.
Last edited by NutsAboutAmiga on 02-Oct-2020 at 04:04 PM.
Last edited by NutsAboutAmiga on 02-Oct-2020 at 04:03 PM.
Last edited by NutsAboutAmiga on 02-Oct-2020 at 04:00 PM.
Last edited by NutsAboutAmiga on 02-Oct-2020 at 04:00 PM.
Last edited by NutsAboutAmiga on 02-Oct-2020 at 03:46 PM.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

NutsAboutAmiga

Re: Amiga SIMD unit
Posted on 2-Oct-2020 16:55:06

[ #113 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12993
From: Norway

@CosmosUnivers

Well when Motorola decided to shutdown 680x0, you had only few options, ARM, MIPS, PowerPC,ALPHA and a few more maybe, a lot of speculation, a lot indecision and lot bankruptcies of ESCOM, and VISCOP later, they finally made choice of PowerPC, sadly Phase5 did not make motherboards, but pPLan did, but some how forgot to stick some Amiga chipsets on the Pegasus I, without that 680x0 programs where lost, not finding what they needed.

Sadly Mia logic did not have a background from Amiga stuff, and Eyetech where just reseller, its hard to see any other outcome for AmigaONE-SE sadly.

However, if we do imagine a parallel universe where some stuck some chipset on PCB that plugged into the DIMM socket, and made the needed changes to Exec and the Linux kernel, to avoid registering the chipset as memory, a lot more software might have worked. In this case NG & classic becomes the same thing, and there won’t be a major division.

Last edited by NutsAboutAmiga on 03-Oct-2020 at 07:15 AM.
Last edited by NutsAboutAmiga on 02-Oct-2020 at 08:33 PM.
Last edited by NutsAboutAmiga on 02-Oct-2020 at 04:58 PM.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

Fl@sh

Re: Amiga SIMD unit
Posted on 2-Oct-2020 22:08:23

[ #114 ]

Regular Member

Joined: 6-Oct-2004
Posts: 253
From: Napoli - Italy

For most technical guys I suggest to look at following link

http://studies.ac.upc.edu/ETSETB/SEGPAR/microprocessors/altivec%20(mpr).pdf

And refresh most AltiVec simd features ..and compare them with all other present in direct competitors.
In any modern simd implementation would be a good choice to have at least the same AltiVec features and not share any of integer and float registers with simd unit.

_________________
Pegasos II G4@1GHz 2GB Radeon 9250 256MB
AmigaOS4.1 fe - MorphOS - Debian 9 Jessie

Status: Offline

A1200coder

Re: Amiga SIMD unit
Posted on 2-Oct-2020 23:54:51

[ #115 ]

New Member

Joined: 5-Oct-2019
Posts: 4
From: Unknown

@Hammer

Quote:

68080 is a Pentium MMX class CPU for the 68K CPU family.

I would say that this is incorrect; the 68080 is able to execute 4 instructions in parallell, which makes it even better than Pentium Pro; basically Pentium Pro is the same CPU as used in Pentium 2/3. Pentium 3 introduced also SSE instruction set.
(Early P6-family: Pentium Pro/PII/PIII, and Pentium M. Also Pentium 4: a maximum of 3 instructions per cycle can be achieved.)

The memory performance is also exceptionally good for 68080, beating even some 1 GHz PPC CPUs in this area. The 68080 runs m68k software clearly faster than a 68060 at same clock speed without the use of any special features of 68080, like AMMX. Some claim that it corresponds to a 120 MHz 68060 for some workloads at current clockspeed of FPGA, which is around 85 MHz.

Last edited by A1200coder on 03-Oct-2020 at 12:30 AM.

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 3-Oct-2020 6:25:52

[ #116 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6515
From: Australia

@A1200coder

I made my statement with both real-world performance and microarchitecture considerations.

For Doom and Quake frame rate results, 68080's performance didn't match my old Pentium 166Mhz and S3 Trio 64U.

It's Doom benchmark time
https://www.complang.tuwien.ac.at/misc/doombench.html
doom -timedemo demo3

Pentium 166 with Diamond Stealth S3 Trio64V+ has 98.9 fps

Pentium 90 with Cirrus Logic 5434 has 50 fps

I was being generous.

It's nearly pointless to argue 68080's quad instruction issue per cycle capability when 68080's FPU is NOT Core 2 level.

This is why I argue for any future 68090 design should focus on multi-pipeline FPU to improve Quake-type game engines.

Intel Core 2 has multi-pipeline 128 bit SIMD integer and FPU hardware , 128 bits wide
Load/Store units and ALUs/AGUs are 64 bits wide. You can't say the same for 68080!

Pentium III has 64bit SIMD SSE FADD and 64bit FMUL, hence Pentium III can do one 32bit ALU, 64bit SIMD SSE FADD, and 64bit FMUL which is effectively five 32bit instructions per clock cycle.

Last edited by Hammer on 03-Oct-2020 at 07:12 AM.
Last edited by Hammer on 03-Oct-2020 at 06:53 AM.
Last edited by Hammer on 03-Oct-2020 at 06:52 AM.
Last edited by Hammer on 03-Oct-2020 at 06:36 AM.
Last edited by Hammer on 03-Oct-2020 at 06:32 AM.
Last edited by Hammer on 03-Oct-2020 at 06:26 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 3-Oct-2020 6:45:28

[ #117 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6515
From: Australia

@NutsAboutAmiga

Quote:

NutsAboutAmiga wrote:
@Hammer

Well with PowerPC and Risk in general they noticed, that more and more people where writing C code, and that compiler where unable to take advantage of special instructions, the idea was that if they removed the dead wight, they can reduce the power consumption and increases the clock frequency.

the result is CPU that’s great at being OK, but not great a being the fastest, or most HOT cpu, the real problem is they never really managed to agree on what instruction to include or not, because of this it make hard to optimize code in assembler.

It was perfect CPU to put in a highend printer, or router, back in 2001.

Power family CPU’s on the other and of the scale, hare they don’t care about heat, notice, they only care about performance, this CPU’s are not meant for the office or as home computer,

Anyway now some older PowerPC isa’s are open sourced can lower price pf PowerPC in some preference/heat/price matrixes.

Theoretical arguments don't reflect reality i.e. PowerPC being beaten by X86-64

This is a physics simulation workload.

The instruction set is only part of the answer when Jaguar's micro-architecture IPC was shown to be superior when compared to "TEH CELL".

Note that the same physics simulation workload runs on AMD GCN version 1.1 for a massive smackdown on IBM solution.

@IBM and other PPC fanboys; offer an alternative solution to Xbox Series S APU solution.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 3-Oct-2020 7:07:34

[ #118 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6515
From: Australia

@NutsAboutAmiga

Quote:

NutsAboutAmiga wrote:
@CosmosUnivers

Well when Motorola decided to shutdown 680x0, you had only few options, ARM, MIPS, PowerPC, and few more, a lot of speculation, a lot indecision and lot bankruptcies of ESCOM, and VISCOP later, they finally made choice of PowerPC, sadly Phase5 did not make motherboards, but pPLan did, but some how forgot to stick some Amiga chipsets on the Pegasus I, without that 680x0 programs where lost, not finding what they needed.

Sadly Mia logic did not have a background from Amiga stuff, and Eyetech where just reseller, its hard to see any other outcome for AmigaONE-SE sadly.

However, if we do imagine a parallel universe where some stuck some chipset on PCB that plugged into the DIMM socket, and made the needed changes to Exec and the Linux kernel, to avoid registering the chipset as memory, a lot more software might have worked. In this case NG & classic becomes the same thing, and there won’t be a major division.

Mia Logic has incompetent Northbridge design worst than VIA's regardless of the Amiga's requirements.

I have a classic AmigaOS 4.1 FE PowerPC on WinUAE, and running WHDload 68K game which triggers another UAE is LOL which is not much different from AROS X86's UAE emulation box.

PS; I signed up for Apollo's Vampire 1200 V2 since mid-2020 for my A1200 and its price is not as crazy when compared to Amiga One XE, X1000 and X5000.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 3-Oct-2020 7:35:37

[ #119 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6515
From: Australia

@Fl@sh

Quote:

Fl@sh wrote:
For most technical guys I suggest to look at following link

http://studies.ac.upc.edu/ETSETB/SEGPAR/microprocessors/altivec%20(mpr).pdf

And refresh most AltiVec simd features ..and compare them with all other present in direct competitors.
In any modern simd implementation would be a good choice to have at least the same AltiVec features and not share any of integer and float registers with simd unit.

Integer SIMD is important for pixel shading as per DX8 type workload.

Geometry is floating-point while pixel shading is Integer. DX9 has floating-point pixel shading.

AMD RDNA's vector math stream processors execute both integer (INT4, INT8, INT16, INT32) and floating-point (FP16, FP32). Rapid pack math is enabled for datatype less than INT32 and FP32.

Rapid pack math INT4 is used for deep learning AI workloads.

Xbox 360 GPU has FP10, FP16 and FP32 support.

Remember, 3D rasterization involves converting floating-point geometry into pixel grid integers.

Modern large scale vector math in GpGPUs handles integer and floating-point datatypes without major issues.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Fl@sh

Re: Amiga SIMD unit
Posted on 3-Oct-2020 10:22:11

[ #120 ]

Regular Member

Joined: 6-Oct-2004
Posts: 253
From: Napoli - Italy

@matthey

Your are making things much harder for ppc.
I know you are in love with 68k and it can cause a loss of impartiality.

Anyway for ppc I have seen prologue and epilogue simpler than in your examples.
Obviously in ppc code you are saving and restoring even vector registers, not present in 68k.
The best case scenario instead IMHO is do not save nothing and use volatile registers as much as possible for ppc code, for normal user program function calls.

You did the worst scenario where a function call needs to save all in one gpr, fpu and vector registers. Obviously restoring all these at end.

I guess in a such complex function, where are involved all these different registers, the save and restore of non volatile resources is really a minor overhead compared with a complexity of job cpu is doing.

I want repeat ppc isa is the more recent among all others, with exception of riscv, and it was developed with future in mind. This is reason why on Ppc word we had a soft 64bit transition, little endian support, latest VMX powerful extensions, embedded and custom cores spacing from car industry to gaming consoles, sharing the same code.

On paper ppc is a good architecture, better than others, the handicap resides in vendors implementations and scale economy fab processes.
With right investments in research and fab process, we could have today a z80 clocked at 10ghz, maybe faster than any x86.

My2cents.

_________________
Pegasos II G4@1GHz 2GB Radeon 9250 256MB
AmigaOS4.1 fe - MorphOS - Debian 9 Jessie

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle