Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6225 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

0 member(s) on-line.

You are an anonymous user.
Register Now!

BigD: 28 mins ago

Mobileconnect: 30 mins ago

qkpcmjwnpfkacm: 41 mins ago

OneTimer1: 42 mins ago

dalek: 49 mins ago

Rob: 52 mins ago

Karlos: 1 hr 4 mins ago

agami: 3 hrs 35 mins ago

matthey: 4 hrs 3 mins ago

Panabudo: 5 hrs 33 mins ago

Forum Index

Classic Amiga Hardware

One major reason why Motorola and 68k failed...

Poster

Thread

matthey

Re: One major reason why Motorola and 68k failed...
Posted on 28-May-2024 21:55:48

[ #141 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2754
From: Kansas

Gunnar Quote:

Reading from memory/cache/stack is a very limited resource that can only very difficultly be scaled up.
On the other hand reading several register per cycle is easy and in comparison very cost effective to scale up.

Therefore a good design will always favor to put values in register instead cache/stack.

Sure, cache accesses are more expensive than register accesses. Cache accesses are often unavoidable though and very high performance CPU/FPU cores that would have large register files also usually allow more than one cache access per cycle. RISC cores need both for performance as all data requires a load to become a usable variable for other instructions. There is no advantage for CISC cores to load data that is used once into a register. In fact, there is a disadvantage of a wasted move instruction if a mem-reg instruction can be used. These mem-reg accesses can take advantage of pipelining as even the P5 Pentium with crap stack based FPU is still worthwhile.

https://www.jagregory.com/abrash-black-book/#pentium-floating-point-optimization

Despite the ugly FXCH instructions and handicapped FPU ISA with no orthogonal FPU registers, the pipelined P5 Pentium FPU did outperform most CISC competitors. With pipelining and register renaming, it is obvious a 68k FPU with just 8 FPU registers would have a significant advantage over the P5 Pentium FPU. It looks like 8 FPU registers would be adequate for the Quake Dot Product, Cross Product, Transformation and Projection inlines. Modern games likely use larger matrices but do they do the floating point math in the FPU or SIMD units?

I was reading a paper on cache hints which benchmarked global full cache hints vs one time cache hints. The Quake benchmark was the only benchmark code which did better with one time cache hints and significantly so. A large percentage of the Quake data is used one time and it is unavoidable that the data is loaded once no matter how many registers exist. Well, another possibility is that the data cache was too small resulting in cache thrashing but more registers would not help much.

Gunnar Quote:

Yes the old FPU code could work OK withj 8 regs.
But we all know that the old 68K FPU is not pipelined.
A pipelined FPU needs unrolled code to make full use of it.
To unroll code you need several times the number of registers.
This is very simple logic.

Most today pipelined FPU (power/intel/etc) have often a latency of 6 clocks.
This means you want to unroll your work loops generally 4-6 times to be able to eat the latency.
For doing this you need 4 to 6 times the number of register.

See above. I still think 16 FPU registers is a good practical number with additional registers likely better used for a SIMD unit supporting floating point.

Gunnar Quote:

We all know that IBM did increase the FPU register to 64 since a few years for POWER?
Why did IBM do this - because its very useful for increasing performance.
And yes IBM also has register renaming in addition to this!

Having 32 FPU register plus being CISC gives the 68080 FPU a huge advantage over the "old 8 Register".

It wasn't POWER9 with 64 FPU registers that was chosen for the newest AmigaOS 4 hardware but the PPC QorIQ P1022 which removed the more reasonable but still large 32 PPC FPU registers. Perhaps 32 FPU registers would be the right number for a CISC CPU competing in the workstation and server markets. It is obviously too many standard FPU registers for the desktop market as AmigaOS 4 hardware targets the desktop market and the a1222 doesn't need any standard FPU registers.

Gunnar Quote:

Matt

If you want to learn more then I highly suggest you to talk to the people which actually code.
Talk to coders writing real FPU code.
And talk to coders which wrote performance coder for the 68080.
You can learn a lot.

If you never code real software, and your "knowledge" is based on Wikipedia ...
yes you can then also contribute to brainstorming and post here in this Forum - "as a Wikipedia Quarterback "

Do for real serious development more knowledge will be useful.

I'm no expert like Gunnar. I'll sit back with my popcorn in my armchair to watch and learn how his FPGA CPU core takes market share away from POWER systems in the high end server/workstation markets. Maybe it will be a race between the A1222 on the desktop to see which of these markets brings back the Amiga first.

Gunnar Quote:

You have a good fantasy - we can see this.

Super-AGA is completely differently in design to Thomas NATAMI AGA - and is a complete unrelated development.

Super-AGA is based on a concept of internal DMA buffers and decoupled prefetcher
with the ability to run exact to pixel timing and also unrelated to pixel timing.
This is somewhat an extension to what Haynie already planned in AAA
the reason for this internal design is to fully optimized Super-AGA to be able
to make perfect usage of modern memory technology.

I can believe it was rewritten enough that it is just based on SAGA. So Super-AGA is not the same as SAGA? Couldn't think of a less confusing name for the new version?

Gunnar Quote:

Besides this fundamental design difference.
Thomas Hirsch uses AHD as coding language which we don't use.
We simulate all our code in Modelsim - And Modelsim is not compatible to AHDL.
All our code is 100% written in VHDL - which is a complete different language.

Yes, SAGA was supposedly originally written in AHDL. The rumor is that it was converted to VHDL, perhaps by Gunnar, and that Thomas was originally annoyed by his meddling.

Last edited by matthey on 28-May-2024 at 10:10 PM.

Status: Offline

Lou

Re: One major reason why Motorola and 68k failed...
Posted on 28-May-2024 23:40:09

[ #142 ]

Elite Member

Joined: 2-Nov-2004
Posts: 4259
From: Rhode Island

@matthey

Quote:

matthey wrote:
Lou Quote:

Since we like benchmarks, here's one I found doing an admittedly simple benchmark, but it exposes why most of the time efficiency is key...especially when it costs less.

https://www.youtube.com/watch?v=2k_jP73Ly7A

Interesting that the SNES cpu benchmark is almost exactly 1/2 for the PCEngine...while clocked at 1/2 the speed. The Megadrive cpu at 7.6 Mhz lost to NEC's version of the 6502 despite being clock slightly higher and, you know, having them extra 'bits' and registers...and both the SNES and PC Engine were operating on RAM not an internal register.

68K's code was smaller...but who cares?

Did you read the assembly code? The 68000 did not use moveq or addq which is important as the 68000 uses fewer cycles to fetch smaller code. It is likely the 68000 wins the benchmark contest with these minor changes. Did you read the comments which suggest this?

the [q] ops are for small numbers.
Writing to Zero Page on a 6502 is faster than an absolute address. Once again advantage 6502.
Many revisions of the 6502 supported a relocatable Zero Page pointer.

In the 6502 code, the value is stored to an absolute address (sta RAM_x2000) every loop and to x2001 every 255 cycles. The isn't even happening in the 68000 code. The 68000 code cheats by never writing to memory and is still slower.

Last edited by Lou on 29-May-2024 at 12:03 AM.
Last edited by Lou on 29-May-2024 at 12:00 AM.
Last edited by Lou on 28-May-2024 at 11:59 PM.
Last edited by Lou on 28-May-2024 at 11:50 PM.

Status: Offline

Hammer

Re: One major reason why Motorola and 68k failed...
Posted on 29-May-2024 2:57:13

[ #143 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6505
From: Australia

@matthey

Quote:

It is not that x86 CPUs were so well optimized for integer to floating point conversions but rather that PPC CPUs had a memory bottleneck. Transfers through cached memory are potentially not much worse than passing registers directly between units but most RISC CPUs have a bottleneck because of load-to-use stalls and up to 2 more instructions per transfer to execute compared to CISC with mem-reg and reg-mem operations. Worst case through memory is also much worse. PPC developers likely expected OoO to solve these problems but PPC limited OoO has limited ability to solve these problems. Even more aggressive OoO may have problems due to synchronization of shared resources between units. It may be possible to rewrite the algorithm to use more fp and less mixed fp and integer code if the FPU is high enough performance but that is not an easy fix. The 68060 FPU requires mixed fp and integer code for best performance so many early 3D algorithms for x86 shouldn't perform too bad with the minimalist 68060 FPU. The 68k FPU supports all register conversions and transfers too.

Within a smaller transistor budget and at a given clock speed, the RISC CPU has an advantage in arithmetic intensity against mostly ROM'ed microcode 68030 e.g. ARM60 @ 12.5 Mhz vs 68030 @ 50 Mhz playing Doom.

At the same clock speed, 68030's mem-reg and reg-mem advantages weren't enough to close the arithmetic intensity gap against ARM60.

Since 68030 has a hardware barrel shifter, then a few basic ADD and MUL instructions should be implemented in the hardware to cover some arithmetic intensity gaps i.e. 68035.

For the Saturn project, Sega rejected 68030 for SuperH2. Motorola was not selling 68LC040-25 at 68030 @ 50Mhz prices. Near $100 68EC040 is useless with DMA'ed devices that are prevalent in game consoles, not just the Amiga. MMU is a premium according to Motorola.
It's too bad 68EC040 didn't have 668030's cache behavior.

On wholesale prices, Motorola's 68040 prices didn't keep pace with Intel's 486 prices.

These are the early rounds that Motorola is losing its customers and this Motorola stupidity was repeated during smart handheld devices.

Motorola lost to ARM9xx-T (with MMU, ARMv4T instruction set) during the smart handheld's rise. Motorola thinks 68000-based Dragonball is good enough to battle ARM9?

Last edited by Hammer on 29-May-2024 at 04:53 AM.
Last edited by Hammer on 29-May-2024 at 04:51 AM.
Last edited by Hammer on 29-May-2024 at 04:50 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

matthey

Re: One major reason why Motorola and 68k failed...
Posted on 29-May-2024 3:01:26

[ #144 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2754
From: Kansas

Lou Quote:

the [q] ops are for small numbers.
Writing to Zero Page on a 6502 is faster than an absolute address. Once again advantage 6502.
Many revisions of the 6502 supported a relocatable Zero Page pointer.

MOVEQ allows 8 bits of immediate data to be moved into a register which is as large of datatype as many 8 bit CPUs can use. This is possible due to the 16 bit variable length encoding. ADDQ and SUBQ only use 3 bits of encoding for the immediate 1-8 but this covers the most common cases and is a big improvement over INC and DEC instructions more common on 8 bit CPUs.

Lou Quote:

In the 6502 code, the value is stored to an absolute address (sta RAM_x2000) every loop and to x2001 every 255 cycles. The isn't even happening in the 68000 code. The 68000 code cheats by never writing to memory and is still slower.

It's not the fault of the 68000 that the benchmark is unrealistic. The 68000 is very flexible and has many options. The 6502 memory stores are a disadvantage of an accumulator architecture CPU that a CISC CPU doesn't have. It is also possible to combine all 3 loops into one loop using the larger datatypes of the 68000. Using the full capabilities of the CPU isn't cheating. The 68000 is clearly a more capable CPU. Any slow memory access disadvantage the 68000 has is more that compensated by more GP registers reducing memory traffic, memory accesses using larger datatype sizes and powerful addressing modes compared to the 6502.

Status: Offline

Hammer

Re: One major reason why Motorola and 68k failed...
Posted on 29-May-2024 5:12:28

[ #145 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6505
From: Australia

@matthey

Quote:

Despite the ugly FXCH instructions and handicapped FPU ISA with no orthogonal FPU registers, the pipelined P5 Pentium FPU did outperform most CISC competitors. With pipelining and register renaming, it is obvious a 68k FPU with just 8 FPU registers would have a significant advantage over the P5 Pentium FPU. It looks like 8 FPU registers would be adequate for the Quake Dot Product, Cross Product, Transformation and Projection inlines. Modern games likely use larger matrices but do they do the floating point math in the FPU or SIMD units?

X86-64v1's standard IEEE FP32 and FP64 are on SSE2 path. SSE2 supports both scalar and vector (SIMD) use cases.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Gunnar

Re: One major reason why Motorola and 68k failed...
Posted on 29-May-2024 5:23:59

[ #146 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

@matthey

Quote:
Cache accesses are often unavoidable though and very high performance CPU/FPU cores that would have large register files also usually allow more than one cache access per cycle.

Lets make a real world example to give the discussion some real content.
Lets say you want to 3D rotate points for a 3D game.

Here is an example of this math.
Arne coded this on Amiga when he was 14 years
https://www.youtube.com/watch?v=FgrPX_NGwYM

Lets say you have a game that wants to rotate 10,000 points.
For this operation you have a rotation matrix, this is 6 values
and each point that you rotate you have 3 values: X,Y,Z
For doing the matrix calculation you need 2 more temp registers.

This makes 11 values.
A good routine can use 11 register for this.
The 68K FPU can do this well as it can use BOTH the 8-Dn Register and 8-FPn Register as Inputs

Arnes example code is a nice example for coding this simple for an unpipelined FPU.

If you have a pipelined FPU than you can run this much faster with unrolled code.
You will make the code 4 times faster if you unroll it 4 times.
If you unroll 4 times, then the code will use 5*4 = 20 register for the vectors + 6 Register for the Matrix.
All the Cache/memory loads you want to reserve for loading the vectors each just one time from memory.
A good code will never want to waste the memory/cache access for reloading values inside the work loop.

This example makes it clear you want to have 26 register for this algorithm.

As we know the 3-matrix code is the "small" version that you can use for vector operation.
Sometimes you want to use the 4-matrix code in your program.
The 4-matrix code is similar but needs of course more values.
It needs 8 for the matrix, and 6 for each vector, (4*6 for unroll)
This means a good routine will use 32 register total for the unrolled loop.

Mind that an unrolled Loop will run this operation about 4 times faster
than the not unrolled version.

In my experience talking about real world code examples makes it always much clearer.

Looking at this example everyone can clearly see how much benefit and value more register have.

Status: Offline

matthey

Re: One major reason why Motorola and 68k failed...
Posted on 29-May-2024 21:58:05

[ #147 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2754
From: Kansas

Hammer Quote:

X86-64v1's standard IEEE FP32 and FP64 are on SSE2 path. SSE2 supports both scalar and vector (SIMD) use cases.

The handicapped x86 FPU was replaced for a reason.

Gunnar Quote:

Lets make a real world example to give the discussion some real content.
Lets say you want to 3D rotate points for a 3D game.

Here is an example of this math.
Arne coded this on Amiga when he was 14 years
https://www.youtube.com/watch?v=FgrPX_NGwYM

Lets say you have a game that wants to rotate 10,000 points.
For this operation you have a rotation matrix, this is 6 values
and each point that you rotate you have 3 values: X,Y,Z
For doing the matrix calculation you need 2 more temp registers.

This makes 11 values.
A good routine can use 11 register for this.
The 68K FPU can do this well as it can use BOTH the 8-Dn Register and 8-FPn Register as Inputs

Arnes example code is a nice example for coding this simple for an unpipelined FPU.

Arne's code for drawing a 3D rotating cube only uses 5 of 8 FPU registers in the existing 68k FPU ISA. There are 6 integer data registers used which hold single precision fp constants but this is not optimum. Each FPU instruction using an integer register has a penalty on 68k FPUs (68060 requires 2 more cycles and does not allow superscalar issue). Using the cache provides better performance than using a data register on the 68060 but this is not optimal for multiple use fp variables/constants and with more parallel execution. A few more FPU registers would be valuable even in this simple 3D code but 16 FPU registers are already a big improvement.

Gunnar Quote:

If you have a pipelined FPU than you can run this much faster with unrolled code.
You will make the code 4 times faster if you unroll it 4 times.
If you unroll 4 times, then the code will use 5*4 = 20 register for the vectors + 6 Register for the Matrix.
All the Cache/memory loads you want to reserve for loading the vectors each just one time from memory.
A good code will never want to waste the memory/cache access for reloading values inside the work loop.

This example makes it clear you want to have 26 register for this algorithm.

As we know the 3-matrix code is the "small" version that you can use for vector operation.
Sometimes you want to use the 4-matrix code in your program.
The 4-matrix code is similar but needs of course more values.
It needs 8 for the matrix, and 6 for each vector, (4*6 for unroll)
This means a good routine will use 32 register total for the unrolled loop.

Mind that an unrolled Loop will run this operation about 4 times faster
than the not unrolled version.

In my experience talking about real world code examples makes it always much clearer.

Looking at this example everyone can clearly see how much benefit and value more register have.

There are always going to be algorithms that would benefit from more registers. Adding more registers is far from free and provides diminishing returns. I still think 16 GP FPU registers is a good number for a CISC FPU while 32 is a good idea for a RISC FPU. CISC FPUs have options when short a few registers like loads from cache and Dn registers which have a minimal performance loss with limited use. Register renaming reduces register needs. Reducing pipelined FPU instruction latencies reduces the number of instructions needed for unrolling. The P5 Pentium had 3 cycle pipelined FADD and FMUL reducing the need to unroll code. Multiple parallel FPU units improves parallelism without unrolling code. I don't think there is enough code that would benefit from 32 FPU registers. Even if pipelined performance is 25% better with 32 FPU registers when a FPU pipeline can be kept full, it won't make much difference to overall FPU performance if this only occurs 0.25% of the time. I believe a 32 FPU register standard is too many registers for the embedded market where some implementations will want to reduce the number or remove the FPU registers completely. Code size will likely be increased to encode so many registers which is a turnoff for embedded use. Perhaps 32 FPU registers will allow Gunnar's FPGA FPU to better compete with the POWER FPU though.

Status: Offline

Lou

Re: One major reason why Motorola and 68k failed...
Posted on 29-May-2024 23:00:50

[ #148 ]

Elite Member

Joined: 2-Nov-2004
Posts: 4259
From: Rhode Island

@matthey

Quote:

matthey wrote:

Lou Quote:

In the 6502 code, the value is stored to an absolute address (sta RAM_x2000) every loop and to x2001 every 255 cycles. The isn't even happening in the 68000 code. The 68000 code cheats by never writing to memory and is still slower.

It's not the fault of the 68000 that the benchmark is unrealistic. The 68000 is very flexible and has many options. The 6502 memory stores are a disadvantage of an accumulator architecture CPU that a CISC CPU doesn't have. It is also possible to combine all 3 loops into one loop using the larger datatypes of the 68000. Using the full capabilities of the CPU isn't cheating. The 68000 is clearly a more capable CPU. Any slow memory access disadvantage the 68000 has is more that compensated by more GP registers reducing memory traffic, memory accesses using larger datatype sizes and powerful addressing modes compared to the 6502.

That's a heck of a coping mechanism...

Fact is the 6502 can access zero page almost like extra registers. The 65CE02 has a Z register than can be used as an extra register or as a relocatable Zero Page address register.

This code didn't need to write to memory at all. In the 68k code is moves the #1 back into D0 to add 1 to it again...so the total is always 2. The point was not to count to 2^23, the point was to do a simple rudimentary addition that many times to test CPU efficiency, not memory read/write speed. So adding an extra step to the 6502 code was a handicap and the 68k still lost.

This is why 68k failed. The instructions per Clock wasn't competitive until the 68040. Too little too late. Too expensive.

Last edited by Lou on 29-May-2024 at 11:01 PM.

Status: Offline

matthey

Re: One major reason why Motorola and 68k failed...
Posted on 30-May-2024 1:33:30

[ #149 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2754
From: Kansas

Lou Quote:

That's a heck of a coping mechanism...

...

This is why 68k failed. The instructions per Clock wasn't competitive until the 68040. Too little too late. Too expensive.

I'm really not concerned about the 68000 losing a benchmark with unoptimized and unrealistic code. "Instructions per Clock" are important but your 68000 benchmark code isn't optimized to reduce them? How experienced is a 68000 assembly programmer that doesn't know about MOVEQ and ADDQ?

@hagopds Quote:

Very cool! I wasn't aware of the 68000's moveq instruction, and it appears to support signed 8-bit fields from -128 to 127. This would be equivalent to the other consoles loop of 1 to 255, and should run faster as moveq requires fewer clock cycles than move.b.

Status: Offline

Gunnar

Re: One major reason why Motorola and 68k failed...
Posted on 30-May-2024 7:32:27

[ #150 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

@matthey

Quote:
. Adding more registers is far from free and provides diminishing returns. ..
Register renaming reduces register needs.

You contradict yourself in your post.
Let me help you:

Fact: the cost for adding more registers is low.
Fact: The HW cost of register renaming is high.
And No 68K CPU, neither Coldfire CPU does register renaming.

To do register renaming in the sense you mean the CPU needs internally to have more "hidden" registers.
This means you have to pay both. The cost of more registers and the much higher cost of register renaming.

Matthey this is the problem with talking with you.
You talk about stuff that you googled without you understanding what it means.

Status: Offline

Gunnar

Re: One major reason why Motorola and 68k failed...
Posted on 30-May-2024 7:41:47

[ #151 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

@matthey

Quote:
The P5 Pentium had 3 cycle pipelined FADD and FMUL reducing the need to unroll code.

The P5 designed was for 90MHz clockrate.
We spoke about modern FPU (designs that can do Gigaherz Clock) and those have around 6 Cycle latency.

With a modern CPU unrolling is very important performance.
Every coder that programmed FPU code knows this.

I would suggest you to look at real world FPU codes done by IBM, MOTOROLA, ARM, you name them.
State of the art code woes unrolling workloop 4, 5, 6 times.

Status: Offline

Gunnar

Re: One major reason why Motorola and 68k failed...
Posted on 30-May-2024 7:49:44

[ #152 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

@matthey

Quote:
Multiple parallel FPU units improves parallelism without unrolling code.

Was this a joke?
You say you find more register to expensive and then you propose to instead have multiple FPUs.
What you propose is 10,000 times more costly.
And the multiple FPU units again need more register each. Don't you know this?

This is like saying "You want to save the money for the subway, and your propose to buy instead a new helicopter"?

Matt you talk about solutions without having any knowledge of hardware costs of any of the options that you propose.

Status: Offline

Gunnar

Re: One major reason why Motorola and 68k failed...
Posted on 30-May-2024 7:55:06

[ #153 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

@matthey

Quote:
Arne's code for drawing a 3D rotating cube only uses 5 of 8 FPU registers in the existing 68k FPU ISA.

So you complain about Arne Amiga demo?
Arne was 13 or 14 years when he wrote this demo effect and shared the code with Amiga community to help others to learn Asm. Arne code allow us to look something when we talk about FPU code.

The code is good for clearly showing us
that for doing a 4-Matrix loop you need 8 constants and you need 4+2 = 6 register per vector.

This means you have 14 variables to handle for the singel (slow) unrolled case
For a 4way unroll this makes 32 variables.
And for a 6way unroll this makes 44 variables.

Everyone with coding experience does understand
that coding a loop with more variables will gets much easier with more registers available.

Quote:
I don't think there is enough code that would benefit from 32 FPU registers

Could be the reason is that you never code anything like this?
Maybe the problem here is talking about things you never did and not understand from own experience.

Matthey how about you write a matrix 3D rotation code for us as example?
And then you unroll it 4 or 6 times for speed.

Please do this and then we can talk again about the topic.

Last edited by Gunnar on 30-May-2024 at 08:15 AM.

Status: Offline

kolla

Re: One major reason why Motorola and 68k failed...
Posted on 30-May-2024 17:39:28

[ #154 ]

Elite Member

Joined: 20-Aug-2003
Posts: 3475
From: Trondheim, Norway

@Gunnar

Maybe the world isnâ€™t filled with amigans obsessing about rotating 3D objects?

_________________
B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC

Status: Offline

matthey

Re: One major reason why Motorola and 68k failed...
Posted on 30-May-2024 22:51:59

[ #155 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2754
From: Kansas

Gunnar Quote:

You contradict yourself in your post.
Let me help you:

Fact: the cost for adding more registers is low.
Fact: The HW cost of register renaming is high.
And No 68K CPU, neither Coldfire CPU does register renaming.

To do register renaming in the sense you mean the CPU needs internally to have more "hidden" registers.
This means you have to pay both. The cost of more registers and the much higher cost of register renaming.

Register renaming has a higher hardware cost than adding architectural registers but register renaming is an optional core design decision where architectural registers defined in the ISA are a requirement. Register renaming has several advantages over an equivalent number of architectural registers.

1. Code is smaller with renamed registers instead of architectural registers due to fewer bits for registers in the instruction encoding. This makes a larger number of rename registers practical. For example, 128 registers uses 3x7=21 bits of encoding space for a 3 op instruction which is 66% of a 32 bit instruction and not practical for architectural registers but possible for renamed registers.

2. Register renaming makes instruction scheduling and programming in general easier resulting in more optimized code without depending on compiler support. Perfect instruction scheduling often is not possible and core designs which minimize stalls usually have better real world performance. For example, the simple and small in-order SiFive U74 CPU core designed to reduce stalls often outperforms a much more complex OoO PPC G5 CPU with more processing hardware but requiring perfect code.

3. Existing code benefits from register renaming where adding architectural registers requires recompiling with updated compiler support to have any performance advantage. A good instruction scheduler is required to maximize performance without register renaming.

It's true that the 68040 and 68060 FPUs were minimalist. Integer performance was the priority but the FPU performance was not bad for minimalist FPUs. Minimalist FPUs are still valuable for lower end embedded use while FPU performance has become more important for high end embedded use than it was for the desktop when the 68040 and 68060 were designed. The 68060 did receive integer register renaming for an in-order core which is not necessary but makes the 68060 more forgiving to program and improves performance. It's natural to assume that a higher performance 68k FPU would have received FPU pipelining and register renaming to allow existing code to perform well.

Gunnar Quote:

The P5 designed was for 90MHz clockrate.
We spoke about modern FPU (designs that can do Gigaherz Clock) and those have around 6 Cycle latency.

A similar 5/6 stage in-order P5 Pentium design could already reach 300MHz in the late 1990s.

P5@66MHz 800nm
P54C@100MHz 600nm
P54CS@200MHz 350nm
P55C@233MHz 280nm
Tillamook@300MHz 250nm

Targeting a higher clock speed with a deeper pipeline is likely to increase the latency with more FPU pipeline stages. We can see this with the in-order Intel Atom Bonnell microarchitecture based on the P5 Pentium. The Bonnell pipeline was aggressively increased to 16-19 stages to achieve over 2GHz using a 45nm process. The latency of the x87 FADD and FMUL instructions increased from 3 cycles to 5 cycles. A more practical design with 7-11 stages likely could achieve 1-2GHz with 3-4 cycle FPU latencies. A deeply pipelined core designed for 3-5GHz to compete with POWER likely would have 6+ cycle FPU instruction latencies. A core hyper optimized for a FPGA may also use many stages to increase the clock speed resulting in longer FPU instruction latencies. Perhaps we can see Gunnar's design priorities.

Gunnar Quote:

With a modern CPU unrolling is very important performance.
Every coder that programmed FPU code knows this.

I would suggest you to look at real world FPU codes done by IBM, MOTOROLA, ARM, you name them.
State of the art code woes unrolling workloop 4, 5, 6 times.

RISC FPUs need more registers and more loop unrolling to not only avoid the long FPU instruction latencies but also load-to-use stalls which can be longer for deep pipelines and FPU loads. A CISC FPU can avoid some of the unrolling and code enlargement especially if FPU instruction latencies are practical.

Gunnar Quote:

Was this a joke?
You say you find more register to expensive and then you propose to instead have multiple FPUs.
What you propose is 10,000 times more costly.
And the multiple FPU units again need more register each. Don't you know this?

There are often separate sub units in the FPU like FADD, FMUL, FDIV, FMISC, etc. The units can potentially execute FPU instructions in parallel although some logic overhead is necessary to make this possible. The P5 Pentium can execute FPU instructions in different FPU sub units at the same time including multiple simultaneous pipelined FADD and FMUL instructions while the 68060 can not despite having similar FPU sub units. The logic to allow parallel instructions could be as simple as scoreboarding which allows instructions using different resources to execute in parallel or more OoO like complexity with in-order completion and potentially using register renaming.

Status: Offline

Lou

Re: One major reason why Motorola and 68k failed...
Posted on 30-May-2024 23:02:39

[ #156 ]

Elite Member

Joined: 2-Nov-2004
Posts: 4259
From: Rhode Island

@matthey

Quote:

matthey wrote:
Lou Quote:

That's a heck of a coping mechanism...

...

This is why 68k failed. The instructions per Clock wasn't competitive until the 68040. Too little too late. Too expensive.

I'm really not concerned about the 68000 losing a benchmark with unoptimized and unrealistic code. "Instructions per Clock" are important but your 68000 benchmark code isn't optimized to reduce them? How experienced is a 68000 assembly programmer that doesn't know about MOVEQ and ADDQ?

The 6502 code had a worse disadvantage.
You're quite the deflector.

Let's add you're 'q' instructions along with the writing to ram that the 6502 was uselessly doing and rerun then...

Status: Offline

matthey

Re: One major reason why Motorola and 68k failed...
Posted on 31-May-2024 0:22:48

[ #157 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2754
From: Kansas

Lou Quote:

The 6502 code had a worse disadvantage.
You're quite the deflector.

Let's add you're 'q' instructions along with the writing to ram that the 6502 was uselessly doing and rerun then...

Rather than arbitrarily decide which CPUs can or should do what, how about using a simple benchmark with code that performs something useful like the Byte Sieve benchmark. Dhrystone or BYTEmark/NBench benchmarks would be better but the 6502 is primitive and has trouble supporting compilers.

Status: Offline

Kronos

Re: One major reason why Motorola and 68k failed...
Posted on 31-May-2024 6:46:24

[ #158 ]

Elite Member

Joined: 8-Mar-2003
Posts: 2766
From: Unknown

@kolla
Quote:

kolla wrote:
@Gunnar

Maybe the world isnâ€™t filled with amigans obsessing about rotating 3D objects?

Or maybe the world is filled with people who haven't about "GPU"s over the past 30+ years?

_________________
- We don't need good ideas, we haven't run out on bad ones yet
- blame Canada

Status: Offline

Gunnar

Re: One major reason why Motorola and 68k failed...
Posted on 31-May-2024 8:47:50

[ #159 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

@matthey

Quote:
The 68060 did receive integer register renaming for an in-order core which is not necessary but makes the 68060 more forgiving to program and improves performance.

This is not true.

The 68060 does not have register renaming.

Status: Offline

Gunnar

Re: One major reason why Motorola and 68k failed...
Posted on 31-May-2024 8:50:33

[ #160 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

@matthey

Quote:
There are often separate sub units in the FPU like FADD, FMUL, FDIV, FMISC, etc.

The APOLLO 68080 FPU is fully parallel and can do 22 FPU instructions in parallel at the same time.

But this does NOT solve the limitation of the registers.
To calculate and store the results of 22 FPU instructions you need a lot more than 8 Registers.

Its very simple to understand this.

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle