Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6071 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

8 crawler(s) on-line.

178 guest(s) on-line.

0 member(s) on-line.

You are an anonymous user.
Register Now!

matthey: 5 mins ago

Matt3k: 15 mins ago

NutsAboutAmiga: 24 mins ago

OneTimer1: 27 mins ago

pixie: 30 mins ago

Karlos: 1 hr 24 mins ago

OlafS25: 1 hr 28 mins ago

AMIGASYSTEM: 1 hr 59 mins ago

Hammer: 2 hrs 7 mins ago

CosmosUnivers: 2 hrs 15 mins ago

Forum Index

General Technology (No Console Threads)

Amiga SIMD unit

Poster

Thread

cdimauro

Re: Amiga SIMD unit
Posted on 23-Oct-2020 22:55:01

[ #181 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Fl@sh Quote:
Fl@sh wrote:
@cdimauro
Quote:
I don't remember now which instructions were removed, either if they are user or privileged ones. I'll check once I've some time.

Privileged

OK, thanks for confirming it. Then there's no problem. Only the overlapping with Altivec might cause issues.

@Hammer Quote:

Hammer wrote:
@cdimauro Quote:

The micro-architecture is an implementation detail: what counts when talking about real code is which ISA & extensions are exposed, and that applications can make use of.

From this PoV Jaguar has a complete AVX ISA extension, which means that vector registers are 256-bit in size and both 128 and 256-bit instructions are available. It's like Ryzen 1&2, which had an AVX/-2 ISA SIMD available, but the internal implementation is 128-bit.

What you reported are microarchitecture-level tricks that are used to gain better performance for a specific implementation. There's no "AVX-128" mode.

There is AVX 128-bit SIMD size refer to GCC's 128-bit AVX auto-vectorization option.

Jaguar's 256 bit AVX is merely for forward compatibility with a higher latency penalty. Jaguar wasn't optimally designed for 256-bit AVX workloads when there is higher plenty.

Jaguar's load/store units are only 128-bit wide not like Zen 2's two 256-bit wide load units and one 256 bit store unit.

In Jaguar, AVX operations complete as 2 x 128-bit operations, while all other 128-bit operations can execute without multiple passes through the pipeline, this increases time to completion latency.

Jaguar's Store queue has 20 entries that are 16bytes (128-bits) wide!

Jaguar’s L1D can sustain a 128-bit read and a 128-bit write each cycle!

1.6 Ghz Jaguar with AVX-256 is like Jaguar at 800Mhz!

ASM/C/C++ programmer (especially for soon to be obsolete game consoles) needs to know the microarchitecture's weaknesses to minimize performance pitfalls.

I have plenty of criticisms against Jaguar's microarchitecture.

OK, but this doesn't and cannot change what I've stated: you cannot take what a compiler introduces on its own as a reference of the reality.

What count is Jaguar's ISA, and this fully supports AVX: 16 256-bit registers in 64-bit mode, and all instructions implemented.
Quote:
You can't ignore microarchitecture weaknesses when it comes to high-performance 3D game engines.

I don't, in fact, but this is ANOTHER question. As a coder I also have to take into account the microarchitecture if I want to squeeze the most from my code running on that particular chip, but this isn't something new neither strictly related to Jaguar.

As I've said before, AMD's Ryzen 1 & 2 had EXACTLY the same implementation of Jaguar, and only with Ryzen 3 (AKA Zen2) there's a full 256-bit implementation for AVX/-2.

But nothing new as I said: it happened also with Intel's Pentium III when the SSE was introduced, as well as the Pentium IV (AFAIR).
Quote:
Quote:

Let's see, because rumors and slides were different from real-world products, when we talk about AMD GPUs (AMD is struggling to be competitive from very long time).

Reminder, Intel's GPU efforts are worst than AMD's.

Yes, Intel tried the first time with Larrabee and failed. But now, at the second time, the results look completely different.

Maybe you forgot that the father of Intel's Xe is Koduri, which was the father of GCN and RDNA1: I think he knows how to do competitive GPUs...
Quote:
RX 5700 XT is the feature set obsolete when it doesn't support XSS/XSX's DirectX12 Ultimate and DirectX12 Feature Level 12.2. Turing RTX supports DirectX12 Ultimate and DirectX12 Feature Level 12.2!

There's a very high chance for RDNA 2's high clock speed due to PS5's reveal.

Using PS5's GPU clock speed,
RX 6800 XT's 64 CU at 2230 Mhz yields 18.268 TFLOPS which is almost 2X of RX 5700 XT.

RX 6900 XT's 80 CU at 2230 Mhz yields 22.835 TFLOPS which is 2.34X over RX 5700 XT.

Unlike Polaris/ Vega's GCN TFLOPS, RDNA v1's TFLOPS nearly like Turing RTX's TFLOPS.

As I said before, I wouldn't compare GCN/RDNA's TFLOPS with Turing's ones: the efficiency of those architectures are different.

A better comparison would be with both GPUs at around the same TFLOPS, but looking at the results with some benchmarks (several different games and GPGPU applications): this makes much more sense.
Quote:
Desktop PC SKUs are not limited by the game console's TDP limitation and NVIDIA has thrown PEG 300 watts standard design limit out of the window with Ampere RTX.

My MSI RTX 2080 Ti Gaming X Trio has three 8 -pins PCI-E power sockets to blast past 300 watts which can narrow the gap with RTX 3080.

https://www.techpowerup.com/review/nvidia-geforce-rtx-3080-founders-edition/31.html
RTX 3080 FE's peak gaming power consumption is 348 watts.

https://www.techpowerup.com/review/nvidia-geforce-rtx-3080-founders-edition/34.html
RTX 2080 Ti is 76% of RTX 3080

https://www.techpowerup.com/review/msi-geforce-rtx-2080-ti-gaming-x-trio/33.html
My MSI RTX 2080 Ti Gaming X Trio with a factory overclock mode is 6% faster over the stock RTX 2080 Ti.

End user's overclock can yield another 6.7% increase.
https://www.techpowerup.com/review/msi-geforce-rtx-2080-ti-gaming-x-trio/36.html

https://www.techpowerup.com/review/msi-geforce-rtx-2080-ti-gaming-x-trio/31.html
RTX 3080 Ti's peak gaming power consumption is 358 watts.

If NVIDIA can blasted pass PEG's 300 watts design limit, so can AIB's RX 6800 XT and RX 6900 XT.

My argument is based on history when PS4 has GCN version 2.0 with 20 CUs design at 800Mhz while PC's R9-290X GCN version 2.0 with 44 CU design at +1Ghz.

PS5 GPU has 20 DCU (aka 40 CU) up to 2.23 Ghz RDNA 2 design.

RX 6900 XT has 40 DCU (aka 80 CU) RDNA 2 design.

RX 6800 XT has 32 DCU (aka 64 CU) RDNA 2 design.

Add 200 Mhz on top PS5 GPU's 2.23Ghz lands on 2.43 Ghz.

Ampere has doubled CUDA cores within SM units without increasing rasterization hardware, hence game results is meh when compared to RTX 2080 Ti OC.

RTX 3080 is using GA102 which the same as RTX 3090 instead of expected GA104.

Usually, G?104 is assigned for ?080 type SKU e.g. GTX 1080 or RTX 2080.

NVIDIA knows AMD's expected RDNA 2 RX 6800 XT/RX 6900 XT results and GA104 wouldn't be enough.

OK, but I don't understand all this long list of numbers, especially with coming graphic cards which we still don't know the full specs.

As I said, it's better to see at real-world number by benchmarking all cards with games & applications. Numbers in this context don't give anything useful IMO.

@Hammer Quote:
Hammer wrote:
@cdimauro Quote:

They simply don't further develop the 32 ISA anymore. Thumb-2 et similar are here to stay for the partners that want to use them. But future AMD ISA will be entirely 64-bit AFAIR: so, not even the 32-bit execution mode will be supported.

Zen 3 still has native support for X86-32.

Zen 3 still supports IA-32/x86. x86-32 is still supported by any x86-64/x64 processors (it's a different ABI for this ISA).
Quote:
AMD doesn't officially support Windows 95/98/Me/NT/2K/XP/7, but it still runs fine on it.

https://www.youtube.com/watch?v=KFEpHEXBCbA
Ryzen 7 2700 running retro DOS-based Windows 98 and Doom 2 game just fine. No Motorola style instruction set kitbashing when it comes to running legacy Windows OS and Doom 2.

The problem is with the drivers, unfortunately.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 24-Oct-2020 8:05:31

[ #182 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2014
From: Kansas

Quote:

cdimauro wrote:
Usually yes and I agree, but, as I said before, I've already a very good code density with my ISA (and without using many more other features that can greatly reduce it) which has 32 GP registers (but "unused" currently: for my experiments and statistics I just translates x86/x64 instructions to an equivalent of my ISA. So, currently the comparisons are always using 8 registers for x86 and 16 for x64).

There are a some complex functions which can use more than 16 registers but CISC can use a few variables in memory with little penalty.

The 68000 seemed to have more registers than needed but the 68060 seemed like it barely had enough. The 68000 could SWAP the upper and lower halves of the 32 bit data registers where the upper half was mostly unused. Datatype sizes increased, functions became more complex and the 68060 was pipelined for 32 bit which was more efficient than smaller sizes. While pointers today are 64 bits, a 32 bit size is adequate for most integers and the default integer size in some 64 bit ABIs including x86-64. While x86-64 likely made this choice more for compatibility, AArch64 found a savings from providing 32 bit operations in a RISC ISA and extending as necessary in powerful CISC like addressing modes if used as an index register. A 68k 32/64 bit ISA could be similar but also take advantage of the high 32 bit half of data registers as extra registers like the 68000 but perhaps with improved access over just having SWAP. The 68k address registers already sign extend all data to the size of the register which would be 64 bit for a 64 bit 68k but still allows 32 bit accesses for compatibility and a 32 bit ABI. Data registers are more like the x86 registers but the 68k has enough encoding space, especially after cleaned up, that it doesn't need to auto sign extend all values like x86-64 so the upper half of the registers can be used. This could give a 64 bit 68k 16 32 bit data registers or 8 64 bit data registers while retaining 8 64 bit address registers with improved PC relative addressing to reduce their usage. Simple operations like a 64 bit SWAP or MOVE.L Dm(h/l),Dn(h/l) can still be a 16 bit instruction but more complex instructions like ADD would need to grow if deemed worth adding. Rather than waste the upper half of data registers extending them with 32 bit integer datatypes and/or adding more 64 bit registers which rarely use the high half, I would like to more efficiently use the register file. This should return the 68000 feel of more registers while code pipelined for a 32 bit processor like the 68060 should perform well. That's a hint of what I'm thinking about for 68k registers. It is nice to have a few more registers especially if they are cheap enough.

Quote:

That's strange. Both have the same number of registers and similar commonly used addressing modes. Any idea why it's happening?

I'm not sure but the 68k seems to be more efficient at using the registers it has. The 68k is a more flexible combination of a reg-mem and mem-mem architecture and the x86 is a combination of a reg-mem and accumulator architecture. The 68k does have (An)+ and -(An) addressing modes which are commonly used, it is more flexible at accessing the stack and it has better PC relative addressing (x86-64 improved dramatically with RIP relative addressing but the 68k is still better for 32 bit other than the lack of PC relative writes which are more important for 64 bit). The number of memory accessing instructions is much lower but that is due to MOVEM and MOVE mem,mem instructions. The x86 is designed to use the stack often and code density degrades significantly without (x86-64 code density degrades further with the use of more than 8 registers or 64 bit). Perhaps the simpler and smaller code with more memory accesses is chosen sometimes in less optimized code. The 68k doesn't have to choose between performance or code density as the smallest code is often the best performance. I don't have any hard data but we (Apollo Team) did look at the stats from Amiga disassemblies of many programs (it is easier to accurately disassemble 68k code) and we were happy with the surprisingly low memory traffic. The Dr. Weaver code density comparison also showed the 68k to be competitive in memory traffic to 32 register RISC while optimizing for size where x86 and x86-64 were horrible.

Quote:

I never heard of problems caused by the register file for Atom and Larrabee, unless we talk about the SIMD ones (which is the reason why low-end x86/x64 processors had only SSE integrated, and not AVX/-2). The decoder, on the other end, was and is a sensible element for x86 and x64.

Power reduction was likely more important than area reduction for the slimmed down Atom and Larrabee cores. The decoder and instruction fetch (byte aligned code gives fewer instructions for any fetch size and larger x86-64 instructions makes it difficult to reduce) were probably priorities. Compatibility was important for both projects making it difficult to reduce the register file the same as the ColdFire team wisely decided not to reduce the number of registers for the same reason (the ColdFire Team sounded more concerned about area=cost than power with no mention of the decoder or fetch for the simplified 68k like ISA).

Quote:

Do you have some studies / numbers about the register file?

Registers are not flat SRAM memory storage the same as caches are not with ways and multiple ports. Registers have r/w ports which make them much more expensive (area, power and slower) and more powerful addressing modes require more ports. Most modern processors have rename registers which usually grow as the number of visible architectural registers grows (Core i7 had 128 and POWER7 almost 200 rename registers for example). OoO cores have many temporary locations for register values. Each thread of each core usually has its own registers. The cost of registers obviously depends on the design which goes up with the performance.

Wiki talks specifically about the register file arrangement of Atom and Larrabee cores which appears to be simplified, especially for the in order cores.

https://en.wikipedia.org/wiki/Register_file

Computer Architecture a Quantitative Approach (5th edition) is a great free resource on processor design which talks about register files and much more. See chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures which is more on topic for this thread.

http://acs.pub.ro/~cpop/SMPA/Computer%20Architecture%20A%20Quantitative%20Approach%20(5th%20edition).pdf

Finally, I provide an old paper which talks about register file timing. The believe then was that OoO RISC cores would grow wider and wider issue while pushing to higher and higher clock speeds. It was assumed that single cycle register file accesses would no longer be possible. The cost and benefits of different port and result forwarding/bypass configurations are shown in addition to the multi-banked register file comparison. Multi-core is popular today and high frequency is rarely pushed while process advances make timing easier but register file timing can still be a consideration in case you were wanting to make a 1024 bit wide SIMD unit. Perhaps the Knight's Landing cores reduced the core frequency because they would heat up causing the AVX-512 register files to have timing problems.

http://people.ac.upc.edu/cruz/paper/isca2000.pdf

Quote:

I agree, but I suspect that the oldest processes used for embedded aren't the ones available 20 or more years ago. AFAIR many are using 32-28nm processes, which allow a very good number of transistors packed in a small area.

I doubt that embedded SoCs are smaller than 1mm^2 area.

The most transistors for the price changes as newer fab processes become cheaper. Companies like NXP are generally a little ahead of the curve which gives a competitive advantage for performance and energy benchmarks and allows them to take advantage of cheaper production costs for popular chips. Prototype ASICs and small run custom ASICs are more likely to try to hit the peak.

Quote:

They simply don't further develop the 32 ISA anymore. Thumb-2 et similar are here to stay for the partners that want to use them. But future AMD ISA will be entirely 64-bit AFAIR: so, not even the 32-bit execution mode will be supported.

The EE Times survey from 2019 showed that 32 bit is far from done in the embedded market.

My current embedded project main processor is a:
10% 8 bit
11% 16 bit
61% 32 bit
15% 64 bit

ARM has tiny 32 bit Thumb 2 only processors for embedded. It will be interesting to see how well they support them. Their current designs will eventually become outdated as old fabs stop production of older processes.

Status: Offline

cdimauro

Re: Amiga SIMD unit
Posted on 24-Oct-2020 9:29:01

[ #183 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:

matthey wrote:
[...]
Conclusions:
PPC has plenty of registers including separate SIMD and FPU registers and passes many arguments in registers but function overhead can be high and moving registers between register files through memory may be slower.

I personally think that the PowerPC ABI is the biggest limit for this ISA: the prologue and epilogue are way too long and complicated, and I would have expected much less instructions to be used for them, since there are plenty of registers available.

IMO PowerPC can do much better with a different ABI (and maybe compilers).
Quote:
x86_64 overhead can vary depending on the ABI and we don't know what was used in the claim. Using all stack parameters gave a 0.86% slowdown in integer performance on x86_64 compared to register parameters in one paper likely because of so much inlining which jumped to a 7.4% slow down with compiler inlining disabled for integers and 1.2% for floating point.

Unfortunately even x86_64 can have big prologues and epilogues, due to pushing or moving some parameters to the stack, and recovering them back before returning.
The stack frame building & destroying doesn't take long, and this is the good part.

So, the main problem here is the luck of a MOVEM instruction, which greatly helps here. However those kind of instructions require the implementation of a pipeline which stalls while doing those kind of long operations, and become a big problem on real-time systems.

For those reasons I preferred to remove them from my ISA (PUSHA and POPA are emulated via millicode), and I've introduced another mechanism to save and restore the most commonly pushed/popped registers: it's not as elegant as on the 68K, but it's much cheaper/easier to implement, and should give better code density results (a short, 2 bytes instruction is used on the prologue and on the epilogue), but paying a small performance penalty (similar to the same solution adopted by RISC-V, but more compact).
Quote:
Like increasing the number of registers, the advantages are often over stated. PPC is powerful but feels unfriendly, unwieldy and wasteful especially considering how much more instruction cache is used as can be seen by the function prologue and epilogue above. Most PPC programmers don't have a good understanding of the hardware which results in inferior code compared to x86_64. Low level programmers preferred x86_64 despite the warts, inconsistencies and kludges which is especially helpful for SIMD programming.

Indeed. The above problems with function calls can be avoided by assembly coders, but doing it with the PowerPC ISA is a pain.
Quote:
AArch64 has shown that attaching an SIMD unit to a RISC processor can be done better and in a standard way. Even ARM doesn't want to compete with the beef or the bloat of the x86_64 SIMD unit though.

Correct, but the problem with the x86_64's (and IA-32) SIMD units is that those are incremental additions to the base ISA, which brought to some thousand of instructions filling the holes in the opcode table and by extensively using the infamous prefixes.

A clean design can implement all those instructions in a much better and efficient way. My ISA has a fully orthogonal SIMD extension, where each instruction can operate on:
- MMX, SSE, AVX, AVX-512 registers;
- scalar or packed data;
- integer of floating point data;
- 8, 16, 32, 64 (for ints) or 16, 32, 64, 128 (for floats) bits.
Additionally a mask can be applied as well as all AVX-512 extensions (yes: even for MMX/SSE/AVX).

This simplifies the life to all: assembly coders, compilers, and the engineers that have to implement the frontend (decoders) and backend (ALU design).

@NutsAboutAmiga Quote:

NutsAboutAmiga wrote:
@Hammer

Well with PowerPC and Risk in general they noticed, that more and more people where writing C code, and that compiler where unable to take advantage of special instructions, the idea was that if they removed the dead wight, they can reduce the power consumption and increases the clock frequency.

the result is CPU that’s great at being OK, but not great a being the fastest, or most HOT cpu, the real problem is they never really managed to agree on what instruction to include or not, because of this it make hard to optimize code in assembler.

It was perfect CPU to put in a highend printer, or router, back in 2001.

Not even for that: I don't see a single reason why a PowerPC can be preferred to a 68K, ARM, RISC-V in the embedded world.

@Fl@sh Quote:

Fl@sh wrote:
For most technical guys I suggest to look at following link

http://studies.ac.upc.edu/ETSETB/SEGPAR/microprocessors/altivec%20(mpr).pdf

And refresh most AltiVec simd features ..and compare them with all other present in direct competitors.

The document is very very old and is comparing Altivec to... MMX: Intel's first SIMD implementation.

Take a look at this: AVX: A leap forward for DSP performance
Quote:
In any modern simd implementation would be a good choice to have at least the same AltiVec features and not share any of integer and float registers with simd unit.

Yes for a completely new ISA. But nowadays a modern ISA can have no FPU, and just use the scalar instructions which are needed even on a SIMD unit/extension.

@Fl@sh Quote:

Fl@sh wrote:

I want repeat ppc isa is the more recent among all others, with exception of riscv, and it was developed with future in mind. This is reason why on Ppc word we had a soft 64bit transition, little endian support, latest VMX powerful extensions, embedded and custom cores spacing from car industry to gaming consoles, sharing the same code.

In the real world you have several differences and not all code can be shared, as we also discussed in some comments. And it's not only a PowerPC problem: the 68K has the same problems.

Incredibly it's Intel which did the best job from this PoV, because it's fully backward compatible with everything.

BTW, the last VMX version doubles the number of vector registers by reusing the FPU register: something which, for you shouldn't be made...
Quote:
On paper ppc is a good architecture, better than others, the handicap resides in vendors implementations and scale economy fab processes.
With right investments in research and fab process, we could have today a z80 clocked at 10ghz, maybe faster than any x86.

My2cents.

Yes: on paper. In reality there's nothing on PowerPC which makes it attractive on some scenario.

Status: Offline

cdimauro

Re: Amiga SIMD unit
Posted on 24-Oct-2020 9:51:19

[ #184 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@CosmosUnivers Quote:

CosmosUnivers wrote:
@matthey

The PPC is a big piece of junk, that's why Phase5 added this CPU in front of the 68k on BlizzardPPC and CyberStormPPC : to dirty the 68k and the Amiga...

Same scenario with the AMMX... Same story with ARM on Warp 560/1260, ZZ9000 and A314...

They add new problems for after create a solution : the purchase of the new cards of course, nothing is free with them...

I hope all the Amiga users will finally understand one day : the ennemies are inside in our community...

Next f**k will be certainly an new Efika ARM, and a new Pegasos ARM : they use always the same synopsis, they repeat because they saw the previous failed = they are 100% certain their new hardware will fail too...

ARM will never work, AMMX will never work, PPC will never work, x86 (new MorphOS) will never work : and they **P*E*R*F*E*C*T*L*Y** know that...

Your fundamentalism it totally insane.

@Hammer Quote:

Hammer wrote:

68080 is a good "what IF" instead of Motorola committing 68K suicide.

From a purely technical/ISA perspective I don't agree, looking at what they are doing.

I can only agree if we talk about the Motorola "philosophy": filling holes in the opcode table according to the "needed" additional instructions and/or extensions. Here I think that the Apollo team is a very good Motorola heir...

@A1200coder Quote:

A1200coder wrote:
@Hammer Quote:

68080 is a Pentium MMX class CPU for the 68K CPU family.

I would say that this is incorrect; the 68080 is able to execute 4 instructions in parallell, which makes it even better than Pentium Pro; basically Pentium Pro is the same CPU as used in Pentium 2/3. Pentium 3 introduced also SSE instruction set.
(Early P6-family: Pentium Pro/PII/PIII, and Pentium M. Also Pentium 4: a maximum of 3 instructions per cycle can be achieved.)

Those comparisons aren't real and don't make sense.

Executing 4 instructions per clock-cycle is very difficult even for an out-of-order design, due to the dependencies in the code flow.

The 68080 is an in-order processor, so it's much worse even compared to a PentiumPro (which was a 3-ways OoO design).

You're an experienced 68K assembly coder, so you should know how difficult was/is to pair instructions even on a simpler 68060...
Quote:
The memory performance is also exceptionally good for 68080, beating even some 1 GHz PPC CPUs in this area.

Those are synthetic benchmarks, which don't match the code executed on real world applications.
Quote:
The 68080 runs m68k software clearly faster than a 68060 at same clock speed without the use of any special features of 68080, like AMMX. Some claim that it corresponds to a 120 MHz 68060 for some workloads at current clockspeed of FPGA, which is around 85 MHz.

That's the reality: for SOME workloads...

@A1200coder Quote:

A1200coder wrote:

But certainly the 68080 is faster than Pentium MMX, since Pentium, like 68060, can only execute max 2 instructions per clock.

The Pentium hadn't the same 68060 limits when executing the 2 instructions, and it had a pipelined FPU...

Last edited by cdimauro on 24-Oct-2020 at 09:54 AM.

Status: Offline

cdimauro

Re: Amiga SIMD unit
Posted on 24-Oct-2020 10:22:47

[ #185 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:

matthey wrote:
Quote:
Fl@sh wrote:
Obviously in ppc code you are saving and restoring even vector registers, not present in 68k.

That is one of the disadvantage of having more registers and different register files. For example, x86_64 has a shared SIMD and FPU register file which reduces this cost.

This happens only for the MMX SIMD. Starting from the SSE x86/x64 uses a separate register file.
Quote:
Using more registers can sometimes slow performance. Context switches where all the registers are saved are slower too.

Depends on the context / application field. It's relevant on very low-performance cores, but not on other scenarios (on desktop, server, and HPC there are usually 100 context switches per second. 1000 in some cases).

@bhabbott Quote:

bhabbott wrote: Quote:

Hammer wrote:

68060 (at 85Mhz)'s and AC68080's Doom and Quake results are okay but they are less than my Pentium 166Mhz /430VX/S3 Trio 64V results.
20 years later and Amiga users still suffer from PC envy.

Who cares how fast some PC can run a crappy old game? Doom is plenty fast enough on an 060 or Vampire, and Quake is a boring game at any speed.

Boring FOR YOU, but it's really strange to see that so many amigans are continuously checking at Doom and Quake's performances: who knows why...
Quote:
Quote:
68060 at 85Mhz comparable to Pentium 90.... I don't think so!

I had a 50MHz 060 in my A3000 and it felt about the same as a Pentium 90 - except for Quake which was apparently much slower (I never tried it on a PC). It was the only game I bought that was supposed to make use of the 060's power, and it stunk. Meanwhile my A1200 ran Amiga games perfectly - much more fun! No way a Pentium 90 could compete with that.

Now you use your "feeling" to compare Amigas and PCs: a very good "metric", right?
Quote:
Any discussion today about Amiga vs PC speeds is silly. All that matters is can I make my Amiga fast enough to do what I want. For 99% of the Amiga stuff I have, my 50MHz 030 equipped A1200 is plenty fast enough. For a few applications such as web browsing and compiling C code the Vampire in my A600 is nicer. It runs Doom ridiculously fast even in hires, but I had just as much fun playing it on the A1200 where the lower frame rate made it more of a challenge - and it looks better on the big TV screen in composite.

It's fine if it fits YOUR expectations. But maybe you have low requirements / expectations from a computer...
Quote:
To me the Amiga is about the total experience, not silly 3D benchmarks.

Since when playing Doom or Quake can labelled as benchmarking?
Quote:
A PC is just an appliance, a boring box full of forgettable hardware and an OS that gets less enjoyable with each generation. I use them when I have a job to do, but I use the Amiga when I want to savour the experience.

An here comes out your real nature: another blind Amiga fanatic which is still stuck to the beginning of the 90s...
Quote:
Quote:
It's Doom benchmark time (again)...

Ho hum. 50fps, 99fps, who cares?

Then why you've written a so long post trying to make such silly justifications?

You talked about envy at the beginning of your post, and that's exactly what I can see...

Status: Offline

Fl@sh

Re: Amiga SIMD unit
Posted on 24-Oct-2020 10:39:20

[ #186 ]

Regular Member

Joined: 6-Oct-2004
Posts: 253
From: Napoli - Italy

@matthey

Quote:
There are a some complex functions which can use more than 16 registers but CISC can use a few variables in memory with little penalty.

The 68000 seemed to have more registers than needed but the 68060 seemed like it barely had enough. The 68000 could SWAP the upper and lower halves of the 32 bit data registers where the upper half was mostly unused. Datatype sizes increased, functions became more complex and the 68060 was pipelined for 32 bit which was more efficient than smaller sizes. While pointers today are 64 bits, a 32 bit size is adequate for most integers and the default integer size in some 64 bit ABIs including x86-64. While x86-64 likely made this choice more for compatibility, AArch64 found a savings from providing 32 bit operations in a RISC ISA and extending as necessary in powerful CISC like addressing modes if used as an index register. A 68k 32/64 bit ISA could be similar but also take advantage of the high 32 bit half of data registers as extra registers like the 68000 but perhaps with improved access over just having SWAP. The 68k address registers already sign extend all data to the size of the register which would be 64 bit for a 64 bit 68k but still allows 32 bit accesses for compatibility and a 32 bit ABI. Data registers are more like the x86 registers but the 68k has enough encoding space, especially after cleaned up, that it doesn't need to auto sign extend all values like x86-64 so the upper half of the registers can be used. This could give a 64 bit 68k 16 32 bit data registers or 8 64 bit data registers while retaining 8 64 bit address registers with improved PC relative addressing to reduce their usage. Simple operations like a 64 bit SWAP or MOVE.L Dm(h/l),Dn(h/l) can still be a 16 bit instruction but more complex instructions like ADD would need to grow if deemed worth adding. Rather than waste the upper half of data registers extending them with 32 bit integer datatypes and/or adding more 64 bit registers which rarely use the high half, I would like to more efficiently use the register file. This should return the 68000 feel of more registers while code pipelined for a 32 bit processor like the 68060 should perform well. That's a hint of what I'm thinking about for 68k registers. It is nice to have a few more registers especially if they are cheap enough.

You can call all these features, instead I call it hacks/triks.
This way to do requires much transistors and levels up a lot ISA complexity.

For me simpler is better, so I prefer max optimization and efficienty for 64bit isa archs and, if you need a 8, 16 or 32 bit datatype, it will be managed as a full 64 bit one.
Even for addressing modes I would prefer something simple with 3/4 addressing modes, no more.

Even a big register file is a no option, avoiding in this way stupid push/pop instructions and avoiding stack usage as much as possible, passing parameters by registers without using mem.
With smart ABI you can describe in a simple way how things need to be done and compilers can do it much nearly to a handwritten assembly code, whithout wasting cpu resources present, but never used.

Minimizing complexity and transistors count, a modern fab process can speedup clock and optimize efficiency at same time.

I want repeat.. even compilers can gain major improvements building a simpler code with less rules and exception to manage.

This is the reason why prefer no implementation of 68080 simd unit in this first phase.
Simply there's no transistor room to make it in a good way like modern Altivec/AVX implementations did.
An AMMX simd would be a big legacy mistake to add in next fpga versions.
IMHO for now would be much better improve compatibility with real 80bit 68k FPU or add some fmadd and others image/sound processing instructions expanding them even on floats.
Even akiko chip was a good idea to implement for emulation of some game based on bitplane displays.

IMHO an Intel like tick/tock strategy would be a good way to do.

_________________
Pegasos II G4@1GHz 2GB Radeon 9250 256MB
AmigaOS4.1 fe - MorphOS - Debian 9 Jessie

Status: Offline

Fl@sh

Re: Amiga SIMD unit
Posted on 24-Oct-2020 10:44:36

[ #187 ]

Regular Member

Joined: 6-Oct-2004
Posts: 253
From: Napoli - Italy

@cdimauro

Quote:
I personally think that the PowerPC ABI is the biggest limit for this ISA: the prologue and epilogue are way too long and complicated, and I would have expected much less instructions to be used for them, since there are plenty of registers available.

IMO PowerPC can do much better with a different ABI (and maybe compilers).

Sorry I don't agree.
please take 5 minutes for this The simplified 64-bit ABI - IBM POWER

_________________
Pegasos II G4@1GHz 2GB Radeon 9250 256MB
AmigaOS4.1 fe - MorphOS - Debian 9 Jessie

Status: Offline

cdimauro

Re: Amiga SIMD unit
Posted on 24-Oct-2020 17:33:38

[ #188 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:

matthey wrote:
Quote:
cdimauro wrote:
Usually yes and I agree, but, as I said before, I've already a very good code density with my ISA (and without using many more other features that can greatly reduce it) which has 32 GP registers (but "unused" currently: for my experiments and statistics I just translates x86/x64 instructions to an equivalent of my ISA. So, currently the comparisons are always using 8 registers for x86 and 16 for x64).

There are a some complex functions which can use more than 16 registers but CISC can use a few variables in memory with little penalty.

The 68000 seemed to have more registers than needed but the 68060 seemed like it barely had enough. The 68000 could SWAP the upper and lower halves of the 32 bit data registers where the upper half was mostly unused. Datatype sizes increased, functions became more complex and the 68060 was pipelined for 32 bit which was more efficient than smaller sizes. While pointers today are 64 bits, a 32 bit size is adequate for most integers and the default integer size in some 64 bit ABIs including x86-64. While x86-64 likely made this choice more for compatibility, AArch64 found a savings from providing 32 bit operations in a RISC ISA and extending as necessary in powerful CISC like addressing modes if used as an index register.

Makes sense. On my ISA I've decided to support only 32 and 64-bit operations for the quick/compact instructions in 64-bit mode, whereas in 32-bit mode I decided for 32 and 8-bit support. This was basically driven from the statistics which I've collected: 16 bits are rarely used in 32-bit mode, and 8/16 bits aren't used so much in 64-bit mode.
Quote:
A 68k 32/64 bit ISA could be similar but also take advantage of the high 32 bit half of data registers as extra registers like the 68000 but perhaps with improved access over just having SWAP. The 68k address registers already sign extend all data to the size of the register which would be 64 bit for a 64 bit 68k but still allows 32 bit accesses for compatibility and a 32 bit ABI. Data registers are more like the x86 registers but the 68k has enough encoding space, especially after cleaned up, that it doesn't need to auto sign extend all values like x86-64 so the upper half of the registers can be used.

x86-64 doesn't do sign-extension, but zero-extensions (when writing 32 bits to a register). Zero-extension is less useful (signed integers are used more often in programming languages), but looks like cheaper to implement.

For the 68K encoding space, I don't think there's enough in the 16-bit space in a 64-bit execution mode.
The MOVE.Size Mem,Mem takes 25% of that space, alone. Then you should at least reserve the same for ADD/SUB/AND/OR.Size Mem,Dn & the equivalents Dn, Mem: another 25% is used. Then you only have 50% of the encoding space, and you still has to reserve something for other important instructions like CMP, shift/rotate, bit testing, and the usual unary and implicit ones. And if you want to make a better use of 32-bit opcodes, you have to reserve a good amount for them, and as well as for the FPU and/or the SIMD unit.
A 16-bit opcode might give the impression to be endless, but once you start using it then it shrinks quickly.

On my ISA I've reserved roughly 50% of the opcode space for the quick/compact instructions, roughly 25% for the GP & FPU ones (but the x87 legacy takes really a small amount), and the remaining 25% for the SIMD unit.
Quote:
This could give a 64 bit 68k 16 32 bit data registers or 8 64 bit data registers while retaining 8 64 bit address registers with improved PC relative addressing to reduce their usage. Simple operations like a 64 bit SWAP or MOVE.L Dm(h/l),Dn(h/l) can still be a 16 bit instruction but more complex instructions like ADD would need to grow if deemed worth adding. Rather than waste the upper half of data registers extending them with 32 bit integer datatypes and/or adding more 64 bit registers which rarely use the high half, I would like to more efficiently use the register file. This should return the 68000 feel of more registers while code pipelined for a 32 bit processor like the 68060 should perform well. That's a hint of what I'm thinking about for 68k registers. It is nice to have a few more registers especially if they are cheap enough.

So basically you mimic what Intel did with the 8086: using the so called "high" registers to provide 8 8-bit of them using the 4 16-bit registers.
This is a quick and easy way to increase the register set, but at the expense of performance (partial registers access) and a bit more complicated implementation. It might be good for the low-end embedded market if you really want to save some space/transistors, but it's not recommended on a more general purpose ISA / implementation.

Since my ISA is 100% x86/x64 assembly-level compatible I had to support those 8-bit high registers, and with my ISA design I can apply the same trick to all register sizes. So, I can have 64 8-bit registers (32 "low" + 32 "high"), 64 16-bit, 64 32-bit, and... 64 64-bit (either by extending the registers to 128-bit or to have some "shadow" registers). However I decided to do NOT push for this trick, because I don't like the partial register access, and I don't want to complicate the ISA and the compilers. I support only the 8-bit "high" registers just for the x86/x64 compatibility, but there's a extra bit in the CPUID to signal the presence / usage of those high-registers, because on the future ISA I want to reuse the current bit in the opcode structure to select either zero or sign extension (currently I can only zero-extend any memory operand from a certain size, so without using the MOVZX instructions), which is much more useful.
Quote:
Quote:
That's strange. Both have the same number of registers and similar commonly used addressing modes. Any idea why it's happening?

I'm not sure but the 68k seems to be more efficient at using the registers it has. The 68k is a more flexible combination of a reg-mem and mem-mem architecture and the x86 is a combination of a reg-mem and accumulator architecture. The 68k does have (An)+ and -(An) addressing modes which are commonly used, it is more flexible at accessing the stack and it has better PC relative addressing (x86-64 improved dramatically with RIP relative addressing but the 68k is still better for 32 bit other than the lack of PC relative writes which are more important for 64 bit). The number of memory accessing instructions is much lower but that is due to MOVEM and MOVE mem,mem instructions. The x86 is designed to use the stack often and code density degrades significantly without (x86-64 code density degrades further with the use of more than 8 registers or 64 bit). Perhaps the simpler and smaller code with more memory accesses is chosen sometimes in less optimized code. The 68k doesn't have to choose between performance or code density as the smallest code is often the best performance. I don't have any hard data but we (Apollo Team) did look at the stats from Amiga disassemblies of many programs (it is easier to accurately disassemble 68k code) and we were happy with the surprisingly low memory traffic. The Dr. Weaver code density comparison also showed the 68k to be competitive in memory traffic to 32 register RISC while optimizing for size where x86 and x86-64 were horrible.

OK, but all of this IMO can only explain the increased memory traffic fetching the instructions (due to the worse code density) and the increase in the number of executed instructions (due to the missing (An)+ & -(An) addressing modes), but doesn't explain an increase in the memory traffic for fetching the data.
Quote:
Quote:
Do you have some studies / numbers about the register file?

Registers are not flat SRAM memory storage the same as caches are not with ways and multiple ports. Registers have r/w ports which make them much more expensive (area, power and slower) and more powerful addressing modes require more ports. Most modern processors have rename registers which usually grow as the number of visible architectural registers grows (Core i7 had 128 and POWER7 almost 200 rename registers for example). OoO cores have many temporary locations for register values. Each thread of each core usually has its own registers. The cost of registers obviously depends on the design which goes up with the performance.

Wiki talks specifically about the register file arrangement of Atom and Larrabee cores which appears to be simplified, especially for the in order cores.

https://en.wikipedia.org/wiki/Register_file

Computer Architecture a Quantitative Approach (5th edition) is a great free resource on processor design which talks about register files and much more. See chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures which is more on topic for this thread.

http://acs.pub.ro/~cpop/SMPA/Computer%20Architecture%20A%20Quantitative%20Approach%20(5th%20edition).pdf

Finally, I provide an old paper which talks about register file timing. The believe then was that OoO RISC cores would grow wider and wider issue while pushing to higher and higher clock speeds. It was assumed that single cycle register file accesses would no longer be possible. The cost and benefits of different port and result forwarding/bypass configurations are shown in addition to the multi-banked register file comparison. Multi-core is popular today and high frequency is rarely pushed while process advances make timing easier but register file timing can still be a consideration in case you were wanting to make a 1024 bit wide SIMD unit. Perhaps the Knight's Landing cores reduced the core frequency because they would heat up causing the AVX-512 register files to have timing problems.

http://people.ac.upc.edu/cruz/paper/isca2000.pdf

OK, thanks for explaining it and for the links (I'll study them once I've some time).

My impression is that for in-order cores a large register file isn't that much expensive.

The 68K has powerful addressing modes which stress more the registers for reading and also for writing them (due to the (An)+ & -(An) addressing modes).
Quote:
Quote:
They simply don't further develop the 32 ISA anymore. Thumb-2 et similar are here to stay for the partners that want to use them. But future AMD ISA will be entirely 64-bit AFAIR: so, not even the 32-bit execution mode will be supported.

The EE Times survey from 2019 showed that 32 bit is far from done in the embedded market.

My current embedded project main processor is a:
10% 8 bit
11% 16 bit
61% 32 bit
15% 64 bit

I remember those stats. You already posted them, and they give a good indication of the market trend. It'll be nice to have the numbers for 2020, to see how the market is moving.
Quote:
ARM has tiny 32 bit Thumb 2 only processors for embedded. It will be interesting to see how well they support them. Their current designs will eventually become outdated as old fabs stop production of older processes.

I don't think that ARM will have problems by just shrinking the existing cores on the new processes. 32-bit ARMs (especially the Cortex / Thumb-2 only) are here to stay for very long time.

@Fl@sh Quote:

Fl@sh wrote:

You can call all these features, instead I call it hacks/triks.
This way to do requires much transistors and levels up a lot ISA complexity.

For me simpler is better, so I prefer max optimization and efficienty for 64bit isa archs and, if you need a 8, 16 or 32 bit datatype, it will be managed as a full 64 bit one.

Fair enough. However too much simplicity can give worse results in terms of optimization / efficiency.
Quote:
Even for addressing modes I would prefer something simple with 3/4 addressing modes, no more.

Here I strongly disagree: powerful (but not too much complex) addressing modes help a lot on both performances and code density.
Quote:
Even a big register file is a no option, avoiding in this way stupid push/pop instructions and avoiding stack usage as much as possible, passing parameters by registers without using mem.

Unfortunately pushing/popping registers to/from the stack is needed even on ISAs with big register files. The net advantage is on the "leaf" function calls (the last one, where no other function will be called), but on other scenarios you still see those instructions.

A smart approach would be to use the "register windows" on function calls, like what SUN did with its SPARC, but this requires a very big register file (so, expensive to implement) and a mechanism to store or fetch the registers at every function call or return. So, problems also here...
Quote:
With smart ABI you can describe in a simple way how things need to be done and compilers can do it much nearly to a handwritten assembly code, whithout wasting cpu resources present, but never used.

I doubt that the best compilers can do better than the best assembly coders.
Quote:
This is the reason why prefer no implementation of 68080 simd unit in this first phase.
Simply there's no transistor room to make it in a good way like modern Altivec/AVX implementations did.
An AMMX simd would be a big legacy mistake to add in next fpga versions.
IMHO for now would be much better improve compatibility with real 80bit 68k FPU or add some fmadd and others image/sound processing instructions expanding them even on floats.
Even akiko chip was a good idea to implement for emulation of some game based on bitplane displays.

IMHO an Intel like tick/tock strategy would be a good way to do.

Unfortunately the 68080 design is already there and it's unlikely that those horrible patches will be removed.

@Fl@sh Quote:

Fl@sh wrote:
@cdimauro Quote:
I personally think that the PowerPC ABI is the biggest limit for this ISA: the prologue and epilogue are way too long and complicated, and I would have expected much less instructions to be used for them, since there are plenty of registers available.

IMO PowerPC can do much better with a different ABI (and maybe compilers).

Sorry I don't agree.
please take 5 minutes for this The simplified 64-bit ABI - IBM POWER

Well, but this is the simplified ABI, whereas I was talking about the regular / full ABI.

Yes, it's much better, and closely resembles the x86-64 one.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 25-Oct-2020 2:51:24

[ #189 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2014
From: Kansas

Quote:

cdimauro wrote:
For the 68K encoding space, I don't think there's enough in the 16-bit space in a 64-bit execution mode.
The MOVE.Size Mem,Mem takes 25% of that space, alone. Then you should at least reserve the same for ADD/SUB/AND/OR.Size Mem,Dn & the equivalents Dn, Mem: another 25% is used. Then you only have 50% of the encoding space, and you still has to reserve something for other important instructions like CMP, shift/rotate, bit testing, and the usual unary and implicit ones. And if you want to make a better use of 32-bit opcodes, you have to reserve a good amount for them, and as well as for the FPU and/or the SIMD unit.
A 16-bit opcode might give the impression to be endless, but once you start using it then it shrinks quickly.

There are a limited number of 68k 16 bit encodings for sure but there is a lot of room for 32 bit encodings. The 68000 encodings were simple but wasted encoding space with some less frequently used 16 bit instruction encodings some of which are better moved to 32 bit encodings or eliminated. MOVE EA,EA does take a huge amount of encoding space but this is the most used instruction like x86 and x86-64 yet is more powerful.

Quote:

So basically you mimic what Intel did with the 8086: using the so called "high" registers to provide 8 8-bit of them using the 4 16-bit registers.
This is a quick and easy way to increase the register set, but at the expense of performance (partial registers access) and a bit more complicated implementation. It might be good for the low-end embedded market if you really want to save some space/transistors, but it's not recommended on a more general purpose ISA / implementation.

It is not as bad for a 64 bit core to use the more frequent 32 bit size as 64 bit wastes processing on extending the upper bits, can have larger code and wastes more caches, memory and register file with extended data and for alignment.

Quote:

Rationale: The C/C++ LP64 and LLP64 data models – expected to be the most commonly used on AArch64 –both define the frequently used int, short and char types to be 32 bits or less. By maintaining this semantic information in the instruction set, implementations can exploit this information to avoid expending energy or cycles to compute, forward and store the unused upper 32 bits of such data types. Implementations are free to exploit this freedom in whatever way they choose to save energy.

- ARMv8 Instruction Set Overview

Partial register stalls and store forwarding stalls are a concern. My hope is that data registers would forward 32 bits while address registers would forward 64 bits so only partial register stalls would be a concern in the less common case of using 64 bits in data registers. Maybe the 68k address data register divide could prove useful to give fast and energy efficient 32 bit data handling with the fastest possible 64 bit address handling. I believe this would provide better performance for existing 68k code as the registers are 32 bit. I could use more technical advice though.

Quote:

OK, thanks for explaining it and for the links (I'll study them once I've some time).

My impression is that for in-order cores a large register file isn't that much expensive.

Or in other words, a large register file can be more expensive for OoO. Perhaps this was a concern for the Larrabee project with the large SIMD register files. Bonnell was a slimmed down core design which was in order.

Quote:

Each cycle two instructions are dispatched in-order. The scheduler can take a pair of instructions from a single thread or across threads. Bonnell in-order back-end resembles a traditional early 90s design featuring a dual ALU, a dual FPU and a dual AGU. Similarly to the front-end, in order to accommodate simultaneous multithreading, the Bonnell design team chose to duplicate both the floating-point and integer register files. The duplication of the register files allows Bonnell to perform context switching on each stage by maintaining duplicate states for each thread. The decision to duplicate this logic directly results in more transistors and larger area of the silicon. Overall implementing SMT still required less power and less die area than the other heavyweight alternatives (i.e., out-of-order and larger superscaler). Nonetheless the total register file area accounts for 50% of the entire core's die area which was single-handedly an important contributor to the overall chip power consumption.

https://en.wikichip.org/wiki/intel/microarchitectures/bonnell

I used to wonder if the sentence in bold above was an error but not after finding in ColdFire literature, "This proposal was driven by the fact that the register file is the largest single structure in the core and a sizable reduction in this function could have an interesting impact on the overall core size."

The choice of CISC for Larrabee was good because it allows better single threaded in order performance but the choice of x86-64 was poor because it produces too much heat in slimmed down cores as could be seen in the Knight's Landing down clocking with AVX-512 use. What if the Larrabee cores had used 42% less power (and used 28% fewer transistors for an area cost savings or more cores) like the in order 68060 advantage compared to the Pentium P54C?

Status: Offline

cdimauro

Re: Amiga SIMD unit
Posted on 25-Oct-2020 7:04:44

[ #190 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:

matthey wrote:
Quote:
cdimauro wrote:
[...]A 16-bit opcode might give the impression to be endless, but once you start using it then it shrinks quickly.

There are a limited number of 68k 16 bit encodings for sure but there is a lot of room for 32 bit encodings. The 68000 encodings were simple but wasted encoding space with some less frequently used 16 bit instruction encodings some of which are better moved to 32 bit encodings or eliminated. MOVE EA,EA does take a huge amount of encoding space but this is the most used instruction like x86 and x86-64 yet is more powerful.

OK, then we're aligned. So, you better reused the 32-bit opcode space.

Can you share something about the numbers? How much space you dedicated to 16-bit opcodes, and how much for the 32-bit ones?
Quote:
It is not as bad for a 64 bit core to use the more frequent 32 bit size as 64 bit wastes processing on extending the upper bits, can have larger code and wastes more caches, memory and register file with extended data and for alignment.
Quote:
Rationale: The C/C++ LP64 and LLP64 data models – expected to be the most commonly used on AArch64 –both define the frequently used int, short and char types to be 32 bits or less. By maintaining this semantic information in the instruction set, implementations can exploit this information to avoid expending energy or cycles to compute, forward and store the unused upper 32 bits of such data types. Implementations are free to exploit this freedom in whatever way they choose to save energy.

- ARMv8 Instruction Set Overview

Yes, supporting 32-bit for a 64-bit is mandatory if you care about performances. But here ARM was talking about the instructions for supporting 32-bit data types.
Quote:
Partial register stalls and store forwarding stalls are a concern. My hope is that data registers would forward 32 bits while address registers would forward 64 bits so only partial register stalls would be a concern in the less common case of using 64 bits in data registers. Maybe the 68k address data register divide could prove useful to give fast and energy efficient 32 bit data handling with the fastest possible 64 bit address handling. I believe this would provide better performance for existing 68k code as the registers are 32 bit. I could use more technical advice though.

The address registers aren't a concern on both 68K and x86-64 (just considering a whole 64-bit register only dedicated for holding/manipulating a pointer), because you never make a partial use of them.

Data registers are a completely different beast, and there I think that it'll be difficult to properly schedule the instructions in order to reduce the stalls due to partially sharing the data. Any optimization manual clearly states to avoid partial registers usage, because they have a sensible impact on performances.

But there is also another good thing for sharing two 32-bit registers on a single 64-bit register: you can push/pop both in a single instruction, and this can help function calls, as we discussed previously.
Quote:
Quote:
My impression is that for in-order cores a large register file isn't that much expensive.

Or in other words, a large register file can be more expensive for OoO. Perhaps this was a concern for the Larrabee project with the large SIMD register files.

AFAIR Larrabee was an in-order design since the beginning. And it made sense for the application area.
Quote:
Bonnell was a slimmed down core design which was in order.
Quote:
Each cycle two instructions are dispatched in-order. The scheduler can take a pair of instructions from a single thread or across threads. Bonnell in-order back-end resembles a traditional early 90s design featuring a dual ALU, a dual FPU and a dual AGU. Similarly to the front-end, in order to accommodate simultaneous multithreading, the Bonnell design team chose to duplicate both the floating-point and integer register files. The duplication of the register files allows Bonnell to perform context switching on each stage by maintaining duplicate states for each thread. The decision to duplicate this logic directly results in more transistors and larger area of the silicon. Overall implementing SMT still required less power and less die area than the other heavyweight alternatives (i.e., out-of-order and larger superscaler). Nonetheless the total register file area accounts for 50% of the entire core's die area which was single-handedly an important contributor to the overall chip power consumption.

https://en.wikichip.org/wiki/intel/microarchitectures/bonnell

I used to wonder if the sentence in bold above was an error but not after finding in ColdFire literature, "This proposal was driven by the fact that the register file is the largest single structure in the core and a sizable reduction in this function could have an interesting impact on the overall core size."

Thanks for the link, but I think that 50% of core size is too much. The good thing of the link is that it reports screenshots of the die and an explanation of the logical units: https://en.wikichip.org/wiki/intel/microarchitectures/bonnell#Die
As it can be seen, on the FPC (FPU, SIMD) it can clearly seen that the register file takes a huge part, but not half of the area. It's big, for sure, but maybe around 30-35 of the area.
A similar thing can be said for the IEC ("Integer" / GP) unit.
This, at least, checking the biggest "regular structures" which can be seen, which should be the register files.

Another important thing that I want to highlight is that FPC and IEC are the smallest parts of the die. The core is accounted for 28% of the die. But FPC and IEC together can represent around 40% (if not even less) of that, leaving the remaining 60% to MEC+FEC.
So, in terms of die area, FPC+IEC can count for around 11% of the die are. Definitely not the biggest part, and the register file area could be around 4% of the die area (considering 35% of space for the register files).

So, really small numbers, in die which is measuring 3.1 mm x 7.8 mm on an old 45nm process. You can imagine how small it can be on a more modern process.

And pay attention that those numbers are considering an SMP implementation: two times the number of registers. It means that on a non-SMP solution the register file is taking around 2% of the die area...

Last but not really least, the FPC implements a SIMD unit, which has 16 x 128-bit vector registers and dedicated ALUs. So, a lot of stuff here.
Quote:
The choice of CISC for Larrabee was good because it allows better single threaded in order performance but the choice of x86-64 was poor because it produces too much heat in slimmed down cores as could be seen in the Knight's Landing down clocking with AVX-512 use. What if the Larrabee cores had used 42% less power (and used 28% fewer transistors for an area cost savings or more cores) like the in order 68060 advantage compared to the Pentium P54C?

As it can be clearly seen from the above numbers, using 64-bit registers doesn't seem to be the biggest problem. But if you think that the x86-64 ISA for itself (so, NOT counting the 16 x 64-bit registers) could have been a problem, then I partially agree (see below).

However the real numbers (and problems) for Larrabee (and Xeon Phi, in general) are coming from the big vector unit, which is basically dominating the die area. We have 4 times 32 x 512-bit registers (plus 8 x 32-bit mask registers) because it's an SMT4 design. It's very easy, taking the previous numbers,to see that it's the biggest piece of the cake...

So, it's not really x86-64 that made it worse: any other ISA could have similar problems with this HUGE register file.
Yes, removing the x86-64 "tax" could have helped, but not dramatically IMO.

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 26-Oct-2020 3:36:17

[ #191 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5287
From: Australia

@cdimauro

Quote:

OK, but this doesn't and cannot change what I've stated: you cannot take what a compiler introduces on its own as a reference of the reality.

What count is Jaguar's ISA, and this fully supports AVX: 16 256-bit registers in 64-bit mode, and all instructions implemented.

What count is the resulting performance. The real reality, modern 3D games are sensitive to latency.

The major reason for RDNA's advantage over Polaris/Vega GCN is instruction completion latency.

Quote:

I don't, in fact, but this is ANOTHER question. As a coder I also have to take into account the microarchitecture if I want to squeeze the most from my code running on that particular chip, but this isn't something new neither strictly related to Jaguar.

As I've said before, AMD's Ryzen 1 & 2 had EXACTLY the same implementation of Jaguar, and only with Ryzen 3 (AKA Zen2) there's a full 256-bit implementation for AVX/-2.

But nothing new as I said: it happened also with Intel's Pentium III when the SSE was introduced, as well as the Pentium IV (AFAIR).

For multimedia floating point operations,
1. SSE's packed FP32 SIMD is better than stack-based X87.

2. Pentium III's X87 is not fully pipelined with only FMUL is pipelined, hence SSE should be used. SSE can support scalar FP32 workloads.

3. Pentium IV has SSE2 which has FP64 which can replace X87 in most situations.
SSE2 can support scalar FP64 workloads.

AMD K8 has 128-bit FADD and 64-bit FMUL units which contribute to K8's superior IPC when compared to Pentium IV.

Core 2 has 128-bit FADD and 128-bit FMUL units with quad instruction issues per cycle decoders.

For desktop gaming PC usage, I avoided Zen 1.x since I already have Core i7-4790K (Devil's Canyon, 256-bit AVX-2 hardware capable). 256-bit AVX-2 hardware was largely useless during XBO/PS4 game ports era.

Sold my Core i7-4770K and i7-4790K funded my Ryzen 9 3900X purchase.
Sold GTX 980 Ti OC and R9-290X OC funded my ASUS ROG Strix X570 purchase
Sold GTX 1080 Ti funded my ASUS Strix 2080 EVO OC purchase.
My cash outlay is nearly zero.

I sold my Core i7-7820X and ASUS ROG Strix X299 for Intel RocketLake preparation.

Quote:

OK, but I don't understand all this long list of numbers, especially with coming graphic cards which we still don't know the full specs.

As I said, it's better to see at real-world number by benchmarking all cards with games & applications. Numbers in this context don't give anything useful IMO.

Sources such as Igors Labs and Videocardz has better credibility when compared to wccftech,

2.577 Ghz NAVI 21 is from Igors Lab's AIB partner.

Quote:

The problem is with the drivers, unfortunately.

Not a major problem with classic PCI to PCIe ribbon cable adapter and existing classic PCI add-on cards stockpile.

Last edited by Hammer on 26-Oct-2020 at 03:49 AM.
Last edited by Hammer on 26-Oct-2020 at 03:39 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 26-Oct-2020 4:20:59

[ #192 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5287
From: Australia

@cdimauro

Quote:

From a purely technical/ISA perspective I don't agree, looking at what they are doing.

I can only agree if we talk about the Motorola "philosophy": filling holes in the opcode table according to the "needed" additional instructions and/or extensions. Here I think that the Apollo team is a very good Motorola heir...

As long Apollo delivers faster 68K hardware for my A1200 without 68060 Rev 6+accel card crazy pricing, then I'm ok with it.

I don't agree with Apollo's ISA direction, but pricing and certain 68K performance levels take a higher priority.

As for AMMX, to quote Gabe Newell on CELL's SPE, "a waste of time".

I use to own an A3000 with 68030 at 25Mhz, hence I don't see MiSter's 68020 at 40Mhz appealing.

Last edited by Hammer on 26-Oct-2020 at 04:23 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

cdimauro

Re: Amiga SIMD unit
Posted on 26-Oct-2020 5:48:36

[ #193 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Hammer Quote:

Hammer wrote:
@cdimauro Quote:

OK, but this doesn't and cannot change what I've stated: you cannot take what a compiler introduces on its own as a reference of the reality.

What count is Jaguar's ISA, and this fully supports AVX: 16 256-bit registers in 64-bit mode, and all instructions implemented.

What count is the resulting performance. The real reality, modern 3D games are sensitive to latency.

The major reason for RDNA's advantage over Polaris/Vega GCN is instruction completion latency.

I think that I've already said that I've nothing to say about that.

The only point that I wanted to stress is that there's no real AVX-128 mode, which was an artificial invention (optimization path) of GCC for AVX. As well as there's no x86-32 mode, which is an artificial invention (ABI) from Intel for x86-64.
Quote:
Quote:
I don't, in fact, but this is ANOTHER question. As a coder I also have to take into account the microarchitecture if I want to squeeze the most from my code running on that particular chip, but this isn't something new neither strictly related to Jaguar.

As I've said before, AMD's Ryzen 1 & 2 had EXACTLY the same implementation of Jaguar, and only with Ryzen 3 (AKA Zen2) there's a full 256-bit implementation for AVX/-2.

But nothing new as I said: it happened also with Intel's Pentium III when the SSE was introduced, as well as the Pentium IV (AFAIR).

For multimedia floating point operations,
1. SSE's packed FP32 SIMD is better than stack-based X87.

2. Pentium III's X87 is not fully pipelined with only FMUL is pipelined, hence SSE should be used. SSE can support scalar FP32 workloads.

3. Pentium IV has SSE2 which has FP64 which can replace X87 in most situations.
SSE2 can support scalar FP64 workloads.

AMD K8 has 128-bit FADD and 64-bit FMUL units which contribute to K8's superior IPC when compared to Pentium IV.

Core 2 has 128-bit FADD and 128-bit FMUL units with quad instruction issues per cycle decoders.

As I said before, nothing to say about taking into account the microarchitecture when deciding what to do (and buy) when performances count (especially on specific market segments / customer scenarios).
Quote:
For desktop gaming PC usage, I avoided Zen 1.x since I already have Core i7-4790K (Devil's Canyon, 256-bit AVX-2 hardware capable). 256-bit AVX-2 hardware was largely useless during XBO/PS4 game ports era.

Sold my Core i7-4770K and i7-4790K funded my Ryzen 9 3900X purchase.
Sold GTX 980 Ti OC and R9-290X OC funded my ASUS ROG Strix X570 purchase
Sold GTX 1080 Ti funded my ASUS Strix 2080 EVO OC purchase.
My cash outlay is nearly zero.

I sold my Core i7-7820X and ASUS ROG Strix X299 for Intel RocketLake preparation.

OK, so you're primarily interested on good gaming performances. That's why you're changing so often your system.

I'm not a gamer, and I don't need to change a system often. I'm more interested to top single-thread performances, but I also don't like to change often a system.
The only exception was the Core i7-4790K, which I've changed after a couple of yeas to a Core i7-6700K only because I had an offer from Intel's internal shop (for employees) which I cannot refuse...
Quote:
Quote:
OK, but I don't understand all this long list of numbers, especially with coming graphic cards which we still don't know the full specs.

As I said, it's better to see at real-world number by benchmarking all cards with games & applications. Numbers in this context don't give anything useful IMO.

Sources such as Igors Labs and Videocardz has better credibility when compared to wccftech,

2.577 Ghz NAVI 21 is from Igors Lab's AIB partner.

Let's see how efficient it will be.

Anyway, nVidia is using a crappy 8nm process from Samsung, so it has also huge margins of improvement once it'll switch to a much better process, like TSMC's 7nm+ or 5nm (but not soon: the orders are already fulfilled).
Quote:
Quote:
The problem is with the drivers, unfortunately.

Not a major problem with classic PCI to PCIe ribbon cable adapter and existing classic PCI add-on cards stockpile.

Yes, but it's a patch. Something which a passionate can do, just for fun.

@Hammer Quote:

Hammer wrote:
@cdimauro Quote:

From a purely technical/ISA perspective I don't agree, looking at what they are doing.

I can only agree if we talk about the Motorola "philosophy": filling holes in the opcode table according to the "needed" additional instructions and/or extensions. Here I think that the Apollo team is a very good Motorola heir...

As long Apollo delivers faster 68K hardware for my A1200 without 68060 Rev 6+accel card crazy pricing, then I'm ok with it.

I don't agree with Apollo's ISA direction, but pricing and certain 68K performance levels take a higher priority.

As for AMMX, to quote Gabe Newell on CELL's SPE, "a waste of time".

I use to own an A3000 with 68030 at 25Mhz, hence I don't see MiSter's 68020 at 40Mhz appealing.

I understand your needs as a customer, but if you buy a product then you're also promoting it, approving the decisions of the vendor which will affect the market and future products.

Nothing different from buying AmigaOS4 and supporting Hyperion's decisions, which have split the post-Commodore market, generated wars, and different platforms which is still hurting this nano-niche.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 27-Oct-2020 2:26:50

[ #194 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2014
From: Kansas

Quote:

cdimauro wrote:
OK, then we're aligned. So, you better reused the 32-bit opcode space.

Can you share something about the numbers? How much space you dedicated to 16-bit opcodes, and how much for the 32-bit ones?

Not yet.

There is more than enough encoding space to add 32 bit encodings for MMX instructions using the data registers. The x86 implementation was bad and I don't like the AMMX bank switching, especially the added address registers. It is lightweight, adds some nice functionality and there is existing code using it. I would want to have a plan for fp SIMD or vector use in the FPU registers though.

Quote:

Yes, supporting 32-bit for a 64-bit is mandatory if you care about performances. But here ARM was talking about the instructions for supporting 32-bit data types.

AArch64 only supports byte and 16 bit datatypes with load/store instructions. It has instructions for 32 and 64 bit operations. A 32 bit operation is nearly as cheap as an 8 or 16 bit operations but some 64 bit operations are more expensive than 32 bit operations like division, multiplication and shift. The 32 bit operations save energy in the ALU but a little energy is also wasted extending the 32 bits of 64 bit registers to 64 bit (extending is a cheap operation though). AArch64 reads the sources of 32 bit operations as 32 bits and zero extends a 32 bit result to 64 bits.

Quote:

The address registers aren't a concern on both 68K and x86-64 (just considering a whole 64-bit register only dedicated for holding/manipulating a pointer), because you never make a partial use of them.

Data registers are a completely different beast, and there I think that it'll be difficult to properly schedule the instructions in order to reduce the stalls due to partially sharing the data. Any optimization manual clearly states to avoid partial registers usage, because they have a sensible impact on performances.

But there is also another good thing for sharing two 32-bit registers on a single 64-bit register: you can push/pop both in a single instruction, and this can help function calls, as we discussed previously.

Partial register usage
32 bit read followed by 64 bit read - no stall
32 bit read followed by 64 bit write - no stall
32 bit write followed by 64 bit read - stall
32 bit write followed by 64 bit write - no stall
64 bit read followed by 32 bit read - no stall
64 bit read followed by 32 bit write - no stall
64 bit write followed by 32 bit read - no stall
64 bit write followed by 32 bit write - no stall

A 32 bit write requires an operation to recombine with the unused half of the register. Is this your understanding?

There are at least 4 ways 32 bit writes to 64 bit registers can be handled.

A) 32 bit operations only affect 32 bits of destination
68k data registers if consistent with 16 bit and 8 bit operations

B) sign extend 32 bit register writes to 64 bits
68k address registers

C) zero extend 32 bit register writes to 64 bits
x86-64, AArch64

D) sign extend signed datatypes and zero extend unsigned (or unknown) datatypes to 64 bits
? (some architectures keep track of sign/unsigned datatypes and some operations like MULS specify them)

It is nice to be consistent and store data in the upper half of the registers which can reduce the number of instructions needed, reduce the number of memory operations as you mention and more efficiently use the register file area. It is also nice to avoid partial register stalls. At least with a 68k SWAP instruction for 64 bits, the whole 64 bit register is written so I will leave it at that for now. The topic needs more technical input and/or test data.

Quote:

AFAIR Larrabee was an in-order design since the beginning. And it made sense for the application area.

Because 4-8 in order x86-64 cores with AVX-512 has more parallel performance than 1 OoO x86-64 core with AVX-512?

Quote:

Thanks for the link, but I think that 50% of core size is too much. The good thing of the link is that it reports screenshots of the die and an explanation of the logical units: https://en.wikichip.org/wiki/intel/microarchitectures/bonnell#Die
As it can be seen, on the FPC (FPU, SIMD) it can clearly seen that the register file takes a huge part, but not half of the area. It's big, for sure, but maybe around 30-35 of the area.
A similar thing can be said for the IEC ("Integer" / GP) unit.
This, at least, checking the biggest "regular structures" which can be seen, which should be the register files.

Another important thing that I want to highlight is that FPC and IEC are the smallest parts of the die. The core is accounted for 28% of the die. But FPC and IEC together can represent around 40% (if not even less) of that, leaving the remaining 60% to MEC+FEC.
So, in terms of die area, FPC+IEC can count for around 11% of the die are. Definitely not the biggest part, and the register file area could be around 4% of the die area (considering 35% of space for the register files).

So, really small numbers, in die which is measuring 3.1 mm x 7.8 mm on an old 45nm process. You can imagine how small it can be on a more modern process.

And pay attention that those numbers are considering an SMP implementation: two times the number of registers. It means that on a non-SMP solution the register file is taking around 2% of the die area...

Last but not really least, the FPC implements a SIMD unit, which has 16 x 128-bit vector registers and dedicated ALUs. So, a lot of stuff here.

The link also mentions the following.

Quote:

Bonnell supports Intel's Hyper-Threading, their marketing term for their own implementation of simultaneous multithreading. The notion of implementing simultaneous multithreading on such a low-power architecture might seem unusual at first. In fact, it's one of only a handful of ultra-low power architectures to support such feature. Intel justified this design choice by demonstrating that performance enjoys an uplift of anywhere from 30% to 50% while worsening power consumption by up to 20% (with an average of 30% performance increase for 15% more power). The toll on the die area was a mere 8%.

This may seem at odds with the earlier statement but here we talk about die area where the previous statement was about core area. What is considered the core area is 28% of the die area for the later Silverthorne.

Core 13,828,574
Uncore 2,738,951
L2 & L2 tag 30,644,682
total 47,212,207 transistors

The core is 13,828,574 transistors which is 29% of the die transistor count.
13,828,574 * .5 = 6914287 transistors which is 15% of the die transistor count

If the earlier Diamondville core was significantly smaller than the Silverthorne core (half?), then the earlier claim that "the total register file area accounts for 50% of the entire core's die area" may be possible while "The toll on the die area was a mere 8%." The density of transistors varies some for caches vs core but it gives a rough approximation. If the Diamondville core was half the size, then your estimate that the register files of Silverthorne used 2% of the total die area is possible as well.

Quote:

As it can be clearly seen from the above numbers, using 64-bit registers doesn't seem to be the biggest problem. But if you think that the x86-64 ISA for itself (so, NOT counting the 16 x 64-bit registers) could have been a problem, then I partially agree (see below).

However the real numbers (and problems) for Larrabee (and Xeon Phi, in general) are coming from the big vector unit, which is basically dominating the die area. We have 4 times 32 x 512-bit registers (plus 8 x 32-bit mask registers) because it's an SMT4 design. It's very easy, taking the previous numbers,to see that it's the biggest piece of the cake...

So, it's not really x86-64 that made it worse: any other ISA could have similar problems with this HUGE register file.
Yes, removing the x86-64 "tax" could have helped, but not dramatically IMO.

Removing the x86-64 "tax" should have lowered the temps and saved some area. Fetch and decode were expensive in an x86 OoO FPGA as the following paper talks about.

https://www.stuffedcow.net/files/henry-thesis-phd.pdf

I'm guessing early Larrabee cores were barely able to use AVX-512 because of the heat produced at that process size (not just because of the x86-64 "tax"). With Moore's Law ending, I wonder what the limit for SIMD register width will be.

Status: Offline

cdimauro

Re: Amiga SIMD unit
Posted on 27-Oct-2020 6:17:13

[ #195 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:

matthey wrote:
Quote:
cdimauro wrote:
OK, then we're aligned. So, you better reused the 32-bit opcode space.

Can you share something about the numbers? How much space you dedicated to 16-bit opcodes, and how much for the 32-bit ones?

Not yet.

There is more than enough encoding space to add 32 bit encodings for MMX instructions using the data registers.

I strongly suggest you to avoid it.

68K is a CISC design and a separated registers bank is recommended, since SIMD instructions are directly accessing memory, and GP registers are used to calculate the address (more rare for the data registers, but they are used on indexed modes and when you haven't enough address registers).
Quote:
The x86 implementation was bad and I don't like the AMMX bank switching, especially the added address registers.

Indeed.
Quote:
It is lightweight, adds some nice functionality and there is existing code using it. I would want to have a plan for fp SIMD or vector use in the FPU registers though.

Better to reuse the FPU registers, then.
Quote:
Quote:
The address registers aren't a concern on both 68K and x86-64 (just considering a whole 64-bit register only dedicated for holding/manipulating a pointer), because you never make a partial use of them.

Data registers are a completely different beast, and there I think that it'll be difficult to properly schedule the instructions in order to reduce the stalls due to partially sharing the data. Any optimization manual clearly states to avoid partial registers usage, because they have a sensible impact on performances.

But there is also another good thing for sharing two 32-bit registers on a single 64-bit register: you can push/pop both in a single instruction, and this can help function calls, as we discussed previously.

Partial register usage
32 bit read followed by 64 bit read - no stall
32 bit read followed by 64 bit write - no stall
32 bit write followed by 64 bit read - stall
32 bit write followed by 64 bit write - no stall
64 bit read followed by 32 bit read - no stall
64 bit read followed by 32 bit write - no stall
64 bit write followed by 32 bit read - no stall
64 bit write followed by 32 bit write - no stall

A 32 bit write requires an operation to recombine with the unused half of the register. Is this your understanding?

Yes. But AFAIR there should be some other cases where partial registers usages can cause problems. I'll check once I've more time.
Quote:
There are at least 4 ways 32 bit writes to 64 bit registers can be handled.

A) 32 bit operations only affect 32 bits of destination
68k data registers if consistent with 16 bit and 8 bit operations

This happens on IA-32/x64-64 as well.
Quote:
B) sign extend 32 bit register writes to 64 bits
68k address registers

C) zero extend 32 bit register writes to 64 bits
x86-64, AArch64

D) sign extend signed datatypes and zero extend unsigned (or unknown) datatypes to 64 bits
? (some architectures keep track of sign/unsigned datatypes and some operations like MULS specify them)

It is nice to be consistent and store data in the upper half of the registers which can reduce the number of instructions needed, reduce the number of memory operations as you mention and more efficiently use the register file area. It is also nice to avoid partial register stalls. At least with a 68k SWAP instruction for 64 bits, the whole 64 bit register is written so I will leave it at that for now. The topic needs more technical input and/or test data.

SWAPs aren't required with an EA and ISA reworking. But it's beyond your project.
Quote:
Quote:
AFAIR Larrabee was an in-order design since the beginning. And it made sense for the application area.

Because 4-8 in order x86-64 cores with AVX-512 has more parallel performance than 1 OoO x86-64 core with AVX-512?

Because for those big workloads the latency is not a problem and an SMT-4 in-order design can much better (re) use the ALUs, at a reduced cost compared to an OoO design with similar number of ALUs.
Quote:
Quote:
Thanks for the link, but I think that 50% of core size is too much. The good thing of the link is that it reports screenshots of the die and an explanation of the logical units: https://en.wikichip.org/wiki/intel/microarchitectures/bonnell#Die
As it can be seen, on the FPC (FPU, SIMD) it can clearly seen that the register file takes a huge part, but not half of the area. It's big, for sure, but maybe around 30-35 of the area.
A similar thing can be said for the IEC ("Integer" / GP) unit.
This, at least, checking the biggest "regular structures" which can be seen, which should be the register files.

Another important thing that I want to highlight is that FPC and IEC are the smallest parts of the die. The core is accounted for 28% of the die. But FPC and IEC together can represent around 40% (if not even less) of that, leaving the remaining 60% to MEC+FEC.
So, in terms of die area, FPC+IEC can count for around 11% of the die are. Definitely not the biggest part, and the register file area could be around 4% of the die area (considering 35% of space for the register files).

So, really small numbers, in die which is measuring 3.1 mm x 7.8 mm on an old 45nm process. You can imagine how small it can be on a more modern process.

And pay attention that those numbers are considering an SMP implementation: two times the number of registers. It means that on a non-SMP solution the register file is taking around 2% of the die area...

Last but not really least, the FPC implements a SIMD unit, which has 16 x 128-bit vector registers and dedicated ALUs. So, a lot of stuff here.

The link also mentions the following.
Quote:
Bonnell supports Intel's Hyper-Threading, their marketing term for their own implementation of simultaneous multithreading. The notion of implementing simultaneous multithreading on such a low-power architecture might seem unusual at first. In fact, it's one of only a handful of ultra-low power architectures to support such feature. Intel justified this design choice by demonstrating that performance enjoys an uplift of anywhere from 30% to 50% while worsening power consumption by up to 20% (with an average of 30% performance increase for 15% more power). The toll on the die area was a mere 8%.

This may seem at odds with the earlier statement but here we talk about die area where the previous statement was about core area. What is considered the core area is 28% of the die area for the later Silverthorne.

Core 13,828,574
Uncore 2,738,951
L2 & L2 tag 30,644,682
total 47,212,207 transistors

The core is 13,828,574 transistors which is 29% of the die transistor count.
13,828,574 * .5 = 6914287 transistors which is 15% of the die transistor count

If the earlier Diamondville core was significantly smaller than the Silverthorne core (half?), then the earlier claim that "the total register file area accounts for 50% of the entire core's die area" may be possible while "The toll on the die area was a mere 8%." The density of transistors varies some for caches vs core but it gives a rough approximation. If the Diamondville core was half the size, then your estimate that the register files of Silverthorne used 2% of the total die area is possible as well.

Those numbers don't match, according to my analysis and calculations, and I'm considering Silverthorne as a reference, because that was reported in the link.
Quote:
Quote:
Yes, removing the x86-64 "tax" could have helped, but not dramatically IMO.

Removing the x86-64 "tax" should have lowered the temps and saved some area. Fetch and decode were expensive in an x86 OoO FPGA as the following paper talks about.

https://www.stuffedcow.net/files/henry-thesis-phd.pdf

Thanks. I know how to use my coming vacation days.
Quote:
I'm guessing early Larrabee cores were barely able to use AVX-512 because of the heat produced at that process size (not just because of the x86-64 "tax").

I don't think so. The Larrabee project wasn't good for a GPU, and not because of the AVX-512-like ISA & micro-architecture. In fact, the project was reused for GPGPU, which much better results.
Quote:
With Moore's Law ending, I wonder what the limit for SIMD register width will be.

IMO it's not about Moore's Law. Having big registers causes causes big troubles.

Status: Offline

cdimauro

Re: Amiga SIMD unit
Posted on 27-Oct-2020 22:31:47

[ #196 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey: Quote:

matthey wrote:
This may seem at odds with the earlier statement but here we talk about die area where the previous statement was about core area. What is considered the core area is 28% of the die area for the later Silverthorne.

Core 13,828,574
Uncore 2,738,951
L2 & L2 tag 30,644,682
total 47,212,207 transistors

The core is 13,828,574 transistors which is 29% of the die transistor count.
13,828,574 * .5 = 6914287 transistors which is 15% of the die transistor count

If the earlier Diamondville core was significantly smaller than the Silverthorne core (half?), then the earlier claim that "the total register file area accounts for 50% of the entire core's die area" may be possible while "The toll on the die area was a mere 8%." The density of transistors varies some for caches vs core but it gives a rough approximation. If the Diamondville core was half the size, then your estimate that the register files of Silverthorne used 2% of the total die area is possible as well.

I've downloaded the PNG of the die, and made some rough measurements with Paint.NET.

Here follow some results:
Core: 709 x 407 + 91 x 226 = 288563 + 20566 = 309129 pixels
FPC: 462 x 105 + 305 x 74 = 48510 + 22570 = 71080 pixels
IEC: 219 x 170 = 37230 pixels
FPC + IEC = 71080 + 37230 = 108310 = pixels 35% of the core

So, it's even less than what I estimated before (40% of the core).

I also made a measurement of the big central part of the FPC, which should contain the register file:
FPC registers: 160 x 145 = 23200 = 33% of the FPC

For the IEC it's more tricky, because of the big text on top, and that the regular structures are small and spars over the area.

However, and even without the numbers for the IEC's register file, I think that it's quite evident that my previous estimates were good enough.

So, the register file is not that big, but on the contrary a small portion of the core, and a very small portion of die.

I wouldn't care about having a few more registers in the ISA, even for a low-end embedded core...

Status: Offline

Fl@sh

Re: Amiga SIMD unit
Posted on 27-Oct-2020 22:35:35

[ #197 ]

Regular Member

Joined: 6-Oct-2004
Posts: 253
From: Napoli - Italy

@cdimauro

If you don’t have registers you have to go with memory..
There’s no good reason to limit register count on modern cpus.

_________________
Pegasos II G4@1GHz 2GB Radeon 9250 256MB
AmigaOS4.1 fe - MorphOS - Debian 9 Jessie

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 28-Oct-2020 2:46:53

[ #198 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2014
From: Kansas

Quote:

cdimauro wrote:
I strongly suggest you to avoid it.

68K is a CISC design and a separated registers bank is recommended, since SIMD instructions are directly accessing memory, and GP registers are used to calculate the address (more rare for the data registers, but they are used on indexed modes and when you haven't enough address registers).

I don't see why there is a problem with memory. Streaming and non-temporal access support would be good but those can be provided for the integer units as well. Integer SIMD support is very cheap even on old and small cores.

Quote:

The area impact is truly insignificant: MAX-1 took about 0.2% of the PA-7100LC silicon area, and MAX-2 took less than 0.1% of the PA-8000 silicon area. Neither caused any cycle time impact on the on the processors nor lengthened the design schedules.

The Apollo Core allows AMMX instructions using integer (data) registers as well. It's pretty nice to be able to work on x_high:y_low pairs in registers.

; SIMD add in data registers
add2p:
paddl d1,d0 ; MMX paddd (paddd supported with alias)
rts

; non-SIMD add of pair in data registers
add2p:
ror.q #32,d0
ror.q #32,d1
add.l d1,d0 ; upper half of destination reg *not* zeroed
ror.q #32,d0 ; partial reg stall
ror.q #32,d1
add.l d1,d0 ; upper half of destination reg *not* zeroed
rts

; non-SIMD add of pair in data registers
add2p:
move.q d0,d2
lsr.q #32,d2
move.q d1,d3
lsr.q #32,d3
add.l d1,d0 ; upper half of destination reg zeroed
add.l d3,d2 ; upper half of destination reg zeroed
lsl.q #32,d2
or.q d2,d0
rts

Notice the 1st non-SIMD function does *not* clear the upper results of ADD.L and uses 2 less instructions and 2 less registers than the 2nd non-SIMD function but suffers from a partial register stall. The SIMD instruction is short and sweet. Using an SIMD unit would be most effective if the source registers were passed in SIMD registers but for just a simple X:Y pair a 256 bit SIMD unit is overkill doing a lot more work than necessary. It may even be nice to have simple integer SIMD support in the integer unit to more easily handle corner cases where a wide SIMD unit is not efficient. I expect SIMD would not even be used in this case leading to one of the other 2 non SIMD options above. How would you handle an X:Y pair add with a 256 bit SIMD unit?

Quote:

Better to reuse the FPU registers, then.

Without parallel conversion from integer to fp or fp to integer being common, there is not a big advantage using fp registers for integer values. There may even be an advantage to keeping all integer values in the same registers and all fp values in the same register.

Quote:

Yes. But AFAIR there should be some other cases where partial registers usages can cause problems. I'll check once I've more time.

Probably with the result forwarding/bypass where a 64 bit read after a 32 bit write which is not extended also causes a stall. Result forwarding is simplest for a 64 bit core when always forwarding 64 bit results. Note that the MMX SIMD operation results are always 64 bits allowing more functionality without the problems of partial register stalls.

Quote:

SWAPs aren't required with an EA and ISA reworking. But it's beyond your project.

A rotate with immediate larger than 8 is a 32 bit encoding currently so SWAPs are just shorter 16 bit encodings. I could add instruction shortcuts for other common immediate shifts and rotates.

Quote:

Those numbers don't match, according to my analysis and calculations, and I'm considering Silverthorne as a reference, because that was reported in the link.

The information for Diamondville is not given. The Diamondville core could be much smaller than the Silverthorne core. It is an inline core which could be closer in size to the early Pentium P5 cores which were only 3,100,000 transistors compared to the Silverthorne core of 13,828,574 transistors. Unlike the Pentium P5 which the Atom is based, Silverthorne is 64 bit and has 16-19 stages where the P5 is 32 bit and only 5 stages. The 68060 was 8 stages and only had 2,500,000 transistors so the deeper pipeline may not have added too many transistors. The pipeline is really too deep for what I like as it required stronger branch prediction using more transistors and the pre-decoder between the L1 ICache and L2 wasted part of the advantage CISC should have in code density to reduce instruction caches.

Quote:

Thanks. I know how to use my coming vacation days.

It sounds like you had not heard about the OoO x86 FPGA core.

Quote:

I don't think so. The Larrabee project wasn't good for a GPU, and not because of the AVX-512-like ISA & micro-architecture. In fact, the project was reused for GPGPU, which much better results.

Were the CPUs mostly used as GPGPUs in data centers (until Nvidia gfx cards took that market)?

Status: Offline

cdimauro

Re: Amiga SIMD unit
Posted on 28-Oct-2020 23:06:14

[ #199 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Fl@sh Quote:

Fl@sh wrote:
@cdimauro

If you don’t have registers you have to go with memory..
There’s no good reason to limit register count on modern cpus.

I think that there should be no doubt now that it's convenient and with little impact, at least looking at the numbers.

@matthey Quote:

matthey wrote:
Quote:
cdimauro wrote:
I strongly suggest you to avoid it.

68K is a CISC design and a separated registers bank is recommended, since SIMD instructions are directly accessing memory, and GP registers are used to calculate the address (more rare for the data registers, but they are used on indexed modes and when you haven't enough address registers).

I don't see why there is a problem with memory. Streaming and non-temporal access support would be good but those can be provided for the integer units as well. Integer SIMD support is very cheap even on old and small cores.

The problem is due to the usage of GP registers on SIMD instructions. For example:
PADD.W (A0, D0.L*2),D1,D2

Sharing the data registers with the SIMD unit not only can cause dependency issues, but it complicates the implementation, since you need to add more register ports to allow better parallel access to them.

If we have two different domains (data, SIMD), the cost is reduced, because you can finely tune both independently (since the registers have different sizes).

With a single domain you cannot have ay option: you have to consider the maximum register size for the implementation.
Quote:
Quote:
The area impact is truly insignificant: MAX-1 took about 0.2% of the PA-7100LC silicon area, and MAX-2 took less than 0.1% of the PA-8000 silicon area. Neither caused any cycle time impact on the on the processors nor lengthened the design schedules.

The Apollo Core allows AMMX instructions using integer (data) registers as well. It's pretty nice to be able to work on x_high:y_low pairs in registers.

; SIMD add in data registers
add2p:
paddl d1,d0 ; MMX paddd (paddd supported with alias)
rts

; non-SIMD add of pair in data registers
add2p:
ror.q #32,d0
ror.q #32,d1
add.l d1,d0 ; upper half of destination reg *not* zeroed
ror.q #32,d0 ; partial reg stall
ror.q #32,d1
add.l d1,d0 ; upper half of destination reg *not* zeroed
rts

; non-SIMD add of pair in data registers
add2p:
move.q d0,d2
lsr.q #32,d2
move.q d1,d3
lsr.q #32,d3
add.l d1,d0 ; upper half of destination reg zeroed
add.l d3,d2 ; upper half of destination reg zeroed
lsl.q #32,d2
or.q d2,d0
rts

Notice the 1st non-SIMD function does *not* clear the upper results of ADD.L and uses 2 less instructions and 2 less registers than the 2nd non-SIMD function but suffers from a partial register stall. The SIMD instruction is short and sweet. Using an SIMD unit would be most effective if the source registers were passed in SIMD registers but for just a simple X:Y pair a 256 bit SIMD unit is overkill doing a lot more work than necessary. It may even be nice to have simple integer SIMD support in the integer unit to more easily handle corner cases where a wide SIMD unit is not efficient. I expect SIMD would not even be used in this case leading to one of the other 2 non SIMD options above.

Sorry, but I don't see what's the problem here. Packed data are handled by SIMD units, whether we use GP or separated registers.
Quote:
How would you handle an X:Y pair add with a 256 bit SIMD unit?

It depends on the SIMD ISA. Vector-length-agnostic ISAs resolve it naturally by selecting the data length to 2 integers. Fixed-length ISAs can support different register sizes (Intel), so it should fit here. The ones that have just one register size which doesn't match the length of the two integers have to use proper masking of the data.
Quote:
Quote:
Better to reuse the FPU registers, then.

Without parallel conversion from integer to fp or fp to integer being common, there is not a big advantage using fp registers for integer values. There may even be an advantage to keeping all integer values in the same registers and all fp values in the same register.

Modern SIMDs have those conversions: just follow the path...
Quote:
Quote:
Yes. But AFAIR there should be some other cases where partial registers usages can cause problems. I'll check once I've more time.

Probably with the result forwarding/bypass where a 64 bit read after a 32 bit write which is not extended also causes a stall. Result forwarding is simplest for a 64 bit core when always forwarding 64 bit results.

Indeed.
Quote:
Note that the MMX SIMD operation results are always 64 bits allowing more functionality without the problems of partial register stalls.

Because those registers are independent from the GP ones. The only problem happens when you want to use the regular FPU.
Quote:
Quote:
SWAPs aren't required with an EA and ISA reworking. But it's beyond your project.

A rotate with immediate larger than 8 is a 32 bit encoding currently so SWAPs are just shorter 16 bit encodings. I could add instruction shortcuts for other common immediate shifts and rotates.

Maybe a SWAP.L can be added in this case.
Quote:
Quote:
Those numbers don't match, according to my analysis and calculations, and I'm considering Silverthorne as a reference, because that was reported in the link.

The information for Diamondville is not given. The Diamondville core could be much smaller than the Silverthorne core. It is an inline core which could be closer in size to the early Pentium P5 cores which were only 3,100,000 transistors compared to the Silverthorne core of 13,828,574 transistors. Unlike the Pentium P5 which the Atom is based, Silverthorne is 64 bit and has 16-19 stages where the P5 is 32 bit and only 5 stages. The 68060 was 8 stages and only had 2,500,000 transistors so the deeper pipeline may not have added too many transistors. The pipeline is really too deep for what I like as it required stronger branch prediction using more transistors and the pre-decoder between the L1 ICache and L2 wasted part of the advantage CISC should have in code density to reduce instruction caches.

Understood. So, do you want to address only the low-end embedded market?
Quote:
Quote:
Thanks. I know how to use my coming vacation days.

It sounds like you had not heard about the OoO x86 FPGA core.

Because I'm not an hardware guy, so I wasn't and in general I'm not interested on FPGA implementations, because I have not enough background. I prefer to focus on the ISA & micro-architecture levels.
Quote:
Quote:
I don't think so. The Larrabee project wasn't good for a GPU, and not because of the AVX-512-like ISA & micro-architecture. In fact, the project was reused for GPGPU, which much better results.

Were the CPUs mostly used as GPGPUs in data centers (until Nvidia gfx cards took that market)?

Yes. There were some supercomputers which used the Xeon Phis.

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 29-Oct-2020 5:17:24

[ #200 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5287
From: Australia

@cdimauro
Quote:

The only point that I wanted to stress is that there's no real AVX-128 mode, which was an artificial invention (optimization path) of GCC for AVX. As well as there's no x86-32 mode, which is an artificial invention (ABI) from Intel for x86-64.

For 32-bit X86, Intel has IA-32.

X86-32 is vendor less reference to 32-bit X86. The alternative reference to the 32bit X86 code path is the old "i386".

Certain iterations of the IA-32 ISA are sometimes labeled i486, i586, and i686, referring to the instruction supersets offered by the 80486, the P5, and the P6 microarchitectures respectively.

AVX allows vectors of either 128 bits or 256 bits in length. AVX's 128-bit length is real and its optimization towards it is real e.g. Jaguar based game consoles.

From https://software.intel.com/content/www/us/en/develop/articles/avoiding-avx-sse-transition-penalties.html

Intel® AVX also includes 128-bit VEX encoded instructions equivalent to all legacy Intel® Streaming SIMD Extensions (Intel® SSE) 128-bit instructions.

Quote:

OK, so you're primarily interested on good gaming performances. That's why you're changing so often your system.

I'm not a gamer, and I don't need to change a system often. I'm more interested to top single-thread performances, but I also don't like to change often a system.
The only exception was the Core i7-4790K, which I've changed after a couple of yeas to a Core i7-6700K only because I had an offer from Intel's internal shop (for employees) which I cannot refuse...

I don't recall the Amiga desktop computing platform being an embedded platform.

I just eBay'ed my old items to fund newer items, hence near-zero cost outlays i.e. its opportunistic upgrade.

My income tax usually funds my PC hardware purchases which are applicable to employees working in the relevant IT industry.

Raytracing hardware has benefits for Blender3D besides games.

Quote:

Let's see how efficient it will be.

Anyway, nVidia is using a crappy 8nm process from Samsung, so it has also huge margins of improvement once it'll switch to a much better process, like TSMC's 7nm+ or 5nm (but not soon: the orders are already fulfilled).

According to AMD's gaming benchmark claims
https://www.techpowerup.com/273934/amd-announces-the-radeon-rx-6000-series-performance-that-restores-competitiveness

RX 6900 XT (300 watts) ~= RTX 3090
RX 6800 XT (300 watts) ~= RTX 3080 (320 watts)
RX 6800 (250 watts) beats RTX 2080 Ti

AMD's Zen team was involved with PC's "RDNA 2 Big NAVI" design, hence very fast 128 MB Infinity Cache was based on Zen's L3 cache.

In general, XBO's 32 MB eSRAM can support 1600x900 framebuffers without delta color compression (DCC).

128 MB Infinity Cache can support 4K framebuffer with DCC.

Quote:

I understand your needs as a customer, but if you buy a product then you're also promoting it, approving the decisions of the vendor which will affect the market and future products.

Nothing different from buying AmigaOS4 and supporting Hyperion's decisions, which have split the post-Commodore market, generated wars, and different platforms which is still hurting this nano-niche.

AmigaOS4 has "the name"(TM), LOL.

From my readings, A1222's price range is similar to Vampire V4.

Last edited by Hammer on 29-Oct-2020 at 05:27 AM.
Last edited by Hammer on 29-Oct-2020 at 05:23 AM.
Last edited by Hammer on 29-Oct-2020 at 05:20 AM.
Last edited by Hammer on 29-Oct-2020 at 05:19 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle