Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6071 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

12 crawler(s) on-line.

75 guest(s) on-line.

0 member(s) on-line.

You are an anonymous user.
Register Now!

Livebyfaith: 8 mins ago

DiscreetFX: 44 mins ago

agami: 2 hrs 33 mins ago

kolla: 2 hrs 42 mins ago

amigakit: 3 hrs 20 mins ago

NutsAboutAmiga: 4 hrs 24 mins ago

michalsc: 4 hrs 30 mins ago

Tuxedo: 5 hrs 17 mins ago

Rob: 6 hrs 11 mins ago

Swisso: 8 hrs 27 mins ago

Forum Index

General Technology (No Console Threads)

The (Microprocessors) Code Density Hangout

Poster

Thread

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 2-Oct-2022 17:55:30

[ #221 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3756
From: Germany

@Hammer

Quote:

Hammer wrote:
@cdimauro

Quote:
I
t depends on how you create the ISA. For mine I was inspired by the work that Stephen Morse did with Intel's 8086.

The 8086 was successful because it was almost 100% source-level compatible with 8085, so porting the existing applications from the latter was very simple and effective.

I know perfectly that the software library is very very important and that's why my primary goal with NEx64T was to have 100% assembly-level compatibility with IA-32 and x86-64.

This means that usually a recompilation is enough to get a binary for my architecture, with the exceptions of applications that make assumptions about the instruction's opcode structure (assemblers, compilers, debugger, JIT compilers).

However and since any IA-32/x86-64 instruction could be usually mapped to a corresponding one on NEx64T, adapting the more difficult applications is quite simple.

Wintel desktop world doesn't tolerate Motorola 68K's instruction set bastardization i.e. "to be or not be" instruction set.

In fact this isn't the case: NEx64T is a novel architecture, albeit it's fully source-level compatibile with IA-32 and x86-64.
Quote:
Both Z80 and 8085 are not X86.

And 8086 wasn't 8085-compatible, but it was almost fully source-level compatible. This allowed it to gain an big software library, which was a great advantage compared to other novel architectures.

I've made the same decision with NEx64T.
Quote:
Zilog Z80 is a software-compatible extension and enhancement of the Intel 8080 and it failed i.e. Z80 was defeated by Intel X86.

They were processors for completely different markets.
Quote:
Zilog Z8000 and Z80000 weren't binary compatible with the Z80 and they both failed.

And also weren't source-level compatible, AFAIR (but I might not recall correctly).
Quote:
AMD's X86-64 is a software-compatible extension and enhancement of Intel's IA-32 with Microsoft being the kingmaker.

x86-64 is both binary and source incompatible with IA-32.

The thing is that processors implementing this ISA could ALSO execute software written for IA-32.
Quote:
Unlike Intel IA-64 Itanium, AMD's X86-64 doesn't compromise IA-32's legacy runtime performance which is important for the PC gaming market.

That was a mistake by Intel: embedding an IA-32 implementation which sucked on performances.

In fact, it was later completely removed and Itanium relied only on software emulation, which was much faster.

However, it was too late and the market was lost...
Quote:
IA-32 has two upgrade paths i.e. Intel's IA-64 and AMD's X86-64. Intel IA-64 is garbage at PC gaming.

It was garbage on many things.
Quote:
You can try to implement your NEx64T with Transmeta style Code Morph Software (CMS) translation that is similar to PiStorm/RPi 3A+/Emu68 method as part of the retro X86 scene i.e. NEx64T with CMS competes against AO486 FPGA port for MiSTer FPGA.

I'm not an hardware engineer, so that's something out of my scope.

And it's not as important. Much more important is developing a backend for one of the existing compilers. This will give much better data with real-world applications and make it immediately and directly comparable with the competitors.
Quote:
ColdFire V1 to V4 is source code compatible with 68K and it's still unacceptable for Amiga users who have accumulated WHDLoad 68K games.

ColdFires weren't viable alternative for Amigas, because they were both source-code incompatible (many instructions were missing) and slow because the missing instructions had to be trapped and (slowly) emulated.
Quote:
I rather have AC68080 or PiStorm/Emu68 solutions over ColdFire V4.

My main reason for PiStorm/Emu68 is its relatively low price and respect for 68K Amiga legacy which includes accumulated WHDLoad 68K games.

Indeed.
Quote:
Binary re-compilation would break DRM /Anti-cheat certificate validation schemes, hence breaking PC games. This is why Valve worked with DRM /Anti-cheat middleware vendors to accept Valve's complied Proton/DXVK certificate for SteamOS. Open source Proton/DXVK is nearly meaningless for PC gaming without passing Microsoft's or Valve's valid certificate checks.

Valid certificate checks help enforce the PC's trusted computing initiative.

Emulation can overcome those issues, if required (e.g. no native applications available)

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 2-Oct-2022 18:02:32

[ #222 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3756
From: Germany

@Hammer

Quote:

Hammer wrote:
@cdimauro

Quote:

Bobcat used a better production process and it still sucked at lot at performances, compared to the Atom.

Reminder, 45 nm Atom is an in-order X86 CPU while 40 nm Bobcat is an out-of-order X86 CPU.

https://www.anandtech.com/show/4023/the-brazos-performance-preview-amd-e350-benchmarked/3

AMD's Bobcat-based E-350 (40 nm) beats Intel Pineview Atom D510 (45 nm).

Both Bobcat and Pineview used similar process tech in the 40 to 45 nm range.

AMD's performance target for Bobcat was 90% of the performance of K8 at the same clock speed and our Photoshop CS4 benchmark shows that AMD can definitely say that it has met that goal. At 1.6GHz the E-350 manages to outperform a pair of K8s running at 1.5GHz in the Athlon X2 3250e
- Anandtech

For X264 benchmarks, Bobcat-based E-350 beats Intel Pineview Atom D510.

For 3dsmax 9, Bobcat-based E-350 beats Intel Pineview Atom D510.

For Single Threaded Cinebench R10, Bobcat-based E-350 beats Intel Pineview Atom D510.

Intel's Pineview Atom D510 sucked harder.

Your narrative is FALSE.

This is cherry-picking. Let's see a serious benchmark done with several applications, included the SPEC test suite:

The final ISA showdown: Is ARM, x86, or MIPS intrinsically more power efficient?

Bobcat had 166Mhz less frequency, but:
- it was out-of-order;
- had more L1 data cache;
- had 512KB L2 cache per core;
- had a better productive process (40nm vs 45nm).

And it sucked compared to Atom...

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 2-Oct-2022 18:05:27

[ #223 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3756
From: Germany

@Hammer

Quote:

Hammer wrote:
@bhabbott and @cdimauro

Quote:
So and in short: those experienced engineers completely failed...

Both of you don't know the context of the AGA development timeline.

READ https://www.landley.net/history/mirror/commodore/haynie.html

From Dave Haynie.

When he (Ali) got to Engineering, he hired a human bus error called Bill Sydnes to take over. Sydnes, a PC guy, didn't have the chops to run a computer, much less a computer design department. He was also an ex-IBMer, and spent much time trying to turn C= (a fairly slick, west-coast-style design operation), into the clunky mess that characterized the Dilbert Zones in most major east-coast-style companies.

He and Ali also decided that AA wasn't going to work, so they cancelled both AA projects (Amiga 3000+ and Amiga 1000+, either one better for the market than the A4000 was), and put it all on the backburner, intentionally blowing the schedule by six+ months.

They cancelled the A500, which was the only actively selling product ever cancelled in C= history, to my knowledge, and replaced it with the A600.

The A600 was originally the A300, George Robbins' idea of a cheaper-than-A500 Amiga; a new line, not a replacement. Sydnes added so much bloat, the A600 was $50 more than the A500, $100 over the goal price.

So, they were late (compared to competitors) AND their management sucked.

Which is what I've already said...
Quote:
With a fast enough CPU with 32-bit fast memory, the AGA dumb frame buffer role is sufficient for Doom.

Without a suitable compute power, AAA chipset wouldn't solve the Doom problem.

Neither AGA at the time, with competitors which evolved much more from the technology side...
Quote:
Commodore UK MD's David Pleasance advocated for accelerated CPU/Fast RAM equipped A1200 bundles, it was rejected by Commodore International's management.

It was still too late...

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 2-Oct-2022 19:50:55

[ #224 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3756
From: Germany

@Karlos

Quote:

Karlos wrote:
@cdimauro
Quote:

But no ternary instructions? Like fadd.s f0,f1,f2.

Looking at your Mandelbrot example it should gain a lot in terms of code density (those ternary instructions should use a 3 bytes encoding) and, especially, performance.

Just to be clear, the Mandelbrot code is not intended to be particularly performant, rather it's there as a test bed. There are versions which expressly use non-register addressing modes (e.g. stack, static etc) so that the relative difference between these can be assessed.

OK, got it.
Quote:
Ternary operations drift further from the 68K familiarity zone.

It depends on your goal.

I've also created my 64-bit 68k inspired architecture, but that was the base and I've also added other interesting features.

With NEx64T I did the same with IA-32/x86-64, but I've done much better and deeper improvements.

To me the initial ISA is just a starting point, and the limit is represented by the available encoding space which is left.
Quote:
I'm not fundamentally opposed to this, but in this example, ternary only makes sense as a register only variant. Having to evaluate three full effective address cases is a lot slower.

It looks strange to me. If you have to execute two instructions then you've to evaluate 2 x 2 = 4 effective addresses.

Which should be slower than evaluate 3 EAs on a single instruction.

But I stop here because I don't recall your implementation.
Quote:
One example I'm considering, however, is a multiply-accumulate operation, e.g. fmac.d fp1, fp2, fp0 --> fp0 += fp1*fp2

Which makes sense, since it's very common / wide-spread & used.

But this requires the evaluation of 3 EAs, right? Then why don't make it more general?
@Karlos

Quote:

Karlos wrote:
@Karlos

Also for 3 byte register to register encodings, you can easily define quaternary operations since the opcode needs only 1 byte. Those may be harder to find, but a general purpose multiply-add (as distinct from multiply accumulate) may serve as an example.

Actually this is one of the rare cases where a quaternary instruction is useful to have.

IA-32/x86-64 has some of them, but unfortunately not for the FMAC & FMA.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 2-Oct-2022 19:56:24

[ #225 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3756
From: Germany

@matthey

Quote:

matthey wrote:

Gunnar [quote]
I think it would have been better for MOTO if they would have used the
size="11" encoding of the immediate instructions for
OPP.L #16bit,(ea)

This encoding was a hole and the 020 used it to add all those CAS/CAS2
CMP2/CHK2 instructions.
The 060 dropped those instructions anyhow. I think no one needs them.
This encoding would have been ideal to solve this perfectly clean.
(See attached file: Decoder.ods)

What do you think?

LOL How short and limited vision he had. Using size = 11 for... operation with a 16-bit immediate. Unbelievable!

This is even worse than the encoding used for AMMX...
Quote:
I think it would be better if a 64 bit mode allowed size="11" for a 64 bit size. This encoding would have been ideal to solve this perfectly clean. This is much easier to decode than a prefix and there is no prefix growth for 64 bit sizes. I already gave you a more compatible 16 bit immediate compression without using the size="11" with an addressing mode encoding. What do you think?

Mit freundlichen Grüßen / Kind regards

This is a MUCH better solution. Of course.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 2-Oct-2022 20:26:03

[ #226 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3756
From: Germany

Updated Benchmark post.

EDIT: added other benchmarks.

Last edited by cdimauro on 02-Oct-2022 at 08:43 PM.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 2-Oct-2022 21:19:28

[ #227 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3756
From: Germany

@matthey

Quote:

matthey wrote:
I found the other old paper on register efficiency based on the number of GP registers.

High-Performance Extendable Instruction Set Computing
https://www.researchgate.net/publication/3888194_High-performance_extendable_instruction_set_computing

The data is based on a MIPS 3000 which has 27 GP registers. Data was recorded as the number of available GP registers in a compiler was reduced down to 8.

No. of Regs | Program size | Load/Store | Move
27 100.00% 27.90% 22.58%
24 100.35% 28.21% 22.31%
22 100.51% 28.34% 22.27%
20 100.56% 28.38% 22.24%
18 100.97% 28.85% 21.93%
16 101.62% 30.22% 20.47%
14 103.49% 31.84% 19.28%
12 104.45% 34.31% 16.39%
10 109.41% 41.02% 10.96%
8 114.76% 44.45% 8.46%

RISC architectures benefit from a few more than 16 GP registers. MIPS is pretty much worst case as it has few addressing modes and it needs several GP registers for address calculations. Most RISC architectures need a GP register free for loads as well (unless they have a reg to mem exchange instruction which is rare). The benefits of more than 16 GP registers is still small as the paper chose 16 GP registers based on the above chart for a compressed RISC encoding. The biggest concerns are elevated "Load/Store" mem accesses and increased instruction counts. Less than 16 GP registers has elevated mem accesses especially approaching 8 GP registers. From 16 to 27 GP registers is only 2.72% more memory accesses even for MIPS which is likely near worst case for RISC.

Interesting. It looks like that 16 registers is the sweet spot between program size and loads/stores. Which is strange for a RISC: I was expecting more registers pressure AKA more loads/stores, using a reduced set of registers.
Quote:
Recall the "Performance Characterization of the 64-bit x86 Architecture from Compiler Optimizations’ Perspective" paper which also gave the increase in mem access when decreasing GP registers from 16 for x86-64 to 8 registers of x86.

https://link.springer.com/content/pdf/10.1007/11688839_14.pdf Quote:

In addition to performance, the normalized dynamic number of memory accesses (including stack accesses) for the CINT2000 and CFP2000 is shown in Fig. 8. On average, with the REG_8 configuration, the memory references are increased by 42% for the CINT2000 and by 78% for the CFP2000; with the REG_12 configuration, the memory references are increased by 14% for CINT2000 and by 29% for the CFP2000.

For whatever reason, the first paper shows an increase of 14.23% memory accesses while the 2nd paper shows an increase of 42% from 16 to 8 GP registers. Both papers show a large increase in the number of mem accesses from 16 to 8 GP registers but this only resulted in a 4.4% slowdown in the 2nd paper. The first paper data shows that from 16 to 27 GP registers is 16% of the mem access difference from 8 to 16 GP registers. 16% of the 4.4% slowdown would be a .71% slowdown that could be avoided with 27 instead of 16 GP registers.

The differences between the two papers is too huge. Even taking into account the very different architectures (MIPS vs x86-64), it's difficult to have an explanation for this.
Quote:
RISC architectures often waste some of the 32 GP registers for a zero register, link register and other specialized registers because having all 32 GP registers doesn't make much difference in performance but the Apollo core needs 48 GP integer registers along with all the CISC techniques which reduce the need for GP registers like reg-mem accesses and powerful addressing modes. Maybe 8 more GP integer registers would have gained 1% performance on low memory bandwidth hardware but, no, it had to be 24 more GP registers that certainly wouldn't be used in low memory bandwidth embedded hardware. The extra registers are not orthogonal like the RISC 32 GP registers either. The evidence was given years ago but ignored.

In fact it's a total non-sense having so many registers on a CISC architecture and especially on a 68k one.
Quote:
The first paper above gives some code density comparisons but they are old EGCS compiles (EGCS was replaced by GCC for good reason). The 68020 was 6/24 architectures and PPC was 21/24 fairing worse than MIPS and SPARC.

I ran across a couple of other interesting papers while searching.

Comparative Architectures, CST Part II, 16 lectures, Lent Term 2005 (Ian Pratt)
https://dokumen.tips/documents/comparative-architectures-clcamacuk-8086-80286-80386-80486-pentium-pentium.html?page=1

Code Density Straw Poll (page 52)

gcc
arch | text | data | bss | total
68k 36152 4256 360 40768
x86 29016 14861 468 44345
alpha 46224 24160 472 70856
mips 57344 20480 880 78704
hp700 66061 15708 852 82621

gcc-cc1
arch | text | data | bss | total
68k 932208 16992 57328 1006528
x86 995984 156554 73024 1225562
hp700 1393378 21188 72868 1487434
alpha 1447552 272024 90432 1810008
mips 2207744 221184 76768 2505696

pgp
arch | text | data | bss | total
68k 149800 8248 229504 387552
x86 163840 8192 227472 399504
hp700 188013 15320 228676 432009
mips 188416 40960 230144 459520
alpha 253952 57344 222240 533536

The 68k has the best code density with this easy competition and a descent compiler. Alpha came out better than I expected while MIPS was worse.

The same paper on page 55 gives conditional branch frequency of about 16% for SPECint92 which is only behind load and ahead of ADD and CMP instructions (MIPS?).

This benchmark is very useful, because it not only shows you that the total size of each executable but this data is also split on the three major sections: text (code), data, and bss (uninitialized data).

However and while the differences on the text/code section could be easily explained by knowing the architectures, the differences on data and/or bss are really hard to understand.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 2-Oct-2022 21:47:20

[ #228 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3756
From: Germany

@matthey

Quote:

matthey wrote:

AMD claimed the average instruction length from x86 to x86-64 only increased from 3.4 to 3.8 bytes for SPECint2000 (https://old.hotchips.org/wp-content/uploads/hc_archives/hc14/3_Tue/26_x86-64_ISA_HC_v7.pdf). It's not difficult to find x86-64 programs with average instruction lengths over 4 bytes though. Cdimauro's "integer" instructions for Photoshop show an increase from 3.2 to 4.3 bytes.

It also depends on the used compilers. Binaries generated by GCC usually are fatter than the ones generated by Intel or Microsoft compilers.

However AMD used GCC on its benchmarks.

I can give you some other data using one of the most recent Microsoft's Visual Studio compilers for Excel 64-bit:

Class        Count       % Avg sz NEx64T   Diff       
INTEGER    5063995   99.03    3.9    3.2   -0.7 -18.0%
SSE          49419    0.97    4.9    4.7   -0.2  -4.4%
AVXFAKE      49419    0.97    5.4    4.7   -0.7 -13.6%
AVX512F      49419    0.97    7.1    4.7   -2.4 -34.1%
FPU            142    0.00    2.7    3.7    1.0 +37.5%
3DNOW           31    0.00    4.3    6.4    2.1 +50.0%

Average is 3.9 bytes per instruction, which is close to the numbers which AMD has shown. Unfortunately I've no 32-bit Excel to make a comparison.
Quote:

With a 4 byte average, an ISA can have 32 GP registers.

Absolutely!
Quote:

If compatibility wasn't so important, it would have been better to start over with a better ISA than x86 and change to a 16 bit encoding base.

Exactly. Results could be much better, even with 32 GP registers.
Quote:

The 68k is in much better shape. A 64 bit mode allows to clean the encoding map up nicely without major decoding changes.

Not that much. IMO it's better to have a slightly changed encoding to make the instructions more easy to decode.

At the very end a 64-bit 68k ISA will be anyway 68k binary-incompatible. Then better to start with a better encoding...
Quote:

A prefix is following the x86-64 disaster and it costs 2 bytes instead of one for the 68k. If the 68k average instruction length is 3 bytes, an average instruction would increase to 5 bytes with a prefix. The common 2 byte instruction would increase to 4 bytes.

I don't expect so huge changes. It depends on how much 64-bit data are used. But using 64-bit pointers AKA address registers should always require the prefix (except on addressing modes).
Quote:

It's easy to say that more than 16 GP registers is rarely used so the prefix contributes little to code size and instruction size increases but then if it is so rarely used then why have all these extra integer registers?

In fact it's a complete non-sense: they are too many.
Quote:

It's also easy to say that 64 bit operations requiring a 16 bit prefix are rarely used which they probably are now but do you want a gimp 64 bit ISA like x86-64 that needs a prefix for 64 bit operations?

It depends on which applications want to be used. Modern applications compiled for 64-bit ISAs use both 64-bit data and addresses (especially). Which means that the prefix will be used often (albeit not 100% of the time).
Quote:

Do you want poor code density, longer instructions and more decoding overhead like x86-64 or a lean and mean 64 bit 68k ISA with one of the best possible 64 bit code densities?

Well, that's what was done: a design where code density wasn't improved; rather the exact contrary.

IF the above context is considered.

However if they are just appendixes because the processor is almost always used with 32-bit o.s. and applications, then whatever encoding is used doesn't matter, since it's essentially a 68k with AMMX used.

Status: Offline

Gunnar

Re: The (Microprocessors) Code Density Hangout
Posted on 2-Oct-2022 21:56:56

[ #229 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

Quote:

Interesting. It looks like that 16 registers is the sweet spot between program size and loads/stores.

No, for many algorithm 16 register are not enough.
There is a good reason that CELL has 128 register
and that IBM POWER have 64 FPU register today.

Dear Cesare Di Mauro,

Could it be that try to compensate lack of knowledge and quality with quantity in your postings?

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 2-Oct-2022 22:13:26

[ #230 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3756
From: Germany

@Gunnar

Quote:

Gunnar wrote:

Quote:

Interesting. It looks like that 16 registers is the sweet spot between program size and loads/stores.

No, for many algorithm 16 register are not enough.

According to the benchmarks results, it looks like that most algorithms don't use more than 16 registers.
Quote:
There is a good reason that CELL has 128 register

There's a good reason why CELL was a complete failure...
Quote:
and that IBM POWER have 64 FPU register today.

To accomodate the registers needed by the wider SIMD unit while not expanding the existing size to 256 bit.

Which was a ridiculous decision as well.
Quote:
Dear Cesare Di Mauro,

Could it be that try to compensate lack of knowledge and quality with quantity in your postings?

See above.

For the rest, don't talk about yourself and pretend to be me: you've written several posts in the few days that opened the account and I'm still waiting proofs of all your claims.

You were also unable to provide the encoding for 3 MOVEA instructions of what's supposed to be YOUR processor. Which is quite strange.

[Comment removed, keep things civilized]

Last edited by sibbi on 04-Oct-2022 at 11:56 AM.

Status: Offline

Gunnar

Re: The (Microprocessors) Code Density Hangout
Posted on 2-Oct-2022 22:35:36

[ #231 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

I see we forgot, that Cesare is more clever than IBM and INTEL together...

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 2-Oct-2022 22:37:35

[ #232 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2150
From: Kansas

cdimauro Quote:

It looks strange to me. If you have to execute two instructions then you've to evaluate 2 x 2 = 4 effective addresses.

Which should be slower than evaluate 3 EAs on a single instruction.

Well on the road to a VAX recreation. The 68k allows 2 EAs but only for MOVE EA,EA which is common and gives a powerful instruction even if MOVE mem,mem takes 2 cycles. The main advantage is better code density and a register saved.

move.l (a0),(a1) ; 2 bytes
===
total: 2 bytes, 2 registers

vs

move.l (a0),d0 ; 2 bytes
move.l d0,(a1) ; 2 bytes
===
total: 4 byte, 3 registers

Was this special case of allowing 2 EAs worth it? ColdFire kept MOVE mem,mem when they greatly simplified the 68k even though they introduced addressing mode limitations by limiting instruction lengths. Of course they kept reg-mem operations as well even read-modify-write ones for their RISC encoding and RISC execution pipelines. This leads me to believe that some of the choices are subjective.

Mitch Alsup https://groups.google.com/g/comp.arch/c/wzZW4Jo5tbM?pli=1 Quote:

I went the other direction: the key data addressing mode in the MY 66000
ISA is :: (Rbase+Rindex*SC+Disp)

When Rbase == R0 then IP is used in lieu of any base register
When Rindex == R0 then there is no indexing (or scaling)
Disp comes in 3 flavors:: Disp16, Disp32, and Disp64

The assembler/linker is task with choosing the appropriate instruction form
from the following:

MEM Rd,(Rbase+Disp16)
MEM Rd,(Rbase+Rindex*SC)
MEM Rd,(Rbase+Rindex*SC+Disp32)
MEM Rd,(Rbase+Rindex*SC+Disp64)

Earlier RISC machines typically only had the first 2 variants. My experience
with x86-64 convinced me that adding the last 2 variants was of low cost
to the HW and of value to the SW.

In a low end machine, the displacement will be coming out of the decoder
and this adds nothing to the AGEN latency or data path width. The 2 gates
of delay (3-input adder) is accommodated by the 2 gates of delay associated
the scaling of the Rindex register (Rbase+Disp)+(Rindex*SC) without adding
any delay to AGEN.

Any high end machine these days will have 3-operand FMAC instructions. Those
few memory references that need 3 operands are easily serviced on those paths.

Having SW create immediates and displacements by executing instructions is
simply BAD FORM*. Immediates and displacements should never pass through the
data cache nor consume registers from the file(s), nor should they be found
in memory that may be subject to malicious intent.

(*) or lazy architecting--of which there is way too much.

The same issues were involved in adding 32-bit and 64-bit immediates to the
calculation parts of the ISA.

DIV R7,12345678901234,R19
is handled as succinctly as:
DIV R7,R19,12345678901234

Almost like somebody actually tried to encode it that way.

Mitch added 68k like addressing modes and retained large immediates and displacements in the code like the 68k for his RISC ISA while ColdFire castrated powerful addressing modes and broke up immediates and displacements preferring an increase of short instructions but retained reg-mem operations. I don't think it is a case of one or the other is fine but not both in the case of the 68k. These 68k features have more complexity but also more performance potential due to more powerful instructions and better code density.

Adding many EAs to instructions makes instructions difficult to execute in a single cycle which is bad for pipelining on real hardware. Limiting EAs to one per instruction like the 68k, except for the special MOVE mem,mem case, keeps from going VAX.

cdimauro Quote:

LOL How short and limited vision he had. Using size = 11 for... operation with a 16-bit immediate. Unbelievable!

It was early in Apollo core development and we were looking at possibilities. I don't remember if I initially suggested the idea or Gunnar but he liked it. Meynaf didn't like it from the start because of incompatibility. I was open minded about it from the beginning and tried to find a way to make it work but I agreed with Meynaf it wasn't a good idea for high compatibility needed for retro use. We considered a separate 64 bit mode but Gunnar didn't like that at all. Some time later, I came up with using the addressing mode to compress immediates which is very compatible and better as it can be used with MOVE. Meynaf actually didn't like the addressing the mode idea as it was only useful for the OP ".L" size (".Q" also but only a 32 bit ISA then) but he didn't complain as much because it was more compatible. Meynaf and Gunnar argued for a long time over the incompatible immediate compression idea and it created bad blood between them. I was more neutral but agreed with Meynaf about the incompatibility. Reusing encodings without a new 64 bit mode is bad for compatibility. Gunnar felt I took Meynaf's side even though he later dropped his support of the incompatible idea, probably because it caused some incompatibility and I found a better alternative.

cdimauro Quote:

This is a MUCH better solution. Of course.

If only Motorola had planned for and reserved the size="11" instruction encodings for 64 bit back in the '70s, I wonder if the 68k would still be alive today. It was no doubt difficult to foresee back then even when the 68000 developers had the foresight to add 32 bit ISA support to a 16 bit CPU microprocessor which revolutionized computing, from embedded to workstation CPU markets (perhaps the beginning of the 2nd computing revolution). The size="11" encoding was still open for the 68000 ISA and it was actually the 68020 ISA which poorly used it while 64 bit ISA support was easier to see at that point. The ColdFire ISA developers should have seen that 64 bit ISA planning helped scalability but they were too focused on scaling the 68k down by castrating it and didn't want it scaled up to compete with PPC. ColdFire had just eliminated byte and word sizes to be more RISC like so it couldn't add a 64 bit size. Some time later, ARM AArch64 added 32 and 64 bit sizes as ARM scaled up replacing PPC and leaving 32 bit Thumb2 for low end embedded markets. ColdFire was just a scaled down castrated and weakened 68k that couldn't scale up to replace PPC. Motorola/Freescale/NXP never leveraged the 68k CISC performance advantage and now they pay license fees to ARM.

Last edited by matthey on 02-Oct-2022 at 10:43 PM.
Last edited by matthey on 02-Oct-2022 at 10:41 PM.
Last edited by matthey on 02-Oct-2022 at 10:40 PM.
Last edited by matthey on 02-Oct-2022 at 10:39 PM.

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 3-Oct-2022 1:33:32

[ #233 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2150
From: Kansas

cdimauro Quote:

Interesting. It looks like that 16 registers is the sweet spot between program size and loads/stores. Which is strange for a RISC: I was expecting more registers pressure AKA more loads/stores, using a reduced set of registers.

For CISC and embedded RISC, it looks to me like 16 GP integer registers is the sweet spot. For high performance RISC, 32 GP registers likely provides some performance benefit but it is partially offset by decreased performance from larger code. Early RISC ISAs payed no attention to code density and I wouldn't be surprised if the performance loss from fat code often more than offset the performance gain from 32 GP registers. Modern RISC ISAs like AArch64 and RISC-V compressed likely make 32 GP registers worthwhile while they still could be improved with a better variable length encoding (Mitch Alsup's ISA?).

cdimauro Quote:

The differences between the two papers is too huge. Even taking into account the very different architectures (MIPS vs x86-64), it's difficult to have an explanation for this.

I would like to see similar data done with compiles for -Os, -O2 and -O3. I expect more loop unrolling and function inlining to require more GP registers. RISC needs more loop unrolling to reduce load-use stalls and can have longer function prologues and epilogs which encourages more inlining.

https://www.ibm.com/docs/en/aix/7.2?topic=overview-prologs-epilogs

RISC functions can't be too deep though as register spills are much more expensive than CISC.

RISC_out_of_regs:
store reg_var3
load reg_var2
// load-use stall (3 cycle on ARM Cortex-A53)
op reg_var2,reg_var1
===
total: 3 instructions, 2 mem accesses, 12 bytes (32 bit fixed length encoding), 3-6 cycles typical

CISC_out_of_regs:
op mem_var2,reg_var1
===
total: 1 instruction, 1 mem access, 4 bytes (68k), 1 cycle typical

This is a huge difference. RISC can reduce the load-use stall (difficult for in order RISC CPUs) or add a reg-mem exchange instruction (rare with CISC like reg-mem complexity). CISC can dual port the data cache and do 2 mem reads per cycle of variables which is likely why we see so little performance degradation on x86(-64) from using so few GP registers despite the much increased mem accesses. There are some complex functions which would benefit from more GP integer registers. See the AMD slide on page 10.

https://old.hotchips.org/wp-content/uploads/hc_archives/hc14/3_Tue/26_x86-64_ISA_HC_v7.pdf

About 90% of functions only need 16 GP registers but this may include SIMD unit registers which likely benefits more from added GP registers. There are a couple of percent of functions that need more than 32 GP registers. Is it worth bloating up the code all the time for infrequent GP register needs for a RISC register munching monster with a memory bottleneck or is it better to have a CISC reg-mem munching monster?

Status: Offline

bhabbott

Re: The (Microprocessors) Code Density Hangout
Posted on 3-Oct-2022 2:00:30

[ #234 ]

Regular Member

Joined: 6-Jun-2018
Posts: 387
From: Aotearoa

@cdimauro

Quote:

cdimauro wrote:

There's a good reason why CELL was a complete failure...

Which was?

The PS3's Cell Processor Is More Powerful Than Current Intel CPUs

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 3-Oct-2022 4:04:22

[ #235 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2150
From: Kansas

bhabbott Quote:

The PS3's Cell Processor Is More Powerful Than Current Intel CPUs

The Cell CPU provided huge theoretical parallel workload SIMD performance but GPUs can already do that. Sony forgot that games need strong sequential workload (single threaded) performance where the single PPC core is severely lacking. The CPU is clocked high to give more SIMD performance but the PPC core is relatively weak and prone to stalls. The AMD64 CPU has so much more single threaded performance and is so much easier to program, it makes the Cell look like a joke. Cell even came from the Roadrunner Supercomputer design but still lacked single threaded performance. The XBOX 360 Xenon had 3 cores and was better in ease of programming but again weak PPC CPU cores.

Cell
~40 cycle load-hit-store pipeline stall (no store-to-load forwarding)
no barrel shifter (like the 68000, each shift takes more cycles)

https://www.extremetech.com/computing/274650-the-worst-cpus-ever-made Quote:

Dishonorable Mention: IBM PowerPC G5

Apple’s partnership with IBM on the PowerPC 970 (marketed by Apple as the G5) was supposed to be a turning point for the company. When it announced the first G5 products, Apple promised to launch a 3GHz chip within a year. But IBM failed to deliver components that could hit these clocks at reasonable power consumption and the G5 was incapable of replacing the G4 in laptops due to high power draw. Apple was forced to move to Intel and x86 in order to field competitive laptops and improve its desktop performance. The G5 wasn’t a terrible CPU, but IBM wasn’t able to evolve the chip to compete with Intel.

...

Dishonorable Mention: Cell Broadband Engine

We’ll take some heat for this one, but we’d toss the Cell Broadband Engine on this pile as well. Cell is an excellent example of how a chip can be phenomenally good in theory, yet nearly impossible to leverage in practice. Sony may have used it as the general processor for the PS3, but Cell was far better at multimedia and vector processing than it ever was at general purpose workloads (its design dates to a time when Sony expected to handle both CPU and GPU workloads with the same processor architecture). It’s quite difficult to multi-thread the CPU to take advantage of its SPEs (Synergistic Processing Elements) and it bears little resemblance to any other architecture.

PPC 603 killed the low end PPC desktop market. PPC G5 killed the high end PPC market. Cell killed the PPC console market.

Last edited by matthey on 03-Oct-2022 at 01:50 PM.
Last edited by matthey on 03-Oct-2022 at 05:28 AM.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 3-Oct-2022 6:45:32

[ #236 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3756
From: Germany

@Gunnar

Quote:

Gunnar wrote:
I see we forgot, that Cesare is more clever than IBM and INTEL together...

from Imgflip Meme Generator

Gunnar, if we remove one point for each logical fallacy that you've written since you are here then your IQ level is the same or below the one of plants...

Quote:

Gunnar wrote:

Dear Cesare Di Mauro,

How do you think you can judge about what you are talking?

Please, tell me where I can buy the powerful CELL: I can't wait to have one... which of course should kill my PC.

And regarding POWER, enjoy:
POWER9 Benchmarks vs. Intel Xeon vs. AMD EPYC Performance On Debian Linux
Raptor Talos II POWER9 Benchmarks Against AMD Threadripper & Intel Core i9
Benchmarking A 10-Core Tyan/IBM POWER Server For ~$300 USD
POWER9 & ARM Performance Against Intel Xeon Cascadelake + AMD EPYC Rome
Quote:
When did you for example write something basic like a Matrix MUL FPU code for any CPU?

Irrelevant (another logical fallacy; of course).
Quote:
The problem with arm chair quarterbacks is that they talk like they have a clue, but in reality they have no clue.

Which the great genius behind the keyboard never proves: who knows why...
Quote:
You talk about optimal register sizes for CPUs but all your "wisdom" comes from Wikipedia.

Another logical fallacy.

Gunnar, the IQ should be greater than zero by definition, but if you continue this way it should be redefined...

@bhabbott

Quote:

bhabbott wrote:
@cdimauro

Quote:

cdimauro wrote:

There's a good reason why CELL was a complete failure...

Which was?

The PS3's Cell Processor Is More Powerful Than Current Intel CPUs

Sure. ON PAPER.

In fact it was so much powerful that it has replaced the processors on PCs, right?

@matthey

Quote:

matthey wrote:
bhabbott Quote:

The PS3's Cell Processor Is More Powerful Than Current Intel CPUs

The Cell CPU provided huge theoretical parallel workload SIMD performance but GPUs can already do that. Sony forgot that games need strong sequential workload (single threaded) performance where the single PPC core is severely lacking. The CPU is clocked high to give more SIMD performance but the PPC core is relatively weak and prone to stalls. The AMD64 CPU has so much more single threaded performance and is so much easier to program, it makes the Cell look like a joke. Cell even came from the Roadrunner Supercomputer design but still lacked single threaded performance. The XBOX 360 Xenon had 3 cores and was better in ease of programming but again weak PPC CPU cores.

*
Quote:
Quote:
https://www.extremetech.com/computing/274650-the-worst-cpus-ever-made [quote]
Dishonorable Mention: IBM PowerPC G5

Apple’s partnership with IBM on the PowerPC 970 (marketed by Apple as the G5) was supposed to be a turning point for the company. When it announced the first G5 products, Apple promised to launch a 3GHz chip within a year. But IBM failed to deliver components that could hit these clocks at reasonable power consumption and the G5 was incapable of replacing the G4 in laptops due to high power draw. Apple was forced to move to Intel and x86 in order to field competitive laptops and improve its desktop performance. The G5 wasn’t a terrible CPU, but IBM wasn’t able to evolve the chip to compete with Intel.

...

Dishonorable Mention: Cell Broadband Engine

We’ll take some heat for this one, but we’d toss the Cell Broadband Engine on this pile as well. Cell is an excellent example of how a chip can be phenomenally good in theory, yet nearly impossible to leverage in practice. Sony may have used it as the general processor for the PS3, but Cell was far better at multimedia and vector processing than it ever was at general purpose workloads (its design dates to a time when Sony expected to handle both CPU and GPU workloads with the same processor architecture). It’s quite difficult to multi-thread the CPU to take advantage of its SPEs (Synergistic Processing Elements) and it bears little resemblance to any other architecture.

LOL

Now the proud Gunnar registers to extremetech and starts telling to the article author:

"are you more clever than IBM?"

"When did you for example write something basic like a Matrix MUL FPU code for any CPU?"

"all your "wisdom" comes from Wikipedia."

"The problem with arm chair quarterbacks is that they talk like they have a clue, but in reality they have no clue."

and so on...

Status: Offline

Gunnar

Re: The (Microprocessors) Code Density Hangout
Posted on 3-Oct-2022 7:52:22

[ #237 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

Cesare and Matthew,

You discuss the optimal CPU features, and you make claims about what features the optimal CPU needs to have. But all your "experience" is based on your understanding or misunderstanding
of posts from some people that you don't know and which you found somewhere in the internet.

When we point out that some real facts, of some CPUS and discuss the advantage for certain well know real live programs, and we ask you - if you have ever written such code?

Then you answer again only by quoting totally random posts from the internet.

Cesare you remind me on someone - who calls himself a football expert but who has never done a block, who has never done a tackle and who has never thrown a ball and never caught a ball in his whole live - and instead you base you knowledge on quoting random nonsense that you found somewhere in the internet.

And you want us to believe, that your playbook is the best in the world and wins every game?

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 3-Oct-2022 8:01:29

[ #238 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3756
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

It looks strange to me. If you have to execute two instructions then you've to evaluate 2 x 2 = 4 effective addresses.

Which should be slower than evaluate 3 EAs on a single instruction.

Well on the road to a VAX recreation. The 68k allows 2 EAs but only for MOVE EA,EA which is common and gives a powerful instruction even if MOVE mem,mem takes 2 cycles. The main advantage is better code density and a register saved.

move.l (a0),(a1) ; 2 bytes
===
total: 2 bytes, 2 registers

vs

move.l (a0),d0 ; 2 bytes
move.l d0,(a1) ; 2 bytes
===
total: 4 byte, 3 registers

Was this special case of allowing 2 EAs worth it? ColdFire kept MOVE mem,mem when they greatly simplified the 68k even though they introduced addressing mode limitations by limiting instruction lengths. Of course they kept reg-mem operations as well even read-modify-write ones for their RISC encoding and RISC execution pipelines. This leads me to believe that some of the choices are subjective.

I agree.

Specifically for the Karlos case, you should consider that his architecture is thought to be executed on a virtual machine. So, fast to decode for a processor which is running the VM. That's why I was wondering which problems could have for evaluation 3 EAs instead of 2, whereas the total cost is / should be below the 2 + 2 when using two different instructions.

Anyway, on that part I was talking only about 3 operands using only registers. So, there's not really EA decoding: just use the needed registers.
Quote:
Mitch Alsup https://groups.google.com/g/comp.arch/c/wzZW4Jo5tbM?pli=1
I went the other direction: the key data addressing mode in the MY 66000
ISA is :: (Rbase+Rindex*SC+Disp)

When Rbase == R0 then IP is used in lieu of any base register
When Rindex == R0 then there is no indexing (or scaling)

IA-32/x86-64 uses similar encoding exceptions to the general Rbase+Rindex*SC+Disp addressing mode, but I don't like them: it complicates the decoder.
Quote:
Disp comes in 3 flavors:: Disp16, Disp32, and Disp64

The assembler/linker is task with choosing the appropriate instruction form
from the following:

MEM Rd,(Rbase+Disp16)
MEM Rd,(Rbase+Rindex*SC)
MEM Rd,(Rbase+Rindex*SC+Disp32)
MEM Rd,(Rbase+Rindex*SC+Disp64)

Disp64 is overkill. Really, not needed.

You could easily see it on the statistics that I've collected: https://www.appuntidigitali.it/18192/statistiche-su-x86-x64-parte-5-indirizzamento-verso-la-memoria/

Adobe Photoshop CS6 32 bit

Addressing modes:
  Address mode              Count
  [EBP-DISP8*4]            208695
  [REG+DISP8]              187256
  PC32                     151488
  [ESP+DISP8*4]            126268
  PC8                      100865
  [REG]                     66870
  [DISP32]                  42151
  [REG+DISP32]              41599
  [REG+REG*SC+DISP8]        30736
  [EBP-DISP32]              23608
  [REG+REG*SC]              20853
  [REG+REG*SC+DISP32]        2707
  [ESP+DISP32]                 78

Adobe Photoshop CS6 64 bit

Addressing modes:
  Address mode              Count
  [RSP+DISP8*8]            320474
  [REG+DISP8]              170953
  PC32                     168789
  PC8                      116645
  [REG+DISP32]              84105
  [REG]                     62352
  [RIP+DISP32]              58299
  [REG+REG*SC+DISP8]        49512
  [REG+REG*SC]              33659
  [RBP-DISP8*8]             25089
  [REG+REG*SC+DISP32]        5544
  [RSP+DISP32]               1643
  [DISP8]                       2

I also share the results with 64-bit Excel:

Addressing modes:
  Address mode                  Count
  [REG+DISP7]                  594110
  PC8                          531251
  [REG+DISP16]                 428929
  PC24                         393766
  [RSP+DISP8*8]                366152
  [RIP+DISP32]                 215223
  [REG]                        202264
  PC32                         162621
  [RBP-DISP8*8]                111836
  [REG+REG*SC]                  85404
  [DISP16]                      14231
  [REG+REG*SC+DISP8]             9670
  [REG+REG*SC+DISP16]            4658
  [REG+DISP32]                   3784
  [RSP+DISP24*8]                 1661
  [REG+REG*SC+DISP32]            1099
  [RIP+DISP16]                    693
  [RBP-DISP32*8]                    1
  [0]                               1

32-bit displacements are the less frequent ones. I strongly doubt that 64-bit displacements could have any usefulness.

In fact, on NEx64T I had space for 64-bit displacements, but I've decided to better use the encoding for something better (which payed-off a lot).
Quote:

Having SW create immediates and displacements by executing instructions is
simply BAD FORM*. Immediates and displacements should never pass through the
data cache nor consume registers from the file(s), nor should they be found
in memory that may be subject to malicious intent.

(*) or lazy architecting--of which there is way too much.

I fully agree here.
Quote:

Adding many EAs to instructions makes instructions difficult to execute in a single cycle which is bad for pipelining on real hardware. Limiting EAs to one per instruction like the 68k, except for the special MOVE mem,mem case, keeps from going VAX.

I don't see the problem using two EAs on instructions: the complexity should be the same as for 68k's MOVE mem,mem.

In fact, on 68k we also had ADD, SUB and CMP instructions with two memory operands.

That's the reason why on NEx64T any binary instruction could use up to two EAs (on some formats; depending on the specific binary instruction). This improves code density and/or instructions counts / executed.
Quote:

cdimauro Quote:

LOL How short and limited vision he had. Using size = 11 for... operation with a 16-bit immediate. Unbelievable!

It was early in Apollo core development and we were looking at possibilities. I don't remember if I initially suggested the idea or Gunnar but he liked it. Meynaf didn't like it from the start because of incompatibility. I was open minded about it from the beginning and tried to find a way to make it work but I agreed with Meynaf it wasn't a good idea for high compatibility needed for retro use. We considered a separate 64 bit mode but Gunnar didn't like that at all.

And I think that he still hasn't added it.

That's why I've asked for the encodings of the 3 MOVEA, but he systematically isn't providing them
Quote:

Some time later, I came up with using the addressing mode to compress immediates which is very compatible and better as it can be used with MOVE. Meynaf actually didn't like the addressing the mode idea as it was only useful for the OP ".L" size (".Q" also but only a 32 bit ISA then) but he didn't complain as much because it was more compatible.

Actually it's not clear what Maynaf wanted: no Size = 11 (which IS GOOD!) neither using an UNUSED addressing mode.

Then what's his solution for this specific problem? No solution?
Quote:

Meynaf and Gunnar argued for a long time over the incompatible immediate compression idea and it created bad blood between them. I was more neutral but agreed with Meynaf about the incompatibility. Reusing encodings without a new 64 bit mode is bad for compatibility.

Absolutely. Only a fool could think about this solution. Or an hardware engineer with a very limited mind.
Quote:

cdimauro Quote:

This is a MUCH better solution. Of course.

If only Motorola had planned for and reserved the size="11" instruction encodings for 64 bit back in the '70s, I wonder if the 68k would still be alive today.

Actually that was THE problem. Another short sight by the hardware engineers of the time...
Quote:

It was no doubt difficult to foresee back then even when the 68000 developers had the foresight to add 32 bit ISA support to a 16 bit CPU microprocessor which revolutionized computing, from embedded to workstation CPU markets (perhaps the beginning of the 2nd computing revolution). The size="11" encoding was still open for the 68000 ISA and it was actually the 68020 ISA which poorly used it while 64 bit ISA support was easier to see at that point.

Hum, no: 68000 also used it, unfortunately. The most notable case: there's no space for MOVE.Q mem,mem.

But there might be some other case where Size = 11 was also (re)used.
Quote:

The ColdFire ISA developers should have seen that 64 bit ISA planning helped scalability but they were too focused on scaling the 68k down by castrating it and didn't want it scaled up to compete with PPC. ColdFire had just eliminated byte and word sizes to be more RISC like so it couldn't add a 64 bit size.

Well, actually they cut so much that they completely killed the product...
Quote:

Some time later, ARM AArch64 added 32 and 64 bit sizes as ARM scaled up replacing PPC and leaving 32 bit Thumb2 for low end embedded markets. ColdFire was just a scaled down castrated and weakened 68k that couldn't scale up to replace PPC. Motorola/Freescale/NXP never leveraged the 68k CISC performance advantage and now they pay license fees to ARM.

Indeed...
Quote:

matthey wrote:
cdimauro Quote:

Interesting. It looks like that 16 registers is the sweet spot between program size and loads/stores. Which is strange for a RISC: I was expecting more registers pressure AKA more loads/stores, using a reduced set of registers.

For CISC and embedded RISC, it looks to me like 16 GP integer registers is the sweet spot. For high performance RISC, 32 GP registers likely provides some performance benefit but it is partially offset by decreased performance from larger code. Early RISC ISAs payed no attention to code density and I wouldn't be surprised if the performance loss from fat code often more than offset the performance gain from 32 GP registers. Modern RISC ISAs like AArch64 and RISC-V compressed likely make 32 GP registers worthwhile while they still could be improved with a better variable length encoding (Mitch Alsup's ISA?).

AArch64 cannot be improved.
Actually some new instructions could be added for loading large immediates, for example, but I don't think that ARM will change the current instruction size: I strongly believe that they will kept it 32-bit.

RISC-V has already encoding space up to 22 bytes (AFAIR) in multiple of 16-bit (guess why

), plus an option for longer instructions.
So, they could do it, but since they are RISC fanatics I don't believe that any of such variable-length encoding would ever come as part of the standard. Just take a look at the vector ISA extensions that they have designed and the jumps through hoops that they have made to keep everything on a 32-bit encoding.

BTW, Mitch Alsup's ISA isn't a RISC. Clearly.
Quote:

cdimauro Quote:

The differences between the two papers is too huge. Even taking into account the very different architectures (MIPS vs x86-64), it's difficult to have an explanation for this.

I would like to see similar data done with compiles for -Os, -O2 and -O3. I expect more loop unrolling and function inlining to require more GP registers. RISC needs more loop unrolling to reduce load-use stalls and can have longer function prologues and epilogs which encourages more inlining.

I believe so.
Quote:

https://www.ibm.com/docs/en/aix/7.2?topic=overview-prologs-epilogs

The most interesting part is represented by the example for such prologs and epilogs.

I expect that NutsAboutAmiga should complaint a lot about all that "bloat". But since it's about PowerPCs I bet that I don't see any comment from him.

Quote:

RISC functions can't be too deep though as register spills are much more expensive than CISC.

RISC_out_of_regs:
store reg_var3
load reg_var2
// load-use stall (3 cycle on ARM Cortex-A53)
op reg_var2,reg_var1
===
total: 3 instructions, 2 mem accesses, 12 bytes (32 bit fixed length encoding), 3-6 cycles typical

CISC_out_of_regs:
op mem_var2,reg_var1
===
total: 1 instruction, 1 mem access, 4 bytes (68k), 1 cycle typical

This is a huge difference.

Indeed. A clear example of CISCs advantage.
Quote:

RISC can reduce the load-use stall (difficult for in order RISC CPUs) or add a reg-mem exchange instruction (rare with CISC like reg-mem complexity).

They already aren't RISCs: if you remove the only thing which is left (Load/Store architecture), then how they could continue call them... RISCs?

Quote:

CISC can dual port the data cache and do 2 mem reads per cycle of variables which is likely why we see so little performance degradation on x86(-64) from using so few GP registers despite the much increased mem accesses. There are some complex functions which would benefit from more GP integer registers. See the AMD slide on page 10.

https://old.hotchips.org/wp-content/uploads/hc_archives/hc14/3_Tue/26_x86-64_ISA_HC_v7.pdf

Yes, I know. Also CPython's mainloop (ceval.c) uses more than 16 registers.

Fortunately those aren't very common cases.
Quote:

About 90% of functions only need 16 GP registers but this may include SIMD unit registers which likely benefits more from added GP registers.

That might happen with the regular / normal SIMD extensions.

Vector extensions (register length-agnostic) require much less registers.
Quote:

There are a couple of percent of functions that need more than 32 GP registers. Is it worth bloating up the code all the time for infrequent GP register needs for a RISC register munching monster with a memory bottleneck or is it better to have a CISC reg-mem munching monster?

Correct. But then why Mitch Alsup uses 60 (or 64? I don't recall now this detail) registers on his ISA?

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 3-Oct-2022 8:05:45

[ #239 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3756
From: Germany

@Gunnar

Quote:

Gunnar wrote:
Cesare and Matthew,

You discuss the optimal CPU features, and you make claims about what features the optimal CPU needs to have. But all your "experience" is based on your understanding or misunderstanding
of posts from some people that you don't know and which you found somewhere in the internet.

When we point out that some real facts, of some CPUS and discuss the advantage for certain well know real live programs, and we ask you - if you have ever written such code?

Then you answer again only by quoting totally random posts from the internet.

Actually YOU never brought a single proof of your claims.

The only ones which are bringing FACTs are me and Matt.
Quote:
Cesare you remind me on someone - who calls himself a football expert but who has never done a block, who has never done a tackle and who has never thrown a ball and never caught a ball in his whole live - and instead you base you knowledge on quoting random nonsense that you found somewhere in the internet.

And you want us to believe, that your playbook is the best in the world and wins every game?

Guess what: another logical fallacy!

Actually logical fallacies are the ONLY thing that you're bringing here.

Gunnar, the great hardware engineer, which outside of his domain shows an embarrassing lack of elementary logic...

Status: Offline

Gunnar

Re: The (Microprocessors) Code Density Hangout
Posted on 3-Oct-2022 8:21:31

[ #240 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

Quote:

Quote:

Cesare you remind me on someone - who calls himself a football expert but who has never done a block, who has never done a tackle and who has never thrown a ball and never caught a ball in his whole live - and instead you base you knowledge on quoting random nonsense that you found somewhere in the internet.

And you want us to believe, that your playbook is the best in the world and wins every game?

Guess what: another logical fallacy!

Actually logical fallacies are the ONLY thing that you're bringing here.

Cesare,

there is nothing wrong by getting information from other sources.

Your problem is the internet is not a 100% reliable source
where every post online is 100% correct - the Internet is full of nonsense.

By quoting from the Internet you can proof the Elvis is still alive,
that the Russians are right now trying to save the Ukraine from Nazis,
that the USA is secretly ruled by reptile aliens,
and that the TINA Amiga project has a 128Bit memory bus.
= the Internet is full of nonsense.

Without any personal experience yourself your can not judge the quality of what you quote from the Internet.

You ran into this trap of quoting "false fact" without understanding them before.
Lets us recall your hundreds of posts about the TINA project:

In the TINA project you claimed absolutely false and technical impossible values as hardware facts.
You claimed impossible clockrates, you claimed impossible bus width, and you claimed you have the CPU from NATAMI that you would use.

Cesare, you repeat this again and again.
Please stop quoting stuff that you not understand.

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle