Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6220 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

0 member(s) on-line.

You are an anonymous user.
Register Now!

AmigaMac: 14 mins ago

matthey: 22 mins ago

OneTimer1: 31 mins ago

Rob: 50 mins ago

ruben: 1 hr ago

Marcian: 1 hr 1 min ago

Dragster: 1 hr 26 mins ago

nbache: 1 hr 47 mins ago

kolla: 1 hr 51 mins ago

minator: 2 hrs 3 mins ago

Forum Index

General Technology (No Console Threads)

The (Microprocessors) Code Density Hangout

Poster

Thread

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 19:38:38

[ #141 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4432
From: Germany

@Gunnar

Quote:

Gunnar wrote:
@cdimauro

Quote:

Ask Olaf to make his forum public and you'll find almost everything there.

So you really think that posting a childish wishlist of features

It was NOT a wishlist, but a concrete design. You still have memory problems or it's "just" bad faith...
Quote:
is designing a chipset,

I've designED a chipset, at the time.
Quote:
or designing a CPU?

Not really. On Olaf's forum I've only talked about architectures and possible 68k extensions.

That's because, at the time, I've already designed my 68k-inspired ISA (which I've talked C64K) and the first or second version (now I'm at v10) of my NEx64T (an IA-32/x86-64 100% assembly-level compatible ISA).
Quote:
What will you do next?

I just stopped, because I had already more than enough.
Quote:
Will you post a pipe dream wishlist of ideas about a future cars in a forum,
and then go to Porsche and tell them that you designed a much better and much more future proof car than them?

I hope they will be as impressed as we are.

Well, I think that actually you could much better impress computer architecture engineers if you give a talk and shows how you designed your AMMX SIMD extension.

I've no doubt that they will laugh like crazy and give you the prize for ugliest ever made SIMD, with its ridiculous split of data and FP registers which extends the same set of additional SIMD registers.

Especially compiler writers will give you a standing ovation!

BTW usually at Nürnberg there's an embedded world conference: you can take the chance to give your talk there. Let me know and I'll for sure be part of, to see the people laughing there.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 19:41:36

[ #142 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4432
From: Germany

@Bosanac

Quote:

Bosanac wrote:
@cdimauro

I’m not asking for PowerPoint slides.

I’m asking you to build it and sell it to me.

Do you understand that a processor is made by HUNDREADS of engineers with DIFFERENT roles?

What you're asking is NOT part of my expertise: I only design ISAs.
Quote:
I’m married to an Italian woman btw, I’m more than familiar with the napoleon complex in many of you.

I'm from south Italy and I can also understand most of Neapolitan.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 20:04:06

[ #143 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4432
From: Germany

@Gunnar, don't forget this:

Quote:

Gunnar wrote:
@matthey

Quote:

matthey wrote:
Your post is technically wrong.

We'll see it.
Quote:
Fact is: The 68080 does not need a prefix to manipulate the full width of the Address registers.
Why do you post stuff like this, if you really not understood the CPU?

Maybe because YOUR documentation sucks so much and have NOTHING reporting what you said?

You talk about FACTs, but the real fact is that I've already read all documentation about your Apollo core / 68080 and there's absolutely nothing which barely resembles what you stated here.

Here's the main source for your documentation: http://apollo-core.com/index.htm?page=coding&tl=1 Even checking ALL "tabs" there's NOTHING there about your statement.

I've also read ALL documentation which A USER (so, NOT you neither someone of your team) has collected and made available here: http://apollo-core.com/knowledge.php?b=5¬e=38530
Specifically, the most interesting is something which is a crossover between an Architecture and Programmers Manual: http://cdn.discordapp.com/attachments/730698753513750539/883167019581722654/VampireProgrammingGuide2021.docx
But even on this manual, there's NOTHING which could confirm your statement.

On the exact contrary, on all "A" instructions that work on an address register as a destination I can see something like this:

"The size of the operation may be specified as word or long. Word size source operands are sign extended to 32-bit quantities prior to the addition."

Pay attention to the highlighted parts, because it's clear that it does NOT talk about 64-bit quantities neither that 16 or 32-bit data are sign-extended to 64-bit.

So, care to PROVE your statement?

And, BTW, could you show me the encodings for the following instructions:
MOVEA.W A0,A1
MOVEA.L A0,A1
MOVEA.Q A0,A1

?

Don't delude your Minions. What they will think about if their leader/hero isn't able to prove his own statements?

Status: Offline

Bosanac

Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 20:25:09

[ #144 ]

Regular Member

Joined: 10-May-2022
Posts: 257
From: Unknown

@cdimauro

Quote:
Do you understand that a processor is made by HUNDREADS of engineers with DIFFERENT roles?

What you're asking is NOT part of my expertise: I only design ISAs.

So go get it made then I can buy it. Are you really as ignorant as you appear?

Taking risks is what makes a man. Hence why I'm more successful than you despite apparently being an idiot according to you.

This is why I say you will always be a wage slave working to make someone else richer than they already are.

Quote:
I'm from south Italy and I can also understand most of Neapolitan.

:facepalm:

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 20:35:05

[ #145 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4432
From: Germany

@Bosanac

Quote:

Bosanac wrote:
@cdimauro

Quote:
Do you understand that a processor is made by HUNDREADS of engineers with DIFFERENT roles?

What you're asking is NOT part of my expertise: I only design ISAs.

So go get it made then I can buy it. Are you really as ignorant as you appear?

Taking risks is what makes a man. Hence why I'm more successful than you despite apparently being an idiot according to you.

This is why I say you will always be a wage slave working to make someone else richer than they already are.

Hey, missed genius, what's your IQ-level?

Do you know WHY I've created a presentation for my processor? For fun?

You're really embarrassing...

Status: Offline

Bosanac

Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 20:40:46

[ #146 ]

Regular Member

Joined: 10-May-2022
Posts: 257
From: Unknown

@cdimauro

We can play that game if you want.

What's your net worth?

If you are so convinced that your designs are as wonderful as you say they are then get them made and available to buy.

It's not difficult.

I can introduce you to one of the many VCs I'm colleagues with. It's not something I'd be interested in funding for sure but they might.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 20:42:03

[ #147 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4432
From: Germany

@Bosanac

Quote:

Bosanac wrote:
@cdimauro

We can play that game if you want.

What's your net worth?

If you are so convinced that your designs are as wonderful as you say they are then get them made and available to buy.

It's not difficult.

I can introduce you to one of the many VCs I'm colleagues with. It's not something I'd be interested in funding for sure but they might.

Thanks, but I've other plans.

Status: Offline

Bosanac

Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 20:59:14

[ #148 ]

Regular Member

Joined: 10-May-2022
Posts: 257
From: Unknown

@cdimauro

Quelle surprise!

Now a man like me might say an idiot is happy to be a wage slave and not master of his own destiny...

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 20:59:31

[ #149 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2747
From: Kansas

@Gunnar
Are you really Gunnar von Boehn or are you one of his goons creating a fake account to troll and attack us?

Gunnar Quote:

Do you really think, that more CPU register would give problem when developing an ASIC?
I saw you posting such before. This is of course absolute nonsense.

More register are no problem at all for going ASIC:
This should be obvious to everyone. As every CPU made today has many more registers.
IBM Power have over hundred of register, same for INTEL, AMD, and ARM.

What ASIC CPU has 48x64b integer registers? Large register files can become a timing problem at higher clock speeds.

https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf Quote:

The register file access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increase the issue width (which implies more register ports) and the size of the instruction window (which implies more registers), and to use some kind of multithreading. Under this scenario, the register file access time could be a dominant delay and a pipelined implementation would be desirable to allow for high clock rates.

...

Most current dynamically scheduled microprocessors have a RISC-like instruction set architecture, and therefore, the majority of instruction operands reside in the register file. The access time of the register file basically depends on both the number of registers and the number of ports. To achieve high performance, microprocessor designers strive to increase the issue width.

Some CPUs have used a duplicate shadow register file because of timing constraints of a large register file.

https://en.wikipedia.org/wiki/Register_file#Microarchitecture Quote:

The Alpha 21264 (EV6), for instance, was the first large micro-architecture to implement a "Shadow Register File Architecture". It had two copies of the integer register file and two copies of the floating point register located in its front end (future and scaled file, each containing 2 read and 2 write ports), and took an extra cycle to propagate data between the two during a context switch. The issuing logic attempted to reduce the number of operations forwarding data between the two and greatly improved its integer performance, and helped reduce the impact of the limited number of general-purpose registers in superscalar architectures with speculative execution. This design was later adapted by SPARC, MIPS and some of the later x86 implementations.

The MIPS uses multiple register files as well. The R8000 floating-point unit had two copies of the floating-point register file, each with four write and four read ports, and wrote both copies at the same time with a context switch. However, it did not support integer operations, and the integer register file still remained as such. Later, shadow register files were abandoned in newer designs in favor of the embedded market.

While CISC CPU cores usually have fewer registers, they usually use more ports to support more powerful instructions. It would appear that the Apollo core has many registers and many ports. If may not be a problem for a low clocked cheap FPGA to ASIC conversion though. The extra register banks could not as effectively be power gated as shadow register banks for fast interrupts either.

The bigger question is whether those extra integer registers are worth the resources. At the very least a larger register file uses more transistors and power. There has been research on how many registers are optimum before diminishing returns show little improvement. If you were the real Gunnar, you would know that I already showed you the research all the way back on the Natami forum and that you ignored it and the general consensus of developers that more GP integer registers were unnecessary. Let's look at the CISC research again called "Performance Characterization of the 64-bit x86 Architecture from Compiler Optimizations’ Perspective".

https://link.springer.com/content/pdf/10.1007/11688839_14.pdf Quote:

Figure 7 gives the normalized execution time (with respect to the base compilation using 16 GPRs and 16 XMM registers) with the REG_8 and REG_12 register configurations for the SPEC2000 benchmarks. On average, with the REG_8 configuration, the CINT2000 exhibits a 4.4% slowdown, and the CFP2000 exhibits a 5.6% slowdown; with the REG_12 configuration, the CINT2000 is slowed down by 0.9%, and the CFP2000 is slowed down by 0.3%. Clearly, these results show that REG_12 already works well for most of the SPEC2000 benchmarks.

On x86-64, the integer performance gain from 8 to 16 GP registers was 4.4% and from 12 to 16 GP registers was 0.9%. This data would suggest that more than 16 GP registers would gain less than the 0.9% performance difference from 12 to 16 GP registers. Register spills increase the code size and memory accesses but so do more encoding bits for registers and prefixes. The same paper mentions another study that showed REX prefixes used to access more registers increased the code size by up to 17%.

https://link.springer.com/content/pdf/10.1007/11688839_14.pdf Quote:

Luna et al. experimented with different register allocators on AMD64 platform. Their studies show that with more registers, a fast linear scan based register allocation algorithm could produce competitive performance code with graph coloring based algorithm. It is interesting for us to evaluate the performance of a linear scan approach with different number of registers on the SPEC2000. In addition, Luna et al. show that the code size could be increased by 17% due to REX prefixes.

Prefixes to access more register banks are not free! The 68k does not use a tiered register bank scheme like x86-64 and can access all 16 GP registers without increasing the code size. I have seen no data to indicate that 68k code exhibits more register pressure or does more memory accesses than x86(-64) but limited data shows significantly less of both.

https://docs.google.com/spreadsheets/d/e/2PACX-1vTyfDPXIN6i4thorNXak5hlP0FQpqpZFk2sgauXgYZPdtJX7FvVgabfbpCtHTkp5Yo9ai6MhiQqhgyG/pubhtml?gid=909588979&single=true

More GP integer registers using more resources is definitely less appealing for embedded use where fat RISC architectures like PPC and MIPS (see above) have reduced registers to compete with the 68k, SuperH and ARM, the big 3 embedded champions. The desktop champion x86-64 gained a moderate performance benefit moving to 16 GP registers from 8 GP registers while the larger code from REX prefixes to use the additional registers (and 64b capabilities) partially offset performance gains. In another x86-64 paper, only 6 of 12 SPEC CPU2000int benchmarks showed a performance increase from x86 to x86-64 and the average performance increase was less than 1%. When the SPEC benchmark was revamped for x86-64, 8 of 12 SPEC CPU2006int benchmarks showed a performance increase of 7% on average from x86 to x86-64. The SPEC CPU2006int benchmarks with the biggest performance losses were the ones with the largest "Memory footprint increase" of 100% and 34%.

Performance Characterization of SPEC CPU2006 Integer Benchmarks on x86-64 Architecture
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.409.9850&rep=rep1&type=pdf

There is likely more room for performance gains from improving code density than from adding registers beyond the 16 GP registers of the 68k whether for embedded use or the desktop. For sure, no embedded CPU ASIC will be created for embedded use with 48 GP integer registers. The embedded market doesn't want the resource and power waste of the large register file nor the code density decline from prefixes. The desktop market requires higher clock speeds where the larger register file may start to be a problem and the larger footprint is still a performance impediment. The 64 bit SIMD unit with no floating point is a joke for the desktop market. For retro use, maybe it can be jammed down people's throats but they mostly care about compatibility and will simple ignore enhancements as has been demonstrated by the lack of compiler support for the Apollo core. With no market, there will be for sure no ASIC.

Gunnar Quote:

Its comically amusing to see an armchair expert that has zero experience in ASIC development and no experience in CPU design, giving "smart" advice to people that develop high-end CPU chips for a living with decades of practical experience in making ASICs designs. The Apollo-Team core team consists of several people which many of them have worked on a number of the best IBM high ASIC CPU designs from PowerPC 970 "G5", to POWER 7, 8 and 9.

Forget propaganda. You are the most qualified for everything! Your marketing genius has customers with deep pockets waiting in line! How much longer before your ASIC is out? You didn't need me finding an embedded company that would help with ASIC development, finding an embedded company building out IoT on a large enough scale to provide the economies of scale to fund an ASIC or me finding an experienced chief architect that could lead the development team? I hope to see your ASIC soon.

I believe Jens is an experienced CPU architect since he designed and wrote the N68k CPU and probably helped with the Apollo core based on the N68k. Chris wrote the FPU. Thomas wrote SAGA. From all that I could see from being an Apollo Team member, you designed the ISA completely by yourself. Where were Jens and Chris when it came to ISA development? What parts of these advanced CPU core designs did you work on? What parts did Jens, Chris or Thomas work on? What were your positions at IBM?

Gunnar Quote:

Seeing that your post about 68080 are often totally wrong - its obvious that you don't know what you talk about.
Many of your posts about AMMX or the instruction set or the CPU features are technically simple false and wrong. This shows me that you never coded for 68080 and all "your knowledge" is based on misreading and misunderstanding or hearsay of the features.

But was having no clue ever be a problem for an armchair expert?

I haven't written much about AMMX although I don't need to know anymore than it uses 64 bit SIMD unit registers without floating point support. That alone disqualifies it from desktop competitiveness. The 48 integer GP registers with it disqualifies it from embedded consideration as a reduced resource usage SIMD unit for potential customers who need more than a MAC unit for DSP workloads. The Apollo core documentation is bad. I was working on good documentation with little appreciation until you went crazy with the ISA (registers and prefixes). Even with good documentation, I feel sorry for any compiler developer that would try to support that ISA. I don't claim to be an expert on your ISA and never have. I know all I need to know to know that that ISA is going nowhere. Did you ever ask Jens what he thinks about the ISA or are you "more qualified" than the original architect?

I still think you are a Vampire cult troll and not really Gunnar. Probably Bosanac.

Last edited by matthey on 26-Sep-2022 at 09:22 PM.
Last edited by matthey on 26-Sep-2022 at 09:20 PM.
Last edited by matthey on 26-Sep-2022 at 09:15 PM.
Last edited by matthey on 26-Sep-2022 at 09:09 PM.

Status: Offline

Bosanac

Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 21:04:07

[ #150 ]

Regular Member

Joined: 10-May-2022
Posts: 257
From: Unknown

@matthey

Quote:
I still think you are a Vampire cult troll and not really Gunnar. Probably Bosanac.

I've got far better things to be doing with my time.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 21:06:42

[ #151 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4432
From: Germany

@Bosanac

Quote:

Bosanac wrote:
@cdimauro

Quelle surprise!

Now a man like me might say an idiot is happy to be a wage slave and not master of his own destiny...

Ah, ok: so you know my plans...

Status: Offline

Bosanac

Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 21:07:56

[ #152 ]

Regular Member

Joined: 10-May-2022
Posts: 257
From: Unknown

@cdimauro

Pray tell.

I'm genuinely interested.

“Life is what happens when you're busy making other plans” – John Lennon

Last edited by Bosanac on 26-Sep-2022 at 09:09 PM.

Status: Offline

Karlos

Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 21:09:04

[ #153 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

It's getting tense in here...

https://imgflip.com/i/6usbkb

_________________
Doing stupid things for fun...

Status: Offline

Bosanac

Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 21:09:50

[ #154 ]

Regular Member

Joined: 10-May-2022
Posts: 257
From: Unknown

@Karlos

Ouch!

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 21:19:41

[ #155 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4432
From: Germany

@Bosanac

Quote:

Bosanac wrote:
@cdimauro

Pray tell.

I'm genuinely interested.

“Life is what happens when you're busy making other plans” – John Lennon

I prefer to go to sleep, since this ping-ponging annoyed me.

@Karlos

Quote:

Karlos wrote:
It's getting tense in here...

https://imgflip.com/i/6usbkb

ROFL Beautiful!

Status: Offline

Karlos

Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 21:21:06

[ #156 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@cdimauro

I don't want to toot my own horn, but I actually built that

Last edited by Karlos on 26-Sep-2022 at 09:21 PM.

_________________
Doing stupid things for fun...

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2022 2:52:05

[ #157 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2747
From: Kansas

cdimauro Quote:

I finally had the time to read the above paper. It only applies to Thumb (the first one). In fact, Thumb-2 solved all issues reported on that paper and in a much more elegant way.

So, the 30% instruction count increase doesn't apply anymore, but it's very reduced (compared to the original ARM).

The Thumb up to 30% instruction increase is bad. Thumb2 is much better but still has elevated instruction counts (and memory accesses). Let's compare using Dr. Vince Weaver's study.

Compared to ARM EABI
SH-3 +36% instructions
Thumb +24% instructions
Thumb2 +12% instructions

The Thumb +24% instruction increase fits with the up to 30% increase found by the paper. Here we found that Thumb2 is using +12% instructions which is half of Thumb and 1/3 of SH-3. This is a big improvement and not too bad considering the code density has improved with each ISA improvement. The comparison is with the original 32 bit fixed length encoding ARM ISA with only 15 GP registers though (the PC is one of the 16). RISC ISAs have increased instruction counts without 32 GP registers while CISC ISAs with 16 GP registers can have competitive instruction counts with RISC ISAs with 32 GP registers. Let's compare instruction counts to the original embedded champion that these compressed RISC ISAs poorly copied.

Compared to 68k
SH-3 +47% instructions
Thumb +35% instructions
Thumb2 +21% instructions

The 68k was still champion in both instruction counts and code density when Motorola decided to replace the 68k embedded champion with PPC. Eventually, newer embedded designs in newer chip fab processes were lower power and more powerful but embedded customers often used the old 68k designs for a decade or more after Motorola neglected the 68k. ARM chips had another advantage which was cost. ARM licensed out their core designs far and wide while encouraging new designs while Motorola protected their baby they were killing off while using the ~300% markup above cost to invest in new PPC designs. PPC low end and embedded designs had consistently disappointing performance and the reason is code density. Instruction caches of PPC CPUs needed to be 4 times that of the 68k to get the same performance but they didn't get it using 4kiB and 8kiB I caches in early PPC designs ruining the reputation of PPC. Higher performance PPC CPUs which initially had good performance soon hit the performance wall when they couldn't increase the cache size anymore. The PPC604e was one of the highest performance CPUs when it launched with 32kiB I+D L1 caches but it couldn't be clocked up anymore than 233MHz with such large L1 caches. This was already the fastest clock speed among competitors with a 32kiB I+D (see fig 3 of "The Microprocessor Today" http://www.ee.unlv.edu/~meiyang/ecg700/readings/micro-today.pdf). The Sun UltraSPARC had to drop back to 16kiB I+D cache to reach 250MHz and the Digital Alpha down to 8kiB I+D L1 cache to reach 500MHz (the instruction bottleneck prompted them to innovate the L2 cache but they had already made a fatal mistake of thinking code density doesn't matter and designing the Alpha ISA based on this). The Alpha CPU used the most transistors of competitors with most of them used for caches. The 2nd most transistors used was the PA-RISC PA-7300LC which only clocked up to 160MHz so 64kiB I+D caches were possible. The PA-8000 moved all the caches off chip but this made for a very expensive CPU chip with lots of pins. PA-RISC was the RISC ISA with the 2nd worst code density behind the Alpha and they had cache issues too. The Pentium Pro had the smallest L1 caches in this high performance competition.

http://www.ee.unlv.edu/~meiyang/ecg700/readings/micro-today.pdf Quote:

Intel’s Pentium Pro has the smallest on-chip caches of any high-performance processor: a mere 8 Kbytes each for instructions and data. This is because the custom-designed level-two cache chip (described earlier) is mounted in the same package as the processor and can deliver near on-chip speeds. Most other high-performance processors have on-chip level-one caches of either 16 or 32 Kbytes each for instructions and data. HP’s 7300LC has the largest caches, at 64 Kbytes each.

The PPro still had only 8kiB I+D L1 caches while the L2 was off chip (the stacked chip design cuts the wire distances boosting performance which is returning to popularity today). The PPro was only middle of the pack in performance but it had room to be clocked up and to double the cache size.

While the PPC 604e was stuck at 233MHz and couldn't be clocked up due to the large caches, the PPC 603e with smaller 16kiB I+D out clocked it and it was likely the 4 stage pipeline limited performance which still reached 300MHz using the same process size (none of the high performance CPUs in figure 3 reached 200MHz with less than 5 pipeline stages so I suspect the PPC 603e snuck in another die shrink to reach 300MHz). Steve Jobs complained about the low PPC clock speeds which were likely due to large L1 caches and shallow pipelines. PPC finally started to improve again when the L2 cache was improved with the G3 and finally placed on chip. The early RISC ISAs designed with no consideration for code density were a mistake. The best 32 bit code density ISA was the 68k and the 68060 with an 8 stage pipeline, deeper than even the 500MHz Alpha 21164, and only 8kiB I+D caches only reached 66MHz for embedded use only and even with a die shrink most chips were marked 50MHz. I've never seen a rev6 MC68060 marked above 50MHz even though most clock to 100MHz. Having 4 times the code in the instruction cache makes the 8kiB I cache like the 32kiB I cache of a 604e and at least leaves enough memory bandwidth to service data cache misses. Increasing to a 16kiB I+D cache and increasing the clock speed to around 250MHz would have been better than pushing the clock speed further. With so few instructions, so few memory accesses, such easy instruction scheduling and leading code density, it would have been a little power house champion of embedded for longer and maybe would have made people rethink using it for the desktop. That's exactly what Motorola didn't want as they AIMed to replace the 68k with PPC while ignoring the PPC cache issues.

Desktop wannabe RISC ISAs
~~Alpha, PA-RISC~~ (worst code density RISC first to die)
|
V
~~MIPS, SPARC, PPC~~ (normal fat RISC died slowly)
|
V
AArch64, RISC-V C (new less RISCy improved code density RISC)

Embedded RISC ISAs
~~SuperH~~ (worst code density, RISC instruction overload, first to die)
|
V
~~Thumb~~ (normal embedded RISC died next, less RISC instruction overload)
|
V
Thumb2 (new improved code density RISC, tolerable RISC instruction fluff)

Some people say code density doesn't matter. There would be a lot of coincidences that the fattest RISC ISAs died first, replaced by the less fat ones and then they were replaced by the least fat ones today. Wouldn't it just be easier to use CISC ISAs which can beat them all in code density? Wouldn't it be easier just to use the best code density CISC ISAs to avoid being replaced like happened with RISC ISAs? Where could we find one of the best code density ISAs, with good performance traits and without as much decoding overhead as fat CISC with a bunch of prefixes?

Last edited by matthey on 27-Sep-2022 at 05:38 AM.
Last edited by matthey on 27-Sep-2022 at 03:06 AM.
Last edited by matthey on 27-Sep-2022 at 02:58 AM.

Status: Offline

bhabbott

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2022 3:47:33

[ #158 ]

Cult Member

Joined: 6-Jun-2018
Posts: 554
From: Aotearoa

@matthey

Quote:

matthey wrote:
bhabbott Quote:

If the prefixed instruction executes much faster and is smaller than the vanilla code it replaces, it's still worth it.

Sigh. Code density is more about instruction cache efficiency.

And smaller code uses less cache. So you agree with me.

Quote:
There is also more decoding overhead for a prefix.

True. This is balanced against how much work the instruction does.

Quote:
Gunnar's logic is likely similar to yours. The 64 bit features will be rarely used so it is fine if they are high overhead using a prefix as long as they have a low resource cost now. Adding register banks and having a large register file is cheap in FPGA with a low CPU clock rate and may give some extra performance so plan for today. There will never be an ASIC either so optimize the ISA and design for a FPGA that will save resources and give benefits today. It is poor planning with a self fulfilling prophesy that the future will never come.

A bird in the hand is worth two in the bush.

I don't know about Gunnar's plans, but I doubt there will be an ASIC in the 68080's near future whatever its architecture. The huge advantage of FPGA is that it can be one thing today and something quite different tomorrow, without having to change any hardware. This protects and extends the user's investment. My Vampire is significantly better now than when I bought it a few years ago, simply by uploading a new bitstream. If it was an ASIC I would be stuck with outdated and possibly buggy hardware.

My personal wish is to be able to put user HDL into the FPGA to create 'hardware' modules for specific tasks, like PSoC MCUs do. This may never be realized either, but at least it is only some HDL code away rather than a million dollar ASIC that will probably be outdated the moment it is created.

In a month's time I will be 65 years old and on the pension, finally getting time to finish some of my projects. But how much time do I have left? One thing is becoming plain - I will not have enough time to turn all my dreams into reality. So I don't care much about some pie-in-the-sky CPU architecture that probably won't make it into silicon in my lifetime (if at all).

Like the saying goes "The perfect is the enemy of the good". As we get older we tend to want more and more perfection, with the result that nothing actually gets finished - until one day it's too late. This is the problem Commodore had with AAA - the engineers had grand plans for a chipset with 'bleeding edge' features and performance, but development dragged on and then they had to rush out the barely adequate AGA chipset before it was too late. If only they had settled for a less 'perfect' design in the first place they might have gotten AGA out 2 years earlier and the Amiga could have stayed relevant for a while longer (during which time they could have been developing the next generation chipset to replace it).

Aiming high is fine, but at some point you have to say "this works well and is good enough for what we need today". Look at the history of commercial CPU development and you will see that successful designs were 'good enough', while the 'perfect' ones never actually got perfected. Today there are so many variations on x86/64 it makes your head spin - all coming from the 8088 which was basically an 8080 with registers extended to 16 bits and prefixes thrown in to get more instructions. Nobody would have designed it that way today, but it was a good solution for what they needed at the time.

Quote:
A x86 prefix is 1 byte while a 68k prefix is 2 bytes which is twice as much code increase.

Not really. The average 68k instruction is 2 or 4 bytes long, so the 2 byte prefix makes it less than double the size on average. Depending on what operands the instruction needs it could be a lot less. So while you are technically correct that the 'increase' is twice as large, this doesn't mean what it sounds like. Furthermore with a 16 bit prefix some bits could be used to extend the instruction's functionality, similarly to how the mode word in 68k mode 6 is used to specify the index register as well as the byte displacement.

Quote:
A 2 byte prefix can hold twice as much data so 64 bit extensions and extra register accesses could be placed in one prefix but then that wouldn't be common "if only a few instructions are prefixed" and extra registers shouldn't be needed as often as the 68k normally has 16 GP register while x86 only has 8 without a prefix.

x86 used prefixes in an attempt to stay close to the 8080 ISA (to ease the porting of CP/M assembly code, which was considered important at the time). Prefixes were applied to many instructions to make up for the lack of GP registers. Opcodes referencing 8 bit registers were extended to 2 bytes as it was considered that 16 bit registers should get the shorter opcodes. This made the 8088 slower than a Z80 running equivalent code at the same clock speed.

On 68k only a few instructions might be prefixed, but that doesn't mean the extra overhead of 2 byte prefixes is onerous. On the contrary, since they are rarely used their impact should be less. Also these are expected to be powerful instructions that save many bytes and cycles over equivalent 'vanilla' code. Any saving from going to one byte prefixes could be gobbled up by alignment issues, including extra memory cycles and more gates required to steer the bytes around inside the CPU.

Last edited by bhabbott on 27-Sep-2022 at 03:57 AM.
Last edited by bhabbott on 27-Sep-2022 at 03:53 AM.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2022 5:01:22

[ #159 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4432
From: Germany

@matthey

Quote:

matthey wrote:
@Gunnar
Are you really Gunnar von Boehn or are you one of his goons creating a fake account to troll and attack us?

He's the one: his "style" is unique.
Quote:
Gunnar Quote:

Do you really think, that more CPU register would give problem when developing an ASIC?
I saw you posting such before. This is of course absolute nonsense.

More register are no problem at all for going ASIC:
This should be obvious to everyone. As every CPU made today has many more registers.
IBM Power have over hundred of register, same for INTEL, AMD, and ARM.

What ASIC CPU has 48x64b integer registers? Large register files can become a timing problem at higher clock speeds.

https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf Quote:

The register file access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increase the issue width (which implies more register ports) and the size of the instruction window (which implies more registers), and to use some kind of multithreading. Under this scenario, the register file access time could be a dominant delay and a pipelined implementation would be desirable to allow for high clock rates.

...

Most current dynamically scheduled microprocessors have a RISC-like instruction set architecture, and therefore, the majority of instruction operands reside in the register file. The access time of the register file basically depends on both the number of registers and the number of ports. To achieve high performance, microprocessor designers strive to increase the issue width.

Some CPUs have used a duplicate shadow register file because of timing constraints of a large register file.

https://en.wikipedia.org/wiki/Register_file#Microarchitecture Quote:

The Alpha 21264 (EV6), for instance, was the first large micro-architecture to implement a "Shadow Register File Architecture". It had two copies of the integer register file and two copies of the floating point register located in its front end (future and scaled file, each containing 2 read and 2 write ports), and took an extra cycle to propagate data between the two during a context switch. The issuing logic attempted to reduce the number of operations forwarding data between the two and greatly improved its integer performance, and helped reduce the impact of the limited number of general-purpose registers in superscalar architectures with speculative execution. This design was later adapted by SPARC, MIPS and some of the later x86 implementations.

The MIPS uses multiple register files as well. The R8000 floating-point unit had two copies of the floating-point register file, each with four write and four read ports, and wrote both copies at the same time with a context switch. However, it did not support integer operations, and the integer register file still remained as such. Later, shadow register files were abandoned in newer designs in favor of the embedded market.

While CISC CPU cores usually have fewer registers, they usually use more ports to support more powerful instructions. It would appear that the Apollo core has many registers and many ports. If may not be a problem for a low clocked cheap FPGA to ASIC conversion though. The extra register banks could not as effectively be power gated as shadow register banks for fast interrupts either.

I don't know how the registers file are used / implemented on modern processors, but currently there are around 200 (Intel, AMD) - 300 (Apple) internal registers used for registers renaming.
Quote:
The bigger question is whether those extra integer registers are worth the resources. At the very least a larger register file uses more transistors and power.

They are very important for OoO micro-architectures and that's why there are so many on modern ones.

But the internal register file is one thing and the number of exposed (to programmers) registers is a completely different thing.
Quote:
There has been research on how many registers are optimum before diminishing returns show little improvement. If you were the real Gunnar, you would know that I already showed you the research all the way back on the Natami forum and that you ignored it and the general consensus of developers that more GP integer registers were unnecessary. Let's look at the CISC research again called "Performance Characterization of the 64-bit x86 Architecture from Compiler Optimizations’ Perspective".

https://link.springer.com/content/pdf/10.1007/11688839_14.pdf Quote:

Figure 7 gives the normalized execution time (with respect to the base compilation using 16 GPRs and 16 XMM registers) with the REG_8 and REG_12 register configurations for the SPEC2000 benchmarks. On average, with the REG_8 configuration, the CINT2000 exhibits a 4.4% slowdown, and the CFP2000 exhibits a 5.6% slowdown; with the REG_12 configuration, the CINT2000 is slowed down by 0.9%, and the CFP2000 is slowed down by 0.3%. Clearly, these results show that REG_12 already works well for most of the SPEC2000 benchmarks.

On x86-64, the integer performance gain from 8 to 16 GP registers was 4.4% and from 12 to 16 GP registers was 0.9%. This data would suggest that more than 16 GP registers would gain less than the 0.9% performance difference from 12 to 16 GP registers.

Which looks really strange, because I saw a net increase of performances coming from IA-32 to x86-64. I mean: exactly the same applications were performing better on the the same processor which supported both IA-32 and x86-64.

My experience with disassembly the x86-64 code shown me that often more than 8 registers are used. Which makes sense, because its ABI uses the available registers for passing parameters whereas IA-32 uses the stack.

So, IMO 16 registers are a good thing for x86-64 which is improving the performance.

And it should be a must have IMO on any architecture. Imagine the 68k which was crippled to have just 8 registers (let's assume that there' no difference like data and address register: they work all the same): would it have had the same performance? I strongly doubt.
Quote:
There is likely more room for performance gains from improving code density than from adding registers beyond the 16 GP registers of the 68k whether for embedded use or the desktop. For sure, no embedded CPU ASIC will be created for embedded use with 48 GP integer registers.

Actually not even a desktop-class processor...
Quote:
The 64 bit SIMD unit with no floating point is a joke for the desktop market.

It is. But it's ok for the Apollo project, since it has a limited scope / audience.
Quote:
I haven't written much about AMMX although I don't need to know anymore than it uses 64 bit SIMD unit registers without floating point support.

That's a pity for you, because then you had no chance to see how those instructions were designed: he continued with the Motorola's tradition to put some patches in the opcode space to get what he needs. So, crippling the design.

But, as long as the product it tailored to the Amiga aficionados, it's ok: it serves the purpose.
Quote:
That alone disqualifies it from desktop competitiveness. The 48 integer GP registers with it disqualifies it from embedded consideration as a reduced resource usage SIMD unit for potential customers who need more than a MAC unit for DSP workloads.

Forget it: the Apollo core has a completely different retro niche / micro-market.
Quote:
The Apollo core documentation is bad.

Indeed: absolutely unprofessional. It's a pain to try to get a realistic picture of the project.
Quote:
I still think you are a Vampire cult troll and not really Gunnar. Probably Bosanac.

No, he's the original one. No doubt about that.

But fortunately we aren't in one of his kingdoms, so he can't ban us (hey, vox: he did again with you! ) for lese-majesty.

@Karlos

Quote:

Karlos wrote:
@cdimauro

I don't want to toot my own horn, but I actually built that

Good for you.

Have you made an RTL/HDL design (that was the discussion before) out of one of your architectures?

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2022 5:12:31

[ #160 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4432
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

I finally had the time to read the above paper. It only applies to Thumb (the first one). In fact, Thumb-2 solved all issues reported on that paper and in a much more elegant way.

So, the 30% instruction count increase doesn't apply anymore, but it's very reduced (compared to the original ARM).

The Thumb up to 30% instruction increase is bad. Thumb2 is much better but still has elevated instruction counts (and memory accesses). Let's compare using Dr. Vince Weaver's study.

Compared to ARM EABI
SH-3 +36% instructions
Thumb +24% instructions
Thumb2 +12% instructions

The Thumb +24% instruction increase fits with the up to 30% increase found by the paper. Here we found that Thumb2 is using +12% instructions which is half of Thumb and 1/3 of SH-3. This is a big improvement and not too bad considering the code density has improved with each ISA improvement. The comparison is with the original 32 bit fixed length encoding ARM ISA with only 15 GP registers though (the PC is one of the 16). RISC ISAs have increased instruction counts without 32 GP registers while CISC ISAs with 16 GP registers can have competitive instruction counts with RISC ISAs with 32 GP registers. Let's compare instruction counts to the original embedded champion that these compressed RISC ISAs poorly copied.

Compared to 68k
SH-3 +47% instructions
Thumb +35% instructions
Thumb2 +21% instructions

The 68k was still champion in both instruction counts and code density when Motorola decided to replace the 68k embedded champion with PPC.

The problem with Dr. Vince Weaver's study is that it's very very limited, since it's for a very small application (considering the biggest one used) and finely tuned by hands in assembly language.

So, it depends on the ability of the person which has written the source code for the specific architecture.

In fact, you and the other Italian guy have made several changes to the 68k source, optimizing it as best as possible for this ISA. I did the same, in this thread, with the IA-32 and x86-64 version, getting also better results.

For these reasons I strongly doubt that the sources for all architectures are perfectly optimized.

Anyway, the application is too small: bigger and more general purpose / real world ones are need to have a better insight.
Quote:
Eventually, newer embedded designs in newer chip fab processes were lower power and more powerful but embedded customers often used the old 68k designs for a decade or more after Motorola neglected the 68k. ARM chips had another advantage which was cost. ARM licensed out their core designs far and wide while encouraging new designs while Motorola protected their baby they were killing off while using the ~300% markup above cost to invest in new PPC designs. PPC low end and embedded designs had consistently disappointing performance and the reason is code density. Instruction caches of PPC CPUs needed to be 4 times that of the 68k to get the same performance but they didn't get it using 4kiB and 8kiB I caches in early PPC designs ruining the reputation of PPC. Higher performance PPC CPUs which initially had good performance soon hit the performance wall when they couldn't increase the cache size anymore. The PPC604e was one of the highest performance CPUs when it launched with 32kiB I+D L1 caches but it couldn't be clocked up anymore than 233MHz with such large L1 caches. This was already the fastest clock speed among competitors with a 32kiB I+D (see fig 3 of "The Microprocessor Today" http://www.ee.unlv.edu/~meiyang/ecg700/readings/micro-today.pdf). The Sun UltraSPARC had to drop back to 16kiB I+D cache to reach 250MHz and the Digital Alpha down to 8kiB I+D L1 cache to reach 500MHz (the instruction bottleneck prompted them to innovate the L2 cache but they had already made a fatal mistake of thinking code density doesn't matter and designing the Alpha ISA based on this). The Alpha CPU used the most transistors of competitors with most of them used for caches. The 2nd most transistors used was the PA-RISC PA-7300LC which only clocked up to 160MHz so 64kiB I+D caches were possible. The PA-8000 moved all the caches off chip but this made for a very expensive CPU chip with lots of pins. PA-RISC was the RISC ISA with the 2nd worst code density behind the Alpha and they had cache issues too. The Pentium Pro had the smallest L1 caches in this high performance competition.

http://www.ee.unlv.edu/~meiyang/ecg700/readings/micro-today.pdf Quote:

Intel’s Pentium Pro has the smallest on-chip caches of any high-performance processor: a mere 8 Kbytes each for instructions and data. This is because the custom-designed level-two cache chip (described earlier) is mounted in the same package as the processor and can deliver near on-chip speeds. Most other high-performance processors have on-chip level-one caches of either 16 or 32 Kbytes each for instructions and data. HP’s 7300LC has the largest caches, at 64 Kbytes each.

The PPro still had only 8kiB I+D L1 caches while the L2 was off chip (the stacked chip design cuts the wire distances boosting performance which is returning to popularity today). The PPro was only middle of the pack in performance but it had room to be clocked up and to double the cache size.

While the PPC 604e was stuck at 233MHz and couldn't be clocked up due to the large caches, the PPC 603e with smaller 16kiB I+D out clocked it and it was likely the 4 stage pipeline limited performance which still reached 300MHz using the same process size (none of the high performance CPUs in figure 3 reached 200MHz with less than 5 pipeline stages so I suspect the PPC 603e snuck in another die shrink to reach 300MHz). Steve Jobs complained about the low PPC clock speeds which were likely due to large L1 caches and shallow pipelines. PPC finally started to improve again when the L2 cache was improved with the G3 and finally placed on chip.

Indeed. Nice study. What impressed me are the performance results for the top HP's PA-RISCs: impressive! And unbelievable for those architectures, because and as you said, the code density is one of the worst.
Quote:
The early RISC ISAs designed with no consideration for code density were a mistake.

Well, what do you expect from the RISC propaganda they announced with the trumpets that it's A Good Thing to have fixed-length instructions whereas variable-length encoding were seen like the black plague and "banned".

But then they sucked a lot for the code density and "embraced" the variable-length encoding. Nice 180 degrees turn...
Quote:
The best 32 bit code density ISA was the 68k and the 68060 with an 8 stage pipeline, deeper than even the 500MHz Alpha 21164, and only 8kiB I+D caches only reached 66MHz for embedded use only and even with a die shrink most chips were marked 50MHz. I've never seen a rev6 MC68060 marked above 50MHz even though most clock to 100MHz. Having 4 times the code in the instruction cache makes the 8kiB I cache like the 32kiB I cache of a 604e and at least leaves enough memory bandwidth to service data cache misses. Increasing to a 16kiB I+D cache and increasing the clock speed to around 250MHz would have been better than pushing the clock speed further. With so few instructions, so few memory accesses, such easy instruction scheduling and leading code density, it would have been a little power house champion of embedded for longer and maybe would have made people rethink using it for the desktop. That's exactly what Motorola didn't want as they AIMed to replace the 68k with PPC while ignoring the PPC cache issues.

Indeed. It was pity that Motorola abandoned one of the best processors ever for one of the new comers which gave really nothing new and useful. Bah...

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle