Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6229 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

1 member(s) on-line.

zipper

You are an anonymous user.
Register Now!

plechaim: 6 hrs 10 mins ago

zipper: 18 hrs 50 mins ago

MEGA_RJ_MICAL: 19 hrs 2 mins ago

Kronos: 19 hrs 4 mins ago

OneTimer1: 19 hrs 26 mins ago

g.bude: 19 hrs 43 mins ago

kamelito: 20 hrs 33 mins ago

amigakit: 20 hrs 49 mins ago

Forum Index

General Technology (No Console Threads)

The (Microprocessors) Code Density Hangout

Poster

Thread

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Jul-2025 16:23:27

[ #381 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2818
From: Kansas

Hammer Quote:

For embedded microcontroller workloads, the cheapest RISC-V implementation is good enough, hence, many single-purpose embedded vendors are using RISC-V e.g., mass storage devices.

There is a sizable jump from MCU limitation to a Cortex-A55 SoC. Some MCUs allow to clock up CPU cores to have good performance using SRAM but they mostly target lower power so have limited amounts of SRAM. ARM dropping 32-bit ISA support for Cortex-A SoCs creates a large gap between 32-bit MCUs and 64-bit SoCs. SiFive series 7 SoCs are a nice hybrid between a SoC and MCU. They allow the caches to be configured as SRAM like a MCU which gives more SRAM than most MCUs with over 2MiB. The in-order series 7 cores are small enough that the SoCs are still affordable unlike for SoCs where OoO cores are used. RISC-V cores likely are significantly smaller than AArch64 cores with their much larger ISA and SIMD requirements. ARM is creating an opening for low end embedded competition that RISC-V is exploiting. Whether RISC-V can scale up in performance with a mess of extensions has yet to be seen. Code density is not the only performance metric.

Hammer Quote:

Quote:

RISC versus CISC: A Tale of Two Chip
https://dl.acm.org/doi/pdf/10.1145/250015.250016

From https://dl.acm.org/doi/pdf/10.1145/250015.250016

On the SPEC92 suite, the RISC system is 16% to 53% faster than the CISC system on the integer benchmarks, with a 39% higher SPECint92 rating. On the floating point benchmarks, the RISC system is 72% to 261% faster, with a 133% higher SPECfp92 rating. On the SPEC95 suite, the Alpha 21164 is 5% to 68% faster on the integer benchmarks with a 22% higher SPECint95 rating; and 53% to 200% faster on FP benchmarks with a 128% higher SPECfp95 rating.

The Alpha 21164 or EV5 became available in 1995 at processor frequencies of up to 333 MHz. In July 1996, the 21164 line was ramped to 500 MHz.

The Alpha 21164 was still ahead in performance but at what cost? Price efficiency (performance/$) and power efficiency (performance/W) are important. The reason I dug up this old article was that this was the turning point and the writing was on the wall. The Alpha 21164 performed reasonably well in the SPEC benchmarks but the P6 had 80%-90% of the integer performance at half the clock speed, less than half the power and the price could be nearly half judging by the transistors. The Alpha 21164 may have been worth it where high FPU performance is required but integer performance is more important for most applications where it does not offer value over the P6. Furthermore, x86 software investments could be maintained with the P6. It was clear that RISC has a major memory bottleneck and CISC has a major integer performance advantage. The x86 ISA was not even a good example of CISC as only 6 GP integer registers caused elevated instruction counts and increased data memory traffic. The 32-bit x86 with 6 GP integer registers, poor orthogonality and a bad stack based FPU with 8 stack based FPU registers more than held its own against the 64-bit Alpha with 32 GP integer registers and 32 GP FPU registers. The 32-bit 68060 has 16 GP integer registers, good orthogonality, a good FPU ISA with 8 GP FPU registers and it was obviously better than the in-order P5 Pentium equivalent. Motorola pulled the plug on the 68k for a RISC ISA more like Alpha though. I guess they could not read the writing on the wall.

Hammer Quote:

HP PA-8000

...

For the 1995 release, your argument doesn't address the $50 BOM cost range for the Amiga Hombre's CD3D game console and A1200 replacement.

Does that include the price of doubling or quadrupling the memory and instruction caches? How much cheaper was PPC AmigaOS 4 hardware with the enlarged memory footprint and the inability to use the PPC Efika because it did not have enough memory for fat PPC? RISC looks cheap when just considering the pipeline logic cost but when considering the caches and memory required and OoO to reduce the performance deficit from stalls, is it really cheap?

Last edited by matthey on 27-Jul-2025 at 04:32 PM.
Last edited by matthey on 27-Jul-2025 at 04:25 PM.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 17-Aug-2025 6:51:57

[ #382 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4495
From: Germany

I was busy with my new architecture in the last period. I've analyzed (again, for some) several architectures (ARM/Thumb-2, AArch64, ARM Helium, ARC, MRISC32, Xtensa, NanoMIPS, RISC-V with several extensions) to see if something was missed both in the code density and performance area. I've also read some books about LLVM and the language reference (which was inspiring and, I think, it should be primary source when designing an ISA: directly mapping an instruction to the corresponding LLVM's IR "instruction" is like winning the Jackpot).

From all those activities I've seen some common patterns regarding what was introduced by the various ISAs to improve the code density. I already had a good base with my architectures (especially the last one), but I've introduced a few more instructions to cover some expects. What I actually miss are some (short/compact) instructions like the following:

Load/stores Reg,[RegBase + RegIndex]
Load/stores Reg,[RegBase + SmallImmediate]
ADD Reg1,Reg2,Reg3
SUB Reg1,Reg2,Reg3
ADD Reg1,Reg2,SmallUnsignedImmediate
SUB Reg1,Reg2,SmallUnsignedImmediate
SHIFT Reg1,Reg2,SmallUnsignedImmediate
SEXT Reg1,Reg2
ZEXT Reg1,Reg2
NOT Reg1,Reg2
NEG Reg1,Reg2
ADD Reg1,Reg2 * 1/2/4/8
BIT Reg,uimm5
ADD/SUB/CMP/MOV uimm8
MOVE2 (MOVE PAIR) RegDest1,RegDest2,RegSource1,RegSourc2
AND Reg,SmallUnsignedImmediate
MUL Reg1,Reg2
SEXT Reg
ZEXT Reg

Some of them require a considerable amount of opcode space, which I haven't anymore (I've almost completely exhaust the 16-bit space). Some of them are particularly difficult to get with an architecture like mine (and for the 68k as well, since I share a key aspect of this ISA regarding code density), because it would require way too much encoding space.

That could be implemented in future AFTER that I get a backend and I'm able to generate statistics about which instructions are more used. In fact, with the latest changes, I might have the chance to free-up a considerable amount of opcodes since I've added much more addressing modes (now I'm very close to 40 of them. Albeit some are "mutually-exclusive" and/or reserved to specific "family" of instructions).

My gut feeling is that several load/store (MOV, in my jargon) instructions might be removed because my ISA is not a LD/ST one, but allows to directly reference up to 3 memory operands per instruction. So, the actual compact load/store instructions might be "absorbed" by regular (and longer) instructions using one of the addressing modes. It's a (simplified, yet quite extended) CISC: load + op or op + store are at its foundations.
Anyway, without numbers, I can't take further decisions about how to (re)model the ISA.

BTW, from NEx64T I've lost a bit of code density (+10 bytes now from 68k's ll.m68k_2_bis.s, instead of the +4 from the old ISA), but I've further reduced number of executed instructions (-21 now. Thanks also to the AArch64 example: comparing them, I've found that with my new architecture I've taken very similar decisions. Which explains the close results) using less registers (-3 from 68k).

Losing a bit of code density wasn't a drama, because an architecture is a synthesis, a compromise of several goals. The small that I've lost in code density is the big that I've gained in terms of simplified instructions formats, a whole set of very flexible bitfields formats (not instructions: formats!), several changes and new instructions for much better FP & integer scalar support in very common scenarios, and I've now an entire block for a complete set of predication instructions (like Thumb2 IF and, especially, My 66000. But much broader and general). I've also a preliminary support for a 128 bit ISA, but I've yet to define some details (which aren't urgent).

Last but not least, I've found some interesting statistics about code density that I'll add to the first page of the thread once I finish replying to some pending posts.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 17-Aug-2025 7:01:11

[ #383 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4495
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

matthey - does not work for immediates above a certain threshold like 8-bit or 16-bit

Could you please elaborate on the last point?

Take for example a 32-bit fixed length RISC instruction encoding that has a 16-bit unsigned immediate field. A 17-bit unsigned immediate is above the threshold and unsigned/zero extension does not work. With simple RISC ISAs, a 2nd immediate is usually loaded, shifted and an or operation of the two immediates performed after the immediate threshold has been exceeded. Sign and unsigned extension work on integers for any number of bits but encoding immediates fields cause arbitrary thresholds. Even with a variable length encoding like the 68k and my #d16.w EA encoding for immediates, signed 16-bit values greater than or less than a 16-bit two's complement integer size are above the threshold and sign extension can not be performed. The variable length encoding allows a 32-bit integer encoding of the immediate used by a single instruction instead of adding 3 dependent instructions with the 32-bit RISC encoding.

Going back to the BA2 ISA examples.

https://www.chipestimate.com/Extreme-Code-Density-Energy-Savings-and-Methods/CAST/Technical-Article/2013/04/02 Quote:

addq.l #$3,d7 ; 16-bit encoding adding 3-bit unsigned immediate (add of 8-bit immediate)

add.l #$6000,d7 ; 48-bit encoding with current 68k
*or*
add.l #$6000.w,d7 ; 32-bit encoding with my #d16.w EA sign extension (add of 16-bit immediate)

add.l #$7fffffff,d7 ; 48-bit encoding with current 68k (add of 32-bit immediate)

The example add of a 32-bit immediate in the pic has to be wrong for Thumb-2 and MIPS as a 32-bit immediate can not fit in a 32-bit encoding. The immediate field thresholds of Thumb-2 and MIPS ISAs were exceeded and there is no encoding larger than 32-bit so a 2nd immediate has to be introduced and combined with the 1st immediate using dependent instructions. This requires at least another 32-bit instruction and may require two more instructions. Once a RISC threshold is surpassed, the code bloat starts which is not only code size but multiple dependent instructions to execute. The scalability of BA2 immediates shows this does not have to be the case for a load/store architecture.

OK, now it's clear and I fully agree with you.
Quote:
However, what if the variable is in memory?

addq.l #$3,(a0) ; 16-bit encoding adding 3-bit unsigned immediate (add of 8-bit immediate)

add.l #$6000,(a0) ; 48-bit addi.l encoding with current 68k (add of 16-bit immediate)

add.l #$7fffffff,(a0) ; 48-bit addi.l encoding with current 68k (add of 32-bit immediate)

My EA mode compression idea does not work with addi unfortunately as the immediate is not an EA. (it does work with move.l EA,EA though).

Unfortunately, that's the only instruction with two EAs.

A generic add.size EA,EA would have allowed much more flexibility, but requiring a larger opcode and, more important, some challenges (e.g.: two reads and one write in memory for a single instruction).
Quote:
The 68k still has a large advantage compared to load/store architectures though.

BA2:
load r7,(r6) ; at least 16-bit encoding
b.addi r7, r7, 0x03 ; 16-bit encoding
store (r6),r7 ; at least 16-bit encoding

load r7,(r6) ; at least 16-bit encoding
b.addi r7, r7, 0x6000 ; 24-bit encoding
store (r6),r7 ; at least 16-bit encoding

load r7,(r6) ; at least 16-bit encoding
b.addi r7, r7, 0x7fffffff ; 48-bit encoding
store (r6),r7 ; at least 16-bit encoding

None of the BA2 examples at the link access memory where it would not want to be compared to the 68k

I don't think that they wanted to avoid a comparison with the 68k, because BA2 was developed when our beloved architecture was already abandoned by Motorola/Freescale.

BA2 direct competitors were other LD/ST architecture, and the examples that they have used are based on those common scenarios (albeit I doubt that Thumb2 and MIPS can generated a big constant like 0x7FFFFFFF with a single instruction).
Quote:
but likely is still better code density than Thumb-2 and no doubt is better than MIPS, especially where MIPS stands for Microprocessor without Interlocked Pipelined Stages like R2000 and R3000 cores.

MIPS:
lw r7,(r6) ; 32-bit encoding
nop ; 32-bit encoding load-to-use delay slot without Interlocked Pipelined Stages
addi r7, r7, 0x03 ; 32-bit encoding
sw (r6),r7 ; 32-bit encoding

lw r7,(r6) ; 32-bit encoding
nop ; 32-bit encoding load-to-use delay slot without Interlocked Pipelined Stages
addi r7, r7, 0x6000 ; 32-bit encoding
sw (r6),r7 ; 32-bit encoding

lw r7,(r6) ; 32-bit encoding
lui r8, 0x7fff ; 32-bit encoding load upper 16-bit immediate instruction
ori r8, 0xffff ; 32-bit encoding
add r7, r7, r8 ; 32-bit encoding
sw (r6),r7 ; 32-bit encoding

The last MIPS example filled the load delay slot with an independent instruction which saved an instruction for the cost of using another register. The 68k code is one instruction using 6 bytes and one register where the MIPS code is 5 instructions using 20 bytes and 3 registers. The R2000/R3000 have single cycle throughput pipelined instructions though. The R4000 would have needed two NOP instructions in the first two examples but interlocked pipeline stages were added to stall the pipeline instead of bloating the code with NOP instructions. The last example is R4000 core ready and will not stall while the first two examples would stall for a cycle even with the NOP instructions. RISC simplification is a pain in the programming ass but to add insult to injury, the performance potential is lower.

That's why the delay slot was removed on more modern processors.

BTW, NanoMIPS allows to load full 32-bit constants, because it has a 16-32-48 bit variable-length encoding. It took a lot for MIPS to finally deliver a competitive architecture regarding code density, but it was too late.
Quote:
cdimauro Quote:

Do you mean because the 68k FPU can load a FP32 immediate data, which is (automatically) expanded to extended precision? Or... can load integer immediate data. Or both?

The vasm peephole optimization can reduce the precision of fp immediates from extended precision down to double precision and from double precision down to single precision. If half precision was supported, it could further compress single precision down to half precision if possible. It is true that the 68k FPU supports integer immediates as well which could give the same compressions as half precision but the integer to fp conversion performance loss would make this a -Os compiler option.

I assume that the performance loss is only due to the current implementation. I mean, another 68k processor could do the conversion much faster.

BTW, the 68k FPU lacks 64-bit integers, which is really strange, since even the weaker x87 FPU supported them (on load and store. Which is enough, considering how the x87 worked).

In the last days there's a discussion on EAB which is talking about Apple's usage of the 68k FPU to do 64-bit integer arithmetic. It wasn't an hack, as someone said, because it's a perfectly legit way to use the 68k FPU (which, BTW, supported up to 65 bit integers: one more bit, because the sign bit is separated from the 64-bit mantissa).
I don't know how much useful it was, because I would have preferred to implement the 64-bit ints arithmetic using the regular CPU. Especially considering that the 68k FPU completely lacks a direct support to 64-bit ints.

On the contrary, x87 supported this datatype (albeit only for load/stores. But that's enough, because calculations are always performed using the extended precision), and the FPU was used on PCs to move data much more quickly (BTW, x87 instructions are also very compact).
Quote:
cdimauro Quote:

But it supports loading come important constants (pi, log2 e, etc.) from its ROM, with the best precision possible (extended).

The FMOVECR 68k FPU instruction is not supported on the 68040 or 68060. It likely does improve code density and reduces extended precision immediates in data caches but I have not examined it closely. It was not a rare instruction as used by the SAS/C compiler, for example for Lightwave.

Dammit. Another bad decision from Motorola...
Quote:
cdimauro Quote:

Yes, but it has 16-bit immediates only on some basic/common instructions (some GP/scalar ones. Which includes loads/stores, of course).

Besides that, all instructions support 32 or 64 bit immediates.

The primary problem, however, is that the instructions are multiple of 32 bits. Which I think that it's not so code density-friendly with general purpose code. But I've no statistics about that (Mitch only shared the number of executed instructions against RISC-V, which looks around 70%).

Not to criticize Mitch, but a 16-bit base variable length encoding is more practical than a 32-bit base variable length encoding. Mitch is targeting a very high performance ISA with many registers using many encoding bits. It remains to be seen if code density sabotages his efforts.

That's precisely my concern. However, I think that his architecture is purely devoted to massive number crunching (HPC & co.).
Quote:
BA2 uses an 8-bit base variable length encoding which is very good for code density

The base is 16 bit. Then you can have 24, 32 and 48 bits encodings.
Quote:
but I expect has some alignment and decoding disadvantages which may affect performance.

Not so much. As Mitch stated some time ago, logic is very cheap, and relaxing the alignment constraints offers much better benefits compared to the slow downs due to the instructions misalignments.
Quote:
Cast offers RISC-V cores in addition to BA2 cores but I would not expect performance to be any better with RISC-V. It may just be that BA2 is more proprietary and protectionism reduces proliferation compared to open RISC-V hardware.

It's a pity that they have embraced RISC-V, because BA2 should have much better performance.
Quote:
cdimauro Quote:

They already CISCs because they lost all RISC "principles/pillars".

However, I'm preparing the popcorns for when RISC-V will introduce 48-bit (or even more) instructions...

RISC-V 48-bit Load Long Immediate Extension
https://github.com/riscvarchive/riscv-code-size-reduction/blob/main/existing_extensions/Huawei%20Custom%20Extension/riscv_LLI_extension.rst

Still discussions. Nothing ratified. But I bet that in some time they will become part of the RISC-V ecosystem, because the value is so much high.
Quote:
https://www.reddit.com/r/RISCV/comments/zrpi3m/why_48bit_instructions/ brucehoult Quote:

Encoding for 48 bit, 64 bit, and longer instructions in RISC-V has not been ratified. The stuff in the ISA manual is just a sketch of how things might work eventually, so all suggestions are welcome.

I've made some myself, and Claire Wolf riffed off my suggestions a little:

https://github.com/riscv/riscv-isa-manual/issues/280

To date there are no 48 bit instructions (and no ratified way to encode them) and multiple companies have strongly resisted introducing the first 48 bit instruction in e.g. the Vector extension, with the unfortunate result that the FMA instructions had to be made destructive (the only such instructions in the 32-bit encoding) and come in two versions depending on which operand is destructed.

The same happens with AArch64 SVE instructions: the predicated versions are only available in the destructive format.

It's the price to pay to the common problem of having only 32-bit available for the opcodes: you can't fit all beautiful (and needed!) stuff there.
Quote:
Personally I think this is a pity as 48 bit instructions do provide a meaningful increase in code density in ISAs such as S/360 and

Absolutely. And not only for having the ability to directly load 32-bit constants.
Quote:
nanoMIPS (which seems to be dead, but it looks to be a very nice post-RISC-V ISA).

It is, but it was too late, as I've said before.
Quote:
Quote:
Having 48 bit instructions would also allow for including the vtype in every V instruction instead of the hack of inserting special vsetvli instructions between pairs of vector instructions, and thus using 64 bits per actual work-doing instruction. Going straight to 64 bit would give no program size advantage.

The last post was 3 years ago but talks about a code density advantage to larger encoding sizes. It does not even mention the reduced instruction advantage. Of course any code taking advantage of larger instruction sizes with RISC-V extensions would require a recompile to gain the benefits that the 68000 had in 1979 and all 68k Amigas already have. RISC-V extensions would have duplicate encodings wasting encoding space compared to an ISA which planned for scaling immediates/displacements using variable length encodings from inception.

The primary problem is the academic Talibans which don't look such long instruction with favour. It would be a clear admission of their failure (and incompetence).

Well, they also failed with the recent ratification of the compact pushm/popm instructions: registers are pushed (and popped) on the opposite order of the RISC-V ABI. Which requires some correction, as WD suggested.
Those people are living on a parallel world. Really...
Quote:
The 68k is ancient technology long forgotten like Roman cement and Baalbek quarrying. Maybe some day the technology will be rediscovered if it is not forgotten and lost first.

Unfortunately, but "sometimes they come back": there's a rediscovery of the 68k systems, and if the tools to support the ISA get updated and ISA itself gets some needed updates, then there might be a second chance (likely on embedded systems).

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 17-Aug-2025 7:04:18

[ #384 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4495
From: Germany

@Hammer

Quote:

Hammer wrote:
@cdimauro

Quote:
Again, this has NOTHING to do with the CODE DENSITY.

You're wrong. It's CODE DENSITY for the 3D use case.

You continue to do NOT understand, at all, what code density is about.

You are a lost cause...

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 17-Aug-2025 7:06:43

[ #385 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4495
From: Germany

@OneTimer1

Quote:

OneTimer1 wrote:
Quote:

Hammer wrote:

Fact: MIPS displaced 68K in mainstream 32-bit/64-bit embedded game consoles.

despite poor code density and although RAM and caches were smaller back then

And some month ago I wrote some code for a router an embedded device using a MIPS core.

Sure. MIPS gained consensus on embedded systems because of another important factor: cores are very small, yet powerful enough for many purposes.

A 16-bit ISA was also defined, to further reduce the core size.

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 19-Aug-2025 3:03:51

[ #386 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2818
From: Kansas

cdimauro Quote:

I was busy with my new architecture in the last period. I've analyzed (again, for some) several architectures (ARM/Thumb-2, AArch64, ARM Helium, ARC, MRISC32, Xtensa, NanoMIPS, RISC-V with several extensions) to see if something was missed both in the code density and performance area. I've also read some books about LLVM and the language reference (which was inspiring and, I think, it should be primary source when designing an ISA: directly mapping an instruction to the corresponding LLVM's IR "instruction" is like winning the Jackpot).

From all those activities I've seen some common patterns regarding what was introduced by the various ISAs to improve the code density. I already had a good base with my architectures (especially the last one), but I've introduced a few more instructions to cover some expects. What I actually miss are some (short/compact) instructions like the following:

Load/stores Reg,[RegBase + RegIndex]
Load/stores Reg,[RegBase + SmallImmediate]

Short versions are more applicable to an 8-bit base VLE. The 68k 16-bit base VLE is already pretty optimal for these.

cdimauro Quote:

ADD Reg1,Reg2,Reg3
SUB Reg1,Reg2,Reg3

Better for reducing the number of instructions than improving the code density as they need to be a 32-bit encoding with a 16-bit VLE. LEA/PEA already provide a 3 op add with 2 preserved source reg ops and even allow to add an immediate and shift one of the source registers too. Change/use register stalls may result in some cases though.

cdimauro Quote:

ADD Reg1,Reg2,SmallUnsignedImmediate
SUB Reg1,Reg2,SmallUnsignedImmediate

These are missing from the 68k and could be useful. I have considered a 32-bit encoding that can also be used for ADDC/ADDX and SUBC/SUBX with flags to optionally carry/borrow the C and/or X bits. This would bring the 68k closer to x86 functionality which is able to ADDC/SUBC a small immediate as I recall. This came about because the 68000 only had 16-bit instructions with data extensions where a 32-bit instruction would save encoding space considering the frequency of use. They could be useful for reencoded 68k ISAs like the 68k64.

cdimauro Quote:

SHIFT Reg1,Reg2,SmallUnsignedImmediate

An encoding with 2 registers and a small immediate is likely going to be 32-bit for a 16-bit base VLE. For a 16-bit base VLE, a 16-bit encoding of a single register by a small immediate is likely the limit. The 68k already has this which would become the quick versions with a 32-bit encoding which allows a 2nd register and a full immediate.

cdimauro Quote:

SEXT Reg1,Reg2
ZEXT Reg1,Reg2

The ColdFire MVS and MVZ instructions would provide this for the 68k with any other names being better. I chose SXTB/SXTW and ZXTB/ZXTW but x86 MOVSX and MOVZX may be preferred by some developers. This functionality would be very good for the 68k and unlike x86(-64) MOVSX/MOVZX, the ColdFire instructions improve code density, albeit using quite a bit of encoding space. Where most 68k and ColdFire compiler backends are shared, it is compelling to use the ColdFire encodings so the existing ColdFire support can be turned on.

cdimauro Quote:

NOT Reg1,Reg2
NEG Reg1,Reg2

I am not so sure they are common enough to warrant a 16-bit encoding with a 16-bit base VLE. More likely they would be 32-bit encodings with 2 registers and NEG could have optional bits for C and X to subtract as NEGC/NEGX. A small immediate could be subtracted as well but maybe not worthwhile for NEG. A 32-bit encoding for an integer ABS reg,reg would be nice as well. Gunnar did not like the idea of an integer ABS instruction as he preferred to predicate over short branches and the code equivalent of ABS which is the same size. There are some advantage to the ABS instruction still but they are minor and he had a good argument so I did not push the point. Not every 68k implementation would support predication of short branches and I wanted a 68k ISA standard that others could use including for emulation and ASICs while he wanted a closed 68k ISA optimized for his personal FPGA toy core which was far from the goals and openness of the original Natami project.

cdimauro Quote:

ADD Reg1,Reg2 * 1/2/4/8

This looks like LEA/PEA with CISC addressing modes again.

cdimauro Quote:

BIT Reg,uimm5

BIT from which ISA? It is a 5-bit immediate so I expect this is a bit number equivalent to the 68k BTST. A BTST type instruction is used and a 16-bit encoding should be possible but it still may not be worthwhile. BTST, BCLR, BSET and BCHG allow a more compact encoding than the equivalent logic operation with a mask but they require more hardware to execute as the mask for the proper logic operation has to be created which may include a shift and/or bit operation unit unavailable in some execution pipelines. The 68060 has ALUs with shift capabilities in both execution pipelines but it only has a bit operation unit in the pOEP. BTST, BCLR, BSET and BCHG are "pOEP-only" for an immediate to a register instruction even though they are single cycle so no superscalar operation. The 68020/68030 have better timing for the logic operations between an immediate mask and a register than BTST, BCLR, BSET and BCHG as well. Compiled code for the 68k generally prefers the logic operations for the 68k and I do not believe this is unusual. Most 68k CPUs have a large performance penalty for fetching extra code and 32-bit immediates can reduce performance as well as partial register writes for 8-bit and 16-bit immediates and for some logic operations. My OP.L #data16.w,Dn addressing mode reduces these problems when a 16-bit immediate suffices.

The other option may be that BIT is from the 6502 but changed to accept an immediate where the immediate is a mask but the op is not modified. BIT is an AND operation that affects the condition codes but does not modify the accumulator.

http://www.6502.org/users/obelisk/6502/reference.html#BIT

A non-destructive AND would be useful even for a single register. I suggested that it is possible to encode the immediate in the destination for this purpose. This may allow AND.L reg,#imm32 and AND.L reg,#imm16.w that would not actually alter the immediate in the code. Some other instructions are common where the result wastes a register like ADD which could benefit while SUB already has a non-destructive version in CMP and not limited to immediates and a reg. I do not know whether this idea would work well in implementations but it is orthogonal. Gunnar did not like the idea perhaps because move+op instructions may be fused to 3 op instructions. He added CMPIW which at least shortens the most common case of move.l+op.l+bcc sequences and the encoding is not in Line-A like some AC68080 code density improving instructions including MOVIW, MOV3Q, MOVS and MOVZ. Maybe he moved MOVS (MVS) and MOVZ (MVZ) back to the original ColdFire locations though.

cdimauro Quote:

ADD/SUB/CMP/MOV uimm8

If "unsigned" 8-bit immediates are needed, they can be encoded in a 16-bit signed immediate. Generally, signed immediate compression gives better code density.

cdimauro Quote:

MOVE2 (MOVE PAIR) RegDest1,RegDest2,RegSource1,RegSourc2

Is moving a pair of registers common enough? A CISC CPU core with reg-mem mem-reg should not need to move pairs of registers too often and with 3 op instructions even less so, even for load/store ISAs. Maybe it would be used enough around function calls to move args to the right registers?

The AC68080 ISA added MOVE2 but it is for memory operations only even though the encoding could support an all register MOVE2 although at least one pair of registers would need to be sequential.

cdimauro Quote:

AND Reg,SmallUnsignedImmediate

Already covered.

cdimauro Quote:

MUL Reg1,Reg2

The 68k has a 16-bit encoding for MUL.W (16x16=32). The 68k Amiga would have to advance decades before needing a short/quick 16-bit encoding for 32x32=32 or 64x64=64. The 68k AmigaOS currently calls a function for 32x32=32 and 32x32=64 which is fine for dinosaur age emulation, just the way the Hyperion A-EonKit syndicate wants the 68k.

cdimauro Quote:

SEXT Reg
ZEXT Reg

If the source and destination register can be the same, the reg to reg versions will suffice which is the case for the ColdFire MVS and MVZ instructions. They are no larger than the old sign extend only 68k EXT(B) instructions and support so much more. The MVS/MVZ instructions were one thing ColdFire got right except for the dreadful names. These instructions should have been added much earlier to every 68k CPU as even the 68000 could support them and would gain a significant benefit from them.

cdimauro Quote:

Some of them require a considerable amount of opcode space, which I haven't anymore (I've almost completely exhaust the 16-bit space). Some of them are particularly difficult to get with an architecture like mine (and for the 68k as well, since I share a key aspect of this ISA regarding code density), because it would require way too much encoding space.

If the above are the LLVM internal instructions, then I do not think the 68k with just a few additions I have suggested would map so poorly. Some of the short encodings are for 8-bit base VLE encodings which can achieve a little better code density but I still believe a 16-bit base VLE can have the performance advantage with a minimal loss of code density. The 8-bit base BA2 ISA may achieve a little better code density than the 68k and Thumb-2 ISAs with smart use of the encoding space, starting with not using and wasting 8-bit instruction encodings, but the vast majority of the 8-bit base VLE ISAs have inferior code density, inferior performance metrics and likely additional decoding overhead.

cdimauro Quote:

BTW, from NEx64T I've lost a bit of code density (+10 bytes now from 68k's ll.m68k_2_bis.s, instead of the +4 from the old ISA), but I've further reduced number of executed instructions (-21 now. Thanks also to the AArch64 example: comparing them, I've found that with my new architecture I've taken very similar decisions. Which explains the close results) using less registers (-3 from 68k).

It sounds like a good tradeoff if wanting more desktop like performance at the expense of embedded market suitability. The number of instructions executed is more important than code density for high performance "desktop" systems and AArch64 set the bar on reducing the number of instructions. The only problem is that a high performance ISA has more competition with both x86-64 and AArch64 entrenched when most market opportunities are for embedded use with improved code density.

cdimauro Quote:

Last but not least, I've found some interesting statistics about code density that I'll add to the first page of the thread once I finish replying to some pending posts.

I do not see changes to the first page of this thread yet.

cdimauro Quote:

BA2 direct competitors were other LD/ST architecture, and the examples that they have used are based on those common scenarios (albeit I doubt that Thumb2 and MIPS can generated a big constant like 0x7FFFFFFF with a single instruction).

At least a 48-bit encoding is required to support a 32-bit immediate which is not supported by many ISAs.

cdimauro Quote:

BTW, NanoMIPS allows to load full 32-bit constants, because it has a 16-32-48 bit variable-length encoding. It took a lot for MIPS to finally deliver a competitive architecture regarding code density, but it was too late.

So for semi-modern general purpose ISAs that support 32-bit immediates/displacements, there is the following.

reg-mem architecture with 8-bit base VLE
x86
86-64
VAX

reg-mem architecture with 16-bit base VLE
68k
ColdFire (variable-length RISC architecture still supports)
Z/Architecture
NS32k (weird LE ISA with BE data encoding)

load/store architecture with 8-bit base VLE
BA2 (does not use 8-bit encodings but belongs here in my opinion)

load/store architecture with 16-bit base VLE
NanoMIPS

load/store architecture with 32-bit base VLE
POWER
Mitch Alsup's ISA

It is odd that 32-bit base VLE ISAs may be the most common for load/store architectures.

cdimauro Quote:

I assume that the performance loss is only due to the current implementation. I mean, another 68k processor could do the conversion much faster.

BTW, the 68k FPU lacks 64-bit integers, which is really strange, since even the weaker x87 FPU supported them (on load and store. Which is enough, considering how the x87 worked).

In the last days there's a discussion on EAB which is talking about Apple's usage of the 68k FPU to do 64-bit integer arithmetic. It wasn't an hack, as someone said, because it's a perfectly legit way to use the 68k FPU (which, BTW, supported up to 65 bit integers: one more bit, because the sign bit is separated from the 64-bit mantissa).
I don't know how much useful it was, because I would have preferred to implement the 64-bit ints arithmetic using the regular CPU. Especially considering that the 68k FPU completely lacks a direct support to 64-bit ints.

On the contrary, x87 supported this datatype (albeit only for load/stores. But that's enough, because calculations are always performed using the extended precision), and the FPU was used on PCs to move data much more quickly (BTW, x87 instructions are also very compact).

I believe 68k FPU integer to FP conversions could have better timing and performance with more hardware, newer silicon and more optimization. I doubt that integer immediates and registers for FP instructions could be made free though, at least without a shared integer and FP register file. The 68060 currently requires 3 more cycles for Fop instructions using integers and these instructions along with any Fop instruction using immediates can only execute "pOEP-only". The latter arbitrary restriction is likely due to instructions in the instruction buffer being limited to 48-bit max without being split up, the same restriction used for ColdFire which does not allow larger instructions at all. Recall that Gunnar found no difference in timing by supporting 64-bit instructions which would allow more 68060 instructions to execute superscalar, including for the FPU.

https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Coldfire-compatible-FPGA-core-with-ISA-enhancement-Brainstorming/td-p/238714?profile.language=en Quote:

2) Support for more/all EA-Modes in all instructions

In the current Coldfire ISA the instruction length is limited to 6 Byte, therefore some instructions have EA mode limitations.

E.g the EA-Modes available in the immediate instruction are limited.

That currently instructions can either by 2,4, or 6 Byte length - reduces the complexity of the Instruction Fetch Buffer logic.

The complexity of this unit increases as more options the CPU supports - therefore not supporting a range from 2 to over 20 like 68K - does reduce chip complicity.

Nevertheless in my tests it showed that adding support for 8 Byte encoding came for relative low cost.

With support of 8 Byte instruction length, the FPU instruction now can use all normal EA-modes - which makes them a lot more versatile.

MOVE instruction become a lot more versatile and also the Immediate Instruction could also now operate on memory in a lot more flexible ways.

While with 10 Byte instructions length support - there are then no EA mode limitations from the users perspective - in our core it showed that 10 byte support start to impact clockrate - with 10 Byte support enabled we did not reach anymore the 200 MHz clockrate in Cyclone FPGA that the core reached before.

The AC68080 does not use an instruction buffer to reduce instruction fetch but it may be the 68060 limitation was used to fit the transistor and power budget of the time and simply reused for ColdFire instruction buffers because it required the least amount of work. Supporting 8-byte/64-bit instructions would improve superscalar execution and allow single precision FP immediates, which would have been valuable for ColdFire instead of moving FP immediates into the data stream but ColdFire was a 2nd class citizen with the most important requirement being to not compete with PPC.

Allowing 68k FPU 64-bit integer datatype memory accesses would be useful too. Not only would it have made 64-bit integer arithmetic in the FPU easier but it would have provided an easier solution to the 68040 major mistake of removing FINT/FINTRZ, which is the only FP instruction the 68060 added back. Motorola had the mindset to chop, chop, chop their way to compete with RISC instead of leveraging the performance advantage of the 68k. Despite the major chopping done to the 1994 68060, it had unmatched integer performance efficiency (performance/MHz) for an in-order core. ARM only surpassed the performance efficiency with a general purpose 8-stage in-order superscalar Cortex-A7 core 17 years later in 2011 and that was with larger caches.

cdimauro Quote:

Not so much. As Mitch stated some time ago, logic is very cheap, and relaxing the alignment constraints offers much better benefits compared to the slow downs due to the instructions misalignments.

It is worthwhile to handle alignment as efficiently as possible which is what is relatively cheap in logic. However, there is still a performance benefit to better aligned data in memory including instructions. The 68060 4-byte instruction fetch is much more likely to fetch whole instructions than a 4-byte fetch on x86, or worse x86-64. It is not only that bytes are more likely to straddle a particular fetch/buffer but x86 and x86-64 ISAs are more likely to have longer encodings. Consider the high tech scalar Cyrix 5x86.

Cyrix 5x86: Fifth-Generation Design Emphasizes Maximum Performance While Minimizing Transistor Count
https://dosdays.co.uk/media/cyrix/5x86/5X-DPAPR.PDF Quote:

Instruction Decode Unit

The instruction decode unit in the 5x86 decodes the variable-length x86 instructions. The instruction decode involves determining the length of each instruction, separating immediate and/or displacement operands, decoding addressing modes and register fields, and creating an entry point into the microcode ROM. As previously discussed, the input to the instruction decoder is eight bytes of instructions supplied by the IF unit. These bytes are shifted and aligned according to the instruction boundary of the last instruction decoded. The ID unit can decode and issue instructions at a maximum rate of one per clock. Instructions with one prefix and instructions of length less than or equal to eight bytes can be decoded in a single cycle.

An 8-byte instruction fetch was used for a scalar x86 CPU and "These bytes are shifted and aligned according to the instruction boundary of the last instruction decoded." Misaligned data memory loads have overhead too which is reduced for 16-bit aligned data compared to 8-bit aligned data. NOP instructions are often added inside x86(-64) code where aligning instructions are very rarely ever added inside 68k code that is executed. It is not unusual for x86(-64) cores to use boundary marker caches and the markers would use twice the space as for a 68k boundary marker cache. The Pentium 4 even used a trace cache which improves code alignment. There are many signs of the increased overhead of less favorable x86(-64) instruction alignment. Despite the 68k ISA more often being compared to the VAX ISA, I believe the decoding overhead of VAX was more comparable to x86(-64) which also has an 8-bit base VLE.

Last edited by matthey on 19-Aug-2025 at 09:47 PM.
Last edited by matthey on 19-Aug-2025 at 09:46 PM.
Last edited by matthey on 19-Aug-2025 at 03:09 AM.

Status: Offline

Hammer

Re: The (Microprocessors) Code Density Hangout
Posted on 19-Aug-2025 4:55:04

[ #387 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6582
From: Australia

@cdimauro

Quote:
You continue to do NOT understand, at all, what code density is about.

You are a lost cause...

Code density's utility is dependant on use case target.

68060's scalar code density advantage (fused data load with ALU op) is less useful for use case that heavily use fused multiple math operations on a single instruction approach.

For 3D games in the 1990s, you're argument position is useless in the real workload vs cost.

Sony's PlayStation result has no problems standing up against gaming PC on economic vs performance terms.

For 1993 to 1995 and given per unit budgetary limits, it's very difficult to pull off 68EC040 or 68EC60 powered CD3D game console (which is the basis for A1200's replacement).

For 1996, 68K IP holder wasn't interested in StrongARM style pushing 68LC040 into 120 Mhz to 140 Mhz range.

Xbox (Project Midway) has fat X86 economies of scale and Microsoft's fat cash at bank.

For 1995-1996, IBM offered and designed two 1 million transistors budget PPC 602 @ 66Mhz game console CPU solution for 3DO M2 project. PPC 602 has a pipelined FP32 FPU. 3DO team increased the transistor budget for 3DO M2's CPU item.

3DO targeted A500's retail US$699 price and hopes that their partner cost reduce it e.g. LG/GoldStar 3DO.

CD32 targeted US$399 intial launch price, hence a similar price range for CD3D.

Last edited by Hammer on 19-Aug-2025 at 05:34 AM.
Last edited by Hammer on 19-Aug-2025 at 04:56 AM.

_________________

Status: Offline

Hammer

Re: The (Microprocessors) Code Density Hangout
Posted on 19-Aug-2025 6:13:05

[ #388 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6582
From: Australia

@matthey

Quote:

There is a sizable jump from MCU limitation to a Cortex-A55 SoC. Some MCUs allow to clock up CPU cores to have good performance using SRAM but they mostly target lower power so have limited amounts of SRAM. ARM dropping 32-bit ISA support for Cortex-A SoCs creates a large gap between 32-bit MCUs and 64-bit SoCs. SiFive series 7 SoCs are a nice hybrid between a SoC and MCU. They allow the caches to be configured as SRAM like a MCU which gives more SRAM than most MCUs with over 2MiB. The in-order series 7 cores are small enough that the SoCs are still affordable unlike for SoCs where OoO cores are used. RISC-V cores likely are significantly smaller than AArch64 cores with their much larger ISA and SIMD requirements. ARM is creating an opening for low end embedded competition that RISC-V is exploiting. Whether RISC-V can scale up in performance with a mess of extensions has yet to be seen. Code density is not the only performance metric.

Western Digital's SweRV is not SiFive U74 series.

SweRV Core EH1 is a 32-bit, 2-way superscalar and 9 stage pipeline core. SweRV Core is open source and moved into VeeR EH1 RISC-V Core.

Quote:

The Alpha 21164 was still ahead in performance but at what cost? Price efficiency (performance/$) and power efficiency (performance/W) are important. The reason I dug up this old article was that this was the turning point and the writing was on the wall.

DEC Alpha continued to scale with Mhz race, but DEC was bleeding engineers to Intel and AMD. DEC also has StrongARM in 1996 that gave ARM CPU designs a large clock speed boost.

Using StrongARM CPUs with Apple MessagePad, Apple was laying the foundation for MacOS X based iOS devices. Other smart handheld vendors followed a similar template.

Quote:

The Alpha 21164 performed reasonably well in the SPEC benchmarks but the P6 had 80%-90% of the integer performance at half the clock speed, less than half the power and the price could be nearly half judging by the transistors. The Alpha 21164 may have been worth it where high FPU performance is required but integer performance is more important for most applications where it does not offer value over the P6. Furthermore, x86 software investments could be maintained with the P6. It was clear that RISC has a major memory bottleneck and CISC has a major integer performance advantage. The x86 ISA was not even a good example of CISC as only 6 GP integer registers caused elevated instruction counts and increased data memory traffic.

FYI, X86 integer register use cases are both GPR and x87.

There are design reasons for X86's fast GPR and x87 data transfers and using x87 for integer workloads. This integer and floating point processing on x87 was carried to SSE/SSE2.

Out of order with register renaming was implemented on X86 with Intel Pentium Pro in 1995 and AMD K5 in 1996.

X86 doesn't play by 68K rules.

Pentium Pro has three x86 decoders with fused memory operands with ALU operation instructions vs Alpha's quad instruction decoders with discrete data load and ALU operation instructions

Alpha would need about six instruction decoders. Apple M series RISC CPU approach is to scale up instruction decoders count that overcomes X86's fused instruction nature. RISC CPU has a different method for the brute force design path.

In response, both Intel and AMD are also scaling up instruction decoders count.

There two very fat ARM compatible core designs (Apple, Qualcomm) vs two very fat X86-64 core designs (Intel, AMD).

Pentium III only has one FPU pipeline port.
Alpha 21264 has two FPU pipeline ports i.e. FADD and FMUL
K7 Athon has three FPU pipeline ports i.e. FADD, FMUL and the 3rd FPU pipeline is for FMISC/FStore/others.

https://www.azillionmonkeys.com/qed/cpujihad.shtml

A floating point test I did that uses this strategy confirms that the K7 is indeed significantly faster than the P6's floating point performance. My test ran about 50% faster

https://www.tomshardware.com/reviews/athlon-processor,121-24.html
Floating Point Benchmarks
3D Studio Max with Windows NT in PPH (pictures per hour)
K7 Athlon @ 600 Mhz = 81.8 PPH
Pentium III @ 600 Mhz = 56.3 PPH

Using the same x87 instruction set, different performance results.

Intel would improve CPU core implementation design with Core 2.

My point, it depends on implementation.

Texture mapped 3D games like Quake play into floating point (geometry) and integer (pixel, texture) mixed instruction stream.

You can't deny Quake killed Cyrix 6x86 despite good SpecINT. John Carmack (aided by Michael Abrash) has an oversize influence over the gaming world that hammered nails into the Amiga, Cyrix, and S3.

Quote:

The 32-bit x86 with 6 GP integer registers, poor orthogonality and a bad stack based FPU with 8 stack based FPU registers more than held its own against the 64-bit Alpha with 32 GP integer registers and 32 GP FPU registers.

Pentium III introduced non-stack scalar and vector SSE extensions with eight SSE registers.
AMD K6-II introduced non-stack scalar and vector 3DNow extensions on eight x87 registers.

Pentium has a FXCH workaround for x87's stack behavior.
K7 Athlon's FXCH generates a NOP instruction with no dependencies. The top two stages of the X87 pipeline are stack renaming then internal register renaming step.

Out of order with register renaming was implemented on X86 with Intel Pentium Pro in 1995 and AMD K5 in 1996.

Quote:

The 32-bit 68060 has 16 GP integer registers, good orthogonality, a good FPU ISA with 8 GP FPU registers and it was obviously better than the in-order P5 Pentium equivalent. Motorola pulled the plug on the 68k for a RISC ISA more like Alpha though. I guess they could not read the writing on the wall.

You ignored X86 integer register use case are both GPR and x87 registers.

Again, X86 doesn't play by 68K rules or your artificial comparison limitation.

Beyond 68060 and AC68080 V4, 68K CPU design improvement is based on finance and engineering resources.

Last edited by Hammer on 19-Aug-2025 at 03:06 PM.
Last edited by Hammer on 19-Aug-2025 at 07:11 AM.
Last edited by Hammer on 19-Aug-2025 at 06:44 AM.
Last edited by Hammer on 19-Aug-2025 at 06:40 AM.
Last edited by Hammer on 19-Aug-2025 at 06:24 AM.

_________________

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 20-Aug-2025 4:44:24

[ #389 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4495
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

I was busy with my new architecture in the last period. I've analyzed (again, for some) several architectures (ARM/Thumb-2, AArch64, ARM Helium, ARC, MRISC32, Xtensa, NanoMIPS, RISC-V with several extensions) to see if something was missed both in the code density and performance area. I've also read some books about LLVM and the language reference (which was inspiring and, I think, it should be primary source when designing an ISA: directly mapping an instruction to the corresponding LLVM's IR "instruction" is like winning the Jackpot).

From all those activities I've seen some common patterns regarding what was introduced by the various ISAs to improve the code density. I already had a good base with my architectures (especially the last one), but I've introduced a few more instructions to cover some expects. What I actually miss are some (short/compact) instructions like the following:

Load/stores Reg,[RegBase + RegIndex]
Load/stores Reg,[RegBase + SmallImmediate]

Short versions are more applicable to an 8-bit base VLE. The 68k 16-bit base VLE is already pretty optimal for these.

The problem here is that the 68k (and my architectures, which share the same problem) don't (and very likely can't) provide 16-bit versions of such instructions (the list that I've shared was referred primarily to the missing instructions in the 16-bit opcode space), because they require too much opcode space.

For the 68k, it would have needed 4 bits for Reg, another 3 for RegBase, then 4 for RegIndex, and finally 2 for the size. Total: 13 bits, multiplied by two (for load & store) -> 14 bits.
Reducing RegIndex to 3 bits (only using data registers for that. Which make absolute sense), wouldn't have helped much, because 13 bits is still A LOT.
With some restrictions we can go down to 12 bits, which is still an entire "quadrant" (1/16 of the opcode space, with that space split into 16 "quadrants").

What such compact ISAs / extensions did to reduce the number of bits was to reduce the number of registers that could be used. So, something like 3 bit (maximum! Some use even less registers) for the registers, plus some other restrictions.

Very similar considerations can be done for the SmallImmediate format.

A reduced version with much less registers can be implemented on 68k & co., but maybe it's not worth it, since the load/store operation can be "absorbed" by a regular ld + op or op + st instruction that already embedded the EA (CISC power!). However, no decision can be taken without solid statistics (with a compiler that easily allows such experiments).
Quote:
cdimauro Quote:

ADD Reg1,Reg2,Reg3
SUB Reg1,Reg2,Reg3

Better for reducing the number of instructions than improving the code density as they need to be a 32-bit encoding with a 16-bit VLE. LEA/PEA already provide a 3 op add with 2 preserved source reg ops and even allow to add an immediate and shift one of the source registers too. Change/use register stalls may result in some cases though.

It depends on much those instructions are used: if it's relevant, then the above 16-bit encodings can certainly help the code density. Additions and subtractions are quite common operations.

LD/ST architectures don't need to reduce the number of executed instructions, because they already have binary (3 operands) instructions.

CISCs with a 16-bit VLE can have their own problem by lack of encoding space for such compact versions.
Quote:
cdimauro Quote:

ADD Reg1,Reg2,SmallUnsignedImmediate
SUB Reg1,Reg2,SmallUnsignedImmediate

These are missing from the 68k and could be useful. I have considered a 32-bit encoding that can also be used for ADDC/ADDX and SUBC/SUBX with flags to optionally carry/borrow the C and/or X bits. This would bring the 68k closer to x86 functionality which is able to ADDC/SUBC a small immediate as I recall. This came about because the 68000 only had 16-bit instructions with data extensions where a 32-bit instruction would save encoding space considering the frequency of use. They could be useful for reencoded 68k ISAs like the 68k64.

Precisely. 68k (and successors) needs a 32-bit block of instructions to encode:
BINOP Source1,Source2,Dest
BINOP Source1,SIMM8,Dest

With BINOP being the list of all binary instructions.

Since the required space is a lot (at least on my architectures), I've only registers versions of those instructions formats, and the one with the 8-bit signed immediate which only works with the basic arithmetic & logic instructions.

The 68k can certain add a block or two for them, and it wouldn't take so much space. For this reason maybe there's some space for embedding an EA as the second source of such instructions, which can further help reducing both the number of executed instructions and code density (3 x 16-bit instructions reduced to 1 x 32-bit one).

A similar solution can be found for unary (two operands) instructions (but without the immediate version, of course).

Anyway, the above ADD and SUB are encoded in 16 bit for some of the architectures that I've analyzed, with the sole purpose of improving the code density.
Quote:
cdimauro Quote:

SHIFT Reg1,Reg2,SmallUnsignedImmediate

An encoding with 2 registers and a small immediate is likely going to be 32-bit for a 16-bit base VLE. For a 16-bit base VLE, a 16-bit encoding of a single register by a small immediate is likely the limit. The 68k already has this which would become the quick versions with a 32-bit encoding which allows a 2nd register and a full immediate.

That's exactly what I did as well. The only available space was for a Reg + 3 bit immediate shift.
Quote:
cdimauro Quote:

SEXT Reg1,Reg2
ZEXT Reg1,Reg2

The ColdFire MVS and MVZ instructions would provide this for the 68k with any other names being better. I chose SXTB/SXTW and ZXTB/ZXTW but x86 MOVSX and MOVZX may be preferred by some developers. This functionality would be very good for the 68k and unlike x86(-64) MOVSX/MOVZX, the ColdFire instructions improve code density,

The improve the code density on x86 and x64 as well: the minimum instruction length for a MOV is 2 bytes, and the MOVSX/MOVZX just need an additional byte.
Quote:
albeit using quite a bit of encoding space.

It's worth the price to pay (but NOT on Line-A).
Quote:
Where most 68k and ColdFire compiler backends are shared, it is compelling to use the ColdFire encodings so the existing ColdFire support can be turned on.

I don't agree. Once you need to (re)compile your code, the only important thing is being source-compatible, and choose another space for such instructions. A compiler can support both Coldfire and 68k+ encoding in a simple way.
Quote:
cdimauro Quote:

NOT Reg1,Reg2
NEG Reg1,Reg2

I am not so sure they are common enough to warrant a 16-bit encoding with a 16-bit base VLE. More likely they would be 32-bit encodings with 2 registers and NEG could have optional bits for C and X to subtract as NEGC/NEGX. A small immediate could be subtracted as well but maybe not worthwhile for NEG.

I agree: I don't see much value here.
Quote:
A 32-bit encoding for an integer ABS reg,reg would be nice as well. Gunnar did not like the idea of an integer ABS instruction as he preferred to predicate over short branches and the code equivalent of ABS which is the same size. There are some advantage to the ABS instruction still but they are minor and he had a good argument so I did not push the point.

I've added it to my architectures because it's common enough. Hence, useful. But with longer encoding, due to the limited opcode space which is available.
Quote:
Not every 68k implementation would support predication of short branches and I wanted a 68k ISA standard that others could use including for emulation and ASICs while he wanted a closed 68k ISA optimized for his personal FPGA toy core which was far from the goals and openness of the original Natami project.

That's his problem. Remaining relegated to FPGAs will greatly limits the audience and the potential for expanding the market. Even worse without a proper vision on how to evolve the platform.
Quote:
cdimauro Quote:

ADD Reg1,Reg2 * 1/2/4/8

This looks like LEA/PEA with CISC addressing modes again.

Exactly, but... much more compact.
Quote:
cdimauro Quote:

BIT Reg,uimm5

BIT from which ISA? It is a 5-bit immediate so I expect this is a bit number equivalent to the 68k BTST. A BTST type instruction is used and a 16-bit encoding should be possible but it still may not be worthwhile. BTST, BCLR, BSET and BCHG allow a more compact encoding than the equivalent logic operation with a mask but they require more hardware to execute as the mask for the proper logic operation has to be created which may include a shift and/or bit operation unit unavailable in some execution pipelines. The 68060 has ALUs with shift capabilities in both execution pipelines but it only has a bit operation unit in the pOEP. BTST, BCLR, BSET and BCHG are "pOEP-only" for an immediate to a register instruction even though they are single cycle so no superscalar operation. The 68020/68030 have better timing for the logic operations between an immediate mask and a register than BTST, BCLR, BSET and BCHG as well. Compiled code for the 68k generally prefers the logic operations for the 68k and I do not believe this is unusual. Most 68k CPUs have a large performance penalty for fetching extra code and 32-bit immediates can reduce performance as well as partial register writes for 8-bit and 16-bit immediates and for some logic operations. My OP.L #data16.w,Dn addressing mode reduces these problems when a 16-bit immediate suffices.

The other option may be that BIT is from the 6502 but changed to accept an immediate where the immediate is a mask but the op is not modified. BIT is an AND operation that affects the condition codes but does not modify the accumulator.

http://www.6502.org/users/obelisk/6502/reference.html#BIT

Sorry, I forgot to clarify that BIT is the group of BIT instructions, so BTST, BCLR, BSET and BCHG for the 68k. On my architecture I use it to avoid repeating the same list of instruction on all formats where they are available (similar thing with SHIFT -> all 8 shift/rotate instructions. ARI -> the 8 base arithmetic & logic instructions).

Regarding the performance, I suggest you to avoid thinking about the current limits of the 68060 or older processors. Talking about new instructions here, it means that a new processor & microarchitecture would be needed for implementing them, and there's a great opportunity to phase out such anachronistic limits.
Quote:
A non-destructive AND would be useful even for a single register. I suggested that it is possible to encode the immediate in the destination for this purpose. This may allow AND.L reg,#imm32 and AND.L reg,#imm16.w that would not actually alter the immediate in the code. Some other instructions are common where the result wastes a register like ADD which could benefit while SUB already has a non-destructive version in CMP and not limited to immediates and a reg. I do not know whether this idea would work well in implementations but it is orthogonal.

With non-destructive do you mean that the operation is performed, but only the flags are updated accordingly?

Because that could be an interesting idea of using EA on a destination when this is an immediate: it doesn't require big changes, since it's almost completely implemented, and you can use it on ALL instructions which already have an EA as destination.
Quote:
Gunnar did not like the idea perhaps because move+op instructions may be fused to 3 op instructions.

I don't see this preventing the above solution. Maybe he simply didn't like it (SIC!).
Quote:
He added CMPIW which at least shortens the most common case of move.l+op.l+bcc sequences and the encoding is not in Line-A like some AC68080 code density improving instructions including MOVIW, MOV3Q, MOVS and MOVZ. Maybe he moved MOVS (MVS) and MOVZ (MVZ) back to the original ColdFire locations though.

That's fine for HIS (only) purposes, but it's a very bad decision (even from Motorola): Line-A should be kept as it is, for SIMD/Vector extensions.
Quote:
cdimauro Quote:

ADD/SUB/CMP/MOV uimm8

If "unsigned" 8-bit immediates are needed, they can be encoded in a 16-bit signed immediate. Generally, signed immediate compression gives better code density.

Unsigned is much more common, and the above have only 16-bit encoding. So, a good code density improvement, but it requires too much encoding space.

That's why I've decided for different paths (e.g.: combining instructions and branches,for examples).
Quote:
cdimauro Quote:

MOVE2 (MOVE PAIR) RegDest1,RegDest2,RegSource1,RegSourc2

Is moving a pair of registers common enough?

I've found it on most of the ISAs which I've examined (even RISC-V has it).
Quote:
A CISC CPU core with reg-mem mem-reg should not need to move pairs of registers too often and with 3 op instructions even less so, even for load/store ISAs. Maybe it would be used enough around function calls to move args to the right registers?

That's exactly the use case.
Quote:
The AC68080 ISA added MOVE2 but it is for memory operations only even though the encoding could support an all register MOVE2 although at least one pair of registers would need to be sequential.

That's more like AArch64, but it's not matching the above use case.
Quote:
cdimauro Quote:

AND Reg,SmallUnsignedImmediate

Already covered.

Same, but not with a 5 or 6 bits immediate.
Quote:
cdimauro Quote:

MUL Reg1,Reg2

The 68k has a 16-bit encoding for MUL.W (16x16=32). The 68k Amiga would have to advance decades before needing a short/quick 16-bit encoding for 32x32=32 or 64x64=64. The 68k AmigaOS currently calls a function for 32x32=32 and 32x32=64 which is fine for dinosaur age emulation, just the way the Hyperion A-EonKit syndicate wants the 68k.

68k is ok, and should be enough. With Amiga, it's the OS the problem, which is bound to library calls for common functionalities which are built on CPUs since very long time.
Quote:
cdimauro Quote:

SEXT Reg
ZEXT Reg

If the source and destination register can be the same, the reg to reg versions will suffice which is the case for the ColdFire MVS and MVZ instructions. They are no larger than the old sign extend only 68k EXT(B) instructions and support so much more. The MVS/MVZ instructions were one thing ColdFire got right except for the dreadful names. These instructions should have been added much earlier to every 68k CPU as even the 68000 could support them and would gain a significant benefit from them.

I agree, but not on Line-A.
Quote:
cdimauro Quote:

Some of them require a considerable amount of opcode space, which I haven't anymore (I've almost completely exhaust the 16-bit space). Some of them are particularly difficult to get with an architecture like mine (and for the 68k as well, since I share a key aspect of this ISA regarding code density), because it would require way too much encoding space.

If the above are the LLVM internal instructions, then I do not think the 68k with just a few additions I have suggested would map so poorly.

No, the above ones are the instructions which I've found on several architectures for getting better code density, but they are not coming from LLVM.

LLVM's IR has also most of them, but more generic (then it's up to the specific backend to map with them to one or more native instructions).

Looking at LLVM, 68k is certainly in a good shape. The most important thing which is missing is the support to SIMD and/or vectors. And, of course, a more mature backend which can better optimize the code.
Quote:
Some of the short encodings are for 8-bit base VLE encodings which can achieve a little better code density but I still believe a 16-bit base VLE can have the performance advantage with a minimal loss of code density. The 8-bit base BA2 ISA may achieve a little better code density than the 68k and Thumb-2 ISAs with smart use of the encoding space, starting with not using and wasting 8-bit instruction encodings, but the vast majority of the 8-bit base VLE ISAs have inferior code density, inferior performance metrics and likely additional decoding overhead.

I almost agree, besides the decoding overhead, where I think that the 68k has more (due to the non-regular instructions mapping, and the extended word).

BTW, BA2 can be easily upgraded to 64-bit and a SIMD/Vector extension can be added as well: there's still a lot of space. This architecture was very very designed, but unfortunately was abandoned by its company (which recalls what Motorola did).
Quote:
cdimauro Quote:

BTW, from NEx64T I've lost a bit of code density (+10 bytes now from 68k's ll.m68k_2_bis.s, instead of the +4 from the old ISA), but I've further reduced number of executed instructions (-21 now. Thanks also to the AArch64 example: comparing them, I've found that with my new architecture I've taken very similar decisions. Which explains the close results) using less registers (-3 from 68k).

It sounds like a good tradeoff if wanting more desktop like performance at the expense of embedded market suitability.

It's good tradeoff, but the above results are coming from just one example, with a lot of time spent to manually squeeze the most with the sole purpose to reduce the code size. So, it can't represent all scenarios neither is using several other instructions and features. Last but not really least, it's pure assembly, whereas the vast majority of the software in the embedded field is C/C++, where my new architecture has a better support (e.g. ABI -> function calls -> function prologues & epilogues, handling of parameters in the stack and stack frame. Just to give an example of notable differences with the 68k).

That's the reason why I will not spend further time on that, and I prefer to try writing a backend for it (not an easy task) and get the results from real-world software & benchmark suites, where I'm confident that it'll show a different story.
Quote:
The number of instructions executed is more important than code density for high performance "desktop" systems and AArch64 set the bar on reducing the number of instructions. The only problem is that a high performance ISA has more competition with both x86-64 and AArch64 entrenched when most market opportunities are for embedded use with improved code density.

The problem of the existing architectures is that usually they do well only on some specific fields.

RISC-V architects recognized it, and tried to create an ISA suitable for all markets. However, they failed because they are living on their academic household and missed important needs of the real world.

My last architecture is certainly not pretending to be the best one on all relevant metrics & markets, but I've designed it to be pretty close while keeping the advantages of all other fields. And it's very customizable: it's possible define / select all its features is a very granular way, to cover from the very low-end (8 x 16-bit registers with a bunch on fixed-length instructions) to the very high-end (HPC/Supercomputer and variable-length instructions up to 34 bytes).
Quote:
cdimauro Quote:

Last but not least, I've found some interesting statistics about code density that I'll add to the first page of the thread once I finish replying to some pending posts.

I do not see changes to the first page of this thread yet.

Not yet. When I do it, I also write a new comment here pointing to the updated post.
Quote:
cdimauro Quote:

BA2 direct competitors were other LD/ST architecture, and the examples that they have used are based on those common scenarios (albeit I doubt that Thumb2 and MIPS can generated a big constant like 0x7FFFFFFF with a single instruction).

At least a 48-bit encoding is required to support a 32-bit immediate which is not supported by many ISAs.

Indeed, and it's good for the competitors that have it.
Quote:
cdimauro Quote:

BTW, NanoMIPS allows to load full 32-bit constants, because it has a 16-32-48 bit variable-length encoding. It took a lot for MIPS to finally deliver a competitive architecture regarding code density, but it was too late.

So for semi-modern general purpose ISAs that support 32-bit immediates/displacements, there is the following.

reg-mem architecture with 16-bit base VLE
68k
ColdFire (variable-length RISC architecture still supports)
Z/Architecture
NS32k (weird LE ISA with BE data encoding)

reg-mem architecture with 8-bit base VLE
x86
86-64
VAX

load/store architecture with 8-bit base VLE
BA2 (does not use 8-bit encodings but belongs here in my opinion)

load/store architecture with 16-bit base VLE
NanoMIPS

load/store architecture with 32-bit base VLE
POWER
Mitch Alsup's ISA

It is odd that 32-bit base VLE ISAs may be the most common for load/store architectures.

Indeed. It's a bit weird, but that's the case.
Quote:
cdimauro Quote:

I assume that the performance loss is only due to the current implementation. I mean, another 68k processor could do the conversion much faster.

BTW, the 68k FPU lacks 64-bit integers, which is really strange, since even the weaker x87 FPU supported them (on load and store. Which is enough, considering how the x87 worked).

In the last days there's a discussion on EAB which is talking about Apple's usage of the 68k FPU to do 64-bit integer arithmetic. It wasn't an hack, as someone said, because it's a perfectly legit way to use the 68k FPU (which, BTW, supported up to 65 bit integers: one more bit, because the sign bit is separated from the 64-bit mantissa).
I don't know how much useful it was, because I would have preferred to implement the 64-bit ints arithmetic using the regular CPU. Especially considering that the 68k FPU completely lacks a direct support to 64-bit ints.

On the contrary, x87 supported this datatype (albeit only for load/stores. But that's enough, because calculations are always performed using the extended precision), and the FPU was used on PCs to move data much more quickly (BTW, x87 instructions are also very compact).

I believe 68k FPU integer to FP conversions could have better timing and performance with more hardware, newer silicon and more optimization. I doubt that integer immediates and registers for FP instructions could be made free though, at least without a shared integer and FP register file. The 68060 currently requires 3 more cycles for Fop instructions using integers and these instructions along with any Fop instruction using immediates can only execute "pOEP-only".

I don't think that the problem is the separated register files: it's the conversion which is required from integer to extended precision, which took time with the 68060. But I'm pretty confident a modernized microarchitecture can greatly improve it.

BTW, the register files separation is only visible at the architecture/ISA level, but the implementation (microarchitecture) can have a unified register file (which makes sense only if the ISA is 64-bit: otherwise there would be too much waste of spaces for the registers).
Quote:
The latter arbitrary restriction is likely due to instructions in the instruction buffer being limited to 48-bit max without being split up, the same restriction used for ColdFire which does not allow larger instructions at all. Recall that Gunnar found no difference in timing between by supporting 64-bit instructions which would allow more 68060 instructions to execute superscalar, including for the FPU.

https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Coldfire-compatible-FPGA-core-with-ISA-enhancement-Brainstorming/td-p/238714?profile.language=en Quote:

2) Support for more/all EA-Modes in all instructions

In the current Coldfire ISA the instruction length is limited to 6 Byte, therefore some instructions have EA mode limitations.

E.g the EA-Modes available in the immediate instruction are limited.

That currently instructions can either by 2,4, or 6 Byte length - reduces the complexity of the Instruction Fetch Buffer logic.

The complexity of this unit increases as more options the CPU supports - therefore not supporting a range from 2 to over 20 like 68K - does reduce chip complicity.

Nevertheless in my tests it showed that adding support for 8 Byte encoding came for relative low cost.

With support of 8 Byte instruction length, the FPU instruction now can use all normal EA-modes - which makes them a lot more versatile.

MOVE instruction become a lot more versatile and also the Immediate Instruction could also now operate on memory in a lot more flexible ways.

While with 10 Byte instructions length support - there are then no EA mode limitations from the users perspective - in our core it showed that 10 byte support start to impact clockrate - with 10 Byte support enabled we did not reach anymore the 200 MHz clockrate in Cyclone FPGA that the core reached before.

The AC68080 does not use an instruction buffer to reduced instruction fetch but it may be the limitation was used to fit the transistor and power budget of the time and simply reused for ColdFire instruction buffers because it required the least amount of work. Supporting 8-byte/64-bit instructions would improve superscalar execution and allow single precision FP immediates, which would have been valuable for ColdFire instead of moving FP immediates into the data stream but ColdFire was a 2nd class citizen with the most important requirement being to not compete with PPC.

Isn't the 68080 fetching and scanning 16 bytes from the instruction cache? And now it has a 3 decoders which can merge a couple of instructions. So, it can extract a maximum of 6 instructions from those 16 bytes.

So, I assume that the above 10 bytes limit is solved.
Quote:
Allowing 68k FPU 64-bit integer datatype memory accesses would be useful too. Not only would it have made 64-bit integer arithmetic in the FPU easier but it would have provided an easier solution to the 68040 major mistake of removing FINT/FINTRZ, which is the only FP instruction the 68060 added back. Motorola had the mindset to chop, chop, chop there way to compete with RISC instead of leveraging the performance of the 68k. Despite the major chopping done to the 1994 68060, it had unmatched integer performance efficiency (performance/MHz) for an in-order core. ARM only surpassed the performance efficiency with a general purpose 8-stage in-order superscalar Cortex-A7 core 17 years later in 2011 and that was with larger caches.

The 68k's FPU can be improved and repair to Motorola's mistakes. This is was left in the dust, but it can be revitalized and make it more modern and more competitive.
Quote:
cdimauro Quote:

Not so much. As Mitch stated some time ago, logic is very cheap, and relaxing the alignment constraints offers much better benefits compared to the slow downs due to the instructions misalignments.

It is worthwhile to handle alignment as efficiently as possible which is what is relatively cheap in logic. However, there is still a performance benefit to better aligned data in memory including instructions. The 68060 4-byte instruction fetch is much more likely to fetch whole instructions than a 4-byte fetch on x86, or worse x86-64. It is not only that bytes are more likely to straddle a particular fetch/buffer but x86 and x86-64 ISAs are more likely to have longer encodings. Consider the high tech scalar Cyrix 5x86.

Cyrix 5x86: Fifth-Generation Design Emphasizes Maximum Performance While Minimizing Transistor Count
https://dosdays.co.uk/media/cyrix/5x86/5X-DPAPR.PDF Quote:

Instruction Decode Unit

The instruction decode unit in the 5x86 decodes the variable-length x86 instructions. The instruction decode involves determining the length of each instruction, separating immediate and/or displacement operands, decoding addressing modes and register fields, and creating an entry point into the microcode ROM. As previously discussed, the input to the instruction decoder is eight bytes of instructions supplied by the IF unit. These bytes are shifted and aligned according to the instruction boundary of the last instruction decoded. The ID unit can decode and issue instructions at a maximum rate of one per clock. Instructions with one prefix and instructions of length less than or equal to eight bytes can be decoded in a single cycle.

An 8-byte instruction fetch was used for a scalar x86 CPU and "These bytes are shifted and aligned according to the instruction boundary of the last instruction decoded."

OK, but that's an old design, and even nowadays it works fine with the current code: instructions longer than 8 bytes aren't common on x86 and x64 as well.
Quote:
Misaligned data memory loads have overhead too which is reduced for 16-bit aligned data compared to 8-bit aligned data.

Sure. If you can align a data type to its original size it's a Good Thing, and that's also what compilers are trying to do.

But you've always to take into account misaligned memory accesses. That's the reason why modern processors have removed this bottleneck and there's no need of several instructions only for aligned load/stores.

An example is AVX-512: a big change from AVX is that finally there's no need to use the aligned/unaligned load/store instructions, because all instructions can access any data despite of the alignment, and the memory controller will do the re-alignment work. This is a model which is now used on all modern processors, because it gives much less headaches at all levels.
Quote:
NOP instructions are often added inside x86(-64) code where aligning instructions are very rarely ever added inside 68k code that is executed.

Exactly, that's to align branches to 16 bytes boundaries.

That's also the reason why x64 shows a worse code density: because of the added NOP instructions purely for alignment purposes.
Quote:
It is not unusual for x86(-64) cores to use boundary marker caches and the markers would use twice the space as for a 68k boundary marker cache.

Indeed. 16-bit VLE encodings have a great advantage here.
Quote:
The Pentium 4 even used a trace cache which improves code alignment.

No, the trace cache is only an internal stuff, which is used to completely remove the L1 code cache and avoid decoding instructions at runtime.

It was a failure, because microops require too much space compared to the normal instructions, and then the trace cache kept much less instructions, hurting the performance.
Quote:
There are many signs of the increased overhead of less favorable x86(-64) instruction alignment. Despite the 68k ISA more often being compared to the VAX ISA, I believe the decoding overhead of VAX was more comparable to x86(-64) which also has an 8-bit base VLE.

IMO VAX is easier to decode compared to x86/x64, because it has no prefixes and partially decoding the first byte you already know how many operands the instruction has. Then you can immediately check the next byte and you know if/how many bytes are needed for the EA of this operand, and go to the next one, and so on.
I think that the effort can be comparable to 68k, up to 2 operands, with the advantage of having only single bytes to check.

P.S. No time to read and correct typos and mistakes.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 20-Aug-2025 4:52:46

[ #390 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4495
From: Germany

@Hammer

Quote:

Hammer wrote:
@cdimauro

Quote:
You continue to do NOT understand, at all, what code density is about.

You are a lost cause...

Code density's utility is dependant on use case target.

Code density is what the computer science literature defines as it, and not what YOU think it should be.

So, and again, no: it doesn't dependent from the use case, but it's a general, absolute measure.

BTW, I gave you an homework regarding this topic, but I'm still waiting the results...
Quote:
68060's scalar code density advantage (fused data load with ALU op) is less useful for use case that heavily use fused multiple math operations on a single instruction approach.

This is about PERFORMANCE -> OFF TOPIC.

BTW, which consoles (see below) had FMA instructions on 90s?
Quote:
For 3D games in the 1990s, you're argument position is useless in the real workload vs cost.

This is about PERFORMANCE and COSTS -> OFF TOPIC.
Quote:
Sony's PlayStation result has no problems standing up against gaming PC on economic vs performance terms.

PCs had better performances. The problem here was the cost.

And again, it's OFF TOPIC.
Quote:
For 1993 to 1995 and given per unit budgetary limits, it's very difficult to pull off 68EC040 or 68EC60 powered CD3D game console (which is the basis for A1200's replacement).

For 1996, 68K IP holder wasn't interested in StrongARM style pushing 68LC040 into 120 Mhz to 140 Mhz range.

Xbox (Project Midway) has fat X86 economies of scale and Microsoft's fat cash at bank.

For 1995-1996, IBM offered and designed two 1 million transistors budget PPC 602 @ 66Mhz game console CPU solution for 3DO M2 project. PPC 602 has a pipelined FP32 FPU. 3DO team increased the transistor budget for 3DO M2's CPU item.

3DO targeted A500's retail US$699 price and hopes that their partner cost reduce it e.g. LG/GoldStar 3DO.

CD32 targeted US$399 intial launch price, hence a similar price range for CD3D.

Same as above: NOTHING to do with the code density -> OFF TOPIC.
Quote:

Hammer wrote:
@matthey

Quote:

The 32-bit 68060 has 16 GP integer registers, good orthogonality, a good FPU ISA with 8 GP FPU registers and it was obviously better than the in-order P5 Pentium equivalent. Motorola pulled the plug on the 68k for a RISC ISA more like Alpha though. I guess they could not read the writing on the wall.

You ignored X86 integer register use case are both GPR and x87 registers.

Out of curiosity, what do you mean with that?

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 21-Aug-2025 5:03:06

[ #391 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4495
From: Germany

Added three new benchmarks on the usual post:
- CoreMark with ARM Cortex-A0+ vs PIC24, PIC18, RL78;
- CodeSize Benchmark with NanoMIPS vs ARM Cortex / Thumb-2;
- Zephyr compilation with RISC-V / PULP vs ARMv7 and ARMv8.

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 21-Aug-2025 6:14:31

[ #392 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2818
From: Kansas

cdimauro Quote:

The problem here is that the 68k (and my architectures, which share the same problem) don't (and very likely can't) provide 16-bit versions of such instructions (the list that I've shared was referred primarily to the missing instructions in the 16-bit opcode space), because they require too much opcode space.

For the 68k, it would have needed 4 bits for Reg, another 3 for RegBase, then 4 for RegIndex, and finally 2 for the size. Total: 13 bits, multiplied by two (for load & store) -> 14 bits.
Reducing RegIndex to 3 bits (only using data registers for that. Which make absolute sense), wouldn't have helped much, because 13 bits is still A LOT.
With some restrictions we can go down to 12 bits, which is still an entire "quadrant" (1/16 of the opcode space, with that space split into 16 "quadrants").

What such compact ISAs / extensions did to reduce the number of bits was to reduce the number of registers that could be used. So, something like 3 bit (maximum! Some use even less registers) for the registers, plus some other restrictions.

Very similar considerations can be done for the SmallImmediate format.

A reduced version with much less registers can be implemented on 68k & co., but maybe it's not worth it, since the load/store operation can be "absorbed" by a regular ld + op or op + st instruction that already embedded the EA (CISC power!). However, no decision can be taken without solid statistics (with a compiler that easily allows such experiments).

Right. There is not a good way to create short encodings on the 68k for (An,Xn) and (d4,An) addressing modes that would allow the orthogonal use of existing instructions. The 68k employs the An and Dn register specialization trick already to reduce most register encodings to 3 bits for 8 registers like many compressed ISA encodings but still allows most instructions to be 16-bit encodings while using 16 registers more effectively than most of them. It would be possible to add short encodings with new EA modes for a particular register like (4,sp) and (8,sp) but there are 3 available encodings and I would like to use one for op.l #data16.w,Dn and another for a shorter encoding for (d32,An). Short addressing mode encodings for #1 and #-1 immediates may save more space but are likely not as good either. A short encoding like this for the #0 immediate could have saved the encoding space used by CLR (MOVE.L #0,EA) and TST (CMP.L #0,EA) which I considered for 68k64 using the 3rd open EA encoding. It would be possible to increase the EA size from 6 to 7 bits, at least for source EAs, so there would be more of these encodings available but even one more bit is difficult to spare with a 16-bit encoding.

cdimauro Quote:

It depends on much those instructions are used: if it's relevant, then the above 16-bit encodings can certainly help the code density. Additions and subtractions are quite common operations.

LD/ST architectures don't need to reduce the number of executed instructions, because they already have binary (3 operands) instructions.

CISCs with a 16-bit VLE can have their own problem by lack of encoding space for such compact versions.

Not all load/store architectures have 3 op. For example, SuperH does not and fixed length 16-bit encodings make it very difficult, as well as 16 GP registers while removing the 68k An Dn split. SuperH has some instructions using 3 registers but one of them is usually implicit reducing orthogonality, often R0 or FR0. Implicit registers are also used to save encoding space while allowing useful immediates and displacements in 16-bit instructions, for example AND #data8,r0. Some instructions even use 2 implicit registers like AND.B #data8, @(R0, GBR). This is a short 16-bit encoding for an ISA copied from the 68k using an (An,Xn) addressing mode. SuperH would have been much better with a VLE as these short encodings are ok if there are longer more orthogonal encodings too. Otherwise, they are a pain to program including for compiler backends with the lack of orthogonality which Thumb and MIPS16 16-bit fixed length encoding ISAs suffer from too. Thumb and MIPS16 may have supported switching modes to a 3 op 32-bit fixed length ISA but it just made the embedded cores larger than SuperH cores. Thumb-2 and MicroMIPS with 16-bit and 32-bit encodings eased some of the limitations but some encodings are still limited to implicit registers and 8 GP registers. The 68k remains more orthogonal and consistent than all these compact ISAs while Thumb-2 finally came close in code density. Motorola/Freescale did little to improve 68k code density after the 68020 ISA which could have done more. The ColdFire ISA did more to improve code density late, perhaps too late, by restoring 68k ISA functionality that had been removed and finally with some ISA_B and ISA_C additions like MVS, MVZ and MOV3Q. Had they stayed with the 68k CPU32 ISA for embedded, added ColdFire instructions that improve code density, retained 68k compatibility and incrementally continued to update their chips, the 68k may still be the king of the embedded market instead of ARM taking the crown with Thumb-2 and better support of it.

cdimauro Quote:

The improve the code density on x86 and x64 as well: the minimum instruction length for a MOV is 2 bytes, and the MOVSX/MOVZX just need an additional byte.


 MOV+CWDE are 3 bytes (MOVSX reg, reg is 3 bytes)
 XOR+MOV are 4 bytes (MOVZX reg,reg is 3 bytes)

Overall, x86(-64) MOVSX/MOVZX are more of a performance improvement than code density improvement by reducing the number of instructions executed and eliminating partial register write stalls. There is at least one case where they result in larger code like where MOVSX is used on a single register without moving it instead of using CWDE (a peephole optimizer can likely eliminate this case though). I expect the average code savings per MOVSX/MOVZX instruction is less than 1 byte on x86 but it would be better on x86-64 as fewer instructions reduces prefix bytes, for example when using the new upper 8 GP integer registers. In comparison, the ColdFire MVS/MVZ instructions save 2 bytes in most cases compared to alternatives and I expect the average code savings per instruction to be at least 1.5 bytes. Also, I doubt they ever result in larger code. Sign and zero extend instructions are better for the 68k because there is more free encoding space while new x86(-64) instruction encodings get longer and longer often to the detriment of code density.


 MOVE+EXT are 4 bytes (MVS reg, reg is 2 bytes)
 CLR/MOVEQ+MOVE are 4 bytes (MVZ reg,reg is 2 bytes)

The larger code density gain and shorter instructions for the 68k should result in more of a performance improvement than the x86(-64) equivalent instructions.

cdimauro Quote:

It's worth the price to pay (but NOT on Line-A).

The ColdFire instructions in Line-A are MOV3Q and MAC unit instructions. The MOV3Q instruction is out of place there and there is encoding space available without moving to Line-A or Line-F. MOV3Q is a good example of a poorly though out and encoded instruction that adds compact RISC ISA inconsistency to the consistent 68k ISA. The immediate of 1 to 7 and -1 is inconsistent with any other 68k or ColdFire instruction and is difficult for humans to remember. With 1 more bit used to specify the sign of the immediate, immediates of 1 to 8 and -1 to -8 could be allowed which is more consistent with ADDQ and SUBQ that allow immediates of 1 to 8. It still is a 32-bit only operation with no size specifier. MOVEQ and some other instructions are 32-bit only, at least in registers. MOV3Q is an effective short encoding for improving code density as my analysis often showed it saving more code than MVS and MVZ combined, although my analysis method underestimated the savings of MVS and MVZ where it was relatively easy to identify where MOV3Q would provide a savings. The big savings comes from replacing MOVE.L #data32,EA which saves 4 bytes per occurrence although my MOVE.L #data16.w,EA would reduce this to a 2 byte savings per occurrence. It still may be worthwhile to move it out of A-line and keep it.

cdimauro Quote:

Sorry, I forgot to clarify that BIT is the group of BIT instructions, so BTST, BCLR, BSET and BCHG for the 68k. On my architecture I use it to avoid repeating the same list of instruction on all formats where they are available (similar thing with SHIFT -> all 8 shift/rotate instructions. ARI -> the 8 base arithmetic & logic instructions).

Regarding the performance, I suggest you to avoid thinking about the current limits of the 68060 or older processors. Talking about new instructions here, it means that a new processor & microarchitecture would be needed for implementing them, and there's a great opportunity to phase out such anachronistic limits.

The 68060 was ahead of its time. It could not only execute a simple integer instruction in both execution pipelines but a shift too. The PPC603 could do neither. Low end superscalar RISC cores commonly had a single integer unit and only high end cores could perform more than one shift per cycle for many years after the 68060 was released. The 68060 has most of the hardware needed in each execution pipeline to superscalar execute bit instructions but they embraced the RISC chop, chop, chop mentality to simplify for the most common cases, later resulting in the simpler but reduced performance and code density ColdFire variable length RISC ISA. Development is actually easier if the superscalar execution pipelines are more similar and symmetrical.

cdimauro Quote:

With non-destructive do you mean that the operation is performed, but only the flags are updated accordingly?

Because that could be an interesting idea of using EA on a destination when this is an immediate: it doesn't require big changes, since it's almost completely implemented, and you can use it on ALL instructions which already have an EA as destination.

Yes, I mean an instruction where the CC flags are updated but not the destination. Some RISC ISAs allow this by specifying the zero register as the destination.

cdimauro Quote:

Unsigned is much more common, and the above have only 16-bit encoding. So, a good code density improvement, but it requires too much encoding space.

That's why I've decided for different paths (e.g.: combining instructions and branches,for examples).

Is unsigned more common? In my experience, "int" is more popular than "unsigned int" in C programs. It is not unusual to see int used when the data is unsigned in my experience. Most of the time there is a minimal difference in performance but unsigned has the advantage. Unsigned multiply and divide sometimes have better timing and unsigned multiply and divide by 2 can be converted to a shift.

cdimauro Quote:

I almost agree, besides the decoding overhead, where I think that the 68k has more (due to the non-regular instructions mapping, and the extended word).

The 68k encodings may be "non-regular" but compressed RISC ISAs have many more instruction encoding formats and instructions than a 32-bit fixed length RISC ISA. Most of them chose a 16-bit base VLE like the 68k which allows to decode 16-bits at a time instead of 8-bits at a time. While most of these RISC ISAs have simpler encodings with fewer encoding sizes, most are not as orthogonal as the 68k. Instructions using the 68k extension word formats are uncommon with maybe 5% of instructions using them on average and most of those use the far simpler to decode brief extension word format (d8,An,Xn*SF) instead of the more difficult to decode full extension word format (bd,An,Xn*SF). Compare this to how common prefixes are in x86(-64) which can even override previously decoded data.

cdimauro Quote:

I don't think that the problem is the separated register files: it's the conversion which is required from integer to extended precision, which took time with the 68060. But I'm pretty confident a modernized microarchitecture can greatly improve it.

BTW, the register files separation is only visible at the architecture/ISA level, but the implementation (microarchitecture) can have a unified register file (which makes sense only if the ISA is 64-bit: otherwise there would be too much waste of spaces for the registers).

I am not so sure the 68060 int to fp latency is due to extended precision overhead. A 3-cycle latency extended precision FMUL is possible so the int to fp is more likely a victim of chop, chop, chop less commonly used hardware. On modern silicon, I do not expect much if any difference in performance between extended and double precision fp except for FDIV and FSQRT.

cdimauro Quote:

Isn't the 68080 fetching and scanning 16 bytes from the instruction cache? And now it has a 3 decoders which can merge a couple of instructions. So, it can extract a maximum of 6 instructions from those 16 bytes.

So, I assume that the above 10 bytes limit is solved.

It is safe to say the AC68080 design solves 68060 fetch and superscalar execution limitations with brute force at the clock speeds the AC68080 operates in a FPGA.

cdimauro Quote:

IMO VAX is easier to decode compared to x86/x64, because it has no prefixes and partially decoding the first byte you already know how many operands the instruction has. Then you can immediately check the next byte and you know if/how many bytes are needed for the EA of this operand, and go to the next one, and so on.
I think that the effort can be comparable to 68k, up to 2 operands, with the advantage of having only single bytes to check.

It should be noted that the decoding of 808x instructions and maybe even 16-bit x86 instructions was likely simpler than for VAX. The cumulative baggage from 8-bit to 16-bit to 32-bit to 64-bit took its toll though.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 21-Aug-2025 18:24:34

[ #393 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4495
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

The problem here is that the 68k (and my architectures, which share the same problem) don't (and very likely can't) provide 16-bit versions of such instructions (the list that I've shared was referred primarily to the missing instructions in the 16-bit opcode space), because they require too much opcode space.

For the 68k, it would have needed 4 bits for Reg, another 3 for RegBase, then 4 for RegIndex, and finally 2 for the size. Total: 13 bits, multiplied by two (for load & store) -> 14 bits.
Reducing RegIndex to 3 bits (only using data registers for that. Which make absolute sense), wouldn't have helped much, because 13 bits is still A LOT.
With some restrictions we can go down to 12 bits, which is still an entire "quadrant" (1/16 of the opcode space, with that space split into 16 "quadrants").

What such compact ISAs / extensions did to reduce the number of bits was to reduce the number of registers that could be used. So, something like 3 bit (maximum! Some use even less registers) for the registers, plus some other restrictions.

Very similar considerations can be done for the SmallImmediate format.

A reduced version with much less registers can be implemented on 68k & co., but maybe it's not worth it, since the load/store operation can be "absorbed" by a regular ld + op or op + st instruction that already embedded the EA (CISC power!). However, no decision can be taken without solid statistics (with a compiler that easily allows such experiments).

Right. There is not a good way to create short encodings on the 68k for (An,Xn) and (d4,An) addressing modes that would allow the orthogonal use of existing instructions. The 68k employs the An and Dn register specialization trick already to reduce most register encodings to 3 bits for 8 registers like many compressed ISA encodings but still allows most instructions to be 16-bit encodings while using 16 registers more effectively than most of them.

Precisely. IMO this is THE KEY feature of the 68k architecture, which contributes both to the code density and a much better reduction of opcode space usage.

Infineon copied it with its TriCore architecture, allowing to double the number of both registers sets at the cost of just 4 bits per register, while keeping a lot of functionalities.

I've opted for a different approach, albeit strongly inspired by it in some parts.

However, for an architecture like the 68k, I would have rigidly separated the two registers sets (at ISA level, at least), because it would have opened the doors for a consistent and very useful change. Maybe if I create another new, 68k-inspired this time, architecture I'll take the chance to design it in this way.
Quote:
It would be possible to add short encodings with new EA modes for a particular register like (4,sp) and (8,sp) but there are 3 available encodings and I would like to use one for op.l #data16.w,Dn and another for a shorter encoding for (d32,An). Short addressing mode encodings for #1 and #-1 immediates may save more space but are likely not as good either. A short encoding like this for the #0 immediate could have saved the encoding space used by CLR (MOVE.L #0,EA) and TST (CMP.L #0,EA) which I considered for 68k64 using the 3rd open EA encoding.

Unfortunately, there's no much space left here.

One encoding is needed for (d32,An), as you've already stated.
Another for (d32,pc) if you plan to expand it to 64-bit (which, I think, you've already decided about it: there's no future for an ISA without 64-bit support).
The third encoding is free, and the best use is your compressed displacement. I've another idea about that, but I can't share it yet because I've used on my architectures (several times on NEx64T. Just one time on the new one, because there's only one use case where it still made sense to have it), and I think that it could be worth a patent before making it public. I hope you understand.
Quote:
It would be possible to increase the EA size from 6 to 7 bits, at least for source EAs, so there would be more of these encodings available but even one more bit is difficult to spare with a 16-bit encoding.

In fact: there can be space only for some new 16-bit instructions, or a few existing ones that could be "extended" adding another bit which is free in the opcode. But it sounds quite weird.

You can follow what I've done with my very old 68k-inspired 64-bit architecture (it's 15 years old now): introduce a 4 + 3 = 7 bits EA, but only for the 32-bit encodings (I've kept the 3 + 3 = 6 bits EA for the 16-bit encodings). This will double the number of addressing modes available, opening a lot of opportunities.

BTW, embedding the EA on the base opcode was the second great feature of the 68k ISA, which contributed a lot to the code density. Unfortunately, it takes some space, so it doesn't leave much bits for encode many instructions. But the most important ones are there.
At the end, the existing 6-bit EA encoding can be seen like a "compressed version" of the more general EA.
Quote:
cdimauro Quote:

It depends on much those instructions are used: if it's relevant, then the above 16-bit encodings can certainly help the code density. Additions and subtractions are quite common operations.

LD/ST architectures don't need to reduce the number of executed instructions, because they already have binary (3 operands) instructions.

CISCs with a 16-bit VLE can have their own problem by lack of encoding space for such compact versions.

Not all load/store architectures have 3 op. For example, SuperH does not and fixed length 16-bit encodings make it very difficult, as well as 16 GP registers while removing the 68k An Dn split. SuperH has some instructions using 3 registers but one of them is usually implicit reducing orthogonality, often R0 or FR0. Implicit registers are also used to save encoding space while allowing useful immediates and displacements in 16-bit instructions, for example AND #data8,r0. Some instructions even use 2 implicit registers like AND.B #data8, @(R0, GBR). This is a short 16-bit encoding for an ISA copied from the 68k using an (An,Xn) addressing mode. SuperH would have been much better with a VLE as these short encodings are ok if there are longer more orthogonal encodings too.

I agree. They insisted on the 16-bit-only opcodes format, which severely limited and hurt this ISA. Not separating the address and data registers was the most stupid mistakes that they did.
Quote:
Otherwise, they are a pain to program including for compiler backends with the lack of orthogonality which Thumb and MIPS16 16-bit fixed length encoding ISAs suffer from too. Thumb and MIPS16 may have supported switching modes to a 3 op 32-bit fixed length ISA but it just made the embedded cores larger than SuperH cores. Thumb-2 and MicroMIPS with 16-bit and 32-bit encodings eased some of the limitations but some encodings are still limited to implicit registers and 8 GP registers. The 68k remains more orthogonal and consistent than all these compact ISAs while Thumb-2 finally came close in code density.

Sure, but with modern compilers the lack of orthogonality isn't a big problem anymore, since instructions can be properly organized and compilers had to introduce the registers aliasing concept that automatically takes care and solves those problems.

One less burden for backend developers...
Quote:
cdimauro Quote:

The improve the code density on x86 and x64 as well: the minimum instruction length for a MOV is 2 bytes, and the MOVSX/MOVZX just need an additional byte.

MOV+CWDE are 3 bytes (MOVSX reg, reg is 3 bytes)
XOR+MOV are 4 bytes (MOVZX reg,reg is 3 bytes)

Overall, x86(-64) MOVSX/MOVZX are more of a performance improvement than code density improvement by reducing the number of instructions executed and eliminating partial register write stalls. There is at least one case where they result in larger code like where MOVSX is used on a single register without moving it instead of using CWDE (a peephole optimizer can likely eliminate this case though). I expect the average code savings per MOVSX/MOVZX instruction is less than 1 byte on x86 but it would be better on x86-64 as fewer instructions reduces prefix bytes, for example when using the new upper 8 GP integer registers.

Yes, MOVSX is more relevant on x86-64. MOVZX contributes to both ISAs.
Quote:
cdimauro Quote:

It's worth the price to pay (but NOT on Line-A).

The ColdFire instructions in Line-A are MOV3Q and MAC unit instructions. The MOV3Q instruction is out of place there and there is encoding space available without moving to Line-A or Line-F. MOV3Q is a good example of a poorly though out and encoded instruction that adds compact RISC ISA inconsistency to the consistent 68k ISA. The immediate of 1 to 7 and -1 is inconsistent with any other 68k or ColdFire instruction and is difficult for humans to remember. With 1 more bit used to specify the sign of the immediate, immediates of 1 to 8 and -1 to -8 could be allowed which is more consistent with ADDQ and SUBQ that allow immediates of 1 to 8. It still is a 32-bit only operation with no size specifier. MOVEQ and some other instructions are 32-bit only, at least in registers. MOV3Q is an effective short encoding for improving code density as my analysis often showed it saving more code than MVS and MVZ combined, although my analysis method underestimated the savings of MVS and MVZ where it was relatively easy to identify where MOV3Q would provide a savings. The big savings comes from replacing MOVE.L #data32,EA which saves 4 bytes per occurrence although my MOVE.L #data16.w,EA would reduce this to a 2 byte savings per occurrence. It still may be worthwhile to move it out of A-line and keep it.

IF there's some encoding space, then it's better to have a MOV3Q version with the Size field. Out of Line-A, of course.

However, I'm not in favour of a symmetrical version for positive and negative numbers: positive numbers are way more common, and deserve much more encoding space, IMO.
Quote:
cdimauro Quote:

With non-destructive do you mean that the operation is performed, but only the flags are updated accordingly?

Because that could be an interesting idea of using EA on a destination when this is an immediate: it doesn't require big changes, since it's almost completely implemented, and you can use it on ALL instructions which already have an EA as destination.

Yes, I mean an instruction where the CC flags are updated but not the destination. Some RISC ISAs allow this by specifying the zero register as the destination.

Perfect. On 68k you need some features register where you can disable triggering the exception, and just update the flags. Not a big deal.
Quote:
cdimauro Quote:

Unsigned is much more common, and the above have only 16-bit encoding. So, a good code density improvement, but it requires too much encoding space.

That's why I've decided for different paths (e.g.: combining instructions and branches,for examples).

Is unsigned more common? In my experience, "int" is more popular than "unsigned int" in C programs. It is not unusual to see int used when the data is unsigned in my experience. Most of the time there is a minimal difference in performance but unsigned has the advantage. Unsigned multiply and divide sometimes have better timing and unsigned multiply and divide by 2 can be converted to a shift.

Sorry, I recognize now that I was a bit ambiguous.

Of course, regular operations are more common as signed (and that's the reason why almost all instructions on my new architecture are signed. Unsigned have a U appended at the end, if that's what it is wanted).

But regarding the range of integer numbers, the most common are the unsigned ones, and that's what I was referring to previously in this specific context.

Think about handling the characters of strings: those operations are usually with unsigned chars.
Quote:
cdimauro Quote:

I almost agree, besides the decoding overhead, where I think that the 68k has more (due to the non-regular instructions mapping, and the extended word).

The 68k encodings may be "non-regular" but compressed RISC ISAs have many more instruction encoding formats and instructions than a 32-bit fixed length RISC ISA. Most of them chose a 16-bit base VLE like the 68k which allows to decode 16-bits at a time instead of 8-bits at a time. While most of these RISC ISAs have simpler encodings with fewer encoding sizes, most are not as orthogonal as the 68k.

After that I've seen the number of instructions formats on IBM's Z architecture (one of the last CISCs which is left), which reached 6Ghz despite that, and then the number of instructions formats on AArch64, I'm not scared anymore about having a large number of formats to be decoded.

The important thing is to have them well organized / structured and with little exceptions.
Quote:
Instructions using the 68k extension word formats are uncommon with maybe 5% of instructions using them on average and most of those use the far simpler to decode brief extension word format (d8,An,Xn*SF) instead of the more difficult to decode full extension word format (bd,An,Xn*SF).

Yes, they are rare, but you don't know if you're on the good or bad (e.g.: additional displacement needed) case until you've decoded the content of the extension(s) word: that's the big problem with the 68k.

And the problem is exacerbated by the fact that instructions using the EA are spread around the opcode space, and it's not easy to catch them all, unless you use a big LUT.

Just to give an example, my last architecture has some hundread of GP binary instructions, but I've only two formats for them, and I need to check only a few bits in the opcode to catch all of them (and I also know the instruction length, if it has displacement/immediates, and where they are located). That's because I've grouped together all of them, on those two formats.

One mitigation to this problem is having an architecture which is source-compatible with the 68k, but having the opcodes of all instructions using EAs grouped together in a few places / formats. However, that would be binary-incompatible with the existing software. So, not an option.
Quote:
Compare this to how common prefixes are in x86(-64) which can even override previously decoded data.

That's different and, believe it or not, it's easier.

An x86/x86-64 decoder has a simple logic to check the fist bytes of an instruction to check if they are prefixes or not. After that you get a bit mask and you can very quickly check where the opcode byte is located. And that's it.
Seeking for the Mod/RM & SIB bytes is also very simple, and there are a few checks in the bytes that can be performed in parallel with the prefixes checks, generating another bit mask.
Once you've both bit masks, you're done: you know very precisely how the instruction is structured / spread on such sequence of bytes.

The bytes comparisons are very cheap, and certainly don't require the very big 16bit -> 16-bit LUT used on the 68060.
Quote:
cdimauro Quote:

I don't think that the problem is the separated register files: it's the conversion which is required from integer to extended precision, which took time with the 68060. But I'm pretty confident a modernized microarchitecture can greatly improve it.

BTW, the register files separation is only visible at the architecture/ISA level, but the implementation (microarchitecture) can have a unified register file (which makes sense only if the ISA is 64-bit: otherwise there would be too much waste of spaces for the registers).

I am not so sure the 68060 int to fp latency is due to extended precision overhead. A 3-cycle latency extended precision FMUL is possible so the int to fp is more likely a victim of chop, chop, chop less commonly used hardware. On modern silicon, I do not expect much if any difference in performance between extended and double precision fp except for FDIV and FSQRT.

Right. It's Motorola that hasn't invested on making those conversion operations faster.

If you consider that this is roughly the last stage at the end of each FP operation, where you need to normalize the result. And with the int->FP conversion you don't even need to care about infinitives and NaNs (so, the implementation is cheaper and faster), you can draw the conclusions yourself...
Quote:
cdimauro Quote:

IMO VAX is easier to decode compared to x86/x64, because it has no prefixes and partially decoding the first byte you already know how many operands the instruction has. Then you can immediately check the next byte and you know if/how many bytes are needed for the EA of this operand, and go to the next one, and so on.
I think that the effort can be comparable to 68k, up to 2 operands, with the advantage of having only single bytes to check.

It should be noted that the decoding of 808x instructions and maybe even 16-bit x86 instructions was likely simpler than for VAX. The cumulative baggage from 8-bit to 16-bit to 32-bit to 64-bit took its toll though.

Yes, but see above: decoding such instructions isn't much complicated.

VAX remains easier to decode compared to x86/x86-64, IMO.

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 21-Aug-2025 20:51:00

[ #394 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2818
From: Kansas

cdimauro Quote:

Added three new benchmarks on the usual post:

- CoreMark with ARM Cortex-A0+ vs PIC24, PIC18, RL78;

Not much info on code density but the SoC guide is very good. Results are good as expected for the Cortex-M0+ and look reasonable.

cdimauro Quote:

- CodeSize Benchmark with NanoMIPS vs ARM Cortex / Thumb-2;

An 11%-13% code density improvement over Thumb-2 is a large difference. It would be interesting to compare NanoMIPS and BA2 code densities, both of which are load/store architectures which support 32-bit immediates in a single instruction by adding a 48b/6B VLE size. The 9-stage in-order superscalar NanoMIPS core is claimed to have better performance than an 8-stage in-order Cortex-A53 core and 11-stage OoO Cortex-R8 core too. This supports the expectation that supporting larger sized immediatates/displacements in larger instructions improves the performance potential, at least with larger cores. The 68k and ColdFire support this and also had good performance. The 68060 and ColdFire likely left performance on the table though with arbitrary 6B instruction limitations as a CISC reg-mem architecture has another gear in performance with load+op and op+load+store instructions in a single cycle. While 6B instruction support may be optimum for a 32-bit load/store architecture, raising the limitation to 8B instructions for a reg-mem architecture would very likely significantly increase performance. A 64-bit ISA could use 10B support for a 64-bit immediate/displacement, even for a load/store ISA. Large powerful CISC instructions are where much of the x86-64 performance comes from. Amazingly, the 68060 was still outperforming the P5 Pentium with the 6B superscalar instruction execution limitation but it is clearly a better design and benefits from the better 68k orthogonality, double the GP integer registers and better code density. From looking at performance metrics, x86(-64) does not lead at anything yet became the king of desktop and server performance. The 68k performance metrics look much better yet the already arbitrarily limited 68060 was castrated down to the ColdFire V5 that only HP may have used.

cdimauro Quote:

- Zephyr compilation with RISC-V / PULP vs ARMv7 and ARMv8.

The results are suspect and not much info is given. ARMv8 looked the best which makes me wonder whether Thumb-2 was used on an ARMv8 CPU.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 22-Aug-2025 4:38:24

[ #395 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4495
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

Added three new benchmarks on the usual post:

- CodeSize Benchmark with NanoMIPS vs ARM Cortex / Thumb-2;

An 11%-13% code density improvement over Thumb-2 is a large difference. It would be interesting to compare NanoMIPS and BA2 code densities, both of which are load/store architectures which support 32-bit immediates in a single instruction by adding a 48b/6B VLE size.

I was impressed by those results, and MIPS talks that it's the same in several scenarios (so, not only on the presented benchmarks).

Unfortunately, there's nothing that can compare it to BA2.
Quote:
The 9-stage in-order superscalar NanoMIPS core is claimed to have better performance than an 8-stage in-order Cortex-A53 core and 11-stage OoO Cortex-R8 core too. This supports the expectation that supporting larger sized immediatates/displacements in larger instructions improves the performance potential, at least with larger cores. The 68k and ColdFire support this and also had good performance.

IF this is coming from supporting 32-bit constants (but only on some destructive instructions, like ADD Reg,imm32), then I was expecting that the 68k surpassed Thumb-2 (which doesn't support them).

So, I'm wondering what's the key factor which influenced so much the code density on those benchmarks, I'd like dig deeper to understand it (it could be useful to do the same on my architecture and perhaps on a 68k+).
Quote:
A 64-bit ISA could use 10B support for a 64-bit immediate/displacement, even for a load/store ISA.

Immediates will be automatically 64-bit when Size = 64-bit, but not displacements (do you want to use a reserved value on the scaled addressing modes?).

The absolute addressing mode is limited to 32 bit. But that's less important (worst case, you can add a MOVE instruction only for 64-bit addresses, like I did).
Quote:
Large powerful CISC instructions are where much of the x86-64 performance comes from. Amazingly, the 68060 was still outperforming the P5 Pentium with the 6B superscalar instruction execution limitation but it is clearly a better design and benefits from the better 68k orthogonality, double the GP integer registers and better code density. From looking at performance metrics, x86(-64) does not lead at anything yet became the king of desktop and server performance.

Actually, x86-64 was sporting the best results:

Quote:
The 68k performance metrics look much better yet the already arbitrarily limited 68060 was castrated down to the ColdFire V5 that only HP may have used.

Which shouldn't be the case with a new microarchitecture.

However, 68k+ seriously need non-destructive binary instructions, as we discussed before, purely for performance reasons.
Quote:
cdimauro Quote:

- Zephyr compilation with RISC-V / PULP vs ARMv7 and ARMv8.

The results are suspect and not much info is given. ARMv8 looked the best which makes me wonder whether Thumb-2 was used on an ARMv8 CPU.

No, that was ARMv8-M, which has some 32-bit variants with Thumb-2: https://en.wikipedia.org/wiki/ARM_architecture_family#Armv8-R_and_Armv8-M
Armv8-M does not include any 64-bit AArch64 instructions

Status: Offline

michalsc

Re: The (Microprocessors) Code Density Hangout
Posted on 22-Aug-2025 5:37:24

[ #396 ]

AROS Core Developer

Joined: 14-Jun-2005
Posts: 464
From: Germany

@cdimauro

Quote:
No, that was ARMv8-M, which has some 32-bit variants with Thumb-2: https://en.wikipedia.org/wiki/ARM_architecture_family#Armv8-R_and_Armv8-M
Armv8-M does not include any 64-bit AArch64 instructions

I wrote matthey some time ago in some other thread that ARMv8-M is a Thumb-2 subset and not the AArch64 like ARMv8-A, but he didn't listen ;)

Last edited by michalsc on 22-Aug-2025 at 05:37 AM.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 23-Aug-2025 4:54:17

[ #397 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4495
From: Germany

@michalsc

Quote:

michalsc wrote:
@cdimauro

Quote:
No, that was ARMv8-M, which has some 32-bit variants with Thumb-2: https://en.wikipedia.org/wiki/ARM_architecture_family#Armv8-R_and_Armv8-M
Armv8-M does not include any 64-bit AArch64 instructions

I wrote matthey some time ago in some other thread that ARMv8-M is a Thumb-2 subset and not the AArch64 like ARMv8-A, but he didn't listen ;)

Maybe it was just a memory problem. For example, I recall the discussions with you and Matt, but I don't remember this specific detail. We're aging...

Or maybe it's ARM which is a mess with the nomenclature of its architectures.

A couple of months ago I (en)joined a seminar with the QNX technical stuff, and they reported that this OS is only 64-bit. However, they mentioned ARMv8-M as one of the supported architectures, but I've said them that's it's 32-bit...

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 23-Aug-2025 5:56:44

[ #398 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2818
From: Kansas

cdimauro Quote:

Precisely. IMO this is THE KEY feature of the 68k architecture, which contributes both to the code density and a much better reduction of opcode space usage.

Infineon copied it with its TriCore architecture, allowing to double the number of both registers sets at the cost of just 4 bits per register, while keeping a lot of functionalities.

I've opted for a different approach, albeit strongly inspired by it in some parts.

However, for an architecture like the 68k, I would have rigidly separated the two registers sets (at ISA level, at least), because it would have opened the doors for a consistent and very useful change. Maybe if I create another new, 68k-inspired this time, architecture I'll take the chance to design it in this way.

For a 16-bit base VLE, the other common option is tiered registers with the lower 3b reg encoding of 16-bit instructions accessing the first 8 registers and 32-bit encodings using 4b reg encodings. This initially appears cleaner as the visible Dn/An register split can disappear but commonly used special registers like the SP need to be mapped to the low 8 registers which does not look as clean, leaves fewer registers available and I do not believe the code density is as good. Some of these ISAs create more instructions like PUSH and POP in the case of the SP but this reduces orthogonality. The x86-64 ISA is similar but poorly implemented as a couple of the lower 8 registers are not GP despite having PUSH and POP instructions leaving only 6 GP registers before needing a prefix to access 8 more GP registers. Fortunately for x86-64, CISC cores with mem-reg and reg-mem memory accesses usually do not need as many registers and load-to-use stalls are avoided, one of the keys why x86 with 6 GP registers stayed ahead in performance of fat RISC with 32 GP registers like Alpha, MIPS, PPC, etc. Fat RISC code density was another reason. RISC fanatics are slow learners but the failures eventually disappeared leaving more competitive RISC architectures.

cdimauro Quote:

Unfortunately, there's no much space left here.

One encoding is needed for (d32,An), as you've already stated.
Another for (d32,pc) if you plan to expand it to 64-bit (which, I think, you've already decided about it: there's no future for an ISA without 64-bit support).
The third encoding is free, and the best use is your compressed displacement. I've another idea about that, but I can't share it yet because I've used on my architectures (several times on NEx64T. Just one time on the new one, because there's only one use case where it still made sense to have it), and I think that it could be worth a patent before making it public. I hope you understand.

A (d32,An) encoding is unfortunately not available. All that is possible with the three remaining EA addressing mode encodings is a single implicit register like (d32,PC), (d32,SP) or (d32,BR). The 68k does not consistently use the same data base register, for example the Amiga, but maybe it would be possible to configure which register is the data base register. I do not believe (d32,SP) is necessary despite all offsets being positive, (d16,sp) only supporting a positive 32kiB range and stacks sometimes being much larger than 32kiB because old data is rarely accessed until the stack shrinks. Short encodings can be good here like (4,SP) which would likely offer a noticeable code density improvement like small immediates #1, #-1 and #0 without CLR and TST instructions. There are just not enough of these encodings free on the 68k to use for such purposes but they could be used in similar ISAs or a 68k64 mode.

cdimauro Quote:

In fact: there can be space only for some new 16-bit instructions, or a few existing ones that could be "extended" adding another bit which is free in the opcode. But it sounds quite weird.

You can follow what I've done with my very old 68k-inspired 64-bit architecture (it's 15 years old now): introduce a 4 + 3 = 7 bits EA, but only for the 32-bit encodings (I've kept the 3 + 3 = 6 bits EA for the 16-bit encodings). This will double the number of addressing modes available, opening a lot of opportunities.

BTW, embedding the EA on the base opcode was the second great feature of the 68k ISA, which contributed a lot to the code density. Unfortunately, it takes some space, so it doesn't leave much bits for encode many instructions. But the most important ones are there.
At the end, the existing 6-bit EA encoding can be seen like a "compressed version" of the more general EA.

It is interesting that we have both considered a 7b EA. Your idea allows for tiered addressing modes where less common addressing modes could be moved to the longer encoding. I also considered different source and destination EA sizes as currently immediates and PC relative modes can not be used in the destination but many 68k instructions flip the source and destination, it may be useful to allow immediates in the destination like I suggested and opening up PC relative stores like x86(-64) would improve code density. PC relative stores are kind of like the hard divide between Dn/An registers where An could be allowed as the source in most 68k instructions which would provide a small benefit to code density and the number of instructions executed. The PC relative store limitation is for academic reasons and the hard register split is due to a conceptual core design that allows more parallelization with separate address and data units and register files. Maybe the 68000 achieved some benefit from it but later 68k designs have used unified Dn/An register files.

cdimauro Quote:

Sorry, I recognize now that I was a bit ambiguous.

Of course, regular operations are more common as signed (and that's the reason why almost all instructions on my new architecture are signed. Unsigned have a U appended at the end, if that's what it is wanted).

But regarding the range of integer numbers, the most common are the unsigned ones, and that's what I was referring to previously in this specific context.

Think about handling the characters of strings: those operations are usually with unsigned chars.

Oddly, character args for C functions usually use signed characters which perhaps was an oversight and sometimes requiring casting to unsigned for efficient processing. The upper half of characters were rarely used in the early days of computing but making every datatype possible signed was a thing even back then. I wonder how much having to type the long "unsigned" in C had to do with this mentality. Some C programmers made shorter aliases like uchar and uint but it was not until C99 when this was improved.

cdimauro Quote:

After that I've seen the number of instructions formats on IBM's Z architecture (one of the last CISCs which is left), which reached 6Ghz despite that, and then the number of instructions formats on AArch64, I'm not scared anymore about having a large number of formats to be decoded.

The important thing is to have them well organized / structured and with little exceptions.

lol

cdimauro Quote:

Yes, they are rare, but you don't know if you're on the good or bad (e.g.: additional displacement needed) case until you've decoded the content of the extension(s) word: that's the big problem with the 68k.

And the problem is exacerbated by the fact that instructions using the EA are spread around the opcode space, and it's not easy to catch them all, unless you use a big LUT.

Just to give an example, my last architecture has some hundread of GP binary instructions, but I've only two formats for them, and I need to check only a few bits in the opcode to catch all of them (and I also know the instruction length, if it has displacement/immediates, and where they are located). That's because I've grouped together all of them, on those two formats.

One mitigation to this problem is having an architecture which is source-compatible with the 68k, but having the opcodes of all instructions using EAs grouped together in a few places / formats. However, that would be binary-incompatible with the existing software. So, not an option.

It is very quick to check bit #8 of brief and full extension words before processing bits 0-7. Processing bits 0-7 for a full extension word is complex and likely uses a relatively large LUT which is bad for FPGAs but the AC68080 is able to do this with no additional EA calc time for full extension addressing modes (bd,An,Xn*SF). The double memory indirect modes take 2 cycles but this is better than any 68k CPU and reasonable considering there are 2 memory accesses. The 68060 handles the brief extension word with no EA calc penalty while the full extension has a 1 cycle penalty but the 68060 design is only optimized for the 80% most common code or something along those lines. At least they did not remove the support and use traps.

cdimauro Quote:

That's different and, believe it or not, it's easier.

An x86/x86-64 decoder has a simple logic to check the fist bytes of an instruction to check if they are prefixes or not. After that you get a bit mask and you can very quickly check where the opcode byte is located. And that's it.
Seeking for the Mod/RM & SIB bytes is also very simple, and there are a few checks in the bytes that can be performed in parallel with the prefixes checks, generating another bit mask.
Once you've both bit masks, you're done: you know very precisely how the instruction is structured / spread on such sequence of bytes.

The bytes comparisons are very cheap, and certainly don't require the very big 16bit -> 16-bit LUT used on the 68060.

I do not believe 68k decoding is that bad. At least the reg fields are likely not part of the LUT/ROM lookup and however large the LUT is is likely due to the quick one shot wide decoding that is possible. Also, ROMs require fewer transistors and less power than programmable LUTs like FPGAs use. It is certainly very different than x86(-64) decoding which requires very fast sequential decoding of small and poorly aligned data. The basic x86(-64) decoding is fairly simple for a single instruction but parallel instruction decoding is complex due to the many possibilities of earlier instructions, prefixes, many possible alignments and many VLE sizes. An x86-64 instructions has twice as many alignment possibilities as the 68k and 15 possible VLE sizes. A 68k instruction can have up to 22 bytes but because they are 16-bit segments, there are only 11 possible VLE sizes. if the 68k does not use the brief or full extension formats which are uncommon and usually known from looking at the first 16-bits of each instruction, integer size decode is easy. Floating point instructions on the 68k add some complexity but nothing like FPU and SIMD decoding on x86-64 with so many standards and instructions. The baggage with x86-64 is real, as Hammer would say.

cdimauro Quote:

I was impressed by those results, and MIPS talks that it's the same in several scenarios (so, not only on the presented benchmarks).

Unfortunately, there's nothing that can compare it to BA2.

If NanoMIPS and BA2 claims are real, I am surprised there are not enough embedded market niches to survive.

cdimauro Quote:

IF this is coming from supporting 32-bit constants (but only on some destructive instructions, like ADD Reg,imm32), then I was expecting that the 68k surpassed Thumb-2 (which doesn't support them).

So, I'm wondering what's the key factor which influenced so much the code density on those benchmarks, I'd like dig deeper to understand it (it could be useful to do the same on my architecture and perhaps on a 68k+).

Some of it is compiler support too. The 68k used to compete with Thumb ISAs in code density studies but the few more modern code density studies with the 68k usually show the 68k to be behind Thumb ISAs. If anything, 68k compiler support has improved a little recently which is strange considering there are no new 68k ASICs but perhaps due to retro interest. It would be difficult to enhance the 68k to reach 10+% better code density than Thumb-2 as NanoMIPS claims but 5+% is likely realistic. Maybe hand optimized 68k assembly could reach 10+% though.

cdimauro Quote:

Immediates will be automatically 64-bit when Size = 64-bit, but not displacements (do you want to use a reserved value on the scaled addressing modes?).

The absolute addressing mode is limited to 32 bit. But that's less important (worst case, you can add a MOVE instruction only for 64-bit addresses, like I did).

The full extension word format 2b field for BD size has an unused/reserved encoding that could be used for a 64-bit displacement. There is no room for the index register to select 64-bit but a 64-bit mode could drop support for 16-bit sign extended index registers and replace with 32-bit sign extended registers or bit #3 is unused making it possible to support all 4 signed integer datatypes if desired, including byte which is not currently supported.

The full extension word format as is allows to encode absolute addressing modes. The expensive LUT provides flexibility at least. I have also considered removing some or all absolute addressing modes and repurposing the encodings for 64-bit. It is probably better to stay more conservative with changes for familiarity by 68k programmers though.

cdimauro Quote:

Actually, x86-64 was sporting the best results:

The competition is limited. Where is the 68k?

Many 68k code density enhancements decrease the number of instructions and sometimes a register used too. Some like ColdFire MVS/MVZ and PC relate stores x86(-64) already has.

cdimauro Quote:

No, that was ARMv8-M, which has some 32-bit variants with Thumb-2: https://en.wikipedia.org/wiki/ARM_architecture_family#Armv8-R_and_Armv8-M
Armv8-M does not include any 64-bit AArch64 instructions

I read right over the "-M" on the end of ARMv8-M. ARM Cortex-M cores do indeed still use Thumb ISAs and have better code density than RISC-V RVC. RISC-V can solve this code density deficit but how many ISA extensions will it take?

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 24-Aug-2025 7:25:34

[ #399 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4495
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

Precisely. IMO this is THE KEY feature of the 68k architecture, which contributes both to the code density and a much better reduction of opcode space usage.

Infineon copied it with its TriCore architecture, allowing to double the number of both registers sets at the cost of just 4 bits per register, while keeping a lot of functionalities.

I've opted for a different approach, albeit strongly inspired by it in some parts.

However, for an architecture like the 68k, I would have rigidly separated the two registers sets (at ISA level, at least), because it would have opened the doors for a consistent and very useful change. Maybe if I create another new, 68k-inspired this time, architecture I'll take the chance to design it in this way.

For a 16-bit base VLE, the other common option is tiered registers with the lower 3b reg encoding of 16-bit instructions accessing the first 8 registers and 32-bit encodings using 4b reg encodings. This initially appears cleaner as the visible Dn/An register split can disappear but commonly used special registers like the SP need to be mapped to the low 8 registers which does not look as clean, leaves fewer registers available and I do not believe the code density is as good.

But SP is one of the most used registers, so it makes sense having it always accessible with the shortest encodings. That's the reason why it's R0 in my latest ISA.

The other possibility is leaving it out of the regular registers, with proper instructions changing its content, and with proper addressing modes. That's the solution which I've implemented on my old "64k" ISA (the 68k inspired one), to free some address registers (I've also moved out FP as well, as separate register).

Both solutions have pros and cons, of course. Deciding which is the best primarily depends on how many special encodings are available in the limited EA (6 bits aren't that many).

Regarding the tiered registers, that's what they have done many architectures which implemented compact / 16-bit extensions for code density. It works well as long as you can use most of reduced number of registers, which covers many cases (but this depends on which kind of registers accessible, of course).
Quote:
Some of these ISAs create more instructions like PUSH and POP in the case of the SP but this reduces orthogonality.

That's not a great problem: compact instructions usually aren't orthogonal because of the choices which are needed to achieve the goal.

And it's definitely not an option with modern compilers. Probably it's more a burden for developers which (manually) write assembly code.
Quote:
The x86-64 ISA is similar but poorly implemented as a couple of the lower 8 registers are not GP despite having PUSH and POP instructions leaving only 6 GP registers before needing a prefix to access 8 more GP registers. Fortunately for x86-64, CISC cores with mem-reg and reg-mem memory accesses usually do not need as many registers and load-to-use stalls are avoided, one of the keys why x86 with 6 GP registers stayed ahead in performance of fat RISC with 32 GP registers like Alpha, MIPS, PPC, etc. Fat RISC code density was another reason. RISC fanatics are slow learners but the failures eventually disappeared leaving more competitive RISC architectures.

The failures "disappeared" only because there aren't RISCs anymore: there's substantially not a single "principle" of RISC philosophy which is still alive (considering the pipeline tricks to avoid the load-to-use stalls which practically converts couples of "RISC" instructions to load+op or op+store versions).
Quote:
cdimauro Quote:

Unfortunately, there's no much space left here.

One encoding is needed for (d32,An), as you've already stated.
Another for (d32,pc) if you plan to expand it to 64-bit (which, I think, you've already decided about it: there's no future for an ISA without 64-bit support).
The third encoding is free, and the best use is your compressed displacement. I've another idea about that, but I can't share it yet because I've used on my architectures (several times on NEx64T. Just one time on the new one, because there's only one use case where it still made sense to have it), and I think that it could be worth a patent before making it public. I hope you understand.

A (d32,An) encoding is unfortunately not available. All that is possible with the three remaining EA addressing mode encodings is a single implicit register like (d32,PC), (d32,SP) or (d32,BR). The 68k does not consistently use the same data base register, for example the Amiga, but maybe it would be possible to configure which register is the data base register.

With "data base register" do you mean something like a global base register? If yes, then it might be a good candidate for using one of those encodings.
Quote:
I do not believe (d32,SP) is necessary despite all offsets being positive, (d16,sp) only supporting a positive 32kiB range and stacks sometimes being much larger than 32kiB because old data is rarely accessed until the stack shrinks.

Correct. 32-bit offsets for the SP aren't so common to justify using one of those precious encodings.
Quote:
Short encodings can be good here like (4,SP) which would likely offer a noticeable code density improvement like small immediates #1, #-1 and #0 without CLR and TST instructions.

I'm not in favour of use one those encoding only for defining a single constant. I really don't like it, despite the possible code density benefit.
Quote:
There are just not enough of these encodings free on the 68k to use for such purposes but they could be used in similar ISAs or a 68k64 mode.

Then better to leave the remaining ones to some better usage.
Quote:
cdimauro Quote:

In fact: there can be space only for some new 16-bit instructions, or a few existing ones that could be "extended" adding another bit which is free in the opcode. But it sounds quite weird.

You can follow what I've done with my very old 68k-inspired 64-bit architecture (it's 15 years old now): introduce a 4 + 3 = 7 bits EA, but only for the 32-bit encodings (I've kept the 3 + 3 = 6 bits EA for the 16-bit encodings). This will double the number of addressing modes available, opening a lot of opportunities.

BTW, embedding the EA on the base opcode was the second great feature of the 68k ISA, which contributed a lot to the code density. Unfortunately, it takes some space, so it doesn't leave much bits for encode many instructions. But the most important ones are there.
At the end, the existing 6-bit EA encoding can be seen like a "compressed version" of the more general EA.

It is interesting that we have both considered a 7b EA. Your idea allows for tiered addressing modes where less common addressing modes could be moved to the longer encoding. I also considered different source and destination EA sizes as currently immediates and PC relative modes can not be used in the destination but many 68k instructions flip the source and destination, it may be useful to allow immediates in the destination like I suggested and opening up PC relative stores like x86(-64) would improve code density. PC relative stores are kind of like the hard divide between Dn/An registers where An could be allowed as the source in most 68k instructions which would provide a small benefit to code density and the number of instructions executed. The PC relative store limitation is for academic reasons

Exactly, and it should have been removed long ago: it has no reason to persist, severely restricting the ISA.
Quote:
and the hard register split is due to a conceptual core design that allows more parallelization with separate address and data units and register files. Maybe the 68000 achieved some benefit from it but later 68k designs have used unified Dn/An register files.

I don't know how much it benefitted, because using an address register also as an index it means that there weren't really two registers files.
Quote:
cdimauro Quote:

Sorry, I recognize now that I was a bit ambiguous.

Of course, regular operations are more common as signed (and that's the reason why almost all instructions on my new architecture are signed. Unsigned have a U appended at the end, if that's what it is wanted).

But regarding the range of integer numbers, the most common are the unsigned ones, and that's what I was referring to previously in this specific context.

Think about handling the characters of strings: those operations are usually with unsigned chars.

Oddly, character args for C functions usually use signed characters which perhaps was an oversight and sometimes requiring casting to unsigned for efficient processing.

I don't recall exactly now, but AFAIK "char" isn't signed by default in K&R / ANSI C: being sign or unsigned was depending on the compiler. But I might be wrong (it's too long since I've read those standards).
Quote:
The upper half of characters were rarely used in the early days of computing but making every datatype possible signed was a thing even back then. I wonder how much having to type the long "unsigned" in C had to do with this mentality. Some C programmers made shorter aliases like uchar and uint but it was not until C99 when this was improved.

That's because too many things were left / undefined by the standard, and it took really too long of having them set in the stone. That was a pity for a language which is supposed to be help abstracting from the low-level, but still kept some of this stuff.
Quote:
cdimauro Quote:

Yes, they are rare, but you don't know if you're on the good or bad (e.g.: additional displacement needed) case until you've decoded the content of the extension(s) word: that's the big problem with the 68k.

And the problem is exacerbated by the fact that instructions using the EA are spread around the opcode space, and it's not easy to catch them all, unless you use a big LUT.

Just to give an example, my last architecture has some hundread of GP binary instructions, but I've only two formats for them, and I need to check only a few bits in the opcode to catch all of them (and I also know the instruction length, if it has displacement/immediates, and where they are located). That's because I've grouped together all of them, on those two formats.

One mitigation to this problem is having an architecture which is source-compatible with the 68k, but having the opcodes of all instructions using EAs grouped together in a few places / formats. However, that would be binary-incompatible with the existing software. So, not an option.

It is very quick to check bit #8 of brief and full extension words before processing bits 0-7. Processing bits 0-7 for a full extension word is complex and likely uses a relatively large LUT which is bad for FPGAs but the AC68080 is able to do this with no additional EA calc time for full extension addressing modes (bd,An,Xn*SF).

An 8-bit LUT isn't so large with an FPGA, so I think that the 68060 could have used it.

However, also for the main opcode, probably a different approach (proper comparisons) was used. A 16-bit -> 16-bit LUT is certainly not feasible for an FPGA.
Quote:
The double memory indirect modes take 2 cycles but this is better than any 68k CPU and reasonable considering there are 2 memory accesses.

That's not much important regarding the pure decoding phase.
Quote:
The 68060 handles the brief extension word with no EA calc penalty while the full extension has a 1 cycle penalty but the 68060 design is only optimized for the 80% most common code or something along those lines. At least they did not remove the support and use traps.

It was a good compromise at the time, but nowadays there are resources to do much better, even keeping it a 2-ways in-order design, and maintaining a small core (not taking into account the caches, of course).
Quote:
cdimauro Quote:

That's different and, believe it or not, it's easier.

An x86/x86-64 decoder has a simple logic to check the fist bytes of an instruction to check if they are prefixes or not. After that you get a bit mask and you can very quickly check where the opcode byte is located. And that's it.
Seeking for the Mod/RM & SIB bytes is also very simple, and there are a few checks in the bytes that can be performed in parallel with the prefixes checks, generating another bit mask.
Once you've both bit masks, you're done: you know very precisely how the instruction is structured / spread on such sequence of bytes.

The bytes comparisons are very cheap, and certainly don't require the very big 16bit -> 16-bit LUT used on the 68060.

I do not believe 68k decoding is that bad. At least the reg fields are likely not part of the LUT/ROM lookup and however large the LUT is is likely due to the quick one shot wide decoding that is possible.

The excerpt that you've reported was talking about a 16-bit -> 16-bit LUT, so the reg fields were certainly included. That's because there were some exceptions on the EA, and there were also 16-bit implicit instructions to take into account.
Quote:
Also, ROMs require fewer transistors and less power than programmable LUTs like FPGAs use.

Right. I'm not talking about FPGAs here: my reference is always an ASIC/RTL design.
Quote:
It is certainly very different than x86(-64) decoding which requires very fast sequential decoding of small and poorly aligned data.

The alignment isn't relevant anymore at this stage of the pipeline.

In fact, the scanned bytes / words are already aligned. We're talking about the next instruction to be decoded, so we know the its PC / address, and its bytes/words were already fetched and shifted at the beginning of 16 bytes (for the usual x86 designs) buffer which is used by the decoder.
Quote:
The basic x86(-64) decoding is fairly simple for a single instruction but parallel instruction decoding is complex due to the many possibilities of earlier instructions, prefixes, many possible alignments and many VLE sizes.

That's really not the case, because the decoding logic which I've explained before is very simple and easily scales with the number of decoded instructions. The only relevant thing is how big the is buffer for scanning the bytes. Assuming that it's 16 bytes, when the decoders start, the process is always the same regardless the number of instructions to decode.

So, very simple 8-bit -> 1-bit LUTs (or comparators) are used for generating the 16-bit mask which marks the presence or not of prefixes.
In parallel, a few 8-bit -> 1-bit LUTs (very likely they are comparators, since there are only a few patterns to be checked) are used to understand if there's a Mod/RM or SIB, generating another 16-bit mask.
Again, in parallel, comparators are used to determine if a mod/RM have a displacement and eventually its size (0, 1, 2 or 4 bytes). They are generating a 2-bit x 16 mask.
An 8-bit -> 1-bit LUTs is used to generate a 16-bit mask to signal if an instruction has a Mod/RM or not.
An 8-bit -> 2-bit LUTs is used to generate a 2 x 16-bit mask to signal if an instruction has an immediate and eventually its size (0, 1, 2 or 4 bytes).

All those comparisons can be done in parallel, require very simple 8-bit LUTs or even comparators, and generates a set of 16 bitmasks that will be used by the following state to seek for the beginning of the opcode and then immediately determine (by combining all entries in the masks starting at this position) the length of the instruction.
This process can be parallelized as well, because all bitmasks are already prepared on the previous stage, and it's just a matter to properly combine them and seek for the beginning/end of each instruction.
Quote:
An x86-64 instructions has twice as many alignment possibilities as the 68k and 15 possible VLE sizes. A 68k instruction can have up to 22 bytes but because they are 16-bit segments, there are only 11 possible VLE sizes.

How many possible VLEs are used / possible is totally irrelevant, because the process to decode the instructions is only dependent from the length of the instructions buffer and how many decoders we've.

In this case, x86/x64's 15 bytes maximum length and 68k's 11 words maximum length is also equivalent (4 bits are needed to compute the instruction length).
Quote:
if the 68k does not use the brief or full extension formats which are uncommon and usually known from looking at the first 16-bits of each instruction, integer size decode is easy.

The problem is that it has to use it, because instructions needs to be (partially decoded) to understand if they have a EA or not, or even 2 EAs. After that, you need to know if the specific EA(s) is or not using an extension word. And then you've to partially decode it to see if there's an extra displacement.

You can mimic the above process used for x86/x64, by having a 10-bit -> 1-bit LUT to generate a 8-bit mask (assuming that you've a 16 bytes instructions buffer) to signal if an instruction has an EA or not. But you also need to handle the special case of the two EAs.
Then you can use some comparators (because the logic is simple enough to avoid a 6-bit -> 1-bit LUT) to generate another 8-bit bitmask to signal which EA has an extension word or not.
You also need some LUTs or comparators (if the logic is simple) to generate a 2-bit x 8 mask that signals if the extension word has a displacement or not (0, 2 or 4 bytes).
Then you combine all those masks to understand how long is the instruction.

I've leave out the special cases (included the fact that there are 16 or 32-bit opcodes) and implicit instructions right now, to simplify the discussion.
Quote:
Floating point instructions on the 68k add some complexity

Not that much: they are part of the 32-bit base opcodes.
Quote:
but nothing like FPU

x87 instructions are very very simple to decode, and they are already covered by the above process which I've described.
Quote:
and SIMD decoding on x86-64 with so many standards and instructions.

Not that much here as well. Everything up to SSE is already fully covered by the above process.

AVX requires the catching of "longer prefixes" (2 and 3 bytes) and AVX-512 the catching of an even longer prefix (4 bytes), but those cases are accomplished by a proper 8-bit -> 1-bit LUT or a very simple comparator (it's just those three specific prefixes).
Quote:
The baggage with x86-64 is real, as Hammer would say.

It's absolutely real, but it's roughly fixed. And it's the same for 68k. The number of transistors used for handling the baggage is practically constant since very long time.
Quote:
cdimauro Quote:

I was impressed by those results, and MIPS talks that it's the same in several scenarios (so, not only on the presented benchmarks).

Unfortunately, there's nothing that can compare it to BA2.

If NanoMIPS and BA2 claims are real, I am surprised there are not enough embedded market niches to survive.

Same opinion. I would not have abandoned my golden goose if the numbers were really as reported.
Quote:
cdimauro Quote:

IF this is coming from supporting 32-bit constants (but only on some destructive instructions, like ADD Reg,imm32), then I was expecting that the 68k surpassed Thumb-2 (which doesn't support them).

So, I'm wondering what's the key factor which influenced so much the code density on those benchmarks, I'd like dig deeper to understand it (it could be useful to do the same on my architecture and perhaps on a 68k+).

Some of it is compiler support too. The 68k used to compete with Thumb ISAs in code density studies but the few more modern code density studies with the 68k usually show the 68k to be behind Thumb ISAs. If anything, 68k compiler support has improved a little recently which is strange considering there are no new 68k ASICs but perhaps due to retro interest. It would be difficult to enhance the 68k to reach 10+% better code density than Thumb-2 as NanoMIPS claims but 5+% is likely realistic. Maybe hand optimized 68k assembly could reach 10+% though.

But it's very rarely used in the real world (I mean: NOT demos et similar things).

Better to focus on modern compilers where a lot can be gained.
Quote:
cdimauro Quote:

Immediates will be automatically 64-bit when Size = 64-bit, but not displacements (do you want to use a reserved value on the scaled addressing modes?).

The absolute addressing mode is limited to 32 bit. But that's less important (worst case, you can add a MOVE instruction only for 64-bit addresses, like I did).

The full extension word format 2b field for BD size has an unused/reserved encoding that could be used for a 64-bit displacement.

Then this is the right place.
Quote:
There is no room for the index register to select 64-bit but a 64-bit mode could drop support for 16-bit sign extended index registers and replace with 32-bit sign extended registers or bit #3 is unused making it possible to support all 4 signed integer datatypes if desired, including byte which is not currently supported.

I don't see any better use for it, so probably it's good to be used for defining the index size.
Quote:
The full extension word format as is allows to encode absolute addressing modes. The expensive LUT provides flexibility at least. I have also considered removing some or all absolute addressing modes and repurposing the encodings for 64-bit. It is probably better to stay more conservative with changes for familiarity by 68k programmers though.

IMO it's not required, after the above clarification. All such problems are solved and the ISA is ready to be extended to 64-bit.
Quote:
cdimauro Quote:

Actually, x86-64 was sporting the best results:

The competition is limited. Where is the 68k?

No data available, unfortunately. But they can obtained by reproducing the same benchmarks.
Quote:
Many 68k code density enhancements decrease the number of instructions and sometimes a register used too. Some like ColdFire MVS/MVZ and PC relate stores x86(-64) already has.

More can come with the suggestions that I've given.
Quote:
cdimauro Quote:

No, that was ARMv8-M, which has some 32-bit variants with Thumb-2: https://en.wikipedia.org/wiki/ARM_architecture_family#Armv8-R_and_Armv8-M
Armv8-M does not include any 64-bit AArch64 instructions

I read right over the "-M" on the end of ARMv8-M. ARM Cortex-M cores do indeed still use Thumb ISAs and have better code density than RISC-V RVC. RISC-V can solve this code density deficit but how many ISA extensions will it take?

It's not able even after combining all ratified extensions and the proposed one.

To me this architecture has no chance to even reach Thumb-2.

P.S. No time to read again, and I've other things to do today.

Status: Offline

bhabbott

Re: The (Microprocessors) Code Density Hangout
Posted on 24-Aug-2025 13:44:54

[ #400 ]

Cult Member

Joined: 6-Jun-2018
Posts: 563
From: Aotearoa

@cdimauro

Quote:

cdimauro wrote:

An 11%-13% code density improvement over Thumb-2 is a large difference.

No, it's trifling.

Quote:
Actually, x86-64 was sporting the best results:

Actually there's nothing in it (even if 'instruction count' has any relevance).

In the real world bloat swamps these trifling differences. And nobody cares. 64-bit means no limits and no reason to rein in the bloat.

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle