Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6211 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

1 member(s) on-line.

BigD

You are an anonymous user.
Register Now!

BigD: 4 mins ago

Hammer: 46 mins ago

Karlos: 58 mins ago

Minuous: 1 hr 47 mins ago

MEGA_RJ_MICAL: 1 hr 57 mins ago

pixie: 2 hrs 7 mins ago

Mobileconnect: 3 hrs 25 mins ago

matthey: 3 hrs 26 mins ago

agami: 6 hrs 48 mins ago

DiscreetFX: 9 hrs 14 mins ago

Forum Index

General Technology (No Console Threads)

The (Microprocessors) Code Density Hangout

Poster

Thread

BigD

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2021 22:23:26

[ #81 ]

Elite Member

Joined: 11-Aug-2005
Posts: 7554
From: UK

@Thread

Mr 68k says ....

_________________
"Art challenges technology. Technology inspires the art."
John Lasseter, Co-Founder of Pixar Animation Studios

Status: Online!

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 10-Sep-2022 7:56:06

[ #82 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4431
From: Germany

Updated Motorola 68K post.

Added NEx64t post.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 18-Sep-2022 5:02:27

[ #83 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4431
From: Germany

Added Benchmarks post.

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 18-Sep-2022 20:55:00

[ #84 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2743
From: Kansas

cdimauro Quote:

Added Benchmarks post.

It's amazing how the 68k could go from best code density to only better than average code density. All it takes are cobwebs in GCC, people forgetting that -f-omit-frame-pointer is necessary for 68k and x86 to get rid of bloating frame pointers which most modern ISAs (and vbcc) have turned off by default, small data is likely never considered, etc. The 68k used to be practically tied with Thumb2 for GCC compiles of the SPEC2006 benchmark suite.

https://www.researchgate.net/publication/221306454_SPARC16_A_new_compression_approach_for_the_SPARC_architecture

I expect the 68k had more individual programs of the benchmark suite that were smaller than Thumb2 in order to come out of top for geometric mean but one or two of the large programs was significantly larger allowing Thumb2 to be smaller for all programs combined. The GCC 68k backend had already declined significantly by the time of this paper while ARM was on top of the embedded world. At least the 68k had a good showing also considering the less than efficient 1979 68k ABI which is still passing args on the stack.

Last edited by matthey on 18-Sep-2022 at 08:55 PM.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 19-Sep-2022 5:07:15

[ #85 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4431
From: Germany

@matthey: thanks. I'll add this as well once I've some time.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 19-Sep-2022 20:48:36

[ #86 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4431
From: Germany

Updated Benchmarks post with the results from the above research paper.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 19-Sep-2022 21:07:13

[ #87 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4431
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

Added Benchmarks post.

It's amazing how the 68k could go from best code density to only better than average code density. All it takes are cobwebs in GCC, people forgetting that -f-omit-frame-pointer is necessary for 68k and x86 to get rid of bloating frame pointers which most modern ISAs (and vbcc) have turned off by default, small data is likely never considered, etc.

I've written a comment on the "The Totally Unscientific Code Density Competition!" blog and an email to "Code Density Compared Between Way Too Many Instruction Sets" asking to recompile the applications with this command line parameter. I hope that we could have some new results.
Quote:
The 68k used to be practically tied with Thumb2 for GCC compiles of the SPEC2006 benchmark suite.

https://www.researchgate.net/publication/221306454_SPARC16_A_new_compression_approach_for_the_SPARC_architecture

Results added to the benchmark post.
Quote:
I expect the 68k had more individual programs of the benchmark suite that were smaller than Thumb2 in order to come out of top for geometric mean but one or two of the large programs was significantly larger allowing Thumb2 to be smaller for all programs combined.

Which looks strange. It would be good to know those programs and investigate why it happened: there might be room for better compiler optimizations for the 68k.
Quote:
The GCC 68k backend had already declined significantly by the time of this paper while ARM was on top of the embedded world. At least the 68k had a good showing also considering the less than efficient 1979 68k ABI which is still passing args on the stack.

Indeed.

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 19-Sep-2022 22:50:29

[ #88 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2743
From: Kansas

cdimauro Quote:

I've written a comment on the "The Totally Unscientific Code Density Competition!" blog and an email to "Code Density Compared Between Way Too Many Instruction Sets" asking to recompile the applications with this command line parameter. I hope that we could have some new results.

Good luck getting a response. Most people don't care about the old 68k and x86 architectures. Even x86(-64) fans often don't care as fatter x86 benchmark programs make the poor x86-64 code density more acceptable. There is a reason why x86 programs sometimes loaded faster and performed better despite having half the GP registers and passing args on the stack.

cdimauro Quote:

Results added to the benchmark post.

Which looks strange. It would be good to know those programs and investigate why it happened: there might be room for better compiler optimizations for the 68k.

There is plenty of room for 68k code density improvements in an old compiler and elsewhere. Most compilers share the backend between the 68k and ColdFire. Just enabling 3 ColdFire instructions MVS, MVZ and MOV3Q, which have no conflicting 68k encodings, should improve code density by several percent with no other changes. MOV3Q saves a lot when compiling with Os optimization like in the benchmark suite as function inlining is greatly reduced giving a greater savings for smaller code to pop stack args. A better ABI would save more but we can't generally change the ABI that is already in use for compatibility reasons (Amiga library functions use register args though). It could and should be changed like x86-64 for new code using a new 68k 64 bit ISA though. I don't think it would be difficult for a 68k ISA to be improved enough to consistently beat Thumb2 in 32 bit code density benchmarks but the low hanging fruit is 64 bit code density where the limited competition is vulnerable to being trounced.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 20-Sep-2022 5:10:20

[ #89 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4431
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

I've written a comment on the "The Totally Unscientific Code Density Competition!" blog and an email to "Code Density Compared Between Way Too Many Instruction Sets" asking to recompile the applications with this command line parameter. I hope that we could have some new results.

Good luck getting a response.

At least I've tried.

I'll monitor the blog because there's no notification option, unfortunately.

For the webpage I've already got a reply, but it isn't from the author, rather from the web admin. The author is unknown. I mean: no contact. Searching aroung I've found this Twitter profile: https://twitter.com/ArcaneSciences but I've no account and I cannot contact it.
Quote:
Most people don't care about the old 68k and x86 architectures. Even x86(-64) fans often don't care as fatter x86 benchmark programs make the poor x86-64 code density more acceptable.

Probably.

BTW, x86-64 has also often prologues & epilogues on functions/subroutines, so the situation might improve a bit omitting the frame pointer generation.
Quote:
There is a reason why x86 programs sometimes loaded faster and performed better despite having half the GP registers and passing args on the stack.

cdimauro Quote:

Results added to the benchmark post.

Which looks strange. It would be good to know those programs and investigate why it happened: there might be room for better compiler optimizations for the 68k.

There is plenty of room for 68k code density improvements in an old compiler and elsewhere.

People should start using Bebbo's GCC instead of the plain vanilla one.
Quote:
Most compilers share the backend between the 68k and ColdFire. Just enabling 3 ColdFire instructions MVS, MVZ and MOV3Q, which have no conflicting 68k encodings, should improve code density by several percent with no other changes.

MOV3Q could make a difference. I've implemented something similar (but more general and with broader immediates) on NEx64T and it compressed A LOT the code on both x86 and x64.

MVS and MVZ doesn't make so much difference. From my x86 & x64 statistics they aren't that much used (especially on x86): https://www.appuntidigitali.it/18362/statistiche-su-x86-x64-parte-8-istruzioni-mnemonici/
Quote:
MOV3Q saves a lot when compiling with Os optimization like in the benchmark suite as function inlining is greatly reduced giving a greater savings for smaller code to pop stack args. A better ABI would save more but we can't generally change the ABI that is already in use for compatibility reasons (Amiga library functions use register args though). It could and should be changed like x86-64 for new code using a new 68k 64 bit ISA though. I don't think it would be difficult for a 68k ISA to be improved enough to consistently beat Thumb2 in 32 bit code density benchmarks but the low hanging fruit is 64 bit code density where the limited competition is vulnerable to being trounced.

I agree, and a 68k 64-bit ISA could be the right moment for changing the ABI.

Status: Offline

Karlos

Re: The (Microprocessors) Code Density Hangout
Posted on 20-Sep-2022 12:43:25

[ #90 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4954
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@thread

Do you guys have any good statistics for the percentage of typical code that's made up of conditional branch instructions? Just asking... for a friend, like..

_________________
Doing stupid things for fun...

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 20-Sep-2022 17:58:39

[ #91 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2743
From: Kansas

cdimauro Quote:

People should start using Bebbo's GCC instead of the plain vanilla one.

Bebbo's changes should be reviewed and injected into the official GCC. The GCC developers don't care though. Recall that the 68k backend was almost removed until a bounty raised enough money for a developer to add the minimum improvements to keep it up to date. That says a lot about the GCC developer attitude toward the 68k which is no respect. There are compiler developers here and there that treat the 68k with respect even without 68k roots. A good example is Min-Yih Hsu who works on the llvm 68k background while he is too young to have lived in the 68k era. Some people think the 68k should be well supported because of the history and retro appeal.

cdimauro Quote:

MOV3Q could make a difference. I've implemented something similar (but more general and with broader immediates) on NEx64T and it compressed A LOT the code on both x86 and x64.

We decided to use a more general immediate compression instead of MOV3Q also although MOV3Q saves 2 more bytes where it can be used.

MOV3Q // saves 4 bytes on MOVE.L of immediates -1, 1-7
OP.L #d16 // saves 2 bytes on OP.L EA of immediates -32768-32767
OP.L #d32 // most forms are 6 bytes (2 for instruction + 4 for d32)

I believe the Apollo core ISA still uses my immediate compression idea using an unused addressing mode encoding in EAs. It's really good as basic forms of OP.L with a source EA and MOVE.L EA,dst with any destination can be compressed and signed 2^16 immediates has broad coverage. It could be used for OP.Q instructions with a 64 bit ISA as well.

cdimauro Quote:

MVS and MVZ doesn't make so much difference. From my x86 & x64 statistics they aren't that much used (especially on x86): https://www.appuntidigitali.it/18362/statistiche-su-x86-x64-parte-8-istruzioni-mnemonici/

The x86(-64) equivalents are MOVSX and MOVZX which I recall are something like 6 bytes? They may not even be used when compiling with Os optimizations. The most common ColdFire versions of MVS and MVZ are only 2 bytes (4 bytes with an immediate that is less useful). They only saves 2 bytes per occurrence but they can be used everywhere and anytime due to their small size. Vasm does peephole optimizations using them for the ColdFire (I suggest them in anticipation of adding MVS/MVZ to the 68k). With that said, code folding can be used to fold MOVEQ+MOVE.B/W into MVZ and MOVE.B/W+EXT into MVS but there are other places where the instructions could be used. It is still debatable whether they are worth the encoding space with code folding but they would improve code density by a few percent on average and give better ColdFire compatibility.

cdimauro Quote:

I agree, and a 68k 64-bit ISA could be the right moment for changing the ABI.

It's amazing that the 68k performs as well as it does and has the code density is has with that old inefficient UNIX ABI liability.

Karlos Quote:

Do you guys have any good statistics for the percentage of typical code that's made up of conditional branch instructions? Just asking... for a friend, like..

As I recall, the general rule of thumb is that 1 in 5 instructions is a branch. There are several documents with instruction frequencies and the branch percentage are given or can be calculated. The PowerPC Compiler Writer's Guide gives 22.1% branch instructions for the integer SPEC92 benchmarks 9.2% for the floating point SPEC92 benchmarks (see figure C-1 and C-2 on page 188).

The PowerPC Compiler Writer's Guide
https://cr.yp.to/2005-590/powerpc-cwg.pdf

There are newer source like RISC-V compiling of SPEC2006 where instruction frequencies are given but you will have to calculate the branch frequencies (compiled with Os optimization).

Enhancing the RISC-V Instruction Set Architecture
https://project-archive.inf.ed.ac.uk/ug4/20191424/ug4_proj.pdf

Cdimauro did an article on instruction frequencies for x86 and x86-64 and could probably give you the link and branch frequency. CISC should give a higher percentage of branches as it has fewer bloat instructions than RISC. I made a spreadsheet of Dr. Vince Weaver's code density competition of size optimized code (like compiling with Os optimization).

https://docs.google.com/spreadsheets/d/e/2PACX-1vTyfDPXIN6i4thorNXak5hlP0FQpqpZFk2sgauXgYZPdtJX7FvVgabfbpCtHTkp5Yo9ai6MhiQqhgyG/pubhtml?gid=909588979&single=true

For the Linux Logo executable which is common integer code, the percentage of branches for various architectures comes out to something like the following.

68k - 29% of instructions branches
Thumb2 - 22%
RISCV32IMC - 28%
Thumb1 - 21%
RISCV64IMC - 28%
AArch64 - 26%
ARM EABI - 26%
PowerPC - 26%
SH-3 - 23%
SPARC - 24%
x86 - 27%
MIPS - 27%
x86-64 - 26%

The branch rule of thumb seems to apply more to ARM Thumb. The PowerPC Compiler Writer's Guide gave 22.1% branches for integer code where this small program has 26% branches and seems reasonable enough. I believe the 68k branch percentage is explained by the 68k having so few instructions leaving branches instructions as a higher percentage. Compressed RISC encodings significantly increase the number of instructions to obtain their code density so the percentage of branches decreases. This is particularly obvious for Thumb and SH-3. One study found the instruction count of Thumb code was increased by 30%.

Efficient Use of Invisible Registers in Thumb Code
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.208&rep=rep1&type=pdf Quote:

More than 98% of all microprocessors are used in embedded products, the most popular 32-bit processors among them being the ARM family of embedded processors. The ARM processor core is used both as a macrocell in building application specific system chips and standard processor chips. In the embedded domain, in addition to having good performance, applications must execute under constraints of limited memory. ARM supports dual width ISAs that are simple to implement and provide a tradeoff between code size and performance. In prior work we studied the characteristics of ARM and Thumb code and showed that for some embedded applications the Thumb code size was 29.8% to 32.5% smaller than the corresponding ARM code size. However, it was also observed that there was an increase in instruction counts for Thumb code which was typically around 30%. We studied the instruction sets and then compared the Thumb and ARM code versions to identify the causes of performance loss. The reasons we identified fall into two categories: Global inefficiency - Global inefficiency arises due to the fact that only half of the register file is visible to most instructions in Thumb code. Peephole inefficiency - Peephole inefficiency arises because pairs of Thumb instructions are required to perform the same task that can be performed by individual ARM instructions.

Thumb2 was a big improvement over Thumb and SuperH 16 bit only encodings which didn't have enough room for immediates, displacements and GP register fields. Supporting 16 and 32 bit encodings allowed nearly 16 GP registers and larger immediates and displacements but instruction counts were still elevated compared to fixed 32 bit RISC encodings (even compared to the classic ARM 32 bit ISA also with ~16 GP registers). The 68k instruction counts are actually lower than all but AArch64. The 68k doesn't pay the RISC code density tax in the big 3 performance metrics which are instruction counts, memory/cache accesses and branches. The branch percentages are elevated but only because the RISC fluff instructions are eliminated. All those extra RISC instructions with added dependencies have to be executed at an increased rate while avoiding the instruction fetch bottleneck caused by the increased instructions.

Last edited by matthey on 20-Sep-2022 at 06:03 PM.

Status: Offline

Karlos

Re: The (Microprocessors) Code Density Hangout
Posted on 20-Sep-2022 18:13:30

[ #92 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4954
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@matthey

That's interesting. I'm considering changing the MC64K bytecode to make conditional branches a sub-opcode. This would free up a significant number of main opcodes to use for 2-byte register to register operations, at the cost of making branches slightly longer and slower.

There already is a subset of "fast path" register to register instruction encodings that gain speed by sidestepping the need to evaluate any effective addresses. However these are implemented as sub opcodes themselves, meaning they need 3 bytes each. Making these primary opcodes and removing the current subopcode indirection would make them significantly faster to interpret.

_________________
Doing stupid things for fun...

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 21-Sep-2022 5:06:04

[ #93 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4431
From: Germany

@Karlos

Quote:

Karlos wrote:
@thread

Do you guys have any good statistics for the percentage of typical code that's made up of conditional branch instructions? Just asking... for a friend, like..

I've already shared a link before:
https://www.appuntidigitali.it/18362/statistiche-su-x86-x64-parte-8-istruzioni-mnemonici/

Data collected from the public beta of Photoshop CS4.

For x86:

Mnemonic              Count      % Avg sz
J                    110954   6.35    2.9

For x64:

Mnemonic              Count      % Avg sz
J                    132638   7.63    3.0

The biggest disassembly / statistics which I've is about 64-bit Excel:

Mnemonic              Count      % Avg sz
J                    603578  11.80    3.2    0.8   -2.4

Which is the more realistic, IMO.

So, around 12% of instructions are conditional jumps. More or lesse 1 every 8 instructions.

Consider that the statistics are limited, because I start disassembly the executables from the entry point and then follow all jumps for which I've an address. So, I don't brutally disassemble everything.

@matthey thanks for the links. I'll take a look once I've some time.

@Karlos

Quote:

Karlos wrote:
@matthey

That's interesting. I'm considering changing the MC64K bytecode to make conditional branches a sub-opcode. This would free up a significant number of main opcodes to use for 2-byte register to register operations, at the cost of making branches slightly longer and slower.

There already is a subset of "fast path" register to register instruction encodings that gain speed by sidestepping the need to evaluate any effective addresses. However these are implemented as sub opcodes themselves, meaning they need 3 bytes each. Making these primary opcodes and removing the current subopcode indirection would make them significantly faster to interpret.

That's what I've suggested you some time ago. This will free some opcodes that you can reuse for something else.

Status: Offline

Karlos

Re: The (Microprocessors) Code Density Hangout
Posted on 21-Sep-2022 6:17:52

[ #94 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4954
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@cdimauro

Quote:
That's what I've suggested you some time ago. This will free some opcodes that you can reuse for something else

Yes. At the time I was focused on solving other things but it's time to go back that. In this case, tahe "something else" in this example being a revised fast path. Comparing to exvm, the peak throughput for R2R operations is a bit disappointing. This should help.

_________________
Doing stupid things for fun...

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 21-Sep-2022 16:56:31

[ #95 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2743
From: Kansas

I would define a branch instruction as any instruction that can change the flow of code (the PC does not advance to the next sequential instruction). This would include conditional branches, unconditional branches, indirect branches, subroutine/function calls/returns and system calls/returns.

https://en.wikipedia.org/wiki/Branch_(computer_science)

cdimauro Quote:

I've already shared a link before:
https://www.appuntidigitali.it/18362/statistiche-su-x86-x64-parte-8-istruzioni-mnemonici/

Data collected from the public beta of Photoshop CS4.

For x86:
Mnemonic Count % Avg sz
J 110954 6.35 2.9

CALL 7.25%
J 6.35%
RET 1.52%
JMP 1.46%
---
16.58% branch instructions (at least)

Photoshop includes floating point code which uses fewer branches than pure integer code.

cdimauro Quote:

For x64:
Mnemonic Count % Avg sz
J 132638 7.63 3.0

J 7.63%
CALL 7.59%
JMP 1.86%
RET 1.33%
---
18.41% branch instructions (at least)

I would expect the number of branch instructions to stay the same or possibly decrease for a large program when moving from a variable length encoded 32 bit ISA to a 64 bit ISA due to support for longer branch displacements but maybe x86 already supported branch instructions with a 32 bit displacement? The 68020 ISA reduced the number of branches from the 68000 ISA by increasing the max branch displacements from 16 bits to 32 bits which eliminated whole tables of trampoline branches for code that was only a few hundred kiB (disassemble Frontier Elite for an example). Maybe the x86-64 code is so fat that it pushed branch displacements out of range but that would mean a 16 bit displacement limitation while the 32 bit 68020 supports 32 bit displacements for Bcc, BRA and BSR. Granted, the 32 bit 68020 allows greater branch displacements (branch ranges) than most 64 bit ISAs due to the advantage of a variable length encoding encoding displacements (also immediates) with instructions (most RISC ISA developers still don't comprehend this big advantage except Mitch Alsup). You do only list the 19 most frequent branch instructions so maybe there are other less frequent branch instructions making the small difference. Maybe the total number of instructions declined with x86-64 while the number of branch instructions did not leaving a higher percentage of branches as we saw for the 68k.

You mentioned that MOVSX and MOVZX were not too common above but MOVSXD is the 9th most common instruction and MOVZX is the 17th most common. This is only 3.19% of instructions between them which is about what I was seeing for the 68k by looking for instruction pairs that could be converted (a few more peephole optimizations are possible as vasm supports). The big difference is that in your x86-64 stats, MOVSXD is 4.5 bytes on average and MOVZX is 4.6 bytes on average while the ColdFire equivalents are 2 bytes in most cases and 4 bytes with an immediate (vasm could often peephole optimize immediate forms to a 2 byte MOVEQ as smaller immediates are more common). MOVSX and MOVZX likely wouldn't even be used with Os optimization on x86-64. x86-64 programmers have to choose large powerful instructions like these or good code density which is the sign of a poor ISA design. These are not even worst case with MOV at 5 bytes, LEA at 5.8 bytes and CALL at 5 bytes in the top 5 most frequent instructions. Ouch! At least LEA is doing the work of several RISC architecture instructions. Load and add are usually the most frequent instructions where LEA is being used more frequently instead of ADD likely because multiple adds and even a shift are possible in one instruction doing more work. The AArch64 ISA developers figured out why the 2nd most common x86-64 instruction was doing so much work in one powerful instruction using complex addressing modes. Competing with CISC performance means throwing away RISC traditional ideals.

Another interesting observation on MVS/MOVSX and MVZ/MOVZX instructions for the 68k is found in the new llvm 68k backend which has no ColdFire support. We still find psuedo instructions which duplicate this functionality.

https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/M68k/M68kInstrData.td Quote:

/// Pseudo:
///
/// MOVSX [x] MOVZX [x] MOVX [x]
///

...

/// This group of Pseudos is analogues to the real x86 extending moves, but
/// since M68k does not have those we need to emulate. These instructions
/// will be expanded right after RA completed because we need to know precisely
/// what registers are allocated for the operands and if they overlap we just
/// extend the value if the registers are completely different we need to move
/// first.

The LLVM compiler intermediate representation uses these building blocks calling MOVSX "sext" and MOVZX "zext". Any name is better than the ColdFire "MVS" and "MVZ" names which can easily be renamed while providing the same functionality. I called them "SXT" and "ZXT" in the 68k ISAs I documented but I'm flexible with names as long as they fit with 68k naming conventions and friendliness. ColdFire instruction names were often poor choices and didn't seem to fit IMO.

cdimauro Quote:

The biggest disassembly / statistics which I've is about 64-bit Excel:
Mnemonic Count % Avg sz
J 603578 11.80 3.2 0.8 -2.4

Which is the more realistic, IMO.

So, around 12% of instructions are conditional jumps. More or lesse 1 every 8 instructions.

Consider that the statistics are limited, because I start disassembly the executables from the entry point and then follow all jumps for which I've an address. So, I don't brutally disassemble everything.

I have heard disassembling code on x86(-64) is not very reliable but your numbers seem reasonable. Even disassembling 68k code is not 100% reliable but it should be good enough for gathering statistics in most cases.

Last edited by matthey on 22-Sep-2022 at 01:14 AM.
Last edited by matthey on 22-Sep-2022 at 01:13 AM.
Last edited by matthey on 21-Sep-2022 at 07:31 PM.
Last edited by matthey on 21-Sep-2022 at 05:06 PM.

Status: Offline

Karlos

Re: The (Microprocessors) Code Density Hangout
Posted on 21-Sep-2022 18:56:04

[ #96 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4954
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@matthey

Specifically, I'm only interested in conditional branches. These occupy about 25% of the available primary opcode space in MC64K. Those slots can be replaced with fast path register to register instructions that are 33% smaller than the current realisation and are even simpler to decode. Totally worth it, I think.

_________________
Doing stupid things for fun...

Status: Offline

bhabbott

Re: The (Microprocessors) Code Density Hangout
Posted on 21-Sep-2022 19:16:43

[ #97 ]

Cult Member

Joined: 6-Jun-2018
Posts: 553
From: Aotearoa

Quote:
cdimauro wrote:
So, around 12% of instructions are conditional jumps. More or less 1 every 8 instructions.

This agrees with my analysis of Amiga programs, which are typically around 8-15%.

Quote:

matthey wrote:
I would define a branch instruction as any instruction that changes the flow of code (the PC does not advance to the next sequential instruction). This would include conditional branches, unconditional branches, indirect branches, subroutine/function calls/returns and system calls/returns.

Difference with conditional branches is they don't always change the flow.

Quote:
Even disassembling 68k code is not 100% reliable but it should be good enough for gathering statistics in most cases.

Reliable unless strings are wrongly disassembled as code (conditional branch opcodes ($62xx-$6fxx) match up with the letters 'b' to 'o', and applications often have a lot of strings in them). cdimauro's method of following jumps should avoid this, so although it might miss a lot of code it should be accurate for the code it followed.

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 21-Sep-2022 20:17:39

[ #98 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2743
From: Kansas

bhabbott Quote:

Difference with conditional branches is they don't always change the flow.

Good point. I made a minor but important change to my definition of a branch.

bhabbott Quote:

Reliable unless strings are wrongly disassembled as code (conditional branch opcodes ($62xx-$6fxx) match up with the letters 'b' to 'o', and applications often have a lot of strings in them). cdimauro's method of following jumps should avoid this, so although it might miss a lot of code it should be accurate for the code it followed.

I didn't have a problem disassembling branches with ADis (updated version by me is at the EAB link below).

http://eab.abime.net/showthread.php?t=82709

It is a smart disassembler that disassembles code paths within a memory range rather than disassembling a memory range from start to end. Instructions and data are flagged with type information and it will backup if data doesn't match indicating a mistake like a RELOC in the part of an instruction or trying to disassemble code marked as data by a previous instruction access. Dead code won't disassemble without adding a code path entrance and Amiga libraries need the function entrances marked which it can add using .fd files. ADis still has problems with identifying small portions of code vs data. This most commonly occurs for data around zero that would disassemble as a ORI #data,EA instruction. I flag unusual and useless but valid variants of ORI as likely being data but this only helps so much. The ISA developers should have put supervisor instructions at the start of the encoding map. This would have not only helped disassemblers but also help stop and debug errant code that ends up executing data which would then trap at that location.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 22-Sep-2022 5:15:58

[ #99 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4431
From: Germany

@matthey, sorry but I've no time also today (it's a quite busy period).

But let me share much more statistics (from a bit more 5 millions instructions decoded) from the 64-bit Excel, which might be interesting:

  Mnemonic              Count      % Avg sz NEx64T   Diff
  MOV                 1975796  38.64    4.3    3.2   -1.1
  J                    603578  11.80    3.2    0.8   -2.4
  CALL                 388107   7.59    5.2    5.3    0.1
  TEST                 347927   6.80    2.9    4.3    1.4
  LEA                  289504   5.66    5.1    3.8   -1.3
  CMP                  266880   5.22    4.1    5.6    1.5
  JMP                  189108   3.70    3.3    2.9   -0.4
  XOR                  163780   3.20    2.5    2.2   -0.4
  POP                  146984   2.87    1.5    1.0   -0.6
  ADD                  118495   2.32    4.0    2.7   -1.3
  PUSH                  80924   1.58    1.6    0.9   -0.7
  AND                   79310   1.55    4.3    4.3    0.0
  SUB                   74092   1.45    3.7    2.5   -1.3
  MOVZX                 62220   1.22    4.2    4.4    0.1
  RET                   47889   0.94    1.0    0.5   -0.5
  OR                    28174   0.55    4.0    3.7   -0.3
  MOVSXD                26975   0.53    3.6    4.2    0.6
  CMOV                  23981   0.47    3.9    4.1    0.1
  NOP                   18301   0.36    1.4    2.3    0.9
  MOVUPS                17995   0.35    4.5    4.6    0.1
  SHR                   17720   0.35    3.2    3.8    0.5
  SET                   16185   0.32    3.3    4.0    0.7
  INC                   15903   0.31    2.6    2.3   -0.3
  INT 3                 14189   0.28    1.0    2.0    1.0
  SHL                   10037   0.20    3.7    4.0    0.3
  IMUL                   8164   0.16    4.3    4.5    0.2
  SAR                    7298   0.14    3.1    3.0   -0.0
  NEG                    7298   0.14    2.3    2.6    0.2
  DEC                    6291   0.12    2.6    2.5   -0.1
  SBB                    5822   0.11    2.3    2.1   -0.2
  MOVAPS                 5098   0.10    4.9    4.8   -0.1
  MOVSX                  4730   0.09    4.6    4.5   -0.1
  MOVDQU                 4717   0.09    5.6    4.6   -1.0
  MOVSD                  4532   0.09    6.1    5.0   -1.1
  MOVD                   3033   0.06    4.4    4.1   -0.3
  BT                     2995   0.06    4.0    4.0   -0.0
  CDQ                    2876   0.06    1.0    2.0    1.0
  NOT                    2361   0.05    2.2    2.3    0.2
  BTS                    2169   0.04    5.5    5.8    0.3
  MOVDQA                 2107   0.04    6.1    4.9   -1.2
  CDQE                   2079   0.04    2.0    2.0    0.0
  BTR                    2029   0.04    5.8    6.2    0.4
  XORPS                  1841   0.04    3.0    2.0   -1.0
  MOVSS                  1407   0.03    5.9    4.9   -1.0
  CVTDQ2PS               1372   0.03    3.0    4.0    1.0
  CVTTSS2SI              1261   0.02    4.2    4.0   -0.2
  DIVSS                  1232   0.02    7.2    8.7    1.5
  LOCK XADD               971   0.02    5.1    4.2   -0.9
  IDIV                    818   0.02    2.9    4.1    1.2
  CVTDQ2PD                804   0.02    4.0    4.0   -0.0
  MULSD                   590   0.01    5.1    5.6    0.5
  CVTTSD2SI               491   0.01    4.2    4.0   -0.2
  PSRLDQ                  437   0.01    5.0    6.0    1.0
  MULSS                   396   0.01    6.6    7.6    1.1
  LOCK INC                393   0.01    4.0    4.2    0.2
  MOVQ                    360   0.01    5.0    4.0   -1.0
  DIVSD                   297   0.01    5.1    5.6    0.4
  ADDSS                   256   0.01    4.8    5.1    0.3
  ADDSD                   255   0.00    4.9    5.2    0.3
  MUL                     239   0.00    2.7    4.0    1.3
  XCHG                    228   0.00    3.6    4.5    0.9
  CVTSI2SS                216   0.00    5.0    4.0   -1.0
  CQO                     195   0.00    2.0    2.0    0.0
  LOCK CMPXCHG            181   0.00    6.7    5.5   -1.2
  LOCK ADD                155   0.00    4.8    4.5   -0.2
  REP STOS                148   0.00    2.6    2.0   -0.6
  LOCK DEC                137   0.00    6.1    6.1   -0.0
  COMISD                  109   0.00    5.2    4.8   -0.5
  CVTSI2SD                107   0.00    5.2    4.1   -1.0
  CWDE                    103   0.00    1.0    2.0    1.0

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 22-Sep-2022 19:47:15

[ #100 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4431
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

People should start using Bebbo's GCC instead of the plain vanilla one.

Bebbo's changes should be reviewed and injected into the official GCC. The GCC developers don't care though. Recall that the 68k backend was almost removed until a bounty raised enough money for a developer to add the minimum improvements to keep it up to date. That says a lot about the GCC developer attitude toward the 68k which is no respect. There are compiler developers here and there that treat the 68k with respect even without 68k roots. A good example is Min-Yih Hsu who works on the llvm 68k background while he is too young to have lived in the 68k era. Some people think the 68k should be well supported because of the history and retro appeal.

It's not that the GCC developers don't care on purpose: it's because there's no one interested on maintaining the 68k backend. That's it.

IMO Bebbo should step-in and propose himself as the 68k maintainer. Then there's a chance to have his changes finally merged on the master branch.

Which is basically what happened with LLVM, because before Min-Yih Hsu there was no maintainer for the 68k. Now that they (because he wasn't alone, AFAIR) produced a consistent and working set of patches, they were finally merged on the master.

That's the way to go with open source.
Quote:
Quote:
cdimauro [quote]
MOV3Q could make a difference. I've implemented something similar (but more general and with broader immediates) on NEx64T and it compressed A LOT the code on both x86 and x64.

We decided to use a more general immediate compression instead of MOV3Q also although MOV3Q saves 2 more bytes where it can be used.

MOV3Q // saves 4 bytes on MOVE.L of immediates -1, 1-7
OP.L #d16 // saves 2 bytes on OP.L EA of immediates -32768-32767
OP.L #d32 // most forms are 6 bytes (2 for instruction + 4 for d32)

I believe the Apollo core ISA still uses my immediate compression idea using an unused addressing mode encoding in EAs. It's really good as basic forms of OP.L with a source EA and MOVE.L EA,dst with any destination can be compressed and signed 2^16 immediates has broad coverage.

Absolutely: nice choice. But MOV3Q is also important to further reduce the code size.
Quote:
It could be used for OP.Q instructions with a 64 bit ISA as well.

That poses a problem: do you want to keep the immediate to always 16-bit (so, despite the operand size)? Because it's also useful to load 32-bit data on a 64-bit destination.
Quote:
Quote:
cdimauro [quote]
MVS and MVZ doesn't make so much difference. From my x86 & x64 statistics they aren't that much used (especially on x86): https://www.appuntidigitali.it/18362/statistiche-su-x86-x64-parte-8-istruzioni-mnemonici/

The x86(-64) equivalents are MOVSX and MOVZX which I recall are something like 6 bytes?

They are 3 bytes minimum. Plus the offset (8 or 32 bit). Plus the prefix for 64-bit and/or accessing the additional registers (R9..R15).
Quote:
They may not even be used when compiling with Os optimizations. The most common ColdFire versions of MVS and MVZ are only 2 bytes (4 bytes with an immediate that is less useful). They only saves 2 bytes per occurrence but they can be used everywhere and anytime due to their small size. Vasm does peephole optimizations using them for the ColdFire (I suggest them in anticipation of adding MVS/MVZ to the 68k). With that said, code folding can be used to fold MOVEQ+MOVE.B/W into MVZ and MOVE.B/W+EXT into MVS but there are other places where the instructions could be used. It is still debatable whether they are worth the encoding space with code folding but they would improve code density by a few percent on average and give better ColdFire compatibility.

IMO ad hoc instructions like MVS and MVZ are better for the a 68k ISA: the code folding version is always bigger and more complicated to handle.
Quote:
cdimauro Quote:

I agree, and a 68k 64-bit ISA could be the right moment for changing the ABI.

It's amazing that the 68k performs as well as it does and has the code density is has with that old inefficient UNIX ABI liability.

Absolutely. The good thing was that on the Amiga o.s. we used as much as registers possible for passing parameters on libraries, which helped improve the code density (and execution speed).
Quote:
CISC should give a higher percentage of branches as it has fewer bloat instructions than RISC. I made a spreadsheet of Dr. Vince Weaver's code density competition of size optimized code (like compiling with Os optimization).

https://docs.google.com/spreadsheets/d/e/2PACX-1vTyfDPXIN6i4thorNXak5hlP0FQpqpZFk2sgauXgYZPdtJX7FvVgabfbpCtHTkp5Yo9ai6MhiQqhgyG/pubhtml?gid=909588979&single=true

For the Linux Logo executable which is common integer code, the percentage of branches for various architectures comes out to something like the following.

68k - 29% of instructions branches
Thumb2 - 22%
RISCV32IMC - 28%
Thumb1 - 21%
RISCV64IMC - 28%
AArch64 - 26%
ARM EABI - 26%
PowerPC - 26%
SH-3 - 23%
SPARC - 24%
x86 - 27%
MIPS - 27%
x86-64 - 26%

The branch rule of thumb seems to apply more to ARM Thumb.

There are two problems with this test / benchmark: it's finely written in assembly and it's just one (small, moreover). To be more reasonable it should be made by various applications with a consistent size. So, real world code...
Quote:
The PowerPC Compiler Writer's Guide gave 22.1% branches for integer code where this small program has 26% branches and seems reasonable enough. I believe the 68k branch percentage is explained by the 68k having so few instructions leaving branches instructions as a higher percentage. Compressed RISC encodings significantly increase the number of instructions to obtain their code density so the percentage of branches decreases. This is particularly obvious for Thumb and SH-3.

Makes sense.
Quote:
One study found the instruction count of Thumb code was increased by 30%.

Efficient Use of Invisible Registers in Thumb Code
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.208&rep=rep1&type=pdf Quote:

More than 98% of all microprocessors are used in embedded products, the most popular 32-bit processors among them being the ARM family of embedded processors. The ARM processor core is used both as a macrocell in building application specific system chips and standard processor chips. In the embedded domain, in addition to having good performance, applications must execute under constraints of limited memory. ARM supports dual width ISAs that are simple to implement and provide a tradeoff between code size and performance. In prior work we studied the characteristics of ARM and Thumb code and showed that for some embedded applications the Thumb code size was 29.8% to 32.5% smaller than the corresponding ARM code size. However, it was also observed that there was an increase in instruction counts for Thumb code which was typically around 30%. We studied the instruction sets and then compared the Thumb and ARM code versions to identify the causes of performance loss. The reasons we identified fall into two categories: Global inefficiency - Global inefficiency arises due to the fact that only half of the register file is visible to most instructions in Thumb code. Peephole inefficiency - Peephole inefficiency arises because pairs of Thumb instructions are required to perform the same task that can be performed by individual ARM instructions.

Thumb2 was a big improvement over Thumb and SuperH 16 bit only encodings which didn't have enough room for immediates, displacements and GP register fields. Supporting 16 and 32 bit encodings allowed nearly 16 GP registers and larger immediates and displacements but instruction counts were still elevated compared to fixed 32 bit RISC encodings (even compared to the classic ARM 32 bit ISA also with ~16 GP registers).

Which means a huge penalty for performances. "Nice" (for competitors).
Quote:
The 68k instruction counts are actually lower than all but AArch64. The 68k doesn't pay the RISC code density tax in the big 3 performance metrics which are instruction counts, memory/cache accesses and branches. The branch percentages are elevated but only because the RISC fluff instructions are eliminated. All those extra RISC instructions with added dependencies have to be executed at an increased rate while avoiding the instruction fetch bottleneck caused by the increased instructions.

I agree. The problem is that the RISCs propaganda was/is successful selling this macro-family as the best and at the same time sullying CISCs as the source of all evils. Unfortunately almost all processor vendors felt into the trap...

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle