Click Here
home features news forums classifieds faqs links search
6066 members 
Amiga Q&A /  Free for All /  Emulation /  Gaming / (Latest Posts)
Login

Nickname

Password

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net
Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.
Donate

Menu
Main sections
Home
Features
News
Forums
Classifieds
Links
Downloads
Extras
OS4 Zone
IRC Network
AmigaWorld Radio
Newsfeed
Top Members
Amiga Dealers
Information
About Us
FAQs
Advertise
Polls
Terms of Service
Search

IRC Channel
Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online
33 crawler(s) on-line.
 30 guest(s) on-line.
 1 member(s) on-line.


 bhabbott

You are an anonymous user.
Register Now!
 bhabbott:  2 mins ago
 AMIGASYSTEM:  5 mins ago
 SHADES:  20 mins ago
 zipper:  22 mins ago
 sibbi:  32 mins ago
 BigD:  38 mins ago
 eliyahu:  40 mins ago
 utri007:  44 mins ago
 AF-Domains.net:  45 mins ago
 kas1e:  48 mins ago

/  Forum Index
   /  General Technology (No Console Threads)
      /  The (Microprocessors) Code Density Hangout
Register To Post

Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 Next Page )
PosterThread
BigD 
Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2021 22:23:26
#81 ]
Elite Member
Joined: 11-Aug-2005
Posts: 6620
From: UK

@Thread

Mr 68k says ....

_________________
"Art challenges technology. Technology inspires the art."
John Lasseter, Co-Founder of Pixar Animation Studios

 Status: Offline
Profile     Report this post  
cdimauro 
Re: The (Microprocessors) Code Density Hangout
Posted on 10-Sep-2022 7:56:06
#82 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3084
From: Germany

Updated Motorola 68K post.

Added NEx64t post.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: The (Microprocessors) Code Density Hangout
Posted on 18-Sep-2022 5:02:27
#83 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3084
From: Germany

Added Benchmarks post.

 Status: Offline
Profile     Report this post  
matthey 
Re: The (Microprocessors) Code Density Hangout
Posted on 18-Sep-2022 20:55:00
#84 ]
Super Member
Joined: 14-Mar-2007
Posts: 1684
From: Kansas

cdimauro Quote:

Added Benchmarks post.


It's amazing how the 68k could go from best code density to only better than average code density. All it takes are cobwebs in GCC, people forgetting that -f-omit-frame-pointer is necessary for 68k and x86 to get rid of bloating frame pointers which most modern ISAs (and vbcc) have turned off by default, small data is likely never considered, etc. The 68k used to be practically tied with Thumb2 for GCC compiles of the SPEC2006 benchmark suite.

https://www.researchgate.net/publication/221306454_SPARC16_A_new_compression_approach_for_the_SPARC_architecture

I expect the 68k had more individual programs of the benchmark suite that were smaller than Thumb2 in order to come out of top for geometric mean but one or two of the large programs was significantly larger allowing Thumb2 to be smaller for all programs combined. The GCC 68k backend had already declined significantly by the time of this paper while ARM was on top of the embedded world. At least the 68k had a good showing also considering the less than efficient 1979 68k ABI which is still passing args on the stack.

Last edited by matthey on 18-Sep-2022 at 08:55 PM.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: The (Microprocessors) Code Density Hangout
Posted on 19-Sep-2022 5:07:15
#85 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3084
From: Germany

@matthey: thanks. I'll add this as well once I've some time.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: The (Microprocessors) Code Density Hangout
Posted on 19-Sep-2022 20:48:36
#86 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3084
From: Germany

Updated Benchmarks post with the results from the above research paper.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: The (Microprocessors) Code Density Hangout
Posted on 19-Sep-2022 21:07:13
#87 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3084
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

Added Benchmarks post.


It's amazing how the 68k could go from best code density to only better than average code density. All it takes are cobwebs in GCC, people forgetting that -f-omit-frame-pointer is necessary for 68k and x86 to get rid of bloating frame pointers which most modern ISAs (and vbcc) have turned off by default, small data is likely never considered, etc.

I've written a comment on the "The Totally Unscientific Code Density Competition!" blog and an email to "Code Density Compared Between Way Too Many Instruction Sets" asking to recompile the applications with this command line parameter. I hope that we could have some new results.
Quote:
The 68k used to be practically tied with Thumb2 for GCC compiles of the SPEC2006 benchmark suite.

https://www.researchgate.net/publication/221306454_SPARC16_A_new_compression_approach_for_the_SPARC_architecture

Results added to the benchmark post.
Quote:
I expect the 68k had more individual programs of the benchmark suite that were smaller than Thumb2 in order to come out of top for geometric mean but one or two of the large programs was significantly larger allowing Thumb2 to be smaller for all programs combined.

Which looks strange. It would be good to know those programs and investigate why it happened: there might be room for better compiler optimizations for the 68k.
Quote:
The GCC 68k backend had already declined significantly by the time of this paper while ARM was on top of the embedded world. At least the 68k had a good showing also considering the less than efficient 1979 68k ABI which is still passing args on the stack.

Indeed.

 Status: Offline
Profile     Report this post  
matthey 
Re: The (Microprocessors) Code Density Hangout
Posted on 19-Sep-2022 22:50:29
#88 ]
Super Member
Joined: 14-Mar-2007
Posts: 1684
From: Kansas

cdimauro Quote:

I've written a comment on the "The Totally Unscientific Code Density Competition!" blog and an email to "Code Density Compared Between Way Too Many Instruction Sets" asking to recompile the applications with this command line parameter. I hope that we could have some new results.


Good luck getting a response. Most people don't care about the old 68k and x86 architectures. Even x86(-64) fans often don't care as fatter x86 benchmark programs make the poor x86-64 code density more acceptable. There is a reason why x86 programs sometimes loaded faster and performed better despite having half the GP registers and passing args on the stack.

cdimauro Quote:

Results added to the benchmark post.

Which looks strange. It would be good to know those programs and investigate why it happened: there might be room for better compiler optimizations for the 68k.


There is plenty of room for 68k code density improvements in an old compiler and elsewhere. Most compilers share the backend between the 68k and ColdFire. Just enabling 3 ColdFire instructions MVS, MVZ and MOV3Q, which have no conflicting 68k encodings, should improve code density by several percent with no other changes. MOV3Q saves a lot when compiling with Os optimization like in the benchmark suite as function inlining is greatly reduced giving a greater savings for smaller code to pop stack args. A better ABI would save more but we can't generally change the ABI that is already in use for compatibility reasons (Amiga library functions use register args though). It could and should be changed like x86-64 for new code using a new 68k 64 bit ISA though. I don't think it would be difficult for a 68k ISA to be improved enough to consistently beat Thumb2 in 32 bit code density benchmarks but the low hanging fruit is 64 bit code density where the limited competition is vulnerable to being trounced.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: The (Microprocessors) Code Density Hangout
Posted on 20-Sep-2022 5:10:20
#89 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3084
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

I've written a comment on the "The Totally Unscientific Code Density Competition!" blog and an email to "Code Density Compared Between Way Too Many Instruction Sets" asking to recompile the applications with this command line parameter. I hope that we could have some new results.


Good luck getting a response.

At least I've tried.

I'll monitor the blog because there's no notification option, unfortunately.

For the webpage I've already got a reply, but it isn't from the author, rather from the web admin. The author is unknown. I mean: no contact. Searching aroung I've found this Twitter profile: https://twitter.com/ArcaneSciences but I've no account and I cannot contact it.
Quote:
Most people don't care about the old 68k and x86 architectures. Even x86(-64) fans often don't care as fatter x86 benchmark programs make the poor x86-64 code density more acceptable.

Probably.

BTW, x86-64 has also often prologues & epilogues on functions/subroutines, so the situation might improve a bit omitting the frame pointer generation.
Quote:
There is a reason why x86 programs sometimes loaded faster and performed better despite having half the GP registers and passing args on the stack.

cdimauro Quote:

Results added to the benchmark post.

Which looks strange. It would be good to know those programs and investigate why it happened: there might be room for better compiler optimizations for the 68k.


There is plenty of room for 68k code density improvements in an old compiler and elsewhere.

People should start using Bebbo's GCC instead of the plain vanilla one.
Quote:
Most compilers share the backend between the 68k and ColdFire. Just enabling 3 ColdFire instructions MVS, MVZ and MOV3Q, which have no conflicting 68k encodings, should improve code density by several percent with no other changes.

MOV3Q could make a difference. I've implemented something similar (but more general and with broader immediates) on NEx64T and it compressed A LOT the code on both x86 and x64.

MVS and MVZ doesn't make so much difference. From my x86 & x64 statistics they aren't that much used (especially on x86): https://www.appuntidigitali.it/18362/statistiche-su-x86-x64-parte-8-istruzioni-mnemonici/
Quote:
MOV3Q saves a lot when compiling with Os optimization like in the benchmark suite as function inlining is greatly reduced giving a greater savings for smaller code to pop stack args. A better ABI would save more but we can't generally change the ABI that is already in use for compatibility reasons (Amiga library functions use register args though). It could and should be changed like x86-64 for new code using a new 68k 64 bit ISA though. I don't think it would be difficult for a 68k ISA to be improved enough to consistently beat Thumb2 in 32 bit code density benchmarks but the low hanging fruit is 64 bit code density where the limited competition is vulnerable to being trounced.

I agree, and a 68k 64-bit ISA could be the right moment for changing the ABI.

 Status: Offline
Profile     Report this post  
Karlos 
Re: The (Microprocessors) Code Density Hangout
Posted on 20-Sep-2022 12:43:25
#90 ]
Elite Member
Joined: 24-Aug-2003
Posts: 3118
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@thread

Do you guys have any good statistics for the percentage of typical code that's made up of conditional branch instructions? Just asking... for a friend, like..

_________________
Doing stupid things for fun...

 Status: Offline
Profile     Report this post  
matthey 
Re: The (Microprocessors) Code Density Hangout
Posted on 20-Sep-2022 17:58:39
#91 ]
Super Member
Joined: 14-Mar-2007
Posts: 1684
From: Kansas

cdimauro Quote:

People should start using Bebbo's GCC instead of the plain vanilla one.


Bebbo's changes should be reviewed and injected into the official GCC. The GCC developers don't care though. Recall that the 68k backend was almost removed until a bounty raised enough money for a developer to add the minimum improvements to keep it up to date. That says a lot about the GCC developer attitude toward the 68k which is no respect. There are compiler developers here and there that treat the 68k with respect even without 68k roots. A good example is Min-Yih Hsu who works on the llvm 68k background while he is too young to have lived in the 68k era. Some people think the 68k should be well supported because of the history and retro appeal.

cdimauro Quote:

MOV3Q could make a difference. I've implemented something similar (but more general and with broader immediates) on NEx64T and it compressed A LOT the code on both x86 and x64.


We decided to use a more general immediate compression instead of MOV3Q also although MOV3Q saves 2 more bytes where it can be used.

MOV3Q // saves 4 bytes on MOVE.L of immediates -1, 1-7
OP.L #d16 // saves 2 bytes on OP.L EA of immediates -32768-32767
OP.L #d32 // most forms are 6 bytes (2 for instruction + 4 for d32)

I believe the Apollo core ISA still uses my immediate compression idea using an unused addressing mode encoding in EAs. It's really good as basic forms of OP.L with a source EA and MOVE.L EA,dst with any destination can be compressed and signed 2^16 immediates has broad coverage. It could be used for OP.Q instructions with a 64 bit ISA as well.

cdimauro Quote:

MVS and MVZ doesn't make so much difference. From my x86 & x64 statistics they aren't that much used (especially on x86): https://www.appuntidigitali.it/18362/statistiche-su-x86-x64-parte-8-istruzioni-mnemonici/


The x86(-64) equivalents are MOVSX and MOVZX which I recall are something like 6 bytes? They may not even be used when compiling with Os optimizations. The most common ColdFire versions of MVS and MVZ are only 2 bytes (4 bytes with an immediate that is less useful). They only saves 2 bytes per occurrence but they can be used everywhere and anytime due to their small size. Vasm does peephole optimizations using them for the ColdFire (I suggest them in anticipation of adding MVS/MVZ to the 68k). With that said, code folding can be used to fold MOVEQ+MOVE.B/W into MVZ and MOVE.B/W+EXT into MVS but there are other places where the instructions could be used. It is still debatable whether they are worth the encoding space with code folding but they would improve code density by a few percent on average and give better ColdFire compatibility.

cdimauro Quote:

I agree, and a 68k 64-bit ISA could be the right moment for changing the ABI.


It's amazing that the 68k performs as well as it does and has the code density is has with that old inefficient UNIX ABI liability.

Karlos Quote:

Do you guys have any good statistics for the percentage of typical code that's made up of conditional branch instructions? Just asking... for a friend, like..


As I recall, the general rule of thumb is that 1 in 5 instructions is a branch. There are several documents with instruction frequencies and the branch percentage are given or can be calculated. The PowerPC Compiler Writer's Guide gives 22.1% branch instructions for the integer SPEC92 benchmarks 9.2% for the floating point SPEC92 benchmarks (see figure C-1 and C-2 on page 188).

The PowerPC Compiler Writer's Guide
https://cr.yp.to/2005-590/powerpc-cwg.pdf

There are newer source like RISC-V compiling of SPEC2006 where instruction frequencies are given but you will have to calculate the branch frequencies (compiled with Os optimization).

Enhancing the RISC-V Instruction Set Architecture
https://project-archive.inf.ed.ac.uk/ug4/20191424/ug4_proj.pdf

Cdimauro did an article on instruction frequencies for x86 and x86-64 and could probably give you the link and branch frequency. CISC should give a higher percentage of branches as it has fewer bloat instructions than RISC. I made a spreadsheet of Dr. Vince Weaver's code density competition of size optimized code (like compiling with Os optimization).

https://docs.google.com/spreadsheets/d/e/2PACX-1vTyfDPXIN6i4thorNXak5hlP0FQpqpZFk2sgauXgYZPdtJX7FvVgabfbpCtHTkp5Yo9ai6MhiQqhgyG/pubhtml?gid=909588979&single=true

For the Linux Logo executable which is common integer code, the percentage of branches for various architectures comes out to something like the following.

68k - 29% of instructions branches
Thumb2 - 22%
RISCV32IMC - 28%
Thumb1 - 21%
RISCV64IMC - 28%
AArch64 - 26%
ARM EABI - 26%
PowerPC - 26%
SH-3 - 23%
SPARC - 24%
x86 - 27%
MIPS - 27%
x86-64 - 26%

The branch rule of thumb seems to apply more to ARM Thumb. The PowerPC Compiler Writer's Guide gave 22.1% branches for integer code where this small program has 26% branches and seems reasonable enough. I believe the 68k branch percentage is explained by the 68k having so few instructions leaving branches instructions as a higher percentage. Compressed RISC encodings significantly increase the number of instructions to obtain their code density so the percentage of branches decreases. This is particularly obvious for Thumb and SH-3. One study found the instruction count of Thumb code was increased by 30%.

Efficient Use of Invisible Registers in Thumb Code
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.208&rep=rep1&type=pdf Quote:

More than 98% of all microprocessors are used in embedded products, the most popular 32-bit processors among them being the ARM family of embedded processors. The ARM processor core is used both as a macrocell in building application specific system chips and standard processor chips. In the embedded domain, in addition to having good performance, applications must execute under constraints of limited memory. ARM supports dual width ISAs that are simple to implement and provide a tradeoff between code size and performance. In prior work we studied the characteristics of ARM and Thumb code and showed that for some embedded applications the Thumb code size was 29.8% to 32.5% smaller than the corresponding ARM code size. However, it was also observed that there was an increase in instruction counts for Thumb code which was typically around 30%. We studied the instruction sets and then compared the Thumb and ARM code versions to identify the causes of performance loss. The reasons we identified fall into two categories: Global inefficiency - Global inefficiency arises due to the fact that only half of the register file is visible to most instructions in Thumb code. Peephole inefficiency - Peephole inefficiency arises because pairs of Thumb instructions are required to perform the same task that can be performed by individual ARM instructions.


Thumb2 was a big improvement over Thumb and SuperH 16 bit only encodings which didn't have enough room for immediates, displacements and GP register fields. Supporting 16 and 32 bit encodings allowed nearly 16 GP registers and larger immediates and displacements but instruction counts were still elevated compared to fixed 32 bit RISC encodings (even compared to the classic ARM 32 bit ISA also with ~16 GP registers). The 68k instruction counts are actually lower than all but AArch64. The 68k doesn't pay the RISC code density tax in the big 3 performance metrics which are instruction counts, memory/cache accesses and branches. The branch percentages are elevated but only because the RISC fluff instructions are eliminated. All those extra RISC instructions with added dependencies have to be executed at an increased rate while avoiding the instruction fetch bottleneck caused by the increased instructions.

Last edited by matthey on 20-Sep-2022 at 06:03 PM.

 Status: Offline
Profile     Report this post  
Karlos 
Re: The (Microprocessors) Code Density Hangout
Posted on 20-Sep-2022 18:13:30
#92 ]
Elite Member
Joined: 24-Aug-2003
Posts: 3118
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@matthey

That's interesting. I'm considering changing the MC64K bytecode to make conditional branches a sub-opcode. This would free up a significant number of main opcodes to use for 2-byte register to register operations, at the cost of making branches slightly longer and slower.

There already is a subset of "fast path" register to register instruction encodings that gain speed by sidestepping the need to evaluate any effective addresses. However these are implemented as sub opcodes themselves, meaning they need 3 bytes each. Making these primary opcodes and removing the current subopcode indirection would make them significantly faster to interpret.

_________________
Doing stupid things for fun...

 Status: Offline
Profile     Report this post  
cdimauro 
Re: The (Microprocessors) Code Density Hangout
Posted on 21-Sep-2022 5:06:04
#93 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3084
From: Germany

@Karlos

Quote:

Karlos wrote:
@thread

Do you guys have any good statistics for the percentage of typical code that's made up of conditional branch instructions? Just asking... for a friend, like..

I've already shared a link before:
https://www.appuntidigitali.it/18362/statistiche-su-x86-x64-parte-8-istruzioni-mnemonici/

Data collected from the public beta of Photoshop CS4.

For x86:

Mnemonic              Count      % Avg sz
J 110954 6.35 2.9


For x64:
Mnemonic              Count      % Avg sz
J 132638 7.63 3.0


The biggest disassembly / statistics which I've is about 64-bit Excel:
Mnemonic              Count      % Avg sz
J 603578 11.80 3.2 0.8 -2.4

Which is the more realistic, IMO.

So, around 12% of instructions are conditional jumps. More or lesse 1 every 8 instructions.

Consider that the statistics are limited, because I start disassembly the executables from the entry point and then follow all jumps for which I've an address. So, I don't brutally disassemble everything.


@matthey thanks for the links. I'll take a look once I've some time.


@Karlos

Quote:

Karlos wrote:
@matthey

That's interesting. I'm considering changing the MC64K bytecode to make conditional branches a sub-opcode. This would free up a significant number of main opcodes to use for 2-byte register to register operations, at the cost of making branches slightly longer and slower.

There already is a subset of "fast path" register to register instruction encodings that gain speed by sidestepping the need to evaluate any effective addresses. However these are implemented as sub opcodes themselves, meaning they need 3 bytes each. Making these primary opcodes and removing the current subopcode indirection would make them significantly faster to interpret.

That's what I've suggested you some time ago. This will free some opcodes that you can reuse for something else.

 Status: Offline
Profile     Report this post  
Karlos 
Re: The (Microprocessors) Code Density Hangout
Posted on 21-Sep-2022 6:17:52
#94 ]
Elite Member
Joined: 24-Aug-2003
Posts: 3118
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@cdimauro

Quote:
That's what I've suggested you some time ago. This will free some opcodes that you can reuse for something else


Yes. At the time I was focused on solving other things but it's time to go back that. In this case, tahe "something else" in this example being a revised fast path. Comparing to exvm, the peak throughput for R2R operations is a bit disappointing. This should help.

_________________
Doing stupid things for fun...

 Status: Offline
Profile     Report this post  
matthey 
Re: The (Microprocessors) Code Density Hangout
Posted on 21-Sep-2022 16:56:31
#95 ]
Super Member
Joined: 14-Mar-2007
Posts: 1684
From: Kansas

I would define a branch instruction as any instruction that can change the flow of code (the PC does not advance to the next sequential instruction). This would include conditional branches, unconditional branches, indirect branches, subroutine/function calls/returns and system calls/returns.

https://en.wikipedia.org/wiki/Branch_(computer_science)

cdimauro Quote:

I've already shared a link before:
https://www.appuntidigitali.it/18362/statistiche-su-x86-x64-parte-8-istruzioni-mnemonici/

Data collected from the public beta of Photoshop CS4.

For x86:
Mnemonic              Count      % Avg sz
J 110954 6.35 2.9



CALL 7.25%
J 6.35%
RET 1.52%
JMP 1.46%
---
16.58% branch instructions (at least)

Photoshop includes floating point code which uses fewer branches than pure integer code.

cdimauro Quote:

For x64:
Mnemonic              Count      % Avg sz
J 132638 7.63 3.0



J 7.63%
CALL 7.59%
JMP 1.86%
RET 1.33%
---
18.41% branch instructions (at least)

I would expect the number of branch instructions to stay the same or possibly decrease for a large program when moving from a variable length encoded 32 bit ISA to a 64 bit ISA due to support for longer branch displacements but maybe x86 already supported branch instructions with a 32 bit displacement? The 68020 ISA reduced the number of branches from the 68000 ISA by increasing the max branch displacements from 16 bits to 32 bits which eliminated whole tables of trampoline branches for code that was only a few hundred kiB (disassemble Frontier Elite for an example). Maybe the x86-64 code is so fat that it pushed branch displacements out of range but that would mean a 16 bit displacement limitation while the 32 bit 68020 supports 32 bit displacements for Bcc, BRA and BSR. Granted, the 32 bit 68020 allows greater branch displacements (branch ranges) than most 64 bit ISAs due to the advantage of a variable length encoding encoding displacements (also immediates) with instructions (most RISC ISA developers still don't comprehend this big advantage except Mitch Alsup). You do only list the 19 most frequent branch instructions so maybe there are other less frequent branch instructions making the small difference. Maybe the total number of instructions declined with x86-64 while the number of branch instructions did not leaving a higher percentage of branches as we saw for the 68k.

You mentioned that MOVSX and MOVZX were not too common above but MOVSXD is the 9th most common instruction and MOVZX is the 17th most common. This is only 3.19% of instructions between them which is about what I was seeing for the 68k by looking for instruction pairs that could be converted (a few more peephole optimizations are possible as vasm supports). The big difference is that in your x86-64 stats, MOVSXD is 4.5 bytes on average and MOVZX is 4.6 bytes on average while the ColdFire equivalents are 2 bytes in most cases and 4 bytes with an immediate (vasm could often peephole optimize immediate forms to a 2 byte MOVEQ as smaller immediates are more common). MOVSX and MOVZX likely wouldn't even be used with Os optimization on x86-64. x86-64 programmers have to choose large powerful instructions like these or good code density which is the sign of a poor ISA design. These are not even worst case with MOV at 5 bytes, LEA at 5.8 bytes and CALL at 5 bytes in the top 5 most frequent instructions. Ouch! At least LEA is doing the work of several RISC architecture instructions. Load and add are usually the most frequent instructions where LEA is being used more frequently instead of ADD likely because multiple adds and even a shift are possible in one instruction doing more work. The AArch64 ISA developers figured out why the 2nd most common x86-64 instruction was doing so much work in one powerful instruction using complex addressing modes. Competing with CISC performance means throwing away RISC traditional ideals.

Another interesting observation on MVS/MOVSX and MVZ/MOVZX instructions for the 68k is found in the new llvm 68k backend which has no ColdFire support. We still find psuedo instructions which duplicate this functionality.

https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/M68k/M68kInstrData.td Quote:

/// Pseudo:
///
/// MOVSX [x] MOVZX [x] MOVX [x]
///

...

/// This group of Pseudos is analogues to the real x86 extending moves, but
/// since M68k does not have those we need to emulate. These instructions
/// will be expanded right after RA completed because we need to know precisely
/// what registers are allocated for the operands and if they overlap we just
/// extend the value if the registers are completely different we need to move
/// first.


The LLVM compiler intermediate representation uses these building blocks calling MOVSX "sext" and MOVZX "zext". Any name is better than the ColdFire "MVS" and "MVZ" names which can easily be renamed while providing the same functionality. I called them "SXT" and "ZXT" in the 68k ISAs I documented but I'm flexible with names as long as they fit with 68k naming conventions and friendliness. ColdFire instruction names were often poor choices and didn't seem to fit IMO.

cdimauro Quote:

The biggest disassembly / statistics which I've is about 64-bit Excel:
Mnemonic              Count      % Avg sz
J 603578 11.80 3.2 0.8 -2.4

Which is the more realistic, IMO.

So, around 12% of instructions are conditional jumps. More or lesse 1 every 8 instructions.

Consider that the statistics are limited, because I start disassembly the executables from the entry point and then follow all jumps for which I've an address. So, I don't brutally disassemble everything.


I have heard disassembling code on x86(-64) is not very reliable but your numbers seem reasonable. Even disassembling 68k code is not 100% reliable but it should be good enough for gathering statistics in most cases.

Last edited by matthey on 22-Sep-2022 at 01:14 AM.
Last edited by matthey on 22-Sep-2022 at 01:13 AM.
Last edited by matthey on 21-Sep-2022 at 07:31 PM.
Last edited by matthey on 21-Sep-2022 at 05:06 PM.

 Status: Offline
Profile     Report this post  
Karlos 
Re: The (Microprocessors) Code Density Hangout
Posted on 21-Sep-2022 18:56:04
#96 ]
Elite Member
Joined: 24-Aug-2003
Posts: 3118
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@matthey

Specifically, I'm only interested in conditional branches. These occupy about 25% of the available primary opcode space in MC64K. Those slots can be replaced with fast path register to register instructions that are 33% smaller than the current realisation and are even simpler to decode. Totally worth it, I think.

_________________
Doing stupid things for fun...

 Status: Offline
Profile     Report this post  
bhabbott 
Re: The (Microprocessors) Code Density Hangout
Posted on 21-Sep-2022 19:16:43
#97 ]
Regular Member
Joined: 6-Jun-2018
Posts: 227
From: Aotearoa

Quote:
cdimauro wrote:
So, around 12% of instructions are conditional jumps. More or less 1 every 8 instructions.

This agrees with my analysis of Amiga programs, which are typically around 8-15%.

Quote:

matthey wrote:
I would define a branch instruction as any instruction that changes the flow of code (the PC does not advance to the next sequential instruction). This would include conditional branches, unconditional branches, indirect branches, subroutine/function calls/returns and system calls/returns.

Difference with conditional branches is they don't always change the flow.

Quote:
Even disassembling 68k code is not 100% reliable but it should be good enough for gathering statistics in most cases.

Reliable unless strings are wrongly disassembled as code (conditional branch opcodes ($62xx-$6fxx) match up with the letters 'b' to 'o', and applications often have a lot of strings in them). cdimauro's method of following jumps should avoid this, so although it might miss a lot of code it should be accurate for the code it followed.

 Status: Online!
Profile     Report this post  
matthey 
Re: The (Microprocessors) Code Density Hangout
Posted on 21-Sep-2022 20:17:39
#98 ]
Super Member
Joined: 14-Mar-2007
Posts: 1684
From: Kansas

bhabbott Quote:

Difference with conditional branches is they don't always change the flow.


Good point. I made a minor but important change to my definition of a branch.

bhabbott Quote:

Reliable unless strings are wrongly disassembled as code (conditional branch opcodes ($62xx-$6fxx) match up with the letters 'b' to 'o', and applications often have a lot of strings in them). cdimauro's method of following jumps should avoid this, so although it might miss a lot of code it should be accurate for the code it followed.


I didn't have a problem disassembling branches with ADis (updated version by me is at the EAB link below).

http://eab.abime.net/showthread.php?t=82709

It is a smart disassembler that disassembles code paths within a memory range rather than disassembling a memory range from start to end. Instructions and data are flagged with type information and it will backup if data doesn't match indicating a mistake like a RELOC in the part of an instruction or trying to disassemble code marked as data by a previous instruction access. Dead code won't disassemble without adding a code path entrance and Amiga libraries need the function entrances marked which it can add using .fd files. ADis still has problems with identifying small portions of code vs data. This most commonly occurs for data around zero that would disassemble as a ORI #data,EA instruction. I flag unusual and useless but valid variants of ORI as likely being data but this only helps so much. The ISA developers should have put supervisor instructions at the start of the encoding map. This would have not only helped disassemblers but also help stop and debug errant code that ends up executing data which would then trap at that location.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: The (Microprocessors) Code Density Hangout
Posted on 22-Sep-2022 5:15:58
#99 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3084
From: Germany

@matthey, sorry but I've no time also today (it's a quite busy period).

But let me share much more statistics (from a bit more 5 millions instructions decoded) from the 64-bit Excel, which might be interesting:

  Mnemonic              Count      % Avg sz NEx64T   Diff
MOV 1975796 38.64 4.3 3.2 -1.1
J 603578 11.80 3.2 0.8 -2.4
CALL 388107 7.59 5.2 5.3 0.1
TEST 347927 6.80 2.9 4.3 1.4
LEA 289504 5.66 5.1 3.8 -1.3
CMP 266880 5.22 4.1 5.6 1.5
JMP 189108 3.70 3.3 2.9 -0.4
XOR 163780 3.20 2.5 2.2 -0.4
POP 146984 2.87 1.5 1.0 -0.6
ADD 118495 2.32 4.0 2.7 -1.3
PUSH 80924 1.58 1.6 0.9 -0.7
AND 79310 1.55 4.3 4.3 0.0
SUB 74092 1.45 3.7 2.5 -1.3
MOVZX 62220 1.22 4.2 4.4 0.1
RET 47889 0.94 1.0 0.5 -0.5
OR 28174 0.55 4.0 3.7 -0.3
MOVSXD 26975 0.53 3.6 4.2 0.6
CMOV 23981 0.47 3.9 4.1 0.1
NOP 18301 0.36 1.4 2.3 0.9
MOVUPS 17995 0.35 4.5 4.6 0.1
SHR 17720 0.35 3.2 3.8 0.5
SET 16185 0.32 3.3 4.0 0.7
INC 15903 0.31 2.6 2.3 -0.3
INT 3 14189 0.28 1.0 2.0 1.0
SHL 10037 0.20 3.7 4.0 0.3
IMUL 8164 0.16 4.3 4.5 0.2
SAR 7298 0.14 3.1 3.0 -0.0
NEG 7298 0.14 2.3 2.6 0.2
DEC 6291 0.12 2.6 2.5 -0.1
SBB 5822 0.11 2.3 2.1 -0.2
MOVAPS 5098 0.10 4.9 4.8 -0.1
MOVSX 4730 0.09 4.6 4.5 -0.1
MOVDQU 4717 0.09 5.6 4.6 -1.0
MOVSD 4532 0.09 6.1 5.0 -1.1
MOVD 3033 0.06 4.4 4.1 -0.3
BT 2995 0.06 4.0 4.0 -0.0
CDQ 2876 0.06 1.0 2.0 1.0
NOT 2361 0.05 2.2 2.3 0.2
BTS 2169 0.04 5.5 5.8 0.3
MOVDQA 2107 0.04 6.1 4.9 -1.2
CDQE 2079 0.04 2.0 2.0 0.0
BTR 2029 0.04 5.8 6.2 0.4
XORPS 1841 0.04 3.0 2.0 -1.0
MOVSS 1407 0.03 5.9 4.9 -1.0
CVTDQ2PS 1372 0.03 3.0 4.0 1.0
CVTTSS2SI 1261 0.02 4.2 4.0 -0.2
DIVSS 1232 0.02 7.2 8.7 1.5
LOCK XADD 971 0.02 5.1 4.2 -0.9
IDIV 818 0.02 2.9 4.1 1.2
CVTDQ2PD 804 0.02 4.0 4.0 -0.0
MULSD 590 0.01 5.1 5.6 0.5
CVTTSD2SI 491 0.01 4.2 4.0 -0.2
PSRLDQ 437 0.01 5.0 6.0 1.0
MULSS 396 0.01 6.6 7.6 1.1
LOCK INC 393 0.01 4.0 4.2 0.2
MOVQ 360 0.01 5.0 4.0 -1.0
DIVSD 297 0.01 5.1 5.6 0.4
ADDSS 256 0.01 4.8 5.1 0.3
ADDSD 255 0.00 4.9 5.2 0.3
MUL 239 0.00 2.7 4.0 1.3
XCHG 228 0.00 3.6 4.5 0.9
CVTSI2SS 216 0.00 5.0 4.0 -1.0
CQO 195 0.00 2.0 2.0 0.0
LOCK CMPXCHG 181 0.00 6.7 5.5 -1.2
LOCK ADD 155 0.00 4.8 4.5 -0.2
REP STOS 148 0.00 2.6 2.0 -0.6
LOCK DEC 137 0.00 6.1 6.1 -0.0
COMISD 109 0.00 5.2 4.8 -0.5
CVTSI2SD 107 0.00 5.2 4.1 -1.0
CWDE 103 0.00 1.0 2.0 1.0

 Status: Offline
Profile     Report this post  
cdimauro 
Re: The (Microprocessors) Code Density Hangout
Posted on 22-Sep-2022 19:47:15
#100 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3084
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

People should start using Bebbo's GCC instead of the plain vanilla one.


Bebbo's changes should be reviewed and injected into the official GCC. The GCC developers don't care though. Recall that the 68k backend was almost removed until a bounty raised enough money for a developer to add the minimum improvements to keep it up to date. That says a lot about the GCC developer attitude toward the 68k which is no respect. There are compiler developers here and there that treat the 68k with respect even without 68k roots. A good example is Min-Yih Hsu who works on the llvm 68k background while he is too young to have lived in the 68k era. Some people think the 68k should be well supported because of the history and retro appeal.

It's not that the GCC developers don't care on purpose: it's because there's no one interested on maintaining the 68k backend. That's it.

IMO Bebbo should step-in and propose himself as the 68k maintainer. Then there's a chance to have his changes finally merged on the master branch.

Which is basically what happened with LLVM, because before Min-Yih Hsu there was no maintainer for the 68k. Now that they (because he wasn't alone, AFAIR) produced a consistent and working set of patches, they were finally merged on the master.

That's the way to go with open source.
Quote:
Quote:
cdimauro [quote]
MOV3Q could make a difference. I've implemented something similar (but more general and with broader immediates) on NEx64T and it compressed A LOT the code on both x86 and x64.


We decided to use a more general immediate compression instead of MOV3Q also although MOV3Q saves 2 more bytes where it can be used.

MOV3Q // saves 4 bytes on MOVE.L of immediates -1, 1-7
OP.L #d16 // saves 2 bytes on OP.L EA of immediates -32768-32767
OP.L #d32 // most forms are 6 bytes (2 for instruction + 4 for d32)

I believe the Apollo core ISA still uses my immediate compression idea using an unused addressing mode encoding in EAs. It's really good as basic forms of OP.L with a source EA and MOVE.L EA,dst with any destination can be compressed and signed 2^16 immediates has broad coverage.

Absolutely: nice choice. But MOV3Q is also important to further reduce the code size.
Quote:
It could be used for OP.Q instructions with a 64 bit ISA as well.

That poses a problem: do you want to keep the immediate to always 16-bit (so, despite the operand size)? Because it's also useful to load 32-bit data on a 64-bit destination.
Quote:
Quote:
cdimauro [quote]
MVS and MVZ doesn't make so much difference. From my x86 & x64 statistics they aren't that much used (especially on x86): https://www.appuntidigitali.it/18362/statistiche-su-x86-x64-parte-8-istruzioni-mnemonici/


The x86(-64) equivalents are MOVSX and MOVZX which I recall are something like 6 bytes?

They are 3 bytes minimum. Plus the offset (8 or 32 bit). Plus the prefix for 64-bit and/or accessing the additional registers (R9..R15).
Quote:
They may not even be used when compiling with Os optimizations. The most common ColdFire versions of MVS and MVZ are only 2 bytes (4 bytes with an immediate that is less useful). They only saves 2 bytes per occurrence but they can be used everywhere and anytime due to their small size. Vasm does peephole optimizations using them for the ColdFire (I suggest them in anticipation of adding MVS/MVZ to the 68k). With that said, code folding can be used to fold MOVEQ+MOVE.B/W into MVZ and MOVE.B/W+EXT into MVS but there are other places where the instructions could be used. It is still debatable whether they are worth the encoding space with code folding but they would improve code density by a few percent on average and give better ColdFire compatibility.

IMO ad hoc instructions like MVS and MVZ are better for the a 68k ISA: the code folding version is always bigger and more complicated to handle.
Quote:
cdimauro Quote:

I agree, and a 68k 64-bit ISA could be the right moment for changing the ABI.


It's amazing that the 68k performs as well as it does and has the code density is has with that old inefficient UNIX ABI liability.

Absolutely. The good thing was that on the Amiga o.s. we used as much as registers possible for passing parameters on libraries, which helped improve the code density (and execution speed).
Quote:
CISC should give a higher percentage of branches as it has fewer bloat instructions than RISC. I made a spreadsheet of Dr. Vince Weaver's code density competition of size optimized code (like compiling with Os optimization).

https://docs.google.com/spreadsheets/d/e/2PACX-1vTyfDPXIN6i4thorNXak5hlP0FQpqpZFk2sgauXgYZPdtJX7FvVgabfbpCtHTkp5Yo9ai6MhiQqhgyG/pubhtml?gid=909588979&single=true

For the Linux Logo executable which is common integer code, the percentage of branches for various architectures comes out to something like the following.

68k - 29% of instructions branches
Thumb2 - 22%
RISCV32IMC - 28%
Thumb1 - 21%
RISCV64IMC - 28%
AArch64 - 26%
ARM EABI - 26%
PowerPC - 26%
SH-3 - 23%
SPARC - 24%
x86 - 27%
MIPS - 27%
x86-64 - 26%

The branch rule of thumb seems to apply more to ARM Thumb.

There are two problems with this test / benchmark: it's finely written in assembly and it's just one (small, moreover). To be more reasonable it should be made by various applications with a consistent size. So, real world code...
Quote:
The PowerPC Compiler Writer's Guide gave 22.1% branches for integer code where this small program has 26% branches and seems reasonable enough. I believe the 68k branch percentage is explained by the 68k having so few instructions leaving branches instructions as a higher percentage. Compressed RISC encodings significantly increase the number of instructions to obtain their code density so the percentage of branches decreases. This is particularly obvious for Thumb and SH-3.

Makes sense.
Quote:
One study found the instruction count of Thumb code was increased by 30%.

Efficient Use of Invisible Registers in Thumb Code
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.208&rep=rep1&type=pdf Quote:

More than 98% of all microprocessors are used in embedded products, the most popular 32-bit processors among them being the ARM family of embedded processors. The ARM processor core is used both as a macrocell in building application specific system chips and standard processor chips. In the embedded domain, in addition to having good performance, applications must execute under constraints of limited memory. ARM supports dual width ISAs that are simple to implement and provide a tradeoff between code size and performance. In prior work we studied the characteristics of ARM and Thumb code and showed that for some embedded applications the Thumb code size was 29.8% to 32.5% smaller than the corresponding ARM code size. However, it was also observed that there was an increase in instruction counts for Thumb code which was typically around 30%. We studied the instruction sets and then compared the Thumb and ARM code versions to identify the causes of performance loss. The reasons we identified fall into two categories: Global inefficiency - Global inefficiency arises due to the fact that only half of the register file is visible to most instructions in Thumb code. Peephole inefficiency - Peephole inefficiency arises because pairs of Thumb instructions are required to perform the same task that can be performed by individual ARM instructions.


Thumb2 was a big improvement over Thumb and SuperH 16 bit only encodings which didn't have enough room for immediates, displacements and GP register fields. Supporting 16 and 32 bit encodings allowed nearly 16 GP registers and larger immediates and displacements but instruction counts were still elevated compared to fixed 32 bit RISC encodings (even compared to the classic ARM 32 bit ISA also with ~16 GP registers).

Which means a huge penalty for performances. "Nice" (for competitors).
Quote:
The 68k instruction counts are actually lower than all but AArch64. The 68k doesn't pay the RISC code density tax in the big 3 performance metrics which are instruction counts, memory/cache accesses and branches. The branch percentages are elevated but only because the RISC fluff instructions are eliminated. All those extra RISC instructions with added dependencies have to be executed at an increased rate while avoiding the instruction fetch bottleneck caused by the increased instructions.

I agree. The problem is that the RISCs propaganda was/is successful selling this macro-family as the best and at the same time sullying CISCs as the source of all evils. Unfortunately almost all processor vendors felt into the trap...

 Status: Offline
Profile     Report this post  
Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 Next Page )

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]
Copyright (C) 2000 - 2019 Amigaworld.net.
Amigaworld.net was originally founded by David Doyle