Click Here
home features news forums classifieds faqs links search
6067 members 
Amiga Q&A /  Free for All /  Emulation /  Gaming / (Latest Posts)
Login

Nickname

Password

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net
Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.
Donate

Menu
Main sections
Home
Features
News
Forums
Classifieds
Links
Downloads
Extras
OS4 Zone
IRC Network
AmigaWorld Radio
Newsfeed
Top Members
Amiga Dealers
Information
About Us
FAQs
Advertise
Polls
Terms of Service
Search

IRC Channel
Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online
11 crawler(s) on-line.
 17 guest(s) on-line.
 1 member(s) on-line.


 bhabbott

You are an anonymous user.
Register Now!
 bhabbott:  2 mins ago
 matthey:  12 mins ago
 an_overeducated_idiot:  35 mins ago
 miggymac:  46 mins ago
 MEGA_RJ_MICAL:  1 hr 13 mins ago
 Marcian:  1 hr 44 mins ago
 amipal:  2 hrs 13 mins ago
 deadduckni:  2 hrs 38 mins ago
 Karlos:  2 hrs 42 mins ago
 Gunnar:  2 hrs 53 mins ago

/  Forum Index
   /  General Technology (No Console Threads)
      /  The (Microprocessors) Code Density Hangout
Register To Post

Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 Next Page )
PosterThread
cdimauro 
Re: The (Microprocessors) Code Density Hangout
Posted on 22-Sep-2022 21:14:22
#101 ]
Elite Member
Joined: 29-Oct-2012
Posts: 2775
From: Germany

@matthey

Quote:

matthey wrote:
I would define a branch instruction as any instruction that can change the flow of code (the PC does not advance to the next sequential instruction). This would include conditional branches, unconditional branches, indirect branches, subroutine/function calls/returns and system calls/returns.

https://en.wikipedia.org/wiki/Branch_(computer_science)

I disagree on this definition. Branch is commonly used for short jumps: not for all changes in the flow of code. Jump is the general term for this, IMO.
Quote:
Quote:
cdimauro [quote]
I've already shared a link before:
https://www.appuntidigitali.it/18362/statistiche-su-x86-x64-parte-8-istruzioni-mnemonici/

Data collected from the public beta of Photoshop CS4.

For x86:
Mnemonic              Count      % Avg sz
J 110954 6.35 2.9



CALL 7.25%
J 6.35%
RET 1.52%
JMP 1.46%
---
16.58% branch instructions (at least)

Photoshop includes floating point code which uses fewer branches than pure integer code.

cdimauro Quote:

For x64:
Mnemonic              Count      % Avg sz
J 132638 7.63 3.0



J 7.63%
CALL 7.59%
JMP 1.86%
RET 1.33%
---
18.41% branch instructions (at least)

I would expect the number of branch instructions to stay the same or possibly decrease for a large program when moving from a variable length encoded 32 bit ISA to a 64 bit ISA due to support for longer branch displacements but maybe x86 already supported branch instructions with a 32 bit displacement?

Yes, x86 already had it. Besides the usual short (1 byte base opcode) short conditional jumps, there are near jumps (2 bytes base opcode) with a 32-bit offset (it can also be 16-bit in 32-bit mode, but the it's practically useless, since the final RIP value is anded with $0000FFFF, so getting a 16-bit address).
Quote:
The 68020 ISA reduced the number of branches from the 68000 ISA by increasing the max branch displacements from 16 bits to 32 bits which eliminated whole tables of trampoline branches for code that was only a few hundred kiB (disassemble Frontier Elite for an example). Maybe the x86-64 code is so fat that it pushed branch displacements out of range but that would mean a 16 bit displacement limitation while the 32 bit 68020 supports 32 bit displacements for Bcc, BRA and BSR.

No, see above: x86-64 support 32-bit displacements as well. With the same instruction length.

IA-32/x86-64 has an advantage with the CALLs, because they use a 5 byte instruction (1 byte for the opcode + 4 for the displacement).
Quote:
Granted, the 32 bit 68020 allows greater branch displacements (branch ranges) than most 64 bit ISAs due to the advantage of a variable length encoding encoding displacements (also immediates) with instructions (most RISC ISA developers still don't comprehend this big advantage except Mitch Alsup).

Indeed. But Mitch is selling his new architecture as a RISC...
Quote:
You do only list the 19 most frequent branch instructions so maybe there are other less frequent branch instructions making the small difference. Maybe the total number of instructions declined with x86-64 while the number of branch instructions did not leaving a higher percentage of branches as we saw for the 68k.

I've reported a more complete table (removing only the statistics for instructions with a frequency of less than 100).
Quote:
You mentioned that MOVSX and MOVZX were not too common above but MOVSXD is the 9th most common instruction and MOVZX is the 17th most common. This is only 3.19% of instructions between them which is about what I was seeing for the 68k by looking for instruction pairs that could be converted (a few more peephole optimizations are possible as vasm supports). The big difference is that in your x86-64 stats, MOVSXD is 4.5 bytes on average and MOVZX is 4.6 bytes on average while the ColdFire equivalents are 2 bytes in most cases and 4 bytes with an immediate (vasm could often peephole optimize immediate forms to a 2 byte MOVEQ as smaller immediates are more common).

Why MVS and MVZ were used with an immediata? To load a smaller immediate on a 32-bit register?
Quote:
MOVSX and MOVZX likely wouldn't even be used with Os optimization on x86-64

Why not? The alternative is to generate two instructions for doing the same, which might also result in more space needed (for zero extension, which requires a XOR instruction).
Quote:
x86-64 programmers have to choose large powerful instructions like these or good code density which is the sign of a poor ISA design.

The main problem with x86-64 is the usage of the REX prefix for enabling 64-bit operand sizes and/or accessing the new registers (R8..R15): this is the absolute major cause for the decreased code density.
Quote:
These are not even worst case with MOV at 5 bytes, LEA at 5.8 bytes and CALL at 5 bytes in the top 5 most frequent instructions. Ouch! At least LEA is doing the work of several RISC architecture instructions. Load and add are usually the most frequent instructions where LEA is being used more frequently instead of ADD likely because multiple adds and even a shift are possible in one instruction doing more work. The AArch64 ISA developers figured out why the 2nd most common x86-64 instruction was doing so much work in one powerful instruction using complex addressing modes. Competing with CISC performance means throwing away RISC traditional ideals.

Correct. But consider that the CALL has 32-bit displacement (the same for the JMP), so you can access to a 4GB address space.
Quote:
Another interesting observation on MVS/MOVSX and MVZ/MOVZX instructions for the 68k is found in the new llvm 68k backend which has no ColdFire support. We still find psuedo instructions which duplicate this functionality.

https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/M68k/M68kInstrData.td Quote:

/// Pseudo:
///
/// MOVSX [x] MOVZX [x] MOVX [x]
///

...

/// This group of Pseudos is analogues to the real x86 extending moves, but
/// since M68k does not have those we need to emulate. These instructions
/// will be expanded right after RA completed because we need to know precisely
/// what registers are allocated for the operands and if they overlap we just
/// extend the value if the registers are completely different we need to move
/// first.


The LLVM compiler intermediate representation uses these building blocks calling MOVSX "sext" and MOVZX "zext". Any name is better than the ColdFire "MVS" and "MVZ" names which can easily be renamed while providing the same functionality. I called them "SXT" and "ZXT" in the 68k ISAs I documented but I'm flexible with names as long as they fit with 68k naming conventions and friendliness. ColdFire instruction names were often poor choices and didn't seem to fit IMO.

Indeed. And that's why I prefer longer but more meaningful names.
Quote:
cdimauro Quote:

The biggest disassembly / statistics which I've is about 64-bit Excel:
Mnemonic              Count      % Avg sz
J 603578 11.80 3.2 0.8 -2.4

Which is the more realistic, IMO.

So, around 12% of instructions are conditional jumps. More or lesse 1 every 8 instructions.

Consider that the statistics are limited, because I start disassembly the executables from the entry point and then follow all jumps for which I've an address. So, I don't brutally disassemble everything.


I have heard disassembling code on x86(-64) is not very reliable but your numbers seem reasonable. Even disassembling 68k code is not 100% reliable but it should be good enough for gathering statistics in most cases.

Yup. The best would have been instrumenting applications runs and mapping the executed instructions. This would also have given another very important set of statistics: the dynamic instructions; both in terms of number executed and their size.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: The (Microprocessors) Code Density Hangout
Posted on 22-Sep-2022 21:28:41
#102 ]
Elite Member
Joined: 29-Oct-2012
Posts: 2775
From: Germany

@Karlos

Quote:

Karlos wrote:
@matthey

Specifically, I'm only interested in conditional branches. These occupy about 25% of the available primary opcode space in MC64K. Those slots can be replaced with fast path register to register instructions that are 33% smaller than the current realisation and are even simpler to decode. Totally worth it, I think.

Sure. 25% of the opcode space used for the conditional branche is really too much!

I used the same amount, but for mapping the entire SIMD extensions (plus some other very small opcode spaces).

Short conditional branches occupy 1/16 of the opcode space in my ISA.


@bhabbott

Quote:

bhabbott wrote:
Quote:
cdimauro wrote:
So, around 12% of instructions are conditional jumps. More or less 1 every 8 instructions.

This agrees with my analysis of Amiga programs, which are typically around 8-15%.

OK, so more or less same statistics. Makes sense, since they are both CISCs with many similarities.


@matthey

Quote:

matthey wrote:

bhabbott Quote:

Reliable unless strings are wrongly disassembled as code (conditional branch opcodes ($62xx-$6fxx) match up with the letters 'b' to 'o', and applications often have a lot of strings in them). cdimauro's method of following jumps should avoid this, so although it might miss a lot of code it should be accurate for the code it followed.


I didn't have a problem disassembling branches with ADis (updated version by me is at the EAB link below).

http://eab.abime.net/showthread.php?t=82709

It is a smart disassembler that disassembles code paths within a memory range rather than disassembling a memory range from start to end. Instructions and data are flagged with type information and it will backup if data doesn't match indicating a mistake like a RELOC in the part of an instruction or trying to disassemble code marked as data by a previous instruction access. Dead code won't disassemble without adding a code path entrance and Amiga libraries need the function entrances marked which it can add using .fd files. ADis still has problems with identifying small portions of code vs data. This most commonly occurs for data around zero that would disassemble as a ORI #data,EA instruction. I flag unusual and useless but valid variants of ORI as likely being data but this only helps so much.

Exactly. A similar issue happens with IA-32/x86-64.
Quote:
The ISA developers should have put supervisor instructions at the start of the encoding map. This would have not only helped disassemblers but also help stop and debug errant code that ends up executing data which would then trap at that location.

For my NEx64T I've defined something better (IMO).

The instruction with opcode $0000 is a conditional branch which has a zero offset. So, the instruction just jumps to the next instruction (whatever is the condition).

In this case it's very easy to identify some bad disassembly.

This (and only this) instruction could also be marked as special, which means that it generates an exception when executed.

The advantage compared to putting a supervisor instruction is that a supervisor instruction is still a legit instruction. Which means that this trick didn't worked for code executed at the supervisor level. Whereas my instruction always generates an exception.

Similarly, the instruction with $FFFF as opcode is a memory-to-memory move which uses a particular addressing mode encoded with those bits. This instruction could also be marked as special and generate an exception.

So, catching both cases of data (sequences of $00 or $FF) executed as instructions.

 Status: Offline
Profile     Report this post  
Karlos 
Re: The (Microprocessors) Code Density Hangout
Posted on 22-Sep-2022 22:08:47
#103 ]
Elite Member
Joined: 24-Aug-2003
Posts: 2806
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@cdimauro

It wasn't too much insofar as I had already supported all the operations (and their size based variations) I wanted in the initial version. However, the original version used effective address decode calls for all modes. While the performance was acceptable, it was markedly slower than exvm on register bound code. The previous attempt at a fast path went some way to addressing or but it is a suboptimal implementation that still needs 3 bytes to encode and is still slower than exvm's 2 byte instruction format as a consequence. I would like to have my cake and eat it, so I'm going the route of using a sub opcode to encode the comparison type (and size) in conditional cases. This also allows me to include unsigned comparisons (blo, bls, bhs, bhi) which were missing before.

_________________
Doing stupid things for fun...

 Status: Offline
Profile     Report this post  
matthey 
Re: The (Microprocessors) Code Density Hangout
Posted on 23-Sep-2022 9:33:37
#104 ]
Super Member
Joined: 14-Mar-2007
Posts: 1601
From: Kansas

cdimauro Quote:

IMO Bebbo should step-in and propose himself as the 68k maintainer. Then there's a chance to have his changes finally merged on the master branch.
Quote:


You make it sound easy when in reality there are likely politics, responsibilities, commitments and extra work involved in becoming the official 68k maintainer.

cdimauro [quote]
Absolutely: nice choice. But MOV3Q is also important to further reduce the code size.


MOV3Q is not flexible enough to implement in my current opinion. Not only does it have a tiny (-1,1-7) immediate range but there is no size field so it is fixed to a longword (32 bit) move. It is a 16 bit instruction with a 3 bit immediate field and a 6 bit EA so it isn't that cheap either. It offers no new functionality and it is purely for improving code density while a much more flexible method was found to gain half the code density advantage of it. It's the only ColdFire instruction in A-line which is not good for 68k compatibility (not a problem for a 64 bit mode but then a 64 bit version would be more useful). Since A-line is unused on the 68k, it should be possible to add the 2 bit size field in the most common location of bits 6 and 7 but then it would eat up more of A-line. A 64 bit mode probably has bigger priorities for A-line. Some instructions would need to be moved to open up quadword in most size fields and A-line has free space and makes more sense than F-line.

cdimauro Quote:

It could be used for OP.Q instructions with a 64 bit ISA as well.

That poses a problem: do you want to keep the immediate to always 16-bit (so, despite the operand size)? Because it's also useful to load 32-bit data on a 64-bit destination.


There are a few options here using EA immediate addressing mode compression.

OP.L #d16 // EA=%111 101
OP.Q #d16 // EA=%111 101

or

OP.L #d16 // EA=%111 101
OP.Q #d32 // EA=%111 101

or

OP.L #d16 // EA=%111 101
OP.Q #d16 // EA=%111 101
OP.Q #d32 // EA=%111 110

There are currently 3 unused addressing modes in EA available that do not allow registers so they have reduced value. The 1st 3 bits are the mode and the 2nd 3 bits are the register.

EA=%111 101
EA=%111 110
EA=%111 111

It is not possible to encode (d32,An) which would have been the most useful as the register field is already set. It is possible to encode (d32,PC) with one of the open slots which I believe is worthwhile, especially if opening up PC relative writes (some 68k purists will complain but x86-64 has shown the advantage of PC relative writes which saves a base register and improves code density). It would be possible to define particular base address registers and allow (d32,A5) for example but this is not orthogonal and (d32,PC) could reach a lot of data with joined sections. Another option would be (xxx).Q absolute addressing for 64 bit mode which would allow absolute addressing to the whole 64 bit address space but it is not efficient even though x86-64 uses it, perhaps more than it should. It would be possible to provide absolute addressing through the full format extension word EAs with a quad word size added although this would make the instruction 2 bytes longer and potentially slower. Then the EA mode slots could be used for #d16, #d32 or a particular number like #1 or #-1. So it would be possible to do the following.

EA=%111 101 // #d16
EA=%111 110 // #d32
EA=%111 111 // (d32,PC)

or

EA=%111 101 // #d16
EA=%111 110 // (d32,PC)
EA=%111 111 // (xxx).Q

My current preferences are #d16 and (d32,PC) while I'm undecided on what to do with the 3rd slot for now. I think #d16 would give the best compression for both .L and .Q sizes while #d32 would only be useful for .Q where 64 bits immediates are less common and most are small.


cdimauro Quote:

I disagree on this definition. Branch is commonly used for short jumps: not for all changes in the flow of code. Jump is the general term for this, IMO.


A fork infers multiple paths or branches but a branch can be straight, crooked, forked or not forked. It looked like Wiki considered branches to be more than conditional branches not that it means much. Technical terms aren't necessarily well defined or defined the same industry wide.

cdimauro Quote:

Yes, x86 already had it. Besides the usual short (1 byte base opcode) short conditional jumps, there are near jumps (2 bytes base opcode) with a 32-bit offset (it can also be 16-bit in 32-bit mode, but the it's practically useless, since the final RIP value is anded with $0000FFFF, so getting a 16-bit address).


x86-64 is removing some of the advantage of displacement scaling which is not good for code density.

x86 Jcc
8 bit displacement: 3 bytes
16 bit displacement: 5 bytes
32 bit displacement: 7 bytes

x86-64 Jcc
8 bit displacement: 3 bytes
16 bit displacement: 7 bytes (encoding removed so 32 bit displacement used)
32 bit displacement: 7 bytes

68020 Bcc
8 bit displacement: 2 bytes
16 bit displacement: 4 bytes
32 bit displacement: 6 bytes

What good is a variable length encoding with immediate and displacement scaling if the code is bigger than a fixed 32 bit encoding on average?

cdimauro Quote:

Indeed. But Mitch is selling his new architecture as a RISC...


In my experience with Mitch, he is not biased against CISC but he likely recognizes that the mainstream computer market is. It most likely would be easier to introduce a new RISC architecture. However, introducing a new architecture is difficult and it is easier to reintroduce an upgraded existing architecture that already has developer support and software.

cdimauro Quote:

Why MVS and MVZ were used with an immediata? To load a smaller immediate on a 32-bit register?


Yes. MVS.B is completely useless for that purpose because MOVEQ uses a signed 8 bit displacement. The other forms are useful for loading immediates outside the range of MOVEQ into data registers. My immediate mode compression with addressing modes also removes most of the advantage.

cdimauro Quote:

Why not? The alternative is to generate two instructions for doing the same, which might also result in more space needed (for zero extension, which requires a XOR instruction).


It looks to me like some of the instruction combos to replace MOVSX and MOVZX would be smaller. Also, they are often used to avoid partial register stalls which is done for performance reasons not necessary when compiling for size.

cdimauro Quote:

The main problem with x86-64 is the usage of the REX prefix for enabling 64-bit operand sizes and/or accessing the new registers (R8..R15): this is the absolute major cause for the decreased code density.


Right. The Apollo core ISA makes the same mistake except the prefix is double the size of a x86-64 prefix. It's not like the 68k started handicapped by only 8 GP registers like x86.

 Status: Offline
Profile     Report this post  
bhabbott 
Re: The (Microprocessors) Code Density Hangout
Posted on 24-Sep-2022 5:16:51
#105 ]
Regular Member
Joined: 6-Jun-2018
Posts: 148
From: Aotearoa

@matthey

Quote:

matthey wrote:

Right. The Apollo core ISA makes the same mistake except the prefix is double the size of a x86-64 prefix. It's not like the 68k started handicapped by only 8 GP registers like x86.

It's only a 'mistake' if code density matters. If only a few instructions are prefixed then executable size is not an issue. Vampires have at least 120MB of RAM, so a few kB here and there is nothing. If the prefixed instruction executes much faster and is smaller than the vanilla code it replaces, it's still worth it.

You say "except the prefix is double the size of a x86-64 prefix" as if this makes it much worse, but x86-64 prefixes are a single byte which breaks word alignment. On 68k the repercussions of this would be severe.

 Status: Online!
Profile     Report this post  
cdimauro 
Re: The (Microprocessors) Code Density Hangout
Posted on 24-Sep-2022 8:55:26
#106 ]
Elite Member
Joined: 29-Oct-2012
Posts: 2775
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

IMO Bebbo should step-in and propose himself as the 68k maintainer. Then there's a chance to have his changes finally merged on the master branch.


You make it sound easy when in reality there are likely politics, responsibilities, commitments and extra work involved in becoming the official 68k maintainer.

Well, there's some extra work for sure: just providing patches is not enough. But I don't believe there's a prejudice / bias against the 68k.
Quote:
cdimauro Quote:

Absolutely: nice choice. But MOV3Q is also important to further reduce the code size.


MOV3Q is not flexible enough to implement in my current opinion. Not only does it have a tiny (-1,1-7) immediate range but there is no size field so it is fixed to a longword (32 bit) move. It is a 16 bit instruction with a 3 bit immediate field and a 6 bit EA so it isn't that cheap either. It offers no new functionality and it is purely for improving code density while a much more flexible method was found to gain half the code density advantage of it.

Yes, it's limited. But better than nothing. As you said, it already takes some encoding space. Adding the size field would have been a good plus but then you needed 11 bits on the 16-bit encoding space which is quite HUGE and doesn't make it worth this cost.
Quote:
It's the only ColdFire instruction in A-line which is not good for 68k compatibility

Absolutely. Wrong choice (similar to the MOVE16, if I recall correctly).
Quote:
(not a problem for a 64 bit mode but then a 64 bit version would be more useful). Since A-line is unused on the 68k, it should be possible to add the 2 bit size field in the most common location of bits 6 and 7 but then it would eat up more of A-line. A 64 bit mode probably has bigger priorities for A-line. Some instructions would need to be moved to open up quadword in most size fields and A-line has free space and makes more sense than F-line.

As I've suggested on Olaf's Amiga developers forum (which is a pity which isn't available anymore, even in read-only mode), it would be better to reorganize & reuse the line-F and line-A on a 64-bit mode to implement respectively scalar and packed instructions for the new SIMD unit.

The encoding space (29 bits!) is enough to give a very very good SIMD extension that fits so well for the 68k (16 vector registers should be enough for it). And one of the most compact while being very modern (vector length-agnostic, supporting both integer and floating types of all sizes, and even masks for predication!).

That's something which I could hardly achieve on my C64K ISA, but it perfectly fits on 68k (and a possible 64-bit extension which keeps the space for line-A & F).
Quote:
cdimauro Quote:

That poses a problem: do you want to keep the immediate to always 16-bit (so, despite the operand size)? Because it's also useful to load 32-bit data on a 64-bit destination.


There are a few options here using EA immediate addressing mode compression.

OP.L #d16 // EA=%111 101
OP.Q #d16 // EA=%111 101

or

OP.L #d16 // EA=%111 101
OP.Q #d32 // EA=%111 101

or

OP.L #d16 // EA=%111 101
OP.Q #d16 // EA=%111 101
OP.Q #d32 // EA=%111 110

There are currently 3 unused addressing modes in EA available that do not allow registers so they have reduced value. The 1st 3 bits are the mode and the 2nd 3 bits are the register.

EA=%111 101
EA=%111 110
EA=%111 111

It is not possible to encode (d32,An) which would have been the most useful as the register field is already set. It is possible to encode (d32,PC) with one of the open slots which I believe is worthwhile, especially if opening up PC relative writes (some 68k purists will complain but x86-64 has shown the advantage of PC relative writes which saves a base register and improves code density).

I agree on this: (d32, PC) is a must have.
Quote:
It would be possible to define particular base address registers and allow (d32,A5) for example but this is not orthogonal

Neither needed: the frame pointer hardly uses more that 16-bit displacements.
Quote:
and (d32,PC) could reach a lot of data with joined sections. Another option would be (xxx).Q absolute addressing for 64 bit mode which would allow absolute addressing to the whole 64 bit address space but it is not efficient even though x86-64 uses it, perhaps more than it should. It would be possible to provide absolute addressing through the full format extension word EAs with a quad word size added although this would make the instruction 2 bytes longer and potentially slower.

Forget it. 64-bit absolutes aren't common even on x86-64.

Specifically, there's no 64-bit absolute address mode on this ISA, rather just (and only) two specific move instructions from/to RAX.

You could do the same on your 64-bit 68k ISA using a 16-bit opcode with 1 + 3 bits (4 if you want to support address registers).
Quote:
Then the EA mode slots could be used for #d16, #d32 or a particular number like #1 or #-1. So it would be possible to do the following.

EA=%111 101 // #d16
EA=%111 110 // #d32
EA=%111 111 // (d32,PC)

or

EA=%111 101 // #d16
EA=%111 110 // (d32,PC)
EA=%111 111 // (xxx).Q

My current preferences are #d16 and (d32,PC) while I'm undecided on what to do with the 3rd slot for now. I think #d16 would give the best compression for both .L and .Q sizes while #d32 would only be useful for .Q where 64 bits immediates are less common and most are small.

I've much better (IMO) proposals for extended/quick immediates.

FIRST (16-bit-only):
OP.B #d16 // EA=%111 101: 16-bit zero-extended to 32-bit
OP.W #d16 // EA=%111 101: 16-bit zero-extended to 64-bit
OP.L #d16 // EA=%111 101: 16-bit negative-extended to 32-bit
OP.Q #d16 // EA=%111 101: 16-bit negative-extended to 64-bit
One encoding. This maximizes the use of the 16-bit immediate for all operand sizes.

SECOND (16 + 32-bit; full 32-bit support):
OP.B #d32 // EA=%111 101: 32-bit sign-extended to 64-bit
OP.W #d32 // EA=%111 101: 32-bit zero-extended to 64-bit
OP.L #d16 // EA=%111 101: 16-bit sign-extended to 32-bit
OP.Q #d16 // EA=%111 101: 16-bit sign-extended to 64-bit
One encoding. This maximizes the use only of the 32-bit immediate but only for 64-bit operand sizes.

THIRD (16 + 32 bit; full support):
OP.W #d32 // EA=%111 101: 32-bit zero-extended to 64-bit
OP.L #d16 // EA=%111 101: 16-bit zero-extended to 32-bit
OP.Q #d16 // EA=%111 101: 16-bit zero-extended to 64-bit
OP.W #d32 // EA=%111 110: 32-bit negative-extended to 64-bit
OP.L #d16 // EA=%111 110: 16-bit negative-extended to 32-bit
OP.Q #d16 // EA=%111 110: 16-bit negative-extended to 64-bit
Two encodings. This maximizes the use of both 16-bit & 32-bit immediates for all operand sizes.

All of three have pros and cons, of course. Now it's "just" a matter to pick what's the best fit for your ISA.
Quote:
cdimauro Quote:

Yes, x86 already had it. Besides the usual short (1 byte base opcode) short conditional jumps, there are near jumps (2 bytes base opcode) with a 32-bit offset (it can also be 16-bit in 32-bit mode, but the it's practically useless, since the final RIP value is anded with $0000FFFF, so getting a 16-bit address).


x86-64 is removing some of the advantage of displacement scaling which is not good for code density.

x86 Jcc
8 bit displacement: 3 bytes
16 bit displacement: 5 bytes
32 bit displacement: 7 bytes

x86-64 Jcc
8 bit displacement: 3 bytes
16 bit displacement: 7 bytes (encoding removed so 32 bit displacement used)
32 bit displacement: 7 bytes

Exactly. It was a bad choice, but I've to say that it depends on how the address size prefix was/is implemented on IA-32 first and x86-64 after. It could have been changed by Intel and/or AMD, but they decided to don't further complicate the ALU(s) for this.
Quote:
68020 Bcc
8 bit displacement: 2 bytes
16 bit displacement: 4 bytes
32 bit displacement: 6 bytes

What good is a variable length encoding with immediate and displacement scaling if the code is bigger than a fixed 32 bit encoding on average?

Indeed. That's why on my ISA I've encodings similar to 68020 (in terms of opcode space), which helped a lot on the code density.
Quote:
cdimauro Quote:

Indeed. But Mitch is selling his new architecture as a RISC...


In my experience with Mitch, he is not biased against CISC but he likely recognizes that the mainstream computer market is. It most likely would be easier to introduce a new RISC architecture. However, introducing a new architecture is difficult and it is easier to reintroduce an upgraded existing architecture that already has developer support and software.

But he's working on a completely new architecture...
Quote:
cdimauro Quote:

Why not? The alternative is to generate two instructions for doing the same, which might also result in more space needed (for zero extension, which requires a XOR instruction).

It looks to me like some of the instruction combos to replace MOVSX and MOVZX would be smaller.

This usually happens when operating on registers. MOVSX and MOVZX usually have the same or better size when accessing memory.

The only exception is represented by the special absolute address MOV instructions using the accumulator. But those are exceptions and very rarely used.
Quote:
Also, they are often used to avoid partial register stalls which is done for performance reasons not necessary when compiling for size.

Yes, that's another point which advantages the MOVSX and MOVZX instructions: they don't cause partial register stalls.

BTW, on my NEx64T MOVSX and MOVZX could be avoided most of the time, because any instruction could automatically sign or zero-extended the memory operand from its size to the size of destination. This both reduces the code size (on average. Even requiring bigger opcodes, due to the use of this mechanism) and the number of executed instructions; while also avoid using an extra register and causing pipeline stalls due to dependencies.
Quote:
cdimauro Quote:

The main problem with x86-64 is the usage of the REX prefix for enabling 64-bit operand sizes and/or accessing the new registers (R8..R15): this is the absolute major cause for the decreased code density.


Right. The Apollo core ISA makes the same mistake except the prefix is double the size of a x86-64 prefix. It's not like the 68k started handicapped by only 8 GP registers like x86.

In fact the Apollo core has an advantage here because the prefix is not needed to access the usual 8 + 8 68k registers.

The prefix is required only when the operand has 64-bit size. However it happens with address registers and unfortunately it's quite common using / manipulating them on the regular 68k code.


@bhabbott

Quote:

bhabbott wrote:
@matthey

Quote:

matthey wrote:

Right. The Apollo core ISA makes the same mistake except the prefix is double the size of a x86-64 prefix. It's not like the 68k started handicapped by only 8 GP registers like x86.

It's only a 'mistake' if code density matters.

Which is the case: it matters. A LOT.
Quote:
If only a few instructions are prefixed then executable size is not an issue.

On a 64-bit execution mode (which currently isn't the case for the Apollo) prefixes are used quite often: see above.
Quote:
Vampires have at least 120MB of RAM, so a few kB here and there is nothing.

It's not a matter of using a few kB. Plus, the most important thing is about how much space is used on the instruction cache.
Quote:
If the prefixed instruction executes much faster and is smaller than the vanilla code it replaces, it's still worth it.

Which shouldn't be the case.
Quote:
You say "except the prefix is double the size of a x86-64 prefix" as if this makes it much worse,

Which is exactly the case: it's MUCH worse, because the prefix uses two bytes instead of one.
Quote:
but x86-64 prefixes are a single byte which breaks word alignment.

In fact it isn't a problem for x86-64. It's a problem with the Apollo core because the 16-bit alignment should be absolutely kept, so a prefix has and must have 16-bit size (at least).
Quote:
On 68k the repercussions of this would be severe.

Indeed: they are.

 Status: Offline
Profile     Report this post  
matthey 
Re: The (Microprocessors) Code Density Hangout
Posted on 25-Sep-2022 5:42:32
#107 ]
Super Member
Joined: 14-Mar-2007
Posts: 1601
From: Kansas

bhabbott Quote:

It's only a 'mistake' if code density matters. If only a few instructions are prefixed then executable size is not an issue. Vampires have at least 120MB of RAM, so a few kB here and there is nothing. If the prefixed instruction executes much faster and is smaller than the vanilla code it replaces, it's still worth it.


Sigh. Code density is more about instruction cache efficiency. There is also more decoding overhead for a prefix.

Gunnar's logic is likely similar to yours. The 64 bit features will be rarely used so it is fine if they are high overhead using a prefix as long as they have a low resource cost now. Adding register banks and having a large register file is cheap in FPGA with a low CPU clock rate and may give some extra performance so plan for today. There will never be an ASIC either so optimize the ISA and design for a FPGA that will save resources and give benefits today. It is poor planning with a self fulfilling prophesy that the future will never come.

bhabbott Quote:

You say "except the prefix is double the size of a x86-64 prefix" as if this makes it much worse, but x86-64 prefixes are a single byte which breaks word alignment. On 68k the repercussions of this would be severe.


A x86 prefix is 1 byte while a 68k prefix is 2 bytes which is twice as much code increase. A 2 byte prefix can hold twice as much data so 64 bit extensions and extra register accesses could be placed in one prefix but then that wouldn't be common "if only a few instructions are prefixed" and extra registers shouldn't be needed as often as the 68k normally has 16 GP register while x86 only has 8 without a prefix.

Last edited by matthey on 25-Sep-2022 at 05:55 AM.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: The (Microprocessors) Code Density Hangout
Posted on 25-Sep-2022 7:01:05
#108 ]
Elite Member
Joined: 29-Oct-2012
Posts: 2775
From: Germany

@matthey

Quote:

matthey wrote:
bhabbott Quote:

It's only a 'mistake' if code density matters. If only a few instructions are prefixed then executable size is not an issue. Vampires have at least 120MB of RAM, so a few kB here and there is nothing. If the prefixed instruction executes much faster and is smaller than the vanilla code it replaces, it's still worth it.


Sigh. Code density is more about instruction cache efficiency. There is also more decoding overhead for a prefix.

Gunnar's logic is likely similar to yours. The 64 bit features will be rarely used so it is fine if they are high overhead using a prefix as long as they have a low resource cost now. Adding register banks and having a large register file is cheap in FPGA with a low CPU clock rate and may give some extra performance so plan for today. There will never be an ASIC either so optimize the ISA and design for a FPGA that will save resources and give benefits today. It is poor planning with a self fulfilling prophesy that the future will never come.

Exactly. Currently the design is good enough for the specific needs of the day. Ad as long as you just need to just manipulate 64-bit data, usually you do it the AMMX extension which uses the F-line, and everything is fine.

But as long as you want to use 64-bit addresses (especially) and/or using 64-bit data on the data registers, then the prefix must used, and code density will be severely hurt.

The same happens with the AMMX unit: it was designed on purpose for 64-bit data and there's no future plan for extending it. The poor Samurai Crow tried to explain that a sort of vector length-agnostic version could have had benefits a lot the performance, but Gunnar stated that it'll still with its design.

So, really, very short and limited vision (BTW I doubt that Gunnar understood Samurai).

 Status: Offline
Profile     Report this post  
Gunnar 
Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 7:43:05
#109 ]
Member
Joined: 25-Sep-2022
Posts: 42
From: Unknown

@matthey

Quote:

matthey wrote:
bhabbott [quote]
Gunnar's logic is likely similar to yours. The 64 bit features will be rarely used so it is fine if they are high overhead using a prefix as long as they have a low resource cost now. Adding register banks and having a large register file is cheap in FPGA with a low CPU clock rate and may give some extra performance so plan for today. There will never be an ASIC either so optimize the ISA and design for a FPGA that will save resources and give benefits today. It is poor planning with a self fulfilling prophesy that the future will never come.


Do you really think, that more CPU register would give problem when developing an ASIC?
I saw you posting such before. This is of course absolute nonsense.

More register are no problem at all for going ASIC:
This should be obvious to everyone. As every CPU made today has many more registers.
IBM Power have over hundred of register, same for INTEL, AMD, and ARM.

Its comically amusing to see an armchair expert that has zero experience in ASIC development and no experience in CPU design, giving "smart" advice to people that develop high-end CPU chips for a living with decades of practical experience in making ASICs designs. The Apollo-Team core team consists of several people which many of them have worked on a number of the best IBM high ASIC CPU designs from PowerPC 970 "G5", to POWER 7, 8 and 9.



Seeing that your post about 68080 are often totally wrong - its obvious that you don't know what you talk about.
Many of your posts about AMMX or the instruction set or the CPU features are technically simple false and wrong. This shows me that you never coded for 68080 and all "your knowledge" is based on misreading and misunderstanding or hearsay of the features.

But was having no clue ever be a problem for an armchair expert?

 Status: Offline
Profile     Report this post  
Karlos 
Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 7:54:26
#110 ]
Elite Member
Joined: 24-Aug-2003
Posts: 2806
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Gunnar

Quote:

Its comically amusing to see an armchair expert that has zero experience in ASIC development


What makes you say this? Wheels for farm vehicles? They're Also Something Intrinsically Circular...

_________________
Doing stupid things for fun...

 Status: Offline
Profile     Report this post  
Gunnar 
Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 8:09:37
#111 ]
Member
Joined: 25-Sep-2022
Posts: 42
From: Unknown

Quote:

Quote:

Right. The Apollo core ISA makes the same mistake except the prefix is double the size of a x86-64 prefix. It's not like the 68k started handicapped by only 8 GP registers like x86.

In fact the Apollo core has an advantage here because the prefix is not needed to access the usual 8 + 8 68k registers.

The prefix is required only when the operand has 64-bit size. However it happens with address registers and unfortunately it's quite common using / manipulating them on the regular 68k code.


Your post is technically wrong.
Fact is: The 68080 does not need a prefix to manipulate the full width of the Address registers.
Why do you post stuff like this, if you really not understood the CPU?


Some nice fact for all "small code size lovers":
The 68080 prefix allows not only to use more register (which already is very good for performance)
It also allows to upgrade the instruction from 2-opperand form to an 3-opperand from.

This means that you can replace 2 old instruction with 1 new.
This improves performance and this improved code density.
As one three-operand instruction is faster and needs less space than two old two-operand instructions.
= Yes, 68080 code can have better density than plain 68000 code.

Last edited by Gunnar on 26-Sep-2022 at 08:12 AM.
Last edited by Gunnar on 26-Sep-2022 at 08:10 AM.

 Status: Offline
Profile     Report this post  
Karlos 
Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 10:22:45
#112 ]
Elite Member
Joined: 24-Aug-2003
Posts: 2806
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Gunnar

Seriously, what do you even know about it? You were only instrumental in the design and execution of the thing, as if that matters. You don't have multiple years of epic length shitposts on pipeline lengths, cycle counts, cache sizes, instruction sets and the developmental costs for custom ASICs.

Coming over here with your factual corrections and been-there-and-built-that credentials. Who do you think you are, eh? Pfft, you only have like 2 posts!

Last edited by Karlos on 26-Sep-2022 at 10:24 AM.
Last edited by Karlos on 26-Sep-2022 at 10:24 AM.

_________________
Doing stupid things for fun...

 Status: Offline
Profile     Report this post  
Gunnar 
Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 10:55:20
#113 ]
Member
Joined: 25-Sep-2022
Posts: 42
From: Unknown

@Karlos

Quote:

Karlos wrote:
@Gunnar

Seriously, what do you even know about it? You were only instrumental in the design and execution of the thing, as if that matters. You don't have multiple years of epic length shitposts on pipeline lengths, cycle counts, cache sizes, instruction sets and the developmental costs for custom ASICs.


True :(



Do you think these "arm chair" experts really think all readers are so stupid?
Do you recall how "cdimauro" posted a lot technical bullshit advertising his "TINA" project not that long ago?

Making nice power points showing the impressive 128bit buses and > 500MHz fantasy clockrate.
All of this was pure technical nonsense.

Did he think we forget all his load of bullshit he posted?
I did not forget.

 Status: Offline
Profile     Report this post  
Bosanac 
Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 11:35:53
#114 ]
Regular Member
Joined: 10-May-2022
Posts: 127
From: Unknown

@Gunnar

Quote:
True :(



Do you think these "arm chair" experts really think all readers are so stupid?
Do you recall how "cdimauro" posted a lot technical bullshit advertising his "TINA" project not that long ago?

Making nice power points showing the impressive 128bit buses and > 500MHz fantasy clockrate.
All of this was pure technical nonsense.

Did he think we forget all his load of bullshit he posted?
I did not forget.


Those who can, do.

Those who can't shitpost forums with delusions of grandeur...

 Status: Offline
Profile     Report this post  
Karlos 
Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 12:53:33
#115 ]
Elite Member
Joined: 24-Aug-2003
Posts: 2806
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Bosanac

Quote:
Those who can't shitpost forums with delusions of grandeur...


You certainly called me out!

_________________
Doing stupid things for fun...

 Status: Offline
Profile     Report this post  
Bosanac 
Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 13:47:53
#116 ]
Regular Member
Joined: 10-May-2022
Posts: 127
From: Unknown

@Karlos

hahahahahahahahahahahaah! :)

 Status: Offline
Profile     Report this post  
cdimauro 
Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 16:26:39
#117 ]
Elite Member
Joined: 29-Oct-2012
Posts: 2775
From: Germany

@Gunnar

Quote:

Gunnar wrote:
@matthey

Quote:

matthey wrote:
bhabbott [quote]
Gunnar's logic is likely similar to yours. The 64 bit features will be rarely used so it is fine if they are high overhead using a prefix as long as they have a low resource cost now. Adding register banks and having a large register file is cheap in FPGA with a low CPU clock rate and may give some extra performance so plan for today. There will never be an ASIC either so optimize the ISA and design for a FPGA that will save resources and give benefits today. It is poor planning with a self fulfilling prophesy that the future will never come.


Do you really think, that more CPU register would give problem when developing an ASIC?
I saw you posting such before. This is of course absolute nonsense.

More register are no problem at all for going ASIC:
This should be obvious to everyone. As every CPU made today has many more registers.
IBM Power have over hundred of register, same for INTEL, AMD, and ARM.

Those are internal registers, used for register renaming.
Quote:
Its comically amusing to see an armchair expert that has zero experience in ASIC development and no experience in CPU design, giving "smart" advice to people that develop high-end CPU chips for a living with decades of practical experience in making ASICs designs. The Apollo-Team core team consists of several people which many of them have worked on a number of the best IBM high ASIC CPU designs from PowerPC 970 "G5", to POWER 7, 8 and 9.

Designing ASIC or FPGA is ok, but where you and your team have seen an architecture with so many registers available? 32 data + 16 address = 48 registers, which is a HUGE number. Especially for a CISC and specifically for the 68000 which also has a powerful memory-to-memory instruction.

Do you expect that coders will use most or even all of them writing finely tuned assembly programs?

Having the possibility to add more registers doesn't mean that you have to do it. Especially in a such huge quantity, because most of the time they will NOT be used.
@Gunnar

Quote:

Gunnar wrote:
Quote:

In fact the Apollo core has an advantage here because the prefix is not needed to access the usual 8 + 8 68k registers.

The prefix is required only when the operand has 64-bit size. However it happens with address registers and unfortunately it's quite common using / manipulating them on the regular 68k code.


Your post is technically wrong.

We'll see it.
Quote:
Fact is: The 68080 does not need a prefix to manipulate the full width of the Address registers.
Why do you post stuff like this, if you really not understood the CPU?

Maybe because YOUR documentation sucks so much and have NOTHING reporting what you said?

You talk about FACTs, but the real fact is that I've already read all documentation about your Apollo core / 68080 and there's absolutely nothing which barely resembles what you stated here.

Here's the main source for your documentation: http://apollo-core.com/index.htm?page=coding&tl=1 Even checking ALL "tabs" there's NOTHING there about your statement.

I've also read ALL documentation which A USER (so, NOT you neither someone of your team) has collected and made available here: http://apollo-core.com/knowledge.php?b=5Če=38530
Specifically, the most interesting is something which is a crossover between an Architecture and Programmers Manual: http://cdn.discordapp.com/attachments/730698753513750539/883167019581722654/VampireProgrammingGuide2021.docx
But even on this manual, there's NOTHING which could confirm your statement.

On the exact contrary, on all "A" instructions that work on an address register as a destination I can see something like this:

"The size of the operation may be specified as word or long. Word size source operands are sign extended to 32-bit quantities prior to the addition."

Pay attention to the highlighted parts, because it's clear that it does NOT talk about 64-bit quantities neither that 16 or 32-bit data are sign-extended to 64-bit.

So, care to PROVE your statement?

And, BTW, could you show me the encodings for the following instructions:

MOVEA.W A0,A1
MOVEA.L A0,A1
MOVEA.Q A0,A1

?
Quote:
Some nice fact for all "small code size lovers":

Size matters, you should know it, albeit as an hardware engineer you might not be interested on that aspect. But it's not your domain, so it's OK.

BTW, for code density it matters if it's small.
Quote:
The 68080 prefix allows not only to use more register (which already is very good for performance)
It also allows to upgrade the instruction from 2-opperand form to an 3-opperand from.

Right, and have you seen someone which stated the contrary here?
Quote:
This means that you can replace 2 old instruction with 1 new.
This improves performance and this improved code density.
As one three-operand instruction is faster and needs less space than two old two-operand instructions.

Which is wrong, since most of the time the first instruction is a move, which is 16-bit in size.

So, using the prefix the fatter instruction size doesn't change, overall. What changes is that you execute one instruction instead of two, but you're already doing it with the instruction fusion...
Quote:
= Yes, 68080 code can have better density than plain 68000 code.

Don't try to change the cards on the table: a 68000 has few instructions and addressing modes, so it has lower code density compared to the 68020 and successors.

You can cheat with people which don't understand what you're saying, but not me...
Quote:

Gunnar wrote:
@Karlos

Quote:

Karlos wrote:
@Gunnar

Seriously, what do you even know about it? You were only instrumental in the design and execution of the thing, as if that matters. You don't have multiple years of epic length shitposts on pipeline lengths, cycle counts, cache sizes, instruction sets and the developmental costs for custom ASICs.


True :(



Do you think these "arm chair" experts really think all readers are so stupid?
Do you recall how "cdimauro" posted a lot technical bullshit advertising his "TINA" project not that long ago?

Making nice power points showing the impressive 128bit buses and > 500MHz fantasy clockrate.
All of this was pure technical nonsense.

Did he think we forget all his load of bullshit he posted?
I did not forget.

Wrong again. You recalled something, but it's NOT correct, since those specs came directly from the owner of the company which was behind the TiNA project. And I was believing that it was possible because, and as you should know, I'm NOT an hardware designer. So, NOT my domain.

In fact, I was involved in the project only for designing the architecture (read: NOT the micro-architecture). The chipset specs / registers, specifically.

Do you recall it now, or do you still have memory problems?

BTW, I also don't forget the pile of bullsh*t that you've written. But compared to you, they are still easy to verify. For example, that you claim 100% compatibility with the Amiga, as it's still reported on your site.

And I can continue, eh! But you got my point, right?


@Bosanac

Quote:

Bosanac wrote:
@Gunnar

Quote:
True :(

Do you think these "arm chair" experts really think all readers are so stupid?
Do you recall how "cdimauro" posted a lot technical bullshit advertising his "TINA" project not that long ago?

Making nice power points showing the impressive 128bit buses and > 500MHz fantasy clockrate.
All of this was pure technical nonsense.

Did he think we forget all his load of bullshit he posted?
I did not forget.


Those who can, do.

Those who can't shitpost forums with delusions of grandeur...

And those that aren't unable to sustain a discussion and try to satisfy their violated ego.

How easy is to make a psychologist...

BTW you're wrong too. In fact, I did. The chipset that I've designed (most of the details are, again, on Olaf's old forum) draws circles around AAA and SAGA combined.

That's the difference with people which have no vision neither creativity, and just fill some hole on their contingent needs...

 Status: Offline
Profile     Report this post  
Bosanac 
Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 16:36:07
#118 ]
Regular Member
Joined: 10-May-2022
Posts: 127
From: Unknown

@cdimauro

Quote:
The chipset that I've designed (most of the details are, again, on Olaf's old forum) draws circles around AAA and SAGA combined.


Yet Gunnar DID and you talked.

Where can I buy your super duper chipset?

 Status: Offline
Profile     Report this post  
cdimauro 
Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 16:51:40
#119 ]
Elite Member
Joined: 29-Oct-2012
Posts: 2775
From: Germany

@Bosanac

Quote:

Bosanac wrote:
@cdimauro

Quote:
The chipset that I've designed (most of the details are, again, on Olaf's old forum) draws circles around AAA and SAGA combined.


Yet Gunnar DID and you talked.

Where can I buy your super duper chipset?

It's quite evident that you do NOT read:

I'm NOT an hardware designer

Regarding my domain, I did MY part: designing the chipset.

Is it clear to you, or should I draw a picture?

 Status: Offline
Profile     Report this post  
cdimauro 
Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2022 17:56:30
#120 ]
Elite Member
Joined: 29-Oct-2012
Posts: 2775
From: Germany

@matthey

Quote:

matthey wrote:

I made a spreadsheet of Dr. Vince Weaver's code density competition of size optimized code (like compiling with Os optimization).

https://docs.google.com/spreadsheets/d/e/2PACX-1vTyfDPXIN6i4thorNXak5hlP0FQpqpZFk2sgauXgYZPdtJX7FvVgabfbpCtHTkp5Yo9ai6MhiQqhgyG/pubhtml?gid=909588979&single=true

For the Linux Logo executable which is common integer code, the percentage of branches for various architectures comes out to something like the following.

68k - 29% of instructions branches
Thumb2 - 22%
RISCV32IMC - 28%
Thumb1 - 21%
RISCV64IMC - 28%
AArch64 - 26%
ARM EABI - 26%
PowerPC - 26%
SH-3 - 23%
SPARC - 24%
x86 - 27%
MIPS - 27%
x86-64 - 26%

The branch rule of thumb seems to apply more to ARM Thumb. The PowerPC Compiler Writer's Guide gave 22.1% branches for integer code where this small program has 26% branches and seems reasonable enough. I believe the 68k branch percentage is explained by the 68k having so few instructions leaving branches instructions as a higher percentage. Compressed RISC encodings significantly increase the number of instructions to obtain their code density so the percentage of branches decreases. This is particularly obvious for Thumb and SH-3. One study found the instruction count of Thumb code was increased by 30%.

Efficient Use of Invisible Registers in Thumb Code
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.208&rep=rep1&type=pdf Quote:

More than 98% of all microprocessors are used in embedded products, the most popular 32-bit processors among them being the ARM family of embedded processors. The ARM processor core is used both as a macrocell in building application specific system chips and standard processor chips. In the embedded domain, in addition to having good performance, applications must execute under constraints of limited memory. ARM supports dual width ISAs that are simple to implement and provide a tradeoff between code size and performance. In prior work we studied the characteristics of ARM and Thumb code and showed that for some embedded applications the Thumb code size was 29.8% to 32.5% smaller than the corresponding ARM code size. However, it was also observed that there was an increase in instruction counts for Thumb code which was typically around 30%. We studied the instruction sets and then compared the Thumb and ARM code versions to identify the causes of performance loss. The reasons we identified fall into two categories: Global inefficiency - Global inefficiency arises due to the fact that only half of the register file is visible to most instructions in Thumb code. Peephole inefficiency - Peephole inefficiency arises because pairs of Thumb instructions are required to perform the same task that can be performed by individual ARM instructions.


Thumb2 was a big improvement over Thumb and SuperH 16 bit only encodings which didn't have enough room for immediates, displacements and GP register fields. Supporting 16 and 32 bit encodings allowed nearly 16 GP registers and larger immediates and displacements but instruction counts were still elevated compared to fixed 32 bit RISC encodings (even compared to the classic ARM 32 bit ISA also with ~16 GP registers).

I finally had the time to read the above paper. I only applies to Thumb (the first one). In fact, Thumb-2 solved all issues reported on that paper and in a much more elegant way.

So, the 30% instruction count increase doesn't apply anymore, but it's very reduced (compared to the original ARM).

 Status: Offline
Profile     Report this post  
Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 Next Page )

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]
Copyright (C) 2000 - 2019 Amigaworld.net.
Amigaworld.net was originally founded by David Doyle