Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6071 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

9 crawler(s) on-line.

54 guest(s) on-line.

0 member(s) on-line.

You are an anonymous user.
Register Now!

DiscreetFX: 5 mins ago

matthey: 50 mins ago

Matt3k: 1 hr 7 mins ago

danwood: 1 hr 18 mins ago

zipper: 2 hrs 57 mins ago

Tuxedo: 3 hrs 30 mins ago

OlafS25: 4 hrs 28 mins ago

Gunnar: 4 hrs 40 mins ago

t0lkien: 4 hrs 42 mins ago

Lou: 5 hrs 32 mins ago

Forum Index

General Technology (No Console Threads)

NEx64T - #2: opcodes structure for simple decoding

Poster

Thread

cdimauro

NEx64T - #2: opcodes structure for simple decoding
Posted on 22-Nov-2023 17:55:53

[ #1 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

Second article of the series.
Let us now examine one of the main advantages (compared to x86/x64) of the new NEx64T architecture: the simplicity of decoding.
English: https://www.appuntidigitali.it/20905/nex64t-opcodes-structure-for-simple-decoding/
Italian: https://www.appuntidigitali.it/20831/nex64t-struttura-degli-opcode-per-una-decodifica-semplice/

Status: Offline

matthey

Re: NEx64T - #2: opcodes structure for simple decoding
Posted on 26-Nov-2023 4:01:38

[ #2 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2052
From: Kansas

@cdimauro
I finally read through your article. I had trouble staying awake during the long x86-64 encoding explanation. If you sit people down and fully explain x86-64 decoding and NEx64T decoding before voting on which they like best, I'm fairly certain NEx64T would come out on top. It's nice that decoding can be accomplished from the first 16 bits. With a 16 bit variable length encoding, I doubt less than 16 bits would ever be fetched. An 8 bit variable length encoding allows for a marginally smaller design fetching 8 bits at a time but that is done about as often as cores that performs byte by byte adds of data in little endian order. The history and pre-history of x86 go back that far with compatibility maintained except for one partial break while maintaining source level compatibility. NEx64T would be a similar source compatible break which should have occurred with x86-64. I would say NEx64T has good prospects but it requires a different hardware design, somewhere between a x86-64 core and 68k core design but perhaps more like that for a 68k core.

Status: Offline

Karlos

Re: NEx64T - #2: opcodes structure for simple decoding
Posted on 26-Nov-2023 13:33:14

[ #3 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4415
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

Ultimately, what is the purpose of this? You have an assembler source level compatible x64 replacement ISA, but it isn't object code compatible.

I do have to wonder what the purpose of a more efficiently encoded, but ultimately binary incompatible ISA is. Almost all maintained code for x64 is in C or higher level languages, so they can be recompiled for almost anything already. I have some x64 code in C++ that makes use of AVX, but even that is realised through intrinsics, implying that the operations could be remapped to any other sequence of instructions that achieve the same end result.

Other than scratching the nerd itch (which I totally understand as a motivation enough), what is the purpose of creating an instruction set layout and showing it to us, rather than showing it to a manufacturer or a bunch of venture capitalists that can actually pick up your ball and run with it? Could it be that there's no appetite to make a more efficient encoding of the x64 instruction when there's way more potential in creating a better instruction set in the first place? One that isn't likely going to end up with intel burying it?

Last edited by Karlos on 26-Nov-2023 at 01:45 PM.

_________________
Doing stupid things for fun...

Status: Offline

cdimauro

Re: NEx64T - #2: opcodes structure for simple decoding
Posted on 27-Nov-2023 6:01:44

[ #4 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey

Quote:

matthey wrote:
@cdimauro
I finally read through your article. I had trouble staying awake during the long x86-64 encoding explanation. If you sit people down and fully explain x86-64 decoding and NEx64T decoding before voting on which they like best, I'm fairly certain NEx64T would come out on top.

We know how x86/x64 encoding sucks a lot and how complicated it is (the only good thing is that it has to deal with 8 bit opcodes): it requires some time to "digest it".
Quote:
It's nice that decoding can be accomplished from the first 16 bits. With a 16 bit variable length encoding, I doubt less than 16 bits would ever be fetched.

Do you mean for decoding?
Quote:
An 8 bit variable length encoding allows for a marginally smaller design fetching 8 bits at a time but that is done about as often as cores that performs byte by byte adds of data in little endian order.

Being little-endian was a nightmare for the opcode design: it's "unnatural" (for me, being an human being). In fact, the opcode design was big-endian on the first NEx64T version.

However I had to switch to little-endian at a certain point in time to be coherent with how the architecture worked.
Quote:
The history and pre-history of x86 go back that far with compatibility maintained except for one partial break while maintaining source level compatibility. NEx64T would be a similar source compatible break which should have occurred with x86-64.

Exactly: that's one of the main points which I wanted to highlight with my ISA.
Quote:
I would say NEx64T has good prospects but it requires a different hardware design, somewhere between a x86-64 core and 68k core design but perhaps more like that for a 68k core.

Actually it's a hybrid.

For the opcode structure and for some decisions I've borrowed important things from 68k. This primarily affects the processor's frontend.

Everything else is mostly related to x86/x64 and, of course, this means the backend.

A NEx64T implementation can take any x86/x64 microarchitecture and do some changes, recycling a lot from them (might looks strange). This is because the frontend will be super simplified, despite working with 16-bit as a "base" for the opcodes, instead of 8-bit.

@Karlos

Quote:

Karlos wrote:
Ultimately, what is the purpose of this?

First, fun. That's how NEx64T was born.

Like the 64-bit 68k spiritual successor which I've designed before that. And the NEx64T spiritual successor which I've designed as well.
Quote:
You have an assembler source level compatible x64 replacement ISA, but it isn't object code compatible.

It's much better than x64 from this PoV, because it wasn't even fully assembly source level compatible with x86. Plus, it complicated the architecture design & implementation.
Quote:
I do have to wonder what the purpose of a more efficiently encoded, but ultimately binary incompatible ISA is. Almost all maintained code for x64 is in C or higher level languages, so they can be recompiled for almost anything already. I have some x64 code in C++ that makes use of AVX, but even that is realised through intrinsics, implying that the operations could be remapped to any other sequence of instructions that achieve the same end result.

The reason is that this architecture:
- can be implemented in a much simpler way, so requiring much less transistors;
- less headaches also for testing it;
- since it has a much better code density (especially for 64-bit code), it can be equipped with smaller caches (saving transistors & power consumption);
- it requires less instructions to be executed.

And being 100% assembly source level compatible makes it straightforward getting new binaries for it (much better than x64 from this PoV).
Quote:
Other than scratching the nerd itch (which I totally understand as a motivation enough), what is the purpose of creating an instruction set layout and showing it to us, rather than showing it to a manufacturer or a bunch of venture capitalists that can actually pick up your ball and run with it? Could it be that there's no appetite to make a more efficient encoding of the x64 instruction when there's way more potential in creating a better instruction set in the first place? One that isn't likely going to end up with intel burying it?

Approaching venture capitalists without something better than a design on paper is not appealing to them and I've already tried this way (with the former Be Inc. CEO).

A new ISA is very very difficult to gain consensus / support, unless it has a great value.

But, as I've said, just a design is not enough. That's why I'm investing some time to write an assembler (and then an emulator), because it can start to show something concrete which might move something.

Status: Offline

Karlos

Re: NEx64T - #2: opcodes structure for simple decoding
Posted on 27-Nov-2023 10:28:51

[ #5 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4415
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@cdimauro

Maybe I'm not seeing the obvious but I don't understand how being assembly language compatible is particularly useful, at least here, in 2023. It sounds like you have a solution looking for a problem.

By the time x64 was released, assembler was well and truly being left behind on the "PC" platform for development, in favour of ever higher level programming languages. Today's compilers don't typically have an assembler stage in them, they compile from source to object code directly (well, they may use an abstract RTL in the middle) and only generate assembler when asked to do so because you want a visual representation of the code they produce to study or to diagnose bugs in the compiler itself.

Apple have ran OSX now on PPC, intel and ARM. They can do this because all of the software they need to port between those architectures is high level compiled code and whatever assembler there is, is restricted to some very low level operations that need a rewrite.

The only use case I can see for being assembly compatible in 2023 is if you intend to disassemble old binaries and reassemble them because the original source is missing, but for everything else you are going to have C code or higher to work with, so worrying about assembly language compatibility would seem to be moot. And as you yourself say, x64 wasn't totally assembly source compatible with x86. Yet the world survived this issue and moved on.

I am sure the engineers at intel and AMD know the existing instruction decode is not as efficient as it could be but they made that specific trade off for the sake of retaining object code compatibility with previous generations. But as theoretically complex as the decode is, in practise it's self-evidentially much simpler.After all, x64 processors aren't exactly slow and what actual percentage of the total die transistor count is the instruction decode stage?

It seems obvious to me that you want to have the mantle of "x64 compatible", here but in reality you aren't - to be x64 compatible it needs to run x64 object code as-is. You have a solution that is to x64 what MC64K (almost - differences in design and intent are duly noted) is to 68K. The main difference there is MC64K is purely for fun and purely so that you can write assembly language and make pixels flash on screen with it. It has no grandiose ideas of being an actual CPU or object code compatible. And yes, I know MC64K isn't actually assembler compatible with 68K, the key point is that it's not object code compatible.

If you have a simple design that requires a lot less logic but still requires compilers to be updated to support it anyway, why not go all the way and just design a whole architecture instead of trying to fix something with x64 that existing vendors don't seem to see as an issue?

Last edited by Karlos on 27-Nov-2023 at 10:52 AM.
Last edited by Karlos on 27-Nov-2023 at 10:50 AM.

_________________
Doing stupid things for fun...

Status: Offline

cdimauro

Re: NEx64T - #2: opcodes structure for simple decoding
Posted on 27-Nov-2023 21:57:36

[ #6 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Karlos

Quote:

Karlos wrote:
@cdimauro

Maybe I'm not seeing the obvious but I don't understand how being assembly language compatible is particularly useful, at least here, in 2023. It sounds like you have a solution looking for a problem.

I'll show you why it's still important.
Quote:
By the time x64 was released, assembler was well and truly being left behind on the "PC" platform for development, in favour of ever higher level programming languages. Today's compilers don't typically have an assembler stage in them, they compile from source to object code directly (well, they may use an abstract RTL in the middle) and only generate assembler when asked to do so because you want a visual representation of the code they produce to study or to diagnose bugs in the compiler itself.

No, it's very common for compilers to generate assembly code and then leaving to assemblers the burden of generate the final object code. A couple of relevant examples (besides VBCC) are GCC and CLang/LLVM.

GCC: https://gcc.gnu.org/install/configure.html
--with-as=pathname
Specify that the compiler should use the assembler pointed to by pathname, rather than the one found by the standard rules to find an assembler, which are:
[...]You may want to use --with-as if no assembler is installed in the directories listed above, or if you have multiple assemblers installed and want to choose one that is not found by the above rules.

CLang: https://clang.llvm.org/docs/Toolchain.html#assembler
Clang can either use LLVM’s integrated assembler or an external system-specific tool (for instance, the GNU Assembler on GNU OSes) to produce machine code from assembly. By default, Clang uses LLVM’s integrated assembler on all targets where it is supported. If you wish to use the system assembler instead, use the -fno-integrated-as option.

Those two are the reason why, after you mentioned it again, I've decided to start writing an assembler for NEx64T: because I can take advantage of generating x86/x64 code from two most used compilers and then directly assembly it for NEx64T.

I only use a part of my architecture (because it's a superset of x86/x64), but that should be enough to define the upper-limit (in terms of code size and executed instructions) = benchmark reference.
Quote:
Apple have ran OSX now on PPC, intel and ARM. They can do this because all of the software they need to port between those architectures is high level compiled code and whatever assembler there is, is restricted to some very low level operations that need a rewrite.

Sure, there's plenty of low-level stuff written in assembly. Usual stuff:
https://sourceware.org/git/?p=glibc.git&a=search&h=HEAD&st=commit&s=memcpy
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/i386/i686/memcpy.S;h=b86af4aac9995d7f8220690717f5b2661f631212;hb=c73c96a4a1af1326df7f96eec58209e1e04066d8

But not only that and it might surprise you that even new, modern code for the latest ISA extensions is still being written. For example:
https://github.com/OpenMathLib/OpenBLAS/commit/e3368cbf1881cdbe65ccc20803f5596bab2b4c08

Intrinsics are much easier, of course, but assembly is still being used.
Quote:
The only use case I can see for being assembly compatible in 2023 is if you intend to disassemble old binaries and reassemble them because the original source is missing, but for everything else you are going to have C code or higher to work with, so worrying about assembly language compatibility would seem to be moot.

I agree that nowadays assembly is rarely used, but see above: it's still there, and new code is written for it...
Quote:
And as you yourself say, x64 wasn't totally assembly source compatible with x86. Yet the world survived this issue and moved on.

But it took a while to have the software ported for x64 (and the same for AArch64, which is even less compatible to ARM32/Thumb).
Quote:
I am sure the engineers at intel and AMD know the existing instruction decode is not as efficient as it could be but they made that specific trade off for the sake of retaining object code compatibility with previous generations.

Well, x64 was NOT object-code compatible with x86 (only partially).
Quote:
But as theoretically complex as the decode is, in practise it's self-evidentially much simpler.After all, x64 processors aren't exactly slow and what actual percentage of the total die transistor count is the instruction decode stage?

A recent example: https://www.theregister.co.uk/2012/03/06/intel_xeon_2600_server_chip_launch/
The instruction decoder and the microcode ROM occupy a non-negligible area of each core. For example, on the Intel Xeon E5-2600 those components occupy approximately 17% of the area of each core

And in the past it was even worse: the decoder took 30% circa (If I recall it correctly) of the Pentium core and even 40% for the PentiumPro core.

Isn't it a considerable piece of the cake?
Quote:
It seems obvious to me that you want to have the mantle of "x64 compatible", here but in reality you aren't - to be x64 compatible it needs to run x64 object code as-is.

I know, but assembly-level is enough. Being binary-compatible doesn't make sense, because nothing changes here and we already know the situation.
Quote:
You have a solution that is to x64 what MC64K (almost - differences in design and intent are duly noted) is to 68K. The main difference there is MC64K is purely for fun and purely so that you can write assembly language and make pixels flash on screen with it. It has no grandiose ideas of being an actual CPU or object code compatible. And yes, I know MC64K isn't actually assembler compatible with 68K, the key point is that it's not object code compatible.

MC64K is like my C64K (the 68k spiritual 64-bit successor which I've designed 13 years ago).

However designing new architectures without much constraints is fun, but the audience / usage is limited (unless you had cool idea which is a game changer in some area).

With NEx64T I wanted to raise the bar and enforce a heavy constraint which made the challenge much harder but more fulfilling to me.
Quote:
If you have a simple design that requires a lot less logic but still requires compilers to be updated to support it anyway, why not go all the way and just design a whole architecture instead of trying to fix something with x64 that existing vendors don't seem to see as an issue?

Maybe because... NEx64T is effectively a whole new architecture? No joke: it's really new and has very very litte in common with x86/x64.

It resembles the latters because it's a superset from a certain PoV, but the opcode structure isn't the only thing which is very different. You'll see it in some other articles.

BTW, I've already designed another ISA which took the most from NEx64T (especially from the SIMD/vector extension), but it's not 100% assembly level with x86/x64, because I wanted to remove all legacy which was still left and focus on a modern and "clean" CISC ISA.

Basically it's the equivalent of RISC-V from this PoV: a CISC which is designed taking the experience of the past CISC ISAs, but without any burden to carry on.

This way I was able to express and push other ideas which were floating around in my mind from some time (completely removing the MOVZX/MOVSX instructions, for example: they aren't needed anymore because this functionality is already there,"for free", on the basic instructions).

Status: Offline

matthey

Re: NEx64T - #2: opcodes structure for simple decoding
Posted on 29-Nov-2023 2:41:12

[ #7 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2052
From: Kansas

cdimauro Quote:

We know how x86/x64 encoding sucks a lot and how complicated it is (the only good thing is that it has to deal with 8 bit opcodes): it requires some time to "digest it".

...

Do you mean for decoding?

Yes. It would be possible to decode a byte at a time for a 16 bit VLE but without an instruction buffer in a tiny core design, no instruction could be executed or transferred to the next stage from the decoded data in that cycle. Even the x86 ISA gains little value from being able to decode a byte at a time as only a few instructions are 1 byte in size. It's not worth it to deal with additional alignment overhead especially when no instruction should be a single byte as they waste too much encoding space. A very simple or specialized ISA may have enough 8 bit instructions to make 8 bit decoding worthwhile.

cdimauro Quote:

Being little-endian was a nightmare for the opcode design: it's "unnatural" (for me, being an human being). In fact, the opcode design was big-endian on the first NEx64T version.

However I had to switch to little-endian at a certain point in time to be coherent with how the architecture worked.

I agree that little endian is unnatural. Starting at the least significant side is fine but it should start at the least significant bit not byte.

%00000001 // big endian #1
%00000001 // little endian #1
%10000000 // true little endian #1

Big endian bits are right to left and bytes are right to left
Little endian bits are right to left but bytes are left to right.
True little endian bits are left to right and bytes are left to right.

True little endian is logical, consistent and natural working with any datatype size from 1 bit to infinity while little endian is a datatype specific bastard ordering. The little endian choice of a byte is arbitrary because of the 8 bit ALU but what if the first little endian ALU had been 4 bits (a nibble like the Intel 4004 ancestor of Intel 8 bit CPUs but byte addressing at minimum) or 16 bits? This is a problem for little endian but not for true little endian. What uses true little endian? The 68k bitfield instructions in both memory and registers.

cdimauro Quote:

A recent example: https://www.theregister.co.uk/2012/03/06/intel_xeon_2600_server_chip_launch/
The instruction decoder and the microcode ROM occupy a non-negligible area of each core. For example, on the Intel Xeon E5-2600 those components occupy approximately 17% of the area of each core

And in the past it was even worse: the decoder took 30% circa (If I recall it correctly) of the Pentium core and even 40% for the PentiumPro core.

Isn't it a considerable piece of the cake?

Is the following article where the "decoder" 30% of Pentium came from?

https://websrv.cecs.uci.edu/~papers/mpr/MPR/ARTICLES/070402.pdf Quote:

Intel estimates about 30% of the transistors were devoted to compatibility with the x86 architecture. Much of this overhead is probably in the microcode ROM, instruction decode and control unit, and the adders in the two address generators, but there are other effects of the complex instruction set. For example, the higher frequency of memory references in x86 programs compared to RISC code led to the implementation of the dual-access data cache.

30% of 3.1 million transistors = 930,000 transistors

Most of the 30% of transistors would be decoder related except the "two address generators". The author thinks they are a CISC disadvantage necessary because of the x86 "frequency of memory references" like any transistors beyond a traditional RISC pipeline are a waste. The pipeline design did help the Pentium overcome the very high memory traffic but it also improved the SiFive U74 RISC-V core too.

There is a die photo with unit labeling in figure 2 showing the Pentium instruction fetch, decode, instruction support and control logic taking about 20% of the chip area. Other x86 compatibility must be spread out in other units but it doesn't look like the decoder itself is 30% by area (transistors may be more densely packed in areas though). We can compare this to the 68060 Microprocessor Report which also has a die shot with unit labeling.

Motorola Introduces Heir to 68000 Line, Multi-issue 68060 Maintains Upward Compatibility
https://websrv.cecs.uci.edu/~papers/mpr/MPR/ARTICLES/080502.pdf

On the 68060, the instruction fetch unit, instruction buffer and pipeline control look like they take about 15% of the area.

68060 15% of 2.5 million transistors = 375,000 transistors
Pentium 20% of 3.1 million transistors = 620,000 transistors

This is a very rough estimate of the instruction fetch, decode and dispatch/issue overhead but it looks like the 68060 probably has less overhead.

Last edited by matthey on 29-Nov-2023 at 02:59 PM.
Last edited by matthey on 29-Nov-2023 at 09:50 AM.
Last edited by matthey on 29-Nov-2023 at 09:25 AM.
Last edited by matthey on 29-Nov-2023 at 04:00 AM.

Status: Offline

Karlos

Re: NEx64T - #2: opcodes structure for simple decoding
Posted on 29-Nov-2023 12:39:55

[ #8 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4415
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@matthey

Machines tend to be byte addressable, not bit addressable. Little endian is perfectly fine for addressing RAM and it makes sense in relation to performing different sized operations on registers. Hell, even the 68K does this: a byte sized operation on a register affects the least significant 8 bits of the 32-bit register whole, likewise a word operation affects the least significant 16 bits.

Enumeration of bit positions within a word of some given size is a different area of concern and should not be conflated. Deciding that the least significant bit should be the leftmost is nonsense. PPC did this and it's nonsense. The concepts of shift/rotate directions are essentially universal across architectures and these happen on words of fixed sizes. Machines don't work with arbitrary sized bitfields. Even on machines that support the concept, like 020+, you are modifying bits in a word of some given size.

Binary bit enumeration is one wheel with a lot of traction. You might say it's a tractor wheel that doesn't need reinventing.

Last edited by Karlos on 29-Nov-2023 at 01:10 PM.

_________________
Doing stupid things for fun...

Status: Offline

NutsAboutAmiga

Re: NEx64T - #2: opcodes structure for simple decoding
Posted on 29-Nov-2023 18:07:55

[ #9 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12832
From: Norway

@Karlos

When you read a number you start with the large number

for example:

One thousand, one hundred, and eleven

Of course, its natural start with the highest bit. Of course, read bits like that but do read Hex numbers, it makes sense that first byte is high byte.

in any case Intel does not have highest bit in lowest order, bit 7 is always on Right side in a Byte, and its the same on BE or LE. its only byte positions thats swaped, not the bits.

it also makes sense address 0 is first byte, when you add ram, you get more stoage sapce,

And naturally expect to be at the end, like pages in a book. you don't start with the last page and read backwards.

Last edited by NutsAboutAmiga on 29-Nov-2023 at 06:15 PM.
Last edited by NutsAboutAmiga on 29-Nov-2023 at 06:14 PM.
Last edited by NutsAboutAmiga on 29-Nov-2023 at 06:13 PM.
Last edited by NutsAboutAmiga on 29-Nov-2023 at 06:10 PM.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

Karlos

Re: NEx64T - #2: opcodes structure for simple decoding
Posted on 29-Nov-2023 18:31:56

[ #10 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4415
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@NutsAboutAmiga

It doesn't make the slightest difference how I or any other human or extraterrestrial intelligence reads a number. Computers operate on binary strings that may or may not represent a number. The machine does not give a crap about human sensibilities here. A 64 bit integer is a string of 8 bytes. What order that string is stored and retrieved in is an implementation detail specified by the architecture of that machine.

Once the byte string is in a register (named or temporary) and you are about to perform an arithmetic operation on it, then and only then does it matter which way around those bytes are handled, because each one represents a discrete octet of place values in a larger structure.

Do not try to present big (or little) as objectively better. They are just different.

Quote:
And naturally expect to be at the end, like pages in a book. you don't start with the last page and read backwards

Beginning and End are relative, like left and right. Are you suggesting reading a book turning the page to the left as you increase page number is the "natural" way? That's positively anti Semitic (language).

Last edited by Karlos on 29-Nov-2023 at 06:37 PM.

_________________
Doing stupid things for fun...

Status: Offline

matthey

Re: NEx64T - #2: opcodes structure for simple decoding
Posted on 29-Nov-2023 20:03:29

[ #11 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2052
From: Kansas

@Karlos
"Natural" is easier for humans and more logical which is important. We were biased by universal writing of numbers in big endian style starting with the most significant digits. This is the way the 4 bit Intel 4004 CPU started but then for the 8 bit 8008 they saw an advantage to fetching the low order bytes first so they could start a multi-byte add earlier even though this was inconsistent with the bit ordering which they did not bother turning around. It was an implementation and datatype specific optimization that makes little endian a bastard format.

big endian 16 bit order: 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
little endian 16 bit order: 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | 15 | 14 | 13 | 12 | 11 | 10 | 9
true little endian 16 bit order: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15

True little endian is the most natural although foreign to the big endian way we write numbers. Compare adding 2 numbers.

True little endian add: Start at the first bit of both numbers and add them together with a carry to the next bit to the right until all digits and carries of the numbers are exhausted.

Big endian add: Start at the last bit of both numbers and add them together with a carry to the left until all digits and carries are exhausted.

Little endian add: You describe the inconsistent and diabolical mess!

Human programmers program computers even if the computer can hide nasty implementation and inconsistent details from architect mistakes. I'm open minded enough to see that true little endian has advantages over big endian but little endian is an unnatural bastard format.

Status: Offline

Karlos

Re: NEx64T - #2: opcodes structure for simple decoding
Posted on 29-Nov-2023 20:45:18

[ #12 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4415
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@matthey

Jeez. Where to begin here?

If you are finding it difficult to write/read numbers in your code because of endian layout you are using the wrong language, tooling or both.

Look, the 32-bit integer 0xABADCAFE is the same on big and little endian CPUs at the point where you are going to use it as a 32-bit integer. Who gives a rat's arse which way around the bytes are stored? It only matters when converting data from one scheme to the other. Even the most rudimentary debuggers let me see the big and little endian interpretation of a value at any given address should I need to see them, as does tools like ghex when interrogating a data file.

If I want to express that 0xABADCAFE 32 bit immediate in C (and any number of other high level languages), it's 0xABADCAFE, regardless of whether it's a big or little endian machine. Even in assembler, it's $ABADCAFE. It's never 0xFECAADAB, unless I look at it in memory on a little endian system. Which will be in a debugger, which will show it as ABADCAFE still because I've told it to interpret 4 byte strings as 32-bit integers on that little endian machine.

The only time I've seen different is in code compiled for Amithlon's big endian memory model, where the *compiler* may generate immediate values expressed in the code one way as output assembler expressed in the reverse byte order for efficiency sake. Entirely understandable, given how it works.

Now, regarding the adding of bits, first of all, computers haven't added digits one bit at a time for donkey's years. Secondly, when you have an integer value in your ALU, which is where you do the actual arithmetic operation, all the bits are (probably*) in place value order. Even your most hardcore direct memory operand supporting CISC processor has to load your memory operand into the bloody ALU in order to perform an operation on it.

*Given the fundamentally bit-parallel nature of modern ALU, who knows? Or even cares, for that matter?

How much code have you written that you actually stumble into endian issues? Unless you are doing different sized operations on the same memory address (like C unions) or interacting with hardware (e.g. raw network buffers are big endian) or data in a fixed layout that's different than your nativeCPU you will never encounter it.

The about of nonsense crap people gush about endianness is unbelievable. I can count the number of endian bugs I've had on the fingers of one hand and that includes working on a virtual machine that runs on both big and little endian hosts.

_________________
Doing stupid things for fun...

Status: Offline

matthey

Re: NEx64T - #2: opcodes structure for simple decoding
Posted on 29-Nov-2023 23:51:17

[ #13 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2052
From: Kansas

Karlos Quote:

Jeez. Where to begin here?

If you are finding it difficult to write/read numbers in your code because of endian layout you are using the wrong language, tooling or both.

Look, the 32-bit integer 0xABADCAFE is the same on big and little endian CPUs at the point where you are going to use it as a 32-bit integer. Who gives a rat's arse which way around the bytes are stored? It only matters when converting data from one scheme to the other. Even the most rudimentary debuggers let me see the big and little endian interpretation of a value at any given address should I need to see them, as does tools like ghex when interrogating a data file.

If I want to express that 0xABADCAFE 32 bit immediate in C (and any number of other high level languages), it's 0xABADCAFE, regardless of whether it's a big or little endian machine. Even in assembler, it's $ABADCAFE. It's never 0xFECAADAB, unless I look at it in memory on a little endian system. Which will be in a debugger, which will show it as ABADCAFE still because I've told it to interpret 4 byte strings as 32-bit integers on that little endian machine.

I understand the way little endian is wired and swaps to big endian in registers. The byte order in memory is different than the bit order and the datatype size needs to be known to access it.

big endian
least significant bit of a datatype is higher in memory
least significant byte of a datatype is higher in memory
least significant bit of a datatype is higher in a register
least significant byte of a datatype is higher in a register

little endian
least significant bit of a datatype is higher in memory
least significant byte of a datatype is lower in memory
least significant bit of a datatype is higher in a register
least significant byte of a datatype is higher in a register

true little endian
least significant bit of a datatype is lower in memory
least significant byte of a datatype is lower in memory
least significant bit of a datatype is lower in a register
least significant byte of a datatype is lower in a register

Little endian is a pseudo big endian ordering with an architecture design optimization that only gives a benefit on the tiniest of core designs yet introduces unnatural behavior even though most of it is hidden from the programmer. It is more natural to write...

move.l #$ABADCAFE,(a0)

and see it in memory as well as write...

move.b #$AB,(a0)+
move.b #$AD,(a0)+
move.b #$CA,(a0)+
move.b #$FE,(a0)+

and have it be the same without it being dependent on the datatype.

True little endian is likely better than big endian as the least significant bit and byte are lower in memory and the most significant are higher in memory which is more natural than big endian but also consistent for both bits and bytes unlike bastard little endian. It also has the advantage that partial calculations can start earlier without seeking to the end of a large datatype in memory and calculating backwards. The 68k bitfield instruction designers understood the advantage of memory as a continuous bitfield or bit stream from lowest to highest memory and lowest to highest significant bits which is not possible with bastard little endian inconsistency and is illogically backwards for big endian (any datatype can be selected by bit offset and length in memory). The BTST, BSET, BCLR and BCHG instructions use big endian bit numbering which works for a defined datatype size but are confusing for a bitfield or bit stream.

Karlos Quote:

How much code have you written that you actually stumble into endian issues? Unless you are doing different sized operations on the same memory address (like C unions) or interacting with hardware (e.g. raw network buffers are big endian) or data in a fixed layout that's different than your nativeCPU you will never encounter it.

The about of nonsense crap people gush about endianness is unbelievable. I can count the number of endian bugs I've had on the fingers of one hand and that includes working on a virtual machine that runs on both big and little endian hosts.

Little endian definitely works as most computers use it now with minimal problems. There are some places where it even has an advantage. That doesn't stop me from considering it an unnatural and inconsistent bastard. Call me a purist if you like but I prefer big endian over bastard little endian. I wish people would recognize that it is not true little endian which is likely the most natural bit and byte ordering of all.

Last edited by matthey on 29-Nov-2023 at 11:57 PM.
Last edited by matthey on 29-Nov-2023 at 11:51 PM.

Status: Offline

Karlos

Re: NEx64T - #2: opcodes structure for simple decoding
Posted on 30-Nov-2023 0:33:15

[ #14 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4415
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@matthey

There is no "true little endian", because no manufacturers ever adopted the idea of mirror writing bits that way. Why? Because it makes no bloody sense on a machine that is byte addressable. You just regard any entity bigger than a byte as a (2, 4, 8, 16, 32 etc length) string of bytes. You don't address individual bits, so the idea that a bit is in a lower memory location than another one in the same byte is a non-sequitur.

Even PowerPC, which actually defines register bit positions in the *reverse* order. (bit 0 is the most significant and bit 31/63 the least), they still perform operations that don't enumerate specific bit positions the way you'd expect on any sensible system.

You seem fixated on something that is absolutely nonsensical for a person that claims to be so well versed in machine architecture. Inside any real CPU, there's no complicated bit swapping and byte munging going on. You want to read a string of bytes into a register and the signals representing the data are routed such that the register contains the value as a congruent 16,32 or 64 bit value for the operand you loaded. There's no particular advantage to either scheme, really, because a value in a register isn't big or little endian. It's a pattern of bits that only assumes a meaning implied by the operation you are performing on it.

Last edited by Karlos on 30-Nov-2023 at 12:46 AM.
Last edited by Karlos on 30-Nov-2023 at 12:35 AM.
Last edited by Karlos on 30-Nov-2023 at 12:34 AM.

_________________
Doing stupid things for fun...

Status: Offline

Karlos

Re: NEx64T - #2: opcodes structure for simple decoding
Posted on 30-Nov-2023 0:43:24

[ #15 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4415
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

Even on some absolutely terrible hypothetical machine that has 64 bit register operations only but is attached by an 8 bit bus to little endian arranged memory, you'd still have no issues. Just read each byte into the register in the appropriate direction to get the "all bits in expected order" arrangement, e.g. the first byte read would go into lowest 8 bits of the register, then you'd switch gates so the next byte read goes into the next lowest 8 bits and so on until you filled it.

On an equivalent big endian memory system, you'd so the same but simply fill the register in the other direction.

Really. It doesn't matter.

Last edited by Karlos on 30-Nov-2023 at 12:44 AM.

_________________
Doing stupid things for fun...

Status: Offline

kolla

Re: NEx64T - #2: opcodes structure for simple decoding
Posted on 30-Nov-2023 4:01:42

[ #16 ]

Elite Member

Joined: 21-Aug-2003
Posts: 2940
From: Trondheim, Norway

@heymatt

If big-endian is "natural" for "humans", then why is little-endian and "mixed" endian so wide spread in use?

_________________
B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC

Status: Offline

Karlos

Re: NEx64T - #2: opcodes structure for simple decoding
Posted on 30-Nov-2023 8:01:26

[ #17 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4415
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@kolla

I always assumed it was an obvious evolution of the 8 bit approach to doing multibyte operations. Almost every 8 bit machine that I can think of that supports it, you'd read your lowest byte first, do an op, set the flags, and then read the next byte. Think ADC on the 6502.

As machines got wider registers, the need to perform multi word operations still existed and so you'd start with the lowest word, and do the same thing. Assuming your wider machine also supports 8 bit data types, you'd continue to support them the same way.

Think of the irony of 68K where addx also has to perform multiword operations least significant word first or where performing 16 / 8 bit operations continue to affect only the least significant portion of the register. You almost begin to wonder what the purpose of big endian was in the first place. More than likely human sensibilities like "but it's the way we write numbers" rather than how the logic actually works was involved somewhere. Thankfully, it's not a big deal it just comes down to wiring for the most part.

_________________
Doing stupid things for fun...

Status: Offline

Karlos

Re: NEx64T - #2: opcodes structure for simple decoding
Posted on 30-Nov-2023 8:05:10

[ #18 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4415
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

For the avoidance of doubt, I personally prefer Big Endian - in part because it's the way I write numbers down, but that's not a good enough reason to make something work that way and risk filling it with contradictions. Hence why MC64K isn't big endian.

_________________
Doing stupid things for fun...

Status: Offline

OneTimer1

Re: NEx64T - #2: opcodes structure for simple decoding
Posted on 30-Nov-2023 16:17:31

[ #19 ]

Cult Member

Joined: 3-Aug-2015
Posts: 989
From: Unknown

@Karlos

Quote:

Karlos wrote:

Almost every 8 bit machine that I can think of that supports it, you'd read your lowest byte first, do an op, set the flags, and then read the next byte.

I know, it's logical for 8 bit engines, read the lowest byte first make you calculations read the next byte.

I theoretically the higher byte to be read could have been on a lower address, I don't know what Motorola did on their 6800 systems.

Status: Offline

cdimauro

Re: NEx64T - #2: opcodes structure for simple decoding
Posted on 30-Nov-2023 17:38:00

[ #20 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

We know how x86/x64 encoding sucks a lot and how complicated it is (the only good thing is that it has to deal with 8 bit opcodes): it requires some time to "digest it".

...

Do you mean for decoding?

Yes. It would be possible to decode a byte at a time for a 16 bit VLE but without an instruction buffer in a tiny core design, no instruction could be executed or transferred to the next stage from the decoded data in that cycle. Even the x86 ISA gains little value from being able to decode a byte at a time as only a few instructions are 1 byte in size. It's not worth it to deal with additional alignment overhead especially when no instruction should be a single byte as they waste too much encoding space. A very simple or specialized ISA may have enough 8 bit instructions to make 8 bit decoding worthwhile.

Yes, it depends entirely on how the ISA is designed. You can have a 16-bit VLE, but 8 bit might be enough to decode the most important information.
Quote:
cdimauro Quote:

Being little-endian was a nightmare for the opcode design: it's "unnatural" (for me, being an human being). In fact, the opcode design was big-endian on the first NEx64T version.

However I had to switch to little-endian at a certain point in time to be coherent with how the architecture worked.

I agree that little endian is unnatural. Starting at the least significant side is fine but it should start at the least significant bit not byte.

%00000001 // big endian #1
%00000001 // little endian #1
%10000000 // true little endian #1

Big endian bits are right to left and bytes are right to left
Little endian bits are right to left but bytes are left to right.
True little endian bits are left to right and bytes are left to right.

True little endian is logical, consistent and natural working with any datatype size from 1 bit to infinity while little endian is a datatype specific bastard ordering. The little endian choice of a byte is arbitrary because of the 8 bit ALU but what if the first little endian ALU had been 4 bits (a nibble like the Intel 4004 ancestor of Intel 8 bit CPUs but byte addressing at minimum) or 16 bits? This is a problem for little endian but not for true little endian. What uses true little endian? The 68k bitfield instructions in both memory and registers.

I agree with Karlos here: it's not that much important from this PoV.

However one thing should be clarified: endianess is all about bytes order. Bits ordering is another thing, which is completely orthogonal and can be applied to systems with whatever endianess.

My primary concern regarding a little-endian system is about the data representation for us, human beings.
But as long as bit #0 is the LSb, then I'm perfectly fine. And this should apply to the little-endian systems which we know.
So, and to be more clear, setting the bit #8 means setting #0 of the second (consecutive, from the first one / lowest address) byte in memory.

As I've said, I only have problem with the data representation. But I've solved it by displaying it as we do: from the MSb to the LSb. So, $0100 (which is the above example of setting bit #8) to me is:
%0000000100000000

That's how I've organized the opcode table on my ISAs, with the bits carrying the most important information which are located the on LSbs (bits #3..0, specifically).
Quote:
cdimauro Quote:

A recent example: https://www.theregister.co.uk/2012/03/06/intel_xeon_2600_server_chip_launch/
The instruction decoder and the microcode ROM occupy a non-negligible area of each core. For example, on the Intel Xeon E5-2600 those components occupy approximately 17% of the area of each core

And in the past it was even worse: the decoder took 30% circa (If I recall it correctly) of the Pentium core and even 40% for the PentiumPro core.

Isn't it a considerable piece of the cake?

Is the following article where the "decoder" 30% of Pentium came from?

https://websrv.cecs.uci.edu/~papers/mpr/MPR/ARTICLES/070402.pdf

No, it was this: https://arstechnica.com/features/2004/07/pentium-1/7/
Searching around for the source I've found this (in German, but I'm able to understand many things).
And finally I was able to reach the real source. For the Pentium. For the Pentium Pro.
Quote:
Quote:
Intel estimates about 30% of the transistors were devoted to compatibility with the x86 architecture. Much of this overhead is probably in the microcode ROM, instruction decode and control unit, and the adders in the two address generators, but there are other effects of the complex instruction set. For example, the higher frequency of memory references in x86 programs compared to RISC code led to the implementation of the dual-access data cache.

30% of 3.1 million transistors = 930,000 transistors

Most of the 30% of transistors would be decoder related except the "two address generators". The author thinks they are a CISC disadvantage necessary because of the x86 "frequency of memory references" like any transistors beyond a traditional RISC pipeline are a waste. The pipeline design did help the Pentium overcome the very high memory traffic but it also improved the SiFive U74 RISC-V core too.

Indeed: actually it's not a disadvantage, rather the opposite!
Quote:
There is a die photo with unit labeling in figure 2 showing the Pentium instruction fetch, decode, instruction support and control logic taking about 20% of the chip area. Other x86 compatibility must be spread out in other units but it doesn't look like the decoder itself is 30% by area (transistors may be more densely packed in areas though).

Other compatibility is not that important: to me the decoder is the most important one and 20% of the core seems much more realistic, looking at the picture.
Quote:
We can compare this to the 68060 Microprocessor Report which also has a die shot with unit labeling.

Motorola Introduces Heir to 68000 Line, Multi-issue 68060 Maintains Upward Compatibility
https://websrv.cecs.uci.edu/~papers/mpr/MPR/ARTICLES/080502.pdf

Nice article, thanks.
Quote:
On the 68060, the instruction fetch unit, instruction buffer and pipeline control look like they take about 15% of the area.

68060 15% of 2.5 million transistors = 375,000 transistors
Pentium 20% of 3.1 million transistors = 620,000 transistors

This is a very rough estimate of the instruction fetch, decode and dispatch/issue overhead but it looks like the 68060 probably has less overhead.

Same opinion. The x86 tax is still high on low-end microarchitectures, like the Pentium (albeit it exploded with the Pentium Pro, which was more high-end).

BTW, this: https://www.theregister.co.uk/2012/03/06/intel_xeon_2600_server_chip_launch/
Doesn't report anymore this:
The instruction decoder and the microcode ROM occupy a non-negligible area of each core. For example, on the Intel Xeon E5-2600 those components occupy approximately 17% of the area of each core

I don't understand why / how the content was changed. I've also searched around again, but I'm not able to find anymore this information.

I'm really upset: it looks that the only way to be sure now is to make a dump of everything which is interesting, because saving a bookmark/URL and some description isn't reliable anymore...

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle