Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6071 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

9 crawler(s) on-line.

90 guest(s) on-line.

1 member(s) on-line.

OlafS25

You are an anonymous user.
Register Now!

OlafS25: 1 min ago

matthey: 5 mins ago

ppcamiga1: 13 mins ago

bhabbott: 15 mins ago

Karlos: 20 mins ago

michalsc: 45 mins ago

ncafferkey: 1 hr 22 mins ago

pixie: 1 hr 26 mins ago

Hypex: 2 hrs 13 mins ago

agami: 2 hrs 21 mins ago

Forum Index

Amiga General Chat

68k Developement

Poster

Thread

_Steve_

Re: 68k Developement
Posted on 30-Sep-2018 18:25:50

[ #401 ]

Team Member

Joined: 18-Oct-2002
Posts: 6808
From: UK

@thread

There has been quite a bit of poor behaviour in this thread. Insults and offensive remarks are not tolerated, and those reported have been removed and action taken.

Differences of opinion are a part of life. Being insulting about it for no reason is not acceptable.

Be respectful when using this site please.

_________________
Test sig (new)

Status: Offline

ppcamiga1

Re: 68k Developement
Posted on 1-Oct-2018 1:00:11

[ #402 ]

Cult Member

Joined: 23-Aug-2015
Posts: 770
From: Unknown

Last Deluxe Paint I use was 4.x. I do not use Deluxe Paint V on my a1200.
Around 1994 I switched to Personal Paint beacuse I need import/export graphics to pc, and Deluxe Paint do not have that oob.
Only worth of use Amiga software than hardware bang are Amiga 500 games.
Amiga 1200 games were crap.
I have Amiga 500 from Commodore with Gotek to play Amiga 500 games.
Amiga 1200 with wmfh, vampire, Mist etc etc thats all crap.
Amiga 500 games are good only on Amiga 500.
Do the same. You do not have one Amiga for everything.
You may have two or more.
For example Amiga 500 from Commodore for games, and Amiga 500 not from Commdore for productivity software.
Only Amiga 500 has chipset faster than cpu. amiga 1200 has built in cpu faster than original amiga blitter
which is as slow as it was in 1983.
Even if I buy again amiga 1200 I will use it as Amiga NG. Simply beacuse a1200 has cpu fster than blitter.
From developer point of view, every Amiga made since 1992 is pc with other than x86 cpu, no matter who made thic Commdore, Escom, Eyetech, Acube, A-eon.
You have to accept that.

Status: Offline

ppcamiga1

Re: 68k Developement
Posted on 1-Oct-2018 1:04:18

[ #403 ]

Cult Member

Joined: 23-Aug-2015
Posts: 770
From: Unknown

Point of use PPC Amiga?
Seamless integration with 68k software, hundred times faster cpu than fastest 68k, thousands times faster graphics.
68k software work better and faster than on uae on fastest pc.

Status: Offline

OlafS25

Re: 68k Developement
Posted on 1-Oct-2018 9:16:47

[ #404 ]

Elite Member

Joined: 12-May-2010
Posts: 6353
From: Unknown

@ppcamiga1

How that?

Hundred times faster?

5000 MhZ PPC CPU? Where? I want I want

You talk nonsense

It cannot work better than on newest PC because you have "NO CHIPSET" and using WinUAE 68k is faster than all amiga PPC platforms except X1000/X5000 if you compare 68k native versus PPC native. I know that because I did benchmarks myself some time ago and could compare it to the results on PPC platforms...

You did any benchmarks? What is your base of your claims? Beliefs?

Status: Online!

cdimauro

Re: 68k Developement
Posted on 1-Oct-2018 17:54:23

[ #405 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@ppcamiga1 Quote:
ppcamiga1 wrote:
Last Deluxe Paint I use was 4.x. I do not use Deluxe Paint V on my a1200.
Around 1994 I switched to Personal Paint beacuse I need import/export graphics to pc, and Deluxe Paint do not have that oob.

So, not even tried DP V?
Quote:
Only worth of use Amiga software than hardware bang are Amiga 500 games.
Amiga 1200 games were crap.

I beg to differ, and I don't think that I'm alone here. Anyway, can you prove it?
Quote:
I have Amiga 500 from Commodore with Gotek to play Amiga 500 games.
Amiga 1200 with wmfh, vampire, Mist etc etc thats all crap.

Again: any proof for it?
Quote:
Amiga 500 games are good only on Amiga 500.

Many works well on other Amigas.

The primary problem with Amiga OCS/ECS games is that several didn't worked on other Amiga machines because their developers were lamers that didn't know how to write code respecting Commodore's guidelines.
Quote:
Do the same. You do not have one Amiga for everything.
You may have two or more.
For example Amiga 500 from Commodore for games, and Amiga 500 not from Commdore for productivity software.

WinUAE has clear advantages here.
Quote:
Only Amiga 500 has chipset faster than cpu. amiga 1200 has built in cpu faster than original amiga blitter
which is as slow as it was in 1983.
Even if I buy again amiga 1200 I will use it as Amiga NG. Simply beacuse a1200 has cpu fster than blitter.

That's not true: the Amiga 1200 CPU is faster than the Blitter for SOME operations (filling and copying data which is 32-bit aligned).

But for the most important operations (which involve shifts, masking, logical operations, filling areas), extensively used on games, the Blitter was still much faster.

Last but not least, on Amiga 1200 there's much more bandwidth available for the Blitter, because the display logic can fetch 32-bit or 64-bit data per "color clock" cycles.
Quote:
From developer point of view, every Amiga made since 1992 is pc with other than x86 cpu, no matter who made thic Commdore, Escom, Eyetech, Acube, A-eon.
You have to accept that.

From a developer point-of-view it's clearly evident that you have absolutely no idea of how Amigas (OCS, ECS, and AGA) and PC work.

Stop spreading lies.
Quote:
ppcamiga1 wrote:
Point of use PPC Amiga?
Seamless integration with 68k software, hundred times faster cpu than fastest 68k, thousands times faster graphics.
68k software work better and faster than on uae on fastest pc.

For this Olaf already replied, but I've to add Amithlon that can also run 68K applications on PC, and which seems to be even faster than WinUAE (because the latter has to emulate the Amiga chipset too). I haven't used it, but it'll be good if some Amithlon user can make some benchmark with crunch-intensive Amiga apps.

I'm adding here your message that you've written on the Vampire thread, because it says more or less the same things.
Quote:
ppcamiga1 wrote:
@cdimauro

PPC can not be replaced because there is not faster cpus than work in 32 bit big endian mode.
PPC Amigas feel like better Amigas than these made by Commodore beacuse they are many times faster than 68k Amigas and provide seamless integration with old 68k Amiga software.
Comparison to pc with Windows is dumb, Windows do not provide integration with old 68k Amiga software.

Integration with what? Is there any KILLER app which justify the usage of PPC AND the seamless 68K integration which you talked about?

Because if there's no killer PPC app, I don't see any reason why I shouldn't prefer to use WinUAE or Amithlon for running the 68K software.

It's true that there's no seamless integration of 68K apps on Windows (as well as Linux/MacOS/etc.), but it might come in future, as I've already stated on the Vampire thread.

Status: Offline

OlafS25

Re: 68k Developement
Posted on 1-Oct-2018 18:05:05

[ #406 ]

Elite Member

Joined: 12-May-2010
Posts: 6353
From: Unknown

@cdimauro

for info... Amithlon is about double as fast as WinUAE on supported hardware but Amithlon is no longer in development so supported hardware ages and that advantage becomes smaller over time

Status: Online!

megol

Re: 68k Developement
Posted on 2-Oct-2018 0:03:27

[ #407 ]

Regular Member

Joined: 17-Mar-2008
Posts: 355
From: Unknown

@cdimauro
Quote:

cdimauro wrote:
@megol
Well, how limited is 68K today?

There are some complex parts there. In a CISC/RISC hybrid one could make other choices better for a modern design - more registers, less instruction formats, easier to decode complex operations and support for complex but useful things. For example the REP MOVSx instruction of the x86 is very useful but could be made even more useful if microcoded operations could be made more efficient. In a limited complexity CISC microcode could be almost "free" as the extras x86 require isn't necessary.

Quote:

At least with my proposal you're still able to gain 64-bit and 8 more data registers without using prefixes, while keeping most of 68K advantages (code density included).

Except not being a 68k processor. My feelings are that if one is to make something new one shouldn't be limited to be like 68k and if one should be 68k compatible extensions should be as natural as possible. Prefixes are ugly hacks but allows the 68k to still be a 68k, with some density decrease of course.

Quote:

The 16-bit opcode space cannot be orthogonal as it was with 68K, of course. Rethinking the 68K ISA needs a different mindset here: 16-bit opcodes should be seen not like regular instructions, but as compact version of more general ones (which are 32-bit in size). Like it happens on other modern ISAs. 16-bit bit opcodes are here to save space: dot.

A good idea IMHO. Even going the RISC V way and being non-orthogonal for compact instructions with varying amount of register bits etc. could be a good idea for real world usage with compiled code. But not many 68k assembly coders would like something like that.

Quote:

Quote:

True. However what I'm trying to say is that if one manipulates 64 bit data the resulting code with prefixes will be smaller than the comparable 32 bit code, if one doesn't need 64 bit data the prefix can be eliminated in most cases and remove overheads in some.

It would be possible to have the prefix without extra register bits and 64 extension and still be useful with zero/sign extensions of normal instructions. Better than MVS/MVZ? I think so but the initial cost is pretty high.

I understand your point, but having looked at many disassembled code and collected statistics (limited, ok, but at least I have some data), I can see that 64-bit versions of the same applications (FirebirdSQL, FFMPEG, Photoshop CS6 public beta, Unreal Engine) don't take profit of the possibility to handle 64-bit instead of 32-bit data. One evident benefit could be using 64-bit immediates, however looking at how many MOV REG,Imm64 are found in the code brings to the conclusion that they are rare birds, and there's substantially no gain on both instructions count reduction and increased code density.

They are rare and most values can be represented as sign extended 32 bits. However one shouldn't forget that while AMD64 is better than most (all?) the 64 bit immediate are special cases rather than a standard feature.

I think a 68k64 version should support 64 bit immediates even though they are rarely useful: orthogonal, follows the general 68k design and doesn't need special handling by the compiler.
So an optimizing compiler is likely to make other choices on 68k64 than on any other 64 bit processor for at least some cases - skewing the statistics a bit.

Quote:

What you can see by looking at the same application compiled in 32 and 64-bit is that most of the operations are 32-bit in the first binary, with a minor amount of byte operations and rare 16-bit operations. Wheres in the second binary the operations are almost always 32 and 64-bit with very rare byte operations and almost zero 16-bit ones; so, basically there's a good mixture of 32 and 64 bit operations, which brings to decreased code density due to more prefixes usage for 64-bit operations.

And also here there is a source for skewed data: x86-32 and AMD64 (by extension) require a prefix for word operations increasing size. Many word size operations are also slower than dword or quadword operations decreasing the chance an optimizing compiler would choose those when an alternative is possible.

For a 64 bit 68k this isn't a problem. Doing word operations need no prefix, nor do byte or long operations. Even some quad-word operations could avoid prefixes like:
MOVE.Q #$0123456789abcdef, D0
ADD.Q #$0123456789abcdef, D5
ORI.Q #$0123456789abcdef, (A0)+ ; not 100% compatible

Quote:

Quote:
I'm still not sure how the most vital part of a 64 bit extension would be handled: addressing. Would like it to be possible to run 32 bit code in 64 bit mode unchanged without needing a mode switch.

Yes in theory, however in practice applications compiled in 32 and 64-bit flavors have quite different mixtures of operations, as I've said before, and it'll be wise to take advantage of it (if it's possible, of course).

The problem as I mentioned above is that the mapping can't be 100% at least to my vision of a 64 bit 68k. How much of the data can directly be translated to a new extension, how much data are artifacts of different architectures/microarchitectures? I have no idea but suspect there can be a lot of artifacts.

Quote:

My previous ISAs versions worked as you stated, because all instructions were orthogonal in size, using a 2-bit field to specify the instruction size. So, the difference between a 32-bit and 64-bit application is that the first one simply didn't used Size=0b11.

Of course a 68k extension can't really do that with a few exceptions...

Quote:

Quote:

That's the reason I like the MVS/MVZ space: 11 free bits. Two bits for extension type (including one quadword variant) and 3 bits for register extension leaves 11-5=6 bits. Even having 4 bits for an additional register two bits are available for other things, and they would be required for some special cases.

This would provide sign and zero extension of byte, word and long operations. There is space for other variants too, for instance one wild idea would be adding SIMD operations to the normal integer operations.

ADD.B D10, D13 ; normal with prefix
ADD.BZ D10, D13 ; result is zero extended to full register width
ADD.BS D10, D13 ; result is sign extended

And perhaps ADD.BQ D10, D13 for SIMD operation on byte quantities in a quadword.

Yes, SIMD operations on integer registers can be useful on very low-end embedded systems.

Anyway, and to give a general answer to your quote, why don't use a longer opcode then, instead of a prefix? You can better optimize instructions on a longer opcode, because you can almost completely get rid of unused/not-useful encodings coming from the application of the prefix to ALL existing opcodes. With the additional, clear advantage to simplify the ISA implementation.

Why not longer new opcodes: trying to avoid modes while being 68k compatible. :)
We don't seem do disagree too much about anything, just looking at the same thing with different goals.

Quote:

I know, from what I've read, that you and Matt want to extended the existing 68K ISA, and you're trying to find solutions to the problems which we have talked about. However when an ISA reaches a critical mass of issues (and 68K have collected many of them), then I think that it's better to completely rebuild it taking the good parts.

That's what I've made with my x86/x64 "re-encodings". You know that both ISAs make a common use of prefixes, but my ISAs have no prefixes at all, while still keeping the same possibilities AND bringing also A LOT of new features and enhancements. And they are... TRIVIAL to decode: a bit more complicated than Thumb-2 (to give an idea of the instructions formats to handle), but not that much distant (only a few bits are needed from the first bytes, in order to get the full instruction length plus a lot of useful information about the instruction type and what it needs).

IMO it's worth to seriously think something about a 68K successor, because I see similar possibilities here.

It's just that IF one should make a new processor why not try to make something a bit different?
Like going back to word addressed memory but support fast extract/insert for byte/word/long/quad/whatever - avoiding problems of unaligned memory for super high frequency pipelines. Or supporting compound instructions extracting more performance with shorter pipelines. Or having an explicit hierarchical register design to reduce renaming overheads and simplify out of order execution? Perhaps a RISC with segmentation based addressing?

Just another CISC or RISC isn't too exiting and unlikely to gain a following. Extending the 68k could possibly carve out a tiny niche at least among retro computing enthusiasts and a few places that still have 68k code running.

Quote:

Quote:
[quote]Here it's enough to provide a regular 32-bit opcode.

Yes but this would be a version using the prefix as usual (would remove the new register field though) without needing additional opcode space.
And with the extra mode bits choosing if an additional register is wanted both of these would be possible:
ASL.W #63, D13
ASL.W #3, D10, D13

But it's a bit messy. Not that complicated, but messy.

Whereas a 32-bit ad-hoc encoding will be much easier to implement.
[/quote]
Don't know if that's true - the prefix is designed to be transparent: an instruction not having a prefix is decoded as the extension bits are zero while an instruction with a prefix decode as normal while the prefix decodes in parallel with the extension bits set accordingly.
Why is that relevant?

Let's assume the internal operation format is like: operation opsize reg_source1, reg_source2, reg_destination, immediate64
ASL.W #7, D5 decodes to the mirco-op ASL WNX D5 - D5 #7

Which may be encoded as:
0000 0001 0101 xxxx 0101 0000...0111

ASL.W #63, D13 decodes to ASL WNX D5 - D5 #7 (SAME AS ABOVE!)

But the prefix decoder inserts the correct bits so the micro-operation result is:
0000 0001 1101 xxxx 1101 0000...0011 1111

ASL.W #7, D10, D13 decodes to ASL WNX D5 - D5 #7

The prefix decoder inserts the extension bits _and_ change the source register field:
0000 0001 1010 xxxx 1101 0000...0111

ASL.WSX #7, D10, D13 is similar with the result:
0000 0101 1010 xxxx 1101 0000...0111

ASL.Q #7, D10, D13 gives:
0000 1010 1010 xxxx 1101 0000...0111

As you see it's all designed to be easy to decode in parallel with a very inexpensive fusion of prefix extension bits and the micro-operation bits from the proper decoder.

In this example opsize: xx00 = byte, xx01 = word, xx10 = long, 00xx = no extension with upper 32 bit zeroed, 01xx = sign extension to full register length, 10xx = zero extension.
As a no-extension long operation (0010) sets the upper 32 bit to zero the zero-extending long format (1010) is repurposed for quadword operation which of course need no type of extension.

The complication would be selecting if four bits should be routed to and replace one register field or routed to and replace bits 6:3 in the immediate field. So two four bit muxes plus a little decoding. Ugly but not exactly hard to do. And it reuses the instruction formats already supported leaving free opcode space for new things.

(There is the problem of having the prefix decoder completely separate from the opcode decoder: prefixes can be redundant or do useless things like replacing bits 6:3 in a long immediate for instance. Simple fix: don't do that, a little bit harder fix: detect those cases after the main decode and trigger an exception as an illegal instruction - as it isn't in a critical path the prefix-opcode decode dependencies aren't a problem anymore.)

Quote:

Quote:

That's the problem for us all I think - it's still just ideas until tested... :/

But I've some data and a bit of experience: see above.

You can also collect statistics about existing applications, and you'll figure-out yourself how is the situation.

I'm thinking now about using some profiler tool to instrument the execution of common applications, to get many more decoded instructions plus dynamic usage of them (actually I only have static analysis).
This will give many useful information and the possibility to generate more stats for my ISA. I still cannot make use of several enhancements which it provides, but even a 1:1 translation will be enough to show the advantages in terms of code density, and what can be achieved with proper compilers support.

The last thing can be a huge problem. It isn't easy to change a good 68k compatible compiler to generate optimized code for some theoretical extension - as there isn't any really good compiler to change. That's my impression at least - and hopefully wrong.

(Am very tired so there are likely a lot of errors/thinkos)
(Edit: there were!)

Last edited by megol on 02-Oct-2018 at 07:49 AM.

Status: Offline

matthey

Re: 68k Developement
Posted on 4-Oct-2018 1:42:20

[ #408 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2015
From: Kansas

Quote:

cdimauro wrote:
In many discussions I've seen people asking for more data registers (actually it seems that most of coders prefer to have more data registers instead of address registers. I'm for more address registers, but it seems that I'm part of a minority).

The data registers are more general purpose, probably easier for a compiler to take advantage of and more are easier to add. Address registers are base/index registers and important to addressing modes which is a strength of the 68k. There are less address registers available to begin with as a7 (sp) is used. If a5 (frame pointer) and/or a4 (small data) registers are used as well then there *is* more of a shortage of address registers. We can free a5 with frame pointers on the stack and often a4 with better pc relative support. Not locking the library base to a6 may help as well.

Quote:

You said that you want to be compatible with the 68K code, and that's OK. But you can have a 64 bit post-68K processor which works this way:
- 68K compatibility mode (like x86 compatibility mode on x64, or the traditional ARM32 on ARM64 processors);
- new execution mode (which isn't 68K-binary compatible) with a 68K-inspired ISA which can run in 64 or 32 bit.

A new 64-bit ISA / opcode table is needed because you don't want to use prefixes (which will lower the code density) while proposing consistent enhancements (64-bit first. 8 more data registers can be a reasonable and achievable goal too. New FPU and/or SIMD unit as the last feature to think about), and dropping some legacy as well (e.g.: double indirect modes, coprocessors, and maybe BCD instructions). This new ISA can itself run in 32 or 64 bit mode without any particular burden (like my NEx64T ISA: 32-bit = use any instruction but just Size=64-bit is not allowed and some instructions decoding which work differently).

So, the new ISA brings all enhancements, and the 68K one is kept only for backward-compatibility.

What do you think about?

I would like to bring some of the very compatible enhancements to 68k_32 as well but if the code density of 68k_64 mode is very close then it is less important. I expect 68k_32 and 68k_64 to be more compatible than x86 and x86_64.

Quote:

OK. For the latter: only with some new instruction (e.g.: MOVLE ), or with a "little-endian data" execution mode?

I don't think a BE/LE execution mode is necessary or a good idea. An instruction is fine. It would be possible for memory pages/regions to be mapped as BE/LE but I wonder how well this can be used in an OS.

Quote:

No, I wasn't talking about that. I was talking about the possibility to change some flag which can alter the meaning / decoding of some instructions, like what the 65C816 processor did when it extended the 65C02 to a 16-bit ISA.

My ISA allows to change some instructions decoding and/or memory addressing modes, based on some configurable flags, in order to better match an application specific use-case (the compiler and/or the developer decide how to configure it).

This could be useful, especially for compatibility. For example, being able to configure the alignment of the stack pointers would be great. It is my understanding that this kind of state/mode change can cause problems in some cases and it is generally preferable to use instructions which specify the configuration. For example, the 68k FPU instructions which specify the rounding mode are preferable and faster than changing the rounding mode in the FPCR register.

Quote:

If you change the 68K opcode table removing / replacing instructions to make space for 64-bit encodings, then you don't have anymore a 68K ISA, but an inspired one.

Most programmers won't notice the difference until they find something missing. I'm hoping to minimize what is missing and keep the 68k "feel" in 64 bit mode.

Quote:

In this case I suggest you, as I said before, to completely rethink the ISA. Basically following what ARM did with its 64-bit ISA.

Keeping the legacies isn't a good idea for a new ISA: don't make the mistake that AMD did with x64.

I don't want to throw the baby out with the bathwater but I do recognize that the 68020 ISA was lacking in some areas. It is mostly some mistakes of the 68020 ISA which I would like to fix and recover some encoding space in 64 bit mode. There are some minor re-encodings as well.

Quote:

I think that ARM had no opcode space available for Thumb-2-like 16-bit instructions, and didn't wanted to introduce a crippled version.

For RISC-V it was much easier because they already put a set of constraints on the available instructions formats (only 3 "base". However there can be a lot to talk about it, because it's mostly marketing), leaving plenty of room (75% of space!) for compressed instructions.

IMO, ARM made a mistake by moving all their new CPUs to AArch64. They should have kept Thumb2 for low end processors (a big percentage of their current market). AArch64 is too fat/robust and could have better code density. An AArch64 compressed format probably could have helped the code density but it is still too fat for low end CPUs. They didn't leave enough encoding space available.

IMO, RISC-V is not robust enough and left too much encoding space free which could have reduced instruction counts and improved code density more. The short compressed encodings can not fully compensate for weaknesses in the base ISA even as the improved code density makes the ISA look better. I have doubts that the simplistic ISA, like other simple RISC ISAs, will be appealing for embedded use other than the "do it yourself" customization (smaller embedded developers may not like it as much as big ones).

Quote:

It'll end in some years.

Anyway larger vector sizes also means more skew clock issues. That's why Intel's high performance SIMD implementations LOWER the clock when using intensive AVX/-2 and, especially, AVX-512 code.

Having to lower the CPU clock speed would put a cap on SIMD units becoming wider. It makes a 256 bit wide SIMD unit look more appealing.

Status: Offline

matthey

Re: 68k Developement
Posted on 4-Oct-2018 22:53:07

[ #409 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2015
From: Kansas

Quote:

megol wrote:
That's why I wrote semi-constants in an earlier post, what I meant is things that are constant in a part of the code like inner loops but can change infrequently like once per subroutine call.

Your examples all show real constants where the advantage of a higher-Dn shortcut isn't really there. If there have to be some type of computation it have to be done with integer registers and moved to the address register, this requires a temporary integer register and wastes an address register. Prefixes plus a high Dn source eliminates the temporary integer register, a move and use no address register making them available for other uses.

Address registers allow addition and subtraction for up or down counters (common but needs an extra instruction in TST/CMP) or offsets between two value. ADD, SUB, CMP, TST and LEA are very common operations. In the code I looked at, opening address register sources (and even destinations) only gives a minor improvement in instruction counts and code density. For similar reasons, I would expect prefixes adding d8-d15 to give a minor improvement in instruction counts and memory traffic with worse code density in unusual cases where the divide would be a problem. The data and address register divide does not necessitate as bad of code as what some people would expect although compilers have more problem with it than humans (d8-d15 would be helpful to compilers *if* they were completely orthogonal data registers). IMO, it is not that the 68k is inherently a difficult target for compilers but rather there has been a lack of incentive to make modern 68k backends mature. I think Bebbo's GCC efforts show this and I know this is the case for the lack of effort in vbcc's backend as well.

Quote:

It's ugly as it isn't orthogonal and doesn't fit into the original design, that's the same ugliness for both high-Dn and An source versions. It is shoehorning in something new in a place Motorola didn't intend to be used as such no matter which version of the extension is selected - so for me they are both ugly with some nicer parts depending how one look at them.
That some of the most useful instructions can't use this shortcut is just another reason why it is ugly IMHO. If the most used instruction (ADD) can't use this shortcut why even bother?

ADD already has An sources *and* destinations open.

add.l d0,d1
add.l a0,d0
add.l d0,a0 ; converted to adda.l with no CC
add.l a0,a1 ; converted to adda.l with no CC

There are no other variations to add here, no pun intended. ADD is orthogonal other than the way the CC is set and the way address registers sign extend sizes less than the register size.

Quote:

I see the high-Dn as a better choice when combined with a prefix, not otherwise. Still don't really like the idea at all, perhaps those encodings could be used for something more useful?

Again things like these requires a good simulator plus compiler support to decide. Modifying the non-JIT processor emulation of (win)UAE should be possible even if I don't understand the design 100% (did a quick look), the focus would be different than the standard emulator of course. Precision while emulating everything from register files, execution pipelines and caches is important while raw performance isn't. Never done anything similar so may be harder than I can imagine. :)

Toni Wilen recently added some FPU instruction logging to WinUAE in a matter of days.

http://eab.abime.net/showpost.php?p=1272424&postcount=9

He could probably add dynamic trace support and CPU performance counters (same as Apollo Core?). This would probably take us (people unfamiliar with the program) weeks to add. I would not ask Toni for this without a more serious effort.

I was familiar with the ADis disassembler (which I improved and am familiar with) so created a quick version in a matter of days which generates static stats for disassembled programs. Gunnar did make some decisions based on some of the statistics it generated and of course that is where some of my knowledge of 68k code comes from.

Frank Wille could create peephole optimizations for vasm (vbcc assembler) in a matter of days which would take me weeks. Vasm already has several ColdFire optimizations which could be turned on for a 68k target (some of which I suggested). My immediate compression could be a peephole optimization. These could give an idea of the code density improvement with the simplest ISA enhancements. As much as I have worked with Frank and as nice as he has been, I would not ask him to help with what would likely turn out to be a wasted effort. It would be worse than him asking Volker to improve the vbcc 68k backend which is a waste of time for a few thousand Amiga enthusiasts.

Quote:

It should be as efficient as doing it the other way as long as the cache prefetcher detects it.

Ok. It will probably be preferable to do base register update and then pre-decrement. It's not as easy to read but isn't too bad. I may not even keep the base register update addressing mode with added 64 bit support though.

Quote:

No they almost never run out of registers as almost no code requires more than the free number of registers for temporary values. Which is by choice.
But do the math: what it the RISC run out of registers 10% of the time and then have a 80% overhead compared to a CISC running out of registers 20% of the time (remember this isn't linear) and have a 5% overhead. Assuming similar execution resources I think (but welcome corrections) it would be something like:
RISC performance = frequency * (100%*(100%-10%) + (100%-80%)*10%) = frequency * 0.92
CISC performance = frequency * (100%*(100%-20%) + (100%-5%)*20%) = frequency * 0.99

Looks good for CISC right? We could try to put in some more realistic numbers which would mean RISC running out of register less than 1% of the time and a higher overhead for the CISC (occupying load port). But one would also need to account for the decreased frequency of the CISC pipeline, the longer CISC pipeline also means higher cost when mispredicting branches.

For the RISC above to be performing as well as the CISC in the calculation above it have to have 8% higher frequency. 108MHz RISC to the 100MHz CISC. This with both exaggerated costs and frequency of register spill/fill.

It's difficult to know what numbers to plug in to the equations. I suspect running out of registers is less common.

registers available | load/store %
24 28.21%
22 28.34%
20 28.38%
18 28.85%
16 30.22%
14 31.84%
12 34.31%
10 41.02%
8 44.45%

Source: "High-Performance Extendable Instruction Set Computing", MIPS ISA

Each register spill from running out of registers requires at least 2 load/stores. From 8 to 16 registers reduces the load/store percentage by 14.23% but from 16 to 24 registers by only 2.01%. The overhead cost of RISC running out of registers makes it pretty easy to see that register spills are common for 8 registers and already uncommon at 16 registers (I would guess less than 10% out of register percentage and probably less than 5% with 16 registers). This is older and "various benchmark" code but probably doesn't vary too much from typical code today. CISC ISAs are generally register misers compared to RISC ISAs which may be partially offset by the 68k register split. In any case, the 68k out of register overhead is likely much lower (I would expect the load/store percentage increase when out of registers to be half of RISC). From various stats I have seen of 68k code, the 68k has low memory traffic and is mostly consistent whether optimizing for performance or size (unlike the x86/x86_64). Your own equations make 16 registers look adequate for common code as well.

Modern mid performance RISC CPUs have longer pipelines comparable with the 68060 so there is little branch performance difference. Yes, simple RISC CPUs can be smaller than CISC but many of them will not have 32 registers because of limited resources.

Quote:

The main power draw of a register file is due to switching not just existing. So if we are to compare the power draw when working of a CISC with memory operands and a RISC with register operands we'd have to include the overheads of a cache and the load/store unit in the equation.
When delivering operands to execution units registers will win.

One can look at real world high performance CISC implementations, x86 and IBM Z and see that they have large register files even though they support memory operands.

68k 16x32
x86 8x32
x86_64 16x64
z/Architecture 16x64
68k_64 16x64

Modern CISC integer register files look to be more medium sized to me. Specific CPU designs use more registers internally. Other units use larger register files.

Quote:

Again look at skewed pipelines if you want to see in-order RISC designs with the same cost as for an inline CISC, a cost that is there all the time instead of optional (if the code can be scheduled correctly).
The 68060 doesn't eliminate anything - the delay is small due to the design and target clock rate. Let's compare the latency of the 68060 with that of the DEC 21164, the Alpha using a 0.5µm process compared with the 0.45µm process of the 68060.

The cache size of the L1 cache (all the 68060 have while the Alpha uses a L2 cache plus optional external L3 cache) is the same, 8KiB. The pipeline length is harder to compare as the Alpha have a pipelined FPU, with all parts of the pipeline included they have the same length but in practice the Alpha have 7 stages to the 10 stages of the 68060.
For the 68060 a cache load hitting the cache takes two cycles, AG and OC. For the Alpha a load hitting the cache takes two cycles - starting in one integer pipeline at S4, data ready at the end of S5.
The 68060 runs at a maximum of 75MHz with the Alpha 21164 at a maximum of 300MHz. The Alpha have a 4 times lower load to use latency as measured in time. Obviously part of that is that the Alpha had a completely different design goal however it should at least illustrate that there is no inherent advantage with CISC other that (and I'm nagging ;P) other than instruction density and (for simple designs) somewhat easier construction.

RISC require more scheduling than simple CISC however that is, just like the amount of registers a design choice.

The 68060 pipeline is generally considered to be an 8 stage design (worst case branch mis-prediction is 8 cycles). The Alpha 21164 has a 7 stage integer and 10 stage FPU pipeline. 68060 instructions which hit in the cache are usually single cycle. The Alpha 21164 is an extremely aggressive 4 wide superscalar high performance design with a dual ported L1 DCache compared to the more balanced (up to 3 instruction issue) superscalar 68060 design. Some people thought it was unfair to compare the 68060 to the more aggressive Pentium design but they are closer than the Pentium and Alpha. The energy requirements give a hint at the aggressiveness of the design.

Alpha 21164@300MHz 3.3V, .5um, 9.3 million transistors, 51W max

Pentium@75MHz 80502, 3.3V, 0.6um, 3.2 million transistors, 9.5W max
68060@75MHz 3.3V, 0.6um, 2.5 million transistors, ~5.5W max *1
PPC 601@75MHz 3.3V, 0.6um, 2.8 million transistors, ~7.5W max *2

There could be 3 68060 CPUs operating in parallel using less transistors or 9 68060 CPUs in parallel for less than the energy usage of the Alpha 21164 and that is with the Alpha using a smaller die size. Too bad Motorola was not designing multi-core CPUs back then.

Quote:

It would be less expensive for an out of order processor, can pipeline dependency checks without significant decrease in performance for instance.

Does single cycle dependency/hazard checking become a problem with wider in-order superscalar issue?

Quote:

So maybe it's better to use 16 register at least do it the 68k way with split D/A registers. 4 register bits x 2 + 2 bits for operation size + 3 bits for address mode leaves 16-(8+2+3)=3 bit for the rest of the opcode with a maximum of 7 usable opcodes. However 2 bits for the address mode may be more practical: (An), (An+disp16), (An)+, -(An) for instance with other modes requiring a long instruction format. Maximum 15 opcodes which may be enough.
Or one could do it like RISC V: smaller number of usable registers for small instructions with full 32 bit instructions having 5 bit register fields.

I found some good uses for the unused 68k 6 bit EA encodings. It allows the compressed immediates as well as a fast (d32,pc) addressing mode which is very useful for a 64 bit ISA. I think 16 registers will be fine. I would rather focus on good 64 bit support which, IMO, is where many other 64 bit ISAs are lacking.

Quote:

However I don't agree with code density being very important. Even small microcontrollers doesn't strive for maximum density anymore - there's no need. Don't read this as I'm saying density doesn't matter it's just not the most important thing.

Some simple micro-controllers don't need good code density if the embedded code being executed is tiny. Instruction fetch is expensive if needed. Instruction supply consumes 42% of a 32 bit RISC embedded CPU as shown in the following link.

http://www.cast-inc.com/blog/consider-code-density-when-choosing-embedded-processors

Status: Offline

cdimauro

Re: 68k Developement
Posted on 7-Oct-2018 12:11:02

[ #410 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@megol Quote:
megol wrote:
@cdimauro
Quote:

cdimauro wrote:
@megol
Well, how limited is 68K today?

There are some complex parts there. In a CISC/RISC hybrid one could make other choices better for a modern design - more registers, less instruction formats, easier to decode complex operations and support for complex but useful things.

That's what I did with my CISC design.
Quote:
For example the REP MOVSx instruction of the x86 is very useful but could be made even more useful if microcoded operations could be made more efficient.

From Haswell Intel started boosting the most important REP instructions, which are now more convenient than the equivalent versions using proper SIMD subroutines.
Quote:
In a limited complexity CISC microcode could be almost "free" as the extras x86 require isn't necessary.

What kind of x86 "extras" do you mean here? Because my NEx64T ISA is a x86/x64 "redesign", but most of the legacy is essentially gone in the ISA/uArch or very limited (and optional to implement, in this case), while keeping 100% source assembly compatibility (and even more in 64-bit mode: all x86 instructions removed on x64, like BOUND, PUSHA, POPA, etc., are still usable).
Quote:
Quote:
At least with my proposal you're still able to gain 64-bit and 8 more data registers without using prefixes, while keeping most of 68K advantages (code density included).

Except not being a 68k processor. My feelings are that if one is to make something new one shouldn't be limited to be like 68k and if one should be 68k compatible extensions should be as natural as possible. Prefixes are ugly hacks but allows the 68k to still be a 68k, with some density decrease of course.

It's fine for me. One question here: what do you think about x64 vs x86, and ARM64 vs ARM32?
Quote:
Quote:
The 16-bit opcode space cannot be orthogonal as it was with 68K, of course. Rethinking the 68K ISA needs a different mindset here: 16-bit opcodes should be seen not like regular instructions, but as compact version of more general ones (which are 32-bit in size). Like it happens on other modern ISAs. 16-bit bit opcodes are here to save space: dot.

A good idea IMHO. Even going the RISC V way and being non-orthogonal for compact instructions with varying amount of register bits etc. could be a good idea for real world usage with compiled code. But not many 68k assembly coders would like something like that.

Assemblers can transparently chose the proper instruction, so assembly coders shouldn't care so much. Unless they want/need to carefully plan which registers to use, in order to achieve the best code density.
Quote:
Quote:
I understand your point, but having looked at many disassembled code and collected statistics (limited, ok, but at least I have some data), I can see that 64-bit versions of the same applications (FirebirdSQL, FFMPEG, Photoshop CS6 public beta, Unreal Engine) don't take profit of the possibility to handle 64-bit instead of 32-bit data. One evident benefit could be using 64-bit immediates, however looking at how many MOV REG,Imm64 are found in the code brings to the conclusion that they are rare birds, and there's substantially no gain on both instructions count reduction and increased code density.

They are rare and most values can be represented as sign extended 32 bits. However one shouldn't forget that while AMD64 is better than most (all?) the 64 bit immediate are special cases rather than a standard feature.

They are special because the only instruction which supports it is MOV. It means that if another istruction likes to use a 64-bit immediate, then it first load it in a register, and then use that register. But the point is that the 64-bit immediate will be used anyway. See a bit more below.
Quote:
I think a 68k64 version should support 64 bit immediates even though they are rarely useful: orthogonal, follows the general 68k design and doesn't need special handling by the compiler.

That's granted: it's part of the 68K ISA, and there shouldn't be an exception here. Any instruction which uses EA = immediate with Size = 64-bit, then will have a 64-bit value following.

I did the same thing with my ISA: all instructions (except shift ones) which have an immediate, can directly use a 64-bit value if their size is 64-bit.
Quote:
So an optimizing compiler is likely to make other choices on 68k64 than on any other 64 bit processor for at least some cases - skewing the statistics a bit.

I don't think so, at least compared to x64. The point is that if a 64-bit immediate value is needed on x64, then it's loaded using the proper MOV instruction, and that kind of instructions are quite rare, because full-sized (e.g. >32-bit) 64-bit values are uncommon.

Some date here, taking Excel 64-bit:

Immediates:
Size Count
IMM32 53862
IMM16 8456
IMM64 2345

I've removed immediates < 16-bit, because they take the largest part (of course), but the comparison between 32 and 64-bit data is significant. 16-bit stats are strange (compared to the 32-bit ones), but it isn't a sporadic case (found a similar pattern on other 64-bit executables).

So, I think that the x64 stats about 64-bit immediates usage are quite general, and can be taken as a reference for other ISAs too.
Quote:
Quote:
What you can see by looking at the same application compiled in 32 and 64-bit is that most of the operations are 32-bit in the first binary, with a minor amount of byte operations and rare 16-bit operations. Wheres in the second binary the operations are almost always 32 and 64-bit with very rare byte operations and almost zero 16-bit ones; so, basically there's a good mixture of 32 and 64 bit operations, which brings to decreased code density due to more prefixes usage for 64-bit operations.

And also here there is a source for skewed data: x86-32 and AMD64 (by extension) require a prefix for word operations increasing size. Many word size operations are also slower than dword or quadword operations decreasing the chance an optimizing compiler would choose those when an alternative is possible.

Good point. This might be the case, yes. We know that on 68K we tried to use as much as possible word operations, because they were the fastest (on 68000 and 68010) compared to longword ones.
Quote:
For a 64 bit 68k this isn't a problem. Doing word operations need no prefix, nor do byte or long operations. Even some quad-word operations could avoid prefixes like:
MOVE.Q #$0123456789abcdef, D0
ADD.Q #$0123456789abcdef, D5
ORI.Q #$0123456789abcdef, (A0)+ ; not 100% compatible

How do you encode them? Using a reserved EA mode for specifying a 64-bit immediate operation?
Quote:
Quote:
Yes in theory, however in practice applications compiled in 32 and 64-bit flavors have quite different mixtures of operations, as I've said before, and it'll be wise to take advantage of it (if it's possible, of course).

The problem as I mentioned above is that the mapping can't be 100% at least to my vision of a 64 bit 68k. How much of the data can directly be translated to a new extension, how much data are artifacts of different architectures/microarchitectures? I have no idea but suspect there can be a lot of artifacts.

That's true. However, at least for 64-bit immediates, see above.
Quote:
Quote:
Anyway, and to give a general answer to your quote, why don't use a longer opcode then, instead of a prefix? You can better optimize instructions on a longer opcode, because you can almost completely get rid of unused/not-useful encodings coming from the application of the prefix to ALL existing opcodes. With the additional, clear advantage to simplify the ISA implementation.

Why not longer new opcodes: trying to avoid modes while being 68k compatible. :)
We don't seem do disagree too much about anything, just looking at the same thing with different goals.

Exactly. To quickly recap:
- you want to fully keep the 68K ISA / opcode table, even for 64-bit operations, using prefix(es).
- Matt wants to keep it only for 32-bit operations, introducing a 64-bit mode (without prefixes) which shares most of the 68K ISA / opcode table, rewriting it partially to make space for 64-bit operations;
- I prefer a completely new re-encoding for both 32 and 64-bit instructions / modes, without using prefixes (they can be avoided by a careful redefinition of opcodes).

Diversity. Isn't bad thing by itself.
Quote:
Quote:
I know, from what I've read, that you and Matt want to extended the existing 68K ISA, and you're trying to find solutions to the problems which we have talked about. However when an ISA reaches a critical mass of issues (and 68K have collected many of them), then I think that it's better to completely rebuild it taking the good parts.

That's what I've made with my x86/x64 "re-encodings". You know that both ISAs make a common use of prefixes, but my ISAs have no prefixes at all, while still keeping the same possibilities AND bringing also A LOT of new features and enhancements. And they are... TRIVIAL to decode: a bit more complicated than Thumb-2 (to give an idea of the instructions formats to handle), but not that much distant (only a few bits are needed from the first bytes, in order to get the full instruction length plus a lot of useful information about the instruction type and what it needs).

IMO it's worth to seriously think something about a 68K successor, because I see similar possibilities here.

It's just that IF one should make a new processor why not try to make something a bit different?
Like going back to word addressed memory but support fast extract/insert for byte/word/long/quad/whatever - avoiding problems of unaligned memory for super high frequency pipelines. Or supporting compound instructions extracting more performance with shorter pipelines. Or having an explicit hierarchical register design to reduce renaming overheads and simplify out of order execution? Perhaps a RISC with segmentation based addressing?

Just another CISC or RISC isn't too exiting and unlikely to gain a following. Extending the 68k could possibly carve out a tiny niche at least among retro computing enthusiasts and a few places that still have 68k code running.

I agree, and that could be a reasonable plan for a 68K "successor".

Even with my new ISA I've added some features which aren't usually found on other processors, and I've implemented some novelty stuff: it isn't just an x86/x64 re-encoding, and actually it's more 68K inspired in many choices/decisions.
Quote:
Quote:
Whereas a 32-bit ad-hoc encoding will be much easier to implement.

Don't know if that's true - the prefix is designed to be transparent: an instruction not having a prefix is decoded as the extension bits are zero while an instruction with a prefix decode as normal while the prefix decodes in parallel with the extension bits set accordingly.
Why is that relevant?

Let's assume the internal operation format is like: operation opsize reg_source1, reg_source2, reg_destination, immediate64
ASL.W #7, D5 decodes to the mirco-op ASL WNX D5 - D5 #7

Which may be encoded as:
0000 0001 0101 xxxx 0101 0000...0111

ASL.W #63, D13 decodes to ASL WNX D5 - D5 #7 (SAME AS ABOVE!)

But the prefix decoder inserts the correct bits so the micro-operation result is:
0000 0001 1101 xxxx 1101 0000...0011 1111

ASL.W #7, D10, D13 decodes to ASL WNX D5 - D5 #7

The prefix decoder inserts the extension bits _and_ change the source register field:
0000 0001 1010 xxxx 1101 0000...0111

ASL.WSX #7, D10, D13 is similar with the result:
0000 0101 1010 xxxx 1101 0000...0111

ASL.Q #7, D10, D13 gives:
0000 1010 1010 xxxx 1101 0000...0111

As you see it's all designed to be easy to decode in parallel with a very inexpensive fusion of prefix extension bits and the micro-operation bits from the proper decoder.

In this example opsize: xx00 = byte, xx01 = word, xx10 = long, 00xx = no extension with upper 32 bit zeroed, 01xx = sign extension to full register length, 10xx = zero extension.
As a no-extension long operation (0010) sets the upper 32 bit to zero the zero-extending long format (1010) is repurposed for quadword operation which of course need no type of extension.

The complication would be selecting if four bits should be routed to and replace one register field or routed to and replace bits 6:3 in the immediate field. So two four bit muxes plus a little decoding. Ugly but not exactly hard to do. And it reuses the instruction formats already supported leaving free opcode space for new things.

I understand your mechanism, and I agree: the implementation is cheap, simple and fast.

I have implemented a similar thing with my ISA, but not using prefixes: just longer opcodes when I want to extend in some way(s) a "base" instruction. My design looks more like a Matrioska.
Quote:
(There is the problem of having the prefix decoder completely separate from the opcode decoder: prefixes can be redundant or do useless things like replacing bits 6:3 in a long immediate for instance. Simple fix: don't do that, a little bit harder fix: detect those cases after the main decode and trigger an exception as an illegal instruction - as it isn't in a critical path the prefix-opcode decode dependencies aren't a problem anymore.)

This is an exception, and it's not critical to handle / implement.
Quote:
Quote:
But I've some data and a bit of experience: see above.
You can also collect statistics about existing applications, and you'll figure-out yourself how is the situation.

I'm thinking now about using some profiler tool to instrument the execution of common applications, to get many more decoded instructions plus dynamic usage of them (actually I only have static analysis).
This will give many useful information and the possibility to generate more stats for my ISA. I still cannot make use of several enhancements which it provides, but even a 1:1 translation will be enough to show the advantages in terms of code density, and what can be achieved with proper compilers support.

The last thing can be a huge problem. It isn't easy to change a good 68k compatible compiler to generate optimized code for some theoretical extension - as there isn't any really good compiler to change. That's my impression at least - and hopefully wrong.

I think that it'll be much easier for your 68K extension. Using prefixes you can already recycle all 68K backend, and you can insert prefix(es) usage on specific parts, without altering too much the compiler code.

Matt 68K 64-bit extension requires much more effort.

Mine... no comment.
Quote:
(Am very tired so there are likely a lot of errors/thinkos)
(Edit: there were!)

As a non-native English speaker (albeit I was born in USA ), I'm the last one which can finger point someone else for errors.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 7-Oct-2018 15:32:44

[ #411 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:
matthey wrote:
Quote:
cdimauro wrote:
In many discussions I've seen people asking for more data registers (actually it seems that most of coders prefer to have more data registers instead of address registers. I'm for more address registers, but it seems that I'm part of a minority).

The data registers are more general purpose, probably easier for a compiler to take advantage of and more are easier to add. Address registers are base/index registers and important to addressing modes which is a strength of the 68k. There are less address registers available to begin with as a7 (sp) is used. If a5 (frame pointer) and/or a4 (small data) registers are used as well then there *is* more of a shortage of address registers. We can free a5 with frame pointers on the stack and often a4 with better pc relative support. Not locking the library base to a6 may help as well.

A6 is there to stay, at least on Amiga o.s. and rewritings. A4 and A5 can be saved, as you've written.

What I don't like of 68K is the not complete orthogonality of the ISA regarding the address registers usage, like using them as indexes on more complex address modes: it shouldn't have been allowed, because that's the role of data registers.
Quote:
I would like to bring some of the very compatible enhancements to 68k_32 as well but if the code density of 68k_64 mode is very close then it is less important.

It should be close, from what you've stated 'til now.
Quote:
I expect 68k_32 and 68k_64 to be more compatible than x86 and x86_64.

Source assembly wise, right?
Quote:
Quote:
No, I wasn't talking about that. I was talking about the possibility to change some flag which can alter the meaning / decoding of some instructions, like what the 65C816 processor did when it extended the 65C02 to a 16-bit ISA.

My ISA allows to change some instructions decoding and/or memory addressing modes, based on some configurable flags, in order to better match an application specific use-case (the compiler and/or the developer decide how to configure it).

This could be useful, especially for compatibility. For example, being able to configure the alignment of the stack pointers would be great. It is my understanding that this kind of state/mode change can cause problems in some cases and it is generally preferable to use instructions which specify the configuration. For example, the 68k FPU instructions which specify the rounding mode are preferable and faster than changing the rounding mode in the FPCR register.

No, my idea is to set such register before starting the execution of code, and usually leave it unchanged. It doesn't make sense to have specific instructions to change part of it, because the features that it enable or disable don't need to be changed often.

Yes, it also allows to change the stack pointer size and alignment as well (e.g.: minimum number of bytes to push).
Quote:
Quote:
If you change the 68K opcode table removing / replacing instructions to make space for 64-bit encodings, then you don't have anymore a 68K ISA, but an inspired one.

Most programmers won't notice the difference until they find something missing. I'm hoping to minimize what is missing and keep the 68k "feel" in 64 bit mode.

The different opcodes structure will only affect people which write compilers, assemblers, and disassembler, and this should be acceptable.
Quote:
Quote:
Anyway larger vector sizes also means more skew clock issues. That's why Intel's high performance SIMD implementations LOWER the clock when using intensive AVX/-2 and, especially, AVX-512 code.

Having to lower the CPU clock speed would put a cap on SIMD units becoming wider. It makes a 256 bit wide SIMD unit look more appealing.

This is only a micro-architectural issue.

You can have a SIMD unit with huge vector registers, where such instructions will be internally split in uops handling half (or even a quarter) the size, to reduce the implementation costs. That's what Intel did with the Pentium-III, when it first introduced the new SSE SIMD (which internally split such 128-bit instructions to two 64-bit uops). And that's what AMD made with its Ryzen processors (where 256-bit AVX/-2 instructions are split into 2 128-bit uops).

The advantage is that the SIMD ISA remain exactly the same, so fetching the same amount of bytes for the instructions.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 7-Oct-2018 16:14:35

[ #412 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:
matthey wrote:
Quote:
megol wrote:
No they almost never run out of registers as almost no code requires more than the free number of registers for temporary values. Which is by choice.
But do the math: what it the RISC run out of registers 10% of the time and then have a 80% overhead compared to a CISC running out of registers 20% of the time (remember this isn't linear) and have a 5% overhead. Assuming similar execution resources I think (but welcome corrections) it would be something like:
RISC performance = frequency * (100%*(100%-10%) + (100%-80%)*10%) = frequency * 0.92
CISC performance = frequency * (100%*(100%-20%) + (100%-5%)*20%) = frequency * 0.99

Looks good for CISC right? We could try to put in some more realistic numbers which would mean RISC running out of register less than 1% of the time and a higher overhead for the CISC (occupying load port). But one would also need to account for the decreased frequency of the CISC pipeline, the longer CISC pipeline also means higher cost when mispredicting branches.

For the RISC above to be performing as well as the CISC in the calculation above it have to have 8% higher frequency. 108MHz RISC to the 100MHz CISC. This with both exaggerated costs and frequency of register spill/fill.

It's difficult to know what numbers to plug in to the equations. I suspect running out of registers is less common.

registers available | load/store %
24 28.21%
22 28.34%
20 28.38%
18 28.85%
16 30.22%
14 31.84%
12 34.31%
10 41.02%
8 44.45%

Source: "High-Performance Extendable Instruction Set Computing", MIPS ISA

Each register spill from running out of registers requires at least 2 load/stores. From 8 to 16 registers reduces the load/store percentage by 14.23% but from 16 to 24 registers by only 2.01%. The overhead cost of RISC running out of registers makes it pretty easy to see that register spills are common for 8 registers and already uncommon at 16 registers (I would guess less than 10% out of register percentage and probably less than 5% with 16 registers). This is older and "various benchmark" code but probably doesn't vary too much from typical code today.

The study is very interesting, but the numbers don't look so close to other code density studies/benchmarks. For example, 68K (68000, 68020) and 80386 code density is very similar, but Thumb (not even Thumb-2!) is shown to be far better those ISAs.

Maybe the used compiler didn't generated good code for such ISAs. Or the programs and libraries used for the benchmark (which are not 100% known, unfortunately) didn't include several kinds of code. It's strange that didn't used the well known SPEC suite (maybe because it's costly? Unfortunately it's not free. -_-).
Quote:
Quote:
The main power draw of a register file is due to switching not just existing. So if we are to compare the power draw when working of a CISC with memory operands and a RISC with register operands we'd have to include the overheads of a cache and the load/store unit in the equation.
When delivering operands to execution units registers will win.

One can look at real world high performance CISC implementations, x86 and IBM Z and see that they have large register files even though they support memory operands.

68k 16x32
x86 8x32
x86_64 16x64
z/Architecture 16x64
68k_64 16x64

Modern CISC integer register files look to be more medium sized to me. Specific CPU designs use more registers internally. Other units use larger register files.

NEx64T/32 32x32
NEx64T/64 32x64

With binary (and unary) instructions available in Mem-Mem format (but using longer opcodes).
Quote:
The 68060 pipeline is generally considered to be an 8 stage design (worst case branch mis-prediction is 8 cycles). The Alpha 21164 has a 7 stage integer and 10 stage FPU pipeline. 68060 instructions which hit in the cache are usually single cycle. The Alpha 21164 is an extremely aggressive 4 wide superscalar high performance design with a dual ported L1 DCache compared to the more balanced (up to 3 instruction issue) superscalar 68060 design. Some people thought it was unfair to compare the 68060 to the more aggressive Pentium design but they are closer than the Pentium and Alpha. The energy requirements give a hint at the aggressiveness of the design.

Alpha 21164@300MHz 3.3V, .5um, 9.3 million transistors, 51W max

Pentium@75MHz 80502, 3.3V, 0.6um, 3.2 million transistors, 9.5W max
68060@75MHz 3.3V, 0.6um, 2.5 million transistors, ~5.5W max *1
PPC 601@75MHz 3.3V, 0.6um, 2.8 million transistors, ~7.5W max *2

There could be 3 68060 CPUs operating in parallel using less transistors or 9 68060 CPUs in parallel for less than the energy usage of the Alpha 21164 and that is with the Alpha using a smaller die size. Too bad Motorola was not designing multi-core CPUs back then.

I still think that the comparison is unfair for the reason which I gave in the past: it was too easy for Motorola continuosly cutting features and instructions to the ISA, thus saving transistors and power usage.

Intel shown that this can be done with its processors for the embedded market, with the Quark family.
Quote:
Quote:
So maybe it's better to use 16 register at least do it the 68k way with split D/A registers. 4 register bits x 2 + 2 bits for operation size + 3 bits for address mode leaves 16-(8+2+3)=3 bit for the rest of the opcode with a maximum of 7 usable opcodes. However 2 bits for the address mode may be more practical: (An), (An+disp16), (An)+, -(An) for instance with other modes requiring a long instruction format. Maximum 15 opcodes which may be enough.
Or one could do it like RISC V: smaller number of usable registers for small instructions with full 32 bit instructions having 5 bit register fields.

I found some good uses for the unused 68k 6 bit EA encodings. It allows the compressed immediates as well as a fast (d32,pc) addressing mode which is very useful for a 64 bit ISA.

Those are good and expected additions to the 68K ISA.
Quote:
Quote:
However I don't agree with code density being very important. Even small microcontrollers doesn't strive for maximum density anymore - there's no need. Don't read this as I'm saying density doesn't matter it's just not the most important thing.

Some simple micro-controllers don't need good code density if the embedded code being executed is tiny. Instruction fetch is expensive if needed. Instruction supply consumes 42% of a 32 bit RISC embedded CPU as shown in the following link.

http://www.cast-inc.com/blog/consider-code-density-when-choosing-embedded-processors

Another interesting link: thanks.

Status: Offline

matthey

Re: 68k Developement
Posted on 8-Oct-2018 1:05:27

[ #413 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2015
From: Kansas

Quote:

cdimauro wrote:
A6 is there to stay, at least on Amiga o.s. and rewritings. A4 and A5 can be saved, as you've written.

The Amiga library pointer (usually a6) could easily vary in a new 64 bit 68k mode. It would be beneficial to make some changes to library structures and format for 64 bit. Even a 32 bit mode may be able to take advantage of pc relative libraries with a (d32,pc) addressing mode and a version check. It is more important to get rid of a4 use in libraries including geta4() and __saveds of functions. These are error prone, difficult to understand and require specific compiler support.

Quote:

Source assembly wise, right?

Yes. The 68k_64 mode should be mostly source compatible but not binary compatible. Many encodings will be the same as the 68k though.

Quote:

No, my idea is to set such register before starting the execution of code, and usually leave it unchanged. It doesn't make sense to have specific instructions to change part of it, because the features that it enable or disable don't need to be changed often.

Yes, it also allows to change the stack pointer size and alignment as well (e.g.: minimum number of bytes to push).

Do you then cause an exception if someone tries to configure it after a certain point?

It would be nice to be able to configure the stack pointer alignment at any point. No alignment would allow a7 to be used as a fully orthogonal address register. If I allow 68k_32 and 68k_64 modes at the same time, they will surely have different settings (at least 32 bit alignment for 64 bit mode). It would also allow to execute ColdFire code using a 32 bit aligned stack for compatibility.

Quote:

This is only a micro-architectural issue.

You can have a SIMD unit with huge vector registers, where such instructions will be internally split in uops handling half (or even a quarter) the size, to reduce the implementation costs. That's what Intel did with the Pentium-III, when it first introduced the new SSE SIMD (which internally split such 128-bit instructions to two 64-bit uops). And that's what AMD made with its Ryzen processors (where 256-bit AVX/-2 instructions are split into 2 128-bit uops).

The advantage is that the SIMD ISA remain exactly the same, so fetching the same amount of bytes for the instructions.

It looks like there is more disadvantage than advantage if SIMD operations need to be broken into smaller operations. I would rather not require breaking instructions down into uops either. An SIMD unit does not require an aggressive OoO CPU. Mid-performance multi-core CPUs with SIMD units may have some advantages like more cores with more total SIMD performance.

Quote:

cdimauro wrote:
The study is very interesting, but the numbers don't look so close to other code density studies/benchmarks. For example, 68K (68000, 68020) and 80386 code density is very similar, but Thumb (not even Thumb-2!) is shown to be far better those ISAs.

Maybe the used compiler didn't generated good code for such ISAs. Or the programs and libraries used for the benchmark (which are not 100% known, unfortunately) didn't include several kinds of code. It's strange that didn't used the well known SPEC suite (maybe because it's costly? Unfortunately it's not free. -_-).

The compiler and compiler options make a huge difference in benchmarks. The EISC paper has several suspect methods and results. The compiler was ancient EGCS 1.1 and the "various benchmarks" were not given. The 68k results are an outlier compared to most other papers I've seen which usually places the 68k in the top 3 for code density (Dr. Vince Weaver's results originally showed the 68k to have barely better than average code density until I ran it through a good peephole optimizing assembler). I can only hope the author of the EISC paper was able to gather relative stats from a reasonable amount of code. This is not as difficult as getting a compiler to generate good code quality for many targets.

Quote:

NEx64T/32 32x32
NEx64T/64 32x64

With binary (and unary) instructions available in Mem-Mem format (but using longer opcodes).

There aren't very many new CISC ISAs as RISC ISAs have generally improved and become more CISC like hybrids. It is nice to know that 32 registers with good code density is possible.

Quote:

I still think that the comparison is unfair for the reason which I gave in the past: it was too easy for Motorola continuosly cutting features and instructions to the ISA, thus saving transistors and power usage.

Intel shown that this can be done with its processors for the embedded market, with the Quark family.

Motorola mostly cut instructions from the CPU designs rather than from the ISA. Most missing instructions were still supported with software support. Compatibility of old user level code was very good and performance was usually adequate because the new CPU was faster.

Status: Offline

bhabbott

Re: 68k Developement
Posted on 8-Oct-2018 6:58:38

[ #414 ]

Regular Member

Joined: 6-Jun-2018
Posts: 338
From: Aotearoa

Quote:

cdimauro wrote:

What I don't like of 68K is the not complete orthogonality of the ISA regarding the address registers usage, like using them as indexes on more complex address modes: it shouldn't have been allowed, because that's the role of data registers.
Not permitting the use of address registers for indexing would make the ISA less orthogonal.

The program I am currently writing uses A4 as a 'virtual address' indexing into two different arrays (pointed to by A5 and A6). It seems more natural to use an address register for this purpose, and it frees up a data register for things that address registers can't do (bitmasking, byte operations etc.). Sometimes I need two different 'addresses', which would further reduce the number of available data registers, and I am already using all of them! Using an address register also permits incrementing the 'address' without affecting flags that might have been set by an earlier operation.

I considered using data registers for indexing but it would have made the code larger, slower, and harder to write. Being forced to use data registers for indexing while address registers lay spare would have been frustrating. Sensibly Motorola didn't make that restriction - another reason why 68000 code is such a joy to write compared to many other ISAs.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 8-Oct-2018 19:42:10

[ #415 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:
matthey wrote:
Quote:
cdimauro wrote:
A6 is there to stay, at least on Amiga o.s. and rewritings. A4 and A5 can be saved, as you've written.

The Amiga library pointer (usually a6) could easily vary in a new 64 bit 68k mode. It would be beneficial to make some changes to library structures and format for 64 bit.

This might work only for AROS: the only Amiga o.s. reimplementation which currently supports 64-bit.
Quote:
Even a 32 bit mode may be able to take advantage of pc relative libraries with a (d32,pc) addressing mode and a version check.

How do you think that this new address mode can be use to avoid the use of A6 in this case?
Quote:
Quote:
No, my idea is to set such register before starting the execution of code, and usually leave it unchanged. It doesn't make sense to have specific instructions to change part of it, because the features that it enable or disable don't need to be changed often.

Yes, it also allows to change the stack pointer size and alignment as well (e.g.: minimum number of bytes to push).

Do you then cause an exception if someone tries to configure it after a certain point?

No. My idea is to have this instruction "o.s. controllable", more or less like the I/O map for an application. So, it raises an exception if the o.s. doesn't allow to use it. Otherwise the instruction is executed and the L1 cache (and L0, cache if present) will be flushed.
Quote:
It would be nice to be able to configure the stack pointer alignment at any point. No alignment would allow a7 to be used as a fully orthogonal address register. If I allow 68k_32 and 68k_64 modes at the same time, they will surely have different settings (at least 32 bit alignment for 64 bit mode). It would also allow to execute ColdFire code using a 32 bit aligned stack for compatibility.

It should be enough to load this special register per task/process/thread, exactly like the pointer to the its MMU table.
Quote:
Quote:
This is only a micro-architectural issue.

You can have a SIMD unit with huge vector registers, where such instructions will be internally split in uops handling half (or even a quarter) the size, to reduce the implementation costs. That's what Intel did with the Pentium-III, when it first introduced the new SSE SIMD (which internally split such 128-bit instructions to two 64-bit uops). And that's what AMD made with its Ryzen processors (where 256-bit AVX/-2 instructions are split into 2 128-bit uops).

The advantage is that the SIMD ISA remain exactly the same, so fetching the same amount of bytes for the instructions.

It looks like there is more disadvantage than advantage if SIMD operations need to be broken into smaller operations. I would rather not require breaking instructions down into uops either. An SIMD unit does not require an aggressive OoO CPU. Mid-performance multi-core CPUs with SIMD units may have some advantages like more cores with more total SIMD performance.

This is exactly the reason why having a SIMD unit with a large vector registers is a good thing.

For mid-performance the SIMD implementation can be cheap, not using an aggressive design. So, you can split the instruction in 2 uops, or it just requires a longer execution in the pipeline. P3 design.

For high-performance, you can have it executed OoO, even with other SIMD instructions at the the same time. P4 design.

The difference? Absolutely nothing: the code is exactly the same, and doesn't need to be recompiled.
Quote:
Quote:
cdimauro wrote:
The study is very interesting, but the numbers don't look so close to other code density studies/benchmarks. For example, 68K (68000, 68020) and 80386 code density is very similar, but Thumb (not even Thumb-2!) is shown to be far better those ISAs.

Maybe the used compiler didn't generated good code for such ISAs. Or the programs and libraries used for the benchmark (which are not 100% known, unfortunately) didn't include several kinds of code. It's strange that didn't used the well known SPEC suite (maybe because it's costly? Unfortunately it's not free. -_-).

The compiler and compiler options make a huge difference in benchmarks. The EISC paper has several suspect methods and results. The compiler was ancient EGCS 1.1 and the "various benchmarks" were not given. The 68k results are an outlier compared to most other papers I've seen which usually places the 68k in the top 3 for code density (Dr. Vince Weaver's results originally showed the 68k to have barely better than average code density until I ran it through a good peephole optimizing assembler). I can only hope the author of the EISC paper was able to gather relative stats from a reasonable amount of code. This is not as difficult as getting a compiler to generate good code quality for many targets.

Ah, so I'm not the only one which thinks that this benchmark doesn't smell good. I agree with your observations as well.
Quote:
Quote:
NEx64T/32 32x32
NEx64T/64 32x64

With binary (and unary) instructions available in Mem-Mem format (but using longer opcodes).

There aren't very many new CISC ISAs as RISC ISAs have generally improved and become more CISC like hybrids. It is nice to know that 32 registers with good code density is possible.

Personally I don't consider RISCs almost all currently available RISC processors: they are CISC. At least looking at the design points of the first RISC.

They are labeled RISC only because, unfortunately, the mantra that "RISC is good and CISC bad" is deeply-rooted in the collective imagination, and IMO this also "banned" new researches and (real) CISC designs.
Quote:
Quote:
I still think that the comparison is unfair for the reason which I gave in the past: it was too easy for Motorola continuosly cutting features and instructions to the ISA, thus saving transistors and power usage.

Intel shown that this can be done with its processors for the embedded market, with the Quark family.

Motorola mostly cut instructions from the CPU designs rather than from the ISA. Most missing instructions were still supported with software support. Compatibility of old user level code was very good and performance was usually adequate because the new CPU was faster.

But if you compare the 68060 (and 68040 as well, where Motorola started with the drastic cut) with the Pentium, the features differences are so huge that it doesn't allow to put them on the same plane.

I've already reported all such differences. They cannot be ignored when making an honest comparison.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 8-Oct-2018 20:00:22

[ #416 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@bhabbott Quote:
bhabbott wrote:
Quote:
cdimauro wrote:

What I don't like of 68K is the not complete orthogonality of the ISA regarding the address registers usage, like using them as indexes on more complex address modes: it shouldn't have been allowed, because that's the role of data registers.

Not permitting the use of address registers for indexing would make the ISA less orthogonal.

Yes, and this is my point: if you (Motorola) have chosen a separated data and address registers model, then it's better to have them working as designed, without mixing their natural usage.

I'll explain it better below.
Quote:
The program I am currently writing uses A4 as a 'virtual address' indexing into two different arrays (pointed to by A5 and A6). It seems more natural to use an address register for this purpose, and it frees up a data register for things that address registers can't do (bitmasking, byte operations etc.). Sometimes I need two different 'addresses', which would further reduce the number of available data registers, and I am already using all of them!

I understand it: you're running out of data registers, and you take benefit of the current 68K implementation, which allows to use address registers where a data register will be the natural choice instead.

However I think that an address register should be used only for operations which make sense for pointer manipulation. And nothing else. So, even adding address registers themselves should be forbidden: we don't add pointers! Subtraction, instead, is a valid operation (but the results should go... to a data register! For obvious reasons).
Quote:
Using an address register also permits incrementing the 'address' without affecting flags that might have been set by an earlier operation.

This is a problem with 68K, because sometimes it's not desirable to set the flags even when using data registers.

With megol proposal, a prefix with a "suppress flags modification" option (bit) might be a suitable solution.

In my ISA I've a similar mechanism (with some longer opcodes), but it works in a "complementary" manner: it allows to suppress flags modification for instructions that change them, and viceversa modify the flags for the ones which don't do it (e.g. MOV).

As you can see, there are coherent solutions to solve your problem, without being forced to use an address register.
Quote:
I considered using data registers for indexing but it would have made the code larger, slower, and harder to write. Being forced to use data registers for indexing while address registers lay spare would have been frustrating. Sensibly Motorola didn't make that restriction -

Motorola allowed it to save space for the implementation. But it mixed (and ruined, IMO) the data and address register model.

A stricter separation would have allowed to get 16 data registers (using the actual An EA mode) and 8 address registers, with all data registers usable as indexes in the (d8,An,Xn) -> (d8,An,Dn) EA mode (with Dn -> D0..D15).
Quote:
another reason why 68000 code is such a joy to write compared to many other ISAs.

I still enjoy 68K, but see above.

Status: Offline

matthey

Re: 68k Developement
Posted on 8-Oct-2018 21:18:03

[ #417 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2015
From: Kansas

Quote:

bhabbott wrote:
Not permitting the use of address registers for indexing would make the ISA less orthogonal.

Absolutely true. Much functionality would be lost in converting address (base/index) registers into only base registers. While address registers support a limited number of instructions (MOVE, LEA, ADD, SUB, CMP, TST), these instructions generally account for at least 2/3 of executed integer instructions (more like 3/4 if non data register using conditional branches are included). The next most common operations are OR, shift and AND of which the 68k addressing modes allow the most common index shifts. This is why allowing address register indexes is almost as good as having 16 GP registers.

The 68k was heavily influenced by the 16 bit DEC PDP-11. The much loved PDP-11 was capable enough to allow C and Unix to first become popular but had only 6-8 GP registers (R0-R7 with R7=PC and R6=SP) and limited address space. The 68k could have added 16 fully orthogonal GP registers but it would have reduced the excellent code density of the PDP-11. Instead, the split data and address registers were added while the PC was dropped as a directly accessible register (this choice is generally considered better for performance despite being popular on ARM). There were 9 mostly GP registers gained while actually improving code density over the PDP-11 for an ISA with full 32 bit capability. The 8 address registers are generally considered to be GP despite the lack of orthogonality because they allow the most common instructions instead of being merely base registers.

Hitachi licensed the 68000 and created the SuperH RISC ISA based on the 68k (SuperH was used in the Sega Saturn, 32X and Dreamcast as well as many Japanese embedded products). The SuperH ISA added 16 GP registers (R0-15) as well as a Global Base Register (GBR). I have talked to Brenden who created the SH BJX1 ISA and has a SuperH emulator who told me that the GBR is rarely used and he is removing it from the BJX1 64 bit ISA.

https://github.com/cr88192/bgbtech_shxemu/wiki/SH-BJX1-ISA

The SuperH is off patent (as is the 68k) and BJX1 is an attempt at improving the SH ISA for the open core J-core project. The J-Core project was using Dr. Vince Weaver's old code density stats but even worse is that I found that SH needed 56% more memory access instructions and 47% more instructions than the 68k while wasting valuable DCache with immediates (one of the worst performance handicapped once popular ISAs). Brendon has added variable length instructions, instructions like LEA and half precision fp (which I suggested to compress fp immediates/constants after explaining the success of the vasm double to single precision fp optimization) but he still recognizes the limitations of starting with an ISA with practically no free encoding space from using a 16 bit fixed length encoding. Of course, I explained the updated data and issues to the CEO behind the J-core push (wants open cores for mass produced IoT) who is a big 68k fan and wants good code density. SuperH does have good code density and is extremely easy to decode but otherwise has massive performance obstacles. Perhaps a 16/32 variable length 68k like RISC encoding with 16 GP registers could have been quite nice for low end embedded CPUs but certainly adding 16 orthogonal GP registers and having good code density alone does not make a good ISA. The marketing literature was good and SuperH became quite popular.

https://www.hotchips.org/wp-content/uploads/hc_archives/hc06/2_Mon/HC6.S4/HC6.4.2.pdf

That concludes the history lesson for today.

Status: Offline

matthey

Re: 68k Developement
Posted on 9-Oct-2018 2:20:17

[ #418 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2015
From: Kansas

Quote:

cdimauro wrote:
How do you think that this new address mode can be use to avoid the use of A6 in this case?

The AmigaOS requires to assume that A6 is used as a base register to access data in a library. It is possible to use all PC relative addressing in a library instead of A6 relative addressing. Then it doesn't matter which address register is used to JSR into the library. There are some utility.library math functions like SDivMod32(), SMult32(), UDivMod32(), UMult32(), SMult64() and UMult64() which are specifically documented to not require A6 to be used.

"Unlike other Amiga library function calls, the utility.library 32 bit math routines do NOT require A6 to be loaded with a pointer to the library base. A6 can contain anything the application wishes. This is in order to avoid overhead in calling them."

http://amigadev.elowar.com/read/ADCD_2.1/Includes_and_Autodocs_3._guide/node05AD.html

I have seen compilers take advantage of using other address registers to access the utility.library math functions and the result is usually better code quality.

Quote:

No. My idea is to have this instruction "o.s. controllable", more or less like the I/O map for an application. So, it raises an exception if the o.s. doesn't allow to use it. Otherwise the instruction is executed and the L1 cache (and L0, cache if present) will be flushed.

The configuration would need to be from supervisor mode which is up to the OS to allow.

Quote:

It should be enough to load this special register per task/process/thread, exactly like the pointer to the its MMU table.

Yes, that should be adequate.

Quote:

This is exactly the reason why having a SIMD unit with a large vector registers is a good thing.
[quote]

Yes, the cost benefit analysis of 32 SIMD registers may indicate it is worthwhile although it depends on the implementation. I just haven't looked into it yet.

[quote]
Personally I don't consider RISCs almost all currently available RISC processors: they are CISC. At least looking at the design points of the first RISC.

They are labeled RISC only because, unfortunately, the mantra that "RISC is good and CISC bad" is deeply-rooted in the collective imagination, and IMO this also "banned" new researches and (real) CISC designs.

CISC does have a bad reputation considering it is generally superior to RISC in instruction counts, memory traffic and code density. One problem is that many people equate CISC to x86/x86_64. Other than the fact that the most powerful consumer CPUs are x86_64, the x86_64 ISA is a poor example of CISC, especially for efficiency. The other problem is that people tend to think more registers is always better like higher clock speeds, more bits and more cores (more is good for marketing).

I don't want to call RISC hybrids CISC when they are load/store instead of register-memory. This is the defining difference between a RISC hybrid and a CISC hybrid for me.

Quote:

But if you compare the 68060 (and 68040 as well, where Motorola started with the drastic cut) with the Pentium, the features differences are so huge that it doesn't allow to put them on the same plane.

I've already reported all such differences. They cannot be ignored when making an honest comparison.

The 68040 mostly cut FPU instructions. What other integer instructions did they cut from the 68020 ISA besides the CALLM/RTM instructions (which I'm not aware of anyone using)?

The percentage of instructions executed which were cut was small. Certain FPU algorithms used the trapped instructions often while others did not. Motorola went a little overboard on cutting instructions but it was logical to trim some fat with instructions which are rarely used or can't be fast in hardware.

Quote:

cdimauro wrote:
This is a problem with 68K, because sometimes it's not desirable to set the flags even when using data registers.

With megol proposal, a prefix with a "suppress flags modification" option (bit) might be a suitable solution.

In my ISA I've a similar mechanism (with some longer opcodes), but it works in a "complementary" manner: it allows to suppress flags modification for instructions that change them, and viceversa modify the flags for the ones which don't do it (e.g. MOV).

The more different options like CC suppress flag and sign/zero extension which are moved into a prefix, the more used the prefix will become. The more used the prefix, the more code density would suffer. Yes, there would be cases where more than one prefix option would be used but usually this is not the case as can be seen with x86_64 where 1 to 2 prefixes are common (a byte encoding allows few options). From your conversations with Megol, you also expect 68k prefixes to significantly adversely affect code density.

Quote:

Motorola allowed it to save space for the implementation. But it mixed (and ruined, IMO) the data and address register model.

A stricter separation would have allowed to get 16 data registers (using the actual An EA mode) and 8 address registers, with all data registers usable as indexes in the (d8,An,Xn) -> (d8,An,Dn) EA mode (with Dn -> D0..D15).

It may have been possible to squeeze in 16 data registers but it would have used more encoding space (especially MOVE which takes a huge amount of encoding space) and it would have created more instruction formats with less consistent register ports. I believe code density would have suffered as more instructions became 32 bit in size.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 9-Oct-2018 6:16:13

[ #419 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:
matthey wrote:
Quote:
bhabbott wrote:
Not permitting the use of address registers for indexing would make the ISA less orthogonal.

Absolutely true. Much functionality would be lost in converting address (base/index) registers into only base registers. While address registers support a limited number of instructions (MOVE, LEA, ADD, SUB, CMP, TST), these instructions generally account for at least 2/3 of executed integer instructions (more like 3/4 if non data register using conditional branches are included).

That's impressive. I wouldn't have expected a so high frequency for "address registers" instructions.
Quote:
The next most common operations are OR, shift and AND of which the 68k addressing modes allow the most common index shifts. This is why allowing address register indexes is almost as good as having 16 GP registers.

Well, you can have 16 data registers (which can be used as indexes on complex modes instead of using an address register) and 8 address registers, with a net separation of the two kinds of registers, as I said before. But I'll talk again about it replying to the last message.
Quote:
Hitachi licensed the 68000 and created the SuperH RISC ISA based on the 68k

I don't see how it can be based on 68K: the ISA is different, and the opcodes structure too. They implemented more or less the same address modes for load/store instructions, but I don't see other similarities. And, of course, they used 16-bit opcodes, but the structure is absolutely different.
Quote:
(SuperH was used in the Sega Saturn, 32X and Dreamcast as well as many Japanese embedded products). The SuperH ISA added 16 GP registers (R0-15) as well as a Global Base Register (GBR). I have talked to Brenden who created the SH BJX1 ISA and has a SuperH emulator who told me that the GBR is rarely used and he is removing it from the BJX1 64 bit ISA.

https://github.com/cr88192/bgbtech_shxemu/wiki/SH-BJX1-ISA

The SuperH is off patent (as is the 68k) and BJX1 is an attempt at improving the SH ISA for the open core J-core project. The J-Core project was using Dr. Vince Weaver's old code density stats but even worse is that I found that SH needed 56% more memory access instructions and 47% more instructions than the 68k while wasting valuable DCache with immediates (one of the worst performance handicapped once popular ISAs). Brendon has added variable length instructions, instructions like LEA and half precision fp (which I suggested to compress fp immediates/constants after explaining the success of the vasm double to single precision fp optimization) but he still recognizes the limitations of starting with an ISA with practically no free encoding space from using a 16 bit fixed length encoding.

I breafly took a look at the BJX1 page, but I find it a quite complex ISA. I wonder how easy it can be to implement it.
Quote:
Of course, I explained the updated data and issues to the CEO behind the J-core push (wants open cores for mass produced IoT) who is a big 68k fan and wants good code density. SuperH does have good code density and is extremely easy to decode but otherwise has massive performance obstacles. Perhaps a 16/32 variable length 68k like RISC encoding with 16 GP registers could have been quite nice for low end embedded CPUs but certainly adding 16 orthogonal GP registers and having good code density alone does not make a good ISA. The marketing literature was good and SuperH became quite popular.

https://www.hotchips.org/wp-content/uploads/hc_archives/hc06/2_Mon/HC6.S4/HC6.4.2.pdf

That concludes the history lesson for today.

Nice link, and I wonder here too, seeing that SuperH2 had a better code density than 68K. I haven't seen other studies where it shown that good code density. Maybe, again, compiler-biased analysis?

@matthey Quote:
matthey wrote:
Quote:
cdimauro wrote:
How do you think that this new address mode can be use to avoid the use of A6 in this case?

The AmigaOS requires to assume that A6 is used as a base register to access data in a library. It is possible to use all PC relative addressing in a library instead of A6 relative addressing. Then it doesn't matter which address register is used to JSR into the library. There are some utility.library math functions like SDivMod32(), SMult32(), UDivMod32(), UMult32(), SMult64() and UMult64() which are specifically documented to not require A6 to be used.

"Unlike other Amiga library function calls, the utility.library 32 bit math routines do NOT require A6 to be loaded with a pointer to the library base. A6 can contain anything the application wishes. This is in order to avoid overhead in calling them."

http://amigadev.elowar.com/read/ADCD_2.1/Includes_and_Autodocs_3._guide/node05AD.html

I have seen compilers take advantage of using other address registers to access the utility.library math functions and the result is usually better code quality.

OK, but those are special cases. Usually libraries require A6 for their base, because they need to access their globally-shared data. They also don't know where the library base will be allocated, and that's particularly true for libraries which stay in ROM.

Take exec.library, for example, which is the worst case: how it should work without using A6?
Quote:
Quote:
No. My idea is to have this instruction "o.s. controllable", more or less like the I/O map for an application. So, it raises an exception if the o.s. doesn't allow to use it. Otherwise the instruction is executed and the L1 cache (and L0, cache if present) will be flushed.

The configuration would need to be from supervisor mode which is up to the OS to allow.

The was my first idea. But I think that it could be still safer to allow an application to change this register if it really needed. There might be part of an applications were performance and/or code density can be improved by a new register configuration, albeit it'll be a pain for compilers to handle it. But I want to leave opened this possibility.
Quote:
Quote:
This is exactly the reason why having a SIMD unit with a large vector registers is a good thing.

Yes, the cost benefit analysis of 32 SIMD registers may indicate it is worthwhile although it depends on the implementation. I just haven't looked into it yet.

No, here I was talking about the vector registers sizes, not their count.

SIMD many registers is certainly beneficial, as it was shown by the IBM's VSX2 paper.
Quote:
Quote:
Personally I don't consider RISCs almost all currently available RISC processors: they are CISC. At least looking at the design points of the first RISC.

They are labeled RISC only because, unfortunately, the mantra that "RISC is good and CISC bad" is deeply-rooted in the collective imagination, and IMO this also "banned" new researches and (real) CISC designs.

CISC does have a bad reputation considering it is generally superior to RISC in instruction counts, memory traffic and code density. One problem is that many people equate CISC to x86/x86_64. Other than the fact that the most powerful consumer CPUs are x86_64, the x86_64 ISA is a poor example of CISC, especially for efficiency.

Yes, I think that's the main reason, albeit their implementation improved considerably in the last decades.
Quote:
The other problem is that people tend to think more registers is always better like higher clock speeds, more bits and more cores (more is good for marketing).

In general yes, I agree, but somethings depend on the specific context.
Quote:
I don't want to call RISC hybrids CISC when they are load/store instead of register-memory. This is the defining difference between a RISC hybrid and a CISC hybrid for me.

The original RISC was much more than just a load/store ISA: it was also a Reduced instruction set, with non-microcoded (and complex) instructions, and fixed-length instructions.

The load/store paradigm is the only thing which survived today, and which is used to differentiate RISCs by CISCs (which kept all their intrinsic characteristics. That's because they were "bad", eh!).
Quote:
Quote:
But if you compare the 68060 (and 68040 as well, where Motorola started with the drastic cut) with the Pentium, the features differences are so huge that it doesn't allow to put them on the same plane.

I've already reported all such differences. They cannot be ignored when making an honest comparison.

The 68040 mostly cut FPU instructions. What other integer instructions did they cut from the 68020 ISA besides the CALLM/RTM instructions (which I'm not aware of anyone using)?

The percentage of instructions executed which were cut was small. Certain FPU algorithms used the trapped instructions often while others did not. Motorola went a little overboard on cutting instructions but it was logical to trim some fat with instructions which are rarely used or can't be fast in hardware.

This all takes space (transistor) and power.

But you forgot to mention the simplified MMU, whereas the Pentium brings all its legacy and introduced other things as well.
Quote:
Quote:
cdimauro wrote:
This is a problem with 68K, because sometimes it's not desirable to set the flags even when using data registers.

With megol proposal, a prefix with a "suppress flags modification" option (bit) might be a suitable solution.

In my ISA I've a similar mechanism (with some longer opcodes), but it works in a "complementary" manner: it allows to suppress flags modification for instructions that change them, and viceversa modify the flags for the ones which don't do it (e.g. MOV).

The more different options like CC suppress flag and sign/zero extension which are moved into a prefix, the more used the prefix will become. The more used the prefix, the more code density would suffer. Yes, there would be cases where more than one prefix option would be used but usually this is not the case as can be seen with x86_64 where 1 to 2 prefixes are common (a byte encoding allows few options). From your conversations with Megol, you also expect 68k prefixes to significantly adversely affect code density.

Absolutely. As I've already stated, my ISA uses longer opcodes, so it's clear that code density will be affected when using them, albeit I've some mitigations (more features can be enabled at the same time, saving more instructions and/or registers). It's important to remark that using those features (prefixes or long instructions) you also have other benefits: instructions count and/or less memory traffic.
Quote:
Quote:
Motorola allowed it to save space for the implementation. But it mixed (and ruined, IMO) the data and address register model.

A stricter separation would have allowed to get 16 data registers (using the actual An EA mode) and 8 address registers, with all data registers usable as indexes in the (d8,An,Xn) -> (d8,An,Dn) EA mode (with Dn -> D0..D15).

It may have been possible to squeeze in 16 data registers but it would have used more encoding space (especially MOVE which takes a huge amount of encoding space) and it would have created more instruction formats with less consistent register ports. I believe code density would have suffered as more instructions became 32 bit in size.

Not that much with a clever implementation, if we talk about just supporting 8,16,32 bits operands (exactly like the 68K). I mean: this ISA would have been very very similar to the current 68K one.

P.S. Sorry, no time to re-read.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 9-Oct-2018 22:46:54

[ #420 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

I reviewed again the opcode table, looking at Motorola's 68000 manual. It took me a while, because the list is badly organized, and it's not easy to figure-out which 16-bit "slots" where used (especially which ones are left/free to be used).

Anyway, I'm decisively convinced that my idea to have 16 data registers and 8 address register, with a clear separation between the twos, is doable shuffling some instructions, and it should maintain a very very similar code density (with margins to improve it, using the new data registers).

BTW, looking at the opcode table I think that Motorola did a big mess: it looks like a patchwork, where its engineers just reused opcode spaces as they needed, without caring and thinking about a more clear and simple organization. Decoding isn't trivial, and there's a huge waste of opcode space which would have been used much better.

Do you still really want to expand this monster? You can keep 68K assembly-level compatibility while getting a much better opcode table, with many more 16-bit "slots" available (no, I'm not talking about adding 8 new data register: just an opcode table redesign).

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle