Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6071 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

6 crawler(s) on-line.

93 guest(s) on-line.

1 member(s) on-line.

matthey

You are an anonymous user.
Register Now!

matthey: 4 mins ago

ppcamiga1: 12 mins ago

bhabbott: 14 mins ago

Karlos: 19 mins ago

OlafS25: 33 mins ago

michalsc: 44 mins ago

ncafferkey: 1 hr 20 mins ago

pixie: 1 hr 25 mins ago

Hypex: 2 hrs 11 mins ago

agami: 2 hrs 19 mins ago

Forum Index

Amiga General Chat

68k Developement

Poster

Thread

matthey

Re: 68k Developement
Posted on 26-Sep-2018 5:11:10

[ #381 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2015
From: Kansas

Quote:

Hypex wrote:
That would be confuisng. That's worse than using x86 syntax for 68K code. But what would come first in the line up? Address or data? They aren't equals.

Data registers come first because they are mode=0 and address registers are mode=1 in the EA encoding. Taking the whole 6 bit EA gives numbers from 0-15 for registers r0-r15.

Quote:

The problem is the time spenting investing in it to get it working. It's okay for embedded work but a poor choice for our market. The Sam already gave trouble for those running WarpOS/UP apps under a wrapper. I've noticed the X1000 kernel can be unstable for things like 68K emulation or interrupt signalling; things that are fine on XE and Sam. I don't know how they passed it. IMHO they should have left it or find another more compatible CPU. Nice idea but OS4 programs aren't optimised enough to run on slow hardware.

Right. The Tabor CPU would likely not have been as bad for a one and done embedded device. It is still rude to expect major support changes from compiler and tools developers when there is such a small advantage to the hardware changes. It seems like a "cheapened" CPU that software guys are expected to make "acceptable".

Quote:

I wonder how they merged it. And how transparently. If there were 32 of each what happens where?

The paper cdimauro linked gives the details (I think I originally gave him the link). See the text that is referring to figure 6. Legacy code compatibility is retained.

https://www.researchgate.net/publication/299472451_Workload_acceleration_with_the_IBM_POWER_vector-scalar_architecture

Quote:

And being adaptable since vector sizes change quickly. So some forward thinking to sizes would help here. Perhaps something like the 020 scale type instructions could help to support multiple widths and expansion.

I don't think vector sizes are changing quickly anymore. A mid performance SIMD unit uses 128 or 256 bit wide registers. The high performance and specialty SIMD units will probably use 512 or 1024 bit wide registers. Only high performance hardware can load a whole register worth of memory in one access to feed the SIMD monster and the register files get to be huge. The IPC does double with each doubling of the register width which is a good gain so I could be wrong but Moore's law has also ended. Supporting multiple SIMD unit sizes or flexible widths takes more encoding space. There is no free lunch.

Quote:

Retro is nice but retro goodness can't be used in a modern web world with internet, office work and productivity.

Modern technology can make the retro goodness fast.

Status: Online!

Overflow

Re: 68k Developement
Posted on 26-Sep-2018 10:50:11

[ #382 ]

Super Member

Joined: 12-Jun-2012
Posts: 1628
From: Norway

Found this nice little video, which should be of intrest for others;

Quote:
In this seminar, Hellfire and Noname from Haujobb share some insights about coding for the Amiga. Based on our experience of jointly creating several Amiga AGA demos over the years, like the "Beam Riders" demo, we will tell you how we did it and share some tricks of the trade with you

This talk covers different aspects of the toolchain, such as coding and testing on PC, syncing with Rocket, and compiling for and profiling on Amiga. It is basically a "Beam Riders" making-of.

https://www.youtube.com/watch?v=s1lVS4tW33g

Status: Offline

OlafS25

Re: 68k Developement
Posted on 26-Sep-2018 12:26:52

[ #383 ]

Elite Member

Joined: 12-May-2010
Posts: 6353
From: Unknown

@NutsAboutAmiga

there are lots of languages used like Amiga-E with lots of amiga-specific includes
additionally there are lots of other languages
for beginner and advanced there are f.e. amiga-specific basic languages
of course there are C and C++
if you are writing games and want it to be portable I would anyway prefer SDL and develop it on modern hardware and do only final testing on anything amiga-related. Or you do really program for amiga (chipset) and for that you need amiga based languages

NG is much slower than modern hardware, I do not really see the advantages. That is how f.e. android development normally is done, the environment is on PC (perhaps even with emulation) and only final testing is done on slow hardware. If I would use GCC I would not use NG for but use modern hardware for compiling and testing in UAE (that is even partly possible for 4.1). The final tests would be on slow 68k or PPC hardware

You do not necessarily need updated compilers to develop software that runs on vampire, everything compiled for 68020 should work. Of course a hand-optimized and adapted software (game) might run faster but then it only runs on vampire whereas if you do not use the really specific features the game works on old classic hardware (with enough resources), vampire or UAE.

Last edited by OlafS25 on 26-Sep-2018 at 12:54 PM.

Status: Offline

NutsAboutAmiga

Re: 68k Developement
Posted on 26-Sep-2018 19:21:18

[ #384 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12819
From: Norway

@OlafS25

Quote:
if you are writing games and want it to be portable I would anyway prefer SDL and develop it on modern hardware and do only final testing on anything amiga-related.

I wrote a allegro5 wrapper I use in some game related projects on Git-Hub, it not feature complete but wrappers on top Graphic.library,

sticking to using vector classes and structs, and you hardly need to think about pointers, and if your using lots of refs instead of pointers in your code, your less likely to need to reboot your computer. I enjoy the Workbench and simple text editors, I like coding on the Amiga, using GCC and makefiles.

I do some times pop into virtual studio, and do not disagree that all bad, it can be where complementary to just working with GCC, however virtual studio projects have different build system, so you now have maintain two build systems, it's not hard.
Quote:
Or you do really program for amiga (chipset) and for that you need amiga based languages

No I code mostly for graphics.library and intuition.library, only system friendly code, event based. I use standardized API's when I code like AHI, so you can backport stuff to 3.x or port it MorphOS or AROS or whatever similar to AmigaOS, I don't wont to lock my self to sound card or old floppy drive, horror.
Quote:
NG is much slower than modern hardware

Native compiler vs native compiler true, but most of project is not rebuilt, only the changed .cpp files are rebuilt, thanks to makefiles, the difference is just not noticeable on smaller project to medium sized project, if project is well structured.

Quote:
You do not necessarily need updated compilers to develop software that runs on vampire,

you will need to updated development tools to use, new AMMX assembler instructions, the compiler will need to know, it can optimize for this things.
For example AmigaE is not as far I know not being developed anymore, as now has been replaced by PortableE, that generally is wrapped around a C/C++ compiler.

Ideally you have CPU optimized code in library's, so program is not locked into one CPU, but lets look back on WarpOS, remember how few thing in the OS that was optimized, a few datatypes, and few libs, this is most likely the outcome of that. only few games support WarpOS, to be successful it need most developers to support 68080, I think.

Last edited by NutsAboutAmiga on 26-Sep-2018 at 07:29 PM.
Last edited by NutsAboutAmiga on 26-Sep-2018 at 07:28 PM.
Last edited by NutsAboutAmiga on 26-Sep-2018 at 07:23 PM.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

bison

Re: 68k Developement
Posted on 26-Sep-2018 20:14:49

[ #385 ]

Elite Member

Joined: 18-Dec-2007
Posts: 2112
From: N-Space

@matthey

Quote:
I expect the simpler in-order CPUs have had a surge of sales since Spectre type exploits came to light and I'm not sure OoO CPU sales have hit bottom yet.

Are there any in-order processors left, other than the Cortex A53 and A55? The Atom and Itanium are gone, and I can't think what else would still have in-order execution.

_________________
"Unix is supposed to fix that." -- Jay Miner

Status: Offline

matthey

Re: 68k Developement
Posted on 26-Sep-2018 20:34:45

[ #386 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2015
From: Kansas

Quote:

megol wrote:
@matthey
Using a prefix doesn't require changing a mode, it could be supported even in a pure 32 bit design with the 64 bit extension encoding reserved. Not even extending the number of registers require a mode change but it would require the context switch code of the OS saving/restoring all registers.

Using address register sources have no advantage compared to a prefix+extended D register sources when it comes to compatibility, programs have to be updated to use both versions.

The "compatibility" of prefixes is appealing. They allow major enhancements with practically no changes to existing instruction formats and register ports. We have already discussed some of the disadvantages though.

Quote:

Using a prefix+normal instruction to initialize "constants" in the higher D registers and then using them as sources uses the same amount of bytes as using integer instructions+MOVE to A registers and then using these as sources, there is no advantage there. The later would also require more instructions potentially lower performing.
Accessing the high D registers like this also preserves the A registers which otherwise would be used as integer data.

So if there is prefix support already using An source encodings as Dn+8 sources is IMHO better.

There is no need to initialize constants in data registers and then move them to an address register.

moveq #100,d0 ; 2 bytes
move.l d0,a0 ; 2 bytes

movea.w #100,a0 ; 4 bytes

Constants greater than 127, less than -128 or 0 take the same amount of bytes of code (or less when an immediate MOVEA.W can be used) and less instructions to load directly into an address register. Fewer constants would be needed in registers as more instruction compressed immediate constants would come directly from the code.

The Dn+8 registers would need new ADD, SUB and MOVE (.B/W/L/Q) encodings at least as the An encoding is being used. The MOVE instructions take a huge amount of encoding space if you plan on keeping the functionality orthogonal with existing data registers. A full prefix implementation would be cleaner but also reduce code density. You thought my opening of An register sources was ugly but I was only using the encoding for its original purpose. Mixing An and Dn+8 encodings is less ugly?

Quote:

Nothing is ever free.

What I ask is what the use cases is for these operations are and if they will be used so often that complicating a critical part of the pipeline is worth it. Both of your examples are showing a simple type of operation that is easy to support but they aren't general.
Even the simple address modes are really much more complicated: basereg+indexreg*scale+displacement

So to update the base register one would reasonably require going through the address generation stage. Standard updates as in the existing 68k ISA need an incrementer/decrementer.
If the address generation stage takes more than one clock cycle* instructions depending on the updated base register will have to wait, worse, even those that follows (a0)+/-(a0) will** have to wait the extra cycle(s).

Or one could reserve the base register update for the simple disp+basereg address mode only, most likely as fast as executing the increment/decrement versions that have to be supported anyway. That wouldn't be orthogonal but perhaps acceptable?

I really don't want to increase the critical path slowing the pipeline. It is only the base register update with post increment which may be a problem. It may be possible to add parallel hardware which would allow the post increment? Investments in hardware which improves code density at least partially pays for itself. It is important to look at where code size increases when optimizing for performance as this is common.

The base register update by itself is not a problem. It would be possible to start at the end of a memory data structure with a base register update instruction and work backward with pre-decrement mode (as my example showed) but fetching backward in memory can be less efficient. Perhaps this is preferable though?

Quote:

But RISC almost never run out of registers. That they require more registers are per design.
The CISC having to use memory data wastes energy only at best, decreasing performance at the same time at worst.

RISC almost never runs out of registers when it has 32 GP registers. RISC uses more registers and has a much higher cost when out of registers which is why 32 registers is used for high performance RISC ISAs.

The energy efficiency of more registers is not so clear. The minimum energy use is increased as the number of active transistors increases with a larger register file. The maximum energy use is increased when out of registers but this is offset by worse code density which has to do more ICache fetches often instead of more DCache fetches some of the time. Instead of wasting transistors on a larger register file and bigger ICaches, larger DCaches might be a more effective use of transistors and I expect can provide competitive energy efficiency, at least in the case of CISC with 16 GP registers.

Quote:

Every processor have load-use delays. The difference is that a in order CISC with load stages inlined in the execution pipeline (Pentium, 68060, Cyrix 6x86, Apollo) _always_ pays the costs even when not accessing memory. The 68060 compensated for that somewhat by executing some integer operations early reducing the effects (the Pentium did not do that). For RISC the delay is instead exposed and optional.

It looks to me like RISC designs need better instruction scheduling (OoO required for good performance) and long unrolled loops to avoid the load-use delays. This gives bloated code and has excessive register requirements. The 68060 design certainly minimizes load-use delays, is conducive to good code density and is quite powerful for an in-order CPU design.

Quote:

2 LD, 1 ST is probably optimal for a standard design. Easiest way to do that would be duplicating the cache so that writes goes to two blocks and each read port connects to a dedicated block.

Yes, 2 LD 1ST would be great as there are more loads than stores. It would make instruction scheduling very easy which is important for an in-order CPU design.

Quote:

Didn't know about the third pipeline, guess it decreased maximum clock frequency enough not to be worthwhile in the end?

I don't know. Gunnar didn't say much about it as usual. An instruction scheduler would be more important for an in-order 3 pipe design and no compiler supports this. Instruction scheduling would be challenging without dual porting the DCache. We can see a little bit of current pipe statistics by looking at performance counter results for the Apollo core. Pipe1 executes an instruction about 60% of the time and Pipe2 about 40% with hand optimized code. Surely this includes stalls (mostly DCache misses and store buffer full stalls as branch prediction was 98% correct). The 68060 could issue instruction pairs and triplets with optimized code 50%-65% of the time but this likely does not include stalls during execution. It is impossible to compare these percentages. We don't know how much the clock frequency dropped either. An FPGA has more limited space than an ASIC so even if the 3rd pipe could execute an instruction 20% of the time it may not be worthwhile. I wouldn't be surprised if a dual ported DCache could raise the percentages 5%-10% and probably more with unoptimized legacy code. The extra pipe should be cheap for an in-order CPU but I don't know of any stats for comparison.

Quote:

Why would a CISC have expensive register access? Even with my prefix hack the worst case is 50% code expansion. Also register files are much more energy efficient than accessing a cache.

CISC uses more encoding space. A 68k like CPU with 16 bit instructions using 32 registers would need 7 bits for an EA, 4 for another register and 2 for a size which leaves only 3 bits for an opcode. Tiered registers are possible with 32 bit instructions to access Dn+8. The prefix is probably the cleanest way to add 8 more data registers to the 68k without re-encoding but maybe re-encoding would be the better choice at that point. The energy efficiency is as clear as mud as I have already addressed above.

Status: Online!

BigD

Re: 68k Developement
Posted on 26-Sep-2018 20:59:49

[ #387 ]

Elite Member

Joined: 11-Aug-2005
Posts: 7323
From: UK

@Thread

Like couldn't we just get a 060 core, shrink the die and then crank up the clock to 5Ghz? That would be cool!

_________________
"Art challenges technology. Technology inspires the art."
John Lasseter, Co-Founder of Pixar Animation Studios

Status: Offline

matthey

Re: 68k Developement
Posted on 26-Sep-2018 21:16:07

[ #388 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2015
From: Kansas

Quote:

bison wrote:
Are there any in-order processors left, other than the Cortex A53 and A55? The Atom and Itanium are gone, and I can't think what else would still have in-order execution.

There are plenty of embedded CPUs like the Cortex-M and Cortex-R series.

https://en.wikipedia.org/wiki/ARM_Cortex-M

There are older ARM CPUs like the Cortex-A8 which are "superseded" but still available and popular. Fido is in-order. Most of the Cast CPUs are in-order.

http://www.cast-inc.com/ip-cores/processors32bit/index.html

Some designs are partial/limited OoO like in-order issue OoO completion. I expect this is what the BA25, most PowerPC, many ARM CPUs and probably Apollo core are using. It is OoO issue OoO completion (but still in-order graduation) which can be very complex and use insane amounts of energy.

Quote:

BigD wrote:
Like couldn't we just get a 060 core, shrink the die and then crank up the clock to 5Ghz? That would be cool!

Uh, no but 500MHz maybe just from die shrinks. Most of the 50MHz 68060s were clocking up to about 65 MHz before the rev 6 die shrink and after they were clocking to about 100MHz. That is about a 50% clock speed improvement per die shrink. Now how many die shrinks have there been since the '90s?

Last edited by matthey on 27-Sep-2018 at 03:34 PM.
Last edited by matthey on 26-Sep-2018 at 09:26 PM.

Status: Online!

OlafS25

Re: 68k Developement
Posted on 27-Sep-2018 9:28:48

[ #389 ]

Elite Member

Joined: 12-May-2010
Posts: 6353
From: Unknown

@NutsAboutAmiga

Yes and No

I integrated Amiga-E in Aros Vision and updated it as far as possible

PortableE is constructed as a cross-platform language not directly compiling in 68k but creating C-Code and compiling it in 68k (or whatever) next step as far as I know. Amiga-E I integrated was a real 68k compiler and thus quiet fast (but of course not compiling for other platforms).

If you want to support apollo specific features (both 68080 and SAGA) you need both adapted compiler and includes. As long as you do not need that (or not want that because you want to write a program that works on different configurations and not just vampire) you do not need that

I do not own vampire myself but as long as they support the old chipsets (ECS/AGA) and 68020 commands all should work except running faster

Still it is possible to integrate asm code in AMOS and Amiblitz and I assume most high-level languages so it would be possible to speed up programs but for that you need good knowledge. I do not know how many are left able to do that... that also affects compilers, how many are left who could adapt compilers to 68080

Vampire (including the coming standalones) has benefits for the 68k platforms because it lifts the general 68k hardware level to something like 100 Mhz 68060 with 128 MB and graphic card. That makes better software possible than before. Of course that is much slower than todays average standards (even RPi). I do not expect wonders there regarding new software. We will see what happens. But it is nice that something happens at all after the long period of stagnating and even shrinking. Hopefully it motivates some developments. Developer support certainly is a weak point in the concept.

Last edited by OlafS25 on 27-Sep-2018 at 09:44 AM.
Last edited by OlafS25 on 27-Sep-2018 at 09:40 AM.
Last edited by OlafS25 on 27-Sep-2018 at 09:38 AM.
Last edited by OlafS25 on 27-Sep-2018 at 09:32 AM.
Last edited by OlafS25 on 27-Sep-2018 at 09:31 AM.

Status: Offline

ppcamiga1

Re: 68k Developement
Posted on 29-Sep-2018 5:55:47

[ #390 ]

Cult Member

Joined: 23-Aug-2015
Posts: 770
From: Unknown

@OlafS25

Vampire has integer performance of 68060 50 MHz. Only RAM is faster than on old 68060 cards.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 29-Sep-2018 6:06:07

[ #391 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@ppcamiga1 Quote:
ppcamiga1 wrote:
I have no printer. In last ten years I printing very rare.
Almost everytime from printer in local shop, or on pc in my work.
I do not need TurboPrint. What I need is fast Amiga to convert .ps to .pdf which I can use everywhere.

I prefer Personal Paint over Deluxe Paint. But in rare cases I use Deluxe Paint IV - it works without problems.

I do not need a chipset.

Have you tried other Deluxe Paint versions (e.g. V)?

In general, I assume that you don't run any (not o.s.-friendly) Amiga game, right?
Quote:
Every usable 68k software from my old 68k Amiga works better on my NG Amiga without problems.

I have no PC-with-x86-CPU-removed-and-replaced-with-a-PowerPC-one machine (in short: PPC PC) to make a comparison, but the 68K software which I use on WinUAE:
- works quite well and very fast;
- has more free memory available of any PPC PC can give (a bit below 2GB in total, included RTG card VRAM);
- can use (emulate) several additional cards (graphic & sound in particular);
- can use my host internet connection without any driver needed (other than the WinUAE one: "one driver to rule them all [network cards]");
- can use my host shared folders;
- has very low-latency input and very smooth graphic update with the latest 4.x version, which allows to play Amiga games like the original machines;
- can use multiple monitors at the same time.
Quote:
Amiga is my hobby, not my work.
As user I do not want to spend my precious free time on something slower and less comfortable than cheap pc from Windows 95 era.
As a developer I do not want to waste my precious free time to optimize software to work on something slower than cheap pc from Windows 95 era.
If slower amiga is ok for You, then ok, we can cooperate in some way, but don expect us to downgrade.

Well, the point is that you, post-Amiga users, are still using 68K software, which was created almost always on the original machines (yes: the SLOOOOWER ones!). And you cannot get rid of it, either because the o.s. has dependencies (e.g.: AREXX) or because you have no PPC equivalents.

This 68K software can be run quite nicely on emulated environments (WinUAE, Amithlon) even outperforming the best PPC machines.

So, what's the point on using PPC PCs? Only for moving some windows on a composited screen? To be more specific, is there any KILLER application (e.g.: ONLY available on PPC PCs. And which is NOT the o.s. itself, of course) to justify the usage of such obsolete and overpriced machines?

Status: Offline

cdimauro

Re: 68k Developement
Posted on 29-Sep-2018 7:24:01

[ #392 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:
matthey wrote:
Don't worry. I'm open minded enough to consider how and the cost of doing things I don't currently consider necessary like adding more registers.

So no additional registers, but see below.
Quote:
IMO, a 32 bit CPU hits the sweet spot for performance, has a good amount of address space, is easy to program and provides a small footprint. This is why they continue to be popular for embedded use. Personally, I would be happy with 32 bit CPUs with a small footprint. However, 64 bit CPUs are popular and hyped right now so it is necessary to have 64 bit plans for the future to gain any kind of respect. Some applications practically require 64 bit including advanced GPUs.

64-bit support.
Quote:
I think compatibility is more important to a 68k CPU than any entrenched architecture. The huge 68k software library and 68k fans depend on compatibility.

68K execution mode.
Quote:
I think 8+8 integer registers is enough but 16+8 is a possibility if necessary. 8+8 is the best for compatibility and code density while providing acceptable memory traffic from stats I have seen. The x86_64 ISA provides world dominating performance with only 16 GP registers.

In many discussions I've seen people asking for more data registers (actually it seems that most of coders prefer to have more data registers instead of address registers. I'm for more address registers, but it seems that I'm part of a minority).

Let's leave the door opened for a 16+8 design. However this means that separation between data and address registers might become more strict; we can talk later about it, if the final ISA uses 16+8 registers.
Quote:
Code density is important and becoming more competitive with new ISAs. I think it would be a mistake to decrease code density when the 68k can leverage the code density advantage.

Then I think that there's no room for prefixes. A 16-bit prefix decrease too much the code density.
Quote:
Quote:
Is binary compatibility required, or source (assembly) compatibility is fine? 100% or less?

I prefer to retain 68k binary compatibility but 100% compatibility with the 68000-68060 is not possible. It is possible to add ColdFire compatibility but I'm not sure whether source or binary is better at this point.

(almost) 100% compatibility can be achieved by having a 68K execution mode, as I mentioned before.

The point that needs clarified is if you want to further extended such 68K execution mode / ISA, or if you want to keep the new things/improvements for the new execution mode(s). See below, when I talk specifically about those execution modes.
Quote:
Quote:
Do you want to keep the more complex addressing modes (e.g.: double memory indirect)?

I think it best to keep some form of support for compatibility. It may be possible that the single memory accessing variations are useful and not too problematic. It is generally better to split the 2 memory access variations into 2 instructions for better scheduling. They are useful for modern OO code but they are inherently multi-cycle. An OoO 68k CPU would likely have less problems with the double memory accessing addressing modes.

Yes, 68K execution mode must keep those complex modes due to backwards compatibility, but they can be removed on the new execution modes. On such new modes double memory indirection can be emulated (e.g.: an assembler can transparently generate an additional one or two instructions).
Quote:
Quote:
Do want to keep 68K compatibility (e.g.: the processor has a 68K compatibility mode) or only new (even 32-bit) mode(s) are available (e.g.: the new ISA is only 68K and/or x86/x64 inspired)? Or something like ARM, with ARM32 + Thumb-2 (like a 32-bit redesign) + ARM64 modes usable?

I prefer to allow the 32 bit mode at the same time as the 64 bit mode (like ARM). The 32 bit code could provide better compatibility and sometimes have improve code density.

OK, but let me better clarify what I was talking about before.

You said that you want to be compatible with the 68K code, and that's OK. But you can have a 64 bit post-68K processor which works this way:
- 68K compatibility mode (like x86 compatibility mode on x64, or the traditional ARM32 on ARM64 processors);
- new execution mode (which isn't 68K-binary compatible) with a 68K-inspired ISA which can run in 64 or 32 bit.

A new 64-bit ISA / opcode table is needed because you don't want to use prefixes (which will lower the code density) while proposing consistent enhancements (64-bit first. 8 more data registers can be a reasonable and achievable goal too. New FPU and/or SIMD unit as the last feature to think about), and dropping some legacy as well (e.g.: double indirect modes, coprocessors, and maybe BCD instructions). This new ISA can itself run in 32 or 64 bit mode without any particular burden (like my NEx64T ISA: 32-bit = use any instruction but just Size=64-bit is not allowed and some instructions decoding which work differently).

So, the new ISA brings all enhancements, and the 68K one is kept only for backward-compatibility.

What do you think about?
Quote:
Big endian is needed for 68k compatibility. Of course there should be ways to accelerate little endian data accesses.

OK. For the latter: only with some new instruction (e.g.: MOVLE ), or with a "little-endian data" execution mode?
Quote:
Quote:
How long an instruction can be? >16 bytes?

Longer than 16 bytes is necessary for compatibility and it would be difficult to limit instructions to 16 bytes for 64 bit support. With that said, I have been careful to make sure that the maximum instruction length does not grow from the 68020 ISA.

68020 = 22 bytes maximum instruction length. Which means 2 16-byte cache lines. So I think that a maximum of 30 bytes can be put as the upper limit.
Quote:
Quote:
Which market segments should be targeted?

Embedded, hobbyist/retro and education.

OK
Quote:
Quote:
Are prefix(es) OK?

I don't like them but they are an option if necessary.

Prefixes = lower code density. It's unlikely that you can keep both prefixes and the current code density.
Quote:
Quote:
Is instructions decoding changeable at runtime (by the kernel? Or even by application?)

In other words and to simplify, do you (already) have an idea of what you want achieve with the new processor (and the new ISA)?

Do you mean an FPGA CPU core or CPU microcode which can be updated? I would like to have a hard and standard CPU. Customization of the 68k CPU instructions would likely be limited. An FPGA on the board with the reset controlled by CPU software (but not tied to the CPU reset) is more interesting for customization. Software could then load various FPGA acceleration programs (for embedded or codec accelerators) or custom chip sets (console or computer) into the FPGA. There still hasn't been the ultimate and easy to use out of the box CPU and FPGA combo even as there are CPU+FPGA SoCs.

No, I wasn't talking about that. I was talking about the possibility to change some flag which can alter the meaning / decoding of some instructions, like what the 65C816 processor did when it extended the 65C02 to a 16-bit ISA.

My ISA allows to change some instructions decoding and/or memory addressing modes, based on some configurable flags, in order to better match an application specific use-case (the compiler and/or the developer decide how to configure it).
Quote:
Quote:
I can say that I prefer to write new ISAs which are only INSPIRED by existing ones. I think that it's enough to take only the good parts/ideas, while creating something new. If it's 100% assembly compatible it's OK, but it's not strictly necessary (for my NEx64T it was my goal because I wanted to make really easy port software, and I was lucky being able to achieve it but I had to make several compromises).

So, how is looking your post-68K/inspired new processor/ISA?

I consider my incomplete 68k_64 ISA to still be 68k and not just inspired. Yes, I have thought about a universal SuperCISC ISA "inspired" by the 68k but nothing exists for it.

If you change the 68K opcode table removing / replacing instructions to make space for 64-bit encodings, then you don't have anymore a 68K ISA, but an inspired one.

In this case I suggest you, as I said before, to completely rethink the ISA. Basically following what ARM did with its 64-bit ISA.

Keeping the legacies isn't a good idea for a new ISA: don't make the mistake that AMD did with x64.
Quote:
Quote:
Please read ALL comments.

Speculative execution is here to stay. Mitigated, certainly, but it'll not be abandoned.

I read all the comments. Of course speculation is here to stay but it is unlikely to be as deep, at least for awhile. I expect the simpler in-order CPUs have had a surge of sales since Spectre type exploits came to light and I'm not sure OoO CPU sales have hit bottom yet.

It seems that some Spectre variant can work on in-order processors too, according to what megol reported.

If you want to be remove any possibility of side-band attacks then you have remove caches, TLBs, and maybe even some machine registers (counters). So, basically it means going back 40 years. IMO, it's completely unacceptable.
Quote:
Quote:
cdimauro wrote:
That's very strange, because I see that in most games the processors which show better performances are always the ones with HT enabled. At least on Intel side, whereas AMD processors shown to suffer it and AMD presented a "Game mode" for its Ryzen, which essentially disables SMT.

When I looked at stats a few years ago, Intel CPUs were having problems with multi-threading performance too. Even from slightly different Intel CPUs there was inconsistent performance.

https://www.guru3d.com/articles_pages/intel_core_i7_8700k_processor_review,18.html

i7 >>>> i5
Quote:
Quote:
OK, so three answers here to my above questions: 64-bit, embedded market, and a "mid" SIMD unit (with 16 registers, I assume).

I am not set on 16 SIMD registers although it would make several things easier. It is much more expensive to run out of SIMD registers as the data is often streaming. The SIMD unit is the last I will work on.

OK, but you have to leave some free space for it.
Quote:
Quote:
There are around 15 billions ARM core produced each year.

Western Digital announced last year that it will completely move from ARM to RISC-V for the micro-controllers, and it produces 1-2 billions per year.

Just to give an example. But other big CPU vendors have joined the RISC-V foundation/committee, so I see a threat for ARM business here.

AArch64 was a big gamble to go after higher performance markets (especially servers) but ARM has become less competitive in deeply embedded devices where customization, simplicity and code density are important. RISC-V has signed up some big names and people like the open cores with no licensing concept. ARM has a strong reputation in embedded but maybe they have given up Thumb2 like Motorola/Freescale gave up the 68k.

I think that ARM had no opcode space available for Thumb-2-like 16-bit instructions, and didn't wanted to introduce a crippled version.

For RISC-V it was much easier because they already put a set of constraints on the available instructions formats (only 3 "base". However there can be a lot to talk about it, because it's mostly marketing), leaving plenty of room (75% of space!) for compressed instructions.
Quote:
Quote:
cdimauro wrote:
@matthey, @hth313, and to who's interested on code density topic:
RISC-V ISA: Understanding Limitations and Methods to Improve Code Density & Performance
https://www.youtube.com/watch?v=0oyTaCC8qQs

Most of the results were relative to some previous RISC-V result so are mostly meaningless. I found the RISC-V video which came on after it more interesting.

Actually the video which I reported is the newer. And that's why it was/is interesting: because the results come after the ones from Celio's video.
Quote:
https://www.youtube.com/watch?v=Ii_pEXKKYUg

The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V
https://people.eecs.berkeley.edu/~krste/papers/EECS-2016-130.pdf

Christopher Celio mostly looks at instruction counts and code density in his comparison of x86_64, ARM32, AArch64 and RISC-V 64G(C). He uses dynamic traces of the SPEC CINT2006 benchmark programs. RV64GC does compare pretty well to AArch64 considering the complexity of of AArch64 (load/store pair with post increment needs 3 write ports to execute in a single cycle as the article points out). Some of the comparisons are unfair like including the macro-op (instruction) fusion results in some charts with RV64GC. Other architectures are doing fusing as well so I really don't see it as an advantage for RISC-V to reduce instructions/operations to less than them.

I absolutely agree.
Quote:
There are also disadvantages to macro-op fusion like reduced code density and more difficult instruction scheduling vs ISA supported instructions and addressing modes. The main advantages are the simpler ISA scales better to low performance hardware and dependent instructions can sometimes be removed.

Indeed. The only reason why they stressed so much the discussion on this direction is because they have solemnly stated that the base ISA is set in the stone, and so no changes will happen here.
Quote:
He really didn't compare to anything with good code density. Thumb2 was excluded and I guess the 68k was not "popular" enough.

The 68K isn't popular anymore, so that's the reason why they haven't used it in the comparison.

However the Thumb-2 exclusion is a clear sign of how much biased the paper is towards the RISC-V, and I sincerely wonder how Dave Patterson put his name on this marketing research...
Quote:
He did find that x86_64 had an average instruction length of 3.71 bytes/instruction which is poor.

Well, I've found on my stats that it's even worse: around 4 bytes/instruction.

Anyway, it's limited only to static analysis of some millions instructions.
Quote:
That is interesting as Vince Weaver's x86_64 code optimized for size gave a very good 2.29 bytes/instruction (uses only 8 registers and the stack often). Seeing as how x86_64 compiler support is good, perhaps we can see the cost of prefixes and larger instructions which access more registers with prefixes and less memory accesses. Of course Vince's static results show a different story when optimized for size.

Which isn't normally the case.

Another thing which I don't like about their research is that they used only GCC as a compiler, and an old version (newer were available).
Quote:
Fewest Instructions
1) AArch64 (power packed instructions for a RISC ISA)
2) 68k (I know a few ISA enhancements to reduce instructions)
3) ARM32/EABI (difficult to optimize but power packed for some algorithms)
4-5) RISCV32IMC & RISCV64IMC (good for a simple ISA)
6) PPC (good before the competition showed up)
7) MIPS
8) Thumb2 (increased code density increased instructions)
9) SPARC
10) Thumb1
11) x86
12) x86_64 (optimizing for size increases instruction counts and memory traffic)
13) SH-3 (16 bit fixed length encoding was not good for performance)

Best Code Density
1) 68k (I know a few ISA enhancements to improve code density)
2) Thumb2 (great code density but 21% more instructions than the 68k)
3) Thumb1
4) SH-3 (good code density but 47% more instructions than the 68k)
5) x86 (good code density but 31% more instructions than RISCV32IMC)
6) RISCV32IMC (good for a simple ISA)
7) x86_64 (good code density but 34% more instructions than RISCV64IMC)
8) RISCV64IMC (good for a simple ISA)
9) AArch64 (powerful for RISC but mediocre code density)
10) ARM32/EABI (RISC code density didn't used to be important)
11) PPC (mediocre before the competition showed up)
12) MIPS (beat SPARC but MIPS programs are larger because of data)
13) SPARC

68K: best of the two worlds.
Quote:
Christopher's conclusion was, "Our analysis using the SPEC CINT2006 benchmark suite shows that the RISC-V ISA can be both denser and higher performance than the popular, existing commercial CISC ISAs." They obviously didn't compare to a good CISC ISA but rather to one with many trade-offs.

Yeah, it's absolutely evident that this paper was good for RISC-V marketing...
Quote:
Where would the 68k be with a good compiler and enhancements?

Ask Bebbo.
Quote:
One last interesting find in the appendix.
Quote:
400.perlbench benchmarks the interpreted Perl language with some of the more OS-centric elements removed and file I/O reduced.
Although libc_malloc and _int_free routines make an appearance for a few percent of the instruction count, the only thing of any serious note in 400.perlbench is a significant amount of the instruction count spent on stack pushing and popping. This works against RV64G as its larger register pool requires it to spend more time saving and restoring more registers. Although counter-intuitive, this can be an issue if functions exhibit early function returns and end up not needing to use all of the allocated registers.

More registers *decreased* performance in functions with early returns.

This can happen, but it's not a general rule. Don't take a single case to extended the results to the totality.
Quote:
Large Register File Advantages and Disadvantages
+ decreased memory traffic

Quote:
- code density

Not always.
Quote:
- transistor count

Negligible.
Quote:
- more stack space used

Negligible.
Quote:
- more registers to save during exceptions

This depends on the language implementation.
Quote:
- early return functions have more registers to save and restore

Only if you use them.

@matthey Quote:
matthey wrote:
Quote:
Hypex wrote:
I wonder how they merged it. And how transparently. If there were 32 of each what happens where?

The paper cdimauro linked gives the details (I think I originally gave him the link).

I provided the link, but it's irrelevant: you also provide a lot of interesting links.
Quote:
Quote:
And being adaptable since vector sizes change quickly. So some forward thinking to sizes would help here. Perhaps something like the 020 scale type instructions could help to support multiple widths and expansion.

I don't think vector sizes are changing quickly anymore. A mid performance SIMD unit uses 128 or 256 bit wide registers. The high performance and specialty SIMD units will probably use 512 or 1024 bit wide registers. Only high performance hardware can load a whole register worth of memory in one access to feed the SIMD monster and the register files get to be huge. The IPC does double with each doubling of the register width which is a good gain so I could be wrong but Moore's law has also ended.

It'll end in some years.

Anyway larger vector sizes also means more skew clock issues. That's why Intel's high performance SIMD implementations LOWER the clock when using intensive AVX/-2 and, especially, AVX-512 code.
Quote:
Supporting multiple SIMD unit sizes or flexible widths takes more encoding space. There is no free lunch.

Indeed. But a fixed (configurable at the beginning of the code) width can be a good compromise.

Status: Offline

NutsAboutAmiga

Re: 68k Developement
Posted on 29-Sep-2018 9:01:21

[ #393 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12819
From: Norway

@cdimauro

Quote:
- has more free memory available of any PPC PC can give (a bit below 2GB in total, included RTG card VRAM);

The lasts Warp3D Nova has supports for Gigabytes of video memory, it uses 64bit addressing GPU. AmigaOS4.1 support virtual memory, and is able to address outside of the 2GB/4GB barrier, only parts of the OS supports this like RAM disk.

The OS is not 64bit, but it knows it has more RAM, programs can see extra memory trow a looking glass.

Quote:
has dependencies (e.g.: AREXX) or because you have no PPC equivalents.

but way fix something that works, basically is just script language, AREXX in genial is slow interface, because it uses text for scripting, AREXX can send commands to programs, and this programs will handle commands they get natively.

Quote:
So, what's the point on using PPC PCs? Only for moving some windows on a composited screen? To be more specific, is there any KILLER application (e.g.: ONLY available on PPC PCs. And which is NOT the o.s. itself, of course) to justify the usage of such obsolete and overpriced machines?

We see lot development in drivers, and things like composition improvement that is used in video players, and new games.

Last edited by NutsAboutAmiga on 29-Sep-2018 at 01:13 PM.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

NutsAboutAmiga

Re: 68k Developement
Posted on 29-Sep-2018 9:05:33

[ #394 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12819
From: Norway

@cdimauro

Quote:
is there any KILLER application

Inventing something new that some else don't have really hard, but improving on old programs, and updating this to support modern screen resolutions. Modern soundcards 16bit/24bit instead of 8bit sound, work on getting this less depended on chip serial number 1. And allow it work on any chip from any manufacture provided there is drivers is the way to go forwards. (new chips are being made everyday that is better then the chips of yesterday), way get stuck with old chips?

Last edited by NutsAboutAmiga on 29-Sep-2018 at 01:09 PM.
Last edited by NutsAboutAmiga on 29-Sep-2018 at 09:06 AM.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

megol

Re: 68k Developement
Posted on 29-Sep-2018 15:38:56

[ #395 ]

Regular Member

Joined: 17-Mar-2008
Posts: 355
From: Unknown

@matthey
Quote:

matthey wrote:
Quote:

megol wrote:
@matthey
Using a prefix doesn't require changing a mode, it could be supported even in a pure 32 bit design with the 64 bit extension encoding reserved. Not even extending the number of registers require a mode change but it would require the context switch code of the OS saving/restoring all registers.

Using address register sources have no advantage compared to a prefix+extended D register sources when it comes to compatibility, programs have to be updated to use both versions.

The "compatibility" of prefixes is appealing. They allow major enhancements with practically no changes to existing instruction formats and register ports. We have already discussed some of the disadvantages though.

Quote:

Using a prefix+normal instruction to initialize "constants" in the higher D registers and then using them as sources uses the same amount of bytes as using integer instructions+MOVE to A registers and then using these as sources, there is no advantage there. The later would also require more instructions potentially lower performing.
Accessing the high D registers like this also preserves the A registers which otherwise would be used as integer data.

So if there is prefix support already using An source encodings as Dn+8 sources is IMHO better.

There is no need to initialize constants in data registers and then move them to an address register.

moveq #100,d0 ; 2 bytes
move.l d0,a0 ; 2 bytes

movea.w #100,a0 ; 4 bytes

Constants greater than 127, less than -128 or 0 take the same amount of bytes of code (or less when an immediate MOVEA.W can be used) and less instructions to load directly into an address register. Fewer constants would be needed in registers as more instruction compressed immediate constants would come directly from the code.

That's why I wrote semi-constants in an earlier post, what I meant is things that are constant in a part of the code like inner loops but can change infrequently like once per subroutine call.

Your examples all show real constants where the advantage of a higher-Dn shortcut isn't really there. If there have to be some type of computation it have to be done with integer registers and moved to the address register, this requires a temporary integer register and wastes an address register. Prefixes plus a high Dn source eliminates the temporary integer register, a move and use no address register making them available for other uses.

Quote:

The Dn+8 registers would need new ADD, SUB and MOVE (.B/W/L/Q) encodings at least as the An encoding is being used. The MOVE instructions take a huge amount of encoding space if you plan on keeping the functionality orthogonal with existing data registers. A full prefix implementation would be cleaner but also reduce code density. You thought my opening of An register sources was ugly but I was only using the encoding for its original purpose. Mixing An and Dn+8 encodings is less ugly?

It's ugly as it isn't orthogonal and doesn't fit into the original design, that's the same ugliness for both high-Dn and An source versions. It is shoehorning in something new in a place Motorola didn't intend to be used as such no matter which version of the extension is selected - so for me they are both ugly with some nicer parts depending how one look at them.
That some of the most useful instructions can't use this shortcut is just another reason why it is ugly IMHO. If the most used instruction (ADD) can't use this shortcut why even bother?

I see the high-Dn as a better choice when combined with a prefix, not otherwise. Still don't really like the idea at all, perhaps those encodings could be used for something more useful?

Again things like these requires a good simulator plus compiler support to decide. Modifying the non-JIT processor emulation of (win)UAE should be possible even if I don't understand the design 100% (did a quick look), the focus would be different than the standard emulator of course. Precision while emulating everything from register files, execution pipelines and caches is important while raw performance isn't. Never done anything similar so may be harder than I can imagine. :)

Quote:

Quote:

Nothing is ever free.

What I ask is what the use cases is for these operations are and if they will be used so often that complicating a critical part of the pipeline is worth it. Both of your examples are showing a simple type of operation that is easy to support but they aren't general.
Even the simple address modes are really much more complicated: basereg+indexreg*scale+displacement

So to update the base register one would reasonably require going through the address generation stage. Standard updates as in the existing 68k ISA need an incrementer/decrementer.
If the address generation stage takes more than one clock cycle* instructions depending on the updated base register will have to wait, worse, even those that follows (a0)+/-(a0) will** have to wait the extra cycle(s).

Or one could reserve the base register update for the simple disp+basereg address mode only, most likely as fast as executing the increment/decrement versions that have to be supported anyway. That wouldn't be orthogonal but perhaps acceptable?

I really don't want to increase the critical path slowing the pipeline. It is only the base register update with post increment which may be a problem. It may be possible to add parallel hardware which would allow the post increment? Investments in hardware which improves code density at least partially pays for itself. It is important to look at where code size increases when optimizing for performance as this is common.

The base register update by itself is not a problem. It would be possible to start at the end of a memory data structure with a base register update instruction and work backward with pre-decrement mode (as my example showed) but fetching backward in memory can be less efficient. Perhaps this is preferable though?

It should be as efficient as doing it the other way as long as the cache prefetcher detects it.

Quote:

Quote:

But RISC almost never run out of registers. That they require more registers are per design.
The CISC having to use memory data wastes energy only at best, decreasing performance at the same time at worst.

RISC almost never runs out of registers when it has 32 GP registers. RISC uses more registers and has a much higher cost when out of registers which is why 32 registers is used for high performance RISC ISAs.

No they almost never run out of registers as almost no code requires more than the free number of registers for temporary values. Which is by choice.
But do the math: what it the RISC run out of registers 10% of the time and then have a 80% overhead compared to a CISC running out of registers 20% of the time (remember this isn't linear) and have a 5% overhead. Assuming similar execution resources I think (but welcome corrections) it would be something like:
RISC performance = frequency * (100%*(100%-10%) + (100%-80%)*10%) = frequency * 0.92
CISC performance = frequency * (100%*(100%-20%) + (100%-5%)*20%) = frequency * 0.99

Looks good for CISC right? We could try to put in some more realistic numbers which would mean RISC running out of register less than 1% of the time and a higher overhead for the CISC (occupying load port). But one would also need to account for the decreased frequency of the CISC pipeline, the longer CISC pipeline also means higher cost when mispredicting branches.

For the RISC above to be performing as well as the CISC in the calculation above it have to have 8% higher frequency. 108MHz RISC to the 100MHz CISC. This with both exaggerated costs and frequency of register spill/fill.

Quote:

The energy efficiency of more registers is not so clear. The minimum energy use is increased as the number of active transistors increases with a larger register file. The maximum energy use is increased when out of registers but this is offset by worse code density which has to do more ICache fetches often instead of more DCache fetches some of the time. Instead of wasting transistors on a larger register file and bigger ICaches, larger DCaches might be a more effective use of transistors and I expect can provide competitive energy efficiency, at least in the case of CISC with 16 GP registers.

The main power draw of a register file is due to switching not just existing. So if we are to compare the power draw when working of a CISC with memory operands and a RISC with register operands we'd have to include the overheads of a cache and the load/store unit in the equation.
When delivering operands to execution units registers will win.

One can look at real world high performance CISC implementations, x86 and IBM Z and see that they have large register files even though they support memory operands.

Quote:

Quote:

Every processor have load-use delays. The difference is that a in order CISC with load stages inlined in the execution pipeline (Pentium, 68060, Cyrix 6x86, Apollo) _always_ pays the costs even when not accessing memory. The 68060 compensated for that somewhat by executing some integer operations early reducing the effects (the Pentium did not do that). For RISC the delay is instead exposed and optional.

It looks to me like RISC designs need better instruction scheduling (OoO required for good performance) and long unrolled loops to avoid the load-use delays. This gives bloated code and has excessive register requirements. The 68060 design certainly minimizes load-use delays, is conducive to good code density and is quite powerful for an in-order CPU design.

Again look at skewed pipelines if you want to see in-order RISC designs with the same cost as for an inline CISC, a cost that is there all the time instead of optional (if the code can be scheduled correctly).
The 68060 doesn't eliminate anything - the delay is small due to the design and target clock rate. Let's compare the latency of the 68060 with that of the DEC 21164, the Alpha using a 0.5µm process compared with the 0.45µm process of the 68060.

The cache size of the L1 cache (all the 68060 have while the Alpha uses a L2 cache plus optional external L3 cache) is the same, 8KiB. The pipeline length is harder to compare as the Alpha have a pipelined FPU, with all parts of the pipeline included they have the same length but in practice the Alpha have 7 stages to the 10 stages of the 68060.
For the 68060 a cache load hitting the cache takes two cycles, AG and OC. For the Alpha a load hitting the cache takes two cycles - starting in one integer pipeline at S4, data ready at the end of S5.
The 68060 runs at a maximum of 75MHz with the Alpha 21164 at a maximum of 300MHz. The Alpha have a 4 times lower load to use latency as measured in time. Obviously part of that is that the Alpha had a completely different design goal however it should at least illustrate that there is no inherent advantage with CISC other that (and I'm nagging ;P) other than instruction density and (for simple designs) somewhat easier construction.

RISC require more scheduling than simple CISC however that is, just like the amount of registers a design choice.

Quote:

Quote:

2 LD, 1 ST is probably optimal for a standard design. Easiest way to do that would be duplicating the cache so that writes goes to two blocks and each read port connects to a dedicated block.

Yes, 2 LD 1ST would be great as there are more loads than stores. It would make instruction scheduling very easy which is important for an in-order CPU design.

Quote:

Didn't know about the third pipeline, guess it decreased maximum clock frequency enough not to be worthwhile in the end?

I don't know. Gunnar didn't say much about it as usual. An instruction scheduler would be more important for an in-order 3 pipe design and no compiler supports this. Instruction scheduling would be challenging without dual porting the DCache. We can see a little bit of current pipe statistics by looking at performance counter results for the Apollo core. Pipe1 executes an instruction about 60% of the time and Pipe2 about 40% with hand optimized code. Surely this includes stalls (mostly DCache misses and store buffer full stalls as branch prediction was 98% correct). The 68060 could issue instruction pairs and triplets with optimized code 50%-65% of the time but this likely does not include stalls during execution. It is impossible to compare these percentages. We don't know how much the clock frequency dropped either. An FPGA has more limited space than an ASIC so even if the 3rd pipe could execute an instruction 20% of the time it may not be worthwhile. I wouldn't be surprised if a dual ported DCache could raise the percentages 5%-10% and probably more with unoptimized legacy code. The extra pipe should be cheap for an in-order CPU but I don't know of any stats for comparison.

It would be less expensive for an out of order processor, can pipeline dependency checks without significant decrease in performance for instance.

Quote:

Quote:

Why would a CISC have expensive register access? Even with my prefix hack the worst case is 50% code expansion. Also register files are much more energy efficient than accessing a cache.

CISC uses more encoding space. A 68k like CPU with 16 bit instructions using 32 registers would need 7 bits for an EA, 4 for another register and 2 for a size which leaves only 3 bits for an opcode. Tiered registers are possible with 32 bit instructions to access Dn+8. The prefix is probably the cleanest way to add 8 more data registers to the 68k without re-encoding but maybe re-encoding would be the better choice at that point. The energy efficiency is as clear as mud as I have already addressed above.

So maybe it's better to use 16 register at least do it the 68k way with split D/A registers. 4 register bits x 2 + 2 bits for operation size + 3 bits for address mode leaves 16-(8+2+3)=3 bit for the rest of the opcode with a maximum of 7 usable opcodes. However 2 bits for the address mode may be more practical: (An), (An+disp16), (An)+, -(An) for instance with other modes requiring a long instruction format. Maximum 15 opcodes which may be enough.
Or one could do it like RISC V: smaller number of usable registers for small instructions with full 32 bit instructions having 5 bit register fields.

However I don't agree with code density being very important. Even small microcontrollers doesn't strive for maximum density anymore - there's no need. Don't read this as I'm saying density doesn't matter it's just not the most important thing.

Status: Offline

megol

Re: 68k Developement
Posted on 29-Sep-2018 15:58:51

[ #396 ]

Regular Member

Joined: 17-Mar-2008
Posts: 355
From: Unknown

@cdimauro

Quote:

cdimauro wrote:
@matthey Quote:
matthey wrote:
Code density is important and becoming more competitive with new ISAs. I think it would be a mistake to decrease code density when the 68k can leverage the code density advantage.

Then I think that there's no room for prefixes. A 16-bit prefix decrease too much the code density.

Just couldn't stop myself from protesting. :)

First a prefix isn't required everywhere, even in 64 bit code 32 bit or lower sized operations are common. For 64 bit or larger operations a prefix is much cheaper than doing two or more 32 bit operations.
Second it would enable features that can decrease code size: more registers, sign and zero extensions and potentially an optional third register operand. For ADDQ/SUBQ and shift #, Dn the immediate field can be extended to 6 bits etc.

It would decrease density certainly but too much? Don't think so. Have to test to be sure.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 29-Sep-2018 17:07:12

[ #397 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@megol

Quote:

megol wrote:
So maybe it's better to use 16 register at least do it the 68k way with split D/A registers. 4 register bits x 2 + 2 bits for operation size + 3 bits for address mode leaves 16-(8+2+3)=3 bit for the rest of the opcode with a maximum of 7 usable opcodes. However 2 bits for the address mode may be more practical: (An), (An+disp16), (An)+, -(An) for instance with other modes requiring a long instruction format. Maximum 15 opcodes which may be enough.
Or one could do it like RISC V: smaller number of usable registers for small instructions with full 32 bit instructions having 5 bit register fields.

That's something which I did around 8 years, when I've created my 64-bit "68K successor": 6-bit EA for 16-bit "quick/compact" instructions, and 7-bit EA for 32-bit "normal" instructions. However I've used the 1 more bit for extending the mode (4-bit) and not the register part.

A 68K "re-encoding" can use the actual An EA for the high Dn, while providing specific MOVE/ADD/SUB instructions for An, in order to have 16-bit opcodes which can still access the high-Dn registers in some encodings, without requiring to go for the full 32-bit opcode version.

@megol Quote:

megol wrote:
@cdimauro

Quote:

cdimauro wrote:
@matthey [quote]matthey wrote:
Code density is important and becoming more competitive with new ISAs. I think it would be a mistake to decrease code density when the 68k can leverage the code density advantage.

Then I think that there's no room for prefixes. A 16-bit prefix decrease too much the code density.

Just couldn't stop myself from protesting. :)[/quote]
Indeed. I was waiting for your reply.
Quote:
First a prefix isn't required everywhere, even in 64 bit code 32 bit or lower sized operations are common.

On x64 8 and 16-bit sized operations are very rare birds. Code is using a mix of 32-bit and 64-bit operations.
Quote:
For 64 bit or larger operations a prefix is much cheaper than doing two or more 32 bit operations.

Yes, but if you have many 64-bit instructions, then the code size will be affected by the prefix usage. Like on x64, but much worse here due to the 16-bit prefix.
Quote:
Second it would enable features that can decrease code size: more registers, sign and zero extensions and potentially an optional third register operand.

A third register is difficult to be put on the prefix: it requires too much space. You have to steal a big opcode space for it.

For the rest I agree: you can put many useful operations/extensions which can be enabled by a prefix. I don't use prefixes on my ISA, but only longer opcodes when I need to extend the base instructions or add another register as a first source operand (the memory operand becomes the second source in this case. AVX docet).
Quote:
For ADDQ/SUBQ and shift #, Dn the immediate field can be extended to 6 bits etc.

Here it's enough to provide a regular 32-bit opcode.
Quote:
It would decrease density certainly but too much? Don't think so. Have to test to be sure.

Well, having numbers will be good, but I've the feeling that the code density will be considerably increase. Just my idea.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 29-Sep-2018 17:12:39

[ #398 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@NutsAboutAmiga Quote:

NutsAboutAmiga wrote:
@cdimauro

Quote:
- has more free memory available of any PPC PC can give (a bit below 2GB in total, included RTG card VRAM);

The lasts Warp3D Nova has supports for Gigabytes of video memory,

How many applications use/can use it?
Quote:
it uses 64bit addressing GPU.

Yes, but still with the old GART mechanism. Modern x64 processors allows to map PCI-Express (GPUs) memory in the same virtual address space, with clear advantages (e.g.: no aperture mapping operations).
Quote:
AmigaOS4.1 support virtual memory,

It's possible even with the Amiga o.s. 3.x, using third-party applications.

Anyway virtual memory doesn't work well on OS4, from what I've read.
Quote:
and is able to address outside of the 2GB/4GB barrier, only parts of the OS supports this like RAM disk.

Yes, I know. A special ram disk which uses no Amiga address space is also possible with WinUAE, if a proper handler is created.
Quote:
The OS is not 64bit, but it knows it has more RAM, programs can see extra memory trow a looking glass.

Only apps that use the new APIs. How many do it?
Quote:
Quote:
has dependencies (e.g.: AREXX) or because you have no PPC equivalents.

but way fix something that works, basically is just script language, AREXX in genial is slow interface, because it uses text for scripting, AREXX can send commands to programs, and this programs will handle commands they get natively.

Yes, but AREXX is still used...
Quote:
Quote:
So, what's the point on using PPC PCs? Only for moving some windows on a composited screen? To be more specific, is there any KILLER application (e.g.: ONLY available on PPC PCs. And which is NOT the o.s. itself, of course) to justify the usage of such obsolete and overpriced machines?

We see lot development in drivers, and things like composition improvement that is used in video players, and new games.

OK, but... any Killer app?

@NutsAboutAmiga
Quote:
NutsAboutAmiga wrote:
@cdimauro

Quote:
is there any KILLER application

Inventing something new that some else don't have really hard, but improving on old programs, and updating this to support modern screen resolutions. Modern soundcards 16bit/24bit instead of 8bit sound, work on getting this less depended on chip serial number 1.

This is something which is already doable with WinUAE.
Quote:
And allow it work on any chip from any manufacture provided there is drivers is the way to go forwards. (new chips are being made everyday that is better then the chips of yesterday), way get stuck with old chips?

Yes, but... again: any killer app?

Status: Offline

megol

Re: 68k Developement
Posted on 29-Sep-2018 23:09:26

[ #399 ]

Regular Member

Joined: 17-Mar-2008
Posts: 355
From: Unknown

@cdimauro
Quote:

cdimauro wrote:
@megol
Quote:

megol wrote:
So maybe it's better to use 16 register at least do it the 68k way with split D/A registers. 4 register bits x 2 + 2 bits for operation size + 3 bits for address mode leaves 16-(8+2+3)=3 bit for the rest of the opcode with a maximum of 7 usable opcodes. However 2 bits for the address mode may be more practical: (An), (An+disp16), (An)+, -(An) for instance with other modes requiring a long instruction format. Maximum 15 opcodes which may be enough.
Or one could do it like RISC V: smaller number of usable registers for small instructions with full 32 bit instructions having 5 bit register fields.

That's something which I did around 8 years, when I've created my 64-bit "68K successor": 6-bit EA for 16-bit "quick/compact" instructions, and 7-bit EA for 32-bit "normal" instructions. However I've used the 1 more bit for extending the mode (4-bit) and not the register part.

A 68K "re-encoding" can use the actual An EA for the high Dn, while providing specific MOVE/ADD/SUB instructions for An, in order to have 16-bit opcodes which can still access the high-Dn registers in some encodings, without requiring to go for the full 32-bit opcode version.

Interesting idea. Would require more opcode space but there are some places where IMHO it's possible to simplify 68k.
Changing MOVE to D/A, EA or EA, D/A only: 3 register + 1 type + 2 size + 6 EA needs two lines instead of three.
This probably makes some people go crazy: limiting load-op-store operations, saves a lot.

However this would perhaps change the re-encoding more to creating a new limited complexity CISC? :P

Quote:
Quote:

@megol
megol wrote:
Just couldn't stop myself from protesting. :)

Indeed. I was waiting for your reply.
Quote:
First a prefix isn't required everywhere, even in 64 bit code 32 bit or lower sized operations are common.

On x64 8 and 16-bit sized operations are very rare birds. Code is using a mix of 32-bit and 64-bit operations.
Quote:
For 64 bit or larger operations a prefix is much cheaper than doing two or more 32 bit operations.

Yes, but if you have many 64-bit instructions, then the code size will be affected by the prefix usage. Like on x64, but much worse here due to the 16-bit prefix.

True. However what I'm trying to say is that if one manipulates 64 bit data the resulting code with prefixes will be smaller than the comparable 32 bit code, if one doesn't need 64 bit data the prefix can be eliminated in most cases and remove overheads in some.

It would be possible to have the prefix without extra register bits and 64 extension and still be useful with zero/sign extensions of normal instructions. Better than MVS/MVZ? I think so but the initial cost is pretty high.

I'm still not sure how the most vital part of a 64 bit extension would be handled: addressing. Would like it to be possible to run 32 bit code in 64 bit mode unchanged without needing a mode switch.

Quote:

Quote:
Second it would enable features that can decrease code size: more registers, sign and zero extensions and potentially an optional third register operand.

A third register is difficult to be put on the prefix: it requires too much space. You have to steal a big opcode space for it.

That's the reason I like the MVS/MVZ space: 11 free bits. Two bits for extension type (including one quadword variant) and 3 bits for register extension leaves 11-5=6 bits. Even having 4 bits for an additional register two bits are available for other things, and they would be required for some special cases.

This would provide sign and zero extension of byte, word and long operations. There is space for other variants too, for instance one wild idea would be adding SIMD operations to the normal integer operations.

ADD.B D10, D13 ; normal with prefix
ADD.BZ D10, D13 ; result is zero extended to full register width
ADD.BS D10, D13 ; result is sign extended

And perhaps ADD.BQ D10, D13 for SIMD operation on byte quantities in a quadword.

Quote:

For the rest I agree: you can put many useful operations/extensions which can be enabled by a prefix. I don't use prefixes on my ISA, but only longer opcodes when I need to extend the base instructions or add another register as a first source operand (the memory operand becomes the second source in this case. AVX docet).
Quote:
For ADDQ/SUBQ and shift #, Dn the immediate field can be extended to 6 bits etc.

Here it's enough to provide a regular 32-bit opcode.

Yes but this would be a version using the prefix as usual (would remove the new register field though) without needing additional opcode space.
And with the extra mode bits choosing if an additional register is wanted both of these would be possible:
ASL.W #63, D13
ASL.W #3, D10, D13

But it's a bit messy. Not that complicated, but messy.

Quote:

Quote:
It would decrease density certainly but too much? Don't think so. Have to test to be sure.

Well, having numbers will be good, but I've the feeling that the code density will be considerably increase. Just my idea.

That's the problem for us all I think - it's still just ideas until tested... :/

Status: Offline

cdimauro

Re: 68k Developement
Posted on 30-Sep-2018 7:19:42

[ #400 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@megol Quote:
megol wrote:
@cdimauro
Quote:
cdimauro wrote:
That's something which I did around 8 years, when I've created my 64-bit "68K successor": 6-bit EA for 16-bit "quick/compact" instructions, and 7-bit EA for 32-bit "normal" instructions. However I've used the 1 more bit for extending the mode (4-bit) and not the register part.

A 68K "re-encoding" can use the actual An EA for the high Dn, while providing specific MOVE/ADD/SUB instructions for An, in order to have 16-bit opcodes which can still access the high-Dn registers in some encodings, without requiring to go for the full 32-bit opcode version.

Interesting idea. Would require more opcode space but there are some places where IMHO it's possible to simplify 68k.
Changing MOVE to D/A, EA or EA, D/A only: 3 register + 1 type + 2 size + 6 EA needs two lines instead of three.
This probably makes some people go crazy: limiting load-op-store operations, saves a lot.

However this would perhaps change the re-encoding more to creating a new limited complexity CISC? :P

Well, how limited is 68K today?

At least with my proposal you're still able to gain 64-bit and 8 more data registers without using prefixes, while keeping most of 68K advantages (code density included).

The 16-bit opcode space cannot be orthogonal as it was with 68K, of course. Rethinking the 68K ISA needs a different mindset here: 16-bit opcodes should be seen not like regular instructions, but as compact version of more general ones (which are 32-bit in size). Like it happens on other modern ISAs. 16-bit bit opcodes are here to save space: dot.
Quote:
Quote:
Yes, but if you have many 64-bit instructions, then the code size will be affected by the prefix usage. Like on x64, but much worse here due to the 16-bit prefix.

True. However what I'm trying to say is that if one manipulates 64 bit data the resulting code with prefixes will be smaller than the comparable 32 bit code, if one doesn't need 64 bit data the prefix can be eliminated in most cases and remove overheads in some.

It would be possible to have the prefix without extra register bits and 64 extension and still be useful with zero/sign extensions of normal instructions. Better than MVS/MVZ? I think so but the initial cost is pretty high.

I understand your point, but having looked at many disassembled code and collected statistics (limited, ok, but at least I have some data), I can see that 64-bit versions of the same applications (FirebirdSQL, FFMPEG, Photoshop CS6 public beta, Unreal Engine) don't take profit of the possibility to handle 64-bit instead of 32-bit data. One evident benefit could be using 64-bit immediates, however looking at how many MOV REG,Imm64 are found in the code brings to the conclusion that they are rare birds, and there's substantially no gain on both instructions count reduction and increased code density.

What you can see by looking at the same application compiled in 32 and 64-bit is that most of the operations are 32-bit in the first binary, with a minor amount of byte operations and rare 16-bit operations. Wheres in the second binary the operations are almost always 32 and 64-bit with very rare byte operations and almost zero 16-bit ones; so, basically there's a good mixture of 32 and 64 bit operations, which brings to decreased code density due to more prefixes usage for 64-bit operations.
Quote:
I'm still not sure how the most vital part of a 64 bit extension would be handled: addressing. Would like it to be possible to run 32 bit code in 64 bit mode unchanged without needing a mode switch.

Yes in theory, however in practice applications compiled in 32 and 64-bit flavors have quite different mixtures of operations, as I've said before, and it'll be wise to take advantage of it (if it's possible, of course).

My previous ISAs versions worked as you stated, because all instructions were orthogonal in size, using a 2-bit field to specify the instruction size. So, the difference between a 32-bit and 64-bit application is that the first one simply didn't used Size=0b11.

However in the last version I've changed several things and took advantage of the findings about the different behavior (instructions mixture) of 32 and 64 bits applications. This resulted in gains in both code density and available opcode space, because instructions specialization opened several opportunities in this sense (which you can clearly see looking at my 32 and 64 bits versions of the function call example which I gave on my posts #225 & #336).

It's true that there's no free lunch, but it's better to squeeze the most of the available meal.
Quote:
Quote:
A third register is difficult to be put on the prefix: it requires too much space. You have to steal a big opcode space for it.

That's the reason I like the MVS/MVZ space: 11 free bits. Two bits for extension type (including one quadword variant) and 3 bits for register extension leaves 11-5=6 bits. Even having 4 bits for an additional register two bits are available for other things, and they would be required for some special cases.

This would provide sign and zero extension of byte, word and long operations. There is space for other variants too, for instance one wild idea would be adding SIMD operations to the normal integer operations.

ADD.B D10, D13 ; normal with prefix
ADD.BZ D10, D13 ; result is zero extended to full register width
ADD.BS D10, D13 ; result is sign extended

And perhaps ADD.BQ D10, D13 for SIMD operation on byte quantities in a quadword.

Yes, SIMD operations on integer registers can be useful on very low-end embedded systems.

Anyway, and to give a general answer to your quote, why don't use a longer opcode then, instead of a prefix? You can better optimize instructions on a longer opcode, because you can almost completely get rid of unused/not-useful encodings coming from the application of the prefix to ALL existing opcodes. With the additional, clear advantage to simplify the ISA implementation.

I know, from what I've read, that you and Matt want to extended the existing 68K ISA, and you're trying to find solutions to the problems which we have talked about. However when an ISA reaches a critical mass of issues (and 68K have collected many of them), then I think that it's better to completely rebuild it taking the good parts.

That's what I've made with my x86/x64 "re-encodings". You know that both ISAs make a common use of prefixes, but my ISAs have no prefixes at all, while still keeping the same possibilities AND bringing also A LOT of new features and enhancements. And they are... TRIVIAL to decode: a bit more complicated than Thumb-2 (to give an idea of the instructions formats to handle), but not that much distant (only a few bits are needed from the first bytes, in order to get the full instruction length plus a lot of useful information about the instruction type and what it needs).

IMO it's worth to seriously think something about a 68K successor, because I see similar possibilities here.
Quote:
Quote:
Here it's enough to provide a regular 32-bit opcode.

Yes but this would be a version using the prefix as usual (would remove the new register field though) without needing additional opcode space.
And with the extra mode bits choosing if an additional register is wanted both of these would be possible:
ASL.W #63, D13
ASL.W #3, D10, D13

But it's a bit messy. Not that complicated, but messy.

Whereas a 32-bit ad-hoc encoding will be much easier to implement.
Quote:
Quote:
Well, having numbers will be good, but I've the feeling that the code density will be considerably increase. Just my idea.

That's the problem for us all I think - it's still just ideas until tested... :/

But I've some data and a bit of experience: see above.

You can also collect statistics about existing applications, and you'll figure-out yourself how is the situation.

I'm thinking now about using some profiler tool to instrument the execution of common applications, to get many more decoded instructions plus dynamic usage of them (actually I only have static analysis).
This will give many useful information and the possibility to generate more stats for my ISA. I still cannot make use of several enhancements which it provides, but even a 1:1 translation will be enough to show the advantages in terms of code density, and what can be achieved with proper compilers support.

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle