Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6071 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

31 crawler(s) on-line.

106 guest(s) on-line.

0 member(s) on-line.

You are an anonymous user.
Register Now!

Hammer: 17 mins ago

amigakit: 25 mins ago

OneTimer1: 28 mins ago

pixie: 36 mins ago

kolla: 39 mins ago

Rob: 58 mins ago

matthey: 1 hr 2 mins ago

corb0: 1 hr 28 mins ago

zipper: 1 hr 29 mins ago

RobertB: 3 hrs 3 mins ago

Forum Index

Amiga General Chat

68k Developement

Poster

Thread

cdimauro

Re: 68k Developement
Posted on 21-Jul-2018 19:16:39

[ #21 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

Unbelievable: an interesting (not boring) thread here. :)

@Hypex

Quote:

Hypex wrote:

As to being advanced faster than PPC, a PPC what? PPC back then or PPC now? In any case sure I think 68K could have matched anything for PPC or x86. The 68K is like an 80x86. They are both CISC designs. The 68K was a better and more modern design than the x86 ever was, so it could certainly compete with it.

True.
Quote:
We only have to look at how they hacked and refactored the old x86 design over the centuries to become the monster it is today.

Well, the burden with x86 is mostly represented by prefixes to decode the instructions, which fortunately can be easily handled by the decoder (they are specific 8-bit patterns).

The addressing modes have a few exceptions, with the most notable one represented by the SIB byte (introduced by the 386).

So, it's relatively easy to handle that stuff (scanning & marking bytes first; then "summing it up" such findings), albeit it requires a good pack of transistors (the full decoder eat 30% of them on the Pentium; 40% on the PentiumPro).

But 68K have some problems too: a 16 bit opcode with a lot of exceptions (Motorola did a dirty job hacking the opcodes to fit instructions) which makes not-so-simple to figure out if an instruction has a extension word and/or an immediate; then the length of the extension word; plus... the double indirect memory modes.
Quote:
So, with that in mind, I certainly think the 68K could have been engineered to the same point wth mutl cores, 64-bit data and address bus, 64* registers, 512 bit vectors and Ghz speeds!

64-bit data requires a prefix (ala x64), which will significantly drop the code density (x64 has 1 byte prefix; here you'll need a TWO bytes one!) Or requires an ad-hoc execution mode which defaults to 64-bit (suppressing/replacing which size in the currenct instructions? Byte is widely used; Word is used a lot in 68000 code, whereas Long is used more in 68020+ code), and this will create other problems, as you can image.

More than 16 registers requires certainly a prefix (so: see above), and anyway there's NO space for 64 registers. In fact, to consider the worst cases (MOVE Mem, Mem; bitfields; long MUL/DIV, and maybe some other instructions which use many registers), you need 4 bits just to have 16 data and 16 address registers (and to extend bitfields offsets & widths to 6 bits -> up to 63 as value). Plus another 2 bits if you want to remove the current data/address register division (joining the x64 "dark side": all registers are general purpose). It means 6 bits. Plus another bit for the 64-bit size, and you reached 7 bits, which is quite a good portion of the free 16-bit opcodes (Mat can be more precise, since he knows much better the current 68K opcodes encodings).

However "32 registers" should be enough for anybody".

Finally, adding vector (SIMD) capabilities to the 68K is not trivial, because every decision that you take will impact other aspects (limits; constraints) of the extension.

Let's say that for the FPU is much better to drop the current implementation, like AMD did with x64 (x87 is still there and usable, but deprecated). It's too complex, with a lot of legacy stuff (yes, even 68K FPU has legacy burdens), offering only 8 registers, and cannot handle 3 (or more) operands.

It's better to focus on a new SIMD unit, since SIMDs can offer an orthogonal scalar (like current FPUs) and vector/packed data model, and you can freely mix both kind of data. Let's say that you want to propose a good, modern SIMD unit, with enough capabilities for the future needs. So, many ISA lovers/architects probably say you that it'll be good to have 32 registers and 3 (and maybe even 4) operands. Here it means that you need at least 3 * 5 = 15 bits only for encoding the operands. Add another bit for specifing scalar or packed data, and we are at 16 bits. Do you want 128, 256, 512 (as you mentioned) and maybe 1024 bits for vector registers sizes? Add another couple of bits, and we are at 18 bits. Do you want both integer and floating point operations? Add another bit -> 19 bits needed now. Which integer and FP sizes do you want to handle: 8..64 bits for int, and 16, 32, 64, and 128 (will be introduced in some processors in the future, for sure) for FP? Another couple of bits -> 21 bits of encoding space. Let's stop here, and add some bits for the EA field encoding (we have CISCs and we want to take most of the advantage, right?). You need only 3 bits here, because we can reuse one operand (the second source) to encode the register (or some special address mode). Total: 24 bits.

Now take a look at the space available for the F-line: it's not even enough for encoding such fields. Let's say that you completely absorbe (reuse) the A (for packed) and F (for scalar) lines for the new SIMD unit, then you have 13 + 16 = 29 bits of total space available for the new super cool SIMD unit. It means that you have only 29 - 24 = 5 bits = 32 completely orthogonal instructions (they can be scalar or packed, int8..64 or fp16..128, 128..1024 bits in size): it's not that much, but you can do some things.

If you drop the vector size specification, sticking to a specific size (128, 256, 512, or 1024 bits. Maybe selectable at runtime, using a special registers) or considering a vector length agnostic SIMD ISA (which is the latest trend in the new SIMD units: ARM presented a new SIMD specification last year, and RISC-V is finalizing one albeit with some difficulties due to the available opcode space), then you can recover 2 bits and have up to 128 general SIMD instructions: a good number.

However if you want to add another cool stuff, like using masks (introduced by the Intel with Xeon Phi / AVX-512), you need at least one bit (RISC-V solution, due to the already mentioned opcode space constraints) and much more if you want to specificy a register for it (Intel solution: 8 mask registers -> 3 bits needed).

TL;DR: 32 bits of base opcode space for a SIMD extension is quite small if you want to introduce a very flexible and modern one.
Quote:
* By the time x86 was 64-bit it matched the 16 register count of the 68K, up from 8. So why not double what x86 is doing and then double again!

Not an easy task if you want to keep also the very good thing of 68Ks: code density. And I think that here you cannot win the battle, because 68Ks use opcodes which are multiple of 16-bits. For the good and... the bad, unfortunately. See above.

BTW, and as already said, x64 has 16 general purpose registers: no data/address distinction. Which is a Good Thing (especially for compilers).

Status: Offline

cdimauro

Re: 68k Developement
Posted on 21-Jul-2018 19:24:57

[ #22 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@pavlor

Quote:

pavlor wrote:
@g01df1sh

68k was no longer competitive in the early 90s.

In 1993, Intel had 486DX/2 66 MHz and Pentium... Motorola 68040 33 MHz. And no, there is no magic that will make 68040 faster than these Intel CPUs, far from that.

*
@ErikBauer

Quote:

ErikBauer wrote:
@pavlor

In 1993 68040 could run at 40Mhz, with a comparable speed to the 486/66Mhz, too bad it was way more expensive.

By 1994 68060 came out, a brilliant competitor to the Intel Pentium, but it was too late and could not be clocked at high enough clockrate to compete with it's Intel counterpart.

No, magic could not have saved 68K family... a better engineering that could allow a better performance/cost ratio maybe could have, maybe not: Microsoft and Intel were already growing titans by then.

@pavlor

Quote:

pavlor wrote:

Comparable? Somewhat yes. 68040 will be as fast in FPU operations, faster in bus related work, but most integer code will run faster on 486DX/2.

*

BTW, the 80486 came on 1989: one year before the 68040. And the same happened to the Pentium (1993) and 68060 (1994), albeit the former was immediately introduced with frequencies up to 66Mhz.
@ErikBauer

Quote:

ErikBauer wrote:
@pavlor

AFAIK 68040 was far more efficient than the i486 reguarding Integer Operations, making that comparison a good toe to toe situation.

The 40Mhz 68040 was absolutely not enough to catch the higher clock frequencies reached by the 80486s.
Quote:
But the real problems of 68k line were overheating and production costs

Only overheating I think, because Motorola and Intel used similar production processes which had similar costs.
@pavlor

Quote:

pavlor wrote:
@ErikBauer

Quote:
AFAIK 68040 was far more efficient than the i486 reguarding Integer Operations, making that comparison a good toe to toe situation.

Far more? Few % maybe, but certainly not much more. However, benchmarking different CPU architectures was not that easy back then - too few application benchmarks across the platforms and no "synthetic" benchmarks one can really believe (well there are at least Spec CPU92 numbers for both 68040 and 80486DX/2).

*

Can you report such numbers? It'll be much more interesting instead of useless metrics like MIPS/MFLOPS.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 21-Jul-2018 19:36:50

[ #23 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Korni

Quote:

Korni wrote:
I have a machine with 486 DX/4 100. Amiga with 040/40 is way more usable and fun to use, even with no magic involved.

You're comparing apples to oranges: completely different machines with not only the CPU, but the chipset, o.s., and applications.

The discussion was about just the processors: 68K from one side and x86/x64 (or other processors) from the other one.

A fair comparison could be a 486 PC and an Amiga 040 with the same amount of memory and the same video card, running the same AROS version and the same applications.
@JimIgou

Quote:

JimIgou wrote:
@Korni

Agreed. I didn't see the point in adopting X86 until it neared 200 MHz.
And even then I started with Cyrix and then moved to AMD.

Intel...still the enemy.

And you still didn't grown, with those childish wars.

We are talking about processors: just bits of metal. And I don't think that you're a Motorola, IBM, or another 68K or PowerPC company employee and/or have company's shares.

If you still hate so much Intel, then:
- don't use devices which have DRAMs, because they were invented by Intel;
- same for EPROMs;
- USB anyone? It came from the evil (not the desert);
- ... should I continue?

Ah, BTW, Cyrix and AMD implemented IA-32 (this is the correct name of the ISA: not x86. You can figure-out yourself what the I stands for ), which... was invented by Intel. So, here you're just trying to shift & hide the reality...

Status: Offline

pavlor

Re: 68k Developement
Posted on 21-Jul-2018 19:50:06

[ #24 ]

Elite Member

Joined: 10-Jul-2005
Posts: 9588
From: Unknown

@cdimauro

Quote:
Can you report such numbers? It'll be much more interesting instead of useless metrics like MIPS/MFLOPS.

68040 33 MHz:

17.8 SpecInt92 (full report)

12.9 SpecFp92 (full report)

80486DX/2 66 MHz (without secondary cache, results with cache are higher):

32.4 SpecInt92

16.1 SpecFp92

(source; I have full reports only for results with a secondary cache)

Edit: Typo (thanks CDimauro)

Last edited by pavlor on 22-Jul-2018 at 07:59 AM.
Last edited by pavlor on 21-Jul-2018 at 07:50 PM.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 21-Jul-2018 20:05:49

[ #25 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey

Quote:

matthey wrote:

Yes, x86 won because of market forces at that time. Not only had x86 CPUs become commodities for PCs but 3D games placed emphasis on performance over other characteristics of the CPU. The 68060 was one of the best overall CPUs for a PC in 1994.

Pentium@75MHz 80502, 3.3V, 0.6um, 3.2 million transistors, 9.5W max
68060@75MHz 3.3V, 0.6um, 2.5 million transistors, ~5.5W max *1
PPC 601@75MHz 3.3V, 0.6um, 2.8 million transistors, ~7.5W max *2

*1 estimate based on 68060@50MHz 3.9W max, 68060@66MHz 4.9W max
*2 estimate based on 601@66MHz 7W max, 601@80MHz 8W max

The 68060 is the clear winner in PPA (Power, Performance and Area) often used to evaluate embedded CPUs today. The 68060 is 42% more energy efficient and is using 21% fewer transistors compared to the most comparable in order Pentium while giving similar performance (and better performance/MHz than the OoO PPC 601). The Performance is similar between these CPUs with the Pentium only having an advantage when it was clocked up due to mass production (for embedded uses lower clock speeds and better performance/clock is more reliable and cheaper).

It isn't a fair comparison.

Intel decided NOT to drop important things (features and/or instructions) from its ISAs, like Motorola systematically did with 68K, starting from the 68030 and continuing with the 68060; I don't have to report the full list, because you know it much better than me.

Pentiums had also a fully pipelined FPU, where 68060s lacked it.

Pentiums had not so hard limits like 68060s about the instructions that can be paired and executed, which was only 2 16-bit instructions (4 bytes total) for the latter, whereas the former had a total of 16 bytes limits AFAIR.

Pentiums also introduced new instructions.

Pentiums also introduced new model registers, included the useful TimeStampCounter.

Pentiums also introduced new debug registers, which allowed to implement and use super-fast instruction or data breakpoints.

All of those take transistors ( = area) and power usage.
Quote:
Yes, the 68k ISA could be improved and made 64 bit like the x86 to x86_64. The 68k has several advantages like more free encoding space.

Motorola re-used the Size=0b11 encoding space for putting other instructions. So, it's not possible to (naturally) use it for specifying a 64-bit data. But for this topic you can specifically take a look at the reply which I've written to Hypex.
Quote:
One of the reasons why the 68060 performed so well compared to the Pentium and PPC is that it has good code density (probably about 10% better than the x86).

Unfortunately there aren't so many extensive studies comparing code densities, unless some trivial (and not really useful) ones (e.g.: the infamous LZ + Linux logo one).
Quote:
The lack of free encoding space in the x86 ISA made x86_64 code really fat. A 64 bit 68k ISA can have a larger advantage in code density over x86_64 than the 68k did over the x86. A 68k 64 bit ISA can more easily and cleanly support more powerful addressing modes used by 64 bit software as well.

I beg to differ, but for this I've already replied to Hypex as well.
Quote:
The 68k can *not* easily encode more integer registers without fattening code, adding complexity to CPU designs and making compiler support more difficult. The x86_64 really needed more than 8 GP registers as 8 requires many more accesses of memory/caches but 16 is actually a good number for most algorithms. The extra x86_64 integer registers are not free either as the instructions are bigger unlike using all 16 68k registers.

But at least they are orthogonal and all general purpose, unlike 68K which split them between data and address registers, which is difficult to handle for compilers.
Quote:
It would be possible for the 68k to free up the A4-A6 registers by using PC relative addressing more and accessing the frame pointer data from the SP (more efficient use of registers).

Not an easy task, and not always possible (e.g.: A6 = library pointer base, for example).

Unfortunately one thing which 68Ks lack is address registers. Unless you resort to use one complex address mode introduced by 68020, but it can work only as a base pointer, and requires a costly (for code density) 16-bit word extension.
Quote:
It would be easier to double the number of 68k FPU registers from 8 to 16 while maintaining good compatibility.

Please get rid of this old execution unit.
Quote:
The 68k could have a SIMD unit or other units with more registers.

I've replied to Hyper for this.

Quote:
The 68k was more competitive with x86 in the early '90s than PPC is with x86_64 today.

Questionable. See pavlor messages too.
Quote:
The 68040 had better performance per clock than the 486 but it was *not* a good design.

Why wasn't it good for you?
Quote:
The 68040 could not easily be clocked up because of heat which also added to the cost of manufacturing. The 68060 was a new improved design which solved the problem and others. It really is a great start of what should have been many CPUs based on this design

I disagree to this too: see at the top of my message.
Quote:
(the 68k could have stayed in order longer than the Pentium which moved quickly to OoO).

Because OoO was (and still is) the obvious way to greatly improve performance. Something which RISCs made much earlier, but Intel (and others) followed as well with very good reasons.

Really, are we talking about performances and still want to promote an in-order design? It's a complete non-sense.

Take a look at Intel's Atoms: when they moved from the 2-way in-order to the 2-way out-of-order superpipeline, performances dramatically improved.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 21-Jul-2018 20:39:24

[ #26 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Hypex

Quote:

Hypex wrote:
It could do 8, 16 or 32 operations without addon hacks. And 64-bit math by combining registers, though that's a work around, for not having 64-bit wide registers.

386+ had nice SHLD/SHRD for this.
Quote:
Though it did have 80-bit in FPU which was unheard of or matched today?

x87 FPU is deprecated since AMD introduced its x86_64/AMD64/x64 ISA, but no other (mainstream) FPUs can actually handle 80-bit FP operations.
Quote:
I almost laughed when I found x86 had byte sized instructions. Yes I know it came from 8-bit world but still.

That's the point, and it offered a very compact ISA when 8086 was introduced.

However nowadays we don't use 8086 (only some DOS applications still need it), and not even IA-32/x86 so much since x64 is clearly the target from several years now.
Quote:
I suppose it can have alignment restrictions lifted.

It never had alignment restrictions. However x64 code usually has a lot of padding (using NOPs) in order to always have targets of branches which are at least 16 bytes-aligned.

This also affects x64 code density. But it's clearly made for improving performances.
Quote:
I, of course, purposely blew up the register count out of humour.

Well, some ISAs have more than 32 registers. Without humour. But they didn't survived.
Quote:
Which is what they are putting in the Vampire 68K. As well as a 64-bit wide register file as I understand it.

Yes.
Quote:
Though it disturbs me they are basing SIMD on Intel MMX.

Actually it's only a marketing slogan, because the two SIMDs are quite different.

I think that they used MMX for the following reasons: having 64-bit SIMD registers (like MMX, indeed), operating on integers (MMX lacked FP instructions as well), and using existing registers for them (MMX used x87 FPU registers. Vampire made even a worse decision, using the general purpose registers).
Quote:
And they think themselves true to the 68K.

Well, in a certain way it's true: they are aggressively patching the 68K ISA putting whatever they think is good to have. Like what Motorola did, basically.

But Motorola did a lot of mistakes with such dirty patching (see other messages)...
Quote:
I know the game. Trying to turn the Amiga into another PC. Turning the 68K a K86!

Adopting some ideas or being inspired from x86 isn't necessarily an evil thing and will not turn an Amiga-like platform to a PC.

Chunky/packed modes are good things, which Amiga should have had since day 0.

16-bit audio, higher frequencies, more channels: isn't it good?

A SIMD is a very good thing too.

64-bits is a must have for a modern ISA (RISC-V have even a 128-bit ISA!).

Hyperthreading too (but this also means that they aren't able to utilize so well the existing processor resources. An OoO design can use the same resource MUCH better).

However they are making a lot of mistakes too, because they clearly have no good vision for the future of the platform.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 21-Jul-2018 20:52:12

[ #27 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey

Quote:
He needs the core to be open. This means the complete HDL source is available to him, can be modified and is patent free.

Well, this doesn't mean to be open like we think normally. An Open ISA and/or implementation is a different thing.

You can still keep your ISA closed (at least the implementation: I don't see any advantage on NOT having a public ISA. It can happen at the beginning, if you can file some patents, but onces people wants/needs to use it, full and public documentation is required) and HDL sources as well by contract (so: it should not be published, but the company which creates the ASIC has access to it).
Quote:
He does *not* need an MMU. He does need some DSP/SIMD work done. Power needs to be fairly low but it needs to be balanced with performance. He wants a small footprint and highly values code density. He would like to eventually have simpler low end (low power 32 bit CPU with DSP) and advanced high end (higher performance 64 bit CPU) sensors but most architectures and CPU designs don't scale well.

OK, it makes sense. However why he needs an ISA which such features/constraints? Is there a specific application field? Or are we talking about an ISA that can be used in many applications fields? Any concrete project which needs it?
Quote:
He was looking at a more modern SuperH (most patents expired) design but the SuperH has severe limitations and some of his information on code density was skewed which I pointed out (the 68k has better code density and performance traits when optimizing for code size).

Is he looking for already existing ISAs?

Is there any possibility for a novelty ISA which matches his desiderata/requirements?

Does he absolutely need HDL sources?

Status: Offline

cdimauro

Re: 68k Developement
Posted on 21-Jul-2018 21:00:55

[ #28 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@pavlor

Quote:

pavlor wrote:

68040 33 MHz:

17.8 SpecInt92 (full report)

12.9 SpecFp92 (full report)

80486DX/2 66 MHz (without secondary cache, results with cache are higher):

32.4 SpecInt92

16.1 SpecInt92

I think that the last one should be SpecFP92, right?

Thanks for the data. So, 68040 and 80486 had, clock-for-clock, similar integer performances (only slightly better the former: +10%), whereas for floating point the 68040 is the clear winner (+60%)
Quote:
(source; I have full reports only for results with a secondary cache)

Let's leave out the secondary cache: it'll be unfair.

However since 80486s reached much higher clock rates, they are clear winners, even if 68040 performed better clock-for-clock on FP operations.

Status: Offline

matthey

Re: 68k Developement
Posted on 21-Jul-2018 21:58:50

[ #29 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2009
From: Kansas

Quote:

Hypex wrote:
@matthey

Quote:
My understanding from Mitch Alsup (one of the 88k chip architects who is active on comp.arch) is that the decision to drop the 88k and go with PPC was mostly politics. There were many competitive and similar RISC architectures to choose from at the time and PPC was an attempt to standardize on a single RISC architecture (probably pushed by computer manufacturers like Apple to have multiple CPU sources). As far as I know, the 88k was only developed by Motorola. Mitch stated that it was developed as an independent project in parallel with 68060 development.

Politics. How typical. And yet, it didn't help with PPC in the end, when it died on the desktop. Even if it survived in the other markets. Which ARM has taken over just about.

It was understandable to create a new standard like PPC. The 68k was common enough to be a standard competing against the x86. After the 68k, different RISC architectures were adopted dividing the market share where economies of scale are the most important competitive factor (also a major reason for the 68k decline). PPC was an attempt to standardize on a RISC architecture but it was only partially successful as there were other popular RISC architectures like Alpha, SPARC, MIPS, PA-RISC, etc. RISC designers thought they had enough of a hardware advantage that they didn't need to worry about economies of scale. DEC Alpha CPUs were faster than any Intel CPUs for awhile and selling the fastest CPUs available was very profitable. They thought they were winning so invested more money and turned up the clock frequencies finding the first flaw of RISC (the heat from higher clock speeds grows exponentially instead of linearly). The Alpha had been simplified to reach these clock speeds abandoning basic CPU functionality like byte operations which compilers were supposed to handle (the 2nd RISC flaw was the idea that most CPU complexity could be moved into the compiler) which slowed adoption. Intel improved their CPUs and kept selling more CPUs winning the war with economies of scale and putting DEC out of business (while acquiring most of the Alpha CPU design talent). The PPC AIM (Apple IBM Motorola) Alliance partners took a more conservative approach to challenging Intel. They at first focused on simpler more efficiency CPU designs which did not clock up well but were successful in some areas while Intel continued to win the high margin high performance market which gaming was pushing. Motorola never made an aggressive PPC design (nothing past the G3/G4 shallow pipeline designs) so IBM stepped in to make the PPC G5 based on their high performance POWER CPUs but this had disappointing performance and lost the power efficiency advantage of RISC. It didn't come as close as the DEC Alpha had. Apple then jumped to Intel CPUs ending any threat from PPC for good. ARM is the new RISC challenger using efficient multiple cores to try to compete in performance and leveraging economies of scale from sales in the huge embedded market (which Motorola should have done with the 68k but perhaps it was too early). Ironically, the new ARM AArch64 ISA which ARM is challenging with is similar to PPC but more modern and friendly (PPC unfriendly assembler hurt it as game optimizers preferred the ugly x86 assembler and embedded users fled to ARM, 68k/ColdFire and MIPS).

The 88k probably would have failed on its own due to too many similar RISC competitors. From what I know of the ISA (documentation is scarce), it is friendlier than the PPC but has worse code density and some oddities. Mitch Alsup said the biggest mistake was not moving to 64 bit sooner (although 64 bit was much more resource expensive back then). He is designing a RISC ISA called the 66000 now.

Quote:

Quote:
The ColdFire is a separate architecture thus *not* 68k. There have been many successful 68k based (non-ColdFire) embedded CPUs. The Dragonball was early based on the 68000 ISA while most later 68k embedded CPUs used the CPU32 ISA.

The ColdFire is referred to as ColdFire/68k in the outside world so if it is totally separate and not at all opcode compatible in anyway then a fairlytale has been spread. It was also considered as a replacement CPU to speed up the Amiga. As below. In any case I didn't say
they were compatible.

http://www.vesalia.de/e_dragon.htm

Most of the ColdFire encodings are the same as the 68k. The ColdFire is *not* compatible with the 68k though. The ColdFire ISA could have left all the 68k encodings open to be trapped and then it would have been used for 68k compatible applications. Most of the ColdFire encodings are open on the 68k so it is possible to have good ColdFire (ISA_C without MAC) compatibility with a 68k CPU. It is common to refer to the ColdFire and 68k architectures together as they have much in common. Most compiler backends support both together. This would make it easy for most compilers to generate ColdFire instructions into 68k code. I modified the ADis disassembler to disassemble Amiga Hunk format executables with ColdFire instructions.

Quote:

I wasn't aware the 68060 could be used in a PC back then.

A Mac, yes, but a PC?

I thought about typing out "personal computer" but then PC is the same thing. Ok, PC probably has two definitions in a tech dictionary now.

PC 1. Personal Computer 2. IBM personal computer and clones or x86/x86_64 CPU computer

Quote:

Quote:
The 68060 is the clear winner in PPA (Power, Performance and Area) often used to evaluate embedded CPUs today.

If only that was to last.

The 68060 won the battle but lost the war. It was guaranteed when Motorola surrendered on the 68k despite all the successes they had with it.

Quote:

Quote:
One of the reasons why the 68060 performed so well compared to the Pentium and PPC is that it has good code density (probably about 10% better than the x86).

It could do 8, 16 or 32 operations without addon hacks. And 64-bit math by combining registers, though that's a work around, for not having 64-bit wide registers. Though it did have 80-bit in FPU which was unheard of or matched today?

The way to make the 68k 64 bit is to make the integer unit registers 64 bit wide. There is often an encoding open to add .q (quadword=64 bit) operations although it would need to be used in a 64 bit mode for maximum compatibility. It is easier and more compatible to add 64 bit to the 68k than it was to add 64 bit to the x86 with x86_64. It can be done without adding prefixes.

The x87 FPU was 80 bit also. 80 bits gave the extra precision suggested by mathematicians to practically eliminate cumulative errors. It often makes algorithms shorter which reduces the number of FPU instructions and improves code density. Few Amiga compilers keep the extra precision usually because of the lack of extended precision support and lack of an ABI which passes arguments in registers (placing extended precision arguments on the stack is very inefficient). I could not easily add full extended precision support to vbcc for these reasons. Many mathematicians and engineers want more precision than 64 bit fp and 80 bit precision is much more practical in hardware than 128 bit.

Quote:

I almost laughed when I found x86 had byte sized instructions. Yes I know it came from 8-bit world but still. I suppose it can have alignment restrictions lifted.

Some people thought 8 bit instructions would give better code density. Dr. Vince Weaver had a code density contest where he wrote the following in the documentation.

Quote:

x86: The x86 code is currently the smallest, mainly because I had a running contest for a while with Stephan Walter until we got it below 1k. It does help that there are a lot of useful 1-byte instructions in the x86 command set, which give it an instant advantage over all of the RISC chips.

Lack of alignment makes string manipulating programs (like ll) a lot easier, as you can store 16 and 32 bit values w/o having to worry if the string is properly aligned.

Funny. Those byte instructions are better for eating up encoding space than improving code density. Instructions which are aligned on an odd byte are often slower on newer CPUs. It is recommended to add prefix padding to make them even aligned when optimizing for performance (the decoding overhead of prefix hell is already payed for with increased latency). I ran his 68k assembler code of the contest through vasm (with optimizations) instead of GAS and all of a sudden it was about 20% better code density than the best x86 code from his contest. Most of the optimizations were immediate and displacement reductions which a compiler would do (GCC does them too but vasm does them as peephole optimizations in the assembler).

Quote:

Quote:
The 68k could have a SIMD unit or other units with more registers.

Which is what they are putting in the Vampire 68K. As well as a 64-bit wide register file as I understand it. Though it disturbs me they are basing SIMD on Intel MMX. And they think themselves true to the 68K.

I know the game. Trying to turn the Amiga into another PC. Turning the 68K a K86!

The Apollo ISA (Vampire) shares the integer registers with the SIMD unit registers. This allows for faster transfers of data between units but practically limits the SIMD unit to 64 bits as 128 bit wide integer registers are impractical. Modern SIMD units have 256 or 512 bit wide registers and the Apollo ISA does not allow updating to this. This is more of a problem than choosing MMX or SSE (patents have expired on the early versions). It is a good idea to make the SIMD unit mostly code compatible with a popular SIMD unit. It was originally going to be Altivec compatible.

Disclosure: I was one of the earliest members of the Apollo team long ago.

Status: Offline

matthey

Re: 68k Developement
Posted on 22-Jul-2018 1:14:33

[ #30 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2009
From: Kansas

cdimauro wrote:
@matthey
Quote:

Intel decided NOT to drop important things (features and/or instructions) from its ISAs, like Motorola systematically did with 68K, starting from the 68030 and continuing with the 68060; I don't have to report the full list, because you know it much better than me.

The 68060 did not drop much more than earlier 68k CPUs. Most of what was dropped was rarely used and fine to trap. The big mistake was removing 32x32=64 which was commonly used by GCC to convert integer division by a constant into a much faster multiplication. There was more gained in the FPU with the addition of the FINT/FINTRZ than was lost from the 68040 FPU. Yes, the 68040 had deprecated most of the 6888x instructions but that was smart and still compatible.

Quote:

Pentiums had also a fully pipelined FPU, where 68060s lacked it.

The Pentium stack based FPU was faster in theory and had all the rarely used x87 instructions. I would rather have the nicer ISA non-pipelined 68060 FPU though.

Quote:

Pentiums had not so hard limits like 68060s about the instructions that can be paired and executed, which was only 2 16-bit instructions (4 bytes total) for the latter, whereas the former had a total of 16 bytes limits AFAIR.

The 68060 could only fetch 4 bytes per cycle which was partially offset by the decoupled prefetch unit and instruction buffer. The instructions buffer could run dry right after a mis-predicted branch and/or with several large instructions in a row. I believe more than 4 bytes could be pulled from the instruction buffer per cycle but I don't know enough about the intermediate internal RISC encoding. A larger instruction fetch likely would have improved performance. There were too many limitations on the instructions pairs that could be executed. I don't know if this was because of a transistor limit or simply running out of time. The 68060 seems to me to be a very good start to a design which was unfinished or limited by a transistor budget.

Quote:

Pentiums also introduced new instructions.

Pentiums also introduced new model registers, included the useful TimeStampCounter.

Pentiums also introduced new debug registers, which allowed to implement and use super-fast instruction or data breakpoints.

The 68060 could have used some ISA improvements. The ColdFire MVS/MVZ instructions would have been useful for a CPU with only a 4 byte per cycle instruction fetch and that could only forward 4 byte results.

The 68060 had less of almost everything compared to the competition. Less is often faster and more energy efficient.

Quote:

All of those take transistors ( = area) and power usage.

The units should have been gated and not drawn much power when off. There is no excuse for the Pentium requiring so much more power than the 68060.

Quote:

Quote:
Yes, the 68k ISA could be improved and made 64 bit like the x86 to x86_64. The 68k has several advantages like more free encoding space.

Motorola re-used the Size=0b11 encoding space for putting other instructions. So, it's not possible to (naturally) use it for specifying a 64-bit data. But for this topic you can specifically take a look at the reply which I've written to Hypex.

A 64 bit mode is probably the best way to retain maximum encoding space as much as Gunnar didn't like the idea. I'm not a fan of many modes either but two is tolerable if enough compatibility and encoding space is gained.

Quote:

Quote:
The lack of free encoding space in the x86 ISA made x86_64 code really fat. A 64 bit 68k ISA can have a larger advantage in code density over x86_64 than the 68k did over the x86. A 68k 64 bit ISA can more easily and cleanly support more powerful addressing modes used by 64 bit software as well.

I beg to differ, but for this I've already replied to Hypex as well.

With a 64 bit mode, I believe prefixes can be avoided. There is even enough encoding space for a mov.q (a0),(a1) which is 2 bytes. There is encoding space for the addressing modes without a prefix also.

Quote:

Quote:
The 68k can *not* easily encode more integer registers without fattening code, adding complexity to CPU designs and making compiler support more difficult. The x86_64 really needed more than 8 GP registers as 8 requires many more accesses of memory/caches but 16 is actually a good number for most algorithms. The extra x86_64 integer registers are not free either as the instructions are bigger unlike using all 16 68k registers.

But at least they are orthogonal and all general purpose, unlike 68K which split them between data and address registers, which is difficult to handle for compilers.

The split is more difficult for compilers but we can use all 16 registers without the instruction growing which gives better code density. The separation can be eased by opening up address register sources in most instructions as well. The original limitation was for more parallelization with a split register file which is outdated today. All 68k CPUs after the 68000 have had a unified register file.

Quote:

Quote:
It would be possible for the 68k to free up the A4-A6 registers by using PC relative addressing more and accessing the frame pointer data from the SP (more efficient use of registers).

Not an easy task, and not always possible (e.g.: A6 = library pointer base, for example).

Unfortunately one thing which 68Ks lack is address registers. Unless you resort to use one complex address mode introduced by 68020, but it can work only as a base pointer, and requires a costly (for code density) 16-bit word extension.

A6 can be freed too. The register used to jsr to a library would be dynamically chosen by the compiler (this is easier for compilers too). This can be done with the 68020 already but the full format encoding is big and slow. All that is lacking is a (d32,pc) addressing mode which is fast and cheap.

libcall:
mov.q (mylibbase,pc),a0 ; 6 byte instruction when (d32,pc) is used
jsr (myfuncoffset,a0) ; 4 byte instruction

myfunc:
move.q (libdata,pc),d0 ; 6 byte instruction when (d32,pc) is used
rts ; 2 bytes

The assembler/compiler will use (d32,pc) when necessary but this is now efficient rather than inefficient. It allows for a size of +- 2GB from the PC which is quite large code with good code density and smaller (d16,pc) could be used sometimes. This would require a new ABI and some merging of sections which could be MMU/XD friendly with smart organization and some padding. The full format encoding space is available for adding a less efficient (d64,pc) for completeness, compiler support and those rare cases where PIC code is needed but the program grows over this size. The small data model would be outdated saving another address register. There is no efficient coding for (d32,An) on the 68k by the way and PC relative addressing saves a base address register when it can be used. Unlike Unix/Windows style libraries, Amiga libraries do not access data in the calling program or other libraries making them a good candidate for PC relative addressing inside the library.

Quote:

Quote:
It would be easier to double the number of 68k FPU registers from 8 to 16 while maintaining good compatibility.

Please get rid of this old execution unit.

I kind of like the extended precision FPU. It is easy to use and has real advantages most of which have been forgotten. It has limitations like not supporting vectors easily but that is what a SIMD unit is for. Granted, an good SIMD unit could possibly replace the need for a FPU altogether. The biggest argument for keeping it is compatibility but that is a good argument. It has already been instruction reduced from the 6888x and probably wouldn't be that expensive to keep.

Quote:

Quote:
The 68040 had better performance per clock than the 486 but it was *not* a good design.

Why wasn't it good for you?

It was brute force instead of finesse. The 68060 was probably too much finesse and not enough brute force though.

Quote:

Quote:
(the 68k could have stayed in order longer than the Pentium which moved quickly to OoO).

Because OoO was (and still is) the obvious way to greatly improve performance. Something which RISCs made much earlier, but Intel (and others) followed as well with very good reasons.

Really, are we talking about performances and still want to promote an in-order design? It's a complete non-sense.

Take a look at Intel's Atoms: when they moved from the 2-way in-order to the 2-way out-of-order superpipeline, performances dramatically improved.

OoO is the way to performance but it is also the way to exponentially increase design costs and energy consumption. Yes, I will talk about in order performance. The Atom did jump in performance with OoO but die adjusted energy consumption went up leaving part of the market behind where they don't compete. Sure, they probably had more customers wanting better performance and full 64 bit support as potential customers wanting energy efficiency were probably already using ARM CPUs. The Atom did not have good performance as an in order CPU but I believe some of it was due to the x86_64 ISA. Long instructions with a small energy efficient instruction fetch is not a good combination for example. In order CPUs are still important for embedded use. Was it a mistake for the Raspberry Pi to use an in order CPU? Is performance not important to the Pi because it is in order? Why wasn't an OoO Intel Atom chosen for the Pi instead?

Status: Offline

cdimauro

Re: 68k Developement
Posted on 22-Jul-2018 7:04:35

[ #31 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey

Quote:

matthey wrote:

DEC Alpha CPUs were faster than any Intel CPUs for awhile and selling the fastest CPUs available was very profitable. They thought they were winning so invested more money and turned up the clock frequencies finding the first flaw of RISC (the heat from higher clock speeds grows exponentially instead of linearly).

AFAIR power is linearly linked to the frequency. But I'm not an electronic engineer, and I can mistake.
Quote:
Intel improved their CPUs and kept selling more CPUs winning the war with economies of scale and putting DEC out of business (while acquiring most of the Alpha CPU design talent).

Hum. Don't remember know if it was because Intel won a lawsuit here.
Quote:
Motorola never made an aggressive PPC design (nothing past the G3/G4 shallow pipeline designs) so IBM stepped in to make the PPC G5 based on their high performance POWER CPUs but this had disappointing performance and lost the power efficiency advantage of RISC. It didn't come as close as the DEC Alpha had. Apple then jumped to Intel CPUs ending any threat from PPC for good.

In reality Apple was already disappointing with PowerPC performances in the late '90s, and in fact it wanted to move to Intel. MacOS X primary hardware platform was an Intel one on 2000 (some months before being released!), and PowerPC was the secondary one.

Apple decided to switch back to (and stay with) PowerPCs only because a rampant IBM manager promised the infamous G5 to Jobs.

This story was reported some years ago by the Freescale CEO, which was... the IBM manager.
Quote:
Ironically, the new ARM AArch64 ISA which ARM is challenging with is similar to PPC but more modern and friendly

IMO it looks more like Alpha/MIPS than PowerPC.
Quote:
(PPC unfriendly assembler hurt it as game optimizers preferred the ugly x86 assembler and embedded users fled to ARM, 68k/ColdFire and MIPS).

Well, the x86 assembler sucks compared to the 68K one, but I definitely prefer it over many RISCs assemblers. I'm not a big fun of load/store architectures.
Quote:
The 88k probably would have failed on its own due to too many similar RISC competitors. From what I know of the ISA (documentation is scarce), it is friendlier than the PPC but has worse code density and some oddities. Mitch Alsup said the biggest mistake was not moving to 64 bit sooner (although 64 bit was much more resource expensive back then). He is designing a RISC ISA called the 66000 now.

I hope he doesn't make the same mistakes which were made with the 88K, primarily using the combined register file for GP/integer and floating point operations...
Quote:
The way to make the 68k 64 bit is to make the integer unit registers 64 bit wide. There is often an encoding open to add .q (quadword=64 bit) operations although it would need to be used in a 64 bit mode for maximum compatibility. It is easier and more compatible to add 64 bit to the 68k than it was to add 64 bit to the x86 with x86_64. It can be done without adding prefixes.

You miss the Size=0b11 encoding in many places/instructions, so you need to provide an alternative, and since there isn't so much encoding space available a prefix is the better candidate for it. Unless you like to pollute the already dirty & patched ISA, adding all missing 64-bit encodings (good luck to the decoder which has to deal with so many exceptions).

For x64 (sorry, I prefer it instead of the longer x86_64) the way was much easier because:
- the x86 (I prefer it instead of IA-32) ISA already used prefixes;
- AMD just removed the shorten (1 byte) INC/DEC instructions, and used the 4 bits space for the new REX prefix (which carried both W=64 bit data override and the 3 additional bits for the new 8 GPR & SIMD registers).
Quote:
Many mathematicians and engineers want more precision than 64 bit fp and 80 bit precision is much more practical in hardware than 128 bit.

It's more practical only because it's already there, but x87 and 68K FPU aren't so good designs.

The future is also presented by 128-bit precision, because it's years that (big) customers are asking for it.
Quote:
Some people thought 8 bit instructions would give better code density. Dr. Vince Weaver had a code density contest where he wrote the following in the documentation.
[...]
Funny. Those byte instructions are better for eating up encoding space than improving code density.

Well, the code density improved because of such 1 byte instructions, if you take a look at the disassembled code, especially for the 8086 version.

So, he was correct: it's because of the byte-sized instructions that the code density was so good.

Albeit I agree that there's A LOT of waste here (in fact in my ISA I moved many of such single byte instructions to longer encodings, especially for the rare ones).
Quote:
Instructions which are aligned on an odd byte are often slower on newer CPUs. It is recommended to add prefix padding to make them even aligned when optimizing for performance (the decoding overhead of prefix hell is already payed for with increased latency).

No, the performance problem is not really related to instructions aligned to odd bytes (which x86 and x64 naturally handle very well), but to the target of branches which aren't 16-byte aligned, as I reported before.

On x64 branch addresses are usually 16-byte aligned (using NOPs, if needed, before it) for this reason, which clearly makes code density even worse.

But, as you know, usually performances are weighted better than code density: you don't compile with -Os, but with -O2 (or -O3).
Quote:
ran his 68k assembler code of the contest through vasm (with optimizations) instead of GAS and all of a sudden it was about 20% better code density than the best x86 code from his contest. Most of the optimizations were immediate and displacement reductions which a compiler would do (GCC does them too but vasm does them as peephole optimizations in the assembler).

Well, you also used some tricks on the 68K code which you submitted to Dr. Weaver, like caching the address of some subroutines to some address registers in order to use a much shorter JSR (Ax) when needed.

I used the same trick using 3 spare registers on my ISA, and I was able to turn the binary to use 4 bytes less than your 68K version (which I've used as the base for my version, but also implanting some instructions from the x64 disassembly).
Quote:
The Apollo ISA (Vampire) shares the integer registers with the SIMD unit registers. This allows for faster transfers of data between units but practically limits the SIMD unit to 64 bits as 128 bit wide integer registers are impractical. Modern SIMD units have 256 or 512 bit wide registers and the Apollo ISA does not allow updating to this. This is more of a problem than choosing MMX or SSE (patents have expired on the early versions). It is a good idea to make the SIMD unit mostly code compatible with a popular SIMD unit. It was originally going to be Altivec compatible.

And it's in some ways, because it offers three operands instructions (and a quaternary one, AFAIR) and 32 vector registers.

But I agree on the rest.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 22-Jul-2018 8:21:47

[ #32 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey

Quote:

matthey wrote:

The 68060 did not drop much more than earlier 68k CPUs. Most of what was dropped was rarely used and fine to trap.

Yes, but because Motorola already dropped a lot of stuff from the 68040...
Quote:
The big mistake was removing 32x32=64 which was commonly used by GCC to convert integer division by a constant into a much faster multiplication. There was more gained in the FPU with the addition of the FINT/FINTRZ than was lost from the 68040 FPU. Yes, the 68040 had deprecated most of the 6888x instructions but that was smart and still compatible.

That's the point.

However Motorola also changed it's PMMU, starting from 68030.

And no, it wasn't a smart decision, because lacking instructions even on the user space hurts compatibility. You cannot always load a trap handler which emulates the missing instructions, and the Amiga is a clear evidence (some games stopped working on 68060 systems).
Quote:
The Pentium stack based FPU was faster in theory and had all the rarely used x87 instructions. I would rather have the nicer ISA non-pipelined 68060 FPU though.

The FPU was stack based, but the Pentium allowed to execute a FXCHG instruction in parallel, which mitigates a lot the performance penalty.

Anyway a pipelined FPU is a much better performer. Maybe pavlor has SPECINTs and SPECFPs for Pentium and 68060 as well.
Quote:
The 68060 could only fetch 4 bytes per cycle which was partially offset by the decoupled prefetch unit and instruction buffer. The instructions buffer could run dry right after a mis-predicted branch and/or with several large instructions in a row. I believe more than 4 bytes could be pulled from the instruction buffer per cycle but I don't know enough about the intermediate internal RISC encoding. A larger instruction fetch likely would have improved performance. There were too many limitations on the instructions pairs that could be executed. I don't know if this was because of a transistor limit or simply running out of time. The 68060 seems to me to be a very good start to a design which was unfinished or limited by a transistor budget.

Which was what I was talking before.

It's quite easy to have reduced areas and power consumption once you cut A LOT of things in your processor.

That's why comparing Pentiums and 68060s isn't fair: Motorola played dirty trying to catch the competitor (and failed anyway).
Quote:
The 68060 could have used some ISA improvements.

Yes, but you already talked about the transistor budget. How can you improve the ISA if you're continuously seeking for ways to cut transistor budget?
Quote:
The ColdFire MVS/MVZ instructions would have been useful for a CPU with only a 4 byte per cycle instruction fetch and that could only forward 4 byte results.

True.
Quote:
The 68060 had less of almost everything compared to the competition. Less is often faster and more energy efficient.

More energy efficient for sure since you lack MANY things.

Faster, well, I strongly doubt, because you've to resort to other compromises because of missing stuff. Let's see if pavlor can post some data here.
Quote:
The units should have been gated and not drawn much power when off. There is no excuse for the Pentium requiring so much more power than the 68060.

Pentium was a performance-oriented processor, and Intel kept ALL backward-compatibility (included the FSINCOS monster): that's enough to explain it.

In fact, it was also implemented to scale, and it did quite well: 66Mhz versions where introduced from the beginning, and easily breaking the 100Mhz barrier after some time.

What was/is the maximum official clock speed reached for 68060s?
Quote:
A 64 bit mode is probably the best way to retain maximum encoding space as much as Gunnar didn't like the idea. I'm not a fan of many modes either but two is tolerable if enough compatibility and encoding space is gained.

Well, maybe you have to seriously think about NOT saving binary compatibility, because you're ending up with a cluttered ISA with so many dirty patches. Even if you don't want to use a prefix.

That's why ARM did with its ARM64: a completely new design, albeit with some source-level compatibility.

Once you reach a certain critical point, it's better to stop and rethink.

That's what I also did with my "64-bit 68K ISA" 8 years ago, which is largely 68K source-level compatibile but binary incompatible. And that's what I did with the latest ISA, which is fully x86/x64 source-level compatibile, but with a totally different binary structure (not even the ModRM/SIB survived).
Quote:
With a 64 bit mode, I believe prefixes can be avoided. There is even enough encoding space for a mov.q (a0),(a1) which is 2 bytes. There is encoding space for the addressing modes without a prefix also.

How do you encode the above mov.q instruction, since the MOVE Mem,Mem only supports B/W/L sizes? You have to find some hole in the opcode table, right? So you continue to patch over patch the ISA, which is becoming a big patchwork, for the "pleasure" of decoder implementers and compiler writers.

Yes, you can avoid prefixes. Definitely. The price to pay: a messed up ISA...
Quote:
A6 can be freed too. The register used to jsr to a library would be dynamically chosen by the compiler (this is easier for compilers too). This can be done with the 68020 already but the full format encoding is big and slow. All that is lacking is a (d32,pc) addressing mode which is fast and cheap.

libcall:
mov.q (mylibbase,pc),a0 ; 6 byte instruction when (d32,pc) is used
jsr (myfuncoffset,a0) ; 4 byte instruction

myfunc:
move.q (libdata,pc),d0 ; 6 byte instruction when (d32,pc) is used
rts ; 2 bytes

But you need a free register, which is not always the case (you can already use A6 for other purposes).

However you need to store the libdata somewhere in the library, using ad-hoc instructions to load it when needed. Plus this (d32,pc) address mode. All this hurts code density...
Quote:
The assembler/compiler will use (d32,pc) when necessary but this is now efficient rather than inefficient. It allows for a size of +- 2GB from the PC which is quite large code with good code density and smaller (d16,pc) could be used sometimes.

That's good, and it's the way many ISAs (x64 too) implements PIC.
Quote:
This would require a new ABI and some merging of sections which could be MMU/XD friendly with smart organization and some padding.

But the Amiga o.s. isn't MMU-friendly...
Quote:
The full format encoding space is available for adding a less efficient (d64,pc) for completeness, compiler support and those rare cases where PIC code is needed but the program grows over this size. The small data model would be outdated saving another address register.

Avoid d64 address modes: they are not needed! It's just wasting address modes encoding space.

Maximum 2GB for code is already A LOT of space. And nobody needs to address elements which stay beyond 2GB of (base address) space for a structure or an array.

Let's use real-world scenarios.
Quote:
There is no efficient coding for (d32,An) on the 68k by the way and PC relative addressing saves a base address register when it can be used.

Pay attention that 32-bit offsets aren't uncommon on real-word applications. And here the 68K will suffer for sure, because it requires an extension word plus the 32-bit offsets: +6 bytes.
Quote:
I kind of like the extended precision FPU. It is easy to use and has real advantages most of which have been forgotten. It has limitations like not supporting vectors easily but that is what a SIMD unit is for. Granted, an good SIMD unit could possibly replace the need for a FPU altogether. The biggest argument for keeping it is compatibility but that is a good argument. It has already been instruction reduced from the 6888x and probably wouldn't be that expensive to keep.

No (Motorola did already a "good job" at cutting the FPU), but you're preventing the FPU to take advantage of the instructions introduced by the SIMD unit.

Unless you want to extend the FPU as well, which I strongly don't recommend.
Quote:
OoO is the way to performance but it is also the way to exponentially increase design costs and energy consumption. Yes, I will talk about in order performance. The Atom did jump in performance with OoO but die adjusted energy consumption went up leaving part of the market behind where they don't compete. Sure, they probably had more customers wanting better performance and full 64 bit support as potential customers wanting energy efficiency were probably already using ARM CPUs. The Atom did not have good performance as an in order CPU but I believe some of it was due to the x86_64 ISA. Long instructions with a small energy efficient instruction fetch is not a good combination for example. In order CPUs are still important for embedded use. Was it a mistake for the Raspberry Pi to use an in order CPU? Is performance not important to the Pi because it is in order? Why wasn't an OoO Intel Atom chosen for the Pi instead?

That's a good question, however the answer isn't related to the ISA and/or Atom micro-architecture, but to the Intel politics.

Do you know that Jobs knocked at Intel door to have an Atom for its upcoming iPhone? But Paul Otellini made a big mistake closing this door, because he didn't wanted to take very low margins.

Regarding the ISA/microarchitecture, here's a nice (albeit outdated, but still valid) article which clearly shows the potential of the very first Atom compared to the available (at the time) ARM designs:
The final ISA showdown: Is ARM, x86, or MIPS intrinsically more power efficient?
http://www.extremetech.com/extreme/188396-the-final-isa-showdown-is-arm-x86-or-mips-intrinsically-more-power-efficient

As you can see, the 2-ways in-order Atom can compete in performance with the most aggressive 3-way out-of-order ARM design, albeit consuming MUCH less energy.

Only the 2-way in-order ARM consumes much less, but it sucks in performances.

P.S. Sorry but reviewing the writing is taking too long, so I refrain to fix eventual orthographic/syntax errors.

Status: Offline

matthey

Re: 68k Developement
Posted on 23-Jul-2018 7:17:16

[ #33 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2009
From: Kansas

Quote:

cdimauro wrote:
@matthey

Quote:

matthey wrote:

DEC Alpha CPUs were faster than any Intel CPUs for awhile and selling the fastest CPUs available was very profitable. They thought they were winning so invested more money and turned up the clock frequencies finding the first flaw of RISC (the heat from higher clock speeds grows exponentially instead of linearly).

AFAIR power is linearly linked to the frequency. But I'm not an electronic engineer, and I can mistake.

The RISC engineers made the same mistake. The equation looks linear.

P = C x V^2 x f

The problem is that the voltage has to be increased with the frequency. Some people say there is a cubic dependency between frequency and power consumption so reducing the clock speed by 30% reduces the power by 35% but this may be best case. The following is a graph showing an Intel i7 where reducing the clock speed by 14% reduced the power by 28%.

https://qph.fs.quoracdn.net/main-qimg-61820a6b86759c354d41dbbd16a0385b

A linear relationship would be a straight line but instead we have significant curvature. This is one of the reasons why lower clocked multicore CPUs are energy efficient for parallel algorithms.

Quote:

Quote:
Intel improved their CPUs and kept selling more CPUs winning the war with economies of scale and putting DEC out of business (while acquiring most of the Alpha CPU design talent).

Hum. Don't remember know if it was because Intel won a lawsuit here.

I recall a lawsuit also. DEC was bought by Compaq which was bought by HP. HP had been working with Intel to produce the Itanium which was slated to be the x86 replacement. HP made some kind of deal like giving Intel much of the chip design assets in return for state of the art Intel fab access and guarantees to produce HP CPUs for a certain length of time. DEC and AMD had been sharing technology though.

Quote:

Quote:
Motorola never made an aggressive PPC design (nothing past the G3/G4 shallow pipeline designs) so IBM stepped in to make the PPC G5 based on their high performance POWER CPUs but this had disappointing performance and lost the power efficiency advantage of RISC. It didn't come as close as the DEC Alpha had. Apple then jumped to Intel CPUs ending any threat from PPC for good.

In reality Apple was already disappointing with PowerPC performances in the late '90s, and in fact it wanted to move to Intel. MacOS X primary hardware platform was an Intel one on 2000 (some months before being released!), and PowerPC was the secondary one.

Motorola wouldn't make an aggressive or new CPU design. They kept offering that shallow pipeline weak OoO design which wouldn't clock up without die shrinks (they are still selling a variation of the same design today). The low end PPCs were even more of a disaster when they tried to mass produce and sell the 603 with 8kB caches like the 68060 (sadly they didn't have half a clue about code density after the 68k). The poor performance of low end PPCs and lack of high end PPCs from Motorola was enough to make anyone want Intel CPUs.

Quote:

Apple decided to switch back to (and stay with) PowerPCs only because a rampant IBM manager promised the infamous G5 to Jobs.

This story was reported some years ago by the Freescale CEO, which was... the IBM manager.

I had heard that too but it was a rumor for all I knew.

Quote:

Quote:
Ironically, the new ARM AArch64 ISA which ARM is challenging with is similar to PPC but more modern and friendly

IMO it looks more like Alpha/MIPS than PowerPC.

AArch64 is *not* a reduced instruction set RISC CPU. It has many instructions, addressing modes and ops for RISC like PowerPC. AArch64 does have a very different feel which I like better. MIPS and Alpha are simpler and closer to the classic "minimal" RISC design.

Quote:

Quote:
(PPC unfriendly assembler hurt it as game optimizers preferred the ugly x86 assembler and embedded users fled to ARM, 68k/ColdFire and MIPS).

Well, the x86 assembler sucks compared to the 68K one, but I definitely prefer it over many RISCs assemblers. I'm not a big fun of load/store architectures.

That was the consensus of the comp.arch group also. PPC assembler was created for compiler use and *not* human use. As inferior as x86 assembler is to 68k in ease of use, the majority of programmers would choose x86 assembler over PPC. PPC fans would say it didn't matter but early games like Doom and Quake were significantly better optimized for the x86. There was even a guy in the group that had done assembler optimizations for the original Quake. PPC optimizers are certainly rare and usually less knowledgeable about the PPC.

Quote:

Quote:
The 88k probably would have failed on its own due to too many similar RISC competitors. From what I know of the ISA (documentation is scarce), it is friendlier than the PPC but has worse code density and some oddities. Mitch Alsup said the biggest mistake was not moving to 64 bit sooner (although 64 bit was much more resource expensive back then). He is designing a RISC ISA called the 66000 now.

I hope he doesn't make the same mistakes which were made with the 88K, primarily using the combined register file for GP/integer and floating point operations...

I asked specifically whether he thought the combined integer/fp register file was a mistake. He did *not* think it was. There are advantages and disadvantages to combined and separate. I can see his view point and respect it. Separate register files for units is popular right now is all. If integer and fp values in the same register file was so bad then why do SIMD units do it?

Quote:

Quote:
The way to make the 68k 64 bit is to make the integer unit registers 64 bit wide. There is often an encoding open to add .q (quadword=64 bit) operations although it would need to be used in a 64 bit mode for maximum compatibility. It is easier and more compatible to add 64 bit to the 68k than it was to add 64 bit to the x86 with x86_64. It can be done without adding prefixes.

You miss the Size=0b11 encoding in many places/instructions, so you need to provide an alternative, and since there isn't so much encoding space available a prefix is the better candidate for it. Unless you like to pollute the already dirty & patched ISA, adding all missing 64-bit encodings (good luck to the decoder which has to deal with so many exceptions).

A table lookup decoder like the 68060 uses can handle what looks like disorganized and unsymmetrical encodings to the human eye. So far, it doesn't even look that bad but the last few encodings may make people with OCD feel uncomfortable. Using a 64 bit mode, I have been able to cleverly recover quite a bit of space without losing much. I will probably look at re-encoding everything to compare it and see which I like better. I believe most of the op.[b/w/l/q] EA,Rn instructions can be used as is though. There are a few useful instructions that need to find new encodings. The x86_64 usually needs 2 instructions and prefixes for a simple op.q. I can't do any worse than x86_64 really. :D

Quote:

For x64 (sorry, I prefer it instead of the longer x86_64) the way was much easier because:
- the x86 (I prefer it instead of IA-32) ISA already used prefixes;
- AMD just removed the shorten (1 byte) INC/DEC instructions, and used the 4 bits space for the new REX prefix (which carried both W=64 bit data override and the 3 additional bits for the new 8 GPR & SIMD registers).

No blame can be given for continuing a bad thing.

Quote:

Quote:
Many mathematicians and engineers want more precision than 64 bit fp and 80 bit precision is much more practical in hardware than 128 bit.

It's more practical only because it's already there, but x87 and 68K FPU aren't so good designs.

The future is also presented by 128-bit precision, because it's years that (big) customers are asking for it.

128 bit IEEE quad precision hardware support would be half the performance of extended precision. It is the fraction/mantissa which requires a wide and slow ALU. Quad precision is 113 bits of fraction while extended precision is only 64 bits. Extended precision increases the exponent to the same size as quad precision which gives a huge range compared to double precision and often giving a number as a result instead of infinity, NaN or a subnormal (which often trap to a slow software handler). With extended precision and quad precision having the same exponent, I wonder if a hardware extended precision operation could be expanded to a quad precision with software (hard+soft quad precision support in a library). It sounds like an interesting project but I'm tired of doing successful projects for a dead platform that nobody uses. Really, we are wasting our time even discussing anything here. I should have never responded to this thread.

Quote:

Quote:
Some people thought 8 bit instructions would give better code density. Dr. Vince Weaver had a code density contest where he wrote the following in the documentation.
[...]
Funny. Those byte instructions are better for eating up encoding space than improving code density.

Well, the code density improved because of such 1 byte instructions, if you take a look at the disassembled code, especially for the 8086 version.

So, he was correct: it's because of the byte-sized instructions that the code density was so good.

Debatable. The 8086 code density was specialized for a narrow application which primarily includes text handling and stack use. The x86 and x86_64 retain this specialization which now hurts the code density for general purpose and performance applications (optimizing for performance gives poor code density). I don't know of a good code density variable length 16 bit encoding specialized for the same purpose as the 8086 for comparison. The 68k code density is more general purpose and is more useful today. Let me show you what I mean. I roughly analyzed Vince Weaver's code density results and created a spreadsheet. The code is optimized for size.

https://docs.google.com/spreadsheets/d/e/2PACX-1vTyfDPXIN6i4thorNXak5hlP0FQpqpZFk2sgauXgYZPdtJX7FvVgabfbpCtHTkp5Yo9ai6MhiQqhgyG/pubhtml?gid=909588979&single=true

When optimizing for code size on the x86, using bytes sizes and the stack give the best code density. This results in 106 memory access instructions mostly from using the stack vs the 68k of only 48 memory access instructions. The 68k did not need to use the stack to achieve maximum code density and it could use all 16 registers instead of 8. If you think the x86_64 having 16 registers helps, it does not as it has *more* memory accessing instructions at 112. This was worse than any other popular ISAs I evaluated. Code density specialization was good for the 8086 but it is bad for the x86 and x86_64.

Vince's 8086 code does not run in Linux lacking some of the overhead of a modern OS and executable. The 68k currently has the best code density of any Linux program and Vince hasn't even updated for the changes I sent him almost a year ago now.

Note: The counts for some ISAs I'm unfamiliar with could be off. I came up with a methodology to make them correct but Vince doesn't seem to be much of a researcher for being a doctor. I kind of wonder if he even understands why I added other categories besides code density.

Quote:

Quote:
ran his 68k assembler code of the contest through vasm (with optimizations) instead of GAS and all of a sudden it was about 20% better code density than the best x86 code from his contest. Most of the optimizations were immediate and displacement reductions which a compiler would do (GCC does them too but vasm does them as peephole optimizations in the assembler).

Well, you also used some tricks on the 68K code which you submitted to Dr. Weaver, like caching the address of some subroutines to some address registers in order to use a much shorter JSR (Ax) when needed.

Actually, Dr. Weaver did the JSR (An). It was a big code saver before I switched most functions to implicitly PC relative BSR. There were a few branches left that were out of BSR.B range so I left the JSR instructions. I think it only saved something like 4 bytes and it took some rearranging of functions to even get that.

Quote:

I used the same trick using 3 spare registers on my ISA, and I was able to turn the binary to use 4 bytes less than your 68K version (which I've used as the base for my version, but also implanting some instructions from the x64 disassembly).

I was able to save a few bytes with 68k ISA enhancements also but not much. The 68k has most of what is needed for a small and simple program like this. The trick is making an enhanced ISA easier for compilers to generate code which has similar code density.

Status: Offline

Hypex

Re: 68k Developement
Posted on 23-Jul-2018 12:03:42

[ #34 ]

Elite Member

Joined: 6-May-2007
Posts: 11215
From: Greensborough, Australia

@BigD

Quote:
This seems a very similar thread to this one!

Looks like I couldn't help myself there.

Status: Offline

Hypex

Re: 68k Developement
Posted on 23-Jul-2018 13:50:40

[ #35 ]

Elite Member

Joined: 6-May-2007
Posts: 11215
From: Greensborough, Australia

@cdimauro

Quote:
Unbelievable: an interesting (not boring) thread here. :)

LOL. I find most things interesting enough, but sometimes too long. OTOH interesting tends to mean long and time consuming after a while.

Quote:
Well, the burden with x86 is mostly represented by prefixes to decode the instructions, which fortunately can be easily handled by the decoder (they are specific 8-bit patterns).

Yes it is here the 8-bit history become obvious. And also byte oriented format of little endian coded instructions. In a way this should make things easier since things can be tacked on by putting them on the LSB.

Quote:
But 68K have some problems too: a 16 bit opcode with a lot of exceptions (Motorola did a dirty job hacking the opcodes to fit instructions) which makes not-so-simple to figure out if an instruction has a extension word and/or an immediate; then the length of the extension word; plus... the double indirect memory modes.

Well here we have a 16-bit base instruction but when an extra 16-bit is added and data after it gets messy.

Quote:
64-bit data requires a prefix (ala x64), which will significantly drop the code density (x64 has 1 byte prefix; here you'll need a TWO bytes one!) Or requires an ad-hoc execution mode which defaults to 64-bit (suppressing/replacing which size in the currenct instructions? Byte is widely used; Word is used a lot in 68000 code, whereas Long is used more in 68020+ code), and this will create other problems, as you can image.

64-bit size should use the size field in the opcode but after I brief look I can see they didn't future proof them and used them for other instructions. I hate the thought of a prefix. And yes for 68K it would need minimum two bytes. And there goes your neat 16-bit base opcode. I also notice at most three bits for what register restricting most operations to address or data register. But I guess that was the point.

Quote:
More than 16 registers requires certainly a prefix (so: see above), and anyway there's NO space for 64 registers. In fact, to consider the worst cases (MOVE Mem, Mem; bitfields; long MUL/DIV, and maybe some other instructions which use many registers), you need 4 bits just to have 16 data and 16 address registers (and to extend bitfields offsets & widths to 6 bits -> up to 63 as value). Plus another 2 bits if you want to remove the current data/address register division (joining the x64 "dark side": all registers are general purpose). It means 6 bits. Plus another bit for the 64-bit size, and you reached 7 bits, which is quite a good portion of the free 16-bit opcodes (Mat can be more precise, since he knows much better the current 68K opcodes encodings).

Yes it would be too many. And over complicate it. Then, do you create new format? Or bolt onto existing one, with say 1 or 2 MSB for register one one word and the normal 3 bit LSB in the other? Not worth it.

Quote:
However "32 registers" should be enough for anybody".

Just like PPC.

Quote:
Finally, adding vector (SIMD) capabilities to the 68K is not trivial, because every decision that you take will impact other aspects (limits; constraints) of the extension.

Oh yes, with a V0 to V7. 128-bit wide. Suppose having a format similar to FPU would remain consistent. But again it needs opcode space.

Quote:
Do you want 128, 256, 512 (as you mentioned) and maybe 1024 bits for vector registers sizes? Add another couple of bits, and we are at 18 bits. Do you want both integer and floating point operations? Add another bit -> 19 bits needed now. Which integer and FP sizes do you want to handle: 8..64 bits for int, and 16, 32, 64, and 128 (will be introduced in some processors in the future, for sure) for FP? Another couple of bits -> 21 bits of encoding space. Let's stop here, and add some bits for the EA field encoding (we have CISCs and we want to take most of the advantage, right?). You need only 3 bits here, because we can reuse one operand (the second source) to encode the register (or some special address mode). Total: 24 bits.

Given we using large register sizes with vectors might as well use an extra opcode to encode it all in. Another possibility, if it can be done, is to refer to entire vector register file space as one width. Say we a space for 8x128-bit vectors. We have 1024 bits for vectors. How about using that as 1024x8-bit registers or 1x1024-bit wide register and everything in between? Even more with increased vector width. But I believe that is point of having vectors SIMD on an array of large data sizes.

Quote:
Now take a look at the space available for the F-line: it's not even enough for encoding such fields. Let's say that you completely absorbe (reuse) the A (for packed) and F (for scalar) lines for the new SIMD unit, then you have 13 + 16 = 29 bits of total space available for the new super cool SIMD unit. It means that you have only 29 - 24 = 5 bits = 32 completely orthogonal instructions (they can be scalar or packed, int8..64 or fp16..128, 128..1024 bits in size): it's not that much, but you can do some things.

Oh no! I even hate the idea of reusing FPU registers like was done on x86. Then again IIRC PPC couldn't directly load FPU values or some limitation,

Quote:
TL;DR: 32 bits of base opcode space for a SIMD extension is quite small if you want to introduce a very flexible and modern one.

It would be.

Quote:
Not an easy task if you want to keep also the very good thing of 68Ks: code density. And I think that here you cannot win the battle, because 68Ks use opcodes which are multiple of 16-bits. For the good and... the bad, unfortunately. See above.

It wouldn't be. A bit superfluous adding more registers.

Quote:
BTW, and as already said, x64 has 16 general purpose registers: no data/address distinction. Which is a Good Thing (especially for compilers).

Oh no. They de-cluttered the complex register arrangement to some something simpler than 68K and similar to PPC. I never thought the address or data specification was too bad on 68K. They had their purpose and could be re-purposed for other uses like register storage. It organised the registers so was neat in that respect.

Last edited by Hypex on 23-Jul-2018 at 01:52 PM.

Status: Offline

Hypex

Re: 68k Developement
Posted on 23-Jul-2018 16:04:19

[ #36 ]

Elite Member

Joined: 6-May-2007
Posts: 11215
From: Greensborough, Australia

@cdimauro

Quote:
But at least they are orthogonal and all general purpose, unlike 68K which split them between data and address registers, which is difficult to handle for compilers.

Compilers should grow some balls.

I've written 68K myself. Not that hard. A lot easier than the 6510 register restrictions those C64 coders have to deal with and the like.

Quote:
Not an easy task, and not always possible (e.g.: A6 = library pointer base, for example).

Yes I found that suggestion strange. I also found the the use of register A7 strange in the 68K. It was like they stole a top register used it for stack instead of a dedicated SP. When there was USP and SSP.

It prevented ExecBase going into A7.

Quote:
Unfortunately one thing which 68Ks lack is address registers.

I don't understand this statement. A0 to A6? Address registers?

Quote:
Why wasn't it good for you?

Rather intimate question about a CPU.

Quote:
386+ had nice SHLD/SHRD for this.

Looks like it.

Quote:
x87 FPU is deprecated since AMD introduced its x86_64/AMD64/x64 ISA, but no other (mainstream) FPUs can actually handle 80-bit FP operations.

I was reading up on 68K FPU format. Looks like wasted space. MSB side had sign and 15 bit exponent with 64-bit mantissa on LSB side. Of course I can't find what I was looking at that for now.

Quote:
That's the point, and it offered a very compact ISA when 8086 was introduced.

Yes for 8086. But I expected for 80286 they would updated this. After all they added another digit.

Quote:
It never had alignment restrictions. However x64 code usually has a lot of padding (using NOPs) in order to always have targets of branches which are at least 16 bytes-aligned.

That's what I mean, being byte based, it wasn't like 68K and not as restrictive as PPC. Though it looks like it lost some on x64.

Quote:
Well, some ISAs have more than 32 registers. Without humour. But they didn't survived.

ARM is almost there with an odd 31 and PPC is holding onto that 32 like a thread.

Quote:
Actually it's only a marketing slogan, because the two SIMDs are quite different.

So a political move? What is popular now.

Quote:
I think that they used MMX for the following reasons: having 64-bit SIMD registers (like MMX, indeed), operating on integers (MMX lacked FP instructions as well), and using existing registers for them (MMX used x87 FPU registers. Vampire made even a worse decision, using the general purpose registers).

64-bit is a bit behind for vectors now. But what GPRs? No the data registers?

I see they couldn't follow the AltiVec model because they don't like PPC. Or that Gunnar doesn't who is firing the shots. He seems to have a a passion against it.

Quote:
Well, in a certain way it's true: they are aggressively patching the 68K ISA putting whatever they think is good to have. Like what Motorola did, basically.

Sure, yeah. But I don't know for what purpose. Vectors and native 64-bit never existed on the Amiga. The Vampire is like an accelerator on steroids. Well almost. I don't know why they want to go beyond that. The hardware is at most 32-bit in design as well as the OS. I'm a for a speed advantage and including RTG+RTA features but bolting features onto a depreciated CPU ISA that hasn't been updated in over 20 years seems a bit superfluous.

Quote:
But Motorola did a lot of mistakes with such dirty patching (see other messages)...

Yep. And cut downs on the 060. Also they changed it since the 010 with MOVE SR and related. But MOVE CCR was better for user code.

Quote:
Adopting some ideas or being inspired from x86 isn't necessarily an evil thing and will not turn an Amiga-like platform to a PC.

I didn't see it but adding little endian instructions wouldn't go far astray.

Quote:
Chunky/packed modes are good things, which Amiga should have had since day 0.

Would have helped on ECS. But I don't know they could have fit it in. Should have been there in AGA. Even Atari has 16-bit framebuffer.

Quote:
16-bit audio, higher frequencies, more channels: isn't it good?

Yes but 16-bit audio is outdated. 24-bit is where it's at and 32-bit FP next.

Quote:
A SIMD is a very good thing too.

But what Amiga software would know what it is?

Quote:
64-bits is a must have for a modern ISA (RISC-V have even a 128-bit ISA!).

Yes maybe. The 68K did have 64-bit support in dual registers. I expected a 128-bit ISA would be here. It's about time!

Quote:
Hyperthreading too (but this also means that they aren't able to utilize so well the existing processor resources. An OoO design can use the same resource MUCH better).

That would be more to think about. Some instructions worked more efficient in certain order. But parallel execution is another matter.

Quote:
However they are making a lot of mistakes too, because they clearly have no good vision for the future of the platform.

The platform stopped being produced in the 90's. But, is there a reason to make it more than it is? I compare this a with OS4 where a lot of people wanted to run the old 68K stuff and see what games worked. My self included. But on the Vampire, they are introducing things that adds incompatibilities. From hardware conflicts with other Amiga boards to software conflicts. If you can't run Amiga software and games on a super accelerator that plugs into the real deal then what's the point? I know certain Amiga people. They don't see the point of OS4 because the AmigaOne machines don't have an Amiga chipset and can't naively play Amiga games. And there are others that touched OS4 but then went back to 68K. We're a strange bunch.

Status: Offline

matthey

Re: 68k Developement
Posted on 24-Jul-2018 22:56:34

[ #37 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2009
From: Kansas

@cdimauro and Hypex
I didn't mean to be rude when I said I should not have replied. It has nothing to do with you guys and has everything to do with the Amiga situation which has become an intellectual black hole. It is a waste of time, talent and energy to even be here as can be seen by this ghost town of a forum and thread. Ex-Amiga people find Amiga forums every once in a while and ask similar questions about what happened to the Amiga and 68k. It would be nice if they could at least receive semi-accurate answers about the history. Maybe a FAQ would be good but the answer is really simple. The Amiga and 68k were innovative and ahead of their times but the companies who owned the technology stopped innovating. Sadly, both C= and Motorola had the innovative people to take their products much further but poor management and politics got in their way.

Quote:

Hypex wrote:
Yes I found that suggestion strange. I also found the the use of register A7 strange in the 68K. It was like they stole a top register used it for stack instead of a dedicated SP. When there was USP and SSP.

It prevented ExecBase going into A7.

In user mode: USP=A7=SP
In supervisor mode: SSP=A7=SP

By using a regular address register for the stack, the 68k is able to provide all the powerful addressing modes of address registers while sharing their logic. Many CPUs are more limited on the operations of the stack.

My first instinct is that it would not be efficient to have a register dedicated as a base register for ExecBase. I think it would be more effective to allow any address register to be an Amiga library base and use PC relative addressing inside the library. The compiler could use a2-a6 as library base registers if it wanted to cache multiple base registers for multiple function calls or a0-a1 for one and done function calls. I believe this would result in better code quality and be simpler for compiler support (no need to swap the optional frame pointer from a6 to a5 for the Amiga for example). It would be less confusing for Amiga programmers as they wouldn't have to worry about the small data model and loading a4 inside libraries in order to get optimum performance. It can be done right now on libraries up to 64kB using (d16,PC) and up to 4GB if a fast and small (d32,PC) was added as I suggested.

Quote:

Quote:
Unfortunately one thing which 68Ks lack is address registers.

I don't understand this statement. A0 to A6? Address registers?

He was probably talking about commonly running out of address registers before data registers. I mentioned how we can save up to 3 address registers on the Amiga.

a4 - Small data base register is unnecessary if the data can be reached with PC relative addressing
a5 - frame pointer register is unnecessary if accessing the frame on the stack
a6 - The library base register would no longer be locked to one register

Quote:

I was reading up on 68K FPU format. Looks like wasted space. MSB side had sign and 15 bit exponent with 64-bit mantissa on LSB side. Of course I can't find what I was looking at that for now.

I know what you are talking about with the big gap between the exponent and the fraction/mantissa. The padding to 96 bits was added to be 32 bit aligned (I believe Intel did *not* add the padding for their 80 bit extended format) but I can't explain why they left the big gap in the middle. To me, it would have made more sense to add the gap at the end allowing for more digits to be added to the right if more precision became fast enough in hardware. This would have then been close to the IEEE quad precision format other than the explicit vs hidden fraction/mantissa bit. It would probably be possible to add a FPCR setting to save the 96 bits so they exactly match the first 96 bits of the IEEE quad precision format. This would allow easy conversion only requiring truncation. It could be useful if adding hardware+software quad precision floating point support too.

Quote:

Quote:
Well, some ISAs have more than 32 registers. Without humour. But they didn't survived.

ARM is almost there with an odd 31 and PPC is holding onto that 32 like a thread.

RISC CPUs typically needs more registers for several reasons. CISC is usually better off with 16 registers as it gives better code density and is more energy efficient. One paper's test suite showed less than 1% performance difference between 12 and 16 registers for a x86_64 CPU.

https://pdfs.semanticscholar.org/c9e7/976e3be3eed6cf843f1148f17059220c2ba4.pdf

I have another paper that evaluates RISC registers called "High-Performance Extendable Instruction Set Computing" which is not online for free. A few quotes from this paper follow.

"The availability of sixteen general-purpose registers is close to optimum"

"There is little change in either the program size or the load and store frequency as the number general-purpose registers reduces from twenty down to sixteen. However, eight registers are clearly too few as, by that stage, the frequency of load and store instructions has almost doubled."

It looks like with 16 registers it is better to look at ways to better use the registers you have than add more. Register files use a substantial amount of energy (there is a reason the Tabor CPU removed the FPU register file). Everyone wants CPUs with higher clock speeds, more caches, more bits and more registers but you may actually get something which is lower performance, is loud and runs hot. Even CPU and ISA designers (cough, cough) can be obsessed with adding more registers when studies have shown it is not worthwhile.

Quote:

Quote:
I think that they used MMX for the following reasons: having 64-bit SIMD registers (like MMX, indeed), operating on integers (MMX lacked FP instructions as well), and using existing registers for them (MMX used x87 FPU registers. Vampire made even a worse decision, using the general purpose registers).

64-bit is a bit behind for vectors now. But what GPRs? No the data registers?

I see they couldn't follow the AltiVec model because they don't like PPC. Or that Gunnar doesn't who is firing the shots. He seems to have a a passion against it.

I didn't see much bias against PPC from Gunnar. He worked with PPC quite a bit at IBM and borrowed ideas from PPC (he obviously likes the 68k better though). I don't know why he switched from Altivec to MMX but it may have been as simple as popularity. It's not like he is going to get auto-vectorization compiler support for his weird ISA any time soon so it will have to be hand laid assembler and more people are familiar with MMX.

Quote:

Quote:
Well, in a certain way it's true: they are aggressively patching the 68K ISA putting whatever they think is good to have. Like what Motorola did, basically.

Sure, yeah. But I don't know for what purpose. Vectors and native 64-bit never existed on the Amiga. The Vampire is like an accelerator on steroids. Well almost. I don't know why they want to go beyond that. The hardware is at most 32-bit in design as well as the OS. I'm a for a speed advantage and including RTG+RTA features but bolting features onto a depreciated CPU ISA that hasn't been updated in over 20 years seems a bit superfluous.

It makes sense to add all kinds of modern features especially if compatibility isn't compromised (which isn't that bad). It supports the Amiga custom chips which seems to make it more popular than PPC hardware which does not. Sure, the 64 bit addressing support is lacking and no ABI but there is no real push to add 64 bit AmigaOS support or compiler support. Granted, the ISA should have been done by a team instead of Gunnar.

Quote:

I didn't see it but adding little endian instructions wouldn't go far astray.

There was a MOVEX instruction added for LE conversion. "Condition codes: X Not affected" stated in the docs. Sigh.

Quote:

Quote:
64-bits is a must have for a modern ISA (RISC-V have even a 128-bit ISA!).

Yes maybe. The 68K did have 64-bit support in dual registers. I expected a 128-bit ISA would be here. It's about time!

Do you realize how slow a 128 bit ISA in a CPU would be? Yea, there are some large servers that may need the address space but I'll find a smaller faster server as an alternative, thank you.

Quote:

Quote:
However they are making a lot of mistakes too, because they clearly have no good vision for the future of the platform.

The platform stopped being produced in the 90's. But, is there a reason to make it more than it is? I compare this a with OS4 where a lot of people wanted to run the old 68K stuff and see what games worked. My self included. But on the Vampire, they are introducing things that adds incompatibilities. From hardware conflicts with other Amiga boards to software conflicts. If you can't run Amiga software and games on a super accelerator that plugs into the real deal then what's the point? I know certain Amiga people. They don't see the point of OS4 because the AmigaOne machines don't have an Amiga chipset and can't naively play Amiga games. And there are others that touched OS4 but then went back to 68K. We're a strange bunch.

The Amiga platform is heading toward being an emulated one. UAE users are probably the largest Amiga group at this point. Vampire hardware is probably winning vs AOS4 hardware. The groups that are shrinking have less and less money to spend and get less and less improvements to their products. They can try to raise prices but it is a losing battle. When we are only emulated then it will be too late. The only way I see to bring the Amiga back is cheap hardware which requires mass production. The reason ARM processors and boards are so cheap is they are produced by the millions for embedded applications. The way to compete with them is to emulate them (find embedded partners who can increase production numbers and lower costs). Retro tech toys have been popular lately and don't require the most advanced desktop OS either. Amiga has had plenty of opportunities but the owners and managers have never been able to get the deals done they need. Next owner please.

See, I wasted more time. I really should stop posting. It is pointless.

Status: Offline

itix

Re: 68k Developement
Posted on 25-Jul-2018 11:41:23

[ #38 ]

Elite Member

Joined: 22-Dec-2004
Posts: 3398
From: Freedom world

@matthey

Quote:

My first instinct is that it would not be efficient to have a register dedicated as a base register for ExecBase. I think it would be more effective to allow any address register to be an Amiga library base and use PC relative addressing inside the library. The compiler could use a2-a6 as library base registers if it wanted to cache multiple base registers for multiple function calls or a0-a1 for one and done function calls. I believe this would result in better code quality and be simpler for compiler support (no need to swap the optional frame pointer from a6 to a5 for the Amiga for example). It would be less confusing for Amiga programmers as they wouldn't have to worry about the small data model and loading a4 inside libraries in order to get optimum performance. It can be done right now on libraries up to 64kB using (d16,PC) and up to 4GB if a fast and small (d32,PC) was added as I suggested.

If you are porting library from Linux/BSD you are usually forced to small data model. This because they are static link libraries turned into shared libraries using compiler magic and must address different data section for each user.

But instead of using A6 it could have used A0. In my experience compilers are not good at caching library base even in a6 and they reload a6 on each call even when making calls to same library in a row. Only hardcore 68k assembler developers care about caching.

Quote:

See, I wasted more time. I really should stop posting. It is pointless.

But informative and interesting to me

_________________
Amiga Developer
Amiga 500, Efika, Mac Mini and PowerBook

Status: Offline

Hypex

Re: 68k Developement
Posted on 25-Jul-2018 18:38:03

[ #39 ]

Elite Member

Joined: 6-May-2007
Posts: 11215
From: Greensborough, Australia

@matthey

Quote:
The Amiga and 68k were innovative and ahead of their times but the companies who owned the technology stopped innovating. Sadly, both C= and Motorola had the innovative people to take their products much further but poor management and politics got in their way.

That's it in a nutshell. And today almost everyone has been driven out. To the point there isn't much competition anymore. Not in the way of different competing technologies. I mean, AMD and Intel, though they compete, are producing the same kind of CPU. They could be seen as competing against each other with a clone of an x86-64 ISA CPU. And AMD against NVidia. Looks different from the outside, but likely is more similar inside than it looks. But like having two fish and chip shops on the same strip of shops. Doing almost the exact same thing which looks pointless. Rather than a diverse alternative.

Quote:
By using a regular address register for the stack, the 68k is able to provide all the powerful addressing modes of address registers while sharing their logic. Many CPUs are more limited on the operations of the stack.

Yes that's true. It just looked strange learning about ASM back in the day, when it was referred to as SP in one section and A7 in another. The SP really just being an alias for A7. For hardcore coding A7 can still be used as regular register if it's saved and no stack is used in a routine.

Quote:
My first instinct is that it would not be efficient to have a register dedicated as a base register for ExecBase. I think it would be more effective to allow any address register to be an Amiga library base and use PC relative addressing inside the library.

This would be hard as most functions were in ROM unless patched. Code should be kept together and usually is. Then there is strings. But ROM routines wouldn't have been able to exteact a base form PC relative addressing. Exec could have loaded ExecBase from $4 but that'sall I can see. I've read PC relative addressing is faster though.

Quote:
He was probably talking about commonly running out of address registers before data registers. I mentioned how we can save up to 3 address registers on the Amiga.

Yes I can see that. Though there were only 8 data registers as well. The global, local and base system are entrenched on AmigaOS. Knocking out locals is easy enough. The globals would have needed to be before or after the code for PC relative to work.

Quote:
I know what you are talking about with the big gap between the exponent and the fraction/mantissa. The padding to 96 bits was added to be 32 bit aligned (I believe Intel did *not* add the padding for their 80 bit extended format) but I can't explain why they left the big gap in the middle.

It does make more sense to full in the gap and adhere to an existing standard.

Quote:
It looks like with 16 registers it is better to look at ways to better use the registers you have than add more. Register files use a substantial amount of energy (there is a reason the Tabor CPU removed the FPU register file). Everyone wants CPUs with higher clock speeds, more caches, more bits and more registers but you may actually get something which is lower performance, is loud and runs hot. Even CPU and ISA designers (cough, cough) can be obsessed with adding more registers when studies have shown it is not worthwhile.

That's interesting then. I saw no need to remove the FPU from the Tabor. Would have been best I think to stick a normal CPU on there and people would be using it now.

Quote:
I didn't see much bias against PPC from Gunnar. He worked with PPC quite a bit at IBM and borrowed ideas from PPC (he obviously likes the 68k better though). I don't know why he switched from Altivec to MMX but it may have been as simple as popularity. It's not like he is going to get auto-vectorization compiler support for his weird ISA any time soon so it will have to be hand laid assembler and more people are familiar with MMX.

I just got that impression. There are stats like how the Vampire is faster than an AmigaOne XE at memory copy. And other things I read just seemed to be negative against PPC. Some people in the community just don't like PPC and think it never belonged on the Amiga. But there are others who also think it should have gone Intel ASAP. I was reading Jim Drew of Fusion fame was coding some stuff for Vampire and thinks PPC is a garbage ISA. That's a bit harsh. He won't be porting his software to OS4 anytime soon. LOL.

About MMX, if you are familar with x86 ASM and code in it as well as 68K ASM, then sure. But really, would Amiga guys actually code MMX? Would anyone even bother coding MMX by hand aside from some hardcore x86 demo coders? ASM is the way of the past except for isolated incidents.

Quote:
Sure, the 64 bit addressing support is lacking and no ABI but there is no real push to add 64 bit AmigaOS support or compiler support. Granted, the ISA should have been done by a team instead of Gunnar.

Okay, so it's just 64-bit integer? Like that 32-bit Tabor CPU with 36-bit addressing.

Well there could be an OS patch then. MOVEM.Q real quick. MULU.Q. MOVEQ.Q gets confusing though.

Gunnar looks like more of a Dunnar than a Gunna.

Quote:
There was a MOVEX instruction added for LE conversion. "Condition codes: X Not affected" stated in the docs. Sigh.

Okay, hmmm.

MOVE cross?

Quote:
Do you realize how slow a 128 bit ISA in a CPU would be? Yea, there are some large servers that may need the address space but I'll find a smaller faster server as an alternative, thank you

No but it's the next move up. 64-bit has been here for a while. For so long I thought it would be classed as obsolete soon. FPU was 64-bit for long time. Hey the 68K and x86 FPU love greater than 64-bit long time. Vectors were 128-bit years ago and noe massively increasing. Soon they will be cutting up the slack.

Of course, soon they would need to use serial data and address bus lines, lest they run out of physical space. SATA is faster than PATA, even though PATA used paraleism. But I think CPU cores are still too fast before they casn bolt serial interfaces on. Unless they are working on it.

Given how things are I wonder if a fresh CPU design would do much better? PPC is old now even. What would they design for the modern age? I've put ideas in my head for a new ISA, how many bits to support, with size of opcodes and all that.

Quote:
The Amiga platform is heading toward being an emulated one. UAE users are probably the largest Amiga group at this point. Vampire hardware is probably winning vs AOS4 hardware.

I'm the odd one out in my computer clubs. They've all embraced the Vampire. I stick to my A4000/060/RTG and AmigaOne at home. But they are for different markets. Vampire for Amiga people who can't move on and don't seem to think the Amiga should have moved into the 21st century. Then there is the OS4 people who welcomed moving Amiga forward but unfortunately are stuck in the 21st centuary at an earlier time.

Quote:
See, I wasted more time. I really should stop posting. It is pointless.

I agree. I've done it again.

Status: Offline

matthey

Re: 68k Developement
Posted on 25-Jul-2018 20:09:57

[ #40 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2009
From: Kansas

Quote:

itix wrote:
If you are porting library from Linux/BSD you are usually forced to small data model. This because they are static link libraries turned into shared libraries using compiler magic and must address different data section for each user.

It is usually recommended to statically link .so/.dll libraries at compile time on the Amiga. It is more efficient as it removes the GOT access overhead during execution (using 32 bit absolute addressing is faster than going through an offset table). Programs will become larger when linking at compile time but most modern ports would already be too large to use SD themselves. Maybe you were hoping to share code but .so/.dll libraries have been much inferior to Amiga libraries at sharing. Version and security issues often keep .so/.dll libraries from being shared in Linux/BSD/Windows also. Is is possible to use PC relative addressing inside of .so/.dll libraries by *not* sharing the libraries between programs and merging all sections (create a new version of the library for every program accessing it, at least when there is a writable data section). Take a look at this old OpenBSD security mitigation slideshow which shows a .so library merged in the memory map (on the 11th slide but worth reading the whole slideshow).

https://www.openbsd.org/papers/auug04/mgp00001.html

Merging all the sections including the data section implies a new instance of the .so library for each process. Were you hoping to share the code of .so/.dll libraries where their inferior libraries are not even shared on their native machines?

PC relative addressing becomes more important when moving to a 64 bit ISA as 64 bit absolute addressing is very wasteful. A (d32,PC) addressing mode is half the size of a 64 bit absolute addressing mode while being able to access up to 4GB of data. The 68k also already has the implicitly PC relative branch instructions necessary and open encodings to easily (and more efficiently than x86_64) scale the whole 64 bit address range often using fewer instructions and with better code density than any other popular 64 bit ISA. Some changes would make this better including merging sections in programs, merging Amiga library sections with the library structure, enabling PC relative writes, enhancing any MMU to have a per page NX/XD bit, etc. It makes sense to design for security at the same time.

Quote:

But instead of using A6 it could have used A0. In my experience compilers are not good at caching library base even in a6 and they reload a6 on each call even when making calls to same library in a row. Only hardcore 68k assembler developers care about caching.

There have been compilers and versions of compilers which did a better job of handling a6 without reloading it constantly. It may be sometimes caused by the register being locked to a6. At the very least, being able to use different address registers as a library base reduces register contention and shuffling. GCC does better with assembler inlines which allow GCC to choose the registers than a human choosing specific registers to be used.

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle