Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6071 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

10 crawler(s) on-line.

99 guest(s) on-line.

0 member(s) on-line.

You are an anonymous user.
Register Now!

michalsc: 15 mins ago

NutsAboutAmiga: 16 mins ago

Karlos: 21 mins ago

OlafS25: 40 mins ago

CosmosUnivers: 1 hr 21 mins ago

ppcamiga1: 1 hr 35 mins ago

matthey: 1 hr 57 mins ago

bhabbott: 2 hrs 8 mins ago

ncafferkey: 3 hrs 14 mins ago

pixie: 3 hrs 18 mins ago

Forum Index

Amiga General Chat

68k Developement

Poster

Thread

matthey

Re: 68k Developement
Posted on 25-Jul-2018 22:45:17

[ #41 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2015
From: Kansas

Quote:

Hypex wrote:
Yes that's true. It just looked strange learning about ASM back in the day, when it was referred to as SP in one section and A7 in another. The SP really just being an alias for A7. For hardcore coding A7 can still be used as regular register if it's saved and no stack is used in a routine.

Yes, but a7 aligns writes to 16 bits. It is probably possible to have configuration bits that set this alignment for better orthogonality (no alignment), ColdFire compatibility (32 bit alignment), etc.

Quote:

Quote:
My first instinct is that it would not be efficient to have a register dedicated as a base register for ExecBase. I think it would be more effective to allow any address register to be an Amiga library base and use PC relative addressing inside the library.

This would be hard as most functions were in ROM unless patched. Code should be kept together and usually is. Then there is strings. But ROM routines wouldn't have been able to exteact a base form PC relative addressing. Exec could have loaded ExecBase from $4 but that'sall I can see. I've read PC relative addressing is faster though.

ROM is addressable and code can be executed directly from ROM using PC relative. Address 4 is the only address accessed for a library base in ROM also. Other libraries are opened as usual using Exec functions. Text is commonly in the same section as code so PC relative addressing can be used. There is already quite a bit of PC relative addressing used. It is odd that more effort was not put into making it easier for Amiga libraries to be PC relative for internal accesses.

Quote:

Quote:
He was probably talking about commonly running out of address registers before data registers. I mentioned how we can save up to 3 address registers on the Amiga.

Yes I can see that. Though there were only 8 data registers as well. The global, local and base system are entrenched on AmigaOS. Knocking out locals is easy enough. The globals would have needed to be before or after the code for PC relative to work.

Local variables are usually on the stack. Yes, PC relative requires merging sections as in the slideshow I posted for itix. Variable accesses in Amiga libraries are mostly reads while global variables in programs use more writes where opening up PC relative writes for the 68k would be highly advantageous. Some purists don't like the idea but x86_64 RIP relative addressing has shown the benefit of PC relative writes.

Quote:

That's interesting then. I saw no need to remove the FPU from the Tabor. Would have been best I think to stick a normal CPU on there and people would be using it now.

Removing the FPU gave little advantage compared to turning off the FPU but it is possible an embedded application needed extreme energy efficiency. I wonder how many man hours were spent on the FPU emulation for Tabor.

Quote:

I just got that impression. There are stats like how the Vampire is faster than an AmigaOne XE at memory copy. And other things I read just seemed to be negative against PPC. Some people in the community just don't like PPC and think it never belonged on the Amiga. But there are others who also think it should have gone Intel ASAP. I was reading Jim Drew of Fusion fame was coding some stuff for Vampire and thinks PPC is a garbage ISA. That's a bit harsh. He won't be porting his software to OS4 anytime soon. LOL.

Gunnar does know PPC well including the weaknesses. There is a certain level of competition as PPC CPUs set the mark in performance for real Amiga hardware. The Apollo core in a low end FPGA can surprisingly outperform some faster clocked PPC CPUs in some benchmarks. A CPU in an FPGA is at a large disadvantage.

I expect the PPC is one of the most hated architectures of all time. There are people who love it too but if you did a poll like is done with politicians (love, like, dislike, hate), it would not poll well. We saw how the hate factor played out in the last U.S. Presidential election which seemed to be a contest of which candidate was least hated. I think the 68k would poll as one of the most loved and least hated architectures of all time. I think x86_64 would beat PPC as well.

Quote:

About MMX, if you are familar with x86 ASM and code in it as well as 68K ASM, then sure. But really, would Amiga guys actually code MMX? Would anyone even bother coding MMX by hand aside from some hardcore x86 demo coders? ASM is the way of the past except for isolated incidents.

You might be surprised at how much SIMD code still comes from humans (entirely or at least tweaked). Auto-vectorization has improved to the point where it usually gives good results but a human can usually do better. The places where SIMD is used are very important for performance.

Quote:

Quote:
Sure, the 64 bit addressing support is lacking and no ABI but there is no real push to add 64 bit AmigaOS support or compiler support. Granted, the ISA should have been done by a team instead of Gunnar.

Okay, so it's just 64-bit integer? Like that 32-bit Tabor CPU with 36-bit addressing.

Well there could be an OS patch then. MOVEM.Q real quick. MULU.Q. MOVEQ.Q gets confusing though.

There is support for some integer 64 bit data operations but limited support for 64 bit addressing (accessing addresses above 4GB). I don't know of any new addressing modes. The biggest reason to move to 64 bit is for more addressing space.

MOVEQ.Q is not necessary by the way. MOVEQ could simply extend to the whole register as it does now.

Quote:

Quote:
There was a MOVEX instruction added for LE conversion. "Condition codes: X Not affected" stated in the docs. Sigh.

Okay, hmmm.

MOVE cross?

I guess. It is not MOVE eXtend obviously. IMO, anything would have been better on the 68k (MOVLE, BYTEREV, BYTESWAP).

Quote:

Quote:
Do you realize how slow a 128 bit ISA in a CPU would be? Yea, there are some large servers that may need the address space but I'll find a smaller faster server as an alternative, thank you

No but it's the next move up. 64-bit has been here for a while. For so long I thought it would be classed as obsolete soon. FPU was 64-bit for long time. Hey the 68K and x86 FPU love greater than 64-bit long time. Vectors were 128-bit years ago and noe massively increasing. Soon they will be cutting up the slack.

The biggest reason to move to 64 bit is for more addressing space.

32 bit - 4GB
64 bit - 17179869184 GB
128 bit - 316912650057057350374175801344 GB

Pointers with 64 bit addressing can already be 1/2 the performance of 32 bit (DCache holds half as many 64 bit pointers as 32 bit pointers). Pointers with 128 bit addressing can be be 1/4 the performance of 32 bit. Larger caches are slower (add latency) which is why there are multiple levels of cache (we could see L4 or maybe even L5 DCaches for a 128 bit system). Moore's law no longer applies so die shrinks can't overcome the slow down as was the case for 64 bit. I wouldn't invest in 128 bit computers.

Quote:

Of course, soon they would need to use serial data and address bus lines, lest they run out of physical space. SATA is faster than PATA, even though PATA used paraleism. But I think CPU cores are still too fast before they casn bolt serial interfaces on. Unless they are working on it.

There are already CPUs with high speed serial links between them and packed closely together in clusters.

Status: Offline

Lou

Re: 68k Developement
Posted on 26-Jul-2018 17:28:41

[ #42 ]

Elite Member

Joined: 2-Nov-2004
Posts: 4169
From: Rhode Island

@matthey

AMMX is great but it still needs to be in a highly parallelized custom chip with access to chip ram that can process 256 triangles at once to compete with N64/PS1/Saturn/Jaguar era machines...

Kind of a SuperAkiko that can do C2P on 1024 bytes at once along with AMMXx256... Apparently, I'm an A-hole for bringing this up...

Last edited by Lou on 26-Jul-2018 at 05:34 PM.
Last edited by Lou on 26-Jul-2018 at 05:33 PM.
Last edited by Lou on 26-Jul-2018 at 05:30 PM.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 26-Jul-2018 20:54:05

[ #43 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey

Quote:

matthey wrote:

The RISC engineers made the same mistake. The equation looks linear.

P = C x V^2 x f

The problem is that the voltage has to be increased with the frequency. Some people say there is a cubic dependency between frequency and power consumption so reducing the clock speed by 30% reduces the power by 35% but this may be best case. The following is a graph showing an Intel i7 where reducing the clock speed by 14% reduced the power by 28%.

https://qph.fs.quoracdn.net/main-qimg-61820a6b86759c354d41dbbd16a0385b

A linear relationship would be a straight line but instead we have significant curvature. This is one of the reasons why lower clocked multicore CPUs are energy efficient for parallel algorithms.

OK. Thanks for the clarification. However the strange thing is that frequencies raised over the years, but voltages... went down. So we should have expected that power usage be kept at certain levels instead of exploding.
Quote:
Motorola wouldn't make an aggressive or new CPU design. They kept offering that shallow pipeline weak OoO design which wouldn't clock up without die shrinks (they are still selling a variation of the same design today). The low end PPCs were even more of a disaster when they tried to mass produce and sell the 603 with 8kB caches like the 68060 (sadly they didn't have half a clue about code density after the 68k). The poor performance of low end PPCs and lack of high end PPCs from Motorola was enough to make anyone want Intel CPUs.

I had heard that too but it was a rumor for all I knew.

Here's the interview: https://www.cnet.com/news/is-the-powerpc-due-for-a-second-wind/

and following is the most important excerpt:

DECEMBER 13, 2005

Weren't you there during the discussions when IBM convinced Apple to adopt the G5?
Mayer: In my previous job, I ran IBM's semiconductor business. So I've seen both sides of the Apple story, because I sold the G5 to Steve (Jobs) the first time he wanted to move to Intel.

Five years ago??
Mayer: Yeah, that's about right. So I sold the G5. First I told IBM that we needed to do it, and then I sold it to Apple that the G5 was good and it was going to be the follow-on of the PowerPC road map for the desktop. It worked pretty well. And then IBM decided not to take the G5 into the laptop and decided to really focus its chip business on the game consoles.

PowerPC wasn't competitive anymore already on 2000 (2005 - 5) and that's why Apple wanted to move to Intel at that time. Anyway, the switch was just delayed, as we know...
Quote:
AArch64 is *not* a reduced instruction set RISC CPU. It has many instructions, addressing modes and ops for RISC like PowerPC. AArch64 does have a very different feel which I like better. MIPS and Alpha are simpler and closer to the classic "minimal" RISC design.

Well, even the most recent Alpha and MIPS ISAs haven't a reduced instruction set. I think that it's quite hard to find a more modern RISC ISA which really remained loyal to its first letter.

AArch64 isn't Reduced, but it thrown away the most complex (and microcoded) instructions, which PowerPC still keeps. The instructions are simpler, like Alpha/MIPS. It's from this PoV which are similar (yes, it still has complex addressing modes).
Quote:
I asked specifically whether he thought the combined integer/fp register file was a mistake. He did *not* think it was. There are advantages and disadvantages to combined and separate. I can see his view point and respect it. Separate register files for units is popular right now is all. If integer and fp values in the same register file was so bad then why do SIMD units do it?

Because they have completely different usages.

SIMD was made to crunch A LOT of data which are usually "local", and that's why they are packed to big vectors registers.

A GP/"integer" unit executes completely different type of code, with data which is almost always scalar, and often "non-local".

If you split the register file, you can optimize the micro-architecture for both, different, use cases.

To make a concrete example, to accelerate GP instructions execution you know that usually (out-of-order) processors use a pile of rename registers, which is much bigger than the physical register file. That's because the code is not so "linear", and you have to keep A LOT of "in flight" instructions with their temporary registers values.
You don't need similar sized rename registers file for SIMD code, because it's quite more "linear", and you don't need to keep so many "in flight" instructions.
So, you can organize the two rename registers sets with ad-hoc, proper sizes, to better match the two different usages.

Now imagine a unified register set for both: to have at least the same performance of the previous split scenario, you need a rename register set which has at least n + m (n for GPR, m for SIMD) entries.

But the implementation cost is the same as before ONLY if the register size of both GPR and SIMD instructions is the same (32 and 32 bit, 64 and 64 bit): once they differ (32 and 64, for example), you have clearly a waste of space. And the bigger the vector size is, the major is the waste.

I think that it's enough to prove that the unified register set has big drawbacks, and should be avoided: keep everything separated allows you finally tune every execution unit, according to the goals of the specific micro-architecture.
Quote:
A table lookup decoder like the 68060 uses can handle what looks like disorganized and unsymmetrical encodings to the human eye.

But if you start to use more 32-bit encodings to fill the 64-bit gap (there's no space in the 16-bit opcodes for it), spreading them over the opcode space (e.g.: not using a single 16-bit "main" opcode to map the needed 32-bit opcodes, but multiple ones), then you need more table lookups, and it becomes expensive.
Quote:
So far, it doesn't even look that bad but the last few encodings may make people with OCD feel uncomfortable. Using a 64 bit mode, I have been able to cleverly recover quite a bit of space without losing much. I will probably look at re-encoding everything to compare it and see which I like better. I believe most of the op.[b/w/l/q] EA,Rn instructions can be used as is though. There are a few useful instructions that need to find new encodings.

Strange. I remember that Motorola used plenty of Size=0b11 to map instructions, so it should require many mappings.

Anyway, how does your 64-bit mode works? Do you still keep the same instructions encodings (and only add the missing 64-bit ones), or do you "remap" some existing instructions to work differently (Size = W becomes Q, for example)?
Quote:
The x86_64 usually needs 2 instructions and prefixes for a simple op.q. I can't do any worse than x86_64 really. :D

Prefixes are needed, yes, but 2 instructions? Can you make some example?
Quote:
No blame can be given for continuing a bad thing.

From what I've written before it might sound a contradiction, but I do NOT like at all the decision which AMD made with x64, to use such new REX prefix to introduce both 64-bit operands size and the extra registers. I would have preferred a completely new ISA, keeping a good source level compatibility. So, yes, they worsened the situation.

As I said before, once an ISA reaches a "critical mass" (too much complexity / legacy burden), it's better to stop, re-think, and create a new, cleaner ISA (but partially source compatible, to make it easier to port the existing code). That's why I appreciate what ARM did with its new 64-bit ISA. And that's why I think that it's better for a 68K "successor" to follow a similar approach (and the same for x86/x64).
Quote:
128 bit IEEE quad precision hardware support would be half the performance of extended precision. It is the fraction/mantissa which requires a wide and slow ALU. Quad precision is 113 bits of fraction while extended precision is only 64 bits. Extended precision increases the exponent to the same size as quad precision which gives a huge range compared to double precision and often giving a number as a result instead of infinity, NaN or a subnormal (which often trap to a slow software handler). With extended precision and quad precision having the same exponent, I wonder if a hardware extended precision operation could be expanded to a quad precision with software (hard+soft quad precision support in a library).

Despite those challenges, the point is that there's a request from customers, and they will certainly be satisfied in the future. So, if you want to create a future-proof ISA, it's better to consider this feature as being part of it, because it'll come the time where you are forced to embrace it anyway, maybe requiring dirty patches.
Quote:
It sounds like an interesting project but I'm tired of doing successful projects for a dead platform that nobody uses. Really, we are wasting our time even discussing anything here. I should have never responded to this thread.

That would have been a pity, because I like this kind of discussions (I was following you on EAB too, but you decided to "exit", unfortunally), and there aren't so many people which can sustain them. Also exchanging different opinions/PoVs can enrich all participants, IMO, giving also some reality check.

I know your frustration by the current 68K situation, and because you still see a lot of potential for this ISA. I feel the same for my ISA, but I prefer to continue to pursue my ideas: I love tinkering with computer architectures, and I love thinking, dreaming, and discussing about them.

So, why stop discussing about architectures? As you have seen, there is interested people. And who knows: maybe, during the discussion, you (or me, or someone else) may have new ideas or find good solutions to problems.

BTW, I'm also short in time. That's why I write when I can.
Quote:
Debatable. The 8086 code density was specialized for a narrow application which primarily includes text handling and stack use. The x86 and x86_64 retain this specialization which now hurts the code density for general purpose and performance applications (optimizing for performance gives poor code density). I don't know of a good code density variable length 16 bit encoding specialized for the same purpose as the 8086 for comparison. The 68k code density is more general purpose and is more useful today. Let me show you what I mean. I roughly analyzed Vince Weaver's code density results and created a spreadsheet. The code is optimized for size.

https://docs.google.com/spreadsheets/d/e/2PACX-1vTyfDPXIN6i4thorNXak5hlP0FQpqpZFk2sgauXgYZPdtJX7FvVgabfbpCtHTkp5Yo9ai6MhiQqhgyG/pubhtml?gid=909588979&single=true

When optimizing for code size on the x86, using bytes sizes and the stack give the best code density. This results in 106 memory access instructions mostly from using the stack vs the 68k of only 48 memory access instructions. The 68k did not need to use the stack to achieve maximum code density and it could use all 16 registers instead of 8. If you think the x86_64 having 16 registers helps, it does not as it has *more* memory accessing instructions at 112. This was worse than any other popular ISAs I evaluated. Code density specialization was good for the 8086 but it is bad for the x86 and x86_64.

Vince's 8086 code does not run in Linux lacking some of the overhead of a modern OS and executable. The 68k currently has the best code density of any Linux program and Vince hasn't even updated for the changes I sent him almost a year ago now.

I fully understand you, I agree, and you made a much better and useful work with your spreadsheet (BTW, how do you count "memory accesses" for 68K's MOVE Mem,Mem instructions? And for instructions like ADD Reg,Mem, which require a read and a write to the same location?).

However I think that Weaver's "code density contest" isn't a good candidate for comparing different ISAs: it's too much specific, tests only an algorithm and some "boilerplate" code (the Linux logo), and it's all hand-optimized (an unlikely thing nowadays).

It would have been better to take the standard SPEC test suite and generate proper metrics from the binaries. This poses other problems, of course, like relying on compilers and their generated code, but at least can give comparable, real-world results.

Unfortunately this requires A LOT of time if you have a new ISA or you want to bring an ISA to comparable levels of mainstream ISAs for the generated code.

That's why I gave-up: I prefer to disassemble some existing x86/x64 code, and "mechanically" converting all instructions to my ISA, using a basic peephole optimizer for some simple cases. A Python script requires much less effort compared to writing an LLVM or GCC backend, and gives me an acceptable indication of how my ISA is doing against to one of the best mainstream ones.
Quote:
Note: The counts for some ISAs I'm unfamiliar with could be off. I came up with a methodology to make them correct but Vince doesn't seem to be much of a researcher for being a doctor. I kind of wonder if he even understands why I added other categories besides code density.
[quote]I was able to save a few bytes with 68k ISA enhancements also but not much. The 68k has most of what is needed for a small and simple program like this. The trick is making an enhanced ISA easier for compilers to generate code which has similar code density.

Yup: see above. Nowadays it's very important to have an ISA which can be easily exploited by compilers. But starting from an existing ISA and achieving that goal isn't easy.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 26-Jul-2018 21:03:25

[ #44 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Hypex

Quote:

Hypex wrote:
@cdimauro

[quote]LOL. I find most things interesting enough, but sometimes too long. OTOH interesting tends to mean long and time consuming after a while.

And unfortunately it requires a massive amount of time... -_-
Quote:
64-bit size should use the size field in the opcode but after I brief look I can see they didn't future proof them and used them for other instructions. I hate the thought of a prefix. And yes for 68K it would need minimum two bytes. And there goes your neat 16-bit base opcode. I also notice at most three bits for what register restricting most operations to address or data register. But I guess that was the point.

It was a very nice choice at the time, but unfortunately it became a burden (a unified GPR is much more flexible).
Quote:
Yes it would be too many. And over complicate it. Then, do you create new format? Or bolt onto existing one, with say 1 or 2 MSB for register one one word and the normal 3 bit LSB in the other? Not worth it.

If you want to keep the existing opcode structure, then you have to resort to a prefix where to put the required MSBs (like what AMD did with x64).

Of course I prefer a new format, because it doesn't require a prefix, but then you loose binary backwards compatibility, and maybe even source-level compatibility (albeit you can keep some compatibility here).
Quote:

Just like PPC.

AFAIR latest PPCs unified both Altivec and FPU register sets for a bigger SIMD ISA. So they have 64 registers here.
Quote:
Oh yes, with a V0 to V7. 128-bit wide. Suppose having a format similar to FPU would remain consistent. But again it needs opcode space.

Hum. Only 8 vector registers? Not enough.

Consider, also, that the last trend is to introduce "matrix" (2D) instructions (which operates to groups of contiguous vector registers), and having a huge register set benefits here.
Quote:
Given we using large register sizes with vectors might as well use an extra opcode to encode it all in.

Sorry, I didn't understand: can you better clarify?
Quote:
Another possibility, if it can be done, is to refer to entire vector register file space as one width. Say we a space for 8x128-bit vectors. We have 1024 bits for vectors. How about using that as 1024x8-bit registers or 1x1024-bit wide register and everything in between? Even more with increased vector width. But I believe that is point of having vectors SIMD on an array of large data sizes.

The problem is encoding all that information, which requires more bits. Hence -> less encoding space. And further complicates the SIMD implementation.
Quote:
Oh no! I even hate the idea of reusing FPU registers like was done on x86.

No, no! I was just thinking about reusing the opcode space for FPU (coprocessors, in general) instructions. Definitely NOT the same register set.

See the previous message to Mat: I'm a big fan of having separate registers set for groups of execution units (GP/Integer, FPU, SIMD, masks for SIMD, ...).
Quote:
Then again IIRC PPC couldn't directly load FPU values or some limitation,

AFAIR it cannot exchange (load, store) data with GP/Integer registers.
Quote:
Yes for 8086. But I expected for 80286 they would updated this. After all they added another digit.

They added Protected mode with the 80286: a BIG improvement. Albeit quite heavy... -_-
Quote:
That's what I mean, being byte based, it wasn't like 68K and not as restrictive as PPC. Though it looks like it lost some on x64.

If you want to optimize for speed that's something that you might consider as well for other ISAs. All processors nowadays have caches with some "line" length, and jumping at the beginning of the line (well, lines can be 32, 64, or even more sized) gives the decoder the possibility to decode as much instructions as possible, without the penality to have to take a look at the following line.

If you care more about code density, then you eliminate this extra padding for x64.
Quote:
64-bit is a bit behind for vectors now. But what GPRs? No the data registers?

Yes, the data registers.
Quote:
I see they couldn't follow the AltiVec model because they don't like PPC. Or that Gunnar doesn't who is firing the shots. He seems to have a a passion against it.

IMO the question is simpler: Gunnar patches the architecture in order to make his life easier at implementing it (with the FPGA which is using). Which doesn't guarantee to have a good, future-proof design. Like it's clearly shown by such choices...
Quote:
Sure, yeah. But I don't know for what purpose. Vectors and native 64-bit never existed on the Amiga. The Vampire is like an accelerator on steroids. Well almost. I don't know why they want to go beyond that. The hardware is at most 32-bit in design as well as the OS. I'm a for a speed advantage and including RTG+RTA features but bolting features onto a depreciated CPU ISA that hasn't been updated in over 20 years seems a bit superfluous.

I agree. It would have been better to create first a fully-compatible Amiga hardware platform (ECS or AGA chipset, RTG for modern graphic software, AHI for modern audio software, and a 68030 or 68040 ISA), pushing as much as possible the performances where it makes sense (RTG and an aggressive out-of-order 68K core). Adding some more stuff (like 3D acceleration, faster bitplanes fetching, faster Blitter) only if there was some space left on the FPGA.

That always with the primary goal to accelerate existing applications. Because this is the crude reality: it's very unlikely that new applications will be created to make use of "alien" Amiga technologies, like 64-bit, Hyperthreading, SIMD.

Of course, if there's still free space into the FPGA you can implement the above alien technologies, as well as adding new instructions to the 68K core. But this should be the last thought...
Quote:
Yep. And cut downs on the 060. Also they changed it since the 010 with MOVE SR and related. But MOVE CCR was better for user code.

That was an ingenuous mistake, which they rapidly fixed with the 010.

However the worse thing was cutting most of the FPU instructions from the 040.
Quote:
I didn't see it but adding little endian instructions wouldn't go far astray.

AMMX isn't little-endian. Or are you referring to something else?
Quote:
Would have helped on ECS. But I don't know they could have fit it in. Should have been there in AGA. Even Atari has 16-bit framebuffer.

Packed/chunky modes would have replaced bitplanes for everything, so they could have fit the chipset space.

It would have simplified the o.s. as well, not having to deal with multiple bitplanes for a single screen: one pointer to the framebuffer and that's it, because you have everything there.

Adopting the bitplanes was the biggest mistake ever, which crippled the Amiga platform.
Quote:
Yes but 16-bit audio is outdated. 24-bit is where it's at and 32-bit FP next.

The more, the better, yes.
Quote:
But what Amiga software would know what it is?

I fully agree: see above.
Quote:
Yes maybe. The 68K did have 64-bit support in dual registers. I expected a 128-bit ISA would be here. It's about time!

Hum. Being honest I'm not a big fan of 128-bit ISAs as well. At the very end you only need it for a bigger virtual address space, which is a quite rare case.

And ISAs like RISC-V implement the 128-bit address space as a segmented space, with a 64-bit identifier and 64-bit offset.

If I want the segmentation again, then I prefer an approach similar to the x86 one, because usually an application working set uses a very small number of segments. This also avoids to have 128-bit pointers for all type of data, which benefits data caches and, in general, all memory hierarchy.
Quote:
That would be more to think about. Some instructions worked more efficient in certain order. But parallel execution is another matter.

Hyperthreading is a different thing, and works like SMP (it's only a lighter implementation).

However the Amiga o.s. is NOT SMP-aware, but the exact opposite, so it's quite difficult to achieve full backward-compatibility with the already existing applications.
Quote:
The platform stopped being produced in the 90's. But, is there a reason to make it more than it is? I compare this a with OS4 where a lot of people wanted to run the old 68K stuff and see what games worked. My self included. But on the Vampire, they are introducing things that adds incompatibilities. From hardware conflicts with other Amiga boards to software conflicts. If you can't run Amiga software and games on a super accelerator that plugs into the real deal then what's the point? I know certain Amiga people. They don't see the point of OS4 because the AmigaOne machines don't have an Amiga chipset and can't naively play Amiga games. And there are others that touched OS4 but then went back to 68K. We're a strange bunch.

Yup. I mostly agree (I never tried OS4, BTW).

Status: Offline

cdimauro

Re: 68k Developement
Posted on 26-Jul-2018 21:18:10

[ #45 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey

Quote:

matthey wrote:

RISC CPUs typically needs more registers for several reasons. CISC is usually better off with 16 registers as it gives better code density and is more energy efficient. One paper's test suite showed less than 1% performance difference between 12 and 16 registers for a x86_64 CPU.

https://pdfs.semanticscholar.org/c9e7/976e3be3eed6cf843f1148f17059220c2ba4.pdf

I have another paper that evaluates RISC registers called "High-Performance Extendable Instruction Set Computing" which is not online for free. A few quotes from this paper follow.

"The availability of sixteen general-purpose registers is close to optimum"

"There is little change in either the program size or the load and store frequency as the number general-purpose registers reduces from twenty down to sixteen. However, eight registers are clearly too few as, by that stage, the frequency of load and store instructions has almost doubled."

It looks like with 16 registers it is better to look at ways to better use the registers you have than add more. Register files use a substantial amount of energy (there is a reason the Tabor CPU removed the FPU register file). Everyone wants CPUs with higher clock speeds, more caches, more bits and more registers but you may actually get something which is lower performance, is loud and runs hot. Even CPU and ISA designers (cough, cough) can be obsessed with adding more registers when studies have shown it is not worthwhile.

The paper was interesting but too outdated (they were also using a P4 Prescott: the first x64 CPU from Intel, which performed really poor in generale, and particularly with 64-bit code), bound to a specific ISA, and maybe need to be updated.

Apple gained A LOT moving its ARM CPU from 32 to 64 bits. AFAIR there was an article from Anandtech which reported a comparison having exactly the same application compiled for 32 and 64 bits.

But in general, as a low-level coder, you may think about applications which can gain a lot from having more registers available. Emulators is the notable example here.
Quote:
Sure, the 64 bit addressing support is lacking and no ABI but there is no real push to add 64 bit AmigaOS support or compiler support.

The Amiga o.s. is strictly bound to 31 bits (32 with some tricks), so a 64-bit 68K is almost useless (unless you want to introduce the infamous "bank switching", like OS4 did).
Quote:
Granted, the ISA should have been done by a team instead of Gunnar.

I fully agree.

Status: Offline

Hypex

Re: 68k Developement
Posted on 27-Jul-2018 18:01:24

[ #46 ]

Elite Member

Joined: 6-May-2007
Posts: 11222
From: Greensborough, Australia

@matthey

Quote:
Yes, but a7 aligns writes to 16 bits. It is probably possible to have configuration bits that set this alignment for better orthogonality (no alignment), ColdFire compatibility (32 bit alignment), etc.

I seem to recall some alignment. Still okay if you keep this in mind.

Quote:
ROM is addressable and code can be executed directly from ROM using PC relative. Address 4 is the only address accessed for a library base in ROM also. Other libraries are opened as usual using Exec functions. Text is commonly in the same section as code so PC relative addressing can be used. There is already quite a bit of PC relative addressing used. It is odd that more effort was not put into making it easier for Amiga libraries to be PC relative for internal accesses.

Sure code can be PC relative that way. The issue I was thinking of was when it needed to access the library base which would sit in RAM. PC relative couldn't then be used to access it.

Quote:
Local variables are usually on the stack. Yes, PC relative requires merging sections as in the slideshow I posted for itix. Variable accesses in Amiga libraries are mostly reads while global variables in programs use more writes where opening up PC relative writes for the 68k would be highly advantageous. Some purists don't like the idea but x86_64 RIP relative addressing has shown the benefit of PC relative writes.

In the case you can use PC relative for variables. So then you have self modifying data as opposed to self modifying code. Since the variables would beed to be reached with PC relative code.

Quote:
Removing the FPU gave little advantage compared to turning off the FPU but it is possible an embedded application needed extreme energy efficiency. I wonder how many man hours were spent on the FPU emulation for Tabor.

I can only imagine but I think any amount, however big or small, adds up to being too much.

Quote:
Gunnar does know PPC well including the weaknesses. There is a certain level of competition as PPC CPUs set the mark in performance for real Amiga hardware. The Apollo core in a low end FPGA can surprisingly outperform some faster clocked PPC CPUs in some benchmarks. A CPU in an FPGA is at a large disadvantage.

What real hardware would this be? Like a CyberStormPPC? I've read the FPGA CPU can take advantage of internal DDR RAM, thus having access to a faster bus, over the slower SDRAM memory interface.

Quote:
I expect the PPC is one of the most hated architectures of all time. There are people who love it too but if you did a poll like is done with politicians (love, like, dislike, hate), it would not poll well. We saw how the hate factor played out in the last U.S. Presidential election which seemed to be a contest of which candidate was least hated. I think the 68k would poll as one of the most loved and least hated architectures of all time. I think x86_64 would beat PPC as well.

x86_64 would beat PPC in what, least loved or most hated? Just putting an EL spin on that.

This looks like a case of a lesser of two evils. I have to admit I didn't really like PPC when it was first introduced as well. I looked up the instructions and how it worked. And it just looked so alien. Was not as friendly as 68K at all. I didn't see it as a follow up. Now I think I've pretty much acccepted it. Mosty because of OS4. But it does have things in common with the Amiga copper, like RISC core, 32-bit instruction words, 16-bit high word for opcode and low word for data. I made a list once. That's about the only thing It really has in common with Amiga as I saw it.

Quote:
You might be surprised at how much SIMD code still comes from humans (entirely or at least tweaked). Auto-vectorization has improved to the point where it usually gives good results but a human can usually do better. The places where SIMD is used are very important for performance.

That's interesting, so the ASM coder can have a come back?

Quote:
Sure, the 64 bit addressing support is lacking and no ABI but there is no real push to add 64 bit AmigaOS support or compiler support. Granted, the ISA should have been done by a team instead of Gunnar.

I thought Gunnar was leading a team? Well, except for a leak, they don't have the source to AmigaOS either. To my knowledge, Hyperion do, but if OS4 still hasn't got 64-bit support now, there is no way they will tweak OS4 68K for 64-bit. But, it would need an overhaul of all pointers IMHO, a complete new API.

I thought they nay have done this with OS4, but they kept it real close to OS3, except when they broke things or made Amiga sources harder to compile. Right now they need a new 64-bit OS4, with PPC32 and 68K supported by a legacy API, that would be a clean wrapper to an interrnal 64-bit API.

Quote:
There is support for some integer 64 bit data operations but limited support for 64 bit addressing (accessing addresses above 4GB). I don't know of any new addressing modes. The biggest reason to move to 64 bit is for more addressing space.

The problem with extending address space is the hardware is still 32-bit. Especially the address lines on the Amiga hardware wouldn't go above 32-bit. So except for a stand alone Apollo machine it would be redundant.

Quote:
MOVEQ.Q is not necessary by the way. MOVEQ could simply extend to the whole register as it does now.

For positive values it's fine. For negative I thought there may be is a problem. But it looks like it extends up to 64-bits with the 32-bit data being the same.

Quote:
I guess. It is not MOVE eXtend obviously. IMO, anything would have been better on the 68k (MOVLE, BYTEREV, BYTESWAP).

MOVER even. MOVE Reverse. I know. REMOVE. Reverse Endian MOVE.

Quote:
The biggest reason to move to 64 bit is for more addressing space.

32 bit - 4GB
64 bit - 17179869184 GB
128 bit - 316912650057057350374175801344 GB

17179869184 GB sounds like a lot of space, but how soon before it is obsolete? It happened before, it can happen again, while the earth is still spining in orbit..

Quote:
Pointers with 64 bit addressing can already be 1/2 the performance of 32 bit (DCache holds half as many 64 bit pointers as 32 bit pointers). Pointers with 128 bit addressing can be be 1/4 the performance of 32 bit. Larger caches are slower (add latency) which is why there are multiple levels of cache (we could see L4 or maybe even L5 DCaches for a 128 bit system). Moore's law no longer applies so die shrinks can't overcome the slow down as was the case for 64 bit. I wouldn't invest in 128 bit computers.

I think at that point there would need to be a 128-bit base offset for code. Which then uses a smaller virtual address to use it's address space in. Some kind of context would be set that holds the real address. Which an OS could switch too for multitasking operations. Possibly registers could be set with a context so they could read and write to another space using just a small, say 4GB offset, which the CPU MMU would translate to the real address. I realise at tihs point it starts to sound liek segmented addressing so perhaps not the best. We know what happens when there isn't a flat memory model to begin with.

Quote:
There are already CPUs with high speed serial links between them and packed closely together in clusters.

I knew it would happen eventually. But how long before it reached the desktop? Or mobiletop for that matter.

Status: Offline

Hypex

Re: 68k Developement
Posted on 27-Jul-2018 18:04:35

[ #47 ]

Elite Member

Joined: 6-May-2007
Posts: 11222
From: Greensborough, Australia

@Lou

Quote:
Kind of a SuperAkiko that can do C2P on 1024 bytes at once along with AMMXx256... Apparently, I'm an A-hole for bringing this up...

An Apollo-hole?

Better then being an A-hole MMX.

Status: Offline

matthey

Re: 68k Developement
Posted on 27-Jul-2018 22:22:32

[ #48 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2015
From: Kansas

Quote:

Lou wrote:
AMMX is great but it still needs to be in a highly parallelized custom chip with access to chip ram that can process 256 triangles at once to compete with N64/PS1/Saturn/Jaguar era machines...

Kind of a SuperAkiko that can do C2P on 1024 bytes at once along with AMMXx256... Apparently, I'm an A-hole for bringing this up...

In a standalone FPGA based Amiga computer, the SIMD unit can access (much faster and more) chip ram. I believe there would be more disadvantages than advantages to moving the SIMD unit to a separate custom chip. Bus arbitration and cache coherency problems were likely partially responsible for stopping Natami development. It makes sense to go the other way moving more functionality into the CPU and moving the custom chips closer as a SoC. Gunnar wanting to replace the Amiga blitter with the SIMD unit makes sense. I would keep the the SIMD unit (and preferably register files) separate like the 68k FPU unit (6888x, 68040 and 68060) which allows for more parallelization but requires more logic and resources (instructions of different CPU units can often execute in parallel). C2P is no longer necessary as chunky is already available in most FPGA Amiga computers with RTG and there is little to be gained by further accelerating legacy bit plane support.

A wider SIMD unit would be awesome. The number of parallel operations doubles with each doubling of the SIMD register width so a 256 bit SIMD unit could do 4 times as many operations (*not* instructions) in parallel. However, the resources needed are much more also (nearly quadruple?). Choosing a 64 bit SIMD unit and even merging with the integer units is efficient on resources and fine for a one off CPU for some embedded application but very bad for an ISA which needs to scale for mid to high performance CPUs. SIMD units are bad at scaling as is. All the outdated SIMD unit support has to be supported for compatibility (for x86 this includes MMX, SSE. SSE2, SSE3, SSSE3, SSE4. AVX, AVX2, AVX-512). All this baggage requires logic and uses encoding space (SIMD instruction lengths started at 3-4 bytes for MMX and are now 6+ bytes). It is difficult for mid performance and energy efficient CPUs to keep all this old support. Ideally, 68k SIMD unit support should avoid these problems by starting with a more modern SIMD unit and number of registers (probably 256 or 512 bits and 16 or 32 registers). Encodings for floating point support should be reserved if fp is not available at first. If there are not enough resources for a proper SIMD unit then don't add a standardized SIMD unit (keep it experimental as I suggested to Gunnar).

Status: Offline

matthey

Re: 68k Developement
Posted on 28-Jul-2018 3:07:22

[ #49 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2015
From: Kansas

Quote:

cdimauro wrote:
OK. Thanks for the clarification. However the strange thing is that frequencies raised over the years, but voltages... went down. So we should have expected that power usage be kept at certain levels instead of exploding.

Lowering the V has helped but frequency has increased more. I believe capacitance increases also at higher clock speeds and with more (active) gates but this was partially offset with die shrinks. Die shrinks are no longer easy though because of current leakage as we go smaller (end of Moore's law). The current leakage increases the power requirements. The Power section in the following paper explains.

https://www.nxp.com/docs/en/white-paper/multicoreWP.pdf

Quote:

Here's the interview: https://www.cnet.com/news/is-the-powerpc-due-for-a-second-wind/

and following is the most important excerpt:

DECEMBER 13, 2005

Weren't you there during the discussions when IBM convinced Apple to adopt the G5?
Mayer: In my previous job, I ran IBM's semiconductor business. So I've seen both sides of the Apple story, because I sold the G5 to Steve (Jobs) the first time he wanted to move to Intel.

Five years ago??
Mayer: Yeah, that's about right. So I sold the G5. First I told IBM that we needed to do it, and then I sold it to Apple that the G5 was good and it was going to be the follow-on of the PowerPC road map for the desktop. It worked pretty well. And then IBM decided not to take the G5 into the laptop and decided to really focus its chip business on the game consoles.

PowerPC wasn't competitive anymore already on 2000 (2005 - 5) and that's why Apple wanted to move to Intel at that time. Anyway, the switch was just delayed, as we know...

IBM usually doesn't make CPUs without many guarantees up front. Apple wasn't anything then like it is now. Apple was bailed out in August 1997 by Microsoft with a $150 million preferred share purchase or it likely would have gone bankrupt. Those Apple shares would be worth tens of billions of dollars today if Microsoft had held onto them. The Amiga in the mean time was practicing "The Art of No Deal". The Amiga companies today are even less friendly and seem to be practicing "No Deal We Sue".

Quote:

Well, even the most recent Alpha and MIPS ISAs haven't a reduced instruction set. I think that it's quite hard to find a more modern RISC ISA which really remained loyal to its first letter.

AArch64 isn't Reduced, but it thrown away the most complex (and microcoded) instructions, which PowerPC still keeps. The instructions are simpler, like Alpha/MIPS. It's from this PoV which are similar (yes, it still has complex addressing modes).

RISC-V is simple and closer to the original Berkeley RISC. It is by some counts the 5th Berkeley RISC design. Alpha and MIPS were Spartan as they did not even have division or multiple load/store instructions as I recall. AArch64 and PPC are much more robust and closer to a RISC/CISC hybrid.

Quote:

Quote:
I asked specifically whether he thought the combined integer/fp register file was a mistake. He did *not* think it was. There are advantages and disadvantages to combined and separate. I can see his view point and respect it. Separate register files for units is popular right now is all. If integer and fp values in the same register file was so bad then why do SIMD units do it?

Because they have completely different usages.

SIMD was made to crunch A LOT of data which are usually "local", and that's why they are packed to big vectors registers.

A GP/"integer" unit executes completely different type of code, with data which is almost always scalar, and often "non-local".

If you split the register file, you can optimize the micro-architecture for both, different, use cases.

To make a concrete example, to accelerate GP instructions execution you know that usually (out-of-order) processors use a pile of rename registers, which is much bigger than the physical register file. That's because the code is not so "linear", and you have to keep A LOT of "in flight" instructions with their temporary registers values.
You don't need similar sized rename registers file for SIMD code, because it's quite more "linear", and you don't need to keep so many "in flight" instructions.
So, you can organize the two rename registers sets with ad-hoc, proper sizes, to better match the two different usages.

Now imagine a unified register set for both: to have at least the same performance of the previous split scenario, you need a rename register set which has at least n + m (n for GPR, m for SIMD) entries.

But the implementation cost is the same as before ONLY if the register size of both GPR and SIMD instructions is the same (32 and 32 bit, 64 and 64 bit): once they differ (32 and 64, for example), you have clearly a waste of space. And the bigger the vector size is, the major is the waste.

I think that it's enough to prove that the unified register set has big drawbacks, and should be avoided: keep everything separated allows you finally tune every execution unit, according to the goals of the specific micro-architecture.

I understand. The integer unit usually uses superscalar parallelism on cached data while the SIMD unit usually uses SIMD instruction parallelism on streaming (often uncached) data. From a simpler perspective, there is nothing inherently bad about having integer and fp values in the same register file and even has advantages like faster fp2int/int2fp conversions and reduced resource usage. The big disadvantage is reduced parallelism from contention of the shared resources. I tend to prefer separate register files for separate units but I'm open minded enough to consider the advantages of shared especially where resources are limited. Your perspective is very much high performance with OoO and practically unlimited resources but this is the minority of CPUs (the majority of CPUs are embedded where efficient use of resources is often more important).

Quote:

Quote:
A table lookup decoder like the 68060 uses can handle what looks like disorganized and unsymmetrical encodings to the human eye.

But if you start to use more 32-bit encodings to fill the 64-bit gap (there's no space in the 16-bit opcodes for it), spreading them over the opcode space (e.g.: not using a single 16-bit "main" opcode to map the needed 32-bit opcodes, but multiple ones), then you need more table lookups, and it becomes expensive.

A table lookup for the 1st 16 bits and one for the 2nd 16 bits should still be fast. x86_64 CPUs usually have their limits on how many prefixes they can decode in one cycle also. The 68k variable length 16 bit encoding system has been copied by many ISAs. How many ISAs have copied the x86_64 variable length byte encodings with prefix hell?

Quote:

Strange. I remember that Motorola used plenty of Size=0b11 to map instructions, so it should require many mappings.

With a 64 bit mode, much can be trimmed.

Quote:

Anyway, how does your 64-bit mode works? Do you still keep the same instructions encodings (and only add the missing 64-bit ones), or do you "remap" some existing instructions to work differently (Size = W becomes Q, for example)?

.b/.w/.l use the same encoding and .q is used in the 4th slot. Most of those conflicting instructions are 68020 instructions which can be dealt with. It cleans up and simplifies the encoding map nicely.

Quote:

Quote:
The x86_64 usually needs 2 instructions and prefixes for a simple op.q. I can't do any worse than x86_64 really. :D

Prefixes are needed, yes, but 2 instructions? Can you make some example?

add.q #,Rn
mov.q (a0)+,(a1)+

Ok, I exaggerated with the "usually" and should have used "sometimes". The x86 ISA usually does need more instructions than the 68k though (Vince's x86 code needed 36% more instructions than the 68k).

Quote:

Quote:
No blame can be given for continuing a bad thing.

From what I've written before it might sound a contradiction, but I do NOT like at all the decision which AMD made with x64, to use such new REX prefix to introduce both 64-bit operands size and the extra registers. I would have preferred a completely new ISA, keeping a good source level compatibility. So, yes, they worsened the situation.

As I said before, once an ISA reaches a "critical mass" (too much complexity / legacy burden), it's better to stop, re-think, and create a new, cleaner ISA (but partially source compatible, to make it easier to port the existing code). That's why I appreciate what ARM did with its new 64-bit ISA. And that's why I think that it's better for a 68K "successor" to follow a similar approach (and the same for x86/x64).

I think the 68k is salvageable. It trades a little performance for great code density much like Thumb 2 (which looks like it copied from the 68k). The 68k is a little more difficult to decode but it has fewer instructions, fewer memory accesses and better code density. Thumb 2 has some modern improvements but I expect the 68k to have better performance with everything equal. The 68k has more free encoding space than most old ISAs for improvement.

Quote:

Quote:
128 bit IEEE quad precision hardware support would be half the performance of extended precision. It is the fraction/mantissa which requires a wide and slow ALU. Quad precision is 113 bits of fraction while extended precision is only 64 bits. Extended precision increases the exponent to the same size as quad precision which gives a huge range compared to double precision and often giving a number as a result instead of infinity, NaN or a subnormal (which often trap to a slow software handler). With extended precision and quad precision having the same exponent, I wonder if a hardware extended precision operation could be expanded to a quad precision with software (hard+soft quad precision support in a library).

Despite those challenges, the point is that there's a request from customers, and they will certainly be satisfied in the future. So, if you want to create a future-proof ISA, it's better to consider this feature as being part of it, because it'll come the time where you are forced to embrace it anyway, maybe requiring dirty patches.

It would be cool if a CPU was created for scientists, engineers and educators. The feature rich extended precision FPUs had them more in mind instead of the later fast and simple FPUs (fast and simple is for the SIMD unit). I wouldn't mind seeing 128 bit quad precision IEEE hardware support but I expect it to be slow. Extended precision requires a 67 bit barrel shift for normalizing so quad precision would require something like a 115 bit barrel shift. It probably wouldn't add any more than a cycle in latency on modern hardware. Division would be more of a problem. Without a customer asking for it, hardware accelerated (hardware+software) quad precision support sounds like a better selling point.

Quote:

That would have been a pity, because I like this kind of discussions (I was following you on EAB too, but you decided to "exit", unfortunally), and there aren't so many people which can sustain them. Also exchanging different opinions/PoVs can enrich all participants, IMO, giving also some reality check.

I thought exposing Gunnar's lies which resulted in his goons attacking Meynaf and me on EAB would stop the attacks but the moderators were more concerned about privacy policies (which ironically weren't even in the forum rules). Either the private e-mails I posted were lies in which case I didn't violate anyone's privacy or they were the truth and I violated the privacy of someone who had done much worse. Our laws here in the U.S. often protect criminals more than victims effectively punishing the victims too. My banning was more a case of absolute power by a moderator corrupted absolutely though. I guess there will be no new updates to any of my Amiga projects which were exclusively on EAB but there isn't much demand anyway as the Amiga is dead (It was just a nail in the coffin).

Quote:

I fully understand you, I agree, and you made a much better and useful work with your spreadsheet (BTW, how do you count "memory accesses" for 68K's MOVE Mem,Mem instructions? And for instructions like ADD Reg,Mem, which require a read and a write to the same location?).

Each memory accessing instruction counts as one including implicit memory accessing instructions like BSR. I highlighted the instructions which accessed memory in the source and gave the files to Vince to add to his github page for public review. It is the simplest way to do it even if it favors multiple memory access instructions (which can often be faster than multiple memory accesses by separate instructions). For simplicity sake, I did not break out read, write, read/modify/write, mem to mem, load/store multiple or pair instructions, etc. I would have thought CISC architectures would have an advantage here but x86/x86_64 performed the worst and several RISC architectures were a little better than the 68k. More registers can also be helpful at reducing memory access instructions.

Quote:

However I think that Weaver's "code density contest" isn't a good candidate for comparing different ISAs: it's too much specific, tests only an algorithm and some "boilerplate" code (the Linux logo), and it's all hand-optimized (an unlikely thing nowadays).

It is a narrow and small benchmark but typical of common programs. Hand optimized assembler really is the only way to fairly compare ISAs as otherwise compiler support is the most important factor. We know which ISAs have the best compiler support.

Quote:

It would have been better to take the standard SPEC test suite and generate proper metrics from the binaries. This poses other problems, of course, like relying on compilers and their generated code, but at least can give comparable, real-world results.

I have thought about doing the same as you suggest. Compiler support and options can make a huge difference but it would still be interesting.

Quote:

Quote:
I was able to save a few bytes with 68k ISA enhancements also but not much. The 68k has most of what is needed for a small and simple program like this. The trick is making an enhanced ISA easier for compilers to generate code which has similar code density.

Yup: see above. Nowadays it's very important to have an ISA which can be easily exploited by compilers. But starting from an existing ISA and achieving that goal isn't easy.

It is possible to achieve very good code density on the 68k using a variety of optimizations and trash registers which compilers seem to have trouble with (Vince's 68k code uses these). My goal is to make it so the same or better code density can be achieved with simpler peephole optimizations (can be done by vasm) and using fewer registers. It would be possible to make a more dense enhanced 32 bit 68k ISA than 64 bit 68k ISA so I will be giving up a little there.

Status: Offline

matthey

Re: 68k Developement
Posted on 28-Jul-2018 6:44:35

[ #50 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2015
From: Kansas

Quote:

Hypex wrote:
Sure code can be PC relative that way. The issue I was thinking of was when it needed to access the library base which would sit in RAM. PC relative couldn't then be used to access it.

ExecBase is usually only read once per program and cached in more modern Amiga software. This is because some Amigas have address $4 in slow chip ram and the cache is placed in fast ram. The 68k has an absolute 16 bit addressing mode (xxx).W which is the same size as (d16,PC) so there is no difference in code density.

Quote:

In the case you can use PC relative for variables. So then you have self modifying data as opposed to self modifying code. Since the variables would need to be reached with PC relative code.

Text/Data sections can be read/write or read only while BSS/uninitialized sections only make sense as read/write. The Amiga usually scatter loads hunks/sections to separate addresses. Merging them (with padding to MMU page boundaries when an MMU is present) is no problem for programs which are not re-entrant. Re-entrant code has 3 choices.

1) Create read only sections once (shared) but create writable sections for each re-entrant process. While convenient, this precludes merging the read only sections with the writable sections so PC relative addressing is no longer possible for all sections. A slow offset table (GOT) is used to access variables in the separate writable sections. This is likely what .so/.dll libraries originally did.

2) Abandon code sharing and create all the sections for every re-entrant process. There were .so/.dll library version and security issues so in some cases OSs likely do this. This does allow PC relative addressing of all sections but variables likely go through the slow GOT anyway for legacy compatibility.

3) Create all sections once (including writable sections) and let the re-entrant processes share the writable sections. This allows merging all sections and using fast PC relative addressing. Some care is needed for arbitration of writable data but this provides the best means for re-entrant sharing and good performance. I expect it to be better for security than "shared" .so/.dll libraries. Amiga libraries work like this.

Is it ok to introduce inferior library technology to the AmigaOS to make shovel ware software easier or is it just a sign of desperation for the Amiga? What is the point of the Amiga if it is a tech follower instead of leader?

Quote:

Quote:
Gunnar does know PPC well including the weaknesses. There is a certain level of competition as PPC CPUs set the mark in performance for real Amiga hardware. The Apollo core in a low end FPGA can surprisingly outperform some faster clocked PPC CPUs in some benchmarks. A CPU in an FPGA is at a large disadvantage.

What real hardware would this be? Like a CyberStormPPC? I've read the FPGA CPU can take advantage of internal DDR RAM, thus having access to a faster bus, over the slower SDRAM memory interface.

The Apollo Core sometimes outperforms PPC accelerators for the classic, old mac hardware, Efika and even Sam in some cases. Most benchmarks which are faster are partially due to faster memory and having an SIMD unit. The 68k is particularly strong in some areas where the PPC is weak even though the much higher clock speeds mask the advantage. Gunnar's Sortbench benchmark shows the 68k strength and PPC weakness when adjusting for clock speed. Even the in order 68060 outperforms most PPC CPUs.

ARM Cortex A4 0.30 MB/s/MHz
ColdFire v3 MCF5329 0.44 MB/s/MHz
Raspberry Pi ARM 1176JZF-S 0.652 MB/s/MHz
ARM Feroceon 88FR131 0.69 MB/s/MHz
IBM Power 6 0.69 MB/s/MHz
Intel Atom 0.84 MB/s/MHz
IBM Power 7 1.16 MB/s/MHz
AmigaOne X1000 PA6T-1682M 1.19 MB/s/MHz
PPC G4 7447 1.26 MB/s/MHz
68060 1.60 MB/s/MHz (1.87 MB/s/MHz with assembler optimizations)
Intel Core 2 Duo 2.61 MB/s/MHz
Apollo Core 3.70 MB/s/MHz
Intel i7 3770 4.32 MB/s/MHz (or more if smaller element size helps)

http://www.apollo-core.com/sortbench/index.htm?page=benchmarks

The benchmark is just a bubble sort. RISC can't schedule much between the load/store instructions in a short loop so they shoot lots of bubbles. The DCache is low latency on the 68060 and it is good at loops which helps also. This benchmark does scale up so an equally clocked 68k CPU would handily beat the same clocked PPC in MB/s. We can also see how much performance ColdFire lost. CPUs which can handle this benchmark tend to have good single core performance traits.

Quote:

x86_64 would beat PPC in what, least loved or most hated? Just putting an EL spin on that.

I expect the PPC would be more hated than the x86_64 (not in Amiga circles) but the PPC and x86_64 would be closer for most loved.

This looks like a case of a lesser of two evils. I have to admit I didn't really like PPC when it was first introduced as well. I looked up the instructions and how it worked. And it just looked so alien. Was not as friendly as 68K at all. I didn't see it as a follow up. Now I think I've pretty much acccepted it. Mosty because of OS4. But it does have things in common with the Amiga copper, like RISC core, 32-bit instruction words, 16-bit high word for opcode and low word for data. I made a list once. That's about the only thing It really has in common with Amiga as I saw it.

The lesser of two evils is still evil. I choose neither.

Quote:

Quote:
You might be surprised at how much SIMD code still comes from humans (entirely or at least tweaked). Auto-vectorization has improved to the point where it usually gives good results but a human can usually do better. The places where SIMD is used are very important for performance.

That's interesting, so the ASM coder can have a come back?

Assembler coding is still an important niche. The leading OSs, games, applications. embedded software, etc. still have code which is highly optimized in places.

Quote:

I thought Gunnar was leading a team? Well, except for a leak, they don't have the source to AmigaOS either. To my knowledge, Hyperion do, but if OS4 still hasn't got 64-bit support now, there is no way they will tweak OS4 68K for 64-bit. But, it would need an overhaul of all pointers IMHO, a complete new API.

I thought they nay have done this with OS4, but they kept it real close to OS3, except when they broke things or made Amiga sources harder to compile. Right now they need a new 64-bit OS4, with PPC32 and 68K supported by a legacy API, that would be a clean wrapper to an interrnal 64-bit API.

Gunnar assembled a team but didn't lead one.

Adding 64 bit support to the AmigaOS while maintaining maximum compatibility is tricky. Pointers in structures are all 32 bits which means they are currently limited to the lower 4GB 32 bit address space. Some memory above 4GB could be used though. New 64 bit structures and libraries could be introduced but it is tricky to share 32 and 64 bit resources. I believe it would be possible to keep a 68k CPU more compatible. I would like to allow 32 bit processes to run in 32 bit mode (limited to 4GB of address space) while 64 bit processes run in 64 mode. AROS added 64 bit support for the x86_64 but broke compatibility.

Quote:

The problem with extending address space is the hardware is still 32-bit. Especially the address lines on the Amiga hardware wouldn't go above 32-bit. So except for a stand alone Apollo machine it would be redundant.

The custom chips can only access data in chip memory and I don't think we have to worry about more than 4GB of chip memory yet.

Quote:

Quote:
MOVEQ.Q is not necessary by the way. MOVEQ could simply extend to the whole register as it does now.

For positive values it's fine. For negative I thought there may be is a problem. But it looks like it extends up to 64-bits with the 32-bit data being the same.

MOVEQ sign extends values between -128 and 127. The only question is whether it should extend to 32 bits or 64 bits in a 64 bit 68k ISA.

Quote:

Quote:
The biggest reason to move to 64 bit is for more addressing space.

32 bit - 4GB
64 bit - 17179869184 GB
128 bit - 316912650057057350374175801344 GB

17179869184 GB sounds like a lot of space, but how soon before it is obsolete? It happened before, it can happen again, while the earth is still spining in orbit..

The earth can spin all it wants but our problem is quantum physics and current leakage. We can't go much smaller with die shrinks.

Quote:

I think at that point there would need to be a 128-bit base offset for code. Which then uses a smaller virtual address to use it's address space in. Some kind of context would be set that holds the real address. Which an OS could switch too for multitasking operations. Possibly registers could be set with a context so they could read and write to another space using just a small, say 4GB offset, which the CPU MMU would translate to the real address. I realise at this point it starts to sound like segmented addressing so perhaps not the best. We know what happens when there isn't a flat memory model to begin with.

Memory paging was awful for 8 bit and not good for 16 bit CPUs but it is not as bad when the pages are huge. PAE was ok but could have been better. The Amiga could keep from moving to 64 bit with something similar as it already has good code density. Moving to 64 bit would open up a practically unlimited address space for a personal computer although there is a performance cost.

Status: Offline

Hypex

Re: 68k Developement
Posted on 1-Aug-2018 18:03:18

[ #51 ]

Elite Member

Joined: 6-May-2007
Posts: 11222
From: Greensborough, Australia

@cdimauro

Quote:
And unfortunately it requires a massive amount of time... -_-

It does and glad it's not just me. Not that it helps.

Quote:
It was a very nice choice at the time, but unfortunately it became a burden (a unified GPR is much more flexible).

I wonder how they would fit any other way? There are data to data instructions as well as address to data. So there is the instruction, data operators and 3 bits each for address or data register. For GPR model, if instead of having A0-A7 and D0-D7, they had R0-R15, then they would need a whole byte to store a 4-bit source and 4-bit desination. Leaving the other 8-bits for instruction and address modes.

Quote:
Of course I prefer a new format, because it doesn't require a prefix, but then you loose binary backwards compatibility, and maybe even source-level compatibility (albeit you can keep some compatibility here).

That would be best. Just not good for the Amiga scene. But sometimes things just need to be replaced.

Quote:
AFAIR latest PPCs unified both Altivec and FPU register sets for a bigger SIMD ISA. So they have 64 registers here.

Wow. 64 registers at 128-bit each?

Quote:
Hum. Only 8 vector registers? Not enough.

Well we are talking about 68K here so the 68K standard is 8 specific registers per set. 8 data, 8 address, and a futher 8 vectors make sense. But if you think there should be more perhaps we can squeeze 16 in there. But we'll need a prefix.

Quote:
Consider, also, that the last trend is to introduce "matrix" (2D) instructions (which operates to groups of contiguous vector registers), and having a huge register set benefits here.

I also proposed maximising the total register file to 1024 bits. With vectors being variable width up to that size. Suppose it will need to be 2048 bits for those 16 vectors now.

Quote:
Sorry, I didn't understand: can you better clarify?

I think I meant, given vectors now have extra large sizes, such as 512 bits, then might as well use an opcode prefix t ocode it all in. Since with that amount of data it will need an extra wordto code it in anyway.

Quote:
The problem is encoding all that information, which requires more bits. Hence -> less encoding space. And further complicates the SIMD implementation.

Well there is a larger vector count or a larger vector field width. Either way it all has to be coded in.

Quote:
No, no! I was just thinking about reusing the opcode space for FPU (coprocessors, in general) instructions. Definitely NOT the same register set.

Oh ok. so the $Fxxx space. Well, since technically $Vxxx doesn't exist, perhaps so. But that breaks compatibility.

Is there also an MMU space?

Quote:
See the previous message to Mat: I'm a big fan of having separate registers set for groups of execution units (GP/Integer, FPU, SIMD, masks for SIMD, ...).

Okay. Just not address or data?

Quote:
AFAIR it cannot exchange (load, store) data with GP/Integer registers.

Yes that was it.

Quote:
They added Protected mode with the 80286: a BIG improvement. Albeit quite heavy... -_-

That's interesting, so the 80286 supported a flat memory model? And the 80386 fully brought 32-bit. It's never clear to me where the flat memory model comes into it with x86 when I look it up.

And it looks backwards, consistent with littel endian, when I read about real mode. I would have thought real mode would be a real memory mode as in a flat memory mode. But no, it's backwards, describing original or segment mode.

Quote:
IMO the question is simpler: Gunnar patches the architecture in order to make his life easier at implementing it (with the FPGA which is using). Which doesn't guarantee to have a good, future-proof design. Like it's clearly shown by such choices...

That's interesting and I can see the sense in doing that. But not the sense that makes it fully incompatible with an 68060. A good medium would be 68040 compatibility That's pretty much the full 68K of the series.

Quote:
I agree. It would have been better to create first a fully-compatible Amiga hardware platform (ECS or AGA chipset, RTG for modern graphic software, AHI for modern audio software, and a 68030 or 68040 ISA), pushing as much as possible the performances where it makes sense (RTG and an aggressive out-of-order 68K core). Adding some more stuff (like 3D acceleration, faster bitplanes fetching, faster Blitter) only if there was some space left on the FPGA.

As it stands it wil almost be like the PowerPC co-processor situation. Except there will be a 68K with an extended 68K on the same CPU. Just a co-pro in the same CPU family.

Quote:
That always with the primary goal to accelerate existing applications. Because this is the crude reality: it's very unlikely that new applications will be created to make use of "alien" Amiga technologies, like 64-bit, Hyperthreading, SIMD.

Alien Amiga. LOL. We can look back to PowerPC. Where it caused a division for those in the class that had exotic PPC hardware. Except now an exclusive 68K division. But since most 68K stuff is set in stone that shouldn't matter much. Getting the retro crowd to accept newerr software is another thing.

Quote:
AMMX isn't little-endian. Or are you referring to something else?

I mean it being based off MMX. MMX derives from x86. So they are basing their AMMX on a vector extension of a little endian CPU. Not that it matters. If it makes any difference.

Quote:
Packed/chunky modes would have replaced bitplanes for everything, so they could have fit the chipset space.

Couldn't have worked then. But I think say being able to use same colour depth either way would have been good. Like using 4-bit or 8-bit packed pixel modes, but restricted to one plane, so the same amount of memory was read in for either. Though, internally, instead of the usual bitmap to colour index conversion or however it worked, it had the indexes right there. So, if need be, it could have virtualsed the depth and converted 16 pixels at a time from packed to planer direct in the bitplane registers. Like Akiko but internally in realtime in the chips.

Quote:
It would have simplified the o.s. as well, not having to deal with multiple bitplanes for a single screen: one pointer to the framebuffer and that's it, because you have everything there.

Can just use the first bitplane pointer. In my invesigation of RTG that's what an RTG bitmap did. One large plane packed across.

Quote:
Adopting the bitplanes was the biggest mistake ever, which crippled the Amiga platform.

I've gone over this and I don't know. It was common at the time. Memory efficient. And it made all the features of the Amiga hardware possible. Parallax layers rely on it on the Amiga. It's fine for blitting images with. It only gets to be a real problem when you want to work one on one with pixels. Or do scaling.

Quote:
Hum. Being honest I'm not a big fan of 128-bit ISAs as well. At the very end you only need it for a bigger virtual address space, which is a quite rare case.

Yet.

Quote:
And ISAs like RISC-V implement the 128-bit address space as a segmented space, with a 64-bit identifier and 64-bit offset.

Oh no. I just lost interest.

Quote:
If I want the segmentation again, then I prefer an approach similar to the x86 one, because usually an application working set uses a very small number of segments

I never got this. I read about it for years and it never made sense. Until some time later I understood it. It was shifted four bits the the left. And then I didn't understand again. Because that just looked stupid, Why would they bother shifting it a nibble across? That ain't very future proof! So 16 bit extends to 1MB? What a waste! I just didn't get it. Why wouldn't they shift it a logical amount like 16-bit left? Or, in my mind, set a 16-bit high word, acessing it with a 16-bit low word.

Status: Offline

matthey

Re: 68k Developement
Posted on 2-Aug-2018 0:00:49

[ #52 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2015
From: Kansas

Quote:

Hypex wrote:
Quote:
It was a very nice choice at the time, but unfortunately it became a burden (a unified GPR is much more flexible).

I wonder how they would fit any other way? There are data to data instructions as well as address to data. So there is the instruction, data operators and 3 bits each for address or data register. For GPR model, if instead of having A0-A7 and D0-D7, they had R0-R15, then they would need a whole byte to store a 4-bit source and 4-bit destination. Leaving the other 8-bits for instruction and address modes.

A full orthogonal CISC encoding with 16 GPR (r0-r15) registers uses substantial encoding space. With a variable length 16 bit encoding, it would require more of the encodings to become 32 bit encodings. The An and Dn specialization allows for the 68k awesome code density, 16 registers without reducing code density (tiered system where instruction sizes grow when using r8-r15 like x86_64 is unnecessary) and significant encoding is available.

The 68k EA encoding practically has a 4 bit register encoding. The EA consists of a 3 bit mode and 3 bit register encoding.

%0000 d0 r0
...
%0111 d7 r7
%1000 a0 r8
...
%1111 a7 r15

Many of the EA encodings use An *or* Dn as the register file was split on the 68000 (outdated on pipelined 68k CPUs like the 68020+). The ISA can be enhanced to allow An *and* Dn in most sources effectively giving a 4 bit register encoding. EA destinations could usually be opened also but allowing An destinations would require special handling of condition codes as An operation destinations usually do *not* set the condition codes as Dn destination operations do. Opening EA sources provides most of the benefits (fewer instructions, improved code density and better orthogonality) while preserving the "feel" of the 68k.

Quote:

Quote:
AFAIR latest PPCs unified both Altivec and FPU register sets for a bigger SIMD ISA. So they have 64 registers here.

Wow. 64 registers at 128-bit each?

I do *not* believe Altivec (v0-v31) register and FPU register (f0-f31) sets are "unified" on the PPC. The Altivec and FPU register files may be unified on particular designs but the ISA does *not* force any unification that I'm aware of.

Quote:

Quote:
Hum. Only 8 vector registers? Not enough.

Well we are talking about 68K here so the 68K standard is 8 specific registers per set. 8 data, 8 address, and a further 8 vectors make sense. But if you think there should be more perhaps we can squeeze 16 in there. But we'll need a prefix.

It is more important to have more SIMD unit registers than FPU registers for reasons cdimauro has stated. More FPU registers would be good with a superscalar FPU and FPU register spills are big and expensive also.

There is no need for a prefix on the 68k for an FPU or SIMD unit. I was able to add 16 FPU registers and 3 op operations without using any more than the current F-line coprocessor ID=1. The MMU uses coprocessor ID=0 leaving most of ID 2-7 available. This is a substantial amount of encoding space although an SIMD unit with 16+ registers and 3 op needs much more space than an FPU (I expected more than one coprocessor ID would be needed). If needed, many of the SIMD instructions could be made 6 byte long instructions *without* prefixes. A CISC style SIMD unit does need more encoding space but doesn't need as many registers.

Quote:

Quote:
Consider, also, that the last trend is to introduce "matrix" (2D) instructions (which operates to groups of contiguous vector registers), and having a huge register set benefits here.

I also proposed maximising the total register file to 1024 bits. With vectors being variable width up to that size. Suppose it will need to be 2048 bits for those 16 vectors now.

The x86_64 SIMD ISAs have sometimes masked half of the SIMD register width for compatibility with narrower sizes when doubling their SIMD unit register width. RISC-V has put off creating and publishing their SIMD unit ISA as they are trying to make a more scaleable SIMD unit. I'm no expert in SIMD units but I expect they will lose as much in not being able to program it as efficiently and close to the hardware as they will gain in increased flexibility but it will be interesting to see what they come up with.

Quote:

Quote:
IMO the question is simpler: Gunnar patches the architecture in order to make his life easier at implementing it (with the FPGA which is using). Which doesn't guarantee to have a good, future-proof design. Like it's clearly shown by such choices...

That's interesting and I can see the sense in doing that. But not the sense that makes it fully incompatible with an 68060. A good medium would be 68040 compatibility That's pretty much the full 68K of the series.

At the user level, there is not much difference between the 68040 and 68060 and not much added to these CPUs.

68030->68040 MOVE16
68040->68060 FINT/FINTRZ

The 68060 did make some significant supervisor changes.

Quote:

That always with the primary goal to accelerate existing applications. Because this is the crude reality: it's very unlikely that new applications will be created to make use of "alien" Amiga technologies, like 64-bit, Hyperthreading, SIMD.

Isn't that true for next gen Amigas also? A few hundred active users is not enough to attract development and much of what we do get is just "alien" shovelware instead of Amiga innovation.

Status: Offline

Hypex

Re: 68k Developement
Posted on 4-Aug-2018 16:55:20

[ #53 ]

Elite Member

Joined: 6-May-2007
Posts: 11222
From: Greensborough, Australia

@matthey

Quote:
ExecBase is usually only read once per program and cached in more modern Amiga software.

I never saw why $4 was exposed to the programs at all. Why wasn't ExecBase stuck in A6 already when a program started? Where the OS stored it in memory was OS business.

Quote:
Text/Data sections can be read/write or read only while BSS/uninitialized sections only make sense as read/write. The Amiga usually scatter loads hunks/sections to separate addresses. Merging them (with padding to MMU page boundaries when an MMU is present) is no problem for programs which are not re-entrant. Re-entrant code has 3 choices.

This neat separation presents another problem. While your 1, 2 and 3 present a good pros and cons for a model. There is something else that would hold back PC relative access. Most OS libraries are in the ROM. The library base was like a global variable area. So for the code to reach a variable PC relative it would need to reach it from ROM to where it is in RAM. And the ROM code is static. Unless I have misinterpret what you are referring to with PC relative. Disk based libraries would have more freedom.

Quote:
Is it ok to introduce inferior library technology to the AmigaOS to make shovel ware software easier or is it just a sign of desperation for the Amiga? What is the point of the Amiga if it is a tech follower instead of leader?

Okay so the SO objects? I was thinking 68K libraries all along.

I don't see how it makes things easier. I've tried to compile Linuxware and can not recall ever coming across a shared object. Of course I like to compile as static. So it works out of the box. SOs just give the programmer an option for something else they don't include in their program. OS4 doesn't support dependency tracking like Linux so I don't think it is a good idea. On OS4 the SO idea is inferior.

Quote:
The Apollo Core sometimes outperforms PPC accelerators for the classic, old mac hardware, Efika and even Sam in some cases. Most benchmarks which are faster are partially due to faster memory and having an SIMD unit. The 68k is particularly strong in some areas where the PPC is weak even though the much higher clock speeds mask the advantage. Gunnar's Sortbench benchmark shows the 68k strength and PPC weakness when adjusting for clock speed. Even the in order 68060 outperforms most PPC CPUs.

The figures were consistent with my XE G3 when I tested it in the memory copy operation.

Quote:
The benchmark is just a bubble sort. RISC can't schedule much between the load/store instructions in a short loop so they shoot lots of bubbles. The DCache is low latency on the 68060 and it is good at loops which helps also. This benchmark does scale up so an equally clocked 68k CPU would handily beat the same clocked PPC in MB/s. We can also see how much performance ColdFire lost. CPUs which can handle this benchmark tend to have good single core performance traits.

I wonder how hand coded RISC would do here?

Quote:
The lesser of two evils is still evil. I choose neither.

What else could you chose? What else could they have chosen back then?

Quote:
Assembler coding is still an important niche. The leading OSs, games, applications. embedded software, etc. still have code which is highly optimized in places.

I would have expected it to fall down to low level operations only needed when necessary.

Quote:
Gunnar assembled a team but didn't lead one.

Ha. That sounds funny. Was it lead astray?

Quote:
. I believe it would be possible to keep a 68k CPU more compatible. I would like to allow 32 bit processes to run in 32 bit mode (limited to 4GB of address space) while 64 bit processes run in 64 mode. AROS added 64 bit support for the x86_64 but broke compatibility.

Somehow, IIRC, Linux PPC32 programs can run on a PPC64 kernel in user space. So with enough planning and underlying CPU support it looks possible.

Quote:
The custom chips can only access data in chip memory and I don't think we have to worry about more than 4GB of chip memory yet.

Yes good point. Not even the 2MB barrier has been broken. Will the Vampire virtualise 8MB or 16MB chip ram?

There are a few comments about AmigaOS being restricted to 31-bits due to some bad designs. But I don't think this can really hamper AmigaOS on 32-bit hardware. At most a 32-bit machine can access is 2GB with memory mapped hardware. I saw this maxed to around 3GB before the PC went 64-bit. Which seems funny as I read that chips Intel had instructions to access hardware. But the 68K needed it memory mapped which looks inferior and memory consuming by comparison.

So even a 68K AmigaOS would need a new design to be ported to 64-bit. But I don't see AROS taking it's place. If I had a fancy new 64-bit Vampire accelerator in my Amiga I'd want to run a real Amiga OS on it, not a copy.

Quote:
MOVEQ sign extends values between -128 and 127. The only question is whether it should extend to 32 bits or 64 bits in a 64 bit 68k ISA.

64 bits should be fine. 32 bits would only be seen by 32 bit instructions.

Quote:
The earth can spin all it wants but our problem is quantum physics and current leakage. We can't go much smaller with die shrinks.

In that case start stacking up the chips in 3d space. To add cores and all sorts of math units in the same space is a feat in itself. But logistics suggest you need expand out eventually and move on up to fit it all in there.

Quote:
Memory paging was awful for 8 bit and not good for 16 bit CPUs but it is not as bad when the pages are huge. PAE was ok but could have been better. The Amiga could keep from moving to 64 bit with something similar as it already has good code density. Moving to 64 bit would open up a practically unlimited address space for a personal computer although there is a performance cost.

Here the shared address space could be a burden. If there was separation it could open up options. For example, the kernel assigning each process a possible 4GB memory space, which would be mapped into by the MMU but each process would need it mapped it when they had CPU time. Or, same type of system, but kernel operating in 64-bit, and processes running in 32-bit code with 64-bit pointer pre loaded into registers. Task would need pointers to real 64-bit address.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 5-Aug-2018 16:41:42

[ #54 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey

Quote:

matthey wrote:

In a standalone FPGA based Amiga computer, the SIMD unit can access (much faster and more) chip ram. I believe there would be more disadvantages than advantages to moving the SIMD unit to a separate custom chip. Bus arbitration and cache coherency problems were likely partially responsible for stopping Natami development. It makes sense to go the other way moving more functionality into the CPU and moving the custom chips closer as a SoC. Gunnar wanting to replace the Amiga blitter with the SIMD unit makes sense.

If he wants to replace the Blitter with a SIMD unit, it's better to have a separate coprocessor (even a 68K derivative) to offload the CPU. So, resembling the Amiga.

A SIMD coprocessor with a banked register set, which are internally transfered once the CPU had set all registers and started the SIMD operation. So that the CPU can immediately start setting the register for the next operation, without the need to wait for the coprocessor to finish the current operation.

That's something which I've missed with the regular Blitter.

Having a separate coprocessor is very important because, as we know, the Amiga o.s. is single core/thread and cannot use multiple processors. But it allows to have and control coprocessors.
Quote:
I would keep the the SIMD unit (and preferably register files) separate like the 68k FPU unit (6888x, 68040 and 68060) which allows for more parallelization but requires more logic and resources (instructions of different CPU units can often execute in parallel).

Exactly. But a coprocessor is even better: see above.
Quote:
C2P is no longer necessary as chunky is already available in most FPGA Amiga computers with RTG and there is little to be gained by further accelerating legacy bit plane support.

I absolutely agree.
Quote:
A wider SIMD unit would be awesome. The number of parallel operations doubles with each doubling of the SIMD register width so a 256 bit SIMD unit could do 4 times as many operations (*not* instructions) in parallel. However, the resources needed are much more also (nearly quadruple?). Choosing a 64 bit SIMD unit and even merging with the integer units is efficient on resources and fine for a one off CPU for some embedded application but very bad for an ISA which needs to scale for mid to high performance CPUs.

Indeed. Modern SIMD units use a separate set of registers. Sharing the FPU registers set, or even the GPR one, is quite rare.
Quote:
SIMD units are bad at scaling as is. All the outdated SIMD unit support has to be supported for compatibility (for x86 this includes MMX, SSE. SSE2, SSE3, SSSE3, SSE4. AVX, AVX2, AVX-512). All this baggage requires logic and uses encoding space (SIMD instruction lengths started at 3-4 bytes for MMX and are now 6+ bytes). It is difficult for mid performance and energy efficient CPUs to keep all this old support.

In reality on x86/x64 there are only 4 "macro SIMD families": MMX, SSE, AVX, and AVX-512. The difference between the different versions that you reported is primarirly related to the instructions which were added to the "base" (original?) SIMD ISA.

The interesting thing is that those SIMD families share many opcodes: it's only the used prefixes which determ the different behavior and/or enabling new features for an instruction.

If you analyze the instructions opcodes you'll immediately figure it out. Take the PADD, for example, from Intel's Architecture Manual, and you'll see the (same) pattern / "base opcode".

Unfortunately you pay this simplicity by requiring a more complicated decoder.

Last but not least, there are rumors about the possibility that Intel will remove some legacy stuff from its future architectures. I think about MMX here, and this makes sense since I haven't found MMX code from very long time (and never found disassembling x86/x64 executables): SSE is bare minimum (SSE2 starting with x64).
Quote:
Ideally, 68k SIMD unit support should avoid these problems by starting with a more modern SIMD unit and number of registers (probably 256 or 512 bits and 16 or 32 registers).

32 registers is a good amount for a SIMD unit, especially with the new trends/challenges.

And 256 bits vector registers is really the bare minimum nowadays. The tendency is to have wider vector registers (but variable-length / size-agnostic).
Quote:
Encodings for floating point support should be reserved if fp is not available at first. If there are not enough resources for a proper SIMD unit then don't add a standardized SIMD unit (keep it experimental as I suggested to Gunnar).

I fully agree.

P.S. Sorry for my late answer(s), but I write once I've some (enough) free time.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 6-Aug-2018 13:55:57

[ #55 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey

Quote:

matthey wrote:

Lowering the V has helped but frequency has increased more. I believe capacitance increases also at higher clock speeds and with more (active) gates but this was partially offset with die shrinks. Die shrinks are no longer easy though because of current leakage as we go smaller (end of Moore's law). The current leakage increases the power requirements. The Power section in the following paper explains.

https://www.nxp.com/docs/en/white-paper/multicoreWP.pdf

Very interesting thanks. It also shown some nice data about single core vs dual core vs single core with double frequency, and talked about the L2 benefits. Cool.
Quote:
IBM usually doesn't make CPUs without many guarantees up front. Apple wasn't anything then like it is now. Apple was bailed out in August 1997 by Microsoft with a $150 million preferred share purchase or it likely would have gone bankrupt.

True, but at the time (2000) Apple recovered quite well and sold some millions Macintosh. IBM wanted to expand its business, and that's why the G5 came (as a cut-down of its POWER4).
Quote:
RISC-V is simple and closer to the original Berkeley RISC. It is by some counts the 5th Berkeley RISC design. Alpha and MIPS were Spartan as they did not even have division or multiple load/store instructions as I recall. AArch64 and PPC are much more robust and closer to a RISC/CISC hybrid.

Hum. Why do you think that AArch64 is a RISC/CISC hybrid? Because there are many instructions and/or some multicycle ones like multiplication? But AArch64 is not like PowerPCs: it misses the load/store multiple-registers, which required microcode (or cranking down the instruction to multiple uops). AArch64 is more similar to Alpha and MIPS from this PoV.

Regarding RISC-V, is not that simple. The base ISA is simple and small, with only a few instructions, but they added many more of them though some extensions which are standardized. So, the ISA is not really "Reduced", because if you want a "desktop-capable" version you need several extensions and many instructions.
Last but not really least, they say that the ISA has only 3 instructions formats, but if you take a look carefully you'll see several formats and exceptions too. Especially looking at the (almost finalized) Vector extension.
Quote:
I understand. The integer unit usually uses superscalar parallelism on cached data while the SIMD unit usually uses SIMD instruction parallelism on streaming (often uncached) data. From a simpler perspective, there is nothing inherently bad about having integer and fp values in the same register file and even has advantages like faster fp2int/int2fp conversions and reduced resource usage. The big disadvantage is reduced parallelism from contention of the shared resources. I tend to prefer separate register files for separate units but I'm open minded enough to consider the advantages of shared especially where resources are limited. Your perspective is very much high performance with OoO and practically unlimited resources but this is the minority of CPUs (the majority of CPUs are embedded where efficient use of resources is often more important).

RISC-V was also created for the embedded market, and it has separate registers files (but with 16 registers instead of 32). Yes, it might not be so efficient, but this is the last tendency even for embedded processors.
Quote:
A table lookup for the 1st 16 bits and one for the 2nd 16 bits should still be fast. x86_64 CPUs usually have their limits on how many prefixes they can decode in one cycle also.

Sure, but only the more complex cases require more cycles for decoding. The vast majority and most common instructions use the simpler decoders.
Quote:
The 68k variable length 16 bit encoding system has been copied by many ISAs.

Do you mean just using 16-bits as "unit of measure" for the various opcode parts, or are you talking about ISAs which specifically copied parts of the 68Ks opcodes structure? In case, can you make same exaple?
Quote:
How many ISAs have copied the x86_64 variable length byte encodings with prefix hell?

Nobody which I know, but that's a different argument. The fact that x86 & x64 had to use several prefixes to extend the basic 8086 opcodes structure doesn't mean that others ISAs are relied from critics about bad design decisions.
Quote:
With a 64 bit mode, much can be trimmed.

.b/.w/.l use the same encoding and .q is used in the 4th slot. Most of those conflicting instructions are 68020 instructions which can be dealt with. It cleans up and simplifies the encoding map nicely.

OK, got it, but this way you already lost binary compatibility with the 68K: your 64-bit mode is a new ISA, effectively. Similar, from this PoV, to what AMD did with x64: largely binary and source-level compatible with the 32-bit mode, but a new thing.

Don't get me wrong: this is a good step because you don't need prefixes here, unlike x64.

However this way you still keep the 68K issues with instructions decoding due to the same extension words format, which requires the decoder to take a look at them to finally compute the instruction length. Plus, the extension words inherit the same pitfalls of the 68020 extensions (particularly the double-indirect modes).
Quote:
Quote:
Prefixes are needed, yes, but 2 instructions? Can you make some example?

add.q #,Rn
mov.q (a0)+,(a1)+

Ok, I exaggerated with the "usually" and should have used "sometimes".

Understood, that there are counter-examples as well.
For the first instruction, x64 has short immediates which sign-extends an 8-bit value to 64-bit. Sure, the add.q allows to use any 64-bit value, so it's more general, but in many (common) cases the 8-bit signed value doesn't require 2 instructions and allows a much better code density too (a 64-bit 68K needs 10 bytes at least).
For the second example, MOVSQ is usually used.

This justs to say that there are pros and cons for both 64-bit ISAs.
Quote:
The x86 ISA usually does need more instructions than the 68k though (Vince's x86 code needed 36% more instructions than the 68k).

That's because its target was only the code density, so it had to use short instructions and the so called "high registers" to keep the total number of bytes.

This is another reason which I don't rely on this context.
Quote:
I think the 68k is salvageable. It trades a little performance for great code density much like Thumb 2 (which looks like it copied from the 68k).

I don't see similarities with Thumb-2. Thumb-2 reintroduces the 32-bit ARM instructions, but without the conditional execution. It was the first Thumb which introduced the very compact ISA with 16-bit opcodes, but they still look quite different from 68K, and borrows something from x86 too (PUSH and POP instructions is the first example that comes to my mind).

Do you have same specific example of something which Thumb copied from 68K?
Quote:
The 68k is a little more difficult to decode

Hum. No, Thumb and Thumb-2 are much, much easier to decode compared to 68K. The 68K has too many pitfalls from this PoV, as we discussed a bit in some messages here (included this one, at the top). And the 64-bit version which you proposed can only complicate the decoder which has to deal with several exceptions & remappings.
Quote:
but it has fewer instructions, fewer memory accesses and better code density. Thumb 2 has some modern improvements but I expect the 68k to have better performance with everything equal.

Included the number of transistors (logic gates) for implementing it? And what about power consumption?

I agree with your first sentence: the benefits are evident, but those aren't the only metrics to "measure" an ISA goodness.
Quote:
The 68k has more free encoding space than most old ISAs for improvement.

Don't know here because it's a too much generic sentence, since you don't talk about which ISAs and which improvements to add.

But to give a counterexample, if you want to introduce a modern SIMD unit with all good "bells and whistles" which are required nowadays, while keeping the CISC "goodness" (AKA: allowing any EA for the second source of ternary operations), then you already know that the available encoding space is absolutely not enough for this task. I've already made precise calculations about it in one of my replies to Hypex, and here I was also in a very optimistic scenarios (completely consuming all A and F opcodes lines).
Quote:
It would be cool if a CPU was created for scientists, engineers and educators. The feature rich extended precision FPUs had them more in mind instead of the later fast and simple FPUs (fast and simple is for the SIMD unit). I wouldn't mind seeing 128 bit quad precision IEEE hardware support but I expect it to be slow. Extended precision requires a 67 bit barrel shift for normalizing so quad precision would require something like a 115 bit barrel shift. It probably wouldn't add any more than a cycle in latency on modern hardware. Division would be more of a problem. Without a customer asking for it, hardware accelerated (hardware+software) quad precision support sounds like a better selling point.

Well, that's the point and as I've said before, there are already customers asking for it, and from long time. It's just a matter of time, and they'll get 128-bit FP precision from "mainstream" ISAs.

RISC-V already provided room for it. Because they already know that it's a market need that should be covered by this fully scalable ISAs. They clearly and quite explicitly said that they want to dominate every market with this ISA: from embedded to HPC.

That's why it's better to think wider when talking about an ISA. Unless you want to cover only a specific market segment, but a generalized ISA will always be the first and strongest contender, and proposing a fully scalable ISA will definitely pay over time due to the very strong ecosystem which will be built around it.
Quote:
I thought exposing Gunnar's lies which resulted in his goons attacking Meynaf and me on EAB would stop the attacks but the moderators were more concerned about privacy policies (which ironically weren't even in the forum rules). Either the private e-mails I posted were lies in which case I didn't violate anyone's privacy or they were the truth and I violated the privacy of someone who had done much worse. Our laws here in the U.S. often protect criminals more than victims effectively punishing the victims too. My banning was more a case of absolute power by a moderator corrupted absolutely though. I guess there will be no new updates to any of my Amiga projects which were exclusively on EAB but there isn't much demand anyway as the Amiga is dead (It was just a nail in the coffin).

I know very well the situation (and I was attacked by Gunnar with lies as well, but on AROS-Exec; albeit his tentative to invoke censorship against me got not follow-ups), and I know how some moderators behaved and still behave (especially DamianD, which has nothing of "moderate", since he usually attacks and STARTS flaming against some users which he doesn't like), but there you made the mistake to give them the reason to ban you after that they already warned everybody to don't continue.

Whether we like or not, they "have the upper hand". And now unfortunately you're out from EAB, which is a pity.
Quote:
Each memory accessing instruction counts as one including implicit memory accessing instructions like BSR. I highlighted the instructions which accessed memory in the source and gave the files to Vince to add to his github page for public review. It is the simplest way to do it even if it favors multiple memory access instructions (which can often be faster than multiple memory accesses by separate instructions).

Yup. This is another clear advantage of CISCs.
Quote:
For simplicity sake, I did not break out read, write, read/modify/write, mem to mem, load/store multiple or pair instructions, etc. I would have thought CISC architectures would have an advantage here but x86/x86_64 performed the worst and several RISC architectures were a little better than the 68k.

The problem with x86 and (especially) x86_64 was the one which I talked before: Victor optimized only for density, increasing the number of instructions and all other metrics. This is particularly evident with x86_64 (and x86_32), where basically he didn't used the extra 8 GPRs which the ISA makes available: it looks like 80386 code, without the benefits of the new ISA...
Quote:
More registers can also be helpful at reducing memory access instructions.

And memory-to-memory instructions too.
Quote:
It is a narrow and small benchmark but typical of common programs. Hand optimized assembler really is the only way to fairly compare ISAs as otherwise compiler support is the most important factor. We know which ISAs have the best compiler support.

True. But consider that hand optimized assembly code isn't common, and the vast majority of the binary code is generated by (high-level) compilers. At the end, when you run an application, it's quite unlikely that it executes assembly-optimized code. That's the reality nowadays, and that's why an ISA needs to get a good compiler support.
Quote:
I have thought about doing the same as you suggest. Compiler support and options can make a huge difference but it would still be interesting.

A good compromise can be represented by using some mainstream compiler for all architectures. Hoping for a good support here, but usually many optimization tricks are general enough to be used by many ISAs.
Quote:
It is possible to achieve very good code density on the 68k using a variety of optimizations and trash registers which compilers seem to have trouble with (Vince's 68k code uses these). My goal is to make it so the same or better code density can be achieved with simpler peephole optimizations (can be done by vasm) and using fewer registers.

Peephole optimizations are tricky with the 68K (but x86 has its problem here, albeit a bit less), because most instructions alter the flags, and what's worse is that some only partially alter them.
Quote:
It would be possible to make a more dense enhanced 32 bit 68k ISA than 64 bit 68k ISA so I will be giving up a little there.

I think that it's better to focus on a 64-bit ISA: this is a must have nowadays, where desktop and mobile have moved (servers and HPCs already joined 64-bit very long time ago).

It's only the embedded segment where 32-bit (and even less, sometimes) ISAs still look good, but that's not enough for an ISA with so much competition (again, RISC-V).

P.S. Sorry, but I have no time now to re-read and fix typos/errors. I hope that the writing is still understandable.
EDIT: corrected a bad quote on-the-fly.

Last edited by cdimauro on 06-Aug-2018 at 04:41 PM.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 7-Aug-2018 17:37:17

[ #56 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey
Quote:
matthey wrote:
3) Create all sections once (including writable sections) and let the re-entrant processes share the writable sections. This allows merging all sections and using fast PC relative addressing. Some care is needed for arbitration of writable data but this provides the best means for re-entrant sharing and good performance. I expect it to be better for security than "shared" .so/.dll libraries. Amiga libraries work like this.

Is it ok to introduce inferior library technology to the AmigaOS to make shovel ware software easier or is it just a sign of desperation for the Amiga? What is the point of the Amiga if it is a tech follower instead of leader?

Actually the .so/.dll model is a superset of the Amiga libraries, since they allow a per-process/thread data section. Amiga libraries are a special case, because there's only a global data section (the library base).
Quote:
The Apollo Core sometimes outperforms PPC accelerators for the classic, old mac hardware, Efika and even Sam in some cases. Most benchmarks which are faster are partially due to faster memory and having an SIMD unit. The 68k is particularly strong in some areas where the PPC is weak even though the much higher clock speeds mask the advantage. Gunnar's Sortbench benchmark shows the 68k strength and PPC weakness when adjusting for clock speed. Even the in order 68060 outperforms most PPC CPUs.

ARM Cortex A4 0.30 MB/s/MHz
ColdFire v3 MCF5329 0.44 MB/s/MHz
Raspberry Pi ARM 1176JZF-S 0.652 MB/s/MHz
ARM Feroceon 88FR131 0.69 MB/s/MHz
IBM Power 6 0.69 MB/s/MHz
Intel Atom 0.84 MB/s/MHz
IBM Power 7 1.16 MB/s/MHz
AmigaOne X1000 PA6T-1682M 1.19 MB/s/MHz
PPC G4 7447 1.26 MB/s/MHz
68060 1.60 MB/s/MHz (1.87 MB/s/MHz with assembler optimizations)
Intel Core 2 Duo 2.61 MB/s/MHz
Apollo Core 3.70 MB/s/MHz
Intel i7 3770 4.32 MB/s/MHz (or more if smaller element size helps)

http://www.apollo-core.com/sortbench/index.htm?page=benchmarks

The benchmark is just a bubble sort. RISC can't schedule much between the load/store instructions in a short loop so they shoot lots of bubbles. The DCache is low latency on the 68060 and it is good at loops which helps also. This benchmark does scale up so an equally clocked 68k CPU would handily beat the same clocked PPC in MB/s. We can also see how much performance ColdFire lost. CPUs which can handle this benchmark tend to have good single core performance traits.

I don't trust this sort benchmark, because it was specifically created to model/mimic 68K code, and in not common scenarios.
If you take a look at the source, it resembles a 68K assembly code, and it's totally different from the usual Bubble sort code which you can look in C:
https://en.wikibooks.org/wiki/Algorithm_Implementation/Sorting/Bubble_sort#C
or even in pseudocode:
https://en.wikipedia.org/wiki/Bubble_sort#Pseudocode_implementation

Last but not really least, the data array is filled in reverse order, so it always matches the worst case (e.g.: swapping the elements). It means that the branch predictor isn't stressed at all: it always predicts to swap the data! And the code is even worse in general, since it always writes back the two analyzed values into the array, either they were swapped or not.

So, the code is far away from classical implementations, and far away from common scenarios (the data isn't always reverse-sorted).

Since the c stdlib has a quick sort implementation ( http://www.cplusplus.com/reference/cstdlib/qsort/ ), a fair comparison would have been a simple benchmark using it, but using an array filled with random values (with a fixed random seed, in order to have a reproduceable test).
Quote:
Assembler coding is still an important niche. The leading OSs, games, applications. embedded software, etc. still have code which is highly optimized in places.

True, but it's a very small part. An in the last years the general trend is to use intrinsics in order to make it much easier to abstract and implement the code, leaving to the compiler the burden to arrange the registers usage/spilling.
Quote:
Adding 64 bit support to the AmigaOS while maintaining maximum compatibility is tricky. Pointers in structures are all 32 bits which means they are currently limited to the lower 4GB 32 bit address space. Some memory above 4GB could be used though. New 64 bit structures and libraries could be introduced but it is tricky to share 32 and 64 bit resources.

You cannot. At least with the Amiga o.s..

An o.s. with a good design allows mixing 32 and 64 applications sharing some resources, since you have the necessary abstraction (e.g.: no pointer passing. Only "opaque" references/ids/handles passed).
Quote:
I believe it would be possible to keep a 68k CPU more compatible. I would like to allow 32 bit processes to run in 32 bit mode (limited to 4GB of address space) while 64 bit processes run in 64 mode. AROS added 64 bit support for the x86_64 but broke compatibility.

This is possible, but 32 and 64 bit applications would be substantially isolated and communication mixing them is not possible. AROS broke compatibility with its 64-bit flavor, because of that.
Quote:
For positive values it's fine. For negative I thought there may be is a problem. But it looks like it extends up to 64-bits with the 32-bit data being the same.

MOVEQ sign extends values between -128 and 127. The only question is whether it should extend to 32 bits or 64 bits in a 64 bit 68k ISA.

IMO it should naturally extend to 64-bits.

The big question is another, however: what happens in 64-bit mode when you load a register with a 32-bit value? With 8 and 16-bit values we already know how the 68K (and x86) works: it leaves the upper bits unchanged (which is a very bad thing for superpipelined processors). x64 "solved" the problem by zeroing the upper 32-bits, so you can freely mix 32 and 64-bit instructions (which represents the VAST majority of code. 8 and 16-bit instructions are VERY RARELY seen on x64 executables), without adding bubbles to the pipeline due to partial registers updates.
Quote:
Memory paging was awful for 8 bit and not good for 16 bit CPUs but it is not as bad when the pages are huge. PAE was ok but could have been better. The Amiga could keep from moving to 64 bit with something similar as it already has good code density. Moving to 64 bit would open up a practically unlimited address space for a personal computer although there is a performance cost.

True. But 64-bit computing is the default nowadays. Only the embedded market is still stuck on 32 bits.

@Hypex
Quote:
Hypex wrote:

I never saw why $4 was exposed to the programs at all. Why wasn't ExecBase stuck in A6 already when a program started? Where the OS stored it in memory was OS business.

Indeed. Another bad design...
Quote:
This neat separation presents another problem. While your 1, 2 and 3 present a good pros and cons for a model. There is something else that would hold back PC relative access. Most OS libraries are in the ROM. The library base was like a global variable area. So for the code to reach a variable PC relative it would need to reach it from ROM to where it is in RAM. And the ROM code is static. Unless I have misinterpret what you are referring to with PC relative. Disk based libraries would have more freedom.

But nowadays you don't really have ROM code. Unless you talk about embedded systems, where it's likely to have EEPROM used to store the firmware.

So, we can assume that libraries are located in RAM (maybe read-only protected).
Quote:
I wonder how hand coded RISC would do here?

Well, the Bubble sort code looks like 68K assembly. You can try to (completely) rewrite it in a way that it looks like PowerPC code (hence: favoring it for the compiler), in order to favor this architecture. At least it will be as "fair" as what Gunnar did...
Quote:
Somehow, IIRC, Linux PPC32 programs can run on a PPC64 kernel in user space. So with enough planning and underlying CPU support it looks possible.

Only because Linux has a good abstraction level. Amiga o.s. (and successors/clones) is in the exact opposite situation.
Quote:
Yes good point. Not even the 2MB barrier has been broken. Will the Vampire virtualise 8MB or 16MB chip ram?

Actually I've seen 4MB of chip ram support. Not so much (since it can reach at least 8MB), but since this is an area for the old, bitplane-based chips, it can be enough. Modern software usually takes advantage of RTG cards, which are definitely much better in almost all aspects.
Quote:
There are a few comments about AmigaOS being restricted to 31-bits due to some bad designs.

Indeed.
Quote:
But I don't think this can really hamper AmigaOS on 32-bit hardware. At most a 32-bit machine can access is 2GB with memory mapped hardware.

Or even less if you should give some address space to peripherals.
Quote:
I saw this maxed to around 3GB before the PC went 64-bit.

This was/is configurable, depending on the kernel implementation. You can have all 4GB of address space (and virtual memory) available for a (single) application, if the o.s. allows it (e.g.: it completely isolates the kernel memory from the user land).
Quote:
Which seems funny as I read that chips Intel had instructions to access hardware. But the 68K needed it memory mapped which looks inferior and memory consuming by comparison.

Both have pros and cons. I prefer memory-mapped I/O anyway, because peripherals don't use so much space for their registers banks, and with memory-mapped you can access memory very fast and with optimized code.
Quote:
So even a 68K AmigaOS would need a new design to be ported to 64-bit. But I don't see AROS taking it's place. If I had a fancy new 64-bit Vampire accelerator in my Amiga I'd want to run a real Amiga OS on it, not a copy.

AROS can run on Vampire, which has already 64-bit support. However this creates problem mixing 64 and 32-bit applications; see above, when I've replied to Mat.
Quote:
Here the shared address space could be a burden. If there was separation it could open up options. For example, the kernel assigning each process a possible 4GB memory space, which would be mapped into by the MMU but each process would need it mapped it when they had CPU time.

This actually happens with 32-bit application on some o.ses (Linux, with 4GB-on-userland patches).
Quote:
Or, same type of system, but kernel operating in 64-bit, and processes running in 32-bit code with 64-bit pointer pre loaded into registers. Task would need pointers to real 64-bit address.

Hum. This depends on what happens with a 64-bit architecture when it loads a 32-bit data into the register: if it zero or sign-extend it, then the 64-bit pointer which you loaded at the beginning is lost. If it preserve the upper 32 bits, then it's fine, but it hurts performances on superpipelined processors.

Status: Offline

michalsc

Re: 68k Developement
Posted on 7-Aug-2018 18:32:43

[ #57 ]

AROS Core Developer

Joined: 14-Jun-2005
Posts: 377
From: Germany

@cdimauro

Quote:
Last but not really least, the data array is filled in reverse order, so it always matches the worst case (e.g.: swapping the elements). It means that the branch predictor isn't stressed at all: it always predicts to swap the data! And the code is even worse in general, since it always writes back the two analyzed values into the array, either they were swapped or not.

So, the code is far away from classical implementations, and far away from common scenarios (the data isn't always reverse-sorted).

I have looked at that code too. It's an awfully written bubble sort routine which aims to match the m68k architecture as much as possible. Best example how to write biased benchmark...

Quote:
True, but it's a very small part. An in the last years the general trend is to use intrinsics in order to make it much easier to abstract and implement the code, leaving to the compiler the burden to arrange the registers usage/spilling.

Indeed. Looking at source code of linux kernel, or locally to amiga land looking at sources of AROS lowest level one can see hardly any assembler code. In best case there are some essential parts which have to be done in asm (interrupt/exception prologue/epilogue and such), the rest is in C, eventually a little bit mixed with few single asm lines added here and there.

Quote:
An o.s. with a good design allows mixing 32 and 64 applications sharing some resources, since you have the necessary abstraction (e.g.: no pointer passing. Only "opaque" references/ids/handles passed).

yes, indeed. But even in such case it is only kernel supporting both 32 and 64-bit code. Most likely the usernald (libraries and such) need to have either complete 32-bit equivalents or at least middle-weight wrappers allowing 32-bit code to communicate with the rest of libraries/applications. Just look at examples such as WoW64 (set of 32-bit libraries necessary to run 32-bit applications on 64bit windows), or the whole set of 32bit libraries in every 64-bit linux distribution.

Quote:
This is possible, but 32 and 64 bit applications would be substantially isolated and communication mixing them is not possible. AROS broke compatibility with its 64-bit flavor, because of that.

When making AROS port to x64 architecture I have decided to not support 32bit code at all. It was decision made on purpose, because in case of amiga-like OS supporting both 32 and 64bit would be a total chaos, comparable with the one where someone would demand to natively support big endian m68k code on little-endian x86 running AROS.

Quote:
Somehow, IIRC, Linux PPC32 programs can run on a PPC64 kernel in user space. So with enough planning and underlying CPU support it looks possible.

Only because Linux has a good abstraction level.

And because on linux there is second set (32bit one) of necessary libraries.

Quote:
Both have pros and cons. I prefer memory-mapped I/O anyway, because peripherals don't use so much space for their registers banks, and with memory-mapped you can access memory very fast and with optimized code.

Not to mention the IO space of x86 cpus is awfully limited (64k).

Status: Offline

cdimauro

Re: 68k Developement
Posted on 7-Aug-2018 18:57:48

[ #58 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Hypex

Quote:
Hypex wrote:
Wow. 64 registers at 128-bit each?

Exactly. There's a paper which talks about PowerPC SIMD units evolution and which explains why, with VSX, they decided to "unify" the FPU and Altivec registes to provide a uniform, 64-entries, registers sets with the last SIMD extension: https://www.researchgate.net/publication/299472451_Workload_acceleration_with_the_IBM_POWER_vector-scalar_architecture
Quote:
Well we are talking about 68K here so the 68K standard is 8 specific registers per set. 8 data, 8 address, and a futher 8 vectors make sense. But if you think there should be more perhaps we can squeeze 16 in there. But we'll need a prefix.

Or an ad-hoc bigger opcode.

Anyway the above paper explains also the reasons behind the need to have more registers with SIMD computations. They found 64 to be a good compromise, but maybe 32 for a CISC can be okish too.
Quote:
I also proposed maximising the total register file to 1024 bits. With vectors being variable width up to that size. Suppose it will need to be 2048 bits for those 16 vectors now.

If you want a vector-length agnostic ISA then it's better to do not put hard limits to the registers size.

What's very important is also the total number of registers, for keeping more data on registers (loops unrolling included).
Quote:
I think I meant, given vectors now have extra large sizes, such as 512 bits, then might as well use an opcode prefix t ocode it all in. Since with that amount of data it will need an extra wordto code it in anyway.

Well, it depends on what kind of support you want to give to programmers and compilers. You can set in the stone (or at the beginning of the computation for vector-length agnostic SIMD ISAs) the length, and then you can avoid prefixes to specify which vector size to use.
Quote:
Well there is a larger vector count or a larger vector field width. Either way it all has to be coded in.

The problem comes when/if you want both...
Quote:
Oh ok. so the $Fxxx space. Well, since technically $Vxxx doesn't exist, perhaps so. But that breaks compatibility.

Is there also an MMU space?

No. Since Motorola sistematically changed the MMU for its processors, we can think about completely dropping the MMU instructions mapped on F-line, and introduce some new MMU instructions (in some 32-bit opcode space) to deal with the MMU, like it happens on all architectures which have no coprocessors support.
Quote:
Okay. Just not address or data?

Good question. There are pros and cons having a unified GPR set, and separated ones for addresses and data. I prefer the former, but the latter is a clear win when code density is important.
Quote:
That's interesting, so the 80286 supported a flat memory model?

No, still segmented, but without overlapping segments (ala 8086) and MANY available segments (8K global/shared with all processes, and 8K local to the process).

However each segment had single-byte granularity and can be transparently extended (up to the maximum limit: 64KB). Very nice features.
Quote:
And the 80386 fully brought 32-bit. It's never clear to me where the flat memory model comes into it with x86 when I look it up.

We can say that it come from 80386. However this processor is segmented, exactly like the 80286. The only difference is that every segment is max 4GB in size (instead of 64KB).

Many o.ses decided to ignore / don't use segmentation because 4GB was enough for a segment, so basically flattening everything.
Quote:
And it looks backwards, consistent with littel endian, when I read about real mode. I would have thought real mode would be a real memory mode as in a flat memory mode. But no, it's backwards, describing original or segment mode.

It's the original 8086 mode. But you can use 32-bit code (with some limits) and data (no limits). Only in DOS.

Other o.ses ran/run always in Protected Mode (but in a flat 4GB address space).
Quote:
I mean it being based off MMX. MMX derives from x86. So they are basing their AMMX on a vector extension of a little endian CPU. Not that it matters. If it makes any difference.

No, AMMX is very different from MMX. It only uses some MMX concepts (e.g.: integer operations, and not having a separate registers set).
Quote:
Couldn't have worked then. But I think say being able to use same colour depth either way would have been good. Like using 4-bit or 8-bit packed pixel modes, but restricted to one plane, so the same amount of memory was read in for either. Though, internally, instead of the usual bitmap to colour index conversion or however it worked, it had the indexes right there. So, if need be, it could have virtualsed the depth and converted 16 pixels at a time from packed to planer direct in the bitplane registers. Like Akiko but internally in realtime in the chips.

No, you can also use "odd" values for depths: 1, 2, 3, 5, 6, 7. Even with such "strange" depth values, packed/chunky modes were/are much easier to handle (and with better performances & efficiency) than bitplanes.
Quote:
Can just use the first bitplane pointer. In my invesigation of RTG that's what an RTG bitmap did. One large plane packed across.

Exactly.
Quote:
I've gone over this and I don't know. It was common at the time. Memory efficient.

Bitplanes are the exact contrary: they are memory (space, bandwidth) inefficent. See above, and think about it.
Quote:
And it made all the features of the Amiga hardware possible. Parallax layers rely on it on the Amiga.

You can do it with 2 packed pointers as well.
Quote:
It's fine for blitting images with.

Packed/chunky are much better. And they don't require a mask for cookie-cutting operations.
Quote:
It only gets to be a real problem when you want to work one on one with pixels. Or do scaling.

Or having smaller objects to draw. Think about gun bullets: they are have a small width, but on the Amiga they needed to be 16 pixels wide...
Quote:
I never got this. I read about it for years and it never made sense. Until some time later I understood it. It was shifted four bits the the left. And then I didn't understand again. Because that just looked stupid, Why would they bother shifting it a nibble across? That ain't very future proof! So 16 bit extends to 1MB? What a waste! I just didn't get it. Why wouldn't they shift it a logical amount like 16-bit left? Or, in my mind, set a 16-bit high word, acessing it with a 16-bit low word.

It had a motivation for such overlapping segments. But I have no time now to explain it (neither to correct typos, sorry), and believe me: it doesn't deserve to know. Ignore/forget it, and live will be better.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 9-Aug-2018 7:12:37

[ #59 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@michalsc

I fully agree with you. Just a couple of more details below.
Quote:
michalsc wrote:

I have looked at that code too. It's an awfully written bubble sort routine which aims to match the m68k architecture as much as possible. Best example how to write biased benchmark...

It's even worse: the code is specifically written to exploit the Apollo 68080 microarchitecture. That's why it loads the two values to be checked, checks & swap them if this is the case (which always is, since the data are always filled in the reverse order), and then always writes back to memory both.

If you take a look at the comment which reports the 68K assembly, this is quite evident. In fact, the "if" & "swap" is basically represented by two instructions: a conditional branch followed by the exchange. Those instructions are merged together by the Apollo core, and internally transformed in a single instruction which is conditionally executed (like 32-bit ARM, to be more clear).

It means that the Apollo core will NEVER introduce bubbles in the pipeline due to a branch misprediction (like it happens on other architectures which have not such kind of mechanism), because... there are no branches at all! And this happens in the most important (and critical to performances) part of the Bubble sort.

So, I prefer to don't call it "biased", but cheating...
Quote:
When making AROS port to x64 architecture I have decided to not support 32bit code at all. It was decision made on purpose, because in case of amiga-like OS supporting both 32 and 64bit would be a total chaos, comparable with the one where someone would demand to natively support big endian m68k code on little-endian x86 running AROS.

I absolutely agree.

You can find solutions (e.g.: allocating o.s. structures, and public memory allocations, in the low 4GB of memory) to mitigate the issue with mixing 32 and 64 bit applications, but this is just a patch which doesn't solve the main problem, which will come back once you saturate such resources. Not even counting the complication to implement and handle this dirty patch.

Better completely avoiding it, and not creating false hopes...

Status: Offline

kolla

Re: 68k Developement
Posted on 9-Aug-2018 8:33:53

[ #60 ]

Elite Member

Joined: 21-Aug-2003
Posts: 2896
From: Trondheim, Norway

@michalsc

Quote:

And because on linux there is second set (32bit one) of necessary libraries.

Yes, people seem to not understand (or somehow forget), that running 32bit on 64bit Linux means having two OS installations on the disk, one 64bit and one 32bit.

_________________
B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle