Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6225 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

0 member(s) on-line.

You are an anonymous user.
Register Now!

BigD: 43 mins ago

Mobileconnect: 46 mins ago

qkpcmjwnpfkacm: 56 mins ago

OneTimer1: 57 mins ago

dalek: 1 hr 4 mins ago

Rob: 1 hr 7 mins ago

Karlos: 1 hr 19 mins ago

agami: 3 hrs 50 mins ago

matthey: 4 hrs 18 mins ago

Panabudo: 5 hrs 49 mins ago

Forum Index

Amiga General Chat

68k Developement

Poster

Thread

OlafS25

Re: 68k Developement
Posted on 19-Sep-2018 17:37:09

[ #321 ]

Elite Member

Joined: 12-May-2010
Posts: 6494
From: Unknown

@kolla

that is correct...

gunnar had a clear vision regarding hardware but not much ideas regarding software. You would need adapted libraries, improved compilers and at best gcc that is compiling optimized software. That obviously is not the case. But it might despite that help by lifting the "real hardware base" on 68060 with 128 MB and RTG compare to 68030 with some ram at the moment. On WinUAE you have endless resources anyway. Perhaps this will motivate more ambitious and demanding software (games) in future. How many people will develop adapted vampire only software (games) has to be seen. Big Applications will not developed on any amiga platform because to support that the user base (thousands) is much too small.

Last edited by OlafS25 on 19-Sep-2018 at 05:52 PM.

Status: Offline

OlafS25

Re: 68k Developement
Posted on 19-Sep-2018 17:43:49

[ #322 ]

Elite Member

Joined: 12-May-2010
Posts: 6494
From: Unknown

@kolla

regarding Aros I asked if it boots on vampire but got no answer so I have to assume that boot problems have not changed

Status: Offline

matthey

Re: 68k Developement
Posted on 20-Sep-2018 0:52:09

[ #323 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2754
From: Kansas

Sorry, I had some family issues to deal with lately. This thread looks like it has some issues of its own. I'll try to answer a few of the worthwhile posts I missed over the next few days.

Quote:

megol wrote:
My earlier design could execute it in theory (as it was never finished) as integer units could read address registers but I still think it ugly. It starts changing the uses of A registers while not being too useful in itself.

Yes it would open up the use of A registers for semi-constant values but the initialization of the registers would still be over the D-A split, not orthogonal by design. It would require new code so there is no inherent advantage over a normal register extension in compatibility, there would be a size advantage though (compared to using a prefix) and maybe that's enough.¨

But if the size advantage is the main point why not simply map the new data registers to the address sources? That would still be ugly of course and it would still be a special case.

If mapping 8 more data registers using the free An in most EAs, then both source and data would be wanted. With a 68k 64 bit mode, this can be done more cleanly but there are a few instructions using the address registers which need new versions. The non-EA register fields are only 3 bits (0-7) meaning using two d8-d15 registers in many common instructions (and sometimes one with for example SUB) is not possible and it would be unclear when a prefix or longer instruction would be necessary for these cases. IMO, it would be uglier (and decrease code density) than opening up An register sources which reduces the number of instructions, improves code density, improves orthogonality and can be done practically for free. IMO, if you want more registers, the prefix method by itself is cleaner although still ugly. The advantage here is being able to use register ports which are the same for both 32 and 64 bit mode. Otherwise, instruction level compatibility would be enough and a 64 bit SuperCISC ISA could be made which would map most 68k and x86/x86_64 instructions.

Quote:

Address register as sources are much easier to support than address register destinations, in the design mentioned above that would mean increasing the size and capability of address generation units. That's also why I opposed your base register update, increments and decrements are easy to handle but updating from the EA can increase the latency of address register writes potentially decrease performance _and_ would require a more complicated AGU design.

There is no use for an "Address Register Indirect Full Format with Predecrement".

forward_old:
lea (8,a0),a0
addq.l #1,(a0)+
clr.l (a0)+

forward_new:
addq.l #1,(8,a0)+ ; base register update + 4
clr.l (a0)+

reverse_old:
lea (12,a0),a0
clr.l (a0)
addq.l #1,-(a0)

reverse_new:
clr.l (12,a0)! ; base register update (predecrement unnecessary)
addq.l #1,-(a0)

An advantage is decreased number of instructions. It is possible to improve code density in performance code avoiding several (d,An) addressing modes although there is a potential to reduce code density in some short sequences. The full format addressing modes are more complex to decode and execute but the Apollo core is able to do most of them for free (no EA calc delay). This favorable of timing makes them more appealing but adding them would require bigger muxes and could slow decoding and/or the AGU. That is what I need to know. Is there a way to do it practically for free? It would be possible to scale displacements by the instruction size giving a larger range of displacement in 64 bit mode which would also require more EA calc work. I may need the bits in the full format extension for better 32 bit compatibility in 64 bit mode anyway. My ISAs are for evaluation purposes and give a visual idea of some less than conservative ideas. Perhaps someone will see the idea, like it and figure out a way to gain an advantage. Perhaps it will be gone in a new ISA for evaluation from me.

Quote:

You are talking about a mid-90's design when the optimal design point was completely different compared to now, even if implemented in (relatively) slow FPGAs. In fact having to use FPGAs shifts the optimal design point even further from the 68060, Pentium and (more comparable with the 68060) the Cyrix 6x86 type design.

For high performance in a FPGA the pipeline will have to be deeper than in an ASIC, there is a paper comparing ASIC and FPGA design that makes that clear (will try to find it later) and there are processor design papers that conclude an FPGA design have to fit the available hardware and be deeper than ASIC designs.

In the 68060 cache latency is low, 1 clock cycle. In a modern x86 it is about 4 cycles. In the 68060 decoding is fast, in modern designs it usually takes several stages even for RISC (not always apparent in pipeline descriptions as each stage does multiple things).

I would prefer a deeply pipelined scalar CISC with out of order execution as a design point for FPGA, just as I did many years ago. I'd just scaled down the emphasis of pushing the clock rate as high as possible of my earlier design. And yes I think it would be competitive with your in-order design as it would clock higher and could tolerate cache misses and cache latency much better.

A longer pipeline is beneficial for increasing FPGA CPU core clock speeds much like a normal CPU. Of course there are major disadvantages to higher clock speeds. There are other ways to increase the IPC which work well in FPGA. Higher clock speeds and a larger L1 DCache would increase the DCache latency and it is more important to keep it low on a CISC "memory munching monster" CPU. I would hope that 2-3 cycle latency would be enough for a 1GHz or lower core. I believe that the single cycle L1 DCache accesses of the 68060 was an important contributor to having one of the best single core performances per MHz of its time.

Quote:

No the load-use is a problem for all processors and no, the CISC have no advantage other than the code size. The same operations have to be done, the same dependencies have to be solved and the same physical access to caches have to be executed.

I disagree. CISC significantly reduces the memory traffic (including number of accesses while RISC compensates with more registers) and the number of load-use delays (the most common 68k addressing modes have no load-use delay where most RISC CPUs have a load/use delay on every load (requiring more registers to unroll code to try avoiding them). Did I mention that RISC needs more registers and if you run out the cost is much higher? Don't I have enough examples showing all this?

Quote:

A simple in-order CISC processor can inline the cache fetch. It doesn't mean the address generation isn't there as in a RISC, it doesn't mean the cache latency isn't there and it doesn't solve any dependency problems. In-order RISC can do the equivalent by skewing the execution pipeline so that the data cache results can be delivered to the execution stage of the dependent instruction - like in the POWER6 processor.
One potential advantage would be removing one dependency check for a load-op instruction. In practice the check still have to be there for any reasonable design.

The chain of FMUL instructions aren't illustrating anything interesting other than misdirection: it simply isn't relevant. Yes if one have a two wide scalar processor with two memory ports one can do two memory operations per clock assuming a cache hit. And yes a RISC would need four instructions to do the same thing. But the operation latency would be the same as the same thing is done.

Now the reality is that the Apollo core according to the (bad) documentation available can't actually do two memory operations per clock. So we should be comparing a superscalar inline RISC with the above. Per clock it could do one memory load and one FP multiplication. Same performance - with lower code density.

The Apollo core did have 3 in-order superscalar integer pipes at one time but I am not aware of it ever having a dual ported DCache (I suggested a 3rd pipe after looking at average instruction length statistics for 68k code). I told Gunnar a dual ported DCache looked really good and may be necessary to make the 3rd pipe "worthwhile". There would still be only one DCache write per cycle allowed but loads are much more common than stores and would help instruction scheduling of true dependencies instead of making them more difficult to schedule. I could write an instruction scheduler for the 68040+ which includes a hypothetical 3 pipe in-order superscalar core with dual ported DCache. I have been tempted to write a 68060 instruction scheduler for vbcc before but the backend code generation is bad enough that I would just be rescheduling many instructions which should be removed.

Quote:

And why does the 68k need so many registers? Because reality is what it is. We don't have a zero latency address generator, we don't have a zero latency cache and we don't have free multi-ported cache. Registers are much cheaper to access, multi-porting register files aren't cheap or free but it still have to be done. With Xilinx FPGA the register file is 64 entry for free, with Intel FPGA 32 entries are free IIRC. That means using fewer registers is wasting resources while using more requires extra effort.

Registers and register files are relatively cheap for a high performance CPU while less so for an energy efficient CPU. Accessing additional registers is considerably more expensive with CISC. I would still like to see a common benchmark which shows an overall performance boost of more than 2% with more integer registers on the 68k before looking for ways to add more registers.

Quote:

Not possible. The reason computing performance have increased is due to speculation, the important improvement of OoO execution isn't instruction reordering itself but providing support for speculative execution.
There are two problems to solve: increase isolation between protection domains and decrease information leakage. Both are possible to do with much less performance impact than that of removing speculative execution.

You sound more optimistic than the comp.arch guys.

Yet another speculation related security vulnerability
https://groups.google.com/forum/#!topic/comp.arch/QJjY8bX9qM0

Status: Offline

bhabbott

Re: 68k Developement
Posted on 20-Sep-2018 3:11:12

[ #324 ]

Cult Member

Joined: 6-Jun-2018
Posts: 554
From: Aotearoa

Quote:

OlafS25 wrote:

But it might despite that help by lifting the "real hardware base" on 68060 with 128 MB and RTG compare to 68030 with some ram at the moment.
Not so much a 'base' as expanding the horizons of 68k Amiga owners. I bet the vast majority only have low-end machines, and many are probably happy with just enough RAM to WHDload games. But for the few of us who want more the Vampire opens up a whole new world.

Quote:
Perhaps this will motivate more ambitious and demanding software (games) in future.
Perhaps (if Vampire owners demand them) but 'more ambitious and demanding' doesn't mean they will be any good - it just means an even smaller market and therefore less incentive to put in much effort.

If I was an Amiga game developer I would target OCS and/or AGA machines first. Why?

1. There's still plenty that can be wrung out of these lower-end machines, particularly now that CF drives and RAM expansions are commonplace and cheap to add.

2. It fits into the whole Amiga 'retro' experience.

3. Producing a really good game takes a lot of time and effort on any platform, but generally less on lower-end machines - and your efforts may be appreciated more.

If I just wanted to make an 'ambitious and demanding' game I wouldn't even consider the Amiga, when PCs are so much more powerful. But then my puny efforts would compare badly to games that have had millions poured into their production. The Amiga offers a different challenge - how to produce a great game without demanding ever more powerful hardware.

Quote:
Big Applications will not developed on any amiga platform because to support that the user base (thousands) is much too small.
Not sure what you mean by 'big' but high end applications often successfully target a small market.

I don't know about other Vampire owners, but the reason I bought one wasn't to play games. My Vampired A600 is currently put in the cupboard because I don't need it for what I am doing right now. It will come out again if IBrowse 2.5 is ever released, or if I need it to run some other 'demanding' 68k app. And I am willing to pay much more for a useful application than a game I might only play once.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 20-Sep-2018 18:55:33

[ #325 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

@CosmosUnivers Quote:

CosmosUnivers wrote:
@ppcamiga1
Quote:
Some has problems with that there is no real progress in 68k development since Commodore bankruptcy

100% agree : all is done by Hyperion, Cloanto, Phase5, Apollo Team, Jens Shoenfeld, Elbox, Bill McEvil and AmigaKit to make impossible any 68k progress...

They use licences, closed sources, and of course division to block everything...

I don't understand what you want: that who has LEGITIMATE copyright & license rights give up them without anything in exchange for?

Are you paid for your work, or do you work for free? Just to know how much do you value the work.

If you think that something should be "free", then you have 2 possibilities: either you PAY (maybe with a bounty) to open it, or you BOYCOTT the company until it follows Commodore's footsteps (and hoping that the owned rights go into the limbo).

If you have alternatives you are free to expose them.

P.S. Moaning is NOT a valid alternative.

EDIT:
@OlafS25 Quote:

OlafS25 wrote:

@CosmosUnivers
why don´t you contribute to Aros 68k then?

That's a very good point. And a valid alternative.

Last edited by cdimauro on 20-Sep-2018 at 06:56 PM.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 20-Sep-2018 19:06:24

[ #326 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

@kolla

Quote:

kolla wrote:
@OlafS25

Quote:

Problem is Toni Wilen is not interested in supporting something like Vampire,

Wrong. Toni is not interested in supporting something like Apollo Core 68080 - he is just fine with "something like Vampire".

It's the 68080 that is the problem.

The reason why there is no 68080 optimised AROS is not due to lack of 68k assembler coders, it is due to lack of support in compilers - AROS is supposed to be portable C code. And the lack of support in compilers is due to 68080 being a crusade by on or two CPU designers who appear to have little/no contact with software developers. Normally when CPU design is taking place, it happens in coordination with compiler back-end developers, so that toolchains are updated according to hardware. But in the case of 68080, there is no such cooperation.

Bebbo's GCC fork might be a possible, and IMO viable, solution to introduce a better support for 68K family, and in the long run for 68080 (once the ISA is stabilized, obviously).

I don't know how huge is the patch, but if it isn't merged into the official master repository, then it'll be hard to follow up the GCC evolution, and maintain the patch itself.

@OlafS25, @Overflow, @wawa. I quote nothing, but I substantially agree with many things expressed in your last messages.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 20-Sep-2018 20:00:55

[ #327 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

@Hypex Quote:

Hypex wrote:
@cdimauro
Quote:
Exactly. There's a paper which talks about PowerPC SIMD units evolution and which explains why, with VSX, they decided to "unify" the FPU and Altivec registes to provide a uniform, 64-entries, registers sets with the last SIMD extension:

Read some of that and can understand the move. Sounds better than doing it the Intel way and hacking vectors into the FPU registers. Though vectors have been bolted onto both ISAs. Interesting how they can maintain backwards compatibility.

In reality it looks very similar to Intel's MMX approach. VMX2 is different primarily because it's using ALSO the existing Altivec registers.
Quote:
Quote:
What's very important is also the total number of registers, for keeping more data on registers (loops unrolling included).

And with a total register count you bring in another hard limit.
Quote:
The problem comes when/if you want both...

In this case no one can have their cake and eat it. Large data width at the expense of available registers. Or large register count at the expense of available width. A compromise between count and width. There usually is a compromise when making design decisions.

It depends on how much do you want to spend / consume from the opcode space. So, basically how large will be the opcodes for the SIMD instructions.

With AVX-512 intel used 6 bytes minimum (1 byte is completely lost due to the usage of the BOUND opcode space. So, essentially only 5 bytes contributes to the definition of the instructions), and it offers 32 registers (and it's a CISC ISA!), 8 mask registers, vector sizes from 128 to 512 bits (but there's a free encoding space for 1024), 3 (sometime 4) operands, some flags to alter the behavior of the instructions (e.g. zeroing or merging, broadcasting of values, rounding, etc.), and up to 1024 instructions. There's really plenty of stuff packed and that's why you need at least 6 bytes.

If you lower the register count to 16 and fix the vector registers to one specific size you can recover 5 bits. Removing the masks support recovers another 4 bits (3 for the mask and 1 for the merge/zero). So, you can think about using 32 bits for SIMD instructions, albeit with the 68K is a bit more difficult due to the presence of the EA field (which requires at least 6 bits. Plus an additional one since a SIMD register requires 4 bits, and not 3 like the usual data/address registers) and to the limited space in the opcode table. I don't know how much you can cut features in order to just stay into 32-bits base opcodes, but cutting too much can lead to a RISC design, which means throwing away CISC advantages (e.e. fused load+op). Anyway, see below.

Bigger opcodes (coming back to AVX-512 and similar solutions) means worse code density, but in a SIMD intensive code it might not be so much important, because here the bottleneck is represented by the amount of data which are "crunched", and consenquently by the bandwidth consumed. SIMD (and most FPU) code is usually more "linear" (e.g. not so much branch-intensive like it can happen with the "integer/scalar" one) and busy looping, so the bandwidth wasted on instructions fetching is negligible/unimportant compared to rest.

To sum it up, maybe a SIMD unit with a bigger opcodes can be suitable for a CISC design: the advantages gained can overcome the reduced code density. That's the way which I finally decide to follow with my ISA, which is an AVX-512 superset and offers 64 SIMD registers natively, albeit a low-end design (for the embedded market) with 16 registers still uses the big opcodes (losing both code density and the advantage of a large number of registers).
Quote:
Quote:
No. Since Motorola sistematically changed the MMU for its processors, we can think about completely dropping the MMU instructions mapped on F-line, and introduce some new MMU instructions (in some 32-bit opcode space) to deal with the MMU, like it happens on all architectures which have no coprocessors support.

Looks like they dropped the F-bomb there. Well, the F-line was more suitable for FPU, since F stood the FPU when there was an FPU.

On my 68K SIMD extension proposal line-F was/is still used for scalar (non-packed/vector) operations.
Quote:
Quote:
However each segment had single-byte granularity and can be transparently extended (up to the maximum limit: 64KB). Very nice features.

Okay, so a bit easier for small data model code? And making large data code more manageable?

It's better only because you have more segments (in reality they are called selectors, in protected mode) available, but still difficult to use for large data models.
Quote:
Quote:
We can say that it come from 80386. However this processor is segmented, exactly like the 80286. The only difference is that every segment is max 4GB in size (instead of 64KB).

That's massive. I wonder how they did that. Going from 8K to 64K and extending that to 4GB seems no mean feat. I mean, you've got what 13-bits grown to 16, then extended to 32.

Yes, that was the solution: "just" extending the offsets from 16 to 32 bits, while keeping the selectors model. In fact in the larget data model you have 48-bit pointers (16 bits for the selector, and 32 bits for the offset). But, as you can imagine, it's not pratical.
Quote:
Quote:
Many o.ses decided to ignore / don't use segmentation because 4GB was enough for a segment, so basically flattening everything.

Nice trick. Makes sense. Until 64-bit comes along but then I bet the old trick comes back into play.

64-bit (long) mode still uses segmentation.
Quote:
Quote:
No, AMMX is very different from MMX. It only uses some MMX concepts (e.g.: integer operations, and not having a separate registers set).libs.catalog
It would help if Gunnar didn't recommend to read a technical guide on Intel MMX if AMMX is very different. Looks as if he is really trying to convert the last of the 68K Amigans to Intel. Right before taking their 68K souls and sacrificing it to the Intel gods.

No, MMX is just a marketing label for Vampire's AMMX. I don't know you, but MMX sounds like a very cool thing, thanks to the marketing campaign that Intel did when it introduced this (very important) ISA extension. So, reusing it even on a completely different projects looks cool for the people.
Quote:
[quote]No, you can also use "odd" values for depths: 1, 2, 3, 5, 6, 7. Even with such "strange" depth values, packed/chunky modes were/are much easier to handle (and with better performances & efficiency) than bitplanes.

Well you could but it doesn't pack as well as 2 multiples. They would likely need to be right aligned to the closest nibble.

It's not really important. Do you care about nibble alignment with bitplanes? No. And you don't have to care about strange packed depths, like using 3 bits per CLUT entry.
Quote:
But a 1 bit depth is an interesting examples, since at 1 bit packed and planar are exactly the same. So with that in mind, how does a 1 bit packed/chunky mode handle better than a 1 bit planar mode?

This is the only point of contact where they are the same, with the same pros and cons.
Quote:
Quote:
Bitplanes are the exact contrary: they are memory (space, bandwidth) inefficent. See above, and think about it.

I've thought about it. At least with an interleaved bitmap, a chunky and planar mode will take up exactly the same amount of memory, at the same 2 multiple bit depth. (1,2,4,8.) With each whole line taking up the same amount of space.

With other depths like 3,5,6 or 7 chunky would use more memory overall.

That leaves the hardware to read from each plane separately. Which would incur a penalty But given hardware was slower back then it seems a funny decision to make.

No, even interleaved bitmaps are very different and absolutely not as efficient as packed modes using the same depth.

Think about it.

On amigacoding.de (which unfortunately is down) I've clearly shown and proved it. Gunnar, like you, was sceptic and thinking the opposite, but after my demonstration it changed his mind. Math cannot be faked.
Quote:
Quote:
You can do it with 2 packed pointers as well. :-8

Maybe, but the dual playfield mode on the Amiga relies on bitplanes to function. As well as other parallax layers and effects. Any disputes ocver that you can take up with saimo.

No need to dispute. The dual playfield mode relies on bitplanes because on Amiga you only have them and you can use nothing else.

But to do exactly the same (and much better IMO) with packed graphic, you just need 2 packed "planes" and define the proper depth for each one. It's even more flexible, because you might define the background plane with more bits/depth, and the foreground one less, just to give an example.

The problem is the we, as Amiga coders, have grown with the bitplane model in mind, and only seen the packed/chunky graphic with a single plane with only power-of-2 depths. If you remove this preconception, you'll see the benefits packed graphic in a more general way.
Quote:
Quote:
Packed/chunky are much better. And they don't require a mask for cookie-cutting operations.

They require some form of masking out. But what about a 1 bit chunky bitmap?

With packed graphics you can still use 1-bit "packed" bitmaps for masks: nobody stops you to use them.

However it's much more efficient auto-generating (!) the mask by looking at the color: 0 -> use the foreground graphic; all other values -> use this color. Plain simple and ultra efficient compared to the ENORMOUS waste of reusing the mask for EVERY bitplane on the Amiga when you have to cookie-cut graphic.
Quote:
Quote:
Or having smaller objects to draw. Think about gun bullets: they are have a small width, but on the Amiga they needed to be 16 pixels wide...

Which is where those 68K ANDing and ORing instructions come in handy. Or using the blitter.

Whatever solution you use, either CPU or Blitter, you'll end up wasting space and bandwidth. Do your math.
Quote:
Anyway, the decision was made, for better or worse the Amiga became famous or infamous for those damned bitplanes. They didn't update it. We just got a load of "fake"RTG cards wth packed modes. So it stuck.

Yup. And that's was really sad... )

Status: Offline

wawa

Re: 68k Developement
Posted on 20-Sep-2018 20:02:44

[ #328 ]

Elite Member

Joined: 21-Jan-2008
Posts: 6259
From: Unknown

@cdimauro

Quote:
I don't know how huge is the patch, but if it isn't merged into the official master repository, then it'll be hard to follow up the GCC evolution, and maintain the patch itself.

thats the point in aros as i understand it.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 20-Sep-2018 20:04:43

[ #329 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

@OlafS25

Quote:

OlafS25 wrote:
@kolla

that is correct...

gunnar had a clear vision regarding hardware but not much ideas regarding software.

I don't want to open another discussion, but I beg to differ here too. Maybe you don't remember what was written on your forum, but Gunnar changed ideas on some important things.

The last is about hitting the hardware.

NOW for setting up the RTG graphic screen on Vampire it's NOT recommended this way, but they (Flype did recently) really recommend to use the o.s..

/OT

Status: Offline

cdimauro

Re: 68k Developement
Posted on 20-Sep-2018 20:36:08

[ #330 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

@matthey Quote:

matthey wrote:
If mapping 8 more data registers using the free An in most EAs, then both source and data would be wanted. With a 68k 64 bit mode, this can be done more cleanly but there are a few instructions using the address registers which need new versions. The non-EA register fields are only 3 bits (0-7) meaning using two d8-d15 registers in many common instructions (and sometimes one with for example SUB) is not possible and it would be unclear when a prefix or longer instruction would be necessary for these cases. IMO, it would be uglier (and decrease code density) than opening up An register sources which reduces the number of instructions, improves code density, improves orthogonality and can be done practically for free. IMO, if you want more registers, the prefix method by itself is cleaner although still ugly. The advantage here is being able to use register ports which are the same for both 32 and 64 bit mode. Otherwise, instruction level compatibility would be enough and a 64 bit SuperCISC ISA could be made which would map most 68k and x86/x86_64 instructions.

Frankly speaking, now I don't know what you really want with the new ISA design. I'm not kidding you, but I want to better understand what's your vision / goal. So, I make some questions.

Do you want 64-bit support?
How much of 68K and/or x86/x64 instructions compatibility do you want to keep? In different words, how much can be thrown out?
Which registers set is ok for you: (current) 8 (data) + 8 (address), 16 + 16, 16 + 8, 32 + 16?
Code density is a dogma or a slight decrease is acceptable?
Is binary compatibility required, or source (assembly) compatibility is fine? 100% or less?
Do you want to keep the more complex addressing modes (e.g.: double memory indirect)?
Do want to keep 68K compatibility (e.g.: the processor has a 68K compatibility mode) or only new (even 32-bit) mode(s) are available (e.g.: the new ISA is only 68K and/or x86/x64 inspired)? Or something like ARM, with ARM32 + Thumb-2 (like a 32-bit redesign) + ARM64 modes usable?
Are opcodes little on big endian?
Is data access little, big endian, selectable?
How long an instruction can be? >16 bytes?
Which market segments should be targeted?
Are prefix(es) OK?
Is instructions decoding changeable at runtime (by the kernel? Or even by application?)

In other words and to simplify, do you (already) have an idea of what you want achieve with the new processor (and the new ISA)?

I can say that I prefer to write new ISAs which are only INSPIRED by existing ones. I think that it's enough to take only the good parts/ideas, while creating something new. If it's 100% assembly compatible it's OK, but it's not strictly necessary (for my NEx64T it was my goal because I wanted to make really easy port software, and I was lucky being able to achieve it but I had to make several compromises).

So, how is looking your post-68K/inspired new processor/ISA?
Quote:
You sound more optimistic than the comp.arch guys.

Yet another speculation related security vulnerability
https://groups.google.com/forum/#!topic/comp.arch/QJjY8bX9qM0

Please read ALL comments.

Speculative execution is here to stay. Mitigated, certainly, but it'll not be abandoned.

EDIT: added other questions. More may come.

Last edited by cdimauro on 20-Sep-2018 at 09:52 PM.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 20-Sep-2018 20:37:14

[ #331 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

@wawa Quote:

wawa wrote:
@cdimauro

Quote:
I don't know how huge is the patch, but if it isn't merged into the official master repository, then it'll be hard to follow up the GCC evolution, and maintain the patch itself.

thats the point in aros as i understand it.

Yes, but what Bebbo want to do? Continue keeping his huge patch privately? It's not a good idea.

Status: Offline

wawa

Re: 68k Developement
Posted on 20-Sep-2018 20:38:44

[ #332 ]

Elite Member

Joined: 21-Jan-2008
Posts: 6259
From: Unknown

@cdimauro

bebbos is a different approach. good that there is an alternative.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 20-Sep-2018 21:55:18

[ #333 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

@wawa: absolutely nothing against alternative.

I was wondering why he continues to keep his branch private, instead of merging it into the master one.

Merging it will make the life easier also to people which wants to define enhancements to the 68K ISA, or to define 68K-inspired new ISAs, because this way it would be possible to compare binaries against other ISAs using a common set of applications/code.

Status: Offline

matthey

Re: 68k Developement
Posted on 21-Sep-2018 1:13:12

[ #334 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2754
From: Kansas

Quote:

cdimauro wrote:
Atoms were 64-bit from the first day, albeit long mode was disabled from some first products.

What good is 64 bit if the 64 bit registers and instructions can't be used?

Quote:

BTW, adding 64-bit to the x86 ISA took around 5% more transistors according to AMD.

After using tens of millions of transistors for caches, OoO, etc., I suppose that may be true (including quadruple size integer register file?). It is kind of like the x86 decoder cost in transistors which used to be significant but doesn't matter much on modern high performance x86_64 CPUs. It does matter for an energy efficient mid performance CPU design. I expect the 68060 would need closer to 50% more transistors for 64 bit. It is usually a good idea to double the DCache size for the wider pointers at the same time.

Quote:

What do you mean with this, the instruction SIZE change (e.g.: default is 32 bit data but with the proper prefix you can select 64-bit or 16 bit)?

Length-changing prefixes
The instruction length decoder has a problem with certain prefixes that change the meaning of the subsequent opcode bytes in such a way that the length of the instruction is changed. This is known as length-changing prefixes.

For example, the instruction MOV AX,1 has an operand size prefix (66H) in 32-bit and 64-bit mode. The same code without operand size prefix would mean MOV EAX,1. The MOV AX,1 instruction has 2 bytes of immediate data to represent the 16-bit value 1, while MOV EAX,1 has 4 bytes of immediate data to represent the 32-bit value 1. The operand size prefix therefore changes the length of the rest of the instruction. The predecoders are unable to resolve this problem in a single clock cycle. It takes 6 clock cycles to recover from this error. It is therefore very important to avoid such length-changing prefixes. The Intel documents say that the penalty for a length-changing prefix is increased to 11 clock cycles if
the instruction crosses a 16-bytes boundary, but I cannot confirm this. My measurements show a penalty of 6 clock cycles in this case as well. The penalty may be less than 6 clock cycles if there are more than 4 instructions in a 16-bytes block.

Quote:

You can solve those problems using CPU affinity settings, if it's really important. Otherwise the o.s. scheduler should make a good work trying to spread the processes on proper physical cores instead of just looking at the available hardware threads.

Anyway, hardware multi-threading can also improve single core/thread performances (!) in some scenarios. I've some idea about it.

If shared data from another program is in the caches, then it is possible for multi-threading to give a performance improvement. This is not uncommon when the threads are related as is common with Linux/BSD threads. It is less common for random processes to have shared data and more common for them to compete for per core resources instead.

I looked at benchmarks of modern games before deciding to buy my 4 core Intel i5 *without* multi-threading. There were more benchmarks where performance was reduced for games than helped. Few of the games were using more than 4 cores but the multi-threading setups were having trouble deciding which cores to place processes on. It is not enough to just look at the process priority and I don't like the idea of complex scheduling algorithms for multi-threading. I would rather not mess with CPU affinity settings or temporarily turn off all multi-threading. I would rather have more cores with no multi-threading.

Quote:

Not so well if you consider that you can only issue 1 FPU instruction per clock cycle, since the minimum size is 4 bytes for them, and the instruction window width is just 4 bytes for the 68060.

The large FPU instructions have multi-cycle latencies so execution rarely outpaces instruction fetch even though the fetch is relatively small. I suspect the 8kiB DCache size is more of a bottleneck for FPU code in programs with large data sets (like Quake or Lightwave).

Quote:

I don't know them. Do you have some data?

No, just hearsay from old posts. I'm kind of surprised the 68060 ever outperformed the Pentium considering how much better compiler support was for the Pentium. As I recall, Lightwave was compiled with SAS/C which never had good support for the 68060. The code is still using many 6888x instructions which are trapped and does no FPU instruction scheduling. I considered optimizing it for the 68060 (I had improved ADis enough to where it was doing a good job of disassembling it). Back in the day, people probably would have payed for a patch which was 20% faster (gains probably limited by the DCache size again as I suspected when optimizing Quake code).

Quote:

I see. But IMO the 4 bytes/cycle is still too small, considering that the smallest 68K instruction is 2 bytes and the 68060 can execute two. It means that the above pairing works with longer instructions only at the expense of wasting cycles executing only 1 or even no instructions at all.

Yes, a cycle is lost here and there. There are several cases where the low fetch rate would hurt performance. Large instructions after a pipeline flush (for example from a mis-predicted branch) as well. I doubt any other CPU of that time and at the 68060 level of performance only had a 4 byte/cycle fetch. RISC CPUs could not have been superscalar with a 4 byte/cycle fetch and at least the later Pentiums fetched 16 bytes/cycle (they need a larger fetch with byte aligned code). I expect you are over estimating the bottleneck though. Motorola probably new how much of a bottleneck it was and considered it to be in an acceptable range.

Quote:

Consider that Pentium debut was 1 year before the 68060, and it was already introduced at 60 and 66 MHz. 1 year late, when 68060 came too, it was already running at up to 120Mhz.

And if we consider that the Pentium design had only a 5 stage pipeline, it's even stranger (or sad)...

We were fortunate the 68060 was not cancelled with the decision of Motorola to use PPC for high performance CPUs. It probably was embedded which saved the 68060.

1996 32-bit embedded CPU market, the following figures are in millions:

68xxx at 53.6
MIPS at 19
SuperH at 18
x86 at 15
i960 at 6.2
ARM 4.2
AMD 29k 2.1
Coldfire 1
SPARC 0.9
PowerPC 0.5

Motorola was the ARM Holdings of the 1990s until they shoved PPC down developers throats and never marketed a 68k CPU with higher performance than the 68060. Since then, Motorola/Freescale/NXP almost went bankrupt and was bought by a foreign company before finally abandoning PPC and licensing ARM technology from the competitor they gave up their biggest market too. Embedded developers hated the PPC, liked ARM, liked ColdFire and loved the 68k but big companies won't give their customers what they want even though they already have a brilliant design like the 68060. That was the saddest part.

Quote:

That's true, but once you define a SIMD unit with "mid" target, then you have put a very big constraint to support more high-end market segment, and I don't think that you want to define two different SIMDs for different markets, right?

Think about Vampire's 68080: it has a very bad, crippled SIMD design (AMMX) which will prevent consistent enhancements.

Do you want a similar future for your 68K SIMD extension? It's better to think carefully before taking so much important decisions.

I would rather have standardization rather than target the highest performance SIMD extensions and features (competing with high end x86_64 CPU performance is folly). Low end CPUs have AMMX or more likely no SIMD unit. The target is mid performance with some compromises as SIMD units don't scale well.

Quote:

I don't want to convince you about the choices for your 68K extension. I'm also biased because of my conflict of interest , but at least I can expose my ideas/opinion about technical facts in a professional way. Then you are smart enough (no joking: I'm quite serious) to take your time, evaluate the whole picture, and take your decision for your project.

I can only say that I fully understand your concerns: designing an ISA isn't a simple exercise, where you fill holes in some tables. It took me around 7 years for me to define and try all solutions/ideas which came to my mind, looking at statistics and making comparisons with other ISAs. Some decisions were painful but had to be taken.

Anyway, and at the very end, an ISA a big synthesis of many things, which sometimes conflict, and you cannot expect that it can excel in all possible contexts / scenarios / markets...

A 64 bit 68k ISA allows more freedom which takes longer to evaluate alright. If the 68k is to have a 2nd chance at life after death, I think it would need embedded markets where reducing the footprint is natural as embedded devices go smaller.

Quote:

Apple with its first 64-bit ARM implementation clearly shown that doubling the registers numbers gained a lot of performance (even completely ignoring the FPU tests, which made use the better SIMD unit, including the hash/crypto extensions), despite the pointers size which doubled and the code density which became worse. So, there's room for improvement here.

However when you define an ISA you have to think about not crippling it.

Thumb2 was mostly used before AArch64 and it doesn't even have 16 registers. It has more memory traffic than the 68k even though it is good for a RISC CPU with so few registers. I haven't seen any big performance advantage of AArch64 without SIMD use on mid performance CPUs like the Cortex-A53.

Quote:

RISC-V is a clear example: this ISA was defined from the beginning to address (almost) all possible market segments, from the low-end embedded to the HPC. In fact, it supports both a reduced ("cut") ISA with only 16 registers (instead of the standard 32), and an upcoming size-agnostic vector ISA. But it was an easy game for the RISC-V designers: they had to support absolutely NO legacy inheritance / constraints.

RISC-V has too many ISA variations and not enough standardization. It will have some embedded wins where this doesn't matter and customization is more important but I don't see it threatening ARM's AArch64 in the lucrative mid-performance CPU market.

Quote:

Another one is my ISA, NEx64T: I've defined it from the very beginning (and independently ) with similar goals. My ISA is natively 64-bit, with 32 GPRs, 64 SIMD registers (from 128 to 1024 bit), and 8 mask registers (plus the infamous x87 FPU with its registers, added only for legacy reasons); it has 32 and 64 bits mode, but this only changes a few opcodes.
Even with this huge design (there's a lot of stuff), it can be easily "cut", going down to a 32-bit only mode with 16 GPRs, and going up with an option to have 128 SIMD registers and 16 mask registers (with longer opcodes, of course) plus the possibility to remove some legacy stuff to introduce more modern features (e.g.: removing MMX support allows to introduce a size-agnostic vector unit, using exactly the same opcode structure for the standard/legacy SIMD).
As you can see, the ISA is very flexible, despite it carries on a big burden: being 100% x86 and x64 assembly level-compatible (with a notable difference: the 64-bit mode allows to execute all legacy instructions which AMD removed with x64) with all consequences that it also means (supporting segmentation, very odd instructions, the x87 FPU, MMX, etc.).
As I said before, designing an ISA is a big compromise of many things, and for mine I wanted to have full source assembly compatibility because software is the most important thing, and just a simple recompilation can bring TONs of applications even if they had assembly parts.

Too bad there is no way to correctly disassemble x86/x86_64 code and convert it to a new encoding (it's almost like a security measure). The 68k could do it most of the time if there was an ISA which had all the replacement instructions and addressing modes.

Quote:

But it also misses ternary instructions, which can save some register and/or reduce the number of instructions and/or improve the code density. For example, in the LZ77 source code that you've sent to Vince, you had to use a (data) register putting a fixed value because the shift operation is limited to max 8 as immediate value in the 68K.

My ISAs document a 32 bit encoding for shifts greater than 8 and that can access memory at the same time. It can save a register and instruction but rarely improves code density. Shifts of less than 9 could would be peephole optimized into the old immediate forms (like quick forms of instructions).

Quote:

Since you posted your 68020, I do the same for my NEx64T ISA:

NEx64T 32-bit version:
mov eax,[esp+16] ; load d
imul eax,[esp+20] ; d = d * e
mov edx,[esp+12] ; load c
sub edx,eax ; c = c - d * e
mov eax,[esp+4] ; load a
sub eax,[esp+8] ; a = a - b
idiv eax,edx ; a = (a - b) / (c - d * e)
mov [esp+24],eax ; store a

instructions: 8
code size: 176 bits
memory traffic: 368 bits (176 + 6 * 4 * 8 bits of load/store operations)
registers used: 3

NEx64T 64-bit version:
mov rax,[rsp+32] ; load d
imul rax,[rsp+40] ; d = d * e
mov rdx,[rsp+24] ; load c
sub rdx,rax ; c = c - d * e
mov rax,[rsp+8] ; load a
sub rax,[rsp+16] ; a = a - b
idiv rax,rdx ; a = (a - b) / (c - d * e)
mov [rsp+48],rax ; store a

instructions: 8
code size: 176 bits
memory traffic: 560 bits (176 + 6 * 8 * 8 bits of load/store operations)
registers used: 3

So, a bit less efficient compared to your 68020 version, but pretty close (albeit directly referencing values into the stack).

I might have used the post-decrement addressing mode (I've all 4 of them for my ISA: pre/post inc/decrement), but paying an additional price in terms of code density, since it requires a longer encoding.

The code is still as ugly as any x86/x86_64 code but not a bad result. You were only outperformed by the 68020 ISA from 1984.

Quote:

I can reveal my finding here, about my ISA: in 64-bit mode it has around 20% better code density than x64.

That makes it the best code density 64 bit ISA I can think of and almost enough to reduce the ICaches by half.

Status: Offline

wawa

Re: 68k Developement
Posted on 21-Sep-2018 8:55:46

[ #335 ]

Elite Member

Joined: 21-Jan-2008
Posts: 6259
From: Unknown

@cdimauro

Quote:
I was wondering why he continues to keep his branch private, instead of merging it into the master one.

i dont know, but i dont think so. too many changes. last any a single time i remember anyone reported upstream was kalamatee on a compilation fault what concerns m68k gcc-8.1.0.

Last edited by wawa on 21-Sep-2018 at 08:56 AM.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 22-Sep-2018 6:53:58

[ #336 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

@matthey Quote:

matthey wrote:
Quote:
cdimauro wrote:
Atoms were 64-bit from the first day, albeit long mode was disabled from some first products.

What good is 64 bit if the 64 bit registers and instructions can't be used?

Testing, my friend. Introducing features on processors without publishing them is very common, and it happened even on very old CPUs like glorious 6502 and Z80. Sometimes the tests reveal that such features are too bugged, so they will not be published on the final/official release.
Quote:
Quote:
BTW, adding 64-bit to the x86 ISA took around 5% more transistors according to AMD.

After using tens of millions of transistors for caches, OoO, etc., I suppose that may be true (including quadruple size integer register file?).

That was what I recall. If you consider that the reference was an Athlon XP with 37 millions transistors, 5% is 1.85 millions which is basically nothing.
Quote:
It is kind of like the x86 decoder cost in transistors which used to be significant but doesn't matter much on modern high performance x86_64 CPUs.

x86 decoders were very expensive on the earlier superscalar designs. According to Patterson (if I recall correctly the source), decoders took 30% of the transistors on the Pentium, and a whooping 40% for the PentiumPro. However the number of transistors used for decoders tend to stabilize over time, because the technology around this field is quite consolidated. From several years the decoders aren't anymore a big piece of the cake.
Quote:
It does matter for an energy efficient mid performance CPU design. I expect the 68060 would need closer to 50% more transistors for 64 bit.

Why? If it's only for doubling the GP registers size, it should be a very small change (assuming that a 64-bit mode is introduced which looks very similar to the current mode).
Quote:
It is usually a good idea to double the DCache size for the wider pointers at the same time.

Not so good IMO. Pointers are doubled in size, but not all other data structures, which remain the same.
Quote:
Quote:
What do you mean with this, the instruction SIZE change (e.g.: default is 32 bit data but with the proper prefix you can select 64-bit or 16 bit)?

Length-changing prefixes
The instruction length decoder has a problem with certain prefixes that change the meaning of the subsequent opcode bytes in such a way that the length of the instruction is changed. This is known as length-changing prefixes.

For example, the instruction MOV AX,1 has an operand size prefix (66H) in 32-bit and 64-bit mode. The same code without operand size prefix would mean MOV EAX,1. The MOV AX,1 instruction has 2 bytes of immediate data to represent the 16-bit value 1, while MOV EAX,1 has 4 bytes of immediate data to represent the 32-bit value 1. The operand size prefix therefore changes the length of the rest of the instruction. The predecoders are unable to resolve this problem in a single clock cycle. It takes 6 clock cycles to recover from this error. It is therefore very important to avoid such length-changing prefixes. The Intel documents say that the penalty for a length-changing prefix is increased to 11 clock cycles if
the instruction crosses a 16-bytes boundary, but I cannot confirm this. My measurements show a penalty of 6 clock cycles in this case as well. The penalty may be less than 6 clock cycles if there are more than 4 instructions in a 16-bytes block.

OK, got it. But it's a very rare situation, because this usually happens only if you want to use 16-bit code (with an immediate data, in this specific case), which is quite rare on 32-bit code, and almost non-existing on 64-bit code.

At least looking at the statistics which I've collected disassembling some x86 and x64 applications.
Quote:
Quote:
You can solve those problems using CPU affinity settings, if it's really important. Otherwise the o.s. scheduler should make a good work trying to spread the processes on proper physical cores instead of just looking at the available hardware threads.

Anyway, hardware multi-threading can also improve single core/thread performances (!) in some scenarios. I've some idea about it.

If shared data from another program is in the caches, then it is possible for multi-threading to give a performance improvement. This is not uncommon when the threads are related as is common with Linux/BSD threads. It is less common for random processes to have shared data and more common for them to compete for per core resources instead.

Multithreading is also common on Windows application; maybe even more common, because Windows had a processes & threads model from long time.

Anyway, I've very different ideas on how to take profit of hardware threading for single core applications.
Quote:
I looked at benchmarks of modern games before deciding to buy my 4 core Intel i5 *without* multi-threading. There were more benchmarks where performance was reduced for games than helped. Few of the games were using more than 4 cores but the multi-threading setups were having trouble deciding which cores to place processes on. It is not enough to just look at the process priority and I don't like the idea of complex scheduling algorithms for multi-threading. I would rather not mess with CPU affinity settings or temporarily turn off all multi-threading. I would rather have more cores with no multi-threading.

That's very strange, because I see that in most games the processors which show better performances are always the ones with HT enabled. At least on Intel side, whereas AMD processors shown to suffer it and AMD presented a "Game mode" for its Ryzen, which essentially disables SMT.
Quote:
Quote:
I don't know them. Do you have some data?

No, just hearsay from old posts. I'm kind of surprised the 68060 ever outperformed the Pentium considering how much better compiler support was for the Pentium. As I recall, Lightwave was compiled with SAS/C which never had good support for the 68060. The code is still using many 6888x instructions which are trapped and does no FPU instruction scheduling.

I see. Very bad situation for the 68060 in this case.
Quote:
I considered optimizing it for the 68060 (I had improved ADis enough to where it was doing a good job of disassembling it). Back in the day, people probably would have payed for a patch which was 20% faster (gains probably limited by the DCache size again as I suspected when optimizing Quake code).

With this performances improvement, yes: people would have paid nice money.
Quote:
Quote:
I see. But IMO the 4 bytes/cycle is still too small, considering that the smallest 68K instruction is 2 bytes and the 68060 can execute two. It means that the above pairing works with longer instructions only at the expense of wasting cycles executing only 1 or even no instructions at all.

Yes, a cycle is lost here and there. There are several cases where the low fetch rate would hurt performance. Large instructions after a pipeline flush (for example from a mis-predicted branch) as well. I doubt any other CPU of that time and at the 68060 level of performance only had a 4 byte/cycle fetch. RISC CPUs could not have been superscalar with a 4 byte/cycle fetch and at least the later Pentiums fetched 16 bytes/cycle (they need a larger fetch with byte aligned code). I expect you are over estimating the bottleneck though. Motorola probably new how much of a bottleneck it was and considered it to be in an acceptable range.

Well, you should know better than me how many longer instructions are found on 68K code, especially when using FPU code. Recently there's a contest on the Apollo forum regarding 3D code, and some people dumped some assembly code: you can see yourself how big are the opcodes, and how they can affect a 68060 design (where the FPU is not pipelined).
Quote:
Quote:
Consider that Pentium debut was 1 year before the 68060, and it was already introduced at 60 and 66 MHz. 1 year late, when 68060 came too, it was already running at up to 120Mhz.

And if we consider that the Pentium design had only a 5 stage pipeline, it's even stranger (or sad)...

We were fortunate the 68060 was not cancelled with the decision of Motorola to use PPC for high performance CPUs. It probably was embedded which saved the 68060.

1996 32-bit embedded CPU market, the following figures are in millions:

68xxx at 53.6
MIPS at 19
SuperH at 18
x86 at 15
i960 at 6.2
ARM 4.2
AMD 29k 2.1
Coldfire 1
SPARC 0.9
PowerPC 0.5

Motorola was the ARM Holdings of the 1990s until they shoved PPC down developers throats and never marketed a 68k CPU with higher performance than the 68060. Since then, Motorola/Freescale/NXP almost went bankrupt and was bought by a foreign company before finally abandoning PPC and licensing ARM technology from the competitor they gave up their biggest market too. Embedded developers hated the PPC, liked ARM, liked ColdFire and loved the 68k but big companies won't give their customers what they want even though they already have a brilliant design like the 68060. That was the saddest part.

I agree, and we know: 68Ks are a piece of cake to code in assembly. It's unbelievable Motorola's lack of vision... -_-
Quote:
I would rather have standardization rather than target the highest performance SIMD extensions and features (competing with high end x86_64 CPU performance is folly). Low end CPUs have AMMX or more likely no SIMD unit. The target is mid performance with some compromises as SIMD units don't scale well.
[...]
A 64 bit 68k ISA allows more freedom which takes longer to evaluate alright. If the 68k is to have a 2nd chance at life after death, I think it would need embedded markets where reducing the footprint is natural as embedded devices go smaller.

OK, so three answers here to my above questions: 64-bit, embedded market, and a "mid" SIMD unit (with 16 registers, I assume).
Quote:
Quote:
Apple with its first 64-bit ARM implementation clearly shown that doubling the registers numbers gained a lot of performance (even completely ignoring the FPU tests, which made use the better SIMD unit, including the hash/crypto extensions), despite the pointers size which doubled and the code density which became worse. So, there's room for improvement here.

However when you define an ISA you have to think about not crippling it.

Thumb2 was mostly used before AArch64 and it doesn't even have 16 registers.

No, that was Thumb. Thumb-2 allowed to use ARM 32-bit instructions (so, accessing all 16 registers), but without conditional execution. So, basically you can interleave 16-bit and 32-bit instructions at the same time, whereas with Thumb you had to switch back and forth between Thumb and ARM execution modes, which decreased both performances and code density.
Quote:
It has more memory traffic than the 68k even though it is good for a RISC CPU with so few registers.

True. At the end it's still a load/store architecture.
Quote:
I haven't seen any big performance advantage of AArch64 without SIMD use on mid performance CPUs like the Cortex-A53.

The Cortex-A53 is quite old and not so much efficient.

ARM offers other mid-range 64-bit designs, but I've no data about them.
Quote:
Quote:
RISC-V is a clear example: this ISA was defined from the beginning to address (almost) all possible market segments, from the low-end embedded to the HPC. In fact, it supports both a reduced ("cut") ISA with only 16 registers (instead of the standard 32), and an upcoming size-agnostic vector ISA. But it was an easy game for the RISC-V designers: they had to support absolutely NO legacy inheritance / constraints.

RISC-V has too many ISA variations and not enough standardization.

They are pushing a lot on the standardization. The most common and important ISA variations are now set in the stone. Basically only the SIMD (vector length-agnostic) is still a WIP.
Quote:
It will have some embedded wins where this doesn't matter and customization is more important but I don't see it threatening ARM's AArch64 in the lucrative mid-performance CPU market.

There are around 15 billions ARM core produced each year.

Western Digital announced last year that it will completely move from ARM to RISC-V for the micro-controllers, and it produces 1-2 billions per year.

Just to give an example. But other big CPU vendors have joined the RISC-V foundation/committee, so I see a treat for ARM business here.
Quote:
Too bad there is no way to correctly disassemble x86/x86_64 code and convert it to a new encoding (it's almost like a security measure).

I think you're meaning a general tool which takes a x86/x64 application and allows to extract all its instructions (which then can be converted to something else), right?
Quote:
The 68k could do it most of the time if there was an ISA which had all the replacement instructions and addressing modes.

You can sort of emulate complex addressing modes, instructions, and some strange features (like segment override on x86/x64. That's what I've made with my first 2 versions/iterations of my ISA), but it requires more instructions and hurts both performance and code density. Maybe some ad-hoc new instructions can help some tasks here.
Quote:
Quote:
But it also misses ternary instructions, which can save some register and/or reduce the number of instructions and/or improve the code density. For example, in the LZ77 source code that you've sent to Vince, you had to use a (data) register putting a fixed value because the shift operation is limited to max 8 as immediate value in the 68K.

My ISAs document a 32 bit encoding for shifts greater than 8 and that can access memory at the same time. It can save a register and instruction but rarely improves code density. Shifts of less than 9 could would be peephole optimized into the old immediate forms (like quick forms of instructions).

That's very good. I've 32-bit encoding for shift instructions on my ISA, but they operate only on registers (with 3 operands. The last one can be a 6-bit immediate). Shifts on memory are quite rare, so there isn't a bottleneck here.
Quote:
Quote:
Since you posted your 68020, I do the same for my NEx64T ISA:

NEx64T 32-bit version:
mov eax,[esp+16] ; load d
imul eax,[esp+20] ; d = d * e
mov edx,[esp+12] ; load c
sub edx,eax ; c = c - d * e
mov eax,[esp+4] ; load a
sub eax,[esp+8] ; a = a - b
idiv eax,edx ; a = (a - b) / (c - d * e)
mov [esp+24],eax ; store a

instructions: 8
code size: 176 bits
memory traffic: 368 bits (176 + 6 * 4 * 8 bits of load/store operations)
registers used: 3

NEx64T 64-bit version:
mov rax,[rsp+32] ; load d
imul rax,[rsp+40] ; d = d * e
mov rdx,[rsp+24] ; load c
sub rdx,rax ; c = c - d * e
mov rax,[rsp+8] ; load a
sub rax,[rsp+16] ; a = a - b
idiv rax,rdx ; a = (a - b) / (c - d * e)
mov [rsp+48],rax ; store a

instructions: 8
code size: 176 bits
memory traffic: 560 bits (176 + 6 * 8 * 8 bits of load/store operations)
registers used: 3

So, a bit less efficient compared to your 68020 version, but pretty close (albeit directly referencing values into the stack).

I might have used the post-decrement addressing mode (I've all 4 of them for my ISA: pre/post inc/decrement), but paying an additional price in terms of code density, since it requires a longer encoding.

The code is still as ugly as any x86/x86_64 code but not a bad result. You were only outperformed by the 68020 ISA from 1984.

Well, that's only because you used the trick to pop the frame content, so basically destroying it. It wasn't defined if the frame content could be destroyed or not, so it's fine to play this way.

A more polite version would have copied the frame pointer to an address registers and then used the trick, but this required an extra register and increased the instruction count by one, putting the 68020 version behind mine.

If we consider a more general use case (which I think was the scope of the study), where basically you have a function which was called with some input and output parameters, and doing more general work inside, then you cannot use this trick anymore, and accessing the parameters from the frame becomes much more expensive for the 68020.
Whereas my ISA has ad-hoc instructions, addressing modes, and some features which are specifically designed to gain a much better code density and performances in this common scenarios, as you can already see from the samples which I've provided.
Quote:
Quote:
I can reveal my finding here, about my ISA: in 64-bit mode it has around 20% better code density than x64.

That makes it the best code density 64 bit ISA I can think of and almost enough to reduce the ICaches by half.

Yes, this preliminary result is quite encouraging, but it's not enough to make a serious comparison with all other ISAs. I need a compiler which targets the new ISA, and allows to generate binaries using the most common applications used for such benchmarks (usually SPECint and SPECfp), but this requires a HUGE work and I've no time now. It's even worse, because currently I cannot take advantage of many features which I've defined (and which can further improve both code density and performance), because it requires proper code generation (which might be complex to implement).

Status: Offline

cdimauro

Re: 68k Developement
Posted on 22-Sep-2018 6:58:04

[ #337 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

@wawa

Quote:

wawa wrote:
@cdimauro

Quote:
I was wondering why he continues to keep his branch private, instead of merging it into the master one.

i dont know, but i dont think so. too many changes. last any a single time i remember anyone reported upstream was kalamatee on a compilation fault what concerns m68k gcc-8.1.0.

OK, I understand. It's very unlikely that a huge patch will be accepted. It happened to me when I've created my CPython fork (WPython): despite the results were very good, the patch wasn't integrated into the mainline because it was too complex and touched too many things.

Bebbo has to create small patches addressing specific features/optimization. This will help a lot code reviews, and increases the chances to have them accepted by the community.

Status: Offline

Hypex

Re: 68k Developement
Posted on 22-Sep-2018 16:27:20

[ #338 ]

Elite Member

Joined: 6-May-2007
Posts: 11351
From: Greensborough, Australia

@matthey

Quote:
The 68020 ISA removed a few limitations. It became possible to use a data register like a base register for some addressing modes although it is slower and often larger. An address register can be tested. These did make the ISA more orthogonal. The 68020 ISA is a significantly easier target for compilers but a few of the features should have been avoided as they can't be fast in hardware.

Yes I did see some modes with data registers in there. Looked slightly superflous. Wouldn't seem worth it with a pentalty like that.

Quote:
I have seen code which referred to the 68k registers as r0-r15. It may take some getting used to use d0-d7 and a0-a7 but it is easier to keep them organized where they lack orthogonality and the register names are consistent length and sometimes shorter.

That would be confuisng. That's worse than using x86 syntax for 68K code. But what would come first in the line up? Address or data? They aren't equals.

Quote:
The 68k address registers act much like RISC registers with the auto sign extension to register width and no condition code as default. This is fast avoiding some stalls but data registers are more flexible and powerful. It is nice to have both types of registers.

That may have caught me out in the early days of ASM. But I see how it is useful.

Quote:
The Tabor CPU abandoned both the AIM PPC ISA and ABI and in a way which made compatibility difficult. Freescale should have renamed these CPUs PPCE or something else to accurately reflect the lack of PPC compatibility. The 68k, x86, and ARM 32 bit CPUs will run code for their respective ISAs from practically the first CPU which supported the 1st version of their ISAs (compatibility sometimes being broken when they introduced 64 bit ISAs and ABIs). Old 32 bit PPC code on the Tabor 32 bit PPC CPU won't run though. It looks to me like false marketing to claim these CPUs are PPC compatible.

The problem is the time spenting investing in it to get it working. It's okay for embedded work but a poor choice for our market. The Sam already gave trouble for those running WarpOS/UP apps under a wrapper. I've noticed the X1000 kernel can be unstable for things like 68K emulation or interrupt signalling; things that are fine on XE and Sam. I don't know how they passed it. IMHO they should have left it or find another more compatible CPU. Nice idea but OS4 programs aren't optimised enough to run on slow hardware.

Quote:
Unification of register files has advantages and disadvantages. The way the POWER ISA unified the FPU and SIMD registers allows for more combined registers than the same resources would give if they were separate but conflicts between the units for resources can reduce parallelism. Compatibility should be a primary concern.

I wonder how they merged it. And how transparenly. If there were 32 of each what happens where?

Quote:
The 68k uses a variable length encoding so longer instructions can be added if encoding space is needed. There is no need for a prefix or postfix with a newly created encoding. They are more commonly used to "fix" mistakes and limitations of old encodings.

I see for x86 it was used for segment selections but also cool things like doing loops. However 68K has DBcc so no need for a loop or repeat prefix. As an example.

Quote:
As I recall, the x86/x86_64 SIMD instructions are up to 6 bytes in length not counting prefixes and data extensions but they have many outdated SIMD instruction sets like MMX with tiny 64 bit registers wasting encoding space (at least Altivec started with 128 bit registers). The 68k could avoid this waste by starting with at least a 256 byte wide SIMD unit.

And being adaptable since vector sizes change quickly. So some forward thinking to sizes would help here. Perhaps something like the 020 scale type instructions could help to support multiple widths and expansion.

Quote:
Moving on is likely to leave most of the Amiga users and software behind. Moving back to retro goodness and reuniting with classic Amiga users left behind looks better to me.

They just need to build in compatibility layers like they have done so. But they need to seperate it. The old software just needs to think it is running on a 68K Amiga or OS4 environment.

Retro is nice but retro goodness can't be used in a modern web world with internet, office work and productivity.

Status: Offline

Hypex

Re: 68k Developement
Posted on 22-Sep-2018 16:40:40

[ #339 ]

Elite Member

Joined: 6-May-2007
Posts: 11351
From: Greensborough, Australia

@wawa

Quote:
i dont like to call out examples by their names, but observed not only ng users but notable developers dropping ng for amiga, you should recall because afair you named one of them yourself as example in one of your recent posts.

Kolla made a good point, but I think this highlights the type of NG user who does that. There are Amiga users who tasted NG, didn't take to it or tried a web browser, lol, then gave up and went back to the Amiga. But, they would have been using a Mac or PC for their main computer activities since years, after giving up the Amiga as a main machine.

There would be very few Amiga users, who were just using their Amiga to this day, tried NG and then went back to their real Amiga again. No way would they be using their Amiga the same way for the past 20 years, unless they ignored the outside computer world in that time.

So, as I see it, those who tried NG or OS4 as it were, but ended up going back to an Amiga, have to be using a Mac or PC mainly as well. In my case I have Macs and PCs as backups only and my main work is done on my X1000 now as far as possible. Others may have embraced NG, then eventually gave up, and converted to Mac or PC. And went back to Amiga machines for enjoyment.

Status: Offline

bison

Re: 68k Developement
Posted on 22-Sep-2018 17:26:50

[ #340 ]

Elite Member

Joined: 18-Dec-2007
Posts: 2112
From: N-Space

@cdimauro

Quote:
The Cortex-A53 is quite old and not so much efficient.

It has been superseded by the Cortex-A55, I think, although I don't know of any mainstream products that use it. Perhaps the Pi4 will.

Last edited by bison on 22-Sep-2018 at 05:39 PM.

_________________
"Unix is supposed to fix that." -- Jay Miner

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle