Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6071 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

11 crawler(s) on-line.

100 guest(s) on-line.

1 member(s) on-line.

OlafS25

You are an anonymous user.
Register Now!

OlafS25: 27 secs ago

A1200: 5 mins ago

NutsAboutAmiga: 5 mins ago

pavlor: 10 mins ago

amigakit: 17 mins ago

Karlos: 51 mins ago

pixie: 1 hr ago

michalsc: 1 hr 17 mins ago

CosmosUnivers: 2 hrs 22 mins ago

ppcamiga1: 2 hrs 37 mins ago

Forum Index

Amiga General Chat

68k Developement

Poster

Thread

wawa

Re: 68k Developement
Posted on 16-Sep-2018 16:43:20

[ #221 ]

Elite Member

Joined: 21-Jan-2008
Posts: 6259
From: Unknown

@OneTimer1

Quote:
So I should be thankful for every product that's not available?

judging by customer base speaking up on the forums, the product is available, just not with the features you demand.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 16-Sep-2018 16:47:02

[ #222 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Barana

Quote:

Barana wrote:
@cdimauro

I note with thanks your cordiality to me in your previous response.

I ask you extend the same cordiality to Gunnar, who Is a genuine amigan - used to hang out quite a bit before you joined aw.net and made friends with many of us.
He and his team who haven't finished yet, have talent I only dream of.

Unless,of course you'd like to produce today for us an 080 core that rivals or betters his?
Again,, thankyou for the cordiality you showed me today,Could we throw some Gunnar's way? An industrious Amigan?

I thank you for your nice thought about me, but I cannot extend it to Gunnar.

Gunnar falsely accused me of lying on aros-exec forum, whereas all my statements were (and are still) easily verifiable. So, there's no chance that I can have cordiality towards him.

BTW, designing a 080 core isn't a requirement for entitling people to speak about some technical arguments. This is a (known) logical fallacy.

Anyway, I've already contributed to define a new "neo Amiga" platform, more than 5 years ago, on the amigacoding.de forum (which now is off, but you can use the wayback machine to find something). If Olaf (the site admin/owner) is honest-enough, he can confirm it.
I gave a lot of ideas, even for new SIMD extensions for 68K (which obviously Gunnar didn't followed). That was and is enough.

@OneTimer1, megol , umisef : I agree.

Status: Offline

Overflow

Re: 68k Developement
Posted on 16-Sep-2018 17:12:17

[ #223 ]

Super Member

Joined: 12-Jun-2012
Posts: 1628
From: Norway

@OneTimer1

Not available? I clearly see the V600 sitting next to me. Is it a perfect product? No. Its clearly work in progress, but given I never owned anything beyond Blizzard 030+16 megs fastram, and didnt expirience
RTG screenmodes back in the 90s; the userexpirience I get from the Vampire is quite comfortable/pleasing.
Ofcourse, Gold 3 with both RTG and AGA/standard amiga video thru HDMI will futher enhance userfriendliness from a NON coder point of view. I cant speak for people that use it as a developerplatform tho. Wether or not compliers are mature enough etc.

If being positive for any actual development in the Amiga community is considering being a fanatic, then guilty as charged. Im happy to see any development, be it MOS, AROS, AOS x.x, PPC etc.
If someone find use for xyz, thats cool.

I would think you could tag me with the label "hippie" istead of "fanatic" in that regard.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 16-Sep-2018 17:18:14

[ #224 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@megol Quote:
megol wrote:
My earlier design could execute it in theory (as it was never finished) as integer units could read address registers but I still think it ugly. It starts changing the uses of A registers while not being too useful in itself.

Yes it would open up the use of A registers for semi-constant values but the initialization of the registers would still be over the D-A split, not orthogonal by design. It would require new code so there is no inherent advantage over a normal register extension in compatibility, there would be a size advantage though (compared to using a prefix) and maybe that's enough.¨

Opening up address registers as source isn't that bad. At the very end, pointers are also data.

What's missing on 68K address register is IMO some instruction to correctly align an address/pointer to a specific size. Yes, pointers aren't data, but we need aligned pointers sometime and actually this is an expensive operation for an address register.
Quote:
But if the size advantage is the main point why not simply map the new data registers to the address sources? That would still be ugly of course and it would still be a special case.

Yup. Can be a possibility. So you have 16 data registers and 8 address registers. Can be good enough, if some instructions are added for the ones that currently have the address register EA mode already occupied.
Quote:
That's also why I opposed your base register update, increments and decrements are easy to handle but updating from the EA can increase the latency of address register writes potentially decrease performance _and_ would require a more complicated AGU design.

I can understand for more complicated EA, but simple Base Address + Offset should be as easy to implement as the canonical pre-decrement and post-increment ones.

Why it might decrease performance?
Quote:
An exception needn't take much more time than a pipeline flush like after a mispredicted branch. Switching to supervisor mode needn't take more than a few clocks.

Most hardware isn't optimized for such things but still one can look at the Itanium abomination that IIRC had a syscall cost of 30 cycles in a L4 kernel.

Strangely I haven't found the timings for SYSCALL/SYSRET instructions on Intel's manuals.
Quote:
Ah here we have the "problem": I'm assuming an OoO 68k processor implemented in FPGA, not a in-order short pipelined core like the Apollo or 68060. Yes for a simple design there are advantages however as soon as one goes OoO the disadvantage disappears as the loaded data can't just be forwarded to a waiting execution stage. It either have to be buffered in a register or in some other resource e.g. a stateful ROB.

OK, but at least we can save a physical register.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 16-Sep-2018 17:27:50

[ #225 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:

matthey wrote:
@cdimauro
I hope you didn't think I was ignoring you. I was just catching up.

No problem: I was too (and am) quite busy. -_- Fortunately the weekend gives me some oxygen.
Quote:
Quote:
cdimauro wrote:
Atom almost doubled the performance going from in-order to out-of-order, while keeping the same limit of max 2 instructions decoded & executed per cycle.

How many transistors did it cost?

No idea at all.
Quote:
We know that doubling the superscalar ways roughly quadruples the transistors.

Hum. This should apply only to the increased number of ways in a superscalar design. Here we talk about moving from an in-order to an OoO design.
Quote:
We don't know the transistor cost of OoO but I expect it varies significantly. Atom added many features with the OoO so it is difficult to calculate in this case. They could have added more cores instead which are easier to turn off/clock down but only help with parallel processing. They probably went to OoO at the same time as a die shrink to offset most of the energy efficiency losses.

Can be, without data is hard to guess.
Quote:
It probably made sense to move to x86_64 for many reasons including improved ISA/ABI and better modern software compatibility but also needed many more gates for 64 bit, caches and more instructions. In the end, we end up with a lower power and more modern Pentium M instead of a 32 bit ARM or Cortex-A53 competitor

Atoms were 64-bit from the first day, albeit long mode was disabled from some first products.

BTW, adding 64-bit to the x86 ISA took around 5% more transistors according to AMD.
Quote:
Quote:
This only talks about instruction length: prefixes aren't mentioned.

The micro-architecture manual talks about the decoder for most other micro-architectures but not the Atom. Many of the mid performance CPUs can only process so many prefixes per cycle (sometimes only 1 prefix per cycle)

It's enough. Usually there's 1 prefix used on x86/x64 instructions for FS/GS base segment (selector) selection, but it's quite rare. 1 prefix is usually used for SIMD instructions. 2 prefixes are found on SIMD instructions which use the new registers on x64.

So, as you can see, prefixes aren't that common (except for SIMD instructions, which anyway are a very small percentage compared to the GP ones).
Quote:
and instruction length changing prefixes can give multi-cycle delays (6 cycles for a Core 2 and Nehalem).

What do you mean with this, the instruction SIZE change (e.g.: default is 32 bit data but with the proper prefix you can select 64-bit or 16 bit)?
Quote:
The Bonnell wiki I linked says the decoder can handle 3 prefixes/cycle but does not mention instruction length changing prefixes.

Indeed. It's not clear.
Quote:
Quote:
They are not: see below. Average SIMD instruction length is around 5 bytes (at least for the executable that I've disassembled and generated statistics), which isn't a big problem for a 2-ways pipeline, if the instruction prefetch window is 16 bytes (but even if its only 8 bytes, 3 bytes are enough to correctly decode the instruction).

I read that instruction length (surely with multi-threading) was one of the excuses for up scaling the Atom but I couldn't find the article. The early Atoms had limited resources they were trying to share with multi-threading so it is no surprise.

I don't think so. If the instruction window is 16 bytes, as I've read, then Atom had no problem at all with longer instructions. In fact, the instruction window is still 16 bytes, even for the 4-ways OoO designs.
Quote:
Maybe their real world results were closer to ARMs multi-threading results.

I never saw a multi-threading ARM core.
Quote:
Quote:
I don't think so. Multi-threading is/was introduced because an in-order design leaves a lot of execution units under utilization, so this allows to better utilize them. Even on out-of-order designs it's very useful, and that's the reason why it's still implemented on high-end processors.

I understand. A core does lots of waiting so it is natural to want to do some processing during this time. The more transistors invested in the core, the more important it looks. I still prefer multi-core for the reasons I gave in a previous thread. Multi-threading wouldn't be so bad if only the execution units were shared but then I doubt there would be a worthwhile savings over having another core that doesn't need to share.

You can solve those problems using CPU affinity settings, if it's really important. Otherwise the o.s. scheduler should make a good work trying to spread the processes on proper physical cores instead of just looking at the available hardware threads.

Anyway, hardware multi-threading can also improve single core/thread performances (!) in some scenarios. I've some idea about it.
Quote:
Quote:
68060 used less transistors because Motorola traditionally cut features on its processors, and 68060 has both super and user mode changes (instructions removed, and simplified MMU). Another important mistake is not providing a fully-pipelined FPU, which basically crippled its FPU performance. And this processor is only able to pair instructions which both are 2 bytes in length, with several pairs limitations. It also introduced no new instructions. And last but not least, the design didn't reached high frequencies.

The 68060 FPU is quite nice and performs well in mixed integer/FPU code as the FPU instructions operate in parallel.

Not so well if you consider that you can only issue 1 FPU instruction per clock cycle, since the minimum size is 4 bytes for them, and the instruction window width is just 4 bytes for the 68060.
Quote:
The Pentium FPU is a fully pipelined stack based relic which has good theoretical performance but is not easy to use. The Pentium FPU only had a performance advantage with (usually hand) optimized FPU only code.

Hum. I don't think so. x87 is a stack-based FPU, which is very easy to use/exploit for compilers.
Quote:
Look up some old Lightwave tests and tell me which performed better per clock.

I don't know them. Do you have some data?

Anyway there should be also SPECFP data for both processors, which are more trustable results
Quote:
The Pentium was clocked up much faster while Motorola was working on the PPC 603.

Not only the Pentium: Intel entered a good way raising the frequencies already with the 80486 design.
Quote:
I believe the 68060 superscalar instruction pairing allows at least 2x4=8 bytes per cycle.

Quote:
FIFO buffer implemented with 3 read ports
- if current pOEP instruction is located i of buffer, then buffer reads at location (i+1),(i+2),(i+3)
- Allows the {(i+1),(i+2)} or {(i+2),(i+3)} pair to be sent to OEPs

http://www.hotchips.org/wp-content/uploads/hc_archives/hc06/3_Tue/HC6.S8/HC6.8.3.pdf

I don't know if the pOEP instruction has already been read from (i+0) and I'm not even sure if the array has 16 bit or 32 bit indexes (16 bit opcode produces a decode longword in the IFP). The IFP could only fetch 4 bytes/cycle but the decoupling allows the FIFO buffer to feed the OEP pipes for awhile.

I see. But IMO the 4 bytes/cycle is still too small, considering that the smallest 68K instruction is 2 bytes and the 68060 can execute two. It means that the above pairing works with longer instructions only at the expense of wasting cycles executing only 1 or even no instructions at all.
Quote:
This works very well for a variable length instruction encoding although enough consecutive long instructions can cause a bottleneck. Unfortunately, the 68060 can only forward 32 bit results yet it can become bottlenecked by instruction immediates (the ColdFire MVS/MVZ instructions and my immediate compression addressing mode would practically eliminate this).

Exactly.
Quote:
Certainly executing more instructions in pairs raises the IPC quickly but surely these bottlenecks didn't come into play too often.

Hum. The most executed instructions are 16-bit in size, as we know, but bigger instructions aren't rare unfortunately.
Quote:
I have heard rumors that Motorola wanted an 8 byte/cycle fetch for the successor to the 68060 so maybe it was significant enough.

It should have been from the beginning IMO. I don't know how much was the cost, in a 2.5 millions transistors budget, but I bet for not that much.
Quote:
The 68060 didn't reach high frequencies because there wasn't enough demand or customers for die shrinks. Apple had switched to PPC, Commodore was buying 68EC020s and workstations had switched to RISC. Motorola was telling everyone the 68060 was the end of the 68k line and Apple pulled some nasty tricks to keep 68060 accelerators off the market or they wouldn't have been able to sell the lower performance PPC 601 and 603 Macs. The 68060 CPUs were over $300 U.S. so hardly cheap. They were used for high end embedded applications where they were perfect being high performance at low clock speeds. The high performance embedded market was just not that big back then. The 68060 has an 8 stage pipeline which should make it easier to clock up than those shallow pipeline PPCs which replaced the 68060 (the 603 has only 4 stages as I recall). Die shrinks alone would raise the 68060 clock speed. Today it may be possible to make a 68060@300MHz for $3 (using old underutilized fabs which would be several generations of die shrink for the 68060) and it would sell into the mid performance embedded market.

Consider that Pentium debut was 1 year before the 68060, and it was already introduced at 60 and 66 MHz. 1 year late, when 68060 came too, it was already running at up to 120Mhz.

And if we consider that the Pentium design had only a 5 stage pipeline, it's even stranger (or sad)...
Quote:
Quote:
16 SIMD registers are too few. Intel introduced AVX-512 (with the EVEX prefix) to bring the SIMD registers to 32, which is a decent number for a CISC architecture. IBM found in 64 SIMD registers a good compromise for the new VMX2 (there was a paper about it).

The highest performance CPUs may have 64 SIMD registers but how many mid performance variants of those CPUs do you see? Register files are expensive. The other option is to make the SIMD unit optional and not available on mid performance CPU variants but I would rather have a more modest SIMD unit on all variants.

That's true, but once you define a SIMD unit with "mid" target, then you have put a very big constraint to support more high-end market segment, and I don't think that you want to define two different SIMDs for different markets, right?

Think about Vampire's 68080: it has a very bad, crippled SIMD design (AMMX) which will prevent consistent enhancements.

Do you want a similar future for your 68K SIMD extension? It's better to think carefully before taking so much important decisions.
Quote:
AArch64 values mid performance and has 32x128.

Yes, but ARM presented a size-agnostic vector ISA which can handle registers ranging from 128 to 2048 bits.

Do you have enough encoding space in your 68K extension to support both kind of SIMD units (or a single one to address both scenarios/needs)?
Quote:
A CISC SIMD unit saves registers so maybe we could get by with 16x256? It is the same sized register file, we can do twice as many operations per instruction and we might just be able to keep the base SIMD instructions 4 bytes with the saved encoding space. I know you want a high end 68k with 64x512 SIMD unit but you are going to have convince Intel the 68k is better first.

Well, my primary interest is convincing Intel (or AMD; or some other CPU vendor which might be interested) that MY ISA is better. I'll talk a bit after, about it.

I don't want to convince you about the choices for your 68K extension. I'm also biased because of my conflict of interest , but at least I can expose my ideas/opinion about technical facts in a professional way. Then you are smart enough (no joking: I'm quite serious) to take your time, evaluate the whole picture, and take your decision for your project.

I can only say that I fully understand your concerns: designing an ISA isn't a simple exercise, where you fill holes in some tables. It took me around 7 years for me to define and try all solutions/ideas which came to my mind, looking at statistics and making comparisons with other ISAs. Some decisions were painful but had to be taken.

Anyway, and at the very end, an ISA a big synthesis of many things, which sometimes conflict, and you cannot expect that it can excel in all possible contexts / scenarios / markets...
Quote:
Quote:
Why not? More registers allow a better ABI convention, putting more parameters into the registers instead of pushing (and popping) them into the stack.

Register files are expensive in both gates and energy use. Code density usually deteriorates with more than 16 registers which requires bigger caches (more than 8 in the case of x86_64 but that bad ISA needed to reduced memory traffic). Working with a few variables in the cache with CISC is really quite cheap. This was very common for the x86 with 8 registers yet it didn't cripple the integer performance. Adding 8 more registers to 16 with x86_64 gave overall less than a 10% performance boost even as the memory traffic is reduced by far greater percentages. From 16 to 32 registers for CISC would probably give an overall measurable reduction in memory traffic but may not give a measurable difference in performance. Many registers is certainly not a quick way to performance as the PPC 603 with 32 registers exhibited in comparison to the Pentium with just 8 registers.

Apple with its first 64-bit ARM implementation clearly shown that doubling the registers numbers gained a lot of performance (even completely ignoring the FPU tests, which made use the better SIMD unit, including the hash/crypto extensions), despite the pointers size which doubled and the code density which became worse. So, there's room for improvement here.

However when you define an ISA you have to think about not crippling it.

RISC-V is a clear example: this ISA was defined from the beginning to address (almost) all possible market segments, from the low-end embedded to the HPC. In fact, it supports both a reduced ("cut") ISA with only 16 registers (instead of the standard 32), and an upcoming size-agnostic vector ISA. But it was an easy game for the RISC-V designers: they had to support absolutely NO legacy inheritance / constraints.

Another one is my ISA, NEx64T: I've defined it from the very beginning (and independently ) with similar goals. My ISA is natively 64-bit, with 32 GPRs, 64 SIMD registers (from 128 to 1024 bit), and 8 mask registers (plus the infamous x87 FPU with its registers, added only for legacy reasons); it has 32 and 64 bits mode, but this only changes a few opcodes.
Even with this huge design (there's a lot of stuff), it can be easily "cut", going down to a 32-bit only mode with 16 GPRs, and going up with an option to have 128 SIMD registers and 16 mask registers (with longer opcodes, of course) plus the possibility to remove some legacy stuff to introduce more modern features (e.g.: removing MMX support allows to introduce a size-agnostic vector unit, using exactly the same opcode structure for the standard/legacy SIMD).
As you can see, the ISA is very flexible, despite it carries on a big burden: being 100% x86 and x64 assembly level-compatible (with a notable difference: the 64-bit mode allows to execute all legacy instructions which AMD removed with x64) with all consequences that it also means (supporting segmentation, very odd instructions, the x87 FPU, MMX, etc.).
As I said before, designing an ISA is a big compromise of many things, and for mine I wanted to have full source assembly compatibility because software is the most important thing, and just a simple recompilation can bring TONs of applications even if they had assembly parts.
Quote:
Quote:
I agree with you and thank you very much for the interesting paper, albeit it's too much outdated and an update with the more modern processors/ISAs would have been much appreciated.

What impresses me is both the first (Stack) and last (Mem-Mem) results. However the 68020 got a very nice and balanced result.

It looks to me like the stack architecture example is flawed as the variables are already in the CPU. Perhaps this is because it is difficult to load variables for a stack architecture? All the other examples do proper loading of variables.

I don't think so. The code assumes, for all architectures, that the 6 variables are located into the stack or a frame pointer, so basically like a function call where they are input and output parameters. From this point of view the example is correct.

However I don't know if the stack ISA had problems loading variables.
Quote:
The 68020 code size could be reduced too.

move.l -(a6),d0
sub.l -(a6),d0
move.l -(a6),d1
move.l -(a6),d2
muls.l -(a6),d2
sub.l d2,d1
divs.l d1,d0
move.l d0,-(a6)

68020 reg-mem
instructions: 8
code size: 160 bits
memory traffic: 402 bits
registers used: 4

The memory traffic is a bit better: 352 bits (160 bits + 6 * 4 * 8 bits of load/store operations).
Quote:
MIPS load-store
instructions: 10
code size: 266 bits
memory traffic: 458 bits
registers used: 9

I added the number of registers used. My code used one more register than the original 68k code. The MIPS RISC code still used 9 registers to my 4. RISC needs more registers.

This is wrong. The compiler generated very poor code for the MIPS, basically using a new register for every need to store a value. If you change it manually, it ends up needed only 4 registers, which makes sense. The rest is exactly the same (poor results).
Quote:
The 68k is not pure reg-mem either as it has the most common and cheapest mem-mem instructions although they are not used in the sample code. It looks like they may reduce instruction counts, code size, memory traffic and registers used further.

But it also misses ternary instructions, which can save some register and/or reduce the number of instructions and/or improve the code density. For example, in the LZ77 source code that you've sent to Vince, you had to use a (data) register putting a fixed value because the shift operation is limited to max 8 as immediate value in the 68K.

Since you posted your 68020, I do the same for my NEx64T ISA:

NEx64T 32-bit version:
mov eax,[esp+16] ; load d
imul eax,[esp+20] ; d = d * e
mov edx,[esp+12] ; load c
sub edx,eax ; c = c - d * e
mov eax,[esp+4] ; load a
sub eax,[esp+8] ; a = a - b
idiv eax,edx ; a = (a - b) / (c - d * e)
mov [esp+24],eax ; store a

instructions: 8
code size: 176 bits
memory traffic: 368 bits (176 + 6 * 4 * 8 bits of load/store operations)
registers used: 3

NEx64T 64-bit version:
mov rax,[rsp+32] ; load d
imul rax,[rsp+40] ; d = d * e
mov rdx,[rsp+24] ; load c
sub rdx,rax ; c = c - d * e
mov rax,[rsp+8] ; load a
sub rax,[rsp+16] ; a = a - b
idiv rax,rdx ; a = (a - b) / (c - d * e)
mov [rsp+48],rax ; store a

instructions: 8
code size: 176 bits
memory traffic: 560 bits (176 + 6 * 8 * 8 bits of load/store operations)
registers used: 3

So, a bit less efficient compared to your 68020 version, but pretty close (albeit directly referencing values into the stack).

I might have used the post-decrement addressing mode (I've all 4 of them for my ISA: pre/post inc/decrement), but paying an additional price in terms of code density, since it requires a longer encoding.
Quote:
Quote:
Consider that RISC-V will be a strong contender for all current leading architectures.

IMO, RISC-V is most likely to replace MIPS like AArch64 is replacing PPC. These are more efficient ISA replacements.

RISC-V is aimed to replace ARM and AArch64 as well: I clearly see this, as well as it emerges from the several presentations/talks of RISC-V workshops.
Quote:
Next up should be the 68k_64 replacing the x86_64.

That will be hard, because of another competitor.
Quote:
Quote:
It's really impressive. Do you have the architecture manual for this processor? I'd like to take a look at it (and at the opcodes structure, of course).

There are multiple embedded CPUs from CAST using the BA2 ISA. The simplest is only one pipeline stage for deeply embedded uses.

http://www.cast-inc.com/ip-cores/processors32bit/index.html

Unfortunately, I have been unable to find any online documentation on the ISA. You would probably have to contact them. It is a 3 op RISC ISA with 16, 24, 32 and 48 bit instructions and supports up to 32 registers. It looks like they have short encodings for 2 op instructions. The following has some examples of code.

https://www.chipestimate.com/Extreme-Code-Density-Energy-Savings-and-Methods/CAST/Technical-Article/2013/04/02

They claim, "We believe the BA22 has the greatest code density in the industry, estimated up to 20% better than the impressive ARM Thumb-2 ISA."

http://www.cast-inc.com/blog/consider-code-density-when-choosing-embedded-processors

I see, thanks. However it seems to be too much embedded-oriented.

Analyzing the few examples (I also haven't found an architecture manual), it seems that this BA2 ISA doesn't support variable sizes with the instructions: they all operate in 32-bit, which is expected and this saves 2 bits in the opcode structure, which is a BIG number (especially for the 16-bit base encoding). This is something which 68K and x86/x64 ISAs cannot do, due to the legacy burden.
Quote:
I believe the 68020 ISA is probably up to 5% better code density than Thumb2 (hand optimized code as 68k compiler support has deteriorated). I was seeing up to 5% better code density for the 68k with a few enhancements using peephole optimizations like the vasm assembler can do. The 68k with enhancements could probably be 5%-15% better code density although supporting 64 bit instead would use up some of the encoding space which could be used to make a more dense 32 bit ISA.

That'll happen for sure. There's no free lunch: if you enable Size = 0b11 for 64-bit, you have to remove some useful instructions in the 16-bit opcode space.
Quote:
A 68k_64 ISA which is 25%-30% better code density than x86_64 or AArch64 may earn more respect.

I can reveal my finding here, about my ISA: in 64-bit mode it has around 20% better code density than x64.

This comes from disassembling some x64 executable (I've already talked about it) and reassembling them for NEx64T, applying also a peephole optimizer for a few things (enabling the use of the new MOV Mem,Mem, PEA, and TEST/CMP + Jcc instructions). But a lot more can be done to improve both code density and performances, since I don't use many new features (new registers, new addressing modes, new ternary reg-reg-reg and reg-reg-simm8 instructions, etc.). I still have to work to an idea which I've found on RISC-V regarding functions prologue and epilogue, but in a better and more efficient way, which will help a lot code density (but decreasing a bit the performances, like for RISC-V).

So, you know that you'll have a big contender to your 68k_64 ISA to supplant x64.
Quote:
I'm not worried about the BA2 ISA being competition. It is nice to see some RISC guys that are smart about code density and energy efficiency.

Well, they aren't even RISC designs IMO. They are CISCs.
Quote:
I still think a 16 bit variable length encoding can offer a better combination of code density and performance (better alignment). The 68k also has the advantage of a huge code base.

I feel the same, even taking an x86/x64 ISA rewriting/enhancement.

Status: Offline

megol

Re: 68k Developement
Posted on 16-Sep-2018 18:50:40

[ #226 ]

Regular Member

Joined: 17-Mar-2008
Posts: 355
From: Unknown

@wawa
I don't see how my post can be misunderstood at least in that way: yes the early releases used a standard slow core that Majsta (I always by default use nicks/usernames) adapted to his board.

What I meant is that the Apollo core and Vampire accelerators isn't a product coming from one source in the first place, and not coming from Gunnar alone specifically. Why allow the effort of the rest of the team to be put in the shadow to push some agenda (that people noticing things aren't shipping a product)?

And frankly even with the performance of the Apollo I still am more impressed by the will and determination of an individual to push forward towards something he thought missing.

Status: Offline

Barana

Re: 68k Developement
Posted on 16-Sep-2018 19:33:51

[ #227 ]

Cult Member

Joined: 1-Sep-2003
Posts: 843
From: Straya!

@OneTimer1
'The entitled generation' give me what I want NOW ! Or I will rail against you and cast aspersions and troll you until I do.
Yeah, nah.

_________________
Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

I serve King Jesus.
What/who do you serve?

Status: Offline

Barana

Re: 68k Developement
Posted on 16-Sep-2018 19:35:25

[ #228 ]

Cult Member

Joined: 1-Sep-2003
Posts: 843
From: Straya!

@gregthecanuck

Spot on ;)

_________________
Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

I serve King Jesus.
What/who do you serve?

Status: Offline

OneTimer1

Re: 68k Developement
Posted on 16-Sep-2018 20:15:11

[ #229 ]

Cult Member

Joined: 3-Aug-2015
Posts: 983
From: Unknown

@Barana

Quote:

Barana wrote:
@OneTimer1
'The entitled generation' give me what I want NOW ! Or I will rail against you and cast aspersions and troll you until I do.
Yeah, nah.

I don't know what generation you are writing about.

I'm of the generation fooled by hundreds of announcements, fooled with Natami announcements, fooled by pictures of Natami's never existing 3D Blitter, fooled by bullshit talk about broken FPUs, fooled about MMUs.

The Apollo team should stop their fantasy announcements and get 'real'.

Status: Offline

JimIgou

Re: 68k Developement
Posted on 16-Sep-2018 20:35:02

[ #230 ]

Regular Member

Joined: 30-May-2018
Posts: 114
From: Unknown

@OneTimer1

As long as there are people with fantasies that the 68K can regain supremacy over X64, fantasies and delusion will be the order of the day.

But pathological psychology is no substitute for real intellect, and the majority of you are engaged in mere mental masturbation.

Status: Offline

ppcamiga1

Re: 68k Developement
Posted on 16-Sep-2018 21:08:21

[ #231 ]

Cult Member

Joined: 23-Aug-2015
Posts: 771
From: Unknown

Ten years ago gunnar von boehn announced his 68k classic amiga wunderwaffe.
Wonderfull natami will run workbench many times faster than uae on fast pc (in 2008 it means core2duo).
Wonderfull natami will have better and faster graphics than current amiga ng AGP solutions (in 2008 it means Radeon R200 family).
Wonderfull natami will be amiga that Commdore made if not bankrupt.
Some gunnar acolyte even wrote on amigaworld than wonderfull natami with 200 MHz clock will be faster than 600 MHz G4.
Wonderfull natami will cost less than 100 Euro.
After ten years of lies and cheating gunnar von boehn has vampire v2 - 060 50 MHz integer performance, with FPU working in selected applications, and without MMU, no 3D acceleration, but with faster memory 100 MHz.
There is some progress compared to old 060 cards - memory is six time faster, 2D is at least as fast as bvision/pci solutions.
But in performance Vampire V2 is far away from announced wonderfull natami.
Workbench on 2008 core2duo runs many times faster than on vampire v2.
Vampire v2 is still not good enough to compete with cheap pc from windows 95 era.
Vampire is not even fast enough to compete with computers with good old 7100 PA-RISC which Commdoore want to use in Hombre chipset.

Status: Offline

ppcamiga1

Re: 68k Developement
Posted on 16-Sep-2018 21:15:39

[ #232 ]

Cult Member

Joined: 23-Aug-2015
Posts: 771
From: Unknown

After three years of presence on the market, there is almost no vampire application.
Quake 1, Duke Nukem, Riva.
In 1997, we were proud of the Quake port on the amiga.
I finish Duke Nukem on amiga on shapeshifter in 1998 - twenty years ago.
But that was twenty years ago.
Riva - even Commodore made hardware for playing VCD - it was FMV.

Status: Offline

hth313

Re: 68k Developement
Posted on 16-Sep-2018 21:16:25

[ #233 ]

Regular Member

Joined: 29-May-2018
Posts: 159
From: Delta, Canada

@cdimauro

Quote:

cdimauro wrote:
I still have to work to an idea which I've found on RISC-V regarding functions prologue and epilogue, but in a better and more efficient way, which will help a lot code density (but decreasing a bit the performances, like for RISC-V).

Do you have a link to or a description of this prologue/epilogue idea for RISC-V? I am very curious.

Status: Offline

ppcamiga1

Re: 68k Developement
Posted on 16-Sep-2018 21:20:38

[ #234 ]

Cult Member

Joined: 23-Aug-2015
Posts: 771
From: Unknown

From happy user of ppc Amiga NG pov there is no reason to change to vampire.
Compatibility of Amiga NG is good enough, every software hits from 68k works ok.
cpu is 100 times faster than vampire, graphics is faster than vampire.
MMU works, 3D works, FPU works.
Vampire is undepowered and overpriced crap witch is not real retro, and not good NG.

Last edited by ppcamiga1 on 16-Sep-2018 at 09:22 PM.

Status: Offline

megol

Re: 68k Developement
Posted on 16-Sep-2018 22:07:54

[ #235 ]

Regular Member

Joined: 17-Mar-2008
Posts: 355
From: Unknown

@cdimauro

cdimauro wrote:
Quote:

Opening up address registers as source isn't that bad. At the very end, pointers are also data.

What's missing on 68K address register is IMO some instruction to correctly align an address/pointer to a specific size. Yes, pointers aren't data, but we need aligned pointers sometime and actually this is an expensive operation for an address register.

3 instructions and 8 bytes (MOVE A, D; ANDI , D; MOVE D, A). Not nice but also not too expensive unless the register have to be aligned often.

It wouldn't be too bad to open up some simple operations to address register destinations even if they have to be executed in the AGU. And, or, add, sub for instance.

Quote:

Yup. Can be a possibility. So you have 16 data registers and 8 address registers. Can be good enough, if some instructions are added for the ones that currently have the address register EA mode already occupied.

But it would be ugly. That is maybe not a huge problem (processors don't care of aesthetics) but it would make instruction selection a little bit harder for compilers.

Quote:

can understand for more complicated EA, but simple Base Address + Offset should be as easy to implement as the canonical pre-decrement and post-increment ones.

Why it might decrease performance?

I don't like special casing without good reasons. Can't most cases be handled with LEA plus auto-increment addressing? Couldn't the bit in the full extension word be used for something else?

It might decrease performance if the address generation path is 2 stages while the (normal) increment/decrement/register move path is one stage for instance. It also complicates handling of indirect address modes.

Also let's imagine handling indirect address modes with a specialized simplified AGU close to the cache itself to reduce load on the main AGU and potentially decrease latency. That would make base register updates very painful.

Quote:

Strangely I haven't found the timings for SYSCALL/SYSRET instructions on Intel's manuals.

Hard to find information about them for all processors.

Quote:

OK, but at least we can save a physical register.

Is that worth much?

Status: Offline

cdimauro

Re: 68k Developement
Posted on 16-Sep-2018 22:09:11

[ #236 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@hth313

Quote:

hth313 wrote:
@cdimauro

Quote:

cdimauro wrote:
I still have to work to an idea which I've found on RISC-V regarding functions prologue and epilogue, but in a better and more efficient way, which will help a lot code density (but decreasing a bit the performances, like for RISC-V).

Do you have a link to or a description of this prologue/epilogue idea for RISC-V? I am very curious.

Sure. Here's the document: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-1.pdf

Section 5.6: The Load-Multiple and Store-Multiple Instructions, starting from pag. 66 (79 in the PDF).

Status: Offline

Barana

Re: 68k Developement
Posted on 16-Sep-2018 22:19:21

[ #237 ]

Cult Member

Joined: 1-Sep-2003
Posts: 843
From: Straya!

@JimIgou

Two americanisms in reply : 'ain't that The truth' and 'Case closed'

_________________
Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

I serve King Jesus.
What/who do you serve?

Status: Offline

cdimauro

Re: 68k Developement
Posted on 16-Sep-2018 22:26:02

[ #238 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@megol Quote:
megol wrote:
@cdimauro

cdimauro wrote: Quote:
Yup. Can be a possibility. So you have 16 data registers and 8 address registers. Can be good enough, if some instructions are added for the ones that currently have the address register EA mode already occupied.

But it would be ugly.

It is. I don't like the idea, but on the table it's a possibility like using prefixes.
Quote:
That is maybe not a huge problem (processors don't care of aesthetics) but it would make instruction selection a little bit harder for compilers.

Sure. However compilers (and JITers too) can be adapted just one time. Of course, someone has to introduce that change, which is a pain to implement.
Quote:
Quote:
can understand for more complicated EA, but simple Base Address + Offset should be as easy to implement as the canonical pre-decrement and post-increment ones.

Why it might decrease performance?

I don't like special casing without good reasons. Can't most cases be handled with LEA plus auto-increment addressing?

Then you require two instructions to be executed -> worse performance and code density.
Quote:
Couldn't the bit in the full extension word be used for something else?

Sure. Some other idea?
Quote:
It might decrease performance if the address generation path is 2 stages while the (normal) increment/decrement/register move path is one stage for instance. It also complicates handling of indirect address modes.

Also let's imagine handling indirect address modes with a specialized simplified AGU close to the cache itself to reduce load on the main AGU and potentially decrease latency. That would make base register updates very painful.

I'm not a micro-architecture expert, however I don't think that it would be painful.

Whatever and complex is the EA mode used to reference the memory, the processor should calculate the final (virtual) address before accessing it. It means that at a precise stage in the pipeline you already have that address, and certainly you know the base address register which was used (in case of a proper EA mode). Now it's only a matter of "just" saving that calculated address on the proper register.

So, I don't see any difference with the pre-decrement or post-increment address mode (except that second one make it easy to do the calculation after that the virtual address is used): first you need to calculate the virtual address, and then you can update the address register (this can be executed in parallel to other things).
Quote:
Quote:
OK, but at least we can save a physical register.

Is that worth much?

It might be if you have not so many registers.

Status: Offline

Barana

Re: 68k Developement
Posted on 16-Sep-2018 22:37:19

[ #239 ]

Cult Member

Joined: 1-Sep-2003
Posts: 843
From: Straya!

@Barana

From the Apollo site
WoW

Gunnar von Boehn
(Apollo Team Member)
Posts 3503
21 Aug 2018 06:21

Regarding the questions about ASIC.

Apollo 68080 is clearly the most advanced 68K CPU.
Those advanced features are the reason Apollo is the fastest 68k CPU.
The Motorola 68060 is the 2nd fastest 68K CPU.
If you compare both Apollo 68080 and the M68060 you can clearly see the difference those improvements make.

Doing an ASIC needs a lot of preparation time and a big investment.
And then an ASIC is a one time shot.

Right now we use FPGAs for the Vampire cards.
The FPGA allow "upgrading" the core version.
So far the Vampire users got regularly a new CPU version which always was improved. We continuously develop the core and continously Apollo gets faster and more powerful.

We have a good list of development ideas for the future.
The new Out Of Order features gives a big speed boost for some applications - we saw benefit of up to +40% for some cases.

Last edited by Barana on 16-Sep-2018 at 10:40 PM.

_________________
Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

I serve King Jesus.
What/who do you serve?

Status: Offline

kolla

Re: 68k Developement
Posted on 17-Sep-2018 9:14:01

[ #240 ]

Elite Member

Joined: 21-Aug-2003
Posts: 2896
From: Trondheim, Norway

@wawa

Quote:

wawa wrote:
@OneTimer1

Quote:
So I should be thankful for every product that's not available?

judging by customer base speaking up on the forums, the product is available, just not with the features you demand.

The V4 is still not available, which was where this argument came from, was it not?

Or maybe not.

The thing that is not available, and that has never been available, is the mythical "full Apollo Core". Remember why there is a V2? Because the V1 was nowhere big enough for the "full Apollo Core" (including FPU). V2 was supposed to take care of that. But then it too turned out to be too small, so eventually V4 came around.

The "standalone" existed (as in "it is here, we are testing it, available real soon now!") long before the V4 was announced. I suppose that was "V3"? V2 was made by some "Vampire Team" that was/is not to be confused with the "Apollo Team", while V4 is not really a "Vampire" as it is NOT designed made by Majsta (who owns the Vampire brand), but rather is an actual "Apollo Team" product, or maybe one should say "V4 Team", Majsta offering facilities for production.

Confused yet? Don't be, it is all crystal clear! Like HDMI output!

_________________
B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle