Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6071 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

7 crawler(s) on-line.

167 guest(s) on-line.

2 member(s) on-line.

OlafS25,

matthey

You are an anonymous user.
Register Now!

matthey: 37 secs ago

OlafS25: 4 mins ago

amigakit: 33 mins ago

RobertB: 1 hr 44 mins ago

Rob: 1 hr 44 mins ago

A1200: 1 hr 51 mins ago

pixie: 1 hr 56 mins ago

sibbi: 2 hrs 18 mins ago

NutsAboutAmiga: 2 hrs 31 mins ago

OneTimer1: 3 hrs 3 mins ago

Forum Index

Amiga General Chat

68k Developement

Poster

Thread

cdimauro

Re: 68k Developement
Posted on 15-Oct-2018 8:09:01

[ #441 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@megol Quote:
megol wrote:
@cdimauro
Quote:
cdimauro wrote:
That's the point: they are quite rare. It's better to use this encoding for more important EA modes.

Yes. But is that 68k extended to 64 bit? Philosophy. :)

Not yet: just philosophize.
Quote:
(Not feeling well so while I read this thread don't expect long replies)

I hope that you're getting well now.

It's a bad period for me too: I had a surgery four days go, and still taking pain killers. Writing something with PC is a huge pain. -_-

Status: Offline

cdimauro

Re: 68k Developement
Posted on 15-Oct-2018 8:25:20

[ #442 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@JimIgou Quote:
JimIgou wrote:
@matthey
Quote:
That was the conclusion I finally came to after my efforts over the years to bring the 68k and Amiga back.

As a devoted MorphOS user, I'd like to see us focus on Power 9 (and its successors) rather than move to X64 once PPCs quietly die.

You're just prolonging the agony.
Quote:
But I don't agree that the 68K efforts are a complete waste, as they will allow us to have full backward compatibility without emulation (JIT or otherwise).

And more powerful 68K cores would enable backporting of OS4 and MorphOS, finally unifying our efforts.

Difficult to see this happening.
Quote:
Oh, and btw, even if none of this succeeds, I anticipate being able to run OS3.1-3.9, OS4, and MorphOS all via QEMU on a Raptor Talos system (concurrently) in the near future. So I feel I'm staying true to my roots.

I should be able to run X64 OS' and apps as well in a similar manner as well.

You can use QEMU on x86/x64 and ARM systems as well, without buying very expensive systems (and not performing well in single core/thread).

BTW, QEMU on Amiga o.s./-like systems just uses one core, so I wonder how you can run all that stuff in a virtualized platform.
Quote:
Hypervisor enabled, bi-endian processors are too cool.

Only if you really can make use of it.
Quote:
Finally, as to the previously mentioned Super H, I thought that was a great processor. And the re-implemented H2 core (BSD licensed as J2) looks promising. As H2+ and H4 patents expire, this line of open cores will expand and all appear to have advantages over the 68K.

Questionable. I still believe that 68K offers better code density and performance. I don't trust a single (biased, IMO) study, whereas other studies show that 68K offers the best compromise between code density and executed instructions.
Quote:
@megol
Quote:
Yes. But is that 68k extended to 64 bit?

Is 64 bit absolutely essential? I still have some 32 bit OS' running, even when they are on 64 bit processors (IE - 32 bit Windows on an i7, MorphOS on a PowerMac G5).

Our current problem isn't 64 bit capability, its that our legacy is 31 bit addressing, not 32, limiting us to 2 GB instead of 4GB.

Just by re-working the software and OS' we can double our memory capability, all on the same processors.

As I said before, you're asking only to delay the agony. This is just a patch, a workaround trying to postpone the inevitable: jumping to a real 64-bit platform.

@hth313: I fully agree.

@OneTimer1 Quote:
OneTimer1 wrote:
@hth313
Quote:
hth313 wrote:

That is just me, a fairly normal developer. Then we there are people doing in memory databases, try telling them that 32-bit is enough...

AmigaOS is restricted to an address space of 2GB, a 64 Bit CPU wont make the OS 64Bit compatible. For a 64Bit system you would need an AmigaOS that was compiled for a 64Bit CPU, something like AROS 64.

But I don't believe a 64Bit AROS could have AOS compatible structures for (real) 68k software and without compatibility to the original AmigaOS all this 64Bit extensions for the CPU would lose sense.

Why? You can run the original Amiga games using UAE (which is still required even on the current 32-bit PowerPC ports/re-implementation of the Amiga o.s.), and applications using a "virtualizer" (once someone does it).
Quote:
All this talking about CPU expansion and 64Bit extension of the 68k seems useless. A 64Bit AOS would need UAE like Sandboxes for AOS68k software. If someone wants this, he can have it under Windows.

This is where Amiga ends a 68k-64Bit CPU won't give them more compatibility then UAE onb Windows, it will only result in lower performance (FPGA) and high system prices(custom hardware).

That's true, but this thread is about general 68K development. We already talked about possible segments market to address (embedded, mid-performance) with such 68K "evolutions", which doesn't necessarily mean having an Amiga o.s. flavor running on them. But I can also see AROS/64-bit running on a 68K_64, for people who likes it.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 15-Oct-2018 9:05:04

[ #443 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Hypex Quote:
Hypex wrote:
@cdimauro
Quote:
That's wrong: alignment is VERY important (and expensive) using bitplanes.

I thought you were talking about pixel bit alignment. Where every LSb to MSb is packed together seperately. Not the planes.

It's a general question, which applies to pixel alignment as well. I'll talk specifically and more clearly after.
Quote:
Quote:
That's what you also do with bitplanes: shifting and masking using the most common operations.

Good if you only need one colour.

Which is the same with packed graphic with depth = 1.
Quote:
Quote:
So, what's the point on not accepting "weird" packed modes, like 3-bit depth? It's only due to mental schema which imposes to see at packed modes only being power-of-two (BTW; 2 and 4-bit packed modes require shifting & masking as well, and they p-o-t).

No it's because packing odd pixel depths is impractical for working with. Okay even nibbles can't be plugged in directly since they take up half a byte. But a depth like 3 would have to be shifted and masked out in a format like this:
11122233 34445556 66777888 999AAA00

Do such formats exist? I can only imagine it being used for scolling background or similar where it didn't have to be modified.

I don't know if they existed or not: I'm only talking about the possibility to use them, and the comparison with bitplanes

And to give a general answer, yes: they can be used everywhere; so not only for scrolling backgrounds.

I know that they sound odd, but they aren't difficult to use even by the CPU. This is the worst case, of course, because the CPU has to access misaligned data, doing shifting and masking, as we already discussed.
In general, you need to read 2 bytes, pack + shift + mask ("and") to get the index value. And you need to read 2 bytes, pack + mask + shift + replace ("or"), unpack, and then write back the 2 bytes if you want to replace the index value. It may sound complicated, but this is a general method which works for all "odd" packed depths (3, 5, 6, 7); "good" packed depths (1, 2, 4, 8) of course are much easier (and faster) to handle.
Now think about doing the same with bitplanes and tell me how many operations are required by the CPU to do exactly the same operations (read and write pixel).

Talking about the Blitter, it would work in a similar way for packed graphic, because cookie-cut operations (which is the worst case, albeit the most interested and very common) required shifting and masking anyway, plus some little change for handling masks (both provided by the user, or "auto-generated").

Finally, talking about the display controller, packed graphic works more or the less the same way, but with some notable exceptions, because you don't need:
- several DMA pointer to fetch data for all needed bitplanes: one pointer is enough (two if you want to implement a Dual-Playfield mode);
- several buffers for fetched data: one (or two) is enough;
- to extract & combine the fetched data to get the index color, because you already have it;
- delay outputting pixels (for the above reason: you already have the index color, just after having fetched the data);
- pre-fetching so many data, so steeling precious DMA slots from sprites (AGA is the worst case here: 64-bit fetching with scrolls leaves just one sprite available!).

I think that it should be enough, right?
Quote:
Quote:
Now add one pixel horizontally, and do again the calculations. Surprise, surprise: bitplanes are the worst in terms of wasted space, especially for an AGA screen.

321 pixels across is a weird amount. Even 320 is odd not being a mutiple of two. Yes I get your point but no one in their right mind would use such an odd amount.

It was just to show how inefficient are bitplanes compared to packed graphic at the same condition.

And no, it's not because it's needed a right mind: think about the constraint that AGA display logic forces if you want to take the most advantage of available bandwidth. A 320 x 200 (or 256 for PAL/SECAM)) screen using 64-bit fetch means you need to fetch 5 64-bit values for each bitplane's row. It's good, because it perfectly matches the alignment constraints. Now think about adding using some horizontal overscan or... some scrolling: you need to add a 64-bit "column" for each bitplane's row. Of course, multiplied for ALL bitplanes -> the more bitplanes -> the more was of both space AND bandwidth.

And we talked only about the display controller logic. Now thing about a game which has graphic (tiles, BOBs) to be drawn. "Fortunately" the Blitter wasn't updated to fetch 32 or 64-bit data, so it wastes only a limited amount of space and bandwidth, but it's still enough if you think that ALL graphic, even a small bullet, requires data to be 16-bit aligned.

Last but not least, BOBs (and moving overlapping windows) require cookie-cutting operations, so you need a mask for the object to be drawn. It means that you need space for the mask, but A LOT of bandwidth in case of graphic with big depth, because and as I said before, you need to fetch that mask AGAIN, for EVERY bitplane where the operation should be applied.
Quote:
Quote:
Yes, but it doesn't mean that packed graphic has no advantages as well. You can also make some scroll operation with packed graphic, more or less the same as with the bitplanes on Amigas (even better/easier with 8, 16, 24 and 32-bit packed graphic: you just need to changed the pointer to scroll the are).

Makes sense. No scroll offset needed?
What about 1/4 pixel scrolling?

In this case you can use exactly the same technique (and register).
Quote:
Quote:
I don't remember the CGA, but EGA used bitplanes, like Amigas...

Well what a strange thing to do.

Were they formatted the same? By the bit? Or did they split the planes up into packed data?

The 4 bitplanes are separated, and selectable independently when applying some logical operations (AND, OR, XOR, NOT). Not like our super-flexible Blitter, of course, but EGA's bitplanes are very similar to the one used with the Amiga (and perfectly fluid scrolling was doable as well).
Quote:
Quote:
In this case it's exactly the opposite: you're wasting A LOT of (chip) memory bandwidth by reloading the same mask for EVERY bitplane where you need to apply it.

Is it worse for interleaved BOBs?

No, they are exactly the same here. Interleaved graphic was only useful for graphic saving & restoring.
Quote:
Quote:
No chunky is possible with bitplanes.

I didn't think so. Not directly. Can believe they let the last Atari be superior to AGA.

Then, how you can do packed graphic with bitplanes (no: using the Copper is not allowed here)?
Quote:
Quote:
Don't worry he can prove nothing. As I said before, I've mathematically proved it on the amigacoding.de forum, and even Gunnar (which was advocating bitplanes) was surprised.

He needs bitplanes for his SAGA story to continue...

That was the original plan, with weird ideas to have even 16 bitplanes. Gosh.

Fortunately he implemented and pushed a lot RTG graphic with Vampire, albeit he has introduced a 9th bitplane... Bah.
Quote:
Quote:
Creative people can have challenges with packed graphic as well.

It's just more "normal".

BTW, HP printers support planar. They also support ink colour planes which are like byte planes. But RGB tripplets are more common now.

If you want another example, JPEG 2000 used bitplanes on the last stage of its graphic pipeline (when the data should be coded using predictions & arithmetic coding). I know it, because I've written a JPEG 2000 decoder for my stage + thesis, to be implemented in hardware by STMicroelectronics.

P.S. Sorry, no time to re-read: I'm feeling too bad now. -_-

Status: Offline

matthey

Re: 68k Developement
Posted on 16-Oct-2018 2:09:33

[ #444 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2014
From: Kansas

Quote:

cdimauro wrote:
I agree with you, and I've already said before: IMO CISC research was "banned" in favor of RISCs. RISCs, which look more CISCs...

Interesting papers, anyway, especially the third one (16-bit Vs. 32-bit Instructions) because there's a nice comparison of the DLX ISA with some variants which a combination of only 16 registers and only 2 operands. It shows that using 32 registers and 3 operands give (while keeping the same 32-bit opcode format) gives advantage on both code density and path length (especially on this one). Regarding registers usage, there are some cases where the 16-regs + 3-op version gave better code density results compared to the 32-regs + 3-op, but overall the latter is doing better.

This is a good indication of possible benefits coming from a CISC design which implements those features/instructions.

The Apollo team discussed adding 3 op instructions. We decided there was little advantage with MOVE+op instruction fusing for a CPU core which executes 3 op instructions internally like the Apollo core. Assuming MOVE+op could always be fused (not possible), the advantages of 3 op occur when out of registers and when an argument is needed in a specific GP register which are not common. The former situation with 2 op is more of a problem for load/store which must do a register spill while the latter can sometimes be avoided by the compiler swapping argument registers. The 3 op instructions take valuable encoding space, increase the complexity of decoding and take more register ports. The 68k already has lower instruction counts than most load/store architectures and the Apollo core 3 op fusion would lower them more. I expect 3 op would be more beneficial to load/store RISC architectures than CISC.

Adding 3 op and 32 registers is cheap for a 32 bit fixed length encoding. These are good ideas as every advantage is needed to reduce instructions counts and code size after starting with poor code density. Starting with variable length instructions and reg-mem operations is quite different. More registers is generally a performance advantage but with diminishing returns. The 68k and x86_64 overall performance advantage of adding more than 16 GP registers is likely less than 2% while reducing code density by much more. Even with load/store RISC, the paper expected an enough bigger energy efficiency improvement than performance loss to consider the 16 bit RISC ISA a success.

Quote:

OK, now it's clear. But this will be a new library model to implement.

Libraries would benefit from changes for 64 bit. The BRA (xxx).L in the LVO table is limiting for 64 bit (requires libraries to be in the lower 4GiB). This could be BRA (d16,pc) now but this limits libraries to a size of 64kiB. A new (d32,pc) address mode takes the same space as (xxx).L and would allow libraries to be 4GiB in size. A flag and OS support for non-ROM libraries using all PC relative addressing could be a minor change also.

Quote:

cdimauro wrote:
Who knows: maybe it's bad luck 'til now. I still believe that something can change in the CPU/ISA panorama, otherwise I would never invested so much time on my ISA.

I don't know. The Intel ARM duopoly is making them richer and more locked in.

Rivals ARM and Intel make peace to secure Internet of Things
https://www.reuters.com/article/us-arm-intel/rivals-arm-and-intel-make-peace-to-secure-internet-of-things-idUSKCN1MP1K4

"Chipmakers are expected to ship around 100 billion ARM-based IoT devices in the next four to five years, matching the total number of ARM chips shipped in the last 25 years, Mukkamala said.

ARM has predicted that as many as 1 trillion IoT devices will be put to work in the world over the next two decades."

The Amiga mostly missed the embedded revolution even using the most popular embedded 68k CPU of the time. It has largely ignored the FPGA age and is oblivious of IoT as well. I only talked to a few people in embedded when I was part of the Apollo team and I was already talking about creating products. Amiga companies seem to think there is more money in suing each other over peanuts while ignoring the 500 pound goose laying golden eggs in the room.

Quote:

Some embedded CPUs have 32 registers. BA21 is one examble, but another notable one is Atmel AVR.

The CAST CPUs allow to reduce the number of registers. I expect 16 registers is more common on their low end CPUs but then they go pretty low end.

Atmel's AVR32 has 16 registers: R0-R12, SP, LR and PC. The PC is part of the GP register file like legacy ARM so would not be considered GP (15 GP registers vs 16 for the 68k).

https://en.wikipedia.org/wiki/AVR32

Quote:

According to Wikipedia https://en.wikipedia.org/wiki/Motorola_68060
"The 68060 was introduced at 50 MHz on Motorola's 0.6 µm manufacturing process. A few years later it was shrunk to 0.42 µm and clock speed raised to 66 MHz and 75 MHz."

That can explain why it's reported a so low power consumption (~5.5W max) compared to Pentium and PowerPC (which were using 0.6µm process).

The M68060UM gives a max of 4.9W for a full 68060@66MHz (see "12.3 Power Dissipation" chart) .

http://cache.freescale.com/files/32bit/doc/ref_manual/MC68060UM.pdf

All copyrights are from 1994 and I don't see any revision/version updates so I assume this is the old 0.6um version of the 68060.

Quote:

"64-bit external databus doubles the amount of information possible to read or write on each memory access and therefore allows the Pentium to load its code cache faster than the 80486; it also allows faster access and storage of 64-bit and 80-bit x87 FPU data."

The 68060 only has a 32 bit data bus while the Pentium and PPC 603(e) CPUs had a 64 bit data bus. Not a fair comparison. The 68060 was doing more with less.

Quote:

Consider another thing: IMM16 covers only a small range of integer data. I haven't put all IMM statistics, but the vast majority is already covered by immediates < 16-bits, so not covered by IMM16.

Whereas IMM32 collects all integer ranges which are < -32768 and >32767.

I "considered" you did not show immediates smaller than 16 bits. Unfortunately, immediates which are sign extended can't be compared to those which are not leaving us guessing. My guess is that many (likely most) of those 32 bit immediates could become 16 bit immediates if the latter was also sign extended. Negative number can be included when sign extending which would increase the percentage.

Status: Online!

matthey

Re: 68k Developement
Posted on 16-Oct-2018 18:25:56

[ #445 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2014
From: Kansas

Quote:

Hypex wrote:
The problem with prefixes on 68K is that they must have 16-bit alignment since that is minimum instruction size. It also doesn't look like it's designed for such a mechanism since it has whole instructions for performing certain tasks. It works on x86 because it is byte based and easily fits into the ISA design. I see it is also used for some neat things like setting up loop counters.

Not only is the minimum alignment 16 bits but the minimum encoding block is 16 bits. A variable length 16 bit encoding is better for performance than the variable length 8 bit x86 encoding but, yes, it does require twice as much prefix code for the most simple and common prefix functionality. It makes sense to then pack as much functionality into a 16 bit prefix as possible which further increases the use of prefixes. Compilers will then use the prefix often because it (minimally) improves performance and they don't have an equation to calculate the performance loss from reduced code density (difficult to calculate and variable). I agree that prefixes are more appealing with x86_64 where they lost code density but had little choice as less than 8 GP registers *is* a performance bottleneck.

Quote:

Hypex wrote:
I wonder when Vampire emulation will be added to UAE?

Toni Wilen has already answered (Vampire support is unlikely).

http://eab.abime.net/showthread.php?t=84264

One of the arguments is that the ISA changes and this is true. The CPU performance counters I was talking about disappeared after they were implemented, documented and Flype wrote CPUMon080 using them.

https://www.youtube.com/watch?v=ees1PExo4PA
https://www.youtube.com/watch?v=jO0_SkHogrI&t=97s

Status: Online!

matthey

Re: 68k Developement
Posted on 16-Oct-2018 20:11:42

[ #446 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2014
From: Kansas

Quote:

JimIgou wrote:
As a devoted MorphOS user, I'd like to see us focus on Power 9 (and its successors) rather than move to X64 once PPCs quietly die.

POWER makes some sense for a high performance Amiga as it is mostly PPC compatible and supports big endian. However, there would be no affordable low to mid performance Amiga which is desperately needed to increase users.

Quote:

But I don't agree that the 68K efforts are a complete waste, as they will allow us to have full backward compatibility without emulation (JIT or otherwise).

And more powerful 68K cores would enable backporting of OS4 and MorphOS, finally unifying our efforts.

With enough CPU performance, JIT emulation makes a lot of sense. However, emulation is a death sentence for the Amiga as users transition to the host OS.

Quote:

Oh, and btw, even if none of this succeeds, I anticipate being able to run OS3.1-3.9, OS4, and MorphOS all via QEMU on a Raptor Talos system (concurrently) in the near future. So I feel I'm staying true to my roots.

I should be able to run X64 OS' and apps as well in a similar manner as well.

Hypervisor enabled, bi-endian processors are too cool.

The problem is that most Amiga users don't want to pay for much higher spec hardware to be able to use inefficient JIT emulation and hypervisor software. It is not elegant to lose over half of the CPU performance to run software even though the hardware may be more flexible.

Quote:

Finally, as to the previously mentioned Super H, I thought that was a great processor. And the re-implemented H2 core (BSD licensed as J2) looks promising. As H2+ and H4 patents expire, this line of open cores will expand and all appear to have advantages over the 68K.

SuperH is an easy to use ISA for low end micro-controllers. The cores can be smaller than 68k cores but I see major performance obstacles to scaling the performance up, especially much elevated instruction counts and memory traffic primarily due to limitations of the 16 bit fixed length ISA. It is good to have more open cores but I'm not sure SuperH offers much value over RISC-V. IMO, the 68k has much more performance potential and existing software while the patents have been expired for longer.

Quote:

Our current problem isn't 64 bit capability, its that our legacy is 31 bit addressing, not 32, limiting us to 2 GB instead of 4GB.

I was surprised to find that my Mediator was using the upper 2GiB of address space. ThoR's Remus (like OxyPatcher/CyberPatcher) uses the upper 2GiB of address space. It is dangerous to allow old programs to execute at addresses in the upper 2GiB of address space as they may use the wrong branch types but it is not a problem for new tested software. The ram disk could easily use memory above 2GiB if it was available. The 68k Amiga has a MMU and support for alternate address spaces (SFC/DFC registers with MOVES instruction) which would even allow a PAE like extension to use more than 32 bits of addressing if the external address lines existed.

Status: Online!

hth313

Re: 68k Developement
Posted on 17-Oct-2018 4:51:51

[ #447 ]

Regular Member

Joined: 29-May-2018
Posts: 159
From: Delta, Canada

@matthey

Quote:

matthey wrote:
Not only is the minimum alignment 16 bits but the minimum encoding block is 16 bits. A variable length 16 bit encoding is better for performance than the variable length 8 bit x86 encoding but, yes, it does require twice as much prefix code for the most simple and common prefix functionality. It makes sense to then pack as much functionality into a 16 bit prefix as possible which further increases the use of prefixes. Compilers will then use the prefix often because it (minimally) improves performance and they don't have an equation to calculate the performance loss from reduced code density (difficult to calculate and variable). I agree that prefixes are more appealing with x86_64 where they lost code density but had little choice as less than 8 GP registers *is* a performance bottleneck.

Regarding this with prefix bytes.

When developing compilers, I have seen a couple of smaller ISAs that use prefix bytes and they are never really good. It typically ends up being a kludge with somewhat limited benefits, for the purpose of binary compatibility.

Case one, 8080 and Z80. Here Z80 prefixed the main 16-bit register (HL) to form two index registers, IX and IY. In addition, they thought that adding a displacement byte would be a good thing, and it (mostly) is. The result is that a 1-byte instruction involving HL now takes 3-bytes for IX/IY and while it is more powerful, the fact that you only can access a single byte at a time using these addressing modes, makes the IX/IY route expensive to use.

Case two, 6800 was extended the 68HC11, adding a prefix byte to give a second index register (creating a 16-bit Y index register in addition to the existing 16-bit X). The extra cost of using Y makes it tricky to use in a good way in a compiler. X is also very busy, which often meant that Y held a variable (using somewhat expensive instructions) and X ended up being used most of the time for all kind of purposes.
What is interesting here is that Motorola later introduced 6812 (CPU12) which is assembly language compatible with 68HC11, but dropped binary compatibility. Motorola recoded the encoding patterns. If I remember it right, they also introduced stack relative addressing and various displacement sizes. As a result, a lot cleaner to use and noticeably better code density compared to 68HC11. You could just re-use your old assembly language sources without any changes!

I also have a third example where OKI made (a slightly modified) 8051 use prefix byte, with somewhat questionable results, but I will not go into that rabbit hole again.

Prefix bytes in my experience have the benefit of maintaining binary compatibility with the old ISA. Code density tends to improve less than you probably hoped for compared the original design. You will get improvements for obvious reasons, as you get more possibilities while keeping the old instructions. But as I said, the benefits tend to be far smaller than you may have hoped for. The real downside is that you will have a kludge that makes the ISA less nice to work with.

However, if done right on an original design that is "strangled", it can open up for improving things. It sounds that x86 might have been a case like this, but I am not familiar with it. Z80 was helped a bit, but it really is a pain for writing a compiler for it. For small ISAs, what you really need are instructions operating of wider operands (at least "int" sized), kind of "un-strangle it". On the other hand, adding a couple of registers tend to be not so beneficial.

I am not sure if and how this applies to what you are thinking about, but I thought it might be of some use.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 17-Oct-2018 8:37:46

[ #448 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:
matthey wrote:
Quote:
cdimauro wrote:
Interesting papers, anyway, especially the third one (16-bit Vs. 32-bit Instructions) because there's a nice comparison of the DLX ISA with some variants which a combination of only 16 registers and only 2 operands. It shows that using 32 registers and 3 operands give (while keeping the same 32-bit opcode format) gives advantage on both code density and path length (especially on this one). Regarding registers usage, there are some cases where the 16-regs + 3-op version gave better code density results compared to the 32-regs + 3-op, but overall the latter is doing better.

This is a good indication of possible benefits coming from a CISC design which implements those features/instructions.

The Apollo team discussed adding 3 op instructions. We decided there was little advantage with MOVE+op instruction fusing for a CPU core which executes 3 op instructions internally like the Apollo core. Assuming MOVE+op could always be fused (not possible), the advantages of 3 op occur when out of registers and when an argument is needed in a specific GP register which are not common. The former situation with 2 op is more of a problem for load/store which must do a register spill while the latter can sometimes be avoided by the compiler swapping argument registers. The 3 op instructions take valuable encoding space, increase the complexity of decoding and take more register ports.

The last sentence I think depends on the specific ISA, because in mine I've found a good "hole" where to fit reg1.Size = reg2 Op reg3/simm8 instructions (with Op = the most used/common instructions for the simm8 case, but can be any binary operation for the reg3 case), and they are very easy to decode.

The last part of the sentence should be general, but... why it should take more register ports? Both 2 and 3 op instructions should fetch the same 2 source operands, and then store the result to the destination one. The only difference is that the first source in the the former matches with the destination operand, whereas in the latter it can come from another source.
Quote:
The 68k already has lower instruction counts than most load/store architectures and the Apollo core 3 op fusion would lower them more. I expect 3 op would be more beneficial to load/store RISC architectures than CISC.

Adding 3 op and 32 registers is cheap for a 32 bit fixed length encoding.

Also for some variable-length CISCs.
Quote:
These are good ideas as every advantage is needed to reduce instructions counts and code size after starting with poor code density. Starting with variable length instructions and reg-mem operations is quite different. More registers is generally a performance advantage but with diminishing returns. The 68k and x86_64 overall performance advantage of adding more than 16 GP registers is likely less than 2% while reducing code density by much more. Even with load/store RISC, the paper expected an enough bigger energy efficiency improvement than performance loss to consider the 16 bit RISC ISA a success.

The lack of registers and ternary instructions can give performance loss too.

When I was at Intel I had the chance to view some very cool internal webinars which explored our micro/architectures (deep to the micro-codes, with some examples shown: it was awesome!), and a series was deserved for Atom which was also compared to the corresponding ARM using some benchmarks (EEsomething. Sorry, I don't remember now the correct name of this embedded test suite). The overall comparison shown an average advantage of Atoms, but on some benchmarks ARM was winning by a HUGE margin (don't remember now if performance was almost double compared to the Atom, but it was impressive). AFAIR, the explanation of the talking colleague was that it was primary due to presence of ternary operands, and then by the conditionally-executed instructions (so, it wasn't Thumb or Thumb-2, but the ARM ISA).

Unfortunately we miss a better, modern study exploring / exploiting such kind of topics, using a huge test/application suite.
Quote:
Quote:
cdimauro wrote:
Who knows: maybe it's bad luck 'til now. I still believe that something can change in the CPU/ISA panorama, otherwise I would never invested so much time on my ISA.

I don't know. The Intel ARM duopoly is making them richer and more locked in.

Rivals ARM and Intel make peace to secure Internet of Things
https://www.reuters.com/article/us-arm-intel/rivals-arm-and-intel-make-peace-to-secure-internet-of-things-idUSKCN1MP1K4

"Chipmakers are expected to ship around 100 billion ARM-based IoT devices in the next four to five years, matching the total number of ARM chips shipped in the last 25 years, Mukkamala said.

ARM has predicted that as many as 1 trillion IoT devices will be put to work in the world over the next two decades."

Intel essentially exited from the embedded / IoT market, so giving ARM all cake. But it was strange to see an agreement between Intel and ARM, due the former security technology.
Quote:
The Amiga mostly missed the embedded revolution even using the most popular embedded 68k CPU of the time. It has largely ignored the FPGA age and is oblivious of IoT as well. I only talked to a few people in embedded when I was part of the Apollo team and I was already talking about creating products. Amiga companies seem to think there is more money in suing each other over peanuts while ignoring the 500 pound goose laying golden eggs in the room.

Sad but true. This is the right time to propose some alternative to ARM and the upcoming RISC-V: once the embedded/IoT market is full of devices with those ISAs, there will be no space for newcomers...
Quote:
Quote:
cdimauro wrote:
Some embedded CPUs have 32 registers. BA21 is one examble, but another notable one is Atmel AVR.

The CAST CPUs allow to reduce the number of registers. I expect 16 registers is more common on their low end CPUs but then they go pretty low end.

I think so.
Quote:
Atmel's AVR32 has 16 registers: R0-R12, SP, LR and PC. The PC is part of the GP register file like legacy ARM so would not be considered GP (15 GP registers vs 16 for the 68k).

https://en.wikipedia.org/wiki/AVR32

I was talking about AVR, not AVR32: https://en.wikipedia.org/wiki/Atmel_AVR_instruction_set#Processor_registers

It' strange to find an 8-bit microcontroller with 32 registers.
Quote:
Quote:
cdimauro wrote:
According to Wikipedia https://en.wikipedia.org/wiki/Motorola_68060
"The 68060 was introduced at 50 MHz on Motorola's 0.6 µm manufacturing process. A few years later it was shrunk to 0.42 µm and clock speed raised to 66 MHz and 75 MHz."

That can explain why it's reported a so low power consumption (~5.5W max) compared to Pentium and PowerPC (which were using 0.6µm process).

The M68060UM gives a max of 4.9W for a full 68060@66MHz (see "12.3 Power Dissipation" chart) .

http://cache.freescale.com/files/32bit/doc/ref_manual/MC68060UM.pdf

All copyrights are from 1994 and I don't see any revision/version updates so I assume this is the old 0.6um version of the 68060.

My guess is that they didn't changed the year reported in the manual, and they just updated it with the new processors. The 4.9W for the 68060@66MHz is quite comparable with the 5.5W of the 75Mhz. The document reports also the value for the 50Mhz, which drawn just 3.5W: too little for the 0.6 process with such performances.
Quote:
Quote:
cdimauro wrote:
"64-bit external databus doubles the amount of information possible to read or write on each memory access and therefore allows the Pentium to load its code cache faster than the 80486; it also allows faster access and storage of 64-bit and 80-bit x87 FPU data."

The 68060 only has a 32 bit data bus while the Pentium and PPC 603(e) CPUs had a 64 bit data bus. Not a fair comparison. The 68060 was doing more with less.

This is my point from the beginning: there cannot be a (fair) comparison because the Pentium brought too many features compared to the 68060.

I don't think that last sentence is acceptable, for the same reason.
Quote:
Quote:
cdimauro wrote:
Consider another thing: IMM16 covers only a small range of integer data. I haven't put all IMM statistics, but the vast majority is already covered by immediates < 16-bits, so not covered by IMM16.

Whereas IMM32 collects all integer ranges which are < -32768 and >32767.

I "considered" you did not show immediates smaller than 16 bits. Unfortunately, immediates which are sign extended can't be compared to those which are not leaving us guessing. My guess is that many (likely most) of those 32 bit immediates could become 16 bit immediates if the latter was also sign extended. Negative number can be included when sign extending which would increase the percentage.

All immediates which I've shown are sign-extended (except the 64-bit data, for obvious reasons). So, there are no 16-bit immediates which are found on the 32-bit statistics.
Quote:
The 68k Amiga has a MMU and support for alternate address spaces (SFC/DFC registers with MOVES instruction) which would even allow a PAE like extension to use more than 32 bits of addressing if the external address lines existed.

Not suitable when using the MOVES instruction: it's too slow for accessing many data in the user space, for example.

Intel solved the problem by using the extra FS and GS segment (selector) registers, which allow to access kernel memory on user land, and viceversa, by using the proper FS or GS prefix. Not elegant, but quite easy and functional.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 17-Oct-2018 8:52:04

[ #449 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@hth313 Quote:
hth313 wrote:
@matthey
Quote:
matthey wrote:
Not only is the minimum alignment 16 bits but the minimum encoding block is 16 bits. A variable length 16 bit encoding is better for performance than the variable length 8 bit x86 encoding but, yes, it does require twice as much prefix code for the most simple and common prefix functionality. It makes sense to then pack as much functionality into a 16 bit prefix as possible which further increases the use of prefixes. Compilers will then use the prefix often because it (minimally) improves performance and they don't have an equation to calculate the performance loss from reduced code density (difficult to calculate and variable). I agree that prefixes are more appealing with x86_64 where they lost code density but had little choice as less than 8 GP registers *is* a performance bottleneck.

Regarding this with prefix bytes.

When developing compilers, I have seen a couple of smaller ISAs that use prefix bytes and they are never really good. It typically ends up being a kludge with somewhat limited benefits, for the purpose of binary compatibility.

Case one, 8080 and Z80. Here Z80 prefixed the main 16-bit register (HL) to form two index registers, IX and IY. In addition, they thought that adding a displacement byte would be a good thing, and it (mostly) is. The result is that a 1-byte instruction involving HL now takes 3-bytes for IX/IY and while it is more powerful, the fact that you only can access a single byte at a time using these addressing modes, makes the IX/IY route expensive to use.

Case two, 6800 was extended the 68HC11, adding a prefix byte to give a second index register (creating a 16-bit Y index register in addition to the existing 16-bit X). The extra cost of using Y makes it tricky to use in a good way in a compiler. X is also very busy, which often meant that Y held a variable (using somewhat expensive instructions) and X ended up being used most of the time for all kind of purposes.
What is interesting here is that Motorola later introduced 6812 (CPU12) which is assembly language compatible with 68HC11, but dropped binary compatibility. Motorola recoded the encoding patterns. If I remember it right, they also introduced stack relative addressing and various displacement sizes. As a result, a lot cleaner to use and noticeably better code density compared to 68HC11. You could just re-use your old assembly language sources without any changes!

Very interesting experience, thanks!

That's what Intel did with the 8086 too, which was binary incompatible with the 8085, but almost all assembly language compatible (using a source translator).
Quote:
I also have a third example where OKI made (a slightly modified) 8051 use prefix byte, with somewhat questionable results, but I will not go into that rabbit hole again.

Prefix bytes in my experience have the benefit of maintaining binary compatibility with the old ISA. Code density tends to improve less than you probably hoped for compared the original design. You will get improvements for obvious reasons, as you get more possibilities while keeping the old instructions. But as I said, the benefits tend to be far smaller than you may have hoped for. The real downside is that you will have a kludge that makes the ISA less nice to work with.

However, if done right on an original design that is "strangled", it can open up for improving things. It sounds that x86 might have been a case like this, but I am not familiar with it. Z80 was helped a bit, but it really is a pain for writing a compiler for it. For small ISAs, what you really need are instructions operating of wider operands (at least "int" sized), kind of "un-strangle it". On the other hand, adding a couple of registers tend to be not so beneficial.

Right. This is the reason why I'm pushing for a complete re-encoding of the ISA. Once an ISA reaches a "critical mass" of bad encodings / design decisions, then it's time to rewrite it.

That's what I did it with my ISA, which is a complete x86/x64 rewriting (and more), but being 100% assembly-language compatible allows to easily recompile the vast amount of existing applications.

The same can be done with a 68K "successor", recycling the good parts.

BTW, some solutions can be found for applications where source code is not available. Having an ISA which is source-level compatible allows to easily translate decoded/disassembled instructions from the old ISA to the new one, usually with a 1:1 mapping. Some analysis over this first, rough, translation can be also done, to further improve the translation result (I've applied a small peephole for my ISA, which considerably improved both code density and instructions count from the original x86/x64 executable).
Quote:
I am not sure if and how this applies to what you are thinking about, but I thought it might be of some use.

Absolutely. It was inline with what we already discussed here. Thanks for your contribute: much appreciated.

Status: Offline

matthey

Re: 68k Developement
Posted on 18-Oct-2018 3:56:17

[ #450 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2014
From: Kansas

@hth313
I tend to agree about prefixes. They are best avoided.

Quote:

cdimauro wrote:
The last sentence I think depends on the specific ISA, because in mine I've found a good "hole" where to fit reg1.Size = reg2 Op reg3/simm8 instructions (with Op = the most used/common instructions for the simm8 case, but can be any binary operation for the reg3 case), and they are very easy to decode.

The last part of the sentence should be general, but... why it should take more register ports? Both 2 and 3 op instructions should fetch the same 2 source operands, and then store the result to the destination one. The only difference is that the first source in the the former matches with the destination operand, whereas in the latter it can come from another source.

Oops. Both 2 op and 3 op need 2 read and 1 write port. I should have said 3 op requires reading more register fields which isn't nearly as costly as register ports. The big disadvantage is of course the used encoding space.

Quote:

Also for some variable-length CISCs.

3 op and 32 registers are not as cheap for a variable length encoding as more instruction formats and different sized and located register fields are required.

Quote:

The lack of registers and ternary instructions can give performance loss too.

When I was at Intel I had the chance to view some very cool internal webinars which explored our micro/architectures (deep to the micro-codes, with some examples shown: it was awesome!), and a series was deserved for Atom which was also compared to the corresponding ARM using some benchmarks (EEsomething. Sorry, I don't remember now the correct name of this embedded test suite). The overall comparison shown an average advantage of Atoms, but on some benchmarks ARM was winning by a HUGE margin (don't remember now if performance was almost double compared to the Atom, but it was impressive). AFAIR, the explanation of the talking colleague was that it was primary due to presence of ternary operands, and then by the conditionally-executed instructions (so, it wasn't Thumb or Thumb-2, but the ARM ISA).

Unfortunately we miss a better, modern study exploring / exploiting such kind of topics, using a huge test/application suite.

ARM ISAs have done a good job of reducing the number of branches. This is especially helpful on less than high performance hardware where good branch prediction requires many transistors and uses energy (15% of the energy use of a Cortex A15 core is for branch prediction). The x86/x86_64 CPU designs have done a good job of improving branch prediction even as their pipelines and branch prediction penalties have generally been high.

Quote:

Intel essentially exited from the embedded / IoT market, so giving ARM all cake. But it was strange to see an agreement between Intel and ARM, due the former security technology.

Atom CPUs are used in embedded applications where performance is more important than energy efficiency. They just didn't reach the energy efficiency they were targeting which makes them less appealing for deeply embedded uses, IoT and mobile devices (the big embedded markets). I expect they have done ok with tablets and Netbooks.

Quote:

Sad but true. This is the right time to propose some alternative to ARM and the upcoming RISC-V: once the embedded/IoT market is full of devices with those ISAs, there will be no space for newcomers...

True. Switching customers to another product usually requires significant and compelling advantages. Still, it looks like a ripe market.

Quote:

I was talking about AVR, not AVR32: https://en.wikipedia.org/wiki/Atmel_AVR_instruction_set#Processor_registers

It' strange to find an 8-bit microcontroller with 32 registers.

AVR is a weird little ISA. It is strange that they didn't move to 16 16 bit registers instead. It looks like it would have allowed for more consistent instruction formats and maybe even contiguous register fields.

Quote:

My guess is that they didn't changed the year reported in the manual, and they just updated it with the new processors. The 4.9W for the 68060@66MHz is quite comparable with the 5.5W of the 75Mhz. The document reports also the value for the 50Mhz, which drawn just 3.5W: too little for the 0.6 process with such performances.

The 5.5W at 75MHz was based on my estimates/interpolation.

Motorola was poor at updating/editing documentation but pretty good at marking them as revised with a new copyright somewhere. See the MC68040UM and MC68030UM for example. The MC68060UM was last copyright in 1994 and I see nothing about revisions. The rev 6 68060 was probably not out until 1995 or 1996. I expect the 4.9W 68060@66MHz was for 0.6um although I can't be sure.

Remember that the 68060 had a 4 byte/cycle instruction fetch and only a 32 bit data bus, half or less of most other similar performance CPUs (possible due to good code density and small average instruction length). It made significant use of power gating probably because Motorola realized that it could be used for embedded even if it was unsuccessful on the desktop. It supported copyback caches where the Pentium used write-through saving memory bus accesses. The 68060 used 4-way set associative L1 caches where the Pentium used 2-way reducing cache misses and memory accesses. Then there was the Pentium x86 decoder tax which was significant on the Pentium. I expect instruction decode and instruction fetch were the biggest energy consumers and the 68060 was much more energy efficient than the Pentium in these areas.

Quote:

This is my point from the beginning: there cannot be a (fair) comparison because the Pentium brought too many features compared to the 68060.

I don't think that last sentence is acceptable, for the same reason.

I think it is clear that some combination of 68060 design and ISA is better than the Pentium design and ISA. Sadly, the 68060 performance was overlooked by even Motorola who chose to market inferior PPC CPUs for the desktop and laptops instead. Compilers largely ignored the 68060 as even GCC does not have an instruction scheduler for it today. The only place it was appreciated was for embedded use and by a few Amiga and Atari users.

P.S. I found in the MC68060UM the following, "Whenever instructions are loaded into the OEP, the instruction buffer attempts to load a 16-bit operation word and 32-bits of extension words into both the pOEP and sOEP." The 68060 can load 6 bytes per cycle into each pipe as long as the instruction buffer has instructions. Not so bad even though an 8 byte/cycle instruction fetch would save a few cycles before the instruction buffer has time to fill.

Quote:

All immediates which I've shown are sign-extended (except the 64-bit data, for obvious reasons). So, there are no 16-bit immediates which are found on the 32-bit statistics.

Are the 16 bit immediates sign extended to 64 bits like the 32 bit immediates?

Quote:

Not suitable when using the MOVES instruction: it's too slow for accessing many data in the user space, for example.

Intel solved the problem by using the extra FS and GS segment (selector) registers, which allow to access kernel memory on user land, and viceversa, by using the proper FS or GS prefix. Not elegant, but quite easy and functional.

MOVES is single cycle although it is a supervisor mode instruction, pOEP only (no superscalar pairing) and doesn't take advantage of reg-mem operations. I wouldn't call it slow or fast and it has drawbacks which are not good for a micro-kernel. I believe 64 bit addressing would be preferable.

Status: Online!

cdimauro

Re: 68k Developement
Posted on 19-Oct-2018 7:10:03

[ #451 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:
matthey wrote:
@hth313
I tend to agree about prefixes. They are best avoided.

He was also underlining how important was (and is, IMO) to re-encode the ISA, while keeping assembly source-level compatibility.
Quote:
Quote:
cdimauro wrote:
The last sentence I think depends on the specific ISA, because in mine I've found a good "hole" where to fit reg1.Size = reg2 Op reg3/simm8 instructions (with Op = the most used/common instructions for the simm8 case, but can be any binary operation for the reg3 case), and they are very easy to decode.

The last part of the sentence should be general, but... why it should take more register ports? Both 2 and 3 op instructions should fetch the same 2 source operands, and then store the result to the destination one. The only difference is that the first source in the the former matches with the destination operand, whereas in the latter it can come from another source.

Oops. Both 2 op and 3 op need 2 read and 1 write port. I should have said 3 op requires reading more register fields which isn't nearly as costly as register ports. The big disadvantage is of course the used encoding space.
Quote:
Also for some variable-length CISCs.

3 op and 32 registers are not as cheap for a variable length encoding as more instruction formats and different sized and located register fields are required.

As I said, it depends on the ISA: I've found a good encoding space (fixed 32-bit opcodes) where to put such ternary instructions, and they are quite simple to decode .

I've also encodings for other ternary instructions (for the same binary "base" ones) which allow to reference memory as the second source (reg1.Size = reg2 Op EA) and apply some "extensors/modifiers" (e.g.: Sign/zero extension for EA, flags inverter, etc.).
For a limited (more common) set of instructions my ISA provides also a general ternary operator with any immediate as second source (reg1.Size = reg2 Op Imm8/16/32/64).
Both encodings are of course longer and variable length (still easy to decode), but they provide enough flexibility and contribute to improve code density and/or instruction counts.

What's the "secret"? Re-encoding x86/x64 I was able to MUCH better utilize the opcode space, provide an opcode format which is WAY easier to decode (trivial, compared to both x86/x64 and 68K), while offering sensible improvements to the ISA with little effort/implementation costs.

I know that's something that you and (especially) megol don't want to go, but it's an example (I hope valuable) of what hth313 stated in his post. You cannot imagine what opportunities this complete re-encoding gave, because I only have talked of some enhancements: of some other features (which exploits the CISC paradigm in a novel way too) I haven't talked (and I prefer to keep something for me currently. I hope that you understand).
Quote:
Quote:
The lack of registers and ternary instructions can give performance loss too.

When I was at Intel I had the chance to view some very cool internal webinars which explored our micro/architectures (deep to the micro-codes, with some examples shown: it was awesome!), and a series was deserved for Atom which was also compared to the corresponding ARM using some benchmarks (EEsomething. Sorry, I don't remember now the correct name of this embedded test suite). The overall comparison shown an average advantage of Atoms, but on some benchmarks ARM was winning by a HUGE margin (don't remember now if performance was almost double compared to the Atom, but it was impressive). AFAIR, the explanation of the talking colleague was that it was primary due to presence of ternary operands, and then by the conditionally-executed instructions (so, it wasn't Thumb or Thumb-2, but the ARM ISA).

Unfortunately we miss a better, modern study exploring / exploiting such kind of topics, using a huge test/application suite.

ARM ISAs have done a good job of reducing the number of branches. This is especially helpful on less than high performance hardware where good branch prediction requires many transistors and uses energy (15% of the energy use of a Cortex A15 core is for branch prediction). The x86/x86_64 CPU designs have done a good job of improving branch prediction even as their pipelines and branch prediction penalties have generally been high.

ARM also decided to go for a more traditional approach with its 64-bit ISA, removing the general conditional execution of instructions, and only allowing for a very small subset of instructions which are conditionally-executed.

However, and like what I've explained before, the Intel colleague explicitly mentioned that the huge performance advantage of ARM on some benchmark is primarily related to the usage of ternary operators.

From what you said 'til now, I feel that you don't like them. You gave some reasons for not being in favor for them, but they might be related to 68K and not to other ISAs.
Quote:
Quote:
Intel essentially exited from the embedded / IoT market, so giving ARM all cake. But it was strange to see an agreement between Intel and ARM, due the former security technology.

Atom CPUs are used in embedded applications where performance is more important than energy efficiency. They just didn't reach the energy efficiency they were targeting which makes them less appealing for deeply embedded uses, IoT and mobile devices (the big embedded markets). I expect they have done ok with tablets and Netbooks.

And that's what Intel is focusing. Yes, it still has products for IoT and embedded market, but it already dropped several products of both segments, and I think that this trend will bring to their closure.
Quote:
Quote:
I was talking about AVR, not AVR32: https://en.wikipedia.org/wiki/Atmel_AVR_instruction_set#Processor_registers

It' strange to find an 8-bit microcontroller with 32 registers.

AVR is a weird little ISA. It is strange that they didn't move to 16 16 bit registers instead. It looks like it would have allowed for more consistent instruction formats and maybe even contiguous register fields.

Same feeling. But IMO for a microcontroller which has to deal mostly with digital and 8-bit "analog" I/O this might be a good choice.
Quote:
Quote:
My guess is that they didn't changed the year reported in the manual, and they just updated it with the new processors. The 4.9W for the 68060@66MHz is quite comparable with the 5.5W of the 75Mhz. The document reports also the value for the 50Mhz, which drawn just 3.5W: too little for the 0.6 process with such performances.

The 5.5W at 75MHz was based on my estimates/interpolation.

Motorola was poor at updating/editing documentation but pretty good at marking them as revised with a new copyright somewhere. See the MC68040UM and MC68030UM for example. The MC68060UM was last copyright in 1994 and I see nothing about revisions. The rev 6 68060 was probably not out until 1995 or 1996. I expect the 4.9W 68060@66MHz was for 0.6um although I can't be sure.

Unfortunately there's not so much information, which is also contradictory. For example, I haven't found any source for the 0.42um process. Looking at the PowerPC processors from Motorola, I've only found references to the 0.5um process.

Another thing is the 3.5W for the 68060@50Mhz: it's really too low for a 0.6um process.
Quote:
Remember that the 68060 had a 4 byte/cycle instruction fetch and only a 32 bit data bus, half or less of most other similar performance CPUs (possible due to good code density and small average instruction length). It made significant use of power gating probably because Motorola realized that it could be used for embedded even if it was unsuccessful on the desktop. It supported copyback caches where the Pentium used write-through saving memory bus accesses. The 68060 used 4-way set associative L1 caches where the Pentium used 2-way reducing cache misses and memory accesses.

Yes, the differences are quite evident.
Quote:
Then there was the Pentium x86 decoder tax which was significant on the Pentium. I expect instruction decode and instruction fetch were the biggest energy consumers and the 68060 was much more energy efficient than the Pentium in these areas.

It's for sure. As I said before, the Pentium uses 30% of its transistors budget only for the decoder, which is also the most used/stressed unit in a processor (without technologies like LSD aka L0 cache).
Quote:
Quote:
This is my point from the beginning: there cannot be a (fair) comparison because the Pentium brought too many features compared to the 68060.

I don't think that last sentence is acceptable, for the same reason.

I think it is clear that some combination of 68060 design and ISA is better than the Pentium design and ISA. Sadly, the 68060 performance was overlooked by even Motorola who chose to market inferior PPC CPUs for the desktop and laptops instead. Compilers largely ignored the 68060 as even GCC does not have an instruction scheduler for it today. The only place it was appreciated was for embedded use and by a few Amiga and Atari users.

No surprise: maintaining and enhancing a compiler is costly. The advantage of a compiler which is able to better schedule instructions was already evident when the Pentium was introduced, and incredibly some Pentium-optimized code shown better performances on 80486 processors sometime.
Quote:
P.S. I found in the MC68060UM the following, "Whenever instructions are loaded into the OEP, the instruction buffer attempts to load a 16-bit operation word and 32-bits of extension words into both the pOEP and sOEP." The 68060 can load 6 bytes per cycle into each pipe as long as the instruction buffer has instructions. Not so bad even though an 8 byte/cycle instruction fetch would save a few cycles before the instruction buffer has time to fill.

Not so bad, but it should be seen in which real condition (e.g.: code patterns) it can happen. I fully agree that a 8 byte/cycle fetch would have been a much better solution (even keeping the same 32-bit data bus).
Quote:
Quote:
All immediates which I've shown are sign-extended (except the 64-bit data, for obvious reasons). So, there are no 16-bit immediates which are found on the 32-bit statistics.

Are the 16 bit immediates sign extended to 64 bits like the 32 bit immediates?

Yes. And to be more clear, I don't use the size / bytes used for the immediate which is used/encoded on instructions. I take the real value, and then classify it according to the number of bits which are required to represent it. So, IMM1 -> -1..0 range. IMM2 -> -2 and 1 values (-1 and 0 already covered by IMM1), etc.
Quote:
Quote:
Not suitable when using the MOVES instruction: it's too slow for accessing many data in the user space, for example.

Intel solved the problem by using the extra FS and GS segment (selector) registers, which allow to access kernel memory on user land, and viceversa, by using the proper FS or GS prefix. Not elegant, but quite easy and functional.

MOVES is single cycle although it is a supervisor mode instruction, pOEP only (no superscalar pairing) and doesn't take advantage of reg-mem operations.

That's the point.
Quote:
I wouldn't call it slow or fast and it has drawbacks which are not good for a micro-kernel. I believe 64 bit addressing would be preferable.

In which sense? Can you better explain it?

P.S. Sorry, no time read it again.

Status: Offline

matthey

Re: 68k Developement
Posted on 20-Oct-2018 6:45:58

[ #452 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2014
From: Kansas

Quote:

cdimauro wrote:
He was also underlining how important was (and is, IMO) to re-encode the ISA, while keeping assembly source-level compatibility.

Re-encoding is an option. I see three 68k ISA enhancement options.

1) compatible changes to the existing 68k ISA
+ simple
+ binary compatibility
- limited changes are possible while retaining compatibility
- limited encoding space while retaining compatibility
2) add a new 64 bit mode
+ encoding space can be recovered
+ binary compatibility retained in 32 bit mode
+ inefficiencies can be removed in 64 bit mode
- more transistors needed for 2 modes
3) re-encoding
+ encoding space can be recovered
+ inefficiencies can be removed
- binary compatibility is lost

Option 1 above is restrictive. The Apollo team had arguments about what is compatible enough. Gunnar probably chose this option because it is simpler and saves transistors which are limited in an FPGA. Megol likes this option with prefixes although I expect he is likely to run into limitations while losing code density. I'm currently exploring option 2 which is reasonable if the instruction formats are largely similar. Binary compatibility is a great advantage with a huge library of existing 68k programs. Re-encoding existing binaries can be problematic. Option 3 provides the most flexibility. I will consider it if I can't create option 2 to my liking. I can see why you went that route for an x86/x86_64 replacement but you had a poor ISA to start with. I believe the 68k ISA is better and serviceable.

Quote:

ARM also decided to go for a more traditional approach with its 64-bit ISA, removing the general conditional execution of instructions, and only allowing for a very small subset of instructions which are conditionally-executed.

AArch64 has moved to simpler conditional instructions. Instead of an if CC instruction like a CMOV, these are a CC selection of 2 registers. This is more easily pipelined without requiring predication. My ISA documents a SELcc instruction which operates in a similar manner (it can sometimes remove 2 branches with a simple if-then-else). There are quite a few variants of these types of instruction in AArch64.

Quote:

However, and like what I've explained before, the Intel colleague explicitly mentioned that the huge performance advantage of ARM on some benchmark is primarily related to the usage of ternary operators.

I would think x86/x86_64 instruction fusion would mostly make up for such cases. The x86/x86_64 can fuse MOV+OP, OP+MOV and even MOV+OP+MOV in many cases.

https://dendibakh.github.io/blog/2018/02/04/Micro-ops-fusion

The x86/x86_64 can also MOV to and from memory practically creating a 3 op fusion using memory. ARM can't do that. Maybe you are talking about a low end Atom without these fusions though?

There is another good article from the same author which shows the advantage of dual porting the L1 data cache.

https://dendibakh.github.io/blog/2018/03/21/port-contention

The L1 DCache latency can sometimes be hidden even though modern latencies have increased with larger caches and higher clock speeds (which Megol was concerned about). This setup with instruction fusion turns the x86/x86_64 into a memory munching monster. I believe we could have a more efficient 68k memory munching monster which does not break instruction down as far. The Apollo core is already doing most of the fusions without micro-op decoding but lacks the dual ported L1 DCache. I believe the 68k could compete with ARM where the Atom failed.

Quote:

Another thing is the 3.5W for the 68060@50Mhz: it's really too low for a 0.6um process.

It is 3.9W for the 68060@50MHz with 0.6um process. The "Superscalar Hardware Architecture of the MC68060" by Joe Circello gives 3.9W (which agrees with the MC68060UM). The paper was an early introduction/promotional literature saying, "Sampling now; production ramp 4Q94". There would be no reason to change the wattage without updating the production status in this paper. It is possible for a CISC CPU to use less energy than comparable RISC CPUs because code density improves energy efficiency.

Quote:

No surprise: maintaining and enhancing a compiler is costly. The advantage of a compiler which is able to better schedule instructions was already evident when the Pentium was introduced, and incredibly some Pentium-optimized code shown better performances on 80486 processors sometime.

Instruction scheduling for the 68060 would likely improve 68040 performance as well. The 68040 often has change/use stalls in the same locations even though the number of cycles is often different. The 68040 would benefit from 68040 specific instruction scheduling as a FMOVE can often be done in parallel with another FPU instruction and mixed integer/FPU code can be scheduled for better parallelism.

Quote:

Yes. And to be more clear, I don't use the size / bytes used for the immediate which is used/encoded on instructions. I take the real value, and then classify it according to the number of bits which are required to represent it. So, IMM1 -> -1..0 range. IMM2 -> -2 and 1 values (-1 and 0 already covered by IMM1), etc.

Ok, thanks. That is a good way to evaluate immediates.

Quote:

In which sense? Can you better explain it?

Most micro-kernel drivers are in user space so it would be very inefficient to switch to supervisor mode to access data in another space. True 64 bit addressing is preferable if more than 32 bits of address space is needed.

Status: Online!

Hypex

Re: 68k Developement
Posted on 20-Oct-2018 15:43:15

[ #453 ]

Elite Member

Joined: 6-May-2007
Posts: 11220
From: Greensborough, Australia

@cdimauro

Quote:
It's a general question, which applies to pixel alignment as well. I'll talk specifically and more clearly after.

We've been discussing it for a while. Is it worth going into more detail? I just hope it isn't prolonging the agony.

Quote:
And to give a general answer, yes: they can be used everywhere; so not only for scrolling backgrounds.

They could but if the pixels were on odd alignments it would good to avoid it.

Quote:
In general, you need to read 2 bytes, pack + shift + mask ("and") to get the index value. And you need to read 2 bytes, pack + mask + shift + replace ("or"), unpack, and then write back the 2 bytes if you want to replace the index value.

This would be similar to encoding any binary data such as a encoder/decoder.

Quote:
Now think about doing the same with bitplanes and tell me how many operations are required by the CPU to do exactly the same operations (read and write pixel).

For one pixel you'd need you'd need to separately read/write each bit, isolate with a mask and shift it into place. A one pixel job is expensive. It could be optimised when writing a pixel by loading the whole pixel index into a register and then writing it into the planes.

Of course, it becomes obvious that biplanes aren't suitable for single pixel operations and, at least on the Amiga should be done in blocks of 16 pixels. So a block of 16x4 pixels could be stored with four word writes, the same amount as for packed at two bytes a write, but can't be done sequentially since it must be split into the planes.

Quote:
I think that it should be enough, right?

Yes I think so. Most of those were already known. However, I think in some circumstances biplanes would be useful. Such as, if you wanted to have four separate graphic layers on screen. Only two colour layers in this case, but with a colour palette. Static layers obviously. With packed it would have the opposite effect if layers were desired in the same way, since here different bits would need to be masked and shifted off. Rare example perhaps.

Quote:
Now think about adding using some horizontal overscan or... some scrolling: you need to add a 64-bit "column" for each bitplane's row. Of course, multiplied for ALL bitplanes -> the more bitplanes -> the more was of both space AND bandwidth.

When I read about these fetch modes years ago I thought it looked cool because it was in 64-bit. 64 was the in thing in those days.

Quote:
"Fortunately" the Blitter wasn't updated to fetch 32 or 64-bit data, so it wastes only a limited amount of space and bandwidth, but it's still enough if you think that ALL graphic, even a small bullet, requires data to be 16-bit aligned.

It was updated to blit at larger sizes which was a good thing. However, it needed operations specific to bitplanes, since they continued using them. For example, the single 16-bit texture, was of limited use. It would have been good if a mode similar to line mode was put in, which could do angles, by blitting bitmaps at angles into the destination. And also, warping, where it could take a source rectangle and blit it in the shape of the destination rectangle. Even simple scaling would have been useful even if it would have involved shrinking or expansion of bitmap data internally, in 16-bit amounts I suppose. These would have helped advance the games and would have allowed real 3d acceleration. Rather than the lame chunky to planar games we got using no hardware features at all.

Quote:
In this case you can use exactly the same technique (and register).

How would that work? I mean, if a screen resolution was set at 320x256, how could the screen be scrolled at a 1280x256 granularity? If a frame pointer was just changed.

Quote:
Then, how you can do packed graphic with bitplanes (no: using the Copper is not allowed here)?

I meant can't believe they let the last Atari be superior to AGA. Meaning it had chunky modes and at least hi colour modes. But apart from that, I wouldn't know how you can do packed graphic with bitplanes. Perhaps a rocket scientist can work it out.

Quote:
That was the original plan, with weird ideas to have even 16 bitplanes. Gosh.

That might be because of AAA or Hombre had those kinds of ideas. I can see the sense in doing it up to 8 planes, since the original Amiga hardware had 8 slots for the planes. But beyond 8-bits it makes no sense. Up to 8-bits CLUT you have a real bitmap. Beyond that you go into direct RGB values, at least with depths like 12 and 15/16 bits. And bit splitting that up makes as much sense as splitting a hi colour word into little endian, which is already done, and looks totally impractical.

Quote:
Fortunately he implemented and pushed a lot RTG graphic with Vampire, albeit he has introduced a 9th bitplane... Bah.

Oh no. That's too odd. I did read of some UHRes pointer in AGA, better stick to that.

Quote:
If you want another example, JPEG 2000 used bitplanes on the last stage of its graphic pipeline (when the data should be coded using predictions & arithmetic coding). I know it, because I've written a JPEG 2000 decoder for my stage + thesis, to be implemented in hardware by STMicroelectronics.

How interesting.

Quote:
P.S. Sorry, no time to re-read: I'm feeling too bad now. -_-

Well some things are best left in the past. They served their purpose. That's how it goes.

Last edited by Hypex on 20-Oct-2018 at 03:49 PM.

Status: Offline

cdimauro

Re: 68k Developement
Posted on 20-Oct-2018 19:54:25

[ #454 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:
matthey wrote:
Quote:
cdimauro wrote:
He was also underlining how important was (and is, IMO) to re-encode the ISA, while keeping assembly source-level compatibility.

Re-encoding is an option. I see three 68k ISA enhancement options.

1) compatible changes to the existing 68k ISA
+ simple
+ binary compatibility
- limited changes are possible while retaining compatibility
- limited encoding space while retaining compatibility
2) add a new 64 bit mode
+ encoding space can be recovered
+ binary compatibility retained in 32 bit mode
+ inefficiencies can be removed in 64 bit mode
- more transistors needed for 2 modes
3) re-encoding
+ encoding space can be recovered
+ inefficiencies can be removed
- binary compatibility is lost

Option 1 above is restrictive. The Apollo team had arguments about what is compatible enough. Gunnar probably chose this option because it is simpler and saves transistors which are limited in an FPGA. Megol likes this option with prefixes although I expect he is likely to run into limitations while losing code density. I'm currently exploring option 2 which is reasonable if the instruction formats are largely similar. Binary compatibility is a great advantage with a huge library of existing 68k programs. Re-encoding existing binaries can be problematic. Option 3 provides the most flexibility. I will consider it if I can't create option 2 to my liking. I can see why you went that route for an x86/x86_64 replacement but you had a poor ISA to start with. I believe the 68k ISA is better and serviceable.

There's not that much to add, because your analysis is mostly shareable.

I believe that binary compatibility is not so important: assembly source compatibility can be good enough. That's why I preferred option 3 for my x86/x64 re-encoding+enhancement. If you have the sources, you can easily recompile them for the new ISA. If you have no sources, a JIT can be used which is both easy to implement and fast to execute, because instructions mapping is 1:1.
Limiting our focus on Amiga software, games need binary compatibility, but they also don't need a new ISA neither better performances (only some 3D games take advantage of it, but here the bottleneck is represented by the Blitter used for drawing polygons), and (Win)UAE can be THE solution here. Applications are a different thing, and option 2 is favored, but taking advantage of the new ISA requires the approach which I've briefly described.

Finally, an Option 2.5 can be added to the list: re-encoding for new 32 & 64 bit ISA (new execution mode which can select one of them) and 68K compatibility mode only for the old applications. It's useful to "bridge" the applications to the new ISA, while keeping the existing code base.
Quote:
Quote:
However, and like what I've explained before, the Intel colleague explicitly mentioned that the huge performance advantage of ARM on some benchmark is primarily related to the usage of ternary operators.

I would think x86/x86_64 instruction fusion would mostly make up for such cases. The x86/x86_64 can fuse MOV+OP, OP+MOV and even MOV+OP+MOV in many cases.

https://dendibakh.github.io/blog/2018/02/04/Micro-ops-fusion

The x86/x86_64 can also MOV to and from memory practically creating a 3 op fusion using memory. ARM can't do that. Maybe you are talking about a low end Atom without these fusions though?

Can be. Now I don't remember if it was the first Atom version (which was 2-ways in-order) or the new one (still 2-ways, but OoO).
Quote:
There is another good article from the same author which shows the advantage of dual porting the L1 data cache.

https://dendibakh.github.io/blog/2018/03/21/port-contention

The L1 DCache latency can sometimes be hidden even though modern latencies have increased with larger caches and higher clock speeds (which Megol was concerned about). This setup with instruction fusion turns the x86/x86_64 into a memory munching monster. I believe we could have a more efficient 68k memory munching monster which does not break instruction down as far. The Apollo core is already doing most of the fusions without micro-op decoding but lacks the dual ported L1 DCache. I believe the 68k could compete with ARM where the Atom failed.

Nice links. Consider, however, that Intel CPUs have 256-bit ports (2 for read and 1 for write), and that's also the reason why they are memory munching monsters.
Quote:
Quote:
Another thing is the 3.5W for the 68060@50Mhz: it's really too low for a 0.6um process.

It is 3.9W for the 68060@50MHz with 0.6um process. The "Superscalar Hardware Architecture of the MC68060" by Joe Circello gives 3.9W (which agrees with the MC68060UM). The paper was an early introduction/promotional literature saying, "Sampling now; production ramp 4Q94". There would be no reason to change the wattage without updating the production status in this paper.

OK, but that's really strange: only 0.4W difference between 0.6 and 0.5/0.42 process?
Quote:
It is possible for a CISC CPU to use less energy than comparable RISC CPUs because code density improves energy efficiency.

CISCs have to pay a "tax" for instructions decoding, unless when they use technologies like Intel's LSD or they have simple opcodes structure.
Quote:
Quote:
No surprise: maintaining and enhancing a compiler is costly. The advantage of a compiler which is able to better schedule instructions was already evident when the Pentium was introduced, and incredibly some Pentium-optimized code shown better performances on 80486 processors sometime.

Instruction scheduling for the 68060 would likely improve 68040 performance as well. The 68040 often has change/use stalls in the same locations even though the number of cycles is often different. The 68040 would benefit from 68040 specific instruction scheduling as a FMOVE can often be done in parallel with another FPU instruction and mixed integer/FPU code can be scheduled for better parallelism.

That's incredible: it seems that the 68040 is working much better than the 68060 in such cases (the 68060 can issue only one 32-bit instruction per cycle).

Status: Offline

cdimauro

Re: 68k Developement
Posted on 20-Oct-2018 20:14:58

[ #455 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Hypex Quote:
Hypex wrote:
@cdimauroQuote:
It's a general question, which applies to pixel alignment as well. I'll talk specifically and more clearly after.

We've been discussing it for a while. Is it worth going into more detail? I just hope it isn't prolonging the agony.

I've already provided plenty of details with my previous post, so I hope that it's not necessary anymore.
Quote:
Quote:
And to give a general answer, yes: they can be used everywhere; so not only for scrolling backgrounds.

They could but if the pixels were on odd alignments it would good to avoid it.

Same for me: I prefer 8, 16 and 32 bits packed pixels. But if we need to optimize space and/or bandwidth usage (like we were discussing), then they should be considered.
Quote:
Quote:
In general, you need to read 2 bytes, pack + shift + mask ("and") to get the index value. And you need to read 2 bytes, pack + mask + shift + replace ("or"), unpack, and then write back the 2 bytes if you want to replace the index value.

This would be similar to encoding any binary data such as a encoder/decoder.

More or less. But the problem is mostly related to the 68000, which cannot access unaligned data (hence the read 2 bytes + pack operation, and unpack + write 2 bytes).

68020 and x86/x64 has not that constraint, so the operation is MUCH easier and faster. For the 68020 it's also a piece of cake, because you can use just ONE bitfield instruction to extract the value or insert the new one. More modern x86/x64 CPUs have similar instructions.
Quote:
Quote:
Now think about doing the same with bitplanes and tell me how many operations are required by the CPU to do exactly the same operations (read and write pixel).

For one pixel you'd need you'd need to separately read/write each bit, isolate with a mask and shift it into place. A one pixel job is expensive. It could be optimised when writing a pixel by loading the whole pixel index into a register and then writing it into the planes.

Of course, it becomes obvious that biplanes aren't suitable for single pixel operations and, at least on the Amiga should be done in blocks of 16 pixels. So a block of 16x4 pixels could be stored with four word writes, the same amount as for packed at two bytes a write, but can't be done sequentially since it must be split into the planes.

In short: bitplanes are way more inefficient.
Quote:
Quote:
I think that it should be enough, right?

Yes I think so. Most of those were already known. However, I think in some circumstances biplanes would be useful. Such as, if you wanted to have four separate graphic layers on screen. Only two colour layers in this case, but with a colour palette. Static layers obviously. With packed it would have the opposite effect if layers were desired in the same way, since here different bits would need to be masked and shifted off. Rare example perhaps.

As I said before, the ONLY advantage of bitplanes comes when you have to access a single plane, or a few planes (less than the depth), which is quite rare.

In all other cases packed graphic wins hands down...
Quote:
Quote:
Now think about adding using some horizontal overscan or... some scrolling: you need to add a 64-bit "column" for each bitplane's row. Of course, multiplied for ALL bitplanes -> the more bitplanes -> the more was of both space AND bandwidth.

When I read about these fetch modes years ago I thought it looked cool because it was in 64-bit. 64 was the in thing in those days.

Even more than 64. The problem is only for bitplanes: the wider the data bus, the worse is the alignment restriction -> wasted space & memory bandwidth.
Quote:
Quote:
"Fortunately" the Blitter wasn't updated to fetch 32 or 64-bit data, so it wastes only a limited amount of space and bandwidth, but it's still enough if you think that ALL graphic, even a small bullet, requires data to be 16-bit aligned.

It was updated to blit at larger sizes which was a good thing. However, it needed operations specific to bitplanes, since they continued using them. For example, the single 16-bit texture, was of limited use. It would have been good if a mode similar to line mode was put in, which could do angles, by blitting bitmaps at angles into the destination. And also, warping, where it could take a source rectangle and blit it in the shape of the destination rectangle. Even simple scaling would have been useful even if it would have involved shrinking or expansion of bitmap data internally, in 16-bit amounts I suppose. These would have helped advance the games and would have allowed real 3d acceleration. Rather than the lame chunky to planar games we got using no hardware features at all.

You're asking a completely different thing here. The Blitter was good for its specific purpose, and it cannot be changed like you stated. A new, different 3D unit is better.
Quote:
Quote:
In this case you can use exactly the same technique (and register).

How would that work? I mean, if a screen resolution was set at 320x256, how could the screen be scrolled at a 1280x256 granularity? If a frame pointer was just changed.

No, I mean: if you want to achieve 1/2 or 1/4 pixel scrolling, then you can use the same BPLCON1 register for applying it. BPLCON1 is also needed for packed graphic which has a depth < 8.

Believe me: there's absolutely no difference with the Amiga scrolling. Only an advantage if you have packed graphic with depth 8, 16, 24 and 32, and IF you don't need 1/2 and 1/4 pixel scrolling, because in this case you can implement the hardware scrolling just using the packed "plane" pointer (whereas on Amiga you need to update all bitplanes pointer AND set BPLCON1).
Quote:
Quote:
Then, how you can do packed graphic with bitplanes (no: using the Copper is not allowed here)?

I meant can't believe they let the last Atari be superior to AGA. Meaning it had chunky modes and at least hi colour modes. But apart from that, I wouldn't know how you can do packed graphic with bitplanes. Perhaps a rocket scientist can work it out.

AFAIR Atari ST line had bitplanes. And NO Blitter
Quote:
Quote:
Fortunately he implemented and pushed a lot RTG graphic with Vampire, albeit he has introduced a 9th bitplane... Bah.

Oh no. That's too odd. I did read of some UHRes pointer in AGA, better stick to that.

You don't need it: better to avoid it on AGA re-implementations.

Status: Offline

matthey

Re: 68k Developement
Posted on 22-Oct-2018 1:31:17

[ #456 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2014
From: Kansas

Quote:

cdimauro wrote:
I believe that binary compatibility is not so important: assembly source compatibility can be good enough. That's why I preferred option 3 for my x86/x64 re-encoding+enhancement. If you have the sources, you can easily recompile them for the new ISA. If you have no sources, a JIT can be used which is both easy to implement and fast to execute, because instructions mapping is 1:1.
Limiting our focus on Amiga software, games need binary compatibility, but they also don't need a new ISA neither better performances (only some 3D games take advantage of it, but here the bottleneck is represented by the Blitter used for drawing polygons), and (Win)UAE can be THE solution here. Applications are a different thing, and option 2 is favored, but taking advantage of the new ISA requires the approach which I've briefly described.

One 68k CPU option would be to have custom chipsets in an FPGA so other 68k computers and consoles could be supported which greatly expands the potential target market of a board. UAE is only a solution for Amiga software, especially games. It would be possible to have an emulator for every 68k system but then why not use an ARM CPU with emulation? A real 68k CPU is a more unique product at least.

Quote:

Finally, an Option 2.5 can be added to the list: re-encoding for new 32 & 64 bit ISA (new execution mode which can select one of them) and 68K compatibility mode only for the old applications. It's useful to "bridge" the applications to the new ISA, while keeping the existing code base.

I think it makes more sense to have the 32 bit mode as the compatibility mode with binary compatibility. If re-encoding, a single 64 bit mode might be preferable.

Quote:

Nice links. Consider, however, that Intel CPUs have 256-bit ports (2 for read and 1 for write), and that's also the reason why they are memory munching monsters.

Of course it is necessary to remove bottlenecks which would keep the memory munching monster from being fed.

Quote:

OK, but that's really strange: only 0.4W difference between 0.6 and 0.5/0.42 process?

I believe the "Superscalar Hardware Architecture of the MC68060" has a typo that says "0.5u" when it should be "0.6u". I believe the MC68060UM data for 50MHz and 66MHz refer to a 0.6um process also. The 60MHz and 66MHz parts are just tested at higher frequency and up marked. I know the 60MHz rated parts existed in the old 0.6um process at least. I have never seen a full 66MHz rated part. Only the last rev 6 parts are 0.42um and there is little information available on them.

Quote:

CISCs have to pay a "tax" for instructions decoding, unless when they use technologies like Intel's LSD or they have simple opcodes structure.

My point was that the CISC savings from improved code density can more than pay for the increased cost of decoding.

Quote:

That's incredible: it seems that the 68040 is working much better than the 68060 in such cases (the 68060 can issue only one 32-bit instruction per cycle).

The 68040 outperforms the 68060 sometimes. It doesn't have to share memory between multiple pipes and larger instructions don't slow it down. The FPU parallel FMOVE is a nice feature but the 68060 has better instruction timings and FINT/FINTRZ. Making the FPU 3 op internally and using instruction fusion is probably a better strategy to efficiently handle FMOVE (Apollo core?). Usually a finesse superscalar CPU will outperform a brute force scalar CPU.

Status: Online!

cdimauro

Re: 68k Developement
Posted on 22-Oct-2018 7:37:47

[ #457 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:
matthey wrote:
Quote:
cdimauro wrote:
I believe that binary compatibility is not so important: assembly source compatibility can be good enough. That's why I preferred option 3 for my x86/x64 re-encoding+enhancement. If you have the sources, you can easily recompile them for the new ISA. If you have no sources, a JIT can be used which is both easy to implement and fast to execute, because instructions mapping is 1:1.
Limiting our focus on Amiga software, games need binary compatibility, but they also don't need a new ISA neither better performances (only some 3D games take advantage of it, but here the bottleneck is represented by the Blitter used for drawing polygons), and (Win)UAE can be THE solution here. Applications are a different thing, and option 2 is favored, but taking advantage of the new ISA requires the approach which I've briefly described.

One 68k CPU option would be to have custom chipsets in an FPGA so other 68k computers and consoles could be supported which greatly expands the potential target market of a board. UAE is only a solution for Amiga software, especially games. It would be possible to have an emulator for every 68k system but then why not use an ARM CPU with emulation? A real 68k CPU is a more unique product at least.

Sure, that was the goal. I've only cited UAE for handling Amiga games, which strictly need 68K compatibility (and chipsetS compatibility).
Quote:
Quote:
Finally, an Option 2.5 can be added to the list: re-encoding for new 32 & 64 bit ISA (new execution mode which can select one of them) and 68K compatibility mode only for the old applications. It's useful to "bridge" the applications to the new ISA, while keeping the existing code base.

I think it makes more sense to have the 32 bit mode as the compatibility mode with binary compatibility. If re-encoding, a single 64 bit mode might be preferable.

The problem with 64-bit mode is that pointers and return addresses are 64-bit. A new 32-bit mode allows to take benefit of the new encoding (which is just a "cut-down" version of the 64-bit mode) while not stressing the data cache when your application doesn't need to access more than 4GB of virtual memory.
Quote:
Quote:
OK, but that's really strange: only 0.4W difference between 0.6 and 0.5/0.42 process?

I believe the "Superscalar Hardware Architecture of the MC68060" has a typo that says "0.5u" when it should be "0.6u". I believe the MC68060UM data for 50MHz and 66MHz refer to a 0.6um process also. The 60MHz and 66MHz parts are just tested at higher frequency and up marked. I know the 60MHz rated parts existed in the old 0.6um process at least. I have never seen a full 66MHz rated part. Only the last rev 6 parts are 0.42um and there is little information available on them.

What a mess. :-/ However 3.5W for the 50Mhz "0.5u version seems really too low value.
Quote:
Quote:
That's incredible: it seems that the 68040 is working much better than the 68060 in such cases (the 68060 can issue only one 32-bit instruction per cycle).

The 68040 outperforms the 68060 sometimes. It doesn't have to share memory between multiple pipes and larger instructions don't slow it down. The FPU parallel FMOVE is a nice feature but the 68060 has better instruction timings and FINT/FINTRZ. Making the FPU 3 op internally and using instruction fusion is probably a better strategy to efficiently handle FMOVE (Apollo core?). Usually a finesse superscalar CPU will outperform a brute force scalar CPU.

The problem with implementing the FPU with 3 op by instructions fusing is that the combined FPU instructions take (at least) 8 bytes (FMOVE + Fop).

Status: Offline

matthey

Re: 68k Developement
Posted on 22-Oct-2018 20:26:10

[ #458 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2014
From: Kansas

Quote:

cdimauro wrote:
The problem with 64-bit mode is that pointers and return addresses are 64-bit. A new 32-bit mode allows to take benefit of the new encoding (which is just a "cut-down" version of the 64-bit mode) while not stressing the data cache when your application doesn't need to access more than 4GB of virtual memory.

If ISA compatibility with 32 bit 68k was good enough, a 32 bit ABI should suffice. It is easy to forget that AMD64/x86_64 is practically an ISA and ABI combination. The x32 ABI allows 32 bit code using the x86_64 ISA without x86_64 ABI. There were too many ISA changes to allow x32 to keep IA-32/x86 compatibility while operating in 64 bit x86_64 mode though.

Quote:

The problem with implementing the FPU with 3 op by instructions fusing is that the combined FPU instructions take (at least) 8 bytes (FMOVE + Fop).

True. Most of the FPU instructions are 32 bits in length so providing 3 op FPU instructions results in more compact code. I have the FPU encoding map I created for Gunnar using 16 registers and 3 op (all FPU registers only). I thought it was good but Gunnar wanted more registers and a source mem op (FPU reg-mem stores have limitations). I don't see the need for more than 16 fp registers if the SIMD unit supports floating point. Having done extensive work on the 68k FPU support code in vbcc, 8 registers was rarely a limitation. It would be nice to have 8 more scratch registers and enough registers to interleave instructions if there were 2 FPU pipes but 16 is adequate.

Status: Online!

cdimauro

Re: 68k Developement
Posted on 23-Oct-2018 6:23:40

[ #459 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@mattheyQuote:
matthey wrote:
Quote:

cdimauro wrote:
The problem with 64-bit mode is that pointers and return addresses are 64-bit. A new 32-bit mode allows to take benefit of the new encoding (which is just a "cut-down" version of the 64-bit mode) while not stressing the data cache when your application doesn't need to access more than 4GB of virtual memory.

If ISA compatibility with 32 bit 68k was good enough, a 32 bit ABI should suffice. It is easy to forget that AMD64/x86_64 is practically an ISA and ABI combination. The x32 ABI allows 32 bit code using the x86_64 ISA without x86_64 ABI. There were too many ISA changes to allow x32 to keep IA-32/x86 compatibility while operating in 64 bit x86_64 mode though.

It depends on how you organize your 68K_64 ISA.

x64 by default zero-extends all 32-bit operations, and the LEA can be forced to generate 32-bit addresses, so it's easy to have/handle 32-bit pointers.
Quote:
Quote:
The problem with implementing the FPU with 3 op by instructions fusing is that the combined FPU instructions take (at least) 8 bytes (FMOVE + Fop).

True. Most of the FPU instructions are 32 bits in length so providing 3 op FPU instructions results in more compact code. I have the FPU encoding map I created for Gunnar using 16 registers and 3 op (all FPU registers only). I thought it was good but Gunnar wanted more registers and a source mem op (FPU reg-mem stores have limitations). I don't see the need for more than 16 fp registers if the SIMD unit supports floating point. Having done extensive work on the 68k FPU support code in vbcc, 8 registers was rarely a limitation. It would be nice to have 8 more scratch registers and enough registers to interleave instructions if there were 2 FPU pipes but 16 is adequate.

As I said some time ago, I don't know how convenient is investing on a traditional FPU nowadays.

SIMD units support scalar operations too (because they are needed even when using packed data), so you basically have scalar floating point instructions "for free". For some embedded market you can just disable/don't implement the packed versions.

The problem with SIMD units is that they take a lot of space for the encoding. If you still want to use the EA a the second source operand (so basically following "the CISC way"), then it's likely that you need to go for 6 bytes opcodes, or drop drastically drop some features.

I'm for completely reuse the line-A for SIMD packed and line-F for SIMD scalar operations on the new ISA. With 16-bit registers, no masks, and the vector length selectable at runtime it might fit on 32-bit opcodes.

Status: Offline

OneTimer1

Re: 68k Developement
Posted on 23-Oct-2018 20:03:50

[ #460 ]

Cult Member

Joined: 3-Aug-2015
Posts: 983
From: Unknown

This babbling about non existing possibilities is sick, its like Geeks masturbating about retro technology. The thread started with someone asking: "Why has Motorola abandoned this CPU"
The answers where given in the first few postings, the topic is carried now away to fantasy architectures that no one could use in an existing 68k system.

If the Apollo team wanted to be taken serious outside the Amiga Retro market, they must make this CPU Linux compatible, they will need a MMU and a reliable flawless FPU a SPI and I2C interface would be needed too.

But the FPGA is much to expensive to compete with an ARM, but maybe not for customers who need 68k compatibility. And customers needing a 64Bit CPU could easily switch to ARM, they don't need a 68k compatibility.

Last edited by OneTimer1 on 23-Oct-2018 at 08:14 PM.
Last edited by OneTimer1 on 23-Oct-2018 at 08:07 PM.

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle