Click Here
home features news forums classifieds faqs links search
6071 members 
Amiga Q&A /  Free for All /  Emulation /  Gaming / (Latest Posts)
Login

Nickname

Password

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net
Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.
Donate

Menu
Main sections
» Home
» Features
» News
» Forums
» Classifieds
» Links
» Downloads
Extras
» OS4 Zone
» IRC Network
» AmigaWorld Radio
» Newsfeed
» Top Members
» Amiga Dealers
Information
» About Us
» FAQs
» Advertise
» Polls
» Terms of Service
» Search

IRC Channel
Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online
10 crawler(s) on-line.
 80 guest(s) on-line.
 1 member(s) on-line.


 pavlor

You are an anonymous user.
Register Now!
 pavlor:  2 mins ago
 amig_os:  8 mins ago
 OlafS25:  14 mins ago
 Seiya:  29 mins ago
 amigatronics:  1 hr 2 mins ago
 zipper:  1 hr 3 mins ago
 amigakit:  1 hr 43 mins ago
 clint:  1 hr 55 mins ago
 NutsAboutAmiga:  2 hrs 4 mins ago
 A1200:  2 hrs 11 mins ago

/  Forum Index
   /  Amiga General Chat
      /  68k Developement
Register To Post

Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 Next Page )
PosterThread
matthey 
Re: 68k Developement
Posted on 10-Sep-2018 18:17:41
#181 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2015
From: Kansas

Quote:

WolfToTheMoon wrote:
Quote:
68060 2,500,000 transistors
ARM Cortex-A9 (32 bit) 26,000,000 transistors
Atom (32 bit) 47,000,000 transistors


Yes, but I think that the biggest difference in transistor count between these CPUs is in far, far larger cache memory on newer CPUs.


Most caches are SRAM which use 6 transistors per bit. Some slower caches may be DRAM which only uses 1 transistor per bit but needs a capacitor, uses more energy and generates more heat reducing the density (transistors/area).

Quote:

The first Atom was basically an updated Pentium which had a comparable transistor count to the 060 - which is a good indication of the hypothethical "modern" 060 revival and how it would perform(minus the non pipelined FPU on the 060)


The 47,000,000 transistors is a Bonnell architecture Silverthorne which is a 1st generation Atom (2nd generation is 140,000,000 transistors but with memory controller and GPU).

https://en.wikichip.org/wiki/intel/microarchitectures/bonnell

Unlike the Pentium, Atom's transistors are cheap and were used in mass to target energy efficiency and performance.

Quote:

Each cycle two instructions are dispatched in-order. The scheduler can take a pair of instructions from a single thread or across threads. Bonnell in-order back-end resembles a traditional early 90s design featuring a dual ALU, a dual FPU and a dual AGU. Similarly to the front-end, in order to accommodate simultaneous multithreading, the Bonnell design team chose to duplicate both the floating-point and integer register files. The duplication of the register files allows Bonnell to perform context switching on each stage by maintaining duplicate states for each thread. The decision to duplicate this logic directly results in more transistors and larger area of the silicon. Overall implementing SMT still required less power and less die area than the other heavyweight alternatives (i.e., out-of-order and larger superscaler). Nonetheless the total register file area accounts for 50% of the entire core's die area which was single-handedly an important contributor to the overall chip power consumption.


Wow! The register file took 50% of the core's die area? Register files are expensive but this may be wrong (not the first time a wiki was wrong).

Quote:

Bonnell supports Intel's Hyper-Threading, their marketing term for their own implementation of simultaneous multithreading. The notion of implementing simultaneous multithreading on such a low-power architecture might seem unusual at first. In fact, it's one of only a handful of ultra-low power architectures to support such feature. Intel justified this design choice by demonstrating that performance enjoys an uplift of anywhere from 30% to 50% while worsening power consumption by up to 20% (with an average of 30% performance increase for 15% more power). The toll on the die area was a mere 8%.


It seems the die area taken for multi-threading is only 8% but 25% of the die area would be used for the 2nd thread's register file according to the 1st quote. This quote seems more reasonable as Intel claimed the Pentium 4 hyper-threading used 5% more die area and gave 15–30% better performance. This is overall performance while likely reducing single core performance while multi-threading is active because of shared resources. ARM was no fan of multi-threading.

Quote:

Perhaps surprisingly, perhaps not, ARM (one of whose major selling points is the small die size of their solutions), state that it is considerably better to double your silicon area and stick two cores on, than it is to go for a more complex single core with SMT support, their reasoning being that a well-designed multi-core system, while bigger, will actually use less power. They claim up to 46% savings in energy over an SMT solution with four threads.

Also, moving an application to two threads on a single SMT-enabled core will increase cache-thrashing by 42%, whereas it will decrease by 37% when moving to two cores.


https://www.theinquirer.net/inquirer/news/1037948/arm-fan-hyperthreading

ARM cores are usually smaller with less area/transistors so it is cheaper for them to add more. Cache thrashing can be a real problem, especially with unrelated threads (likely with the AmigaOS). Another problem of multi-threading is that it requires thread management so 2 high performance threads are not stuck sharing the same core when they would be better off on different cores. These factors can cause inconsistent performance and jitter. Multi-threading can decrease single core performance with non-parallel processing and more cores is better for parallel processing. Designers of CPUs with lots of area/transistors per core tend to like multi-threading. ARM has preferred various big.LITTLE multi-core configurations which pair a weak energy efficient CPU (usually in order) with a high performance CPU (usually OoO) to provide more energy savings than a high performance CPU with clock scaling. The problem is that their energy efficient in-order cores are too weak and take too long to get work done while their OoO CPUs take too much area/transistors to beef up weak cores (this is one of the reasons why they moved to AArch64).

Quote:

Performance and Energy
Higher performance allows doing more in less time, and sleeping for a longer time
o Requires less energy to complete a given task
Higher performance allows lower clock rates
o Reduces clock tree and CPU power when active
o Enables use of HVT cells and allows smaller implementation, both decreasing power leakage

Area and Energy
With short active periods and long idle periods, idle energy can be equally or more important than active energy
Leakage currents and therefore idle energy is more or less proportional to area


https://www.slideshare.net/castcores/cpu-subsystem-total-power-consumption-understanding-the-factors-and-selecting-the-best-ip

Once again it is worth reading the whole slideshow. It is from a company called CAST which makes embedded 32 bit CPUs with a higher code density than the 68k or Thumb2 and end up with stronger cores than ARM. They achieve the high code density with a byte based encoding but the minimum size instruction starts at 16 bits (Intel 8051 roots). Their cores have very good single core performance and energy efficiency (see comparisons to weak ARM cores in the slideshow).

I hope it is becoming clear now that we want strong single core performance using the fewest transistors and smallest area. This allows the most efficient multi-core scaling. Does architecture matter?

Exploring Microprocessor Architectures for Gigascale Integration
https://www.semanticscholar.org/paper/Exploring-Microprocessor-Architectures-for-Codrescu-Pant/d22d47e3a257bdb364ef92c0b70e5078e04c510e

Some people are going to criticize this old paper because of its age as it predates multi-core and considers multi-processor configurations. The math, science and scaling are much the same though.

Performance = Clock frequency * IPC * Sp

IPC is of course instructions per cycle
Sp is the parallel speedup offered by multiple nodes (cores)

Highly parallel tasks are better off with more cores while non-parallel processing needs stronger cores. The area of superscalar multi-issue cores was found to roughly quadruple when the number of ways doubles. The article examines the "efficiency of superscalar nodes" by taking the IPC and dividing by the issue width.

Efficiency of Superscalar Nodes Chart
https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/d22d47e3a257bdb364ef92c0b70e5078e04c510e/7-Table2-1.png

The node/core superscalar efficiency is typically between 30-50%, with a geometric mean of just under 40%. These old CPUs are nice to compare because they were transistor limited which gives us an idea of the architecture efficiency. The best efficiency of 9 CPUs listed is the (Intel P5) Pentium and (PA-RISC) PA8000. The P5 Pentium is dual pipeline in-order which the author does not seem to be aware of. Most, if not all, of the other CPUs are OoO which needs more transistors. The 68060 is not shown but the 68060 has close to the same integer performance and IPC as a P5 Pentium (I would take the 68060 in a coding contest but then it is much easier to program in addition to likely being faster with integers). The 68060 is using much fewer transistors (and energy efficiency) where we are looking for the best single core performance with the fewest transistors.

The PPC 603 did not perform well but this was one of the worst PPC CPUs ever. PPC has about a 40% worse code density than the 68060 but the same 8kiB L1 ICache was used. For every 25%-30% code density improvement, the ICache can be halved (miss rate drops in half, see "The RISC-V Compressed Instruction Set Manual"). The PPC needs at least twice the instruction cache (per core) of the 68k as the PPC has roughly 40% worse code density. The PPC designers quickly doubled the cache sizes and the 603e at least had tolerable performance (the Mac Performas with 603 gave the PPC a bad name though). The bare bones 603 core was small in comparison to the 68060 with only 1.6 million transistors but we should add at least another 8kiB (393,216 transistors) of ICache bringing us to about 2 million vs the 2.5 million of the 68060. Adding 8kiB DCache brings us to about the same core size as the 68060 (2.5 million transistors) and roughly agrees with the 2.6 million transistor count for the 603e on wiki. The PPC 603e still does not match the 68060 single core performance and spending the transistors on OoO instead of a longer pipeline like the 68060 means it can not be as easily clocked up (the down fall of the PPC efficient OoO shallow pipeline designs).

The PA-8000 has 3.8 million transistors and is 4 wide. Let's divide the transistor count by 4 to compare with 2 wide giving .48 million transistors. Not bad except all the caches are external and much slower which surprises me with the performance. PA-RISC was one of the early adopters of SIMD and I wouldn't be surprised if the IPC reflects SIMD instruction usage. I did a comparison of earlier PA-RISC CPUs with the 68060 on EAB.

http://eab.abime.net/showpost.php?p=1142968&postcount=22

My 68060 was still competition for the PA-RISC with SIMD in RiVA performance. One of the likely reasons is the poor performance of external caches. Why were the caches external? Because PA-RISC is even more cache hungry (worse code density) than the PPC and larger dies were expensive (Alpha paid up for new and larger dies and DEC went bankrupt when Intel's CISC once again outperformed their RISCy attempt to out clock them).

I hope we can see now that CISC has the single core performance advantage. The 68k likely has a single core performance per transistor advantage over x86 which is good for energy efficiency and multi-core scaling. The big disadvantage of CISC CPUs is that they take longer to design but a good ISA can help and should improve performance and energy efficiency.

 Status: Offline
Profile     Report this post  
OneTimer1 
Re: 68k Developement
Posted on 10-Sep-2018 18:44:49
#182 ]
Cult Member
Joined: 3-Aug-2015
Posts: 983
From: Unknown

@wawa

Quote:

wawa wrote:
@OneTimer1

Quote:
not even Vesalia seems to be interested


and you think vesalia only sells to 10k+ audience? whooa!


Don't tell such stupid lies about me, I have not written anything like that.

An you don't know a shit about what I think.

Last edited by OneTimer1 on 10-Sep-2018 at 06:49 PM.

 Status: Offline
Profile     Report this post  
OneTimer1 
Re: 68k Developement
Posted on 10-Sep-2018 18:45:59
#183 ]
Cult Member
Joined: 3-Aug-2015
Posts: 983
From: Unknown

@OlafS25

Quote:

OlafS25 wrote:

The Vampire are a kind of replacement/add ...


I know what they are.

 Status: Offline
Profile     Report this post  
OneTimer1 
Re: 68k Developement
Posted on 10-Sep-2018 18:47:40
#184 ]
Cult Member
Joined: 3-Aug-2015
Posts: 983
From: Unknown

@CosmosUnivers

Quote:

CosmosUnivers wrote:

CyberGraphX : blocked by Phase5


Rubbish, it belongs to Frank Mariak. If you want it, ask him.

Last edited by OneTimer1 on 10-Sep-2018 at 06:49 PM.

 Status: Offline
Profile     Report this post  
JimIgou 
Re: 68k Developement
Posted on 10-Sep-2018 19:22:09
#185 ]
Regular Member
Joined: 30-May-2018
Posts: 114
From: Unknown

@OneTimer1

Quote:
...it belongs to Frank Mariak. If you want it, ask him.


As in, owned by Frank, incorporated into MorphOS when he joined with Ralph Schmidt to develop that OS.

Phase 5 is dissolute and owns nothing.
And btw, DCE still exists.

Hasn't anyone been paying attention to the evolution of the Amiga community?

Last edited by JimIgou on 10-Sep-2018 at 07:22 PM.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: 68k Developement
Posted on 10-Sep-2018 21:02:01
#186 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@megol

Quote:

megol wrote:
@matthey

Now it's obvious that I'm not going crazy, twice the have tried to reply to your post and twice the whole text have just disappeared when previewing. This is strange and frustrating no matter if it's due to the browser (Chrome) or the website. :(
So here is a reply without quotes (they have to be previewed to be right IME) and shorter (third time I write this down).

I suggest you to copy the quoted text to an editor, write your statements there, and then copy back to the forum form to preview the post and eventually submit it. It works out for me, and saved a lot of time and headaches...

@matthey

Quote:

matthey wrote:
https://www.slideshare.net/castcores/cpu-subsystem-total-power-consumption-understanding-the-factors-and-selecting-the-best-ip

Once again it is worth reading the whole slideshow. It is from a company called CAST which makes embedded 32 bit CPUs with a higher code density than the 68k or Thumb2 and end up with stronger cores than ARM. They achieve the high code density with a byte based encoding but the minimum size instruction starts at 16 bits (Intel 8051 roots). Their cores have very good single core performance and energy efficiency (see comparisons to weak ARM cores in the slideshow).

It's really impressive. Do you have the architecture manual for this processor? I'd like to take a look at it (and at the opcodes structure, of course).
Quote:
I hope it is becoming clear now that we want strong single core performance using the fewest transistors and smallest area. This allows the most efficient multi-core scaling.

If we want strong single core performance (which personally I prefer), yes.
Quote:
Exploring Microprocessor Architectures for Gigascale Integration
https://www.semanticscholar.org/paper/Exploring-Microprocessor-Architectures-for-Codrescu-Pant/d22d47e3a257bdb364ef92c0b70e5078e04c510e

Some people are going to criticize this old paper because of its age as it predates multi-core and considers multi-processor configurations. The math, science and scaling are much the same though.

Don't be harsh. Some studies are limited by their context / period of time, because in the meanwhile new technologies and processors architectures & microarchitectures were developed, which can bring new and/or different results.

You also cite RISC-V sometimes, which is the last arrived in the field...
Quote:
I hope we can see now that CISC has the single core performance advantage.


Quote:
The 68060 is not shown but the 68060 has close to the same integer performance and IPC as a P5 Pentium (I would take the 68060 in a coding contest but then it is much easier to program in addition to likely being faster with integers). The 68060 is using much fewer transistors (and energy efficiency) where we are looking for the best single core performance with the fewest transistors.
[...]
The 68k likely has a single core performance per transistor advantage over x86 which is good for energy efficiency and multi-core scaling.

A contest isn't needed to verify it: SPECInt and SPECfp for both 68060 and Pentium should be available. Pavlor?
Quote:
The big disadvantage of CISC CPUs is that they take longer to design but a good ISA can help and should improve performance and energy efficiency.

That's absolutely true. A CISC ISA is more complex and needs time to be designed & fine tuned. And after that there's the big problem of compilers support... -_-

 Status: Offline
Profile     Report this post  
matthey 
Re: 68k Developement
Posted on 10-Sep-2018 23:59:40
#187 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2015
From: Kansas

Quote:

megol wrote:
This is the final attempt to post a reply, tried 4 times now but closed down the window, succeeded in erasing the whole thing, updated the browser forgetting about the message and finally though I had posted but apparently didn't! Must be going crazy :/


I always copy to the clipboard after typing and before hitting any buttons. I had similar problems with my messages disappearing after previewing, quotes not working and losing messages because I was logged out. Formatting control is also bad. Maybe the forum can be upgraded to whatever amiga.org is using since that upgrade is going so well.

Quote:

With a simple predecode scheme that tags each word of instruction data indicating the opcode length (1 or 2) plus the length of the immediate data or displacement data following finding the position to an address extension word (AEW - brief or full) or the next instruction is trivial. The problem becomes finding out if there is an AEW and if so how many extension words that need.

...

IOW I don't see adding a prefix complicating decode significantly. The usefulness of doing that is worthy of discussion however.


I have not done anything in HDL. That would be the last step for me (ISA/ABI design, compiler support, emulation, OS support, CPU core HDL). I have some understanding of what you are talking about but I'm not the person to help with the details of a core design. I do think it is worth considering a prefix for the 68k but it is probably a better fit for the x86. If you want to add more registers, then it would be good to compare to a complete re-encoding for 64 bit but this would make supporting the existing 32 bit for compatibility more expensive (advantage prefix). I don't see any great need to add more registers as 32 bit performs quite well with 16, code density is better with 16, context switches are faster with 16, cores are smaller with 16 and energy efficiency is better with 16. For a general purpose CISC CPU, 16 registers is not only adequate but probably optimal.

Quote:

megol wrote:
While the 1995 paper by Liedtke is a classic it is also outdated.

Later revisions of the L3/L4 family moved towards a portable design without losing focus on performance. See the paper below:
From L3 to seL4 What Have We Learnt in 20 Years of L4 Microkernels?
http://sigops.org/sosp/sosp13/papers/p133-elphinstone.pdf

Especially see section 4.5 on page 145 "Non-portability" including:
"This argument was debunked by Liedtke himself ... Careful design and implementation made it possible to develop an implementation that was 80–90% architecture-agnostic"


Awesome paper! I was not aware of it. This is why it is good to share. I'm still digesting. Indeed a micro-kernel requires careful design and implementation. I have read L4 documentation before but this summary across flavors is just what I was interested in.

Quote:

The best way to architect TLBs for the Amiga would be essentially removing them, keep a protection cache that can be slower than a TLB without causing performance problems (assume a memory access is legal and restart from a checkpoint if an access violation occur) and a memory-level address translator. This is possible as the Amiga OS assumes a single address space.


The TLB paging system continues to be a major source of jitter. Virtually indexed physically tagged (VIPT) and multilevel TLBs help but 64 bit memory is more taxing with often 4 levels of page tables in memory to access on a miss and more likely spatial separation of pointers. Large page sizes actually make the jitter worse even as they improve the TLB hit rate. Virtualization hardware adds another layer of virtual address lookup. The page tables grow with 64 bits, aren't generally cached and can even be paged out to disk (less likely with 64 bit and increases memory requirements). Address switching is the most expensive part of context switchers. With all the cost, paging and the whole virtual address concept is a very useful feature. Doing away with it completely would be tricky for a dynamic OS but even reducing the quantity of pages with shared memory regions perhaps with protection rings could be helpful.

Quote:

The Fido is using a barrel style processor combined with an axed cache and isn't really comparable to a mainstream processor for several reasons. The deterministic cache is actually a feature of most embedded processors and even a few x86 processors: cache line locking. Read things into the cache and disable updates of that cache line - voila! Actually this have been used with mainstream x86 processors even without explicit support for the early startup when DRAM haven't been initialized, that's more of a hack though.


Fido can lock programs in SRAM which acts like everything is in L1 caches. It's not uncommon for embedded systems with small programs but doesn't work for general purpose CPUs.

Quote:

Also while latency varies (with the variance being the time jitter) it doesn't matter but for specific hard real-time tasks. Even for hard real-time that doesn't matter if the worst case timing is lower than the maximum allowed response time.


Right. The big advantage is that a cheaper and lower clocked CPU and board can be used. Most higher performance and general purpose CPUs need much faster CPUs to get the same level of responsiveness. This leaves the common mid performance CPUs as less than responsive. If you get sluggishness and stuttering, go buy a faster CPU. That is the problem with general purpose CPUs today. The Amiga was so amazing when it came out because it avoided this on mid performance affordable hardware.

Quote:

The jitter is also unavoidable as it isn't usually the TLBs or protection causing them but the mere fact that _caches_ are needed for any reasonably powerful processor that isn't a barrel style one. For an example of a high performance design without caches one can look at the Tera MTA and realize the inherent delay for each and every memory access that had to be covered by explicit multi-threading.


Yes, caches add jitter but are practically required for a general purpose CPU. Cacheless CPUs are often high latency long pipeline designs unsuitable for general purpose use because of branches. I'm not advocating for the elimination of caches but for more efficient use of caches. I do believe the paging system is a major source of worst case jitter. A MMU page paged out to disk can take thousands of cycles for an address translation. Of course this makes virtual caches wait for the translation. With hardware virtualization there can be another layer of address translation which can miss. There can be multiple layers of high latency dependent and sequential operations. Intel likes this as it is always making the highest performance CPU to take care of any sluggishness. On the other hand, we should have responsive and high enough performance CPUs for most uses for less than the price of a Raspberry Pi.

Quote:


--
Using the oldest memory protection design (base and bound) like in the Fido seem like a bad idea. Even going further developing that to a segmentation design (multiple bases and bounds) is IMO a bad idea as long as it will be applied to the Amiga and the Amiga OS. It would require a complete redesign.

And I say that as someone that have always liked the idea of a proper segmentation design, and the x86 is about as far one can get from a proper one. It could be a nice hobby project to make an usable segmentation based processor and OS - but the result wouldn't be an Amiga*.

(* or perhaps it would, it just seem to be a label nowadays)


I agree that Fido has embedded limitations. That's what I meant when I said Fido was not dynamic enough for general purpose use. It does show that paging is not necessary for process isolation and protection. How about the single cycle hardware context switches? It is kind of nice that the scheduling, preemptive interrupt and timers are handled in hardware which is quick (avoids 2 supervisor mode context switches?) and more secure. This is a quote from your L4 link.

Quote:

Nevertheless, none of the designers of L4 kernels to date claim that they have developed a “pure” microkernel in the sense of strict adherence to the minimality principle. For example, all of them have a scheduler in the kernel, which implements a particular scheduling policy (usually hard-priority round-robin). To date, noone has come up with a truly general in-kernel scheduler or a workable mechanism which would delegate all scheduling policy to user-level without imposing high overhead.


Could a hardware scheduler allow the first "pure" microkernel? Even if a lower priority context gets corrupted or even falls into an infinite loop, the upper level contexts still operate and get full priority. This is an innovative yet simple little CPU.

 Status: Offline
Profile     Report this post  
megol 
Re: 68k Developement
Posted on 11-Sep-2018 19:38:59
#188 ]
Regular Member
Joined: 17-Mar-2008
Posts: 355
From: Unknown

@Matthey
Sigh... Now I can't even get the quote function to work at all. Latest chrome, no extensions running on this domain, no problem on any other site. Going manual for now until I've digged up the password to test in another browser.

Quote:

I have not done anything in HDL. That would be the last step for me (ISA/ABI design, compiler support, emulation, OS support, CPU core HDL). I have some understanding of what you are talking about but I'm not the person to help with the details of a core design. I do think it is worth considering a prefix for the 68k but it is probably a better fit for the x86. If you want to add more registers, then it would be good to compare to a complete re-encoding for 64 bit but this would make supporting the existing 32 bit for compatibility more expensive (advantage prefix). I don't see any great need to add more registers as 32 bit performs quite well with 16, code density is better with 16, context switches are faster with 16, cores are smaller with 16 and energy efficiency is better with 16. For a general purpose CISC CPU, 16 registers is not only adequate but probably optimal.

Code density optimal perhaps and not far from 32 registers in a design with register renaming. The problem is of course that the 68k isn't a processor with 16 general registers, far from it IMHO. Going to a 16 data/16 address register split would still not be equivalent to a 32 register machine in this aspect however the differences would shrink to almost nothing.

Quote:

The TLB paging system continues to be a major source of jitter. Virtually indexed physically tagged (VIPT) and multilevel TLBs help but 64 bit memory is more taxing with often 4 levels of page tables in memory to access on a miss and more likely spatial separation of pointers. Large page sizes actually make the jitter worse even as they improve the TLB hit rate. Virtualization hardware adds another layer of virtual address lookup. The page tables grow with 64 bits, aren't generally cached and can even be paged out to disk (less likely with 64 bit and increases memory requirements). Address switching is the most expensive part of context switchers. With all the cost, paging and the whole virtual address concept is a very useful feature. Doing away with it completely would be tricky for a dynamic OS but even reducing the quantity of pages with shared memory regions perhaps with protection rings could be helpful.

With a single address space system the virtual memory jitter could be reduced to almost zero:
First one can use VIVT caches - as one virtual address is the same in the whole system there are no problems. This means there are no TLB lookups needed as long as accesses are done to cached data, the address wanted (virtual) is the same as in the cache tag.

Second one can have large but slow virtual to physical translation done at the memory interface layer. Here a longer latency doesn't matter much as the DRAM itself is the bottleneck. So instead of having two or more layers of TLB one can have one per memory controller with more entries. E.g. a 16 core system with 1 memory controller can have 8Ki translation entries instead of (numbers from AMD Ryzen) 16x( (8+64+512)+(64+1536) )= 34944 entries. Assuming 1MiB L2 caches of each processor plus a shared 16MiB L3 with 8192 entries one can map all cached RAM using 4096 byte pages.

Third one can use alternative translation designs without the problems standard processor designs have with them. For instance using a hashed page lookup would make mapping huge amounts of memory easy, for the problems that can cause standard designs one could search for Linus Torvalds calling it names.

But shorter: the jitter would become that of the caches themselves rather than that of TLB and translation itself.

Quote:

Yes, caches add jitter but are practically required for a general purpose CPU. Cacheless CPUs are often high latency long pipeline designs unsuitable for general purpose use because of branches. I'm not advocating for the elimination of caches but for more efficient use of caches. I do believe the paging system is a major source of worst case jitter. A MMU page paged out to disk can take thousands of cycles for an address translation. Of course this makes virtual caches wait for the translation. With hardware virtualization there can be another layer of address translation which can miss. There can be multiple layers of high latency dependent and sequential operations. Intel likes this as it is always making the highest performance CPU to take care of any sluggishness. On the other hand, we should have responsive and high enough performance CPUs for most uses for less than the price of a Raspberry Pi.

One shouldn't really page VM translation pages to slow media. :)
I'm old enough that I can remember measuring noticeable slowdowns enabling the VM mechanism alone (10% or so) but unless doing something stupid like swapping critical pages to a slow HDD it isn't a large problem nowadays. And using virtual memory doesn't mean one have to support swapping at all, modern systems use it to their advantage to reduce code complexity and improve performance (copy on write, lazy zeroing of pages etc). Many of those tricks aren't directly usable with the translation system described above sadly due to the fixed virtualphysical mapping, there are hybrids possible though.

Quote:

I agree that Fido has embedded limitations. That's what I meant when I said Fido was not dynamic enough for general purpose use. It does show that paging is not necessary for process isolation and protection. How about the single cycle hardware context switches? It is kind of nice that the scheduling, preemptive interrupt and timers are handled in hardware which is quick (avoids 2 supervisor mode context switches?) and more secure. This is a quote from your L4 link.
(snip quote)
Could a hardware scheduler allow the first "pure" microkernel? Even if a lower priority context gets corrupted or even falls into an infinite loop, the upper level contexts still operate and get full priority. This is an innovative yet simple little CPU.

It is possible to do without a kernel at all in a sense. Hardware process/thread scheduler in combination with an integrated communication system/messages would leave only managing threads/communication nodes to the privileged "kernel". For an example of something similar one can look at the classic 80's Transputer.

But I don't know if that would make the resulting system pure. The problem with purity referred to above is AFAIK how to provide primitives so that the upper layers can provide whatever scheduling system they want - and with a hardware scheduler this choice would still be made just by the hardware.

 Status: Offline
Profile     Report this post  
matthey 
Re: 68k Developement
Posted on 11-Sep-2018 20:34:56
#189 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2015
From: Kansas

@cdimauro
I hope you didn't think I was ignoring you. I was just catching up.

Quote:

cdimauro wrote:
Atom almost doubled the performance going from in-order to out-of-order, while keeping the same limit of max 2 instructions decoded & executed per cycle.


How many transistors did it cost? We know that doubling the superscalar ways roughly quadruples the transistors. We don't know the transistor cost of OoO but I expect it varies significantly. Atom added many features with the OoO so it is difficult to calculate in this case. They could have added more cores instead which are easier to turn off/clock down but only help with parallel processing. They probably went to OoO at the same time as a die shrink to offset most of the energy efficiency losses. It probably made sense to move to x86_64 for many reasons including improved ISA/ABI and better modern software compatibility but also needed many more gates for 64 bit, caches and more instructions. In the end, we end up with a lower power and more modern Pentium M instead of a 32 bit ARM or Cortex-A53 competitor

Quote:

This only talks about instruction length: prefixes aren't mentioned.


The micro-architecture manual talks about the decoder for most other micro-architectures but not the Atom. Many of the mid performance CPUs can only process so many prefixes per cycle (sometimes only 1 prefix per cycle) and instruction length changing prefixes can give multi-cycle delays (6 cycles for a Core 2 and Nehalem). The Bonnell wiki I linked says the decoder can handle 3 prefixes/cycle but does not mention instruction length changing prefixes.

Quote:

They are not: see below. Average SIMD instruction length is around 5 bytes (at least for the executable that I've disassembled and generated statistics), which isn't a big problem for a 2-ways pipeline, if the instruction prefetch window is 16 bytes (but even if its only 8 bytes, 3 bytes are enough to correctly decode the instruction).


I read that instruction length (surely with multi-threading) was one of the excuses for up scaling the Atom but I couldn't find the article. The early Atoms had limited resources they were trying to share with multi-threading so it is no surprise. Maybe their real world results were closer to ARMs multi-threading results.

Quote:

I don't think so. Multi-threading is/was introduced because an in-order design leaves a lot of execution units under utilization, so this allows to better utilize them. Even on out-of-order designs it's very useful, and that's the reason why it's still implemented on high-end processors.


I understand. A core does lots of waiting so it is natural to want to do some processing during this time. The more transistors invested in the core, the more important it looks. I still prefer multi-core for the reasons I gave in a previous thread. Multi-threading wouldn't be so bad if only the execution units were shared but then I doubt there would be a worthwhile savings over having another core that doesn't need to share.

Quote:

Quote:
The 68060 is 42% more energy efficient and is using 21% fewer transistors compared to the most comparable in-order Pentium that the early Atom designs went back to for energy efficiency. The 68060 has good performance with just a 4 byte/cycle instruction fetch so I expect the 68060 could be successful where the Atom was not

That's an incorrect and unfair comparison, which you've already reported before.


The Pentium design is a more aggressive design (more transistors, more bandwidth) with less energy saving features (they cost transistors and a little performance) yet the 68060 with finesse was right there. I think they are close enough to make a comparison but yeah, it is not fair to the Pentium.

Quote:

68060 used less transistors because Motorola traditionally cut features on its processors, and 68060 has both super and user mode changes (instructions removed, and simplified MMU). Another important mistake is not providing a fully-pipelined FPU, which basically crippled its FPU performance. And this processor is only able to pair instructions which both are 2 bytes in length, with several pairs limitations. It also introduced no new instructions. And last but not least, the design didn't reached high frequencies.


The 68060 FPU is quite nice and performs well in mixed integer/FPU code as the FPU instructions operate in parallel. The Pentium FPU is a fully pipelined stack based relic which has good theoretical performance but is not easy to use. The Pentium FPU only had a performance advantage with (usually hand) optimized FPU only code. Look up some old Lightwave tests and tell me which performed better per clock. The Pentium was clocked up much faster while Motorola was working on the PPC 603.

I believe the 68060 superscalar instruction pairing allows at least 2x4=8 bytes per cycle.

Quote:

FIFO buffer implemented with 3 read ports
- if current pOEP instruction is located i of buffer, then buffer reads at location (i+1),(i+2),(i+3)
- Allows the {(i+1),(i+2)} or {(i+2),(i+3)} pair to be sent to OEPs


http://www.hotchips.org/wp-content/uploads/hc_archives/hc06/3_Tue/HC6.S8/HC6.8.3.pdf

I don't know if the pOEP instruction has already been read from (i+0) and I'm not even sure if the array has 16 bit or 32 bit indexes (16 bit opcode produces a decode longword in the IFP). The IFP could only fetch 4 bytes/cycle but the decoupling allows the FIFO buffer to feed the OEP pipes for awhile. This works very well for a variable length instruction encoding although enough consecutive long instructions can cause a bottleneck. Unfortunately, the 68060 can only forward 32 bit results yet it can become bottlenecked by instruction immediates (the ColdFire MVS/MVZ instructions and my immediate compression addressing mode would practically eliminate this). Certainly executing more instructions in pairs raises the IPC quickly but surely these bottlenecks didn't come into play too often. I have heard rumors that Motorola wanted an 8 byte/cycle fetch for the successor to the 68060 so maybe it was significant enough.

The 68060 didn't reach high frequencies because there wasn't enough demand or customers for die shrinks. Apple had switched to PPC, Commodore was buying 68EC020s and workstations had switched to RISC. Motorola was telling everyone the 68060 was the end of the 68k line and Apple pulled some nasty tricks to keep 68060 accelerators off the market or they wouldn't have been able to sell the lower performance PPC 601 and 603 Macs. The 68060 CPUs were over $300 U.S. so hardly cheap. They were used for high end embedded applications where they were perfect being high performance at low clock speeds. The high performance embedded market was just not that big back then. The 68060 has an 8 stage pipeline which should make it easier to clock up than those shallow pipeline PPCs which replaced the 68060 (the 603 has only 4 stages as I recall). Die shrinks alone would raise the 68060 clock speed. Today it may be possible to make a 68060@300MHz for $3 (using old underutilized fabs which would be several generations of die shrink for the 68060) and it would sell into the mid performance embedded market.

Quote:

16 SIMD registers are too few. Intel introduced AVX-512 (with the EVEX prefix) to bring the SIMD registers to 32, which is a decent number for a CISC architecture. IBM found in 64 SIMD registers a good compromise for the new VMX2 (there was a paper about it).


The highest performance CPUs may have 64 SIMD registers but how many mid performance variants of those CPUs do you see? Register files are expensive. The other option is to make the SIMD unit optional and not available on mid performance CPU variants but I would rather have a more modest SIMD unit on all variants. AArch64 values mid performance and has 32x128. A CISC SIMD unit saves registers so maybe we could get by with 16x256? It is the same sized register file, we can do twice as many operations per instruction and we might just be able to keep the base SIMD instructions 4 bytes with the saved encoding space. I know you want a high end 68k with 64x512 SIMD unit but you are going to have convince Intel the 68k is better first.

Quote:

Why not? More registers allow a better ABI convention, putting more parameters into the registers instead of pushing (and popping) them into the stack.


Register files are expensive in both gates and energy use. Code density usually deteriorates with more than 16 registers which requires bigger caches (more than 8 in the case of x86_64 but that bad ISA needed to reduced memory traffic). Working with a few variables in the cache with CISC is really quite cheap. This was very common for the x86 with 8 registers yet it didn't cripple the integer performance. Adding 8 more registers to 16 with x86_64 gave overall less than a 10% performance boost even as the memory traffic is reduced by far greater percentages. From 16 to 32 registers for CISC would probably give an overall measurable reduction in memory traffic but may not give a measurable difference in performance. Many registers is certainly not a quick way to performance as the PPC 603 with 32 registers exhibited in comparison to the Pentium with just 8 registers.

Quote:

68K is also short in address registers, which is a pain even for assembly coders.


It's not that bad. We need to do more PC relative addressing, especially for 64 bit, which can save a register. Make the register base not require a6 by using PC relative addressing in libraries and not require the small data register to be loaded saving A4. Put the frame on the stack to save the a5 register (default in vbcc). We need to use what we have more efficiently and it is not the end of the world to work with a few variables in the cache.

Quote:

I agree with you and thank you very much for the interesting paper, albeit it's too much outdated and an update with the more modern processors/ISAs would have been much appreciated.

What impresses me is both the first (Stack) and last (Mem-Mem) results. However the 68020 got a very nice and balanced result.


It looks to me like the stack architecture example is flawed as the variables are already in the CPU. Perhaps this is because it is difficult to load variables for a stack architecture? All the other examples do proper loading of variables. The 68020 code size could be reduced too.

move.l -(a6),d0
sub.l -(a6),d0
move.l -(a6),d1
move.l -(a6),d2
muls.l -(a6),d2
sub.l d2,d1
divs.l d1,d0
move.l d0,-(a6)

68020 reg-mem
instructions: 8
code size: 160 bits
memory traffic: 402 bits
registers used: 4

MIPS load-store
instructions: 10
code size: 266 bits
memory traffic: 458 bits
registers used: 9

I added the number of registers used. My code used one more register than the original 68k code. The MIPS RISC code still used 9 registers to my 4. RISC needs more registers. The 68k is not pure reg-mem either as it has the most common and cheapest mem-mem instructions although they are not used in the sample code. It looks like they may reduce instruction counts, code size, memory traffic and registers used further.

Quote:

Consider that RISC-V will be a strong contender for all current leading architectures.


IMO, RISC-V is most likely to replace MIPS like AArch64 is replacing PPC. These are more efficient ISA replacements. Next up should be the 68k_64 replacing the x86_64.

Quote:

It's really impressive. Do you have the architecture manual for this processor? I'd like to take a look at it (and at the opcodes structure, of course).


There are multiple embedded CPUs from CAST using the BA2 ISA. The simplest is only one pipeline stage for deeply embedded uses.

http://www.cast-inc.com/ip-cores/processors32bit/index.html

Unfortunately, I have been unable to find any online documentation on the ISA. You would probably have to contact them. It is a 3 op RISC ISA with 16, 24, 32 and 48 bit instructions and supports up to 32 registers. It looks like they have short encodings for 2 op instructions. The following has some examples of code.

https://www.chipestimate.com/Extreme-Code-Density-Energy-Savings-and-Methods/CAST/Technical-Article/2013/04/02

They claim, "We believe the BA22 has the greatest code density in the industry, estimated up to 20% better than the impressive ARM Thumb-2 ISA."

http://www.cast-inc.com/blog/consider-code-density-when-choosing-embedded-processors

I believe the 68020 ISA is probably up to 5% better code density than Thumb2 (hand optimized code as 68k compiler support has deteriorated). I was seeing up to 5% better code density for the 68k with a few enhancements using peephole optimizations like the vasm assembler can do. The 68k with enhancements could probably be 5%-15% better code density although supporting 64 bit instead would use up some of the encoding space which could be used to make a more dense 32 bit ISA. A 68k_64 ISA which is 25%-30% better code density than x86_64 or AArch64 may earn more respect. I'm not worried about the BA2 ISA being competition. It is nice to see some RISC guys that are smart about code density and energy efficiency. I still think a 16 bit variable length encoding can offer a better combination of code density and performance (better alignment). The 68k also has the advantage of a huge code base.

 Status: Offline
Profile     Report this post  
matthey 
Re: 68k Developement
Posted on 12-Sep-2018 2:18:52
#190 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2015
From: Kansas

Quote:

megol wrote:
Code density optimal perhaps and not far from 32 registers in a design with register renaming. The problem is of course that the 68k isn't a processor with 16 general registers, far from it IMHO. Going to a 16 data/16 address register split would still not be equivalent to a 32 register machine in this aspect however the differences would shrink to almost nothing.


The hard split between An and Dn registers can be softened with improved orthogonality. Opening up An sources where possible is a good start. There won't ever be a split register file again so why keep the wall up? The encoding for LEA EA,Dn is open (Gunnar hated the idea). These save a temp register, instruction and improve code density where they can be used. Other individual improvements in register efficiency can be made hear and there. I have suggested and documented several.

The x32 ABI gains 9 registers to 15 (with RIP relative addressing), 64 bit math with 64 registers and register args over the ia32. All of these gains together give a 7-10% average integer performance boost (RIP relative addressing gave a 10% performance advantage for PIC which is impressive). I expect most of these gains came from the addition of registers. From other studies I've seen, I'll guess reg args to give 0-2%, 64 bit math to give 0-3% and the addition of registers to give 3-7%. The addition of registers from 16 to 32 would give a much smaller performance boost. Honestly, I would expect to see 0-1% and would be surprised to see even 1-2% overall improvement. The improved PC relative addressing (probably including PC relative writes) I was talking about looks good though.

Performance Data (for x32 ABI)
On Core i7 2600K 3.40GHz:
Improved SPEC CPU 2K/2006 INT geomean by 7-10% over ia32 and 5-8% over Intel64.
Improved SPEC CPU 2K/2006 FP geomean by 5-11% over ia32.
Very little changes in SPEC CPU 2K/2006 FP geomean, comparing against Intel64.
Comparing against ia32 PIC, x32 PIC:
Improved SPEC CPU 2K INT by another 10%.
Improved SPEC CPU 2K FP by another 3%.
Improved SPEC CPU 2006 INT by another 6%
Improved SPEC CPU 2006 FP by another 2%.

http://www.linuxplumbersconf.net/2011/ocw//system/presentations/531/original/x32-LPC-2011-0906.pptx

In light of this data, what do you expect the average integer performance boost from 16 to 32 registers for a CISC CPU to be?

Quote:

With a single address space system the virtual memory jitter could be reduced to almost zero:
First one can use VIVT caches - as one virtual address is the same in the whole system there are no problems. This means there are no TLB lookups needed as long as accesses are done to cached data, the address wanted (virtual) is the same as in the cache tag.


Why use VAs at all if VA=PA? EA=PA but we still have to look up page protection/access bits even if we store them in the caches?

The 68060 uses VIVT TLBs which normally have to be flushed on address space changes but it has a global bit in the ATC (TLB) and page tables which allows these global entries to not be flushed with a PFLUSHAN or PFLUSHN. This is very good for the AmigaOS shared/public memory regions without giving the disadvantages of VIPT TLBs which are common today. The transparent translation registers (TTRs) are good for cheaply mapping large blocks of memory without large page sizes but there are only 4. The 68060 MMU setup is very good for the time period and may actually be better for the AmigaOS than many paging only MMUs today. The MC68060UM is certainly easier to read than any PPC MMU documentation with their acronym hell. Long acronyms are not easy for humans!

Quote:

Second one can have large but slow virtual to physical translation done at the memory interface layer. Here a longer latency doesn't matter much as the DRAM itself is the bottleneck. So instead of having two or more layers of TLB one can have one per memory controller with more entries. E.g. a 16 core system with 1 memory controller can have 8Ki translation entries instead of (numbers from AMD Ryzen) 16x( (8+64+512)+(64+1536) )= 34944 entries. Assuming 1MiB L2 caches of each processor plus a shared 16MiB L3 with 8192 entries one can map all cached RAM using 4096 byte pages.

Third one can use alternative translation designs without the problems standard processor designs have with them. For instance using a hashed page lookup would make mapping huge amounts of memory easy, for the problems that can cause standard designs one could search for Linus Torvalds calling it names.

But shorter: the jitter would become that of the caches themselves rather than that of TLB and translation itself.


Have there been any implementations or documentation describing such a setup?

Quote:

One shouldn't really page VM translation pages to slow media. :)
I'm old enough that I can remember measuring noticeable slowdowns enabling the VM mechanism alone (10% or so) but unless doing something stupid like swapping critical pages to a slow HDD it isn't a large problem nowadays. And using virtual memory doesn't mean one have to support swapping at all, modern systems use it to their advantage to reduce code complexity and improve performance (copy on write, lazy zeroing of pages etc). Many of those tricks aren't directly usable with the translation system described above sadly due to the fixed virtualphysical mapping, there are hybrids possible though.


With 64 bit, page swapping is rarely necessary for saving memory, not that we had to worry about that too much with the memory miser AmigaOS. Not only does swapping to media add jitter but also CoW and lazy updating.

Quote:

It is possible to do without a kernel at all in a sense. Hardware process/thread scheduler in combination with an integrated communication system/messages would leave only managing threads/communication nodes to the privileged "kernel". For an example of something similar one can look at the classic 80's Transputer.

But I don't know if that would make the resulting system pure. The problem with purity referred to above is AFAIK how to provide primitives so that the upper layers can provide whatever scheduling system they want - and with a hardware scheduler this choice would still be made just by the hardware.


A hardware scheduler can be versatile too. All that is needed are hardware registers to define settings like the quantum. You just set it all up with the settings you want and let it run. I don't know if it would be a pure microkernel either but it would minimize the amount of supervisor kernel code and supervisor time which is what a microkernel is all about.

Last edited by matthey on 13-Sep-2018 at 06:54 PM.
Last edited by matthey on 12-Sep-2018 at 02:24 AM.

 Status: Offline
Profile     Report this post  
megol 
Re: 68k Developement
Posted on 15-Sep-2018 16:46:32
#191 ]
Regular Member
Joined: 17-Mar-2008
Posts: 355
From: Unknown

@matthey

Quote:

matthey wrote:
Quote:

megol wrote:
Code density optimal perhaps and not far from 32 registers in a design with register renaming. The problem is of course that the 68k isn't a processor with 16 general registers, far from it IMHO. Going to a 16 data/16 address register split would still not be equivalent to a 32 register machine in this aspect however the differences would shrink to almost nothing.


The hard split between An and Dn registers can be softened with improved orthogonality. Opening up An sources where possible is a good start. There won't ever be a split register file again so why keep the wall up? The encoding for LEA EA,Dn is open (Gunnar hated the idea). These save a temp register, instruction and improve code density where they can be used. Other individual improvements in register efficiency can be made hear and there. I have suggested and documented several.


I strongly disagree. The result would be removal of an obvious difference between A registers and D registers while full orthogonality is impossible. And IMHO not even worth reaching for.
It would be a hack and an ugly one at that.

Quote:

The x32 ABI gains 9 registers to 15 (with RIP relative addressing), 64 bit math with 64 registers and register args over the ia32. All of these gains together give a 7-10% average integer performance boost (RIP relative addressing gave a 10% performance advantage for PIC which is impressive). I expect most of these gains came from the addition of registers. From other studies I've seen, I'll guess reg args to give 0-2%, 64 bit math to give 0-3% and the addition of registers to give 3-7%. The addition of registers from 16 to 32 would give a much smaller performance boost. Honestly, I would expect to see 0-1% and would be surprised to see even 1-2% overall improvement. The improved PC relative addressing (probably including PC relative writes) I was talking about looks good though.

Performance Data (for x32 ABI)
On Core i7 2600K 3.40GHz:
Improved SPEC CPU 2K/2006 INT geomean by 7-10% over ia32 and 5-8% over Intel64.
Improved SPEC CPU 2K/2006 FP geomean by 5-11% over ia32.
Very little changes in SPEC CPU 2K/2006 FP geomean, comparing against Intel64.
Comparing against ia32 PIC, x32 PIC:
Improved SPEC CPU 2K INT by another 10%.
Improved SPEC CPU 2K FP by another 3%.
Improved SPEC CPU 2006 INT by another 6%
Improved SPEC CPU 2006 FP by another 2%.

http://www.linuxplumbersconf.net/2011/ocw//system/presentations/531/original/x32-LPC-2011-0906.pptx

In light of this data, what do you expect the average integer performance boost from 16 to 32 registers for a CISC CPU to be?


With the CISC I think you mean the availability of memory operands? That isn't really an advantage anymore, at least looking at a reasonably modern core. One can see such operations a code density advantage however the increasing load-use latency dictated by physics makes them a performance problem.

But to answer your question I honestly have no idea. While there are numbers available many of those are for RISC processors or AMD64 which can't apply 100% to the 68k.

The 68k have more than 8 registers however how many are effectively available depends on the needs of the code. Something that need 9 data registers (including semi-constant values) will see the processor as limited to 8 registers as would something that need 9 address registers. Add to that the two operand limitation which increases the need for registers.

A compiler patched to generate code with a register extension plus a cycle exact simulator would be needed to provide a reasonable answer.

Quote:

Quote:

With a single address space system the virtual memory jitter could be reduced to almost zero:
First one can use VIVT caches - as one virtual address is the same in the whole system there are no problems. This means there are no TLB lookups needed as long as accesses are done to cached data, the address wanted (virtual) is the same as in the cache tag.


Why use VAs at all if VA=PA? EA=PA but we still have to look up page protection/access bits even if we store them in the caches?


VA!=PA normally. They have a 1 to 1 mapping so any given physical address have a corresponding virtual address (assuming that memory is mapped) and vice versa, but the addresses aren't (normally) the same.

If every process and even hardware device have the same view of the address space the translation can be moved to the memory controller, but it still have to be there.

Quote:

The 68060 uses VIVT TLBs which normally have to be flushed on address space changes but it has a global bit in the ATC (TLB) and page tables which allows these global entries to not be flushed with a PFLUSHAN or PFLUSHN. This is very good for the AmigaOS shared/public memory regions without giving the disadvantages of VIPT TLBs which are common today. The transparent translation registers (TTRs) are good for cheaply mapping large blocks of memory without large page sizes but there are only 4. The 68060 MMU setup is very good for the time period and may actually be better for the AmigaOS than many paging only MMUs today. The MC68060UM is certainly easier to read than any PPC MMU documentation with their acronym hell. Long acronyms are not easy for humans!


The PPC MMU is... interesting. The 68060 is more of a standard design.

Quote:

Quote:

Second one can have large but slow virtual to physical translation done at the memory interface layer. Here a longer latency doesn't matter much as the DRAM itself is the bottleneck. So instead of having two or more layers of TLB one can have one per memory controller with more entries. E.g. a 16 core system with 1 memory controller can have 8Ki translation entries instead of (numbers from AMD Ryzen) 16x( (8+64+512)+(64+1536) )= 34944 entries. Assuming 1MiB L2 caches of each processor plus a shared 16MiB L3 with 8192 entries one can map all cached RAM using 4096 byte pages.

Third one can use alternative translation designs without the problems standard processor designs have with them. For instance using a hashed page lookup would make mapping huge amounts of memory easy, for the problems that can cause standard designs one could search for Linus Torvalds calling it names.

But shorter: the jitter would become that of the caches themselves rather than that of TLB and translation itself.


Have there been any implementations or documentation describing such a setup?


Don't know actually. Most operating systems want the (modern) Unix model and this type of system doesn't map cleanly to that.
The Mill processor seems to use something like this but mixed with other things to make Unix type systems possible. Have not been keeping track of that project for a while however.

Quote:

Quote:

One shouldn't really page VM translation pages to slow media. :)
I'm old enough that I can remember measuring noticeable slowdowns enabling the VM mechanism alone (10% or so) but unless doing something stupid like swapping critical pages to a slow HDD it isn't a large problem nowadays. And using virtual memory doesn't mean one have to support swapping at all, modern systems use it to their advantage to reduce code complexity and improve performance (copy on write, lazy zeroing of pages etc). Many of those tricks aren't directly usable with the translation system described above sadly due to the fixed virtualphysical mapping, there are hybrids possible though.


With 64 bit, page swapping is rarely necessary for saving memory, not that we had to worry about that too much with the memory miser AmigaOS. Not only does swapping to media add jitter but also CoW and lazy updating.


But CoW can bring huge performance improvements too. It's hard to eliminate jitter given caches (hardware and software) and things like speculative execution.

Quote:

Quote:

It is possible to do without a kernel at all in a sense. Hardware process/thread scheduler in combination with an integrated communication system/messages would leave only managing threads/communication nodes to the privileged "kernel". For an example of something similar one can look at the classic 80's Transputer.

But I don't know if that would make the resulting system pure. The problem with purity referred to above is AFAIK how to provide primitives so that the upper layers can provide whatever scheduling system they want - and with a hardware scheduler this choice would still be made just by the hardware.


A hardware scheduler can be versatile too. All that is needed are hardware registers to define settings like the quantum. You just set it all up with the settings you want and let it run. I don't know if it would be a pure microkernel either but it would minimize the amount of supervisor kernel code and supervisor time which is what a microkernel is all about.


Yes it could eliminate almost all kernel overheads in theory. But just providing efficient hardware supported context swapping and interrupt/exception handling would do that too.
That leaves the communication overhead and that is much harder to eliminate, especially if one want to reduce jitter to a minimum.

 Status: Offline
Profile     Report this post  
matthey 
Re: 68k Developement
Posted on 15-Sep-2018 21:11:39
#192 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2015
From: Kansas

Quote:

megol wrote:
I strongly disagree. The result would be removal of an obvious difference between A registers and D registers while full orthogonality is impossible. And IMHO not even worth reaching for.
It would be a hack and an ugly one at that.


Let's compare some code with address register sources open. The following is code to determine if an address in an address register is odd (d0 is 0 for even and 1 for odd).

A:
move.l a0,d1
moveq #1,d0
and.l d0,d1
instructions: 3
code size: 6
registers used: 3

B:
move.l a0,d0
and.l #1,d0
instructions: 2
code size: 8 (6 bytes with my immediate compression enhancement)
registers used: 2

C:
moveq #1,d0
and.l a0,d0
instructions: 2
code size: 4
registers used: 2

Code A above is the most optimal and best code density for most 68k CPUs but needs an extra instruction, 2 trash registers and is not easy for compilers to generate. Code B is easier for compilers to generate but gives poor code density. Code C is the most compact and is not difficult for compilers. Simpler and shorter looks better to me and *not* ugly.

I documented addressing mode sources being opened in the old 68kF pdf docs (yellow highlighting for new enhancements).

http://eab.abime.net/showthread.php?t=83642

There is barely a noticeably difference unlike when I tried to open up address register destinations (a few encoding conflicts and different CC behavior). My new ISAs are called 68k_64 and 68k_32 with significant changes to allow for 64 bit support which I am still working on.

Quote:

With the CISC I think you mean the availability of memory operands? That isn't really an advantage anymore, at least looking at a reasonably modern core. One can see such operations a code density advantage however the increasing load-use latency dictated by physics makes them a performance problem.


I wouldn't be so sure about that. Register renaming (like the 68060 uses) makes instruction scheduling easier without several data dependencies. The 68060 uses some good techniques to reduce load-use stall penalties like early instruction completion in the AG stage (EA calc) and instruction forwarding (the 68060 has pretty good performance with limited or no instruction scheduling common with poor 68k compiler support). The only real superscalar scheduling problem are true dependencies (sequential code), load-use stalls and one read, write or read/write DCache access per cycle. True dependencies can often not be removed, load-use stall penalties can be reduced and the DCache can be dual ported. Dual porting the DCache is expensive but it can turn CISC CPUs into memory munching monsters. The whole reg-mem operation can be scheduled along with register only operations which is very powerful. I think an in-order superscalar 68k with 3 integer pipes and dual ported DCache would outperform many 4 wide superscalar RISC CPUs with OoO. A good instruction scheduler would be necessary for maximum performance but I expect it would perform better than any 68k CPU on existing code as well.

Quote:

But to answer your question I honestly have no idea. While there are numbers available many of those are for RISC processors or AMD64 which can't apply 100% to the 68k.

The 68k have more than 8 registers however how many are effectively available depends on the needs of the code. Something that need 9 data registers (including semi-constant values) will see the processor as limited to 8 registers as would something that need 9 address registers. Add to that the two operand limitation which increases the need for registers.

A compiler patched to generate code with a register extension plus a cycle exact simulator would be needed to provide a reasonable answer.


The 68k should be closer to the CISC AMD64 in register usage than RISC although it is more efficient in memory (68k is a reg-mem mem-mem hybrid with more powerful addressing modes where x86 is a reg-mem accumulator hybrid). One important consideration is the penalty when out of registers.

RISC:
st -(sp),r0
ld r0,(r3)
add r0,r0,r1
st (r3),r0
ld r0,(sp)+
instructions: 5
code size: 20 bytes
registers used: 3
cycles: at least 7 cycles

CISC:
add.l d1,(a3)
instructions: 1
code size: 2 bytes
registers used: 2
cycles: at least 1 cycle

It is RISC that has problems with load-use stalls after loads. Gunnar wrote the following on the Apollo forum recently.

Quote:

In general one can say that "Out of Order" is much less important for 68K than e.g. for PPC.
The reason is easy to understand:

A 68k instruction can do a lot work - often an 68k can do with one instruction the amount of work for which a PPC needs 3 instructions.

Example:

ADD.L D1,(A0)+

This instruction will load the data from memory to which A0 points to, will add to it the value of D1, and will save the result back into memory.

The PPC will need for this 3 instructions:
Example how the PPC would do it: (using 68K syntax for clarity)

LOAD (a0),R2
ADD D1,R2
STORE R2,(A0)
UPDATE A0+

The one instruction needs 1 CYCLE on 68080.
The PPC equivalent needs 3-4 instruction which are sequentially depending - the timing will typically look like this

LOAD (a0),R2 -- 1 clock
-- 3 clock load usage bubble
ADD D1,R2 -- 1 clock
STORE R2,(A0) -- 1 clock
UPDATE A0+ -- 1 clock

We see that the dependency requires the instruction being done after each other. And the DCache access creates a load-usage bubble.
This make it take a total of 7 clocks.
The PPC needs OoO to fill the gaps.

Without OoO PPC will under-perform badly.
OoO allows to PPC to fill the bubble with some code.

And if you run such operation in a LOOP then the decoder will but this operations several times in the execution - like this:

LOAD (a0),R2 -- 1 clock
-- 3 clock load usage bubble
ADD D1,R2 -- 1 clock
STORE R2,(A0) -- 1 clock
UPDATE A0+ -- 1 clock

LOAD (a0),R2 -- 1 clock
-- 3 clock load usage bubble
ADD D1,R2 -- 1 clock
STORE R2,(A0) -- 1 clock
UPDATE A0+ -- 1 clock

The R2 which is really a TMP variable here is now used 2 times in the Execution pipe. In reality these 2 usages of the temp variable R2 are not depending. As both times the same name R2 is used, the CPU can NOT utilize OoO here to re-order and can not speed up the code. Only be renaming them to two different variables e.g. T2 / T3 the core can use OoO fully and can reorder the operations to avoid some of the bubbles.

As you see OoO and Register renaming is very important for the PPC to even get "acceptable" performance.

The 68k on the other hand does by design not have this problems.


The RISC load-use stall after a load is equal to the DCache access time (large caches have increased access times). In another thread, Gunnar wrote the following.

Quote:

APOLLO 68080 can do a free DCache read per cycle
So technically instead loading values in advance in register you can also just do this:

FMUL.S (a0),Fp0
FMUL.S 4(a0),Fp1
FMUL.S 8(a0),Fp2

You can read in every instruction from Cache, even if you re-read the same value, this is no disadvantage.


With a single fp unit, there is no slowdown with the code above. With a dual ported DCache and 2 fp units, there is no slowdown. If we can schedule a memory access instruction with a non memory access instruction then there is no slowdown. Why does the 68k need so many registers?

Quote:

VA!=PA normally. They have a 1 to 1 mapping so any given physical address have a corresponding virtual address (assuming that memory is mapped) and vice versa, but the addresses aren't (normally) the same.

If every process and even hardware device have the same view of the address space the translation can be moved to the memory controller, but it still have to be there.


I see. You are wanting to eliminate the synonym case which has advantages and disadvantages.

The following idea also eliminates some synonym problems but is different.

Quote:

6.5. Eliminating TLB

Wood et al. [10] propose eliminating TLB and instead using VT data cache for storing PTEs. To
avoid synonym problem, they provision that all processes share a single global VA space which
must be used by processes sharing data. In their design, PTEs are stored in same cache as data
and instructions and are accessed using VAs. For translating a VA, its PTE needs to be fetched for
determining the PA. For this, global VA of PTE is computed. By enforcing that all page tables are
contiguous in VA space, PTE can be obtained by accessing the page table with VPN as the index.
For multicore processors, their approach avoids TLB consistency problem since page tables are
cached only in data cache of a core and not TLB. Hence, cached PTEs are updated consistently as
directed by the cache coherency policy. Since translation is performed only on cache misses, storing
PTEs in cache has minimal impact on its normal behavior and the overall performance. They show
that performance of their approach depends crucially on cache miss rate (and hence cache size) and
for large sized caches, their approach provides good performance.


[10] Wood DA, Eggers SJ, Gibson G, Hill MD, Pendleton JM. An in-cache address translation
mechanism. ACM SIGARCH Computer Architecture News, vol. 14, 1986; 358–365.

This sounds like a good idea but I wonder how much the cache sizes would grow.

Quote:

The PPC MMU is... interesting. The 68060 is more of a standard design.


Which one, the old AIM or new Book E version?

Quote:

But shorter: the jitter would become that of the caches themselves rather than that of TLB and translation itself.


Not bad. Embedded uses with small programs could lock the caches and have very low jitter.

Quote:

Don't know actually. Most operating systems want the (modern) Unix model and this type of system doesn't map cleanly to that.


That's the problem. It would be nice to have and MMU which is still able to run Linux and BSD.

Quote:

But CoW can bring huge performance improvements too. It's hard to eliminate jitter given caches (hardware and software) and things like speculative execution.


Speculative execution may be on the decline with Spectre.

Quote:

Yes it could eliminate almost all kernel overheads in theory. But just providing efficient hardware supported context swapping and interrupt/exception handling would do that too.
That leaves the communication overhead and that is much harder to eliminate, especially if one want to reduce jitter to a minimum.


Crossing over to supervisor mode still has many times the overhead of a subroutine on the user side. This is significantly cheaper than an interrupt/exception like a timer interrupted context switch. Both are performance concerns for a micro-kernel even as they have become cheaper on modern hardware.

 Status: Offline
Profile     Report this post  
BigD 
Re: 68k Developement
Posted on 15-Sep-2018 22:05:54
#193 ]
Elite Member
Joined: 11-Aug-2005
Posts: 7323
From: UK

@matthey

Quote:
Let's compare some code with address register sources open.


Oh please, oh please, talk nerdy to us

Do you speak hexadecimal too?

So with this knowledge will you lead the 68k resurgence or simply catalogue the nature and whereabouts of the register interrupts as the battle rages?

_________________
"Art challenges technology. Technology inspires the art."
John Lasseter, Co-Founder of Pixar Animation Studios

 Status: Offline
Profile     Report this post  
cdimauro 
Re: 68k Developement
Posted on 16-Sep-2018 9:01:15
#194 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

Sorry for the late answers, but I still focused on moving to the new house. -_-
You've written many things meanwhile, so I'm trying to split the discussion. I hope that I don't make more mess doing so... :-/

@matthey Quote:
matthey wrote:
I do think it is worth considering a prefix for the 68k but it is probably a better fit for the x86. If you want to add more registers, then it would be good to compare to a complete re-encoding for 64 bit but this would make supporting the existing 32 bit for compatibility more expensive (advantage prefix).

IMO prefixes aren't a good match even for x86. To be more precise, I would have preferred a complete re-encoding for x64 (let me shorten x86_64 this way), instead of the miserable patchwork that was made to introduce both 64-bit and 16 registers (GPR and SIMD).

So, I don't see anything bad on keeping the original opcode schema for the 32-bit mode while adopting a completely new encoding for the 64 bit mode. BTW, that's what ARM did with its new 64-bit ISA, where it also doubled the registers.
Yes, compilers and disassemblers should be changed, but that would happen anyway even using prefixes (which aren't recognized neighter used).

Of course, it also depends on the specific "base" ISA. In fact, extending the 68K ISA to double the amount of data and address registers is a big challenge, which might require some compromises to achieve it, while keeping most of the original design at the same time.
Quote:
I don't see any great need to add more registers as 32 bit performs quite well with 16, code density is better with 16, context switches are faster with 16, cores are smaller with 16 and energy efficiency is better with 16. For a general purpose CISC CPU, 16 registers is not only adequate but probably optimal.

Apart for the context switches (which didn't happen often, BTW), this is not necessarily true and benefits can come even to a CISC which has reg-mem (and mem-reg) for most instructions.

Taking Excel 64-bit, for example, I've disassembled around 4.37M instructions. Around 1M use reg,reg operands, but around 450K use stack or bp memory accesses (and might be even more if the other 500K memory accesses count also sp/bp usages) which is still a big number (almost half of regs-only usage!).

But this isn't related only to Excel: other 64-bit applications (MAME 64-bit, MS Access 64-bit, NodeJS 64-bit, Photoshop CS6 public beta 64-bit, Unreal Engine 64-bit, WinUAE 64-bit) which I've disassembled (but with less disassembled instructions) show similar patterns (with some notable exceptions: PS CS6 shown sp/bp access an on a par with reg-reg, and MAME with sp/bp accesses which are 3 times reg-reg!).

Consider that I start disassembling from the executable entry point and from there trying to catch/gather all jumps, so the results should be taken with a grain of salt, but at least I've some data.

Anyway, it's quite likely that having more registers the memory accesses can be still lowered, improving both code density (if instructions aren't becoming longer on average, of course) and performances (less memory pressure).

@megol Quote:
megol wrote:
Code density optimal perhaps and not far from 32 registers in a design with register renaming. The problem is of course that the 68k isn't a processor with 16 general registers, far from it IMHO. Going to a 16 data/16 address register split would still not be equivalent to a 32 register machine in this aspect however the differences would shrink to almost nothing.

@matthey Quote:
matthey wrote:
The hard split between An and Dn registers can be softened with improved orthogonality. Opening up An sources where possible is a good start. There won't ever be a split register file again so why keep the wall up? The encoding for LEA EA,Dn is open (Gunnar hated the idea). These save a temp register, instruction and improve code density where they can be used. Other individual improvements in register efficiency can be made hear and there. I have suggested and documented several.

@megol Quote:
megol wrote:
I strongly disagree. The result would be removal of an obvious difference between A registers and D registers while full orthogonality is impossible. And IMHO not even worth reaching for.
It would be a hack and an ugly one at that.

This is something which I, as 68K coder, always wanted to have, but I also respect and consider your position about not opening LEA to data registers (as well as adding other operations using address registers to "mimic" the data one). I think that, from a strict 68K ISA perspective, we all agree that the best design decision was to split register file AS IT WAS MADE by Motorola to clear distinguish data and address/pointers usages (with the code density benefit which it also brought).

Quote:
With the CISC I think you mean the availability of memory operands? That isn't really an advantage anymore, at least looking at a reasonably modern core. One can see such operations a code density advantage however the increasing load-use latency dictated by physics makes them a performance problem.

I don't see it. I think that this is still one of the biggest CISC advantages, which not only increases code density but also allows to save a register and avoid instruction dependencies, with overall benefits on performances too.

IMO this is quite evident on processor designs which aren't out-of-order.
Quote:
The 68k have more than 8 registers however how many are effectively available depends on the needs of the code. Something that need 9 data registers (including semi-constant values) will see the processor as limited to 8 registers as would something that need 9 address registers. Add to that the two operand limitation which increases the need for registers.

That's absolutely true, and the reason why it's better to have more registers on a 68K/like ISA design.
Quote:
A compiler patched to generate code with a register extension plus a cycle exact simulator would be needed to provide a reasonable answer.

Indeed, but it's an hard task.

@matthey Quote:
matthey wrote:
The x32 ABI gains 9 registers to 15 (with RIP relative addressing), 64 bit math with 64 registers and register args over the ia32. All of these gains together give a 7-10% average integer performance boost (RIP relative addressing gave a 10% performance advantage for PIC which is impressive). I expect most of these gains came from the addition of registers. From other studies I've seen, I'll guess reg args to give 0-2%, 64 bit math to give 0-3% and the addition of registers to give 3-7%. The addition of registers from 16 to 32 would give a much smaller performance boost. Honestly, I would expect to see 0-1% and would be surprised to see even 1-2% overall improvement. The improved PC relative addressing (probably including PC relative writes) I was talking about looks good though.

Performance Data (for x32 ABI)
On Core i7 2600K 3.40GHz:
Improved SPEC CPU 2K/2006 INT geomean by 7-10% over ia32 and 5-8% over Intel64.
Improved SPEC CPU 2K/2006 FP geomean by 5-11% over ia32.
Very little changes in SPEC CPU 2K/2006 FP geomean, comparing against Intel64.
Comparing against ia32 PIC, x32 PIC:
Improved SPEC CPU 2K INT by another 10%.
Improved SPEC CPU 2K FP by another 3%.
Improved SPEC CPU 2006 INT by another 6%
Improved SPEC CPU 2006 FP by another 2%.

http://www.linuxplumbersconf.net/2011/ocw//system/presentations/531/original/x32-LPC-2011-0906.pptx

In light of this data, what do you expect the average integer performance boost from 16 to 32 registers for a CISC CPU to be?

Difficult to answer, but I think that it might heavily affect some important scenarios, like emulation, databases, parsers, etc. The overall average looking at a large set of applications can be only 2%, but this might still be worth if you have a large boost on some important types of code.

 Status: Offline
Profile     Report this post  
Barana 
Re: 68k Developement
Posted on 16-Sep-2018 9:33:57
#195 ]
Cult Member
Joined: 1-Sep-2003
Posts: 843
From: Straya!

http://www.apollo-core.com/knowledge.php?b=4¬e=16571&z=XEhsEg

_________________
Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

I serve King Jesus.
What/who do you serve?

 Status: Offline
Profile     Report this post  
cdimauro 
Re: 68k Developement
Posted on 16-Sep-2018 9:43:59
#196 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey Quote:

matthey wrote:
Let's compare some code with address register sources open. The following is code to determine if an address in an address register is odd (d0 is 0 for even and 1 for odd).

A:
move.l a0,d1
moveq #1,d0
and.l d0,d1
instructions: 3
code size: 6
registers used: 3

B:
move.l a0,d0
and.l #1,d0
instructions: 2
code size: 8 (6 bytes with my immediate compression enhancement)
registers used: 2

C:
moveq #1,d0
and.l a0,d0
instructions: 2
code size: 4
registers used: 2

Code A above is the most optimal and best code density for most 68k CPUs but needs an extra instruction, 2 trash registers and is not easy for compilers to generate. Code B is easier for compilers to generate but gives poor code density. Code C is the most compact and is not difficult for compilers. Simpler and shorter looks better to me and *not* ugly.

I documented addressing mode sources being opened in the old 68kF pdf docs (yellow highlighting for new enhancements).

http://eab.abime.net/showthread.php?t=83642

Here it makes sense to open the address registers usage, because they are used as source.
Quote:
I think an in-order superscalar 68k with 3 integer pipes and dual ported DCache would outperform many 4 wide superscalar RISC CPUs with OoO. A good instruction scheduler would be necessary for maximum performance but I expect it would perform better than any 68k CPU on existing code as well.

I don't think so. 2 instructions is already the best compromise for an in-order design. Having more than 2 instructions will lead to too many dependencies to be found and resolved by compilers (even worse for human beings), which is basically what killed in-order designs like EPIC/Itanium.

Unless you like VLIW designs, but they aren't general purpose, and even on the embedded market they aren't famous/common.
Quote:
The 68k should be closer to the CISC AMD64 in register usage than RISC although it is more efficient in memory (68k is a reg-mem mem-mem hybrid with more powerful addressing modes where x86 is a reg-mem accumulator hybrid). One important consideration is the penalty when out of registers.

RISC:
st -(sp),r0
ld r0,(r3)
add r0,r0,r1
st (r3),r0
ld r0,(sp)+
instructions: 5
code size: 20 bytes
registers used: 3
cycles: at least 7 cycles

CISC:
add.l d1,(a3)
instructions: 1
code size: 2 bytes
registers used: 2
cycles: at least 1 cycle

It is RISC that has problems with load-use stalls after loads.

Well, here you're saving and restoring the used register, which is unlikely to happen on RISC thanks to the large number of available registers.

A more fair comparison would have been 3 instructions used for the RISC example.
Quote:
The RISC load-use stall after a load is equal to the DCache access time (large caches have increased access times). In another thread, Gunnar wrote the following.
[...]
With a single fp unit, there is no slowdown with the code above. With a dual ported DCache and 2 fp units, there is no slowdown. If we can schedule a memory access instruction with a non memory access instruction then there is no slowdown. Why does the 68k need so many registers?

Because using memory accesses descreases code density? O|-) FPU instructions are already at least 32-bit in size, and the provided example will make them larger, which isn't always desirable... at least looking at your strong interest on code density, right?
Quote:
Speculative execution may be on the decline with Spectre.

This isn't the case: out-of-order and speculative execution is here to stay, because the performance improvement is so huge (compared to in-order designs) even when applying the full mitigations.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: 68k Developement
Posted on 16-Sep-2018 9:55:34
#197 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@mattheyQuote:
matthey wrote:
Fido can lock programs in SRAM which acts like everything is in L1 caches. It's not uncommon for embedded systems with small programs but doesn't work for general purpose CPUs.

It's used on x86 processors from long time, albeit only at the very beginning (after the power on, when there's no memory interface which is still available/known).

This might be applied/opened on embedded x86 design as well.
Quote:
Right. The big advantage is that a cheaper and lower clocked CPU and board can be used. Most higher performance and general purpose CPUs need much faster CPUs to get the same level of responsiveness. This leaves the common mid performance CPUs as less than responsive. If you get sluggishness and stuttering, go buy a faster CPU. That is the problem with general purpose CPUs today. The Amiga was so amazing when it came out because it avoided this on mid performance affordable hardware.

Well, this was due to Amiga o.s. which lacked supervisor user mode context switches.

AROS can do the same on any architecture.
Quote:
I agree that Fido has embedded limitations. That's what I meant when I said Fido was not dynamic enough for general purpose use. It does show that paging is not necessary for process isolation and protection. How about the single cycle hardware context switches? It is kind of nice that the scheduling, preemptive interrupt and timers are handled in hardware which is quick (avoids 2 supervisor mode context switches?) and more secure.

This is also the reason why FIDO isn't 68K compatible, since it changed the supervisor mode and part of the user mode.

FIDO is certainly a good example when talking about embedded design, but it had to remove the more general-purpose features of standard 68K processors.

Custom designs for custom needs...
Quote:
Could a hardware scheduler allow the first "pure" microkernel? Even if a lower priority context gets corrupted or even falls into an infinite loop, the upper level contexts still operate and get full priority. This is an innovative yet simple little CPU.

Intel tried it with 80286 and 80386 processors, but AFAIK this solution isn't widely adopted.

BTW, @megol: http://minix.net/minix.html
MINIX 2.0 (Intel CPUs from 8088 to Pentium)
https://en.wikipedia.org/wiki/MINIX
Tanenbaum originally developed MINIX for compatibility with the IBM PC and IBM PC/AT microcomputers available at the time.

To address the non-sense that meynaf says when he talks about things which he clearly doesn't know.

 Status: Offline
Profile     Report this post  
Barana 
Re: 68k Developement
Posted on 16-Sep-2018 9:56:34
#198 ]
Cult Member
Joined: 1-Sep-2003
Posts: 843
From: Straya!

@BigD

Intel no #1 !

https://youtu.be/xN0vUlljX0I

_________________
Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

I serve King Jesus.
What/who do you serve?

 Status: Offline
Profile     Report this post  
cdimauro 
Re: 68k Developement
Posted on 16-Sep-2018 9:59:29
#199 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Barana

Quote:

Barana wrote:
http://www.apollo-core.com/knowledge.php?b=4¬e=16571&z=XEhsEg

Already read and matthey already reported parts of the messages.

However 10% improvement isn't that much: I was expecting A LOT more moving from the in-order to the new out-of-order design, like what happened with other architectures.

It might be related to the lack of FPGA resources, but this is a mantra which was used many times to justify the absence of FPU implementation, which after some months (of critics)... "magically" appeared...

 Status: Offline
Profile     Report this post  
Barana 
Re: 68k Developement
Posted on 16-Sep-2018 10:10:39
#200 ]
Cult Member
Joined: 1-Sep-2003
Posts: 843
From: Straya!

@cdimauro

Ah OK thx.
Compared to an Amiga with a stock 040 the improvement adds up to staggering. Considering a while ago they were at the level of a 603e with 'only' a 68k.
As someone once said, opinions are like bums, everyone has one.
And... The proof is in the pudding.
Anyway... 68k no #1

_________________
Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

I serve King Jesus.
What/who do you serve?

 Status: Offline
Profile     Report this post  
Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 Next Page )

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]
Copyright (C) 2000 - 2019 Amigaworld.net.
Amigaworld.net was originally founded by David Doyle