Click Here
home features news forums classifieds faqs links search
6072 members 
Amiga Q&A /  Free for All /  Emulation /  Gaming / (Latest Posts)
Login

Nickname

Password

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net
Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.
Donate

Menu
Main sections
» Home
» Features
» News
» Forums
» Classifieds
» Links
» Downloads
Extras
» OS4 Zone
» IRC Network
» AmigaWorld Radio
» Newsfeed
» Top Members
» Amiga Dealers
Information
» About Us
» FAQs
» Advertise
» Polls
» Terms of Service
» Search

IRC Channel
Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online
13 crawler(s) on-line.
 21 guest(s) on-line.
 1 member(s) on-line.


 Gunnar

You are an anonymous user.
Register Now!
 Gunnar:  2 mins ago
 geit:  8 mins ago
 matthey:  13 mins ago
 Kronos:  1 hr 1 min ago
 miggymac:  1 hr 7 mins ago
 bhabbott:  1 hr 7 mins ago
 alef:  1 hr 9 mins ago
 Karlos:  1 hr 31 mins ago
 Rob:  2 hrs 11 mins ago
 zipper:  2 hrs 28 mins ago

/  Forum Index
   /  Classic Amiga Hardware
      /  Retro Games Limited - THEA500 Mini - Future?
Register To Post

Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 Next Page )
PosterThread
Hammer 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 17-Aug-2023 3:43:59
#21 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5125
From: Australia

@matthey

Quote:

Copycat products like the A600GS will likely be late to the game and appeal to a niche of a niche market while prices will be higher due to lack of mass production. Without mass production, a custom board like the A600GS uses increases the cost compared to higher production products like THEA500 Mini and Raspberry Pi SBCs which could have been embedded in a custom case. Is there a killer feature that justifies a custom SBC or is this "Good Stuff" a gimmick that acts like a dongle?

PC clones following a certain standard enabled the platform to survive the "big iron" RISC challenge.

The "single Apple-only vendor" mindset (using Copycat negative term) has problems with the PC's clone business model.

I have no problems with A600GS coexisting with TheA500mini.

Quote:

It is certainly possible to offer a product with more value than THEA500 Mini but it would require more investment. The current legal situation is an impediment and likely even endangers an Amiga Maxi followup, at least as good of one as possible. Greedy Amiga people have carved out their niches in the Amiga market and fragmented it. These ARM boards using emulation are the low end of the Amiga market.

Nope. https://amigang.com/amigamini-thea500/
The emulated 68040 with JIT reached 173 MIPS and 116 MFLOPS.

As long ARM CPU evolves, any TheA500mini or A600GS follow-up can also evolve in a cost-effective price range.

Quote:

FPGA Amiga users (MiSTer, MiST, Vamp/AC), x86-64 UAE users, original hardware Amiga users and even PPC AmigaNOne users likely have a better experience at least some of the time compared to cheap ARM emulation.

AmigaOS 4.1 or MorphOS has OS-hosted userland only 68K emulator without custom chip support. Look in the mirror when PPC OSes use UAE!

ARM Cortex A57 and A72 successor are entry points for ARM's out-of-order-execution ARMv8 64bit cores.

ARM's UAE 68K CPU JIT emulation hosted on Linux is less optimized compared to single task Emu68.

Quote:

ARM emulation may offer some value due to cheapness but will never replace better Amiga hardware and unify the Amiga again.

This is not ARM's fault. Blame Motorola, Freescale and Hector Ruiz.

Quote:

Its obvious that the only way to offer value that destroys all the competition and unifies the Amiga again is with the 68k and Amiga custom chips in real silicon as a single chip SoC.

Not economically viable.

Quote:

A 68060+AGA chipset uses fewer transistors than the U.S. $1 RP2040 SoC chip which likely costs a fraction of that to mass produce.

68060's microarchitecture is an obsolete CPU design and FPU is not even pipelined.

68060's microarchitecture has two 68K hardware decoders for the RISC core.

RP2040's low price is produced in large economies of scale. Wafer start is not cheap.

If there's money for wafer start, it's better to design hardware-accelerated 68K decoders front-end (hardware-accelerated Emu68) for ARM's big OOOE CPU core.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

 Status: Offline
Profile     Report this post  
matthey 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 18-Aug-2023 0:11:17
#22 ]
Super Member
Joined: 14-Mar-2007
Posts: 1950
From: Kansas

Hammer Quote:

PC clones following a certain standard enabled the platform to survive the "big iron" RISC challenge.

The "single Apple-only vendor" mindset (using Copycat negative term) has problems with the PC's clone business model.

I have no problems with A600GS coexisting with TheA500mini.


Some of the early PC clones were not fully IBM compatible. In some cases this reduced the cost and in other cases it increased performance but these disappeared when PC clones came out that were more compatible with little performance and cost difference. All PC clones were copycats but the most compatible and best value hardware was the winner. The best copycats won but Amiga emulation on ARM is not the most compatible or the highest performance. It is cheap but still only the 2nd cheapest potential Amiga hardware solution.

Hammer Quote:

Nope. https://amigang.com/amigamini-thea500/
The emulated 68040 with JIT reached 173 MIPS and 116 MFLOPS.

As long ARM CPU evolves, any TheA500mini or A600GS follow-up can also evolve in a cost-effective price range.


JIT is less compatible by nature and adds jitter/stuttering to program execution. Toni Wilen usually won't accept bug reports for WinUAE with JIT turned on so it barely has support. While JIT may work pretty well for software bound system friendly software it may have more problems with hardware intensive software with tighter timing requirements like many Amiga games. There is a reason why THEA500 Mini has JIT disabled by default.

Single core integer performance is the most important performance metric for an Amiga and 173 DMIPS is poor. The M68060UM states the 68060 has "superscalar integer performance of over 100 MIPS at 66 MHz". A 68060@114MHz should offer at least 173 DMIPS of performance according to Motorola documentation. A rev6 68060 may be able to achieve this with very good cooling so even the 1990s original silicon integer performance is still competitive with THEA500 Mini which is clocked much higher. I believe the 68060 actually had better integer performance than this but lacked compiler support to show it and the reason why Motorola left it as "over" rather than giving an exact number. ARM is known for low power cores with weak single core performance while the 68060 had one of the best integer single core performances at the time in DMIPS/MHz.

Single Core DMIPS/MHz
---
Cortex-M0+ 1.16 (RPi Pico/RP2040)
68060 1.52+
PPC440/460EX 1.80 (Sam440/460)
ColdFireV5 1.83
QorIQ-P1022/e500v2 2.4 (A1222)
Cortex-A53 2.88 (RPi 3)
Cortex-A72 5.45 (RPi 4)

Motorola/Freescale/NXP results and IBM/AMCC results come from documentation or employees. RPi 3 and 4 results come from measured results at https://forums.gentoo.org/viewtopic-t-1101000.html . RP2040 measured results are from https://www.magazinmehatronika.com/en/banana-pi-bpi-pico-rp2040-review/ .

The Cortex-M0+, 68060, ColdFireV5 and Cortex-A53 are lower power, smaller and cheaper in-order designs while the PPC440, e500v2 and Cortex-A72 are much larger and should be significantly higher performance. Of course the chip process plays a role as newer cores on newer silicon have a fraction of the distance for the electricity to travel compared to the oldest silicon in the list. Older and low power cores may have smaller L1 caches as well (Cortex-M0+ 16kiB XIP cache but may run directly from SRAM memory which is what caches use, 68060 has only 8kiB-I/8kiB-D, ColdFireV5 has 16kiB-I/16kiB-D while all others use 32kiB-I/32kiB-D). The ColdFireV5 superscalar core design is very similar to the 68060 but slightly more modern and improved with double the L1 caches, 2 pipeline stages added, a hardware return/link stack, likely some code folding/fusion and fully synthesizable (easier to change but lower performance). The PPC440 has poor integer performance for an OoO core. The OoO PPC e500v2 has respectable integer performance but is still outperformed by the in-order Cortex-A53 which shouldn't happen. The 68060 performs well for the age, cache sizes and an in-order core. While a modernized 68060 may not reach Cortex-A53 integer performance, performance wouldn't be 1/3 due to emulation overhead or, to put it another way, it would be 3 times greater for 68k performance. Modern x86-64 cores can have roughly twice the single core integer performance of the Cortex-A72 but they are huge and expensive OoO beasts which are in a different class. The smaller cores above can be produced for less than $1 where the RP2040 has been very successful despite a lack of a standard software platform like the 68k Amiga which has a huge software library with a tiny footprint.

The info above is what was censored (deleted with no reason given) in the A1222 thread. Maybe it
won't be as threatening in this thread.

Hammer Quote:

Not economically viable.


If a few million dollars is too much of an investment in cheap hardware then it is better not to be in the hardware business. Economies of scale are too important to be ignored. Amiga hardware businesses are an embarrassment but Trevor needs a tax deduction. It's like he doesn't want to make money and wants low production numbers for his rare bastard Amiga collection.

Hammer Quote:

68060's microarchitecture is an obsolete CPU design and FPU is not even pipelined.

68060's microarchitecture has two 68K hardware decoders for the RISC core.


The 68060 core design is modern enough to do well in the chart above despite many handicaps compared to more modern cores. The lack of a fully pipelined FPU was a good compromise in the day to save transistors without too much of a hit to less important floating performance, especially for the most common mixed integer and floating point code. The integer pipeline processes up to several stages of the FPU instructions so only the last stages in the FPU lack pipelining. Transistors are much cheaper today so probably makes sense to have a fully pipelined FPU as well as adding back some of the removed instructions which would improve 68k compatibility as well as performance. Keeping a core small though may allow to compete better in cost with the likes of the RP2040.

Hammer Quote:

RP2040's low price is produced in large economies of scale. Wafer start is not cheap.

If there's money for wafer start, it's better to design hardware-accelerated 68K decoders front-end (hardware-accelerated Emu68) for ARM's big OOOE CPU core.


No. The Amiga goes nowhere with emulation on a big hot ARM OoO CPU running at 1/3 performance. It may go somewhere as a easy to use, cheap and competitive building block with lots of software for standard hardware with a tiny footprint.

Last edited by matthey on 18-Aug-2023 at 12:20 AM.
Last edited by matthey on 18-Aug-2023 at 12:19 AM.
Last edited by matthey on 18-Aug-2023 at 12:15 AM.

 Status: Offline
Profile     Report this post  
Hammer 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 18-Aug-2023 5:25:46
#23 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5125
From: Australia

@matthey

Quote:

Some of the early PC clones were not fully IBM compatible. In some cases this reduced the cost and in other cases it increased performance but these disappeared when PC clones came out that were more compatible with little performance and cost difference. All PC clones were copycats but the most compatible and best value hardware was the winner. The best copycats won

I'm aware of this. I'm old enough for "microcomputer" word usage.

Quote:

but Amiga emulation on ARM is not the most compatible or the highest performance.

That's debatable. For WHDload games, "Turtle mode" is recommended for very fast 68K Amigas.

Without UAE, PowerPC Amigas has zero compatibility with WHDload Amiga games.

Different UAE profiles are needed for different Amiga game eras.

Quote:

It is cheap but still only the 2nd cheapest potential Amiga hardware solution.

This was Commodore's domain.


Quote:

JIT is less compatible by nature and adds jitter/stuttering to program execution.

When performance matters e.g. Doom and Quake, my PiStorm-Emu68-RPi 3A+ (ARM Cortex A53) beats TF1260's 68060 rev1 @ 62.5 Mhz.

The "jitter/stuttering" is BS when Emu68 is running as a single-task program i.e. no context switching with the host OS.

I have FreeSync/G-Sync-enabled monitors for WinUAE, hence 50 hz PAL is not a problem.

144 hz target requires a 6.9 ms completion time interval.
120 hz target requires an 8.3 ms completion time interval.
60 hz target requires a 16.6 ms completion time interval.
50 hz target requires a 20 ms completion time interval.

I have a PC 4K FreeSync/GSync monitor with 144 Hz, hence a 6.9 ms completion time interval. It's currently set to normal 120 hz mode.

For 50 Hz, JIT and target 68k workload process shouldn't exceed 20 ms.

Quote:

Toni Wilen usually won't accept bug reports for WinUAE with JIT turned on so it barely has support.

Don't care. I know the limitation of WinUAE-JIT.


Quote:

While JIT may work pretty well for software bound system friendly software it may have more problems with hardware intensive software with tighter timing requirements like many Amiga games. There is a reason why THEA500 Mini has JIT disabled by default.

Any JIT is not useful for most WHDload 2D games.

WHDLoad patched Amiga games are recommended with full 32-bit CPU accelerated Amigas. Extra "turtle" commands are added to WHDLoad startup. Emu68's JIT depth can change on the fly.

I collected multi-parallax and non-multi-parallax smooth for WHDLoad Amiga games on my A500+PiStorm-Emu68. I removed "bad" Amiga games from my collection.

Without using the degrader tool(1), A3000's compatibility with A500 disk games is not good.
1. https://aminet.net/package/util/misc/Degrader

A1200's 68EC020 CPU''s cache breaks Dread.

Using Wicher 508i accelerated A500 with 68HC000 @ 50Mhz and 16-bit Fast RAM, the timings caused WHDLoad Settlers' graphics corruption. I usually use 68HC000 @ 25 Mhz. Wicher 508i can disable Fast RAM when required.

PiStorm32-Emu68 has a hotkey to disable the PiStorm accelerator and boot into stock A1200 (with built-in IDE micro-SD boot). PiStorm accelerator's micro-SD boot partition has a higher boot priority.


Quote:

Single core integer performance is the most important performance metric for an Amiga and 173 DMIPS is poor. The M68060UM states the 68060 has "superscalar integer performance of over 100 MIPS at 66 MHz".

173 DMIPS is SysInfo's score from TheA500mini's UAE-JIT.

68060's claims with "over 100 MIPS at 66 MHz" is useless when the front-side bus is still a 68040 32-bit bus.

Classic Pentium has a 64-bit front-side bus to feed the dual 32-bit integer pipelines.

From https://www.youtube.com/watch?v=skU70bb-5ak
This TF1260's 68060 at 62.5Mhz has 49.58 Mips in SysInfo instead of over 100 Mips.

Amiga's 68060 accelerators didn't have an onboard L2 cache (64-bit 66 Mhz SRAM) like classic Pentium motherboards.

I also have TF1260.

Quote:

Single Core DMIPS/MHz
---
Cortex-M0+ 1.16 (RPi Pico/RP2040)
68060 1.52+
PPC440/460EX 1.80 (Sam440/460)
ColdFireV5 1.83
QorIQ-P1022/e500v2 2.4 (A1222)
Cortex-A53 2.88 (RPi 3)
Cortex-A72 5.45 (RPi 4)

These are useless arguments when the load-store units are major bottlenecks. Instructions without data are useless.

Hint: Emu68's EmuControl tool shows ARM MIPS in near real-time.

Cortex-A72 has three ARM decoders and two-load-store units hence it can sustain 1 load/1 store, and burst two loads.

1500 Mhz Cortex-A72 delivering sustained 4,500 mips (3 IPC) or 7,500 mips (5 IPC) ... LOL you're in dreamland.

https://youtu.be/AEkFu6QHyHY?t=158
Emu68's EmuControl tool shows ARM MIPS in near real-time.


RPI 4B Emu68 screenshot from https://www.patreon.com/posts/emu68-0-15-3-new-85948607

As for picking on lowly Cortex-M0

https://community.arm.com/support-forums/f/architectures-and-processors-forum/5176/arm-cortex-m0-details

Without going into details, some of the low cost Cortex-M0 microcontrollers on the market has less than 50K gates and that included bus system, peripehrals, and possibly DMA support, etc (exclude memory area and analog components). The 12K gate number is based on minimum configuration at 180ULL process. However, you can get different gate count using different processes, some gives better figure and some give larger areas. For the Cortex-M0 DesignStart, as it has got 16 interrupts and the SysTick timer, the area would be a bit larger than 12K.



Quote:

The 68060 core design is modern enough to do well in the chart above despite many handicaps compared to more modern cores. The lack of a fully pipelined FPU was a good compromise in the day to save transistors without too much of a hit to less important floating performance, especially for the most common mixed integer and floating point code. The integer pipeline processes up to several stages of the FPU instructions so only the last stages in the FPU lack pipelining. Transistors are much cheaper today so probably makes sense to have a fully pipelined FPU as well as adding back some of the removed instructions which would improve 68k compatibility as well as performance. Keeping a core small though may allow to compete better in cost with the likes of the RP2040.

68060 micro-architecture wasn't designed with high clock speed.

68060 at 0.6 μm vs, Pentium's P54C 0.6 μm reached 100 Mhz. My TF1260's 68060 rev1 couldn't reach 70Mhz without a freeze.

68060 at 0.42 μm vs Pentium's P54CS 0.35 μm reached 200 Mhz and P55C 0.35 μm reached 233 Mhz.

Using Digital's 0.75 μm, Alpha 21064 reached 150 Mhz (in 1992) and later reached 200 Mhz (in 1993).

RDNA v2's 1 extra stage pipeline compared to RDNA v1 is for reaching higher clock speeds.
When compared to the smaller Zen 4C, the normal Zen 4's larger core design is for higher clock speed.

Designing a micro-architecture for high clock speed is an art, not just process tech.


Last edited by Hammer on 18-Aug-2023 at 07:26 AM.
Last edited by Hammer on 18-Aug-2023 at 07:05 AM.
Last edited by Hammer on 18-Aug-2023 at 06:56 AM.
Last edited by Hammer on 18-Aug-2023 at 06:41 AM.
Last edited by Hammer on 18-Aug-2023 at 06:11 AM.
Last edited by Hammer on 18-Aug-2023 at 06:05 AM.
Last edited by Hammer on 18-Aug-2023 at 06:01 AM.
Last edited by Hammer on 18-Aug-2023 at 05:59 AM.
Last edited by Hammer on 18-Aug-2023 at 05:39 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

 Status: Offline
Profile     Report this post  
Rob 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 19-Aug-2023 8:03:55
#24 ]
Elite Member
Joined: 20-Mar-2003
Posts: 6334
From: S.Wales

@matthey

https://docs.justia.com/cases/federal/district-courts/washington/wawdce/2:2007cv00631/143245/147/1.html

Part (b) of the agreement basically sets out what Amiga Inc can and can't do with regard to operating systems. Their activities with Amiga OS to seem to be limited to C64DTV type devices with itheir own ASIC or possibly an FPGA but not emulation devices, and the Amiga OS UI can't be exposed to the end user.

Amiga Inc would not have been able to license the Kickstart ROM to Retro Games for The A500 Mini because it wouldn't be able to run it without emulation.

According to attachment 1, Cloanto's licenses grant them "Rights sufficient to support Amiga Forever, including emulation modules".

Where do the rights to license a product like The A500 mini come from?

I'm no legal expert so I'd be interested to hear the interpretions of others, or if they have any further info to share.

If Cloanto had caused a breach of the 2009 settlement agreement, I'm sure Hyperion would have already seized upon that so there must be something I'm missing here.

 Status: Offline
Profile     Report this post  
amigakit 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 19-Aug-2023 11:08:33
#25 ]
Amiga Kit
Joined: 28-Jun-2004
Posts: 2508
From: www.amigakit.com

@Rob

I think the THEA500Mini does not reveal the underlying AmigaOS system to the user to keep with the prior US legal agreements. Hence there are no Amiga applications present- it is purely a retro games platform.

We have developed our own system files over the last few years which when compiled for 68K, takes the V46 versioning. On PPC is it V54 versioning. The A600GS will contain V46 commands, tools, commodities, gadgets and datatypes. PPaint and Octamed boot using these commands.

_________________
Amiga Kit Amiga Store
Links: www.amigakit.com | New Products | A600GS

 Status: Offline
Profile     Report this post  
Rob 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 19-Aug-2023 11:27:43
#26 ]
Elite Member
Joined: 20-Mar-2003
Posts: 6334
From: S.Wales

@amigakit

Quote:
I think the THEA500Mini does not reveal the underlying AmigaOS system to the user to keep with the prior US legal agreements.


That's what I thouhgt until I re-read this part.
"(iv) Amiga may distribute the Software in Object Code form "as is" (i.e. i a form unmodified from that in definition "l", without additional functionality commonly associated with an Operating System) and whereby the User Interface of the Software is not exposed to the end-user, solely in conjunction with gaming content or in conjunction with self-contained hardware devices containing such gaming content (e.g. Joysticks) which are capable of executing the Object Code form of the Software without software emulation."

The A500 Mini isn't "without software emulation".

 Status: Offline
Profile     Report this post  
kolla 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 19-Aug-2023 13:36:07
#27 ]
Elite Member
Joined: 20-Aug-2003
Posts: 2821
From: Trondheim, Norway

@Rob

Isn’t that because Cloanto already had the contracts and rights when it came to “with software emulation”? And now Amiga and Cloanto are the one and same, and hence can do both with and without software emulation?

_________________
B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC

 Status: Offline
Profile     Report this post  
matthey 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 19-Aug-2023 21:09:59
#28 ]
Super Member
Joined: 14-Mar-2007
Posts: 1950
From: Kansas

Hammer Quote:

173 DMIPS is SysInfo's score from TheA500mini's UAE-JIT.


Sysinfo uses non-compliant DMIPS code and is not superscalar aware. It is complete rubbish.

Hammer Quote:

68060's claims with "over 100 MIPS at 66 MHz" is useless when the front-side bus is still a 68040 32-bit bus.

Classic Pentium has a 64-bit front-side bus to feed the dual 32-bit integer pipelines.


The 68k has better code density than the x86, shorter instructions on average (less than 3 bytes/instruction where x86 is closer to 3.5) and more favorable code alignment than x86 (16 bit aligned instructions instead of 8 bit aligned instructions).

The Superscalar Architecture of the MC68060 Quote:

We used dynamic code analysis of existing 68k applications to determine the instruction-fetch bandwidth necessary to support the superscalar operand-execution pipelines. The chip's instruction-set architecture contains 16-bit and larger instructions, with a measured average instruction length of less than 3 bytes. Simulations based on trace data indicate that, holding the rest of the architecture constant, with the combination of a branch-prediction driven prefetch and an instruction buffer, a 64-bit instruction prefetch would be only marginally faster than a 32-bit instruction prefetch. Based on this analysis, the instruction-fetch pipeline interface has separate 32-bit address and data buses. All instruction fetches are 32-bit aligned fetches. The instruction cache supports a continuous one instruction fetch per cycle rate.


The 68060 has significant room to improve performance but adding a 64 bit instruction fetch by itself would only marginally improve performance. Low power and cost were more important to the 68060 where it was heavily targeted at the embedded market while the Pentium was primarily targeting the desktop with a higher performance design on paper. The superior 68k ISA and better core design allowed the 68060 balanced core design to compete in performance with the Pentium high performance design and I believe even outperform it in integer performance showing that the 32 bit instruction fetch is not a significant bottleneck.

Hammer Quote:

Amiga's 68060 accelerators didn't have an onboard L2 cache (64-bit 66 Mhz SRAM) like classic Pentium motherboards.


The Pentium needed more caches and fat x86 cores didn't leave enough transistors for an on chip L2 cache (onboard != on chip). The off chip L2 cache required expensive memory so there was no free lunch, at least until economies of scale kicked in.

Hammer Quote:

These are useless arguments when the load-store units are major bottlenecks. Instructions without data are useless.

Hint: Emu68's EmuControl tool shows ARM MIPS in near real-time.

Cortex-A72 has three ARM decoders and two-load-store units hence it can sustain 1 load/1 store, and burst two loads.

1500 Mhz Cortex-A72 delivering sustained 4,500 mips (3 IPC) or 7,500 mips (5 IPC) ... LOL you're in dreamland.


The OoO Cortex-A72 gets destroyed by modern OoO x86-64 cores as well but at a cost. The Cortex-A72 may have cost tens of millions to develop, may require a $5 chip due to size and borderline needs a fan while the x86-64 core may cost hundreds of millions to develop, may require a $50 chip due to size and definitely needs a fan. The in-order Cortex-A53 is the most popular ARM core and with native code outperforms the Cortex-A72 using emulated code at a significantly lower cost. A modernized 68060 wouldn't even have to get that close in performance to the Cortex-A53 to be able to outperform the Cortex-A72 when executing 68k code and in-order cores are much cheaper to develop, perhaps in the low millions and can be produced for under $1 per chip.

The 68060 core design is intrinsically good at accessing memory, especially for an in-order design. RISC designs need OoO to avoid load-to-use penalties (pipeline bubbles) of separately pipelined load/use and execution pipelines. CISC instructions often have the operation and memory access together which means they can be pipelined together which the 68060 does avoiding most load-to-use penalties without the overhead of OoO. Each of the two integer pipelines of the 68060 can calculate an effective address and access memory. The LEA instruction is common and doesn't access memory so this is beneficial even when only a single cache/memory access per cycle is allowed as is common with most in-order CPU cores. Furthermore, The 68060 has 4 independent data cache banks which allows multiple cache accesses.

The Superscalar Architecture of the MC68060 Quote:

The interface between the data cache and the execution-operand pipeline has a 32-bit address bus and a bidirectional 64-bit data bus. In any given superscalar dispatch, only one memory referencing operation can occur. However, the data cache has four independent banks. It can perform one cache read and one cache write per cycle if these operations go to different banks. Trace-driven analysis and simulation shows the optimal MC68060 strategy to be one data cache read port with early memory read (operand-fetch cycle stage) and late memory write (write-back stage). Static trace analysis shows that, for example, holding the rest of the architecture constant, the performance penalty of a single data cache read port versus a dual-ported design is 3 to 5 percent. The cost of a dual-ported data cache design greatly exceed its benefits.


I believe multi-bank interleaving was started on the Pentium II (only 2 banks though) before x86-64 cores later switched to dual ported caches turning these cores into CISC memory munching monsters that they are known for today. The 68060 can also execute read-modify-write instructions in a single cycle which I don't believe was possible on any x86-64 core and is not possible with RISC. I don't know if the 68060 multi-bank feature would allow two read-modify-write instructions per cycle simultaneously from each integer pipe allowing up to 2 reads and 2 writes from memory per cycle exceeding the Cortex-A72 load/store performance but the OoO execution and much larger caches more than makes up for any memory munching deficit. The in-order 68060 really should be compared to the in-order Cortex-A53 but we can only guess at a level playing field with a modernized 68060. The ColdFireV5 core gets us partway there albeit with a lower power target than the already low power 68060 core design.

Hammer Quote:

As for picking on lowly Cortex-M0

https://community.arm.com/support-forums/f/architectures-and-processors-forum/5176/arm-cortex-m0-details

Without going into details, some of the low cost Cortex-M0 microcontrollers on the market has less than 50K gates and that included bus system, peripehrals, and possibly DMA support, etc (exclude memory area and analog components). The 12K gate number is based on minimum configuration at 180ULL process. However, you can get different gate count using different processes, some gives better figure and some give larger areas. For the Cortex-M0 DesignStart, as it has got 16 interrupts and the SysTick timer, the area would be a bit larger than 12K.



I didn't mean to pick on the Cortex-M0+. The core design has an extremely low power and area target which likely scales lower than is practical for a 68k core. The 12k gates you mention for the Cortex-M0 someone estimated at ~72,000 transistors but ARM would not confirm. This would be larger than the original 68000 core (and likely the smallest ColdFire cores) which is likely due to extreme power gating to save power but uses additional transistors. Caches also use at least 6 transistors per bit and more can be used to save power. The Cortex-M0+ has a 2 stage instead of 3 stage pipeline which saves transistors and a minimally featured core may be down to half the size of a 68000 which has some advanced features like a hardware multiplier. The RP2040 has 2 cores which together I expect are a little bigger than a 68000 core. It is the 264kiB of SRAM on the RP2040 that quickly uses more transistors than a 68060+AGA. Transistors are cheap and the SoC is available for $1 with profit margin included. A single chip Amiga SoC would likely target more performance and doesn't need to be that cheap. The key Amiga advantage is standard hardware with lots of software in a tiny footprint. Lower power and cost is important for embedded use but there isn't much competition until reaching hardware that can run standard Linux distros.

Hammer Quote:

68060 micro-architecture wasn't designed with high clock speed.

68060 at 0.6 μm vs, Pentium's P54C 0.6 μm reached 100 Mhz. My TF1260's 68060 rev1 couldn't reach 70Mhz without a freeze.

68060 at 0.42 μm vs Pentium's P54CS 0.35 μm reached 200 Mhz and P55C 0.35 μm reached 233 Mhz.

Using Digital's 0.75 μm, Alpha 21064 reached 150 Mhz (in 1992) and later reached 200 Mhz (in 1993).

RDNA v2's 1 extra stage pipeline compared to RDNA v1 is for reaching higher clock speeds.
When compared to the smaller Zen 4C, the normal Zen 4's larger core design is for higher clock speed.

Designing a micro-architecture for high clock speed is an art, not just process tech.


Before the 68060 was released, it became clear the 68060 was mostly an embedded CPU and for embedded use more performance at a lower clock speed is better (DMIPS/MHz). It costs money to increase the clock rating and Motorola didn't want the 68060 competing with the "high end" PPC chips especially where the lower end shallow pipeline PPC designs like the PPC603(e) using newer processes had trouble reaching the integer performance of the 68060 per clock and shallower pipelines limited clock speeds. The 8 stage 68060 pipeline should have been easier to clock up than the early Pentiums and the shallow pipeline PPC designs. The 7 stage integer pipeline of the Alpha 21064 only achieved about 1.1 DMIPS/MHz, well below the 68060 and Pentiums. ARM cores had shallow pipelines and weak integer performance until the professional StrongARM design showed they could be clocked up with caches added but they still had poor DMIPS/Mhz. The ColdFireV5 could achieve 610 DMIPS @330MHz (0.13um in 2002) with a superscalar design very close to the 68060 but 2 stages added. This was a fully synthesizable design making it easy to work with and change processes while lacking custom blocks and other optimizations which could improve performance. Obviously, embedded cores are not optimized as much as high end desktop cores although ARM has started optimizing their embedded designs more which is why PPC designs are so outdated. It's the end of Moore's Law and AArch64 needed to look good compared to the RISC competition (PPC and ColdFire were finished off) and to be sure that 64 bit cores with their additional overhead were never slower than older ARM 32 bit cores. They are also more competitive now for the low end desktop in stealth desktop hardware like the Raspberry Pi.

Last edited by matthey on 19-Aug-2023 at 09:21 PM.

 Status: Offline
Profile     Report this post  
matthey 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 19-Aug-2023 21:33:46
#29 ]
Super Member
Joined: 14-Mar-2007
Posts: 1950
From: Kansas

kolla Quote:

Isn’t that because Cloanto already had the contracts and rights when it came to “with software emulation”? And now Amiga and Cloanto are the one and same, and hence can do both with and without software emulation?


Yea. No amount of intimidation and coercion by Ben could get around existing exclusive licenses despite the financial duress of Amiga Inc. after Pentti Kouri died in 2009, the same year the 2009 contract was signed. Vultures are opportunistic predators of the dead and waste no time swooping in for the corpse, even if not completely dead yet. It is impressive that Ben could turn from a contracted business partner that failed to deliver anything to challenging ownership of practically everything Amiga Inc. owned, which was the one thing forbidden in the contract he wrote.

Last edited by matthey on 19-Aug-2023 at 09:48 PM.
Last edited by matthey on 19-Aug-2023 at 09:41 PM.
Last edited by matthey on 19-Aug-2023 at 09:35 PM.

 Status: Offline
Profile     Report this post  
Hammer 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 21-Aug-2023 5:40:01
#30 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5125
From: Australia

@matthey

Quote:

Sysinfo uses non-compliant DMIPS code and is not superscalar aware. It is complete rubbish.

SysInfo's MIPS comparison is ONLY useful with other SysInfo MIPS results.

You compared SysInfo's 173 DMIPS with the CPU vendor's theoretical DMIPS.

Quote:

The 68k has better code density than the x86, shorter instructions on average (less than 3 bytes/instruction where x86 is closer to 3.5) and more favorable code alignment than x86 (16 bit aligned instructions instead of 8 bit aligned instructions).

1. Per clock, 68060 is missing half of Pentium's front-side bandwidth.

2. 68060's FPU is not pipelined!

3. My Taiwanese PC-Partner Intel 430VX chipset motherboard has an onboard 512 KB L2 cache. https://theretroweb.com/motherboards/s/pcpartner-mb520n-35-8258-xx

4. 68060 didn't have an on-chip L2 cache.

Quote:

The Pentium needed more caches and fat x86 cores didn't leave enough transistors for an on chip L2 cache (onboard != on chip). The off chip L2 cache required expensive memory so there was no free lunch, at least until economies of scale kicked in.

More Blah.

I owned TF1260's 68060 Rev 1 @ 62.5Mhz and the resulting Quake frame rate was substantially inferior to my old Pentium 150 with S3 Trio 64UV PCI-based PC clone.

100 Mhz 68060 Rev 6 Quake results are similar to 83 Mhz Pentium Overdrive on a 32bit FSB 486 motherboard!

Your repeated performance promises don't reflect the real world!

Quote:

The OoO Cortex-A72 gets destroyed by modern OoO x86-64 cores as well but at a cost.

This is well known. Hint: I have been posting ARM Cortex A57/A72 vs AMD Jaguar benchmarks for several years. Hint: 3DMarks Storm Physics benchmarks.

Cortex-A77 has two AGUs (connected to the Load buffer) and two Store units (connected to the Store buffer). Cortex-A77 leads into A78 and X1.

Quote:

The Cortex-A72 may have cost tens of millions to develop, may require a $5 chip due to size and borderline needs a fan while the x86-64 core may cost hundreds of millions to develop, may require a $50 chip due to size and definitely needs a fan.

1. I have slightly overclocked RPi 4B into 1.6 Ghz and it doesn't need a cooling fan.

2. X86-64 core's size is dependent on the microarchitecture's implementation. The need for a fan is dependent on the CPU and cooling configuration.

3. $50 chip, look in https://pcpartpicker.com/products/cpu/#sort=price&F=96,99,98,101
AMD Ryzen 3 4100 has $62.99 USD at retail and this is a "speed-bin" salvage Renoir APU with 156 mm2 chip size at TSMC's 7 nm process node.

4. ARM and X86 can spread their respective R&D cost over large unit sales. Large "economies of scale" matter.

Quote:

The in-order Cortex-A53 is the most popular ARM core and with native code outperforms the Cortex-A72 using emulated code at a significantly lower cost.

That's a useless comparison for this subject's use case.

Quote:

A modernized 68060 wouldn't even have to get that close in performance to the Cortex-A53 to be able to outperform the Cortex-A72 when executing 68k code and in-order cores are much cheaper to develop, perhaps in the low millions and can be produced for under $1 per chip.

Speculation is NOT fact.

Quote:

The 68060 core design is intrinsically good at accessing memory, especially for an in-order design. RISC designs need OoO to avoid load-to-use penalties (pipeline bubbles) of separately pipelined load/use and execution pipelines.

Your repeated performance promises don't reflect the real world!

Your argument didn't deliver superior Quake scores.

Quote:

CISC instructions often have the operation and memory access together which means they can be pipelined together which the 68060 does avoiding most load-to-use penalties without the overhead of OoO. Each of the two integer pipelines of the 68060 can calculate an effective address and access memory. The LEA instruction is common and doesn't access memory so this is beneficial even when only a single cache/memory access per cycle is allowed as is common with most in-order CPU cores. Furthermore, The 68060 has 4 independent data cache banks which allows multiple cache accesses.

More Blah.

Your repeated performance promises don't reflect the real world!

Your argument didn't deliver superior Quake scores.

For Quake, no 68060 Rev 6 configuration (including Warp 1260 with P96 RTG) has beaten my old Pentium 150 with S3 Trio 64UV PCI-based PC clone.

PS: Amiga SuperBuster Zorro III's effective bandwidth is trash compared to the Intel 430VX PCI chipset.

Quote:

The 68060 can also execute read-modify-write instructions in a single cycle which I don't believe was possible on any x86-64 core and is not possible with RISC.

Modern X86 CPUs still have read-modify-write (RMW) instructions, but it's preferable to use gather-scatter AVX-512 instructions, and then apply arithmetic instructions.

Zen 2 RMW example
ADD (mem32, reg) is 1 cycle latency.

Zen 4 RMW example
ADD (mem, reg32) is 1 cycle latency.

It depends on micro-microarchitecture's implementation.

Your argument doesn't factor in 68060's 32-bit front-side bus.


Quote:

I don't know if the 68060 multi-bank feature would allow two read-modify-write instructions per cycle simultaneously from each integer pipe allowing up to 2 reads and 2 writes from memory per cycle exceeding the Cortex-A72 load/store performance but the OoO execution and much larger caches more than makes up for any memory munching deficit.

Your argument doesn't factor in 68060's 32-bit front-side bus.


Quote:

The in-order 68060 really should be compared to the in-order Cortex-A53 but we can only guess at a level playing field with a modernized 68060.

"Modernized 68060" doesn't exist unless FPGA AC68080 (missing 68K MMU) is factored in.

Quote:

The ColdFireV5 core gets us partway there albeit with a lower power target than the already low power 68060 core design.

ColdFire V5 gained full superscalar for the ColdFire family.

Last edited by Hammer on 21-Aug-2023 at 08:08 AM.
Last edited by Hammer on 21-Aug-2023 at 07:21 AM.
Last edited by Hammer on 21-Aug-2023 at 07:03 AM.
Last edited by Hammer on 21-Aug-2023 at 05:49 AM.
Last edited by Hammer on 21-Aug-2023 at 05:48 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

 Status: Offline
Profile     Report this post  
Hammer 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 21-Aug-2023 7:33:11
#31 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5125
From: Australia

@kolla

Quote:

kolla wrote:
@Rob

Isn’t that because Cloanto already had the contracts and rights when it came to “with software emulation”? And now Amiga and Cloanto are the one and same, and hence can do both with and without software emulation?


Cloanto Corporation and Amiga Corporation are still separate legal entities.

TheA500mini's AmigaOS IP is from Cloanto which is effectively an "Amiga Forever" emulation package with a custom UI, custom case, ARM-based platform, and USB physical controls.

Amiga Forever emulation package exposes AmigaOS's Workbench UI before Hyperion entered an agreement with Amiga Inc.

https://www.amigaforever.com/kb/14-117
Topic: This article discusses a number of issues affecting the 1997 Preview Edition of Amiga Forever.

Escom went bust in 1996.

Last edited by Hammer on 21-Aug-2023 at 07:57 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

 Status: Offline
Profile     Report this post  
matthey 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 22-Aug-2023 1:57:36
#32 ]
Super Member
Joined: 14-Mar-2007
Posts: 1950
From: Kansas

Hammer Quote:

SysInfo's MIPS comparison is ONLY useful with other SysInfo MIPS results.

You compared SysInfo's 173 DMIPS with the CPU vendor's theoretical DMIPS.


Usually CPU vendors supply actual DMIPS benchmarks for released hardware. It would be scandalous if a large vendor gave benchmarks they couldn't back up which could result in lawsuits or criminal charges for fraud.

Hammer Quote:

1. Per clock, 68060 is missing half of Pentium's front-side bandwidth.

2. 68060's FPU is not pipelined!

3. My Taiwanese PC-Partner Intel 430VX chipset motherboard has an onboard 512 KB L2 cache. https://theretroweb.com/motherboards/s/pcpartner-mb520n-35-8258-xx

4. 68060 didn't have an on-chip L2 cache.


The Alpha 21164 released in 1995 was the first CPU to have an on-chip L2 cache (only 96kiB). The 8kiB instruction cache L1 was left small as it reduces access time but an 8kiB L1 has a fraction of the instructions in this L1 compared to a 68060 8kiB L1 instruction cache. To avoid this major instruction cache bottleneck which is common with RISC CPUs, the L2 was added which allowed the clock speed to increase from an Alpha 21064A CPU max of 300MHz to 500MHz. Although this on-chip L2 was innovative and increased performance, the RISC instruction bottleneck remained while the CPU chip increased from 2.85 million transistors to 9.3 million transistors and the power used increased from 33W to 56W. Even the 4th and last gen Pentium P55C released in 1996 only used 4.5 million transistors with a TDP of 17W (~26W by 1.5 rule of thumb multiplier) for the 1997 Pentium@233MHz. Perhaps Alpha's greatest innovation was their greatest contribution to their downfall. The idea that ISA code density doesn't matter and higher clock speeds would scale RISC performance up were severely flawed. The 68060 with ~2.5 million transistors and running cool enough for passive cooling and mobile applications was never really given a chance. The 68060 "replacement" PPC tried a different RISC gamble with cheap to design shallow pipeline designs with relatively large caches. This worked well until customers complained about the lack of performance due to not clocking PPC up and the RISC instruction cache bottleneck was exposed again.

As far as 68060 performance, the 68060 lacks several high performance features compared to the Pentium as you mentioned above (including an on mother board L2 cache but excluding an on-chip L2 cache which would have made the Pentium more expensive than the Alpha 21164) and the 68060 received must less developer support compared to the Pentium. Recall the ByteMark benchmark I have previously posted which demonstrates specific weaknesses in 68060 compiler support.

https://amigaworld.net/modules/newbb/viewtopic.php?topic_id=44391&forum=25#847418

Early versions of the GCC compiler 68k backend started to mature and integer benchmarks show a roughly 40% integer advantage of the 68060 vs Pentium with on board 256kiB L2 cache at the same clock speed. The same GCC 68k code demonstrated poor results for the 68060 FPU but my vbcc backend FPU improvements point toward poor GCC FPU support for the 68060 with the 68060 being nearly on par with the Pentium reference benchmark (the common case of mixed integer and FPU code performs well as I suspected despite the lack of a fully pipelined 68060 FPU). The vbcc compiler lacks the integer backend maturity of the old GCC for the 68k but still shows a 21% integer advantage for vbcc of the 68060 vs Pentium reference benchmark at the same clock speed. The ByteMark benchmark is actually more comprehensive than the Dhrystone benchmark without requiring the large caches of modern CPU cores and may require proprietary code that is not free. Both GCC and and vbcc lack a superscalar scheduler for the 68060 which is usually very detrimental to in-order CPU core performance, especially RISC cores, yet the 68060 has impressive performance without this scheduling. Motorola realized the importance on an instruction scheduler with their 68060 compiler support which may be why their Dhrystone results are so good.

The Superscalar Architecture of the MC68060 Quote:

The execution example comes from the Dhrystone benchmark, a well known synthetic integer program. In the example, the processor executes code generated by the Diab Data 3.4A C compiler. Developed to fully exploit the chip's superscalar architecture, this compiler uses a number of advanced instruction-scheduling techniques.


I doubt any Amiga developer has seen 68060 code generated by the Diab Data compiler which is the only 68060 compiler instruction scheduler I'm aware of. This clearly demonstrates that Motorola was working with the Dhrystome benchmark on the 68060 rather than calculating a "theoretical" DMIPS result. Most likely their given Dhrystone result was a work in progress result while they could see that the 68060 had significantly more potential. Mitch Alsup pointed out that one of the weaknesses of Motorola was not having in house compiler support as we can see here where the 68060 compiler support didn't go anywhere. Not only was he an architect of Motorola CPU cores but he also worked on compiler support.

The ByteMark benchmark doesn't prove the 68060 had better performance than the Pentium. Compilers for x86 could have been poor also despite much more development effort but x86 backends were good enough to outperform Alpha, PPC and ARM CPU chips after x86 Pentiums received a deeper pipeline like the 68060 allowing them to be clocked up. I feel like I'm repeating myself and wasting my time here like I did when I was developing for the Amiga. I'd prefer not to continue this conversation with you repeating the same points ad nauseam. I'm more than aware of the 68060 lack of high performance features but I don't believe it hurt the performance nearly as much in comparison to the Pentium as the lack of compiler support.

Hammer Quote:

More Blah.

I owned TF1260's 68060 Rev 1 @ 62.5Mhz and the resulting Quake frame rate was substantially inferior to my old Pentium 150 with S3 Trio 64UV PCI-based PC clone.

100 Mhz 68060 Rev 6 Quake results are similar to 83 Mhz Pentium Overdrive on a 32bit FSB 486 motherboard!

Your repeated performance promises don't reflect the real world!


I wouldn't be surprised if more man hours have been spent on optimizing Quake for the Pentium than all the man hours spent on 68060 compiler support combined.

Hammer Quote:

1. I have slightly overclocked RPi 4B into 1.6 Ghz and it doesn't need a cooling fan.

2. X86-64 core's size is dependent on the microarchitecture's implementation. The need for a fan is dependent on the CPU and cooling configuration.

3. $50 chip, look in https://pcpartpicker.com/products/cpu/#sort=price&F=96,99,98,101
AMD Ryzen 3 4100 has $62.99 USD at retail and this is a "speed-bin" salvage Renoir APU with 156 mm2 chip size at TSMC's 7 nm process node.

4. ARM and X86 can spread their respective R&D cost over large unit sales. Large "economies of scale" matter.


Using a small enough chip process, the Cortex-A72 and a weak GPU do not need cooling. The Raspberry Pi Foundation moved from a cheaper 40nm process for the in-order Cortex-A53 to a more expensive 28nm process for the OoO Cortex-A72 to make passive cooling possible. This was probably a good trade off for customers wanting to use the RPi 4 like a desktop computer but I'm not sure they could have started there without their early embedded and hobby market grab. They could also move to a better GPU where the need for an OoO CPU reduces the power budget available for the GPU and a more expensive chip foundry process would be required. Of course its difficult for ARM CPU performance to improve much from here and they would be competing with significantly higher performance OoO x86-64 CPUs with a better GPU than it is practical to provide for "cheap" hardware.

When talking about the cost and development time of OoO x86-64 CPUs, I was trying to give an idea of the order of magnitude of these monsters compared to OoO ARM chips. The performance/$ and performance/W of x86-64 CPUs is high but at least the ARM reference designs have a ways to go to compete in high performance and desktop gaming markets.

Hammer Quote:

More Blah.

Your repeated performance promises don't reflect the real world!

Your argument didn't deliver superior Quake scores.

For Quake, no 68060 Rev 6 configuration (including Warp 1260 with P96 RTG) has beaten my old Pentium 150 with S3 Trio 64UV PCI-based PC clone.

PS: Amiga SuperBuster Zorro III's effective bandwidth is trash compared to the Intel 430VX PCI chipset.


The earliest Pentium@150MHz was not released until 1996 and already had at least 3 and maybe 4 die shrinks compared to the original Pentium. As I recall, PCI was adopted as the de facto industry standard in 1995 and the majority of new PCs may not have come with it until 1996 due to the popularity of older hardware (PCI generally arrived starting with 2nd gen Pentiums). C= was going bankrupt before PCI became popular and left handicapped and buggy Zorro III hardware that could only achieve a fraction of the potential bandwidth of the Zorro III standard.

Hammer Quote:

Modern X86 CPUs still have read-modify-write (RMW) instructions, but it's preferable to use gather-scatter AVX-512 instructions, and then apply arithmetic instructions.

Zen 2 RMW example
ADD (mem32, reg) is 1 cycle latency.

Zen 4 RMW example
ADD (mem, reg32) is 1 cycle latency.

It depends on micro-microarchitecture's implementation.


Your Zen RMW examples are not RMW operations. They are read operations.

Last edited by matthey on 22-Aug-2023 at 02:07 AM.

 Status: Offline
Profile     Report this post  
BigD 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 22-Aug-2023 6:42:29
#33 ]
Elite Member
Joined: 11-Aug-2005
Posts: 7277
From: UK

@matthey

Benchmarking 68k vs Pentium CPUs is about as far from the ideals of THEA500 Mini as could be! Whether a 1996 150Mhz Pentium beat a 1994 060 in performance is far removed from an Arm based console emulating roughly an 030 AGA Amiga setup in 2023! What are you getting at?

Quote:
When talking about the cost and development time of OoO x86-64 CPUs, I was trying to give an idea of the order of magnitude of these monsters compared to OoO ARM chips. The performance/$ and performance/W of x86-64 CPUs is high but at least the ARM reference designs have a ways to go to compete in high performance and desktop gaming markets.


That part of your comment seems on topic IMHO. Agreed.

Last edited by BigD on 22-Aug-2023 at 06:45 AM.

_________________
"Art challenges technology. Technology inspires the art."
John Lasseter, Co-Founder of Pixar Animation Studios

 Status: Offline
Profile     Report this post  
matthey 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 22-Aug-2023 20:42:19
#34 ]
Super Member
Joined: 14-Mar-2007
Posts: 1950
From: Kansas

BigD Quote:

Benchmarking 68k vs Pentium CPUs is about as far from the ideals of THEA500 Mini as could be! Whether a 1996 150Mhz Pentium beat a 1994 060 in performance is far removed from an Arm based console emulating roughly an 030 AGA Amiga setup in 2023! What are you getting at?


Hammer mentioned the following claim of THEA500 Mini.

Hammer Quote:

Nope. https://amigang.com/amigamini-thea500/
The emulated 68040 with JIT reached 173 MIPS and 116 MFLOPS.

As long ARM CPU evolves, any TheA500mini or A600GS follow-up can also evolve in a cost-effective price range.


My reply was that a 68060@114MHz should have 173+ DMIPS according to Motorola claims which he unfortunately also refuted. While this may be barely adequate for emulating standard 68000+ECS and 68020+AGA Amigas, it is poor performance especially for the hardware.

https://amigang.com/amigamini-thea500/ Quote:

The A500 mini, features a All Winner H6 chip which is a CPU ARM Cortex A53, this is the same chip that powers Raspberry Pi 3 and Pi Zero 2. Its unknown the clock speed but its also stated that the system is faster than a Pi 3B+ which was clocked at 1.4ghz. A53 chips can run up to 2Ghz.


It takes a 1.4+ GHz ARM Cortex-A53 CPU (~40nm chip process) to emulate a 68060@114MHz (~500nm) and Amiga custom chips (~5000nm) and this is a good place to start evolving like it is an Amiga base platform? Why would Amiga users with 68060 hardware be interested considering the reduced compatibility and higher latency of emulation? Why would WinUAE x86-64 users want reduced performance emulation? Why would new and NG Amiga users want to experience 1990s CPU performance again? Maybe this is ok for THEA500 Mini "toy" but is this all we will get for an Amiga Maxi and A600GS with general purpose Amiga capabilities?

 Status: Offline
Profile     Report this post  
BigD 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 22-Aug-2023 20:51:14
#35 ]
Elite Member
Joined: 11-Aug-2005
Posts: 7277
From: UK

@matthey

Since the Emu68 approach results in higher performance, that's maybe something AmigaKit are trying to replicate through streamlining their System54 OS optimisation ported to 68k?

Personally 030/50 performance is OK for the Mini but given the hardware is also able to emulate the PS1, N64 and even some Dreamcast games, there's probably room for optimisation on the Amiga side. Pandory is able to get Jim Power to run smoothly whereas out of the box it's a dog!

Last edited by BigD on 22-Aug-2023 at 08:54 PM.
Last edited by BigD on 22-Aug-2023 at 08:52 PM.

_________________
"Art challenges technology. Technology inspires the art."
John Lasseter, Co-Founder of Pixar Animation Studios

 Status: Offline
Profile     Report this post  
matthey 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 22-Aug-2023 22:01:15
#36 ]
Super Member
Joined: 14-Mar-2007
Posts: 1950
From: Kansas

BigD Quote:

Since the Emu68 approach results in higher performance, that's maybe something AmigaKit are trying to replicate through streamlining their System54 OS optimisation ported to 68k?

Personally 030/50 performance is OK for the Mini but given the hardware is also able to emulate the PS1, N64 and even some Dreamcast games, there's probably room for optimisation on the Amiga side. Pandory is able to get Jim Power to run smoothly whereas out of the box it's a dog!


One small step for Amiga emulation, one giant leap backward for Amiga technology and innovation. Based on Moore's Law, Intel's David House predicted in 1975 that computer chip performance would roughly double every 18 months. In AmigaNever land, hardware performance can't double in 18 years. Is there any Amiga embarrassment great enough to signal the Amiga has hit rock bottom? Isn't it obvious the Amiga market can support mass production offering much better value after THEA500 Mini success without "Amiga" branding, without AmigaOS for general purpose use, without ethernet/WiFi, using cheap barely adequate ARM emulation and not being particularly low cost?

Last edited by matthey on 22-Aug-2023 at 10:11 PM.
Last edited by matthey on 22-Aug-2023 at 10:04 PM.

 Status: Offline
Profile     Report this post  
BigD 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 23-Aug-2023 7:53:40
#37 ]
Elite Member
Joined: 11-Aug-2005
Posts: 7277
From: UK

@matthey

Unless you are still hankering after the Amiga Team topping the Compute Folding Charts in 2023 or thinking we can retake a significant slice of the desktop computing market I can't see your concern!

Apps that require a doubling of performance on 68k are only now being written due to Vampire and PiStorm hardware availability! No one yearns for PPC software or anything developed for AROS or WinUAE only when 68k was dormant IMHO!

_________________
"Art challenges technology. Technology inspires the art."
John Lasseter, Co-Founder of Pixar Animation Studios

 Status: Offline
Profile     Report this post  
matthey 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 24-Aug-2023 5:01:03
#38 ]
Super Member
Joined: 14-Mar-2007
Posts: 1950
From: Kansas

BigD Quote:

Unless you are still hankering after the Amiga Team topping the Compute Folding Charts in 2023 or thinking we can retake a significant slice of the desktop computing market I can't see your concern!


The PPC AmigaNOne targets the desktop market and a few thousand units have sold mostly for that purpose. The Raspberry Pi targets hobbyist, embedded and educational markets and has sold roughly 50 million units yet likely has millions of users using it as a low end desktop. The AmigaNOne has too small of a user base to attract serious development while the Raspberry Pi has desktop like productivity software that AmigaNOne users can only dream about. The Raspberry Pi Foundation isn't hyping desktop hardware as it is more productive to focus on value even though they gave low end desktop users more performance and memory with the Raspberry Pi 4. They have also gone smaller with the RP2040 SoC and became vertically integrated to try to unlock more value by cutting out the large markup of off the shelf commodity chips. They can use ARM a la carte IP saving development time but it doesn't come free when trying to minimize production cost for mass production. The $1 SoC is used by many embedded and hobby devices and I expect has sold in the millions by now despite not having a standard OS with a GUI. The SoC has two 133 MHz dual ARM Cortex-M0+ cores, 264 kiB SRAM, DMA controller and I/O. This isn't too far from the original Amiga 1000 spec with a GUI but this SoC uses more transistors than a 68060+AGA because of the transistors used for the 264 kiB SRAM (6 transistors per bit). The Amiga can scale down to about this footprint but more resources are more comfortable, transistors are cheap and it is not necessary to target as low of power as the RP2040. It would be advantageous to have at least 2MiB of memory for an Amiga SoC because this gives the Amiga most of the retro 68k Amiga games market, a much larger market than any retro Acorn RISC OS market and which THEA500 Mini has shown to be valuable.

Some people think ARM cores are superior to the "outdated" 68060. The 68060 lacks many modern features of ARM in-order cores but stands up pretty well. More performance at a lower clock speed is better for embedded use as higher clock speeds generate more heat and use more power. The 133MHz Cortex-M0+ cores in the RP2040 are only 0.99 DMIPS/MHz (ARM claim) while a 68060@90MHz with 1.52+ DMIPS/MHz (Motorola claim) is higher performance. The ARM counter to low integer single core performance is to add more cores but this only works well for parallel workloads which Amiga emulation is not. David House predicted performance doubles every 18 months but I believe ARM designs were not able to surpass the 1994 68060 in integer single core performance as measured by DMIPS/MHz for over a decade until the 2005 Cortex-A8 which had 2.0 DMIPS/MHz (ARM claim). The in-order superscalar Cortex-A8 had 13 pipeline stages compared to the 68060 8 stages which gives more instruction level parallelism and potentially higher clock speeds at the cost of longer pipeline refills from mispredicted branches and interrupts and more transistors used. Despite the deeper pipeline not really being a good tradeoff for an embedded core, the Cortex-A8 was very popular (it was kind of like the Pentium 4 of ARM CPUs). By this time, a Cortex-A8 had at least twice and more commonly four times the L1 cache sizes of the 68060 and often came with a 128kiB or 1MiB L2 cache and, despite the ARM reputation for small cores, used many more transistors and likely more than the later 2011 in-order superscalar Cortex-A7 which returned to the more practical 8 stage pipeline (dropping to 1.9 DMIPS/MHz) and typically using about four times the transistors of the 68060, mostly for caches. The chip process for the 68060 started at 500nm while the Cortex-A8 started at 65nm and the Cortex-A7 started at 40nm. All the cores mentioned so far have been 32 bit cores with roughly equivalent extreme code compression which improves cache efficiency, reduces memory requirements and reduces memory bandwidth requirements. ARM's compressed Thumb encodings allowed them to gain major embedded market share and compete against the 68k which was not getting many new and especially not high performance designs. Motorola suits decided fat PPC was good for embedded use and lost all but the highest end embedded market to ARM Thumb cores (PPC cores could achieve better DMIPS/MHz than the weak performance ARM cores but needed more expensive hardware). Motorola created a simplified 68k ISA called ColdFire for low end embedded use but lost most of their 68k embedded market by not making it compatible enough to the 68k due to wanting to kill off the 68k so it couldn't compete with PPC. Despite ColdFire losing some performance and compression (code density) compared to the 68020 ISA, the 2002 in-order superscalar ColdFireV5 achieved 1.83 DMIPS/MHz with a design copied from the 68060 with a few more modern features added but still only 16kiB I+D caches, no L2 cache and using only a 130nm chip process. I believe the ColdFireV5 had better single core integer performance than any ARM designed core when it became available and even remained competitive against the significantly newer 2005 Cortex-A8 and 2011 Cortex-A7. Up to this point, the biggest 68k performance impediment was Motorola management who threw their baby out with the bathwater. The story is not over though. In 2012, ARM introduced the extremely popular 64 bit in-order superscalar Cortex-A53 with a higher performance AArch64 ISA now achieving 2.3 DMIPS/MHz. AArch64 does have some high performance features that likely explain some of the performance boost from previous year Cortex-A7 with 1.9 DMIPS/MHz. The number of integer general purpose registers were increased to 32 which is the same as PPC and CISC like addressing modes were added like the 68k helping to decrease the elevated instructions executed of Thumb encodings (equivalent Thumb2 code could use 20% more instructions than 68020 code despite being similar size). AArch64 code can be 50% larger than 68020 and Thumb2 32 bit code but PPC code density is worse (PPC 32 bit code is probably about 20% larger). 64 bit pointers are bigger and slower often making data harder to push as well. AArch64 makes compiler support easier and I believe ARM worked hard to improve it for a successful launch. I believe ARM stepped up their design quality and increasing the designs options like supporting more chip process sizes and more cache sizes. It is possible for the Cortex-A53 to use a 40nm chip process with 8kiB L1I/D caches and no L2, this CPU would be a total dog and I'd be surprised to see 2.3 DMIPS/MHz but at 10nm with 32kiB L1/I/D and 2MiB L2, maybe a Cortex-A53@1.4GHz could emulate more than a 68060@114MHz. The cost of 64 bit in the Cortex-A53 isn't too bad though as the transistors used are maybe roughly five times that of a 68060 not counting the other 3 cores which are commonly in a chip. For the early 64 bit OoO Cortex-A57 this jumps to roughly 30 times the transistors of a 68060 for one core and doesn't even double the single core integer performance at 4.1 DMIPS/MHz. ARM went big to try and compete in the desktop and server markets while focusing less on their historic small and cheap embedded market. Sorry for the long paragraph but I hope it gives some perspective that I'm not talking about anything close to the desktop.

BigD Quote:

Apps that require a doubling of performance on 68k are only now being written due to Vampire and PiStorm hardware availability! No one yearns for PPC software or anything developed for AROS or WinUAE only when 68k was dormant IMHO!


It's easier to compile more modern software and not have to optimize it to run on ancient hardware. We just need to make Amiga hardware cheap enough for the masses so everyone who wants to can upgrade. Who doesn't want to see what 68k hardware could do without Motorola management "limitations" and what Amiga hardware could do without C= management "limitations"? Only big businesses could fund and do the kind of work they did, and chose not to do, but not anymore. The 68060 is a small in-order core that could be developed by a small team today and the Amiga chipset is simple enough that the logic has been reverse engineered and put into FPGA several times already, sometimes by a single person. This is not rocket science and it has never been easier and cheaper to develop and produce an Amiga SoC (single chip Amiga). Road blocks and highway robbers would rather have rare bastard AmigaNOnes for the classes on the desktop though. I could always do some more comparisons of modern desktop CPUs with their many billions of transistors compared to the old AmigaNOne embedded PPC cores using tens of millions but I'll spare you.

Last edited by matthey on 24-Aug-2023 at 01:28 PM.

 Status: Offline
Profile     Report this post  
BigD 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 24-Aug-2023 8:12:33
#39 ]
Elite Member
Joined: 11-Aug-2005
Posts: 7277
From: UK

@matthey

Quote:
I could always do some more comparisons of modern desktop CPUs with their many billions of transistors compared to the old AmigaNOne embedded PPC cores using tens of millions but I'll spare you.


Thank you!

_________________
"Art challenges technology. Technology inspires the art."
John Lasseter, Co-Founder of Pixar Animation Studios

 Status: Offline
Profile     Report this post  
Hammer 
Re: Retro Games Limited - THEA500 Mini - Future?
Posted on 25-Aug-2023 2:03:05
#40 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5125
From: Australia

@matthey

Quote:

Usually CPU vendors supply actual DMIPS benchmarks for released hardware. It would be scandalous if a large vendor gave benchmarks they couldn't back up which could result in lawsuits or criminal charges for fraud.

Read the fine print. SysInfo DMIPS is only useful within the SysInfo DMIPS context.

There are other benchmarks such as MacBench 4.0 (runs within Shapeshifter's MacOS 68K), Doom, Quake, Cinema 4D, Lightwave 3D and 'etc'.

Motorola's 68060's DMIPS claim is smashed when the workload has sustained external bus access.

Quote:

The Alpha 21164 released in 1995 was the first CPU to have an on-chip L2 cache (only 96kiB).

The 8kiB instruction cache L1 was left small as it reduces access time but an 8kiB L1 has a fraction of the instructions in this L1 compared to a 68060 8kiB L1 instruction cache.

To avoid this major instruction cache bottleneck which is common with RISC CPUs, the L2 was added which allowed the clock speed to increase from an Alpha 21064A CPU max of 300MHz to 500MHz. Although this on-chip L2 was innovative and increased performance, the RISC instruction bottleneck remained while the CPU chip increased from 2.85 million transistors to 9.3 million transistors and the power used increased from 33W to 56W. Even the 4th and last gen Pentium P55C released in 1996 only used 4.5 million transistors with a TDP of 17W (~26W by 1.5 rule of thumb multiplier) for the 1997 Pentium@233MHz. Perhaps Alpha's greatest innovation was their greatest contribution to their downfall. The idea that ISA code density doesn't matter and higher clock speeds would scale RISC performance up were severely flawed. The 68060 with ~2.5 million transistors and running cool enough for passive cooling and mobile applications was never really given a chance. The 68060 "replacement" PPC tried a different RISC gamble with cheap to design shallow pipeline designs with relatively large caches. This worked well until customers complained about the lack of performance due to not clocking PPC up and the RISC instruction cache bottleneck was exposed again.

Pentium P55C refers to Pentium MMX and it was released in January 1997. Pentium MMX's announcement in October 1996 was a paper launch and it was late for Xmas Q4 1996 sales.

Pentium (P54CS) reached 200 MHz in 1996.
Pentium Pro (P6) reached 200 MHz in 1995. 1995 era Pentium Pro has on chip-package L2 cache in multi-chip module (MCM) format.

Engineers from DEC Alpha have their code density with a high clock speed attempt with AMD's K7 Athlon.

IBM has designed some POWER64 CPUs with very high clock speeds such as PPE and POWER6 which are similar long pipelines with high clock speed approach as AMD's Bulldozer and Intel's NetBurst Pentium IV.

What's important is application performance e.g. 68060 doesn't deliver the quickest render time for Lightwave in the 1995-1996 time period. Reaching high clock speed is a design feature.

Quote:

As far as 68060 performance, the 68060 lacks several high performance features compared to the Pentium as you mentioned above (including an on mother board L2 cache but excluding an on-chip L2 cache which would have made the Pentium more expensive than the Alpha 21164)

For 1996, the classic Pentium is a "Celeron" relative to Pentium Pro. 1998 era has Celeron 300A with on-chip 128 KB L2 cache and it's good enough for games.

68060 competition from Phase 5 or Quikpak lacks on PCB L2 cache.

After Escom purchased Commodore in 1995, the 1996-era QuikPak A4000T-060 has 68060 @ 66 Mhz SKU.

I owned A3000/030@25Mhz (with 4MB Fast RAM, 2MB Chip RAM) in 1996 and Quake's frame rate vs. cost is an important factor i.e. Phase 5 Cyberstorm 060/Cybervision 64 (S3 Trio 64U) wasn't cost-effective compared to Pentium 150/S3 Trio 64UV+.

My TF1260's 68060 rev1 is overclocked to 62.5 Mhz and its overall performance is about Pentium 60.

A1200 AGA 256 color mode with PiStorm32-Lite-Emu68 RPI 4B can drive Quake demo3 320x200 benchmark into +47 fps range.

Quote:

and the 68060 received must less developer support compared to the Pentium. Recall the ByteMark benchmark I have previously posted which demonstrates specific weaknesses in 68060 compiler support.

A good compiler wouldn't have solved 68060's legacy 68040 32-bit front-side bus problem.

For Intel's P5 Pentium, Intel replaced the 486 32-bit bus for a reason. 1995 released Pentium Overdrive @ 83 Mhz on 486's 32-bit bus is a major performance bottleneck.

Last edited by Hammer on 25-Aug-2023 at 03:12 AM.
Last edited by Hammer on 25-Aug-2023 at 02:59 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

 Status: Offline
Profile     Report this post  
Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 Next Page )

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]
Copyright (C) 2000 - 2019 Amigaworld.net.
Amigaworld.net was originally founded by David Doyle