Poster | Thread |
redfox
| |
Re: Trevor's Amiga Blog Posted on 21-Apr-2023 23:16:44
| | [ #1541 ] |
|
|
|
Elite Member |
Joined: 7-Mar-2003 Posts: 2078
From: Canada | | |
|
| @number6
Thanks for the link to Trevor's blog.
I hope we get some A1222+ news soon.
--- redfox
|
|
Status: Offline |
|
|
dirkzwager
| |
Re: Trevor's Amiga Blog Posted on 23-Apr-2023 11:39:26
| | [ #1542 ] |
|
|
|
Regular Member |
Joined: 9-Aug-2019 Posts: 129
From: Belgium, LImburg, Bilzen | | |
|
| @redfox
I hope ithat for so Many years. I have now My sam460 for 3 years. _________________ Amiga 500+, Pi 3b+ and Amiga os 3.10 Amikit XE usb Stick Powerbook 17" and MorphOS 3.13 PowerMac G5 2.3 MP and MorphOS 3.13 Sam 460 with Amiga os 4.1 and checkmate 1500 www.sitedesign.be |
|
Status: Offline |
|
|
amigakit
| |
Re: Trevor's Amiga Blog Posted on 5-Jun-2024 22:58:08
| | [ #1543 ] |
|
|
|
Amiga Kit |
Joined: 28-Jun-2004 Posts: 2595
From: www.amigakit.com | | |
|
| |
Status: Offline |
|
|
Hammer
| |
Re: Trevor's Amiga Blog Posted on 6-Jun-2024 3:27:10
| | [ #1544 ] |
|
|
|
Elite Member |
Joined: 9-Mar-2003 Posts: 5906
From: Australia | | |
|
| @NutsAboutAmiga
SAM460 seems to have less technical drama when compared to Artica S AmigaOnes and a certain S54 with a nonstandard PPC FPU. _________________ Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68) Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68) Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB |
|
Status: Offline |
|
|
kolla
| |
Re: Trevor's Amiga Blog Posted on 6-Jun-2024 7:20:54
| | [ #1545 ] |
|
|
|
Elite Member |
Joined: 20-Aug-2003 Posts: 3235
From: Trondheim, Norway | | |
|
| @amigakit
Why does Trevor write that the screenshot is of sysinfo under Amibench when it clearly is kickstart 1.x components that are listed? Why isn’t it showing sysinfo under Amibench? Why is it so hard to get things right? Why, mummy, why why why? _________________ B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC |
|
Status: Offline |
|
|
Cheese
| |
Re: Trevor's Amiga Blog Posted on 6-Jun-2024 19:18:20
| | [ #1546 ] |
|
|
|
Regular Member |
Joined: 23-Oct-2006 Posts: 315
From: Unknown | | |
|
| @kolla
_________________ x86/MorphOS 4.0
"Delving into the past can be a dangerous exercise." -hyperionmp
"I've been a supporter of "REACTION" GUI because is an Amiga OS thing." -Snuffy
"I personally prefer a vision of do'ers and makers rather than |
|
Status: Offline |
|
|
matthey
| |
Re: Trevor's Amiga Blog Posted on 7-Jun-2024 3:23:37
| | [ #1547 ] |
|
|
|
Elite Member |
Joined: 14-Mar-2007 Posts: 2355
From: Kansas | | |
|
| amigakit Quote:
"Innovation", "passion", "creativity" and "pioneering" created personal computers like the Amiga but the Amiga today is about endless books, emulation and preserving what will soon be forgotten. RIP Amiga.
|
|
Status: Offline |
|
|
pavlor
| |
Re: Trevor's Amiga Blog Posted on 7-Jun-2024 14:44:02
| | [ #1548 ] |
|
|
|
Elite Member |
Joined: 10-Jul-2005 Posts: 9636
From: Unknown | | |
|
| @amigakit
Thanks for posting! |
|
Status: Offline |
|
|
number6
| |
Re: Trevor's Amiga Blog Posted on 28-Aug-2024 18:55:24
| | [ #1549 ] |
|
|
|
Elite Member |
Joined: 25-Mar-2005 Posts: 11619
From: In the village | | |
|
| @thread
Just adding Trevor's blog posting from August 6, 2024:
Back to winter down-under
#6
_________________ This posting, in its entirety, represents solely the perspective of the author. *Secrecy has served us so well* |
|
Status: Offline |
|
|
matthey
| |
Re: Trevor's Amiga Blog Posted on 29-Aug-2024 19:59:51
| | [ #1550 ] |
|
|
|
Elite Member |
Joined: 14-Mar-2007 Posts: 2355
From: Kansas | | |
|
| #6 Quote:
Trevor is just as oblivious and delusional as ever. All is well in Amiga Neverland.
http://blog.a-eon.biz/blog/index.php/2024/08/06/back-to-winter-down-under/ Quote:
AmiBench’s performance is further boosted by AmigaKit’s ARMgraphics.library, which bypasses the 68K graphics bottleneck to accelerate graphics rendering.
|
There was an Amiga chip memory bottleneck due to cheap CBM chipsets and memory but it had nothing to do with the 68k and it is removed on the A600GS with chip memory speed "a massive 412.98 compared to the chip speed of an A600". Where is there a "68K graphics bottleneck"?
No 68k CPU has "graphics" inside the CPU so logically any bottleneck would be from accessing the external address space of memory and hardware registers but this is where the load/store ARM Cortex-A53 has bottlenecks not 68k CPUs.
1. load-to-use stall bottleneck 2. load/store bottleneck 3. fetch and instruction cache bottleneck due to poor code density
Native ARM code with instruction scheduling can partially remove the huge ARM Cortex-A53 bottleneck due to load-to-use stalls. Simple and quick emulation conversion of 68k to AArch64 code usually does not include instruction scheduling though. Worst case instruction scheduling can be expected as the 68k CPUs do not have load-to-use stalls so the code is not scheduled to avoid them (load results are usually accessed by the next instruction stalling the Cortex-A53). A load-to-use penalty of 3 cycles is a performance killer and bad for an 8-stage CPU especially when 68k CPUs didn't have any load-to-use penalty. Then there is the RISC load/store instruction bottleneck that requires more instructions, memory traffic and registers than CISC.
A typical general purpose CPU workload of instructions will average to approximately the following.
load 26% (~25% which is 5 out of 20 instructions) store 10% (2 out of 20 instructions) ALU 49% (~50% which is 10 out of 20 instructions) branch 15% (3 out of 20 instructions)
20 instruction typical workload for ARM Cortex-A53 emulating 68k code load x5 with load-to-use stalls (5*4= 20 cycles) store x2 (2*0.5 to 2*1= 1-2 cycles with store buffer) ALU x10 (10*0.5 to 10*1= 5-10 cycles) branch x3 (most are conditional predicted branches so ~0 cycles) --- total: 26-32 cycles
The Cortex-A53 load instruction execution throughput is 1 cycle but the latency is 3 cycles and the load result can't be used for 3 cycles. Much of the Cortex-A53 performance improvement over the predecessor Cortex-A7 is due to improved superscalar execution but this requires very good and sometimes impossible instruction scheduling due to the increased load-to-use penalty. The Cortex-A53 has two simple integer execution pipelines so it can execute two ALU instructions in a cycle but so could the 68060 which also has most of the design features without the bottlenecks.
20 instruction typical workload for 68060 load+ALU x5 (5*0.5 to 5*1= 2.5-5 cycles) store x2 (2*0.5 to 2*1= 1-2 cycles with store buffer) ALU x5 (5*0.5 to 5*1= 2.5-5 cycles) branch x3 (most are conditional predicted branches so ~0 cycles) --- total: 6-12 cycles
The in-order superscalar 68060 design not only eliminates bottleneck #1, load-to-use stalls, but CISC load+ALU instructions avoid bottleneck #2 also. The load+ALU instructions are pipelined superscalar executing in a single cycle where the Cortex-A53 requires two single cycle instructions with a 3 cycle load-to-use stall between without instruction scheduling. The 68060 design makes instruction scheduling much easier and the 68060 has amazing performance for an in-order design considering most compilers have no 68060 specific instruction scheduler. The SiFive U74 core architects used a similar 68060 like design for RISC-V with impressive results even though only bottleneck #1 is removed as RISC-V is load/store so suffers from bottleneck #2 and bottleneck #3 (RVC compressed and AArch64 still have inferior code density to the 68060). The simple and tiny in-order SiFive U74 core not only outperforms the in-order Cortex-A53 for integer performance but the OoO PPC G5 in some benchmarks with a fraction of the units and transistors. An equivalent RVC L1I instruction cache may contain twice as much code as a PPC L1I cache so bottleneck #3 is reduced even though a 68k L1I cache may contain twice as much code as RVC and four times the amount of code as PPC. Maybe the better code density was the larger part of the SiFive U74 core advantage because OoO is supposed to remove most of load-to-use stalls and reduce performance loss from poor instruction scheduling. One of the big selling points of the in-order Cortex-A53 was that it offered similar if not better performance compared to the OoO Cortex-A9 with a smaller lower power in-order core.
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/the-top-5-things-to-know-about-cortex-a53 Quote:
3. Higher performance than Cortex-A9: smaller and more efficient too
The Cortex-A9 features an out-of-order pipeline, dual issue capability, and a longer pipeline than Cortex-A53 that enables 15% higher frequency operation. However the Cortex-A53 achieves higher single thread performance by pushing a simpler design farther - some of the key factors enabling the performance of the Cortex-A53 include the integrated low latency level 2 cache, the larger 512 entry main TLB, and the complex branch predictor. The Cortex-A9 has set the bar for the high end of the smartphone market through 2012 – by matching and exceeding that level of performance in a smaller footprint and power budget, the Cortex-A53 delivers performance to entry level devices that was previously enjoyed by high-end flagship mobile devices – in a lower power budget and at lower cost. The graph below compares the single thread performance of the high efficiency Cortex-A processors with the Cortex-A9. At the same frequency, Cortex-A53 delivers more than 20% higher instruction throughput than the Cortex-A9 for representative workloads.
|
The ARM doc above also verifies that a longer pipeline enables higher frequency but it can result in larger performance killing load-to-use penalties with RISC designs. The early RISC shallow pipeline performance advantage quickly disappeared as longer pipeline less microcoded CISC designs appeared that avoided classic RISC bottlenecks #1-3. The Cortex-A53 suffers from all RISC bottlenecks with bottleneck #2 worse than the classic RISC pipeline by a cycle while bottleneck #3, code density, is somewhat better than most classic RISC ISAs. Even the classic RISC pipeline 2 cycle load-to-use penalty was researched with simulations in an old paper showing zero cycle loads (zero cycle load-to-use penalty) provided more integer performance than 32 GP registers vs 8 GP registers and an in-order CPU design with zero cycle loads was surprisingly close to an aggressive OoO design.
https://ftp.cs.wisc.edu/sohi/papers/1995/micro.zcl.pdf Quote:
The column labeled Cycle(In+ZCL)=Cycle(Out) repeats the experiments, except the in-order issue processor has support for zero-cycle loads. For the integer codes, the performance of the two processors is now much closer - both out performing each other in some cases, with slightly better performance on the out-of-order issue processor.
This result is striking when one considers the clock cycle and design time advantages typically afforded to in-order issue processors. It may be the case that for workloads where untolerated latency is dominated by data cache access latencies (as in the case of the integer benchmarks), an in-order issue design with support for zero-cycle loads may consistently out perform an out-of-order issue processor.
|
In-order CPU designs are at a disadvantage due to long load latencies stalling the CPU waiting on data or instructions not in the L1 cache but RISC load-to-use stalls, load/store instructions and poor code density increase the bottlenecks while CISC designs improve them. Smaller, lower power and cheaper in-order core designs are compelling which is a major reason why the Cortex-A53 was/is likely the most popular ARM core ever and why it was naively chosen for THEA500 Mini and A600GS despite the 3 cycle load-to-use stall making it far from ideal for emulation.
https://www.anandtech.com/show/6420/arms-cortex-a57-and-cortex-a53-the-first-64bit-armv8-cpu-cores Quote:
ARM claims that on the same process node (32nm) the Cortex A53 is able to deliver the same performance as a Cortex A9 but at roughly 60% of the die area. The performance claims apply to both integer and floating point workloads. ARM tells me that it simply reduced a lot of the buffering and data structure size, while more efficiently improving performance. From looking at Apple's Swift it's very obvious that a lot can be done simply by improving the memory interface of ARM's Cortex A9. It's possible that ARM addressed that shortcoming while balancing out the gains by removing other performance enhancing elements of the core.
|
I've explained all this before although maybe not all in one place. The memory/cache/load bottlenecks are with RISC CPUs compared with CISC CPUs and added together they have negative synergies and are significant. Existing 68k code is written assuming CISC advantages which can only be fully realized on CISC designs. CISC advantages can provide a cost advantage along with cheaper hardware requirements without emulation and avoiding ARM royalties but mass production is required. Emulation is not competitive and provides more bottlenecks on top of the RISC bottlenecks, especially for cheap in-order CPU designs with much increased cache requirements.
Last edited by matthey on 29-Aug-2024 at 08:24 PM. Last edited by matthey on 29-Aug-2024 at 08:09 PM.
|
|
Status: Offline |
|
|
pavlor
| |
Re: Trevor's Amiga Blog Posted on 30-Aug-2024 16:43:33
| | [ #1551 ] |
|
|
|
Elite Member |
Joined: 10-Jul-2005 Posts: 9636
From: Unknown | | |
|
| @number6
Thanks!
Boing Ball cup? I really want that! |
|
Status: Offline |
|
|
K-L
| |
Re: Trevor's Amiga Blog Posted on 30-Aug-2024 19:01:49
| | [ #1552 ] |
|
|
|
Super Member |
Joined: 3-Mar-2006 Posts: 1427
From: Oullins, France | | |
|
| @Trevor
Thanks a lot for this new entry _________________ PowerMac G5 2,7Ghz - 2GB - Radeon 9650 - MorphOS 3.14 AmigaONE X1000, 2GB, Sapphire Radeon HD 7700 FPGA Replay + DB 68060 at 85Mhz |
|
Status: Offline |
|
|