Poster | Thread |
Karlos
|  |
Re: 32-bit PPC on FPGA Posted on 14-Feb-2024 16:07:40
| | [ #161 ] |
|
|
 |
Elite Member  |
Joined: 24-Aug-2003 Posts: 4843
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| @Gunnar
Quote:
In my experience read difficulty is also a pain problem when you debug code. When you debug you often go throw the assembler code one by one and follow it. |
I don't disagree, especially when your integer literal #100000 is split into two 16-bit immediate values 0x02 and -31072._________________ Doing stupid things for fun... |
|
Status: Offline |
|
|
Gunnar
|  |
Re: 32-bit PPC on FPGA Posted on 14-Feb-2024 16:20:25
| | [ #162 ] |
|
|
 |
Cult Member  |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @Karlos
Quote:
especially when your integer literal #100000 is split into two 16-bit immediate values 0x02 and -31072. |
agreed. I personally found loading of 64bit pointers very ugly to read.
How many instruction do you need for loading a 64bit pointer? |
|
Status: Offline |
|
|
ppcamiga1
|  |
Re: 32-bit PPC on FPGA Posted on 14-Feb-2024 16:25:05
| | [ #163 ] |
|
|
 |
Cult Member  |
Joined: 23-Aug-2015 Posts: 985
From: Unknown | | |
|
| @Karlos
ppc is stil something that we used in 1997 accept that stop this assembler shit ppc works and was faster than 68k many years ago
|
|
Status: Offline |
|
|
ppcamiga1
|  |
Re: 32-bit PPC on FPGA Posted on 14-Feb-2024 16:31:03
| | [ #164 ] |
|
|
 |
Cult Member  |
Joined: 23-Aug-2015 Posts: 985
From: Unknown | | |
|
| my dream amiga will be fpga with 68k and ppc core with ocs for old games and better graphics parallel to ocs for rest
|
|
Status: Offline |
|
|
Karlos
|  |
Re: 32-bit PPC on FPGA Posted on 14-Feb-2024 16:41:30
| | [ #165 ] |
|
|
 |
Elite Member  |
Joined: 24-Aug-2003 Posts: 4843
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| @Gunnar
Quote:
How many instruction do you need for loading a 64bit pointer? |
Can you clarify? That can be interpreted a number of ways. I interpreted it as this:
extern long myscore;
long const* get() { return &myscore; }
Compiled for Power64, gcc 13.2 -Ofast
.LC0: .quad myscore get(): .quad .L.get(),.TOC.@tocbase,0 .L.get(): addis 3,2,.LC0@toc@ha ld 3,.LC0@toc@l(3) blr .long 0 .byte 0,9,0,0,0,0,0,0
Some of this is ABI overhead with the whole toc lookup, the second instruction loads the pointer._________________ Doing stupid things for fun... |
|
Status: Offline |
|
|
Gunnar
|  |
Re: 32-bit PPC on FPGA Posted on 14-Feb-2024 16:52:36
| | [ #166 ] |
|
|
 |
Cult Member  |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @Karlos
yeeees but ...
This code loads a 64bit value from memory/d-cache to register... And this code assumes you have a pointer to the TOC in memory. This is close but not what I meant..
Where does your pointer come from? How do you create the pointer in the first place and how many instruction do you need for doing this?
And this way of loading from the TOC with 16bi(An) mode... works pretty nice but many programs very fast run out of TOC space. What do you do when your program uses more than 64K TOC?
And how do the access then look? This gets very fast very ugly.
|
|
Status: Offline |
|
|
Karlos
|  |
Re: 32-bit PPC on FPGA Posted on 14-Feb-2024 17:07:53
| | [ #167 ] |
|
|
 |
Elite Member  |
Joined: 24-Aug-2003 Posts: 4843
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| @Gunnar
This is why I asked for clarification. The example was already using the large (4G) TOC model which is where the first add instruction comes from. If I specify a small TOC model (-mcmodel=small):
.LC0: .quad myscore get(): .quad .L.get(),.TOC.@tocbase,0 .L.get(): ld 3,.LC0@toc(2) blr .long 0 .byte 0,9,0,0,0,0,0,0
You will notice the add immediate shifted step has gone now._________________ Doing stupid things for fun... |
|
Status: Offline |
|
|
matthey
|  |
Re: 32-bit PPC on FPGA Posted on 14-Feb-2024 17:34:02
| | [ #168 ] |
|
|
 |
Elite Member  |
Joined: 14-Mar-2007 Posts: 2451
From: Kansas | | |
|
| geen_naam Quote:
Because amigaos was designed at the same time when the steam engine was invented, it is hopelessly outdated. Because memory copies are byte aligned instead of long word aligned, you have to do byte copies in at least 33% of the cases. Or risk stalling your cpu. Therefore optimised copy routines do not have much effect as they could do on a modern OS.
|
The AmigaOS developers should have known to naturally align all structure data with padding even though they were programming for a 16 bit CPU. The 32 bit 68020 was introduced in 1984 so they had time to improve alignments before release. AmigaOS 4 developers had time to improve alignments but they chose 68k compatibility instead for a NG AmigaOS. The PPC ISA and CPUs are not as good at handling 16 bit data and poor alignment as most 68k CPUs. Even the 68060 is forgiving of poor alignment.
geen_naam Quote:
Managing cache is a must in any application. Not only in copy loops. Because the performance penalty of a cache miss is huge on our multi GHz processors. Waiting for data to arrive from slow DDR can take up several 100 clock cycles.
Intel and AMD understand their target audience. Those coders have mostly no clue about the hardware they are running on. Their software can run "in the clouds" for all they care. Therefore those platforms offer plenty of memory, hardware prefetchers, advanced predicters and instruction reordering. Optimization is done first and foremost in hardware and OS. Not in applications like we used to be used to.
So despite the compiler generated code, the PC and server CPU is still able to get maximum out of its potential.
|
PPC was developed when it was thought that simplifying CPU cores to clock higher and moving complexity into the compiler gives a RISC advantage. The RISC philosophy of breaking instructions into many weak instructions not only increased instruction counts but produced code with more dependent instructions, made instruction scheduling more difficult, introduced load-to-use stalls and clogged up instruction caches.
68k: // 1 instruction, 10 bytes, 1 cycle execution possible add.l #100000,myscore
PPC: // 5 instructions, 20 bytes, 6 cycle execution common for PPC lis 10,myscore@ha lwz 9,myscore@l(10) // dependent on r10 in previous instruction // load-to-use stall (1 cycle) addis 9,9,0x2 // dependent on result in r9 and must wait for load-to-use delay addi 9,9,-31072 // dependent on result in r9 stw 9,myscore@l(10) // dependent on r9 result
This is assuming the "Figure 4-15. Common Model Instruction Delays" single cycle load-to-use (load-use) delay given for early shallow pipeline PPC core designs.
The PowerPC Compiler Writer’s Guide https://cr.yp.to/2005-590/powerpc-cwg.pdf
A two cycle load-to-use penalty is more common for modern RISC CPU cores that abandoned shallow pipeline designs. The most common RISC core in the world, the ARM Cortex-A53, has a 3 cycle load-to-use penalty. Even a programmer should be able to see the major RISC disadvantage here. The problem is that RISC architectures require perfect code while CISC architectures are inherently forgiving of low quality code and are naturally higher performance.
geen_naam Quote:
Well, I actually do know about both ASIC design and FPGAs. As I was part of ASIC design teams myself.
You only use FPGAs to verify functionality. Because simulation is painfully slow.
We never optimised our ASIC design for the resources in an FPGA. Which is totally useless. Since the fab make use of their own libraries with primitive which are tuned to the process node. Timing behaviour is completely different. Therefore you have to simulate the post layout Verilog with timing which you get from your fab.
Umisef claims that you designed your "68080" to make best use of the FPGA resources available. Which I think is very plausible.
|
Far worse than optimizing a CPU core based on FPGA resources is optimizing an ISA for a FPGA core. There are legitimate reasons to use a FPGA CPU core in hardware, for example low production applications. Better performance due to better utilizing FPGA resources is a competitive advantage. However, some customers will want to move up to an ASIC where an ISA optimized for a FPGA is a handicap.
|
|
Status: Offline |
|
|
Gunnar
|  |
Re: 32-bit PPC on FPGA Posted on 14-Feb-2024 17:36:29
| | [ #169 ] |
|
|
 |
Cult Member  |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @Karlos
Correct with small TOC, you can load using 1 instruction With big TOC, you can load using 2 instruction and spending one more temp register
But can you make an example of how to create the TOC pointer? This was my original question .. Maybe I did not clearly word it :) |
|
Status: Offline |
|
|
Karlos
|  |
Re: 32-bit PPC on FPGA Posted on 14-Feb-2024 18:08:27
| | [ #170 ] |
|
|
 |
Elite Member  |
Joined: 24-Aug-2003 Posts: 4843
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| @Gunnar
I think I understand, but is that not a bit of a strawman argument? The ABI defines which register is expected to hold the TOC base and setting it all up is a job for the loader/linker.
Application code doesn't generally need to worry about that. _________________ Doing stupid things for fun... |
|
Status: Offline |
|
|
Gunnar
|  |
Re: 32-bit PPC on FPGA Posted on 14-Feb-2024 18:35:21
| | [ #171 ] |
|
|
 |
Cult Member  |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @Karlos
Quote:
Application code doesn't generally need to worry about that. |
Unless you work on the side as IBM that develops the OS backend.. Then you of course see what I was talking about, and you see this in many places.
But nevermind ... |
|
Status: Offline |
|
|
Karlos
|  |
Re: 32-bit PPC on FPGA Posted on 14-Feb-2024 19:28:58
| | [ #172 ] |
|
|
 |
Elite Member  |
Joined: 24-Aug-2003 Posts: 4843
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| |
Status: Offline |
|
|
Hammer
 |  |
Re: 32-bit PPC on FPGA Posted on 15-Feb-2024 6:11:01
| | [ #173 ] |
|
|
 |
Elite Member  |
Joined: 9-Mar-2003 Posts: 6161
From: Australia | | |
|
| @matthey
Quote:
A two cycle load-to-use penalty is more common for modern RISC CPU cores that abandoned shallow pipeline designs. The most common RISC core in the world, the ARM Cortex-A53, has a 3 cycle load-to-use penalty. Even a programmer should be able to see the major RISC disadvantage here. The problem is that RISC architectures require perfect code while CISC architectures are inherently forgiving of low quality code and are naturally higher performance.
|
There's a reason why Apple's M1 has 8 decoders to rival AMD's four (decoder count = IEU count match) and Intel's six decoders.
Zen 5 increases IEU units to 6.
A "very fat" RISC microarchitecture can be designed to rival a "very fat" X86-64 microarchitecture.
Qualcomm Snapdragon X's Oryon CPU is from M1. Oryon CPU appears on Qualcomm SnapDragon 8 Gen 4.
Qualcomm Oryon CPU is available for multiple desktop PC OEMs and makes this CPU an existential threat to AMD and Intel.
Snapdragon X Elite will debut in June 2024. Snapdragon X Elite(87 watts and 58 watts)'s 12 cores/12 threads rivals or beats AMD's 8 cores/16 threads Ryzen 7 7840HS
Cinebench 2024 MT Snapdragon X Elite at 58 watts configuration is similar to Ryzen 7 7840HS (35 to 54 watts, unknown config).
Snapdragon X Elite = 950 score with 12 threads Ryzen 7 7840HS = 979 score with 16 threads i7-13800H = 996 score with 20 threads
From https://www.anandtech.com/show/21112/qualcomm-snapdragon-x-elite-performance-preview-a-first-look-at-whats-to-come
Ryzen 7 8700G (45 to 64 watts) = 986 with 16 threads
From https://www.topcpu.net/en/cpu-r/cinebench-2024-multi-core
Intel ArrowLake and AMD Zen 5 will be released this year to counter Qualcomm Oryon.
Last edited by Hammer on 15-Feb-2024 at 06:17 AM. Last edited by Hammer on 15-Feb-2024 at 06:16 AM.
_________________ Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68) Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68) Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB |
|
Status: Offline |
|
|
ppcamiga1
|  |
Re: 32-bit PPC on FPGA Posted on 15-Feb-2024 6:15:14
| | [ #174 ] |
|
|
 |
Cult Member  |
Joined: 23-Aug-2015 Posts: 985
From: Unknown | | |
|
| amiga in fpga it may be something nice with 68k and risc with ocs and better graphics finally merged classic and ng
|
|
Status: Offline |
|
|
kolla
|  |
Re: 32-bit PPC on FPGA Posted on 15-Feb-2024 6:34:47
| | [ #175 ] |
|
|
 |
Elite Member  |
Joined: 20-Aug-2003 Posts: 3352
From: Trondheim, Norway | | |
|
| @ppcamiga1
Quote:
Stop this shite shit_________________ B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC |
|
Status: Offline |
|
|
Hammer
 |  |
Re: 32-bit PPC on FPGA Posted on 15-Feb-2024 7:11:58
| | [ #176 ] |
|
|
 |
Elite Member  |
Joined: 9-Mar-2003 Posts: 6161
From: Australia | | |
|
| @Gunnar
Quote:
Gunnar wrote:
To whom do you talk? I never said this.
|
You stated, "Normally code runs around 2 instructions per clock."
The normal can change when out-of-order and prefetch depths capabilities are fattened. Quote:
The Apollo 68080 CPU has 6 EXECUTION units. 2 EA, 2 IALU, 1 AMMX, 1 FPU The Apollo 68080 can do up to 4 instructions per cycle.
|
Zen 4 can do 9 micro-ops from the micro-op cache (6.76 K entries) and four micro-ops from 4-way decoders.
13 micro-ops bottlenecked into the Rename / Dispatch unit that can dispatch 6 micro-ops while the Register Alias Tables can retire 8 micro-ops.
What's AC68080's instruction retirement rate?
The instruction retirement rate can be a bottleneck.
------------------------------- https://ko-fi.com/post/Lightwave-5-benchmarking-and-findings-Z8Z3I8IOX Lightwave-5 benchmarks

A4000/060 is 50Mhz 68060 config.
_________________ Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68) Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68) Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB |
|
Status: Offline |
|
|
Karlos
|  |
Re: 32-bit PPC on FPGA Posted on 15-Feb-2024 9:40:23
| | [ #177 ] |
|
|
 |
Elite Member  |
Joined: 24-Aug-2003 Posts: 4843
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| To all the doubters, just look at AmiBerry in the benchmarks above. Contrast the results with the X5000.
@NutsAboutAmiga
Do you still believe that 68K applications get a huge boost from having access native OS calls? That's only true if that's the only thing it spends the majority of it's time doing.
Most real application software spends most of its time either computing something, or sitting around idle, waiting to be triggered by some external action. LW rendering is a good example of the former. The further along the spectrum from compute bound to IO bound/event driven, the less raw speed matters anyway. Last edited by Karlos on 15-Feb-2024 at 09:51 AM.
_________________ Doing stupid things for fun... |
|
Status: Offline |
|
|
pixie
 |  |
Re: 32-bit PPC on FPGA Posted on 15-Feb-2024 12:55:17
| | [ #178 ] |
|
|
 |
Elite Member  |
Joined: 10-Mar-2003 Posts: 3411
From: Figueira da Foz - Portugal | | |
|
| @geen_naam
Quote:
Amiberry is running a winuae based heavilly optimised 68k JIT on a twice as fast core. Yet, it only manages to achieve similar results compared to MOS/OS4 on a X5000. |
An emulator that actually optimizes stuff it should be optimizing... I know, mind blowing stuff!
 _________________ Indigo 3D Lounge, my second home. The Illusion of Choice | Am*ga |
|
Status: Offline |
|
|
Karlos
|  |
Re: 32-bit PPC on FPGA Posted on 15-Feb-2024 13:12:15
| | [ #179 ] |
|
|
 |
Elite Member  |
Joined: 24-Aug-2003 Posts: 4843
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| |
Status: Offline |
|
|
pixie
 |  |
Re: 32-bit PPC on FPGA Posted on 15-Feb-2024 13:24:45
| | [ #180 ] |
|
|
 |
Elite Member  |
Joined: 10-Mar-2003 Posts: 3411
From: Figueira da Foz - Portugal | | |
|
| @geen_naam
Quote:
1. Amiberry 68k JIT is based on heavilly optimised WinUAE JIT. (68k emulation is primary function) |
On ARM? 'heavilly optimised WinUAE JIT' on ARM? I'll ask again in case you don't understand... 'heavilly optimised WinUAE JIT' on ARM???
Quote:
2. MorphOS and AmigaOS4 JIT are less heavilliy optimised |
I would think they would quite optimized for PPC... perhaps you don't know the gap between JIT and non JIT code running, in heavily optimized (defact) CPU combo such WinUAE/x86 it's huge, perhaps MOS or AmigaOs does some magic (ie point 4) to pull those numbers, but perhaps it's just simple JIT...
_________________ Indigo 3D Lounge, my second home. The Illusion of Choice | Am*ga |
|
Status: Offline |
|
|