Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6071 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

13 crawler(s) on-line.

47 guest(s) on-line.

1 member(s) on-line.

MagicSN

You are an anonymous user.
Register Now!

MagicSN: 4 mins ago

DiscreetFX: 7 mins ago

Livebyfaith: 17 mins ago

agami: 2 hrs 42 mins ago

kolla: 2 hrs 51 mins ago

amigakit: 3 hrs 29 mins ago

NutsAboutAmiga: 4 hrs 33 mins ago

michalsc: 4 hrs 39 mins ago

Tuxedo: 5 hrs 26 mins ago

Rob: 6 hrs 20 mins ago

Forum Index

Amiga OS4 Hardware

32-bit PPC on FPGA

Poster

Thread

Karlos

Re: 32-bit PPC on FPGA
Posted on 14-Feb-2024 16:07:40

[ #161 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4491
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Gunnar

Quote:
In my experience read difficulty is also a pain problem when you debug code.
When you debug you often go throw the assembler code one by one and follow it.

I don't disagree, especially when your integer literal #100000 is split into two 16-bit immediate values 0x02 and -31072.

_________________
Doing stupid things for fun...

Status: Offline

Gunnar

Re: 32-bit PPC on FPGA
Posted on 14-Feb-2024 16:20:25

[ #162 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

@Karlos

Quote:
especially when your integer literal #100000 is split into two 16-bit immediate values 0x02 and -31072.

agreed. I personally found loading of 64bit pointers very ugly to read.

How many instruction do you need for loading a 64bit pointer?

Status: Offline

ppcamiga1

Re: 32-bit PPC on FPGA
Posted on 14-Feb-2024 16:25:05

[ #163 ]

Cult Member

Joined: 23-Aug-2015
Posts: 829
From: Unknown

@Karlos

ppc is stil something that we used in 1997
accept that
stop this assembler shit
ppc works and was faster than 68k many years ago

Status: Offline

ppcamiga1

Re: 32-bit PPC on FPGA
Posted on 14-Feb-2024 16:31:03

[ #164 ]

Cult Member

Joined: 23-Aug-2015
Posts: 829
From: Unknown

my dream amiga will be fpga with 68k and ppc core
with ocs for old games
and better graphics parallel to ocs for rest

Status: Offline

Karlos

Re: 32-bit PPC on FPGA
Posted on 14-Feb-2024 16:41:30

[ #165 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4491
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Gunnar

Quote:
How many instruction do you need for loading a 64bit pointer?

Can you clarify? That can be interpreted a number of ways. I interpreted it as this:


extern long myscore;

long const* get() {
    return &myscore;
}

Compiled for Power64, gcc 13.2 -Ofast


.LC0:
        .quad   myscore
get():
        .quad   .L.get(),.TOC.@tocbase,0
.L.get():
        addis 3,2,.LC0@toc@ha
        ld 3,.LC0@toc@l(3)
        blr
        .long 0
        .byte 0,9,0,0,0,0,0,0

Some of this is ABI overhead with the whole toc lookup, the second instruction loads the pointer.

_________________
Doing stupid things for fun...

Status: Offline

Gunnar

Re: 32-bit PPC on FPGA
Posted on 14-Feb-2024 16:52:36

[ #166 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

@Karlos

yeeees but ...

This code loads a 64bit value from memory/d-cache to register...
And this code assumes you have a pointer to the TOC in memory.
This is close but not what I meant..

Where does your pointer come from?
How do you create the pointer in the first place and how many instruction do you need for doing this?

And this way of loading from the TOC with 16bi(An) mode...
works pretty nice but many programs very fast run out of TOC space.
What do you do when your program uses more than 64K TOC?

And how do the access then look?
This gets very fast very ugly.

Status: Offline

Karlos

Re: 32-bit PPC on FPGA
Posted on 14-Feb-2024 17:07:53

[ #167 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4491
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Gunnar

This is why I asked for clarification. The example was already using the large (4G) TOC model which is where the first add instruction comes from. If I specify a small TOC model (-mcmodel=small):


.LC0:
        .quad   myscore
get():
        .quad   .L.get(),.TOC.@tocbase,0
.L.get():
        ld 3,.LC0@toc(2)
        blr
        .long 0
        .byte 0,9,0,0,0,0,0,0

You will notice the add immediate shifted step has gone now.

_________________
Doing stupid things for fun...

Status: Offline

matthey

Re: 32-bit PPC on FPGA
Posted on 14-Feb-2024 17:34:02

[ #168 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2150
From: Kansas

geen_naam Quote:

Because amigaos was designed at the same time when the steam engine was invented, it is hopelessly outdated. Because memory copies are byte aligned instead of long word aligned, you have to do byte copies in at least 33% of the cases. Or risk stalling your cpu. Therefore optimised copy routines do not have much effect as they could do on a modern OS.

The AmigaOS developers should have known to naturally align all structure data with padding even though they were programming for a 16 bit CPU. The 32 bit 68020 was introduced in 1984 so they had time to improve alignments before release. AmigaOS 4 developers had time to improve alignments but they chose 68k compatibility instead for a NG AmigaOS. The PPC ISA and CPUs are not as good at handling 16 bit data and poor alignment as most 68k CPUs. Even the 68060 is forgiving of poor alignment.

geen_naam Quote:

Managing cache is a must in any application. Not only in copy loops. Because the performance penalty of a cache miss is huge on our multi GHz processors. Waiting for data to arrive from slow DDR can take up several 100 clock cycles.

Intel and AMD understand their target audience. Those coders have mostly no clue about the hardware they are running on. Their software can run "in the clouds" for all they care. Therefore those platforms offer plenty of memory, hardware prefetchers, advanced predicters and instruction reordering. Optimization is done first and foremost in hardware and OS. Not in applications like we used to be used to.

So despite the compiler generated code, the PC and server CPU is still able to get maximum out of its potential.

PPC was developed when it was thought that simplifying CPU cores to clock higher and moving complexity into the compiler gives a RISC advantage. The RISC philosophy of breaking instructions into many weak instructions not only increased instruction counts but produced code with more dependent instructions, made instruction scheduling more difficult, introduced load-to-use stalls and clogged up instruction caches.

68k: // 1 instruction, 10 bytes, 1 cycle execution possible
add.l #100000,myscore

PPC: // 5 instructions, 20 bytes, 6 cycle execution common for PPC
lis 10,myscore@ha
lwz 9,myscore@l(10) // dependent on r10 in previous instruction
// load-to-use stall (1 cycle)
addis 9,9,0x2 // dependent on result in r9 and must wait for load-to-use delay
addi 9,9,-31072 // dependent on result in r9
stw 9,myscore@l(10) // dependent on r9 result

This is assuming the "Figure 4-15. Common Model Instruction Delays" single cycle load-to-use (load-use) delay given for early shallow pipeline PPC core designs.

The PowerPC Compiler Writerâ€™s Guide
https://cr.yp.to/2005-590/powerpc-cwg.pdf

A two cycle load-to-use penalty is more common for modern RISC CPU cores that abandoned shallow pipeline designs. The most common RISC core in the world, the ARM Cortex-A53, has a 3 cycle load-to-use penalty. Even a programmer should be able to see the major RISC disadvantage here. The problem is that RISC architectures require perfect code while CISC architectures are inherently forgiving of low quality code and are naturally higher performance.

geen_naam Quote:

Well, I actually do know about both ASIC design and FPGAs. As I was part of ASIC design teams myself.

You only use FPGAs to verify functionality. Because simulation is painfully slow.

We never optimised our ASIC design for the resources in an FPGA. Which is totally useless. Since the fab make use of their own libraries with primitive which are tuned to the process node. Timing behaviour is completely different. Therefore you have to simulate the post layout Verilog with timing which you get from your fab.

Umisef claims that you designed your "68080" to make best use of the FPGA resources available. Which I think is very plausible.

Far worse than optimizing a CPU core based on FPGA resources is optimizing an ISA for a FPGA core. There are legitimate reasons to use a FPGA CPU core in hardware, for example low production applications. Better performance due to better utilizing FPGA resources is a competitive advantage. However, some customers will want to move up to an ASIC where an ISA optimized for a FPGA is a handicap.

Status: Offline

Gunnar

Re: 32-bit PPC on FPGA
Posted on 14-Feb-2024 17:36:29

[ #169 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

@Karlos

Correct with small TOC, you can load using 1 instruction
With big TOC, you can load using 2 instruction and spending one more temp register

But can you make an example of how to create the TOC pointer?
This was my original question ..
Maybe I did not clearly word it :)

Status: Offline

Karlos

Re: 32-bit PPC on FPGA
Posted on 14-Feb-2024 18:08:27

[ #170 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4491
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Gunnar

I think I understand, but is that not a bit of a strawman argument? The ABI defines which register is expected to hold the TOC base and setting it all up is a job for the loader/linker.

Application code doesn't generally need to worry about that.

_________________
Doing stupid things for fun...

Status: Offline

Gunnar

Re: 32-bit PPC on FPGA
Posted on 14-Feb-2024 18:35:21

[ #171 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

@Karlos

Quote:
Application code doesn't generally need to worry about that.

Unless you work on the side as IBM that develops the OS backend..
Then you of course see what I was talking about, and you see this in many places.

But nevermind ...

Status: Offline

Karlos

Re: 32-bit PPC on FPGA
Posted on 14-Feb-2024 19:28:58

[ #172 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4491
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Gunnar

Can you provide the example code in question? I'm intrigued.

_________________
Doing stupid things for fun...

Status: Offline

Hammer

Re: 32-bit PPC on FPGA
Posted on 15-Feb-2024 6:11:01

[ #173 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5616
From: Australia

@matthey

Quote:

A two cycle load-to-use penalty is more common for modern RISC CPU cores that abandoned shallow pipeline designs. The most common RISC core in the world, the ARM Cortex-A53, has a 3 cycle load-to-use penalty. Even a programmer should be able to see the major RISC disadvantage here. The problem is that RISC architectures require perfect code while CISC architectures are inherently forgiving of low quality code and are naturally higher performance.

There's a reason why Apple's M1 has 8 decoders to rival AMD's four (decoder count = IEU count match) and Intel's six decoders.

Zen 5 increases IEU units to 6.

A "very fat" RISC microarchitecture can be designed to rival a "very fat" X86-64 microarchitecture.

Qualcomm Snapdragon X's Oryon CPU is from M1. Oryon CPU appears on Qualcomm SnapDragon 8 Gen 4.

Qualcomm Oryon CPU is available for multiple desktop PC OEMs and makes this CPU an existential threat to AMD and Intel.

Snapdragon X Elite will debut in June 2024. Snapdragon X Elite(87 watts and 58 watts)'s 12 cores/12 threads rivals or beats AMD's 8 cores/16 threads Ryzen 7 7840HS

Cinebench 2024 MT
Snapdragon X Elite at 58 watts configuration is similar to Ryzen 7 7840HS (35 to 54 watts, unknown config).

Snapdragon X Elite = 950 score with 12 threads
Ryzen 7 7840HS = 979 score with 16 threads
i7-13800H = 996 score with 20 threads

From https://www.anandtech.com/show/21112/qualcomm-snapdragon-x-elite-performance-preview-a-first-look-at-whats-to-come

Ryzen 7 8700G (45 to 64 watts) = 986 with 16 threads

From https://www.topcpu.net/en/cpu-r/cinebench-2024-multi-core

Intel ArrowLake and AMD Zen 5 will be released this year to counter Qualcomm Oryon.

Last edited by Hammer on 15-Feb-2024 at 06:17 AM.
Last edited by Hammer on 15-Feb-2024 at 06:16 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

ppcamiga1

Re: 32-bit PPC on FPGA
Posted on 15-Feb-2024 6:15:14

[ #174 ]

Cult Member

Joined: 23-Aug-2015
Posts: 829
From: Unknown

amiga in fpga it may be something nice
with 68k and risc
with ocs and better graphics
finally merged classic and ng

Status: Offline

kolla

Re: 32-bit PPC on FPGA
Posted on 15-Feb-2024 6:34:47

[ #175 ]

Elite Member

Joined: 21-Aug-2003
Posts: 3072
From: Trondheim, Norway

@ppcamiga1

Quote:

stop this assembler shit

Stop this shite shit

_________________
B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC

Status: Offline

Hammer

Re: 32-bit PPC on FPGA
Posted on 15-Feb-2024 7:11:58

[ #176 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5616
From: Australia

@Gunnar

Quote:

Gunnar wrote:

To whom do you talk?
I never said this.

You stated, "Normally code runs around 2 instructions per clock."

The normal can change when out-of-order and prefetch depths capabilities are fattened.

Quote:

The Apollo 68080 CPU has 6 EXECUTION units.
2 EA, 2 IALU, 1 AMMX, 1 FPU
The Apollo 68080 can do up to 4 instructions per cycle.

Zen 4 can do 9 micro-ops from the micro-op cache (6.76 K entries) and four micro-ops from 4-way decoders.

13 micro-ops bottlenecked into the Rename / Dispatch unit that can dispatch 6 micro-ops while the Register Alias Tables can retire 8 micro-ops.

What's AC68080's instruction retirement rate?

The instruction retirement rate can be a bottleneck.

-------------------------------
https://ko-fi.com/post/Lightwave-5-benchmarking-and-findings-Z8Z3I8IOX
Lightwave-5 benchmarks

A4000/060 is 50Mhz 68060 config.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Karlos

Re: 32-bit PPC on FPGA
Posted on 15-Feb-2024 9:40:23

[ #177 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4491
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

To all the doubters, just look at AmiBerry in the benchmarks above. Contrast the results with the X5000.

@NutsAboutAmiga

Do you still believe that 68K applications get a huge boost from having access native OS calls? That's only true if that's the only thing it spends the majority of it's time doing.

Most real application software spends most of its time either computing something, or sitting around idle, waiting to be triggered by some external action. LW rendering is a good example of the former. The further along the spectrum from compute bound to IO bound/event driven, the less raw speed matters anyway.

Last edited by Karlos on 15-Feb-2024 at 09:51 AM.

_________________
Doing stupid things for fun...

Status: Offline

pixie

Re: 32-bit PPC on FPGA
Posted on 15-Feb-2024 12:55:17

[ #178 ]

Elite Member

Joined: 10-Mar-2003
Posts: 3215
From: Figueira da Foz - Portugal

@geen_naam

Quote:
Amiberry is running a winuae based heavilly optimised 68k JIT on a twice as fast core. Yet, it only manages to achieve similar results compared to MOS/OS4 on a X5000.

An emulator that actually optimizes stuff it should be optimizing... I know, mind blowing stuff!

_________________
Indigo 3D Lounge, my second home.
The Illusion of Choice | Am*ga

Status: Offline

Karlos

Re: 32-bit PPC on FPGA
Posted on 15-Feb-2024 13:12:15

[ #179 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4491
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@geen_naam

What is the cost/performance ratio of AmiBerry versus X5000 here?

_________________
Doing stupid things for fun...

Status: Offline

pixie

Re: 32-bit PPC on FPGA
Posted on 15-Feb-2024 13:24:45

[ #180 ]

Elite Member

Joined: 10-Mar-2003
Posts: 3215
From: Figueira da Foz - Portugal

@geen_naam

Quote:
1. Amiberry 68k JIT is based on heavilly optimised WinUAE JIT. (68k emulation is primary function)

On ARM? 'heavilly optimised WinUAE JIT' on ARM? I'll ask again in case you don't understand... 'heavilly optimised WinUAE JIT' on ARM???

Quote:
2. MorphOS and AmigaOS4 JIT are less heavilliy optimised

I would think they would quite optimized for PPC... perhaps you don't know the gap between JIT and non JIT code running, in heavily optimized (defact) CPU combo such WinUAE/x86 it's huge, perhaps MOS or AmigaOs does some magic (ie point 4) to pull those numbers, but perhaps it's just simple JIT...

_________________
Indigo 3D Lounge, my second home.
The Illusion of Choice | Am*ga

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle