Click Here
home features news forums classifieds faqs links search
6124 members 
Amiga Q&A /  Free for All /  Emulation /  Gaming / (Latest Posts)
Login

Nickname

Password

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net
Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.
Donate

Menu
Main sections
» Home
» Features
» News
» Forums
» Classifieds
» Links
» Downloads
Extras
» OS4 Zone
» IRC Network
» AmigaWorld Radio
» Newsfeed
» Top Members
» Amiga Dealers
Information
» About Us
» FAQs
» Advertise
» Polls
» Terms of Service
» Search

IRC Channel
Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online
22 crawler(s) on-line.
 95 guest(s) on-line.
 1 member(s) on-line.


 Kronos

You are an anonymous user.
Register Now!
 Kronos:  26 secs ago
 deadwood:  8 mins ago
 pixie:  12 mins ago
 AmigaMac:  47 mins ago
 outlawal2:  1 hr 6 mins ago
 g.bude:  1 hr 13 mins ago
 coder76:  1 hr 35 mins ago
 Rob:  1 hr 43 mins ago
 MEGA_RJ_MICAL:  1 hr 51 mins ago
 Karlos:  2 hrs 32 mins ago

/  Forum Index
   /  Amiga General Chat
      /  Integrating Warp3D into my 3D engine
Register To Post

Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 Next Page )
PosterThread
matthey 
Re: Integrating Warp3D into my 3D engine
Posted on 19-Feb-2025 17:59:59
#101 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2602
From: Kansas

Karlos Quote:

Write your preferred worst case instruction sequences for 68K under emulation and benchmark them on real 68K and under Emu68, Petunia etc.

Otherwise why you are writing remains a hypothetical worst case and not a practical one.


When does Emu68 not translate a 68k "OP mem,Rn" instruction into two sequential ARM64 load+op instructions with a load-to-use stall between?

code_68k:
add.l mem,Rn

code_RISC:
load mem,Rm
load-to-use stall
add Rm,Rn

Did Michal write an instruction scheduler for Emu68? Is an instruction scheduler for a JIT compiler even a good idea? Are loads common enough to have a performance impact?

A typical general purpose CPU workload of instructions will average to approximately the following.

load 26%
store 10%
ALU 49%
branch 15%

About 1 in 4 instructions is a load and without an instructions scheduler the simple translation of the most common 68k loads to RISC code incurs a load-to-use stall with no chance to begin executing instructions in either execution pipeline in between. The Cortex-A53 could have executed 6-9 simple instructions during the load-to-use stall with instruction scheduling. The behavior of both execution pipelines has been shown down to the cycle for the Cortex-A53 and PPC "common model" (PPC601/PPC603). I have described the behavior exactly and there is nothing "hypothetical" it. The current "worst case" performance impact would be for 68k programs that perform many loads and best case few loads. Best case could be improved in 3 ways.

1. a JIT compilation instruction scheduler but it may not be worthwhile
o load-to-use stall gains are offset by time to schedule instructions
2. AOT compilation could perform instruction scheduling without affecting execution performance
o additional data, storage and complexity required
3. a 68k compiler backend for RISC emulation but it is not worthwhile
o split 68k OP mem,Rn instructions into MOVE+OP instructions
o schedule instructions for RISC load-to-use stalls

These are not good options because native code is better performance. Just use native ARM programs if you want to use ARM hardware. The 68k Amiga elegance and advantages are all lost with emulation. The 68k Amiga tech is good and in some ways better than the RISC tech but it is enough different that it is nowhere close to competitive with translation overhead. It works better the other way where CISC emulates RISC as can be seen by x86-64 emulators emulating Nintendo PPC G3 consoles at full speed even with the endian swapping overhead. Of course that is high end x86-64 hardware and even x86 hardware does not scale down as far as 68k hardware could, if we had real hardware.

Heimdall Quote:

Sorry, I only have time right now to react to this paragraph.
But you're correct. I understand the instruction scheduling from the Jaguar's RISC, because I spent several years working with its GPU/DSP RISC and have gathered plenty benchmark data how stalls affect the performance. It's not pretty.

So, while in theory you can get 1 cycle / op (instead of typical 3 cycles / op), if there's no pipeline bubbles, in reality it takes a lot of effort to refactor the code to even get 1.5 c / op. Unoptimized first draft is often worse than ~2.2 c / op

I only had few hand-optimized inner loops that were approaching ~1.2 c /op. They're un-maintainable after a week despite the comments. Might as well toss it at that point.


Not all RISC CPU core designs are equal but most have been bubblicious. The philosophy of RISC was to simplify which means minimal hardware where only one unit can perform an operation and the load/store unit now uses a separate pipeline so load results have a load-to-use penalty. Even simple superscalar RISC designs maintained this philosophy and gained very little from being superscalar due to impossible instruction scheduling. The PPC "common model" is for the most common PPC CPUs which were the superscalar PPC601 and PPC603(e) where they rarely executed a pair of instructions in the examples from the PPC Compiler Writer's Guide. The problem is that they only have a single integer unit so they can do a load+add but not an add of the most recent load because of the load-to-use stall meaning loads need to be unrolled wasting registers, bloating code and requiring instruction scheduling. The 68060 has no problem with a load+add even if the add is of the most recent load.

code_68060:
add.l mem,Rn ; 1 cycle pOEP|sOEP
move.l mem,Rm ; 1 cycle pOEP|sOEP, optimization forwards result to sOEP
add.l Rm,Rn ; 1 cycle pOEP|sOEP, optimization receives result from pOEP

The superscalar PPC601 and PPC 603 can not perform an add+add or other ALU+ALU instruction in the same cycle either. The 68060 has no problem with this as well, even if one ALU operation is from memory.

code_68060:
add.l mem,Rn ; 1 cycle pOEP|sOEP
add.l Rx,Ry ; 1 cycle pOEP|sOEP

With existing code for scalar 68k CPUs, 45%-55% of instructions issue as pairs/triplets on the 68060 and 50%-65% of instructions issue as pairs/triplets with 68060 optimized code. The PPC601 and PPC603 are barely superscalar in comparison with much of the performance made up by doubling the caches of the 68060 (PPC601 already had double the 68060 caches and the PPC603 was quickly replaced by the PPC603e with double the 68060 caches due to poor performance).

The PPC G3 (PPC603e successor), PPC 604(e) and Cortex-A53 do have 2 integer units which makes instruction scheduling easier and finally makes a superscalar RISC CPU worthwhile even with load-to-use stalls. However, instruction scheduling is still very important as load-to-use stalls need to avoided, especially when the load-to-use penalty is large as is the case for the Cortex-A53. Most cores only have a single load/store unit and only allow a single memory access per cycle, the PPC 604 being an exception with 2 load/store units which makes instruction scheduling easier as now a sequential load+load in code is not a superscalar problem but the PPC604 is 4 issue and 3 completion capable so sequential load+load+load in code may be a problem. Instructions scheduling for the 2 issue 68060 is still much easier and a dual ported data cache like the PPC604 and later x86 CPUs would make instruction scheduling and pair/triplet issues rates increase, especially for existing 68k code. The in-order 68060 with a dual ported data cache would still be a smaller and simpler CPU core than an OoO PPC604 core which was abandoned for the simpler and lower power G3 design.

Heimdall Quote:

I already mentioned it here, but on Jaguar the RAM latency was brutal. Loading a value from RAM (outside of 4 KB RISC cache) reduced instruction throughput drastically. I recall that a tight 4-op loop couldn't write #0 to RAM more than 19,000 times within 1 frame.

I'd be curious as to what that latency is on modern RISCs, like the ones in A500Mini.


Most embedded hardware either uses MCUs with SRAM at the low end or multi-level caches like desktop hardware at the mid to high end. The Cortex-A53 uses multi-level caches so memory performance is less important except on startup, when streaming data and with large programs and data which blow out the caches. Modern memory access latencies vary depending on if you are hitting the L1, the L2, the L3 (not on Cortex-A53) or DRAM. For a typical Cortex-A53, see the chart at the following link.

https://www.7-cpu.com/cpu/Cortex-A53.html
https://www.7-cpu.com/ (for other CPUs)

I hope that helps.

 Status: Offline
Profile     Report this post  
Karlos 
Re: Integrating Warp3D into my 3D engine
Posted on 19-Feb-2025 18:49:00
#102 ]
Elite Member
Joined: 24-Aug-2003
Posts: 4937
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@matthey

I take it you aren't willing to write this test case.

_________________
Doing stupid things for fun...

 Status: Offline
Profile     Report this post  
Karlos 
Re: Integrating Warp3D into my 3D engine
Posted on 19-Feb-2025 19:37:05
#103 ]
Elite Member
Joined: 24-Aug-2003
Posts: 4937
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@matthey

You're welcome:

https://github.com/0xABADCAFE/emububble

This tests 10M iterations of adding a long from a memory location to a register. Example output:


Got Timer, frequency is 709379 Hz
Iterations: 10000000, step: 3
Result: 30000000, expected 30000000
Time: 41264 EClock ticks (58 ms)


All we need now is someone with a PiStorm to test it and someone on real silicon (pref 060) to test it.

Last edited by Karlos on 19-Feb-2025 at 07:40 PM.

_________________
Doing stupid things for fun...

 Status: Offline
Profile     Report this post  
kriz 
Re: Integrating Warp3D into my 3D engine
Posted on 19-Feb-2025 20:18:50
#104 ]
Regular Member
Joined: 20-Mar-2005
Posts: 243
From: No (R) Way

@Karlos

On pistorm 3B (1.2ghz)

Got Timer, frequency is 709379 Hz
Iterations: 10000000, step: 3
Result: 30000000, expected 30000000
Time: 178730 EClock ticks (251 ms)

 Status: Offline
Profile     Report this post  
matthey 
Re: Integrating Warp3D into my 3D engine
Posted on 19-Feb-2025 20:45:34
#105 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2602
From: Kansas

MagicSN Quote:

As to a 1 GHz ASIC 080 I am unsure if it would be the big deal really.

If we assume linear speedup (and probably it will be less than linear ?) on a 1 GHz 080 we might get lowres H2 at 80 fps (10x current fps). PiStorm cm4 already has 50 fps there. So while this is a bit faster than the CM4 it is not THAT much faster. And Pi5 is (even with amiberry) still 2x faster than PiStorm Cm4. And when I read discussion on Apollo Plans I always read about 500-600 MHz, not 1 GHz. A 500 MHz 080 would probably not even reach the speed of PiStorm Cm4. Though it would enable playing games like Heretic 2 on it of course


There are different levels of ASICs.

o Full custom IC
o Cell-based IC
o Mask-programmable gate array
o Platform/structured

The AC68080@500-600MHz is likely the cheapest FPGA to ASIC conversion. It is relatively cheap, the ASIC chips are cheaper than the equivalent FPGA chips and the FPGA HDL code does not require as much modification as more advanced ASIC designs. The performance and power are improved but the results are limited. A larger more expensive FPGA can be used as the source FPGA with a larger transistor budget so performance could improve from larger caches but a L2 cache may be needed which is likely not used at the current low AC68080 clock speed and limited affordable FPGA transistor budgets. Large caches increase access times which is why multi-level caches are used. L1 cache performance does improve linearly with the CPU core clock speed while memory performance does not. I asked Gunnar what the cache size was once and he did not tell me but he likely changed it often knowing him. I expect it is much larger than the 68060 8kiB I+D cache though. The 68060 on semi-modern silicon could easily have the L1 cache increased to 32kiB I+D and this alone would improve performance. I would expect a 20%-30% increase to performance from enlarging to 16kiB I+D caches which is what Motorola claimed for the 68060+.

Motorola Introduces Heir to 68000 Line (Microprocessor Report Vol. 8, No. 5, April 18, 1994)
https://websrv.cecs.uci.edu/~papers/mpr/MPR/ARTICLES/080502.pdf Quote:

o 68060+ - undisclosed architectural enhancements that increase performance 20-30% independent of clock frequency.

Process improvements will bring further speed gains and, perhaps more important, cost reduction.


The 68060 at 2.5 million transistors could have doubled the caches and I believe that was the plan for the 68060+ as well as higher clock speeds but the 68060 core outperforms the PPC601 and PPC603 cores with the same sized caches and the 8-stage 68060 could be clocked higher than the 4-stage PPC601 and PPC603. Did anyone bother telling Steve Jobs before he decided to switch to PPC only to find that the shallow pipeline PPC CPU pipelines did not clock up?

The L1 cache increase from 16kiB I+D to 32kiB I+D gives a smaller performance gain at about 15%.

Arthur Revitalizes PowerPC Line (Microprocessor Report Vol. 11, No. 2, February 17, 1997)
https://websrv.cecs.uci.edu/~papers/mpr/MPR/ARTICLES/110203.PDF Quote:

Somerset estimates the larger on-chip caches alone add about 15% in performance for typical Mac applications, compared with the 603e. The changes to the core - the extra integer unit, improved fetch rate, hardware TLB miss handler, and dynamic branch prediction - add another 15% or so. The biggest performance gain comes from the new L2 cache bus; this change raises performance by 50% or more.


Arthur was the internal name for the PPC G3 which increased the PPC603e L1 caches from 16kiB I+D to 32kiB I+D. The early G3 had no L2 caches but added on-chip L2 tags which increased performance by 50% or more. Moving the whole L2 cache on-chip would of course increase performance more and was done for late PPC G3 cores when transistor budgets allowed it. Perhaps this is adequate to see that substantial performance increases are possible with larger caches independent of clock speed.

The in-order Pentium MMX/P55C@233MHz with 16kiB I+D was already reaching about 48 fps in Quake.

https://thandor.net/benchmark/33

Even a cheap FPGA to ASIC conversion of the AC68080@500MHz would be on newer silicon with larger caches than the Pentium MMX with otherwise similar if not better performance. I would like to see a 68k Amiga ASIC conversion more inline with the ColdFire ASICs which used full auto layout of the also fully synthesizable cores.

MOTOROLA THAWS COLDFIRE V4 (Microprocessor Report May 15, 2000)
https://www.cecs.uci.edu/~papers/mpr/MPR/2000/20000515/142001.PDF Quote:

The larger caches are the biggest reason that the die didn’t shrink dramatically. Another reason is starkly visible in Figure 2, the die photo. ColdFire is the only family of processors from Motorola that’s entirely synthesized from high-level models with automated design tools. There’s no custom circuit layout at all. Compiled chips are bigger, slower, and less power-efficient than full-custom designs, but they are much quicker and cheaper to create. Where a hand-packed design typically has neat blocks of function units inside a Piet Mondrian grid of buses, the 5407 has an amorphous mass of compiler-generated circuits on a Jackson Pollock canvas of silicon. The only semblance of order comes from the caches and on-chip memories around the periphery of the die. They’re compiled too, but SRAM arrays obediently fall into dense rows and columns, even without a guiding hand.


The ColdFireV5@333MHz using a 130nm process in 2002 reached 610 DMIPs with 32kiB I+D caches using auto layout tools. The ColdFireV5 and 68060 are both similar full static designs written in Verilog (the AC68080 may not be a full static design and is written in VHDL). There was 8 years of silicon improvement from the 1994 68060@50MHz to a ColdFireV5@333MHz and another 23 years of silicon improvement since which I expect would allow a 68060@1GHz using a ColdFire like ASIC process.

MagicSN Quote:

(Using H2 as Benchmark here)

A 2 GHz system of course would be totally different, that would be faster than everything else, including x5000...


Achieving 2GHz is possible with more pipeline stages but has diminishing returns and drawbacks (the AC68080 pipeline is likely longer than the 68060 pipeline for a higher clock speed in a FPGA). It may be possible to optimize performance critical parts of a core to allow higher clock speeds or license already optimized blocks that would allow the core to clock higher with fewer or no drawbacks. Other enhancements would likely have priority over optimizations and a better process. Starting with a core that is in good shape helps like the full static and modular 68060 design which is fully "MC" certified, has no known bugs remaining and brings respect.

kolla Quote:

I’m not sure what you are asking - chunky modes is what’s native modes on the Linux/ARM system which the THEA500 is. If you’re asking if it ships with P96 pre-installed and with software that use RTG, the answer is no. But nothing prevents the owner/user to install full AmigaOS with P96 and whatever, and even bring the system online. At this point it’s the owner who’s responsible for the device though.


Thanks. You answered the question. There are no built-in RTG chunky modes for THEA500 Mini. It would be possible to write a P96 driver if documentation could be found for the GPU but that is the problem with everyone using custom hardware instead of standard RPi hardware for 68k Amiga emulation.

Last edited by matthey on 19-Feb-2025 at 10:52 PM.

 Status: Offline
Profile     Report this post  
Karlos 
Re: Integrating Warp3D into my 3D engine
Posted on 19-Feb-2025 20:52:57
#106 ]
Elite Member
Joined: 24-Aug-2003
Posts: 4937
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@kriz

As a super crude approximation then, there are 3 68K instructions in this loop. 10,000,000 x 3 / 0.251 gives 119.5 emulated "MIPS" for this particular example.

I'd like to see the same for a 68060. One hopes it should get away with folding the branch out so it's just the cost of the two instructions.

Last edited by Karlos on 19-Feb-2025 at 09:06 PM.

_________________
Doing stupid things for fun...

 Status: Offline
Profile     Report this post  
matthey 
Re: Integrating Warp3D into my 3D engine
Posted on 19-Feb-2025 21:31:37
#107 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2602
From: Kansas

Karlos Quote:

As a super crude approximation then, there are 3 68K instructions in this loop. 10,000,000 x 3 / 0.251 gives 119.5 emulated "MIPS" for this particular example.

I'd like to see the same for a 68060. One hopes tit should get away with folding the branch out so it's just the cost of the two instructions.


The RPi 4 CortexA72@1.8GHz has 18 times the number of cycles in a given time as a 68060@100MHz or 36 times the number of cycles in a given time as a 68060@50MHz. I expect the load-to-use stalls would make an in-order Cortex-A53@1.8GHz perform no better than a 68060@300MHz but the Cortex-A72@1.8GHz is OoO so it depends on how well the OoO design can reschedule instructions to fill the load-to-use stall. I expect a RPi 3 Cortex-A53@1.4GHz to perform at best like a 68060@233MHz. There is no doubt that a 68060@100MHz can not match the performance of even a RPi 3 but clock the 68060 up to the same frequency on similar silicon and it destroys the Cortex-A53 and is cheaper and lower power on similar silicon than the Cortex-A72. The 68060 does not even need larger caches for this test but for overall performance it would need similar modernized caches and memory which would make it close to Cortex-A53 sized but a fraction of the area of the Cortex-A72. The problem is not just weak performance but value. Would anyone buy hardware with a CortexA53@233MHz to use native code?

 Status: Offline
Profile     Report this post  
Karlos 
Re: Integrating Warp3D into my 3D engine
Posted on 19-Feb-2025 21:35:18
#108 ]
Elite Member
Joined: 24-Aug-2003
Posts: 4937
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

The 58ms result I got was on a 3GHz i7 running Amiberry 7, it was very consistent.

If we take it at face value, it's only 4.3x faster than the 1.2GHz Pi3. That's with 2.5x the clock speed, so factoring that in we only get a factor of 1.72x

_________________
Doing stupid things for fun...

 Status: Offline
Profile     Report this post  
Karlos 
Re: Integrating Warp3D into my 3D engine
Posted on 19-Feb-2025 21:38:04
#109 ]
Elite Member
Joined: 24-Aug-2003
Posts: 4937
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@matthey

Are you going to run it on real silicon or are you just going to opine?

_________________
Doing stupid things for fun...

 Status: Offline
Profile     Report this post  
michalsc 
Re: Integrating Warp3D into my 3D engine
Posted on 19-Feb-2025 22:12:52
#110 ]
AROS Core Developer
Joined: 14-Jun-2005
Posts: 433
From: Germany

@Karlos

Please make also version with slightly unrolled loop.

 Status: Offline
Profile     Report this post  
ZXDunny 
Re: Integrating Warp3D into my 3D engine
Posted on 19-Feb-2025 22:47:37
#111 ]
New Member
Joined: 7-Feb-2025
Posts: 7
From: Unknown

PiStorm32lite, A1200 + CM4 (Mild OC 2.2Ghz).

Got Timer, frequency is 709379 Hz
Iterations: 10000000, step: 3
Result: 30000000, expected 30000000
Time: 91287 EClock ticks (128 ms)

Hope that helps.

Last edited by ZXDunny on 19-Feb-2025 at 10:48 PM.

 Status: Offline
Profile     Report this post  
ZXDunny 
Re: Integrating Warp3D into my 3D engine
Posted on 19-Feb-2025 22:52:03
#112 ]
New Member
Joined: 7-Feb-2025
Posts: 7
From: Unknown

And from a friend with an 060:

B1260/50Mhz

Got Timer, frequency is 709379 Hz
Iterations: 10000000, step: 3
Result: 30000000, expected 30000000
Time: 290104 EClock ticks (408 ms)

(x-posted from EAB)

 Status: Offline
Profile     Report this post  
Karlos 
Re: Integrating Warp3D into my 3D engine
Posted on 19-Feb-2025 23:42:47
#113 ]
Elite Member
Joined: 24-Aug-2003
Posts: 4937
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@michalsc

Pushed a version with a second 4x unrolled loop to test.

_________________
Doing stupid things for fun...

 Status: Offline
Profile     Report this post  
matthey 
Re: Integrating Warp3D into my 3D engine
Posted on 20-Feb-2025 0:24:31
#114 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2602
From: Kansas

Karlos emububble benchmark results.

CPU@MHz | time | 68060 equivalent
68060@50MHz 408ms 68060@50MHz
Cortex-A53@1200MHz 251ms 68060@81MHz
Cortex-A72@2200MHz 128ms 68060@159MHz
i7@3000MHz 58ms 68060@408MHz

The code all fits in the L1 cache so I would expect performance to scale linearly with the clock speed. I expected a RPi 3 Cortex-A53@1400MHz would be at best equivalent to a 68060@233MHz and the tested Cortex-A53 is clocked a little lower. The results are lower than I expected but this is close to a worst case benchmark for RISC cores with load-to-use penalties. The benchmark is even tough for OoO cores as there may not be enough instructions to reschedule to avoid load-to-use stalls.

michalsc Quote:

Please make also version with slightly unrolled loop.


Maybe 2 versions. The following code would test whether OoO can fill the load-to-use slots better without the branch. The in-order Cortex-A53 will still struggle but should improve a small amount along with the 68060.

.loop:
add.l mem,d1
add.l mem,d2
add.l mem,d3
add.l mem,d4
subq.l #1,d0
bgt.s .loop

The original code is just an "add.l mem,d1" in a code and register saving unrolled loop and the 68060 performs well. I do not expect much of an improvement from unrolling the loop which exhibits how advanced the loop handling was for 1994. The 3rd version could be RISCified 68k code using RISC scheduling to compare to the other results.

.loop:
move.l mem,d1
move.l mem,d2
move.l mem,d3
move.l mem,d4
add.l d1,a1
add.l d2,a2
add.l d3,a3
add.l d4,a4
subq.l #1,d0
bgt.s .loop

I would expect the last version to have the best performance on RISC cores with load-to-use penalties but the worst performance on the 68060.

Edit: My post was too late. You chose the first version. Maybe you should have reduced the loop count to 1/4 of the original loop when there are 4 adds per iteration so the results would compare more easily?

Last edited by matthey on 20-Feb-2025 at 12:50 AM.
Last edited by matthey on 20-Feb-2025 at 12:49 AM.
Last edited by matthey on 20-Feb-2025 at 12:35 AM.

 Status: Offline
Profile     Report this post  
ZXDunny 
Re: Integrating Warp3D into my 3D engine
Posted on 20-Feb-2025 0:36:17
#115 ]
New Member
Joined: 7-Feb-2025
Posts: 7
From: Unknown

Again, PiStorm32Lite CM4 at 2.2GHz:

Got Timer, frequency is 709379 Hz
Iterations: 10000000, step: 3
Result: 30000000, expected 30000000
Time: 89085 EClock ticks (125 ms)
Unrolled (4x):
Result: 30000000, expected 30000000
Time: 22325 EClock ticks (31 ms)


 Status: Offline
Profile     Report this post  
Karlos 
Re: Integrating Warp3D into my 3D engine
Posted on 20-Feb-2025 0:49:25
#116 ]
Elite Member
Joined: 24-Aug-2003
Posts: 4937
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@matthey

From Paraj @ EAB, the unrolled version on 060/50
Quote:

Results from that (still B1260/50):
Got Timer, frequency is 709379 Hz
Iterations: 10000000, step: 3
Result: 30000000, expected 30000000
Time: 290172 EClock ticks (409 ms)
Unrolled (4x):
Result: 30000000, expected 30000000
Time: 180747 EClock ticks (254 ms)


Looks like the 4x unroll definitely does help the 060 too. That or I've messed it up somehow...

_________________
Doing stupid things for fun...

 Status: Offline
Profile     Report this post  
Karlos 
Re: Integrating Warp3D into my 3D engine
Posted on 20-Feb-2025 1:01:33
#117 ]
Elite Member
Joined: 24-Aug-2003
Posts: 4937
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@matthey

Quote:

Edit: My post was too late. You chose the first version. Maybe you should have reduced the loop count to 1/4 of the original loop when there are 4 adds per iteration so the results would compare more easily?


It is doing this. Note the lsr.l #2,d0 before entering the unrolled loop.

_________________
Doing stupid things for fun...

 Status: Offline
Profile     Report this post  
matthey 
Re: Integrating Warp3D into my 3D engine
Posted on 20-Feb-2025 1:16:04
#118 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2602
From: Kansas

Karlos Quote:

Looks like the 4x unroll definitely does help the 060 too. That or I've messed it up somehow...


More 68060 improvement from unrolling than I expected. Unrolled loops sometimes provide a large performance improvement and sometimes a small improvement which I never figured out why. My 68060 CopyMem() patch only used 2x move.l (a0)+,(a1)+ in a loop which was not the best performance but close enough to it that I did not unroll it anymore. A move16 loop wanted more unrolling and plenty of other code does as well. These are complex cores that are difficult to predict sometimes. Unrolling improves the 68060 performance and competitiveness even though it requires more code which is bad for caches but not as bad as RISC cores which have fat instructions and fatter code from requiring more loop unrolling to avoid load-to-use stalls. A 4x unroll is not always enough to remove all the Cortex-A53 load-to-use penalty. RISC Synergies of bloat at work.

Karlos Quote:

It is doing this. Note the lsr.l #2,d0 before entering the unrolled loop.


I overlooked that. Good job foreseeing comparing the results.

Last edited by matthey on 20-Feb-2025 at 01:29 AM.

 Status: Offline
Profile     Report this post  
kolla 
Re: Integrating Warp3D into my 3D engine
Posted on 20-Feb-2025 3:36:13
#119 ]
Elite Member
Joined: 20-Aug-2003
Posts: 3418
From: Trondheim, Norway

@matthey

Quote:
Thanks. You answered the question. There are no built-in RTG chunky modes for THEA500 Mini. It would be possible to write a P96 driver if documentation could be found for the GPU but that is the problem with everyone using custom hardware instead of standard RPi hardware for 68k Amiga emulation.


The hardware is well understood, there’s native OpenGL for native games as well as some of the emulators, and for Amiberry there’s uaegfx.card, the built-in generic P96 driver.

https://youtu.be/Ib6AMVXCGtc?t=660

Using a Raspberry Pi doesn’t help here, Amiberry on RPi is pretty much same situation, while the Musashi P96 driver isn’t super great. The one that stands out is the 68k native driver for Emu68, but then you (as of now) also need Amiga hardware of some sort, either real or “simulated” (minimig on FPGA).

But frankly, with the THEA500 mainly being a Linux/ARM gaming console, with tons of emulators as well as native games (with just about every Amiga RTG game being a port from PC for which there is also ARM native ports)… AmigaOS RTG isn’t really so imprtant.

Last edited by kolla on 20-Feb-2025 at 03:47 AM.
Last edited by kolla on 20-Feb-2025 at 03:46 AM.
Last edited by kolla on 20-Feb-2025 at 03:41 AM.

_________________
B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC

 Status: Offline
Profile     Report this post  
michalsc 
Re: Integrating Warp3D into my 3D engine
Posted on 20-Feb-2025 6:29:54
#120 ]
AROS Core Developer
Joined: 14-Jun-2005
Posts: 433
From: Germany

@Karlos

Unroll on Emu68 has several effects on the generated code:

1. Short loop contains branch instruction which also contributes to instruction count.
2. Branch itself is more complex than one could think since it needs to eventually leave the JIT block and because of that performs some preparations - effectively the branch generates much more instructions than the ADD.L and SUBQ.L together.
3. GT branch is a complex one and as such manipulates condition flags of ARM which adds penalty. Try BNE instead - it will already be faster.
4. Loop results in condition code generation in the SUBQ.L opcode which adds its own penalty on ARM side already.

Now unrolled loop:
1. ADD.L mem,reg generates the very same code as before, generating 3 ARM instructions - two for pushing absolute mem address into register and one fetch. Address generation using instructions I used is, on more advanced ARM architectures, squeezed into single operation.
2. SUBQ.L causes condition code generation four times less frequently
3. Branch is four time less frequent

I would follow matthey's suggestion and also test the other kind of loop - four fetches followed by four add operations, however this effectively generates more ARM instructions but omits the fetch penalty to some degree. Actually, on A76 and above this would be even faster since these CPUs have two parallel fetch units.

Second suggestion - try a loop where BNE is used instead of BGT :)

And of course, thanks for spending your time on writing such benchmark - it shows what I was saying every time I was asked - there is no linear speedup with JIT, it all depends on the code behind, and because of that Emu68 can be either just very, rudicurously or ludicurously fast :)

 Status: Offline
Profile     Report this post  
Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 Next Page )

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]
Copyright (C) 2000 - 2019 Amigaworld.net.
Amigaworld.net was originally founded by David Doyle