Poster | Thread |
Jose
|  |
Re: Integrating Warp3D into my 3D engine Posted on 6-Feb-2025 21:49:01
| | [ #81 ] |
|
|
 |
Super Member  |
Joined: 10-Mar-2003 Posts: 1001
From: Unknown | | |
|
| I was surprised recently when by accident I saw some amiga related videos on youtube with huge number of visualizations. Heck even alternative platforms like Rumble or Bitchute that surfaced around the covid era that have much, much less users, have amiga related videos! I think that's an indication that there are much more potential amiga users than what's apparent. But it's also not evident that the platform is more than just nostalgia for them. With the recent, or maybe not so recent for those who have been paying attention, direction bit companies like microsfot and apple are taking regarding their platforms and lack of users freedom, I think there could be a huge potential user base, at least in countries that don't interfere and block new comers in those industries in the name of security. A good example of that is the EU fines for social platforms like facebook and google, where the fines are paid by them in one or two days of profit, so they amount to no more than cost of business and eliminating the competition. So would the US or the EU even allow a new CPU / platform they don't control heavily in the name of security ? Would they even abstain from interfering in it ?
_________________
 José |
|
Status: Offline |
|
|
Heimdall
|  |
Re: Integrating Warp3D into my 3D engine Posted on 17-Feb-2025 19:12:05
| | [ #82 ] |
|
|
 |
Member  |
Joined: 20-Jan-2025 Posts: 99
From: North Dakota | | |
|
| @matthey
Quote:
THEA500 Mini emulates AGA and reaches 68030 levels of performance with JIT turned on which is off by default for better compatibility. It has better performance than probably 90% of the Amigas Commodore sold but is low performance for 3D. |
Well, but today, while browsing, I found some interesting benchmark figures for A500 Mini, which would actually imply it's faster for my 3D engine (uses only Integer ops, no floating-point) than V4.
Allegedly, It does 220 MIPS, which is more than V4SA (~170 I think) and is close to A600 GS (257 MIPS).
More importantly, it shouldn't suffer from 32-bit bandwidth performance issue, like classic AGA systems, because it's a modern CPU (ARM, IIRC).
So, is RTG supported out of the box on A500 Mini or do you have to fiddle with it manually to install it there? |
|
Status: Offline |
|
|
MagicSN
|  |
Re: Integrating Warp3D into my 3D engine Posted on 17-Feb-2025 23:06:58
| | [ #83 ] |
|
|
 |
Hyperion  |
Joined: 10-Mar-2003 Posts: 764
From: Unknown | | |
|
| @Heimdall
No wonder i think a500mini uses some A53 cpu or similar (same as 600gs) which is factors faster than Vampire v4.
Using my port of Heretic2 as benchmark:
- v4: 320x256 8 fps - a53 (pistorm 3): 640x480 10 fps
Pistorm 4/cm4 still 2-3x faster than a53 (640x480 around 20 fps on unoverclocked pi4, around 25-30 fps on cm4, overclocked pi5 - for example standalone pi5 with amikit to run AmigaOS - around 48-56 fps - 56 is overclocked - so around 6x faster than Vampire in a 4x higher resolution than the test on Vampire, in case of 4/cm4 only 3-4x faster than Vampire, using a 4x higher resolution than the Vampire test).
Note with 2d games with integer math the difference is lower. On my test with âsecret project #1â pistorm was just around 2.5x faster than Vampire at 1024x768, at 800x600 the difference was even less (probably memory speed or video access made a difference here?) Last edited by MagicSN on 18-Feb-2025 at 06:21 AM.
|
|
Status: Offline |
|
|
matthey
|  |
Re: Integrating Warp3D into my 3D engine Posted on 18-Feb-2025 6:01:34
| | [ #84 ] |
|
|
 |
Elite Member  |
Joined: 14-Mar-2007 Posts: 2518
From: Kansas | | |
|
| Heimdall Quote:
Well, but today, while browsing, I found some interesting benchmark figures for A500 Mini, which would actually imply it's faster for my 3D engine (uses only Integer ops, no floating-point) than V4.
Allegedly, It does 220 MIPS, which is more than V4SA (~170 I think) and is close to A600 GS (257 MIPS).
|
Be careful that the claim is not SysInfo MIPS instead of DMIPS. SysInfo MIPS which comes from a popular Amiga program called SysInfo are useless. The V4SA MIPS claim sounds like DMIPS. The best Motorola DMIPS claim for the 68060 was 1.8 DMIPS/MHz so a 68060@100MHz is about 180 DMIPS. The AC68080 clock speed and overall performance is similar even though the strengths are often different. I do not know what the DMIPS of the 68k CPU emulated on a Cortex-A53 hardware is. ARM Cortex-A53 specs vary and emulation results may not be consistent. I believe the A600GS hardware is a little higher spec and performance than THEA500 Mini. Performance is likely to be dominated by the 3 cycle load-to-use stalls of 68k loads translated to ARM code without instruction scheduling. Roughly 1 in 4 instructions is a load in typical code and often has a latency of 5 cycles on the Cortex-A53 with unscheduled code.
code_68060: add.l (mem),Rn ; pOEP
code_RISC: load (mem),Rm ; pOEP load-to-use bubble ;sOEP load-to-use bubble ; pOEP load-to-use bubble ; sOEP load-to-use bubble ; pOEP load-to-use bubble ; sOEP load-to-use bubble ; pOEP load-to-use bubble ; sOEP add Rm,Rn ; pOEP
About 25% of 68k instruction introduce 6+ bubbles in the Cortex-A53 execution pipelines when accessing the L1 data cache. The RISC solution to awful performance efficiency is to clock up the core. The 1994 8-stage in-order 68060 CPU 1.8 DMIPS/MHz was not surpassed by an 8-stage in-order ARM core until the 2011 Cortex-A7 with 1.9 DMIPS/MHz and it is the predecessor of the 8-stage in-order Cortex-A53 with 2.3 DMIPS/MHz. ARM only needed 17 years to surpass the 68060 performance efficiency but who cares about efficiency with modern silicon and high clock speeds. The ARM DMIPS/MHz benchmark rating would be with compiled code and instruction scheduling while it is easy to see from the example above that performance efficiency falls off a cliff without instruction scheduling.
Heimdall Quote:
More importantly, it shouldn't suffer from 32-bit bandwidth performance issue, like classic AGA systems, because it's a modern CPU (ARM, IIRC).
|
High end 68k accelerators using fast memory bypass low bandwidth AGA chipmem in many cases, especially with RTG. Memory bandwidth is still nowhere close to ARM memory bandwidth but ARM has other overhead including load-to-use stalls. It is the relative to the 68k large caches and high clock speed that makes low end ARM cores acceptable performance for emulating the 68k.
Heimdall Quote:
So, is RTG supported out of the box on A500 Mini or do you have to fiddle with it manually to install it there?
|
I do not own THEA500 Mini and am not sure. Some emulation and FPGA Amiga hardware includes P96 RTG support and others does not. I am surprised someone with a Mini has not answered your question.
Last edited by matthey on 18-Feb-2025 at 08:06 AM.
|
|
Status: Offline |
|
|
Karlos
|  |
Re: Integrating Warp3D into my 3D engine Posted on 18-Feb-2025 14:31:18
| | [ #85 ] |
|
|
 |
Elite Member  |
Joined: 24-Aug-2003 Posts: 4907
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| @matthey
You come up with this line of argument every time. So, why not put it to the test for real?
Compile a small binary that executes this class of worst case code in a loop and time a million interactions.
Then it can be tested on real silicon and under Emu68
Real numbers, please. _________________ Doing stupid things for fun... |
|
Status: Offline |
|
|
kolla
|  |
Re: Integrating Warp3D into my 3D engine Posted on 18-Feb-2025 18:25:25
| | [ #86 ] |
|
|
 |
Elite Member  |
Joined: 20-Aug-2003 Posts: 3373
From: Trondheim, Norway | | |
|
| @matthey
Quote:
I am surprised someone with a Mini has not answered your question. |
There's more than just one here? _________________ B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC |
|
Status: Offline |
|
|
Heimdall
|  |
Re: Integrating Warp3D into my 3D engine Posted on 18-Feb-2025 19:16:40
| | [ #87 ] |
|
|
 |
Member  |
Joined: 20-Jan-2025 Posts: 99
From: North Dakota | | |
|
| @MagicSN
Quote:
MagicSN wrote: @Heimdall
No wonder i think a500mini uses some A53 cpu or similar (same as 600gs) which is factors faster than Vampire v4.
Using my port of Heretic2 as benchmark:
- v4: 320x256 8 fps - a53 (pistorm 3): 640x480 10 fps
Pistorm 4/cm4 still 2-3x faster than a53 (640x480 around 20 fps on unoverclocked pi4, around 25-30 fps on cm4, overclocked pi5 - for example standalone pi5 with amikit to run AmigaOS - around 48-56 fps - 56 is overclocked - so around 6x faster than Vampire in a 4x higher resolution than the test on Vampire, in case of 4/cm4 only 3-4x faster than Vampire, using a 4x higher resolution than the Vampire test).
Note with 2d games with integer math the difference is lower. On my test with âsecret project #1â pistorm was just around 2.5x faster than Vampire at 1024x768, at 800x600 the difference was even less (probably memory speed or video access made a difference here?) |
Those are some really nice benchmark results! I spent ages on EAB trying to get such numbers. I even wrote a separate benchmark for my engine that does multiple synthetic tests (3D transform, Quad Set-up, scanline traversal) and finally the full scene rendering but wasn't successful in finding anyone with pistorm to run it
So, I really appreciate you sharing these numbers as that's the closest I got so far in guestimating how it might run on pistorm compared to my V4SA!
I don't have the results near me, but on V4SA, my game was playable up to 1280x720, though we all have different frame rate preferences, of course.
And I always wondered what would pistorm with 2,500 MIPS do! |
|
Status: Offline |
|
|
Heimdall
|  |
Re: Integrating Warp3D into my 3D engine Posted on 18-Feb-2025 19:40:18
| | [ #88 ] |
|
|
 |
Member  |
Joined: 20-Jan-2025 Posts: 99
From: North Dakota | | |
|
| @matthey
1. Sysinfo - it is a number from sysinfo because I distinctly recall going through the hoops to upload the binary to my V4 and running it there. I also recall that the new core ran at higher clock so the Sysinfo number was higher than two years ago. If I am not mistaken, there's been another new core in last 12 months that is clocked even higher, thus the number I mentioned is likely obsolete today.
2. RISC bubbles. I'm familiar with them from years of coding Jaguar's DSP and GPU. In theory you can get 1 op per cycle, but the bubbles are quite bad. A random code would need around 2.3 cycles per op. If you spent a day refactoring some inner loop, you could get down to 1.4 cycles per op. It was very rare to have a loop that would get above 1.25. Of course,the main disadvantage of such optimization is that you will never touch that code again. It's humanely unreadable after two months despite comments...
3. RTG accelerators - from what I have been told, majority supports 256 colors and 24-bit color depth is basically unheard of, which is understandable given the era. But, we're in 2025 now and there's plethora devices like A500Mini, A600GS, Vampire, Minimig, Pistorm and others that don't have RAM bandwidth issues like it's 1992. It only makes sense to first get it to run on modern hw. Besides, that's the only HW I can get myself anyway as I am not paying the exorbitant eBay prices for, ehm, "Museum HW" đ
4. A500 Mini owners - I noticed there's not many around here. Besides, the target HW here by definition reaches out to people who don't want to spend a month configuring it. It's a plug and play box, after all. |
|
Status: Offline |
|
|
Karlos
|  |
Re: Integrating Warp3D into my 3D engine Posted on 18-Feb-2025 20:14:31
| | [ #89 ] |
|
|
 |
Elite Member  |
Joined: 24-Aug-2003 Posts: 4907
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| @Heimdall
Matthey utterly *despises* 68K emulation. He's also no fan of the whole vampire / 68080 because the designers didn't want to listen to him on how they should implement it. It's not that what he says is without any merit, but he's chasing a dream in which a "true successor" to the 68060 can be built as a genuine ASIC, capable of scaling to GHz clock rates and that all this could be done at a reasonable cost (few million USD). He's been saying this for years and I honestly wish some eccentric billionaire would come and make it happen, but I don't see it in reality.
TLDR, I would take what he says with a pinch of salt, but the concept he's describing is still real and I'd love to see a genuine, measured in-situ test of these suggested worst-case examples. _________________ Doing stupid things for fun... |
|
Status: Offline |
|
|
NutsAboutAmiga
|  |
Re: Integrating Warp3D into my 3D engine Posted on 18-Feb-2025 20:51:50
| | [ #90 ] |
|
|
 |
Elite Member  |
Joined: 9-Jun-2004 Posts: 12974
From: Norway | | |
|
| |
Status: Offline |
|
|
MagicSN
|  |
Re: Integrating Warp3D into my 3D engine Posted on 18-Feb-2025 22:21:26
| | [ #91 ] |
|
|
 |
Hyperion  |
Joined: 10-Mar-2003 Posts: 764
From: Unknown | | |
|
| @Heimdall
Not sure what you mean by "RTG Accelerators" but most Graphics Cards, even from the old times, support up to 24 (32) Bit. Of course HighColor and TrueColor on a Zorro Card or whatever would be a bit slow for a game.
What is currently relevant (PCI Graphics Card, PiStorm, Vampire) runs higher color depths fast enough (Well, as to H2 on Vampire that's a bit beyond what it can do...).
Here BTW my complete Heretic 2 Benchmark page with lots of different systems:
http://amigagaming.de/test/benchmark-h2 |
|
Status: Offline |
|
|
matthey
|  |
Re: Integrating Warp3D into my 3D engine Posted on 18-Feb-2025 23:54:54
| | [ #92 ] |
|
|
 |
Elite Member  |
Joined: 14-Mar-2007 Posts: 2518
From: Kansas | | |
|
| Karlos Quote:
You come up with this line of argument every time. So, why not put it to the test for real?
Compile a small binary that executes this class of worst case code in a loop and time a million interactions.
Then it can be tested on real silicon and under Emu68
Real numbers, please.
|
Heimdall has likely not heard my explanation about the problem of load-to-use stalls for 68k to ARM code translation. It is instruction scheduling 101 which he should already understands as a RISC assembly coder. We have looked at Emu68 traces before which naively translates 68k load instructions into ARM64 load+op instructions. "Worst case" is achieved on every 68k OP mem-reg instruction that is translated to ARM instructions. An in-order Cortex-A53 core has no way to avoid stalling for the full 3-5 cycle load-to-use stall unlike OoO cores.
https://tech-blog.sonos.com/posts/assembly-still-matters-cortex-a53-vs-m1/ Quote:
The main differentiating factor between CPUs is the pipeline implementation. Some chips have an in-order pipeline, while others have an out-of-order pipeline.
Seeing as the Cortex-A53 is an old and modest CPU, it uses the âin-order, partial dual issueâ pattern; In other words the issuing logic is just smart enough to dispatch one or two instructions at each cycle to two different ports. Instructions will be started in the exact order in which the program presents them.
On the other hand, the Apple M1 features an out-of-order pipeline that will consider a handful of upcoming instructions at each cycle, start as many as it can in each cycle, and even move instructions before their turn if it does not change the program semantics. As a consequence, for such an advanced CPU, it is enough that the program logic is arranged in the simplest way possible to let the processor issue logic reorder and reorganize instructions in the most favorable order.
On the Cortex-A53 things are harder. The in-order pipeline makes it critical that the developer is aware of âdependency chainsâ in the code. For example, in our first kernel, the second ld1ld1 instruction loads values into v4v4 from memory. The next instruction actually uses this value in a computation. Loading a value typically takes 3 to 5 cycles, assuming the memory we are reading is in cache. During these cycles, the Cortex-A53 has nothing else to do and thus must wait for the result. During this time it can not start the next fmla as it needs v4v4 to be loaded. The core is stalling; its wasting cycles, time, and energy. It is up to the assembly developer to find a way to organize the computations so that these dependency chains do not block the processor.
|
The last paragraph is explaining a load-to-use or load-use stall. The 2nd to last paragraph is more nuanced than stated. OoO cores can sometimes remove or reduce load-to-use stalls but instruction scheduling is recommended and may improve performance.
The PowerPC Compiler Writer's Guide https://cr.yp.to/2005-590/powerpc-cwg.pdf Quote:
The example in Figure 4-6 uses pointer chasing to illustrate how execution pipelines can stall because of the latency for cache access. This latency stalls dispatch of the dependent compare, creating an idle execution cycle in the pipeline. Moving an independent instruction between the compare and the branch can hide the stall, that is, perform useful work during the delay. The delay is referred to as the load-use delay. The same principle applies to any instruction which follows the load and has operands that depend on the result of the load.
|
There is a PPC load-to-use instruction scheduling example in Fig 4-17.
Cycle | Unscheduled Assembly Fragment 0 lwz R0,4(R3) # load a[i+1] 0 # load-to-use bubble/stall 1 # load-to-use bubble/stall 1 # load-to-use bubble/stall 2 add R5,R5,R0 # r = r + a[i+1] 3 lwz R6,8(R3) # load a[i+2] 3 # load-to-use bubble/stall 4 # load-to-use bubble/stall 4 # load-to-use bubble/stall 5 add R5,R5,R6 # r = r + a[i+2]
The instructions are scheduled/rearranged to avoid the PPC "common model" 1 cycle load-to-use stalls, even on OoO PPC as all the PPC cores covered in the guide are OoO.
Cycle | Assembly Fragment Scheduled to Account for Cache Latency 0 lwz R0,4(R3) # load a[i+1] 1 lwz R6,8(R3) # load a[i+2] 2 add R5,R5,R0 # r = r + a[i+1] 3 add R5,R5,R6 # r = r + a[i+2]
PPC developers understood that load-to-use stalls are performance killers which is one of the reasons why early PPC shallow pipelines with small load-to-use penalties were favored. The 8-stage in-order Cortex-A53 has a minimum of 3 cycle load-to-use stall and with no scheduling would look like the following using PPC assembly for all you PPC assembly lovers out there.
Cycle | Unscheduled Assembly Fragment with 3 cycle load-to-use penalty 0 lwz R0,4(R3) # load a[i+1] 0 # load-to-use bubble/stall 1 # load-to-use bubble/stall 1 # load-to-use bubble/stall 2 # load-to-use bubble/stall 2 # load-to-use bubble/stall 3 # load-to-use bubble/stall 3 # load-to-use bubble/stall 4 add R5,R5,R0 # r = r + a[i+1] 4 lwz R6,8(R3) # load a[i+2] 5 # load-to-use bubble/stall 5 # load-to-use bubble/stall 6 # load-to-use bubble/stall 6 # load-to-use bubble/stall 7 # load-to-use bubble/stall 7 # load-to-use bubble/stall 8 add R5,R5,R6 # r = r + a[i+2]
The 68060 can execute similar code in 2 cycles.
0 add.l mem,Rn ; lwz+add 1 add.l mem,Rn ; lwz+add
The 68060 can often perform a load and store access per cycle but not 2 loads per cycle. A dual ported data cache would allow 2 loads per cycle reducing the timing of the code above to a single cycle, make instruction scheduling easier yet and improve performance with legacy scalar 68k code and unscheduled 68k code. The performance advantage of zero cycle loads approaches the performance of OoO and is much cheaper as far as resources.
Zero-Cycle Loads: Microarchitecture Support for Reducing Load Latency https://ftp.cs.wisc.edu/sohi/papers/1995/micro.zcl.pdf Quote:
This result is striking when one considers the clock cycle and design time advantages typically afforded to in-order issue processors. It may be the case that for workloads where untolerated latency is dominated by data cache access latencies (as in the case of the integer benchmarks), an in-order issue design with support for zero-cycle loads may consistently out perform an out-of-order issue processor.
|
The RISC way of increasing the number of instructions, bloating the code, using more GP registers and introducing stalls is not good for performance. The 68060 using fewer cycles than a PPC603 which uses fewer cycles than a Cortex-A53 core does not matter as good designs are easily surpassed by bad designs using newer silicon.
Karlos Quote:
Matthey utterly *despises* 68K emulation. He's also no fan of the whole vampire / 68080 because the designers didn't want to listen to him on how they should implement it. It's not that what he says is without any merit, but he's chasing a dream in which a "true successor" to the 68060 can be built as a genuine ASIC, capable of scaling to GHz clock rates and that all this could be done at a reasonable cost (few million USD). He's been saying this for years and I honestly wish some eccentric billionaire would come and make it happen, but I don't see it in reality.
|
I do not despise 68k emulation. WinUAE makes sense where x86-64 hardware is already owned or where a RPi user needs or wants to use some 68k Amiga software. Buying hardware specifically to use for 68k emulation is where I cringe at the sad state of 68k Amiga hardware and what was once an elegant design. A 68k Amiga ASIC would not require a billionaire. It would likely require 3-7 million USD for a professionally developed in-order superscalar 68k Amiga ASIC but that is far from requiring a billionaire territory. The Vamp AC68080@100MHz in FPGA for life is not competing so well against even low end ARM emulation of the 68k. An AC68080 ASIC could be created for maybe 1-2 million USD that would clock at 1-2 GHz and turn the tables. The mistakes like 64-bit SIMD using the integer register file so it can not easily be widened to a more competitive width would become more apparent. All the added registers that are difficult to use would not make the AC68080 competitive for high performance but more likely less competitive for embedded use where an ASIC would have had a chance. I was in talks with a business that develops embedded ASICs to create an AC68080 ASIC but Gunnar proved unprofessional and chose to play with his toy instead. BoXeR, Natami and Vamp/AC had so much FPGA development effort in the right direction and then unfortunately fizzled.
|
|
Status: Offline |
|
|
matthey
|  |
Re: Integrating Warp3D into my 3D engine Posted on 19-Feb-2025 0:35:23
| | [ #93 ] |
|
|
 |
Elite Member  |
Joined: 14-Mar-2007 Posts: 2518
From: Kansas | | |
|
| kolla Quote:
There's more than just one here? 
|
So does THEA500 Mini have RTG chunk modes built in? Does it support 8-bit, 16-bit, 24-bit or 32-bit RTG?
Heimdall Quote:
1. Sysinfo - it is a number from sysinfo because I distinctly recall going through the hoops to upload the binary to my V4 and running it there. I also recall that the new core ran at higher clock so the Sysinfo number was higher than two years ago. If I am not mistaken, there's been another new core in last 12 months that is clocked even higher, thus the number I mentioned is likely obsolete today.
|
The SysInfo MIPS code is random instructions that is nothing like useful code and gains minimal benefits from superscalar pipelines. At least the DMIPS benchmark is like real world compiled code and can benefit from superscalar pipelines. Comparing SysInfo MIPS and DMIPS is like comparing apples and oranges which is useless.
Heimdall Quote:
3. RTG accelerators - from what I have been told, majority supports 256 colors and 24-bit color depth is basically unheard of, which is understandable given the era. But, we're in 2025 now and there's plethora devices like A500Mini, A600GS, Vampire, Minimig, Pistorm and others that don't have RAM bandwidth issues like it's 1992. It only makes sense to first get it to run on modern hw. Besides, that's the only HW I can get myself anyway as I am not paying the exorbitant eBay prices for, ehm, "Museum HW" đ
|
Most old 68k Amiga graphics cards support RTG 16-bit and 24-bit or 32-bit modes but they are slow in these modes and 1-4 MiB of graphics board memory is typical. Many users use RTG 8-bit or 16-bit modes which are faster and leaves more graphics board memory free. Most hardware 3D boards used 16-bit modes for 3D as that was practical at the time.
It is not just memory that has come a long way. The number of transistors on a chip now allows for small amounts of memory like the 2 MiB of AGA chip memory or a RTG buffer to use SRAM which has better performance than any DRAM. SRAM memory on a little older silicon may be able to outperform modern graphics cards using the latest DRAM in some areas. High performance cards usually have large GPU memory caches with higher performance SRAM due to die shrinks though. SRAM scaling is slowing down with the benefits greatly reduced using the latest chip process nodes.
Heimdall Quote:
4. A500 Mini owners - I noticed there's not many around here. Besides, the target HW here by definition reaches out to people who don't want to spend a month configuring it. It's a plug and play box, after all.
|
I know several THEA500 Mini owners on this forum. They are just not answering your question whether THEA500 Mini supports RTG modes.
Last edited by matthey on 19-Feb-2025 at 12:37 AM.
|
|
Status: Offline |
|
|
Karlos
|  |
Re: Integrating Warp3D into my 3D engine Posted on 19-Feb-2025 9:59:01
| | [ #94 ] |
|
|
 |
Elite Member  |
Joined: 24-Aug-2003 Posts: 4907
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| @matthey
Write your preferred worst case instruction sequences for 68K under emulation and benchmark them on real 68K and under Emu68, Petunia etc.
Otherwise why you are writing remains a hypothetical worst case and not a practical one.
_________________ Doing stupid things for fun... |
|
Status: Offline |
|
|
Heimdall
|  |
Re: Integrating Warp3D into my 3D engine Posted on 19-Feb-2025 12:28:19
| | [ #95 ] |
|
|
 |
Member  |
Joined: 20-Jan-2025 Posts: 99
From: North Dakota | | |
|
| Quote:
Oh, wow! That is a fantastic benchmark! I never saw any numbers from those systems before.
Are you in contact with those testers ? Because I have a separate benchmark executable for my engine that allows you to select the resolution via requester (all the way up to 1920x1080) and executes a set of synthetic and rendering tests.
Last year I was trying to get it tested on the V4 discord, hoping someone there would at least have a pistorm, but nobody did. |
|
Status: Offline |
|
|
Heimdall
|  |
Re: Integrating Warp3D into my 3D engine Posted on 19-Feb-2025 12:40:27
| | [ #96 ] |
|
|
 |
Member  |
Joined: 20-Jan-2025 Posts: 99
From: North Dakota | | |
|
| Quote:
@matthey wrote:
Heimdall has likely not heard my explanation about the problem of load-to-use stalls for 68k to ARM code translation. It is instruction scheduling 101 which he should already understands as a RISC assembly coder. | Sorry, I only have time right now to react to this paragraph. But you're correct. I understand the instruction scheduling from the Jaguar's RISC, because I spent several years working with its GPU/DSP RISC and have gathered plenty benchmark data how stalls affect the performance. It's not pretty.
So, while in theory you can get 1 cycle / op (instead of typical 3 cycles / op), if there's no pipeline bubbles, in reality it takes a lot of effort to refactor the code to even get 1.5 c / op. Unoptimized first draft is often worse than ~2.2 c / op
I only had few hand-optimized inner loops that were approaching ~1.2 c /op. They're un-maintainable after a week despite the comments. Might as well toss it at that point.
I already mentioned it here, but on Jaguar the RAM latency was brutal. Loading a value from RAM (outside of 4 KB RISC cache) reduced instruction throughput drastically. I recall that a tight 4-op loop couldn't write #0 to RAM more than 19,000 times within 1 frame.
I'd be curious as to what that latency is on modern RISCs, like the ones in A500Mini.
It's probably even mentioned in this thread, I just need to spend more time on it, but I am traveling across the pond and am without internet most of the time these few days... |
|
Status: Offline |
|
|
MagicSN
|  |
Re: Integrating Warp3D into my 3D engine Posted on 19-Feb-2025 13:08:37
| | [ #97 ] |
|
|
 |
Hyperion  |
Joined: 10-Mar-2003 Posts: 764
From: Unknown | | |
|
| @Heimdall
Most of the testers are the Betatesters from my Betatesting team for upcoming games, so yes, with most of them I am in contact.
As to PiStorm I got a lot of users in my Betateser list (after AmigaOS 4 systems the most often hardware in my betatester team). Myselves I also have (additionally to a PiStorm CM4) a Pi5+Amikit system. |
|
Status: Offline |
|
|
MagicSN
|  |
Re: Integrating Warp3D into my 3D engine Posted on 19-Feb-2025 13:12:24
| | [ #98 ] |
|
|
 |
Hyperion  |
Joined: 10-Mar-2003 Posts: 764
From: Unknown | | |
|
| @matthey
As to a 1 GHz ASIC 080 I am unsure if it would be the big deal really.
If we assume linear speedup (and probably it will be less than linear ?) on a 1 GHz 080 we might get lowres H2 at 80 fps (10x current fps). PiStorm cm4 already has 50 fps there. So while this is a bit faster than the CM4 it is not THAT much faster. And Pi5 is (even with amiberry) still 2x faster than PiStorm Cm4. And when I read discussion on Apollo Plans I always read about 500-600 MHz, not 1 GHz. A 500 MHz 080 would probably not even reach the speed of PiStorm Cm4. Though it would enable playing games like Heretic 2 on it of course 
(Using H2 as Benchmark here)
A 2 GHz system of course would be totally different, that would be faster than everything else, including x5000... Last edited by MagicSN on 19-Feb-2025 at 01:18 PM.
|
|
Status: Offline |
|
|
Karlos
|  |
Re: Integrating Warp3D into my 3D engine Posted on 19-Feb-2025 13:21:57
| | [ #99 ] |
|
|
 |
Elite Member  |
Joined: 24-Aug-2003 Posts: 4907
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| @MagicSN
It's irrelevant. The PiStorm could run at 1/10th it's current performance, it's still infinitely faster than an ASIC that doesn't exist.
Mattheys objections are based on the performance of a real solution against that of an imaginary one. You only have to look at real world application benchmarks to see that the PiStorm runs extremely well. The light wave tests alone ought to be full of many of the worst case code examples he likes to use as explainers as to why emulation is bad. It's full of random memory accesses and branching because that's the very nature of software ray tracing at it's most fundamental level. Last edited by Karlos on 19-Feb-2025 at 01:26 PM.
_________________ Doing stupid things for fun... |
|
Status: Offline |
|
|
kolla
|  |
Re: Integrating Warp3D into my 3D engine Posted on 19-Feb-2025 14:59:11
| | [ #100 ] |
|
|
 |
Elite Member  |
Joined: 20-Aug-2003 Posts: 3373
From: Trondheim, Norway | | |
|
| @matthey
Quote:
So does THEA500 Mini have RTG chunk modes built in? Does it support 8-bit, 16-bit, 24-bit or 32-bit RTG? |
Iâm not sure what you are asking - chunky modes is whatâs native modes on the Linux/ARM system which the THEA500 is. If youâre asking if it ships with P96 pre-installed and with software that use RTG, the answer is no. But nothing prevents the owner/user to install full AmigaOS with P96 and whatever, and even bring the system online. At this point itâs the owner whoâs responsible for the device though._________________ B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC |
|
Status: Offline |
|
|