Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6071 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

16 crawler(s) on-line.

34 guest(s) on-line.

0 member(s) on-line.

You are an anonymous user.
Register Now!

AndreasM: 9 mins ago

zErec: 9 mins ago

amigakit: 16 mins ago

matthey: 29 mins ago

sibbi: 32 mins ago

_ThEcRoW: 1 hr 15 mins ago

amigagr: 1 hr 39 mins ago

zipper: 2 hrs 13 mins ago

Templario: 2 hrs 28 mins ago

NutsAboutAmiga: 2 hrs 52 mins ago

Forum Index

Amiga Development

Packed Versus Planar: FIGHT

Poster

Thread

Karlos

Re: Packed Versus Planar: FIGHT
Posted on 6-Oct-2022 9:57:44

[ #441 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4491
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Gunnar

Quote:
Will you come at A37 next week?
Worlds biggest Amiga event?

We can look at the number together there ;)

Forgetting the rest of this silly exchange we've been having, with heartfelt honesty, I'd absolutely love to! But sadly I won't be able to. Hopefully there'll be plenty of footage recorded though. Are they any plans to livestream?

_________________
Doing stupid things for fun...

Status: Offline

Karlos

Re: Packed Versus Planar: FIGHT
Posted on 6-Oct-2022 10:03:00

[ #442 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4491
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@cdimauro

Quote:

cdimauro wrote:
@Karlos

Quote:

Karlos wrote:
@Gunnar

Those are bold claims. Your workloads must be very selective. Alpha blending is a good example though. Suppose I have two large pixel, e.g 1080p arrays of ARGB 32-bit pixels and I want to alpha blend buffer B onto buffer A using B's alpha channel.

Are you claiming the 68080, at it's normal clock rate, using AMMX will complete this in less time than a 1GHz PPC using altivec instructions to perform this task?

But those are still synthetic benchmarks.

As I've said before:

It's important to test an entire application / game and not just single routines.

Sure, it's synthetic and in all probability would be memory bound. It would still be interesting to see though.

_________________
Doing stupid things for fun...

Status: Offline

Gunnar

Re: Packed Versus Planar: FIGHT
Posted on 6-Oct-2022 11:04:40

[ #443 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

@Karlos

Quote:

Forgetting the rest of this silly exchange we've been having, with heartfelt honesty, I'd absolutely love to! But sadly I won't be able to. Hopefully there'll be plenty of footage recorded though. Are they any plans to livestream?

Where are you from?

Regarding Live stream, I don't know.
The last years there were even a TV channel coming and doing Interviews and broadcasts.

The event is real nice..

Status: Offline

cdimauro

Re: Packed Versus Planar: FIGHT
Posted on 6-Oct-2022 20:19:24

[ #444 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3760
From: Germany

@Gunnar

Quote:

Gunnar wrote:
@cdimauro

Dear Cesare Di Mauro,

Quote:

Let me report here again it:
But it's when you talk about performances and saying that AMMX is faster than a PowerPC then you're lying.

As it's possible to understand by the average Joe, I was GENERICALLY talking about PERFORMANCES of AMMX vs PowerPC.

So, how short and simple is some specific routine... does NOT matter.

It does NOT matter it's simpler to code.

It does NOT matter if it needs less clock cycles (it's called "efficiency", and I've ALREADY talked about it before).

And finally it does NOT matter if it needs less bandwidth (again, it's called "efficiency", and I've ALREADY talked about it before).

Everything that you reported it NOT pertinent to my GENERAL statement. But I'll add something more at the bottom to further clarify it.

Same as above: it does NOT matter.

This is about EFFICIECY whereas I've clearly talked about PERFORMANCE on the specific statement that you quoted.

We can explain this to you.
I will give some real example to help you understand what we talk about here.

Lets have a look at real games and how they are coded.

Lets look the game DIABLO,
Lets look at Command and Conquer,
Lets look at the new RTG version of SONIC the Hedgehog,
Lets look at ROBIN HOOD (MorphOS game)
Lets look at 194x Deluxe
Lets look at Apollo Invader and Apollo-X

All of these games are RTG games for Amiga that came out in last years.
They are real world examples of games that came out Amiga.

Diablo uses 256 color, the other games use 16bit or even truecolor.
Sonic is 320 resolution, the others user 640 or even higher resolution 800/960.

Some of the games are based on SDL ports of PC games.
For example ROBIN HOOD, which I ported to MorphOS is based on SDL.
But also SONIC, DIABLO and COMMAND & CONQUER are based on SDL PC versions.

Some games like SONIC, 194x, Apollo-X use Dual or Multiplayfield effects.
DUAL / Multiplayfield on RTG is done by copying the playfield on top of each other.
This means that the playfields are coded like "huge" Softsprites.

ROBIN HOOD uses many houses and castle or animated elements like Windmills.

Most of the games have hundreds of Sprites on the Screen.
Some of the game like Apollo-Invader, Robin-Hood, Apollo-X, and 194X use alpha blending and light effects.

The internal coding of these games is all similar in many ways.
All these game are coded by using the CPU to create the screen.
This is normal coding style for PC ports, for 2D SDL games and for Amiga RTG Games.

To make these games run you need enough CPU power, and you need to copy a lot of memory.
In other words the performance of all of these games is limited by the memory performance.
This is also the reason why ROBIN HOOD was to a slower on PEGASOS 1 compared to PEGASOS 2 , as the Bus was slower on the Pegasos 1.

You see this also when playing DIABLO on AmigaOne or Pegasos, and compare this with DIABLO on Vampire. The game runs much faster on Vampire.
This is because of 3 reasons
a) The memory is faster on the Vampire
b) the G3 and G4 PowerPC CPU do not support automatic memory prefetching, but the 68080 CPU does
c) You need 3 memory access on the PowerPC - while only 2 memory access on the 68080 with AMMX for doing GFX combines.

Those are SOME examples and ALL GAMES. What do you want to prove with your examples?

They are particular cases. Whereas my statement was GENERAL.
Quote:
Let us explain this with numbers:

The memory on the Vampire give you ballpark 600 MB/sec speed

For the AmigaOne I need someone to run Stream/Minibench again exact number
but from recalling I think we speak about 100-150 MB /sec
I would be nice if someone could run minibench or BUSTEST and please correct me if I recall incorrectly.

For the sake of argument lets say 150 MB
As you can clearly see the Vampire starts with a 4 times memory performance advantage.
600 MB versus 150 MB
This is a huge difference, and the main reason why these type of games run faster on the Vampire.

150MB/s for the AmigaOnes look too low. Even ridiculous, since they can mount 133Mhz SDRAMs, which means >1GB/s of available bandwidth.
Quote:
Now on the POWERPC you need 3 memory operations to process a screen element.
This means to do things like rendering a Sprite, a bullet, and animated background piece, a dual playfield == you always need 3 memory access.

You're using the SDRAM's Data Mask. Nice trick.

However it works only on 8-bit / 256 colors games.

With 16 and 32-bit graphics this trick isn't possible, because STOREM is available only for bytes stores (no 16-bit and 32-bit versions of the instructions are available, AFAIR).

Specifically, it doesn't work when performing the alpha-blending operation.
Quote:
This means you have to divide your memory bandwidth by 3
150 MB / 3 == 50 MB
The AmigaOne has at the end of the day a Sprite/Render/Playfield/Background construct performance of 50 MB/sec

You can do good games with 50MB/sec performance.
But its limiting you a lot in possible resolutions, number of sprites and frame rate.

Yes, but as I've said above, it looks too low compared to the available bandwidth.
Quote:
Lets look at the Vampire.

Because AMMX is more efficient you can do render a Sprite, a bullet, and animated background piece, a dual playfield = with 2 Memory access.

This means you have to divide your memory bandwidth only by 2
600 MB / 2 == 300 MB

The Vampire has at the end of the day a Sprite/Render/Playfield/Background construct performance of 300 MB/sec

As you see this comparison started with a 4:1 memory performance advantage
and with higher efficiency of AMMX over ALTIVEC this ends with a total 6:1 performance advantage

Yes, IF the AmigaOnes have such ridiculous memory bandwidth.
Quote:
Lets clarify a few things:
a) We all love PowerPC.
I did design parts of some of the best PowerPC cores.
I wrote tons of Altivec performance code for IBM. And I ported and wrote software for MorphOS.

b) The above games are typical examples of RTG games that came to Amiga systems in the last years. There are also other games of course. But I picked the above games because I wrote them or ported them myself or was involved in it, so I know for all of these games perfectly how they are working.

Let's clarify (again) one thing: those are SOME, SPECIFIC, cases, whereas my statement was GENERAL.

Status: Offline

kolla

Re: Packed Versus Planar: FIGHT
Posted on 6-Oct-2022 21:42:18

[ #445 ]

Elite Member

Joined: 21-Aug-2003
Posts: 3072
From: Trondheim, Norway

I have both G4 systems with altivec and an actual Vampire systems.

_________________
B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC

Status: Offline

cdimauro

Re: Packed Versus Planar: FIGHT
Posted on 7-Oct-2022 5:35:19

[ #446 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3760
From: Germany

@Hypex

Quote:

Hypex wrote:
@cdimauro

Quote:
So, you still haven't the original version.

No, but this one is on Aminet:
http://aminet.net/package/demo/mega/StateOfTheArt

That's the same that I've.
Quote:
Quote:
I think it was a typo from Skid Row: it should have been "WITH" (not WITHOUT) 680x0.

Yes. When x != 0.

Indeed.

Status: Offline

Gunnar

Re: Packed Versus Planar: FIGHT
Posted on 7-Oct-2022 7:41:46

[ #447 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

Dear Cesare Di Mauro,

Quote:

Those are SOME examples and ALL GAMES. What do you want to prove with your examples?

I explained with examples how these games create their game graphics.
And this showed why memory performance is very important for many games.
Many people dont know how games are coded, therefore it was useful to explain this.

I also explained why AMMX highly improves how much you can draw with the available memory bandwidth of your system.
The reason is that on PowerPC and other CPUs you always need 3 Memory operations for common drawing techniques. AMMX allows you to do the same with only 2 Memory operations.
This means AMMX gives you a 50% advantage here.

Quote:

Quote:

For the sake of argument lets say 150 MB memory performance of the AmigaONE XE
As you can clearly see the Vampire starts with a 4 times memory performance advantage.
600 MB versus 150 MB
This is a huge difference, and the main reason why these type of games run faster on the Vampire.

150MB/s for the AmigaOnes look too low. Even ridiculous, since they can mount 133Mhz SDRAMs, which means >1GB/s of available bandwidth.

Yes, the AmigaOne XE does significantly underperform in memory performance.
This is well known fact.

The PowerPC G3 and G4 CPU can not automatically do memory prefetches
this is why they are generally never good in normal memory performance.
The G2 of the Efika and the 4xx CPU of the SAM also can not do memory prefetches.
They are also very poor in memory performance.

The G5 970 IBM PowerPC is significantly better then all others in memory performance.
The G5 does do automatic memory prefetches.
The APOLLO 68080 does also do automatic prefetches and has automatic stream detection.
This is the reason why the APOLLO 68080 is better in memory performance than G3/G4 and 440/460 systems.

BTW, the AmigaOne XE does also under perform in bus performance. The AmigaOne XE has a AGP2 Graphics port which in "theory" could be reach good speed - but the real bus performance on this port the AmigaOne has only low performance.

Quote:

Quote:

[quote]Now on the POWERPC you need 3 memory operations to process a screen element.
This means to do things like rendering a Sprite, a bullet, and animated background piece, a dual playfield == you always need 3 memory access.

You're using the SDRAM's Data Mask. Nice trick.

However it works only on 8-bit / 256 colors games.

You are wrong.

The 50% efficiency boost of AMMX works in any GFX format.
The Apollo 68080 CPU can use this in 8-Bit/256 color mode, in 15bit mode, in 16bit mode and in also in the truecolor modes.

The games that I posted use all of these modes.
Some games use 8bit, some are 16bit, some are truecolor. Some games even mix modes.
All of the games highly benefit from the AMMX memory efficiency boost.

Quote:

With 16 and 32-bit graphics this trick isn't possible, because STOREM is available only for bytes stores (no 16-bit and 32-bit versions of the instructions are available, AFAIR).

Your claim is wrong.

You have never coded AMMX, and you not understood the documentation.
Help us understand WHY do you always make claims if you don't know anything in this topic?

Quote:

Quote:

b) The above games are typical examples of RTG games that came to Amiga systems in the last years. There are also other games of course. But I picked the above games because I wrote them or ported them myself or was involved in it, so I know for all of these games perfectly how they are working.

Let's clarify (again) one thing: those are SOME, SPECIFIC, cases, whereas my statement was GENERAL.

The general problem with many of your statements is that you speak without having a clue.
And then you talk nonsense.

Above we have several examples of this:
You did not know the memory performance problems of AmigaOne
You did know that G4 PowerPC can not do automatic memory prefetches and therefore has a general performance problem - also on the MAC and other systems.
You did not know that AMMX can use the 50% memory performance advantage on any Graphic mode.
On the TINA project you did not know that the FPGA can not use a 128Bit memory bus
and you did not know that the FPGA can not reach 400MHz clockrate (but in fact struggles to reach 200)

There is nothing wrong in not knowing stuff. No one can know everything.
What is not good is making false claims - if you not know the topic.

No one expects that people here have experience and knowledge how games are typically coded.
And this is OK.
And to explain how games are often coded I gave some examples showing you this.

Status: Offline

Gunnar

Re: Packed Versus Planar: FIGHT
Posted on 7-Oct-2022 7:47:20

[ #448 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

@kolla

Quote:

kolla wrote:
I have both G4 systems with altivec and an actual Vampire systems.

Maybe you can help Cesare with some numbers then?

Maybe you can run on your AmigaOne XE some memory benchmark?
Like bustest, stream, minibench?

Please also state with the results what version your XE is, and which CPU model and which clockrate does it has.

Thanks

Status: Offline

Gunnar

Re: Packed Versus Planar: FIGHT
Posted on 7-Oct-2022 8:09:30

[ #449 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

Dear Cesare Di Mauro,

please be so kind and stop pretending your know stuff if you have no knowledge in this area.

On the AmigeONE memory problem topic your post was:
"The low performance can't be correct, as there is 133 printed on the memory stick"

I think the American call this "talking out of your ass"?
Why did you not ask someone to run "BUSTEST" for you to give you more information?

In the AMMX memory performance efficiency discussion your post was:
"STOREM only supports 8bit mode, this will not work for 16bit or truecolor!"
You claim was wrong.
Why could you not ask "I see that StoreM does improve 8Bit, but does also support 16bit and 32bit mode?"

Cesare some of posts are clever and good.
But your habit of making false claims without knowing does not help anyone, really!
Could you maybe, please consider to change this habit?
What do you think ?

Status: Offline

kolla

Re: Packed Versus Planar: FIGHT
Posted on 7-Oct-2022 9:05:50

[ #450 ]

Elite Member

Joined: 21-Aug-2003
Posts: 3072
From: Trondheim, Norway

@Gunnar

Quote:

Gunnar wrote:
@kolla

Quote:

kolla wrote:
I have both G4 systems with altivec and an actual Vampire systems.

Maybe you can help Cesare with some numbers then?

Maybe you can run on your AmigaOne XE some memory benchmark?
Like bustest, stream, minibench?

Please also state with the results what version your XE is, and which CPU model and which clockrate does it has.

Thanks

Vampire:


1.Minne:> bustest SIZE 16m FAST
BusSpeedTest 0.19 (mlelstv)   Buffer:   16777216 Bytes, Alignment: 32768
========================================================================
memtype   addr       op         cycle     calib         bandwidth
fast      $08EC8000  readw      13.2 ns   normal     151.0 * 10^6 byte/s
fast      $08EC8000  readl      16.3 ns   normal     244.8 * 10^6 byte/s
fast      $08EC8000  readm      19.7 ns   normal     202.9 * 10^6 byte/s
fast      $08EC8000  writew     27.2 ns   normal      73.6 * 10^6 byte/s
fast      $08EC8000  writel     13.7 ns   normal     292.0 * 10^6 byte/s
fast      $08EC8000  writem     26.8 ns   normal     149.4 * 10^6 byte/s

PowerMac10,1 (287 - Mac mini), 7447A @ 1410 MHz:


----------------------------------------------------------------
STREAM Memory Benchmark v0.3
Gunnar von Boehn
----------------------------------------------------------------
The Test will run some minutes please be patient.
Total memory required = 32.0 MB.
Each test is run 3 times, but only the *best* time for each is used.
----------------------------------------------------------------


Memory throughput Working on Arrays of 16 MB.
----------------------------------------------------------------
Read test (summing up the array).
----------------------------------------------------------------
Function      Rate (MB/s)     Avg time     Min time     Max time
read 8           615.3605       1.3004       1.3001       1.3007
read 32         1846.3039       0.4334       0.4333       0.4335
read 64         2395.3678       0.3342       0.3340       0.3343
read 32x2       2017.0087       0.3967       0.3966       0.3968
read 32x4       2145.0935       0.3729       0.3729       0.3730
read 32 CP3     7854.4467       0.1019       0.1019       0.1020
read 32 CP4     9005.5561       0.0889       0.0888       0.0890
read 32 CP5  * 10557.5513       0.0758       0.0758       0.0759
read 32 CP6    10477.6398       0.0764       0.0764       0.0765
read 32x4 CP3   8709.1708       0.0919       0.0919       0.0920
read 32x4 CP4  10168.0097       0.0787       0.0787       0.0788
read 32x4 CP5  10900.5250       0.0734       0.0734       0.0735
read 32x4 CP6  10416.2640       0.0768       0.0768       0.0769
----------------------------------------------------------------
Write test (setting array A).
----------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
write 8          672.8457       1.1892       1.1890       1.1895
write 32        1859.6173       0.4710       0.4302       0.5117
write 64        2719.3868       0.2944       0.2942       0.2946
write 32x2      2126.3414       0.3763       0.3762       0.3763
write 32x4      2264.1714       0.3534       0.3533       0.3535
memset 750   *  7871.3796       0.1017       0.1016       0.1017
memset 750 0    7895.4763       0.1014       0.1013       0.1014
libmoto memset  7831.1660       0.1024       0.1022       0.1027
glibc memset    4566.6517       0.1752       0.1752       0.1752
glibc memset0   7926.2126       0.1010       0.1009       0.1010
----------------------------------------------------------------
ACompare test (comparing the source and destination arrays).
----------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
cmp 8            971.4780       1.6470       1.6470       1.6470
cmp 32          2839.0042       0.5640       0.5636       0.5645
cmp 64          3602.7040       0.4629       0.4441       0.4816
cmp 32x2        3090.6772       0.5181       0.5177       0.5185
cmp 32x4        3277.2513       0.4889       0.4882       0.4896
cmp 32 CP2      7448.5126       0.2152       0.2148       0.2155
cmp 32 CP3   * 10228.4817       0.1566       0.1564       0.1568
cmp 32 CP4     10071.1889       0.1591       0.1589       0.1592
cmp 32 CP5     10278.4092       0.1559       0.1557       0.1561
cmp 32 CP6      9889.4712       0.1620       0.1618       0.1621
cmp 32x4 CP2   10295.8656       0.1559       0.1554       0.1564
cmp 32x4 CP3   10797.1741       0.1482       0.1482       0.1483
cmp 32x4 CP4   10368.5308       0.1544       0.1543       0.1546
cmp 32x4 CP5   10650.0368       0.1504       0.1502       0.1506
cmp 32x4 CP6   10132.5458       0.1579       0.1579       0.1579
libmoto memcmp419430400.0000       0.1959       0.0000       0.3918
glibc memcmp   12647.1357       0.1582       0.1265       0.1898
----------------------------------------------------------------
Copy test (copying array A -> B).
----------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
copy 8          1050.2341       1.5236       1.5235       1.5238
copy 32         2569.0632       0.6228       0.6228       0.6228
copy 64         3429.5173       0.4672       0.4665       0.4678
copy 32x2       2306.4525       0.6939       0.6937       0.6940
copy 32x4       2284.5656       0.7005       0.7004       0.7007
copy 32 CP2     6425.8055       0.2491       0.2490       0.2491
copy 32 CP3     7962.4195       0.2010       0.2009       0.2011
copy 32 CP4  *  8337.8099       0.1920       0.1919       0.1922
copy 32 CP5     8559.1163       0.1871       0.1869       0.1872
copy 32x4 CP2   7647.2719       0.2093       0.2092       0.2093
copy 32x4 CP3   7997.4812       0.2003       0.2001       0.2005
copy 32x4 CP4   8376.9640       0.1910       0.1910       0.1911
copy 32x4 CP5   8350.5504       0.1918       0.1916       0.1919
copy 64x4 CP4   8356.0507       0.1915       0.1915       0.1915
copy 64x4 CP4C  8723.2878       0.1834       0.1834       0.1835
glibc memcpy    5139.4565       0.3114       0.3113       0.3116
bmove512        5116.5921       0.3129       0.3127       0.3132
FC64            8247.9293       0.1941       0.1940       0.1943
libmoto memcpy  8660.4487       0.1848       0.1847       0.1849
memcpy 750      6418.2035       0.2495       0.2493       0.2496
----------------------------------------------------------------

_________________
B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC

Status: Offline

Gunnar

Re: Packed Versus Planar: FIGHT
Posted on 7-Oct-2022 9:45:34

[ #451 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

@kolla

Quote:

Vampire:

1.Minne:> bustest SIZE 16m FAST
BusSpeedTest 0.19 (mlelstv) Buffer: 16777216 Bytes, Alignment: 32768
========================================================================
memtype addr op cycle calib bandwidth
fast $08EC8000 readw 13.2 ns normal 151.0 * 10^6 byte/s
fast $08EC8000 readl 16.3 ns normal 244.8 * 10^6 byte/s
fast $08EC8000 readm 19.7 ns normal 202.9 * 10^6 byte/s
fast $08EC8000 writew 27.2 ns normal 73.6 * 10^6 byte/s
fast $08EC8000 writel 13.7 ns normal 292.0 * 10^6 byte/s
fast $08EC8000 writem 26.8 ns normal 149.4 * 10^6 byte/s

You core seems to be outdated by some years, maybe you should update your core.

For comparison, what I do get here:
Quote:

fast $02CB8000 writel 7.2 ns normal 556.4 * 10^6 byte/s
fast $02CB8000 writem 7.2 ns normal 555.8 * 10^6 byte/s

Mind that both values are not only higher than yours but also the same speed.

Mind that the Display DMA will eat also bandwidth.
This looks like disadvantage for the Vampire and an advantage for the AmigaONE XE - but reality it is the opposite. In reality this the Super-AGA display from fast memory a major advantage for these games.

Let me help you understand why:

These 2D Games typically compose their screen inside Fast memory.
And then the game will copy the Frame to the GFX card.
This copy does need to read the frame from Fastmem and then needs to copy the frame over the bus to the GFX card. This copy takes significant time.
This takes more time than the DMA eats when just reading from Fastmem.
Everyone will understand that Read+write, takes longer than just Read.

We so far spoke about the "compose" time, we did not mention that the AmigaONE XE will need extra time to copy the frame too - while the Vampire does not need to copy to display.

I assume you ran this on 1280x720 32bit mode? Could this be?
This means you calculated on old slower core the time available after the screen "display" was done.

Please mind that the AmigaONE XE does not even has the power to just copy the screen in this resolution to the GFX card fluently (50/60Hz).
The AmigaOne can never do such games in the resolution fluently.

The Vampire is also not designed to make 1280x720 32bit action games in 50Hz.
This is clear. The Vampire has more power than the AmigeOne but not enough to make these fast action in this resolution games.

If you want to make reasonable comparison then set your display to a mode used by the games.
Then you will see what bandwidth for composing is available.
SONIC uses 320x200 16bit, DIABLO uses 640x480 8bit, for example.

The numbers of your MAC are obviously Cache values and not memory speed.
But you probably saw this yourself.

Please mind that STREAM will run the test twice and give out 2 results.
The first run is memory speed, the second run "forced" to run in Cache size, to measure the cache performance.

So please make sure to post the 1st result and the 2nd block!

But maybe this was just a wrong EXE with only cache test enabled.
The EXEs were done 16years ago, when I created LINUX memcopy optimizations for POWER. But the official website of this does not even exist anymore. I just grapped an old EXE from some old folder without being able to check it as I have no PPC running anymore.

Sorry if the EXE did only one test, then this was not the right version, maybe you can run BUSTEST , on real AMIGAONE XE?
Or run minibench?
http://apollo-core.com/minibench/index.htm?page=downloads

Or download and compile STREAM from here:
https://www.cs.virginia.edu/stream/

Last edited by Gunnar on 07-Oct-2022 at 11:39 AM.
Last edited by Gunnar on 07-Oct-2022 at 09:52 AM.
Last edited by Gunnar on 07-Oct-2022 at 09:46 AM.

Status: Offline

Gunnar

Re: Packed Versus Planar: FIGHT
Posted on 7-Oct-2022 11:08:42

[ #452 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

I meanwhile found and reinstalled on old website backup. So we have the real numbers now:
Lets have a look at them:

Quote:

PEGASOS 1 600MHz glibc/memcopy 110 MB/sec

EFIKA 400MHz glibc/memcpy 175 MB/sec

AMIGA ONE 800MHz glibc/memcpy 178 MB/sec

AMIGA ONE 933MHz glibc/memcpy 177 MB/sec

PEGASOS 2 1000Mhz glibc/memcpy 180 MB/sec (Original firmware)

PEGASOS 2 1000Mhz glibc/memcpy 368 MB/sec (Firmware with tuned BIOS settings)

MAC G4 ibook 1420Mhz glibc/memcpy 412 MB/sec

Please note that memcopy results do count both READ+ WRITE together.
This means a result of 110 MB means, that 55 MB were READ and 55 MB were WRITTEN per second.

For the AMIGA ONE this means that 180 MB/sec =
90 MB READ + 90 MB WRITE per second

Yes 180 MB/sec is less than what the memory module in theory could deliver.
But I think we explained well enough, that the G4 lacking prefetch and the bad AmigaOne memory controller both ruin the result.

What do the values mean for gaming?
Kolla, you ran the test on the Vampire in 1280x720 32bit resolution, right?

To display this resolution the AmigaONE would need to do a memcopy of 360 MB/sec!
It would need to read 180 MB and write 180 MB to the GFX card.
The AmigaOne unfortunately has only 50% of the performance needed to do the copy to the GFX card.

Is it clear everyone that the AmigaONE by far lacks the performance to even do the display update?
Or should I explain this better?

The Vampire can display this resolution, and the DMA needed for this was already included in the results of Kolla.
And faster doing the DMA display the Vampire still has 240 MB read and 300 MB write performance left. So could still do some game in lower FPS.

Is it now clear to everyone why the Vampire with only 85Mhz still runs games like DIABLO faster than a OS4 machine with 1000MHz G4?

For many games memory performance is very important.

The Vampire here has 4 key advantages:
a) good memory speed
b) it can display the screen without need to copy it. This is major advantage!
c) The 68080 CPU will automatically detect memory streams and prefetch
d) AMMX allows you to more efficiently use the available bandwidth

And lets be clear. Our competition is only a battle in Kindergarten.
The Vampire and all the Amiga PPC system are "toy" systems compared to PS5, to latest INTEL machines or IBM big servers. But all know this.

Last edited by Gunnar on 07-Oct-2022 at 12:26 PM.
Last edited by Gunnar on 07-Oct-2022 at 11:35 AM.

Status: Offline

Karlos

Re: Packed Versus Planar: FIGHT
Posted on 7-Oct-2022 12:31:38

[ #453 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4491
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Gunnar

These benchmarks are good for memory access bound tasks. What about just overall FPS for game X ?

Without knowing when the PPC memcopy benchmarks were made and using which version of the C library it's tough to draw to oo many conclusions.

Just the other day, I was experimenting with some memory filling operations for MC64K. The idea being that you have memset() like functions that can set 16,32 and 64 bit values. The most naive code was getting 10GB/s regardless but memset() got 26. So I made some AVX2 intrinsics versions and they still only got 10. Then, I realised I was using store and not stream. Then I got 26...

Last edited by Karlos on 07-Oct-2022 at 12:41 PM.
Last edited by Karlos on 07-Oct-2022 at 12:39 PM.

_________________
Doing stupid things for fun...

Status: Offline

Gunnar

Re: Packed Versus Planar: FIGHT
Posted on 7-Oct-2022 13:07:12

[ #454 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

Hello Karlos,

Quote:

Karlos wrote:
@Gunnar

These benchmarks are good for memory access bound tasks. What about just overall FPS for game X ?

Depends on the game and the scene.
DIABLO for example gets up 85 Frames on the Vampire,
I think about it gets about a quarter of the framerate on the OS 4 PowerPC.
Its still playable on the OS 4 machine.

Some games like APOLLO INVADER are not available for OS 4.
APOLLO INVADER uses AMMX a lot and also uses SAGA Audio mixing features, like stereo positioning of SND FX. These Audio features are "free" on Amiga Super AGA chipset.
If you want to do this on PowerPC with "AHI" then the CPU would needs to recalculate the audio stream in real time. This would eat extra CPU.
I don think that APOLLO INVADER would be playable on a AmigaOne system.

Quote:

Without knowing when the PPC memcopy benchmarks were made and using which version of the C library it's tough to draw to oo many conclusions.

The test runs several memcopy option, the GLIB C is just one of them

For comparison here are the full result on the AmigaOneXE

Copy test (copying array A -> B).
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
copy 8 160.4779 0.9982 0.9970 0.9993
copy 32 155.3301 1.0334 1.0301 1.0367
copy 64 152.3254 1.0518 1.0504 1.0531
copy 32x2 177.3958 0.9020 0.9019 0.9020
glibcb memcpy 178.1908 0.8992 0.8979 0.9005
bmove512 155.6844 1.0281 1.0277 1.0286
-------------------------------------------------------------

As you see the GLIBC code result was one of the best.

copy 64 = is the doing 64bit moves via FPU, this would be the way the official "stream" copy routine did it. https://www.cs.virginia.edu/stream/ref.html
So the "official" STREAM result would be 152 MB/sec for the Amiga One

BTW did you notice the difference for the PEGASOS 2, with the tune BIOS memory firmware settings? It makes 100% performance difference.

Here is the source of the versions:

void copy_8(char *source,char *destination, int size) {
int j;
for (j=0; j < size; j++)
source[j] = destination[j];
}
void copy_32(int *source,int *destination, int size) {
int j;
size=size/4;
for (j=0; j < size; j++)
source[j] = destination[j];
}
// Original stream copy
void copy_64(double *source,double *destination, int size) {
int j;
size=size/8;
for (j=0; j < size; j++)
source[j] = destination[j];
}
void copy_32x2(int* source,int* destination,int size){
int a,b;
int i;

size=size/8;
for (i=0; i < size ; ++i) {
a=*source++;
b=*source++;
*destination++=a;
*destination++=b;
}
}
void copy_32x4(int* source,int* destination,int size){
int a,b,c,d;
int i;

size=size/16;
for (i=0; i < size ; ++i) {
a=*source++;
b=*source++;
c=*source++;
d=*source++;

*destination++=a;
*destination++=b;
*destination++=c;
*destination++=d;
}
}
void bmove512(int* to,int* from, unsigned int length)
{
register unsigned long *f,*t,*end;

length=(length >> 9) <<9;

end = (long*) ((char*) from+length);

f= (unsigned long*) from;
t= (unsigned long*) to;

#if defined(m88k) || defined(sparc) || defined(HAVE_LONG_LONG)
do {
t[0]=f[0]; t[1]=f[1]; t[2]=f[2]; t[3]=f[3];
t[4]=f[4]; t[5]=f[5]; t[6]=f[6]; t[7]=f[7];
t[8]=f[8]; t[9]=f[9]; t[10]=f[10]; t[11]=f[11];
t[12]=f[12]; t[13]=f[13]; t[14]=f[14]; t[15]=f[15];
t[16]=f[16]; t[17]=f[17]; t[18]=f[18]; t[19]=f[19];
t[20]=f[20]; t[21]=f[21]; t[22]=f[22]; t[23]=f[23];
t[24]=f[24]; t[25]=f[25]; t[26]=f[26]; t[27]=f[27];
t[28]=f[28]; t[29]=f[29]; t[30]=f[30]; t[31]=f[31];
t[32]=f[32]; t[33]=f[33]; t[34]=f[34]; t[35]=f[35];
t[36]=f[36]; t[37]=f[37]; t[38]=f[38]; t[39]=f[39];
t[40]=f[40]; t[41]=f[41]; t[42]=f[42]; t[43]=f[43];
t[44]=f[44]; t[45]=f[45]; t[46]=f[46]; t[47]=f[47];
t[48]=f[48]; t[49]=f[49]; t[50]=f[50]; t[51]=f[51];
t[52]=f[52]; t[53]=f[53]; t[54]=f[54]; t[55]=f[55];
t[56]=f[56]; t[57]=f[57]; t[58]=f[58]; t[59]=f[59];
t[60]=f[60]; t[61]=f[61]; t[62]=f[62]; t[63]=f[63];
#ifdef HAVE_LONG_LONG
t+=64; f+=64;
#else
t[64]=f[64]; t[65]=f[65]; t[66]=f[66]; t[67]=f[67];
t[68]=f[68]; t[69]=f[69]; t[70]=f[70]; t[71]=f[71];
t[72]=f[72]; t[73]=f[73]; t[74]=f[74]; t[75]=f[75];
t[76]=f[76]; t[77]=f[77]; t[78]=f[78]; t[79]=f[79];
t[80]=f[80]; t[81]=f[81]; t[82]=f[82]; t[83]=f[83];
t[84]=f[84]; t[85]=f[85]; t[86]=f[86]; t[87]=f[87];
t[88]=f[88]; t[89]=f[89]; t[90]=f[90]; t[91]=f[91];
t[92]=f[92]; t[93]=f[93]; t[94]=f[94]; t[95]=f[95];
t[96]=f[96]; t[97]=f[97]; t[98]=f[98]; t[99]=f[99];
t[100]=f[100]; t[101]=f[101]; t[102]=f[102]; t[103]=f[103];
t[104]=f[104]; t[105]=f[105]; t[106]=f[106]; t[107]=f[107];
t[108]=f[108]; t[109]=f[109]; t[110]=f[110]; t[111]=f[111];
t[112]=f[112]; t[113]=f[113]; t[114]=f[114]; t[115]=f[115];
t[116]=f[116]; t[117]=f[117]; t[118]=f[118]; t[119]=f[119];
t[120]=f[120]; t[121]=f[121]; t[122]=f[122]; t[123]=f[123];
t[124]=f[124]; t[125]=f[125]; t[126]=f[126]; t[127]=f[127];
t+=128; f+=128;
#endif
} while (f < end);
#else
do {
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
} while (f < end);
#endif
return;
} /* bmove512 */

Last edited by Gunnar on 07-Oct-2022 at 01:09 PM.

Status: Offline

Karlos

Re: Packed Versus Planar: FIGHT
Posted on 7-Oct-2022 13:14:34

[ #455 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4491
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Gunnar

That code looks exactly like the sort of manually unrolled code that even when vectorised may not use the best fit operation. Just like my first attempt to use avx2 to set memory, I was using an operation that interacts with the cache and was no better than a naive scalar loop (which was probably auto vectorised to the same thing). Changing it to one that bypasses the cache and immediately writes to memory increased the performance 2.6x.

It was my first day...

_________________
Doing stupid things for fun...

Status: Offline

Gunnar

Re: Packed Versus Planar: FIGHT
Posted on 7-Oct-2022 13:31:09

[ #456 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

Hello Karlos,

Quote:

That code looks exactly like the sort of manually unrolled code that even when vectorised may not use the best fit operation.

BMOVE512 is the code that MYSQL internally uses for bigger memcopies.
And yes its not better than the other options. I included it only for comparison.

Last edited by Gunnar on 07-Oct-2022 at 01:32 PM.

Status: Offline

Karlos

Re: Packed Versus Planar: FIGHT
Posted on 7-Oct-2022 13:56:41

[ #457 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4491
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Gunnar

On compilers with good autovectorisation those methods may be OK, but even then, for larger copies they may be suboptimal due to interaction with the cache. Explicit use of the right vector instructions and preferably those that are fastest for stream operations like setting and copying could have a significant impact on the outcome. Like I said, my most naive implementation of a 16/32/64-bit memory filling, with no manual optimisation of any kind, achieved about 10GB/s on this machine. I wrote what I thought was a reasonable AVX2 intrinsics based version, but it still only got 10GB/s. Analysis of the compiler output showed that the naive version had been autovectorised into something resembling the manual AVX2 version. Then I went away, read the AVX manuals (which aren't great, intel) and realised that changing from _mm256_store_si256() to _mm256_stream_si256() may be what I needed. And that was it. From 10GB/s to 26GB/s just like that.

This reminded me of the fun I had with move16 on the 040. For things like pixel format conversion, I'd have a 16-byte aligned buffer on the stack, use move16 to transfer data to it from the source, operate on it in place, then move16 to transfer from there to the destination. Happy days,

Last edited by Karlos on 07-Oct-2022 at 01:56 PM.

_________________
Doing stupid things for fun...

Status: Offline

Hammer

Re: Packed Versus Planar: FIGHT
Posted on 7-Oct-2022 14:01:18

[ #458 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5616
From: Australia

@Gunnar

Quote:
I don think that APOLLO INVADER would be playable on a AmigaOne system.

Have you made a generic 68K RTG version available for 3rd party testing?

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Gunnar

Re: Packed Versus Planar: FIGHT
Posted on 7-Oct-2022 17:01:05

[ #459 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

Hello Hammer,
how are you?

Quote:

Quote:
I don think that APOLLO INVADER would be playable on a AmigaOne system.

Have you made a generic 68K RTG version available for 3rd party testing?

Please note that the game Apollo-Invader was written Arne, and not be me.
I only support Arne in this, mostly with "smart-ass" comments.

Arne started the development on Invader this year when he was 14 years old, he got 15 years old now.
Apollo-Invader is Arne's second complete Amiga game.
His first game was Apollo-Blocks which he wrote last year, when he was 13 years.
Arne also helped me on some games, namely APOLLO-CROWN, DR-APOLLO, and APOLLO-MENACE

Apollo-Invader is written 100% in assembly.
Apollo-Invader was also 100% written on Amiga.
Arne wrote the game using his V4-STANDALONE, a machine which he also did solder himself.

Arne is a huge Amiga fan.
Actually Arne was born as Amiga fan. I think he not no choice
He helps us testing since the early NATAMI days, at a time when he still went to Kindergarten.

Arne discovered his love to Amiga assembly coding with 13 years.
In the last two years he wrote a number of cool demo effects.
Arne runs a small Youtube channel showing his Amiga demo effects - and he also provides them for free downloads for other to use them, or to learn from them as examples how to code in assembly.

Here are some examples:

A Fractal in Assembly
https://www.youtube.com/watch?v=1Qb9624x4kc

Some nice Math function:
https://www.youtube.com/watch?v=ZJdU0wEWTLo

A nice Cube effect from him:
https://www.youtube.com/watch?v=Rtw2tn0d_V0

The game is coded fully in assembly and makes good use of AMMX instruction for easier coding and for more speed. The game also uses Super-AGA HW special features like 16bit audio playback with stereo-position of sounds.

Making a second version not using the existing hardware features would have been a lot more work, for something less good and much slower - As you can imagine that Arne did not waste time on this.

Maybe some of you have met Arne before.
He was on several Amiga event in the last years. For example OUF Amiga party in Switzerland, or RETRO CLASSIC.

You can also meet Arne on the A37 party - where we will also show his game.
APOLLO-INVADERS was first shown on GAMESCOM this year.
At A37 we will run an APOLLO-INVADERS play competition with some cool prices to win!

Arne and me will also give an assembly coding live session at A37.
We will bring 8 system with keyboard and monitor for people to sit there and use.
Together with the interested people we will code a complete Amiga game from scratch.
We have prepared all the GFX and music - and we will show the people hands on - how to write a complete game in 1.5 hours for Amiga in assembly.

See you at A37!

Last edited by Gunnar on 07-Oct-2022 at 05:06 PM.

Status: Offline

kolla

Re: Packed Versus Planar: FIGHT
Posted on 7-Oct-2022 19:24:07

[ #460 ]

Elite Member

Joined: 21-Aug-2003
Posts: 3072
From: Trondheim, Norway

@Gunnar

The core is the just released 2.16

_________________
B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle