Poster | Thread |
Hypex
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 7:22:52
| | [ #401 ] |
|
|
|
Elite Member |
Joined: 6-May-2007 Posts: 11299
From: Greensborough, Australia | | |
|
| @Karlos
Quote:
Maybe. The point was, it was using dedicated hardware to accelerate graphics operations rather than SIMD instructions on the CPU. I'm not opposed either way but the former seems more in the "spirit" of the Amiga way of doing things. |
Yes, I understood that, which how I also see it. An Atari ST however, where it lacks hardware scrolling and sprites, is a perfect candidate for AMMX. Atari MMX! |
|
Status: Offline |
|
|
Gunnar
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 8:07:31
| | [ #402 ] |
|
|
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @Karlos
Quote:
It doesn't matter what your opinion of alignment is because both versions are free to use the most optimal alignment for the test. You already claimed 68080 would outperform a GHz class PPC using AltiVec optimised code for the same task.
|
ALTIVEC has severe performance problems in some areas and can not be used for all problems. Let me help you understand this.
The design goal of a CISC CPU is to make coding easy and to solve problems in hardware. The design goal of RISC is the opposite of CISC. The design goal of RISC is to make the hardware of the chip simply so that the CPU manufacturer has the easy part - and this makes coding more difficult. I assume you all know this.
ALTIVEC can operate on 128bit vector. This is good. Some algorithms benefit a lot from the speedup that ALTIVEC can give. ALTIVEC has also a severe design limitation which makes ALTIVEC useless for some problems for example ALTIVEC is relative useless for doing game graphics. The reason for this is that ALTIVEC can only read and write 128bit alignment memory.
Lets say you want to use ALTIVEC paint alpha blended sprites on a screen. In real games sprites and bullets can have any alignment in game screen memory. If a bullet flies then it could have X-position 0, then X=1, then X=2 and so on. With ALTIVEC you can only read and write to Position X=0, or to Position X=16 or Position X=32 You can not easily read or write any other positions. This problem should be easy to understand for you.
You will understand and agree that its very difficult to use ALTIVEC for games, or for moving windows on the Workbench around as ALTIVEC can not read/write to any screen position.
AMMX is designed for making games. AMMX has several instruction which are a lot stronger for game coding. These AMMX instructions can do more work per instruction than ALTIVEC and they save memory bandwidth. If memory bandwidth is your game coding limitation than using AMMX will allow you to get much better results = more FPS than trying to do the same with ALTIVEC. And very important feature AMMX can operate on any alignment. This makes coding a lot simpler and also makes the program faster.
ALTIVEC is not designed for drawing games and its not good for accelerating Workbench. ALTIVEC is designed for other tasks.
AMMX is designed for speeding up Workbench graphics and for making game coding easier and faster.
If you try to use ALTIVEC for game drawing then this is like trying to eat soup with a fork.
AMMX is faster than ALTIVEC for game coding.
And yes a game drawing routine need to be able to draw to ANY X-Position! This means if you want to make a comparison than your routine needs be able to READ/WRITE misaligned - as in games the sprites are never limited to Position X=0 or X=16 only.
And yes in doing 2D game GFX the 85 Mhz Vampire is in many areas stronger than the 800MHz AmigaONE XE. You see this for example in games like DIABLO, DIABLO runs on the Vampire several times faster than on GigaHerz OS 4 machines.
There are several easy to understand reasons why the Vampire is stronger in some areas. a) AMMX is designed for doing games b) ALTIVEC is _not_ designed for doing games and can _not_ read/write misaligned. This means ALTIVEC is pretty useless for games c) the speed for memory access to the Graphic cards on AmigaONE system is very low d) the speed for memory access graphic memory access on the Vampire is very good e) the memory speed of Fastmem is very low for the PowerPC system f) the memory speed of Fastmem is very fast for the 68080
The memory speed problems of the AmigaOne and Pegasos are well known. You can very easily also measure this with Amiga tools like BUSTEST. Benchmark MEMCOPY or run BUSTEST then you see this clearly.
The 68080 system reach 500-700 MB/sec memory performance. This is a lot more than what an AmigaOne XE can reach.
Lets sum this up:
- 68080 CPU is a lot easier to code for. - The G4 PowerPC does not support misaligned load/store for SIMD. This lack of hardware alignment support gives programmers a real problem, and makes ALTIVEC pretty useless for game drawing. - The Vampire systems are better in memory performance than AmigaOne Xe or Pegasos - AMMX is designed for game coding.
|
|
Status: Offline |
|
|
Gunnar
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 8:21:41
| | [ #403 ] |
|
|
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @Karlos
Quote:
Alpha Blend a source 1080p 32-bit ARGB buffer onto a destination 1080p 32-bit ARGB buffer, each optimally aligned in the fastest RAM you can read from/write to.
|
I look very much forward to your benchmark result - but we know the result already.
Your problem is memory bound. This means your performance is limited by the speed of your Fast memory.
All the Vampire Accelerators for Amiga100/500/2000/600/1200 and the Standalone have much better memory performance than the PowerPC AmigaOne XE. The A1 XE can not win against them in this area.
The AMIGA program BUSTEST is a good tool for measuring memory speed. Run BUSTEST or a similar tool. |
|
Status: Offline |
|
|
Gunnar
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 8:40:47
| | [ #404 ] |
|
|
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @Karlos
Quote:
It doesn't matter what your opinion of alignment is because both versions are free to use the most optimal alignment for the test. You already claimed 68080 would outperform a GHz class PPC using AltiVec optimised code for the same task. |
You do not read what I wrote
Let me help you and again quote what I said. Please read it again carefully:
Quote:
If you look at what real life, real memory, gaming blitting performance they deliver = then the 68080 AMMX system does outperform 1GHz PowerPC systems in the maximum real screen game/sprite blitting performance.
|
I never said that an 80MHz 68K does beat a 800 MHz PowerPC at any random task. I was very specific and explained that in area of sprite/game and window rendering the 68K beats even the 10 times higher clocked PowerPC. Yes Sprite/Game/Window drawing does by design require support of misaligment.
We know that coding alignment support in Altivec is a real pain in the ass. But for game routines, alignment support is required, and therefore you will have to code it. If you have problems coding the alignment support, feel free to "spy" in the 400 instruction Altivec memcopy that I posted before - the code includes the alignment correction. The aligment code is part of the reason the memcopy is so ugly and big.
Last edited by Gunnar on 05-Oct-2022 at 08:48 AM.
|
|
Status: Offline |
|
|
FairBoy
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 9:09:56
| | [ #405 ] |
|
|
|
Member |
Joined: 8-Jun-2020 Posts: 76
From: Unknown | | |
|
| @Gunnar Quote:
Yes Sprite/Game/Window drawing does by design require support of misaligment. |
No, in all fairness, that's nonsense again. Such tasks don't have such requirements by design. If misalignment-support is required depends on the respective system's alignment requirements and if the used pixel format and the bitmap's memory layout match those. |
|
Status: Offline |
|
|
Gunnar
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 9:15:15
| | [ #406 ] |
|
|
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @Karlos Quote:
Maybe. The point was, it was using dedicated hardware to accelerate graphics operations rather than SIMD instructions on the CPU. I'm not opposed either way but the former seems more in the "spirit" of the Amiga way of doing things.
|
Yes, Super-AGA has all this too.
Super-AGA has DMA based Amiga Audio channels, supporting not only the old 8bit but also 16bit and 32bit stere now.
Super-AGA has DMA based Sprites, which support the old OCS and AGA modes and also have enhanced modes with more colors.
Super-AGA supports Scrolling and supports PLANAR and CHUNKY modes
Super-AGA has DMA hardware support for Video YUV modes.
Super-AGA has DMA hardware support for Picture in Picture.
Super-AGA has the Amiga Copper, and is improved to support also 32bit Copper Moves
Super-AGA has the DMA Amiga Blitter, and has it improved to be faster and support 64bit
Super-AGA even has DMA based hardware 3D acceleration (Maggie chip)
You can program your games fully in the Amiga spirit.
AMMX gives you the opportunity to also solve some task very efficient with the CPU. This option can make coding easier and give the programmer more freedom.
The benefit should be easy to understand for any coder. Let me help you understand it.
The DMA based Amiga blitter design is great for doing stuff in parallel to the CPU. As you all know to use the Blitter you need to first tell the Blitter what todo. This "job giving to Blitter" does require a number of CPU instructions. For small blit jobs it can be often easier and faster to do the job yourself than spending the overhead to call the Blitter. But this is no news also in Amiga this was already the case before.
And of course if you use the CPU for the job then you have more flexibility this can pay of when you for example want to do some extra processing like color conversion or format changes or calculation some effects. Modern GFX cards also use a CPU for this. They call them Shaders.
|
|
Status: Offline |
|
|
Gunnar
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 9:21:48
| | [ #407 ] |
|
|
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @FairBoy
Quote:
Quote:
Yes Sprite/Game/Window drawing does by design require support of misaligment.
|
No, in all fairness, that's nonsense again. Such tasks don't have such requirements by design.
|
Help us understand your post.
Lets say we code a game like Xenon 2. In this game the player controls a spaceship. The player can move his ship freely to left or right to any X position on the screen in this game. The render needs to support this. Only being able to have the Sprite on position X=0, or position X=16 or position X=32 - do we agree that this would be bad? Yes this would make a bad game. And the bullets, we also expect them to be displayed on every screen position, right?
Now if you move on the Workbench a Window around We agree that it is nice to be able to move the fluently to any positions. Jumping from X=0 to X=16 = this would look not good.
If you have a 256 color game then one pixel = 1 BYTE if you have a Hicolor game then one pixel = 1 WORD if you use a 32bit game mode then one pixel = 1 LONG
Can ALTIVEC write to BYTE,WORD or LONG address? No, ALTIVEC can not write or read to BYTE, to WORD, or to LONG addresses ALTIVEC can only access addresses on 128bit boundary.
This means screen positions on all Graphic formats are misaligned from ALTIVEC perspective. You can still use ALTIVEC to copy memory or copy GFX data. But you need to code the alignment support yourself. This alignment support does add many instructions, this makes the code hard to write and hard to read and of course makes it also slower.
Let me give you a simple example for you to better understand
MOVE.l D0,$1 -- we move 4 byte to address $1 This is a misaligned write., The 68020 CPU and higher do support this in hardware. You can in 1 instruction write 4 byte to the address $1 $2 $3 $4
Now if you your hardware not supports misaligned write and you still want to use the LONG write then can this by copy the D0 data into 2 register. And shift/rotate it by 8bit in them. Then you create yourself 2 Mask register These have the value of: #$00FFFFFF and "$FF00000" then you read from memory two long from Address $0 and address $4 then AND these values with the MASK you do the same inverted with your data, then you OR both and then you copy the 2 LONGS back to memory.
As you see this is much more complicated needs a lot more instructions and also adds extra memory access. It should be very clear to understand that this is slower than the simple MOVE.L before
With ALTIVEC this is basically the same but with 16byte instead 4 byte
Last edited by Gunnar on 05-Oct-2022 at 10:05 AM. Last edited by Gunnar on 05-Oct-2022 at 09:37 AM. Last edited by Gunnar on 05-Oct-2022 at 09:25 AM.
|
|
Status: Offline |
|
|
Hypex
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 10:03:47
| | [ #408 ] |
|
|
|
Elite Member |
Joined: 6-May-2007 Posts: 11299
From: Greensborough, Australia | | |
|
| @cdimauro
Quote:
I've no statistics, but when I've checked WHDLoad games quite some of them required some Kickstart. I would be good to have some numbers. |
When I installed WHDLoad it wouldn't work without any Kickstart. And that was before I installed any games. Must remember to check next time I boot my A4000.
Quote:
Unfortunately not. And it was expensive for me, at the time. |
I missed out. A local music shop was selling a boxed heap for real cheap. But I didn't buy one because I already had AGA in my A1200.
Quote:
Well, that's much more expensive! |
It would have been. But he must have got it cheap. Expensive to ship!
Quote:
Same here with The Nut which tends to move people to another "protected" environment. |
I smell Acorns.
Quote:
Correct. That's why you can talk about efficiency and then AMMX is clearly more efficient for some operations.
But it's when you talk about performances and saying that AMMX is faster than a PowerPC then you're lying.
The difference might sound subtle but it's essential. |
It's all in the fine details and what was meant. It's easy enough to talk about performance in a misleading way. Of course how it performs outright is based on a number of factors including clock, memory speed and efficiency.
But, what we want to know is, the resulting speed. What is the quickest at doing an operation. Even so this is still apples and oranges. For one thing an ASIC CPU core isn't the same as an FPGA embedded CPU core. For another it's a different ISA. And different memory is used. There's talk of fast ram but an AmigaOne doesn't have fast ram in the Amiga chipset context of it. And a Vampire SA doesn't have fast ram either in the Amiga chipset context.
I thought Karlos provided a reasonable first test case for starters which was simple enough. But Gunnar wants to complicate it with a non-aligned vector soft sprite test. It might as well be an Amiga chip ram byte copy test vs. Vampire SA where the Apollo is faster because it doesn't have non-cached limitations of chip ram.
Quote:
No, x64 have also a better microarchitecture. I've shared several links to benchmarks on the code density thread and you can see that there isn't a huge clock difference between the tested systems, but x64 platforms as much faster on average compared to PowerPCs. |
I've avoided that thread because I spend too much time reading here already. Well, there goes the RISC argument then. After AMD64 I thought x86 arch become a real mess. All these extensions put onto it. Extended opcodes. More lettering complicating simple register indexes and width specifying. And at end of the day x86-64 still wins.
I'm still not going to learn x64 ASM. It might be more complicated. But PPC64 ASM still looks more understandable!
Quote:
Sorry, but it's still wrong. Either you remove the & and replace it with a space, like I did it, or replace it with the HTML equivalent, like kolla suggested (which is much better). |
Damn it. I know what it is. You cannot preview a post. If you preview is destroys HTML. I already tested it as working then made the mistake of previewing it but I think a pure HTML link is lost cause.
http://apollo-core.com/knowledge.php?b=4¬e=38817 Patents to Improve CPU Core?
Quote:
Maybe on some synthetic benchmark. Because the Sams are running at 1Ghz, so they are much faster than an Apollo core.
It's important to test an entire application / game and not just single routines. |
That's why I had the idea for my Doom test. I was interested in testing a full game engine. Where I could modify it enough to give me results in a real world test.
Quote:
Indeed. In fact, Intel invented also the EPROM. |
It just keeps getting better. At this point I'd like to admit that I think an Intel would have been a good CPU choice for the Hombre Amiga project. An Intel Itanium!
Quote:
Hum. V is better to be used for variable-length vectors. Maybe S = SIMD is a better prefix for those registers. |
A for address. D for data. It logically follows to have V for vectors. At least to me. Length should be coded as per the 68K dot width protocol.
I thought S was a grey as I recalled a MOVES but it's not listed in online searches. In any case both A and D stand for words but an S would stand for an acronym. (Or initialism).
Technically, I see 68K has facing the same challenges as x86, since they are similar designs with a core CISC ISA. The 68K is less restrictive as it was designed with 16 registers with 32 bit width existing, divided into address and data as they may be, so not as pure GPR but later cores addressed this and gave more freedom. x86 was designed with a smaller register count and width which then had to be retrofitted on to expand the design. So 68K should be a better base to add vectors onto. However, both designs do have separate GPR and floats, in the register files, and where transistors are limited re-purposing floats does make sense. They already gave 80 bits width so enough for 64-bit vectors with 16-bits to spare or 5 words. But it would restrict it as floats and vectors would overlap. However, in 68K ISA, it would also be expected to also perform memory to memory vector ops.
Quote:
floats could be 32-bit also. |
That would be even more inferior to 80 bit and 64 bit floats.
Quote:
No, it this case the instruction is an AMMX one, which always operates on 64-bit data = quad word. |
Apollo has 64 bit data registers then? Since it stored in a Dx. That looks wrong. Load/store? It's not a RISC. PPC needs to load then store. That looks more wrong for a 68K extension.
Quote:
It would have been better to use a size suffix, like .q, but we know that Motorola also used default sizes for some instructions, so this is choice is in-line with what Motorola did. |
Word default except for MOVEQ with long words yes. But it's still ambiguous. There are examples like MOVE16, so a LOAD8.B would fit in better. Aside from LOAD having no hints it's a vector op. Or a MOVE8.B since 68K standardised on MOVE.
Quote:
Much better would have been to use a different syntax for the SIMD registers, like S2 in this case instead of D2, because D2 is causing confusion, like for you. |
Yes it tells me it's loading into D2. I don't see it any other way. Redesign it before they start writing assembler parsers.
The store is even worse. A "storem.b" ? That tells it stores multiple bytes into memory but then only gives D2. What a waste, that looks pointless, like a "movem.b d2,(a1)+".
Quote:
No, it merged the bytes from D2 with the bytes on (A1). If a byte is zero, then the corresponding byte on (A1) isn't changed. Otherwise it's replaced with the one from D2. |
That's not a store, that's a merge!
Quote:
It's a merge / blend operation. Maybe a better name should have been used. |
Yes, like merge. O blend. Mix.
NZORA. Non-zero-OR-AND. Combine bits with OR, if source is non-zero then apply mask with AND from source. Store to destination. Just made it up.
Quote:
There some examples and a Programmer's guide on Discord. I've posted several links on the Code Density thread. |
I'll take a link. At least I don't have to use Discord. Have an account but never used.
Quote:
To me it looks complicated. |
Suppose it is. It's in a few lines of code. I don't recall if I ran it through MonAm but there was clearly something wrong with it.
Quote:
Sorry for that. |
So was I. I knew he had major health conditions. But I was disappointed when all the last work he was doing wasn't published and I don't know what his family did with his Amigas or the data. I'm thinking I should at least release the work put I into it. Not like it has any big secrets. It uses a modified MultPlayer source to support and play a number of module formats and do scopes. I ported a few to AHI and they work fine on OS4. The way I approached it, unlike other AHI module players, at least on OS4 I've seen, is to use the module features of AHI to handle timing and mxing. It even checks and preselects the proper Paula mode if it exists so it uses hardware mixing.
If Mary or Monica in the Vampire core has an AHI driver which it should by now, it could be used to play classic modules in 16-bit resolution using hardware mixing. The way it should be.
Quote:
No, but I'll get there. Maybe next year. That would still be quicker than the four years an update took last time.
Quote:
As I've said before, I would have preferred 8 or, even better, 16 audio channels. Leaving only a few sprites for the mouse pointer or some extra, small stuff. |
That would have put more pressure on the blitter. They would have needed to speed it up. Few sprites for a mouse pointer looks like such a VGA thing.
This doesn't seem to be well known or brought up in the subject. But did know the C16 actually did have a hardware sprite? Yes it had a hardware cursor!
That's not so funny. It was the way of the future. VGA chipsets embraced hardware cursors. C64 people thinking those 8 sprites were great. The C16 showed them where it was at! The way of the future. One sprite for your cursor. |
|
Status: Offline |
|
|
Karlos
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 10:10:24
| | [ #409 ] |
|
|
|
Elite Member |
Joined: 24-Aug-2003 Posts: 4534
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| @Gunnar
Quote:
Gunnar wrote: @Karlos
Quote:
It doesn't matter what your opinion of alignment is because both versions are free to use the most optimal alignment for the test. You already claimed 68080 would outperform a GHz class PPC using AltiVec optimised code for the same task. |
You do not read what I wrote
|
Untrue. I read exactly what you wrote, which is what triggered this whole conversation. You've posted a lot long winded things since then, hoping to obscure your original claim away from view, so I refer you back to your own post here
And should you feel the urge to retroactively change it, the point of contention is cited below:
Quote:
Gunnar wrote:
Quote:
Karlos wrote: @Gunnar
Those are bold claims.
|
I have both 68080 and PowerPC Systems here
Quote:
Your workloads must be very selective. Alpha blending is a good example though. Suppose I have two large pixel, e.g 1080p arrays of ARGB 32-bit pixels and I want to alpha blend buffer B onto buffer A using B's alpha channel.
Are you claiming the 68080, at it's normal clock rate, using AMMX will complete this in less time than a 1GHz PPC using altivec instructions to perform this task?
|
Yes correct. |
Anyone with any comprehension of the subject matter at all will conclude that you are claiming that 68080/AMMX will be able to perform an alpha blend of a 1080p ARGB buffer onto another faster than a GHz class PowerPC using Altivec for the same task.
Your whole diversion into alignment came after you made this claim. Now in fairness to you, I didn't qualify it as "using aligned data" but since I am talking about one whole 1080p buffer being blended wholesale onto another, I didn't think I had to. I assumed it ought to be apparent I am not talking about a misaligned use case here. However, unless AMMX is badly broken, there's no reason why it's performance on aligned data would be worse than on misalgined.
As for being memory bound, then this is where you may have a case. The memory performance of the articia based A1 is suboptimal to say the least. But your initial claim didn't include this caveat and I only said a "GHz class PPC using altivec.". This could be a Mac or a server of some kind.
As it stands, I don't actually care either way as I don't have a horse in the race. My current opinions on PPC are a matter of record here. However, I do call out anything that looks like BS._________________ Doing stupid things for fun... |
|
Status: Offline |
|
|
Gunnar
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 10:19:15
| | [ #410 ] |
|
|
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @Hypex
Quote:
Yes, like merge. O blend. Mix.
NZORA. Non-zero-OR-AND. Combine bits with OR, if source is non-zero then apply mask with AND from source. Store to destination. Just made it up.
|
Cute, but wrong.
I understand where you come from as this would be the "default" way to code this. But this would be slow, very slow. AMMX allows you to do this much faster and much more efficient.
Let me help you understand this.
A "normal" sprite operation would read the Sprite, read the Background, then do a MASK operation similar to what do did assume and then write the result back to memory. These are in total 3 memory access.
AMMX supports MASKED writes, where you can define a MASK as source, or the instruction can create the MASK on the fly depending on the content. Using this instruction you only have 2 memory access, 1st sprite, 2nd Screen write.
This means the CPU not need to read to from the screen for this. In other words the system saves 33% memory bandwidth.
Yes other CPUs not support this. And if you would want to code this operation with e.g. the PowerPC than you need to LOAD, LOAD, combine and STORE - you need 50% more memory bandwidth for this task than the 68080 needs.
This is another good example why some game are so much faster on Vampire than on AmigeONE. The 68080 has a lot more memory bandwidth than the AmigaOne and when you code the same on PowerPC you need/waste 50% more memory bandwidth then the 68080 needs using the AMMX instructions.
By the way, you can find the the AMMX instruction documentation on the Apollo-Core website. If you read it then you will understand this and other advantages even better and you not need to "guess" wrongly again.
|
|
Status: Offline |
|
|
Gunnar
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 10:25:01
| | [ #411 ] |
|
|
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @Karlos
Quote:
The memory performance of the articia based A1 is suboptimal to say the least. But your initial claim didn't include this caveat and I only said a "GHz class PPC using altivec.". This could be a Mac or a server of some kind.
|
Its correct that AmigaOne and Pegasos have bad memory performance. Yes we all know this. I told from day one, that beating the AmigaOne is very easy in this area.
Yes G3 or G4 MACs have better memory performance than AmigaOne, this is also correct. I used to a have a nice G4 Powerbook and yes its memory performance was much better than AmigaOne, the memory performance of the Mac Powerbook was actually not as good as the Vampire, but better than the AmigaOne.
You can easily measure memory performance yourself. On the PPC you can run Stream or you can run Minibench. Minibench will measure this for you. Here are EXE and source for you http://apollo-core.com/minibench/
The AMIGA program BUSTEST does also measure this very good. Run it on the systems and see yourself! You can get BUSTEST here: https://aminet.net/package/util/moni/bustest
Last edited by Gunnar on 05-Oct-2022 at 10:33 AM. Last edited by Gunnar on 05-Oct-2022 at 10:28 AM.
|
|
Status: Offline |
|
|
Gunnar
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 10:43:56
| | [ #412 ] |
|
|
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @Karlos
Quote:
Quote:
[quote] [quote] Alpha blending is a good example though. Suppose I have two large pixel, e.g 1080p arrays of ARGB 32-bit pixels and I want to alpha blend buffer B onto buffer A using B's alpha channel.
Are you claiming the 68080, at it's normal clock rate, using AMMX will complete this in less time than a 1GHz PPC using altivec instructions to perform this task?
|
Yes correct.
|
Anyone with any comprehension of the subject matter at all will conclude that you are claiming that 68080/AMMX will be able to perform an alpha blend of a 1080p ARGB buffer onto another faster than a GHz class PowerPC using Altivec for the same task. [/quote]
Dear Karlos, you will recall and you will agree with me that I spoke all the time about AmigaONE, Pegasos and about 68080 CPU.
I knew that the problems we talk about are all memory bound and as we all know that the Vampire outperforms both AmigaONE and Pegasos in memory speed. I was easy for me to say that the 68080 will beat them. We know result of this in advance. Reread what I wrote, I told you that the problems are memory bound and that the CPU clock is not helping here, you could have a 4 GHz PPC in your AmigaOne - it would still loose this comparison.
Yes of course I always knew which benchmark we win and which we can not loose. This is maybe not fair ... You have to mind that many of Apollo-Team are IBM PowerPC/Power CPU developers. Our team has participated in designing and building many of the PowerPC chips that you like. Of course we know all their strength and weaknesses.
On the PowerPC when you want to use Altivec you on top have the problem of handling the alignment issues in software. For sure you understand this problem now too.
Quote:
As it stands, I don't actually care either way as I don't have a horse in the race
|
You want to give up? Come on! Show some sportmanship!
I love see your Altivec code with alignment handling
Last edited by Gunnar on 05-Oct-2022 at 10:49 AM.
|
|
Status: Offline |
|
|
Karlos
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 10:58:20
| | [ #413 ] |
|
|
|
Elite Member |
Joined: 24-Aug-2003 Posts: 4534
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| @Gunnar
I don't have an AmigaOne. I have a dead teron evaluation board. It used to be an A1, once.
I'm not going to write any PPC code as I have no way of verifying it even assembles, let alone executes or is correct for the problem. And since the problem I posted allows both machines to use the best possible alignment because entire buffers are being blended together I wouldn't bother making it handled misaligned code that's never going to be used, even if I did. I may be an obsessive autist but I'm not an idiot.
_________________ Doing stupid things for fun... |
|
Status: Offline |
|
|
FairBoy
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 11:00:32
| | [ #414 ] |
|
|
|
Member |
Joined: 8-Jun-2020 Posts: 76
From: Unknown | | |
|
| @Gunnar Thanks for bringing up the OCS sprite example because that underlines my statement. Sprites need source data alignment (hw requirement). No writing to destination bitmap required. That's the idea of sprites. No handling of misaligned data required at all. This alone renders your statement invalid, which was my point.
For the blitter (BOBs / windows) again the data has to be aligned (hw requirement). For x you have shift and mask capabilities, for y there's nothing to do at all. Nowhere we got access to misaligned data here neither.
The point is: your generalized statement is just wrong. The correct statement is: it depends.
But it was off track anyway because you started to derail to not further get into this here: Quote:
Karlos: "I just asked about blending a pair of 1080p buffers." |
There are no alignment issues here, even with Altivec. Unless you misalign those buffers on purpose, which wouldn't be a real-world scenario.
Now, what about your performance claim for this specific task? That was the question and so far you tried hard not to answer it.
And regarding more complex scenarios like "alpha-blending of arbitrary rectangles using altivec vs AMMX": I agree that AMMX is doing the job nicely and I don't think anybody here would say different. But that's not the point. It's about performance of a well coded real world example. In real world you'd have prepared some optimized blitting functions for the cases where the x-coordinate doesn't fit altivec's 128 byte alignment requirements and you'd probably also tweak your (let's call them) BOBs and framebuffer to have some (transparent) extra pixels. Yes, that's definitively more ugly than the coresponding AMMX code, but the question is: is it also slower? And if so, is it slower by a magnitude as you claimed? Or is this claim just bullshit like other clearly doctored Apollo vs PPC benchmarks you presented in the past?
What's missing is prove for your claims. And for a start we'd be happy to see how Karlos' simple scenario compares. Since you made those claims, it's you who must prove them. Go on, we're waiting.
|
|
Status: Offline |
|
|
Gunnar
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 11:06:51
| | [ #415 ] |
|
|
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| Dear Fanboy,
Quote:
Sprites need source data alignment (hw requirement). No writing to destination bitmap required. That's the idea of sprites. No handling of misaligned data required at all. This alone renders your statement invalid, which was my point.
|
HW sprite need data alignment. But we not spoke about HW sprites. What we here spoke about was using ALTIVEC for doing a memcopy or for doing variations of a memcopy. A Window copying is in fact also a type of memcopy.
You need to read what the people talk about before you make wrong points.
Quote:
Yes, that's definitively more ugly than the coresponding AMMX code, but the question is: is it also slower? And if so, is it slower by a magnitude as you claimed?
|
Maybe you missed that all these applications, game 2D copy, typical SDL games, Window movement .. You will agree that all these are from of a Memcopy. For Memcopy the speed of your Memory Interface is important.
The PowerPC system we spoke about have much weaker memory interface than the Vampire we spoke about. Therefore the ALTIVEC code will always run slower. And yes the alignment handling in software makes the code of course also slower.
The whole topic would be different on the 970 PowerPC The 970 has a much better memory performance. And the 970 has a very important feature which the G3 and G4 not have. The 970 can automatically prefetch memory into its caches. This will highly improve the performance.
The G3 and G4 PowerPC are not able to do this. Therefore there always bad in memory.
The Apollo 68080 CPU can like the 970 PowerPC automatically prefetch memory streams. This is the reason its so fast and this is the secret why it easily beats the G3 and G4 PPC.
We dont need to doctor benchmarks. I said it before. The developers from the Apollo-Team are ex-IBM PowerPC developers. We have made many of the high end Power chips we talk about here.
So comparing a PowerPC to 68080 is a homerun for us. As actually we participated in developing both the PPC and the 68080. We know where each has his strength or weaknesses. And we know the undocumented features of many PowerPC. We know where we solved in the 68080 problems of some PowerPC. Like for example the memory streaming, or the memory bandwidth saving with AMMX
Last edited by Gunnar on 05-Oct-2022 at 11:21 AM. Last edited by Gunnar on 05-Oct-2022 at 11:14 AM.
|
|
Status: Offline |
|
|
Cool_amigaN
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 12:34:11
| | [ #416 ] |
|
|
|
Super Member |
Joined: 6-Oct-2006 Posts: 1228
From: Athens/Greece | | |
|
| In the past I did the RiVA benchmark and the MorphOS G4@1.6Ghz was around +60% faster (if I recall correctly) compared to Vampire but I can't remember if it was against V2 or V4. I did post the result in this forum but can't relocate them at the moment.
I have also tested the DevilutionX 68k build and it loaded within 10 secs on MorphOS compared to 30-40 secs on the Vampire (again, don't remember the version). Of course filesystem could be altering the result here.
I am also getting real life copy rates of 16Mb/sec (steady) between ram/hd and about 13-15Mb/sec between hd partitions and around 6.1Mb/sec from LAN on my old Sawtooth 3.1 (1999) with MorphOS which I think are faster than Vamp. from what I have seen on real life events.
Anyway, I think koszer did and maintained some real LW rendering tests and fastest possible was i7 9th gen(?) winuae, followed by MorphOS G5. There is a diagram on ppa.pl I think...
Don't get me wrong, not saying that Vamp. is a subpar system but if you advertise as a product which gets closer to NG but lacks compared to my 2 decades old system (plus it gets crippled by the accompanying OS/apps, compared to a real Amiga NG OS), it just far of the truth. Last edited by Cool_amigaN on 05-Oct-2022 at 12:35 PM.
_________________
|
|
Status: Offline |
|
|
Gunnar
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 13:01:42
| | [ #417 ] |
|
|
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @Cool_amigaN
Hallo Cool_Amigan, How are you?
Quote:
I have also tested the DevilutionX 68k build and it loaded within 10 secs on MorphOS compared to 30-40 secs on the Vampire (again, don't remember the version). Of course filesystem could be altering the result here.
|
Did we not spoke before about GFX render performance? When we spoke before about DIABLO=DevilutionX - we talked about that the Vampire reaches a higher frame rate than the OS4 PowerPC compile running on a G4 PowerPC 1000 MHz Did you accidentally switch topic to hard drive speed now?
Quote:
In the past I did the RiVA benchmark and the MorphOS G4@1.6Ghz was around +60% faster (if I recall correctly)
|
This is nice. But your 1.6 GHz system has two times the clockrate as the AmigaOne XE 800MHz that we spoke about, right? So in other words does this imply that the Vampire play RIVA videos faster than the AmigOne 800 MHz and Pegasos 1000MHz?
If RIVA play videos faster on the Vampire than on the AmigaONE 800Mhz, this is not a very nice result for Vampire?
Quote:
I am also getting real life copy rates of 16Mb/sec (steady) between ram/hd
|
This is good. You can with the Super-AGA chipset also set different IDE speeds. Of course per default the system runs in compatibility slowest mode. If you enable faster modes then the IDE speed will increase significantly. The V4 reaches up to 20MB/sec Disk read speed (reliable)
But maybe we should not get into hard drive speed of your PowerPC system... I fear already a Windows lover jump-in now and tell you that his SSD on his PC is X-many times faster than your MOS system. *Yes Hammer, shut up please! And dont post Off- topic INTEL pics!*
Isn't it strange how this discussion deformed?
I think it all started with me trying to make a point that "performance" can not be judged by looking only at a clockrate, or at a instruction description as in real live for a system many factors are important for performance, like memory latency.
Last edited by Gunnar on 05-Oct-2022 at 01:07 PM.
|
|
Status: Offline |
|
|
Karlos
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 13:13:54
| | [ #418 ] |
|
|
|
Elite Member |
Joined: 24-Aug-2003 Posts: 4534
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| @Gunnar
It started because I asked you a very clear (effectively impossible for anyone technical to misunderstand) question and you gave a very direct answer that unconditionally and unambiguously affirmative.
Q) Are you claiming that the 68080/AMMX at its nominal clockspeed will alphablend one 1080p buffer onto another faster than a GHz class PPC using altivec optimised code for the same task?
A) Yes, correct.
I didn't specify which PPC or in which system but let's assume I didn't mean one that wasn't broken by design, e.g. a last generation G4 powemac.
You've since made a lot of noise but haven't actually substantiated the claim. I'm asking the question but I'm not in a position to answer it since I have neither platform to hand.
My gut instinct is that as long as memory throughput doesn't end up being the limit, the PPC should be faster. But I don't actually care ans would happily be proved wrong by a benchmark of the proposed case. Last edited by Karlos on 05-Oct-2022 at 01:23 PM. Last edited by Karlos on 05-Oct-2022 at 01:14 PM.
_________________ Doing stupid things for fun... |
|
Status: Offline |
|
|
Gunnar
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 13:38:06
| | [ #419 ] |
|
|
|
Cult Member |
Joined: 25-Sep-2022 Posts: 512
From: Unknown | | |
|
| @Karlos
Dear Karlos,
Quote:
I didn't specify which PPC or in which system but let's assume I didn't mean one that wasn't broken by design, e.g. a last generation G4 powemac.
|
OK, does this mean you not want to talk here about PPC-AMIGA anymore but about MAC now?
Did this not start with us talking about whether the 68080 is faster in some areas than a PPC G4 AmigaOne? And you debate this.... and you wanted some proof, right?
And after we explained in detail the technical reasons why the 68080 beats the PPC G4 AmigaOne in some areas .... Do you change your argument now? Are you now saying: OK the Vampire is faster than the AMIGAONE but can it also beat a POWERMAC? You not try to use MAC now to defend the honor your AmigaOne? Come on, this is getting funny. I really wonder which computer will be next after the MAC.
Let us answer your question to the G4 PowerMac. Yes, the memory interface of the PowerMac is better than the AmigaOne. This is correct. The memory interface of the AmigaOne is very poor.
But can the PowerMAC with Altivec beat the 68080 with AMMX doing the Sprite copies workloads? I think that the Vampire will still win in many cases The Vampire will sometimes win with a with small margine but in cases where it can benefit from the 30% AMMX efficiency advantage for Sprite operations there the Vampire can really outclass the MAC.
Why is this?
As far I recall the Apple Titanium G4 Powerbook has a slightly lower memory performance than the Vampire. The Mac is much better than the AmigaOne but slightly lower than the Vampire.
And please recall that the Apollo 68080 CPU has the feature to automatically prefetch memory streams. On the PowerPC side only the 970 (IBM G5) has this feature. The G4 does not have this feature. This means programs on the G4 PPC have a real disadvantage.
Dear Karlos,
I've explained the advantage of the AMMX storeMask operations before. Was the example clear to you and easy to understand?
Let us quickly look at them again. This will help you to understand why the MAC has here a disadvantage.
The PowerPC needs to do 3 Memory operations for a Sprite copy: - Read Sprite - Read Screen - Write Screen AMMX allows the 68080 to do this with only 2 Memory operations: - Read Sprite - Write Screen
Is this benefit of this clear to understand? This basically means that if both systems have the same memory bandwidth then the Vampire can because the more efficient AMMX render 50% more softsprites than the MAC.Last edited by Gunnar on 05-Oct-2022 at 01:49 PM.
|
|
Status: Offline |
|
|
Karlos
| |
Re: Packed Versus Planar: FIGHT Posted on 5-Oct-2022 13:43:22
| | [ #420 ] |
|
|
|
Elite Member |
Joined: 24-Aug-2003 Posts: 4534
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| @Gunnar
I wasn't talking about PPC Amiga in the first place. It was you that brought that example up. I was talking about PPC in general. This is why I said "GHz class PPC running altivec optimised code."
Be in no doubt. If I had meant "compared to an A1 XE 800 MHz" that's exactly what I would've said. I'm not responsible for your fixation on poor examples of hardware implementation. By far the majority of all altivec enabled PPC machines don't use crippled chipsets. _________________ Doing stupid things for fun... |
|
Status: Offline |
|
|