Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6221 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

1 member(s) on-line.

pixie

You are an anonymous user.
Register Now!

pixie: 37 secs ago

Hammer: 25 mins ago

Karlos: 1 hr 1 min ago

cdimauro: 1 hr 13 mins ago

RobertB: 3 hrs 21 mins ago

vintagewatches.pk: 3 hrs 54 mins ago

matthey: 3 hrs 58 mins ago

davidf215: 4 hrs 11 mins ago

AmigaMac: 4 hrs 46 mins ago

JeanCriswell: 4 hrs 53 mins ago

Forum Index

Amiga Development

Packed Versus Planar: FIGHT

Poster

Thread

MEGA_RJ_MICAL

Re: Packed Versus Planar: FIGHT
Posted on 9-Aug-2022 23:17:00

[ #121 ]

Super Member

Joined: 13-Dec-2019
Posts: 1200
From: AMIGAWORLD.NET WAS ORIGINALLY FOUNDED BY DAVID DOYLE

ZORRAM!!!!!!!!!!!!!!!!!!!!!!

_________________
I HAVE ABS OF STEEL
--
CAN YOU SEE ME? CAN YOU HEAR ME? OK FOR WORK

Status: Offline

Karlos

Re: Packed Versus Planar: FIGHT
Posted on 9-Aug-2022 23:46:49

[ #122 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@MEGA_RJ_MICAL

I always thought care in the community was a poor idea.

_________________
Doing stupid things for fun...

Status: Offline

Hammer

Re: Packed Versus Planar: FIGHT
Posted on 10-Aug-2022 0:08:33

[ #123 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6503
From: Australia

@cdimauro

Quote:
Not relevant: see below.

It's relevant when the product is marketed as retro and software legacy.

Quote:

That's why I've said that the Apollo's 68080 is more like Intel: it was Intel that reused the FPU to introduce its first SIMD unit, the MMX.

Intel reused 80-bit X87 registers for MMX 64-bit registers.

From around 1994 with HP PA-RISC MAX-1 integer SIMD, HP design team opted for the elegant use of the existing facilities in the CPU, which were slightly modified to understand new, packed subword data.

For HP's POV, PA-RISC replaced their Motorola 680x0 Unix workstations.
For C='s POV, PA-RISC-based Amiga Hombre replaced Motorola 680x0 in the Amiga.

For Acorn's POV, ARM replaced C= CSG/MOS's 65xx CPUs.
ARM is a modern candidate to replace 680x0 ASIC in the "classic" Amiga via PiStorm/Emu68.

Quote:

Besides that, the 68080 also introduced instructions fusion, which is another thing introduced by Intel (with the Banias micro-architecture).

Not relevant.

FACT: AC68080 does NOT have 68K's MMU.

FACT: Intel did NOT remove the full-featured X87 for MMX.
FACT: Intel/AMD did NOT remove the full-featured X86 MMU.

P6 Banias family had a fused-domain RS, as well as ROB, so micro-fusion helped increase the effective size of the out-of-order window. But SandyBridge family simplified the uop format making it more compact, allowing larger RS sizes that are helpful all the time, not just for micro-fused instructions.

Quote:
Not relevant: see above..

It's relevant when the product is marketed as retro and software legacy.

Your micro-architecture implementation argument doesn't address the software interface boundary.

Quote:

I don't agree, for the above reasons.

Fact: Apollo-Core is not the original 68K vendor and designer i.e. Motorola/Freescale.

Quote:

Redundant / useless.

Your arguments are useless.

You have started a flame war.

Last edited by Hammer on 10-Aug-2022 at 12:59 AM.
Last edited by Hammer on 10-Aug-2022 at 12:41 AM.
Last edited by Hammer on 10-Aug-2022 at 12:29 AM.
Last edited by Hammer on 10-Aug-2022 at 12:28 AM.
Last edited by Hammer on 10-Aug-2022 at 12:13 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Karlos

Re: Packed Versus Planar: FIGHT
Posted on 10-Aug-2022 0:43:25

[ #124 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@bhabbott

Quote:
...including the code. And I want real proof, ie. a working example that can be tested on a real machine

That's going to be a tough ask, seeing as how there is no "real machine" implementation of N-bit packed pixels. Unless you choose either N=1, in which case you'll get identical results on your native hardware or you use N=8 and compare AGA to an RTG board of your choice. We know that 8 bit packed pixels run rings around planar already.

The only thing I think that you can fairly test is the basic performance of software access to memory formatted as planes or packed pixels, which is what the gist at the beginning of this thread sets up for you. The challenge to write a simple set pixel routine for each of the two modes, fill a large area pixel by pixel and time both.

I'm all but certain that planar will be linearly slower than packed for this task as the depth increases.

Obviously, block fills are a different matter and it should be possible to write routines for both that are basically equivalent in performance.

_________________
Doing stupid things for fun...

Status: Offline

Hammer

Re: Packed Versus Planar: FIGHT
Posted on 10-Aug-2022 0:53:59

[ #125 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6503
From: Australia

@cdimauro

Quote:

Performances aren't similar: you've to better and carefully take a look at the video, and you'll see that the Amiga has less FPS generated compared to the PC.

It's only a minor difference. AGA's Doom frame rates will continue to scale with a faster CPU.

Quote:

It would have been good to run Doom's timedemo to see the effective framerate of both games (with the Amiga port which introduced some optimizations, BTW).

PC Doom has its own optimizations.

Quote:

P.S. I saw that you had the good sense to edit your comment and remove your previous statement (that the Amiga was able to do... full motion! LOL ).

Only for slower CPU 68030 context.

https://www.youtube.com/watch?v=fl-gYdkIXCk
Amiga 1200 with 68040 @33Mhz with AGA playing Doom.

https://youtu.be/BrYjAjDem1k?t=181
Doom (ADoom) playing on an Amiga 1200 with 68060.

When a faster CPU is available, C=- AGA dumb frame buffer is better than IBM VGA.

IBM VGA is rubbish regardless of throwing K7 Athlon XP CPU at it.

Amiga AGA is able to do full motion video.

https://www.youtube.com/watch?v=Kqfbe-DUOKg
PiStorm emu68 Doom test with ECS's EHB (6 bitplanes) mode.

With a fast CPU, Amiga OCS/ECS is able to do full motion video with Doom.

Last edited by Hammer on 10-Aug-2022 at 01:11 AM.
Last edited by Hammer on 10-Aug-2022 at 01:09 AM.
Last edited by Hammer on 10-Aug-2022 at 01:04 AM.
Last edited by Hammer on 10-Aug-2022 at 01:03 AM.
Last edited by Hammer on 10-Aug-2022 at 01:01 AM.
Last edited by Hammer on 10-Aug-2022 at 12:56 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Karlos

Re: Packed Versus Planar: FIGHT
Posted on 10-Aug-2022 1:14:45

[ #126 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

Interlude:

Speaking of Doom, I can't wait for this mod to be finished...

https://youtu.be/juyMqm5vguI

Right, back to the flame war...

_________________
Doing stupid things for fun...

Status: Offline

cdimauro

Re: Packed Versus Planar: FIGHT
Posted on 10-Aug-2022 5:19:57

[ #127 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

@Hypex

Quote:

Hypex wrote:

Not that I know of but I was referring to display controller first and blitter secondary. In any case misaligned pixels should post no problem. Just mask out any edges and fill it all in between. If the source data is off alignment just shift it like in planar.

Exactly.
Quote:
The only real problem I see would be if the source is a different pixel width. In which case it would need to scale the pixels and spend time packing then in which is 3d territory.

It's not like 3D, but yes: in this specific case you've to unpack the source pixels to reach the size of the destination. Difficult for a CPU (but easier for one having a SIMD unit), but much easier in hardware.
Quote:
As it happens planar has no problem with different depths since you just blit the planes you need.

No. If you the source has a different size compared to the destination (for example: 3 bits source, 5 bits destination), then you've to blit anyway the missing bitplanes, otherwise you mess-up the graphic.

Unless your framebuffer + CLUT is organized in a way to implement some transparency effect using one or two (maximum: going over this makes less sense, because you're wasting the color palette only for the transparency effects) bitplanes.
Quote:
I tend to think, even if possible, that the benefit would not outweigh any practical advantage.

That's because you haven't written videogames on Amiga.
Quote:
If the extra logic could be used to support 8 bit colour, and it was more simple to do so, then I think it would be better and more practical than supporting some obscure widths.

We're talking about a system similar to the (original) Amiga but with packed graphic instead.

When then 8-bit pixel sizes age came then we had more hardware resources (CPU, memory) and it was more evident that packed was the way to go.
Quote:
But, I'm not a chip designer, so the logic may be easier than I imagine it to be.

Me neither, but I've some idea. As I've already said, it's implementing the masking which is more complex with packed graphics: the rest is simpler.

@Karlos

Quote:

Karlos wrote:

The only thing I think that you can fairly test is the basic performance of software access to memory formatted as planes or packed pixels, which is what the gist at the beginning of this thread sets up for you. The challenge to write a simple set pixel routine for each of the two modes, fill a large area pixel by pixel and time both.

I'm all but certain that planar will be linearly slower than packed for this task as the depth increases.

And when the data bus sizes increases as well, wasting more bandwidth. And space, if you don't properly pack the graphics; but then the misalignment cases increases, so again wasting bandwidth.
Quote:
Obviously, block fills are a different matter and it should be possible to write routines for both that are basically equivalent in performance.

The more aligned to the data bus sizes the ares to be filled are, the more equivalent are the two graphic formats. But it's the viceversa (less aligned) which increases the gap in favor of packed.

@Hammer

Quote:

Hammer wrote:
@cdimauro

Quote:
Not relevant: see below.

It's relevant when the product is marketed as retro and software legacy.

Care to prove it?
Quote:
Quote:

That's why I've said that the Apollo's 68080 is more like Intel: it was Intel that reused the FPU to introduce its first SIMD unit, the MMX.

Intel reused 80-bit X87 registers for MMX 64-bit registers.

Ah, it was Intel: NOT AMD. Finally you admit it...
Quote:
From around 1994 with HP PA-RISC MAX-1 integer SIMD, HP design team opted for the elegant use of the existing facilities in the CPU, which were slightly modified to understand new, packed subword data.

For HP's POV, PA-RISC replaced their Motorola 680x0 Unix workstations.
For C='s POV, PA-RISC-based Amiga Hombre replaced Motorola 680x0 in the Amiga.

For Acorn's POV, ARM replaced C= CSG/MOS's 65xx CPUs.
ARM is a modern candidate to replace 680x0 ASIC in the "classic" Amiga via PiStorm/Emu68.

Again, irrelevant and useless: the Apollo team followed the Intel (INTEL) way for implementing its SIMD unit. Is it clear to you?
Quote:
Quote:

Besides that, the 68080 also introduced instructions fusion, which is another thing introduced by Intel (with the Banias micro-architecture).

Not relevant.

FACT: AC68080 does NOT have 68K's MMU.

Nobody asserted this.
Quote:
FACT: Intel did NOT remove the full-featured X87 for MMX.

FACT: the Apollo team followed the Intel way for its SIMD unit. Do you understand it?
Quote:
FACT: Intel/AMD did NOT remove the full-featured X86 MMU.

Not relevant.
Quote:
P6 Banias family had a fused-domain RS, as well as ROB, so micro-fusion helped increase the effective size of the out-of-order window. But SandyBridge family simplified the uop format making it more compact, allowing larger RS sizes that are helpful all the time, not just for micro-fused instructions.

Again, your non-sense padding.

FACT: the Apollo team implemented the same technology that Intel introduced. Do you understand this?
Quote:
Quote:
Not relevant: see above..

It's relevant when the product is marketed as retro and software legacy.

As above: care to prove the connection?
Quote:
Your micro-architecture implementation argument doesn't address the software interface boundary.

Maybe because I wasn't intended to talk about it, rather about the design decisions of the Apollo team for their 68080, which resemble more Intel than AMD?

Maybe some day you'll get it...
Quote:
Quote:

I don't agree, for the above reasons.

Fact: Apollo-Core is not the original 68K vendor and designer i.e. Motorola/Freescale.

Irrelevant: see above. You continue to miss the context and the reasons of my replies.
Quote:
Quote:

Redundant / useless.

Your arguments are useless.

I agree: it happens when I've to deal with people that don't understand them.
Quote:

You have started a flame war.

See above. Plus, it happens specifically with you when you want to put your beloved AMD where it isn't the case.
Quote:

Hammer wrote:
@cdimauro

Quote:

Performances aren't similar: you've to better and carefully take a look at the video, and you'll see that the Amiga has less FPS generated compared to the PC.

It's only a minor difference. AGA's Doom frame rates will continue to scale with a faster CPU.

Yes, and? I was talking SPECIFICALLY about the video that you posted.

Care to show the generated FPS for both systems?
Quote:
Quote:

It would have been good to run Doom's timedemo to see the effective framerate of both games (with the Amiga port which introduced some optimizations, BTW).

PC Doom has its own optimizations.

The Amiga versions seems to have further optimizations. As it was reported in the comments to the video (have you read them?).
Quote:
Quote:

P.S. I saw that you had the good sense to edit your comment and remove your previous statement (that the Amiga was able to do... full motion! LOL ).

Only for slower CPU 68030 context.

https://www.youtube.com/watch?v=fl-gYdkIXCk
Amiga 1200 with 68040 @33Mhz with AGA playing Doom.

https://youtu.be/BrYjAjDem1k?t=181
Doom (ADoom) playing on an Amiga 1200 with 68060.

And... in your opinion is it going to 25/30 FPS? Because you talked about FULL MOTION, right?
Quote:
When a faster CPU is available, C=- AGA dumb frame buffer is better than IBM VGA.

Again?!? This is your usual apples to oranges comparison: '92 technology vs '87 one.

Completely non-sense...
Quote:
IBM VGA is rubbish regardless of throwing K7 Athlon XP CPU at it.

Ah, yes. I'm wondering why you didn't mentioned the MDA...
Quote:
Amiga AGA is able to do full motion video.

See above the for full motion.
Quote:
https://www.youtube.com/watch?v=Kqfbe-DUOKg
PiStorm emu68 Doom test with ECS's EHB (6 bitplanes) mode.

With a fast CPU, Amiga OCS/ECS is able to do full motion video with Doom.

And here again.

Now I want to see when you understand the context and give PROPER replies. Difficult to expect from your, as your forum history speaks for you...

Status: Offline

Hammer

Re: Packed Versus Planar: FIGHT
Posted on 10-Aug-2022 7:10:05

[ #128 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6503
From: Australia

@cdimauro

Quote:
Care to prove it?

1st example, the clusterfuk mess between Jim Drew (Fusion Mac) vs Gunnar. Gunnar's counterargument is to use Shapeshifter which modifies Apple's ROMs.

Disabling AC68080 V2's non-compliant Motorola 68K FPU wannabe clone can force Apple's scalar CPU-only code path.

2nd example, there's Lightwave's minor render difference AC68080 V2's non-compliant Motorola 68K FPU vs genuine Motorola 68K FPU results.

Pentium FDIV bug has a minor result difference from X87's result and caused Intel to recall the product.

This is not apollo-core.com's bias forum, you can't hide, nor you can censor me.

3rd example, read https://eab.abime.net/showpost.php?p=1555083&postcount=28
Clusterfk for V2 owners.

There's a reason why I halted the V2 purchase.

Quote:

Again, irrelevant and useless: the Apollo team followed the Intel (INTEL) way for implementing its SIMD unit. Is it clear to you?

Apollo team didn't follow Intel's and AMD's X86/X87 legacy software preservation practices. Is it clear to you?

Quote:

FACT: the Apollo team implemented the same technology that Intel introduced. Do you understand this?

FACT: IBM PowerPC 970 also has instruction fusion hence Intel doesn't have a monopoly on this design concept. PowerPC 970's Power4-based design is released in 2001.

IBM PowerPC 970 was released in 2002.

Intel Pentium M (P6 Banias) was released 2003.

Quote:

Not relevant.

Apollo team didn't follow Intel's and AMD's X86/X87 legacy software preservation practices. Is it clear to you?

AC68080's wannabe boat-anchor SIMD extension is not relevant for 68K legacy preservation.

AMD and Intel can copy each other's SIMD extensions, hence fulfilling the second source requirement.

Motorola has DSP 56001, Apollo team didn't support this.

C= selected AT&T DSP3210, Apollo team didn't support this.

I rather support ARM NEON or RISC-V over AC68080's wannabe boat-anchor SIMD extensions.

As for the "Apollo team", Igor Majstorovic has aired the dirty laundry.

Quote:

Maybe because I wasn't intended to talk about it, rather about the design decisions of the Apollo team for their 68080, which resemble more Intel than AMD?

FACT: IBM PowerPC 970 also has instruction fusion hence Intel doesn't have a monopoly on this design concept. PowerPC 970's Power4-based design is released in 2001!

IBM PowerPC 970 was released in 2002.

Intel Pentium M (P6 Banias) was released 2003.

Quote:

See above. Plus, it happens specifically with you when you want to put your beloved AMD where it isn't the case.

Your assumption is wrong since I'm aware of IBM's PowerPC 970's instruction fusion concepts.

Don't assume.

Again, Apollo-core is a wannabe 68K cloner and it's NOT Motorola/Freescale. AMD moved from wannabe X86 cloner into X86-64 standards driver.

Don't pretend Apollo-core being Intel's "genuine" position.

Apollo-core's custom SIMD extension is NOT relevant for 68K software legacy protection and serves as a distraction.

Quote:

Yes, and? I was talking SPECIFICALLY about the video that you posted.

https://www.youtube.com/watch?v=1B1jKjrRUmk

For Doom, performance experience is similar between A1200 with 68030 @ 50Mhz and AGA vs 386DX-40 with ET4000AX.

Quote:

Care to show the generated FPS for both systems?

The benchmark argument will produce different results from two different systems.

The poster for https://www.youtube.com/watch?v=1B1jKjrRUmk has stated very similar performance.

ET4000AX's Doom frame rates can continue to scale to around 33 fps with faster CPUs.

Last edited by Hammer on 10-Aug-2022 at 08:05 AM.
Last edited by Hammer on 10-Aug-2022 at 07:54 AM.
Last edited by Hammer on 10-Aug-2022 at 07:51 AM.
Last edited by Hammer on 10-Aug-2022 at 07:51 AM.
Last edited by Hammer on 10-Aug-2022 at 07:38 AM.
Last edited by Hammer on 10-Aug-2022 at 07:35 AM.
Last edited by Hammer on 10-Aug-2022 at 07:27 AM.
Last edited by Hammer on 10-Aug-2022 at 07:21 AM.
Last edited by Hammer on 10-Aug-2022 at 07:14 AM.
Last edited by Hammer on 10-Aug-2022 at 07:12 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Hypex

Re: Packed Versus Planar: FIGHT
Posted on 10-Aug-2022 16:21:25

[ #129 ]

Elite Member

Joined: 6-May-2007
Posts: 11351
From: Greensborough, Australia

@cdimauro

Quote:
It'll be closed once my article is published.

Which will be by next week hopefully, but we're still working to restart the tech blog. It's already done, BTW, albeit in Italian (because actually the blog publishes articles in that tongue, but you can use Google to translate it keeping the formatting or deepl.com for a more accurate translation (text-only, unfortunately).

Speaking off, I'll be posting an article of my own soon. Just on here. Needs some cleaning up. A bunch of text isn't that interesting so I will collage some images so it can be visualised. It almost turned out to be an essay or mini thesis.

Quote:
This isn't needed for an packed-graphics (only) Amiga. Maybe I'll cover this on a new, more specific, article in the near future.

In my case it's for a planar and packed mode Amiga.

Quote:
An example would be beneficial.

This is all hypothesis but something like the following copper list simulating it. Such as user copper list with OS setting up exact values. In this case an extended BPLCON would be used to activate it while bitplane DMA is disabled.


CMOVE BPLEN, DMACON; Bitplane disable
CMOVE $8000, BPLCON2; Enable packed pass through mode (for example)
CWAIT 0,0; Wait for top left of display start
DB.B $00, $01, $02, $03, $04, $05, $06, $07; Packed data fed direct to internal serial CLUT indexer
...
CEND; Revert to copper again after all lines are read

But, I think it changes behaviour too much, to be a possible extension.

Quote:

BTW, for planar graphics I've covered the 3 most famous formats: Atari ST-like (per words interleaved), interleaved (per row), and "Amiga" (completely free: a pointer for each bitplane).

That's interesting to know. ST sounds similar to Amiga sprites. Seems to be a word theme with these computers.

Quote:

No duplication, because you don't have even or odd planes with packed graphics: you just have pointers to plane's data, and those planes & data are completely independent from each other.

This will be covered on the new article.

I will also show below an example.

Quote:

No. I mean that the Amiga requires to separate the bitplane pointers in even and odd to save / retrieve the data for each of the two different playfields. So, bitplane pointers #0, #2, #4 (and #6 on AGA) define the data for playfield #1 and #1, #3, #5 (and #7 on AGA) for playfield #2.

For one field it's not usually considered what's even and what's odd unless it affects other display elements.

Quote:

Whereas with packed data you just have bitplane pointer #0 which points to the data of playfield #1 and bitplane pointer #1 to the data of playfield #2.

Everything else is exactly the same.

But if you only wanted one field two fields shouldn't come into it.

However I can now demonstrate what I mean. Failing to find an example of it I found some old source of mine doing a user copper list so just modified it to show. I used FS-UAE for a screen shot so hope it's accurate enough.

Okay so here are two images. 2 bit depth. Palette is set to black, white, green, blue. A check pattern I set up in blue squares. Both using standard single field mode.

On top standard view.

On bottom I modify scroll offsets every 32 lines. At 0, 0/0; at 32, 4/0; at 64, 0/8; at 96, 4/8.

Quote:

And, as you can already see, packed graphics is much less complex (and more efficient, looking at the numbers).

What I can't see is how to duplicate the above trick, without using software rendering or using dual playfield, with a packed setup.

I'm programming the copper like this in my AmigaE code:


  CWAIT(myucoplist,0,0)
  CMOVEA(myucoplist,BPLCON1,$00)
  CWAIT(myucoplist,32,0)
  CMOVEA(myucoplist,BPLCON1,$40)
  CWAIT(myucoplist,64,0)
  CMOVEA(myucoplist,BPLCON1,$08)
  CWAIT(myucoplist,96,0)
  CMOVEA(myucoplist,BPLCON1,$48)
  CEND(myucoplist)

Quote:

If you need code (I don't: math is enough, as I've already said), you can start writing it.

I thought I would provide code so you could see how it's generated.

Quote:

Explained in the article, and is tightly bounded to the above mentioned horizontal scrolling (think about it, and you'll see why "odd widths" aren't really odd).

It can just as easily be scrolled across.

Quote:

No, you can have much more flexibility with packed graphics: see my previous comment on that specific point.

Just to clarify I was thinking about chipsets like VGA that have no dual playfield modes.

Quote:

It can be, if you're computing the final framebuffer taking the data from the two playfields and combining them.

Like above, but I mean by using hardware scrolling. With only one layer that can be scrolled it needs to be computed. Or do it the C64 way which can do parallax scrolling without dual layers. Or maybe the C16 which doesn't have sprites, so more comparable to VGA which also doesn't have sprites and also has hardware scrolling.

Quote:

Correct: this is an easy and very cheap way to implement a dual playfield screen on a system which only has a 8-bit packed mode.

However you're wasting a lot of colors.

Yes, that a limitation of doing it. But, in hardware, AGA also lost colours as well in 16+16 dual layer mode. Possible, but I don't know of any games that just used the blitter to blit scrolling layers into the bitmap so that multple layers with more colours could be displayed.

Quote:

Indeed, and I think that this was the way that it was implemented in PC games with multiple overlapping playfields.

That's good to know. I didn't think it would be that efficient. But when you are working on hardware with only one layer and need multiple layers there's not much choice than a brute force method.

Quote:

WOW! Finally you did it: got how things go with packed graphics. Then I agree with almost everything.

Lol. I actually calculated this and listed some figures in the other thread but I got out the calculator to check again.

Quote:

Only one thing about masking: it can be done exactly like with planar, but implementing it requires a bit more logic (some extra work on pipeline stage to properly prepare the mask according to the source or sources channels' pixel size or sizes).

Given blitters were around for packed data they would have worked this out in other hardware. But, on Amiga it relied on the bitmask. I'm sure with packed that could have avoided that with zero being transparent and all else opaque.

Quote:

It would have been good to run Doom's timedemo to see the effective framerate of both games (with the Amiga port which introduced some optimizations, BTW).

I'm compiling some test results at present. But it's time consuming. Seems to be there for ten minutes or more doing it.

Status: Offline

matthey

Re: Packed Versus Planar: FIGHT
Posted on 10-Aug-2022 18:16:16

[ #130 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2748
From: Kansas

cdimauro Quote:

As a game developer, I don't agree: it wasn't fast memory which was required, but chip mem. ONLY chip mem. This would have helped A LOT on the vast majority of cases, since the CPU is often used just to load the chipset registers to start some operations (specifically: setting up the Blitter).

The size of chip memory was a limitation of some games depending on the complexity. However, fast memory improved total system bandwidth and more chip memory bandwidth remained. The caches of later 68k processors reduced memory accesses also leaving more system and chip bandwidth. Even the little 256 byte ICache of the 68020 combined with the very good code density of the 68k removes most instruction memory traffic as most code loops fit in 256 bytes.

cdimauro Quote:

The lack of memory bandwidth is ALSO due to the unlucky choice of planar instead of chunky graphics.

My article will show it with number at hands, since it specifically targets and reports how many read and/or write accesses to memory are required for the display controller or for some graphic primitives.
I've only considered the case with 8 colors = 3 bits per pixel to keep the article simple (it's already 40kB of text, as I've said), but the analysis could be easily be redone exactly for any pixel size.
I've also primarily considered a system with an 8 bit data bus, but sometimes I've reported analysis for 16, 32, and 64 bit data bus sizes, to show how the analysis and numbers scale (in favor of the packed graphics, of course: the wider is the data access granularity, the bigger are the benefits of this format).

I can't think of many cases where planar would use fewer memory accesses and some graphics primitives like writing a single pixel are horrible. Your article should be interesting.

cdimauro Quote:

See above: games required more bandwidth, and Commodore completely failed to give it since the better memory bandwidth (compared to OCS/ECS) was only reserved to the display controller.

The main problem with AGA is that the Blitter was left EXACTLY the same, so handling only 16-bit at the time, instead of 32 and 64 bit. The ONLY advantage of AGA in this case it's because using screens with 32 or 64-bit memory access freed more memory slots compared to OCS/ECS, so the the Blitter "automatically gained" more memory accesses and bandwidth, but it wasn't that much to sustain certain loads.

However I've to say in its defense that using data bus sizes of 32 and 64-bit could have helped on some workloads, but exacerbated some other cases greatly wasting uses memory and/or memory bandwidth. That's because with planar it happens the exact opposite of packed graphics: the wider is the data bus, the more inefficient it's this format (see above). This is also explained in my article.

Yea, no blitter improvement until AA+ and then it was minor and kind of an after thought, especially considering 16 bit chunky arrived with it and more powerful 68k CPUs were being used that had competitive if not superior blitting performance.

cdimauro Quote:

Previously you also said that chunky is cheap to implement: why this regression here?

It depends on the context. Packed/chunky would have been cheap to implement in custom chips in the 1990s and would be practically free today but in 1985 every transistor was important. I'm not saying planar was a better choice but it could have saved a few valuable transistors or a few kiB of very expensive memory which explains the choice. Performance was not as important as constraints when the Amiga custom chips were developed and the choice needs to be evaluated in this context. Engineering design is full of compromises.

cdimauro Quote:

That's incorrect. CGA and MDA already had some flexibility. However EGA introduced A LOT more flexibility and programmability.

VGA was EGA-compatible, so inherited all of them, and "just" added something more (8-bit packed graphics, 18-bit color palette, more bandwidth, more memory).

The Amiga custom chips also retained compatibility and became more flexible. Early graphics standards were often less flexible to save a few transistors which was very important then but changed quickly. Sadly, the PC compatible world advanced quicker than the Amiga which started out ahead in technology.

cdimauro Quote:

SIMD/Vector makes sense only for 8, 16, 24 (more difficult) and 32-bit pixel sizes. It's much more difficult to use such units for less than 8 bits pixel sizes.

I disagree. The narrower the integer datatypes, the more SIMD calculations can be performed per instruction thus improving efficiency. There are diminishing returns as narrower integer datatypes become less useful and supporting them uses up encoding space. The 4 bit datatype has become popular for AI because it allows more calculations per instruction or cycle as Hammer showed.

cdimauro Quote:

Indeed. 68060 was too late. Before that the CPU could be used on some scenarios to replace the Blitter, but not in the most complicated ones.

Amiga users were using software blitting starting with the 68020 but it certainly became more advantageous as the performance of the CPU increased.

cdimauro Quote:

This is a logical fallacy, similar to the Argumentum ad populum.

No, you don't need that something exists to prove if it's good or not.

To me the lack of current implementation of odd pixel sizes is probably due to the lack of proper creativity / thinking out of order: computer scientist and programmers are used to power-of-twos, so they think that "this is the (only) way" (TM).

My article gives some details about the implementation of the display controller (specifically), and some graphic primitives. This could give some inputs to someone which effectively implements it, if people are demanding for something "concrete".

To me math is and should be enough, and I'll give proper analysis and numbers.

I was not trying to prove anything. It was just an observation that the natural entropy of design choices tends toward good choices and away from bad ones, minus the copycat effect. I like to examine, and re-examine, technology choices and older technologies which were discarded due to copycat technology, unfair research, political reasons and popularity.

Status: Offline

matthey

Re: Packed Versus Planar: FIGHT
Posted on 10-Aug-2022 20:23:55

[ #131 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2748
From: Kansas

Hypex Quote:

It wasn't for low end but my reply was for a higher end model Amiga. Still, it has some interesting features. It was created around the same time as the Amiga was and in some respects looks superior with what it offers. Makes me wonder if the Amiga would have turned out better using these chips than the effort custom designing what it had?

The TI TMS34010 chip had more advanced graphics capabilities than the Amiga custom chips when it came out. It was higher end and cost significantly more. It could have been used on a Zorro card like the A2410 graphics card with RTG to provide the Amiga with a high end graphics solution. The Zorro II bus bottlenecked the memory bandwidth and even the bugs in Zorro III would have limited the performance. CBM could have licensed the technology from TI and integrated it eliminating the cost of a high performance bus as TI was trying to license the technology to be used in the console market. CBM's vision was to reduce the cost of the Amiga into a C64 and they didn't seem interested in producing a high performance Amiga. The high end Amiga hardware market was left for 3rd party companies that could not achieve the economies of scale to be competitive.

Hypex Quote:

The Amiga was no C64. Apart from an Amiga not being a real Commodore in the C64 sense it too expensive. At the time of the Amiga they had just realised the hard work of the C128 that tried to be a C64. The Amiga didn't make sense compared to that.

The Amiga 500 was cheaper to manufacture than the C128. It was the C128, CDTV and Amiga 600 that didn't make sense. These "mistakes" were too expensive and divided the market between CBM products instead of fewer products with more margin increasing economies of scale.

Hypex Quote:

I've seen examples of this and it does work. But, without a copper, would be less efficient I imagine. Also, VGA could do raster interrupts, but that's a rather primitive method. So can a C16 and not many thought that was special doing screen splits with BASIC commands. OS4 builds on the P96 screen dragging but it would be blitted somehow. I suppose I don't think anything can compare to live copper effects and rendering to a back buffer or screen just isn't the same.

As I recall from ThoR, AmigaOS 4 uses compositing/overlay support to allow screen dragging but this required too much CPU performance for AmigaOS 3.

Hypex Quote:

I read they did this on Apollo. Of course, the still have to support the blitter in hardware, but did they increase the speed? Even so, using a CPU for blit operations, still looks backwards to me. I mean, I would compare it with software 3d like Doom against hardware 3d. I just like dedicated hardware.

Dedicated hardware, like a blitter or DSP, allows to use a cheaper CPU and lower the hardware cost. However, upgrading the CPU performance and capabilities gives more performance all the time for general purpose use and not just when blitting or doing DSP workloads. Using a thread for blitting uses CPU performance that would normally be wasted during stalls.

Hypex Quote:

Also, there were plans for a blitter per plane. Parallel sounds good but did it include speeding up? I tend to think one operation that blit all planes would be useful. Suppose parallel would have covered that.

I believe the blitter per plane was a rumor (from Dave Haynie comment?) and likely never seriously considered. The big obstacle is a memory access per parallel blitter. The memory could be banked (or have separate memory controllers) but for each access to fall within a memory range corresponding to that bank requires rigid resolutions as some PC graphics hardware used. Another option would be to divide chip memory into different banks with separate memory pools and allocate each bitplane of a bitmap in a different bank but this would have created more fragmentation and reduced the amount of available chip memory. Any option like this would have likely been prohibitively expensive as well. The blitter ALU work time is short for a simple blitter that is clocked high so I believe a pipelined blitter with pipelined memory accesses makes more sense.

Hypex Quote:

Yes, the problem is, including Commodore producing it, the 68020 was more expensive. I still like the idea of the blitter, though it's more suited to large block transfers. It also does does line drawing. But I don't know if bitfields would have been speedy enough. I was doing some bitfield testing on my A4000/060 recently and found them useful but slow. I ended up changing back to loading and shifting which worked faster. Still, bitfields are said to be how Virtual GP could do fast texture mapping with planar graphics, so perhaps I did something wrong.

I couldn't find the cost of a 68020 in 1985 but one source said it would cost $150 for the CPU while hardware expense for a full 32 bit CPU would increase much more than this. It still could have been worthwhile considering how many times the performance would increase and the much better shift and new bitfield instructions which could have replaced the blitter. Software blitting is often used on 68020+ Amigas so I would assume the performance is adequate on a 68020.

The 68060 can usually do a mask and shift in fewer cycles than using a bitfield instruction so code will usually be faster when not using them. The bitfield instructions do improve code density and the 68060 is sensitive to large code so it is beneficial to use them when not in performance optimized loops but compilers are often not smart and the best performance for 68060 optimized code is to turn all bitfields off. The 68040 is the opposite as bitfields instructions are the fewest number of cycles relative to shift so they should always be used in 68040 optimized code. The 68020 and 68030 are between the 68040 and 68060 and it is usually better to use them for better code density though the performance difference isn't going to be much.

As part of the Apollo team, we looked closely at bitfield instructions. With Virtual GP then, Dungeon Master uses bitfield instructions often but many games don't use them at all. Some compilers generate bitfield instructions very often while others not at all. The surprising amount of compiler generated bitfield instructions shows how well compilers can use them and how general purpose they are which is important for deciding whether they are worthwhile. They improve code density modestly as well. The final factor of whether they should be included is whether they can be optimized to few enough cycles and indeed they can. Perhaps the 68060 was trying to do away with them by not optimizing them. The mask and shift method would sometimes be as fast but the bitfield instructions usually give better code density. Gunnar wanted to trap them for the Apollo core but Meynaf and I managed to convince him to include them and he even optimized them for an Apollo core advantage. That may have been the only debate Meynaf and I won against Gunnar where Gunnar actually seemed to change his mind.

Hypex Quote:

Not that I know of but I was referring to display controller first and blitter secondary. In any case misaligned pixels should post no problem. Just mask out any edges and fill it all in between. If the source data is off alignment just shift it like in planar. The only real problem I see would be if the source is a different pixel width. In which case it would need to scale the pixels and spend time packing then in which is 3d territory. As it happens planar has no problem with different depths since you just blit the planes you need.

I tend to think, even if possible, that the benefit would not outweigh any practical advantage. If the extra logic could be used to support 8 bit colour, and it was more simple to do so, then I think it would be better and more practical than supporting some obscure widths. But, I'm not a chip designer, so the logic may be easier than I imagine it to be.

RISC philosophy was to eliminate hardware support for misaligned memory accesses in the CPU and most early RISC architectures didn't support it (PPC was one of the first RISC architectures to optionally but often support them in big endian mode). Today, most RISC architectures have adopted this CISC like feature like so many other useful CISC associated features. Yes, even a blitter could handle misaligned memory accesses and I wouldn't be surprising if it was actually worthwhile despite taking a few more transistors.

Status: Offline

cdimauro

Re: Packed Versus Planar: FIGHT
Posted on 10-Aug-2022 20:53:36

[ #132 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

@Hammer

Quote:

Hammer wrote:
@cdimauro

Quote:
Care to prove it?

1st example, the clusterfuk mess between Jim Drew (Fusion Mac) vs Gunnar. Gunnar's counterargument is to use Shapeshifter which modifies Apple's ROMs.

Disabling AC68080 V2's non-compliant Motorola 68K FPU wannabe clone can force Apple's scalar CPU-only code path.

2nd example, there's Lightwave's minor render difference AC68080 V2's non-compliant Motorola 68K FPU vs genuine Motorola 68K FPU results.

Then please tell me about Motorola decisions to remove FPU instructions from 68040 and 68060: those were DESIGN decisions, which affected the existing software.

And it was NOT a retro market here.
Quote:
Pentium FDIV bug has a minor result difference from X87's result and caused Intel to recall the product.

This is a bug: NOT a design decision. That's why Intel decided for recalling the processors.

Something which did NOT happen with Motorola and its processors which were missing FPU instructions.
Quote:
This is not apollo-core.com's bias forum, you can't hide, nor you can censor me.

LOL Please, tell me how I can censor something which is public and, more important, WHY should I attempt something stupid like that.
Quote:
3rd example, read https://eab.abime.net/showpost.php?p=1555083&postcount=28
Clusterfk for V2 owners.

There's a reason why I halted the V2 purchase.

But nobody stopped you when you bought a Motorola processor which was lucking instructions (hence: incompatible with some existing software)...
Quote:
Quote:

Again, irrelevant and useless: the Apollo team followed the Intel (INTEL) way for implementing its SIMD unit. Is it clear to you?

Apollo team didn't follow Intel's and AMD's X86/X87 legacy software preservation practices. Is it clear to you?

Neither Motorola: see above.

Besides that, you continue to talk about the FPU when I clearly mentioned 68080's SIMD unit. Since the beginning.

This is a Red Herring: another logical fallacy...
Quote:
Quote:

FACT: the Apollo team implemented the same technology that Intel introduced. Do you understand this?

FACT: IBM PowerPC 970 also has instruction fusion hence Intel doesn't have a monopoly on this design concept. PowerPC 970's Power4-based design is released in 2001.

IBM PowerPC 970 was released in 2002.

Intel Pentium M (P6 Banias) was released 2003.

If it's a fact (for you) then you can provide sources for your claims.

If you find them, because you might have confused POWER4 / PowerPC 970 instructions cracking with fusion. But cracking is the exact opposite of fusion.

Or, you might have confused their FMA instructions with instructions fusions. But, again, those are completely different things.

Anyway, it isn't my problem: your is the thesis and you are the one that has to prove it.
Quote:
Quote:

Not relevant.

Apollo team didn't follow Intel's and AMD's X86/X87 legacy software preservation practices. Is it clear to you?

See above.
Quote:
AC68080's wannabe boat-anchor SIMD extension is not relevant for 68K legacy preservation.

And here you clearly don't know the Amiga as a platform neither how Apollo's AMMX is used.

You seem to follow the Apollo forum, but you look quite ignorant about what was written about.

For your information (since you don't know it) some Amiga software uses also RTG. On Apollo the "packed" Blitter which accelerates the RTG graphic operations is performed by... rolling drum... the second 68080 thread using the AMMX unit.

So, as you can see, the "AC68080's wannabe boat-anchor SIMD extension" IS relevant for 68K legacy preservation, dear ignorant.
Quote:
AMD and Intel can copy each other's SIMD extensions, hence fulfilling the second source requirement.

Oh, yes. But you're again and desperately changing the topic.

Apollo's 68080 has AMMX and instructions fusion. Who designed the MMX? Intel or AMD? Who implemented instructions fusion? Intel or AMD?

Care to give some answer, Mr. Fonzarelli: https://www.youtube.com/watch?v=WkqgDoo_eZE ?
Quote:
Motorola has DSP 56001, Apollo team didn't support this.

C= selected AT&T DSP3210, Apollo team didn't support this.

See above: Motorola didn't supported even its own processors, cutting FPU, MMU, supervisor, and even user mode instructions (!!!), changing the exception formats, and changing the MMU for every processor, but you complaint about Apollo not having implemented two coprocessors that nobody used on the Amiga.

Don't you think that you're looking so ridiculous? Even more ridiculous, considered that the Apollo CPU was/is designed to support the Amiga software?
Quote:
I rather support ARM NEON or RISC-V over AC68080's wannabe boat-anchor SIMD extensions.

See above, dear ignorant. Better that you shut-up, instead of talking of things that you have no clue at all.
Quote:
As for the "Apollo team", Igor Majstorovic has aired the dirty laundry.

And? Other padding, as usual...
Quote:
Quote:
Maybe because I wasn't intended to talk about it, rather about the design decisions of the Apollo team for their 68080, which resemble more Intel than AMD?

FACT: IBM PowerPC 970 also has instruction fusion hence Intel doesn't have a monopoly on this design concept. PowerPC 970's Power4-based design is released in 2001!

IBM PowerPC 970 was released in 2002.

Intel Pentium M (P6 Banias) was released 2003.

Don't repeat things, parrot: to me one time is enough to understand.

Anyway, see above: I'm waiting the proof of your statement on that topic. Don't let me become old...
Quote:
Quote:

See above. Plus, it happens specifically with you when you want to put your beloved AMD where it isn't the case.

Your assumption is wrong since I'm aware of IBM's PowerPC 970's instruction fusion concepts.

Then prove it!
Quote:
Don't assume.

I assume because it's exactly your pattern. You behave like Pavlov's dog when talking about x86 processors, always trying to put AMD in good light, and Intel in bad light.

Specifically, I quote again you:

"Apollo-core's AC68080 attempted to be "AMD" for the 68K family"

Which is clearly false, since the most important technologies implemented on the 68080 are the SIMD unit and instructions fusion (which allows this processor to execute up to 4 instructions per clock). And both were designed by Intel. So, the statement should have been this, instead, as I've said:

"Actually Apollo's 68080 is attempted to be more like Intel instead of AMD, due to the design decisions."
Quote:
Again, Apollo-core is a wannabe 68K cloner and it's NOT Motorola/Freescale.

Who stated the contrary?
Quote:
AMD moved from wannabe X86 cloner into X86-64 standards driver.

Another padding... you like reporting useless stuff.
Quote:
Don't pretend Apollo-core being Intel's "genuine" position.

I do, because of the technologies that they implemented, as I've already explained several times, and you don't want to read or don't understand.
Quote:
Apollo-core's custom SIMD extension is NOT relevant for 68K software legacy protection and serves as a distraction.

See above: keep silence, ignorant!
Quote:
Quote:

Yes, and? I was talking SPECIFICALLY about the video that you posted.

https://www.youtube.com/watch?v=1B1jKjrRUmk

For Doom, performance experience is similar between A1200 with 68030 @ 50Mhz and AGA vs 386DX-40 with ET4000AX.

Quote:

Care to show the generated FPS for both systems?

The benchmark argument will produce different results from two different systems.

The poster for https://www.youtube.com/watch?v=1B1jKjrRUmk has stated very similar performance.

ET4000AX's Doom frame rates can continue to scale to around 33 fps with faster CPUs.

So, you aren't able to judge yourself how the Amiga Doom is working compared to the PC one, and you've to resort to the comment of someone else.

Well, I'm not superman and I don't have his sight, but I can see the differences. Better that you ask for a good pair of glasses...

Status: Offline

cdimauro

Re: Packed Versus Planar: FIGHT
Posted on 10-Aug-2022 21:09:50

[ #133 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

@Hypex

Quote:

Hypex wrote:
@cdimauro

Quote:
Speaking off, I'll be posting an article of my own soon. Just on here. Needs some cleaning up. A bunch of text isn't that interesting so I will collage some images so it can be visualised. It almost turned out to be an essay or mini thesis.

It looks like the period. What's the topic?
Quote:
[quote]An example would be beneficial.

This is all hypothesis but something like the following copper list simulating it. Such as user copper list with OS setting up exact values. In this case an extended BPLCON would be used to activate it while bitplane DMA is disabled.

CMOVE BPLEN, DMACON; Bitplane disable
CMOVE $8000, BPLCON2; Enable packed pass through mode (for example)
CWAIT 0,0; Wait for top left of display start
DB.B $00, $01, $02, $03, $04, $05, $06, $07; Packed data fed direct to internal serial CLUT indexer
...
CEND; Revert to copper again after all lines are read

But, I think it changes behaviour too much, to be a possible extension.

To me it looks super-complicated and inefficient.
Quote:
Quote:
Whereas with packed data you just have bitplane pointer #0 which points to the data of playfield #1 and bitplane pointer #1 to the data of playfield #2.

Everything else is exactly the same.

But if you only wanted one field two fields shouldn't come into it.

If you have one playfield then only bitplane pointer #0 is used on packed.

It's only when you need more playfields that more bitplane pointers are needed.
Quote:
However I can now demonstrate what I mean. Failing to find an example of it I found some old source of mine doing a user copper list so just modified it to show. I used FS-UAE for a screen shot so hope it's accurate enough.

Okay so here are two images. 2 bit depth. Palette is set to black, white, green, blue. A check pattern I set up in blue squares. Both using standard single field mode.

On top standard view.

On bottom I modify scroll offsets every 32 lines. At 0, 0/0; at 32, 4/0; at 64, 0/8; at 96, 4/8.

Quote:
And, as you can already see, packed graphics is much less complex (and more efficient, looking at the numbers).

What I can't see is how to duplicate the above trick, without using software rendering or using dual playfield, with a packed setup.

I'm programming the copper like this in my AmigaE code:

CWAIT(myucoplist,0,0)
CMOVEA(myucoplist,BPLCON1,$00)
CWAIT(myucoplist,32,0)
CMOVEA(myucoplist,BPLCON1,$40)
CWAIT(myucoplist,64,0)
CMOVEA(myucoplist,BPLCON1,$08)
CWAIT(myucoplist,96,0)
CMOVEA(myucoplist,BPLCON1,$48)
CEND(myucoplist)

Honestly, I don't see the problem: you can change the offset as well with packed graphics.

As I said before, there's absolutely no difference with the planar graphics, besides when you access / modify just single bitplanes.

Changes which affect all bitplanes in planar graphics can be done exactly the same with the packed one.
Quote:
Quote:
If you need code (I don't: math is enough, as I've already said), you can start writing it.

I thought I would provide code so you could see how it's generated.

Nice. Thanks.
Quote:
Quote:
No, you can have much more flexibility with packed graphics: see my previous comment on that specific point.

Just to clarify I was thinking about chipsets like VGA that have no dual playfield modes.

Yes, understood.
Quote:
Quote:
It can be, if you're computing the final framebuffer taking the data from the two playfields and combining them.

Like above, but I mean by using hardware scrolling. With only one layer that can be scrolled it needs to be computed. Or do it the C64 way which can do parallax scrolling without dual layers. Or maybe the C16 which doesn't have sprites, so more comparable to VGA which also doesn't have sprites and also has hardware scrolling.

VGA has hardware scrolling, but for the only playfield / framebuffer.

If you want to simulate two playfields with the VGA, each with its own independent scrolling, then you've to "compose" the framebuffer by properly reading the data of each playfield. Which means: taking into account their scrolling when start reading their data.
Quote:
Quote:
Indeed, and I think that this was the way that it was implemented in PC games with multiple overlapping playfields.

That's good to know. I didn't think it would be that efficient. But when you are working on hardware with only one layer and need multiple layers there's not much choice than a brute force method.

Ehm... there's other choice: compiled sprites.

On PC with VGA and 8-bit packed graphics you could have code executed when you need to display a specific sprite. Why? Because you can completely avoid masking (e.g. checking if color was 0, so don't display the sprite's graphic).
Quote:
Quote:
Only one thing about masking: it can be done exactly like with planar, but implementing it requires a bit more logic (some extra work on pipeline stage to properly prepare the mask according to the source or sources channels' pixel size or sizes).

Given blitters were around for packed data they would have worked this out in other hardware. But, on Amiga it relied on the bitmask. I'm sure with packed that could have avoided that with zero being transparent and all else opaque.

Yes, with packed graphics the mask can be calculated on-the-fly, by checking for color 0. So, you don't need extra data only for the mask.

However this further complicates the Blitter.
Quote:
Quote:
It would have been good to run Doom's timedemo to see the effective framerate of both games (with the Amiga port which introduced some optimizations, BTW).

I'm compiling some test results at present. But it's time consuming. Seems to be there for ten minutes or more doing it.

Thanks for this as well.

Status: Offline

cdimauro

Re: Packed Versus Planar: FIGHT
Posted on 10-Aug-2022 21:33:31

[ #134 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

As a game developer, I don't agree: it wasn't fast memory which was required, but chip mem. ONLY chip mem. This would have helped A LOT on the vast majority of cases, since the CPU is often used just to load the chipset registers to start some operations (specifically: setting up the Blitter).

The size of chip memory was a limitation of some games depending on the complexity. However, fast memory improved total system bandwidth and more chip memory bandwidth remained.

Not enough benefits for an Amiga with 68000 and OCS/ECS. As I've said, on such machines the CPU was mostly setting the Blitter, which was the one doing the vast majority of work.

Just to give an example, on such Amigas but with 1MB chip memory my parallax routine for the Fightin' Spirits floor was "for free". Whereas it was impossible on the classic 512KB chip mem machines...

And on USA Racing (unreleased) I could have added at least another 48x48 pixels competitor car on the screen.

To me fastmem was/is useless..
Quote:
The caches of later 68k processors reduced memory accesses also leaving more system and chip bandwidth. Even the little 256 byte ICache of the 68020 combined with the very good code density of the 68k removes most instruction memory traffic as most code loops fit in 256 bytes.

With 68020 it's different, and having fast mem could have been interesting.

But, still, the more chip mem, the better. In fact, I was happy to have 2MB of chip ram with my Amiga 1200: Fightin' Spirit worked much better.
Quote:
Quote:
cdimauro [quote]
Previously you also said that chunky is cheap to implement: why this regression here?

It depends on the context. Packed/chunky would have been cheap to implement in custom chips in the 1990s and would be practically free today but in 1985 every transistor was important. I'm not saying planar was a better choice but it could have saved a few valuable transistors or a few kiB of very expensive memory which explains the choice. Performance was not as important as constraints when the Amiga custom chips were developed and the choice needs to be evaluated in this context. Engineering design is full of compromises.

I fully agree, and that's why I think that packed graphic could have be easier to implement AKA less transistors used. The only thing a bit more complicated is implementing the masking on the Blitter.
Quote:
Quote:
cdimauro [quote]
SIMD/Vector makes sense only for 8, 16, 24 (more difficult) and 32-bit pixel sizes. It's much more difficult to use such units for less than 8 bits pixel sizes.

I disagree. The narrower the integer datatypes, the more SIMD calculations can be performed per instruction thus improving efficiency. There are diminishing returns as narrower integer datatypes become less useful and supporting them uses up encoding space.

That's was my point. Think about supporting pixels sizes of 2, 3, 4, 5, 6, and 7: you require proper instructions for each of those cases.

You cannot think about letting them "first citizens", like the other regular data types: the encoding spaces explodes (3 bits required!).
Quote:
The 4 bit datatype has become popular for AI because it allows more calculations per instruction or cycle as Hammer showed.

But usually this is very limited. So, you have a bunch of instructions just enough for implementing the AI calculations.

So, they aren't "first citizens". Those are special cases, like hash instructions, crypto instructions, string comparisons, etc. And are very limited (only on their specific usage scenarios).
Quote:
Quote:
cdimauro [quote]
Indeed. 68060 was too late. Before that the CPU could be used on some scenarios to replace the Blitter, but not in the most complicated ones.

Amiga users were using software blitting starting with the 68020 but it certainly became more advantageous as the performance of the CPU increased.

Correct. But see above, for gaming machines.
Quote:
Quote:
cdimauro
This is a logical fallacy, similar to the Argumentum ad populum.

No, you don't need that something exists to prove if it's good or not.

To me the lack of current implementation of odd pixel sizes is probably due to the lack of proper creativity / thinking out of order: computer scientist and programmers are used to power-of-twos, so they think that "this is the (only) way" (TM).

My article gives some details about the implementation of the display controller (specifically), and some graphic primitives. This could give some inputs to someone which effectively implements it, if people are demanding for something "concrete".

To me math is and should be enough, and I'll give proper analysis and numbers.

I was not trying to prove anything. It was just an observation that the natural entropy of design choices tends toward good choices and away from bad ones, minus the copycat effect. I like to examine, and re-examine, technology choices and older technologies which were discarded due to copycat technology, unfair research, political reasons and popularity.

OK, np with that.

As I've said, IMO the problem is that computer scientist / microelectronic engineers are used to think by powers-of-two. So, they haven't thought about using "odd" pixel sizes.

I, on the exact contrary, am used to think out-of-order. If I've a problem, then I try to think to the best possible solution, whatever it is.

Example: runtime checking of memory accesses. Intel developed MPX. Some other processor vendors developed tagged memory. But on Intel's drawer there's my idea which is much better (faster, less memory usage) than Intel's MPX and tagged memory. And it has also some interesting "side effects" which allows to improve performances in some common algorithms.

And this is just an example: I've plenty of (bizarre) ideas to solve specific problems...

Status: Offline

matthey

Re: Packed Versus Planar: FIGHT
Posted on 10-Aug-2022 22:22:32

[ #135 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2748
From: Kansas

Hammer Quote:

1st example, the clusterfuk mess between Jim Drew (Fusion Mac) vs Gunnar. Gunnar's counterargument is to use Shapeshifter which modifies Apple's ROMs.

Disabling AC68080 V2's non-compliant Motorola 68K FPU wannabe clone can force Apple's scalar CPU-only code path.

2nd example, there's Lightwave's minor render difference AC68080 V2's non-compliant Motorola 68K FPU vs genuine Motorola 68K FPU results.

Pentium FDIV bug has a minor result difference from X87's result and caused Intel to recall the product.

WinUAE didn't have a "compatible" extended precision 68k FPU until 2018 despite being first released in 1995. WinUAE changed back to an incompatible double precision FPU as default in 2020. WinUAE is or was the most popular Amiga, at least before THEA500 Mini, and it has the same incompatibility by default. Maybe this incompatibility had become the Amiga industry standard despite the possibility to produce unintended results and even crashes.

Hammer Quote:

Apollo team didn't follow Intel's and AMD's X86/X87 legacy software preservation practices. Is it clear to you?

I warned the Apollo team that reducing the precision wouldn't be compatible. Gunnar knew but chose performance and logic savings over compatibility. WinUAE also chose performance over compatibility as the default.

Hammer Quote:

FACT: IBM PowerPC 970 also has instruction fusion hence Intel doesn't have a monopoly on this design concept. PowerPC 970's Power4-based design is released in 2001.

IBM PowerPC 970 was released in 2002.

Intel Pentium M (P6 Banias) was released 2003.

Motorola called instruction combining "instruction folding" and it goes back to at least 1994.

M68060 User's Manual 10-8 Quote:

Additionally, the use of instruction folding techniques allow one or two instructions to be
simultaneously executed with a predicted taken Bcc (also for BRA and JMP instructions).

The 68060 only used "instruction folding" with branches which is sometimes called "branch folding". Instruction folding was brought back in the V4 ColdFire and expanded though.

MOTOROLA THAWS COLDFIRE V4 (May 15, 2000) Quote:

The instruction folding technique enables limited parallel execution without the extra logic of dual issue superscalar pipelines. The second section of the V4 pipeline automatically folds certain pairs of instructions into a single cycle operation. For example, it combines MOV.l mem,Rx and ADD.l Ry,Rx to create ADD.l mem,Ry,Rx. Motorola says these kinds of instruction pairs occur frequently in embedded programs. Programmers writing in assembly language could make sure it happens by deliberately pairing those kinds of instructions inside critical loops.

Instruction folding was brought back in the CF5407 which is the V4 ColdFire chip the article above is about.

Hammer Quote:

AC68080's wannabe boat-anchor SIMD extension is not relevant for 68K legacy preservation.

AMD and Intel can copy each other's SIMD extensions, hence fulfilling the second source requirement.

Motorola has DSP 56001, Apollo team didn't support this.

C= selected AT&T DSP3210, Apollo team didn't support this.

I rather support ARM NEON or RISC-V over AC68080's wannabe boat-anchor SIMD extensions.

A partial implementation of a FPU and a partial implementation of a SIMD unit doesn't count as one of something? If only someone had suggested leaving the SIMD unit experimental and implementing a fully compatible FPU. Oh, yea, someone did.

Hammer Quote:

As for the "Apollo team", Igor Majstorovic has aired the dirty laundry.

It sucks that there was bad blood between them but the "Apollo team" is on a mission to dominate the high end Amiga FPGA market.

Last edited by matthey on 10-Aug-2022 at 10:25 PM.

Status: Offline

Hammer

Re: Packed Versus Planar: FIGHT
Posted on 12-Aug-2022 4:41:53

[ #136 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6503
From: Australia

@matthey

Quote:

WinUAE didn't have a "compatible" extended precision 68k FPU until 2018 despite being first released in 1995. WinUAE changed back to an incompatible double precision FPU as default in 2020.

WinUAE still offers 80 bits FP compatibility when users need this feature just as all modern X86-64 CPUs offer 80-bit FP compatibility in the slower performance path, but X86-64 CPUs have the option to brute force >4 Ghz and >5 Ghz this issue. Incoming AMD Zen 4 and Intel Raptor Lake have clock speeds nearing 6 Ghz.

The 68060 integrates a Floating-Point Unit that is compatible with Motorola 68881 / 68882 co-processors. 68060 FPU provides hardware support only for most common floating-point instructions and data types while unsupported instructions and data types are emulated in the slower software path. 68060 FPU provides hardware support for FP32, FP64, and FP80.

Motorola 68060's software support package (M68060SP) doesn't work with Amiga kick-the-OS games since it wasn't placed below the OS level. WHDload game patches mitigate this issue.

Unlike Intel's Pentium X87 FPU, Motorola wasn't able to implement pipelining for the 68060 FPU.
AMD implements fully pipelined X87 FPU with K7 Athlon.

AC68080 V2 has non-compliant IEEE-754 FP52. Apollo-Core didn't place unsupported FP instructions in the slower performance path.

If Apollo-Core wants to follow Intel/AMD, Apollo team should have placed unsupported FP instructions in the slower firmware microcode path and used higher clock speed and general arhitecture improvements (e.g. larger cache, wider I/O) to brute force performance improvements.

My argument is about respecting legacy software, not about the micro-architecture implementation argument. The micro-architecture implementation argument is useless when it doesn't respect legacy software.

Quote:

Maybe this incompatibility had become the Amiga industry standard despite the possibility to produce unintended results and even crashes.

It stems from Motorola's mindset that doesn't respect legacy software.

On 8GB USB flash drive with MS-DOS 7.1, I run DOS games such as Pinball Fantasy, Doom, and Super Street Fighter 2 Turbo on a Core i7-3770K/GeForce GTX 1050/UEFI-CSM-based PC.

Quote:

I warned the Apollo team that reducing the precision wouldn't be compatible. Gunnar knew but chose performance and logic savings over compatibility. WinUAE also chose performance over compatibility as the default.

WinUAE still offers 80 bits FP compatibility as an option.

Quote:

Motorola called instruction combining "instruction folding" and it goes back to at least 1994.

Additionally, the use of instruction folding techniques allow one or two instructions to be
simultaneously executed with a predicted taken Bcc (also for BRA and JMP instructions).

According to Intel documentation: Instruction Fusion is when multiple RISC-like assembly instructions are merged into CISC-like one assembly instruction.

Macro-Operation Fusion (also Macro-Op Fusion, MOP Fusion, or Macrofusion) is a hardware optimization technique found in many modern microarchitectures whereby a series of adjacent macro-operations are merged into a single macro-operation prior or during decoding. Those instructions are later decoded into fused-µOPs.

This particular argument is about the micro-architecture implementation which is irrelevant to respecting legacy software argument.

Quote:

A partial implementation of a FPU and a partial implementation of a SIMD unit doesn't count as one of something? If only someone had suggested leaving the SIMD unit experimental and implementing a fully compatible FPU. Oh, yea, someone did.

If only the Apollo team implement unsupported 68881/68882/68040/68060 FPU instructions in the slower path i.e. X86 world implements fast and slow paths.

Quote:

It sucks that there was bad blood between them but the "Apollo team" is on a mission to dominate the high end Amiga FPGA market.

Too bad the standalone Vampire V4 didn't follow C= A500/A1200's CPU expandability characteristics.

FYI, Minimig FPGA version 1.8 supports a real 68000 socket, hence PiStorm/Pi 3a(or Pi 2W)/Emu68 and any other A500 68K CPU socket accelerator works with it e.g. Vampire AC68080, Warp 560, TF536 and etc. Minimig FPGA version 1.8 followed C= A500's CPU expandability characteristics.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Hammer

Re: Packed Versus Planar: FIGHT
Posted on 12-Aug-2022 4:53:59

[ #137 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6503
From: Australia

@matthey

Quote:
The size of chip memory was a limitation of some games depending on the complexity. However, fast memory improved total system bandwidth and more chip memory bandwidth remained. The caches of later 68k processors reduced memory accesses also leaving more system and chip bandwidth. Even the little 256 byte ICache of the 68020 combined with the very good code density of the 68k removes most instruction memory traffic as most code loops fit in 256 bytes.

https://www.youtube.com/watch?v=GojpwZMBHz4
Amiga 1200's stock 68EC020 CPU with Fast RAM was able to handle arcade quality Final Fight game port. This port includes a parallax background.

68020's small cache is not enough. Arcade quality Final Fight port was programmed in 68020 assembler language.

I support then-MD of Commodore UK David Pleasance's out-of-the-box A1200 with Fast RAM SKU push.

Last edited by Hammer on 12-Aug-2022 at 04:57 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Hammer

Re: Packed Versus Planar: FIGHT
Posted on 12-Aug-2022 6:23:35

[ #138 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6503
From: Australia

@cdimauro

Quote:

Then please tell me about Motorola decisions to remove FPU instructions from 68040 and 68060: those were DESIGN decisions, which affected the existing software.

68060 FPU hardware supports FP32, FP64, and FP80. Unsupported instructions are in the M68060SP software support package.

Motorola's M68060SP support package wasn't placed below the OS level.

Quote:

This is a bug: NOT a design decision. That's why Intel decided for recalling the processors.

X86 market pressure has defined classic Pentium FDIV flaws as a bug.

However, it has been noted that far fewer failures are found in single precision than in double or extended precisions.

For example, correct values all would round to 1.3338, but the returned values are 1.3337, an error in the fifth significant digit.

Classic Pentium with FDIV bug can still run Quake-type games.

Quote:

Something which did NOT happen with Motorola and its processors which were missing FPU instructions.

68060 FPU supports FP32, FP64, and FP80. Unsupported instructions are in the M68060SP software support package a.k.a the slow path.

Quote:

If it's a fact (for you) then you can provide sources for your claims.

If you find them, because you might have confused POWER4 / PowerPC 970 instructions cracking with fusion. But cracking is the exact opposite of fusion.

Or, you might have confused their FMA instructions with instructions fusions. But, again, those are completely different things.

Anyway, it isn't my problem: your is the thesis and you are the one that has to prove it.

Your micro-architecture implementation argument is useless when it doesn't respect Amiga legacy software.

Quote:

I assume because it's exactly your pattern. You behave like Pavlov's dog when talking about x86 processors, always trying to put AMD in good light, and Intel in bad light.

Specifically, I quote again you:

"Apollo-core's AC68080 attempted to be "AMD" for the 68K family"

Which is clearly false, since the most important technologies implemented on the 68080 are the SIMD unit and instructions fusion (which allows this processor to execute up to 4 instructions per clock). And both were designed by Intel. So, the statement should have been this, instead, as I've said:

"Actually Apollo's 68080 is attempted to be more like Intel instead of AMD, due to the design decisions."

My argument is not about the micro-implementation argument i.e. my argument is about resulting output, performance, and respecting Amiga's legacy software.

Your micro-implementation argument is meaningless when AC68080 V2 is not compliant with 688881/68882/68060's FP64 and FP80.

AC68080 V4's FP64 support debunks your FP52 is a "good enough" argument.

Apollo team didn't respect Amiga's legacy software at the same level as Intel/AMD's X86/X87 legacy software support.

Quote:

So, you aren't able to judge yourself how the Amiga Doom is working compared to the PC one, and you've to resort to the comment of someone else.

That's a fluff argument.

I have two RTX 3080 Ti OC AIB cards i.e. one ASUS ROG Strix and the other is an MSI Gaming X variant. In terms of benchmarks, the slightly faster RTX 3080 Ti is the MSI Gaming X variant. The minor frame rate differences are not major issues when ASUS is the largest AIB PC GPU vendor.
If absolute highest frame rates were a major factor, I would have purchased MSI's Supreme X or EVGA FTW3 RTX 3080 Ti models.

My 68K configuration for C= Amiga hardware platforms are 68HC000 @ 50 Mhz overclocked, 68EC020 @ 14 Mhz with 8MB 68882 @ 50Mhz card, 68LC060 rev 4 (reached 75 Mhz), 68060 rev 1 @ 62.5 Mhz overclocked and PiStorm/Pi 3a/Emu68.

I sold my A3000/030 @ 25 Mhz in 1996 for Pentium 150/S3 Trio 64/Yamaha Sonata (16-bit sound card) based clone PC.

My A3000/030 @ 25 Mhz was useless for Doom!

For Quake, my A3000 with proposed Cyberstorm 060 @ 50Mhz and CyberGraphics 64 (S3 Trio 64) upgrades weren't cost-effective when compared to Pentium 150/S3 Trio 64/Yamaha Sonata (16-bit sound card) based clone PC.

68HC000 @ 50 Mhz overclocked, PiStorm/Pi 3a/Emu68 for C= A500 rev 6A.

68EC020 @ 14 Mhz with 8MB 68882 @ 50Mhz card, 68LC060 rev 4 (reached 75 Mhz), 68060 rev 1 @ 62.5 Mhz overclocked for C= A1200 rev 1D1.

I don't plan to purchase another Motorola "68030" performance-level CPU.

Quote:

Well, I'm not superman and I don't have his sight, but I can see the differences. Better that you ask for a good pair of glasses

Unlike IBM VGA, AGA's Doom frame rate continues to scale with a faster CPU.

At a certain point with faster CPUs, AGA can be a bottleneck, but the Hollywood industry's full motion video standard is 24 fps. Many XBO/PS4 game console titles have a 30 fps target.

My 386DX-33 with ET4000 was my gaming PC from 1992 to 1996 era and it acts like A1200 with a fast 030 card.

Last edited by Hammer on 12-Aug-2022 at 06:37 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Karlos

Re: Packed Versus Planar: FIGHT
Posted on 12-Aug-2022 10:56:18

[ #139 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

Rumour has it that your long posts and rebuttals resulted in the outage...

Last edited by Karlos on 12-Aug-2022 at 10:58 AM.

_________________
Doing stupid things for fun...

Status: Offline

Karlos

Re: Packed Versus Planar: FIGHT
Posted on 12-Aug-2022 11:06:19

[ #140 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Hammer

Quote:
At a certain point with faster CPUs, AGA can be a bottleneck, but the Hollywood industry's full motion video standard is 24 fps. Many XBO/PS4 game console titles have a 30 fps target.

Lower frame rates are fine for non-interactive media like movies. For games, though, there is the whole input/feedback latency consideration: how long between pressing a control and seeing the reaction? FPS games of the era were single threaded and needed to process user input and game physics, sequentially each frame. As the frame rate goes down, so does the responsiveness until the perceived lag becomes a challenge.

There is a much bigger difference going from 30 to 60 FPS for gaming purposes than there is for basic video consumption.

Last edited by Karlos on 12-Aug-2022 at 11:10 AM.

_________________
Doing stupid things for fun...

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle