Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6071 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

18 crawler(s) on-line.

132 guest(s) on-line.

2 member(s) on-line.

pavlor,

Birbo

You are an anonymous user.
Register Now!

pavlor: 1 min ago

Birbo: 1 min ago

Hypex: 6 mins ago

matthey: 6 mins ago

AmigaMac: 18 mins ago

amigakit: 50 mins ago

A1200: 1 hr 59 mins ago

michalsc: 2 hrs 3 mins ago

OlafS25: 3 hrs 2 mins ago

clint: 3 hrs 7 mins ago

Forum Index

Amiga Development

An experimental Doom speed test and feasibility study based on a virtual packed-planar mode

Poster

Thread

Hypex

An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 13-Aug-2022 15:32:32

[ #1 ]

Elite Member

Joined: 6-May-2007
Posts: 11204
From: Greensborough, Australia

Hello Doom fans!

This test is based on an experiment I recently conducted with packed pixels and I present my findings here. In recent months there has been debate over packed and planar pixel formats and their relation to the Amiga. This experiment is based on an idea that predates those discussions but only recently did I create and perform some real world tests. The idea is to see how fast the display hardware would have been had the Amiga featured an actual packed pixel mode.

For this I used Doom as a base to work off. Not only because it remains popular to this day, especially among Amiga fans, but because it also has a testable 3d game engine for performing real world tests that can be source modified. The time incursion caused by a full packed to planar bitmap conversion, using blitter assisted or CPU only logic, is said to be minimal compared with the whole process of rendering an actual 3d frame info a screen buffer. This test proves how fast the hardware can be by both directly rendering into chip ram or indirectly copying an off screen render to chip ram when ready.

I called this virtual packed mode packed-planar. It represents an 8 bit linear packed pixel buffer. In hardware it would have been implemented as one plane with each pixel taking up one byte in linear order I imagine. Of course, to simulate this, I had to use an actual 8 bit depth bit-planar bitmap mode. This is possibly slower than an actual packed-planar mode may have been, but since chip ram writes are still involved any difference would likely have been marginal. The bitmap is also configured as an interleaved bitmap and for a few reasons. By using interleaved means that each line in packed format takes up the exact same space as in planar so one line can be analogous to another. For example, an 8 bit packed screen of 320 pixels wide takes up 320 bytes per line and with an interleaved bitmap of 8 bit depth, each line also takes up exactly 320 bytes. The pixel organisation is obviously different between formats but each line takes up the same space since it contains all planes for one line. This also means that an on screen visual representation also makes some sense. Still confusing but the confusion is reduced to each line.

In addition, I also wrote and tested some ideas that came to mind after my initial testing. These are palette remapping to test how well the palette can be remapped so planar can resemble packed; ordered bitmapping where 8 sequential pixels of packed bytes are written to 8 bytes of sequential planes; and pixel tabling where a set of 8 sequential packed pixels are used to calculate an index into a table of 8 planar pixels copied into the bitmap. These will be explained in more detail below.

My results will be based on several tests to provide a baseline. All using a game screen dimension of 320x200x8. With native planar, RTG and direct RTG. Compared against off screen and direct packed-planar. With additional speed tests for ordered and tabled bitmap modes.

My test hardware is as follows:
Amiga A4000D.
CyberStorm 68060 @ 50Mhz with 128MB RAM installed.
CyberVision 64 graphics card sitting in Zorro slot.
PicassoIV mainly used as flicker fixer sitting in Video slot.
CyberSCSI controller adapted to SCA HDD.
DVD/RW connected to internal IDE.

Alternative methods
A few different experimental methods of planar writing were also tested. Which will be discussed here. Namely palette remapping, ordered bitmapping and pixel tabling.

• Palette remapping: This remaps the palette entries to better match what a pixel should look like against what it does look like. However, it’s almost impossible to do so as in the standard case of an 8 bit interleaved bitmap, 40 sequential packed pixels end up as a spread of 320 pixels in planar overlaid in each plane, since each plane of 320 pixels wide takes up 40 bytes per line. A contiguous bitmap with separate planes as whole bitmaps would be even worse.

• Ordered bitmapping: This orders bit plane writes so each sequential 8x1 packed pixel block is written as an 8x1 planar pixel block in each plane. That is 8 packed pixels are written into 8 plane bytes. So each 8x1 packed pixel block is written as an 8x1 planar pixel block. Still confused but reducing the confusion down to an 8x1 block. In this format the screen makes more sense and it can actually be made out. It’s also more suitable for palette remapping though still in limited capacity.

• Pixel tabling: This takes a set of packed pixel values, combines them into an index, and uses it with a look-up pixel table for the same values in planar pixel format. Essentially it’s a pre-calculated packed to planar table. Since the minimum amount of pixels in planar is 8 pixels per byte it makes sense to convert 8 pixels at once. Unfortunately, since this requires 8 packed pixels, this also means a 64-bit look-up table would be needed. Not even a modern 64-bit system has the power for that! Therefore, sacrifices have to be made. In order to test this I reduced depth to 4 bits and halved resolution, so each 8x 8 bit pixel blocks becomes 4x 4 bit pixel blocks, which converts into a 16 bit index. I also tested using 5 bit depth which converts to a 20 bit index and looks better. Initially I had experimented by using “numerology” to quickly reduce 64-bits down by adding 8 pixels up into an 11 bit index. But, since it only makes a total of 2,040 values, the result is too blocky to be of any practical value and almost unrecognisable. To put it into perspective I will give figures on memory requirements to convert sets of 8 pixels at different bit depths and it very quickly goes out of hand at an exponential rate. At 2 bits packed, 8x 2 bits needs 16 bit index, with 128K planar table. At 3 bits packed, 8x 3 bits needs 24 bit index, with 48MB planar table! At 4 bits packed, 8x 4 bits needs 32 bit index, with a 16GB planar table being out of the range of possibility. If resolution is halved it’s more possible. 4x 2 bits would only need 8 bit index, with 512 byte table. 4x 3 bits would only need 12 bit index, with 12KB table. 4x 4 bits would only need 16 bit index, with 256KB table. 4x 5 bits would only need 20 bit index, with 5MB table. While 4x 6 bits needs 24 bit index, with a 96MB table! Another way would be to divide the pixels into nibbles, then load in the left nibbles and logically or on the right nibbles, which would allow for more resolution and depth, using a smaller table. But, it also requires double look ups, though with enough registers planes need only be written once. Compared to Akiko, where pixel data needs to be read, written, read, then written to planes; tabling only needs two reads and a write with pixel data read, table read, then written to planes with which the latter steps can be combined into a single copy operation. Another variant I had considered is using a jump table with routines that directly write the associated values in, or jumping to a routine from a calculated offset, which would likely speed up operations with no memory read incursion but would incur a penalty on memory requirements.

Packed-Planar visualisation
To give you a visual of how this looks on screen I compiled a collage of images. This is from frame 100 and frame 1,000. It still looks confusing but is interesting to see what happens when you write packed pixel data into a planar bitmap. In addition I show how the ordered and tabled images looks. Ordered is quite interesting as you can actually make out the images and see projectiles firing as well as the gun moving. Here tabled is shown in 5-bit precision. I've also shown what remapping looks like. For this I used a pre-calculated palette mapping table from that frame based on inverting the pixel palette indexes, from what they were on planar, to how they should be in packed. The remapping is rather simple, and simply remaps 64,000 pixels straight of a whole screen into 256 colour entries. The result looks like sepia tones.

From the top:
Actual.
Packed-Planar, Ordered Packed-Planar, Tabled Packed-Planar.
Remapped Packed-Planar, Remapped Ordered Packed-Planar, Remapped Tabled Packed-Planar.

Frame 100:

Frame 1000:

Actual results
So you came here for some actual results and here they are!

I set up a baseline to measure the packed-planar against. On my PAL system this is a native PAL Lowres 320x256x8 screenmode. I also tested the NTSC equivalent. In addition I also set up a baseline for RTG and direct RTG. From both my PicassoIV and CyberVision 64 cards using CyberGfx API. Doom game screen is 320x200x8 as standard.

Default Packed-Planar uses CopyMemQuick(), Quick Packed-Planar uses a small 68040 MOVE16 instruction loop, while Direct Packed-Planar uses the Doom engine to directly render into the planar bitmap.

I let ADoom chose the appropriate routines for chunky to planar conversion on my hardware so these are left at default. I’ve ran a timedemo of DEMO1 from shareware WAD 1.666 with sound FX disabled as is recommended for Doom speed tests. I then filtered out the relevant info and tabled it here including arguments, screen mode info, game ticks with FPS result and any extra timings. Here we go.

Baselines:

Native PAL:
ADoom -nosfx -timedemo DEMO1 -forcedemo -native
Screen Mode $00021000 is NATIVE-PLANAR 8-BIT
Error: timed 6305 gametics in 11380 realtics ( 19.4 fps)
Total number of frames = 6345
Total Chunky2Planar time     = 78289942 us  (12338 us/frame)

Native NTSC:
ADoom -nosfx -timedemo DEMO1 -forcedemo -native
Screen Mode $00011000 is NATIVE-PLANAR 8-BIT
Error: timed 6305 gametics in 11404 realtics ( 19.4 fps)
Total number of frames = 6345
Total Chunky2Planar time     = 79496059 us  (12528 us/frame)

RTG PicassoIV:
ADoom -nosfx -timedemo DEMO1 -forcedemo -rtg
Screen Mode $50001000 is FOREIGN 8-BIT CYBERGRAPHX
Error: timed 6305 gametics in 10348 realtics ( 21.3 fps)
Total number of frames = 6345
Total WritePixelArray8 time  = 47129091 us  (7427 us/frame)

RTG Direct PicassoIV:
ADoom -nosfx -timedemo DEMO1 -forcedemo -rtg -directcgx
Screen Mode $50001000 is FOREIGN 8-BIT CYBERGRAPHX
Error: timed 6305 gametics in 11546 realtics ( 19.1 fps)
Total LockBitMap time        = 618419 us  (97 us/frame)

RTG CyberVision 64:
ADoom -nosfx -timedemo DEMO1 -forcedemo -rtg
Screen Mode $53001000 is FOREIGN 8-BIT CYBERGRAPHX
Error: timed 6305 gametics in 11106 realtics ( 19.9 fps)
Total number of frames = 6346
Total WritePixelArray8 time  = 41406955 us  (6524 us/frame)

RTG Direct CyberVision 64:
ADoom -nosfx -timedemo DEMO1 -forcedemo -rtg -directcgx
Error: timed 6305 gametics in 10616 realtics ( 20.8 fps)
Total number of frames = 6345
Total LockBitMap time        = 638144 us  (100 us/frame)

Packed-Planar:

Packed-Planar PAL:
ADoom -nosfx -timedemo DEMO1 -forcedemo -packed
Screen Mode $00021000 is NATIVE 8-BIT PACKED-PLANAR
Error: timed 6305 gametics in 11282 realtics ( 19.6 fps)
Total number of frames = 6344
Total CopyMemQuick time      = 75325641 us  (11873 us/frame)

Packed-Planar NTSC:
ADoom -nosfx -timedemo DEMO1 -forcedemo -packed
Screen Mode $00011000 is NATIVE 8-BIT PACKED-PLANAR
Error: timed 6305 gametics in 11266 realtics ( 19.6 fps)
Total number of frames = 6346
Total CopyMemQuick time      = 76827458 us  (12106 us/frame)

Packed-Planar Quick PAL:
ADoom -nosfx -timedemo DEMO1 -forcedemo -packed -quick -remapped
Screen Mode $00021000 is NATIVE 8-BIT REMAPPED PACKED-PLANAR
Error: timed 6305 gametics in 11097 realtics ( 19.9 fps)
Total number of frames = 6345
Total packed quick time      = 74771061 us  (11784 us/frame)

Packed-Planar Quick NTSC:
ADoom -nosfx -timedemo DEMO1 -forcedemo -packed -quick -remapped
Screen Mode $00011000 is NATIVE 8-BIT REMAPPED PACKED-PLANAR
Error: timed 6305 gametics in 11290 realtics ( 19.5 fps)
Total number of frames = 6345
Total packed quick time      = 78466078 us  (12366 us/frame)

Packed-Planar Direct PAL:
ADoom -nosfx -timedemo DEMO1 -forcedemo -packed -direct
Screen Mode $00021000 is NATIVE 8-BIT DIRECT PACKED-PLANAR
Error: timed 6305 gametics in 12291 realtics ( 18.0 fps)
Total number of frames = 6345

Packed-Planar Direct NTSC:
ADoom -nosfx -timedemo DEMO1 -forcedemo -packed -direct
Screen Mode $00011000 is NATIVE 8-BIT DIRECT PACKED-PLANAR
Error: timed 6305 gametics in 12547 realtics ( 17.6 fps)
Total number of frames = 6345

Packed-Planar Ordered PAL:
ADoom -nosfx -timedemo DEMO1 -forcedemo -packed -direct -ordered
Screen Mode $00021000 is NATIVE 8-BIT REMAPPED ORDERED PACKED-PLANAR
Error: timed 6305 gametics in 18354 realtics ( 12.0 fps)
Total number of frames = 6326
Total packed order time      = 276991436 us  (43786 us/frame)

Packed-Planar Ordered NTSC:
ADoom -nosfx -timedemo DEMO1 -forcedemo -packed -direct -ordered
Screen Mode $00011000 is NATIVE 8-BIT ORDERED PACKED-PLANAR
Error: timed 6305 gametics in 18486 realtics ( 11.9 fps)
Total number of frames = 6326
Total packed order time      = 283743360 us  (44853 us/frame)

Packed-Planar Tabled PAL:
ADoom -nosfx -timedemo DEMO1 -forcedemo -packed -direct -tabled
Screen Mode $00021000 is NATIVE 8-BIT TABLED PACKED-PLANAR
Building raster table...
Error: timed 6305 gametics in 18794 realtics ( 11.7 fps)
Total number of frames = 6325
Total packed table time      = 290725884 us  (45964 us/frame)

Packed-Planar Tabled NTSC:
ADoom -nosfx -timedemo DEMO1 -forcedemo -packed -direct -tabled
Screen Mode $00011000 is NATIVE 8-BIT TABLED PACKED-PLANAR
Building raster table...
Error: timed 6305 gametics in 19003 realtics ( 11.6 fps)
Total number of frames = 6325
Total packed table time      = 298191358 us  (47144 us/frame)

Charting visuals
A load of figures and statistics isn't very interesting to look at. So I also produced some bar charts with a visual of the FPS results against the Doom standard of 35 FPS. These are grouped into categories. You will find below the Baselines, Packed-Planar, Alternative with Ordered and Tabled Packed-Planar. Then below that comparisons with Packed-Planar versus Native-Planar, Packed-Planar versus RTG and Direct Packed-Planar versus Direct RTG.

Baselines:

Comparisons:

Reviewing Results
After examining the results across the board you can see there isn’t too much difference between native planar and RTG. The Picasso with native packed returned the fastest result overall but was also the slowest with direct rendering! While the CyberVision was the next best being faster than native planar and faster in direct mode but somehow ended up slightly slower than Picasso even though it would be a faster chipset. There were also differences between PAL and NTSC with PAL being slightly faster overall. This is interesting as NTSC mode is meant to run faster but it is running on a PAL machine though that shouldn’t make much difference.

Now, what we really want to know is how a packed planar mode would compare. As it turns out directly copying the screen buffer compared with the logic operation of converting the screen buffer into planar is only marginally faster. With the quick copy just slightly faster again. But, what’s really interesting is that directly rendering the screen buffer into chip ram turned out to be slower! Yes, rendering the screen directly into chip ram, simulating native packed planar, was slower than converting a packed buffer into planar and writing it into chip ram. So, it looks like a native-packed mode would have given a slight speed up, but not enough to give any real gains. Given chip ram access is slower than fast ram it’s likely that rendering directly to chip incurred a penalty given the game engine also copies the screen buffer out at times. Also, an A4000 is said to be slower at chip access than an A1200, which I’d also like to test out. Given the A4000 was sold as an expensive workstation, while the A1200 was relegated to a cheaper games machine, it would be rather an insult that an A4000 bus operation could be slower than an A1200.

What’s even slower is the ordered and tabled modes. The ordered mode makes byte copies which would slow it down, as opposed to a fast copy at 32 bits per read and write. For a 320x200 screen, that means 64,000 bytes copied, compared to 16,000 long words, or even the quick copy doing 1,000 copy operations of 16 long words at 512 bits at a time. Likewise, the tabled mode also suffers the same penalty, from byte sized writes. With the added demand of reading blocks of 8 pixels (here a word at a time) and reducing them down to a 16 bit index or 20 bit index as demonstrated here for a table lookup. Also, even though the 5-bit tabled mode as tested here technically only needs 5 bitplanes, it is tested here by writing into all 8.

Conclusion
The figures show one thing. With little difference in speed between packed to planar conversion and a straight copy, with a direct render being even slower, it’s obvious here chip ram is a bottle neck slowing it down and the Amiga needed VRAM. Had the Amiga featured VRAM for it’s chip memory, even with bitplanes, it would have speed up graphics even further. Add to this an actual packed planar mode with VRAM and then we’d have had a winning combination!

Last edited by Hypex on 15-Aug-2022 at 01:06 PM.

Status: Offline

Hypex

Re: An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 13-Aug-2022 15:33:19

[ #2 ]

Elite Member

Joined: 6-May-2007
Posts: 11204
From: Greensborough, Australia

Place holder 1

Status: Offline

Hypex

Re: An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 13-Aug-2022 15:33:32

[ #3 ]

Elite Member

Joined: 6-May-2007
Posts: 11204
From: Greensborough, Australia

Place holder 2

Status: Offline

Karlos

Re: An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 13-Aug-2022 15:49:40

[ #4 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4402
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Hypex

The 32-bit aligned write bandwidth (from the CPU) for vanilla AGA is about 7MB/s. If you are "doing it right", you'd want to ensure that your engine is doing that, and not doing single byte writes. This also applies to any RTG card.

Assuming this bandwidth was your only bottleneck, at 7MB/s you ought to be able to fill about 115 frames per second at 320*200 for 8 bits per pixel.

I don't remember exactly, but if memory serves, a loop unrolled move16 based fill I wrote for the BVPPC managed something like 17MB/s (aligned, cached source data).

I used to have a small benchmark tool that measured VRAM bandwidth using a variety of techniques and had some data for different cards, including PIV. Alas I lost that a long time ago.

Last edited by Karlos on 13-Aug-2022 at 03:50 PM.

_________________
Doing stupid things for fun...

Status: Offline

cdimauro

Re: An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 13-Aug-2022 16:35:58

[ #5 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Karlos: correct.

The problem here is that the 68060@50Mhz is a monster compared to AGA's chipmem. So, if you directly render to chip ram instead of fast ram, then you're killing the performances. Especially if you don't combine the writes to maximize the chipmem bandwidth usage with 32-bit writes.

Also and AFAIR, chipmem cannot be cached.
And maybe the Amiga4000 had issues using the burst mode. The MOVE16 might not be used for this reason. But here I don't recall correctly, so it might be ok. To be checked.

In short: direct rendering to chip ram is expected to give poor results. Better to render to fast mem, and then use a quick routine to copy to chip ram: the "usual way". But then the goal of this experiment (testing packed graphics speed against planar) cannot be achieved...

Status: Offline

NutsAboutAmiga

Re: An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 13-Aug-2022 20:13:34

[ #6 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12817
From: Norway

@Hypex

Interesting to see this score, but score does necessary say everything.
I guess also depends on how the graphic card is connected, like Blizzard Vision is connected directly to the CPU accelerator card instead of through a PCI adapter or directly through the Zorro slots.

You can’t really test an imaginary chipset, you run analyzes, but yeh, chip ram will be a limited factor, you hit, and it does not matter how fast the CPU is.

In any case the CPU and the GPU has to share access to the CHIP RAM.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

Karlos

Re: An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 13-Aug-2022 21:32:41

[ #7 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4402
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Hypex

It just occurred to me that this thread split out of the PvP thread (now overrun with x86 v 68k legacy compatibility discussion), which was itself split out of another thread ...

_________________
Doing stupid things for fun...

Status: Offline

matthey

Re: An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 13-Aug-2022 22:45:27

[ #8 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2000
From: Kansas

A graphics card that transfers across a bus is limited by both the bus and graphics memory bandwidth. Zorro II can only transfer about 4MiB/s which is worse than ISA so even a graphics card that supports packed/chunky should be slower for Doom than AGA since the whole frame has to be updated (not true for scrolling, line blitting, rendering local bitmaps/textures, etc where some data can be reused from the card memory). Zorro III can only transfer about 15MiB/s but even that is uncommon due to Buster bugs that CBM did not prioritize fixing (EISA could transfer 20MiB/s while Zorro III should have been capable of more than twice that). It would be interesting to see bus tests to the graphics card memories of the test machines. The bandwidth is probably enough to not restrict performance though.

The Amiga custom chips primarily only have to worry about memory bandwidth. VMEM would have been optimal as Jay Miner noted as dual ported memory allows the custom chips and CPU to access memory at the same time as well as doubling the bandwidth. It probably seemed like VMEM was created for the Amiga and then CBM didn't use it which must have been extremely frustrating for Jay. His motivation was obvious in his drive to finish the Ranger chipset specs before he left CBM only to be ignored and an inferior ECS introduced 3 years after he completed the Ranger specs. Pure incompetence by CBM!

Last edited by matthey on 13-Aug-2022 at 10:50 PM.

Status: Offline

NutsAboutAmiga

Re: An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 13-Aug-2022 22:54:15

[ #9 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12817
From: Norway

@matthey

DAM transferee between FAST and CHIP might help a lot, salving the CPU from doing the work. something like GART, but it introduced as AGP feature pretty late (1997), not relevant to any hardware from 1992, but DMA was at least known technology. I guess SCSI worked with internal memory buffers.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

Karlos

Re: An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 13-Aug-2022 23:33:35

[ #10 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4402
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

One thing that is worth noting is that if you are doing pure software rendering, you are most likely to want to use a frame buffer in fast local memory while rendering, especially if you have to do any reading (e.g. transparency effects etc). This means that you're going to have to rely on how fast you can copy data from this memory to the display memory.

The most optimum C2P routines on higher end 68K processors approach "copy speed" for 8-bit chunky on AGA, that is to say, the bandwidth bottleneck masks the overhead of performing the C2P step.

_________________
Doing stupid things for fun...

Status: Offline

matthey

Re: An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 14-Aug-2022 4:38:41

[ #11 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2000
From: Kansas

Karlos Quote:

One thing that is worth noting is that if you are doing pure software rendering, you are most likely to want to use a frame buffer in fast local memory while rendering, especially if you have to do any reading (e.g. transparency effects etc). This means that you're going to have to rely on how fast you can copy data from this memory to the display memory.

The most optimum C2P routines on higher end 68K processors approach "copy speed" for 8-bit chunky on AGA, that is to say, the bandwidth bottleneck masks the overhead of performing the C2P step.

Or more simply you could say an optimized C2P conversion is practically free on the 68060?

Status: Offline

cdimauro

Re: An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 14-Aug-2022 6:04:54

[ #12 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Karlos, matthey: I don't think so. C2P requires some time on such processors which isn't "free".

Reading 4 8-bit packed pixels at the time is quite easy, bu then you've to unpack them bit by bit, shift & insert them on 8 different data registers (address registers can't be used for shifting & masking, AFAIR) until you've filled-up those registers. And finally write those 8 32-bit data to 8 different locations which are on chip mem (which is'nt cachable).

So, it would be good to have some measurement to effectively see how long it takes in such fast processor / system.

Status: Offline

pavlor

Re: An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 14-Aug-2022 6:22:38

[ #13 ]

Elite Member

Joined: 10-Jul-2005
Posts: 9583
From: Unknown

@Hypex

As others have writen above, fast CPU, slow GFX/bus, don´t expect miracles. It is similar to my 486SX 25 MHz notebook with ISA VGA, which gives the very same result in the Doom benchmark as my A1200 (68030 50 MHz) - you see there are far worse GFX options for Doom than AGA.

Status: Online!

Karlos

Re: An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 14-Aug-2022 9:53:53

[ #14 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4402
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@cdimauro

To clarify, it takes CPU time, of course. However what it means is that the elapsed time required to convert a frame in fast ram to planar data in chip ram is about the same as just copying the data. It ceases to be the limiting factor to your framerate.I've not looked for a while but I'm sure some of the better routines were close to copy speed. I used to chase the same ideal for RGB pixel conversion under RTG. Aim to saturate the write bandwidth.

_________________
Doing stupid things for fun...

Status: Offline

cdimauro

Re: An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 14-Aug-2022 10:07:22

[ #15 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Karlos: I'll appreciate if you or someone else has some numbers, because it looks strange to me.

Maybe HyperX could test only this C2P part.

Status: Offline

Karlos

Re: An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 14-Aug-2022 11:45:40

[ #16 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4402
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@cdimauro

There are some old discussions here http://eab.abime.net/showthread.php?t=74008

I guess the point is that the code is running entirely from the instruction cache and the only memory IO are long reads from fast memory and long writes to chip memory. There are older discussions on Google groups too.

-edit-

If the bus is completely busy on useful data transfer then I think it's fair to say the code is copy speed.

Last edited by Karlos on 14-Aug-2022 at 12:53 PM.

_________________
Doing stupid things for fun...

Status: Offline

NutsAboutAmiga

Re: An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 14-Aug-2022 12:51:45

[ #17 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12817
From: Norway

@Karlos

I agree, C2P vs optimized memory copy routine.
Some Crazy Amiga peaple will probably say that C2P is faster

So, your making assumption that memory is slower than what CPU can output.
but many of the accelerators have onboard fast memory.

Last edited by NutsAboutAmiga on 14-Aug-2022 at 12:56 PM.
Last edited by NutsAboutAmiga on 14-Aug-2022 at 12:56 PM.
Last edited by NutsAboutAmiga on 14-Aug-2022 at 12:52 PM.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

Karlos

Re: An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 14-Aug-2022 14:06:54

[ #18 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4402
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@NutsAboutAmiga

Essentially what I'm saying is that the chip ram write speed from the CPU is the limit on fast 68k with local fast memory. The best C2P routines in those circumstances get close to just copying data to chip ram from fast ram because the write is the bottleneck. Obviously in a fast to fast c2p it is evident that copy would be quicker as there is much more memory bandwidth available.

_________________
Doing stupid things for fun...

Status: Offline

matthey

Re: An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 14-Aug-2022 19:16:04

[ #19 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2000
From: Kansas

Here is another way to describe a copy speed C2P conversion...

A 68060 CPU core is much faster than memory (why we have caches). A memory access takes many cycles while the 68060 core can access data in as little as one cycle throughput. Some early CPUs had to wait until a memory write finished to execute another instruction. Later CPUs were designed to not only allow more instructions to execute until another write occurred but they added a write buffer which allows a queue of pending writes as well. If the write buffer becomes full, then a stall (wait) occurs until a write is finished and there is space in the write buffer again. Many instructions can be executed between the writes especially in the case of the 68060 which can perform up to 2 shifts or masks per cycle as needed by a C2P conversion. A naive copy to chip memory stalls most of the time waiting on the memory writes while a copy speed C2P conversion processes data instead of stalling and finishes in about the same amount of time.

The CV64 and PIV graphics boards have 5%-10% faster fps so what happened to the copy speed C2P being free? The CV64 uses VRAM and the PIV uses 45ns EDO memory which may have twice the bandwidth of AGA chip memory or more yet they barely have more fps. Is the difference the cost of C2P from perhaps an inefficient 68060 C2P conversion routine? Maybe, but I have a hunch that the difference is the newer graphics card memory compared to the cheapest possible AGA chip memory. I would like to see memory write benchmarks from fast memory to the different graphic card and AGA chip memories. Do the graphics boards spend 5%-10% less time writing graphics memory than AGA spends writing chip memory? I'm surprised there is not a bigger difference considering the technology advantage of the graphic card memory but this is low resolution and most of the CPU time is spent on the game rather than writing graphics memory.

Status: Offline

Karlos

Re: An experimental Doom speed test and feasibility study based on a virtual packed-planar mode
Posted on 14-Aug-2022 20:16:50

[ #20 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4402
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@matthey

I could get around 15* MB/s write bandwidth from a 68040@25MHz to the video memory on the Permedia2. 68060 managed better, IIRC. In any event, it was plenty fast enough for Doom.

Up to 18 with dubious move16 hacks...

_________________
Doing stupid things for fun...

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle