Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6223 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

0 member(s) on-line.

You are an anonymous user.
Register Now!

OneTimer1: 6 mins ago

amigakit: 7 mins ago

matthey: 14 mins ago

BigD: 1 hr 21 mins ago

ruben: 1 hr 25 mins ago

minator: 1 hr 59 mins ago

Chris_Y: 2 hrs 8 mins ago

70sAnd80sRule: 2 hrs 27 mins ago

zipper: 2 hrs 30 mins ago

Tuxedo: 2 hrs 57 mins ago

Forum Index

General Technology (No Console Threads)

The (Microprocessors) Code Density Hangout

Poster

Thread

Gunnar

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2022 5:59:02

[ #161 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

@cdimauro

Quote:

In fact, I've already created the architecture that kills not only x64, but all others (that I know).

OK, you said several times that you have no knowledge about hardware development,
you have no clue about logic design, but you were able to beat INTEL and all other major CPU companies of the world. IMPRESSIVE!

And you saved so much money!
INTEL and IBM spend millions to verify their design ideas and to developing working prototypes.
What fouls they must be. You can do all this without making prototypes - you only need your head.

Do you really think that you designed anything?

Status: Offline

Gunnar

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2022 6:54:47

[ #162 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

@cdimauro

We all know that you are self proclaimed Amiga Hardware and CPU expert.
Wasn't "Amiga Hardware expert" your title on the TINA project which you announced in 2013?

Quote:

TiNA is a new board that is being designed and developed in Italy by a team of Amiga enthusiasts.

Their main goal is to make a complete implementation of the Amiga 500 and/or Amiga 1200 by the use of powerful FPGAs. They want to make this come true with a 68020 CPU that is even more powerful than any existing 68060 CPU. It’ll be able to execute 2 in-order instructions per clock cycle, but at 400 (!) MHz

http://www.tinaproject.it/index.html

You announced a new 400Mhz, 800 MIPS 68K CPU, running in a low cost FPGA.
These are very impressive numbers.
Even more impressive if you know that the FPGA that you announced to use, can by far not reach 400MHz.

Can you help us understand how much of this CPU you did "design" before announcing it?
Was the CPU working?
Or did you announce all this, including making the nice website and marketing pictures before doing any development on it?

You announced your CPU does 2 instruction per clock cycle and will run impressive 400MHz in a low cost FPGA.

Can you help us understand where you got these numbers from?
Where did you know your CPU executes 2 instruction per cycle?
Where did you know it runs 400MHz?

Last edited by Gunnar on 27-Sep-2022 at 07:15 AM.

Status: Offline

Gunnar

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2022 7:38:18

[ #163 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

@cdimauro

cdimauro, please help all readers here understand who you really are?

Are your a "raw diamond", a genius, a guy without having any hardware knowledge being able to develop the world best CPU architecture, outclassing INTEL and IBM?

Are you an impressive genius who can reach several times the reachable clockrate out an low end ALTERA FPGA - much more than even ALTERA the company which build these FPGA could reach?

Or are you just talking a lot of bullshit?

Status: Offline

Karlos

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2022 11:26:26

[ #164 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Karlos

Quote:

Karlos wrote:
@matthey

Specifically, I'm only interested in conditional branches. These occupy about 25% of the available primary opcode space in MC64K. Those slots can be replaced with fast path register to register instructions that are 33% smaller than the current realisation and are even simpler to decode. Totally worth it, I think.

Somewhat returning to the topic of code density. I made the changes indicated above in a branch. To recap:
- Normal instructions are of the form [ opcode ] [ ea dst ] [ ... ] [ ea src ] [...]
- Where the operand EA is a register direct type, this was a single byte, but the EA function at runtime still has to be called.
- A fast path rearranged these into [ fast prefix ] [ opcode ] [ reg pair ]

This meant that a typical "fast path" operation like add.l d0, d1 still took 3 bytes to encode, even if it did skip the whole EA decode logic at runtime.

About 25% of the Opcode values were taken up with a fairly rich set of compare and branch instructions (as there's no CC).
- These were changed into just 2 (for now) that use a sub-opcode as the condition to check.
- This freed up a large number of opcodes to use as a reworked fast path of the form [ opcode ] [ reg pair ] which is much more in line with my other load/store VM designs.

To test this, I have a simple absolutely naive mandelbrot generation program, that plots the set at 2048 x 2048 with a max iteration depth of 128.

- The original version (using the 3-byte fast path encoding where possible) produces a bytecode chunk in the binary of 483 bytes.

- The code-density version produces a bytecode chunk of 428 bytes, a reduction of about 11.4%

However, what matters is execution. Each version was ran a fixed number of times each and the best values reported by the time instruction used (the VM was compiled without internal instrumentation option which would skew the results).

- Original: 5.181 seconds (user time)

- Code Density: 3.873 seconds (user time)

That's an increase of 33.7%, almost exactly the same as the effective reduction in opcode size going from 3 bytes to 2. However a factor in that is also that the 2 byte instructions are also simpler to decode as they have no prefix.

Peak interpretive performance for the new fast path instruction format is about 620 MIPs (add.q r1, r0) on my machine. It was about 500 previously. I probably need to improve the benchmarking methodology, e.g. locking to a single core and preventing speed stepping.

Further improvements are possible because there are no equivalent "fast path" variants for any compare and branch (except for dbnz). The mandelbrot code bailout check has an fbgt.s instruction (which is called up to 128 times per pixel) that would benefit from:
- A fast path for register to register (currently is EA decode)
- A short form given it currently only supports 32-bit branch displacements.

Last edited by Karlos on 27-Sep-2022 at 02:07 PM.
Last edited by Karlos on 27-Sep-2022 at 11:27 AM.

_________________
Doing stupid things for fun...

Status: Offline

Hammer

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2022 11:31:51

[ #165 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6504
From: Australia

@cdimauro

Quote:

As usual, you don't read what people write. In fact, I've already created the architecture that kills not only x64, but all others (that I know). AND I've posted some preliminary (because results could be much better having a backend for it, which could allow to fully exploits its features) in this thread:
https://amigaworld.net/modules/newbb/viewtopic.php?mode=viewtopic&topic_id=44169&forum=17&start=0&viewmode=flat&order=0#841312
Nice, eh?

BTW I haven't yet shown the slides with the comparisons between its vector unit and Intel's AVX-512, RISC-V's vector extension, and ARM's SVE (vector extension as well), because its more embarrassing for them.

To quote a typical Sony fanboy, it has no games.

Creating a new CPU instruction set with no games.... that's a failure for most of Amiga's audience.

Have you departed from Amiga's primary target audience?

Vampire's success is mostly due to carrying WHDLoad 68k Amiga game library legacy into the next performance level hardware by not abandoning legacy like the PowerPC camp i.e. Apple's 68K-to-PPC migration model wouldn't work for Amiga's game console like nature.

PiStorm/Emu68 also follows respect WHDLoad 68k Amiga game library legacy.

Your fictional X64 CPU replacement does nothing for Geekbench, Cinebench R20/R23, Blender 3D, X64 PC games and 'etc'.

AMD is pretty good at packing transistors into a given area.

AMD's 40 nm Bobcat with 400 million transistors and 74 mm2 die size vs Intel's 45 nm Pineview with 178 million transistors and 87 mm2 die size.

For the PS4 game console win at 28 nm TSMC, AMD packed in Jaguar that competed against ARM Cortex A15's chip area size. Unlike ARM Cortex A15 competition, Jaguar has 64 bits, 128-bit SIMD and out-of-order processing.

The instruction set alone doesn't complete the solution for a chip product when layout skills are also important.

Last edited by Hammer on 27-Sep-2022 at 11:54 AM.
Last edited by Hammer on 27-Sep-2022 at 11:38 AM.
Last edited by Hammer on 27-Sep-2022 at 11:35 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Karlos

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2022 11:41:39

[ #166 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@cdimauro

Quote:

cdimauro wrote:

@Karlos

Quote:

Karlos wrote:
@cdimauro

I don't want to toot my own horn, but I actually built that

Good for you.

Have you made an RTL/HDL design (that was the discussion before) out of one of your architectures?

No. While I've messed with hardware, I'm really not a hardware designer. I don't think my designs would necessarily lend themselves well to hardware implementation:
- They tend to be based on bytecode enumerations rather than individual bits having any specific meaning. While this is good for software realisations, I expect this is a poor match for actual hardware logic.

- They don't generally have any condition codes. I started that way but soon stopped because the code that tends to get written doesn't need them until there's a branch and it's generally a simpler proposition to have a "compare and branch" instruction than maintain a ton of state you don't use most of the time. I expect this is also a poor fit for hardware because I can envisage it's quite easy to route ALU signals from any operation into a bit position of some CC register - effectively something you get "for free". Some architectures, e.g. PPC have variations that do or do not update the CC but I've yet to see any real hardware that doesn't rely on condition codes for implementing branching.

_________________
Doing stupid things for fun...

Status: Offline

Hammer

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2022 12:18:21

[ #167 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6504
From: Australia

@Gunnar

Quote:

Gunnar wrote:
@cdimauro

cdimauro, please help all readers here understand who you really are?

Are your a "raw diamond", a genius, a guy without having any hardware knowledge being able to develop the world best CPU architecture, outclassing INTEL and IBM?

Are you an impressive genius who can reach several times the reachable clockrate out an low end ALTERA FPGA - much more than even ALTERA the company which build these FPGA could reach?

Or are you just talking a lot of bullshit?

cdimauro seems to be Cesare Di Mauro at Principal QA Engineer at BMW Car IT GmbH Neu-Ulm, Bayern, Deutschland.

Embedded processor mindset doesn't nothing for close source legacy software.

I have read Cesare Di Mauro's http://docplayer.net/50921629-X86-x64-assembly-python-new-cpu-architecture-to-rule-the-world.html

For PC gaming CPUs to PCIe pathway, lowest latency, highest effective IPC, and highest clock speed have higher priorities.

Cesare Di Mauro's re-compilation on closed-source X86/X86-64 PC legacy software will break DRM/anti-cheat certificate checks. Valve has invested in R&D with DRM providers for SteamOS's Proton/DXVK layer.

Xbox Series X and PS5 have transparent decompression/compression hardware I/O.

Last edited by Hammer on 27-Sep-2022 at 01:42 PM.
Last edited by Hammer on 27-Sep-2022 at 12:25 PM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Bosanac

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2022 12:26:19

[ #168 ]

Regular Member

Joined: 10-May-2022
Posts: 257
From: Unknown

@cdimauro

Apologies for using your real name earlier, didn't expect you to get doxxed by stalkers.

Status: Offline

Karlos

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2022 12:34:26

[ #169 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Hammer

Uncool doxxing aside...

Quote:
Embedded processor mindset doesn't nothing for close source legacy software.

Beep boop. This sentence doesn't.

_________________
Doing stupid things for fun...

Status: Offline

Hammer

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2022 12:57:27

[ #170 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6504
From: Australia

@Karlos

Cesare Di Mauro's information is on the public internet.

There's nothing wrong with proposing yet another instruction set, but there are complications.

EuroPython 2013 is a public forum.

Adding Python hardware acceleration is an interesting idea e.g. https://www.digikey.com.au/en/maker/blogs/2018/python-on-hardware

AMD has added AVX3-512 for mainstream desktop PC CPUs and what's coming after AVX3-512 is a matter of debate.

Last edited by Hammer on 27-Sep-2022 at 01:21 PM.
Last edited by Hammer on 27-Sep-2022 at 01:19 PM.
Last edited by Hammer on 27-Sep-2022 at 01:17 PM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Hammer

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2022 13:01:45

[ #171 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6504
From: Australia

@Bosanac

For the record, I didn't notice his real name from AW.net. I googled search "NEx64T instruction" and I found his real name.

There's nothing wrong with Cesare's proposed instruction set advocacy and I support his free speech.

My own interest is running my existing software library with higher performance, and cost-effective prices.

Jensen Huang of NVIDIA, Dr. Lisa Su of AMD, and Patrick Gelsinger of Intel do not hide from the market.

Last edited by Hammer on 27-Sep-2022 at 01:14 PM.
Last edited by Hammer on 27-Sep-2022 at 01:10 PM.
Last edited by Hammer on 27-Sep-2022 at 01:07 PM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Karlos

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2022 13:23:07

[ #172 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Hammer

Quote:
Jensen Huang of NVIDIA, Dr. Lisa Su of AMD

They may not hide from the market, but you never see them together in the same room at the same time, do you? Just sayin'.

_________________
Doing stupid things for fun...

Status: Offline

Hammer

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2022 13:27:20

[ #173 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6504
From: Australia

@Karlos

https://www.techtimes.com/articles/253736/20201030/fact-check-nvidia-ceo-uncle-amds-dr-lisa-su.htm
Technically, it is safe to say that Lisa Su's own grandfather is actually Jen-Hsun Huang's uncle. Although they aren't really niece and uncles, they are very close relatives. Jen-Hsun Huang (otherwise known as Jensen) owns a degree in electrical engineering and was a co-founder of NVIDIA back in 1993 during his 30th birthday

AMD's Lisa Su is related to NVIDIA's Jensen Huang.

Remember, Jensen Huang is an ex-AMD engineer. It's in the family.

Last edited by Hammer on 27-Sep-2022 at 01:29 PM.
Last edited by Hammer on 27-Sep-2022 at 01:29 PM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Karlos

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2022 19:39:33

[ #174 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Hammer

That's what they want you to think...

_________________
Doing stupid things for fun...

Status: Offline

Karlos

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2022 19:50:22

[ #175 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

The code density improvements to mc64k are now merged. The Mandelbrot test shows around a 42% performance improvement. Pretty much all the arithmetic in the main loop is now using one of the 2-byte register to register operations.

https://github.com/IntuitionAmiga/MC64000/blob/main/assembler/test_projects/mandelbrot/src/sp_register.s

_________________
Doing stupid things for fun...

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 28-Sep-2022 2:20:07

[ #176 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2752
From: Kansas

I found the other old paper on register efficiency based on the number of GP registers.

High-Performance Extendable Instruction Set Computing
https://www.researchgate.net/publication/3888194_High-performance_extendable_instruction_set_computing

The data is based on a MIPS 3000 which has 27 GP registers. Data was recorded as the number of available GP registers in a compiler was reduced down to 8.

No. of Regs | Program size | Load/Store | Move
27 100.00% 27.90% 22.58%
24 100.35% 28.21% 22.31%
22 100.51% 28.34% 22.27%
20 100.56% 28.38% 22.24%
18 100.97% 28.85% 21.93%
16 101.62% 30.22% 20.47%
14 103.49% 31.84% 19.28%
12 104.45% 34.31% 16.39%
10 109.41% 41.02% 10.96%
8 114.76% 44.45% 8.46%

RISC architectures benefit from a few more than 16 GP registers. MIPS is pretty much worst case as it has few addressing modes and it needs several GP registers for address calculations. Most RISC architectures need a GP register free for loads as well (unless they have a reg to mem exchange instruction which is rare). The benefits of more than 16 GP registers is still small as the paper chose 16 GP registers based on the above chart for a compressed RISC encoding. The biggest concerns are elevated "Load/Store" mem accesses and increased instruction counts. Less than 16 GP registers has elevated mem accesses especially approaching 8 GP registers. From 16 to 27 GP registers is only 2.72% more memory accesses even for MIPS which is likely near worst case for RISC. Recall the "Performance Characterization of the 64-bit x86 Architecture from Compiler Optimizations’ Perspective" paper which also gave the increase in mem access when decreasing GP registers from 16 for x86-64 to 8 registers of x86.

https://link.springer.com/content/pdf/10.1007/11688839_14.pdf Quote:

In addition to performance, the normalized dynamic number of memory accesses (including stack accesses) for the CINT2000 and CFP2000 is shown in Fig. 8. On average, with the REG_8 configuration, the memory references are increased by 42% for the CINT2000 and by 78% for the CFP2000; with the REG_12 configuration, the memory references are increased by 14% for CINT2000 and by 29% for the CFP2000.

For whatever reason, the first paper shows an increase of 14.23% memory accesses while the 2nd paper shows an increase of 42% from 16 to 8 GP registers. Both papers show a large increase in the number of mem accesses from 16 to 8 GP registers but this only resulted in a 4.4% slowdown in the 2nd paper. The first paper data shows that from 16 to 27 GP registers is 16% of the mem access difference from 8 to 16 GP registers. 16% of the 4.4% slowdown would be a .71% slowdown that could be avoided with 27 instead of 16 GP registers. RISC architectures often waste some of the 32 GP registers for a zero register, link register and other specialized registers because having all 32 GP registers doesn't make much difference in performance but the Apollo core needs 48 GP integer registers along with all the CISC techniques which reduce the need for GP registers like reg-mem accesses and powerful addressing modes. Maybe 8 more GP integer registers would have gained 1% performance on low memory bandwidth hardware but, no, it had to be 24 more GP registers that certainly wouldn't be used in low memory bandwidth embedded hardware. The extra registers are not orthogonal like the RISC 32 GP registers either. The evidence was given years ago but ignored.

The first paper above gives some code density comparisons but they are old EGCS compiles (EGCS was replaced by GCC for good reason). The 68020 was 6/24 architectures and PPC was 21/24 fairing worse than MIPS and SPARC.

I ran across a couple of other interesting papers while searching.

Comparative Architectures, CST Part II, 16 lectures, Lent Term 2005 (Ian Pratt)
https://dokumen.tips/documents/comparative-architectures-clcamacuk-8086-80286-80386-80486-pentium-pentium.html?page=1

Code Density Straw Poll (page 52)

gcc
arch | text | data | bss | total
68k 36152 4256 360 40768
x86 29016 14861 468 44345
alpha 46224 24160 472 70856
mips 57344 20480 880 78704
hp700 66061 15708 852 82621

gcc-cc1
arch | text | data | bss | total
68k 932208 16992 57328 1006528
x86 995984 156554 73024 1225562
hp700 1393378 21188 72868 1487434
alpha 1447552 272024 90432 1810008
mips 2207744 221184 76768 2505696

pgp
arch | text | data | bss | total
68k 149800 8248 229504 387552
x86 163840 8192 227472 399504
hp700 188013 15320 228676 432009
mips 188416 40960 230144 459520
alpha 253952 57344 222240 533536

The 68k has the best code density with this easy competition and a descent compiler. Alpha came out better than I expected while MIPS was worse.

The same paper on page 55 gives conditional branch frequency of about 16% for SPECint92 which is only behind load and ahead of ADD and CMP instructions (MIPS?).

Last edited by matthey on 28-Sep-2022 at 02:31 AM.
Last edited by matthey on 28-Sep-2022 at 02:23 AM.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 28-Sep-2022 4:21:06

[ #177 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

@all. No problem for my personal information: it was my decision to make them public, starting from my nickname.

I've also linked here, on my profile, my technological blog, so it's easy to know who I am and to reach as well my LinkedIn or Facebook profiles.

Status: Offline

Hammer

Re: The (Microprocessors) Code Density Hangout
Posted on 28-Sep-2022 4:21:28

[ #178 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6504
From: Australia

@matthey

https://barefeats.com/doom3.html
The real-world problems with PowerPC and Altivec when running Doom 3.

MAC GAME PERFORMANCE BRIEFING FROM THE DOOM 3 DEVELOPERS

Glenda Adams, Director of Development at Aspyr Media, has been involved in Mac game development for over 20 years. I asked her to share a few thoughts on what attempts they had made to optimize Doom 3 on the Mac and what barriers prevented them from getting it to run as fast on the Mac as in comparable Windows PCs. Here's what she wrote:

"Just like the PC version, timedemos should be run twice to get accurate results. The first run the game is caching textures and other data into RAM, so the timedemo will stutter more. Running it immediately a second time and recording that result will give more accurate results.

The performance differences you see between Doom 3 Mac and Windows, especially on high-end cards, is due to a lot of factors (in general order from smallest impact to largest):

1. PowerPC architectural differences, including a much higher penalty for float to int conversion on the PPC. This is a penalty on all games ported to the Mac, and can't be easily fixed. It requires re-engineering much of the game's math code to keep data in native formats more often. This isn't 'bad' coding on the PC -- they don't have the performance penalty, and converting results to ints saves memory and can be faster in many algorithms on that platform. It would only be a few percentage points that could be gained on the Mac, so its one of those optimizations that just isn't feasible to do for the speed increase.

2. Compiler differences. gcc, the compiler used on the Mac, currently can't do some of the more complex optimizations that Visual Studio can on the PC. Especially when inlining small functions, the PC has an advantage. Add to this that the PowerPC has a higher overhead for functional calls, and not having as much inlining drops frame rates another few percentage points.

--------------
There are other issues besides code densities.

The best 68K implementation (minus the 68K MMU and FP80) is the AC68080 followed by MC68060 Rev 6 @105 Mhz (with 68K MMU, FP80, and missing a few 68000/68020 instructions).

PiStorm/RPI 3A+(ARM Cortex A53)/Emu68 is effectively similar to Transmeta's Code Morph Software that targeted X86/X86-64 external instruction set on VLIW-based microarchitecture CPU.

Last edited by Hammer on 28-Sep-2022 at 04:23 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 28-Sep-2022 5:09:20

[ #179 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

@Gunnar

Quote:

Gunnar wrote:
@cdimauro

Quote:

In fact, I've already created the architecture that kills not only x64, but all others (that I know).

OK, you said several times that you have no knowledge about hardware development,
you have no clue about logic design, but you were able to beat INTEL and all other major CPU companies of the world. IMPRESSIVE!

And you saved so much money!
INTEL and IBM spend millions to verify their design ideas and to developing working prototypes.
What fouls they must be. You can do all this without making prototypes - you only need your head.

Do you really think that you designed anything?

Do you know what, Gunnar? You might be a smartass hardware engineer but when it comes to something which is outside that then you seem to be a dumbass. And it's getting ridiculous since you're supposed to be a professional and should be able to understand some very basic concept, like "interfaces" and "implementations" like I said before.

Anyway, let me explain you briefly the concept, at elementary school level, hoping that it finally gets into your head.

When talking about computer architectures there several level of abstraction involved. Let's focus on the most important ones for this discussion / context:

- Architecture. For a processor it defines the so called Instruction Set Architecture AKA ISA. So, how many registers, their sizes, their purposes, how many instructions, how they work, and how they are encoded (e.g.: opcodes structure). For a chipset, like the one on the Amiga, it defines the same level of details, but for all those peripherals. So, what's the purpose of a peripheral, its registers, their sizes, their purposes, and how they interact with the system. Let's stop here, but it should be clear enough.

- Microarchitecture. It's a precise implementation of an Architecture. For a processor it defines the external address and data bus sizes, how many registers are used (example: more than the physical ones for an OoO implementation), if some registers are split internally (example: SSE 128-bit registers might be broken up in two 64-bit registers), how many read ports and write ports its used for a register file (different registers domains might have different numbers for the same), if all instructions are microcoded, directly executed, or a combination of both, how many caches and their sizes and associativities and line granularity, how many ALUs / execution units and what they specifically do (which kind of instructions they execute), etc. etc. There other things but I stop here, because the scope should be clear enough. For the chipset the situation is similar, so you have the address and data bus sizes for the peripherals, internal buffers for data fetched from / sent to the external bus (examples: for the serial port. Or for the display controller, the disk drive, etc.), how many read and write ports are implemented for the CLUT, if there's a cache or not for the instructions executed by coprocessors like the Copper, how many pipelines are used to implement the Blitter operations, how the audio channels are converted from digital to analog (example: using 1-bit DAC), etc., etc. I stop here but should be clear.

- RTL/HDL. It's the concrete implementation of a Microarchitecture, which is usually done with hardware-level languages like VHDL or Verilog. The sources describe and implement the exact details exposed on the Microarchitecture and Architecture.

This clarified, I'm an expert for Architectures, have a good know knowledge on Microarchitectures, and zero knowledge on RTL/HDL.

So, and as it is should be very easy to understand now (HOPEFULLY! Otherwise you're hopeless), what you said before are complete bullsh*ts, since this is clearly NOT my domain / expertise.

Ist es klar?
Quote:

Gunnar wrote:
@cdimauro

We all know that you are self proclaimed Amiga Hardware and CPU expert.

Am I not? Care to show why?

And please, let me know if there's an official process to acquire those title, because I'm curious.
Quote:
Wasn't "Amiga Hardware expert" your title on the TINA project which you announced in 2013?

Yes, and? Isn't my expertise enough for it?
Quote:
Quote:

TiNA is a new board that is being designed and developed in Italy by a team of Amiga enthusiasts.

Their main goal is to make a complete implementation of the Amiga 500 and/or Amiga 1200 by the use of powerful FPGAs. They want to make this come true with a 68020 CPU that is even more powerful than any existing 68060 CPU. It’ll be able to execute 2 in-order instructions per clock cycle, but at 400 (!) MHz

http://www.tinaproject.it/index.html

You announced a new 400Mhz, 800 MIPS 68K CPU, running in a low cost FPGA.
These are very impressive numbers.
Even more impressive if you know that the FPGA that you announced to use, can by far not reach 400MHz.

Can you help us understand how much of this CPU you did "design" before announcing it?
Was the CPU working?
Or did you announce all this, including making the nice website and marketing pictures before doing any development on it?

You announced your CPU does 2 instruction per clock cycle and will run impressive 400MHz in a low cost FPGA.

Can you help us understand where you got these numbers from?
Where did you know your CPU executes 2 instruction per cycle?
Where did you know it runs 400MHz?

I don't know if it's because you're ageing and you tend to forget the things or it's Mother Nature that did a bad work with your, but I've already explained it two times (which seems to don't be enough with you):

the memory bus and the hardware implementation was driven by the owner of the company

They were his hardware engineers that reported that what you described was possible.

My duties on the project were OTHERs (designing the Architecture. See above for more details). Understood now?

Regarding the site, I haven't written a single byte / character on the TiNA's site: everything was made by the company's employees. I was just a user on the forum and only with the additional moderator privileges: so, not even as admin.

Is it clear now? Let's see how many times should I repeat it again, old grumpy Gunnar...
Quote:

Gunnar wrote:
@cdimauro

cdimauro, please help all readers here understand who you really are?

Someone already did it...
Quote:
Are your a "raw diamond", a genius, a guy without having any hardware knowledge being able to develop the world best CPU architecture, outclassing INTEL and IBM?

Are you an impressive genius who can reach several times the reachable clockrate out an low end ALTERA FPGA - much more than even ALTERA the company which build these FPGA could reach?

See above: you continue to confuse my expertise and what I did with my project.
Quote:
Or are you just talking a lot of bullshit?

Only in your limited mind, since you're unable to conceive the differences between Architectures, Microarchitectures, RTL/HDL, etc., and you mix everything on a big cauldron.

The problem with you is that, as an hardware engineer, you see the world only from this perspective.

It's like a square that thinks that the world is made of squares and if/when it sees something different it complaints because it should have been a square and work as a square, of course...

Anyway, you continue to write A LOT but you find no time to PROVE your previous statement about the 68080 address registers. I've asked several times, but still nothing. Then to me it means that what you said is a pure bullsh*t and you're a liar.

@all: I've no time to reply to other messages. I'll do when I can.

P.S. No time to read.

Status: Offline

Gunnar

Re: The (Microprocessors) Code Density Hangout
Posted on 28-Sep-2022 6:58:50

[ #180 ]

Cult Member

Joined: 25-Sep-2022
Posts: 512
From: Unknown

@Hammer

Quote:

1. PowerPC architectural differences, including a much higher penalty for float to int conversion on the PPC.

Actually the Float to Int conversion has no penalty on PPC.
What used to have a penalty was moving the int to an integer register.

68K and others have FPU instruction where they can read from values into FPU from Integer registers and have also instructions to write result back from FPU into integer registers.

IBM did regard this as unneeded and the FPU could write results only to FPU register.
The FPU register could then be stored to memory location.
There where no instruction available on PPC to directly move an FPU result to an Integer Register.
This means the FPU needed to move the value to memory and then the Integer unit needed to load it from memory.

For most FPU core this is not a problem at all and normal academic / industrial algorithms run perfectly without needing this. Sometimes like in Quake game this move is wanted - you suffer a delay from the memory access.
Recent POWER CPUs did solve this.

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle