Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6162 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

1 member(s) on-line.

Bobben

You are an anonymous user.
Register Now!

Bobben: 4 mins ago

minator: 11 mins ago

klx300r: 31 mins ago

Rob: 59 mins ago

DiscreetFX: 1 hr 11 mins ago

matthey: 1 hr 13 mins ago

Mobileconnect: 1 hr 19 mins ago

MagicSN: 1 hr 34 mins ago

number6: 1 hr 48 mins ago

amigakit: 1 hr 52 mins ago

Forum Index

Amiga General Chat

Market Size For New Games requiring 68040+ (060, 080)

Poster

Thread

Hammer

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 20-Mar-2025 7:13:24

[ #181 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6409
From: Australia

@matthey

Quote:
Would you prefer that I call it a kludge?

Backward compatibility is important and it wouldn't matter when 1993's P5 evolved into P6 Pentium Pro in 1995.

X86 rapidly evolved with aggressive competition between Intel and AMD. Cyrix failed to keep up with X86's evolution pace.

The aggressive competition between Intel and AMD allowed MS to enter the games console market with the x86-based original Xbox.

The initial BOM costings for the original Xbox
https://www.neogaf.com/threads/3do-mx-chipset-the-technology-nintendo-almost-used-in-an-n64-successor-for-1999.350196/#post-14521193

Brown said the goals were to make money, expand Microsoft's technology into the living room, and create the perception that Microsoft was leading the
charge in the new era of consumer appliances. The initial cost estimate was for a machine with a bill of materials (engineering talk for cost) of $303. That
machine would debut in the fall of 2000 and use a $20 microprocessor running at 350 megahertz from Advanced Micro Devices. The machine would also have
a $55 hard disk drive with two gigabytes of storage, a $27 DVD drive to play movies, a $35 graphics chip, $25 worth of memory chips, and a collection of
other standard parts like a motherboard, and power supply. Over time, these prices would decline.

$20 Intel-compatible microprocessor and a $30 graphics chip from Nvidia. The highest-priced item on the list of materials was $40 for memory chips. But the
rest of the bill of materials was complete, down to $2.14 for the cables and $4.85 for screws

Xbox's BOM cost parameters are close to mainstream Amiga AGA.

https://forum.beyond3d.com/threads/og-xbox-was-planned-to-launch-with-an-amd-cpu-until-last-minute.62562/#post-2225089
Xbox's CPU increased to K7 Duron before switching to Intel Coppermine 128K.
NVIDIA's nForce/Xbox is mostly an AMD 76x chipset that NVidia bought the rights to modify.

The AMD and nForce AMD IDE controllers are fully compatible. (Linux kernel: "AMD 755/756/766/8111 and nVidia nForce/2/2s/3/3s/CK804/MCP04 IDE driver for Linux.")
The I2C/SMBus controller on the nForce is fully AMD-756/766/68 compatible.
The audio controller is i810 compatible - as is the audio controller of the AMD-768 and the AMD-8111.
The nForce and AMD-768 modems are compatible.
At least one register ("VGA_en") in the nForce PCI-to-AGP bridge is compatible with the AMD chipset (AMD-761, 24081.pdf, page 136).
The nForce uses HyperTransport.

https://archive.computerhistory.org/resources/access/text/2013/04/102723369-05-01-acc.pdf
For October 12, 1998 on page 174 of 487
68040-25 reached $24.20, obsolete for 1998-era game console.

The initial 68000's value vs performance success in the 1980s wasn't repeated with full 32bit 68k CPUs.

Quote:

The d_scan.c source code would most likely generate a double precision FDIV with Visual C++ and default/global precision set to double precision. The FP 1.0 data is representable in single precision which saves memory but when it is loaded into the FPU it is translated to higher precision and calculations are done at higher precision. It is possible to override the default precision and use single precision but it has significant overhead to change and Michael Abrash warned about repercussions of setting the default rounding to single all the time. The following WinQuake assembly source code has functions to change the default precision.

https://github.com/id-Software/Quake/blob/master/WinQuake/sys_wina.s

I do not see C(Sys_LowFPPrecision) or C(Sys_HighFPPrecision) function calls. Maybe I missed them or maybe lowering the precision to single precision was just experimental. Find these function calls in the source code somewhere to show that single precision was used. The video you linked gives the single precision Pentium FDIV latency but that does not mean it was used. It also says FDIV "can't be pipelined" and the AC68080 is claimed to have a pipelined FDIV.

Using a fresh Visual C++ 2.0 install on a fresh Windows NT 4.0 install with VirtualBox 7, the sizeof float is 4 bytes i.e. 32bit size, hence FP32.

Are you claiming Visual C++ 5.0 to Visual Studio 2015 recompile WinQuake being broken?

https://github.com/nicolasboulenc/WinQuake
Visual Studio 2015's WinQuake source build.

Using FDIV FP64 with a non-science use case like games is unwise.

https://github.com/jagregory/abrash-black-book/blob/master/src/chapter-65.md

There are several interesting points to Listing 65.3. First, floating-point arithmetic is used throughout the clipping process. While it is possible to use fixed-point, doing so requires considerable care regarding range and precision. Floating-point is much easierâ€”and, with the Pentium generation of processors, is generally comparable in speed. In fact, for some operations, such as multiplication in general and division when the floating-point unit is in single-precision mode, floating-point is much faster. Check out Chris Hecker's column in the February 1996 Game Developer for an interesting discussion along these lines.

The Pentium FDIV bug also affects FP32.

https://blogs.mathworks.com/cleve/2013/04/29/pentium-division-bug/
The computed quotient is accurate to only 14 bits

With FP52, AC68080 V2 was able to run Quake 68K ports.

-----------------

FDIV wasn't pipelined with Pentium.

AMD K10 vs Intel Core 2 FDIV FP64 comparisons.
https://www.anandtech.com/show/2386/6

The only weakness remaining in the Core x87 architecture is the FP divider. Notice how even a relatively low percentage of divisions (the 4th number in the mix) kills the performance of our 65nm Xeon. The Opteron 22xx and 23xx are 70% faster (sometimes more) when it comes to double precision FP divisions. However, the new Xeon 54xx closes this gap completely thanks to lowering the latency of a 64-bit FDIV from 32 cycles (Xeon 53xx) to 20 cycles (Xeon 54xx, Opteron 23xx). The Xeon 54xx is only 1% to 5% slower in the scenarios where quite a few divisions happen. That is because the Opterons are capable of somewhat pipelining FDIVs, which allows them to retire one FDIV every 17 cycles. The clock speed advantage of the 45nm Xeon (3.2GHz vs. 2.5GHz maximum at present) will give it a solid lead in x87 performance.

(skip)

When it comes to raw SSE performance, the Intel architectures are 3% to 14% faster in the add/subtract/multiply scenarios. When there are divisions involved, Barcelona absolutely annihilates the 65nm Core architecture with up to 80% better SSE performance, clock for clock. It even manages to outperform the newest 45nm Xeon, but only by 8% to 18%. Notice once again the vast improvement from the 2nd generation Opteron to the 3rd generation Opteron when it comes to SIMD performance, ranging from 55% to 150%.

AMD Barcelona K10 Opteron has pipelined FDIVs with a 17 cycle retirement.

For X87, the Core 2 Xeon's raw clock speed advantage wins over K10 Opteron.

With pure floating point SSE comparison, AMD Barcelona beats Intel Core 2-based Xeon, but games are mixed integer and floating point concurrent workloads i.e. K10 Barcelona's three instruction issue per cycle front end lost to Core 2's quad instruction issue per cycle front end.

https://www.anandtech.com/show/18871/arm-unveils-armv92-mobile-architecture-cortex-x4-a720-and-a520-64bit-exclusive/3
For ARM Cortex A720, a Cortex-X4 variant

Another improvement is the Pipelined FDIV/FSQRT (division + square root), which performs operations on floating point numbers using the pipelines.

Quote:

Replacing the crap x87 FPU with a SIMD FPU tends to do that. Intel was proud of the x87 FPU before they moved to deeper pipelines for higher clock speeds and decided to replace it. The x87 baggage remains so Quake still runs though.

SSE2 comes with Intel Pentium IV Willamette (November 2000) followed by Pentium IV Northwood (January 2002) and AMD K8 Opteron SledgeHammer (April 2003, X86-64). Intel lost X86 ISA leadership with X86-64.

Pentium III's SSE replaced X87 on FP32 workloads.

For 68K, the biggest problem is money.

Last edited by Hammer on 20-Mar-2025 at 10:45 PM.
Last edited by Hammer on 20-Mar-2025 at 10:32 PM.
Last edited by Hammer on 20-Mar-2025 at 10:31 PM.
Last edited by Hammer on 20-Mar-2025 at 11:00 AM.
Last edited by Hammer on 20-Mar-2025 at 07:25 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Hammer

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 20-Mar-2025 7:37:33

[ #182 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6409
From: Australia

@matthey

Quote:

CPU and FPU registers rarely increase or grow wider. The x86 ISA doubled the integer registers going to x86-64 but with a new mode and practically a new ISA in a less than efficient way for decoding and code density.

AVX-512 doubled its register count and still works with 128-bit, 256-bit, and 512-bit wide SIMD.

Quote:

The AC68080 added CPU and FPU registers without a new mode and it is a non-orthogonal mess with less than the 100% claimed compatibility. That really only leaves SIMD registers which x86-64 has increased with baggage.

The baggage remained similar since X86 decoders are relatively tiny compared to the rest of the CPU.

68060 can't even run unmodified MacOS on ex-68040 Macintosh.

https://www.nxp.com/docs/en/supporting-information/MC68060AR.pdf
"Porting software" within the same 68K CPU family is a joke.

On Amiga, there's reinventing the wheel with an old code base with a newer 68060 while an unmaintained Windows NT 3.5 works on modern PCs with UEFI-CSM or hardware-accelerated virtual machines. Modern PCs can run 1987 unmodified MS-DOS while ARMv9 needs modified RISCOS i.e. doesn't run 1987 RISCOS.

Quote:

That does lead to the problem of SIMD units not scaling well and already having enough issues at 512b wide, including being resource hogs, that this size has not become standard.

AVX-512 are not resource hogs. Intel Skylake X is old.

https://www.phoronix.com/review/amd-ryzen-9-9950x3d-linux/4
Intel Embree 4.3's Pathtracer middleware is known to use AVX-512

https://www.phoronix.com/review/amd-ryzen-9-9950x3d-linux/8
OpenVINO is known to use AVX-512 Ice Lake

AMD-based next-generation game consoles will enforce the standard for X86-64 v4 since they have Zen 5 or Zen 6.

https://overclock3d.net/news/cpu_mainboard/sonys-ps6-will-be-turbocharged-by-amds-x3d-tech-leaker-claims/
PS6 with Zen 5 X3D. Desktop PCs would have Zen 6 X3D with 12 cores per CCD.

https://www.techpowerup.com/316868/microsofts-next-gen-xbox-for-2028-to-combine-amd-zen-6-and-rdna5-with-a-powerful-npu-and-cloud-integration
Xbox Next with Zen 6.

The baseline X86-64 version in the next-generation game consoles is at least Zen 4's AVX-512 (X86-64 v4 + IceLake AVX-512 extensions). Zen 5's AVX-512 has Intel Tigerlake extensions.

https://wccftech.com/amd-surpasses-100-million-gaming-consoles-this-generation
AMD surpassed 100 million gaming consoles this generation with X86-64 v3 (AVX2). Since the original Xbox with SSE, game consoles have helped to push X86's SIMD extensions.

Motorola doesn't understand gaming and they (68K) were kicked out from the mainstream gaming market during the full 32-bit transition.

Last edited by Hammer on 20-Mar-2025 at 10:29 PM.
Last edited by Hammer on 20-Mar-2025 at 10:24 PM.
Last edited by Hammer on 20-Mar-2025 at 10:11 PM.
Last edited by Hammer on 20-Mar-2025 at 10:09 PM.
Last edited by Hammer on 20-Mar-2025 at 10:07 PM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

cdimauro

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 21-Mar-2025 6:06:24

[ #183 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4349
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

Then the 68060 cannot be directly compared to the other processors, which were aimed at the desktop market and not at the embedded once.
In fact, at least the Pentium has not only retained the full legacy from all its predecessors, but added more features. All this needs transistors and draws more power.
If the 68060 was designed specifically for the embedded market and Motorola decided to remove many features to make it more suitable at that, then I've no problem to accept it (processors should be adapted to the specific needs). But, as I've said, then any comparison is not possible anymore.

Apples and oranges can still be compared even though a pear is a better comparison for an apple and a lemon is a better comparison for an orange.

Quote:
The 68881 was supposedly 155,000 transistors and the 68882 176,000 transistors. Restoring the 64-bit result multiply and divide with the full FPU is still likely to be less than 3 million transistors which is less than the P5 Pentium. Maybe a pipelined 68060 FPU would come close but the 68060 has a deeper integer pipeline, twice as many GP integer registers, more cache associativity, a 2nd barrel shifter and an instruction buffer. The 68060 has a clear PPA advantage as the hardware design does not even consider the better 68k CPU ISA and FPU ISA advantage with better orthogonality that allows higher instruction multi-issue rates.

Add a fully-fledged 68451 PMMU which should be duplicated (due to the double pipeline) & enforced, all missing instructions (which means CALLM/RTM included), all exception stacks, a 64-bit data bus, a 16 byte instruction fetch window, debug registers/features, new instructions, and the machine registers.

It's a lot of stuff which likely have put the 68060 on par or even above the Pentium in terms of transistors count, considerable raising its power-consumption.

That's why I still think that those processors aren't comparable: the 68060 was a very castrated version of all its predecessor, to remove A LOT of legacy baggage (and good features as well). It's of it the target is the embedded market, but Intel was clearly going towards desktops and servers market, which are quite different.
Quote:
cdimauro Quote:

np. We're just braninstorming here.

Honestly, I don't like such mixed usage of registers (for the same reasons that I don't like what Gunnar did with the 68080).

In my vision it's better to have the FPU and SIMD/Vector registers completely separated: scalar from one side and SIMD/Vector on the other side.
The primary reason is that in this way the SIMD/Vector unit can freely "grow" completely independent. So, its registers can increase in size without wasting space, because only vector data is stored there, and which is fully used (e.g.: no partial register usage).

CPU and FPU registers rarely increase or grow wider. The x86 ISA doubled the integer registers going to x86-64 but with a new mode and practically a new ISA in a less than efficient way for decoding and code density. The AC68080 added CPU and FPU registers without a new mode and it is a non-orthogonal mess with less than the 100% claimed compatibility. That really only leaves SIMD registers which x86-64 has increased with baggage.

x64 increased in SIMD registers numbers and their size with AVX-512: 32 x 512 bit registers, and another 8 x 64 bit registers for the masks. Which is A LOT.
Quote:
That does lead to the problem of SIMD units not scaling well and already having enough issues at 512b wide,

512b is now the standard, because the AVX-10 subset with 256-bit registers was completely dropped from Intel just two days ago.

It looks like that Intel learned how to properly implement such big registers even on its smaller E-Cores, so such restriction isn't needed anymore. AFAIK Pather Lake should offer AVX-512 to both P and E-Cores.
Quote:
including being resource hogs,

Yes, many transistors are required for their registers, but registers renaming was it already well above 32 registers since long time.
Quote:
that this size has not become standard. AArch64 just made 128b wide their standard.

No, 512b become more standard in the last years. RISC-V cores which sport the Vector extension usually have 512b vector registers. The same if ARM cores which support SVE/SVE2 (Fujitsu with its AArch64/SVE2 super computer was the first one).
All of them having 32 physical vector registers, so exactly matching AVX-512.
Quote:
Using pairs of SIMD instructions for 256b SIMD is not so inefficient either.

Yes, it's common for smaller cores.

Even the Pentium III and IV internally split the SSE/SSE2 instructions in pairs of 64-bit SIMD instructions (which "someone" could have improperly and wrongly called SSE-64/SSE2-64).
Quote:
Then there are vector units which are more scalable but have higher latency and are more difficult to program. It is necessary to find a low resource balance that fits with the 68k code density small footprint but the AC68080 64-bit SIMD is too narrow to be worthwhile for FP and is barely worthwhile for integer. A MAC unit like ColdFire uses is even worse as a pipelined FPU with FMA would be a better choice. I want a standard full featured extended precision 68k FPU but I am not so sure about a standard SIMD unit even though I do understand the advantage and disadvantage of having it standard. I am open minded on options and too inexperienced with SIMD and hardware to make some ISA decisions by myself.

This summer I'll try to design one during my vacations. In this period I'm too much busy with two many projects (I'm also helping a friend for open source project to protect wild animals).
Quote:
My 68k ISA goal was to document innovative ideas and proposals that could be decided on by a group of experienced developers. Of course Gunnar decided he was the group of experienced developers, aka the Apollo team. I wanted an enhanced 68k ISA that others could agree on including emulation but I found out most emulation developers want nothing to do with enhancements because emulation is retro and EOL. Even the AmigaOS is devolving backwards from the 68020 for AmigaOS 3.1 to the 68000. Nobody cares anymore about 68k ISA enhancements than they do about a new ABI that properly supports extended precision FP. A faster host CPU for emulating the 68k is good enough.

Maybe an optional, modern SIMD unit could be help.
Quote:
cdimauro Quote:

The second reason is that scalars and SIMD/vector instructions aren't usually mixed-up: either you execute the formers or the latters. And they have different usages.
To be more clear, an algorithm that requires more instructions for being completed using scalar instructions, might require much fewer SIMD/Vector instructions. Which means that the read/write ports of the scalar and SIMD/Vector register files can be different and can be finely tuned depending on the specific microarchitecture / market.
If unify (or use a mix like the above) the scalar and SIMD/Vector register files, then you end-up with heavily limiting either scalar or the SIMD/Vector unit (e.g. more ports will be more difficult and expensive to be implemented for the SIMD/Vector unit, whereas less ports will penalize the performance of the scalar unit).

Most register writes would be of the whole FPU/SIMD register so I would think the ports would be similar. I get that the SIMD registers may want to be increased or expanded for higher end uses though which would be more difficult and the ports would become less efficient.

Yes. That's why it's better to superate them.
Quote:
cdimauro Quote:

IMO extending the data registers to 80 bits and using them as the extra 8 FPU registers is "the lesser of two evils". It certainly looks odd at the first sight, but this way you keep the scalar and SIMD/Vector units well separated and free to be implemented according to the specific needs. With a very little expense in terms of space (especially thinking about a 64 bit ISA).

Food for thinking...

My thinking is that separate register files would be less wasteful for the 68k ISA. The advantage of unified int/FPU registers would be more efficient FP to int and int to FP for the cost of wasted wider integer registers.

I haven't unified the int/FP scalar registers only to have a more efficient conversion of data types. In fact, I haven't mentioned on the reasons of my previous posts.

IMO such conversion could be simply better implemented in a more efficiente way, whatever is the registers organization.
Quote:
My thinking is that FPU/SIMD registers would share scratch registers better and heavy FPU and SIMD use at the same time would be uncommon with the disadvantage that it would be more difficult to increase the SIMD registers or their width in the future.

If you don't feel comfortable with extending the data registers to 8 bit, then yes: add another 8 FPU registers, with the option to share FPU and SIMD registers with the future extension.
Quote:
cdimauro Quote:

Understood, and I mostly agree.

One point where I don't is about the x86/x87 bugs. Bugs are... bugs. Something that could happen and that can also be fixed: the infamous Pentium FDIV bug is here to recall what can happen... and what can be done with fixing issues.

Another point where I don't agree with Kahan is about the x87 design. In fact, it wasn't true that it was limited by the opcode space.
At the time a lot of opcodes were still not used (besides the 8 x "escape" ones used for the x87): several bytes are free in the opcode table, and could have been used to implement a register-based FPU instead of a stack based, even supporting three operands.
It would have required longer opcodes, but x87 needs extra FXCHG instructions anyway to "solve" this problem...

The Intel x87 opcode space requirement may have been like the Intel requirement for Stephen Morse to build an upgraded 8080 with full compatibility. Morse was smart enough to realize that it would not have been much of an upgrade and after "protesting" received the go ahead to create a decently upgraded 8086 with partial 8080 compatibility instead. It probably was not Kahan's place to protest but maybe John Palmer should have. Palmer was trying to keep the x87 project from being canceled though as Intel marketing did not think enough math coprocessors could be sold to pay for development. If you watched the Kahan video I posted, Intel unexpectedly sold roughly as many coprocessors as they did CPUs (1:1).

Unbelievable! It means that the x86/x87 couples were bought by scientist and/or professionals. Which were waiting for a much cheaper solution.
Quote:
I doubt that was the case for the 6888x as at least Commodore wanted to sell cheapened business computers without a FPU, MMU or even desktop class CPU. Commodore gave us cheapened embedded CPUs, Motorola gave us cheapened embedded CPUs and A-Eon gave us cheapened embedded CPUs. At least Commodore and Motorola gave us cheap and 68k compatible cheapened embedded CPUs.

I fully agree here.
Quote:
cdimauro Quote:

No, My 66000 is a radical new architecture, quite different from 68k, and uses 32-bit VLE. I think that it's different from 88100 as well, but I should refresh my studies on this ISA to make a concrete comparison.

Due to 32-bit base opcode/alignment I think that the code density shouldn't be so good, but Mitch says that he's better than RISC-V. Let's see. I've some doubt about that, but benchmarks are needed.
What's true is that it requires much less executed instructions compared to RISC-V (the 64-bit "G" variant). Well, not surprising, since RISC-V is a so weak ISA...

I recall Mitch talking about the encoding which was the first RISC ISA I had heard of to use variable sized immediates and displacements like the 68k which is one of the less copied reasons for CISC performance. I recall he gave some examples of encodings but I do not recall it being a 32-bit variable length encoding which indeed would be bad for code density. I do recall it having many GP registers though which would increase instruction sizes.

Maybe he changes its design. Actually instructions are 32-bit size/aligned and there's a unified 32 x registers file.

But the ISA is still under development and there are discussions about possible addition of instructions (recently an equivalent of ARM's very good CCMP was proposed, but it hasn't so much consensus to be added to the ISA now).
Quote:
The 88110 design did use shared int/FPU registers although it was still only 32-bit which was the biggest mistake of the 88k according to Mitch. Oddly, a new extended precision FPU register file was added for the 88110 too. The 88k ISA is still fairly minimal like most RISC but far more friendly than PPC. Some oddities include packed/pixel Pop SIMD instructions using 32-bit register pairs and a XMEM instruction like the x86 XCHG instruction. The 88k ISA is strange but the register sharing is similar to Mitch's new ISA. The 88k ISA is not as valuable today as the 88110 OoO design which is a very flexible dual issue design for 10 units that makes instruction scheduling very easy and has high multi-issue rates. It uses a shallow 4-stage integer pipeline but has a fully pipelined extended precision FPU.

http://www.bitsavers.org/components/motorola/88000/MC88110UM_88110_Users_Manual_1991.pdf

Thanks for sharing it.

I think that Mitch is inspired by the 88100, because I see similarities and he talks about such processor quite often.
Quote:
A decoupled instruction fetch pipeline could be added for a variable length encoding which would give a medium depth pipeline like is popular today and would more consistently feed the supersclar the existing execution pipeline. There are man years of labor in the CPU design and likely a good base design here which is easier than starting over with a new design. I expect Mitch would be interested in reacquiring or licensing the IP and a package deal for old cores like the 68k, 88110 and ColdFire V5 may improve the value of a deal.

AFAIK Mitch is proceeding on his own and hasn't asked for licenses. Which makes sense: he has "enough" experience.
Quote:
cdimauro Quote:

I don't think that microcoded is needed. My mem-mem-mem design is way more simple compared to the VAX one, and requires only some bits (the LSBs) to determine the position and length of each of the three operands.
In general, decoding the first bits (within the first word / 16 bits. 15 bits maximum, to be more precise, in the worst case scenario/encoding) are enough to figure out the instruction length and the position and length of all its operands.
VAX, on other hand, requires parsing the byte stream one byte at the time, and advancing byte-by-byte for doing the same. Which is the reason why it wasn't possible to pipeline it (at the time), which lead to its failure.
Having three memory operands can have its own problems, of course, but only when those are effectively pointing to memory locations. Since I can reference registers, immediates, and constants (in ROM. No FMOVECR is required for them: just use one of them from the few that are defined in the EA), many times there's only a single memory reference, so only an AGU is needed.
Another practical case is when there are only two memory operands, and one of them is the destination: in this case processing the destination's EA can be easily delayed until an AGU is free (which is very similar to the 68k case with the MOVE instruction).
The worst case, of course, is when there are two source memory operands (so, mem-mem with destination as first source operand, and mem-mem-mem), because they need to be evaluated ASAP for getting their values; but 68k has the same problem with the add mem-mem instructions (albeit the available EAs are a few and fixed).
I don't know if there are intrinsic issues having to deal with more than one EA per instruction, but in case I'm curious to understand why.

I agree that the 8-bit VLE was the major reason for the downfall of VAX. It is like the bad 8-bit VLE x86 encoding but the x86 ISA is simpler. The VAX ISA is more orthogonal than the x86 ISA but it is also more complex. The 68k 16-bit VLE is easier to decode and orthogonal with ISA complexity between the x86 and VAX ISA. The mem,mem did not go too far except for maybe the ADDX/SUBX mem,mem type instructions and the double memory indirect addressing modes but they are no problem for the 68060 in hardware.

That's why the 68k is the primary source of inspiration for my architectures: they've more in common with the 68k than with x87/x64 (even NEx64T). With very good reasons.
Quote:
cdimauro Quote:

Mitch reported a slightly different list about such extra rounding modes:
Table 4: Rounding Modes

Mode Status Encoding
Round Nearest Even IEEE 754 000
Round Nearest Odd Experimental 001
Round Nearest Magnitude IEEE 754-2008 010
Round Away From Zero Experimental 100
Round Towards Zero IEEE 754 101
Round Towards + Infinity IEEE 754 110
Round Towards â€“ Infinity IEEE 754 111

For the new ones there's a match with Round Away From Zero Experimental with your Round to away from Zero, but the other twos don't.
I've extended my new architecture with this table:
Round to Nearest, Ties to Even - round()
Round Toward Zero - trunc()
Round Down, Toward -Infinity - floor()
Round Up, Toward +Infinity - ceil()
Round to Nearest, Ties Away from Zero
Round to Nearest, Ties toward Zero
Round to Nearest, Ties to Max Magnitude
Round Away from Zero

I don't know if it's right, but I'm not an expert on this field.

Anyway, the encoding meanings might change. The important thing is that I've extended the field in the status registers to support such 8 rounding modes, and I've also extended the previously mentioned CONV instruction (which now takes a whopping 192 encodings from the list of available binary instructions. Fortunately I've many of them, but CONV already took a good part of it).

I did not try to give an exhaustive list of other rounding modes. There are many ways to round numbers and I was listing some common ones we have likely learned, used and are useful but are not part of the IEEE standard. Rounding modes use many different names too. I was just saying a 3-bit encoding for at least 8 rounding modes is a good idea. IEEE FP rounding is a little tricky because there is a sticky bit and guard bit.

Round Nearest Magnitude IEEE 754-2008

This is probably the optional IEEE 754-2008 rounding mode needed by double double and double extended numbers which is basically multiprecision FP math in hardware. I would have to brush up on how exactly the different rounding takes place to be sure. Supporting additional rounding modes is likely very cheap in hardware although the instruction encoding space is not as cheap and knowing which ones are worthwhile to support is not clear.

Edit: Mitch's "Round nearest, magnitude IEEE 754-2008" is different than the optional "Round nearest, ties away from zero" listed at the following link.

https://en.wikipedia.org/wiki/IEEE_754#Rounding_rules

I am not sure either are the rounding mode used for multiprecision arithmetic either. No easy answers anyway.

OK, np. It's already great what you've shared, and which allowed me to solve some important problems in this areas.

I think that my ISA is now complete, at least from an embedded/user-land perspective (for the supervisor, MMU, etc., I think that I'll get inspiration from Mitch's My 66000).

Status: Offline

Hammer

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 21-Mar-2025 12:07:50

[ #184 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6409
From: Australia

@matthey

For N64 project,
https://web.archive.org/web/20150208022940/http://www.nytimes.com/1993/08/21/business/company-news-video-game-link-is-seen-for-nintendo.html

A computer industry official said MIPS, a subsidiary of Silicon Graphics,
had developed a version of its R4000 processor that operated on less than
one-half watt and could be produced for about $40 each

Commodore's (Jeff Porter) multimedia group used the MIPS-X CPU @ 40Mhz variant inside CL-450 SoC for the FMV module i.e. full 32bit 68K has "cost vs performance" problems. FMV module's CL-450 SoC has a less than $50 price range for 10,000 units.

PS1's MIPS 3051 CPU (MIPS R3000A compatible) has a $30 price range.

For low-cost next-generation high-performance 32-bit entertainment experiences, it's not with Motorola's 32-bit 68K.

Last edited by Hammer on 21-Mar-2025 at 12:08 PM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Hammer

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 21-Mar-2025 12:40:51

[ #185 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6409
From: Australia

@cdimauro

Quote:
It looks like that Intel learned how to properly implement such big registers even on its smaller E-Cores, so such restriction isn't needed anymore. AFAIK Pather Lake should offer AVX-512 to both P and E-Cores.[

Gracemont implements 256bit AVX2 with hardware 128bit units.

Cinebench 2024 benchmark doesn't support vector 256bit AVX and minimal 128bit SIMD.

Cinebench 2024 extensively uses scalar AVX2's FMA3 instructions. Cinebench's codebase is like old-school RISC FMA3 FPU.

Reference
https://chipsandcheese.com/p/cinebench-2024-reviewing-the-benchmark

Quote:

Even the Pentium III and IV internally split the SSE/SSE2 instructions in pairs of 64-bit SIMD instructions (which "someone" could have improperly and wrongly called SSE-64/SSE2-64).

Don't assume. SSE SIMD mode's lowest width is not 64bit like 3DNow SIMD i.e. SSE's SIMD
has a 128-bit wide instruction set implemented on 64-bit SIMD hardware for Pentium III and Pentium IV.

AMD K7 XP's SSE FADD and SSE FMUL hardware are 64-bit wide. They are decoded via the fast doubles path.

AMD K8's SSE FADD hardware is 128-bit wide while SSE FMUL hardware is 64-bit wide.

Intel Core 2 is the first X86 CPU with full 128-bit SSE hardware implementation followed by AMD K10.

AVX supports multiple SIMD widths with 128-bit being the lowest mode.

I only focus on AVX-128 since it's optimal for AMD Jaguar's 1x FADD and 1x FMUL 128-bit SIMD hardware and there are more than 100 million game consoles with this embedded game console CPU. AVX-256 bit will add an extra clock cycle latency on the Jaguar CPU. I'm being transparent to the fact that Jaguar's implementation is not a true 256-bit AVX implementation i.e. fake 256-bit AVX.

There are more than 100 million Xbox Series X, Xbox Series S, and PS5 game consoles with Zen 2's 256-bit AVX2 ISA with 256-bit hardware implementation.

PowerPC G4's 128bit Altivec is real 128bit SIMD hardware, not some fake BS from the X86 camp. X86's higher clock speed design feature mitigates SSE's fake 128-bit SIMD marketing PR.

I skipped Zen 1.x's 8 cores due to per core 2X 128bit FADD + 2X 128bit FMA3 units for Core i7-7820X 8 cores with AVX-512 and Core i9-9900K. Ryzen 9 3900X is my 1st AMD desktop CPU since the early 2000s K8 Athlon 64s.

Last edited by Hammer on 21-Mar-2025 at 01:10 PM.
Last edited by Hammer on 21-Mar-2025 at 12:52 PM.
Last edited by Hammer on 21-Mar-2025 at 12:50 PM.
Last edited by Hammer on 21-Mar-2025 at 12:48 PM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

cdimauro

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 21-Mar-2025 21:01:14

[ #186 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4349
From: Germany

@Hammer

Quote:

Hammer wrote:
@cdimauro

Quote:
You still don't get it and continue to grab and report things related to the topic, but without understanding the real topic.

Karlos has just written another comment trying again to clarify it and let you know, but I doubt that you'll ever learn it.

Learn something that, BTW, was quite obvious and heavily used at the time ("the time" -> when people were trying to squeeze the most from the limited available resources).

But this is something which is obvious only to people which had their hands on this stuff. So, nothing that seems to you've made in your life (despite you've reported being a developer. But working on completely different areas, likely).

Wrong.

Karlos wrote:

FDIV ?

The only time you should be using floating point division is if there's no alternative and an angry person has a gun to your head. As soon as more than one number needs to be divided by the same divisor, it should be converted into it's reciprocal so that you can multiply it instead.

Again, Karlos assumed my Quake FDIV argument was per pixel.

I'm already aware of Quake's conservative FDIV usage.

I already stated FDIV was executed out of order while other quicker instructions are processed.. LOL

Learn to read properly.

Try again.

Karlos assumed nothing like that: he has written PRECISE statements which are very common and obvious for people optimizing code which used divisions (not only FP, I've to say). Which is clearly NOT you.

Here's his original post

the one which I've replied to.

And don't tell me to read: that's exactly what I've done before writing my post, which was perfectly in line.

But you still don't get it. Because you're the problem here (see also my following posts).

Status: Offline

cdimauro

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 21-Mar-2025 21:07:44

[ #187 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4349
From: Germany

@matthey

Quote:

matthey wrote:
Hammer Quote:

FXCH is not a trick, it's a programmer control register renaming function.

Would you prefer that I call it a kludge?

Simply a kludge? No, it's an HORRIBLE kludge!

Continuously swapping registers with FXCH instructions is a clear proof of how much crap was/is the x87 coprocessor. For people which understand it, of course.

Status: Offline

cdimauro

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 21-Mar-2025 21:40:48

[ #188 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4349
From: Germany

@Hammer

Quote:

Hammer wrote:
@cdimauro

Quote:
It looks like that Intel learned how to properly implement such big registers even on its smaller E-Cores, so such restriction isn't needed anymore. AFAIK Pather Lake should offer AVX-512 to both P and E-Cores.[

Gracemont implements 256bit AVX2 with hardware 128bit units.

Irrelevant.
Quote:
Cinebench 2024 benchmark doesn't support vector 256bit AVX and minimal 128bit SIMD.

Irrelevant.
Quote:
Cinebench 2024 extensively uses scalar AVX2's FMA3 instructions. Cinebench's codebase is like old-school RISC FMA3 FPU.

Irrelevant.
Quote:
Reference
https://chipsandcheese.com/p/cinebench-2024-reviewing-the-benchmark

Irrelevant.
Quote:
Quote:

Even the Pentium III and IV internally split the SSE/SSE2 instructions in pairs of 64-bit SIMD instructions (which "someone" could have improperly and wrongly called SSE-64/SSE2-64).

Don't assume.

I assume, because you continue to don't understand. See below.
Quote:
SSE SIMD mode's lowest width is not 64bit like 3DNow SIMD i.e. SSE's SIMD

First of all, there's no "lowest width".

But even assuming that you mentioned instructions accessing SIMD registers with smaller size compared to their normal size, I correctly assume that you've never read an x86/x64 architecture manual in your life because... rolling drum... there are SSE instructions which access only the bottom 64 or 32 bits of the SIMD register.

You still and continuously talk of things that you've no clue. AT ALL!
Quote:
has a 128-bit wide instruction set implemented on 64-bit SIMD hardware for Pentium III and Pentium IV.

Correct.
Quote:
AMD K7 XP's SSE FADD and SSE FMUL hardware are 64-bit wide. They are decoded via the fast doubles path.

AMD K8's SSE FADD hardware is 128-bit wide while SSE FMUL hardware is 64-bit wide.

Intel Core 2 is the first X86 CPU with full 128-bit SSE hardware implementation followed by AMD K10.

Irrelevant.
Quote:
AVX supports multiple SIMD widths with 128-bit being the lowest mode.

Wrong. Again, you talk of things that you've no clue at all, because you never opened an x86/x64 architecture manual.

In fact, AVX instructions always change the entire content of the SIMD destination register. Read again: ALL CONTENT.

So, not only all 256 bits are changed on x86/x64 microarchitectures which have 256-bit SIMD registers, but on microarchitecture which have 512-bit SIMD registers (e.g.: the microarchitecture supports AVX-512) an AVX (and AVX2, of course) 128-bit instruction... changes the content of ALL 512 bits of the destination register.
Read again: an AVX/AVX2 128-bit instruction, which was conceived when AVX-512 was not even on the minds of its architectures, changes the FULL content of the SIMD destination register.

Got it now?

And this is the very basic, elementary stuff that anyone which has at least studied the x86/x64 architecture which its SIMD extensions, knows since the very beginning. Read: NOT you.
Quote:
I only focus on AVX-128

Again with your totally dumb invention: there exists NO AVX-128! When do you plan to RTFM ONE TIME IN YOUR LIFE?!?
Quote:
since it's optimal for AMD Jaguar's 1x FADD and 1x FMUL 128-bit SIMD hardware and there are more than 100 million game consoles with this embedded game console CPU.

Irrelevant.
Quote:
AVX-256 bit will add an extra clock cycle latency on the Jaguar CPU. I'm being transparent to the fact that Jaguar's implementation is not a true 256-bit AVX implementation i.e. fake 256-bit AVX.

There's NO "fake" 256-bit AVX.

You still don't understand the difference between an ISA/architecture and one of its possible implementations (read: a microarchitecture).

All processors that implement AVX (not even AVX2) should implement this instruction set AS PER SPECS. Which means: 16 x 256-bit registers, and all set of defined instructions.

IF a microarchitecture has 128-bit SIMD units this does NOT mean which it's a "fake" 256-bit AVX: it just and only means that its architects decided to save transistors with a weaker implementation. AND NOTHING ELSE!

That's because a microarchitecture is totally transparent from an ISA/architecture PoV: how it INTERNALLY works does NOT matter from this perspective. It only matters to COMPILER or DEVELOPERS that wants to optimize the code for this, SPECIFIC, implementation of the ISA.

How many times should I repeat this simple, elementary concept? You're hopeless!
Quote:
There are more than 100 million Xbox Series X, Xbox Series S, and PS5 game consoles with Zen 2's 256-bit AVX2 ISA with 256-bit hardware implementation.

Irrelevant.
Quote:
PowerPC G4's 128bit Altivec is real 128bit SIMD hardware,

There's no "real" 128-bit SIMD hardware: see above. The architects decided this way, but they could have split the instructions in 64-bit micro-ops for lower cost implementations, because of what I fruitlessly tried to explain you above for the n-time.
Quote:
not some fake BS from the X86 camp.

Same as above: hopeless IGNORANT!
Quote:
X86's higher clock speed design feature mitigates SSE's fake 128-bit SIMD marketing PR.

Same as above + higher clock speed only belong to Pentium IV: Pentium III had much lower clocks, yet they implemented the novel SSE instructions using 64-bit micro-ops. Which was a LEGIT and fully understandable decision.

But, hey, I'm talking to stone that can never understand (yes, I'm loosing my time here, but only for the benefit of OTHER people and not you).
Quote:
I skipped Zen 1.x's 8 cores due to per core 2X 128bit FADD + 2X 128bit FMA3 units for Core i7-7820X 8 cores with AVX-512 and Core i9-9900K. Ryzen 9 3900X is my 1st AMD desktop CPU since the early 2000s K8 Athlon 64s.

Totally irrelevant.

Now, go study for the first time in your life, IGNORANT!

Status: Offline

Karlos

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 21-Mar-2025 22:42:40

[ #189 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4943
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@cdimauro

United we stand, divided we stall.

_________________
Doing stupid things for fun...

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle