Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6225 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

0 member(s) on-line.

You are an anonymous user.
Register Now!

A1200: 6 mins ago

BigD: 13 mins ago

MEGA_RJ_MICAL: 25 mins ago

Karlos: 36 mins ago

agami: 1 hr 11 mins ago

Everblue: 1 hr 44 mins ago

OneTimer1: 2 hrs 23 mins ago

dalek: 2 hrs 29 mins ago

pixie: 2 hrs 33 mins ago

Mobileconnect: 3 hrs 29 mins ago

Forum Index

Amiga General Chat

Market Size For New Games requiring 68040+ (060, 080)

Poster

Thread

bhabbott

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 9-Mar-2025 10:20:30

[ #161 ]

Cult Member

Joined: 6-Jun-2018
Posts: 554
From: Aotearoa

@Heimdall

Quote:

Heimdall wrote:

With a little bit of work you can absolutely avoid this normalization division. I have been avoiding this division myself on Jaguar and Amiga in my engine
Good to hear.

What performance (fps) do you expect from your engine in the game(s) you want to produce, on a) 25MHz 040, b) 50MHz 060, c) V2 Vampire?

Status: Offline

Heimdall

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 9-Mar-2025 12:39:41

[ #162 ]

Regular Member

Joined: 20-Jan-2025
Posts: 104
From: North Dakota

Quote:

bhabbott wrote:

What performance (fps) do you expect from your engine in the game(s) you want to produce, on a) 25MHz 040, b) 50MHz 060, c) V2 Vampire?
The following benchmark data is from V4SA:

Quote:
Vampire V4 (No AMMX/Maggie, Just CPU)

----------------------------------
PolyCount: 1,020 tris
Color Depth: 32-bit
Total Frames rendered:1,000
Method: Brute-Force (no LOD)

----------------------------------
320x200: 77 fps
640x200: 63 fps
480x270: 62 fps
640x360: 53 fps
640x400: 39 fps
720x576: 33 fps
848x480: 26 fps
800x600: 22 fps
960x540: 20 fps
1280x720: 13 fps
1440x900: 7.4 fps
1920x1080:4.8 fps

I can estimate performance with great precision, because of a detailed benchmark (a separate build I can create at any time). Think of it as LEGO bricks - each engine component takes certain amount of time.
Quote:

Engine Stage Throughput per one NTSC frame (1/60s):
----------------------------------------------------------------------------------
...7,819: Triangle 3D Transform
...3,945: Triangle Set-Up / Clipping (stage prior to traversal)
.41,383: Scanlines Traversal (No Pixel fill, just traversal)
637,484: Pixel Fill

This means that during 1/60s you can Either transform 7,819 tris, OR you can do 3,945 triangles set-up OR you can traverse 41,383 scanlines OR fill 637,484 pixels within those scanlines.
This is the reason why 3D rasterizing's performance differs frame from frame so greatly - since even though you have the exact same amount of triangles in the scene, the number of scanlines/clipping differs greatly from frame to frame. Certain clipping scenarios take vastly more CPU time than others.

This data allows me to do a preliminary 3D design in Excel first. Figure out the polycount BEFORE I start creating 3D assets, factor in some scanline count variability in the target resolution and based on that only then start creating 3D assets.

Status: Offline

Heimdall

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 10-Mar-2025 0:46:58

[ #163 ]

Regular Member

Joined: 20-Jan-2025
Posts: 104
From: North Dakota

Quote:

bhabbott wrote:

What performance (fps) do you expect from your engine in the game(s) you want to produce, on a) 25MHz 040, b) 50MHz 060
I expect very high framerate on 040-060, because I ported the engine to Lynx's 4 MHz 6502, where it performed surprisingly well.

If it can run great on an 8-bit 4 MHz 6502, I'm sure it won't have issues on 040-060 that pull 20-40 MIPS (Minus the C2P cost).

On Jaguar's 13 MHz 68000, I had a set-up that did 60 fps at 768x200 (65536 colors) with a StunRunner 3D scene, which was using Blitter for drawing the scanlines in parallel with RISC GPU processing scanline traversal. And the DSP was doing soft-synth in parallel to 68000 doing all the gameplay and whatever engine components didn't fit into 4 KB RISC cache.

Since V4SA is roughly as fast as Atari Jaguar in my flatshader (The Jaguar's 26.6 RISC GPU coupled with Blitter is quite a beast), I reckon 040-060 will perform at an appropriate CPU fraction of the V4SA (whatever is the % of CPU performance against 080). Minus C2P, of course.

While, on the surface, it might seem weird how on Earth could Jaguar remotely be a match for V4SA's ~150 MIPS, the V4SA is doing all the rasterizing on CPU.
Whereas on Jaguar:
- FrameBuffer is cleared for free in parallel with GPU starting the 3D transformation
- Scanlines are rendered for ~free in parallel with GPU working through the scanline traversal

Right before I stopped working on Jaguar, I was experimenting with using its DSP (which had additional HW bugs on top of GPU bugs ) for 3D processing in parallel with GPU, Blitter and 68000. It's an exercise in synchronisation across multiple processors, but the performance gain is worth it. That engine build blows even V4SA out of the water , let alone poor 040-060

But, you still have only 2 MB of RAM, unlike on Amiga. And, at 768x200x65536, the two framebuffers take up 0.6 MB out of the 2 MB, so not much fits there in terms of art assets...

Status: Offline

cdimauro

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 11-Mar-2025 5:21:08

[ #164 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

Exactly. So, either you need a complete set of instructions where you directly the define the precision or rounding mode, or you need a mechanism which allows you to override them on-the-fly.

On my new architecture I've normal instructions which use the rounding mode set in the status register, and longer instructions where it's directly set in the opcode. Problem solved, but costly (2 more bytes).

The 68k FPU encodings are not particularly compact but the extra encoding space was useful for adding the FSop and FDop instructions. They may seem free after lunch was already payed for. If you only support single and double precision, then it is only one encoding bit to support the two precisions. With the bit, it is possible to support SIMD style FPUs that perform single precision ops in single precision instead of converting to double precision like classic FPUs.

Still thinking about that, maybe for the 68k's FPU you can use two bits from the coprocessor id (Line-F) to directly specify one of the four rounding modes.

This way you've the normal instruction with coprocessor id #1 -> use the rounding mode defined on the FPU status register.
Coprocessor ids #4..7 are used for the same instruction, but forcing the rounding mode coming from the low 2 bits of the coprocessor id.

I think that this is an elegant way to solve the problem, while keeping the same code density.

That's only if directly selecting the rounding mode is more common/useful compared to selecting the precision (single or double instead of extended). Otherwise it should be the other way around (the lowest two bits of the coprocessor id select the desired precision, and specific instructions should be introduced for applying a different rouding mode).

The main problem with the 68k is that unfortunately Motorola wasted a lot of encoding space introducing this coprocessor interface, removing three precious bits for encoding much better information for the FPU instructions.

As I've suggested several years ago on Olaf's forum, it would be much better to set a bit on the status register (one of the three unused bits on the CCR, to be more specific) to allow redefining the line-F for the FPU/scalar instructions, and line-A for the SIMD/Vector instructions, to better use the 12-bits available on such lines (defining specular instructions: the same encoding are used, but for scalar or packed data, depending on the specific line)-.

Status: Offline

matthey

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 11-Mar-2025 5:59:32

[ #165 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2754
From: Kansas

cdimauro Quote:

16B/cycle is very common since many years, and it should be the bare minimum for an enhanced 68060+ design.

The 68060 developers looked at and benchmarked lots of 68k code before likely choosing to target something like 80% of code. Some compilers were simplistic and much existing code was small where future code should have been considered too. Program sizes were growing even for embedded use but the 68060 was supporting 68000 instruction sizes and 68k code sizes where many of the programs were 64kiB or smaller using a small data model. A 68060+ upgrade of 8B/cycle fetch, 8B/instruction superscalar execution limit and 32kiB I+D L1 caches certainly sounds more reasonable today.

cdimauro Quote:

This could be reduced with a new implementation. Or hidden into the pipeline having an additional stage which is used to "align" all data sources to the same size.

If nothing else, silicon improvements could improve integer conversion latency much like the ARM FPU pipeline improvements in the picture I showed in my previous post in this thread. ARM is being smart to improve latencies with shorter FPU pipelines rather than maximizing clock speeds.

cdimauro Quote:

OK, now I share a detail about my new architecture: it has a unified register set for all GP and scalar (integer, FP) operations. Which, coincidentally, was the same choice that Mitch has done with its My 66000.

On paper a (micro)architecture with registers distributed on different domains MIGHT be more efficient, because you can carefully design every single register group with the proper, unique, number of read and write ports. ON PAPER.

On paper, because my experience tells me that in the real world there are not so much differences. Which means, that a unified register set can be overall as good as a register set which is split on different domains (as it was for my NEx64T. But here my problem was that I had to be 100% source compatible with x86/x64, which had different domains for all instructions. So, even for AVX-512 I had to have separate registers for the masks).

On the other side, having a unified register sets allows provides some benefits: moving registers is "free", you can apply operations to data which normally belong to different domains (for example, a conditional move is datatype-agnostic: it can move integer or FP data), and you don't need to duplicate instructions (all AVX-512 mask registers are gone on my new architecture: I can just use the regular GP / Integer scalar operations).

The only "problem" which I've is that scalar operations extend the datatype regarding the specific scalar type. So, a bitwise xor operation fully extends to the register size (16, 32 or 64 bit) as an integer operation, whereas an half-precision FP operation fully extends as well to the maximum size (but with the proper IEEE format: up to FP128).

To go back to your problem: sharing the same register set for both GP/data and FPU could be a good solution for an enhanced 68k processor, and that's my recommendation.

I think a unified integer and FPU register file is ok if the ISA is designed for that and all the registers are the same 64-bit width. I do not like it for the 68k because the 68k FPU registers are extended precision 80-bit width while widening the integer registers past 64-bits makes no sense and narrowing the 80-bit FPU registers is incompatible. It would make more sense to share the FPU and SIMD unit register files which I believe most x86(-64) cores do. Compatibility is important! Frank Wille contacted me about VBCC test code that failed a test thinking it was my code changes but it was because he thought he could save some time by testing using WinUAE 64-bit FPU emulation. Some 68k code takes advantage of the extended precision and will fail with only double precision!

cdimauro Quote:

For SIMD/Vector instructions I think that it's better to have a separated register file, with independent read/write ports. That's, again, due to my experience.
However, there's an option to use the GP/scalar register for low-cost embedded chips.

Mitch introduced a completely different concept: virtual vector registers, which reuse the regular (and only) register set. It's a nice idea, but it has one limit: it can only vectorize a single loop (e.g.: no nested vector loops).

I'm considering adding an option to support something similar as well. I've a lot of flexibility with my new architecture, because it's completely new (e.g.: no x86/x64 chains).

Since 68k is "new" from this perspective, you can consider similar solutions for SIMD extension(s).

I realize that some implementations have had issues with shared FPU and SIMD unit register files but both register files are large and expensive without sharing. I am open to SIMD ideas and Mitch certainly has some good ideas but some may not be a good fit for the 68k where compatibility is a top priority. Different and new ISAs and ideas are good to see. New ISA designs have been stifled by x86-64 and ARM64 when they are far from perfect, especially in their ability to scale. RISC-V is the only other new ISA that has gained momentum and is likely not good enough to replace either ARM or x86-64 but has found some niches. ISAs have advantages and disadvantages yet we have large one size fits all ISAs other than RISC-V. Innovation has been replaced by an oppressive duopoly. ARM used to be the 4th most popular 32-bit embedded ISA and now there are not even 4 ISAs.

cdimauro Quote:

Thanks for this great explanation: it clarified this very delicate point, which I wasn't aware of, and that I must take into account!

The "Innocuous Double Rounding of Basic Arithmetic Operations" had a nice introduction but was too math heavy and proof like for most people. The following paper is better for the programmer with good code examples.

The pitfalls of verifying floating-point computations
https://hal.science/hal-00128124

This is the paper that helped me figure out that function call FPU args need to retain the intermediate calculation precision.

The pitfalls of verifying floating-point computations Quote:

A common optimisation is inlining â€” that is, replacing a call to a function by the expansion of the code of the function at the point of call. For simple functions (such as small arithmetic operations, e.g. x to x^2), this can increase performance significantly, since function calls induce costs (saving registers, passing parameters, performing the call, handling return values). C99 (ISO, 1999, Â§6.7.4) and C++ have an inline keyword in order to pinpoint functions that should be inlined (however, compilers are free to inline or not to inline such functions; they may also inline other functions when it is safe to do so). However, on x87, whether or not inlining is performed may change the semantics of the code!

Consider what gcc 4.0.1 on IA32 does with the following program, depending on whether the optimisation switch -O is passed:

static inline double f(double x) {
return x/1E308;
}

double square(double x) {
double y = x*x;
return y;
}

int main(void) {
printf("%gn", f(square(1E308)));
}

gcc does not inline functions when optimisation is turned off. The square function returns a double, but the calling convention is to return floating point values into a x87 register â€” thus in long double format. Thus, when square is called, it returns approximately 10^716, which fits in long double but not double format. But when f is called, the parameter is passed on the stack â€” thus as a double, +âˆž. The program therefore prints +âˆž. In comparison, if the program is compiled with optimisation on, f is inlined; no parameter passing takes place, thus no conversion to double before division, and thus the final result printed is 10^308.

VBCC using the 68k FPU has the same issue as this simple x87 FPU program. The solution is not to get rid of the extended precision FPU but for function args to always be extended precision as well as FPU intermediate variables and register spills. VBCC stores FPU variables in extended precision already. Neither the SystemV stack based arg or SAS/C register based arg ABI specify to save function args in extended precision. Probably the best solution would be to violate the SAS/C ABI slightly with documentation for FPU code. It would be some work to implement though and 68k development is EOL. A double precision FPU like most emulators use and the AC68080 solves the double rounding problem but it is incompatible with some code. It is possible to have a properly functioning extended precision FPU with all the benefits it was designed to have but there is no need with the original hardware being replaced by hardware with an incompatible 68k double precision FPU!

The next example has 4 different possible results on x86(-64) hardware with GCC. The GCC devs document but do not fix the problem.

https://gcc.gnu.org/wiki/FloatingPointMath
https://gcc.gnu.org/wiki/x87note
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30255
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=323

The extended precision x87 FPU remains valuable enough for scientists and engineers but it is difficult to use and error prone. If affordable 68k hardware was available with proper extended precision support, there would be some interest. It may be possible to provide perfect quad precision support with double extended arithmetic by adding a new rounding mode also. Double double arithmetic does not have the exponent range of quad precision but extended precision does.

https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Double-double_arithmetic

A full quad precision FPU would be too slow and large for low cost hardware but extended precision is still practical.

cdimauro Quote:

That's exactly the reason why I've a similar model with my new ISA, while supporting SIMD/vector as well.

To be more frank, I've completely moved the scalar part from NEx64T, so that the SIMD unit is purely and uniquely "vector", leaving all scalar stuff to the "good old" GP/scalar unit. This simplified a lot the architecture, whilst handling some scalar stuff on the vector unit (e.g.: broadcasting, and extracting/inserting scalars from/on vector lanes).

It sounds interesting. The x86(-64) SIMD unit is likely larger than it needs to be because it replaced the FPU and because of all the baggage of older SIMD ISAs. A leaner meaner and more SIMD focused ISA could be better.

cdimauro Quote:

That's something which should be done with a 68k successor: FMA + passing parameters to FP registers + extending the register letting data registers to be used for the FPU.

The 68k's FPU, as it is, can't be competitive with the modern designs & requirements.

The 68060 FPU is still good for a minimalist FPU by embedded standards. Most 68k fans would like to have a more full featured and modernized FPU though. Many would like to see 6888x instructions and features returned which is a possibility. It would be cool to see how Lightwave and other 3D programs would run with reduced traps. Mitch is good at transcendental instruction implementations and optimizations as well as having CISC and GPU architecture design experience. He is the guy I would try to get for a chief architect. Plus he is a good teacher. I get the feeling that he does not like the BS and constraints of working for some big businesses though.

cdimauro Quote:

Indeed. I can better cover all such scenarios, because I've just added a flag on the larger FP instructions, to select a different precision for the operation. So that I can cover this corner case. Your clarification was providential (and I was lucky because I had a spare, unused bit on this instruction format, which I had no idea on how to use for)!

Flexibility is good if the long encoding bit is free anyway. So the default and short encoding of the instruction is to maintain the current precision and the long version of the instruction has the option to round/convert the result to double if single and single if double?

fadd.d ; double add, round to double (short encoding)
fsadd.d ; double add, round to single (long encoding)
fdadd.s ; single add, round to double (long encoding
fadd.s ; single add, round to single (short encoding)

Like so using 68k FPU style notation?

cdimauro Quote:

BTW, Mitch said that he didn't want to automatically expand single datatypes to double, to avoid complicating the My 66000 design introducing something like the x87 did. I assume that he was referring to the default precision to be used for FP operations.

Ok. With a double precision FPU, it would have ended up like the PPC ISA and ABI with all loads converted to double precision. His choice gives results like a SIMD lane which is consistent at the cost of precision. Single precision calculations lose accuracy quickly but there is pressure to conform to this with all the SIMD FPUs. With a free bit in the encoding, maybe it would it would be possible to choose consistency or accuracy.

cdimauro Quote:

I wouldn't care much about code density for FP/SIMD/vector operations.

As you, I was always obsessed to keep instruction length as small as possible, to reduce the code density.

However and on the real world, such instructions aren't so much used. The code density is basically dominated by the regular GP/Scalar integer instructions.

For this reason, having bigger FP/SIMD/Vector instructions isn't really a concern. I've some short versions for the most common cases, but that's all about : if I need to full power/flexibility of my new architecture I've to use much longer encoding.

In this case, performance is much more important than code density. Plus, I've greatly reduced the number of instructions & simplified a lot the opcodes structure, which is another big bonus.

I agree that the less frequently used instructions can and should be longer in a properly designed ISA. I am not complaining about most 68k FPU instructions being 4 bytes. More can be saved by compressing fp immediates as they can easily be longer than the FPU instruction and it is better to have at least single precision immediates and maybe double precision too in the more predictable instruction stream. Too large of instructions can be a challenge for superscalar execution and may require extra cycles, at least for lower power cores.

cdimauro Quote:

ARM is purely focused on performance, and that's the reason why Thumb-2 wasn't extended to 64 bit. OK, even because there's no encoding space available.

However, I think that it's a problem in the long run. Code density is very important even for 64-bit architectures, because the benefits are very well know. And ARM has nothing to propose in this area.

That's a big chance for competitors.

I agree. It is not difficult to make a variable length encoded 64-bit ISA with a better code density. Even RISC-V did it although not by much and with a weaker performance ISA. More 68k and Thumb-2 like code density is possible for a 64-bit ISA. The GP registers would likely need to be reduced to 16 but CISC ISAs with 16 GP registers can provide similar performance to RISC ISAs with 32 GP registers. Even the 8 GP register x86 killed the 32 GP register PPC but nobody professional wants to develop a better CISC ISA like the 68k.

cdimauro Quote:

Still thinking about that, maybe for the 68k's FPU you can use two bits from the coprocessor id (Line-F) to directly specify one of the four rounding modes.

This way you've the normal instruction with coprocessor id #1 -> use the rounding mode defined on the FPU status register.
Coprocessor ids #4..7 are used for the same instruction, but forcing the rounding mode coming from the low 2 bits of the coprocessor id.

I think that this is an elegant way to solve the problem, while keeping the same code density.

That's only if directly selecting the rounding mode is more common/useful compared to selecting the precision (single or double instead of extended). Otherwise it should be the other way around (the lowest two bits of the coprocessor id select the desired precision, and specific instructions should be introduced for applying a different rouding mode).

Changing the rounding mode is uncommon except for fp to int conversions and there all that is needed is a variation of the FINT instruction.

FINT - round to int using the FPCR settings (exists)
FINTRZ or FTRUNC - round to zero (exists)
FINTRP or FCEIL - round to plus infinity (does not exist)
FINTRM or FFLOOR - round to minus infinity (does not exist)
FINTRN or FROUND - round to nearest (does not exist)

The names correspond to the C trunc(), ceil(), floor() and round() functions.

http://en.cppreference.com/w/c/numeric/math/trunc
http://en.cppreference.com/w/c/numeric/math/ceil
http://en.cppreference.com/w/c/numeric/math/floor
http://en.cppreference.com/w/c/numeric/math/round

The encoding is open for all variations and it is nice to be able to use one instruction for each rounding mode. With this and the FSop and FDop instructions, the FPCR precision and rounding modes would rarely need to be changed saving the 6-8 cycle FMOVE to FPCR times two as the old FPCR is often restored shortly after. Gunnar ignored this simple proposal though.

Last edited by matthey on 11-Mar-2025 at 06:38 AM.
Last edited by matthey on 11-Mar-2025 at 06:01 AM.

Status: Offline

cdimauro

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 17-Mar-2025 5:59:49

[ #166 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

16B/cycle is very common since many years, and it should be the bare minimum for an enhanced 68060+ design.

The 68060 developers looked at and benchmarked lots of 68k code before likely choosing to target something like 80% of code. Some compilers were simplistic and much existing code was small where future code should have been considered too. Program sizes were growing even for embedded use but the 68060 was supporting 68000 instruction sizes and 68k code sizes where many of the programs were 64kiB or smaller using a small data model. A 68060+ upgrade of 8B/cycle fetch, 8B/instruction superscalar execution limit and 32kiB I+D L1 caches certainly sounds more reasonable today.

Absolutely! 68060 were clearly wrong, because the average instruction length for 68k is around 3 bytes. So, an instruction pair should have required at least 6B/cycle fetch (which is odd. 8B is more reasonable).
Quote:
cdimauro Quote:

This could be reduced with a new implementation. Or hidden into the pipeline having an additional stage which is used to "align" all data sources to the same size.

If nothing else, silicon improvements could improve integer conversion latency much like the ARM FPU pipeline improvements in the picture I showed in my previous post in this thread. ARM is being smart to improve latencies with shorter FPU pipelines rather than maximizing clock speeds.

That's because it has to reduce power consumption, which is the most important factor (not the only one, of course) on its markets (embedded & mobile).
Quote:
cdimauro Quote:

OK, now I share a detail about my new architecture: it has a unified register set for all GP and scalar (integer, FP) operations. Which, coincidentally, was the same choice that Mitch has done with its My 66000.

On paper a (micro)architecture with registers distributed on different domains MIGHT be more efficient, because you can carefully design every single register group with the proper, unique, number of read and write ports. ON PAPER.

On paper, because my experience tells me that in the real world there are not so much differences. Which means, that a unified register set can be overall as good as a register set which is split on different domains (as it was for my NEx64T. But here my problem was that I had to be 100% source compatible with x86/x64, which had different domains for all instructions. So, even for AVX-512 I had to have separate registers for the masks).

On the other side, having a unified register sets allows provides some benefits: moving registers is "free", you can apply operations to data which normally belong to different domains (for example, a conditional move is datatype-agnostic: it can move integer or FP data), and you don't need to duplicate instructions (all AVX-512 mask registers are gone on my new architecture: I can just use the regular GP / Integer scalar operations).

The only "problem" which I've is that scalar operations extend the datatype regarding the specific scalar type. So, a bitwise xor operation fully extends to the register size (16, 32 or 64 bit) as an integer operation, whereas an half-precision FP operation fully extends as well to the maximum size (but with the proper IEEE format: up to FP128).

To go back to your problem: sharing the same register set for both GP/data and FPU could be a good solution for an enhanced 68k processor, and that's my recommendation.

I think a unified integer and FPU register file is ok if the ISA is designed for that and all the registers are the same 64-bit width. I do not like it for the 68k because the 68k FPU registers are extended precision 80-bit width while widening the integer registers past 64-bits makes no sense and narrowing the 80-bit FPU registers is incompatible. It would make more sense to share the FPU and SIMD unit register files which I believe most x86(-64) cores do.

At least from an architectural PoV, no: FPU (x87) and SIMD (SSE+) registers are separated.

Only MMX was (re)using x87's registers.

Microarchitectures are a different story, but here what they do is totally transparent (and irrelevant) to the ISA.

For 68k, we're talking about the architecture level, where it IS relevant if the additional FPU registers are mapped to the data or FPU/SIMD registers.

Let's take a look at some solutions.

Extending data registers to 80 bits (96 bits effective in memory).
- 32-bit ISA -> wasting 8 x (80 - 32) bits
- 64-bit ISA -> wasting 8 x (80 - 64) bits

Extending FP registers to 16 and to hold SIMD data as well.
- 128-bit SIMD -> wasting 16 x (128 - 80) bits
- 256-bit SIMD -> wasting 16 x (256 - 80) bits
- 512-bit SIMD -> wasting 16 x (512 - 80) bits

So, the more you scale with the SIMD uniti (which makes sense. At least up to 256-bit), the more space is wasted when one register is used only for holding a scalar.

That's exactly one of the major reason (but not the only one) why I've decided to completely separate the SIMD/Vector registers file from the scalar one.
Quote:
Compatibility is important!

Absolutely, but I don't see compatibility issues here: new kernels should take care of properly saving & restoring such additional bits, yes, but it's not needed for the old kernels (they are not aware of such extension -> extra bits not used -> no need to save & restore them).
Quote:
Frank Wille contacted me about VBCC test code that failed a test thinking it was my code changes but it was because he thought he could save some time by testing using WinUAE 64-bit FPU emulation. Some 68k code takes advantage of the extended precision and will fail with only double precision!

Indeed. That's why who claims "100% compatibility" without supporting the extending precision is lying.
Quote:
cdimauro Quote:

For SIMD/Vector instructions I think that it's better to have a separated register file, with independent read/write ports. That's, again, due to my experience.
However, there's an option to use the GP/scalar register for low-cost embedded chips.

Mitch introduced a completely different concept: virtual vector registers, which reuse the regular (and only) register set. It's a nice idea, but it has one limit: it can only vectorize a single loop (e.g.: no nested vector loops).

I'm considering adding an option to support something similar as well. I've a lot of flexibility with my new architecture, because it's completely new (e.g.: no x86/x64 chains).

Since 68k is "new" from this perspective, you can consider similar solutions for SIMD extension(s).

I realize that some implementations have had issues with shared FPU and SIMD unit register files but both register files are large and expensive without sharing. I am open to SIMD ideas and Mitch certainly has some good ideas but some may not be a good fit for the 68k where compatibility is a top priority. Different and new ISAs and ideas are good to see. New ISA designs have been stifled by x86-64 and ARM64 when they are far from perfect, especially in their ability to scale.

At least ARM64 improved a lot with SVE/SVE2, but x64 is still stuck in the past...
Quote:
RISC-V is the only other new ISA that has gained momentum and is likely not good enough to replace either ARM or x86-64 but has found some niches.

RISC-V is an academic ISA which is weak and badly designed (such people were living on a parallel universe, due to not taking into account what are the needs with real code).

It took ages for them to defined the vector extension (so, plenty of time to think about it), and yet they failed (see how they introduced the mask/predicate support).
Quote:
ISAs have advantages and disadvantages yet we have large one size fits all ISAs other than RISC-V. Innovation has been replaced by an oppressive duopoly. ARM used to be the 4th most popular 32-bit embedded ISA and now there are not even 4 ISAs.

Unfortunately, it's very difficult to introduce a new ISA. Even if it can fit all markets.
Quote:
Quote:
cdimauro [quote]
Thanks for this great explanation: it clarified this very delicate point, which I wasn't aware of, and that I must take into account!

The "Innocuous Double Rounding of Basic Arithmetic Operations" had a nice introduction but was too math heavy and proof like for most people. The following paper is better for the programmer with good code examples.

The pitfalls of verifying floating-point computations
https://hal.science/hal-00128124

This is the paper that helped me figure out that function call FPU args need to retain the intermediate calculation precision.

The pitfalls of verifying floating-point computations Quote:

A common optimisation is inlining â€” that is, replacing a call to a function by the expansion of the code of the function at the point of call. For simple functions (such as small arithmetic operations, e.g. x to x^2), this can increase performance significantly, since function calls induce costs (saving registers, passing parameters, performing the call, handling return values). C99 (ISO, 1999, Â§6.7.4) and C++ have an inline keyword in order to pinpoint functions that should be inlined (however, compilers are free to inline or not to inline such functions; they may also inline other functions when it is safe to do so). However, on x87, whether or not inlining is performed may change the semantics of the code!

Consider what gcc 4.0.1 on IA32 does with the following program, depending on whether the optimisation switch -O is passed:

static inline double f(double x) {
return x/1E308;
}

double square(double x) {
double y = x*x;
return y;
}

int main(void) {
printf("%gn", f(square(1E308)));
}

gcc does not inline functions when optimisation is turned off. The square function returns a double, but the calling convention is to return floating point values into a x87 register â€” thus in long double format. Thus, when square is called, it returns approximately 10^716, which fits in long double but not double format. But when f is called, the parameter is passed on the stack â€” thus as a double, +âˆž. The program therefore prints +âˆž. In comparison, if the program is compiled with optimisation on, f is inlined; no parameter passing takes place, thus no conversion to double before division, and thus the final result printed is 10^308.

Thanks! It's another great and very insightful explanation!
Quote:
VBCC using the 68k FPU has the same issue as this simple x87 FPU program. The solution is not to get rid of the extended precision FPU but for function args to always be extended precision as well as FPU intermediate variables and register spills. VBCC stores FPU variables in extended precision already. Neither the SystemV stack based arg or SAS/C register based arg ABI specify to save function args in extended precision. Probably the best solution would be to violate the SAS/C ABI slightly with documentation for FPU code. It would be some work to implement though and 68k development is EOL. A double precision FPU like most emulators use and the AC68080 solves the double rounding problem but it is incompatible with some code.

I fully agree: compatibility is very important and I don't see why the extended precision offered by x87 and 68k shouldn't be properly and correctly used (since it has clear advantages).
Quote:
It is possible to have a properly functioning extended precision FPU with all the benefits it was designed to have but there is no need with the original hardware being replaced by hardware with an incompatible 68k double precision FPU!

Do you refer to hardware implementation of a 68k FPU? Or to the proper support at the compiler level? Or both?
Quote:
The next example has 4 different possible results on x86(-64) hardware with GCC. The GCC devs document but do not fix the problem.

https://gcc.gnu.org/wiki/FloatingPointMath
https://gcc.gnu.org/wiki/x87note
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30255
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=323

The extended precision x87 FPU remains valuable enough for scientists and engineers but it is difficult to use and error prone.

That's criminal! The bug has 25 years now, and it was duplicated around an hundred times. Why don't they fix it?!?
Quote:
If affordable 68k hardware was available with proper extended precision support, there would be some interest. It may be possible to provide perfect quad precision support with double extended arithmetic by adding a new rounding mode also. Double double arithmetic does not have the exponent range of quad precision but extended precision does.

https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Double-double_arithmetic

A full quad precision FPU would be too slow and large for low cost hardware but extended precision is still practical.

That's an awesome trick for getting much more precision by reusing the same, limited, double precision. Kudos to the inventors!
Quote:
cdimauro Quote:

That's exactly the reason why I've a similar model with my new ISA, while supporting SIMD/vector as well.

To be more frank, I've completely moved the scalar part from NEx64T, so that the SIMD unit is purely and uniquely "vector", leaving all scalar stuff to the "good old" GP/scalar unit. This simplified a lot the architecture, whilst handling some scalar stuff on the vector unit (e.g.: broadcasting, and extracting/inserting scalars from/on vector lanes).

It sounds interesting. The x86(-64) SIMD unit is likely larger than it needs to be because it replaced the FPU and because of all the baggage of older SIMD ISAs. A leaner meaner and more SIMD focused ISA could be better.

That's what I've done. Now the ISA (compared also to NEx64T) looks like a dwarf...
Quote:
Quote:
cdimauro [quote]
That's something which should be done with a 68k successor: FMA + passing parameters to FP registers + extending the register letting data registers to be used for the FPU.

The 68k's FPU, as it is, can't be competitive with the modern designs & requirements.

The 68060 FPU is still good for a minimalist FPU by embedded standards. Most 68k fans would like to have a more full featured and modernized FPU though. Many would like to see 6888x instructions and features returned which is a possibility. It would be cool to see how Lightwave and other 3D programs would run with reduced traps. Mitch is good at transcendental instruction implementations and optimizations as well as having CISC and GPU architecture design experience. He is the guy I would try to get for a chief architect. Plus he is a good teacher. I get the feeling that he does not like the BS and constraints of working for some big businesses though.

He already retired, but he's still active. And it has some patents for fast implementation of FPU trigo etc. instructions.

However, he's fully involved on his ISA.
Quote:
cdimauro Quote:

Indeed. I can better cover all such scenarios, because I've just added a flag on the larger FP instructions, to select a different precision for the operation. So that I can cover this corner case. Your clarification was providential (and I was lucky because I had a spare, unused bit on this instruction format, which I had no idea on how to use for)!

Flexibility is good if the long encoding bit is free anyway. So the default and short encoding of the instruction is to maintain the current precision and the long version of the instruction has the option to round/convert the result to double if single and single if double?

Roughly yes. But see below for more detail.
Quote:
fadd.d ; double add, round to double (short encoding)
fsadd.d ; double add, round to single (long encoding)
fdadd.s ; single add, round to double (long encoding
fadd.s ; single add, round to single (short encoding)

Like so using 68k FPU style notation?

The situation is a bit more complicated, because on my new ISA I've the same instruction, but packaged in different (opcode) formats which enable more features/extensions thanks to the longer encodings.
On top of that, I've some overlapping between GP and scalar (integer or FP) instructions. They aren't the same thing: there are instructions on the first group and not on the second, and viceversa. And there are instructions on both. For example, ADD is both groups, whereas CALL is only on the first one, and MAX only on the second one.

Let's focus only to the scalar FP version of ADD. I haven't yet defined the final syntax, because I'm still thinking about ADD F0,F1,F2 and FADD R0,R1,R1 (on both the first register is the destination).

Taking as a reference the above list and assuming that the maximum precision is FP128, here follow the equivalent versions with my new ISA and a bit more:

add f0,f1,f2 ; extended add, no precision rounding (short encoding)
add{d} f0,f1,r2.d ; double add, round to double (normal encoding)
add{s} f0,f1,r2.d ; double add, round to single (normal encoding)
add{d} f0,f1,r2.s ; single add, round to double (normal encoding)
add{s} f0,f1,r2.s ; single add, round to single (normal encoding)

The short encoding is the common one for all scalar operations, and always operate to the full register size (e.g. FP128 for scalar FP and 64-bit for scalar int). It's the most compact because it works only with registers (and since there are three of them, the encoding space is very limited).

The normal encoding is the one which allows to define an EA for the second source and some additional bits for the desired precision & exception suppression for FP data, and sign or zero extension for int data (but I've two spare bits which aren't yet used, because there's no precision selection concept here).

I've other encodings / opcode formats for defining an immediate as second source (in a more compact form compared to different types of immediates which can be defined with an EA), mem-mem and mem-mem-mem versions (both which allow to define the precision of each operand), and a general mechanism which is applicable to all the above cases (except the short encoding) which further defines other extensions/"enrichments" of the instruction (included its conditional execution).

SIMD/vector instructions have exact equivalent encodings for the same instructions (but short encodings are much more limited: not much encoding space!). However, there's no precision rouding which can be selected for FP data (operations are always executed with the given data type size). A mask can be selected, as well as broadcasting of the last source EA, zero/merge of lanes, and the size of the "vector".
Quote:

cdimauro Quote:

I wouldn't care much about code density for FP/SIMD/vector operations.

As you, I was always obsessed to keep instruction length as small as possible, to reduce the code density.

However and on the real world, such instructions aren't so much used. The code density is basically dominated by the regular GP/Scalar integer instructions.

For this reason, having bigger FP/SIMD/Vector instructions isn't really a concern. I've some short versions for the most common cases, but that's all about : if I need to full power/flexibility of my new architecture I've to use much longer encoding.

In this case, performance is much more important than code density. Plus, I've greatly reduced the number of instructions & simplified a lot the opcodes structure, which is another big bonus.

I agree that the less frequently used instructions can and should be longer in a properly designed ISA. I am not complaining about most 68k FPU instructions being 4 bytes. More can be saved by compressing fp immediates as they can easily be longer than the FPU instruction and it is better to have at least single precision immediates and maybe double precision too in the more predictable instruction stream.

Do you mean using shorter immediates for instructions which work with a given precision?
For example:

fadd.d #fp16imm,fp0     ; double add, loading an FP16 immediate as second source

?
Quote:

Too large of instructions can be a challenge for superscalar execution and may require extra cycles, at least for lower power cores.

As you can see above, I've a lot of "beaf". But this is being paid with longer encodings (compared to 68k and other architectures): I could have focused only on the common case (full precision operations).

68k is in a better shape regarding the code density for FP instructions, but it has less flexibility (e.g. it's missing 3 operands, which is the major point to be addressed, IMO).
Quote:

cdimauro Quote:

Still thinking about that, maybe for the 68k's FPU you can use two bits from the coprocessor id (Line-F) to directly specify one of the four rounding modes.

This way you've the normal instruction with coprocessor id #1 -> use the rounding mode defined on the FPU status register.
Coprocessor ids #4..7 are used for the same instruction, but forcing the rounding mode coming from the low 2 bits of the coprocessor id.

I think that this is an elegant way to solve the problem, while keeping the same code density.

That's only if directly selecting the rounding mode is more common/useful compared to selecting the precision (single or double instead of extended). Otherwise it should be the other way around (the lowest two bits of the coprocessor id select the desired precision, and specific instructions should be introduced for applying a different rouding mode).

Changing the rounding mode is uncommon except for fp to int conversions and there all that is needed is a variation of the FINT instruction.

FINT - round to int using the FPCR settings (exists)
FINTRZ or FTRUNC - round to zero (exists)
FINTRP or FCEIL - round to plus infinity (does not exist)
FINTRM or FFLOOR - round to minus infinity (does not exist)
FINTRN or FROUND - round to nearest (does not exist)

The names correspond to the C trunc(), ceil(), floor() and round() functions.

http://en.cppreference.com/w/c/numeric/math/trunc
http://en.cppreference.com/w/c/numeric/math/ceil
http://en.cppreference.com/w/c/numeric/math/floor
http://en.cppreference.com/w/c/numeric/math/round

The encoding is open for all variations and it is nice to be able to use one instruction for each rounding mode. With this and the FSop and FDop instructions, the FPCR precision and rounding modes would rarely need to be changed saving the 6-8 cycle FMOVE to FPCR times two as the old FPCR is often restored shortly after. Gunnar ignored this simple proposal though.

Indeed. That would have been very helpful, but we know how it "works"...

Thanks for the clarification, anyway. I've slightly changed the ISA according to the above.

I already had a CONV instruction since very long time, for a general data-type to data-type instruction, but it was limited only to truncation or nearest when converting from FP to int data. Now it allows to define any of the four rounding modes.

Fortunately, CONV can use this mechanism for the destination short encoding (but using the full precision for source operand).

BTW, I've no PMOVS*/PMOVZ* neither some PACK/UNPACK SIMD/vector instruction, because CONV is used to cover all those case.

P.S. No time to read again.

Status: Offline

matthey

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 18-Mar-2025 4:15:49

[ #167 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2754
From: Kansas

cdimauro Quote:

Absolutely! 68060 were clearly wrong, because the average instruction length for 68k is around 3 bytes. So, an instruction pair should have required at least 6B/cycle fetch (which is odd. 8B is more reasonable).

Wrong is not how I would word the 68060 4B/cycle instruction fetch design decision. It was a design choice to lower power at the expense of performance which may not have been the wrong choice for the embedded market at the time. The high end high performance embedded market at the time was smaller than today. Most RISC cores could not reach the integer performance efficiency (performance/MHz) of the 68060 and the 68060 still competed with desktop x86 and PPC CPUs in performance but at a reduced power and area for the embedded market.

68060@75MHz 3.3V, 600nm, 2.5 million transistors, ~5.5W max
Pentium-P54C@75MHz, 3.3V, 600nm, 3.3 million transistors, 9.5W max
PPC601@75MHz 3.3V, 600nm, 2.8 million transistors, ~7.5W max

The 68060 4B/cycle instruction fetch is not optimal for performance. It is nice that the in-order 68060 design with so much performance already obviously has more performance potential but boosting the instruction fetch alone would likely only provide a modest performance improvement. The similar but lower power ColdFire V5 design increased the instruction fetch to 8B/cycle so there is little doubt that it is a worthwhile performance upgrade.

cdimauro Quote:

That's because it has to reduce power consumption, which is the most important factor (not the only one, of course) on its markets (embedded & mobile).

Yes, another PPA trade-off. Reduced pipeline depth for lower power, smaller area and reduced instruction execution latencies in cycles at the expense of max clock speed. Modern silicon is still providing some of the timing improvement as a 13-stage integer pipeline CPU core can have a double precision FADD with a 2 cycle latency today. The 68060 had one of the best integer 32x32 multiply latencies at 2 cycles but it could likely be single cycle with possible multi-issue capabilities today. From 500nm to 50nm, the electricity has 1/10 of the distance to travel and from 500nm to 5nm the electricity has 1/100 of the distance to travel.

cdimauro Quote:

At least from an architectural PoV, no: FPU (x87) and SIMD (SSE+) registers are separated.

Only MMX was (re)using x87's registers.

Microarchitectures are a different story, but here what they do is totally transparent (and irrelevant) to the ISA.

For 68k, we're talking about the architecture level, where it IS relevant if the additional FPU registers are mapped to the data or FPU/SIMD registers.

Let's take a look at some solutions.

Extending data registers to 80 bits (96 bits effective in memory).
- 32-bit ISA -> wasting 8 x (80 - 32) bits
- 64-bit ISA -> wasting 8 x (80 - 64) bits

Extending FP registers to 16 and to hold SIMD data as well.
- 128-bit SIMD -> wasting 16 x (128 - 80) bits
- 256-bit SIMD -> wasting 16 x (256 - 80) bits
- 512-bit SIMD -> wasting 16 x (512 - 80) bits

So, the more you scale with the SIMD unit (which makes sense. At least up to 256-bit), the more space is wasted when one register is used only for holding a scalar.

That's exactly one of the major reason (but not the only one) why I've decided to completely separate the SIMD/Vector registers file from the scalar one.

Register files are expensive especially for embedded use. This makes register sharing tempting. Sharing 64-bit integer and 64-bit FPU registers makes sense with a 64-bit FPU. The 80-bit 68k FPU has advantages and it is not worth breaking compatibility to reduce it to 64-bit. There are basically two options that I see for the 68k FPU.

1. 80-bit FPU registers - quad precision supported with a pair of extended precision registers
2. 128-bit FPU registers - default precision extended, quad precision requires 2nd FPU ALU pass

The latter choice is cleaner and a 128-bit size is reasonable for SIMD unit registers if sharing. Maybe keep the first 8 FPU registers for the FPU only and compatibility, the 2nd 8 FPU registers volatile and shared with the SIMD unit and the last 8 128-bit registers as SIMD only referenced backward (first SIMD register is the last register in the register file) for 24 total 128-bit registers?

Just throwing the idea out there rather than thinking all the details through.

cdimauro Quote:

Indeed. That's why who claims "100% compatibility" without supporting the extending precision is lying.

Motorola wanted to castrate the 68k extended precision FPU for embedded use but at least ColdFire does not claim to be 68k compatible. Lack of 68k compatibility was a major reason why ColdFire failed too.

cdimauro Quote:

RISC-V is an academic ISA which is weak and badly designed (such people were living on a parallel universe, due to not taking into account what are the needs with real code).

It took ages for them to defined the vector extension (so, plenty of time to think about it), and yet they failed (see how they introduced the mask/predicate support).

But David Patterson is an award wining RISC expert with experience from RISC-I, RISC-II, RISC-III and RISC-IV. Now you are telling me a RISC-VI do over is needed?

cdimauro Quote:

Do you refer to hardware implementation of a 68k FPU? Or to the proper support at the compiler level? Or both?

The 68881 through 68060 already have the necessary hardware for full extended precision support. The only piece missing is an ABI standard that specifies saving FPU function args in registers or on the stack in extended precision. As I recall, VBCC already uses extended precision by default, uses and spills extended precision variables, context switches save the extended precision registers and FPU return values are in FP0. I believe the only missing part is that FPU function args are placed on the stack as either double or single precision according to the arg original datatype rather than the extended precision intermediate result. Popular 68k ABIs are a mess.

1. SysV/Unix ABI - ancient, all args passed on the1. stack using natural alignment
o used by VBCC & LLVM

2. GCC/Linux ABI - all args passed on the stack, relaxed stack alignment is poor for performance
o used by official GCC

3. SAS/C Fastcall ABI - args passed in scratch registers with remainder on stack, poor documentation
o used by SAS/C, VBCC & I believe unofficial Geek Gadgets GCC up to GCC 3.4 when reg args selected

https://m680x0.github.io/doc/abi.html
https://m680x0.github.io/ref/sysv-abi-download
https://eab.abime.net/showthread.php?t=108456

The SAS/C ABI with a small change to pass non-scratch register FPU function args on the stack in extended precision would suffice but it would introduce incompatibility with an existing standard. I offered to document a new ABI (modified SAS/C ABI) with the changes but there is not much motivation to adopt it for a dead 68k platform. I was developing for VBCC and there was a resistance to change for a dead 68k and that is not even talking about getting strangers to agree to and adopt a new ABI.

Part of the problem is that the 68k FPU ISA with extended precision gets lumped in with the x86 FPU ISA with extended precision. The x86 FPU ISA has problems and there have been bugs in the CPU hardware making it difficult to provide better extended precision support and making it a nightmare for compiler developers as the GCC developers not wanting to touch the existing support exhibits. Even William Kahan, the father of FP, who helped design the 8087 FPU knows the 68k ISA is better.

https://history.siam.org/%5C/pdfs2/Kahan_final.pdf Quote:

HAIGH: So you werenâ€™t sensing that the 8086 architecture was going to dominate?
KAHAN: Oh, no. If I had known that, it would have curdled my blood. It had a horrible architecture. No, but the arithmetic was what I was designing. Then a problem arose, in that they found that the op code space was somewhat tight; they didnâ€™t really have space for op codes that would use a regular register architecture. They didnâ€™t have enough space in the op code to have two registers routinely. It looked as if they could have in many cases only one register. They can have only one register; the other must be implicit. That means youâ€™ve got to have a stack, so everything goes to the top of the stack. Itâ€™s a goddamned bottleneck in performance, but it does allow you to get by with this op code limitation.

...

KAHAN: Well, because Intel had so large a share of the market, it was really selling a vast number of these chips at fairly high pricesâ€”much higher than the cost of productionâ€”so they had money rolling in and could afford to have literally hundreds of engineers working on the next generation of Intel chips. The other people, like the SPARC people and so on, didnâ€™t have that luxury. They really had to scramble for the money and for the engineering. The Pentium architecture is appalling. Itâ€™s really bad. Itâ€™s descended from the Intel X86 architecture, and perpetuates most of its faults. But the Pentium isnâ€™t exactly whatâ€™s there. Whatâ€™s there is a little microengine which executes the most frequent instructions as any RISC chip would, in a very direct fashion and, as soon as the instructions get at all interesting, it goes to microcode. So how can you tell the difference? The RISC chipsâ€”theyâ€™ve got microcode too. Even the DEC Alpha, which must surely be the epitome of a RISC-chip design, had what they called PAL code, which was essentially microcode to handle exceptional or rare events. Ultimately, they used PAL code to cope with the penalties imposed by what were called trap barriers, but thatâ€™s somebody elseâ€™s history.

...

And Peter Marksteinâ€”Iâ€™d met him originally in Yorktown Heightsâ€”he was now in Austin working for IBM on this particular project. He found lots and lots of ways to use the fused multiply-add to advantage, so it turned out to be a great thing for them. Then, when Apple was thrown into bed with IBM by the Pepsi-Cola man, Scully, Apple adopted the same architecture. Motorola got a license to produce the chips, initially. So Motorola was very disappointed to lose Apple as a customer for their 68000 family, and also for their subsequent RISC family. That was the 88110.

The 88110 was a Motorola chip with a RISC architecture and beautiful floating point, just like the floating point on the 68040. The 68040 did thoroughly what John Palmer had wanted to do, but didnâ€™t have enough transistors to do. They did a really nice job. Mind you, they made one mistake, and it was a funny mistake. See, they used a CORDIC algorithm (remember, I said thereâ€™s CORDIC, which is part of a family of algorithms). I was using pseudo-multiply, pseudo-divide. They were using a CORDIC algorithm, and the CORDIC algorithm isnâ€™t quite so accurate as pseudo-multiply, pseudo-divide, but on the other hand, the CORDIC algorithm can run faster. It can go some 30% faster.

The CORDIC algorithm was written up really well by Steve Walther. He worked at IBM in the Deer Creek Road research facility. He wrote a paper on CORDIC algorithms for computing transcendental functions, not merely trig functionsâ€”hyperbolic functions and, consequently, log, exponential, things like thatâ€”and he presented the algorithms in APL language, so he really had everything you needed.

But he left one thing out of his paper: how many extra digits must you carry in order to get a certain number of digits right? And the answer wasnâ€™t perfectly obvious, but when the guys at Motorola were in the throes of implementing this algorithm, they were calling Steve Walther for advice whenever they came to a sticking point, and Steve Walther, being a good-natured fellow, was giving them advice until his management at Hewlett-Packard said, â€œWhat are you doing giving away this kind of consulting information to a rival or at least potential rival?â€ So the next time they called, Steve had to say, â€œIâ€™m sorry. My boss has told me I shouldnâ€™t talk with you guys.â€

What were they calling him about? They wanted to know how many extra digits they should carry in order to get 64 correct. Well, he couldnâ€™t tell them, and they didnâ€™t figure it out quite right so, in fact, they were losing a few bits of the 64. On the initial coprocessors for the Motorola 68000 family (thatâ€™s the 68881 and the 68882, slightly faster), they were losing bits, so their transcendental functions were not so accurate as Intelâ€™s and, indeed, therefore not so accurate as Cyrixâ€™s. But they were pretty fast, and they had built the complete library on the chip, doing what John Palmer had only hoped to do and, ultimately, they redesigned things so that they could put the floating point on the same chip as the processor and not need a coprocessor, and that was the 68040. For the 68040, they employed Peter Tang as a consultant. He wrote a beautiful library for them in which they got all the accuracy they needed. And I have a 68040 and Iâ€™m delighted with it. It runs in my Macintosh Quadra. It does a beautiful job but, unfortunately, it does it with a 33 MHz clock.

Well, it looked as though the RISC architecture fad would be inappropriate for the 68000 because it was a CISC architecture, if ever you saw one, a complicated instruction set architecture with an orthogonal instruction set and everything else. So one could have said, â€œWell, youâ€™ve got a slow machine,â€ but Motorola had designed a RISC family of machines which had the same floating point as the 68040, just implemented a little bit differently. It used the same transcendental function codes, just implemented a little bit differently, and Apple rejected the use of that processor because Scully wanted Apple and IBM to become buddies. So that killed the 88110, which was a good architecture. I think I like the 960 better, but the 88110 was a pretty good architecture. Iâ€™ve got the manual here, even though I havenâ€™t had the chance to look at it carefully, and that screwed things up at Apple.

Apple had an excellent numerical environment called SANE, Standard Apple Numerical Environment. It had so many of the things that I had wanted. They didnâ€™t necessarily do it in quite the way I liked, but they did a very conscientious job, and programmers came to like it a lot. Of course, it takes a while. Thereâ€™s a lag. You know, initially the programmers are complaining about everything. Theyâ€™ll complain about any change, theyâ€™ll complain about having to spend extra time to get the right answer and stuff like that, but ultimately, the programmers were really liking it. These guys were getting unsolicited testimonials from programmers and some letters to the editor and so on about what a great thing that SANE was, and that was killed in order to switch to the new Power PC architecture.

The 808x ISA is horrible and the 68k ISA was "beautiful" and "orthogonal", everything they wanted the 808x to be. Then Apple's Scully killed the 68k to switch to PPC. The Kahan interview goes on to talk about the x86 FPU overflow/underflow stack register bug, the Pentium FDIV and FP to int conversion bug and mentions reduced precision from some transcendental functions (FSIN inaccuracy). The 6888x FPU just had a minor loss of precision from using faster CORDIC functions. The 8087 FPU was historically important at bringing math and science into FPU design but it was the 68k FPU that did it right and then RISC cheapened it ignoring much of his work.

cdimauro Quote:

He already retired, but he's still active. And it has some patents for fast implementation of FPU trigo etc. instructions.

However, he's fully involved on his ISA.

Mitch's ISA uses a 16-bit VLE like the 68k. Variable sized immediates and displacements are encoded in the code like the 68k. The hardware and decoding are not so much different between the 68k and his ISA. He may be interested if licensing some ancient 68k, 88110 and ColdFire V5 CPU cores. He was an architect on the 88k and helped with 68k development which would likely increase interest.

cdimauro Quote:

The situation is a bit more complicated, because on my new ISA I've the same instruction, but packaged in different (opcode) formats which enable more features/extensions thanks to the longer encodings.
On top of that, I've some overlapping between GP and scalar (integer or FP) instructions. They aren't the same thing: there are instructions on the first group and not on the second, and viceversa. And there are instructions on both. For example, ADD is both groups, whereas CALL is only on the first one, and MAX only on the second one.

Let's focus only to the scalar FP version of ADD. I haven't yet defined the final syntax, because I'm still thinking about ADD F0,F1,F2 and FADD R0,R1,R1 (on both the first register is the destination).

Taking as a reference the above list and assuming that the maximum precision is FP128, here follow the equivalent versions with my new ISA and a bit more:
add f0,f1,f2 ; extended add, no precision rounding (short encoding)
add{d} f0,f1,r2.d ; double add, round to double (normal encoding)
add{s} f0,f1,r2.d ; double add, round to single (normal encoding)
add{d} f0,f1,r2.s ; single add, round to double (normal encoding)
add{s} f0,f1,r2.s ; single add, round to single (normal encoding)

The short encoding is the common one for all scalar operations, and always operate to the full register size (e.g. FP128 for scalar FP and 64-bit for scalar int). It's the most compact because it works only with registers (and since there are three of them, the encoding space is very limited).

The normal encoding is the one which allows to define an EA for the second source and some additional bits for the desired precision & exception suppression for FP data, and sign or zero extension for int data (but I've two spare bits which aren't yet used, because there's no precision selection concept here).

I've other encodings / opcode formats for defining an immediate as second source (in a more compact form compared to different types of immediates which can be defined with an EA), mem-mem and mem-mem-mem versions (both which allow to define the precision of each operand), and a general mechanism which is applicable to all the above cases (except the short encoding) which further defines other extensions/"enrichments" of the instruction (included its conditional execution).

SIMD/vector instructions have exact equivalent encodings for the same instructions (but short encodings are much more limited: not much encoding space!). However, there's no precision rounding which can be selected for FP data (operations are always executed with the given data type size). A mask can be selected, as well as broadcasting of the last source EA, zero/merge of lanes, and the size of the "vector".

Looks good other than the mem-mem-mem encodings. Even a VAX could likely be microcoded with descent performance but mem-mem-mem may increase the microcode size and/or complexity.

cdimauro Quote:

Do you mean using shorter immediates for instructions which work with a given precision?
For example:
fadd.d #fp16imm,fp0 ; double add, loading an FP16 immediate as second source

?

Yes. That is the kind of FP peephole optimization VASM currently performs for the 68k FPU although half precision FP is not supported, another idea Gunnar did not like. FP immediates can be very long but often can be exactly represented with a lower precision FP datatype.

cdimauro Quote:

As you can see above, I've a lot of "beaf". But this is being paid with longer encodings (compared to 68k and other architectures): I could have focused only on the common case (full precision operations).

68k is in a better shape regarding the code density for FP instructions, but it has less flexibility (e.g. it's missing 3 operands, which is the major point to be addressed, IMO).

As I recall, few register to register FMOVE instructions were used in the VBCC support code I worked on. There were more to begin with but I managed to get rid of most of them. The 68040 had the FMOVE parallel operation optimization but I would have rather had FINT/FINTRZ in hardware. I can see a pipelined FPU using more FMOVE instructions and 3 op becoming more important but I still expect mem-reg operations to be more common than 3 op FPU instructions. I could be wrong and it would be interesting to see some comparisons of optimized code.

cdimauro Quote:

Indeed. That would have been very helpful, but we know how it "works"...

Thanks for the clarification, anyway. I've slightly changed the ISA according to the above.

I already had a CONV instruction since very long time, for a general data-type to data-type instruction, but it was limited only to truncation or nearest when converting from FP to int data. Now it allows to define any of the four rounding modes.

Fortunately, CONV can use this mechanism for the destination short encoding (but using the full precision for source operand).

BTW, I've no PMOVS*/PMOVZ* neither some PACK/UNPACK SIMD/vector instruction, because CONV is used to cover all those case.

There are other rounding modes possible but those 4 are core IEEE rounding modes.

IEEE rounding modes
Round to Nearest, Ties to Even - round()
Round toward Zero - trunc()
Round toward +Infinity - ceil()
Round toward -Infinity - floor()

other rounding modes
Round to Nearest, Ties away from 0
Round to Nearest, Ties toward 0
Round to away from Zero

The IEEE FP round to nearest, ties to even is different than taught in school but is preferred for statistics. Other rounding modes can be useful and are more often taught in schools. The double double or double quad trick uses a non-IEEE required rounding mode. A 3-bit rounding mode field could be useful in the FPCR for future support although a 2-bit field may be fine in instruction encodings for now.

Last edited by matthey on 18-Mar-2025 at 11:20 PM.

Status: Offline

Hammer

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 18-Mar-2025 23:15:30

[ #168 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6505
From: Australia

@Karlos

Quote:
I didn't say it was. I said that fdiv is a pretty stupid metric to base system performance in general on as most code will avoid division if it's optimised for performance. Division, whether integer or floating point, is still the slowest of the regular arithmetic operations even today, let alone in the mid 90's. You use division when you need exact/precise results.

Quake uses it for perspective correction, but it's not strictly necessary: if the divisor range can be normalised to some fixed range that can then be expressed as an integer, that integer can then be used to select a reciprocal from a lookup that is then used as a multiplier instead. Abrash didn't do this because he was able to rearrange the code manually to mask the fdiv latency behind a whole block of other integer instructions. Plus the lookup might end up being too large, depending on the necessary range.

I'm sure a similar approach could be implemented for the 68K as the FPU can calculate the division while the IU is busy doing something else too. However, do any of the Amiga ports have an equivalently hand optimised version of the draw16 code or are they just using the vanilla C?

There were numerous other bits of Quake optimised by Abrash for x86 too that probably lack equivalent attention for 68K.

Comparing release Quake on the Pentium and 68060 is oranges and apples until you are dealing with ports having equivalent optimisation effort.

AC68080 is 68K optimised for Quake, hence its strength is shown with Quake instead of Lightwave.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Karlos

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 18-Mar-2025 23:29:50

[ #169 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4960
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Hammer

I read somewhere that 68080 has particularly fast floating point divide. Are the sources for this version available (I mean they *should* be).

_________________
Doing stupid things for fun...

Status: Offline

Hammer

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 19-Mar-2025 1:13:56

[ #170 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6505
From: Australia

@matthey

Quote:

The Pentium FXCH trick is a kludge workaround that makes a pipelined FPU worthwhile but it is still far from optimal. Linus Torvalds also talks about the x87 FPU deficiencies (see link for full x87 FPU issues).

FXCH is not a trick, it's a programmer control register renaming function.

3DNow recycles x87 FPR with non-stack FPR access. 3DNow extends MMX into floating point which removes MMX/X87 switch context overheads, but Intel has other ideas e.g. separate SSE registers.

SSE2 with FP64 has reduced the need for x87 FPR.

Quote:

Linus calls the x87 stack registers "a horribly horribly bad idea" while I called the whole x87 FPU ISA "horrible". Maybe we are both biased because we started on the 68k and have both made comments about the 68k being easier to program?

68881 was introduced in 1984, which is too late for IBM PC 5150 since it supports 8087 for INT32, INT64, FP32, FP64 and FP80. The "killer app" Lotus 123 2.0 was in development for the IBM PC platform and it was released in 1985 with X87 support.

The first mainstream 68K platform that officially supported FPU is Apple's Mac II with 68881 in March 1987.

A2620's release was delayed by the late design change from Commodore's 2nd-generation custom MMU to Motorola 68551 MMU. A2620 was demonstrated in CeBIT March 1988. Dave Haynie has to take over Bob Welland's incomplete A2620 project which affected A3000's R&D.

Unlike Commodore's system engineering group, Apple focused on the MacOS platform, not somebody else's Unix platform.

"Missing in action" is not a competitor.

Commodore was late on 68881.
Commodore was late on ECS.
Commodore didn't officially support 68020/68881 for Amiga's mass market Amiga 500 model.
If you want business Amiga, the user has to cross the Jack Tramiel-inspired A500 / A2000 product segmentation divide.
That's anti-business for Commodore's mass Amiga 500 production model.

Commodore UK crushed pizza box Checkmate A1500 (A500 desktop conversion) with Amiga 1500 model.

Commodore didn't leverage mass-produced Amiga 500 into business.

Quote:

Intel favored abandoning the x87 FPU for scalar operations in the SIMD unit as it moved to deeper and deeper pipelines for higher clock speeds while AMD had more practical pipeline depths that were slowly growing deeper too. Deeper FPU pipelines make the register shortage worse although x86 superpipelining popularity faded after the Pentium 4 and more practical pipelining with lower instruction latencies returned as there are fewer bubbles/stalls, fewer registers needed and smaller code which is true for RISC cores too.

Thanks to ex-Motorola (Semiconductor Products) Hector Ruiz and Mike Butler (Bulldozer's chief architect), AMD Bulldozer has 20 stage pipeline length which is near Pentium IV Northwood level. It's AMD's #metoo moment on Pentium IV-style high clock speed mixed with server multi-threading bias at the expense of the gaming PC market.

Hector Ruiz and Mike Butler (Bulldozer's chief architect) were pushed out of AMD. Mike Butler has moved to Samsung. F__kups has a price at AMD i.e. pushed out of the company.

Bulldozer's follow-on Piledriver FPU/SIMD units where recycled for Zen since FMA4 instructions can run on it.

For integer ALU pipelines, Zen 1 to Zen 4 has 14-stage pipelines and Zen 5 has 16-stage pipelines. FPU/SIMD pipelines are longer. Zen 5 is designed for +5.2 Ghz with profitable yields. Zen 4's worst silicon grade is Ryzen 5 7500F with 5.0Ghz. Ryzen 5 7500F has defective 2 CPU cores and IGP, silicon grade vs electrical noise/leak is not good when compared to 7950X's CCD. Zen 5's worst silicon grade is Ryzen 5 9600 with 5.2 Ghz.

Quote:

FPU instructions are multi-cycle on the 68060 giving more time for the instruction buffer to fill. For example, FMUL has a 3 cycle execution latency and 12 bytes can be fetched in that time while the base FMUL instruction is 4 bytes in length. The 68060 instruction fetch is usually adequate for FPU heavy code without FPU pipelining. Mixed integer and FPU code is tighter but usually a few cycles are lost hear and there allowing the decoupled fetch to catch back up.

You're not thinking pipeline in relation to fetch stages.

The following university lecture shows ARM Cortex A53 and A57 full pipeline diagram
https://web.cs.wpi.edu/~cs4515/d15/Protected/LecturesNotes_D15/CS4515-TeamB-Presentation.pdf
ARM Cortex A53's 3 cycle fetch is part of the pipeline.

AMD Jaguar's and Intel Atom Bonnell's 3 cycle fetch stages are part of the pipeline and they are designed for high clock speed with profitable yields. Xbox One and PS4 game consoles have strict specification cut-off and less tolerance with lesser yields. To increase yields, the PC market can tolerate different CPU "speed bin" grades.

Last edited by Hammer on 19-Mar-2025 at 03:11 AM.
Last edited by Hammer on 19-Mar-2025 at 03:05 AM.
Last edited by Hammer on 19-Mar-2025 at 01:21 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Hammer

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 19-Mar-2025 1:28:14

[ #171 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6505
From: Australia

@Karlos

Quote:

Karlos wrote:
@Hammer

I didn't say it was. I said that fdiv is a pretty stupid metric to base system performance in general on as most code will avoid division if it's optimised for performance. Division, whether integer or floating point, is still the slowest of the regular arithmetic operations even today, let alone in the mid 90's. You use division when you need exact/precise results.

Quake uses it for perspective correction, but it's not strictly necessary: if the divisor range can be normalised to some fixed range that can then be expressed as an integer, that integer can then be used to select a reciprocal from a lookup that is then used as a multiplier instead. Abrash didn't do this because he was able to rearrange the code manually to mask the fdiv latency behind a whole block of other integer instructions. Plus the lookup might end up being too large, depending on the necessary range.

I'm sure a similar approach could be implemented for the 68K as the FPU can calculate the division while the IU is busy doing something else too. However, do any of the Amiga ports have an equivalently hand optimised version of the draw16 code or are they just using the vanilla C?

There were numerous other bits of Quake optimised by Abrash for x86 too that probably lack equivalent attention for 68K.

Comparing release Quake on the Pentium and 68060 is oranges and apples until you are dealing with ports having equivalent optimisation effort.

You wrote,

FDIV ?

The only time you should be using floating point division is if there's no alternative and an angry person has a gun to your head. As soon as more than one number needs to be divided by the same divisor, it should be converted into it's reciprocal so that you can multiply it instead.

Again, did you assume my Quake FDIV argument was per pixel?

I'm already aware Quake's conservative FDIV usage.

Last edited by Hammer on 19-Mar-2025 at 01:31 AM.
Last edited by Hammer on 19-Mar-2025 at 01:29 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Hammer

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 19-Mar-2025 1:30:36

[ #172 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6505
From: Australia

@cdimauro

Quote:
You still don't get it and continue to grab and report things related to the topic, but without understanding the real topic.

Karlos has just written another comment trying again to clarify it and let you know, but I doubt that you'll ever learn it.

Learn something that, BTW, was quite obvious and heavily used at the time ("the time" -> when people were trying to squeeze the most from the limited available resources).

But this is something which is obvious only to people which had their hands on this stuff. So, nothing that seems to you've made in your life (despite you've reported being a developer. But working on completely different areas, likely).

Wrong.

Karlos wrote:

FDIV ?

The only time you should be using floating point division is if there's no alternative and an angry person has a gun to your head. As soon as more than one number needs to be divided by the same divisor, it should be converted into it's reciprocal so that you can multiply it instead.

Again, Karlos assumed my Quake FDIV argument was per pixel.

I'm already aware of Quake's conservative FDIV usage.

I already stated FDIV was executed out of order while other quicker instructions are processed.. LOL

Learn to read properly.

Try again.

Last edited by Hammer on 19-Mar-2025 at 01:34 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Hammer

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 19-Mar-2025 1:37:57

[ #173 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6505
From: Australia

@Karlos

Quote:

Karlos wrote:
@Hammer

Why don't you read the actual source code: https://github.com/id-Software/Quake/blob/master/WinQuake/d_draw16.s

Perspective correction is performed every 16th pixel with linear interpolation to *avoid* having to call fdiv too often. Moreover, the one fdiv call is set amongst code that can execute around it, hiding the latency it adds. More importantly, look at how many times values are multiplied by values based on the reciprocals calculated by division (illustrating my earlier point about reciprocal multiplication).

Ideally this is a case where you'd want to perform the calculation for every pixel for the best visual quality, which would require a division per pixel.

This entire source exists and was optimised accordingly because that's just too expensive. And it's too expensive because it depends on floating point division.

Quake does not do one divide per pixel, it is in steps of 8 pixels i.e. dscan.c in winquake.
https://github.com/id-Software/Quake/blob/master/WinQuake/d_scan.c

Back at you.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Karlos

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 19-Mar-2025 1:45:16

[ #174 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4960
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@Hammer

Where have I claimed it does division per pixel? The code in draw16.asm optimises it by hiding the cost of the division through better instruction arrangement and (tbc) halving the overall amount by doing it once per 16, rather than per 8.

The point is, does the 68060 version of quake have a 68K optimised equivalent of draw16.asm, or is it just using the C code above?

_________________
Doing stupid things for fun...

Status: Offline

Hammer

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 19-Mar-2025 1:58:57

[ #175 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6505
From: Australia

@matthey

Quote:

Visual C++ 2.0 compiler is what Quake used from "Michael Abrash's Graphics Programming Black Book Special Edition".

https://youtu.be/DWVhIvZlytc?t=842
Pentium's Quake FDIV instruction interval is 19 clock cycles i.e. it's FP32.

From WinQuake's d_scan.c https://github.com/id-Software/Quake/blob/master/WinQuake/d_scan.c
Datatype declaration is "float" instead of "double".

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Hammer

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 19-Mar-2025 2:28:02

[ #176 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6505
From: Australia

@Karlos

Quote:

Where have I claimed it does division per pixel? The code in draw16.asm optimises it by hiding the cost of the division through better instruction arrangement and (tbc) halving the overall amount by doing it once per 16, rather than per 8.

8 vs 16 pixel resolution debate for FDIV wouldn't matter since FDIV instruction is used sparingly.

Render shortcuts are part of visual smoke and mirrors i.e. interactive "Hollywood" entertainment.

------------
Modern sparse render techniques include checkerboard rendering.

Skipped pixels are pixel "reconstructed".

Sparse render methods are used in real time raytracing with raytrace denoise or ray reconstruction with deep learning. Rays are used sparingly.

Blender also uses raytrace denoise method to speed up raytracing.

NVIDIA's RTX game titles usually have higher ray saturation to gimp AMD GPU's RT cores. RX 9070 XT (RDNA 4) improves raytracing situation.

Quote:

The point is, does the 68060 version of quake have a 68K optimised equivalent of draw16.asm, or is it just using the C code above?

There's 8 pixel span version from https://github.com/id-Software/Quake/blob/master/WinQuake/d_draw.s

Cross-platform benchmark is less useful when Quake on multiple platforms doesn't have the same render shortcuts.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

cdimauro

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 19-Mar-2025 6:06:27

[ #177 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

@matthey

Quote:

matthey wrote:
cdimauro Quote:

Absolutely! 68060 were clearly wrong, because the average instruction length for 68k is around 3 bytes. So, an instruction pair should have required at least 6B/cycle fetch (which is odd. 8B is more reasonable).

Wrong is not how I would word the 68060 4B/cycle instruction fetch design decision. It was a design choice to lower power at the expense of performance which may not have been the wrong choice for the embedded market at the time. The high end high performance embedded market at the time was smaller than today. Most RISC cores could not reach the integer performance efficiency (performance/MHz) of the 68060 and the 68060 still competed with desktop x86 and PPC CPUs in performance but at a reduced power and area for the embedded market.

68060@75MHz 3.3V, 600nm, 2.5 million transistors, ~5.5W max
Pentium-P54C@75MHz, 3.3V, 600nm, 3.3 million transistors, 9.5W max
PPC601@75MHz 3.3V, 600nm, 2.8 million transistors, ~7.5W max

The 68060 4B/cycle instruction fetch is not optimal for performance. It is nice that the in-order 68060 design with so much performance already obviously has more performance potential but boosting the instruction fetch alone would likely only provide a modest performance improvement. The similar but lower power ColdFire V5 design increased the instruction fetch to 8B/cycle so there is little doubt that it is a worthwhile performance upgrade.

Then the 68060 cannot be directly compared to the other processors, which were aimed at the desktop market and not at the embedded once.
In fact, at least the Pentium has not only retained the full legacy from all its predecessors, but added more features. All this needs transistors and draws more power.
If the 68060 was designed specifically for the embedded market and Motorola decided to remove many features to make it more suitable at that, then I've no problem to accept it (processors should be adapted to the specific needs). But, as I've said, then any comparison is not possible anymore.
Quote:
cdimauro Quote:

At least from an architectural PoV, no: FPU (x87) and SIMD (SSE+) registers are separated.

Only MMX was (re)using x87's registers.

Microarchitectures are a different story, but here what they do is totally transparent (and irrelevant) to the ISA.

For 68k, we're talking about the architecture level, where it IS relevant if the additional FPU registers are mapped to the data or FPU/SIMD registers.

Let's take a look at some solutions.

Extending data registers to 80 bits (96 bits effective in memory).
- 32-bit ISA -> wasting 8 x (80 - 32) bits
- 64-bit ISA -> wasting 8 x (80 - 64) bits

Extending FP registers to 16 and to hold SIMD data as well.
- 128-bit SIMD -> wasting 16 x (128 - 80) bits
- 256-bit SIMD -> wasting 16 x (256 - 80) bits
- 512-bit SIMD -> wasting 16 x (512 - 80) bits

So, the more you scale with the SIMD unit (which makes sense. At least up to 256-bit), the more space is wasted when one register is used only for holding a scalar.

That's exactly one of the major reason (but not the only one) why I've decided to completely separate the SIMD/Vector registers file from the scalar one.

Register files are expensive especially for embedded use. This makes register sharing temping. Sharing 64-bit integer and 64-bit FPU registers makes sense with a 64-bit FPU. The 80-bit 68k FPU has advantages and it is not worth breaking compatibility to reduce it to 64-bit.

I wasn't thinking about dropping the extension precision and supporting only the double one, rather to extend the 8 data registers to 80-bits to support it.
Quote:
There are basically two options that I see for the 68k FPU.

1. 80-bit FPU registers - quad precision supported with a pair of extended precision registers
2. 128-bit FPU registers - default precision extended, quad precision requires 2nd FPU ALU pass

The latter choice is cleaner and a 128-bit size is reasonable for SIMD unit registers if sharing. Maybe keep the first 8 FPU registers for the FPU only and compatibility, the 2nd 8 FPU registers volatile and shared with the SIMD unit and the last 8 128-bit registers as SIMD only referenced backward (first SIMD register is the last register in the register file) for 24 total 128-bit registers?

Just throwing the idea out there rather than thinking all the details through.

np. We're just braninstorming here.

Honestly, I don't like such mixed usage of registers (for the same reasons that I don't like what Gunnar did with the 68080).

In my vision it's better to have the FPU and SIMD/Vector registers completely separated: scalar from one side and SIMD/Vector on the other side.
The primary reason is that in this way the SIMD/Vector unit can freely "grow" completely independent. So, its registers can increase in size without wasting space, because only vector data is stored there, and which is fully used (e.g.: no partial register usage).
The second reason is that scalars and SIMD/vector instructions aren't usually mixed-up: either you execute the formers or the latters. And they have different usages.
To be more clear, an algorithm that requires more instructions for being completed using scalar instructions, might require much fewer SIMD/Vector instructions. Which means that the read/write ports of the scalar and SIMD/Vector register files can be different and can be finely tuned depending on the specific microarchitecture / market.
If unify (or use a mix like the above) the scalar and SIMD/Vector register files, then you end-up with heavily limiting either scalar or the SIMD/Vector unit (e.g. more ports will be more difficult and expensive to be implemented for the SIMD/Vector unit, whereas less ports will penalize the performance of the scalar unit).

IMO extending the data registers to 80 bits and using them as the extra 8 FPU registers is "the lesser of two evils". It certainly looks odd at the first sight, but this way you keep the scalar and SIMD/Vector units well separated and free to be implemented according to the specific needs. With a very little expense in terms of space (especially thinking about a 64 bit ISA).

Food for thinking...
Quote:
cdimauro Quote:

RISC-V is an academic ISA which is weak and badly designed (such people were living on a parallel universe, due to not taking into account what are the needs with real code).

It took ages for them to defined the vector extension (so, plenty of time to think about it), and yet they failed (see how they introduced the mask/predicate support).

But David Patterson is an award wining RISC expert with experience from RISC-I, RISC-II, RISC-III and RISC-IV.

He's also one of the recent Turing medals.

Nevertheless, I've strong arguments that he was wrong (and actually I've proved it, on the proper series of articles that I've publised), and he's still wrong because he has an academic vision when designing computer architectures.
Quote:
Now you are telling me a RISC-VI do over is needed?

They succeeded one time (after four failures), but I don't think that can have another chance.
Even because RISC-V already gained a lot of consensus, despite its bad design, which costed a lot of investments to several companies, and they want to keep it for very long time.
Quote:
cdimauro Quote:

Do you refer to hardware implementation of a 68k FPU? Or to the proper support at the compiler level? Or both?

The 68881 through 68060 already have the necessary hardware for full extended precision support. The only piece missing is an ABI standard that specifies saving FPU function args in registers or on the stack in extended precision. As I recall, VBCC already uses extended precision by default, uses and spills extended precision variables, context switches save the extended precision registers and FPU return values are in FP0. I believe the only missing part is that FPU function args are placed on the stack as either double or single precision according to the arg original datatype rather than the extended precision intermediate result. Popular 68k ABIs are a mess.

1. SysV/Unix ABI - ancient, all args passed on the1. stack using natural alignment
o used by VBCC & LLVM

2. GCC/Linux ABI - all args passed on the stack, relaxed stack alignment is poor for performance
o used by official GCC

3. SAS/C ABI - args passed in scratch registers with remainder on stack, poor documentation
o used by SAS/C, VBCC & I believe unofficial Geek Gadgets GCC up to GCC 3.4 when reg args selected

https://m680x0.github.io/doc/abi.html
https://m680x0.github.io/ref/sysv-abi-download

The SAS/C ABI with a small change to pass non-scratch register FPU function args on the stack in extended precision would suffice but it would introduce incompatibility with an existing standard. I offered to document a new ABI (modified SAS/C ABI) with the changes but there is not much motivation to adopt it for a dead 68k platform. I was developing for VBCC and there was a resistance to change for a dead 68k and that is not even talking about getting strangers to agree to and adopt a new ABI.

OK, now it's more clear and I agree. A new ABI was definitely needed to better use the 68k (not only for the FPU).
Quote:
Part of the problem is that the 68k FPU ISA with extended precision gets lumped in with the x86 FPU ISA with extended precision. The x86 FPU ISA has problems and there have been bugs in the CPU hardware making it difficult to provide better extended precision support and making it a nightmare for compiler developers as the GCC developers not wanting to touch the existing support exhibits. Even William Kahan, the father of FP, who helped design the 8087 FPU knows the 68k ISA is better.

https://history.siam.org/%5C/pdfs2/Kahan_final.pdf Quote:

HAIGH: So you werenâ€™t sensing that the 8086 architecture was going to dominate?
KAHAN: Oh, no. If I had known that, it would have curdled my blood. It had a horrible architecture. No, but the arithmetic was what I was designing. Then a problem arose, in that they found that the op code space was somewhat tight; they didnâ€™t really have space for op codes that would use a regular register architecture. They didnâ€™t have enough space in the op code to have two registers routinely. It looked as if they could have in many cases only one register. They can have only one register; the other must be implicit. That means youâ€™ve got to have a stack, so everything goes to the top of the stack. Itâ€™s a goddamned bottleneck in performance, but it does allow you to get by with this op code limitation.

...

KAHAN: Well, because Intel had so large a share of the market, it was really selling a vast number of these chips at fairly high pricesâ€”much higher than the cost of productionâ€”so they had money rolling in and could afford to have literally hundreds of engineers working on the next generation of Intel chips. The other people, like the SPARC people and so on, didnâ€™t have that luxury. They really had to scramble for the money and for the engineering. The Pentium architecture is appalling. Itâ€™s really bad. Itâ€™s descended from the Intel X86 architecture, and perpetuates most of its faults. But the Pentium isnâ€™t exactly whatâ€™s there. Whatâ€™s there is a little microengine which executes the most frequent instructions as any RISC chip would, in a very direct fashion and, as soon as the instructions get at all interesting, it goes to microcode. So how can you tell the difference? The RISC chipsâ€”theyâ€™ve got microcode too. Even the DEC Alpha, which must surely be the epitome of a RISC-chip design, had what they called PAL code, which was essentially microcode to handle exceptional or rare events. Ultimately, they used PAL code to cope with the penalties imposed by what were called trap barriers, but thatâ€™s somebody elseâ€™s history.

...

And Peter Marksteinâ€”Iâ€™d met him originally in Yorktown Heightsâ€”he was now in Austin working for IBM on this particular project. He found lots and lots of ways to use the fused multiply-add to advantage, so it turned out to be a great thing for them. Then, when Apple was thrown into bed with IBM by the Pepsi-Cola man, Scully, Apple adopted the same architecture. Motorola got a license to produce the chips, initially. So Motorola was very disappointed to lose Apple as a customer for their 68000 family, and also for their subsequent RISC family. That was the 88110.

The 88110 was a Motorola chip with a RISC architecture and beautiful floating point, just like the floating point on the 68040. The 68040 did thoroughly what John Palmer had wanted to do, but didnâ€™t have enough transistors to do. They did a really nice job. Mind you, they made one mistake, and it was a funny mistake. See, they used a CORDIC algorithm (remember, I said thereâ€™s CORDIC, which is part of a family of algorithms). I was using pseudo-multiply, pseudo-divide. They were using a CORDIC algorithm, and the CORDIC algorithm isnâ€™t quite so accurate as pseudo-multiply, pseudo-divide, but on the other hand, the CORDIC algorithm can run faster. It can go some 30% faster.

The CORDIC algorithm was written up really well by Steve Walther. He worked at IBM in the Deer Creek Road research facility. He wrote a paper on CORDIC algorithms for computing transcendental functions, not merely trig functionsâ€”hyperbolic functions and, consequently, log, exponential, things like thatâ€”and he presented the algorithms in APL language, so he really had everything you needed.

But he left one thing out of his paper: how many extra digits must you carry in order to get a certain number of digits right? And the answer wasnâ€™t perfectly obvious, but when the guys at Motorola were in the throes of implementing this algorithm, they were calling Steve Walther for advice whenever they came to a sticking point, and Steve Walther, being a good-natured fellow, was giving them advice until his management at Hewlett-Packard said, â€œWhat are you doing giving away this kind of consulting information to a rival or at least potential rival?â€ So the next time they called, Steve had to say, â€œIâ€™m sorry. My boss has told me I shouldnâ€™t talk with you guys.â€

What were they calling him about? They wanted to know how many extra digits they should carry in order to get 64 correct. Well, he couldnâ€™t tell them, and they didnâ€™t figure it out quite right so, in fact, they were losing a few bits of the 64. On the initial coprocessors for the Motorola 68000 family (thatâ€™s the 68881 and the 68882, slightly faster), they were losing bits, so their transcendental functions were not so accurate as Intelâ€™s and, indeed, therefore not so accurate as Cyrixâ€™s. But they were pretty fast, and they had built the complete library on the chip, doing what John Palmer had only hoped to do and, ultimately, they redesigned things so that they could put the floating point on the same chip as the processor and not need a coprocessor, and that was the 68040. For the 68040, they employed Peter Tang as a consultant. He wrote a beautiful library for them in which they got all the accuracy they needed. And I have a 68040 and Iâ€™m delighted with it. It runs in my Macintosh Quadra. It does a beautiful job but, unfortunately, it does it with a 33 MHz clock.

Well, it looked as though the RISC architecture fad would be inappropriate for the 68000 because it was a CISC architecture, if ever you saw one, a complicated instruction set architecture with an orthogonal instruction set and everything else. So one could have said, â€œWell, youâ€™ve got a slow machine,â€ but Motorola had designed a RISC family of machines which had the same floating point as the 68040, just implemented a little bit differently. It used the same transcendental function codes, just implemented a little bit differently, and Apple rejected the use of that processor because Scully wanted Apple and IBM to become buddies. So that killed the 88110, which was a good architecture. I think I like the 960 better, but the 88110 was a pretty good architecture. Iâ€™ve got the manual here, even though I havenâ€™t had the chance to look at it carefully, and that screwed things up at Apple.

Apple had an excellent numerical environment called SANE, Standard Apple Numerical Environment. It had so many of the things that I had wanted. They didnâ€™t necessarily do it in quite the way I liked, but they did a very conscientious job, and programmers came to like it a lot. Of course, it takes a while. Thereâ€™s a lag. You know, initially the programmers are complaining about everything. Theyâ€™ll complain about any change, theyâ€™ll complain about having to spend extra time to get the right answer and stuff like that, but ultimately, the programmers were really liking it. These guys were getting unsolicited testimonials from programmers and some letters to the editor and so on about what a great thing that SANE was, and that was killed in order to switch to the new Power PC architecture.

The 808x ISA is horrible and the 68k ISA was "beautiful" and "orthogonal", everything they wanted the 808x to be. Then Apple's Scully killed the 68k to switch to PPC. The Kahan interview goes on to talk about the x86 FPU overflow/underflow stack register bug, the Pentium FDIV and FP to int conversion bug and mentions reduced precision from some transcendental functions (FSIN inaccuracy). The 6888x FPU just had a minor loss of precision from using faster CORDIC functions. The 8087 FPU was historically important at bringing math and science into FPU design but it was the 68k FPU that did it right and then RISC cheapened it ignoring much of his work.

Understood, and I mostly agree.

One point where I don't is about the x86/x87 bugs. Bugs are... bugs. Something that could happen and that can also be fixed: the infamous Pentium FDIV bug is here to recall what can happen... and what can be done with fixing issues.

Another point where I don't agree with Kahan is about the x87 design. In fact, it wasn't true that it was limited by the opcode space.
At the time a lot of opcodes were still not used (besides the 8 x "escape" ones used for the x87): several bytes are free in the opcode table, and could have been used to implement a register-based FPU instead of a stack based, even supporting three operands.
It would have required longer opcodes, but x87 needs extra FXCHG instructions anyway to "solve" this problem...
Quote:
cdimauro Quote:

He already retired, but he's still active. And it has some patents for fast implementation of FPU trigo etc. instructions.

However, he's fully involved on his ISA.

Mitch's ISA uses a 16-bit VLE like the 68k. Variable sized immediates and displacements are encoded in the code like the 68k. The hardware and decoding are not so much different between the 68k and his ISA. He may be interested if licensing some ancient 68k, 88110 and ColdFire V5 CPU cores. He was an architect on the 88k and helped with 68k development which would likely increase interest.

No, My 66000 is a radical new architecture, quite different from 68k, and uses 32-bit VLE. I think that it's different from 88100 as well, but I should refresh my studies on this ISA to make a concrete comparison.

Due to 32-bit base opcode/alignment I think that the code density shouldn't be so good, but Mitch says that he's better than RISC-V. Let's see. I've some doubt about that, but benchmarks are needed.
What's true is that it requires much less executed instructions compared to RISC-V (the 64-bit "G" variant). Well, not surprising, since RISC-V is a so weak ISA...
Quote:
cdimauro Quote:

The situation is a bit more complicated, because on my new ISA I've the same instruction, but packaged in different (opcode) formats which enable more features/extensions thanks to the longer encodings.
On top of that, I've some overlapping between GP and scalar (integer or FP) instructions. They aren't the same thing: there are instructions on the first group and not on the second, and viceversa. And there are instructions on both. For example, ADD is both groups, whereas CALL is only on the first one, and MAX only on the second one.

Let's focus only to the scalar FP version of ADD. I haven't yet defined the final syntax, because I'm still thinking about ADD F0,F1,F2 and FADD R0,R1,R1 (on both the first register is the destination).

Taking as a reference the above list and assuming that the maximum precision is FP128, here follow the equivalent versions with my new ISA and a bit more:
add f0,f1,f2 ; extended add, no precision rounding (short encoding)
add{d} f0,f1,r2.d ; double add, round to double (normal encoding)
add{s} f0,f1,r2.d ; double add, round to single (normal encoding)
add{d} f0,f1,r2.s ; single add, round to double (normal encoding)
add{s} f0,f1,r2.s ; single add, round to single (normal encoding)

The short encoding is the common one for all scalar operations, and always operate to the full register size (e.g. FP128 for scalar FP and 64-bit for scalar int). It's the most compact because it works only with registers (and since there are three of them, the encoding space is very limited).

The normal encoding is the one which allows to define an EA for the second source and some additional bits for the desired precision & exception suppression for FP data, and sign or zero extension for int data (but I've two spare bits which aren't yet used, because there's no precision selection concept here).

I've other encodings / opcode formats for defining an immediate as second source (in a more compact form compared to different types of immediates which can be defined with an EA), mem-mem and mem-mem-mem versions (both which allow to define the precision of each operand), and a general mechanism which is applicable to all the above cases (except the short encoding) which further defines other extensions/"enrichments" of the instruction (included its conditional execution).

SIMD/vector instructions have exact equivalent encodings for the same instructions (but short encodings are much more limited: not much encoding space!). However, there's no precision rounding which can be selected for FP data (operations are always executed with the given data type size). A mask can be selected, as well as broadcasting of the last source EA, zero/merge of lanes, and the size of the "vector".

Looks good other than the mem-mem-mem encodings. Even a VAX could likely be microcoded with descent performance but mem-mem-mem may increase the microcode size and/or complexity.

I don't think that microcoded is needed. My mem-mem-mem design is way more simple compared to the VAX one, and requires only some bits (the LSBs) to determine the position and length of each of the three operands.
In general, decoding the first bits (within the first word / 16 bits. 15 bits maximum, to be more precise, in the worst case scenario/encoding) are enough to figure out the instruction length and the position and length of all its operands.
VAX, on other hand, requires parsing the byte stream one byte at the time, and advancing byte-by-byte for doing the same. Which is the reason why it wasn't possible to pipeline it (at the time), which lead to its failure.
Having three memory operands can have its own problems, of course, but only when those are effectively pointing to memory locations. Since I can reference registers, immediates, and constants (in ROM. No FMOVECR is required for them: just use one of them from the few that are defined in the EA), many times there's only a single memory reference, so only an AGU is needed.
Another practical case is when there are only two memory operands, and one of them is the destination: in this case processing the destination's EA can be easily delayed until an AGU is free (which is very similar to the 68k case with the MOVE instruction).
The worst case, of course, is when there are two source memory operands (so, mem-mem with destination as first source operand, and mem-mem-mem), because they need to be evaluated ASAP for getting their values; but 68k has the same problem with the add mem-mem instructions (albeit the available EAs are a few and fixed).
I don't know if there are intrinsic issues having to deal with more than one EA per instruction, but in case I'm curious to understand why.
Quote:
cdimauro Quote:

Do you mean using shorter immediates for instructions which work with a given precision?
For example:
fadd.d #fp16imm,fp0 ; double add, loading an FP16 immediate as second source

?

Yes. That is the kind of FP peephole optimization VASM currently performs for the 68k FPU although half precision FP is not supported, another idea Gunnar did not like. FP immediates can be very long but often can be exactly represented with a lower precision FP datatype.

I fully agree. That's why I support it on my architecture in different ways (as short FP9/FP5/FP16+/FP32+ in the EA, and with a specific set of instructions for FP16/32/64/128).

That's something which the 68k strongly needs, because it's not acceptable to required 12 bytes for an extended precision value which could be represented with 16.
A slot in the few EA modes can be used for this. I've done something similar since I've designed NEx64T, but I've found a particular way which I can't disclose now (I think that it might be worth a patent, if there's no prior art).
Quote:
cdimauro Quote:

As you can see above, I've a lot of "beaf". But this is being paid with longer encodings (compared to 68k and other architectures): I could have focused only on the common case (full precision operations).

68k is in a better shape regarding the code density for FP instructions, but it has less flexibility (e.g. it's missing 3 operands, which is the major point to be addressed, IMO).

As I recall, few register to register FMOVE instructions were used in the VBCC support code I worked on. There were more to begin with but I managed to get rid of most of them. The 68040 had the FMOVE parallel operation optimization but I would have rather had FINT/FINTRZ in hardware. I can see a pipelined FPU using more FMOVE instructions and 3 op becoming more important but I still expect mem-reg operations to be more common than 3 op FPU instructions. I could be wrong and it would be interesting to see some comparisons of optimized code.

I third operand could have been easily introduced if Motorola had properly used the bits on the F-line format.
Anyway, I think that 3 ops instructions are common at least of scientific FP code. I've seen some asm dump from Mitch about one Fortran routine that used several immediates but also many registers. Internally there's the GNU numeric library which is being used for benchmarks (such routine is one of them, but it was ported to C).
Quote:
cdimauro Quote:

Indeed. That would have been very helpful, but we know how it "works"...

Thanks for the clarification, anyway. I've slightly changed the ISA according to the above.

I already had a CONV instruction since very long time, for a general data-type to data-type instruction, but it was limited only to truncation or nearest when converting from FP to int data. Now it allows to define any of the four rounding modes.

Fortunately, CONV can use this mechanism for the destination short encoding (but using the full precision for source operand).

BTW, I've no PMOVS*/PMOVZ* neither some PACK/UNPACK SIMD/vector instruction, because CONV is used to cover all those case.

There are other rounding modes possible but those 4 are core IEEE rounding modes.

IEEE rounding modes
Round to Nearest, Ties to Even - round()
Round toward Zero - trunc()
Round toward +Infinity - ceil()
Round toward -Infinity - floor()

other rounding modes
Round to Nearest, Ties away from 0
Round to Nearest, Ties toward 0
Round to away from Zero

The IEEE FP round to nearest, ties to even is different than taught in school but is preferred for statistics. Other rounding modes can be useful and are more often taught in schools. The double double or double quad trick uses a non-IEEE required rounding mode. A 3-bit rounding mode field could be useful in the FPCR for future support although a 2-bit field may be fine in instruction encodings for now.

Mitch reported a slightly different list about such extra rounding modes:

Table 4: Rounding Modes

Mode Status Encoding
Round Nearest Even IEEE 754 000
Round Nearest Odd Experimental 001
Round Nearest Magnitude IEEE 754-2008 010
Round Away From Zero Experimental 100
Round Towards Zero IEEE 754 101
Round Towards + Infinity IEEE 754 110
Round Towards â€“ Infinity IEEE 754 111

For the new ones there's a match with Round Away From Zero Experimental with your Round to away from Zero, but the other twos don't.
I've extended my new architecture with this table:

Round to Nearest, Ties to Even - round()
Round Toward Zero - trunc()
Round Down, Toward -Infinity - floor()
Round Up, Toward +Infinity - ceil()
Round to Nearest, Ties Away from Zero
Round to Nearest, Ties toward Zero
Round to Nearest, Ties to Max Magnitude
Round Away from Zero

I don't know if it's right, but I'm not an expert on this field.

Anyway, the encoding meanings might change. The important thing is that I've extended the field in the status registers to support such 8 rounding modes, and I've also extended the previously mentioned CONV instruction (which now takes a whopping 192 encodings from the list of available binary instructions. Fortunately I've many of them, but CONV already took a good part of it).

P.S. As usual, no time to read again: I've to start working.

Last edited by cdimauro on 19-Mar-2025 at 06:08 AM.

Status: Offline

matthey

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 19-Mar-2025 17:45:12

[ #178 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2754
From: Kansas

Hammer Quote:

FXCH is not a trick, it's a programmer control register renaming function.

Would you prefer that I call it a kludge?

Hammer Quote:

SSE2 with FP64 has reduced the need for x87 FPR.

Replacing the crap x87 FPU with a SIMD FPU tends to do that. Intel was proud of the x87 FPU before they moved to deeper pipelines for higher clock speeds and decided to replace it. The x87 baggage remains so Quake still runs though.

Hammer Quote:

You're not thinking pipeline in relation to fetch stages.

The following university lecture shows ARM Cortex A53 and A57 full pipeline diagram
https://web.cs.wpi.edu/~cs4515/d15/Protected/LecturesNotes_D15/CS4515-TeamB-Presentation.pdf
ARM Cortex A53's 3 cycle fetch is part of the pipeline.

AMD Jaguar's and Intel Atom Bonnell's 3 cycle fetch stages are part of the pipeline and they are designed for high clock speed with profitable yields. Xbox One and PS4 game consoles have strict specification cut-off and less tolerance with lesser yields. To increase yields, the PC market can tolerate different CPU "speed bin" grades.

The Cortex-A53 diagram is incomplete. There is a short pipeline between the L2 cache and L1 cache that pre-decodes data before placing the predecoded data in the L1 cache.

https://chipsandcheese.com/p/arms-cortex-a53-tiny-but-important Quote:

To further reduce power and frontend latency, the L1i stores instructions in an intermediate format, moving some decode work to predecode stages before the L1i is filled. That lets ARM simplify the main decode stages, at the cost of using more storage per instruction.

The article linked above has a diagram that shows the L1i cache with a description of "Predecode Cache" but still does not show the predecode pipeline or unit. I can not find a diagram of the Cortex-A53 that shows it but the PPC Cell CPU design that we were recently discussing is similar.

The pipeline after the predecoded L1i cache is shortened reducing the latency but fewer instructions fit in the L1i as the predecode data uses space, the latency to the L2 is increased and code in SRAM can not be used directly (e.g. MCU SRAM). Not just the instruction fetch for the main pipeline can be a bottleneck but the instruction fetch from L2 for predecoding into the L1i can be a bottleneck for instructions obtained from L2 and above. The whole L1i cache becomes a predecoded instruction buffer vs a more limited in size instruction buffer between the instruction fetch pipeline (IFP) and execution pipelines (OEPs) in a 68060/ColdFire like design.

classic design (fetch/cycle needs to be large enough for execution/cycle needs)
P5 Pentium
Cyrix 6x86
68040
AC68080 (?)

instruction buffer decoupled IFP and OEP design
68060
ColdFire
SiFive 7 series

predecoded L1i design
Cortex-A53
Cell PPC PPE
PPC G3

L1i boundary marker cache design
Bonnell Atom

trace cache design (predecoded traces of instructions rather than individual instructions)
Pentium 4
early AC68k (experimented with it but likely abandoned)

Bonnell moves the L1i predecode data into a separate boundary marker cache and two pipeline stages are skipped in the main pipeline when the predecode data is available rather than having a separate predecode pipeline between the L2 and L1i. The predecode data is written to the boundary marker cache if not available when encountered in the main pipeline. It is an interesting design I have not seen elsewhere.

https://en.wikichip.org/wiki/Bonnell#Front_End

It is true that higher clock speed designs require not only more pipeline stages but may require more fetch stages. One pipeline stage may not be long enough to fetch from the L1i or L1d. The time it takes to access a cache is dependent on the cache size though. Smaller caches can be accessed faster. Bonnell requires 3 cycles for the 32kiB L1i cache access and the motivation behind a separate 4kiB boundary marker cache rather than storing the predecode data in the L1i cache may have been to keep the access time to a minimum. Good code density also allows to keep more instructions in a smaller L1i without increasing the access time (fewer fetch stages but maybe more decode stages). This effect should not be underestimated as a 8kiB 68k L1i has about the performance of a 32kiB PPC L1i from RISC-V code density research. The ColdFire V5 chose to increase the L1i and L1d to 32kiB each requiring 2 stages each for access with a 9-stage pipeline (they were able to remove one of the 68060 IFP stages). They likely could have reduced the L1i to 16kiB and accessed with one stage which would have had the performance of a 64kiB PPC L1i but the RISC-V research was not available yet. Silicon improvements allow to do more in one stage so a 68060+ with 8-stage pipeline and 32kiB L1i+L1d may only require single stage cache accesses on modern silicon.

Hammer Quote:

https://youtu.be/DWVhIvZlytc?t=842
Pentium's Quake FDIV instruction interval is 19 clock cycles i.e. it's FP32.

From WinQuake's d_scan.c https://github.com/id-Software/Quake/blob/master/WinQuake/d_scan.c
Datatype declaration is "float" instead of "double".

The d_scan.c source code would most likely generate a double precision FDIV with Visual C++ and default/global precision set to double precision. The FP 1.0 data is representable in single precision which saves memory but when it is loaded into the FPU it is translated to higher precision and calculations are done at higher precision. It is possible to override the default precision and use single precision but it has significant overhead to change and Michael Abrash warned about repercussions of setting the default rounding to single all the time. The following WinQuake assembly source code has functions to change the default precision.

https://github.com/id-Software/Quake/blob/master/WinQuake/sys_wina.s

I do not see C(Sys_LowFPPrecision) or C(Sys_HighFPPrecision) function calls. Maybe I missed them or maybe lowering the precision to single precision was just experimental. Find these function calls in the source code somewhere to show that single precision was used. The video you linked gives the single precision Pentium FDIV latency but that does not mean it was used. It also says FDIV "can't be pipelined" and the AC68080 is claimed to have a pipelined FDIV.

Last edited by matthey on 19-Mar-2025 at 05:58 PM.
Last edited by matthey on 19-Mar-2025 at 05:51 PM.

Status: Offline

matthey

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 19-Mar-2025 19:34:36

[ #179 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2754
From: Kansas

Karlos Quote:

I read somewhere that 68080 has particularly fast floating point divide. Are the sources for this version available (I mean they *should* be).

http://www.apollo-core.com/documentation/AC68080PRM.pdf Quote:

INSTRUCTION TIMING

Integer CPU & AMMX instructions normally are 1 cycle.
MUL=2 or 3, DIV equal or less than 18
MOVE16=4
MOVEM=int(1+ n/2) (all=15long=8)
FMOVEM=n (all=8, when compact apollo-format)
JMP, JSR=1 except with a calculated ea, then its 4 { like JSR -6(a6) }

FPU instructions FNEG, FABS, FMOVE are 1 cycle
FADD, FCMP, FSUB, FMUL=6
FDIV=10, FSQRT=22
FMUL, FADD, FSUB, FDIV, SQRT = are all fully pipelined.
Other calculations (FSINCOS FTWOTOX etc) use a kind of micro-code that take
about 100-..200+ cycles.

Integer ea calculations costs
Free.
Except exotic ones with indirect memory cost about 4 cycle.
([d16,An],Dn.s*n,d16)

Float converting costs
Free: Dn.s Dn.d #.s #.d #.x (single double extended)
Integer convert adds 1 cycle. (byte word long)

The FPU has very odd timing. FADD, FSUB, FCMP, FMUL latency is twice that of the 68060 while FDIV latency is about 1/4 of the 68060 and FSQRT latency is about 1/3 of the 68060.

instruction | AC68080 | 68060 | 88110
FABS 1 1 1
FADD 6 3 3
FMUL 6 3 3
FDIV 10 37 26
FSQRT 22 68 trap

The AC68080 FPU is for double precision while the 68060 FPU and 88110 FPU are for extended precision. The 68060 could have borrowed the earlier and lower latency 88110 FDIV but the developers were focused on the most common instructions and FDIV must not have been common enough.

The AC68080PRM shows a 7-stage pipeline which is shallower than I expected and perhaps more like the Cyrx 6x86 except with split L1 caches. The following additional specs are also given.

http://www.apollo-core.com/documentation/AC68080PRM.pdf Quote:

SPECS

â€“ 64bit memory Data-bus.
â€“ 16kb ICache , 1cycle=16byte to CPU every cycle.
â€“ 128kb DCache 3ported.
1cycle=8byte read AND 8 byte write to/from CPU AND talk to mem.
â€“ mem burst=32byte.(4x8) latency is around 12 CPU cyle.

The CPU itself detect continuous memory access and will automatically prefetch
the memory.

Only a 16kiB L1i but single cycle latency and it has the performance of a 64kiB PPC L1i which is the advantage of code density. This is more info than I ever was given as an Apollo team member.

Karlos Quote:

Where have I claimed it does division per pixel? The code in draw16.asm optimises it by hiding the cost of the division through better instruction arrangement and (tbc) halving the overall amount by doing it once per 16, rather than per 8.

The point is, does the 68060 version of quake have a 68K optimised equivalent of draw16.asm, or is it just using the C code above?

Frank Wille's version of Quake has quite a bit of assembly code and the source is available.

http://sun.hasenbraten.de/quake1/

Also, I made an assembly FDIV optimization by rearranging integer instructions below it for NovaCoder's AmiQuake 2.

Last edited by matthey on 19-Mar-2025 at 07:36 PM.

Status: Offline

matthey

Re: Market Size For New Games requiring 68040+ (060, 080)
Posted on 20-Mar-2025 3:10:23

[ #180 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2754
From: Kansas

cdimauro Quote:

Then the 68060 cannot be directly compared to the other processors, which were aimed at the desktop market and not at the embedded once.
In fact, at least the Pentium has not only retained the full legacy from all its predecessors, but added more features. All this needs transistors and draws more power.
If the 68060 was designed specifically for the embedded market and Motorola decided to remove many features to make it more suitable at that, then I've no problem to accept it (processors should be adapted to the specific needs). But, as I've said, then any comparison is not possible anymore.

Apples and oranges can still be compared even though a pear is a better comparison for an apple and a lemon is a better comparison for an orange.

The 68881 was supposedly 155,000 transistors and the 68882 176,000 transistors. Restoring the 64-bit result multiply and divide with the full FPU is still likely to be less than 3 million transistors which is less than the P5 Pentium. Maybe a pipelined 68060 FPU would come close but the 68060 has a deeper integer pipeline, twice as many GP integer registers, more cache associativity, a 2nd barrel shifter and an instruction buffer. The 68060 has a clear PPA advantage as the hardware design does not even consider the better 68k CPU ISA and FPU ISA advantage with better orthogonality that allows higher instruction multi-issue rates.

cdimauro Quote:

np. We're just braninstorming here.

Honestly, I don't like such mixed usage of registers (for the same reasons that I don't like what Gunnar did with the 68080).

In my vision it's better to have the FPU and SIMD/Vector registers completely separated: scalar from one side and SIMD/Vector on the other side.
The primary reason is that in this way the SIMD/Vector unit can freely "grow" completely independent. So, its registers can increase in size without wasting space, because only vector data is stored there, and which is fully used (e.g.: no partial register usage).

CPU and FPU registers rarely increase or grow wider. The x86 ISA doubled the integer registers going to x86-64 but with a new mode and practically a new ISA in a less than efficient way for decoding and code density. The AC68080 added CPU and FPU registers without a new mode and it is a non-orthogonal mess with less than the 100% claimed compatibility. That really only leaves SIMD registers which x86-64 has increased with baggage. That does lead to the problem of SIMD units not scaling well and already having enough issues at 512b wide, including being resource hogs, that this size has not become standard. AArch64 just made 128b wide their standard. Using pairs of SIMD instructions for 256b SIMD is not so inefficient either. Then there are vector units which are more scalable but have higher latency and are more difficult to program. It is necessary to find a low resource balance that fits with the 68k code density small footprint but the AC68080 64-bit SIMD is too narrow to be worthwhile for FP and is barely worthwhile for integer. A MAC unit like ColdFire uses is even worse as a pipelined FPU with FMA would be a better choice. I want a standard full featured extended precision 68k FPU but I am not so sure about a standard SIMD unit even though I do understand the advantage and disadvantage of having it standard. I am open minded on options and too inexperienced with SIMD and hardware to make some ISA decisions by myself. My 68k ISA goal was to document innovative ideas and proposals that could be decided on by a group of experienced developers. Of course Gunnar decided he was the group of experienced developers, aka the Apollo team. I wanted an enhanced 68k ISA that others could agree on including emulation but I found out most emulation developers want nothing to do with enhancements because emulation is retro and EOL. Even the AmigaOS is devolving backwards from the 68020 for AmigaOS 3.1 to the 68000. Nobody cares anymore about 68k ISA enhancements than they do about a new ABI that properly supports extended precision FP. A faster host CPU for emulating the 68k is good enough.

cdimauro Quote:

The second reason is that scalars and SIMD/vector instructions aren't usually mixed-up: either you execute the formers or the latters. And they have different usages.
To be more clear, an algorithm that requires more instructions for being completed using scalar instructions, might require much fewer SIMD/Vector instructions. Which means that the read/write ports of the scalar and SIMD/Vector register files can be different and can be finely tuned depending on the specific microarchitecture / market.
If unify (or use a mix like the above) the scalar and SIMD/Vector register files, then you end-up with heavily limiting either scalar or the SIMD/Vector unit (e.g. more ports will be more difficult and expensive to be implemented for the SIMD/Vector unit, whereas less ports will penalize the performance of the scalar unit).

Most register writes would be of the whole FPU/SIMD register so I would think the ports would be similar. I get that the SIMD registers may want to be increased or expanded for higher end uses though which would be more difficult and the ports would become less efficient.

cdimauro Quote:

IMO extending the data registers to 80 bits and using them as the extra 8 FPU registers is "the lesser of two evils". It certainly looks odd at the first sight, but this way you keep the scalar and SIMD/Vector units well separated and free to be implemented according to the specific needs. With a very little expense in terms of space (especially thinking about a 64 bit ISA).

Food for thinking...

My thinking is that separate register files would be less wasteful for the 68k ISA. The advantage of unified int/FPU registers would be more efficient FP to int and int to FP for the cost of wasted wider integer registers. My thinking is that FPU/SIMD registers would share scratch registers better and heavy FPU and SIMD use at the same time would be uncommon with the disadvantage that it would be more difficult to increase the SIMD registers or their width in the future.

cdimauro Quote:

Understood, and I mostly agree.

One point where I don't is about the x86/x87 bugs. Bugs are... bugs. Something that could happen and that can also be fixed: the infamous Pentium FDIV bug is here to recall what can happen... and what can be done with fixing issues.

Another point where I don't agree with Kahan is about the x87 design. In fact, it wasn't true that it was limited by the opcode space.
At the time a lot of opcodes were still not used (besides the 8 x "escape" ones used for the x87): several bytes are free in the opcode table, and could have been used to implement a register-based FPU instead of a stack based, even supporting three operands.
It would have required longer opcodes, but x87 needs extra FXCHG instructions anyway to "solve" this problem...

The Intel x87 opcode space requirement may have been like the Intel requirement for Stephen Morse to build an upgraded 8080 with full compatibility. Morse was smart enough to realize that it would not have been much of an upgrade and after "protesting" received the go ahead to create a decently upgraded 8086 with partial 8080 compatibility instead. It probably was not Kahan's place to protest but maybe John Palmer should have. Palmer was trying to keep the x87 project from being canceled though as Intel marketing did not think enough math coprocessors could be sold to pay for development. If you watched the Kahan video I posted, Intel unexpectedly sold roughly as many coprocessors as they did CPUs (1:1). I doubt that was the case for the 6888x as at least Commodore wanted to sell cheapened business computers without a FPU, MMU or even desktop class CPU. Commodore gave us cheapened embedded CPUs, Motorola gave us cheapened embedded CPUs and A-Eon gave us cheapened embedded CPUs. At least Commodore and Motorola gave us cheap and 68k compatible cheapened embedded CPUs.

cdimauro Quote:

No, My 66000 is a radical new architecture, quite different from 68k, and uses 32-bit VLE. I think that it's different from 88100 as well, but I should refresh my studies on this ISA to make a concrete comparison.

Due to 32-bit base opcode/alignment I think that the code density shouldn't be so good, but Mitch says that he's better than RISC-V. Let's see. I've some doubt about that, but benchmarks are needed.
What's true is that it requires much less executed instructions compared to RISC-V (the 64-bit "G" variant). Well, not surprising, since RISC-V is a so weak ISA...

I recall Mitch talking about the encoding which was the first RISC ISA I had heard of to use variable sized immediates and displacements like the 68k which is one of the less copied reasons for CISC performance. I recall he gave some examples of encodings but I do not recall it being a 32-bit variable length encoding which indeed would be bad for code density. I do recall it having many GP registers though which would increase instruction sizes. The 88110 design did use shared int/FPU registers although it was still only 32-bit which was the biggest mistake of the 88k according to Mitch. Oddly, a new extended precision FPU register file was added for the 88110 too. The 88k ISA is still fairly minimal like most RISC but far more friendly than PPC. Some oddities include packed/pixel Pop SIMD instructions using 32-bit register pairs and a XMEM instruction like the x86 XCHG instruction. The 88k ISA is strange but the register sharing is similar to Mitch's new ISA. The 88k ISA is not as valuable today as the 88110 OoO design which is a very flexible dual issue design for 10 units that makes instruction scheduling very easy and has high multi-issue rates. It uses a shallow 4-stage integer pipeline but has a fully pipelined extended precision FPU.

http://www.bitsavers.org/components/motorola/88000/MC88110UM_88110_Users_Manual_1991.pdf

A decoupled instruction fetch pipeline could be added for a variable length encoding which would give a medium depth pipeline like is popular today and would more consistently feed the supersclar the existing execution pipeline. There are man years of labor in the CPU design and likely a good base design here which is easier than starting over with a new design. I expect Mitch would be interested in reacquiring or licensing the IP and a package deal for old cores like the 68k, 88110 and ColdFire V5 may improve the value of a deal.

cdimauro Quote:

I don't think that microcoded is needed. My mem-mem-mem design is way more simple compared to the VAX one, and requires only some bits (the LSBs) to determine the position and length of each of the three operands.
In general, decoding the first bits (within the first word / 16 bits. 15 bits maximum, to be more precise, in the worst case scenario/encoding) are enough to figure out the instruction length and the position and length of all its operands.
VAX, on other hand, requires parsing the byte stream one byte at the time, and advancing byte-by-byte for doing the same. Which is the reason why it wasn't possible to pipeline it (at the time), which lead to its failure.
Having three memory operands can have its own problems, of course, but only when those are effectively pointing to memory locations. Since I can reference registers, immediates, and constants (in ROM. No FMOVECR is required for them: just use one of them from the few that are defined in the EA), many times there's only a single memory reference, so only an AGU is needed.
Another practical case is when there are only two memory operands, and one of them is the destination: in this case processing the destination's EA can be easily delayed until an AGU is free (which is very similar to the 68k case with the MOVE instruction).
The worst case, of course, is when there are two source memory operands (so, mem-mem with destination as first source operand, and mem-mem-mem), because they need to be evaluated ASAP for getting their values; but 68k has the same problem with the add mem-mem instructions (albeit the available EAs are a few and fixed).
I don't know if there are intrinsic issues having to deal with more than one EA per instruction, but in case I'm curious to understand why.

I agree that the 8-bit VLE was the major reason for the downfall of VAX. It is like the bad 8-bit VLE x86 encoding but the x86 ISA is simpler. The VAX ISA is more orthogonal than the x86 ISA but it is also more complex. The 68k 16-bit VLE is easier to decode and orthogonal with ISA complexity between the x86 and VAX ISA. The mem,mem did not go too far except for maybe the ADDX/SUBX mem,mem type instructions and the double memory indirect addressing modes but they are no problem for the 68060 in hardware.

cdimauro Quote:

Mitch reported a slightly different list about such extra rounding modes:
Table 4: Rounding Modes

Mode Status Encoding
Round Nearest Even IEEE 754 000
Round Nearest Odd Experimental 001
Round Nearest Magnitude IEEE 754-2008 010
Round Away From Zero Experimental 100
Round Towards Zero IEEE 754 101
Round Towards + Infinity IEEE 754 110
Round Towards â€“ Infinity IEEE 754 111

For the new ones there's a match with Round Away From Zero Experimental with your Round to away from Zero, but the other twos don't.
I've extended my new architecture with this table:
Round to Nearest, Ties to Even - round()
Round Toward Zero - trunc()
Round Down, Toward -Infinity - floor()
Round Up, Toward +Infinity - ceil()
Round to Nearest, Ties Away from Zero
Round to Nearest, Ties toward Zero
Round to Nearest, Ties to Max Magnitude
Round Away from Zero

I don't know if it's right, but I'm not an expert on this field.

Anyway, the encoding meanings might change. The important thing is that I've extended the field in the status registers to support such 8 rounding modes, and I've also extended the previously mentioned CONV instruction (which now takes a whopping 192 encodings from the list of available binary instructions. Fortunately I've many of them, but CONV already took a good part of it).

I did not try to give an exhaustive list of other rounding modes. There are many ways to round numbers and I was listing some common ones we have likely learned, used and are useful but are not part of the IEEE standard. Rounding modes use many different names too. I was just saying a 3-bit encoding for at least 8 rounding modes is a good idea. IEEE FP rounding is a little tricky because there is a sticky bit and guard bit.

Round Nearest Magnitude IEEE 754-2008

This is probably the optional IEEE 754-2008 rounding mode needed by double double and double extended numbers which is basically multiprecision FP math in hardware. I would have to brush up on how exactly the different rounding takes place to be sure. Supporting additional rounding modes is likely very cheap in hardware although the instruction encoding space is not as cheap and knowing which ones are worthwhile to support is not clear.

Edit: Mitch's "Round nearest, magnitude IEEE 754-2008" is different than the optional "Round nearest, ties away from zero" listed at the following link.

https://en.wikipedia.org/wiki/IEEE_754#Rounding_rules

I am not sure either are the rounding mode used for multiprecision arithmetic either. No easy answers anyway.

Last edited by matthey on 20-Mar-2025 at 06:52 PM.

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle