Click Here
home features news forums classifieds faqs links search
6071 members 
Amiga Q&A /  Free for All /  Emulation /  Gaming / (Latest Posts)
Login

Nickname

Password

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net
Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.
Donate

Menu
Main sections
» Home
» Features
» News
» Forums
» Classifieds
» Links
» Downloads
Extras
» OS4 Zone
» IRC Network
» AmigaWorld Radio
» Newsfeed
» Top Members
» Amiga Dealers
Information
» About Us
» FAQs
» Advertise
» Polls
» Terms of Service
» Search

IRC Channel
Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online
0 crawler(s) on-line.
 62 guest(s) on-line.
 1 member(s) on-line.


 zipper

You are an anonymous user.
Register Now!
 zipper:  1 min ago
 kolla:  28 mins ago
 Karlos:  36 mins ago
 matthey:  1 hr 13 mins ago
 Futaura:  1 hr 28 mins ago
 roschmyr:  1 hr 31 mins ago
 amigang:  1 hr 43 mins ago
 NutsAboutAmiga:  1 hr 51 mins ago
 pixie:  2 hrs 24 mins ago
 vox:  2 hrs 30 mins ago

/  Forum Index
   /  Amiga OS4 Hardware
      /  some words on senseless attacks on ppc hardware
Register To Post

Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 )
PosterThread
matthey 
Re: some words on senseless attacks on ppc hardware
Posted on 1-May-2024 20:59:33
#1261 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2086
From: Kansas

Hammer Quote:

68060's FPU wasn't pipelined. Your 8-stage pipelines are for integers. CPU's clock speed potential is only as good as the weak point.


The 68060 FPU is not considered pipelined as a unit but FPU instructions are pipelined more than the classic RISC pipeline.

1. IF (Instruction Fetch)
2. ID (Instruction Decode)
3. EX (EXecute)
4. MEM (MEMory access)
5. WB (register Write Back)

https://en.wikipedia.org/wiki/Classic_RISC_pipeline

Now lets look at the 68060 instruction pipeline.

1. IAG (Instruction Address Generation)
2. IC (Instruction fetch Cycle)
3. IED (Instruction Early Decode)
4. IB (Instruction Buffer)
5. DS (Decode instruction and Select)
6. AG (operand Address Generation)
7. OC (Operand fetch Cycle)
8. EX (instruction EXecution)
--- stages 9 and 10 are optional ---
9. DA (Data Available)
10. WB (Write Back)

Both 68060 integer instruction stages and FPU instruction stages are similar with the difference being the EX stage which is performed by the integer operand execution pipelines (OEPs) while FPU instructions are sent to the FPU. The FPU is actually in the primary OEP (pOEP) though. The 68060 FPU internal operation is serial and not pipelined but the FPU can execute FPU instructions in parallel with the integer OEPs. The FPU EX stage may have more work to perform than typical integer instructions like normalizing fp numbers including 67 bit barrel shift and extra result and error checking but CISC pipelines have multi cycle integer instructions like division too (no hardware MUL or DIV for classic RISC pipeline). From what I've read, critical clock speed limiting stages are more likely to be related to MMU or cache accesses where professional chip designs use custom or licensed optimized IP blocks. The 68060 designers likely just needed more time to analyze where the critical timing areas are and optimize them.

Hammer Quote:

Without a 64-bit front-side bus, sustained FP64 and dual INT32 wouldn't be optimal.


Each 68060 OEP can access the banked data cache in the same cycle so this would be 2x32 bit data cache accesses. The FPU has a 64 bit path to the data cache. Anytime sustained accesses to memory are necessary is not optimal. The Pentium P5 and P6 have a small advantage here made even smaller by subtracting the memory bandwidth used by loading ~20% more code. I still prefer Motorola's strategy with the 68060+ to provide competitive performance at a much lower cost.

Hammer Quote:

AMD's K5 has a 6-stage pipeline and runs into a 133 Mhz clock-speed wall. AMD's K6 (with technology from NexGen's ex-Alpha DEC engineers) still has 6-stage pipelines and can reach higher clock speeds.

Pipeline stages are one aspect of clock speed potential.


The 8 stage 68060 ran into a pencil pusher wall at only 50MHz making Motorola look incompetent. The high performance at a low clock speed was perfect for embedded use though. As I recall, the 68060 life was over a decade and the last revision and die shrink came in 1999 but was still rated at 50MHz. Even though Motorola handed ARM with Thumb2 the embedded market by shoving fat PPC down customers throats, ARM didn't have a CPU core which could match 68060 DMIPS/MHz performance until about 2005 with the Cortex-A8 but it had a less general purpose 13 stage integer pipeline that improves ILP at the expense of branch performance (more of a media processor like the Pentium 4 but with less heat).

Hammer Quote:

That's a flawed argument when Pentium Pro's larger cache and front end cover the gap between the slower 64-bit 66 Mhz front side bus and the CPU's higher clock speed.

Additional transistors are spent when there's a large gap between the slower front side bus and CPU clock speed.


I was comparing the 68060+ with 16kiB I+D to the PPro.

68060+
8 stage in-order superscalar design
16kiB I+D (64 bit internal data paths to data cache where advantageous like FPU)
32 bit data bus to memory (reduces CPU, memory and board costs)
single die ~3.3 million transistors

Pentium Pro
14 stage OoO design (OoO and longer than necessary pipeline wastes transistors)
L1 8kiB I+D
L2 256kiB
64 bit data bus to memory
2-3 dies bonded together to make expensive CPU module of ~5.5 million transistors

The 32 bit data bus of the 68060 was usually not a bottleneck but doubling the caches with the 68060+ reduces any disadvantage the 68060 had because double the data can be accessed using internal 64 bit paths to the caches instead of accessing memory over the 32 bit data bus. Memory accesses are also reduced with the larger caches making the 32 bit data bus less of an issue. It is true that higher CPU clock speeds would increase data requirements making the 32 bit data bus more of an issue. The PPro definitely has the advantage for larger workloads with the L2 cache and 64 bit data bus but the 68060+ would have been potent while much more efficient and much cheaper. Let's not forget that the 68k code density is about 20% better than x86 code density which reduces memory accesses and improves instruction cache efficiency. RISC-V research found that every 25%-30% code density improvement is like doubling the instruction cache size. With ISA code density improvements like adding ColdFire instructions, the 16kiB instruction cache may have held as much code as a x86 32kiB instruction cache (adding instructions to x86 worsens code density due to larger instructions from no free encoding space while adding instructions to the 68k can improve code density). It certainly looks like code density improvements were worthwhile enough to add them to the AC68080 giving code density beyond the Thumb2 embedded standard and without the performance loss of RISC compressed encodings due to increased instruction counts. The ColdFire instructions would have helped 68060 performance not just from code density improved cache efficiency but also from elimination of partial register writes and 32 bit results improving forwarding/bypassing capabilities.

Hammer Quote:

http://archive.computerhistory.org/resources/access/text/2013/04/102723315-05-01-acc.pdf
Page 86 of 417, DataQuest 1995

1994 Worldwide Microprocessor Market Share Ranking.

For 1994 Market Share
1. Intel, 73.2%
2. AMD, 8.6%
3. Motorola, 5.2%
4. IBM, 2.2%

Page 84 of 417,


This is MPU market share by revenue which shows how much higher margin high end computer markets like desktop and workstation markets were than the embedded market which Motorola led with the 68k.

Hammer Quote:

Supply Base for 32-Bit Microprocessors—1994,
For Product's Share of Total 32-Bit-and-Up MPU Market 1994

68000, 17%
80386SX/SL, 3%
80386DX, 3%
80486SX, 16%
80486DX, 21%
683XX, 9%
68040, 3%
68030, 1%
68020, 3%
80960, 4%
AM29000, 1%
32X32, 3%
R3000/R4000, 1%
Sparc, 1%
Pentium, 4%
Others, 10%

Motorola wasn't able to convert 68000's success for 68020, 68030 and 68040. This factor has weakened Motorola's independent R&D capability.


Now you are talking about volume where the 68k looks much better because of the high volume of the embedded market.

68000, 17%
683XX, 9%
68040, 3%
68020, 3%
68030, 1%
---
68k is ~33% of 32+ bit MPU market by volume in 1994 but most of this is for the low margin embedded market. The 68000 should really be classified as a 16 bit CPU but it does have a 32 bit ISA.

80386SX/SL, 3%
80386DX, 3%
80486SX, 16%
80486DX, 21%
Pentium, 4%
---
x86 is ~47% of 32+ bit MPU market by volume in 1994 but most of this is for the high margin desktop market.

Last edited by matthey on 01-May-2024 at 09:10 PM.

 Status: Offline
Profile     Report this post  
Hammer 
Re: some words on senseless attacks on ppc hardware
Posted on 2-May-2024 7:37:51
#1262 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5407
From: Australia

@matthey
Quote:

Both 68060 integer instruction stages and FPU instruction stages are similar with the difference being the EX stage which is performed by the integer operand execution pipelines (OEPs) while FPU instructions are sent to the FPU. The FPU is actually in the primary OEP (pOEP) though. The 68060 FPU internal operation is serial and not pipelined but the FPU can execute FPU instructions in parallel with the integer OEPs. The FPU EX stage may have more work to perform than typical integer instructions like normalizing fp numbers including 67 bit barrel shift and extra result and error checking but CISC pipelines have multi cycle integer instructions like division too (no hardware MUL or DIV for classic RISC pipeline). From what I've read, critical clock speed limiting stages are more likely to be related to MMU or cache accesses where professional chip designs use custom or licensed optimized IP blocks. The 68060 designers likely just needed more time to analyze where the critical timing areas are and optimize them.

Show 68060 Rev 6 100 Mhz beating Pentium 100 at Quake benchmark. Hint: Warp1260 with RTG couldn't do it.

Pentium's Floating-Point Unit (FPU) of Pentium has an eight-stage pipeline i.e.
Prefetch (PF),
Decode-1 (D1),
Decode-2 (D2),
Execute (dispatch),
Floating Point Execute-1 (X1)
Floating Point Execute-2 (X2)
Write Float (WF)
Error Reporting (ER)

Pentium's Floating-Point Unit (FPU) of Pentium MMX has a nine-stage pipeline i.e.
Prefetch (PF),
Fetch (F),
Decode-1 (D1),
Decode-2 (D2),
Execute (dispatch),
Floating Point Execute-1 (X1)
Floating Point Execute-2 (X2)
Write Float (WF)
Error Reporting (ER)

Pipelined enables "instructions-in-flight".

-----
P54 Pentium 75/90/100/120/133/150/166/200's internal instruction bus is 256 bits wide which is an improvement over the 1993 released P5 Pentium 60/66's 128-bit internal instruction bus. (Cite: Intel's Pentium Processor Family Developer’s Manual, 1997, Page 24 of 609).

P55 Pentium includes MMX.

Quote:

Each 68060 OEP can access the banked data cache in the same cycle so this would be 2x32 bit data cache accesses. The FPU has a 64 bit path to the data cache. Anytime sustained accesses to memory are necessary is not optimal. The Pentium P5 and P6 have a small advantage here made even smaller by subtracting the memory bandwidth used by loading ~20% more code. I still prefer Motorola's strategy with the 68060+ to provide competitive performance at a much lower cost.

Show 68060 Rev 6 100 Mhz match Pentium 100 at Quake benchmark.

The lower platform cost argument from 68060 is a joke in practice.

Quote:

The 8 stage 68060 ran into a pencil pusher wall at only 50MHz making Motorola look incompetent. The high performance at a low clock speed was perfect for embedded use though. As I recall, the 68060 life was over a decade and the last revision and die shrink came in 1999 but was still rated at 50MHz.

Amiga/Atari Falcon's 68060 accelerators are not limited by Motorola's official 50Mhz.

Quote:

Even though Motorola handed ARM with Thumb2 the embedded market by shoving fat PPC down customers throats,

PowerPC 601 has 2.8 million transistors, scaled to 120 Mhz clock speed in 1995, and good FPU.
68060 has 2.5 million transistors.

Power Macintosh 8100's PPC 601 reached 80Mhz in March 1994.

Intel Pentium reached 100 Mhz in March 1994.

Power Macintosh 8100/110's PPC 601 reached 110 Mhz in Nov 1994.

PowerPC came out strong in 1994.

-------
In 1995...Pentium Pro 150,166,180 and 200 Mhz models were released in Nov 1995.
Pentium 133 Mhz was released in June 1995.

PowerPC 604 reached 132 Mhz with Power Mac 9500/132's June 1995 release.


-------
Pentium 150/166 in Jan 1996.

Power Mac 9500/150's 604 reached 150 MHz CPU in April 1996.

Pentium 200 in June 1996.
AMD K5-100 reached 100 MHz in June 1996.

Power Mac 9500/200's 604e reached 200 MHz CPU in August 1996.

-------
1997
Power Mac 9600's 604e reached 233 Mhz around February 1997.

AMD K6 with MMX SIMD (mainstream Socket 7) reached 233 Mhz in Apr 1997.

Pentium II 233, 266, and 300 Mhz with MMX SIMD was released in May 1997.

Pentium MMX (mainstream Socket 7) 233 Mhz was released in June 1997.

Power Mac 9600's 604e reached 350 Mhz around August 1997.
-------
1998
Pentium II "Deschutes" 266, 300, and 333 Mhz were released in Jan 1998.
K6 266Mhz was released in Jan 1998.

Pentium II "Deschutes" 350 and 400 Mhz were released in April 1998.
K6 300Mhz was released in April 1998.

PowerPC camp ran into a clock speed wall in 1998.

Pentium II "Deschutes"450 Mhz was released in August 1998.

K6-2 400 Mhz with 3DNow SIMD was released in Nov 1998.
-------
1999 to 2000, the Ghz race between Intel Pentium III (reached 1Ghz on March 8, 2000), AMD Athlon (reached 1Ghz on March 6, 2000), and Alpha EV67 (750 Mhz)/EV68 (1Ghz in 2001).



Quote:

ARM didn't have a CPU core which could match 68060 DMIPS/MHz performance until about 2005 with the Cortex-A8 but it had a less general purpose 13 stage integer pipeline that improves ILP at the expense of branch performance (more of a media processor like the Pentium 4 but with less heat).

You omitted ARM Cortex-A8 includes a 64-bit wide NEON SIMD.


Quote:

This is MPU market share by revenue which shows how much higher margin high end computer markets like desktop and workstation markets were than the embedded market which Motorola led with the 68k.

A higher revenue margin is important for R&D health.

Quote:

Now you are talking about volume where the 68k looks much better because of the high volume of the embedded market.

68000, 17%
683XX, 9%
68040, 3%
68020, 3%
68030, 1%
---
68k is ~33% of 32+ bit MPU market by volume in 1994 but most of this is for the low margin embedded market. The 68000 should really be classified as a 16 bit CPU but it does have a 32 bit ISA.

80386SX/SL, 3%
80386DX, 3%
80486SX, 16%
80486DX, 21%
Pentium, 4%
---
x86 is ~47% of 32+ bit MPU market by volume in 1994 but most of this is for the high margin desktop market.


You missed my point on why I have shown specific 68K models i.e. to remove 68000 cloak that hides 68020's, 68030's, and 68040's volume market share.

68000 is not Doom CPU capable.

80386SX has a 16-bit front side bus, hence its 32-bit ALU is gimped i.e. hence it joins a similar boat as 68000/68010. 80386SX has a built-in i386 MMU.

Removing the "16-bit" lame 32-bit CPU ducks.

68040, 3%
68020, 3%
68030, 1%
Total: 7%

80386DX, 3%
80486SX, 16%
80486DX, 21%
Pentium, 4%
Total: 44%

In 1992's wholesale price, Motorola's 68030-25 wasn't cost-effective against AMD's Am386-40.

Last edited by Hammer on 02-May-2024 at 09:15 AM.
Last edited by Hammer on 02-May-2024 at 08:54 AM.
Last edited by Hammer on 02-May-2024 at 08:05 AM.
Last edited by Hammer on 02-May-2024 at 07:59 AM.
Last edited by Hammer on 02-May-2024 at 07:47 AM.
Last edited by Hammer on 02-May-2024 at 07:42 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A ECS, KS 3.2, PiStorm/RPi 3A+/Emu68)

 Status: Offline
Profile     Report this post  
matthey 
Re: some words on senseless attacks on ppc hardware
Posted on 2-May-2024 23:21:09
#1263 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2086
From: Kansas

Hammer Quote:

Show 68060 Rev 6 100 Mhz beating Pentium 100 at Quake benchmark. Hint: Warp1260 with RTG couldn't do it.


The compiler support between the P5 Pentium which became the most popular desktop CPU and the 68060 which was a high end and thus lower volume embedded CPU is incomparable. The 68060 received minimal if any compiler support but high performance CPU cores usually require highly tuned and optimized code to get anywhere close to their potential. The 68060 performance with old and poorly optimized code is impressive but there is more possible. My VBCC support code enhancements demonstrate what improvements to one part of compiler support can make but it is just scratching the surface of what is possible to the many parts.

o compiler backend (68k compilers lack good int and fp code generation)
o compiler support software (GCC support primitive, VBCC has some assembly code and inlines)
o compiler instruction scheduler (no 68060 specific instruction scheduler for GCC or VBCC)
o compiler other (VBCC uses VASM which has the best peephole optimizer for the 68k)
o game optimizations (Amiga Quake versions may have assembly code but some not 68060 optimized)
o OS optimizations (AmigaOS is optimized for a 16 bit 68000 CPU with a 25 year old compiler)
o gfx/RTG driver optimizations (many drivers are compiled so poorly optimized)

While compiler support is the most important for performance, Quake was designed and highly optimized for an x86 PC target using likely man years of optimizations that only a profitable game market can provide. OS and gfx drivers play a part and the quality of their code usually depends on compiler support as well.

Hammer Quote:

Pentium's Floating-Point Unit (FPU) of Pentium has an eight-stage pipeline i.e.
Prefetch (PF),
Decode-1 (D1),
Decode-2 (D2),
Execute (dispatch),
Floating Point Execute-1 (X1)
Floating Point Execute-2 (X2)
Write Float (WF)
Error Reporting (ER)

Pentium's Floating-Point Unit (FPU) of Pentium MMX has a nine-stage pipeline i.e.
Prefetch (PF),
Fetch (F),
Decode-1 (D1),
Decode-2 (D2),
Execute (dispatch),
Floating Point Execute-1 (X1)
Floating Point Execute-2 (X2)
Write Float (WF)
Error Reporting (ER)

Pipelined enables "instructions-in-flight".


The 68060 allows multiple FPU "instructions-in-flight" in the 68060 pipeline but only one can be executed at a time inside the FPU. There are several single cycle latency FPU instructions like FMOVE, FABS, FNEG, FCMP and FTST where there is no disadvantage to a non-pipelined FPU. FDIV and FSQRT are usually not pipelined in pipelined FPUs. FADD, FSUB and FMUL are the important FPU instructions to pipeline as they are used often but there is no performance loss if they are scheduled 3 cycles apart which is their execution latency. The 68060 can likely have more pipelined "instructions-in-flight" than the Pentium.

https://websrv.cecs.uci.edu/~papers/mpr/MPR/ARTICLES/061505.pdf Quote:

One advantage that the 68060 will have over Pentium is that it can issue an integer instruction in parallel with most floating-point instructions. Also, it can continue to issue integer instructions into both pipes while a long-latency operation continues in the floating-point unit. Pentium, in contrast, ties up the entire processor when performing floating-point calculations.


https://en.wikipedia.org/wiki/Motorola_68060#Architecture Quote:

Against the Pentium, the 68060 can perform better on mixed code; Pentium's decoder cannot issue an FP instruction every opportunity and hence the FPU is not superscalar as the ALUs were. If the 68060's non-pipelined FPU can accept an instruction, it can be issued one by the decoder. This means that optimizing for the 68060 is easier: no rules prevent FP instructions from being issued whenever was convenient for the programmer other than well understood instruction latencies. However, with properly optimized and scheduled code, the Pentium's FPU is capable of double the clock for clock throughput of the 68060's FPU.


Most FPU using code is mixed code (int+fp instructions) where the 68060 has the performance advantage but heavy FPU code (mostly fp instructions) on the Pentium has a much higher theoretical FPU performance limit that is rarely realized, especially by compilers. The 68060 designers focused on integer performance and were likely trying to save transistors to double the caches with the 68060+ which would have improved FPU and integer performance. A pipelined FPU and/or multiple FPU sub units (FADD/FSUB/FMUL, FDIV/FSQRT, FMOVE/FABS/FNEG/FCMP/FTST FPU instruction execution in parallel) could be provided when transistor costs were cheaper.

Jay Miner Quote:

They say that engineering is the art of compromise and I can really attest to that.

https://youtu.be/n-MqC35aWrQ?t=439

Hammer Quote:

Show 68060 Rev 6 100 Mhz match Pentium 100 at Quake benchmark.

The lower platform cost argument from 68060 is a joke in practice.


The low price of the 68060 allowed it to survive and succeed in the lower volume high end embedded market. In April of 1994, the 68060@50MHz had a price of $308 each while the Pentium P54C@100MHz price was $995 each for 1000 units. The 68060 had better performance/price than the Pentium even though it was reduced by the low clock speed which only grew worse as it was held at 50MHz and by lackluster compiler benchmarks compared to the Pentium which received much better compiler support.

Hammer Quote:

PowerPC 601 has 2.8 million transistors, scaled to 120 Mhz clock speed in 1995, and good FPU.
68060 has 2.5 million transistors.

Power Macintosh 8100's PPC 601 reached 80Mhz in March 1994.

Intel Pentium reached 100 Mhz in March 1994.

Power Macintosh 8100/110's PPC 601 reached 110 Mhz in Nov 1994.

PowerPC came out strong in 1994.


The PPC 601 has a 32kiB unified cache where the 68060 and Pentium have 8kiB I+D (16kiB total). This is a good example of the RISC concept of using a simpler shorter (4 stage) pipeline and applying the transistor savings to the caches. Performance/MHz was competitive with the 68060 and Pentium but the shallow pipeline and 32kiB cache likely limited the max clock speed resulting in the PPC 601+ with an expensive die shrink. As I recall, the PPC 601 and PPC 603 FPUs were only fully pipelined for single precision while compilers used practically all double precision instructions before C99. Still, it wasn't difficult to outperform the ugly stack based Pentium FPU or the minimalist 68060 FPU.


Hammer Quote:

In 1995...Pentium Pro 150,166,180 and 200 Mhz models were released in Nov 1995.
Pentium 133 Mhz was released in June 1995.

PowerPC 604 reached 132 Mhz with Power Mac 9500/132's June 1995 release.


The 14 stage PPro pipeline was overkill wasting transistors but it allowed a high max clock speed. The PPC 604 had a longer 6 stage pipeline that allowed it to be clocked up and the split 16kiB I+D (32kiB total) was an improvement over the PPC 601. As I recall the PPC 604 receive a FPU that was fully pipelined for double precision fp which was also a nice upgrade. It's a powerful practical design but now 3.6 million transistors and Steve Jobs wants more MHz to compete with the deeply pipelined PPro.

Hammer Quote:

Pentium 150/166 in Jan 1996.

Power Mac 9500/150's 604 reached 150 MHz CPU in April 1996.

Pentium 200 in June 1996.
AMD K5-100 reached 100 MHz in June 1996.

Power Mac 9500/200's 604e reached 200 MHz CPU in August 1996.
-------
1997
Power Mac 9600's 604e reached 233 Mhz around February 1997.

AMD K6 with MMX SIMD (mainstream Socket 7) reached 233 Mhz in Apr 1997.

Pentium II 233, 266, and 300 Mhz with MMX SIMD was released in May 1997.

Pentium MMX (mainstream Socket 7) 233 Mhz was released in June 1997.

Power Mac 9600's 604e reached 350 Mhz around August 1997.


The PPC 604e received the old die shrink with doubling of the caches that boosted the clock speed in the past. The caches were now 32kiB I+D (64kiB total) which may have limited the max clock speed but the poor code density required more instruction cache at higher speeds, the 6 stage pipeline was barely adequate at higher clock speeds and the 5.1 million transistor chip was expensive. The PPC 604e was one of the most powerful desktop CPUs for a short period of time but the design didn't have much potential left. PPC looked dead end until the PPC G3 with L2 cache to feed the RISC instruction fetch bottleneck replaced the 604(e) design. The early PPC AmigaNOne line used some of the first PPC G3 CPUs to have on-chip L2 caches now using 20+ million transistors, mostly for caches to feed the RISC monster. Those shallow pipeline PPC cores are definitely smaller than more powerful and pipelined CISC cores though.

Hammer Quote:

1998
Pentium II "Deschutes" 266, 300, and 333 Mhz were released in Jan 1998.
K6 266Mhz was released in Jan 1998.

Pentium II "Deschutes" 350 and 400 Mhz were released in April 1998.
K6 300Mhz was released in April 1998.

PowerPC camp ran into a clock speed wall in 1998.


PPC G3 designs started appearing about this time although the first ones did not have an on chip L2. PPC designs generally stopped trying to compete for max clock speeds which was always the intention of the PPC ISA. The PPC ISA breaks from the classic RISC philosophy choosing more complexity to better compete with simpler RISC ISAs. This strategy was ahead of its time as most RISC ISAs added complexity and adopted more CISC like features. So what went wrong? PPC designs focused on shallow pipeline limited OoO designs to minimize load-to-use stalls and branch prediction logic. While generally a good idea, they were too locked into this concept with PPC designs for too long. Another problem was their code density was not enough of an improvement over older classic RISC ISAs. ARM Thumb ISAs took the low end of the PPC market with much better 68k like code density but lacked 68k like performance to finish off the better than Thumb performance PPC but then the similar to PPC AArch64 with better performance and significantly better code density did.

Hammer Quote:

Pentium II "Deschutes"450 Mhz was released in August 1998.

K6-2 400 Mhz with 3DNow SIMD was released in Nov 1998.
-------
1999 to 2000, the Ghz race between Intel Pentium III (reached 1Ghz on March 8, 2000), AMD Athlon (reached 1Ghz on March 6, 2000), and Alpha EV67 (750 Mhz)/EV68 (1Ghz in 2001).


The odd man out here is the DEC Alpha where architects pioneered the L2 cache to minimize the RISC instruction fetch bottleneck. Just because Alpha, MIPS, SPARC, PA-RISC and PPC are dead RISC ISAs with some of the worst code densities doesn't mean their death was all about code density. Performance matters too as as a 1GHz x86 CPU is a lot more powerful than a 1GHz classic RISC CPU. CISC ISAs are stronger because both register and cache accesses can be pipelined, fewer more powerful instructions are used, more powerful instructions and addressing modes are used, memory traffic is reduced with fewer memory accesses and less code fetches and caches are saved with compressed code. Well, x86 is far from a good example of CISC but it was good enough to take down the RISC competition which has 4 times the number of GP registers.

Hammer Quote:

You omitted ARM Cortex-A8 includes a 64-bit wide NEON SIMD.


A SIMD unit back then didn't make any difference to compiled benchmarks and most general purpose compiled software. Even today, auto vectorization is tricky to extract consistent performance gains from general purpose compiled code.

Hammer Quote:

A higher revenue margin is important for R&D health.


True. Intel and AMD won not just because of x86 CISC superiority but because of economies of scale in their high margin markets. Had Motorola not thrown out their 68k baby, they may have been able to leverage larger volume economies of scale in the embedded market like ARM did. Granted, the embedded market is much larger now but I believe Intel is more worried about ARM trying to push up into their high end high margin markets than AMD competition because ARM is more destructive to margins than AMD.

Hammer Quote:

You missed my point on why I have shown specific 68K models i.e. to remove 68000 cloak that hides 68020's, 68030's, and 68040's volume market share.

68000 is not Doom CPU capable.

80386SX has a 16-bit front side bus, hence its 32-bit ALU is gimped i.e. hence it joins a similar boat as 68000/68010. 80386SX has a built-in i386 MMU.

Removing the "16-bit" lame 32-bit CPU ducks.

68040, 3%
68020, 3%
68030, 1%
Total: 7%

80386DX, 3%
80486SX, 16%
80486DX, 21%
Pentium, 4%
Total: 44%

In 1992's wholesale price, Motorola's 68030-25 wasn't cost-effective against AMD's Am386-40.


The 68EC030 was very good value from historic prices I have seen. CBM should have moved to a 68EC030@28MHz instead of 68EC020@14MHz but they needed a chipset that could run at 28MHz. A 68EC030@28MHz with AA+ would have competed better against cheap 386s and saved their reputation. Motorola generally expected closer to desktop margins for their full CPUs though. Intel's pricing seemed to be more about higher margins for higher clock speed rated CPUs which they pushed more but it made more sense to push for high margin markets.

Last edited by matthey on 02-May-2024 at 11:39 PM.
Last edited by matthey on 02-May-2024 at 11:27 PM.

 Status: Offline
Profile     Report this post  
Hammer 
Re: some words on senseless attacks on ppc hardware
Posted on 3-May-2024 0:20:26
#1264 ]
Elite Member
Joined: 9-Mar-2003
Posts: 5407
From: Australia

@matthey

Quote:
While compiler support is the most important for performance, Quake was designed and highly optimized for an x86 PC target using likely man years of optimizations that only a profitable game market can provide. OS and gfx drivers play a part and the quality of their code usually depends on compiler support as well.


Quake was designed for Pentium FPU.
https://youtu.be/DWVhIvZlytc?t=934
K6 vs Pentium FPU with Quake and Quake 2.

K6-3 includes the full FPU design fix. It took AMD about 23 months to fix K6's FPU with concurrent K7 Athlon's R&D.

Quote:

The 68060 allows multiple FPU "instructions-in-flight" in the 68060 pipeline but only one can be executed at a time inside the FPU. There are several single cycle latency FPU instructions like FMOVE, FABS, FNEG, FCMP and FTST where there is no disadvantage to a non-pipelined FPU. FDIV and FSQRT are usually not pipelined in pipelined FPUs. FADD, FSUB and FMUL are the important FPU instructions to pipeline as they are used often but there is no performance loss if they are scheduled 3 cycles apart which is their execution latency. The 68060 can likely have more pipelined "instructions-in-flight" than the Pentium.

Nope. 68060 has a 32-bit front-side bus issue in addition to 68060's FPU design issues.

https://www.youtube.com/watch?v=0_dW-21gdkw
Warp 1260 with RTG playing Quake. Warp 1260's RTG doesn't have a Zorro III/Super Buster bottleneck. Warp 1260 includes on PCB L2 cache.

Warp 1260's 68060 @ 100 MHz has a 32-bit 100 Mhz front side bus while Pentium 100 Mhz has 64 64-bit 66 Mhz front side bus. 32-bit 100 Mhz front side bus is equivalent to 50 Mhz 64-bit front side bus.

Warp 1260's 68060 @ 100 MHz has Pentium 75 results.

K6-3 has up to 100 Mhz 64-bit front side bus Super Socket 7 that competed against Pentium II 450's 100Mhz 64-bit front side bus.

K6-2 has 100Mhz Super Socket 7 support, but without the full FPU design fix.

Motorola didn't port 88110's 60x bus for 68060. By 1994, Motorola's revenues are less than AMD's and Motorola is fully focused on PowerPC. Motorola designed two distinct 603 and 604 PowerPC core designs with different CPU core die sizes for its high/low product segmentation.

Intel just cut Pentium II's L2 cache and called it Celeron for its high/med/low product segmentation i.e. Xeon, Pentium II, and Celeron.

PowerPC 601/603/604 FPU has a simplified FP64 design instead of Pentium's and 68060's FP80.

683XX led to the lesser tier Coldfire R&D.

Quote:

True. Intel and AMD won not just because of x86 CISC superiority but because of economies of scale in their high margin markets. Had Motorola not thrown out their 68k baby, they may have been able to leverage larger volume economies of scale in the embedded market like ARM did.

ARM's supporters create a safe high-margin market space for ARM i.e. handheld smart phones.

Quote:

Granted, the embedded market is much larger now but I believe Intel is more worried about ARM trying to push up into their high end high margin markets than AMD competition because ARM is more destructive to margins than AMD.

Nope. Intel RaptorLake-R suffered another unstable Pentium III Ghz race-like debacle. https://wccftech.com/only-5-out-of-10-core-i9-13900k-2-out-of-10-core-i9-14900k-cpus-stable-in-auto-profile-intel-board-partners-stability-issues/


The tester is the owner of a studio that buys several CPUs for their own needs. In invoices shared by the tester, it is revealed that he has bought and tested at least 100s of Intel Core i9-13900K and Core i9-14900K CPUs and it looks like almost all of the chips he acquired had some sort of issue in terms of stability. Motherboards used by the studio include ASUS's Z790, B760, Z690 and B660 boards.

The software he runs requires each CPU and PC to pass through a certain variety of tests and at the Auto profile set in the ASUS motherboards, the majority of CPUs fail this test and have to be resold. Based on these tests, the tester determined a probability rate respective to the CPU's stability & it is shared below:

Intel Core i9-13900K "AUTO -253W" - 40/50% (4/5 out of 10 units stable)
Intel Core i9-13900K "Reduced Loadline" - 50-60% (5/6 out of 10 units stable)
Intel Core i9-13900K "B760/B660 Board" - 60-70% (6/7 out of 10 units stable)
Intel Core i9-14900K "AUTO - 253W" - 20% (2 out of 10 units stable)
Intel Core i9-14900K "Reduced Loadline" - ~30% (3 out of 10 units stable)
Intel Core i9-14900K "B760/B660 Board" - 40% (4 out of 10 units stable)

So the out-of-the-box experience on an Intel 13th and 14th Gen CPUs is bad. It is reported that the chips might work fine for a week or a little over a month but usually end up producing stability issues.




Quote:

The 68EC030 was very good value from historic prices I have seen. CBM should have moved to a 68EC030@28MHz instead of 68EC020@14MHz but they needed a chipset that could run at 28MHz. A 68EC030@28MHz with AA+ would have competed better against cheap 386s and saved their reputation. Motorola generally expected closer to desktop margins for their full CPUs though. Intel's pricing seemed to be more about higher margins for higher clock speed rated CPUs which they pushed more but it made more sense to push for high margin markets.


In 1992, Motorola should have recognized they were not a leading CPU vendor and should have acted like AMD's price disruption tactics.

For the Saturn, Sega rejected 68030 on pricing. Motorola focused on Intel instead of non-Intel competitors.

For "kick-the-OS hit-the-metal" games, 68020/68030 wasn't 100% instruction set compatible with 68000.

Motorola's MMU premium price didn't encourage Linux 68K. ARM's uptake has the low-cost/low power consumption MMU-equipped ARM710T (ARMv4T) CPUs with Linux kernel wave.

Last edited by Hammer on 03-May-2024 at 12:53 AM.
Last edited by Hammer on 03-May-2024 at 12:48 AM.
Last edited by Hammer on 03-May-2024 at 12:42 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A ECS, KS 3.2, PiStorm/RPi 3A+/Emu68)

 Status: Offline
Profile     Report this post  
matthey 
Re: some words on senseless attacks on ppc hardware
Posted on 3-May-2024 20:57:38
#1265 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2086
From: Kansas

Hammer Quote:

Quake was designed for Pentium FPU.
https://youtu.be/DWVhIvZlytc?t=934
K6 vs Pentium FPU with Quake and Quake 2.

K6-3 includes the full FPU design fix. It took AMD about 23 months to fix K6's FPU with concurrent K7 Athlon's R&D.


There are many FPU design decisions which affect FPU performance.

o ISA
o ABI
o FPU algorithm efficiency (FPU instruction latency)
o FPU pipelining (FPU instruction throughput)
o FPU register renaming (FPU registers can sometimes be reused, easier instruction scheduling)
o FPU parallel sub units (like OoO execution for FPU)
o FPU to CPU interface
o memory and cache resources
o code optimization

AMD was unfortunate with the OoO K6 that independent instructions could not be retired under a FDIV. This is not really a bug but maybe a design oversight. Most floating point programs do not use FDIV as much as 3D transformations for Quake and only a few years later perspective corrected T&L 3D gfx boards reduced the importance of FDIV performance. Intel dodged a bullet by having good FPU performance for the Pentium P5 which was weak at more important integer performance compared to the CISC competition.

How well can the minimalist 68060 FPU handle FDIV for Quake in comparison? The Pentium P5 has a FDIV latency in cycles of 19 single precision, 33 double precision and 39 extended precision. All precisions have a latency of 37 cycles on the 68060. The 68k FPU ISA has a single precision division instruction called FSGLDIV which on the 6888x has a 69 cycle latency instead of FDIV 103 cycle latency. The 68040 and 68060 also received FSop and FDop instructions which round the instruction result to single and double precision which it may be possible to optimize for less precision. The advantage of these FPU instructions is that the FPCR register does not need to be changed and changed back for selecting different precisions which are expensive operations.

FMOVE Dn,FPCR ; 8 cycles on 68060
FMOVE FPCR,Dn ; 4 cycles on 68060

FLDCW ; 8 cycles on P5 Pentium
FNSTCW ; 2 cycles on P5 Pentium

I believe x86 has to change the equivalent of the FPCR (CW?) and there are no instructions which select a different precision without this expensive overhead. Quake needs more than single precision variables sometimes so it is not possible to set the global precision to single precision all the time. Integer instructions on the 68060 can continue to execute in parallel with the FDIV like the P5 Pentium and shouldn't ever stall like the AMD K6. The 68060 single precision FDIV takes 18 cycles longer than the P5 Pentium but can continue to execute int instructions while the AMD K6 stalls for 14 cycles with limited execution of instructions. The 68060 avoids the overhead of changing the precision in the FPCR/CW as well. Since we know integer instructions execute in parallel with the FDIV, I would say the 68060 is likely to have better performance for this specific case than the AMD K6. The K6 came out in 1997 with OoO, 32kiB I+D caches and using 8.8 million transistor to the 68060 2.5 million so overall the K6 FPU performance was likely better. The earlier P5 Pentium FPU performance no doubt has more potential than the 68060 FPU but the x86 stack based FPU ISA reduces the advantage and the 68060 clean GP register FPU ISA boosts the fp performance to be surprising close in practice. Quake may have been the exception with Pentium P5 specific optimizations and lots of hand optimized assembly code necessary to unlock the difficult to extract x86 FPU performance.

Hammer Quote:

Nope. 68060 has a 32-bit front-side bus issue in addition to 68060's FPU design issues.


The max number of "instructions-in-flight" is a best case scenario with everything cached and is highly dependent on the pipeline length where the 68060 has the advantage over the P5 Pentium. The 2 extra stages of the P5 Pentium FPU don't make up the shortfall. On average, the 68060 has more "instructions-in-flight" too as it has fewer superscalar multi-issue restrictions and multi-issues a lot more (45%-55% dual/triple issue with existing 68k code and 50%-65% dual/triple issue with 68060 code). Also, The P5 Pentium is really good at tying up the integer units with FXCH instructions while FPU instructions are executing rather than executing integer instructions in parallel with FPU instructions. Don't under estimate the advantage of a more orthogonal and cleaner CISC ISA.

Hammer Quote:

683XX led to the lesser tier Coldfire R&D.


Motorola should have adopted the CPU32 ISA (6833x) across all their embedded products and upgraded it with ColdFire instructions. This would have given good 68k compatibility with some simplification over the 68020 ISA, full 32 bit ISA support and better than Thumb2 code density. Instead, ColdFire destroyed 68k compatibility, reduced compiler support and further divided the 68k embedded market.

Hammer Quote:

ARM's supporters create a safe high-margin market space for ARM i.e. handheld smart phones.


I'm not sure how safe Qualcomm considers that market space for ARM.

Hammer Quote:

In 1992, Motorola should have recognized they were not a leading CPU vendor and should have acted like AMD's price disruption tactics.

For the Saturn, Sega rejected 68030 on pricing. Motorola focused on Intel instead of non-Intel competitors.

For "kick-the-OS hit-the-metal" games, 68020/68030 wasn't 100% instruction set compatible with 68000.

Motorola's MMU premium price didn't encourage Linux 68K. ARM's uptake has the low-cost/low power consumption MMU-equipped ARM710T (ARMv4T) CPUs with Linux kernel wave.


The 68000 had Japanese 2nd sources in Hitachi and Toshiba which improved the likely hood of use in Japanese products. The fall out and lawsuits with Hitachi resulted in Motorola becoming spooked with 2nd suppliers and producing the high end 68020+ chips themselves. As I recall, Hitachi tried to get an injunction that would stop the 68030 from being sold. This likely means that Motorola used some of Hitachi's fab technology to produce the 68030 and Hitachi may have been slated to 2nd source produce it, perhaps to supply Sega. Sega sided with the Japanese Hitachi and went with SuperH instead of 68k.

I didn't like Motorola's 68k pricing strategy either. It really didn't make sense with the 68040+ which could have and perhaps should have included a standard MMU and FPU at no additional cost and just charge more for higher clocked and enhanced versions. For example, the first 68060 included the MMU and FPU which they sold as EC and LC versions, likely including chips that failed MMU and FPU testing but others were fully functional. Their biggest priority was to make new dies for EC and LC chips without the MMU and/or FPU but by this time there wasn't much silicon savings from eliminating them. Economies of scale were better to make more of one standard chip. They also could have made the 68060+ with double the caches for not much more development effort than the EC and LC chips and which could have been sold with a significantly higher margin. Don't downgrade customers, upgrade them. Intel knew how to upgrade their products and customers. Upgrading the 68060 would have damaged the PPC market though so down was the only direction for the 68k until it was stripped down to the ColdFire. It's kind of like the CBM strategy of stripping down and cost reducing products like the 68000+OCS/ECS Amiga to become a C64 only to find that the now obsolete product has little value and demand.

Last edited by matthey on 03-May-2024 at 09:51 PM.

 Status: Offline
Profile     Report this post  
Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 )

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]
Copyright (C) 2000 - 2019 Amigaworld.net.
Amigaworld.net was originally founded by David Doyle