Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6071 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

10 crawler(s) on-line.

148 guest(s) on-line.

1 member(s) on-line.

OlafS25

You are an anonymous user.
Register Now!

OlafS25: 1 min ago

zipper: 1 hr 8 mins ago

pixie: 1 hr 23 mins ago

amigakit: 1 hr 26 mins ago

RobertB: 1 hr 27 mins ago

bhabbott: 1 hr 56 mins ago

jPV: 2 hrs 36 mins ago

matthey: 2 hrs 36 mins ago

AmiKit: 2 hrs 37 mins ago

Musashi5150: 2 hrs 58 mins ago

Forum Index

General Technology (No Console Threads)

Amiga SIMD unit

Poster

Thread

Hammer

Re: Amiga SIMD unit
Posted on 1-Sep-2020 5:41:10

[ #61 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5286
From: Australia

@matthey
Quote:

Xilinx Versal FPGAs have been using a 7nm process for about a year (TSMC claims ~20% speed improvement or ~40% power reduction from their 10nm to 7nm process and Altera claims up to 40% higher performance or 40% lower power for the 10nm Agilex over the 14nm Stratix 10). Intel's lack of 7nm process is likely hurting Altera high end FPGA sales. These are high margin data center focused FPGA accelerators doing parallel DSP like workloads with AI datatype support to compete with GPUs. FPGAs use high tech processes to make up for routing inefficiencies.

https://www.extremetech.com/computing/296154-how-are-process-nodes-defined

Q: Why Do People Claim Intel 10nm and TSMC/Samsung 7nm Are Equivalent?

A: Because the manufacturing parameters for Intel’s 10nm process are very close to the values TSMC and Samsung use for what they call a 7nm process. The chart below is drawn from WikiChip, but it combines the known feature sizes for Intel’s 10nm node with the known feature sizes for TSMC’s and Samsung’s 7nm node. As you can see, they’re very similar.

Due to Intel's 10 nm supply issues, Intel is using it's 10 nm at the market segments with tougher competition e.g. mobile (e.g. Tigerlake), FPGA embedded and server (e.g. Icelake Xeon).

Quote:

Psygnosis retains some autonomy from Sony and has released games for other platforms after they were purchased. They were often only a publisher of games in the Amiga days as they were to be with Ultracore. They were known for their Amiga support and using Amiga tools but so was Digital Illusions (DICE) who created Ultracore. Digital Illusions was started by a Swedish Amiga demo group and created Pinball Dreams/Fantasies/Illusions and Benefactor for the Amiga before being bought by EA. The video from the official site mentions the Sega Mega Drive, Sega Genesis, PS4, PSVita but no mention of the Amiga.

I'm not naive to think Psygnosis wasn't biased towards the PS1 after Sony bought Psygnosis. Psygnosis cancelled Ultracore for both Amiga and Sega Genesis.

Last edited by Hammer on 01-Sep-2020 at 05:50 AM.
Last edited by Hammer on 01-Sep-2020 at 05:45 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 1-Sep-2020 6:06:21

[ #62 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5286
From: Australia

@megol

Major current design targets for Intel and AMD CPUs are Geekbench PR and Blender 3D.

AMD Claims Zen 3 being an integer monster while Intel increased Tigerlake's L1 cache size towards Geekbench BS PR i.e. Apple loves using Geekbench marketing.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

megol

Re: Amiga SIMD unit
Posted on 1-Sep-2020 9:13:14

[ #63 ]

Regular Member

Joined: 17-Mar-2008
Posts: 355
From: Unknown

@Hammer

Don't agree with Geekbench or Blender being important targets for optimization, are there any reasons to believe that being the case? Intel and to a lesser degree AMD are having problems optimizing for anything these days as we a closing in to fundamental limitations with all the "easy" ways to extract performance gone.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 2-Sep-2020 2:25:15

[ #64 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2013
From: Kansas

Quote:

megol wrote:
That's independent on ISA and x86 wasn't mentioned at all.
It's suboptimal as memory accesses even when hitting L1 are slow (latency not throughput) and it wastes a lot of power.

You have to put it all in perspective and not just the L1 DCache accesses. Everything in processor design is a trade off. Let's use the 68k as a baseline and do a hypothetical compare to some other architectures.

PPC +10% instructions, +40% ICache traffic, -5% DCache traffic
Thumb2 +20% instructions, +20% DCache traffic

Do you think this would be a good tradeoff for PPC? Do you think the extra 16 GP registers would be worth it for 5% less DCache traffic at the cost of 40% increased ICache traffic and 10% more instructions to execute? Do you think the PPC would save energy overall for the same workload? Is Thumb2 a good tradeoff? Is it worth it for Thumb2 to decrease ICache traffic by 40% if it increases DCache traffic by 25% and increases instructions to execute by 10% in comparison to PPC? Do you think Thumb2 will save energy overall for the same workload?

Throughput is what matters most of the time when the L1 DCache is pipelined. The latency is exposed less often on the 68060 where the AGU is before ALU while on most RISC cores (with separate load/store units and ALU units) the latency increases the load-use penalty. DCache access times are very important to performance and there are various techniques to increase performance and decrease energy besides pipelining the L1 DCache.

Register access is preferable to DCache access of course but it is no free lunch and usually the registers are used inefficiently. Register values are 80%-90% of the time used once or not at all with register bypass/forwarding. Usually the same few registers receive most of the accesses. There have been register cache hierarchy/bank proposals and pipelining proposals made for the register file to allow more registers and register ports when accesses would take more than a single cycle but it may be better to use fewer registers more efficiently. Register renaming and bypass/forwarding are popular design decisions which increase register efficiency. PC relative addressing, powerful addressing modes and large immediate support are examples of ways to increase register efficiency in the ISA.

Quote:

No I was talking about hardware not an idealized best case scenario. The latency isn't 1 cycle, the throughput can be.

If latency is more important than throughput then why are cores pipelined?

Quote:

For something that isn't generally useful. Accumulator style operations on memory operands aren't efficient.

Accessing memory with fewer single cycle instructions isn't useful or efficient?

RISC had an advantage when it used single cycle instructions partially offset by the ICache bottleneck it created with many simple instructions. This advantage was lost when CISC used more powerful single cycle instructions. RISC has been trying to catch up ever since. They started by adding more complex instructions like PPC and the original ARM ISA. Then they added variable length encodings to address the ICache bottleneck like Thumb2 and RISC-V. AArch64 even adds powerful addressing modes like the 68k which they found most cores could execute in a single cycle. Practically all research has gone into RISC and it has led full circle back to CISC advantages which have a bad reputation because what little research that has been done on CISC is for x86. Maybe we could dispose of the biases and admit that hybrid cores are more CISC like today than pure RISC? Maybe we could even consider possible advantages of other CISC traits?

Quote:

But you are still touching memory wasting power and adding latency.

You are assuming that RMW instructions are only beneficial when using variables in memory spilled from registers. You are assuming it is better to always move rarely used variables into registers and that it never requires spilling a register to make room. You are assuming CISC DCache overhead is higher than RISC instruction fetch overhead. You are assuming CISC DCache latency has a higher cost than adding aggressive OoO to avoid executing more simple and often dependent instructions and avoid the increased load-use penalty of cache accesses.

Quote:

Yes but I was talking about the intrinsic latency due to pipelining, SRAM access and wire delays (routing). The one that is about 3-4 cycles on a cache hit in a modern architecture due to physics not implementation nor ISA. That will likely increase in the future however OoO can only cover for so much latency inside that critical microarchitectural loop.
Deep OoO can cover even some L2 hits but of course if they start accumulating it's stall time.

I don't see L1 DCache sizes or latencies growing much in the future. Multi-level caches are still growing but mostly L3 and L4 with diminishing returns. Clock frequencies are mostly increasing with die size shrinks and few core designs are trying to push the limits of wide superscalar OoO. Security issues are limiting the complexity and performance gains of new designs. Single thread performance of mature core designs has mostly flattened other than die shrink gains and fine tuning.

Quote:

You should be comparing to register accesses. RMW isn't magical and generally useful only as a replacement for missing registers. Then it becomes slower and more power consuming.
Sure it's possible to use it when memory have to be touched anyway and then not having to go through an additional EA stage will save some power but not much - the real expense is in cache accesses themselves.

A RISC capable of decoding 3 instructions per clock will have the same throughput and latency as your CISC for this uncommon operation and will execute more common code patterns faster than the CISC with one decoder. And the RISC is likely using fewer transistors to do it, variable length decoding is very transistor intensive. This is assuming the same number of L1 R/W ports.

No. Not talking about power optimized slow embedded controllers. Look instead at a modern processor optimized for performance and compare a complex decoder with a simple one.
The AMD Ryzen is a good example where decoding takes several clock cycles and is very parallel with a late selection/fusion of partial decodes. And before you repeat the cheap trick above no 68k decoding is complex too and would need a similar design.

RISC processors are moving to variable length encodings too. Even with the simple RISC-V, most like pure RISC of new ISAs, the compressed variant is the most common for Linux distributions. Why would they add latency into their pipeline unless there was an overall benefit? Every 25%-30% code size reduction allows to reduce the L1 ICache size by half without reducing performance (a 16kiB RV32C ICache has similar performance to a 32kiB PPC ICache and an 8kiB 68k ICache has similar performance to a 32kiB Alpha ICache). Reducing ICache sizes reduces access latency, saves transistors and saves energy. I read "The RISC-V Compressed Instruction Set Manual" where the research was mentioned.

Quote:

Sure with a mid-90's style pipeline in a slow low power processor things look different but why discuss that in 2020?

Few core designs are competing for top performance. There are more cores competing with practical, energy efficient and security designs for embedded use which more closely resemble the simpler '90s designs than the micro oped aggressive OoO behemoths.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 2-Sep-2020 4:03:12

[ #65 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2013
From: Kansas

Quote:

Hammer wrote:
https://www.extremetech.com/computing/296154-how-are-process-nodes-defined

Q: Why Do People Claim Intel 10nm and TSMC/Samsung 7nm Are Equivalent?

A: Because the manufacturing parameters for Intel’s 10nm process are very close to the values TSMC and Samsung use for what they call a 7nm process. The chart below is drawn from WikiChip, but it combines the known feature sizes for Intel’s 10nm node with the known feature sizes for TSMC’s and Samsung’s 7nm node. As you can see, they’re very similar.

Due to Intel's 10 nm supply issues, Intel is using it's 10 nm at the market segments with tougher competition e.g. mobile (e.g. Tigerlake), FPGA embedded and server (e.g. Icelake Xeon).

The Intel 10nm process is competitive with the TSMC 7nm process but TSMC has moved on to 5nm. Intel isn't expected to have 7nm process production until 2021. Intel is falling behind although it is not as far as 10nm Intel process vs 5 nm TSMC process makes it sound. It is still far enough that Intel is considering using other fabs.

Quote:

I'm not naive to think Psygnosis wasn't biased towards the PS1 after Sony bought Psygnosis. Psygnosis cancelled Ultracore for both Amiga and Sega Genesis.

Sony offered Psygnosis for sale because of friction, likely including multi-platform titles, and reportedly received an offer as high as $300 million (more than ten times what Sony paid for the company just three years before) but decided not to sell. Maybe Psygnosis was more loyal to Sony when they were first purchased though. Maybe Psygnosis believed 3D gaming was the end of 2D gaming which it practically was until retro gaming made a comeback.

Last edited by matthey on 02-Sep-2020 at 04:06 AM.

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 2-Sep-2020 6:09:25

[ #66 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5286
From: Australia

@megol

https://www.notebookcheck.net/Quad-core-Intel-Tiger-Lake-processor-outscores-AMD-s-Ryzen-9-3950X-in-Geekbench-single-core-test.451119.0.html

A Tiger Lake processor from Intel has been spotted being put through its paces on Geekbench. The unnamed Tiger Lake-U quad-core chip posted 4,920 in the multi-core test, but its single-core score was quite outstanding. At 1,400 points, this places the Intel processor ahead of such enthusiast-level chips as AMD’s Ryzen 9 3950X.

https://www.phoronix.com/scan.php?page=news_item&px=Intel-Software-Blender-Gold
Intel Ramping Up Their Investment In Blender Open-Source 3D Modeling Software

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 2-Sep-2020 6:22:07

[ #67 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5286
From: Australia

@matthey

Quote:

The Intel 10nm process is competitive with the TSMC 7nm process but TSMC has moved on to 5nm. Intel isn't expected to have 7nm process production until 2021. Intel is falling behind although it is not as far as 10nm Intel process vs 5 nm TSMC process makes it sound. It is still far enough that Intel is considering using other fabs.

Don't compare nm PR marketing numbers when there's no standard.

Read https://www.pcgamer.com/au/chipmaking-process-node-naming-lmc-paper/

In a recent paper, researchers suggest that semiconductor companies should ditch the loosely defined transistor gate length as a measure of technological advancement (i.e 7nm or 14nm), and instead focus on transistor density
....
Intel reports a density of 100.76MTr/mm2 (mega-transistor per squared millimetre) for its 10nm process, while TSMC's 7nm process is said to land a little behind at 91.2MTr/mm2 (via Wikichip)

Last edited by Hammer on 02-Sep-2020 at 06:23 AM.
Last edited by Hammer on 02-Sep-2020 at 06:22 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 15-Sep-2020 5:27:51

[ #68 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5286
From: Australia

@matthey

Quote:

matthey wrote:
There are many misconceptions about the SIMD unit. I am no expert but there are some others that may be able to contribute as well. We will start by looking at the history and basic features. MMX will be looked at which the Apollo Core adopted as a standard. Finally some questions will be answered from another thread.

SoC's IGP from Intel, AMD, and NVIDIA has MIMD (multiple instructions, multiple data) math co-processors.

Wider SIMD such as AVX-512 has higher limitations when compared to MIMD.

Xbox Series S SoC has 8 Zen 2 cores with 20 CU RDNA 2 iGPU(4 TFLOPS stream, ~4.3 TFLOP hardware raytracing), hence AMD effectively designed a small IGP with the latest DirectX12 Ultimate features for APUs.

Last edited by Hammer on 15-Sep-2020 at 05:32 AM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

Samurai_Crow

Re: Amiga SIMD unit
Posted on 16-Sep-2020 5:46:32

[ #69 ]

Elite Member

Joined: 18-Jan-2003
Posts: 2320
From: Minnesota, USA

@Hammer

Knowing Gunnar I'd expect him to use a wider bus width on the FPU on the Vampire v4 to make the single-precision SIMD use wider registers within the FPU infrastructure. The SIMD Floating point may be big enough to warrant a bigger FPGA though and push development into the range of a Vampire 5th generation machine. Of course it will be easy to add a second, external FPGA but I don't know where the cost effectiveness breakdown falls.

Since the 68020+ architecture supports multiple external coprocessors he'll have no trouble finding instruction set space for such a unit. Maybe he'll make wider opcode fusions to improve SIMD into MIMD by auto-detecting DBRA counters. This part is less certain than my prediction about the wide SIMD unit. At least it's a path forward.

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 16-Sep-2020 7:33:51

[ #70 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5286
From: Australia

@Samurai_Crow

Gunnar could have added PCI-E port or add GCN CU FPGA instead e.g.

https://www.phoronix.com/scan.php?page=news_item&px=MTgyNTE
MIAOW project, Open Source FPGA GCN CU based on AMD's Southern islands. It supports the existing OpenCL codebase.

MIMD support would jump ahead of SIMD.

68080 with GCN CU would be a strange and interesting APU.

MIAOW like solution needs to combine with SAGA.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

Samurai_Crow

Re: Amiga SIMD unit
Posted on 17-Sep-2020 1:22:50

[ #71 ]

Elite Member

Joined: 18-Jan-2003
Posts: 2320
From: Minnesota, USA

@Hammer

Interesting idea. Maybe once he gets the bugs worked out of the SAGA core he can add something like that on a separate FPGA. He had tried to add a polygon plotter when he was working on the old NatAmi MX board before the Vampire 1 came out.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 17-Sep-2020 19:26:14

[ #72 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2013
From: Kansas

Quote:

Samurai_Crow wrote:
Interesting idea. Maybe once he gets the bugs worked out of the SAGA core he can add something like that on a separate FPGA. He had tried to add a polygon plotter when he was working on the old NatAmi MX board before the Vampire 1 came out.

It is usually cheaper to use a bigger FPGA than "add a separate FPGA". When the display controller is integrated with the Amiga custom chips (SAGA) of the primary SoC FPGA, it makes little sense to do 3D processing in another FPGA. Two FPGAs with more total pins are likely to be more expensive than a single larger FPGA with fewer pins.

FPGAs are good at parallel processing of narrower datatypes but adding mass parallelism would likely require a larger more expensive FPGA than most Amiga users would be willing to pay. Without an ASIC, Vamp development advances will likely slow to that of FPGA size and performance improvements allowed by FPGA price reductions over time.

Status: Offline

Samurai_Crow

Re: Amiga SIMD unit
Posted on 18-Sep-2020 1:17:11

[ #73 ]

Elite Member

Joined: 18-Jan-2003
Posts: 2320
From: Minnesota, USA

@matthey

I see what you mean. That Verilog source must have been in a MASSIVE Xilinx FPGA to fit all 40 of those compute units in there.

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 18-Sep-2020 1:39:17

[ #74 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5286
From: Australia

@Samurai_Crow

40 CU GCN clone... that's almost like PS4 Pro APU's iGPU 36 CU or Xbox One X's iGPU 40 CU.

PC APUs have 11 CU.

68090 + 40 CU GCN clone in an APU.... that would be something.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

Lou

Re: Amiga SIMD unit
Posted on 18-Sep-2020 13:28:54

[ #75 ]

Elite Member

Joined: 2-Nov-2004
Posts: 4169
From: Rhode Island

Adding a modern gpu to a 68k is like adding a V8 engine to a pedal bike...will keeping the pedal bike's gear+chain transmission...and wheels.

Status: Offline

Samurai_Crow

Re: Amiga SIMD unit
Posted on 18-Sep-2020 17:34:37

[ #76 ]

Elite Member

Joined: 18-Jan-2003
Posts: 2320
From: Minnesota, USA

@Lou

Quote:

Lou wrote:
Adding a modern gpu to a 68k is like adding a V8 engine to a pedal bike...will keeping the pedal bike's gear+chain transmission...and wheels.

Given that it would be running on a $3500 FPGA board the weak link won't be the CPU or 68080 instruction set, but the OS and drivers.

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 18-Sep-2020 22:16:15

[ #77 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2013
From: Kansas

Quote:

Lou wrote:
Adding a modern gpu to a 68k is like adding a V8 engine to a pedal bike...will keeping the pedal bike's gear+chain transmission...and wheels.

Pedals are optional.

It's faster than a V8 in a heavy vehicle.

Status: Offline

umisef

Re: Amiga SIMD unit
Posted on 19-Sep-2020 10:06:37

[ #78 ]

Super Member

Joined: 19-Jun-2005
Posts: 1714
From: Melbourne, Australia

@matthey

Come on, Matt, there was a better image for that :)

Status: Offline

Hammer

Re: Amiga SIMD unit
Posted on 19-Sep-2020 14:41:16

[ #79 ]

Elite Member

Joined: 9-Mar-2003
Posts: 5286
From: Australia

@Lou

Quote:

Lou wrote:
Adding a modern gpu to a 68k is like adding a V8 engine to a pedal bike...will keeping the pedal bike's gear+chain transmission...and wheels.

FPGA 40 CU clone at 100 Mhz yields 512 GFLOPS when the real 40 CU GCN at 1172 Mhz yields 6000 GFLOPS. 512 GFLOPS is year 2006 era GeForce 8800 GTX level compute performance.

Try smaller scale CU count.

Without OpenGL acceleration, Vampire V4 is not Amiga Hombre level chipset.

Last edited by Hammer on 19-Sep-2020 at 02:45 PM.

_________________
Ryzen 9 7900X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB
Amiga 1200 (Rev 1D1, KS 3.2, PiStorm32lite/RPi 4B 4GB/Emu68)
Amiga 500 (Rev 6A, KS 3.2, PiStorm/RPi 3a/Emu68)

Status: Offline

matthey

Re: Amiga SIMD unit
Posted on 19-Sep-2020 21:21:02

[ #80 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2013
From: Kansas

Quote:

Hammer wrote:
FPGA 40 CU clone at 100 Mhz yields 512 GFLOPS when the real 40 CU GCN at 1172 Mhz yields 6000 GFLOPS. 512 GFLOPS is year 2006 era GeForce 8800 GTX level compute performance.

Try smaller scale CU count.

Without OpenGL acceleration, Vampire V4 is not Amiga Hombre level chipset.

The Hombre PA-7150 only had a FPU for floating point. SIMD MAX supported 2x16bit integer operations only. The PA-7150 was clocked around 125MHz so similar to the Apollo Core and the Apollo Core is likely more powerful per clock and has more memory bandwidth which is important for 3D performance. Hombre had dedicated 3D hardware and an enhanced blitter but it was either inflexible or more likely used mostly fixed point integer 3D math. The FLOPS were likely very low. With software 3D libraries and a few hardware modifications and additions, it may not take much to match the performance of the Hombre target. Supporting modern 3D APIs like OpenGL may be difficult though as they rely on floating point datatypes.

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle