Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6223 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

0 member(s) on-line.

You are an anonymous user.
Register Now!

nbache: 9 mins ago

lionstorm: 55 mins ago

minator: 1 hr 1 min ago

matthey: 1 hr 28 mins ago

zipper: 1 hr 57 mins ago

amiwell: 2 hrs ago

Beajar: 2 hrs 47 mins ago

ppcamiga1: 2 hrs 48 mins ago

Karlos: 3 hrs 5 mins ago

mordock: 3 hrs 14 mins ago

Forum Index

Classic Amiga Hardware

News about Vampire and Apollo

Poster

Thread

cdimauro

Re: News about Vampire and Apollo
Posted on 20-Oct-2018 19:49:57

[ #1661 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

Forgot to reply to this thread. Sorry for the late reply.
@matthey Quote:
matthey wrote:
Quote:
cdimauro wrote:
Just some notes here, since I was primarily working on Xeon Phi products when I was at Intel.

Larrabee failed because GPUs were (and still are, albeit something is changing) too much optimized for raster graphic, so the latter allowed to better use the silicon for this specific task. First Larrabee versions (never released) had not fixed-functions at all, so the x64 cores had to do all the work, which brought to very worse performances; this forced Intel to add some fixed-functions units (texturing) to improve the situation. However it wasn't enough to compete with GPUs.
The paradox is that hardware-based raytracing GPUs were just presented by nVidia, and this means that NOW a Larrabee design could have been much higher chances to compete...

The Amiga has an association with ray tracing as the Amiga was better equipped to render static ray traced images back when it was introduced. When I was part of the Apollo team back in 2013, one of the potential investors I was talking to was all about ray tracing (Monte Carlo bi-directional path & wave tracer). He was wanting FPGA accelerated ray tracing as a GPU before it became popular again. It certainly is possible to do real time ray tracing at lower resolutions than rasterization. Eventually, highly specialized hardware will likely take over for ray tracing like rasterization but the algorithms are being improved so it could be useful to use more flexible hardware like many core GPGPU/CPUs and/or FPGAs. While an SIMD unit can likely handle most of the vector computations, it looks to me like the algorithms can use many branches for trees (shorter integer pipelines and/or better branch prediction helpful) and quasi-random numbers (less discrepancy than psuedo-random numbers) which perhaps could use acceleration. Does this agree with your understanding of ray tracing workloads and did I miss some requirements to make it fast?

Shorter pipelines aren't a requirement neither important in this case. Since the ray-tracing code is quite distributable/scalable, it's better to implement good vector cores with SMP/HT capability, in order to mask latencies and better optimize resource (execution units) usage. In-order cores are better than OoO ones, because they can be smaller and consume less power.
Quote:
Quote:
Larrabee and the first Xeon Phi products weren't called CPUs because they were just coprocessors (they lacked some instructions. So, they weren't fully x64-compatible) and sold only as PCI-Express cards.

Starting from Knights Landing they are called CPUs, because they have a full x64 ISA. They were still sold as PCI-Express cards, but also as standalone processors (which offered much better performances too, since Knights Landing processors hadn't to go through the the very slow PCI-Express to share memory: NUMA works much better).

A CPU integrated ray tracing GPU could probably offset some of the additional cost of ray tracing. The Amiga used to have integrated graphics which has a performance advantage with Moore's law expiring but now PCIe seems to be the NG "Amiga way". I love the scale-ability of a highly parallel multi-core GPGPU/CPU which can use cores for both the CPU and GPU. I believe it would be easier to program and optimize than highly specialized hardware (Hans was complaining about the unfriendliness of modern rasterizing GPUs).

I agree. Larrabee was a piece of cake to program, and you can clearly feel it looking at the several slides and videos explaining how scene rasterization was implemented.
Quote:
I believe the 68k could be a more efficient mid performance base CPU than a Pentium/Atom like x86/x86_64. The 68k ISA is not as fat, instructions don't need to be broken down as far, decoding is less expensive (percentage of a mid performance core was as high as 30-40% for x86 according to you) and may require fewer pipeline stages for better branch performance, has significantly better code density which may allow to half the L1 ICache, etc., which is all a per core savings in transistors and energy use which adds up. Too bad the "Amiga way" does not include hardware any more. Software only is the route to being assimilated.

Unfortunately 68K lacks a modern SIMD unit, which is also big (many vector registers and instructions. Larrabee and AVX-512 have mask registers too) and wide (512-bit vector registers).

P.S. Yes, Pentium and PentiumPro used respectively 30 and 40% of their transistors budget only for the decoder. However 68K is difficult to decode too.

@matthey Quote:
matthey wrote:
The fully pipelined OoO FPU doesn't seem to be blowing away the partially pipelined in-order 68060 FPU. Maybe I was right in arguing that the fully pipelined Pentium FPU didn't have much advantage over the 68060 in mixed integer/fp code.

Maybe Apollo's FPU isn't that advanced compared to the Pentium one, so a comparison might still not be doable.

Status: Offline

retro

Re: News about Vampire and Apollo
Posted on 21-Oct-2018 0:31:19

[ #1662 ]

Super Member

Joined: 16-Dec-2003
Posts: 1050
From: Unknown

@BigD yeahh hahaha i say good louck finding CyberVision PPC / CyberStorm PPC to an amiga classics for a reasonble /fair price,

060 or better cpu repleacement
soundcard-
max fast mem relly fast memory
GFX card or a an solution anyhow
SD card slot you can use as an harddisk.
new kickstart.
fast network---
and so on and so forth

Status: Offline

matthey

Re: News about Vampire and Apollo
Posted on 21-Oct-2018 21:37:06

[ #1663 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2750
From: Kansas

Quote:

BigD wrote:
Duke Nukem 3D / Quake 2 and Alien Breed 3D 2 would all benefit from the Vampire. As for hardware accelerated 3D; just buy a Tabor and mess with that! Mediator, while good for some is a PCI kludge for Zorro. Just use a Zorro II/III card and enjoy Amiga with Amiga hardware. The Picasso IV is the limit of how far to push a classic 'Miggy IMHO. Mediator and CyberVision PPC / CyberStorm PPC all got a bit crazy with little to no software support IMHO.

I have a CSMKIII 68060@75MHz (50ns memory with fastest memory setting) Mediator with Voodoo 4. It does 25 fps with Quake 1 at 512x384x16. This looked and played good enough that my PC and console using nephew never complained about the frame rate, lack of resolution or lack of colors. I was using tweaked (bug fixes and optimizations) Warp3D drivers. The last official Warp3D for the 68k is poorly optimized and an embarrassment.

Avenger (Voodoo) libraries
1) Z-buffer bugs trash memory.
2) An AllocVec() was accessing memory past the allocated size.
3) CheckIdle() is so slow it causes responsiveness problems.
4) Code compiled for the 68040 with lack of FPU FINT/FINTRZ slows and bloats the code for the 68060.
5) Code compiled for the 68040 does not work on the 68881/68882.
6) Endian swap code is slow.
7) No texture compression to reduce the bottleneck of slow Zorro-PCI transfers.

Permedia2.library
1) FPU rounding problems cause wrong colors.

Warp3D.library
1) Indirect mode almost doesn't work because it was so poorly optimized. The indirect mode queue system works well once optimized and is smoother and multitasks better than direct mode.
2) The overall library optimization is horrible. This library could be half the size.

Warp3D is far from a thin or optimized API layer for the 68k but a 68060 with Voodoo 3-5 (and likely Permedia2 even though it is slower at 3D) can still outperform the Apollo Core in 3D in any Vampire accelerator. Perhaps the Apollo Core would have come closer if it had supported single precision float and wider 128/256 bit operations in the SIMD unit but that road has been closed by the ISA. A PCI or PCIe slot could have been supported on the Vampire with some additional cost (FPGA with more I/O pins or SerDes respectively). The "team" looked at these options and Gunnar decided against them. Now Vampire owners will likely get whatever 3D support is possible after cutting down and squeezing in other units and support. There is no plan beyond minor FPGA size upgrades.

Status: Offline

matthey

Re: News about Vampire and Apollo
Posted on 21-Oct-2018 22:39:12

[ #1664 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2750
From: Kansas

Quote:

cdimauro wrote:
Shorter pipelines aren't a requirement neither important in this case. Since the ray-tracing code is quite distributable/scalable, it's better to implement good vector cores with SMP/HT capability, in order to mask latencies and better optimize resource (execution units) usage. In-order cores are better than OoO ones, because they can be smaller and consume less power.

Longer pipelines use more transistors and consume more power. Longer pipelines require better branch prediction which also needs more transistors and consumes more power. These are per core costs which add up. It looks like the ray tracing code needs some integer unit performance so probably a medium pipeline depth would be good (7-14 stages) and 1GHz-2GHz clock speeds?

Yes, SIMD unit and SMP performance would be more important.

Quote:

I agree. Larrabee was a piece of cake to program, and you can clearly feel it looking at the several slides and videos explaining how scene rasterization was implemented.

Ease of programming is too often overlooked. IMO, too often hardware developers trade ease of programming for nearly unobtainable theoretical maximum performance.

Quote:

Unfortunately 68K lacks a modern SIMD unit, which is also big (many vector registers and instructions. Larrabee and AVX-512 have mask registers too) and wide (512-bit vector registers).

If the 68k had received an SIMD unit, it would be outdated today. It is better to start out new (but similar to an existing design) and learn from the past mistakes of others. The only thing unfortunate is that the base SIMD instructions may need to be 6 bytes in length for a very high performance SIMD unit.

Quote:

Maybe Apollo's FPU isn't that advanced compared to the Pentium one, so a comparison might still not be doable.

By features, the Apollo core is more advanced than the Pentium or 68060 but it has a major handicap in FPGA.

Status: Offline

cdimauro

Re: News about Vampire and Apollo
Posted on 22-Oct-2018 6:51:28

[ #1665 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

@mattheyQuote:
matthey wrote:
Quote:
cdimauro wrote:
Shorter pipelines aren't a requirement neither important in this case. Since the ray-tracing code is quite distributable/scalable, it's better to implement good vector cores with SMP/HT capability, in order to mask latencies and better optimize resource (execution units) usage. In-order cores are better than OoO ones, because they can be smaller and consume less power.

Longer pipelines use more transistors and consume more power. Longer pipelines require better branch prediction which also needs more transistors and consumes more power. These are per core costs which add up.

Yes, but:
- you don't need better branch predictors with such kind of code;
- more pipeline stages allow to break complicated tasks to simpler ones, allowing to reach higher frequencies.
Quote:
It looks like the ray tracing code needs some integer unit performance so probably a medium pipeline depth would be good (7-14 stages) and 1GHz-2GHz clock speeds?

Integer code is needed for sure but they aren't critical for the ray tracing code. Regarding frequencies, Larrabee and the subsequent Knight* family worked at around 1Ghz speeds.
Quote:
Quote:
I agree. Larrabee was a piece of cake to program, and you can clearly feel it looking at the several slides and videos explaining how scene rasterization was implemented.

Ease of programming is too often overlooked. IMO, too often hardware developers trade ease of programming for nearly unobtainable theoretical maximum performance.

Theoretical maximum performance is good with Larrabee/Knights*, thanks to the SMP4 design.

Having an easy to program ISA is important when you need to squeeze the most from critical hot spots, which might be written in assembly or using intrinsics. Assembly is very rarely used nowadays, but when you have/need to do it, then having a nice ISA helps a lot.
Quote:
Quote:
Unfortunately 68K lacks a modern SIMD unit, which is also big (many vector registers and instructions. Larrabee and AVX-512 have mask registers too) and wide (512-bit vector registers).

If the 68k had received an SIMD unit, it would be outdated today. It is better to start out new (but similar to an existing design) and learn from the past mistakes of others. The only thing unfortunate is that the base SIMD instructions may need to be 6 bytes in length for a very high performance SIMD unit.

Exactly. More registers and several instructions (maybe with masking) -> longer opcodes. No free lunch here.
Quote:
Quote:
Maybe Apollo's FPU isn't that advanced compared to the Pentium one, so a comparison might still not be doable.

By features, the Apollo core is more advanced than the Pentium or 68060 but it has a major handicap in FPGA.

According to Gunnar, the new OoO core gives 10% better performances (compared to the previous in-order one): it's a very small gain compared to what happened to Atom when it switched from 2-ways in-order to 2-ways OoO core.

Anyway, here you were taking Apollo 68080 results to "promote" the 68060, but they are different: nothing of Apollo core results can be taken to change something on the 68060 vs Pentium comparison.

Status: Offline

megol

Re: News about Vampire and Apollo
Posted on 22-Oct-2018 13:52:16

[ #1666 ]

Regular Member

Joined: 17-Mar-2008
Posts: 355
From: Unknown

@cdimauro
Do we actually know anything about the Apollo OoO?

As I understand it this change could be as simple as adding out of order retire of FP instructions (something the in order Atom had from the beginning), perhaps adding OoO retire for some safe integer operations or something similar. Adds some complexity to the hazard checks.

Or it could be that each pipeline can choose to start execution early under some circumstances. Hazard checks, perhaps some renaming logic.

Any of those would be using out of order execution but considerably less powerful than what is commonly referred to as out of order execution.

Given the propensity of the team of cherry picking peak performance figures I guess that it could be a early out mechanism (OoO retire) which gives 10% performance increase in some hand coded loop. Or not. Would be nice if the 68k joined the likes of the Pentium Pro as a modern CISC implementation. :)

Status: Offline

cdimauro

Re: News about Vampire and Apollo
Posted on 22-Oct-2018 14:14:49

[ #1667 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

@megol: unfortunately there's no detail at all. Gunnar only mentioned this 10% performance increase, giving no other information about it.

I fully agree with the rest, especially the last part.

Status: Offline

matthey

Re: News about Vampire and Apollo
Posted on 22-Oct-2018 20:38:06

[ #1668 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2750
From: Kansas

Quote:

cdimauro wrote:
Yes, but:
- you don't need better branch predictors with such kind of code;
- more pipeline stages allow to break complicated tasks to simpler ones, allowing to reach higher frequencies.

Distributing the code to more cores and working in parallel helps with branch performance but there are limitations to how many cores are practical too. There are branches in the ray tracing algorithms and SIMD units can't do branches. Perhaps 2 bit saturating prediction with a BTB would be adequate.

Quote:

Integer code is needed for sure but they aren't critical for the ray tracing code. Regarding frequencies, Larrabee and the subsequent Knight* family worked at around 1Ghz speeds.

It looks like Knight's Landing went up to 1.5GHz with AVX512 code and up to 1.7GHz without. 1GHz would probably be more practical and maybe less without state of the art die sizes.

https://en.wikipedia.org/wiki/Xeon_Phi

Quote:

According to Gunnar, the new OoO core gives 10% better performances (compared to the previous in-order one): it's a very small gain compared to what happened to Atom when it switched from 2-ways in-order to 2-ways OoO core.

The transistor count jumped too when the Atom went OoO. The energy use would have increased dramatically also but it can be hidden with die shrinks where transistor counts can not.

Status: Offline

matthey

Re: News about Vampire and Apollo
Posted on 22-Oct-2018 21:38:26

[ #1669 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2750
From: Kansas

Quote:

megol wrote:
Do we actually know anything about the Apollo OoO?

I don't know much either.

Quote:

As I understand it this change could be as simple as adding out of order retire of FP instructions (something the in order Atom had from the beginning), perhaps adding OoO retire for some safe integer operations or something similar. Adds some complexity to the hazard checks.

Is this what you are talking about with the same terminology?

fetch in-order -> issue in-order -> execution/completion in-order -> graduation/retirement OoO

I expect this is the mostly likely option. Most cores (including OoOE) hold the completed instructions until they can be retired in-order. It is possible to retire independent instructions OoO with a dependency/hazard check as you say. This is good for loads and long latency instructions as the other pipes are not kept waiting. There is a little more to recording and being able to roll back or re-execute retired instructions in the case of an interrupt or exception. The claimed 10% performance benefit isn't far off from the 6.56% average performance improvement found in the following paper.

A Study of Out-of-Order Completion for the MIPS R10K Superscalar Processor
https://pdfs.semanticscholar.org/b12f/e5d909324249c9349427b7ea6338e782d40b.pdf

The 6.56% average performance improvement is a wide superscalar OoO issue, OoO completion, (normally in-order graduation) CPU. The actual improvement would depend heavily on the L1 DCache setup also. Cache misses could easily push the average performance improvement to 10% or more.

Status: Offline

cdimauro

Re: News about Vampire and Apollo
Posted on 23-Oct-2018 5:32:11

[ #1670 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

@matthey Quote:
matthey wrote:
Quote:
cdimauro wrote:
Yes, but:
- you don't need better branch predictors with such kind of code;
- more pipeline stages allow to break complicated tasks to simpler ones, allowing to reach higher frequencies.

Distributing the code to more cores and working in parallel helps with branch performance but there are limitations to how many cores are practical too. There are branches in the ray tracing algorithms and SIMD units can't do branches. Perhaps 2 bit saturating prediction with a BTB would be adequate.

As I said before, those designs use SMP4 to "address" those issues and better optimize the execution units usage.
Quote:
Quote:
Integer code is needed for sure but they aren't critical for the ray tracing code. Regarding frequencies, Larrabee and the subsequent Knight* family worked at around 1Ghz speeds.

It looks like Knight's Landing went up to 1.5GHz with AVX512 code and up to 1.7GHz without.

Don't consider the 1.7Ghz "without". Larrabee introduced the EVEX opcode and the LBi instructions, which were the foundation of the new massive vector unit, which was then transformed to the instruction set by the Knight* family and later renamed as AVX-512. In short: it doesn't make sense to avoid using the massive vector unit: you completely lose the main advantage of those families.

It was surprising to see that KNL reached 1.5Ghz. I don't remember such frequencies when I worked on it at Intel (before leaving, on December 2017), but usually we had pre-production samples at lower frequencies. That's really nice, considering the massive computation power that it provides.
Quote:
1GHz would probably be more practical and maybe less without state of the art die sizes.

https://en.wikipedia.org/wiki/Xeon_Phi

Well, if you can do more than 1Ghz while keeping the power usage... why not?
Quote:
Quote:
According to Gunnar, the new OoO core gives 10% better performances (compared to the previous in-order one): it's a very small gain compared to what happened to Atom when it switched from 2-ways in-order to 2-ways OoO core.

The transistor count jumped too when the Atom went OoO. The energy use would have increased dramatically also but it can be hidden with die shrinks where transistor counts can not.

That's what happened.

Last edited by cdimauro on 23-Oct-2018 at 05:32 AM.

Status: Offline

megol

Re: News about Vampire and Apollo
Posted on 23-Oct-2018 11:16:32

[ #1671 ]

Regular Member

Joined: 17-Mar-2008
Posts: 355
From: Unknown

@matthey


  0       1       2       3
Fetch
Issue   Fetch
Exec    Issue   Fetch
Exec    Exec    Issue   Fetch
Exec    Retire  Exec    Issue
Retire          Retire  Exec
                        Retire

Here instruction 0 are allowed to continue to execute while new instructions 1, 2 executes to completion and writes architectural registers. As long as that doesn't violate the architecture rules that is fine and commonly used.

Haven't had time to read the linked paper but a quick glance gives me the impression that it's about something different: adding OoO completion to a standard OoO pipeline.
The normal thing to do is have a reorder buffer making sure results are written to architectural resources in order to preserve the illusion of being a (fast) in order machine.
So that would be improvement of the general OoO pipeline instead of something that under some circumstances can avoid stalling an in-order pipeline, performance changes would be from completely separate mechanism so not comparable.

Status: Offline

matthey

Re: News about Vampire and Apollo
Posted on 23-Oct-2018 21:17:22

[ #1672 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2750
From: Kansas

Quote:

megol wrote:

0 1 2 3
Fetch
Issue Fetch
Exec Issue Fetch
Exec Exec Issue Fetch
Exec Retire Exec Issue
Retire Retire Exec
Retire

Here instruction 0 are allowed to continue to execute while new instructions 1, 2 executes to completion and writes architectural registers. As long as that doesn't violate the architecture rules that is fine and commonly used.

Yes, that is what I was talking about too.

Quote:

Haven't had time to read the linked paper but a quick glance gives me the impression that it's about something different: adding OoO completion to a standard OoO pipeline.
The normal thing to do is have a reorder buffer making sure results are written to architectural resources in order to preserve the illusion of being a (fast) in order machine.
So that would be improvement of the general OoO pipeline instead of something that under some circumstances can avoid stalling an in-order pipeline, performance changes would be from completely separate mechanism so not comparable.

The paper should have been called "A Study of Out-of-Order Graduation for the MIPS R10K Superscalar Processor". Despite the name of the paper, the study refers to OoO graduation/retirement and *not* OoO completion/execution (the paper says: "Completion refers to completion of execution and graduation refers to committing the results."). The MIPS R10000 already has OoOE but in-order graduation. The study looks at the performance gain of the MIPS R10000 with OoO graduation. This CPU is much different than an in-order CPU but the performance gain should be at least as high with an in-order CPU. Read the Summary.

"We present here a study of performance impact of out-of-order graduation for MIPS R10K processor. The basic observation is that the longer latency operations e.g., load, multiply, division etc. are holding the smaller latency instructions (completed execution) in active list. As a result the active list is getting full and hence decode is getting stalled. The performance impact becomes more observable when the load operation misses.

A careful out-of-order graduation has shown 6.56% performance improvement on an average for DSM and multimedia kernels. The independent instructions, the potential candidate for out-of-order graduation, are generated due to software pipelining, loop index improvement, array index computation or loop termination check etc. We show that long latency operations contribute a lot to the performance loss. The load miss situation is noticeable. However we have not studied the impact of different memory subsystem [5] and thereby different load miss situation on out-of-order graduation."

Last edited by matthey on 23-Oct-2018 at 09:20 PM.

Status: Offline

megol

Re: News about Vampire and Apollo
Posted on 24-Oct-2018 12:46:23

[ #1673 ]

Regular Member

Joined: 17-Mar-2008
Posts: 355
From: Unknown

@matthey

Quote:

matthey wrote:
The paper should have been called "A Study of Out-of-Order Graduation for the MIPS R10K Superscalar Processor". Despite the name of the paper, the study refers to OoO graduation/retirement and *not* OoO completion/execution (the paper says: "Completion refers to completion of execution and graduation refers to committing the results."). The MIPS R10000 already has OoOE but in-order graduation. The study looks at the performance gain of the MIPS R10000 with OoO graduation. This CPU is much different than an in-order CPU but the performance gain should be at least as high with an in-order CPU. Read the Summary.

It could be much higher.

Quote:

"We present here a study of performance impact of out-of-order graduation for MIPS R10K processor. The basic observation is that the longer latency operations e.g., load, multiply, division etc. are holding the smaller latency instructions (completed execution) in active list. As a result the active list is getting full and hence decode is getting stalled. The performance impact becomes more observable when the load operation misses.

Then my impression was correct, the paper looks at the performance advantage incorporating OoO graduation/retire in a standard OoO pipeline and not adding OoO retire to an in-order pipeline.

Quote:

A careful out-of-order graduation has shown 6.56% performance improvement on an average for DSM and multimedia kernels. The independent instructions, the potential candidate for out-of-order graduation, are generated due to software pipelining, loop index improvement, array index computation or loop termination check etc. We show that long latency operations contribute a lot to the performance loss. The load miss situation is noticeable. However we have not studied the impact of different memory subsystem [5] and thereby different load miss situation on out-of-order graduation."

The OoO design allow later instructions to bypass earlier ones and pass their result to later dependent instructions, it's just that they are speculative until they retire. In this case retire means updating the architectural resources making the result non-speculative. By adding OoO retire some instructions can update architectural resources early as long as there is no violation of the ISA specification, this makes it less likely the processor will stall waiting for reordering resources to be freed.

In the in-order design the retire (or more commonly writeback) stage is simply writing the architectural registers without using any reorder mechanism. Adding OoO retire to the pipeline allow later instructions to execute if it doesn't violate the ISA specification.

Improvements in the OoO case comes from not stalling when multiple long-latency instructions (loads that miss L1 or even L2) fills the reorder structures. Improvements in the in-order case is allowing some degree of OoO execution at all. Not the same thing.

Status: Offline

bhabbott

Re: News about Vampire and Apollo
Posted on 24-Oct-2018 23:50:09

[ #1674 ]

Cult Member

Joined: 6-Jun-2018
Posts: 554
From: Aotearoa

Quote:
A careful out-of-order graduation has shown 6.56% performance improvement on an average for DSM and multimedia kernels.

6.56% Woohoo!

Status: Offline

megol

Re: News about Vampire and Apollo
Posted on 25-Oct-2018 15:07:29

[ #1675 ]

Regular Member

Joined: 17-Mar-2008
Posts: 355
From: Unknown

@bhabbott
I don't know how to interpret that.

Status: Offline

matthey

Re: News about Vampire and Apollo
Posted on 25-Oct-2018 18:46:50

[ #1676 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2750
From: Kansas

Quote:

megol wrote:
It could be much higher.

Yes. The MIPS 10000 L1 DCache likely has a low miss rate and there are often enough pipes to execute new instructions without graduating old ones. I expect a CPU design which is more in-order or has fewer pipes to benefit more from OoO retirement/graduation. The paper shows that higher L1 DCache miss rates result in a higher performance gain for OoO retirement/graduation. The 6.56% performance gain is likely close to a minimum average expected improvement.

Quote:

Then my impression was correct, the paper looks at the performance advantage incorporating OoO graduation/retire in a standard OoO pipeline and not adding OoO retire to an in-order pipeline.

Yes.

Quote:

The OoO design allow later instructions to bypass earlier ones and pass their result to later dependent instructions, it's just that they are speculative until they retire. In this case retire means updating the architectural resources making the result non-speculative. By adding OoO retire some instructions can update architectural resources early as long as there is no violation of the ISA specification, this makes it less likely the processor will stall waiting for reordering resources to be freed.

In the in-order design the retire (or more commonly writeback) stage is simply writing the architectural registers without using any reorder mechanism. Adding OoO retire to the pipeline allow later instructions to execute if it doesn't violate the ISA specification.

Improvements in the OoO case comes from not stalling when multiple long-latency instructions (loads that miss L1 or even L2) fills the reorder structures. Improvements in the in-order case is allowing some degree of OoO execution at all. Not the same thing.

I don't see much difference between the gains of adding OoO retirement to an in-order design vs an OoO design. In both cases without OoO retirement, the CPU stalls while waiting for long latency instructions to complete. The OoO design may be able to issue a few more instructions before it stalls so I would expect the gains from OoO retirement to be less. Another way to deal with the stalls is to use multi-threading and switch to another thread when stalled. The Apollo core supposedly supported multi-threading for a while. I wonder if it still does with likely reduced gains from using OoO retirement.

The OoO retired instructions and results must be recorded somewhere as they must be rolled back if there is an exception on an earlier instruction which is still executing. Since OoO retired instructions must be independent, maybe it is possible to use the rename registers for results until the speculative instructions retire? Even if only one extra instruction could be executed OoO it would be helpful if it was cheap enough (poor mans OoO).

Status: Offline

Overflow

Re: News about Vampire and Apollo
Posted on 18-May-2019 18:27:52

[ #1677 ]

Super Member

Joined: 12-Jun-2012
Posts: 1628
From: Norway

https://www.youtube.com/watch?v=nNYRZ_bXpF0

Quote:

Testing the game on the Vampire V4 Stand Alone in the proper way, with the CDAudio for full amazement!

With the V4 we can plug up to 4 IDE devices so connecting a CDrom drive is easy.

T-Zer0 is a nice looking AGA Shoot em up game for the Amiga published by PXL Computers in 1999.

Status: Offline

bennymee

Re: News about Vampire and Apollo
Posted on 19-May-2019 19:08:43

[ #1678 ]

Cult Member

Joined: 19-Aug-2003
Posts: 701
From: Netherlands

@Overflow

Looks and runs good at first sight, no glitches, no stuttering.

Status: Offline

Overflow

Re: News about Vampire and Apollo
Posted on 20-May-2019 17:34:28

[ #1679 ]

Super Member

Joined: 12-Jun-2012
Posts: 1628
From: Norway

Vampire V1200 First Videos - SysInfo

https://www.youtube.com/watch?v=4MA-8aPH9Sg&feature=youtu.be

Vampire V1200 First Videos - AIBB

https://www.youtube.com/watch?v=yWGSLQb_9HQ&feature=youtu.be

Status: Offline

_Steve_

Re: News about Vampire and Apollo
Posted on 20-May-2019 20:33:27

[ #1680 ]

Team Member

Joined: 17-Oct-2002
Posts: 6823
From: UK

@bennymee

Quote:
@Overflow

Looks and runs good at first sight, no glitches, no stuttering.

You need to look again more closely

There was definitely one graphical glitch when the ship selection came up (which there was a lot of frantic button pressing at the time it displays).

The ships graphics at the bottom of the screen showed a lot of corruption, but aside from that one moment though, the rest of the game seemed to be running well.

It has actually been such a long time since I played T-Zero, I may have to dig out my copy and give it a go again (although I recall being particularly bad at it).

Last edited by _Steve_ on 20-May-2019 at 08:34 PM.
Last edited by _Steve_ on 20-May-2019 at 08:34 PM.

_________________
Test sig (new)

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle