Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6223 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

1 member(s) on-line.

70sAnd80sRule

You are an anonymous user.
Register Now!

70sAnd80sRule: 4 mins ago

kolla: 10 mins ago

minator: 39 mins ago

matthey: 46 mins ago

arden2222: 56 mins ago

number6: 1 hr 8 mins ago

Chris_Y: 1 hr 30 mins ago

OneTimer1: 1 hr 36 mins ago

clint: 1 hr 38 mins ago

AmigaMac: 1 hr 42 mins ago

Forum Index

Amiga Development

Vector abstraction for fun and profit.

Poster

Thread

Karlos

Vector abstraction for fun and profit.
Posted on 7-Jan-2023 15:56:18

[ #1 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

So, the recent discussion around PiStorm32 got me thinking. I definitely appreciate the reasoning why @michalsc isn't keen to expose native AArch64 code execution and other resources to avoid fragmenting the userbase with "yet another binary target". As a developer, though, I would definitely like to be able to make use of those resources.

I was thinking about the VPU library suggestion. At first glance, aside from a few common operations like copying, blending and such, there aren't that many well defined stream operations that you would want to put into it. The real power of SIMD is that you can write your own streaming operations for it.

So, here's a thought. Suppose you defined a virtual SIMD processor, with some modest number of registers, say 8 or 16, with some sensible width, perhaps 16 or 32 bytes. Add to that a handful of simple integer registers for counters and pointers. Define a good set of vector operations, as bytecode and suitable load/store type addressing modes that make use of the scalar registers. Obvious candidates are post increment, etc. Also include some basic status register that you can check.

Let's say your library provides a context allocator/deallocator that returns a structure representing the state of the above registers that you can read/write as you see fit.

Next let's say the library provides another function that accepts some string of bytecode for this virtual unit and can compile it to something for the underlying machine architecture to execute, that you never directly see. The return from this is a handle structure that contains a function pointer to call. It also contains a pointer to a state structure that you have to set. This way, multiple compiled SIMD functions can share the same state object if desired.

You populate your state structure and invoke your callable. If it all works, the status register in the state is all good and your natively compiled vector code did something awesome, and with any luck, did it quickly, making use of native vector operations.

To polish it off, a function that can compile the bytecode from some assembler string definition, similar to how GLSL and other shader languages work would be ideal. This would allow, for instance, executables to have precompiled vector code in data sections or the assembler representation to converted, for maximum flexibility.

Such a machine can be implemented on hosts that don't even have a vector unit. It can even be implemented in pure 68K as an interpreter model - not for speed (obviously) but as a baseline for defining the expected behaviour and debugging purposes.

Last edited by Karlos on 07-Jan-2023 at 10:55 PM.

_________________
Doing stupid things for fun...

Status: Offline

Karlos

Re: Vector abstraction for fun and profit.
Posted on 7-Jan-2023 23:58:20

[ #2 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

In the interests of keeping it simple, the additional scalar operations would be limited to things like loop counters. The processing model would be simple, linear imperative. You define a sequence of vector operations (loads, stores, arithmetic/logic, etc) and some looping primitives. Conditional logic would be extremely limited and the code would generally just execute the operation sequence until the loop or exit condition is done. There's no need to support calling other routines within the vector code since it complicates things unnecessarily. SIMD code tends to be very linear anyway.

_________________
Doing stupid things for fun...

Status: Offline

Karlos

Re: Vector abstraction for fun and profit.
Posted on 8-Jan-2023 14:10:56

[ #3 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

I felt sure this thread would have lured out a couple of chip heads...

_________________
Doing stupid things for fun...

Status: Offline

pixie

Re: Vector abstraction for fun and profit.
Posted on 11-Jan-2023 8:16:38

[ #4 ]

Elite Member

Joined: 10-Mar-2003
Posts: 3475
From: Figueira da Foz - Portugal

@Karlos

Could this be done in a way that it would work on x86 (winuae) or arm (emu68k)?

_________________
Indigo 3D Lounge, my second home.
The Illusion of Choice | Am*ga

Status: Offline

kolla

Re: Vector abstraction for fun and profit.
Posted on 11-Jan-2023 9:59:18

[ #5 ]

Elite Member

Joined: 20-Aug-2003
Posts: 3475
From: Trondheim, Norway

@Karlos

Any profit yet?

_________________
B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC

Status: Offline

Karlos

Re: Vector abstraction for fun and profit.
Posted on 11-Jan-2023 12:07:14

[ #6 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@pixie

Yes, that's the point. As a VM it can be implemented as a pure interpreter, on the 68K itself. You won't get any actual benefit in that case except for the ability to test and debug things.

Last edited by Karlos on 11-Jan-2023 at 12:15 PM.

_________________
Doing stupid things for fun...

Status: Offline

Karlos

Re: Vector abstraction for fun and profit.
Posted on 11-Jan-2023 12:08:31

[ #7 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@kolla

None. But on the flipside, I do enjoy implementing virtual machines and this is a virtual machine. It has some interesting constraints.

The absolute MVP requires the definition of a vector register set, some arithmetic/logic operations, basic load/store with some support for incremental addressing (streams tend to work that way) and a basic loop counter. No conditional logic at all except the loop counter has ended. Finally an interpreter that can execute whatever operations above.

Obviously the goal is to have a JIT implementation that uses appropriate vector operations (and even if it doesn't, the raw native scalar performance may help).

I might have a go at a proof of concept later just for fun.

Last edited by Karlos on 11-Jan-2023 at 12:32 PM.
Last edited by Karlos on 11-Jan-2023 at 12:10 PM.

_________________
Doing stupid things for fun...

Status: Offline

cdimauro

Re: Vector abstraction for fun and profit.
Posted on 22-Jan-2023 9:05:57

[ #8 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

@Karlos

Quote:

Karlos wrote:
I felt sure this thread would have lured out a couple of chip heads...

It was/is definitely interesting for me, but in the last months I was very busy using my little spare time writing a Telegram bot (in Python, of course . It aggregates news / threads / comments from the online newspaper & forums which I attend and pushes notifications in proper / rich format. So, I don't waste time anymore checking "what's new" ).
Now it's done and I've some time writing comments.
Quote:

Karlos wrote:
So, the recent discussion around PiStorm32 got me thinking. I definitely appreciate the reasoning why @michalsc isn't keen to expose native AArch64 code execution and other resources to avoid fragmenting the userbase with "yet another binary target". As a developer, though, I would definitely like to be able to make use of those resources.

I was thinking about the VPU library suggestion. At first glance, aside from a few common operations like copying, blending and such, there aren't that many well defined stream operations that you would want to put into it. The real power of SIMD is that you can write your own streaming operations for it.

So, here's a thought. Suppose you defined a virtual SIMD processor, with some modest number of registers, say 8 or 16,

I think that 16 is the bare minimum.
Quote:
with some sensible width, perhaps 16 or 32 bytes.

The New Big Thing on this area is vector length-agnostic SIMD registers. So, don't define registers widths: it should be determined at runtime and used "transparently" (via proper instructions).
Quote:
Add to that a handful of simple integer registers for counters and pointers.

Which means that you also need a regular "integer / scalar" ISA.

Here stays the biggest challenge, IMO: it should be small but powerful-enough to sustain the (big) SIMD unit.

Any idea already on that, or it's completely open / "white paper"?
Quote:
Define a good set of vector operations,

That's pretty much easy because it's enough to take a look at the "competitors" (Intel, ARM, RISC-V) and start by "borrowing" the most used/common from them.

However opcode space is needed to fill the gaps with future instructions.
Quote:
as bytecode and suitable load/store type addressing modes that make use of the scalar registers.

A big question here: is the new ISA a CISC (e.g.: any instruction can directly reference memory) or L/S (AKA "RISC£)?
Quote:
Obvious candidates are post increment, etc.

Scalar indexing? If yes, don't use fixed scalars (*1, *2, *4, *8) but directly use the whole size of the operations.
Quote:
Also include some basic status register that you can check.

For SIMD and/or integer unit? Which flags?
Quote:
Let's say your library provides a context allocator/deallocator that returns a structure representing the state of the above registers that you can read/write as you see fit.

Next let's say the library provides another function that accepts some string of bytecode for this virtual unit and can compile it to something for the underlying machine architecture to execute, that you never directly see. The return from this is a handle structure that contains a function pointer to call. It also contains a pointer to a state structure that you have to set. This way, multiple compiled SIMD functions can share the same state object if desired.

You populate your state structure and invoke your callable.

Better to avoid directly populating the state structure: getters/setters should be defined.
Quote:
If it all works, the status register in the state is all good and your natively compiled vector code did something awesome, and with any luck, did it quickly, making use of native vector operations.

To polish it off, a function that can compile the bytecode from some assembler string definition, similar to how GLSL and other shader languages work would be ideal. This would allow, for instance, executables to have precompiled vector code in data sections or the assembler representation to converted, for maximum flexibility.

Such a machine can be implemented on hosts that don't even have a vector unit. It can even be implemented in pure 68K as an interpreter model - not for speed (obviously) but as a baseline for defining the expected behaviour and debugging purposes.

The idea looks nice and is fun, however where is the "profit"?
Quote:

Karlos wrote:
In the interests of keeping it simple, the additional scalar operations would be limited to things like loop counters. The processing model would be simple, linear imperative. You define a sequence of vector operations (loads, stores, arithmetic/logic, etc) and some looping primitives. Conditional logic would be extremely limited and the code would generally just execute the operation sequence until the loop or exit condition is done. There's no need to support calling other routines within the vector code since it complicates things unnecessarily. SIMD code tends to be very linear anyway.

Yes, but if you look at the real code you'll see the vector instructions aren't enough: the executed code is also made of a lot of integer/scalar instructions.

A ray tracer, for example, does a lot of number-crunching, but it also executes a lot of regular instructions.

So, it's better to define a good ISA outside of the SIMD one.

Status: Offline

Karlos

Re: Vector abstraction for fun and profit.
Posted on 22-Jan-2023 9:14:12

[ #9 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@cdimauro

"Fun and profit" is something of a joke term for anything implemented as free and open software. What I wan here is the simplest MVP implementation that can do useful stream processing. Raytracing isn't an ideal candidate for SIMD anyway. It's handy for representing some basic concepts like RGB tuples, vectors and points but as you say, it's a heavy mix of vector and scalar together. This "machine" is more about defining a simple, loopable block of SIMD operations that will rip through blocks of memory.

I'm not particularly bothered to have varying length vector sizes for this purpose, I just want something trivially mappable to real hardware for iteration zero.

As for directly accessing the state structure I don't really mind given you're going to be directly programming it at a machine level anyway. It's the only state you will see, anything else is hidden away.

I've made a start but the initial implementation is going to a pure interpreter library on 68K to evaluate how useful it is. It may not be useful at all!

As you say it will need some scalar operations too. The bare minimum I've identified are things like loop counters and other simple control flow. All vector operations are 128 bit and have to be aligned. Think old school SIMD here. I'm not interested in overcomplicating the design to something unimplementable just yet.

Last edited by Karlos on 22-Jan-2023 at 09:23 AM.

_________________
Doing stupid things for fun...

Status: Offline

cdimauro

Re: Vector abstraction for fun and profit.
Posted on 22-Jan-2023 9:52:54

[ #10 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

@Karlos

Quote:

Karlos wrote:
@cdimauro

"Fun and profit" is something of a joke term for anything implemented as free and open software.

You can have profit with free/open stuff as well.
Quote:
What I wan here is the simplest MVP implementation that can do useful stream processing. Raytracing isn't an ideal candidate for SIMD anyway. It's handy for representing some basic concepts like RGB tuples, vectors and points but as you say, it's a heavy mix of vector and scalar together. This "machine" is more about defining a simple, loopable block of SIMD operations that will rip through blocks of memory.

OK, got it.
Quote:
I'm not particularly bothered to have varying length vector sizes for this purpose, I just want something trivially mappable to real hardware for iteration zero.

I think that a length-agnostic vector ISA is even better from this PoV, because it allows you to generate proper instructions for the native SIMD unit, even unrolling the loops if it's a good fit for the specific ISA / microarchitecture.

Whereas in the simplest case you can just replace a vector instruction with the equivalent SIMD instruction plus the loop counter ones.
Quote:
I've made a start but the initial implementation is going to a pure interpreter library on 68K to evaluate how useful it is. It may not be useful at all!

You can try with some typical vector / SIMD code, like BLAS/LINKPACK routines, or image processing kernels.
Quote:
As you say it will need some scalar operations too. The bare minimum I've identified are things like loop counters and other simple control flow.

That's enough for starting with a minimal implementation.
Quote:
All vector operations are 128 bit

See above: if you expose this detail then you'll make the same mistake that Gunnar did with its 68080, where it fixed the vector width to 64-bit.
Quote:
and have to be aligned.

Which is another big problem with SIMDs: it's not always the case and you require boilerplate code to handle misaligned memory accesses.

That's something which could (should, IMO) be hidden inside the implementation, when you have to JIT the code for the specific microarchitecture.
Quote:
Think old school SIMD here.

I see.
Quote:
I'm not interested in overcomplicating the design to something unimplementable just yet.

I don't think that a length-agnostic vector ISA is necessarily an overcomplication (unless you take the RISC-V vector extension as a reference).

IMO it's the opposite: it allows you to define a simple ISA and delegating the costs to the specific JIT / target microarchitecture.

Anyway, it's up to you to define your ISA: I'm just expressing my opinion. :)

Status: Offline

Karlos

Re: Vector abstraction for fun and profit.
Posted on 22-Jan-2023 10:08:01

[ #11 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@cdimauro

I take the point on vector length but I chose 128 bit as a starting point for a few reasons. As an extension to a 32-bit ISA, it only supports (u)int8-32 and float32. I didn't add float64 yet. On the fence about it.

Going arbitrary length/alignment may be for V2. I have to prove v1 has legs first.

One challenge already is breaking out of the emulator to native code. I don't think that Mr Wilen is particularly happy about maintaining support for it in UAE and I've no idea yet if Emu68 even supports it.

_________________
Doing stupid things for fun...

Status: Offline

cdimauro

Re: Vector abstraction for fun and profit.
Posted on 22-Jan-2023 10:44:41

[ #12 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4438
From: Germany

@Karlos

Quote:

Karlos wrote:
@cdimauro

I take the point on vector length but I chose 128 bit as a starting point for a few reasons. As an extension to a 32-bit ISA, it only supports (u)int8-32 and float32. I didn't add float64 yet. On the fence about it.

OK, but vector registers size and supported data types are different / independent.

You can initially avoid implementing float64, whatever is the register size (see Intel's SSE extension: double support arrived with SSE2).
Quote:
Going arbitrary length/alignment may be for V2. I have to prove v1 has legs first.

Makes sense. It's a hobby project, at then end.
Quote:
One challenge already is breaking out of the emulator to native code. I don't think that Mr Wilen is particularly happy about maintaining support for it in UAE

Maybe I recall badly, but AFAIK there should be already support for calling native from UAE.
Quote:
and I've no idea yet if Emu68 even supports it.

Same for me. However it could be useful to implement offloading of some tasks (some datatypes, for example).

Status: Offline

Karlos

Re: Vector abstraction for fun and profit.
Posted on 22-Jan-2023 13:04:24

[ #13 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4958
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@cdimauro

Sure, I understand the difference between vector size and element type :) I also want it to work on PPC/Altivec, so there's something of a minimal viable intersection to consider too. I don't know if it will actually end up viable; it may turn out that the processing model is just too inflexible. We'll see.

In any event implementing an even remotely efficient base interpreter for 68040+ is going to be fun for its own sake.

As for breaking out into native, I've found the sources to Wazp3D that can call the native opengl so I'm sure the secret sauce is in there somewhere. Given the performance of the mc64k interpreter it might be acceptably fast over pure scalar 68K code implementation even if the native code isn't particularly optimal