Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6071 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

15 crawler(s) on-line.

170 guest(s) on-line.

0 member(s) on-line.

You are an anonymous user.
Register Now!

DiscreetFX: 10 mins ago

klx300r: 19 mins ago

Matt3k: 1 hr 53 mins ago

agami: 3 hrs 24 mins ago

amigasociety: 3 hrs 46 mins ago

matthey: 4 hrs 32 mins ago

RobertB: 4 hrs 49 mins ago

Rob: 5 hrs 13 mins ago

number6: 6 hrs 18 mins ago

Karlos: 6 hrs 54 mins ago

Forum Index

Amiga General Chat

MC64K - Imaginary 64-bit 680x0

Poster

Thread

Karlos

Re: MC64K - Imaginary 64-bit 680x0
Posted on 19-Oct-2022 11:44:13

[ #141 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4405
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

For the CPU nerds here, the 770 MIPs rating for add.q r1, r0 can be broken down:

The code uses a switch case interpreter loop with some optimisations. One of those is that we globally allocate the MC64K's program counter in r12 and it uses a goto short circuit to avoid checking the "status register" of the machine for any instruction that can't push it into a failure state, e.g. this basic add.

The instruction being tested here is "fast path" meaning that it consists of the opcode enumeration followed directly by a byte that packs the destination and source as nybbles (destination first). For the compilation options used, the x64 code generated results in 18 x64 instructions for the complete fetch, decode and execute cycle:



.skip_status_check:

.L1901:

switch (*puProgramCounter++) {

	movq	%r12, %rax
	cmpb	$-16, (%rax)
	leaq	1(%r12), %r12
	ja	.L1655	                # default: case
	movzbl	(%rax), %edx
	movslq	(%r14,%rdx,4), %rcx
	addq	%r14, %rcx    # r14 holds the switch/case jump table location
	jmp	*%rcx

    [ opcode  switch-case jump table here ]

            case Opcode::R2R_ADD_Q: {
.L1760:
                readRegPair(); // uint8 uRegPair = *puProgramCounter++

	movzbl	1(%rax), %r11d	# MEM[(const uint8 *)puProgramCounter.866_2 + 1B], uRegPair
	movq	_ZN5MC64K7Machine11Interpreter5aoGPRE@GOTPCREL(%rip), %rsi	#, tmp2090

                dstGPRQuad() += srcGPRQuad(); // aoGPR[uRegPair & 0x0F].iQuad += aoGPR[uRegPair >> 4].iQuad

	movq	%r11, %rcx	        # uRegPair, _337
	shrb	$4, %r11b	        #, tmp2093
	andl	$15, %r11d              #, tmp2095
	leaq	2(%rax), %r12	        #, puProgramCounter
	movq	(%rsi,%r11,8), %rax	# aoGPR[_340].iQuad, tmp2101
	andl	$15, %ecx	        #, _337
	addq	%rax, (%rsi,%rcx,8)	# tmp2101, aoGPR[_337].iQuad

                goto skip_status_check;
            }

	jmp	.L1901 # rinse and repeat

Thus, 8 instructions to fetch the ocpode and branch to the appropriate handler, followed by 10 more instructions necessary to perform the operation.

What this means is that in order to reach 770 virtual MIPS here, the host cpu was hitting 13860 MIPS for the above code. This is a slight underestimation because the VM code was only loop unrolled 10x, so every 10th iteration a slightly longer path was taken since there's a dbnz to deal with.

Ignoring this, the CPU tops out at 2.7 GHz so 13860 / 2700 gives ~5.13 instructions per cycle throughput (single thread).

_________________
Doing stupid things for fun...

Status: Offline

Karlos

Re: MC64K - Imaginary 64-bit 680x0
Posted on 19-Oct-2022 19:45:17

[ #142 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4405
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

I really must put some effort into a JIT. Even if it's only 80% of native scalar performance that'd be up to 10,000 MIPS for something approximating 64-bit 68K assembler to play with.

_________________
Doing stupid things for fun...

Status: Offline

cdimauro

Re: MC64K - Imaginary 64-bit 680x0
Posted on 19-Oct-2022 22:14:52

[ #143 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Karlos

Quote:

Karlos wrote:
@cdimauro

I need to put some effort into optimising the EA decode path. It's around 5x slower than the R2R fast path implementation:

Loading object file as host 'Standard Test Host'
Linking 2 exported symbols...
Matched 0 0x56148cf02cb0 [--x] main
Matched 1 0x56148cf02d14 [--x] exit
Runtime: Executable instance loaded at 0x56148cf038d0 for binary 'test_projects/bench/bin/bench.64x'
Stack of 256 allocated at 0x56148cf01bd0 ... 0x56148cf01cd0
Beginning run at PC:0x56148cf02cb0...
Benchmarking 10x unrolled using 80000000 iterations
Loop Calibration
took: 375582962 nanoseconds 213.0022 MIPS
Baseline: add.q r1, r0
took: 1038852816 nanoseconds 770.0802 MIPS 1.0000 relative

Benchmarking: add.q r1, (r10)
took: 5059109489 nanoseconds 158.1306 MIPS 4.8699 relative

Benchmarking: add.q r1, (r10)+/-
took: 5225708577 nanoseconds 153.0893 MIPS 5.0303 relative

Benchmarking: add.q r1, +/-(r10)
took: 5285572311 nanoseconds 151.3554 MIPS 5.0879 relative

Benchmarking: add.q r1, 8(r10)
took: 4999233568 nanoseconds 160.0245 MIPS 4.8123 relative

Benchmarking add.q (r11), (r10)
took: 5080120650 nanoseconds 157.4766 MIPS 4.8901 relative

Benchmarking: add.q r1, label
took: 6031763240 nanoseconds 132.6312 MIPS 5.8062 relative

Benchmarking: add.q #1, r0
took: 5797901749 nanoseconds 137.9810 MIPS 5.5811 relative

Benchmarking: biz.q r0, label (when taken)
took: 2967526205 nanoseconds 269.5848 MIPS 2.8565 relative

Benchmarking: biz.q r0, label (when not taken)
took: 2603414498 nanoseconds 307.2888 MIPS 2.5060 relative

Benchmarking: bsr/rts (round trip)
took: 4164010331 nanoseconds 192.1225 MIPS 4.0083 relative

Benchmarking: bsr/rts (round trip, stack misaligned)
took: 4047076063 nanoseconds 197.6736 MIPS 3.8957 relative

Benchmarking: bsr.b/rts (round trip, short negative displacement)
took: 3927501483 nanoseconds 203.6918 MIPS 3.7806 relative

Benchmarking: bsr.b/rts (round trip, short negative displacement, stack misaligned)
took: 4129881186 nanoseconds 193.7102 MIPS 3.9754 relative

Benchmarking: hcf #0, #0 (no op vector)
took: 4003440809 nanoseconds 199.8281 MIPS 3.8537 relative

Benchmarking: link r5, #-64/unlk r5 (round trip)
took: 4623583909 nanoseconds 173.0260 MIPS 4.4507 relative

(Tested on i7-i7500 mobile, 2.7GHz)

I took at look at the code and it's hard to make something better.

Maybe you can arrange this part a little bit differently:
https://github.com/IntuitionAmiga/MC64000/blob/main/core/src/cpp/machine/interpreter_ea.cpp

    initDisplacement();

    uint8 uEffectiveAddress = *puProgramCounter++;
    uint8 uEALower = uEffectiveAddress & 0x0F; // Lower nybble varies, usually a register.

    // Switch based on the mode
    switch (uEffectiveAddress >> 4) {

to:

    uint8 uEffectiveAddress = *puProgramCounter++;
    uint8 uEAMode = uEffectiveAddress >> 4;
    uint8 uEALower = uEffectiveAddress & 0x0F; // Lower nybble varies, usually a register.

    // Switch based on the mode
    switch (uEAMode) {

So, removing initDisplacement (what was the purpose? Is it always used?) and interleaving some instructions in order to have the EAMode already... ready when it should be used.

Quote:

Karlos wrote:
For the CPU nerds here, the 770 MIPs rating for add.q r1, r0 can be broken down:

The code uses a switch case interpreter loop with some optimisations. One of those is that we globally allocate the MC64K's program counter in r12 and it uses a goto short circuit to avoid checking the "status register" of the machine for any instruction that can't push it into a failure state, e.g. this basic add.

Makes sense. This trick is used on other VMs as well.
Quote:

The instruction being tested here is "fast path" meaning that it consists of the opcode enumeration followed directly by a byte that packs the destination and source as nybbles (destination first). For the compilation options used, the x64 code generated results in 18 x64 instructions for the complete fetch, decode and execute cycle:



.skip_status_check:

.L1901:

switch (*puProgramCounter++) {

	movq	%r12, %rax
	cmpb	$-16, (%rax)
	leaq	1(%r12), %r12
	ja	.L1655	                # default: case
	movzbl	(%rax), %edx
	movslq	(%r14,%rdx,4), %rcx
	addq	%r14, %rcx    # r14 holds the switch/case jump table location
	jmp	*%rcx

    [ opcode  switch-case jump table here ]

            case Opcode::R2R_ADD_Q: {
.L1760:
                readRegPair(); // uint8 uRegPair = *puProgramCounter++

	movzbl	1(%rax), %r11d	# MEM[(const uint8 *)puProgramCounter.866_2 + 1B], uRegPair
	movq	_ZN5MC64K7Machine11Interpreter5aoGPRE@GOTPCREL(%rip), %rsi	#, tmp2090

                dstGPRQuad() += srcGPRQuad(); // aoGPR[uRegPair & 0x0F].iQuad += aoGPR[uRegPair >> 4].iQuad

	movq	%r11, %rcx	        # uRegPair, _337
	shrb	$4, %r11b	        #, tmp2093
	andl	$15, %r11d              #, tmp2095
	leaq	2(%rax), %r12	        #, puProgramCounter
	movq	(%rsi,%r11,8), %rax	# aoGPR[_340].iQuad, tmp2101
	andl	$15, %ecx	        #, _337
	addq	%rax, (%rsi,%rcx,8)	# tmp2101, aoGPR[_337].iQuad

                goto skip_status_check;
            }

	jmp	.L1901 # rinse and repeat

Thus, 8 instructions to fetch the ocpode and branch to the appropriate handler, followed by 10 more instructions necessary to perform the operation.

Unfortunately the compiler isn't able to generate an optimized code even on switch/cases like that, which are very commong.

5 instructions for main loop should be enough here. And maybe a couple of registers aren't needed. Plus, and that's even worse, the jump table could be better optimized.
Quote:

What this means is that in order to reach 770 virtual MIPS here, the host cpu was hitting 13860 MIPS for the above code. This is a slight underestimation because the VM code was only loop unrolled 10x, so every 10th iteration a slightly longer path was taken since there's a dbnz to deal with.

Ignoring this, the CPU tops out at 2.7 GHz so 13860 / 2700 gives ~5.13 instructions per cycle throughput (single thread).

Which is a very good result.

Quote:

Karlos wrote:
I really must put some effort into a JIT. Even if it's only 80% of native scalar performance that'd be up to 10,000 MIPS for something approximating 64-bit 68K assembler to play with.

Indeed. But it's much more difficult and requires a lot of work. You may take a look at Michal's Emu68 as a starting point.

Status: Offline

Karlos

Re: MC64K - Imaginary 64-bit 680x0
Posted on 19-Oct-2022 22:37:17

[ #144 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4405
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@cdimauro

initDisplacement() is just a macro that declares a union used to hold displacement values read from the instruction stream. It's not an actual function call. See: https://github.com/IntuitionAmiga/MC64000/blob/d895c3fafd5f2c05cc5e8f94ec9b689f83e3c39c/core/src/cpp/include/machine/gnarly.hpp#L26

Regarding the jump table the compiler emits for the switch case, the entries are all 32-bit but I'm sure none of the actual values are. Seems a bit wasteful for something that will end up in your L1 cache.

Quote:
Indeed. But it's much more difficult and requires a lot of work. You may take a look at Michal's Emu68 as a starting point.

Yeah but the incentive...

Last edited by Karlos on 19-Oct-2022 at 10:43 PM.
Last edited by Karlos on 19-Oct-2022 at 10:42 PM.

_________________
Doing stupid things for fun...

Status: Offline

cdimauro

Re: MC64K - Imaginary 64-bit 680x0
Posted on 20-Oct-2022 5:53:39

[ #145 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Karlos

Quote:

Karlos wrote:
@cdimauro

initDisplacement() is just a macro that declares a union used to hold displacement values read from the instruction stream. It's not an actual function call. See: https://github.com/IntuitionAmiga/MC64000/blob/d895c3fafd5f2c05cc5e8f94ec9b689f83e3c39c/core/src/cpp/include/machine/gnarly.hpp#L26

OK, then nevermind.
Quote:
Regarding the jump table the compiler emits for the switch case, the entries are all 32-bit but I'm sure none of the actual values are. Seems a bit wasteful for something that will end up in your L1 cache.

Exactly: that's a huge waste on one of the most important caches.

I can't believe that this is still happening nowadays with a modern compiler.
Quote:
Quote:
Indeed. But it's much more difficult and requires a lot of work. You may take a look at Michal's Emu68 as a starting point.

Yeah but the incentive...

IMO you should do something different, because you already have an ecosystem which is working.

Now what's more important is getting a backend for your architecture. So that you can compile regular C/C++ applications and get binaries for it.

Status: Offline

Karlos

Re: MC64K - Imaginary 64-bit 680x0
Posted on 20-Oct-2022 9:00:59

[ #146 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4405
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@cdimauro

The emitted jump table is even worse if you don't compile for PIC. Then the entires are full 64-bit address slots.

Quote:

IMO you should do something different, because you already have an ecosystem which is working.

To be fair, building a compiler front for it wasn't really on the roadmap. That's not to say I won't consider it but there's other fun stuff to do first.

What I do want to do is implement some virtual hardware for it. There's already a basic chunky display with built-in beam racer (similar to the copper) but I want to make that truly asynchronous so that it can run on a second CPU core. Plus there's also the prospect of implementing some sound synthesis, again to run asynchronously. Ultimately the idea here is that your virtual "custom chips" ought to be able to make use of real spare computing power rather than competing with the virtual CPU for it.

Last edited by Karlos on 20-Oct-2022 at 09:33 AM.

_________________
Doing stupid things for fun...

Status: Offline

Karlos

Re: MC64K - Imaginary 64-bit 680x0
Posted on 20-Oct-2022 15:46:05

[ #147 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4405
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@cdimauro

Quote:
Exactly: that's a huge waste on one of the most important caches.

At least for GCC there is another option: https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html

I've written a tiny proof of concept and can get a 16-bit jump table out of it. As ugly as it looks, this still compiles with --std=c++17 -Wall -Wextra -W


#include 

int __attribute__((noinline)) test(unsigned char const* ops) {
    int x = 0;
    static short const handler [] = {
        (short)((char*)&&ret - (char*)&&begin),
        (short)((char*)&&inc - (char*)&&begin),
        (short)((char*)&&dec - (char*)&&begin),
        (short)((char*)&&rst - (char*)&&begin)
    };

    #define next() goto *((char*)&&begin + handler[*ops++]);
    next();

    begin:
    ret:
        std::puts("\tret");
        return x;
    inc:
        std::puts("\tinc");
        x++;
        next();
    dec:
        std::puts("\tdec");
        x--;
        next();
    rst:
        std::puts("\trst");
        x = 0;
        next();
    return -1;
}

int main() {
    unsigned char code[] = {
        2, 3, 1, 0
    };

    int x = test(code);
    std::printf("x: %d\n", x);
    return 0;
}

Looking at the compiler output for this example at the "inc:" label, the entire opcode, plus the threaded branch to next handler is 5 instructions


# jumptbl.cpp:21:         x++;
	incl	%r8d
# jumptbl.cpp:22:         next();
	movswq (%rcx,%rax,2), %rax
	incq %rdi
	addq %rdx, %rax
	jmp *%rax

Another advantage of this approach is that the generated jump table is 16 bits regardless of whether ot not you compile with -fPIC.

Last edited by Karlos on 20-Oct-2022 at 03:47 PM.
Last edited by Karlos on 20-Oct-2022 at 03:47 PM.

_________________
Doing stupid things for fun...

Status: Offline

cdimauro

Re: MC64K - Imaginary 64-bit 680x0
Posted on 21-Oct-2022 6:22:30

[ #148 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Karlos

Quote:

Karlos wrote:
@cdimauro

The emitted jump table is even worse if you don't compile for PIC. Then the entires are full 64-bit address slots.

Shocking. How inefficient could still be a compiler with very common (and important) cases.
Quote:
Quote:
IMO you should do something different, because you already have an ecosystem which is working.

To be fair, building a compiler front for it wasn't really on the roadmap. That's not to say I won't consider it but there's other fun stuff to do first.

What I do want to do is implement some virtual hardware for it. There's already a basic chunky display with built-in beam racer (similar to the copper) but I want to make that truly asynchronous so that it can run on a second CPU core. Plus there's also the prospect of implementing some sound synthesis, again to run asynchronously. Ultimately the idea here is that your virtual "custom chips" ought to be able to make use of real spare computing power rather than competing with the virtual CPU for it.

Got it. Then... have fun with the JIT: it's also a very interesting project to work on and gives a lot of satisfactions once results arrive.

Status: Offline

cdimauro

Re: MC64K - Imaginary 64-bit 680x0
Posted on 21-Oct-2022 6:24:51

[ #149 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Karlos

Quote:

Karlos wrote:
@cdimauro

Quote:
Exactly: that's a huge waste on one of the most important caches.

At least for GCC there is another option: https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html

I've written a tiny proof of concept and can get a 16-bit jump table out of it. As ugly as it looks, this still compiles with --std=c++17 -Wall -Wextra -W

#include

int __attribute__((noinline)) test(unsigned char const* ops) {
int x = 0;
static short const handler [] = {
(short)((char*)&&ret - (char*)&&begin),
(short)((char*)&&inc - (char*)&&begin),
(short)((char*)&&dec - (char*)&&begin),
(short)((char*)&&rst - (char*)&&begin)
};

#define next() goto *((char*)&&begin + handler[*ops++]);
next();

begin:
ret:
std::puts("\tret");
return x;
inc:
std::puts("\tinc");
x++;
next();
dec:
std::puts("\tdec");
x--;
next();
rst:
std::puts("\trst");
x = 0;
next();
return -1;
}

int main() {
unsigned char code[] = {
2, 3, 1, 0
};

int x = test(code);
std::printf("x: %d\n", x);
return 0;
}

Looking at the compiler output for this example at the "inc:" label, the entire opcode, plus the threaded branch to next handler is 5 instructions

# jumptbl.cpp:21: x++;
incl %r8d
# jumptbl.cpp:22: next();
movswq (%rcx,%rax,2), %rax
incq %rdi
addq %rdx, %rax
jmp *%rax

Another advantage of this approach is that the generated jump table is 16 bits regardless of whether ot not you compile with -fPIC.

That's The Way! 5 instructions, as I was expecting.

The only mess is with building the handler table, but a macro could help.

I assume that this is the next change on your VM.

The only problem is that it's portable, since it's GCC-only.

Status: Offline

Karlos

Re: MC64K - Imaginary 64-bit 680x0
Posted on 21-Oct-2022 7:46:55

[ #150 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4405
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

@cdimauro

LLVM supports label as value too, and those are the only compilers I'm targeting for now.

I probably won't implement the handler exactly as demonstrated here since this dispatch threading approach puts the computed goto in every handler, increasing their length each. What we lose in the jump table is regained in duplicate code.

So I think it may be better to have a central point in the code that is unconditionally branched to where the next jump location is calculated.

Obviously all these details can be hidden by macros so I can test both approaches without having to rewrite a ton of code.

Last edited by Karlos on 21-Oct-2022 at 08:00 AM.

_________________
Doing stupid things for fun...

Status: Offline

Karlos

Re: MC64K - Imaginary 64-bit 680x0
Posted on 21-Oct-2022 20:18:05

[ #151 ]

Elite Member

Joined: 24-Aug-2003
Posts: 4405
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition!

Well I did that thing...

The runtime can be compiled with -DINTERPRETER_CUSTOM to select the custom jump table rather than the standard switch case and further use -DTHREADED_DISPATCH to optionally embed the next opcode decode/jump onto the tail of the handler code.

Under equivalent conditions, the previous fast switch/case baseline for add.q r1, r0 was ~740 MIPS. Using the custom jump table this increased to ~830. Enabling the threaded dispatch increased to ~915. I had hoped to break 1000, but I guess it's not quite tuned enough ;)

I should be able to turn the body of the interpreter into an include that satisfies both the switch case and the custom jump table version as it relies on macros to define the entry point and exit method for each handler. I don't really like having this much duplication.

Last edited by Karlos on 21-Oct-2022 at 08:29 PM.

_________________
Doing stupid things for fun...

Status: Offline

cdimauro

Re: MC64K - Imaginary 64-bit 680x0
Posted on 22-Oct-2022 5:41:10

[ #152 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Karlos

Quote:

Karlos wrote:
@cdimauro

LLVM supports label as value too, and those are the only compilers I'm targeting for now.

Which should be enough: they cover the majority of platforms.
Quote:

Karlos wrote:
Well I did that thing...

The runtime can be compiled with -DINTERPRETER_CUSTOM to select the custom jump table rather than the standard switch case and further use -DTHREADED_DISPATCH to optionally embed the next opcode decode/jump onto the tail of the handler code.

Under equivalent conditions, the previous fast switch/case baseline for add.q r1, r0 was ~740 MIPS. Using the custom jump table this increased to ~830. Enabling the threaded dispatch increased to ~915. I had hoped to break 1000, but I guess it's not quite tuned enough ;)

It's already a very good gain. You cannot make miracles here.
Quote:
I should be able to turn the body of the interpreter into an include that satisfies both the switch case and the custom jump table version as it relies on macros to define the entry point and exit method for each handler. I don't really like having this much duplication.

Makes sense.

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle