Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.
|
|
|
|
Poster | Thread | Karlos
| |
Re: MC64K - Imaginary 64-bit 680x0 Posted on 19-Oct-2022 11:44:13
| | [ #141 ] |
| |
|
Elite Member |
Joined: 24-Aug-2003 Posts: 4405
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| For the CPU nerds here, the 770 MIPs rating for add.q r1, r0 can be broken down:
The code uses a switch case interpreter loop with some optimisations. One of those is that we globally allocate the MC64K's program counter in r12 and it uses a goto short circuit to avoid checking the "status register" of the machine for any instruction that can't push it into a failure state, e.g. this basic add.
The instruction being tested here is "fast path" meaning that it consists of the opcode enumeration followed directly by a byte that packs the destination and source as nybbles (destination first). For the compilation options used, the x64 code generated results in 18 x64 instructions for the complete fetch, decode and execute cycle:
.skip_status_check:
.L1901:
switch (*puProgramCounter++) {
movq %r12, %rax cmpb $-16, (%rax) leaq 1(%r12), %r12 ja .L1655 # default: case movzbl (%rax), %edx movslq (%r14,%rdx,4), %rcx addq %r14, %rcx # r14 holds the switch/case jump table location jmp *%rcx
[ opcode switch-case jump table here ]
case Opcode::R2R_ADD_Q: { .L1760: readRegPair(); // uint8 uRegPair = *puProgramCounter++
movzbl 1(%rax), %r11d # MEM[(const uint8 *)puProgramCounter.866_2 + 1B], uRegPair movq _ZN5MC64K7Machine11Interpreter5aoGPRE@GOTPCREL(%rip), %rsi #, tmp2090
dstGPRQuad() += srcGPRQuad(); // aoGPR[uRegPair & 0x0F].iQuad += aoGPR[uRegPair >> 4].iQuad
movq %r11, %rcx # uRegPair, _337 shrb $4, %r11b #, tmp2093 andl $15, %r11d #, tmp2095 leaq 2(%rax), %r12 #, puProgramCounter movq (%rsi,%r11,8), %rax # aoGPR[_340].iQuad, tmp2101 andl $15, %ecx #, _337 addq %rax, (%rsi,%rcx,8) # tmp2101, aoGPR[_337].iQuad
goto skip_status_check; }
jmp .L1901 # rinse and repeat
Thus, 8 instructions to fetch the ocpode and branch to the appropriate handler, followed by 10 more instructions necessary to perform the operation.
What this means is that in order to reach 770 virtual MIPS here, the host cpu was hitting 13860 MIPS for the above code. This is a slight underestimation because the VM code was only loop unrolled 10x, so every 10th iteration a slightly longer path was taken since there's a dbnz to deal with.
Ignoring this, the CPU tops out at 2.7 GHz so 13860 / 2700 gives ~5.13 instructions per cycle throughput (single thread).
_________________ Doing stupid things for fun... |
| Status: Offline |
| | Karlos
| |
Re: MC64K - Imaginary 64-bit 680x0 Posted on 19-Oct-2022 19:45:17
| | [ #142 ] |
| |
|
Elite Member |
Joined: 24-Aug-2003 Posts: 4405
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| I really must put some effort into a JIT. Even if it's only 80% of native scalar performance that'd be up to 10,000 MIPS for something approximating 64-bit 68K assembler to play with. _________________ Doing stupid things for fun... |
| Status: Offline |
| | cdimauro
| |
Re: MC64K - Imaginary 64-bit 680x0 Posted on 19-Oct-2022 22:14:52
| | [ #143 ] |
| |
|
Elite Member |
Joined: 29-Oct-2012 Posts: 3650
From: Germany | | |
|
| @Karlos
Quote:
Karlos wrote: @cdimauro
I need to put some effort into optimising the EA decode path. It's around 5x slower than the R2R fast path implementation:
Loading object file as host 'Standard Test Host' Linking 2 exported symbols... Matched 0 0x56148cf02cb0 [--x] main Matched 1 0x56148cf02d14 [--x] exit Runtime: Executable instance loaded at 0x56148cf038d0 for binary 'test_projects/bench/bin/bench.64x' Stack of 256 allocated at 0x56148cf01bd0 ... 0x56148cf01cd0 Beginning run at PC:0x56148cf02cb0... Benchmarking 10x unrolled using 80000000 iterations Loop Calibration took: 375582962 nanoseconds 213.0022 MIPS Baseline: add.q r1, r0 took: 1038852816 nanoseconds 770.0802 MIPS 1.0000 relative
Benchmarking: add.q r1, (r10) took: 5059109489 nanoseconds 158.1306 MIPS 4.8699 relative
Benchmarking: add.q r1, (r10)+/- took: 5225708577 nanoseconds 153.0893 MIPS 5.0303 relative
Benchmarking: add.q r1, +/-(r10) took: 5285572311 nanoseconds 151.3554 MIPS 5.0879 relative
Benchmarking: add.q r1, 8(r10) took: 4999233568 nanoseconds 160.0245 MIPS 4.8123 relative
Benchmarking add.q (r11), (r10) took: 5080120650 nanoseconds 157.4766 MIPS 4.8901 relative
Benchmarking: add.q r1, label took: 6031763240 nanoseconds 132.6312 MIPS 5.8062 relative
Benchmarking: add.q #1, r0 took: 5797901749 nanoseconds 137.9810 MIPS 5.5811 relative
Benchmarking: biz.q r0, label (when taken) took: 2967526205 nanoseconds 269.5848 MIPS 2.8565 relative
Benchmarking: biz.q r0, label (when not taken) took: 2603414498 nanoseconds 307.2888 MIPS 2.5060 relative
Benchmarking: bsr/rts (round trip) took: 4164010331 nanoseconds 192.1225 MIPS 4.0083 relative
Benchmarking: bsr/rts (round trip, stack misaligned) took: 4047076063 nanoseconds 197.6736 MIPS 3.8957 relative
Benchmarking: bsr.b/rts (round trip, short negative displacement) took: 3927501483 nanoseconds 203.6918 MIPS 3.7806 relative
Benchmarking: bsr.b/rts (round trip, short negative displacement, stack misaligned) took: 4129881186 nanoseconds 193.7102 MIPS 3.9754 relative
Benchmarking: hcf #0, #0 (no op vector) took: 4003440809 nanoseconds 199.8281 MIPS 3.8537 relative
Benchmarking: link r5, #-64/unlk r5 (round trip) took: 4623583909 nanoseconds 173.0260 MIPS 4.4507 relative
(Tested on i7-i7500 mobile, 2.7GHz) |
I took at look at the code and it's hard to make something better.
Maybe you can arrange this part a little bit differently: https://github.com/IntuitionAmiga/MC64000/blob/main/core/src/cpp/machine/interpreter_ea.cpp
initDisplacement();
uint8 uEffectiveAddress = *puProgramCounter++; uint8 uEALower = uEffectiveAddress & 0x0F; // Lower nybble varies, usually a register.
// Switch based on the mode switch (uEffectiveAddress >> 4) { to:
uint8 uEffectiveAddress = *puProgramCounter++; uint8 uEAMode = uEffectiveAddress >> 4; uint8 uEALower = uEffectiveAddress & 0x0F; // Lower nybble varies, usually a register.
// Switch based on the mode switch (uEAMode) { So, removing initDisplacement (what was the purpose? Is it always used?) and interleaving some instructions in order to have the EAMode already... ready when it should be used.
Quote:
Karlos wrote: For the CPU nerds here, the 770 MIPs rating for add.q r1, r0 can be broken down:
The code uses a switch case interpreter loop with some optimisations. One of those is that we globally allocate the MC64K's program counter in r12 and it uses a goto short circuit to avoid checking the "status register" of the machine for any instruction that can't push it into a failure state, e.g. this basic add. |
Makes sense. This trick is used on other VMs as well. Quote:
The instruction being tested here is "fast path" meaning that it consists of the opcode enumeration followed directly by a byte that packs the destination and source as nybbles (destination first). For the compilation options used, the x64 code generated results in 18 x64 instructions for the complete fetch, decode and execute cycle:
.skip_status_check:
.L1901:
switch (*puProgramCounter++) {
movq %r12, %rax cmpb $-16, (%rax) leaq 1(%r12), %r12 ja .L1655 # default: case movzbl (%rax), %edx movslq (%r14,%rdx,4), %rcx addq %r14, %rcx # r14 holds the switch/case jump table location jmp *%rcx
[ opcode switch-case jump table here ]
case Opcode::R2R_ADD_Q: { .L1760: readRegPair(); // uint8 uRegPair = *puProgramCounter++
movzbl 1(%rax), %r11d # MEM[(const uint8 *)puProgramCounter.866_2 + 1B], uRegPair movq _ZN5MC64K7Machine11Interpreter5aoGPRE@GOTPCREL(%rip), %rsi #, tmp2090
dstGPRQuad() += srcGPRQuad(); // aoGPR[uRegPair & 0x0F].iQuad += aoGPR[uRegPair >> 4].iQuad
movq %r11, %rcx # uRegPair, _337 shrb $4, %r11b #, tmp2093 andl $15, %r11d #, tmp2095 leaq 2(%rax), %r12 #, puProgramCounter movq (%rsi,%r11,8), %rax # aoGPR[_340].iQuad, tmp2101 andl $15, %ecx #, _337 addq %rax, (%rsi,%rcx,8) # tmp2101, aoGPR[_337].iQuad
goto skip_status_check; }
jmp .L1901 # rinse and repeat
Thus, 8 instructions to fetch the ocpode and branch to the appropriate handler, followed by 10 more instructions necessary to perform the operation. |
Unfortunately the compiler isn't able to generate an optimized code even on switch/cases like that, which are very commong.
5 instructions for main loop should be enough here. And maybe a couple of registers aren't needed. Plus, and that's even worse, the jump table could be better optimized. Quote:
What this means is that in order to reach 770 virtual MIPS here, the host cpu was hitting 13860 MIPS for the above code. This is a slight underestimation because the VM code was only loop unrolled 10x, so every 10th iteration a slightly longer path was taken since there's a dbnz to deal with.
Ignoring this, the CPU tops out at 2.7 GHz so 13860 / 2700 gives ~5.13 instructions per cycle throughput (single thread). |
Which is a very good result.
Quote:
Karlos wrote: I really must put some effort into a JIT. Even if it's only 80% of native scalar performance that'd be up to 10,000 MIPS for something approximating 64-bit 68K assembler to play with. |
Indeed. But it's much more difficult and requires a lot of work. You may take a look at Michal's Emu68 as a starting point. |
| Status: Offline |
| | Karlos
| |
Re: MC64K - Imaginary 64-bit 680x0 Posted on 19-Oct-2022 22:37:17
| | [ #144 ] |
| |
|
Elite Member |
Joined: 24-Aug-2003 Posts: 4405
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| | Status: Offline |
| | cdimauro
| |
Re: MC64K - Imaginary 64-bit 680x0 Posted on 20-Oct-2022 5:53:39
| | [ #145 ] |
| |
|
Elite Member |
Joined: 29-Oct-2012 Posts: 3650
From: Germany | | |
|
| @Karlos
Quote:
OK, then nevermind. Quote:
Regarding the jump table the compiler emits for the switch case, the entries are all 32-bit but I'm sure none of the actual values are. Seems a bit wasteful for something that will end up in your L1 cache. |
Exactly: that's a huge waste on one of the most important caches.
I can't believe that this is still happening nowadays with a modern compiler. Quote:
Quote:
Indeed. But it's much more difficult and requires a lot of work. You may take a look at Michal's Emu68 as a starting point. |
Yeah but the incentive... |
IMO you should do something different, because you already have an ecosystem which is working.
Now what's more important is getting a backend for your architecture. So that you can compile regular C/C++ applications and get binaries for it. |
| Status: Offline |
| | Karlos
| |
Re: MC64K - Imaginary 64-bit 680x0 Posted on 20-Oct-2022 9:00:59
| | [ #146 ] |
| |
|
Elite Member |
Joined: 24-Aug-2003 Posts: 4405
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| @cdimauro
The emitted jump table is even worse if you don't compile for PIC. Then the entires are full 64-bit address slots.
Quote:
IMO you should do something different, because you already have an ecosystem which is working. |
To be fair, building a compiler front for it wasn't really on the roadmap. That's not to say I won't consider it but there's other fun stuff to do first.
What I do want to do is implement some virtual hardware for it. There's already a basic chunky display with built-in beam racer (similar to the copper) but I want to make that truly asynchronous so that it can run on a second CPU core. Plus there's also the prospect of implementing some sound synthesis, again to run asynchronously. Ultimately the idea here is that your virtual "custom chips" ought to be able to make use of real spare computing power rather than competing with the virtual CPU for it.Last edited by Karlos on 20-Oct-2022 at 09:33 AM.
_________________ Doing stupid things for fun... |
| Status: Offline |
| | Karlos
| |
Re: MC64K - Imaginary 64-bit 680x0 Posted on 20-Oct-2022 15:46:05
| | [ #147 ] |
| |
|
Elite Member |
Joined: 24-Aug-2003 Posts: 4405
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| @cdimauro
Quote:
Exactly: that's a huge waste on one of the most important caches. |
At least for GCC there is another option: https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html
I've written a tiny proof of concept and can get a 16-bit jump table out of it. As ugly as it looks, this still compiles with --std=c++17 -Wall -Wextra -W
#include
int __attribute__((noinline)) test(unsigned char const* ops) { int x = 0; static short const handler [] = { (short)((char*)&&ret - (char*)&&begin), (short)((char*)&&inc - (char*)&&begin), (short)((char*)&&dec - (char*)&&begin), (short)((char*)&&rst - (char*)&&begin) };
#define next() goto *((char*)&&begin + handler[*ops++]); next();
begin: ret: std::puts("\tret"); return x; inc: std::puts("\tinc"); x++; next(); dec: std::puts("\tdec"); x--; next(); rst: std::puts("\trst"); x = 0; next(); return -1; }
int main() { unsigned char code[] = { 2, 3, 1, 0 };
int x = test(code); std::printf("x: %d\n", x); return 0; }
Looking at the compiler output for this example at the "inc:" label, the entire opcode, plus the threaded branch to next handler is 5 instructions
# jumptbl.cpp:21: x++; incl %r8d # jumptbl.cpp:22: next(); movswq (%rcx,%rax,2), %rax incq %rdi addq %rdx, %rax jmp *%rax
Another advantage of this approach is that the generated jump table is 16 bits regardless of whether ot not you compile with -fPIC.Last edited by Karlos on 20-Oct-2022 at 03:47 PM. Last edited by Karlos on 20-Oct-2022 at 03:47 PM.
_________________ Doing stupid things for fun... |
| Status: Offline |
| | cdimauro
| |
Re: MC64K - Imaginary 64-bit 680x0 Posted on 21-Oct-2022 6:22:30
| | [ #148 ] |
| |
|
Elite Member |
Joined: 29-Oct-2012 Posts: 3650
From: Germany | | |
|
| @Karlos
Quote:
Karlos wrote: @cdimauro
The emitted jump table is even worse if you don't compile for PIC. Then the entires are full 64-bit address slots. |
Shocking. How inefficient could still be a compiler with very common (and important) cases. Quote:
Quote:
IMO you should do something different, because you already have an ecosystem which is working. |
To be fair, building a compiler front for it wasn't really on the roadmap. That's not to say I won't consider it but there's other fun stuff to do first.
What I do want to do is implement some virtual hardware for it. There's already a basic chunky display with built-in beam racer (similar to the copper) but I want to make that truly asynchronous so that it can run on a second CPU core. Plus there's also the prospect of implementing some sound synthesis, again to run asynchronously. Ultimately the idea here is that your virtual "custom chips" ought to be able to make use of real spare computing power rather than competing with the virtual CPU for it. |
Got it. Then... have fun with the JIT: it's also a very interesting project to work on and gives a lot of satisfactions once results arrive. |
| Status: Offline |
| | cdimauro
| |
Re: MC64K - Imaginary 64-bit 680x0 Posted on 21-Oct-2022 6:24:51
| | [ #149 ] |
| |
|
Elite Member |
Joined: 29-Oct-2012 Posts: 3650
From: Germany | | |
|
| @Karlos
Quote:
Karlos wrote: @cdimauro
Quote:
Exactly: that's a huge waste on one of the most important caches. |
At least for GCC there is another option: https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html
I've written a tiny proof of concept and can get a 16-bit jump table out of it. As ugly as it looks, this still compiles with --std=c++17 -Wall -Wextra -W
#include
int __attribute__((noinline)) test(unsigned char const* ops) { int x = 0; static short const handler [] = { (short)((char*)&&ret - (char*)&&begin), (short)((char*)&&inc - (char*)&&begin), (short)((char*)&&dec - (char*)&&begin), (short)((char*)&&rst - (char*)&&begin) };
#define next() goto *((char*)&&begin + handler[*ops++]); next();
begin: ret: std::puts("\tret"); return x; inc: std::puts("\tinc"); x++; next(); dec: std::puts("\tdec"); x--; next(); rst: std::puts("\trst"); x = 0; next(); return -1; }
int main() { unsigned char code[] = { 2, 3, 1, 0 };
int x = test(code); std::printf("x: %d\n", x); return 0; }
Looking at the compiler output for this example at the "inc:" label, the entire opcode, plus the threaded branch to next handler is 5 instructions
# jumptbl.cpp:21: x++; incl %r8d # jumptbl.cpp:22: next(); movswq (%rcx,%rax,2), %rax incq %rdi addq %rdx, %rax jmp *%rax
Another advantage of this approach is that the generated jump table is 16 bits regardless of whether ot not you compile with -fPIC. |
That's The Way! 5 instructions, as I was expecting.
The only mess is with building the handler table, but a macro could help.
I assume that this is the next change on your VM.
The only problem is that it's portable, since it's GCC-only. |
| Status: Offline |
| | Karlos
| |
Re: MC64K - Imaginary 64-bit 680x0 Posted on 21-Oct-2022 7:46:55
| | [ #150 ] |
| |
|
Elite Member |
Joined: 24-Aug-2003 Posts: 4405
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| @cdimauro
LLVM supports label as value too, and those are the only compilers I'm targeting for now.
I probably won't implement the handler exactly as demonstrated here since this dispatch threading approach puts the computed goto in every handler, increasing their length each. What we lose in the jump table is regained in duplicate code.
So I think it may be better to have a central point in the code that is unconditionally branched to where the next jump location is calculated.
Obviously all these details can be hidden by macros so I can test both approaches without having to rewrite a ton of code. Last edited by Karlos on 21-Oct-2022 at 08:00 AM.
_________________ Doing stupid things for fun... |
| Status: Offline |
| | Karlos
| |
Re: MC64K - Imaginary 64-bit 680x0 Posted on 21-Oct-2022 20:18:05
| | [ #151 ] |
| |
|
Elite Member |
Joined: 24-Aug-2003 Posts: 4405
From: As-sassin-aaate! As-sassin-aaate! Ooh! We forgot the ammunition! | | |
|
| Well I did that thing...
The runtime can be compiled with -DINTERPRETER_CUSTOM to select the custom jump table rather than the standard switch case and further use -DTHREADED_DISPATCH to optionally embed the next opcode decode/jump onto the tail of the handler code.
Under equivalent conditions, the previous fast switch/case baseline for add.q r1, r0 was ~740 MIPS. Using the custom jump table this increased to ~830. Enabling the threaded dispatch increased to ~915. I had hoped to break 1000, but I guess it's not quite tuned enough ;)
I should be able to turn the body of the interpreter into an include that satisfies both the switch case and the custom jump table version as it relies on macros to define the entry point and exit method for each handler. I don't really like having this much duplication. Last edited by Karlos on 21-Oct-2022 at 08:29 PM.
_________________ Doing stupid things for fun... |
| Status: Offline |
| | cdimauro
| |
Re: MC64K - Imaginary 64-bit 680x0 Posted on 22-Oct-2022 5:41:10
| | [ #152 ] |
| |
|
Elite Member |
Joined: 29-Oct-2012 Posts: 3650
From: Germany | | |
|
| @Karlos
Quote:
Karlos wrote: @cdimauro
LLVM supports label as value too, and those are the only compilers I'm targeting for now. |
Which should be enough: they cover the majority of platforms. Quote:
Karlos wrote: Well I did that thing...
The runtime can be compiled with -DINTERPRETER_CUSTOM to select the custom jump table rather than the standard switch case and further use -DTHREADED_DISPATCH to optionally embed the next opcode decode/jump onto the tail of the handler code.
Under equivalent conditions, the previous fast switch/case baseline for add.q r1, r0 was ~740 MIPS. Using the custom jump table this increased to ~830. Enabling the threaded dispatch increased to ~915. I had hoped to break 1000, but I guess it's not quite tuned enough ;) |
It's already a very good gain. You cannot make miracles here. Quote:
I should be able to turn the body of the interpreter into an include that satisfies both the switch case and the custom jump table version as it relies on macros to define the entry point and exit method for each handler. I don't really like having this much duplication. |
Makes sense. |
| Status: Offline |
| |
|
|
|
[ home ][ about us ][ privacy ]
[ forums ][ classifieds ]
[ links ][ news archive ]
[ link to us ][ user account ]
|