Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6071 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

9 crawler(s) on-line.

97 guest(s) on-line.

2 member(s) on-line.

zipper,

kinchan

You are an anonymous user.
Register Now!

zipper: 2 mins ago

kinchan: 4 mins ago

dreamlandfantasy: 16 mins ago

CosmosUnivers: 19 mins ago

Beamish2040: 19 mins ago

Seiya: 24 mins ago

MagicSN: 32 mins ago

kiFla: 49 mins ago

pixie: 1 hr 21 mins ago

Templario: 1 hr 35 mins ago

Forum Index

Amiga Emulation

Productivity Amiga Emulation

Poster

Thread

Wanderer

Re: Productivity Amiga Emulation
Posted on 20-Jul-2015 13:25:38

[ #61 ]

Cult Member

Joined: 16-Aug-2008
Posts: 654
From: Germany

@fishy_fis

OS4 and MOS run only on exotic hardware, not on machine from Best Buy which I have anyway.
Amithlon is great, but it does not run on nowadays hardware. (maybe in VESA mode, without sound and network, if you are lucky). And I don't have the host in the background for the usual stuff AmigaOS sucks at.

_________________
--
Author of
HD-Rec, Sweeper, Samplemanager, ArTKanoid, Monkeyscript, Toadies, AsteroidsTR, TuiTED, PosTED, TKPlayer, AudioConverter, ScreenCam, PerlinFX, MapEdit, AB3 Includes and many more...
Homepage: http://www.hd-rec.de

Status: Offline

thellier

Re: Productivity Amiga Emulation
Posted on 20-Jul-2015 14:57:25

[ #62 ]

Regular Member

Joined: 2-Nov-2009
Posts: 263
From: Paris

@Wanderer

Perhaps Janus-UAE on Aros-hosted looks like what you want to do. No?
If I remenber well janus-uae use the natives (x86) aros .library to render the gui
I mean a 68k call to intuition.library is converted to an x86 call to aros' intuition.library

Alain Thellier

Status: Offline

cdimauro

Re: Productivity Amiga Emulation
Posted on 21-Jul-2015 6:16:38

[ #63 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@thellier: what happens if AROS' (native) intuition.library returns a 64-bit pointer to the screen structure? The 68K code cannot use it.

Status: Offline

NutsAboutAmiga

Re: Productivity Amiga Emulation
Posted on 21-Jul-2015 7:49:45

[ #64 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12820
From: Norway

@fishy_fis

Quote:
Aren't OS4, MOS, or Amithlon pretty much what you're proposing; a 68k compatible system leaning towards system friendly software.

Well no, AmigaOS4 and MorphOS native mode is PowerPC, most programs that run on the system is native PowerPC. Yes this operating systems run 68K software, AmigaOS4 and MorphOS has native drivers for hardware, it does run on top anything else.

(Well MorphOS has quark kernel but as I understand it, it just a sort of boot loader, low level management kernel)

Amithlon is more like what he proposes, and emulator with minimum chipset support, and uses the host OS for the stuff AmigaOS suck at.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

NutsAboutAmiga

Re: Productivity Amiga Emulation
Posted on 21-Jul-2015 7:50:27

[ #65 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12820
From: Norway

@Wanderer

Quote:
OS4 and MOS run only on exotic hardware, not on machine from Best Buy which I have anyway.

Well if there was an Intel or AMD chip on it there will not be any different to what you buy from Best Buy.
The only thing exotic is the Firmeware (BIOS), UBoot/OpenFirmeware and CFE.

The real difference is in the OS not in the hardware.

Last edited by NutsAboutAmiga on 21-Jul-2015 at 07:51 AM.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

thellier

Re: Productivity Amiga Emulation
Posted on 21-Jul-2015 9:14:10

[ #66 ]

Regular Member

Joined: 2-Nov-2009
Posts: 263
From: Paris

@cdimauro

I dont know : I didnt used Janus-UAE myself . I just remenbering to have read that it was retargetting the gui calls

Alain Thellier

Status: Offline

cdimauro

Re: Productivity Amiga Emulation
Posted on 21-Jul-2015 18:25:55

[ #67 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@NutsAboutAmiga

Quote:

NutsAboutAmiga wrote:
@Wanderer

Quote:
OS4 and MOS run only on exotic hardware, not on machine from Best Buy which I have anyway.

Well if there was an Intel or AMD chip on it there will not be any different to what you buy from Best Buy.
The only thing exotic is the Firmeware (BIOS), UBoot/OpenFirmeware and CFE.

The real difference is in the OS not in the hardware.

The same sentence used by Steve Jobs when he announced the transition from PowerPC to Intel.

However, you don't change the hardware if it's not good anymore.

Status: Offline

cdimauro

Re: Productivity Amiga Emulation
Posted on 21-Jul-2015 18:32:30

[ #68 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@thellier

Quote:

thellier wrote:
@cdimauro

I dont know : I didnt used Janus-UAE myself . I just remenbering to have read that it was retargetting the gui calls

Alain Thellier

Yes, I've the same recall.

However I think that the situation is not that simply, because of the different endianess and/or pointer size.

But now is quite premature to talk about it. Let's first see what Wanderer achieves with it's new JITer.

Status: Offline

Wanderer

Re: Productivity Amiga Emulation
Posted on 23-Jul-2015 14:16:25

[ #69 ]

Cult Member

Joined: 16-Aug-2008
Posts: 654
From: Germany

@cdimauro

Ok, I took your advice. My previous emulator core-code looked like this:
Quote:

// Run Emulator
void M68kEmulator::Run(M68k_Opcode* code_address) {
  PC = code_address;
  Running = true;
  while (Running) {
    try {
      while (true) {
        EvalLUT[*PC++]();
      }
    } catch(char *e) {
      ; // handle memory hit to HW registers, NULL, 4 etc.
      ; // if we cannot handle, set Running = false;
    }
  }
}

"EvalLUT" is a lookup table of function pointers to parameterless functions like this:

Quote:

void Eval_11010000_01101111_ADD_W() {
  int16* src = (int16*)(A[7].addr_ + (int32)(*((int16*)PC)));
  int16* dst = (&D[0].word_);
  int a = (int)*src;
  int b = (int)*dst;
  int res = a + b;
  *dst = (int16)res;
  N = (res < 0);
  Z = (res == 0);
  V = ((res ^ src) & 0x00008000) != 0);
  C = ((res & 0x00010000) != 0);
  X = C;
}


This however results in ~64K functions, which is pretty huge. Even they are autogenerated, it gets several megabytes of source code and quite an executable, bad for cache!
So I took your advice switch on the first 10 bits, the functions need to be implemented manually but there are only max. 1024. The loop it looks like this:

Quote:

// Run Emulator
void M68kEmulator::Run(M68k_Opcode* code_address) {
  PC = code_address;
  Running = true;
  while (Running) {
    try {
      while (true) {
        M68k_Opcode opcode = *PC++;
        switch((opcode >> 6) & 0x3FF) {
          case 0x0000: // 00000000.00000000
            Eval_ORI_ORICCR(opcode);
            break;
          ...
          case 0x03FF: // 11111111.11000000
            Eval_IllegalOpcode(opcode);
            break;
          default:
            Eval_IllegalOpcode(opcode);
        }
      }
    } catch(char *e) {
      ; // handle memory hit to HW registers, NULL, 4 etc.
      ; // if we cannot handle, set Running = false;
    }
  }
}

Do you think this is the right approach? Again, speed is not the most important here, as a JIT will follow. Most of the work could be reused for JIT of course.
Some concepts I think about :

1. Replacing AllocMem() with host side malloc, removing the emulator "offset" for every memory access.

2. Host side functionality as
  a) Coprocessor instructions
  b) 68020 "CALLM" module
  c) Magic memory addresses (WinUAE like)
  d) special opcodes

c) is probably the easiest from 68k side, since you can do it in every programming language, no "new" opcodes are required. In combination with 1, this will cause an exception however.

3. Access to "4" or "DFF...." registers are captured with exceptions thrown by a SignalHandler. This is not really a nice way to do this, but I don't have to check memory access at all. Is this a good idea? Possibilities I can think ok:
  a) test every address before access (probably very slow)
  b) trap as proposed as access violation exception, no extra cost during regular opcodes
  c) somehow manipulating the host MMU (never did this, probably complicated and host dependent)

Thanks!

Last edited by Wanderer on 23-Jul-2015 at 02:17 PM.

Status: Offline

cdimauro

Re: Productivity Amiga Emulation
Posted on 23-Jul-2015 17:13:19

[ #70 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Wanderer

Quote:

Wanderer wrote:
@cdimauro

Ok, I took your advice. My previous emulator core-code looked like this:
Quote:

// Run Emulator
void M68kEmulator::Run(M68k_Opcode* code_address) {
PC = code_address;
Running = true;
while (Running) {
try {
while (true) {
EvalLUT[*PC++]();
}
} catch(char *e) {
; // handle memory hit to HW registers, NULL, 4 etc.
; // if we cannot handle, set Running = false;
}
}
}

"EvalLUT" is a lookup table of function pointers to parameterless functions like this:

Quote:

void Eval_11010000_01101111_ADD_W() {
int16* src = (int16*)(A[7].addr_ + (int32)(*((int16*)PC)));
int16* dst = (&D[0].word_);
int a = (int)*src;
int b = (int)*dst;
int res = a + b;
*dst = (int16)res;
N = (res < 0);
Z = (res == 0);
V = ((res ^ src) & 0x00008000) != 0);
C = ((res & 0x00010000) != 0);
X = C;
}

This however results in ~64K functions, which is pretty huge. Even they are autogenerated, it gets several megabytes of source code and quite an executable, bad for cache!
So I took your advice switch on the first 10 bits, the functions need to be implemented manually but there are only max. 1024. The loop it looks like this:

Quote:

// Run Emulator
void M68kEmulator::Run(M68k_Opcode* code_address) {
PC = code_address;
Running = true;
while (Running) {
try {
while (true) {
M68k_Opcode opcode = *PC++;
switch((opcode >> 6) & 0x3FF) {
case 0x0000: // 00000000.00000000
Eval_ORI_ORICCR(opcode);
break;
...
case 0x03FF: // 11111111.11000000
Eval_IllegalOpcode(opcode);
break;
default:
Eval_IllegalOpcode(opcode);
}
}
} catch(char *e) {
; // handle memory hit to HW registers, NULL, 4 etc.
; // if we cannot handle, set Running = false;
}
}
}

Do you think this is the right approach? Again, speed is not the most important here, as a JIT will follow. Most of the work could be reused for JIT of course.

The problem here is that you still have to duplicate some code, because of the many repetitions in the opcodes patterns. I mean, for line A, line F, the big MOVE EA,EA, etc., you have to repeat the code in the case.

IMO if you just use even a simple table like the first one which I shown you previously, you can better aggregate the opcodes to some "macro-opcodes", greatly reducing the big switch case. You need only a 1KB * 2 (word) table for this, or even a 1KB if you can compress the macro-opcodes to 256 at maximum. It'll be also much more cache-friendly.
Quote:
Some concepts I think about :

1. Replacing AllocMem() with host side malloc,

Right. And you can replace many more APIs, but that's one of the most important.
Quote:
removing the emulator "offset" for every memory access.

How do you plan to do so? The problem is that the host o.s. usually doesn't give you access to the first page (address 0 and beyond). On Windows I'm 100% sure.

That's another reason why I don't like to use an host o.s. which limits the tricks that can be used to speed-up the emulation.
Quote:
2. Host side functionality as
a) Coprocessor instructions

Ignore them now.
Quote:
b) 68020 "CALLM" module

AFAIK it's not used on Amiga. Ignore it.
Quote:
c) Magic memory addresses (WinUAE like)
d) special opcodes

32-bit ones?
Quote:
c) is probably the easiest from 68k side, since you can do it in every programming language, no "new" opcodes are required. In combination with 1, this will cause an exception however.

Please, can you clarify?
Quote:
3. Access to "4" or "DFF...." registers are captured with exceptions thrown by a SignalHandler. This is not really a nice way to do this, but I don't have to check memory access at all. Is this a good idea?

It depends on the application, because if it accesses a lot of times the $4 location, it'll degrade a lot the performance.

For accessing the custom chipset, well, the idea is to replace the APIs to completely avoid it, or reduce to the minimum.

Currently you can assume that there's normal memory at that location.
Quote:
Possibilities I can think ok:
a) test every address before access (probably very slow)

Exactly.
Quote:
b) trap as proposed as access violation exception, no extra cost during regular opcodes

Good with normal scenarios. Especially more modern applications, which are also the most intensive ones, are o.s.-friendly (no direct access to the hardware) and don't spend their time reading $4 frequently.
Quote:
c) somehow manipulating the host MMU (never did this, probably complicated and host dependent)

Thanks!

That's host-o.s. dependent, for sure. But you can focus on mainstream o.ses now, and leave other o.ses for the future.

Status: Offline

Wanderer

Re: Productivity Amiga Emulation
Posted on 23-Jul-2015 18:07:20

[ #71 ]

Cult Member

Joined: 16-Aug-2008
Posts: 654
From: Germany

@cdimauro

Quote:

IMO if you just use even a simple table like the first one which I shown you previously, you can better aggregate the opcodes to some "macro-opcodes", greatly reducing the big switch case. You need only a 1KB * 2 (word) table for this, or even a 1KB if you can compress the macro-opcodes to 256 at maximum. It'll be also much more cache-friendly.

If I test only the upper 8 bit, it shifts much more work on the Eval_OpcodeXYZ() function.
E.g. examining the upper 10 bits, I get cases like

case 0x0139: // 01001110.01000000
Eval_TRAP_LINK_UNLK_MOVEtoUSP_MOVEfromUSP_RESET_NOP_STOP_RTE_RTD_RTS_TRAPV_RTR_MOVEfromCC_MOVEtoCC(opcode);
break;

I just looked up your post about the opcode table. It is basically the same I do, just offline. I map the corresponding Eval_Opcode_XYZ function to the first 10 bits. In your example, you need only a 1024 * 2 bytes opcode => splitted opcode table, but another one from the macro opcode to the actual function. This might get bigger than a direct 1024 x address LUT.

another issue is how you handle various addeessing modes, if not with separate functions. Should they introduce another switch? Then we would get
Quote:

switch(macroOpcode) {
switch(src_ea) {
switch(dest_ea) {
switch(size) {
move.x ,
}
}
}

Quote:

Quote:

removing the emulator "offset" for every memory access.

How do you plan to do so? The problem is that the host o.s. usually doesn't give you access to the first page (address 0 and beyond). On Windows I'm 100% sure.

Neither does AmigaOS. 0x4 is an exception here. If I can trap them as illegal memory access, I can return a resonable result. Of course this is much slower, but normally you don't poll 0x4 a lot.
A JIT could actually catch this special case, because often this is addressed absolutely.

And again, it would cause an "Access Violation" on all "low" addresses like 0, 4 ... which the emulator can gracefully handle and act like "Enforcer". ChipSet addresses would also cause an AV, since they are located at "DFFx0000", which is in the upper 2 GB.

Only thing that worries me that this will force the emulator executable to be 32 bit, or it would need something like malloc32bit() which probably doesn't exist.

Status: Offline

Wanderer

Re: Productivity Amiga Emulation
Posted on 23-Jul-2015 19:52:56

[ #72 ]

Cult Member

Joined: 16-Aug-2008
Posts: 654
From: Germany

Another idea:

A 13-bit table would disambiguate almost all opcodes, except some trivial ones.
So I could create a 8192 byte sized LUT that contains 1 byte, which indexes the actual function.

The reason for this is, if you look at the upper 10 bits only, you get quite a funny mix, e.g. with MOVEP because it is "hacked" into some gaps between other opcodes.

Quote:

switch((opcode >> 0x6) & 0x3FF) {
case 0x0000: // 00000000.00000000
Eval_ORI_ORICCR(opcode);
break;
case 0x0001: // 00000000.01000000
Eval_ORI_ORISR(opcode);
break;
case 0x0002: // 00000000.10000000
Eval_ORI(opcode);
break;
case 0x0003: // 00000000.11000000
Eval_CHK2xCMP2(opcode);
break;
case 0x0004: // 00000001.00000000
Eval_BTST2_MOVEP(opcode);
break;
case 0x0005: // 00000001.01000000
Eval_BCHG2_MOVEP(opcode);
break;
case 0x0006: // 00000001.10000000
Eval_BCLR2_MOVEP(opcode);
break;
case 0x0007: // 00000001.11000000
Eval_BSET2_MOVEP(opcode);
break;
...

Looking at the first 13 bits would solve that and allow for most opcodes one function.
That limits to 256 "Meta" Opcodes, but I think this is enough even for 68040+FPU.

Looks at this table: http://goldencrystal.free.fr/M68kOpcodes.pdf

Status: Offline

cdimauro

Re: Productivity Amiga Emulation
Posted on 23-Jul-2015 20:01:22

[ #73 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Wanderer

Quote:

Wanderer wrote:
@cdimauro

Quote:

IMO if you just use even a simple table like the first one which I shown you previously, you can better aggregate the opcodes to some "macro-opcodes", greatly reducing the big switch case. You need only a 1KB * 2 (word) table for this, or even a 1KB if you can compress the macro-opcodes to 256 at maximum. It'll be also much more cache-friendly.

If I test only the upper 8 bit, it shifts much more work on the Eval_OpcodeXYZ() function.

Exactly. That's why I suggested from the beginning to take the topmost 10 bits: it better covers the 68K (main) opcodes structure.
Quote:
E.g. examining the upper 10 bits, I get cases like

case 0x0139: // 01001110.01000000
Eval_TRAP_LINK_UNLK_MOVEtoUSP_MOVEfromUSP_RESET_NOP_STOP_RTE_RTD_RTS_TRAPV_RTR_MOVEfromCC_MOVEtoCC(opcode);
break;

I just looked up your post about the opcode table. It is basically the same I do, just offline. I map the corresponding Eval_Opcode_XYZ function to the first 10 bits.

The compiler will generate internally at least a 1024 * sizeof(pointer) table to handle such big switch, which means 4 or 8KB of L1 data cache used only for that. Using a split table, instead, you can better reorganize the (macro) opcodes, using much little space, especially using a maximum of 256 macro-opcodes. It can positively influence the code executing. Consider that an out-of-order processor can better mask the latency of the split table look-up, if you subsequently can execute other instructions that extract other useful informations that can be used inside the switch/case construct.
Quote:
In your example, you need only a 1024 * 2 bytes opcode => splitted opcode table, but another one from the macro opcode to the actual function.

Yes, but as I said the compiler generates an internal function pointers table to address your big case.

However the advantage of your solution is that you don't need to pass parameters to the calling function, because you continue to work inside your eval-loop.

I'm and I was aware of it even before that I posted my solution in the previous comments, but in general I don't like big function bodies, and I prefer to split the code in subroutines to better isolate and mangae them.

But from a pure performance point-of-view, the big switch case is more convenient, in general.
Quote:
This might get bigger than a direct 1024 x address LUT.

No, since you have much less macro-opcodes.
Quote:
another issue is how you handle various addeessing modes, if not with separate functions. Should they introduce another switch? Then we would get
Quote:

switch(macroOpcode) {
switch(src_ea) {
switch(dest_ea) {
switch(size) {
move.x ,
}
}
}

Yes, you need a couple of function pointers table for source and destination EAs evaluation. But they are very small (only 64 + 64 entries).
Quote:
Quote:

How do you plan to do so? The problem is that the host o.s. usually doesn't give you access to the first page (address 0 and beyond). On Windows I'm 100% sure.

Neither does AmigaOS. 0x4 is an exception here. If I can trap them as illegal memory access, I can return a resonable result. Of course this is much slower, but normally you don't poll 0x4 a lot.
A JIT could actually catch this special case, because often this is addressed absolutely.

Yes. The only problem can happen with applications which don't use the short or big absolute address modes, which require the trap.
Quote:
And again, it would cause an "Access Violation" on all "low" addresses like 0, 4 ... which the emulator can gracefully handle and act like "Enforcer". ChipSet addresses would also cause an AV, since they are located at "DFFx0000", which is in the upper 2 GB.

No, chipset address is in the 24-bit address ranges. It's $DFFxxx and consider also $Bxxxxx for CIAs and maybe $DExxxx (I don't remember now) for the real time clock.
Quote:
Only thing that worries me that this will force the emulator executable to be 32 bit, or it would need something like malloc32bit() which probably doesn't exist.

It doesn't exist, but on Windows you can use VirtualAlloc to directly allocate memory in the address space that you want, which is >= 64KB and less than 2GB (I don't remember now if it's 2GB - 64KB; but it's just a detail).

You can also have an application which uses a Large Address Space which can extend the application's address space to 3GB, but you need to enable it (and reboot) before to use it. That's for a 32-bit version of Windows. If I remember correctly, a 64-bit Windows should allow up to 4GB of application space to be used, without enabling the proper parameter; but I'm not sure (it was long time ago when I studied it), so it's better to double check it, if you're interested.

That was may plan to eliminate the need to add the offset: put the application code and internal dato to >2GB (if possible; in the worst case use the upper MBs close to the 2GB limit), and leave the remaining bottom 2GB memory for the Amiga emulation.

EDIT: I remove the "less than" characters, because the forum cut the last part of the comment.

Last edited by cdimauro on 23-Jul-2015 at 10:03 PM.
Last edited by cdimauro on 23-Jul-2015 at 10:02 PM.
Last edited by cdimauro on 23-Jul-2015 at 09:08 PM.

Status: Offline

cdimauro

Re: Productivity Amiga Emulation
Posted on 23-Jul-2015 20:06:05

[ #74 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Wanderer

Quote:

Wanderer wrote:
Another idea:

A 13-bit table would disambiguate almost all opcodes, except some trivial ones.
So I could create a 8192 byte sized LUT that contains 1 byte, which indexes the actual function.

The reason for this is, if you look at the upper 10 bits only, you get quite a funny mix, e.g. with MOVEP because it is "hacked" into some gaps between other opcodes.

Quote:

switch((opcode >> 0x6) & 0x3FF) {
case 0x0000: // 00000000.00000000
Eval_ORI_ORICCR(opcode);
break;
case 0x0001: // 00000000.01000000
Eval_ORI_ORISR(opcode);
break;
case 0x0002: // 00000000.10000000
Eval_ORI(opcode);
break;
case 0x0003: // 00000000.11000000
Eval_CHK2xCMP2(opcode);
break;
case 0x0004: // 00000001.00000000
Eval_BTST2_MOVEP(opcode);
break;
case 0x0005: // 00000001.01000000
Eval_BCHG2_MOVEP(opcode);
break;
case 0x0006: // 00000001.10000000
Eval_BCLR2_MOVEP(opcode);
break;
case 0x0007: // 00000001.11000000
Eval_BSET2_MOVEP(opcode);
break;
...

Looking at the first 13 bits would solve that and allow for most opcodes one function.

If it's only for such strange and rare instructions (which was also removed on the 68060), you can avoid it. It's better to consider the more common and frequent instructions.
Quote:
That limits to 256 "Meta" Opcodes, but I think this is enough even for 68040+FPU.

Looks at this table: http://goldencrystal.free.fr/M68kOpcodes.pdf

If you limit the macro-opcodes to 256, it's a good thing. But you are wasting some precious L1 data cache this way.

However, you can experiment and then decide what's better. It's not a big deal playing now with such mask sizes, and changing them when you want.

Status: Offline

Wanderer

Re: Productivity Amiga Emulation
Posted on 24-Jul-2015 13:33:25

[ #75 ]

Cult Member

Joined: 16-Aug-2008
Posts: 654
From: Germany

@cdimauro

I think I am going for this solution:

8KB for 13bit lookup table that maps the upper 13bits of the opcodes to max. 256 "MetaOpcodes", which are actually implemented as functions. E.g. "ADD", "ADDX", "SUB", "SUBX", "MOVE", "MOVEA" will be MetaOpcodes. The mapping table allows to gradually add opcodes and postpone the implementation of rare opcodes until they are actually needed.

First Milestone should be to run a simple "Hello World" 68k exe. The 68k interpreter is a vehicle to get to the actual JIT implementation, for me learning how the 68k works and for code that is reusable for JIT.

I have also checked out LLVM (there are also some alternatives like libJIT) which looks pretty straight forward and a much better idea than implementing an own JIT compiler for each platform. LLVM supports very well x86, i64, ARM and PPC. Other architectures may have some features missing, but they are not important at the moment. Right now I am looking at Windows, MacOS, Linux (x86/i64) and Android.

Some design issues remain, however.

When compiled for 64bit, using Host memory directly won't work. It will need an indirection by adding the virtual machines memory offset. Means every memory access has not only the endianess flip, but also an ADD instruction. I was hoping to get around this, but that would mean we need to allocate memory on the host that can be addressed by 32bit.

Also the hardware registers DFF... are an issue. One solution could be to count on that they are always absolute addressed, and adding a check to those addressing modes. This is not super fast, but also not speed critical, especially when a JIT is used. Normal addressing would not be affected. Generally, the credo is not to make compromises because of the chipset emu. But if we can get some programs running that otherwise would horribly crash, its worth to support.

Status: Offline

umisef

Re: Productivity Amiga Emulation
Posted on 24-Jul-2015 14:58:57

[ #76 ]

Super Member

Joined: 19-Jun-2005
Posts: 1714
From: Melbourne, Australia

@Wanderer

Quote:
8KB for 13bit lookup table that maps the upper 13bits of the opcodes to max. 256 "MetaOpcodes", which are actually implemented as functions. E.g. "ADD", "ADDX", "SUB", "SUBX", "MOVE", "MOVEA" will be MetaOpcodes. The mapping table allows to gradually add opcodes and postpone the implementation of rare opcodes until they are actually needed.

Frankly, worrying about the efficiency instruction decoding at this stage is probably a bit premature. Once you have something (anything) going, then you are in a position to actually gather some statistics from the software you are targeting that can inform such optimisations. For example, out of the 65,536 possible opcodes, the vast majority essentially never gets used, and the vast majority of code actually uses a very small number of opcodes --- meaning that merely having a large lookup table won't do bad things to your caches, because most of it barely ever gets touched.

Quote:
I have also checked out LLVM

You will need to think about a higher level design for your JIT emulation --- when does JIT translation get triggered (simply translating the first time you see untranslated code risks sluggish startup, because you spend translation resources on startup code which only ever gets run once)? How many conditional branches/jumps/JSR calls do you intend to compile through? How do you keep track of the various places where the thread of execution can end up? How do you deal with computed jumps/JRSs (very popular due to the library system)? How are you going to chain your compilation units together? How/where do you keep 68k processor state between compiled blocks? And how do you interface between interpretative and compiling emulation (do they share the same state? Or do you need to translate back and forth?).

Those are the easy questions. Some of the harder ones are: How do you handle instruction cache flush instructions (which may be limited to a particular memory range, but also might not be. And even if they are --- how will you work out which of your JIT compiled units are affected by the given memory range? And what are you going to do to them --- invalidate them? If so, how will such invalidation affect the chaining?
Also, how are you going to handle interrupts, if at all? And I don't think you can get away completely without --- somehow 68k needs to be preempted and the 68k context switched (unless you run one emulation task per 68k task, each with its own 68k state. There are serious problems with that approach, too). Similarly, there are definitely 68k instructions which can cause exceptions --- which might happen in the middle of a compilation unit. How are you going to deal with such exceptions?

And then, there is....
Quote:
Also the hardware registers DFF... are an issue. One solution could be to count on that they are always absolute addressed, and adding a check to those addressing modes.

...the issue of memory mapped I/O. Do NOT count on ANYTHING being consistently done in a sane manner. Sorry for shouting, but I know from personal (painful) experience that anything bizarre and convoluted you can think of (and quite a few things I, at least, couldn't think of) is being done by someone in a supposedly-system-friendly Amiga program. The UAE JIT keeps track while running the interpretative emulation of which 68k instructions access only real memory, and which ones access something else, and uses that trace info while later compiling the code. Which reduces the number of issues, but is far from getting them all.
People use memcpy() to access the memory mapped I/O. They really do. So whatever you do, don't rely on things like addressing modes to detect MMIO.

(The UAE/Amithlon solution is to let the MMU catch memory accesses to things which aren't memory, and then route those accesses only through the hardware emulation. What that requires, however, is decoding the faulting host instruction and adjusting the host CPU state according to what comes out of the hardware emulation. That is a complete nightmare on x86/32 already, even for the hand-generated code which uses nothing but MOV for memory accesses, with a limited number of addressing modes, treating the x86 as a RISC-like load/store architecture. LLVM, GCCJIT or similar libraries will almost certainly generate arithmetic/logical instructions operating directly on memory, and use all possible addressing modes. Heck, the x86 can happily move from one memory location directly to another, so the same instruction can even fault twice...)

Tl;DR version: Don't worry about performance for now. Worry about getting stuff to work, making it work fast is the second step.

Status: Offline

Wanderer

Re: Productivity Amiga Emulation
Posted on 24-Jul-2015 21:32:29

[ #77 ]

Cult Member

Joined: 16-Aug-2008
Posts: 654
From: Germany

@umisef

I don't optimize premature. The idea with the 13bit table is not necessarily speed, it is to handle the problem of assigning the right function to the right opcode. Since the opcodes are scrambled and interleaved all over the place, making a mapping makes sense. Using the first 13 bit disambiguates most of the 16bit opcodes. 10bit, as proposed earlier, mixes a lot of opcodes into one that needs another test.

I know that the distribution of the instructions are not equal, of course. When it is optimizing time, I can take care of the most frequent one and speed them up. But probably I won't do even that, probably I will go for the JIT first.

I know there are a lot of issues that need to be addressed.

Hardware Addresses:
I don't exactly know yet how. Initially I thought just leaving those addresses unallocated and catch it with a signal handler when accessed would be a good idea, since it does not cost time during regular execution. If I can program the host MMU, this would be much better of course.
However, I want to keep things simple, and Chipset emulation is not a priority. It is more like that programs that hit those addresses don't cause a segfault. Generally, the idea is to run software that does not bang the hardware directly, and retarget the OS calls that would do.

It would be nice if I could get more advice and insights from more emu-experienced people than me.
So I will keep posting my thoughts, feel free to bash them or support them.

Status: Offline

Wanderer

Re: Productivity Amiga Emulation
Posted on 24-Jul-2015 21:45:47

[ #78 ]

Cult Member

Joined: 16-Aug-2008
Posts: 654
From: Germany

About the JIT (LLVM):

The same opcode mapping from the interpreter can be used to output LLVM code instead of executing the opocde.
Then LLVM can then compile it and optimize it, e.g. LLVM will gracefully eliminate all unnecessary flag calculations with its SSA model. LLVM will give a pointer to the code that can be stored in a cache.

Granularity: without having really seen WinUAE's JIT code, my intuition would be to translate all opcodes until a jump/branch happens or a maximum number of opcodes is translated. If the code is small and the jump is forward in the code, it could be included too, just loops won't work since the execution time is unpredictable. Rollout of n loops would be possible though. Ideally we always get around the same "N" instructions in one chunk.

This will be packed into an LLVM function and called. From AmigaOS side, such a chunk of opcodes would behave atomic, no interrupts or scheduling possible. Giving the sheer speed of nowadays hardware, I don't think that would become an issue. A chunk of 100 opcodes is probably faster than a single instruction on an 68000, which was atomic too. You can probably have 1000 opcodes at the speed of one DIVS. The actual number needs to be determined, mabye its less, maybe much more until you start to "feel" it.
The state of the 68k CPU would be shared between interpreted and JIT, so they can be seamlessly changed. E.g. a program could start some code interpreted, while LLVM creates a native chunk for it. Second or third time, when LLVM is done, it will switch to the JIT'ed code. This avoids a noticeable delay when starting a new program.

Status: Offline

cdimauro

Re: Productivity Amiga Emulation
Posted on 25-Jul-2015 8:44:02

[ #79 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@Wanderer

Quote:

Wanderer wrote:
@cdimauro

I think I am going for this solution:

8KB for 13bit lookup table that maps the upper 13bits of the opcodes to max. 256 "MetaOpcodes", which are actually implemented as functions. E.g. "ADD", "ADDX", "SUB", "SUBX", "MOVE", "MOVEA" will be MetaOpcodes. The mapping table allows to gradually add opcodes and postpone the implementation of rare opcodes until they are actually needed.

Fine. Only one thing, which I tough looking at your metaopcode switch/case. If you simply call the proper metaopcode function inside it, and all of them have the same prototype, you can completely avoid the switch and use a function pointers table, so your code at the end will look like:
[code]char macro_opcode = split_table[opcode >> 3];
macro_opcodes[macro_opcode](opcode);
[/code]
Quote:
When compiled for 64bit, using Host memory directly won't work. It will need an indirection by adding the virtual machines memory offset. Means every memory access has not only the endianess flip, but also an ADD instruction. I was hoping to get around this, but that would mean we need to allocate memory on the host that can be addressed by 32bit.

As I said before, you can use VirtualAlloc on Windows which solves your problem. For Linux and MacOS there should equivalent functions.
Quote:
Also the hardware registers DFF... are an issue. One solution could be to count on that they are always absolute addressed, and adding a check to those addressing modes. This is not super fast, but also not speed critical, especially when a JIT is used.

You can intercept when problematic addresses (chipset, CIAs, real time clock, etc.) are directly loaded on an address register (via immediate MOVE or LEA), and you can keep track of it and propagate on other instructions (other MOVEs, ADDs, SUBs, or LEAs which make use of the "dirty" address register, even for calculating new addresses). But it's a bit complicated: leave it now.
Quote:
Normal addressing would not be affected. Generally, the credo is not to make compromises because of the chipset emu. But if we can get some programs running that otherwise would horribly crash, its worth to support.

Yes, but right now just focus on the JIT. After that it starts to work, you can think about this problematic.

Status: Offline

cdimauro

Re: Productivity Amiga Emulation
Posted on 25-Jul-2015 9:19:01

[ #80 ]

Elite Member

Joined: 29-Oct-2012
Posts: 3650
From: Germany

@umisef

Quote:

umisef wrote:

You will need to think about a higher level design for your JIT emulation --- when does JIT translation get triggered (simply translating the first time you see untranslated code risks sluggish startup, because you spend translation resources on startup code which only ever gets run once)?

So, basically you're suggesting that it's better to care about hot spots.
Quote:
How do you deal with computed jumps/JRSs (very popular due to the library system)?

They are also used with OOP coding (VMTs) and in general with function pointers.

A simple solution might be to use a map of 68K addresses -> translated addresses.

Something much more efficient can be made, but it's quite complicated. I also prefer to do not talk about it.
Quote:
How are you going to chain your compilation units together? How/where do you keep 68k processor state between compiled blocks? And how do you interface between interpretative and compiling emulation (do they share the same state? Or do you need to translate back and forth?).

Let's have some JITed code, and then we can see what solutions can be applied.

I'm not saying that they are not important. They are, definitely, but let's see how the project evolves. I don't think that you've addressed all of them just before writing the UAE's JIT and Amithon.
Quote:
Those are the easy questions. Some of the harder ones are: How do you handle instruction cache flush instructions (which may be limited to a particular memory range, but also might not be. And even if they are --- how will you work out which of your JIT compiled units are affected by the given memory range? And what are you going to do to them --- invalidate them? If so, how will such invalidation affect the chaining?

The first thing which comes to my mind: collecting all the affected addresses, and thrown away all compiled blocks. Chained/dependent blocks are collected also, and patched to call the JITer instead of the thrown block(s).
Quote:
Also, how are you going to handle interrupts, if at all? And I don't think you can get away completely without --- somehow 68k needs to be preempted and the 68k context switched

Do we really need to implement interrupts for this? I mean, JITed (or interpreted) 68K code can be "interrupted" in some ways without implementing 68K Interrupts.

Some simple solution: checking after the execution of a compiled block. Or after a certain number of them.
Quote:
(unless you run one emulation task per 68k task, each with its own 68k state. There are serious problems with that approach, too).

Let's stick with the simple case now: one process for all 68K tasks.
Quote:
Similarly, there are definitely 68k instructions which can cause exceptions --- which might happen in the middle of a compilation unit. How are you going to deal with such exceptions?

Only some of them can do it, and they are few and even rare. You can generate ad hoc host instructions to check if the exception is generated, and handle it accordingly.
Quote:
And then, there is....
...the issue of memory mapped I/O. Do NOT count on ANYTHING being consistently done in a sane manner. Sorry for shouting, but I know from personal (painful) experience that anything bizarre and convoluted you can think of (and quite a few things I, at least, couldn't think of) is being done by someone in a supposedly-system-friendly Amiga program.

I can imagine, and it'll very interesting to see what you've found. If you time and you wish to do it, of course.
Quote:
The UAE JIT keeps track while running the interpretative emulation of which 68k instructions access only real memory, and which ones access something else, and uses that trace info while later compiling the code. Which reduces the number of issues, but is far from getting them all.
People use memcpy() to access the memory mapped I/O. They really do. So whatever you do, don't rely on things like addressing modes to detect MMIO.

(The UAE/Amithlon solution is to let the MMU catch memory accesses to things which aren't memory, and then route those accesses only through the hardware emulation.

That's my idea also. Doesn't make sense to check for every accessed address: it kills performance.
Quote:
What that requires, however, is decoding the faulting host instruction and adjusting the host CPU state according to what comes out of the hardware emulation. That is a complete nightmare on x86/32 already, even for the hand-generated code which uses nothing but MOV for memory accesses, with a limited number of addressing modes, treating the x86 as a RISC-like load/store architecture. LLVM, GCCJIT or similar libraries will almost certainly generate arithmetic/logical instructions operating directly on memory, and use all possible addressing modes. Heck, the x86 can happily move from one memory location directly to another, so the same instruction can even fault twice...)

I know it. It's very hard, but there are some solutions to this problem too. Quite complicated solutions, but they can be very efficient also. Anyway intercepting hardware accesses haven't to be performed quickly: that's not the purpose of this project.

BTW, I appreciated a lot your comment. Very instructive. I hope that you can share some other opinions and your great experience with UAE and Amithlon.

And one question: what do you think of Wanderer's project? Does it make makes sense / sounds good / is useful?

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle