Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6220 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

1 member(s) on-line.

agami

You are an anonymous user.
Register Now!

agami: 2 mins ago

matthey: 41 mins ago

MEGA_RJ_MICAL: 1 hr 48 mins ago

Rob: 1 hr 53 mins ago

AmigaMac: 2 hrs 15 mins ago

OneTimer1: 2 hrs 31 mins ago

ruben: 3 hrs ago

Marcian: 3 hrs 2 mins ago

Dragster: 3 hrs 26 mins ago

nbache: 3 hrs 47 mins ago

Forum Index

Amiga Emulation

Productivity Amiga Emulation

Poster

Thread

Wanderer

Re: Productivity Amiga Emulation
Posted on 13-Jul-2015 9:46:09

[ #41 ]

Cult Member

Joined: 16-Aug-2008
Posts: 654
From: Germany

WinUAE is a beast. Just checked out the souce code :-O. This would take several man years to implement.
However, the vast majority of the code is not needed for productivity emulation. Most of the things are dealing with hardware emulation, timings etc. E.g. we don't care how much cycles each instruction costs.

BTW, the idea not not like Wine, where the entire OS API is replaced. This would be a lot of work too. Actually this would be something like a big-endian compiled AROS with 68K emu.

What I have in mind is only to replace the timing critical things, and things that remove the necessity to emulate the actual hardware. E.g. if the Joystick/Mouse registers are not emulated, then the input.device must be replaced (or some functions of it) with direct input from the host.

_________________
--
Author of
HD-Rec, Sweeper, Samplemanager, ArTKanoid, Monkeyscript, Toadies, AsteroidsTR, TuiTED, PosTED, TKPlayer, AudioConverter, ScreenCam, PerlinFX, MapEdit, AB3 Includes and many more...
Homepage: http://www.hd-rec.de

Status: Offline

cdimauro

Re: Productivity Amiga Emulation
Posted on 13-Jul-2015 17:15:07

[ #42 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4432
From: Germany

@Wanderer

Quote:

Wanderer wrote:
WinUAE is a beast. Just checked out the souce code :-O. This would take several man years to implement.
However, the vast majority of the code is not needed for productivity emulation. Most of the things are dealing with hardware emulation, timings etc. E.g. we don't care how much cycles each instruction costs.

Yes, and that's why I told that it's easier to rewrite it than trying to adapt.
Quote:
BTW, the idea not not like Wine, where the entire OS API is replaced. This would be a lot of work too. Actually this would be something like a big-endian compiled AROS with 68K emu.

That's something which we already discussed some time ago. I suggested to use a proper compiler, like the Intel one, to build a big-ending o.s. on a little-endian architecture, to solve the "communication" problems of the two different worlds. I don't know if GCC has bi-endian support.
Quote:
What I have in mind is only to replace the timing critical things, and things that remove the necessity to emulate the actual hardware. E.g. if the Joystick/Mouse registers are not emulated, then the input.device must be replaced (or some functions of it) with direct input from the host.

Why don't remove at all any Amiga hardware and replace libraries and devices with proper native code? If you have to (re)write again an emulator for part of the hardware, I think it's more convenient to spend a similar effort for a native platform, which would be much faster and easier to handle & maintain.

AROS has most of the things already implemented: I think it might be a good starting point.

Last edited by cdimauro on 13-Jul-2015 at 05:15 PM.

Status: Offline

Deniil715

Re: Productivity Amiga Emulation
Posted on 14-Jul-2015 10:15:54

[ #43 ]

Elite Member

Joined: 14-May-2003
Posts: 4238
From: Sweden

@Wanderer

Very good idea! I only do productivity, but it's hard to be mobile with the X1000 being a tower...

I have installed Amiga forever on a laptop, but I did notice how it loads the CPU all the time for no reason while in WB using RTG.

This kind of emulation would be much like my old A1200 with BPPC+BVision, Prelude, FastATA: That is, it did no longer support "classic" HW-banging software because everything was retargeted. Fine!

But OS4 emulation would be the most preferable.

_________________
- Don't get fooled by my avatar, I'm not like that (anymore, mostly... maybe only sometimes)
> Amiga Classic and OS4 developer for OnyxSoft.

Status: Offline

Heinz

Re: Productivity Amiga Emulation
Posted on 14-Jul-2015 15:33:05

[ #44 ]

Regular Member

Joined: 10-Oct-2005
Posts: 212
From: Unknown

@Wanderer

Is'nt AROS-hosted doing most of what you propose ?
Would'nt it be possible to create a 68k-compiled AROS hosted ?

Status: Offline

cdimauro

Re: Productivity Amiga Emulation
Posted on 14-Jul-2015 19:46:56

[ #45 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4432
From: Germany

@Heinz: no, because the AROS/68K version is used under a 68K emulator (E-UAE-based), whereas Wanderer wants to remove the emulator to maximum performance and reduce battery usage when the system is (more or less) idle.

Status: Offline

Wanderer

Re: Productivity Amiga Emulation
Posted on 14-Jul-2015 21:23:13

[ #46 ]

Cult Member

Joined: 16-Aug-2008
Posts: 654
From: Germany

@Heinz

If you compile AROS for 68K, you need an 68K emulator. Just like for AmigaOS3.x.

AROS is interesting to get quick insight of what specific functions are doing internally. It is not necessarily the same for AmigaOS and AROS. So the emulator is OS dependent. I hope for AROS and AmigaOS3.x this would be the same emulator, but e.g. for OS4 or MOS, apart from the PPC emulation, I guess some changes are necessary.

Again, the idea is to emulate a virtual Amiga 68K machine, but with more or less passive support for Chipset/Hardware. It might be sufficient to not make programs crash, but probably not sufficient to work properly, e.g. Paula might stay silent, copper won't do anything.

The speedup kicks in when OS functions are directly replaced, ideally with multi core support, e.g. this is totally doable with graphics and files.

Status: Offline

Wanderer

Re: Productivity Amiga Emulation
Posted on 14-Jul-2015 21:23:19

[ #47 ]

Cult Member

Joined: 16-Aug-2008
Posts: 654
From: Germany

EDIT: duplicate post.

Last edited by Wanderer on 14-Jul-2015 at 09:24 PM.

Status: Offline

cdimauro

Re: Productivity Amiga Emulation
Posted on 15-Jul-2015 5:18:31

[ #48 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4432
From: Germany

@Wanderer: using only the o.s. APIs, in theory it's also possible to process the generated copper list(s), and recreate the view (e.g.: generating the final framebuffer), like an original Amiga. That's provided that the Copper instructions only act on the display registers (colors, bitplanes, etc.) and do not access other things (Blitter, Audio). This way you do not need to fully emulate the Copper, which is quite expensive.

Regarding MorphOS, it's similar to the Amiga o.s. 3.x and AROS, so something is also possible, providing a proper PowerPC JIT. OS4 is a bit different, and requires more work, but it's also feasable.

Status: Offline

Wanderer

Re: Productivity Amiga Emulation
Posted on 18-Jul-2015 17:00:07

[ #49 ]

Cult Member

Joined: 16-Aug-2008
Posts: 654
From: Germany

BTW, is anyone in the possession of the 68881/2 FPU opcodes?

Like for the 680x0 opcodes here, this would be ideal.

http://oldwww.nvg.ntnu.no/amiga/MC680x0_Sections/index.HTML

Status: Offline

itix

Re: Productivity Amiga Emulation
Posted on 18-Jul-2015 18:30:05

[ #50 ]

Elite Member

Joined: 22-Dec-2004
Posts: 3398
From: Freedom world

@cdimauro

Quote:

@Heinz: no, because the AROS/68K version is used under a 68K emulator (E-UAE-based), whereas Wanderer wants to remove the emulator to maximum performance and reduce battery usage when the system is (more or less) idle.

Then the best approach would be what was used on Basilisk II. It uses stripped down UAE core where only CPU is emulated and everything else was thrown away. AFAIK it is not unreasonable amount of work to strip it down and also keep it updated with the mainstream.

_________________
Amiga Developer
Amiga 500, Efika, Mac Mini and PowerBook

Status: Offline

cdimauro

Re: Productivity Amiga Emulation
Posted on 18-Jul-2015 19:14:49

[ #51 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4432
From: Germany

@Wanderer: you find everything here.

@itix: then Basilisk II can be a starting point.

Status: Offline

Wanderer

Re: Productivity Amiga Emulation
Posted on 19-Jul-2015 8:24:04

[ #52 ]

Cult Member

Joined: 16-Aug-2008
Posts: 654
From: Germany

@cdimauro

Thanks, exactly what I needed...

I'll start with an opcode-mask table to generate the real lookup table from 16bit-opcode => ASM code/C code.
Only things that is annoying are the 32bit opcodes. 4GB table is not practical, so I need a hierarchical table, which costs time when looking up. (this is for a few functions only, like DIV?.l or FPU)

Status: Offline

cdimauro

Re: Productivity Amiga Emulation
Posted on 19-Jul-2015 9:29:32

[ #53 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4432
From: Germany

@Wanderer: don't underestimate the use of a hierarchical table. For my 8086 emulator (written in Python) I used one, because it perfectly fit the ISA model (8-bit opcodes), but I think that something similar can be made with a 68K, without hurting too much performance, or even benefiting from this point of view.

Consider that a large opcode table means that it cannot stay in the L1 data cache, so at least the processor has to access to the L2 one, and it's pretty slow. And if the L2 cache isn't big, well, it creates bigger problems (L3 cache access if you're lucky, or... the main memory!). And all of that impacts also on the code execution, because the L2 (and L3) cache is shared also for code, so it means that lucking space for the code can hurt a lot the performance in the final end.

Using small look-up tables, on the other side, means that you require a hierarchical approach, which creates dependences between the subsequent look-ups. But such kind of dependencies should have a much lower impact, because the processor is blocked for a few pipeline stages (the L1 cache data latency + execution finalization/write-stage of the previous instruction). And, on the final end, all small look-up tables can even stay on the L1 data cache, or at the most on part of the L2 one, and there's still plenty of space to store other tables (function pointers to the real code) or the physical code.

So, I suggest you the second approach. Try with a 10-bit opcode table (for the topmost 10 bits of the opcode) which gives you back a word (16-bit) result, and in the meanwhile you can mask the lowest 6 bits which can be used as an argument (usually the EA) to the function to be called.

Just a rough example:

# The split_opcode table returns a 16-bit value.
# The low-byte is the "macro-opcode" to be executed
# The high-byte is the parameter of the macro-opcode (e.g.: can be a better "qualifier")
splitted_opcode = split_opcode[opcode >> 6]

ea = opcode & 63
size = (opcode >> 6) & 3
macro_opcode = splitted_opcode & 255
argument = splitted_opcode >> 8

#opcode_table is a function pointers table
opcode_table[macro_opcode](argument, size, ea) # Calls the proper function

Inside the proper function, you can handle your special case only when need (e.g.: implicit instructions, line-A, etc.), leaving the most common/regular code to the normally executed instructions. You can handle here the 32-bit opcodes, of course, using other look-up tables if needed.

I think that such code is extremely cache-friendly, and takes into account also the processor out-of-order execution model (e.g.: calculating the parameters is something that can be executed while the processor is busy with the memory table access).

Status: Offline

Wanderer

Re: Productivity Amiga Emulation
Posted on 19-Jul-2015 10:06:40

[ #54 ]

Cult Member

Joined: 16-Aug-2008
Posts: 654
From: Germany

@cdimauro

I have a 16bit table now. The idea is to roll out the ea so there are only constants, no variables.

e.g. there is a separate function for each of those (simplified, no flags etc.):

[code]
void ADD_D0_D0 { D0 += D0; }
void ADD_D1_D0 { D0 += D1; }
void ADD_D2_D0 { D0 += D2; }
...
void ADD_D7_D7 { D7 += D7; }
[/code]

You mean, e.g.

[code]
void ADD_Dn_Dn (int opcode) {
int src_reg = MASK_OUT_SRCREG(opcode);
int dst_reg = MASK_OUT_DSTREG(opcode);
D[src_reg] += D[dst_reg];
}
[/code]

* D[] is the emulator object's variable for the data registers

may be faster? It would also mean to mask and shift the register number out of the opcode.

On the other hand, I am aiming for JIT and not interpreted code. The interpreted code is only to get things running.
My idea for the JIT is to translate the opcodes to LLVM code, and then let LLVM do the optimization. This would also allow easily so switch the target CPU architecture, e.g. x86, 64, ARM, PPC etc. What do you think? Or is the JIT something that has to be directly translated to opcodes of the target CPU? I can imagine I would do a poor job compared to LLVM.

Last edited by Wanderer on 19-Jul-2015 at 10:13 AM.
Last edited by Wanderer on 19-Jul-2015 at 10:10 AM.
Last edited by Wanderer on 19-Jul-2015 at 10:07 AM.
Last edited by Wanderer on 19-Jul-2015 at 10:07 AM.

Status: Offline

cdimauro

Re: Productivity Amiga Emulation
Posted on 19-Jul-2015 14:01:14

[ #55 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4432
From: Germany

@Wanderer

Quote:

Wanderer wrote:
@cdimauro

I have a 16bit table now. The idea is to roll out the ea so there are only constants, no variables.

e.g. there is a separate function for each of those (simplified, no flags etc.):

[code]
void ADD_D0_D0 { D0 += D0; }
void ADD_D1_D0 { D0 += D1; }
void ADD_D2_D0 { D0 += D2; }
...
void ADD_D7_D7 { D7 += D7; }
[/code]

You mean, e.g.

[code]
void ADD_Dn_Dn (int opcode) {
int src_reg = MASK_OUT_SRCREG(opcode);
int dst_reg = MASK_OUT_DSTREG(opcode);
D[src_reg] += D[dst_reg];
}
[/code]

* D[] is the emulator object's variable for the data registers

may be faster? It would also mean to mask and shift the register number out of the opcode.

Yes, the idea is like that, but I use three hierarchies to achieve the same result. For a generic binary operation, it'll be something like:

[code]
void BINARY_Dn_EA_8 (int operation, int source_reg, int dest_ea) {
char *address = decode_dest_ea_8(ea);
binary_8_execute[operation](registers[source_reg], address[0], address);
}
[/code]

binary_8_execute is function pointers table to the real code of the operation to be executed. For the generic addition of two 8-bit data, the function will be like that:

[code]
void add_8bit (char source1, char source2, char *destination) {
flags_first_operand = source1;
flags_second_operand = source2
flags_result = source1 + source2

# The 0xff masking can be avoided in C
destination[0] = flags_result & 0xff

flags_operation = FLAGS_ADD8
}
[/code]

Quite complex, right? Yes, because I prefer to aggregate code instead of writing tenths thousands of functions, trying to fit as much as possible on the L1 and L2 caches. For achieving this I do 3 function calls (BINARY_Dn_EA_8, decode_dest_ea_8, and add_8bit) whereas you do just one.

However, pay attention to the biggest problem which you have writing an emulator, and which you haven't still covered: the flags handling. It's a NIGHTMARE! The best compromise which I found is to do NOT calculate them at every operation. Otherwise it will kill the performance, for sure: the function which I use for the full calculation of flags is the biggest and more complex one. What I do is quite simple: I store the original operands, the result of the operation, and the "rough" operation that I've executed. Only when I need to calculate the flags (or only some of them), I call the ad hoc function.

I prefer my implementation because it's much easier to handle and make experiments, and it should give also good enough performance (but currently I lack a C implementation; I use only Python for my 8086 emulator).

Looking at your, with 65536 functions (to cover the full 16-bit opcode spectrum), you need a 256KB (32-bit host architecture) or 512KB (64-bit host architecture) just for the function pointers table, which is quite huge. Basically, you're eating (almost) all L1 and L2 (which typically is 256 or 512KB per core, or cores cluster).

It also means that you need a lot of space for the 65536 functions' code, since a quite common operation like a MOVE changes the flags (not all of them unfortunately! Partial flags update can be a nightmare also!), and so you need to calculate or at least defer them (my preferred solution), and it requires code (even a lot).

Summing it all, the risk is that, even if you execute just one function per opcode, the final execution is slow because you need to pick some data and/or code from the L3 (if you're lucky) or the memory (I don't think about accessing a memory operand, of course). And the code is also difficult to manage with such expanded code (you'll end-up using a lot of macros, which are quite difficult to debug).

So, I invite you to use a hierarchical approach, at least at the beginning, just to have something working in less time. You can always play unrolling one or two levels of the hierarchy once you have time to further experiment.

Especially if you think to use a JIT, like you stated, where the biggest advantage stays in the JITed code execution, and not on an ultra-fast decoder.
Quote:
On the other hand, I am aiming for JIT and not interpreted code. The interpreted code is only to get things running.

Exactly, so don't spend so much time trying to optimize the interpreter, even because the final execution time can be worse.

BTW: for a project like your, having an interpreted code is a non-sense. There are already emulators which do it, and provide also a JIT. For your project to succeed you have to give something more: a much better JIT, which gives greather performance for applications.
Quote:
My idea for the JIT is to translate the opcodes to LLVM code, and then let LLVM do the optimization. This would also allow easily so switch the target CPU architecture, e.g. x86, 64, ARM, PPC etc. What do you think? Or is the JIT something that has to be directly translated to opcodes of the target CPU? I can imagine I would do a poor job compared to LLVM.

With LLVM you get quickly get a good JIT. So I recommend to use it.

But if you want to get the most, I think that an architecture-specific JITer can give you much better results. Of course, you need to spend much more time. A JIT isn't simply: "I take one 68K instruction and I generate the corresponding host instructions".

Anyway, it's something that you can think about after that you've a stable and working platform. There's always space for imagination and "dirty" ideas to come and be implemented.

Status: Offline

Belxjander

Re: Productivity Amiga Emulation
Posted on 19-Jul-2015 14:26:27

[ #56 ]

Cult Member

Joined: 4-Jan-2005
Posts: 557
From: Chiba prefecture Japan

@Wanderer

68K reference documentation in PDF form from the manufacturer as Deep links,

Use this location *WITH* a filename added...
http://www.freescale.com/files/archives/doc/ref_manual/

Document Filenames are *EXACTLY* as follows...
M68000PRM.pdf
MC68020UM.pdf
MC68030UM.pdf
MC68040UM.pdf
MC68060UM.pdf
MC68881UM.pdf

Does this satisfy any 68K documentation requirements?

If the above links don't work...PM me and we can work out transferring the pdf's you need.

EDIT: added the FPU co-processor document name

Last edited by Belxjander on 19-Jul-2015 at 02:35 PM.

Status: Offline

cdimauro

Re: Productivity Amiga Emulation
Posted on 19-Jul-2015 14:41:27

[ #57 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4432
From: Germany

@Belxjander: unfortunately they don't work (404 - Page Not Found ).

You need to use the cache (base) link for it:

http://cache.freescale.com/files/32bit/doc/ref_manual/

Status: Offline

Wanderer

Re: Productivity Amiga Emulation
Posted on 19-Jul-2015 20:31:41

[ #58 ]

Cult Member

Joined: 16-Aug-2008
Posts: 654
From: Germany

@cdimauro

Thanks for your advice, I appreciate it.

Actually what I do is 2 steps:

1. I have written an opcode decoder that outputs C code and runs during compile time. It creates an include file that is included is the actual emulator. This way I don't have to use macros, the code for each opcode is explicitly written out.
However, I can still change hundreds of functions at once if I change the template.

2. The generated code is included in the emulator and a LUT points to it. The functions are parameter-less, since they know everything they need. That this is not cache friendly is a valid point.

Storing the input/output of each instruction rather than calculating the flags is an option. Lets see.
For the JIT this is not an issue. All calculations that get overwritten without a READ should be removed by the optimizer.
My idea is to generate LLVM very naively, just walk through the code like the interpreted mode would do. Then, LLVM can optimize unnecessary calculations away and re-order the instructions for better pipelining. Cache should not be an issue anymore since no LUT is involved during execution.

Status: Offline

fishy_fis

Re: Productivity Amiga Emulation
Posted on 20-Jul-2015 0:16:24

[ #59 ]

Elite Member

Joined: 29-Mar-2004
Posts: 2170
From: Australia

@Wanderer

I'm not sure I understand the point.
Aren't OS4, MOS, or Amithlon pretty much what you're proposing; a 68k compatible system leaning towards system friendly software.
Granted os4 is crazy expensive in a bang per buck sort of way, but both mos and amithlon are pretty affordable.
My current amithlon machine is a core2quad @ 4.4ghz, which provides more processing power when running 68k software than any ppc system running ppc code, so its not like there's a shortage of available grunt. On top of that it can also run big endian x86 code, providing even more grunt if a person wants it.

Status: Offline

cdimauro

Re: Productivity Amiga Emulation
Posted on 20-Jul-2015 5:16:33

[ #60 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4432
From: Germany

@Wanderer: only pay attention to instructions that partially update the flags. Like MOVE which, unfortunately, leave X as is. And MOVE is the most used instruction, often followed by a Bcc instruction (so you need to evalaute flags here).

@fishy_fis: the only software which is similar to Wanderer's project is Amithlon, but he wants to create something more os and architecture friendly. With a better JIT, of course.

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle