Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6205 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

0 member(s) on-line.

You are an anonymous user.
Register Now!

minator: 6 mins ago

matthey: 31 mins ago

number6: 1 hr 5 mins ago

OneTimer1: 1 hr 39 mins ago

DiscreetFX: 2 hrs 31 mins ago

jingof: 2 hrs 38 mins ago

Mobileconnect: 2 hrs 53 mins ago

t0lkien: 3 hrs 28 mins ago

zipper: 3 hrs 36 mins ago

Rob: 4 hrs 8 mins ago

Forum Index

MorphOS Software

Assembly startup codes for ECX compiler in VAsm?

Poster

Thread

Samurai_Crow

Assembly startup codes for ECX compiler in VAsm?
Posted on 23-Jan-2020 20:45:11

[ #1 ]

Elite Member

Joined: 18-Jan-2003
Posts: 2320
From: Minnesota, USA

I'm in the process of starting a new fork of ECX, the AmigaE compiler for PPC and 68020+. I've run into a snag. The startup codes are bootstrapped in Assembly and I don't know what the settings should be for MorphOS and OS4 on VAsm.

I've already gotten the PAsm PPC sources massaged into VAsmPPC format and the AsmOne sources massaged into VAsm68k. What settings do I need to get the right object file formats? I know OS4 uses an ELF derivative. I think MorphOS does too for that matter. Yet the startup codes are different for OS4 and MorphOS.

Status: Offline

Templario

Re: Assembly startup codes for ECX compiler in VAsm?
Posted on 23-Jan-2020 20:48:47

[ #2 ]

Elite Member

Joined: 22-Jun-2004
Posts: 3678
From: Unknown

@Samurai_Crow
What is VAsm and PAsm?

Status: Offline

Samurai_Crow

Re: Assembly startup codes for ECX compiler in VAsm?
Posted on 23-Jan-2020 22:34:29

[ #3 ]

Elite Member

Joined: 18-Jan-2003
Posts: 2320
From: Minnesota, USA

@Templario

They convert PPC Assembly code to object files. Assembly is one step up from the raw binary.

Status: Offline

jPV

Re: Assembly startup codes for ECX compiler in VAsm?
Posted on 24-Jan-2020 6:18:48

[ #4 ]

Cult Member

Joined: 11-Apr-2005
Posts: 840
From: .fi

@Samurai_Crow

Would this help: https://library.morph.zone/An_Introduction_to_MorphOS_PPC_Assembly

_________________
- The wiki based MorphOS Library - Your starting point for MorphOS
- Software made by jPV^RNO

Status: Offline

Samurai_Crow

Re: Assembly startup codes for ECX compiler in VAsm?
Posted on 24-Jan-2020 8:47:07

[ #5 ]

Elite Member

Joined: 18-Jan-2003
Posts: 2320
From: Minnesota, USA

@jPV

Thanks! Yes it does! The tutorial even uses VAsm!

Status: Offline

Hypex

Re: Assembly startup codes for ECX compiler in VAsm?
Posted on 7-Aug-2021 16:39:38

[ #6 ]

Elite Member

Joined: 6-May-2007
Posts: 11351
From: Greensborough, Australia

@Samurai_Crow

How are you going with your fork? You did well to convert it to VAsm. I had a go but gave up as I didn't understand the errors. I've never done PPC ASM like 68K but it was supposed to replace PAsm somehow. I didn't try the 68K ASM file. I did try gas with OS4 but it vomited like vasm did and I didn't know what switches it needed.

Status: Offline

matthey

Re: Assembly startup codes for ECX compiler in VAsm?
Posted on 8-Aug-2021 1:26:42

[ #7 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2736
From: Kansas

Hypex Quote:

How are you going with your fork? You did well to convert it to VAsm. I had a go but gave up as I didn't understand the errors. I've never done PPC ASM like 68K but it was supposed to replace PAsm somehow. I didn't try the 68K ASM file. I did try gas with OS4 but it vomited like vasm did and I didn't know what switches it needed.

The MorphOS assembler file isn't going to work in AmigaOS 4. While ELF is supported in AmigaOS 4, it is tricky to use as ELF is a Unix format which defaults to loading at a set address instead of using a relocatable scatter loader. It may be easier to use the extended hunk format (EHF). Just take the non-libc PPC hello world written by Frank Wille from the following link and name the text file HelloWorld.s.

https://wiki.amigaos.net/wiki/The_Hacking_Way:_Part_1_-_First_Steps#Writing_programs_in_assembler

The AmigaOS 4 devs were kind enough to use Frank Wille's PPC hello world assembler code for vasm and convert it to GCC without giving instructions for assembling with vasm. Typical arrogant bias. It should be possible to assemble directly to an EHF PPC program without an object file.

vasmppc_std -Fhunkexe -o ram:HelloWorld HelloWorld.s

The file ram:HelloWorld would then be the AmigaOS 4 executable.

It is funny how the MorphZone MorphOS PPC assembly example DOS OpenLibrary code takes 12 instructions and 48 bytes while the equivalent 68k code takes 4 instructions and 14 bytes. The PPC code takes 3 times as many instructions and is ~3.43 times larger. Then the MorphZone article author tries to justify the discrepancy with the following comment.

Quote:

One could be forgiven for thinking that the PPC code snippet looks a little ungainly compared to the former. Whereas every PPC instruction is four bytes long, 68k instructions can be as little as two bytes but up to as many as ten. A single 68k instruction can load a value from a 32 bit memory address specified by one instruction operand and store it at another 32 bit memory address specified by the second instruction operand. The 68k can also perform other operations beyond simple loads and stores directly on memory. Or at least appear to... In truth, computer memory only sends and receives data - no other data processing (like adding, subtracting & etc) occurs in memory. While the 68k instruction add.l #$12345678,(a0) appears to add the immediate value of it's first operand to whatever may already be stored at the address pointed to by a0, the contents of that address are actually loaded from memory into a private work register, the addition is performed and then the result is stored back to the same memory location. So this instruction actually performs two memory accesses where it might appear that there was only one. Contrast this with PPC assembly programming where memory loads and stores are all done explicitly. Before data can be operated upon it must be loaded from memory into a GPR (General Purpose Register), zero (in the case of a simple memory copy) or more operations can then be performed and then the data may be stored back to memory.

The "add.l #$12345678,(a0)" read-modify-write example does do a read and write access but the write can usually be written to the write buffer and the core need not wait for it to complete. This instruction executes in 1 cycle on the 68060 when the data pointed to is in the DCache and the write back to cache/memory can be deferred in the write buffer queue until it is convenient to write. If the 68k did not support RMW instructions, the following code would be necessary.

move.l (a0),d0 ; this read occurs no sooner than with RMW
add.l #$12345678,d0
move.l d0,(a0) ; this write occurs later
---
We now have 3 dependent instructions to execute instead of 1, our code increased in size from 6 bytes to 10 bytes and we waste a GP register. PPC needs another dependent instruction to load the 32 bit immediate (due to lack of immediate encoding bits in a fixed 32 bit encoding) and all instructions are 4 bytes so the code now takes 16 bytes. The PPC code is 4 times the number of instructions and ~2.67 times larger than the 68k code while an extra GP register is wasted and the write is delayed longer. This example was supposed to make the 68k look bad? lol

Status: Offline

kolla

Re: Assembly startup codes for ECX compiler in VAsm?
Posted on 8-Aug-2021 3:09:57

[ #8 ]

Elite Member

Joined: 20-Aug-2003
Posts: 3473
From: Trondheim, Norway

@matthey

I am quite confident that “making 68k look bad” was NOT on their agenda.

_________________
B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC

Status: Offline

matthey

Re: Assembly startup codes for ECX compiler in VAsm?
Posted on 8-Aug-2021 7:47:23

[ #9 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2736
From: Kansas

kolla Quote:

I am quite confident that “making 68k look bad” was NOT on their agenda.

Ancient Chinese marketing proverbs:

"If you can't make your product look good then try to make the competition's product look bad."

"Know your competition's products as well as your own."

"Sell product first let customer try to use product later."

Status: Offline

kolla

Re: Assembly startup codes for ECX compiler in VAsm?
Posted on 8-Aug-2021 8:18:08

[ #10 ]

Elite Member

Joined: 20-Aug-2003
Posts: 3473
From: Trondheim, Norway

@matthey

Looks more like US proverbs to me, but anyhow, there never was any “competition” to begin with, so what you’re saying is nonsense.

_________________
B5D6A1D019D5D45BCC56F4782AC220D8B3E2A6CC

Status: Offline

jPV

Re: Assembly startup codes for ECX compiler in VAsm?
Posted on 9-Aug-2021 9:13:59

[ #11 ]

Cult Member

Joined: 11-Apr-2005
Posts: 840
From: .fi

Quote:

matthey wrote:
kolla Quote:

I am quite confident that “making 68k look bad” was NOT on their agenda.

Ancient Chinese marketing proverbs:

"If you can't make your product look good then try to make the competition's product look bad."

"Know your competition's products as well as your own."

"Sell product first let customer try to use product later."

Just to remind that the linked article in the MorphOS Library wiki was written by an individual user, who isn't part of the MorphOS Team. It's a public user driven wiki. So everyone should be careful with the "them" word or talking like it'd be from the product authors to avoid putting words into mouths where they don't belong ;)

_________________
- The wiki based MorphOS Library - Your starting point for MorphOS
- Software made by jPV^RNO

Status: Offline

Hypex

Re: Assembly startup codes for ECX compiler in VAsm?
Posted on 9-Aug-2021 15:51:04

[ #12 ]

Elite Member

Joined: 6-May-2007
Posts: 11351
From: Greensborough, Australia

@matthey

Quote:
The MorphOS assembler file isn't going to work in AmigaOS 4. While ELF is supported in AmigaOS 4, it is tricky to use as ELF is a Unix format which defaults to loading at a set address instead of using a relocatable scatter loader. It may be easier to use the extended hunk format (EHF). Just take the non-libc PPC hello world written by Frank Wille from the following link and name the text file HelloWorld.s.

The ASM source in question is a generic PPC source file designed to compile on both OS4 and MOS. But, I didn't know you could write an ASM source, designed to assemble into an object, that would refuse to assemble because of the OS the assembler is running on.

The details are vague, but the intended output format is in fact hunk. The compiler is designed to load in the PPC code embedded in 68K hunk format which PAsm can output. The same sources are used for MOS and OS4. OS4 is selected with a define.

Quote:
The AmigaOS 4 devs were kind enough to use Frank Wille's PPC hello world assembler code for vasm and convert it to GCC without giving instructions for assembling with vasm. Typical arrogant bias. It should be possible to assemble directly to an EHF PPC program without an object file. vasmppc_std -Fhunkexe -o ram:HelloWorld HelloWorld.s

Well, the point was to work with the OS4 SDK which has GCC included. Including a vasm guide would complicated it with an extra source and instructions needed. A fresh example would have been best in this case.

Quote:
It is funny how the MorphZone MorphOS PPC assembly example DOS OpenLibrary code takes 12 instructions and 48 bytes while the equivalent 68k code takes 4 instructions and 14 bytes. The PPC code takes 3 times as many instructions and is ~3.43 times larger. Then the MorphZone article author tries to justify the discrepancy with the following comment.

They didn't store or check the result in either case. But, another thing we have to factor in here is 68K ABI emulation. This is a MorphOS ABOX call. And that emulates the 68K ABI in the function process. I can't say why they wanted to emulate the ABI, which was fully register based, and store all the parameters in a memory array, on a CPU that has double the registers. But they did, perhaps to emulate the 68k ABI, so porting was more direct. Amiga code is more compatible with MOS than OS4 in C.

The OS4 ASM would likely be less instructions for opening DOS. Even though OS4 ABI gets criticised for being double in-directed in function calls. I count 7 instructions for OS4 including the unrolling of the CALLOS macro. But, to be fair and exact to the 68K example, ExecBase must be loaded. On OS4 IExec Interface also needs loading so that is more code. This causes bloat. Getting IExec is another 7 instructions without checking results like the example. Then after DosBase IDos is needed which is another 9 instructions. The equivalent OS4 code needs about 27 instructions, including checks. The edge it had to MOS it just lost. But once all loaded it can gain again on library calls.

But just to add, in some cases Exec is preloaded. I can't find any info on _start register layout. Think such a search would be easy. It's provided in interrupts. There should be no need to load it direct from 0x4 like the ASM example is doing. Technically this means the first thing an OS4 program does is crash to get the ExecBase pointer. But both on 68K and OS4 Exec should have been handed to each program on startup.

If we want to add insult to injury then an AROS x86 library call example is in order!

Quote:
The "add.l #$12345678,(a0)" read-modify-write example does do a read and write access but the write can usually be written to the write buffer and the core need not wait for it to complete. This instruction executes in 1 cycle on the 68060 when the data pointed to is in the DCache and the write back to cache/memory can be deferred in the write buffer queue until it is convenient to write. If the 68k did not support RMW instructions, the following code would be necessary.

move.l (a0),d0 ; this read occurs no sooner than with RMW
add.l #$12345678,d0
move.l d0,(a0) ; this write occurs later

That's almost exactly what PPC needs to do. Another thing to take into account, is that on 68K the operation is atomic. On PPC, that must be split it up, as it is not atomic and thread unsafe. So on PPC that must be taken into account.

Quote:
We now have 3 dependent instructions to execute instead of 1, our code increased in size from 6 bytes to 10 bytes and we waste a GP register. PPC needs another dependent instruction to load the 32 bit immediate (due to lack of immediate encoding bits in a fixed 32 bit encoding) and all instructions are 4 bytes so the code now takes 16 bytes. The PPC code is 4 times the number of instructions and ~2.67 times larger than the 68k code while an extra GP register is wasted and the write is delayed longer. This example was supposed to make the 68k look bad? lol

I didn't exactly read it as trying to make the 68K look bad. It points out the differences. It does look like a negative on PPC as data must be loaded from memory first and into a user GPR. But, regardless of what separate operations are performed in the process, 68K will always use less code.

There's no way around needing to load 32 bits in two halves on PPC in code, taking up 8 bytes for the whole process. But, one trick may be to store a 32-bit value before the prologue, so one instruction could load it in and fast as it would be in the CPU instructions cache. Same could be done from data load but then it needs to be in CPU data cache. So in code is better I think. Now, the entire "payload" will still be 8 bytes, so the process may outweigh the gain. However one instruction over two may give a small speed increase.

Despite forward planning for 64 bit, it looks like they stuffed up PPC64, because it needs 5 instructions to load in a 64 bit value! It should only need 4 like PPC32 only needs 2 for a full word load. But it needs a 68K like 64 bit swap to finish the process. In any case, needing to load a full 64-bit value may not always be common. And, with my instruction data code trick, the "payload" to load 64-bit on PPC would be brought down from 20 bytes to a more compact friendly 12 bytes. But, if they stuffed up PPC64 further, so it can't actually load in a 64 bit value, they are trying to create the most inefficient 64 bit CPU ever!

Last edited by Hypex on 10-Aug-2021 at 02:47 AM.

Status: Offline

NutsAboutAmiga

Re: Assembly startup codes for ECX compiler in VAsm?
Posted on 9-Aug-2021 16:14:33

[ #13 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12993
From: Norway

@Hypex

Quote:
because it needs 5 instructions to load in a 64 bit value!

it makes no sense writing code like that, like that better load it from memory (ld or ldx), with one instruction, If you think about all PowerPC instructions are 32bit wide, you can’t fit 64bit into 32bit, and you need space for the opcode, it can’t all be data.

Last edited by NutsAboutAmiga on 09-Aug-2021 at 04:18 PM.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

matthey

Re: Assembly startup codes for ECX compiler in VAsm?
Posted on 10-Aug-2021 3:36:34

[ #14 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2736
From: Kansas

Hypex Quote:

The ASM source in question is a generic PPC source file designed to compile on both OS4 and MOS. But, I didn't know you could write an ASM source, designed to assemble into an object, that would refuse to assemble because of the OS the assembler is running on.

The details are vague, but the intended output format is in fact hunk. The compiler is designed to load in the PPC code embedded in 68K hunk format which VAsm can output. The same sources are used for MOS and OS4. OS4 is selected with a define.

That is confusing. MorphOS, AmigaOS 4, WarpOS and PowerUp are all different vbcc targets for the PPC and AROS PPC isn't even supported yet. They all are probably invoked with "vasmppc_std" yet have different configurations and default outputs. Cross compilers require more care to use but it doesn't help that there are so many PPC Amiga targets and standards.

Hypex Quote:

They didn't store or check the result in either case. But, another thing we have to factor in here is 68K ABI emulation. This is a MorphOS ABOX call. And that emulates the 68K ABI in the function process. I can't say why they wanted to emulate the ABI, which was fully register based, and store all the parameters in a memory array, on a CPU that has double the registers. But they did, perhaps to emulate the 68k ABI, so porting was more direct. Amiga code is more compatible with MOS than OS4 in C.

There are 18 GP callee save (non-volatile) registers R14-R31 using the System V PPC ABI. Keeping all 16 68k registers in PPC registers would only leave 1 GP callee save register available. Using a mix of callee save and caller save PPC GP registers for the 68k registers may be more trouble than it is worth.

Hypex Quote:

The OS4 ASM would likely be less instructions for opening DOS. Even though OS4 ABI gets criticised for being double in-directed in function calls. I count 7 instructions for OS4 including the unrolling of the CALLOS macro. But, to be fair and exact to the 68K example, ExecBase must be loaded. On OS4 IExec Interface also needs loading so that is more code. This causes bloat. Getting IExec is another 7 instructions without checking results like the example. Then after DosBase IDos is needed which is another 9 instructions. The equivalent OS4 code needs about 27 instructions, including checks. The edge it had to MOS it just lost. But once all loaded it can gain again on library calls.

Both MorphOS and AmigaOS 4 are bad compared to the 68k. How can something so simple be made so difficult?

Hypex Quote:

If we want to add insult to injury then an AROS x86 library call example is in order!

The x86 code equivalent should be more compact than PPC.

move.l 4.w,a6 ; better scheduling and optimize to absolute short addressing (saves 2 bytes)
lea (dosName,pc),a0 ; use pc relative addressing (saves 2 bytes)
moveq #0,d0
jsr _LVOOpenLibrary(a6)

I don't believe x86 can do the equivalent of the jsr instruction with the call instruction so it may become 2 instructions. Using x86-64 would solve the register shortage as some Amiga libraries use quite a few register args.

Hypex Quote:

That's almost exactly what PPC needs to do. Another thing to take into account, is that on 68K the operation is atomic. On PPC, that must be split it up, as it is not atomic and thread unsafe. So on PPC that must be taken into account.

The 68k RMW instructions are atomic because there is only one core. With single core multitasking, the instruction can't be interrupted. With SMP, a much more expensive locked RMW instruction is required to be thread safe. The C/C++ atomic functions use locked RMW instructions on the x86. This is also confusing because locked RMW instructions can be used to implement lock free programming (data sharing) with SMP.

When the PPC is emulating the 68k, the RMW instructions are likely very expensive to implement. The PPC core would need to exclusively lock the cache line being modified, use fences around the cache line or disable interrupts while doing the equivalent of the RMW. These RMW instructions are common on the 68k like with library open counts.

Hypex Quote:

There's no way around needing to load 32 bits in two halves on PPC in code, taking up 8 bytes for the whole process. But, one trick may be to store a 32-bit value before the prologue, so one instruction could load it in and fast as it would be in the CPU instructions cache. Same could be done from data load but then it needs to be in CPU data cache. So in code is better I think. Now, the entire "payload" will still be 8 bytes, so the process may outweigh the gain. However one instruction over two may give a small speed increase.

Putting large immediate/constant numbers near the code is not optimum. Ideally, the instruction loads use the ICache and data loads/stores use the DCache with separate L1 caches which are common today. Simple hardware would end up with the same data in both the L1 ICache and L1 DCache which is not efficient. While Putting immediate/constant numbers near the code reduces the number of instructions, the actual code is more spread out and more likely to miss in the ICache. RISC is often fat anyway which means fewer instructions per cache line. ICache pre-fetching is easier to predict than DCache but an instruction bottleneck can develop.

It is best to compress immediate/constant numbers in the instructions without adding dependent instructions. This is more easily done with a variable length instruction set (although most RISC VLEs have done a poor job as RISC-V and Thumb2 still need multiple instructions for large immediates). ARM has often improved constant loading by allowing free shifts although shifts are a bit expensive. Sign extending smaller numbers is cheaper and does a good job. For floating point, numbers which are powers of 2 can be exactly represented in smaller fp formats.

Hypex Quote:

Despite forward planning for 64 bit, it looks like they stuffed up PPC64, because it needs 5 instructions to load in a 64 bit value! It should only need 4 like PPC32 only needs 2 for a full word load. But it needs a 68K like 64 bit swap to finish the process. In any case, needing to load a full 64-bit value may not always be common. And, with my instruction data code trick, the "payload" to load 64-bit on PPC would be brought down from 20 bytes to a more compact friendly 12 bytes. But, if they stuffed up PPC64 further, so it can't actually load in a 64 bit value, they are trying to create the most inefficient 64 bit CPU ever!

PPC64 looks like it has poor code density although I was never able to find papers with data. AArch64 looks more efficient for a 64 bit ISA even though it also has a 32 bit fixed length encoding. Add in a more standard ISA that is easier to use and it is easy to see why AArch64 is the PPC killer.

NutsAboutAmiga Quote:

it makes no sense writing code like that, like that better load it from memory (ld or ldx), with one instruction, If you think about all PowerPC instructions are 32bit wide, you can’t fit 64bit into 32bit, and you need space for the opcode, it can’t all be data.

A 32 bit immediate can't even be fit into a 32 bit instruction encoding. The PPC immediate field is usually 16 bits which is common for RISC ISAs. PPC64 still has 32 bit instruction encodings but 4 instructions with a 16 bit immediate field gives 64 bits. Fortunately, immediates which use all 64 bits are rare. The upper bits of most 64 bit CPUs are 0s or 1s.

Last edited by matthey on 10-Aug-2021 at 03:41 AM.

Status: Offline

NutsAboutAmiga

Re: Assembly startup codes for ECX compiler in VAsm?
Posted on 10-Aug-2021 15:29:13

[ #15 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12993
From: Norway

@Hypex

Double indirect?

if you run 68K program, you trigger illegal instructions that triggers the stub the stub calls native powerpc version of it, and the native powerpc inheritance other functions using “Self->” internally.

the good thing about using Self is, is if function is patched, it effects every where its used, if “self->” was used but __ppc__function(bla,bla) then you can’t granted it was replace every where or patch will effect every place the functions are used internally.

for something AllocVec

you have __stub_68k__AllocVec, then __ppc__AllocVecTagList is called, from the self-table.

fir AllocVecTags then it calls
__stub_68k__AllocVecTags( self, ULONG *regs ) , then it calls __ppc__AllocVecTags( self, size, __va_args__ ), using self-table, and it calls __ppc__AllocVecTagList( self, size, struct Tag * ) using the self-table, internally.

“Tags” functions are promoted, but they are built for convenience, not for speed. Converting var args into tag list is not free, even if salloc is used, you need to read it out in loop, and next function will need to read Tag *tags in loop, so that’s not good.

TagList functions is what you should be using, that’s what default to anyway.

From computer language standpoint, it be good if Amiga E convert tags into taglist before compiling, there something to be gained there.

never benchmarked the difference between Tags and TagLists functions, it might be interesting.

Last edited by NutsAboutAmiga on 10-Aug-2021 at 03:39 PM.
Last edited by NutsAboutAmiga on 10-Aug-2021 at 03:31 PM.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

Hypex

Re: Assembly startup codes for ECX compiler in VAsm?
Posted on 11-Aug-2021 14:45:25

[ #16 ]

Elite Member

Joined: 6-May-2007
Posts: 11351
From: Greensborough, Australia

@matthey

Quote:
That is confusing. MorphOS, AmigaOS 4, WarpOS and PowerUp are all different vbcc targets for the PPC and AROS PPC isn't even supported yet. They all are probably invoked with "vasmppc_std" yet have different configurations and default outputs. Cross compilers require more care to use but it doesn't help that there are so many PPC Amiga targets and standards.

You must have hit reply before my edit. I mistyped Pasm. Yes that is confusing. Vasm does support all that, some are derivatives of ELF and others are hybrids of hunk and ELF. Talk about confusing!

But, what I meant was, the ECX PPC source with a PPC code generator back-end is made to assemble with Pasm and output to a 68K style hunk. Somewhere along the way it loads it in. In fact it loads it at run time because I managed to compile a broken build and kept getting this scanados error. I checked the code around it which kept failing on a simple module load and realised it loaded the PPC code in from an internal hunk loader. Then I re-read the read me and confirmed from a one liner it needs hunk format. Programmers expecting other programmers to understand from vague instructions.

Quote:
There are 18 GP callee save (non-volatile) registers R14-R31 using the System V PPC ABI. Keeping all 16 68k registers in PPC registers would only leave 1 GP callee save register available. Using a mix of callee save and caller save PPC GP registers for the 68k registers may be more trouble than it is worth.

There's no need for most. On 68K a library function call can have up to 14 parameters in registers, D0-D7 and A0-A5. A6 is library base, A7 stack. In practise 12 is more realistic as in C code A4 holds globals and A5 holds locals. D0-D1/A0-A1 available as scratch as usual. PPC SysV ABI allows 8 registered parameters and the rest must be stacked, so an original call using D0-D7 or D0-D3/A0-A3 or any up to 8 can be parametrically registered on PPC. Since what I'm comparing here is a MorphOS library function call with a SysV function call.

OTOH, OS4 doesn't use any of these shenanigans for library calls, and uses SysV conventions. What OS4 lacks is native calling from library bases, since they introduced interfaces. But once you have the jump table it works a similar way. By comparison, here's some PPC code from ECX itself, with an OS4 and MOS function call:


.MACRO M_INLINE_NEW
   ADDI R3, R3, 8  # add memheader size
   STW R3, 12(R1) # save size
.IFDEF _AMIGAOS4_
   OR R4, R3, R3
   LWZ R3, IEXEC(GLOB)
   LWZ R0, IALLOCMEM(R3)
   MTCTR R0          # to ctr
   ADDI R5, 0, 1<<12 # memf_shared
   ORIS R5, R5, 1    # memf_clear
.ELSE # morphos
   LWZ R0, EMULCALLDIRECTOS(SYS)
   MTCTR R0         # to ctr
   STW R3, 0(SYS) # emulhandle.dregs[0] := size
   ADDI R3, 0, 1  # memf_public
   ORIS R3, R3, 1 # memf_clear
   STW R3, 4(SYS)   # emulhandle.dregs[1] := flags
   LWZ R3, EXECBASE(GLOB)
   STW R3, 56(SYS)  # emulhandle.aregs[6] := execbase
   ADDI R3, 0, ALLOCMEM
.ENDIF
   BCTRL            # call AllocMem()
   OR. R3, R3, R3
   BEQ _\@ipnew_end
   LWZ R0, MEMLIST(GLOB)
   STW R0, 0(R3)   # set next
   STW R3, MEMLIST(GLOB) # add to list
   LWZ R0, 12(R1)
   STW R0, 4(R3)   # set size
   ADDI R3, R3, 8  # skip head
_\@ipnew_end:
.ENDM

Quote:

The x86 code equivalent should be more compact than PPC.

move.l 4.w,a6 ; better scheduling and optimize to absolute short addressing (saves 2 bytes)
lea (dosName,pc),a0 ; use pc relative addressing (saves 2 bytes)
moveq #0,d0
jsr _LVOOpenLibrary(a6)

I don't believe x86 can do the equivalent of the jsr instruction with the call instruction so it may become 2 instructions. Using x86-64 would solve the register shortage as some Amiga libraries use quite a few register args.

Easily. Also, AROS is using some ABI, and also got stuck in the 64-bit switch discussion. When things like ABIv0 and ABIv1 came into it. Funny, after a brief search, I can find no AROS x86 example. Given how popular ASM is with Amiga people, it's surprising there's no common examples. Plenty of 68K till the cows come home. Lots off PPC stuff that caught attention from the 90's. But, AROS, using a CISC CPU, turns nothing up!

Now PPC doesn't have PC relative modes I know about apart from a branch. You can locate the PC, or IP as they call it, using a cheat by calling a bl to the following instruction and picking it out of the lr. But not immediate like the above. In fact, a format like ELF doesn't lend it self well to this style of coding, since it likes to divide text code, strings and bss in segments. It's organised. But I can duplicate the DOS open operation on PPC, aside from ABI, at best I come up with is this:


lwz r3,4(r0); This uses a trick where specifying R0 in EA results in 16 bit absolute address
lis r4,dosNname@ha
ori r4,dos_name@l ; Can't be avoided really
li r5, 0
lwz r0,_LVOOpenLibrary(r3)
mtctr r0  ; Can do register load from base relative but not jump
bctrl ; PPC needing three instructions for a function jump isn't good.

I reduced it as much as I could but still only get down to 28 bytes. Funny. Twice as large.

But, the Amiga way of calling a kernel routine by jumping into a function table, is considered "weird" by modern conventions. Most examples of calling on a kernel routine will call a trap. PPC can do that of course with sc, like 68k trap.

Quote:

The 68k RMW instructions are atomic because there is only one core. With single core multitasking, the instruction can't be interrupted. With SMP, a much more expensive locked RMW instruction is required to be thread safe. The C/C++ atomic functions use locked RMW instructions on the x86. This is also confusing because locked RMW instructions can be used to implement lock free programming (data sharing) with SMP.

The 68K missed out on that time in the computer world.

Quote:

When the PPC is emulating the 68k, the RMW instructions are likely very expensive to implement. The PPC core would need to exclusively lock the cache line being modified, use fences around the cache line or disable interrupts while doing the equivalent of the RMW. These RMW instructions are common on the 68k like with library open counts.

With emulation it makes it easier, especially on a single core OS, since that is what OS4 is. But, each emulated task, as well as each native task, still needs that same atomic operation. Usually a forbid lock takes care of that. OS4 also includes a mutex as well as classic semaphore. But none are used to protect any system lists I know of that would have been a good idea.

The PPC also has an RMW variant for atomic case uses, with a LMS. Load-modify-store. Needs a few instructions, as expected on PPC, uses a conditional busy loop that looks bad:


    ; atomically increment the word stored at address r3
loop:
    lwarx   r4, 0, r3         ; load with reservation
    addi    r4, r4, 1         ; increment
    stwcx.  r4, 0, r3         ; store conditional
    bne-    loop              ; if failed (unlikely), try again
    ; on exit r4 contains incremented value

From Microsoft docs of all places!

https://devblogs.microsoft.com/oldnewthing/20180814-00/?p=99485

Quote:

Putting large immediate/constant numbers near the code is not optimum. Ideally, the instruction loads use the ICache and data loads/stores use the DCache with separate L1 caches which are common today. Simple hardware would end up with the same data in both the L1 ICache and L1 DCache which is not efficient. While Putting immediate/constant numbers near the code reduces the number of instructions, the actual code is more spread out and more likely to miss in the ICache. RISC is often fat anyway which means fewer instructions per cache line. ICache pre-fetching is easier to predict than DCache but an instruction bottleneck can develop.

Given the same payload is needed regardless on PPC loading 32 bits also means the complication is not worth it. But it's mainly with integer values where it's needed. Even accessing an absolute address at different locations doesn't need two half loads. It's common to load the high word of a base address into one register then use it to access other locations with only low word needed. As with any CPU, code should be optimised for that particular CPU.

Also, static data can be set up to be read in later. To compact size. It just takes up other space such as globals or other address base.

Quote:

It is best to compress immediate/constant numbers in the instructions without adding dependent instructions. This is more easily done with a variable length instruction set (although most RISC VLEs have done a poor job as RISC-V and Thumb2 still need multiple instructions for large immediates). ARM has often improved constant loading by allowing free shifts although shifts are a bit expensive. Sign extending smaller numbers is cheaper and does a good job. For floating point, numbers which are powers of 2 can be exactly represented in smaller fp formats.

At best, the PPC has a 16 bit version of a $7000 moveq, With li loading a sign extended 16 bit integer.

For a LEA type load I don't know what PPC offers. I suppose an addi or an ori would do it. PPC does have the advantage of being able to perform an operation on two registers and store in a third.

Quote:

PPC64 looks like it has poor code density although I was never able to find papers with data. AArch64 looks more efficient for a 64 bit ISA even though it also has a 32 bit fixed length encoding. Add in a more standard ISA that is easier to use and it is easy to see why AArch64 is the PPC killer.

As with anything PPC, it's only an issue if a full word load is needed. Most pointers would be fine with base relatives. This puts 16 bit limit on globals but OOP code tends to substitute that with a self pointer. But at least branching can be up to 24-bits. Finally, a win for PPC, a BSR.W beater.

Last edited by Hypex on 11-Aug-2021 at 03:18 PM.
Last edited by Hypex on 11-Aug-2021 at 03:18 PM.

Status: Offline

Hypex

Re: Assembly startup codes for ECX compiler in VAsm?
Posted on 12-Aug-2021 7:50:28

[ #17 ]

Elite Member

Joined: 6-May-2007
Posts: 11351
From: Greensborough, Australia

@NutsAboutAmiga

Yes, there seems to be some idea about OS4 using double indirection. I read from time to time. In fact here you tackled the subject last month:
https://amigaworld.net/modules/newbb/viewtopic.php?mode=viewtopic&topic_id=44226&forum=14&start=40&viewmode=flat&order=0#843263

So it looks like a misunderstanding to me. Unless each interface call actually calls a stub that then jumps to the library routine. In fact the 68K calls work exactly this way. You call a function by a JSR to it, then the code jumps off somewhere else. 68K jump tables aren't exactly direct.

There is mention of a base to interface, the library base I assume, but it isn't needed for function calls and would be grabbed by the function routine if it needs.

As per the example code I listed, MOS isn't exactly direct either. Or in this case not using a standard ABI call. It's using a custom call that places the parameters in a 68K register object, duplicating the 68K register layout, then calling a hook or kernel routine with an index of what Amiga OS call it wants. Considering the whole operation, that's not exactly a direct function call without any kind of indirection happening.

So this was about PPC only code calling PPC code via OS functions. However I have been testing the OS4 emutrap lately. Which stores the registers exactly as MOS does for an OS call as it happens. D0-D7/A0-A7.

Had the trap been only 6 bytes, one code and one pointer, it could have fit into the jump table. So on OS4 a 68K a function call will jump a few places before native code takes over.

Technically, there should be no difference between a var list and tag list. Both would be a list or array of long words. On PPC they are said to be problematic. Without knowing the internal details I can only imagine it's a stack issue as PPC doesn't exactly have a stack, where items can be pushed onto as will, and branches that save a return address. It lacks those common features and on PPC it's done by hand and backwards as code must set up a stack frame in the function and store the return address itself.

However, on E, the convention isn't to use tag lists for tag functions but an E list. Sure an object can be built up with the tags set in place, but it's more manual set up, so easier to put in a list. Somehow it is able build that dynamically in the code. It must be in local space somewhere, since PPC doesn't have the luxury of dynamically stacking parameters like a C functions did. Even though PPC is said to be designed with compiles in mind. Suppose a parameter stack could be purposed to do it, so there would be a stack for each purpose, that each function could stack onto dynamically and when it is fully loaded tag it with a list length and pass as pointer. I haven't looked into it.

I never knew why there were Tags and TagList functions for doing exactly the same thing. They are from 68K API and on 68K a dynamic list would be built on the stack and then a pointer passed as the tag list. In both cases a pointer to a list is given so why confuse the functions with a distinction?

Last edited by Hypex on 12-Aug-2021 at 08:01 AM.

Status: Offline

NutsAboutAmiga

Re: Assembly startup codes for ECX compiler in VAsm?
Posted on 12-Aug-2021 18:40:56

[ #18 ]

Elite Member

Joined: 9-Jun-2004
Posts: 12993
From: Norway

@Hypex

Quote:

Had the trap been only 6 bytes, one code and one pointer, it could have fit into the jump table. So on OS4 a 68K a function call will jump a few places before native code takes over.

I believe it results in misalignment, wont it.

https://developer.ibm.com/articles/pa-dalign/

I have noticed that hardware/custom.h does not force packed structure,
that can be problem. GCC wil auto pad strctures,
if you are not telling it to use packed structures.

Quote:
Technically, there should be no difference between a var list and tag list. Both would be a list or array of long words.

Well yes, I won’t be surprised if you found way pass address of arg, into next function, but it typicaly done va_start, va_arg, va_end.

The difference is I guess is that Tag list can be const’s while variable length args are built at runtime.

Last edited by NutsAboutAmiga on 12-Aug-2021 at 07:18 PM.
Last edited by NutsAboutAmiga on 12-Aug-2021 at 07:08 PM.
Last edited by NutsAboutAmiga on 12-Aug-2021 at 06:56 PM.
Last edited by NutsAboutAmiga on 12-Aug-2021 at 06:43 PM.

_________________
http://lifeofliveforit.blogspot.no/
Facebook::LiveForIt Software for AmigaOS

Status: Offline

matthey

Re: Assembly startup codes for ECX compiler in VAsm?
Posted on 13-Aug-2021 8:33:48

[ #19 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2736
From: Kansas

Hypex Quote:

But, what I meant was, the ECX PPC source with a PPC code generator back-end is made to assemble with Pasm and output to a 68K style hunk. Somewhere along the way it loads it in. In fact it loads it at run time because I managed to compile a broken build and kept getting this scanados error. I checked the code around it which kept failing on a simple module load and realised it loaded the PPC code in from an internal hunk loader. Then I re-read the read me and confirmed from a one liner it needs hunk format. Programmers expecting other programmers to understand from vague instructions.

I'm guessing it was ELF which may have worked if you knew what variation was needed for AmigaOS 4 and how to specify it. Any AmigaOS variation so far needs unusual ELF handling for the relocatable scatter loader.

Hypex Quote:

There's no need for most. On 68K a library function call can have up to 14 parameters in registers, D0-D7 and A0-A5. A6 is library base, A7 stack. In practise 12 is more realistic as in C code A4 holds globals and A5 holds locals. D0-D1/A0-A1 available as scratch as usual. PPC SysV ABI allows 8 registered parameters and the rest must be stacked, so an original call using D0-D7 or D0-D3/A0-A3 or any up to 8 can be parametrically registered on PPC. Since what I'm comparing here is a MorphOS library function call with a SysV function call.

The a4 register is used as a pointer to global data when compiling with small data which has become less common as program sizes have grown (also used in libraries sometimes requiring a geta4() call so a4 points to library data). The a5 register is the frame pointer on the 68k Amiga but this is often turned off on the Amiga (vbcc defaults to no frame pointer) as it is more efficient to use the a7 stack pointer for local variables (a5 frame pointer is only for debugging). It is rare for programs to use a4 or a5 for args as they may be used but it is also rare to have that many pointer args. Too many register args can cause the function called to spill registers to gain working registers which defeats the purpose of register args.

Hypex Quote:

OTOH, OS4 doesn't use any of these shenanigans for library calls, and uses SysV conventions. What OS4 lacks is native calling from library bases, since they introduced interfaces. But once you have the jump table it works a similar way. By comparison, here's some PPC code from ECX itself, with an OS4 and MOS function call:

I have to say the AmigaOS 4 library call setup looks better even with the double memory indirect. MOS using the 68k registers in memory isn't efficient.

Hypex Quote:

Easily. Also, AROS is using some ABI, and also got stuck in the 64-bit switch discussion. When things like ABIv0 and ABIv1 came into it. Funny, after a brief search, I can find no AROS x86 example. Given how popular ASM is with Amiga people, it's surprising there's no common examples. Plenty of 68K till the cows come home. Lots off PPC stuff that caught attention from the 90's. But, AROS, using a CISC CPU, turns nothing up!

Most 68k fans are not interested in switching to x86-64. The CPUs are so high performance that there is no need for efficient code anyway.

Hypex Quote:

Now PPC doesn't have PC relative modes I know about apart from a branch. You can locate the PC, or IP as they call it, using a cheat by calling a bl to the following instruction and picking it out of the lr. But not immediate like the above. In fact, a format like ELF doesn't lend it self well to this style of coding, since it likes to divide text code, strings and bss in segments. It's organised. But I can duplicate the DOS open operation on PPC, aside from ABI, at best I come up with is this:

lwz r3,4(r0); This uses a trick where specifying R0 in EA results in 16 bit absolute address
lis r4,dosNname@ha
ori r4,dos_name@l ; Can't be avoided really
li r5, 0
lwz r0,_LVOOpenLibrary(r3)
mtctr r0 ; Can do register load from base relative but not jump
bctrl ; PPC needing three instructions for a function jump isn't good.

I reduced it as much as I could but still only get down to 28 bytes. Funny. Twice as large.

PC relative addressing has become more important in modern ISAs and with 64 bit addressing. The x86-64 RIP relative addressing was a big improvement for x86 which had poor PC relative support due to early segment use. AArch 64 has improved PC relative support.

Quote:

There is improved support for position-independent code and data addressing:

• PC-relative literal loads have an offset range of ±1MiB. This permits fewer literal pools, and more sharing of literal data between functions – reducing I-cache and TLB pollution.
• Most conditional branches have a range of ±1MiB, expected to be sufficient for the majority of conditional branches which take place within a single function.
• Unconditional branches, including branch and link, have a range of ±128MiB. Expected to be sufficient to span the static code segment of most executable load modules and shared objects, without needing linker-inserted trampolines or “veneers”.
• PC-relative load/store and address generation with a range of ±4GiB may be performed inline using only two instructions, i.e. without the need to load an offset from a literal pool.

It is difficult for a fixed length encoding to provide good PC relative support. The fewest instructions and best code density comes from progressively longer PC relative encodings like the 68k uses.

(d8,pc) in 16 bit encodings, (d16,pc) in 32 bit encodings, (d32,pc) in 48 bit encodings

The 68k has good PC relative support but it could be better for 64 bit addressing including a (d32,pc) explicit mode (it has implicit support for branches) and PC relative writes. RIP relative addressing of x86-64 has shown the advantage of PC relative writes even though some purists don't think they should be allowed even though restricting writes adds little to security. Some OSs have separate sections for code (read only protected) and text areas (read only and no execute) but this reduces PC relative use and code density. PC relative accesses save a register while often providing more compact code. Absolute addressing is very inefficient with 64 bit addressing and (d16,Rn) addressing which was common for the 68k and PPC only accesses 64kiB of data which is limiting for large programs today.

Hypex Quote:

But, the Amiga way of calling a kernel routine by jumping into a function table, is considered "weird" by modern conventions. Most examples of calling on a kernel routine will call a trap. PPC can do that of course with sc, like 68k trap.

The trap method of system calls allows more separation of user and supervisor code but it is also more expensive. Switching to Supervisor mode and flushing the pipeline makes a trap relatively more expensive on modern processors.

Hypex Quote:

The 68K missed out on that time in the computer world.

There were a few multiprocessor 68k computers but no multicore SMP processors. A multiprocessor computer would have similar data sharing issues though.

Hypex Quote:

With emulation it makes it easier, especially on a single core OS, since that is what OS4 is. But, each emulated task, as well as each native task, still needs that same atomic operation. Usually a forbid lock takes care of that. OS4 also includes a mutex as well as classic semaphore. But none are used to protect any system lists I know of that would have been a good idea.

I would think disabling interrupts would be necessary and that a forbid would not be enough. Mutexs and semaphores are tricky to use.

Hypex Quote:

The PPC also has an RMW variant for atomic case uses, with a LMS. Load-modify-store. Needs a few instructions, as expected on PPC, uses a conditional busy loop that looks bad:

; atomically increment the word stored at address r3
loop:
lwarx r4, 0, r3 ; load with reservation
addi r4, r4, 1 ; increment
stwcx. r4, 0, r3 ; store conditional
bne- loop ; if failed (unlikely), try again
; on exit r4 contains incremented value

From Microsoft docs of all places!

It's no worse than using a CAS instruction which can also fail.

Hypex Quote:

At best, the PPC has a 16 bit version of a $7000 moveq, With li loading a sign extended 16 bit integer.

For a LEA type load I don't know what PPC offers. I suppose an addi or an ori would do it. PPC does have the advantage of being able to perform an operation on two registers and store in a third.

While PPC has 3 op and twice the number of 68k registers, it's not unusual for the 68k to use half the number of instructions and half the code size. This is what happens when a RISC processor touches memory but the problems include PPC deficiencies too. PPC may make up some ground when doing complex functions but how many instructions on average are executed without using memory?

Hypex Quote:

As with anything PPC, it's only an issue if a full word load is needed. Most pointers would be fine with base relatives. This puts 16 bit limit on globals but OOP code tends to substitute that with a self pointer. But at least branching can be up to 24-bits. Finally, a win for PPC, a BSR.W beater.

The 68020+ has BSR.L and Bcc.L so the PPC only beats the 68000. AArch64 has better PC relative addressing than PPC in many cases too as I posted above. In some ways, PPC is more outdated than the 68k.

Status: Offline

Hypex

Re: Assembly startup codes for ECX compiler in VAsm?
Posted on 13-Aug-2021 15:21:23

[ #20 ]

Elite Member

Joined: 6-May-2007
Posts: 11351
From: Greensborough, Australia

@NutsAboutAmiga

Quote:
I believe it results in misalignment, wont it.

Usually, and I thought this would affect it, but this isn't a usual case. I was seeing crashes and thought my EmuTrap object, that I added to my AmigaE code was at fault, There is a 32-bit trap code, a 16-bit return flag and 32-bit function pointer. As you can see that's odd, so I padded it to long alignment. This turned out to be a mistake. The crash was due to a missing global base.

It didn't exactly matter what alignment it sat on. Because the trap is part of 68K code and the 68K code is word aligned. So even if the object was all in alignment, it wouldn't matter, because there are cases it would sit on a word. Or half word as PPC calls it.

Quote:
I have noticed that hardware/custom.h does not force packed structure, that can be problem. GCC wil auto pad strctures, if you are not telling it to use packed structures.

There should be a pragma pack (2) in there somewhere. I found this myself with CIABase structure years ago.

Quote:
Well yes, I won’t be surprised if you found way pass address of arg, into next function, but it typicaly done va_start, va_arg, va_end.

The difference is I guess is that Tag list can be const’s while variable length args are built at runtime.

Yes tag lists can be static but on the other end is a function that expects a pointer to a tag list. Whether it is a local object or loaded onto the stack both would end up as a pointer. But I suppose I don't know the internals either. It's hard to find info online that has the LVO with an Amiga function. It used to be common in books but it is hardly found online. So I can't check if the Tags and TagList function have a different LVO and two functions for the same exact thing. Which looks like a waste.

I also wonder how an OS supports C++. To me this looks a compiler feature with calling methods. But with an OS it must work with a specific ABI. A compiler can't define the ABI nor constrict a class how it wants to privately. Like how the OS4 methods work. It's customised to work on OS4. Somehow on modern systems they have some kind of transparency with calling methods.

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle