Amigaworld.net - The Amiga Computer Community Portal Website

home

features

news

forums

classifieds

faqs

links

search

6225 members

Amiga Q&A / Free for All / Emulation / Gaming / (Latest Posts)

Login

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net

Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.

Menu

Main sections

»	Home
»	Features
»	News
»	Forums
»	Classifieds
»	Links
»	Downloads

Extras

»	OS4 Zone
»	IRC Network
»	AmigaWorld Radio
»	Newsfeed
»	Top Members
»	Amiga Dealers

Information

»	About Us
»	FAQs
»	Advertise
»	Polls
»	Terms of Service
»	Search

IRC Channel

Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online

22 crawler(s) on-line.

95 guest(s) on-line.

1 member(s) on-line.

mbrantley

You are an anonymous user.
Register Now!

mbrantley: 4 mins ago

matthey: 47 mins ago

BigD: 1 hr 53 mins ago

MEGA_RJ_MICAL: 2 hrs 5 mins ago

kolla: 2 hrs 8 mins ago

amyren: 2 hrs 35 mins ago

minator: 2 hrs 38 mins ago

zipper: 3 hrs 47 mins ago

Rob: 3 hrs 53 mins ago

lionstorm: 4 hrs 22 mins ago

Forum Index

General Technology (No Console Threads)

The (Microprocessors) Code Density Hangout

Poster

Thread

simplex

Re: The (Microprocessors) Code Density Hangout
Posted on 8-May-2021 2:40:41

[ #61 ]

Cult Member

Joined: 5-Oct-2003
Posts: 896
From: Hattiesburg, MS

@matthey

Thanks. My father is the only person I've known to criticize the 68000 series, so I was really surprised. Maybe I hang out in the wrong joints.

If I may ask a followup: do you agree with his complaint about the loss of the software interrupt? I know what interrupts are, but am not familiar enough with the 6809 and 68000 architectures to understand that. I remember that the 6809 had a fast interrupt, and it was apparently new to the chip (whether it was new to Motorola's chips or new in general to computing I don't know), and a lot of people were wowed by it. When I look briefly at what Wikipedia says about the 68000's interrupts, though, it looks as if the worst you could say is that the 68000 had something more general, and in that sense better, even if it lost the precise interrupt he's talking about. I also see that the 68000 proper apparently had some issues that prevented virtualization, and I actually understand what WP says there, but I don't know if the 6809 had that issue or... well, yeah, I don't know. So anything you don't mind to say would be great. Whereas if it doesn't interest you don't worry about it.

_________________
I've decided to follow an awful lot of people I respect and leave AmigaWorld. If for some reason you want to talk to me, it shouldn't take much effort to find me.

Status: Offline

bison

Re: The (Microprocessors) Code Density Hangout
Posted on 8-May-2021 3:20:04

[ #62 ]

Elite Member

Joined: 18-Dec-2007
Posts: 2112
From: N-Space

@matthey

Quote:
The 68000 could not recover from a failed memory access (address or bus error) because it did not include enough processor state data to resume the faulted instruction. This prevented the use of virtual memory where the 68000 was being used by workstations (some used dual 68000 processors as a workaround). The 68010 fixed this problem.

I seem to remember that Sun used a custom in-house MMU with the 68000, but I don't recall the details. It was slightly before my time. (They were using SPARC by the time I got my first Unix job.)

Back in the late 80s there was this ongoing debate at my local Amiga user's group about putting 68010s in Amigas, and whether it did any good or not.

Last edited by bison on 08-May-2021 at 03:27 AM.

_________________
"Unix is supposed to fix that." -- Jay Miner

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 8-May-2021 5:39:24

[ #63 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

@simplex Quote:

simplex wrote:
@cdimauro

OK, here's what he wrote 3 years ago when I asked him about this.Quote:
It took Intel several years to catch up with the 6809. It also irritated me no end when Motorola built a new, inferior design rather than expanding the 6809. ...Indeed it was the 68K. Rather than expand the elegant 8/16-bit architecture to 16 and/or 32 bits, they cobbled together a (to me) haphazard collection of special and general purpose registers with a (to me) haphazard instruction set, and dumped the elegant software interrupt with its inherent safety features. Yes, it was much inferior.

To be honest, at this point I wondered if he was misremembering Motorola's products, but be spent decades doing this and I'm at best a software guy, so I dropped the subject.

Thanks for sharing. I don't know how software interrupts work on the 6809 (I only studied the 68HC12 from a ISA PoV, which is a "rewriting" of the 6809), so I can't express any opinion about it (BTW, I've red that it was proposed an ISA extension for RISC-V for implementing "software interrupts").
However with all respect, and as also reported by Matt, I beg to disagree regarding the architecture comparison: the 68000, even with all mistakes made by Motorola's engineers (which seem to be used at it, unfortunately), has clearly a superior design.

To me the 68HC12 is one of the best, if not even the best, 16 bit processor (I like to classify it as 16-bit, because almost everything is 16-bit) which I've seen, and that's the reason why I've payed homage to it.

@matthey Quote:

matthey wrote:
cdimauro Quote:

I've no time now to make a code example using 68HC12's instruction set, and I don't know if it makes sense, because the example has no constraints: there's no size defined for the data, so 8 or 16-bit can be used, which gives very different results on a such processor.

The pdf page 3 says, "Assume all six variables are 4-bytes integers." This would mean the accumulator architecture would be 32 bits with hardware multiply and division. It certainly looks like a hypothetical ISA. Maybe the example should have used the default integer size as used by C of the architecture.

OK, thanks for clarifying it. Then it doesn't make sense to write a 6800 version.
Quote:
cdimauro Quote:
Anyway, this comparisons is too simple to make a general statement about different computer architecture families.
I also don't like the fact that it reports only some bits for the instructions instead of the full size. For example, the instructions size for 68020 is 28 bytes = 224 bits, and for VAX is 24 bytes = 192 bits.

I agree. The code size calculations are strange. It also takes away from the memory traffic calculation which includes instruction and data traffic. The architecture comparison idea was good but it has some flaws. VAX code density is good and sometimes beats the 68k. It would be interesting to see what a modern micro-op OoO VAX CPU would look like.

VAX ISA is really really impressive: one of the best ISA which I've seen. Despite the apparent complexity (due to arbitrary operands, and up to 3 operands usable), the design in tremendously simple: opcode (+ operand 1) (+ operand 2) (+ operand 3).

I think that the implementation was also very simple and cost-effective at the time, but it has challenges for a modern OoO microarchitecture, because it has to "seek" up to 3 operands which have variable length, to also see when the instruction ends: quite difficult even for a simple ISA like that.

The funny thing is that it needed no specific "code compression techniques / instructions" to improve the code density: the particular nature of the ISA already allowed to embed all information in a very compact way.
To further improve the code density I would reorganize a little the bit the opcodes, to make space for a few binary and ternary instructions, and for the SIMD extension. The rest is already cool and well-made.
Quote:
cdimauro Quote:
LOL However the RPi are selling some millions, whereas the embedded market is much larger.

Sure, it is just a small piece of the big embedded market pie. It is impressive that a little hobbyist board took substantial embedded market share and created what has likely become the most popular embedded form factor standard by introducing cheap hardware. Another hobbyist board form factor based on the Arduino (project started in Italy) was the previous most popular embedded form factor and the clear target of the new Raspberry Pi Pico (with new Raspberry Pi Foundation custom SoC) which you seem to think is primitive (it is but that is not the point). I expect the board to take market share from Arduino as the price is a fraction of that of the Arduino.

I used the Arduino, and it really sucks as a platform. However the only reason why Arduino is cool it's because it has a vast support, and allows you to quickly prototype something until you get a good working version of your project. After that you can rework it to have very cost-effective boards when produced in large quantities, and there the RPi cannot compete at all.
Quote:
The Raspberry Pi Foundation management is really smart. Why can't the Amiga have management like that?

You're asking the impossibile: Commodore and post-Commodore management where equally crazy.
Quote:
cdimauro Quote:
Do you "only" need a chip architect = HDL/RTL expert?

HDL proficiency is not enough. Someone with processor design understanding and experience is needed. The chief architect is not the place to be cheap. The right person would give instant credibility to such a project.

Understood. But this is a high profile, and I think that it's difficult to find someone that can embrace projects like that. Unless the project is properly backed...

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 8-May-2021 6:26:31

[ #64 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

@megol Quote:

megol wrote:
@cdimauro
To see how important it really is for mainstream computing Itanium and x86 are good examples: for Itanium which wastes many bytes in a normal, optimal(!), instruction stream the bad density wasn't a problem.

Matt already replied, but let me add something. It's the exact opposite of what you said: Itanium is the clear proof that code density matters.

Performances were so much worse that Intel had to add a 2/4MB external (because it was too big at the time: 2001!) cache. Large caches were needed primarily by its very poor code density.
Quote:
For x86 it's so important that compilers and operating systems doesn't care about adding waste bytes - it's not important.DD
It's so unimportant that compilers have specific -Os flags, there are TONs of researches in that field and $$$ of investments only to enhance code density...

Yes, even on x86, since memory, bandwidth, cache sizes and used silicon matters as well.
[quote]And of course we can add the argument that 68k most likely would have moved in the same direction as x86 to optimize performance like aligning critical loops to cache line boundaries to maximize decode throughput. This and several other standards is part of what makes modern x86 binaries code density bad.

I suggest you to disassemble some binaries to see and count how many NOPs or LEAs (more rare in the last years: NOPs are the preferred way) used for 16-byte padding.

In the meanwhile you can compare the x86 and x64 binaries and see how different they are about code size (stack-based the first and register-based the second: this causes an increase in code size in the latter) and especially looking at the single instructions (a lot of $4x prefixes used), WITHOUT taking into account the alignament.
Quote:
We have so much memory that operating systems and programs can waste it without problem, caches are large enough that density doesn't improve performance etc. etc.

We? Who? You can't extend your needs, or the needs of a subset of users, to the generality.
Quote:
And those arguments and a lot more have been talked about before thus the "This thread again".

Those arguments were spread around in different threads talking of different things. Not to a single thread which has the purpose to specifically talk about the code density argument.

@Fl@sh Quote:

Fl@sh wrote:
@megol

I tried to rise same exceptions about cache sizes and today's memory speed/abundance, I hope your reply will be more considered.

I suggest you to actively follow conferences which talk about microprocessors, and you'll see how many times the code density argument is treated: you'll also find specific talks about it.
Quote:
We all know in todays programming practices nearly all code is written in C/C++ or other high level languages.
Sometimes only few critical parts of code is still written in assembly.

So compilers can make a big difference in code generation, but sadly compilers will never produce same quality code a skilled human coder can do.

It depends. Compilers can easily take into account A LOT of variables when generating code, which requires A LOT of time to a human coder to do the same.
Quote:
Anyway microprocessors are quite complex and you should know very well, for each architecture, instructions clock cycles and applicable addressing modes, and a lot of other things too..
What can be faster on on a cpu is slower in another one, a classic example is Intel vs Amd war where some code is faster on a brand and slower on the other.
Often remaining still on same brand, code can be the best for a generation of cpu and worst for another.

This is all about micro-architectures.
Quote:
What I want to evidence is that coders can't be locked to hand optimize on a particular ISA implementation, next microprocessors have to be much simpler to program with less instructions and less addressing modes.

All this to reach the goal to produce less complex cpus, with less transistors and an easy way to choose/manage instructions by compilers.

My hope is to get less but at best of possibilities.

That's not true, and not even what's really happening. Even RISCs have HUNDREADs of instructions because this reduces the execution time and/or the size (which is also contributing to the execution time). Some RISCs have also complex addressing modes for the same reason.
Compilers can support it, albeit time is required for some complex addressing mode (e.g.: pre-post de/increment, or scaled index).
Quote:
PowerPC has three addressing modes

Simple or complex?
Quote:
and full orthogonality,

Simply no. Please take a look at the list of instructions and the registers.
Quote:
it was future proof for 32bit->64bit change,

Correct. And it's expected, since it's a modern ISA. So, it was expected and obvious.
Quote:
it had SIMD units from start

Absolutely wrong. SIMD units were introduced only after that Intel did it.
Quote:
and it's ISA can be easily expanded.

True, but it's of no particular value.
Quote:
For me PowerPc project was much better than any 68k, x86-64 or ARM ISA.

I beg to differ.
Quote:
Today IBM is developing POWER (mostly back compatible with old PowerPC ISA) and it's features are, at least for me, the most impressive among others (Power 10).
IBM POWER cpus are currently developed mainly for supercomputers/high end servers, but if we are discussing in theory "how should be a good CPU" this technology can't be excluded from discussion.

It's not a PowerPC, doesn't bring important innovations, and it straggles to compete even with x86/x64. POWER will remain bounded to a nano-niche of market.

@NutsAboutAmiga Quote:

NutsAboutAmiga wrote:
@cdimauro

So in the old day, code used to be mostly 16bit, now more often using 32bit data, we waste more memory, but we have more memory to waste.

I disagree: see above.
Quote:
in some case you can use 1 x 32bit instruction, to do work of 2 x 16bit instructions, or 4 x the work of 8-bit instructions, but that is not how its commonly done.

We do it with AltiVec instructions, but don’t do it with normal instructions.

Correct, but we have different goals with regular instructions and SIMD instructions.
Quote:
you just need to get your head around thinking this way, then you see that 16bit and 8bit is almost redundant instructions.

Yes as instructions. No as data types. Even for "general purpose" computing.
Quote:
the idea behind fixed length instruction is simplified decoding path, reducing transistors, and latency. A more complex design will be none fixed length instruction, but why not put in a larger cpu cache instead.

Because it costs: money (silicon) AND energy.
Quote:
Smaller code does not equal, faster code, that’s a misconception, for example tight loops will be slower then unrolled loops. But unrolled loops will take up more space. You can test this yourself with compiler flag -OS and -O3 on GCC; optimize for size or try optimizing for speed, the one optimized for speed will be more often larger.

You're mixing-up two different things, which are NOT mutually esclusive: you can have loops unrolled (taking more space) but using smaller instructions (less space).

It depends on what you need.
Quote:
And Yes including a FPU make allow faster calculation of floating point values, you might not notice any speed improvement, If use too old benchmark tool, don’t do any 3D work, or do not use computer for any complex math.

Code density matters ALSO for FP calculations: the smaller the code, the less caches are used, the less bandwidth is used --> performance benefits. It has also energy benefits --> frequencies can get higher --> again performance benefits.
Quote:
And yes, a MMU is useful for debugging, virtualization and emulation, and security, not having a MMU is a minus.

Again: it depends on what you need. Amiga o.s. & applications didn't required MMUs, with some exceptions (emulators that used accessed bit to look at what memory was changed to update the graphics). MMUs were primarily useful for developers.

Status: Offline

matthey

Re: The (Microprocessors) Code Density Hangout
Posted on 8-May-2021 7:07:46

[ #65 ]

Elite Member

Joined: 14-Mar-2007
Posts: 2756
From: Kansas

simplex Quote:

If I may ask a followup: do you agree with his complaint about the loss of the software interrupt? I know what interrupts are, but am not familiar enough with the 6809 and 68000 architectures to understand that. I remember that the 6809 had a fast interrupt, and it was apparently new to the chip (whether it was new to Motorola's chips or new in general to computing I don't know), and a lot of people were wowed by it. When I look briefly at what Wikipedia says about the 68000's interrupts, though, it looks as if the worst you could say is that the 68000 had something more general, and in that sense better, even if it lost the precise interrupt he's talking about. I also see that the 68000 proper apparently had some issues that prevented virtualization, and I actually understand what WP says there, but I don't know if the 6809 had that issue or... well, yeah, I don't know. So anything you don't mind to say would be great. Whereas if it doesn't interest you don't worry about it.

I'm not sure what he is referring to with a software interrupt. Perhaps it is a lower priority interrupt than a hardware interrupt which can be setup and triggered somehow. The Amiga has what are called software interrupts like this.

https://wiki.amigaos.net/wiki/Exec_Interrupts#Software_Interrupts

I'm not sure how these work under the hood but I suspect that they temporarily insert a high priority task into the task scheduler which calls a prioritized chain of custom handlers.

If he is talking about a customizable hardware vector so that user code can generate a hardware interrupt which calls custom "software" then the 68k has that too. The exception vectors can be changed to point to custom code which can then be accessed with a TRAP instruction or A-line Trap for example. There are also many user defined vectors but I don't recall how useful they are.

Fast interrupts are usually done by having duplicate shadow registers which are swapped in during interrupts. This would be cheap for an accumulator architecture as there are few registers to shadow. ARM added fast interrupts that swap some registers to shadow registers. The 68k (CPU32) FIDO has fast interrupts using shadow registers in what are called fast vectored contexts.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 8-May-2021 7:12:55

[ #66 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

@Hammer Quote:

Hammer wrote:
@cdimauro Quote:

Matt already replied, but I add some things which deserve.

First, the numbers that you criticized are coming from compiler-generated code. Specifically, when compiling the SPEC suite.

The goal for the SPEC benchmark suite is to reach the highest score and the smallest code doesn't automatically yield the fastest code path.

First, code size matters, whatever is the goal: fasted code or smaller code. See above my other comments.

Second, SPEC is also widely used as a code size benchmark, since it has a large set of common applications / algorithms.

Third, please take a look at the study next time, because it clearly says this:
"In order to select the base architecture for our experiments, we evaluated 15 variations of 7 different ISAs regarding to code size for the same set of programs, the SPEC 2006 benchmark. To evaluate them, we used gcc built specifically for each architecture variation and the same global gcc options to allow a fair comparison. By global options we mean all the options that are not architecture specific. From the architecture specific options, we used the ones that generate the smaller code size."
Quote:
Quote:

Whereas your above considerations are taking a specific contest where only manually-written code was used and compared.
So, you're comparing apples and oranges (despite that 8086 wasn't running Linux, as it was reported).

There's https://en.wikipedia.org/wiki/Embeddable_Linux_Kernel_Subset a.k.a Linux-8086

It isn't Linux, so it doesn't matter.

Also, from the link: "Rudimentary Ethernet and FAT support were added in 2017".
Quote:
Quote:
Second, this contest is of very limited use for comparing architectures, because:
- assembly code is rarely used (even on embedded systems);
- the used program was really tiny (in the order of 1KB);
- it isn't using so much real-life code;
- the provided solutions aren't all equally "top-notch".

Regarding the last and for example, the 68K code provided by Matt and ross reorganized the LZSS code (removing one branch instruction) and used a net trick (using the X flag as a flag/signal). Something similar can be used on x86/x64 as well, but I had no time to adapt their code (I only did for my ISA).

Anyway and as I've said, those kind of contests are just for fun: IMO the important thing is to look at compiler-generated code.

Code density is also dependant on the compiler (e.g. ICC vs VC++ vs GCC)

Let me quote again myself:
"Specifically, on Windows I found that binaries compiled with GCC have poor code density compared to binaries generated by other compilers (Visual Studio, primarily).

So, and as you can see, compilers also matters when talking about code density.".

Please read the thread before writing.
Quote:
and the OS.

If the o.s. used in the benchmark is the same, then it's an invariant AKA doesn't matter.
Quote:
With the PS4 programming guide, Sony (Naughty Dog dev team) recommends
1. keeping high-performance data small and contiguous within the data cache.
2. keeping high-performance code small within the instruction cache.

PPT slide from Sony's Naughty Dog team.

Those are super-obvious statements that an average assembly coder already knows from long time.

@Hammer Quote:

Hammer wrote:
@matthey Quote:

x86 was used some for embedded devices but that was probably as much about cheap prices and good developer tool support from mass production. The x86-64 architecture is less appealing as can be seen by the Intel Atom processors being limited to only the highest performance embedded systems despite many offerings.

Intel Atom CPU is not a small chip when compared to AMD embedded CPU counterpart.

Intel Atom Pine View = 9.7mm2 on 45 nm process
AMD Bobcat = 4.6 mm2 on 40 nm process GoFlo

Intel Atom Clovertrail = 5.6 mm2 on 32 nm process
AMD Jaguar = 3.1 mm2 on 28 nm process TSMC, includes out-of-order processing with dual 128 bit SIMD units. Used for MS's Xbox One and Sony's PS4.

ARM Cortex A15 = 2.7 mm2 on 28 process TSMC.

Again, apples and oranges comparisons.

Different chips have different uncore elements, as well as different amounts (and type) of L2/L3 caches.

If you want to compare something, which is related to the (micro)architectures, then you have to consider only the core.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 8-May-2021 7:28:30

[ #67 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

A few notes that I forget previously.

@matthey Quote:

matthey wrote:
3. x86-64 is still optimized for stack usage and byte sizes which are not used much in code optimized for performance so longer instructions which enlarge code are used instead;

x64 is optimized for registers usage, and for 32 bits data sizes.
Quote:
4. decoding is simpler on the 68k

Depends on the micro-architecture.

Having a lot instructions formats as well as a lot of exceptions on instructions makes the decoding task quite difficult, especially on an ISA where opcodes are multiple of 16-bit (LUTs are difficult to be used, or much more expensive. Compared to 8-bit opcode ISAs, at least).
Quote:
so instructions don't need to be broken down as far saving energy and micro-op caches which may benefit from increased code alignment are more likely to be avoidable

The reason why micro-ops caches are widely used (even on RISC designs: ARM, for example) is because of performance reasons (they are immediately available --> shorter pipeline --> less branch-misprediction penalties) AND energy reasons (the decoders can be turned-off. Decoders are energy-vampires!).

@matthey Quote:

The first POWER processor design used a simple "pure RISC" philosophy but the newest POWER processor is more complex than any 68k processor ever was

Well, POWER was more complex already very long time ago. In fact, it used to break complex instructions as well in "more RISCy" ones.
Quote:
and competes with modern x86-64 processors as one of the most complex ever.

But not on performances.

Status: Offline

JimIgou

Re: The (Microprocessors) Code Density Hangout
Posted on 8-May-2021 15:10:51

[ #68 ]

Regular Member

Joined: 30-May-2018
Posts: 114
From: Unknown

@simplex

I loved working with the 6809, first microprocessor to offer real position independent code as well as reentrant code, it had nice instruction set. The Hitachi 6309 improved on that design. And I used to work for a company that built systems based on Peripheral Technology's PT68K4 and PT68K5 motherboards (68000 and 68020 based respectively).
Motorola was on a roll there, the only thing I wasn't fond of was the lack of backward compatibility between the 68000 and the 6809.
They are different, but similar enough that I moved from the 6809 to the 68000 easily enough.

And the 68000 core is still is used today.

Oh, and the 68010 supports memory virtualization, as do all later 680XXs.

We had some great UNIX-like OS' running on the 68000 series.

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 13-May-2021 6:00:03

[ #69 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

I've updated the i386 and x86-64 code for the LZSS loop, mimicking the 68K one (X -> C carry flag loop, and reordering the blocks to eliminate some jump instructions). I haven't checked it, but I think that it should work out.

New i386 LZSS loop:
Quote:
# LZSS decompression algorithm implementation
# by Stephan Walter 2002, based on LZSS.C by Haruhiko Okumura 1989
# optimized some more by Vince Weaver

# we used to fill the buffer with FREQUENT_CHAR
# but, that only gains us one byte of space in the lzss image.
# the lzss algorithm does automatic RLE... pretty clever
# so we compress with NUL as FREQUENT_CHAR and it is pre-done for us

mov $(N-F), %bp # R

mov $logo, %esi # %esi points to logo (for lodsb)

mov $out_buffer, %edi # point to out_buffer
push %edi # save this value for later

stc # +1;+1I. C=1

decompression_loop:
lodsb # load in a byte

# mov $0xff, %bh # re-load top as a hackish 8-bit counter
# mov %al, %bl # move in the flags
not %eax # -2;-1I | ~flags (68k penalty..) {.b}

test_flags:
ror $1, %al # shift bottom bit into carry flag (use C as 'moving' counter)
jz decompression_loop
jc offset_length # if set, we jump to match copy

discrete_char:
lodsb # load a byte
inc %ecx # we set ecx to one so byte
# will be output once
# (how do we know ecx is zero?)

jmp store_byte # and cleverly store it

offset_length:
lodsw # get match_length and match_position
mov %eax,%edx # copy to edx
# no need to mask dx, as we do it
# by default in output_loop

shr $(P_BITS),%eax
add $(THRESHOLD+1),%al
mov %al,%cl # cl = (ax >> P_BITS) + THRESHOLD + 1
# (=match_length)

output_loop:
and $POSITION_MASK,%dh # mask it
mov text_buf(%edx), %al # load byte from text_buf[]
inc %edx # advance pointer in text_buf
store_byte:
stosb # store it

mov %al, text_buf(%ebp) # store also to text_buf[r]
inc %ebp # r++
and $(N-1), %bp # mask r

loop output_loop # repeat until k>j

# or %bh,%bh # if 0 we shifted through 8 and must
# jnz test_flags # re-load flags

# jmp decompression_loop
# -6;-3I.

cmp $logo_end, %esi # have we reached the end?
je test_flags # if so, exit

# STATISTICS (compared to original i386). Size: -7 bytes. Instructions: -3.
# end of LZSS code

New x86-64 LZSS loop:
Quote:
# LZSS decompression algorithm implementation
# by Stephan Walter 2002, based on LZSS.C by Haruhiko Okumura 1989
# optimized some more by Vince Weaver

# we used to fill the buffer with FREQUENT_CHAR
# but, that only gains us one byte of space in the lzss image.
# the lzss algorithm does automatic RLE... pretty clever
# so we compress with NUL as FREQUENT_CHAR and it is pre-done for us

mov $(N-F), %ebp # R

mov $logo, %esi # %esi points to logo (for lodsb)

mov $out_buffer, %edi # point to out_buffer
push %rdi # save this value for later

xor %ecx, %ecx

stc # +1;+1I. C=1

decompression_loop:
lodsb # load in a byte

# mov $0xff, %bh # re-load top as a hackish 8-bit counter
# mov %al, %bl # move in the flags
not %eax # -2;-1I | ~flags (68k penalty..) {.b}

test_flags:
ror $1, %al # shift bottom bit into carry flag (use C as 'moving' counter)
jz decompression_loop
jc offset_length # if set, we jump to match copy

discrete_char:
lodsb # load a byte
inc %ecx # we set ecx to one so byte
# will be output once
# (how do we know ecx is zero?)

jmp store_byte # and cleverly store it

offset_length:
lodsw # get match_length and match_position
mov %eax,%edx # copy to edx
# no need to mask dx, as we do it
# by default in output_loop

shr $(P_BITS),%eax
add $(THRESHOLD+1),%al
mov %al,%cl # cl = (ax >> P_BITS) + THRESHOLD + 1
# (=match_length)

output_loop:
and $POSITION_MASK,%dh # mask it
mov text_buf(%rdx), %al # load byte from text_buf[]
inc %edx # advance pointer in text_buf
store_byte:
stosb # store it

mov %al, text_buf(%rbp) # store also to text_buf[r]
inc %ebp # r++
and $(N-1), %bp # mask r

loop output_loop # repeat until k>j

# or %bh,%bh # if 0 we shifted through 8 and must
# jnz test_flags # re-load flags

# jmp decompression_loop
# -6;-3I.

cmp $logo_end, %esi # have we reached the end?
je test_flags # if so, exit

# STATISTICS (compared to original x86_64). Size: -7 bytes. Instructions: -3.
# end of LZSS code

Both gained 7 bytes and 3 less instructions.

Status: Offline

noXLar

Re: The (Microprocessors) Code Density Hangout
Posted on 15-May-2021 8:45:38

[ #70 ]

Cult Member

Joined: 8-May-2003
Posts: 737
From: Norway

@cdimauro

so, any progress on your cpu remake project?
i find it very interesting to read, and enjoy it. but i have to admit
i have no idea what being discussed or the idea behind it.

is it about making theoretical super 68k?

_________________
nox's in the house!

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 15-May-2021 10:54:39

[ #71 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

@noXLar Quote:

noXLar wrote:
@cdimauro

so, any progress on your cpu remake project?

I've completed the architecture definition (essentially only the ISA) at the end of last year, so I've nothing do from this PoV: the project is done.

However I take the chance to share some statistics from the last versions & iterations:
NEx64T Overall Statistics.xlsx

As you can see there's a lot of data, which is coming from a Python script which I've create to automatically generate tables and charts. Some deserve a quick explanation.

Tabs:
Instructions 32-bit -> data collected for 32-bit binaries (x86).
Length 32-bit -> Charts about the instructions lengths.
Executed 32-bit -> Charts about the number of executed instructions.
GCC 32-bit -> It's exactly like Instructions 32-bit, but it's reporting the statistics only for binaries compiled with GCC.
64-bit versions are exactly like the 32-bit ones, but data is collected for 64-bit binaries (x64).

Columns:
Instructions -> Total number of (disassembled from the binary) instructions.
Size - x86 -> The size, in bytes, of all (disassembled) instructions for x86.
Len. - x86 -> The average instruction length for x86.
Size - No opt -> The size, in bytes, of the all x86 instructions which were re-assembled for NEx64T (my architecture), as they are (no optimization at all: just 1:1 conversion from x86 to NEx64T).
Len - No opt -> The average instruction length for NEx64T.
Δ - No opt -> The average instruction length delta (difference) between NEx64T and x86. Negative = Blue = gain = advantage. Positive = Red = loss = disadvantage.
Exec. - No opt -> The number of executed instructions for NEx64T.
Δ - No opt -> The executed instruction delta (difference) between NEx64T and x86.
The following columns have the same meaning, but for different optimizations, architecture versions, or minor architecture changes.

The most relevant versions are:
No opt -> The x86/x64 instruction is just converted to the NEx64T equivalent. No opt starts with the NEx64T v6 architecture.
Opt -> Some simple optimizations are performed (for example: AND Mem,0 is converted to MOV Mem,0, etc.).
Rip -> Only for x86 (not for x64). 32-bit absolute addresses are converted to RIP + Offset (On NEx64T there's no difference about addressing modes: they can be used for 32 or 64-bit execution mode).
Peepholer -> Two x86/x64 instructions are combined to a single NEx64T instruction.
Quickcalls -> Two or more x86/x64 instructions are replaced by a "quick call" short instructions, which calls a small subroutine which just executes the original instructions. Used to reduce the size of function prologues and epilogues.
V7 -> Here starts NEx64T v7 (several changes made to the ISA, compared to v6).
V8 -> Here starts NEx64T v8 (several changes made to the ISA, compared to v7).

Anyway, those details are mostly for geeks interested on studying details about architectures.

Charts are definitely more interesting, and easier to read.
Quote:
i find it very interesting to read, and enjoy it. but i have to admit
i have no idea what being discussed or the idea behind it.

It's difficult to discuss about my ISA, because there are A LOT of ideas behind it, which I can't yet publish.

BTW I'm just using a few of my ISA features & instructions on the data / charts. Many others can be applied to reduce both the code size and the number of executed instruction.

However I'm currently more focused on creating a backend for my ISA. Simulations don't make sense anymore, because I've already got a good trend for both 32 and 64-bit versions of my ISA.
Quote:
is it about making theoretical super 68k?

No, the super 68K is Matt's project.

Status: Offline

noXLar

Re: The (Microprocessors) Code Density Hangout
Posted on 16-May-2021 2:16:18

[ #72 ]

Cult Member

Joined: 8-May-2003
Posts: 737
From: Norway

@cdimauro

hehe, it's hieroglyphics for me, but still interesting.

found this site that was made yesterday, since you all talked about 6502 cpu earlyer. thought i share it here.

https://www.pagetable.com/?p=1520

_________________
nox's in the house!

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 16-May-2021 5:48:29

[ #73 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

@noXLar: interesting reading. I wasn't aware that MOS (then Commodore) had a cross-assembler for the 6502. When I started programming the 8-bit Commodore machines I only had papers, pencils, and the list of opcodes...

Status: Offline

OneTimer1

Re: The (Microprocessors) Code Density Hangout
Posted on 19-May-2021 16:19:17

[ #74 ]

Super Member

Joined: 3-Aug-2015
Posts: 1263
From: Germany

@noXLar

There are 6502 Assemblers written in Basic, therefore I'm not surprised finding some written in Fortran.

Look here:
https://archive.org/details/6502_Assembler_in_BASIC/page/n5/mode/2up?view=theater

Status: Offline

AmigaBlitter

Re: The (Microprocessors) Code Density Hangout
Posted on 4-Jun-2021 10:29:11

[ #75 ]

Elite Member

Joined: 26-Sep-2005
Posts: 3523
From: Unknown

@thread

https://www.extremetech.com/computing/323245-risc-vs-cisc-why-its-the-wrong-lens-to-compare-modern-x86-arm-cpus

_________________
retired

Status: Offline

Hammer

Re: The (Microprocessors) Code Density Hangout
Posted on 4-Jun-2021 14:04:13

[ #76 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6505
From: Australia

@cdimauro

Quote:

"In order to select the base architecture for our experiments, we evaluated 15 variations of 7 different ISAs regarding to code size for the same set of programs, the SPEC 2006 benchmark. To evaluate them, we used gcc built specifically for each architecture variation and the same global gcc options to allow a fair comparison. By global options we mean all the options that are not architecture specific. From the architecture specific options, we used the ones that generate the smaller code size."

Flawed argument with GCC.

https://www.spec.org/cpu2006/results/res2009q1/cpu2006-20090316-06788.html
Intel C++ Compiler 11.0 for Linux is being used.

For work, I used Intel C++ Compiler and the company pays for it.

For 2018 examples
https://www.spec.org/cpu2006/results/res2018q1/cpu2006-20171224-51360.txt
Compiler: C/C++: Version 17.0.3.191 of Intel C/C++ Compiler for Linux was used for Intel Xeon Gold 6150.

https://www.spec.org/cpu2006/results/res2018q1/cpu2006-20171212-51335.txt
Compiler: C/C++: Version 4.5.2.1 of x86 Open64 Compiler Suite (from AMD) was used for EPYC 7451. AMD's Compiler Suite can be download from https://developer.amd.com/x86-open64-compiler-suite

The priority for SPEC benchmark's highest priority is to obtain the highest score for the hardware product.

Last edited by Hammer on 04-Jun-2021 at 02:23 PM.
Last edited by Hammer on 04-Jun-2021 at 02:14 PM.
Last edited by Hammer on 04-Jun-2021 at 02:12 PM.
Last edited by Hammer on 04-Jun-2021 at 02:06 PM.
Last edited by Hammer on 04-Jun-2021 at 02:05 PM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2021 7:39:04

[ #77 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

@AmigaBlitter

Quote:

AmigaBlitter wrote:
@thread

https://www.extremetech.com/computing/323245-risc-vs-cisc-why-its-the-wrong-lens-to-compare-modern-x86-arm-cpus

Thanks. Very interesting article.

I've written another one to reply it (matthey will appreciate it for sure. :D), but it'll take before publishing it.

@Hammer

Quote:

Hammer wrote:
@cdimauro

Quote:

"In order to select the base architecture for our experiments, we evaluated 15 variations of 7 different ISAs regarding to code size for the same set of programs, the SPEC 2006 benchmark. To evaluate them, we used gcc built specifically for each architecture variation and the same global gcc options to allow a fair comparison. By global options we mean all the options that are not architecture specific. From the architecture specific options, we used the ones that generate the smaller code size."

Flawed argument with GCC.

Where is the flaw here?
Quote:
https://www.spec.org/cpu2006/results/res2009q1/cpu2006-20090316-06788.html
Intel C++ Compiler 11.0 for Linux is being used.

For work, I used Intel C++ Compiler and the company pays for it.

For 2018 examples
https://www.spec.org/cpu2006/results/res2018q1/cpu2006-20171224-51360.txt
Compiler: C/C++: Version 17.0.3.191 of Intel C/C++ Compiler for Linux was used for Intel Xeon Gold 6150.

So what? When I was working at Intel I've VALIDATED our C/C++ (and Fortran) compiler for our HPC suite.

You don't have to tell me how good was/is our compiler.
Quote:
https://www.spec.org/cpu2006/results/res2018q1/cpu2006-20171212-51335.txt
Compiler: C/C++: Version 4.5.2.1 of x86 Open64 Compiler Suite (from AMD) was used for EPYC 7451. AMD's Compiler Suite can be download from https://developer.amd.com/x86-open64-compiler-suite

So what?
Quote:
The priority for SPEC benchmark's highest priority is to obtain the highest score for the hardware product.

Sure. Who have argued against it?

But maybe you completely missed the point of the discussion and the thread. Please re-read the thread, and understand why SPEC was used and, in general, why it's still used ALSO for code density metrics.

Status: Offline

Hammer

Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2021 8:00:33

[ #78 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6505
From: Australia

@cdimauro

Quote:
Where is the flaw here?

GCC is not the best compiler for the platform, hence flawed comparison.

Quote:

So what? When I was working at Intel I've VALIDATED our C/C++ (and Fortran) compiler for our HPC suite.

You don't have to tell me how good was/is our compiler.

Red herring. You missed my point. GCC is not the best compiler for the platform.

Quote:

So what?

You missed my point. GCC is not the best compiler for the platform.

Quote:

Sure. Who have argued against it?

But maybe you completely missed the point of the discussion and the thread. Please re-read the thread, and understand why SPEC was used and, in general, why it's still used ALSO for code density metrics.

Are you going to argue for GCC being the best compiler for X86 and X86-64(AMD64)?

Last edited by Hammer on 26-Sep-2021 at 08:12 AM.
Last edited by Hammer on 26-Sep-2021 at 08:02 AM.
Last edited by Hammer on 26-Sep-2021 at 08:01 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

Hammer

Re: The (Microprocessors) Code Density Hangout
Posted on 26-Sep-2021 8:17:58

[ #79 ]

Elite Member

Joined: 9-Mar-2003
Posts: 6505
From: Australia

@cdimauro

Quote:

Again, apples and oranges comparisons.

Different chips have different uncore elements, as well as different amounts (and type) of L2/L3 caches.

If you want to compare something, which is related to the (micro)architectures, then you have to consider only the core.

It's apples to apples.

ARM Cortex A15 (28 nm node) = 2.7 mm2

AMD Jaguar (28 nm node) = 3.1 mm2, delivered out-of-order processing with 128-bit AVX v1 hardware.

Intel Atom (Bonnell) Pineview (32 nm node) = 5.6 mm2, in-order processing.

AMD wins cost-sensitive Xbox One and PS4 game consoles and its well-known embedded X86-64 devices.

Intel's 32 nm Atom (Bonnell) Pineview is crap i.e. releasing something close to classic Pentium design in 2008 wasn't wise. For 2013, 22nm Atom Silvermont was missing AVX.

AMD 28 nm Jaguar's die area is close enough to ARM Cortex A15 with Jaguar having superior performance when compared to both 28 nm ARM Cortex A15 and Intel's 32 nm Atom (Bonnell) Pineview.

Last edited by Hammer on 26-Sep-2021 at 08:45 AM.
Last edited by Hammer on 26-Sep-2021 at 08:39 AM.
Last edited by Hammer on 26-Sep-2021 at 08:36 AM.
Last edited by Hammer on 26-Sep-2021 at 08:26 AM.
Last edited by Hammer on 26-Sep-2021 at 08:19 AM.

_________________
Amiga 1200 (rev 1D1, KS 3.2, PiStorm32/RPi CM4/Emu68)
Amiga 500 (rev 6A, ECS, KS 3.2, PiStorm/RPi 4B/Emu68)
Ryzen 9 7950X, DDR5-6000 64 GB RAM, GeForce RTX 4080 16 GB

Status: Offline

cdimauro

Re: The (Microprocessors) Code Density Hangout
Posted on 27-Sep-2021 5:03:16

[ #80 ]

Elite Member

Joined: 29-Oct-2012
Posts: 4441
From: Germany

@Hammer Quote:

Hammer wrote:
@cdimauro Quote:
Where is the flaw here?

GCC is not the best compiler for the platform, hence flawed comparison.

GCC is not the best compiler for many platforms: Intel has a better compiler for x86/x64, ARM has a better compiler for its processors, Renaissance has a better compiler for its processors, and so on. Even the very old compilers for Amiga generate better code compared to 68K.

However the point on making code density comparisons is to set all variables except one, the architecture, so this is the only thing which is changing, and then the study can focus on that.

GCC is an "average" (speed, code density) compiler for many architectures, it's also one of the most used, and it's the one which is cover the more architectures. That's why it's used on such studies.

Otherwise, let me know how you can do this without providing a similar too. I quote again the relevant part from the article, but highlighting the important parts:
"In order to select the base architecture for our experiments, we evaluated 15 variations of 7 different ISAs regarding to code size for the same set of programs, the SPEC 2006 benchmark. To evaluate them, we used gcc built specifically for each architecture variation and the same global gcc options to allow a fair comparison. By global options we mean all the options that are not architecture specific. From the architecture specific options, we used the ones that generate the smaller code size."

Is it clear now?
Quote:
Quote:
So what? When I was working at Intel I've VALIDATED our C/C++ (and Fortran) compiler for our HPC suite.

You don't have to tell me how good was/is our compiler.

Red herring. You missed my point. GCC is not the best compiler for the platform.
Quote:
So what?

You missed my point. GCC is not the best compiler for the platform.

I missed the point simply because you haven't reported this information. Next time don't write a sentence that only you can understand, because the relevant part was still locked in you brain, and people cannot access it.
Quote:
Quote:
Sure. Who have argued against it?

But maybe you completely missed the point of the discussion and the thread. Please re-read the thread, and understand why SPEC was used and, in general, why it's still used ALSO for code density metrics.

Are you going to argue for GCC being the best compiler for X86 and X86-64(AMD64)?

That's unbelievable, really! By what kind of "logic" do you think that I've even barely allowed you to go to this "conclusion"?

I suggest again, before replying to other messages, to better read AND understand what other people is saying (AND the topic... possibly!).

To me it's like that you read a sentence that doesn't sound good to you, and then you reply on that transferring your thought. But that's too fast: take your time to give a correct reply, because your current process isn't working.
And what's even worse, you're putting words on people's leaps, deceiving what they said. I'm only responsible about what I, and only I, I've written: not about what you (wrongly) understood.

So, again: take your time.

@Hammer Quote:
Hammer wrote:
@cdimauro Quote:
Again, apples and oranges comparisons.

Different chips have different uncore elements, as well as different amounts (and type) of L2/L3 caches.

If you want to compare something, which is related to the (micro)architectures, then you have to consider only the core.

It's apples to apples.

I've checked chip-architects.com but I've found no article which shows how those pictures were obtained.
Quote:
ARM Cortex A15 (28 nm node) = 2.7 mm2

AMD Jaguar (28 nm node) = 3.1 mm2, delivered out-of-order processing with 128-bit AVX v1 hardware.

Intel Atom (Bonnell) Pineview (32 nm node) = 5.6 mm2, in-order processing.

AMD wins cost-sensitive Xbox One and PS4 game consoles and its well-known embedded X86-64 devices.

Intel's 32 nm Atom (Bonnell) Pineview is crap i.e. releasing something close to classic Pentium design in 2008 wasn't wise.

At least Intel proposed something on the very low-power mobile market. Please, tell me what AMD proposed for that market on 2008? I can't wait for your reply...

And it was something which was competitive. See below, in the last part.
Quote:
For 2013, 22nm Atom Silvermont was missing AVX.

Nevertheless, it destroyed AMD's Jaguar even only using the old SSEs.

Enjoy: https://www.anandtech.com/show/7314/intel-baytrail-preview-intel-atom-z3770-tested/2

AND at a fraction of power consumption.
Quote:
AMD 28 nm Jaguar's die area is close enough to ARM Cortex A15 with Jaguar having superior performance when compared to both 28 nm ARM Cortex A15 and Intel's 32 nm Atom (Bonnell) Pineview.

I reveal you a secret: the same year Intel produced its Silvermont which... you can see above how it was performing.

And, again, it's quite evident that you're changing the cards on the table, only to make in good light your beloved AMD. Yes, because now you're talking about performances: hence considering the whole platform, and not only the cores. And by comparing only what you like to see.

First I show how good was Bonnell compared to Bobcat:
https://www.extremetech.com/extreme/188396-the-final-isa-showdown-is-arm-x86-or-mips-intrinsically-more-power-efficient/2
As you can see, out of 4 benchmarks, the first was clearly won by Bonnell , the second a little bit by Bobcat, the third clearly by Bobcat, and the forth was essentially a draw. This is from a pure performance perspective (first picture).
But if you check from a power consumption perspective (second picture), Bonnell does WAY better than Bobcat.
Even worse if we check from an efficiency (performances AND power consumption): Bonnell destroys Bobcat.
In the same pictures there are also some ARM processors: A8, A9 and A15. So you see the same benchmarks and draw your conclusion.

Last but not really least, let's see how it was possible. So, the micro-architecture side: https://www.extremetech.com/extreme/188396-the-final-isa-showdown-is-arm-x86-or-mips-intrinsically-more-power-efficient
Here we see that Intel's platform was a 2-ways in-order, with 24KB of L1 data cache, 512KB of shared L2 cache, and in a 45nm process.
Whereas AMD's one was a 2-ways out-of-order, with 32KB of L1 data cache, 512KB of L2 cache per each core, and in a 45nm process.

So, what you classified as "crap" was able to compete with your beautiful design in terms of pure performances, and doing much much better on power-consumption AND efficiency using much less resources (less L1 data cache, less L2 cache), and in a worse processing power.

BTW and according to AMD, Jaguar was only 15% ca. better than on IPC compared to Bobcat:
https://www.slideshare.net/AMDPhil/bobcat-to-jaguarv2

Now let's see if you can stop your propaganda.

Status: Offline

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]

Amigaworld.net was originally founded by David Doyle