Click Here
home features news forums classifieds faqs links search
6071 members 
Amiga Q&A /  Free for All /  Emulation /  Gaming / (Latest Posts)
Login

Nickname

Password

Lost Password?

Don't have an account yet?
Register now!

Support Amigaworld.net
Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.
Donate

Menu
Main sections
» Home
» Features
» News
» Forums
» Classifieds
» Links
» Downloads
Extras
» OS4 Zone
» IRC Network
» AmigaWorld Radio
» Newsfeed
» Top Members
» Amiga Dealers
Information
» About Us
» FAQs
» Advertise
» Polls
» Terms of Service
» Search

IRC Channel
Server: irc.amigaworld.net
Ports: 1024,5555, 6665-6669
SSL port: 6697
Channel: #Amigaworld
Channel Policy and Guidelines

Who's Online
14 crawler(s) on-line.
 153 guest(s) on-line.
 1 member(s) on-line.


 OlafS25

You are an anonymous user.
Register Now!
 OlafS25:  4 mins ago
 amigakit:  21 mins ago
 amigang:  54 mins ago
 clint:  1 hr 19 mins ago
 zipper:  2 hrs 16 mins ago
 ppcamiga1:  2 hrs 27 mins ago
 VooDoo:  2 hrs 46 mins ago
 marcofreeman:  3 hrs 4 mins ago
 pixie:  3 hrs 10 mins ago
 kolla:  3 hrs 26 mins ago

/  Forum Index
   /  General Technology (No Console Threads)
      /  Understanding CPU and integer performance
Register To Post

Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 Next Page )
PosterThread
cdimauro 
Re: Understanding CPU and integer performance
Posted on 21-Feb-2014 22:19:10
#21 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@WolfToTheMoon

Quote:

WolfToTheMoon wrote:
@olegil

Quote:
PPC has a MUCH better SIMD


Compared to what, MMX? Sure.
But x64 now uses SSE and AVX.

SSE was not that good at the time, especially regarding the first (P3) implementation. But then was the SSE2, which greatly enhanced it.

Now there's AVX, which is a HUGE step forward, and literally obliterates Altivec: http://mil-embedded.com/articles/avx-leap-forward-dsp-performance/

Then will come AVX-512 (soon on the next Xeon Phi, which already has a completely new SIMD vector units which is... impressive! Take a look at it, especially if you're a CISCy guy): http://software.intel.com/en-us/blogs/2013/avx-512-instructions

Focusing on SIMD, there's no RISC that can compete with a CISC SIMD unit. Absolutely. And we know how important is a SIMD unit "nowadays" (e.g.: starting 20 years ago).

Many benchmarks don't use SIMD. It's unbelievable for me. And I don't know why people pay so much attention to this crap...

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Understanding CPU and integer performance
Posted on 21-Feb-2014 22:32:44
#22 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@minator

Quote:

minator wrote:
CISC was designed when memory was very expensive so instructions were small to save money - but it made the processors complex.

RISC processors were thought up when memory was cheaper and being simple meant you could go faster. Being simpler also meant they were cheaper to make. Compare the die of a G5 to a contemporary x86 and you'll find it's substantially smaller.

That was true when the CPU cores were made by thousands of transistors. Now we have hundreds of millions, or even billions of trnasistors, so the CISC complexity hasn't the weight of the past.
Quote:
The advantage of CISC is smaller instructions but ...he who giveth also taketh away.
CISC instructions have to be taken apart and this takes a load of logic. In x86 they are converted into micro-ops which are basically RISC instructions

The same applies with some RISCs, like the G5 and the POWER.
Quote:
(x86 hasn't been proper CISC for a long time).

x86 is still a CISC, because it has the attributes which historically have defined a CISC member (basically variable-length instructions, instructions which directly access memory instead of using loads/stores).
Quote:
All this also has to be tracked through the pipeline so while CISC saves room in the cache it has a cost in area and power. Area = $$$ and these days chip design is all about power.

How much that costs when you have billions of transistors?
Quote:
At the high end the only company doing CISC now is IBM and that's only in their mainframes. x86 is more of a hybrid and has been using RISC internally since the 90s. Other than those and some microcontrollers, RISC has pretty much won.

x86 is a CISC, as I said. Don't confuse the internal implementation, which is totally hidden, with the ISA. It's the ISA which is exposed outside, and which defines a processor, NOT the internal implementation.

If RISC were that good, why no company has built a RISC processor with a similar to the one used internally by x86 et all? The answer is quite simple: albeit the lack of the x86 CISC complexity, the performance will be MUCH slower...
Quote:
BTW you can also do multi-length instructions in RISC. e.g. 32 bit ARMs can do Thumb instructions which are 16 bits long.

Thumb is a CISC ISA, not a RISC ISA, since it's variable-length (16 and 32-bit opcodes).
Quote:
Even code size is not a CISC advantage.

Nein. Code size is a CISC advantage due to their variable-length nature, which RISCs have not.

The only way for some RISC-based processors to be competitive from this point-of-view was to... became CISC, using variable-length opcodes.

 Status: Offline
Profile     Report this post  
matthey 
Re: Understanding CPU and integer performance
Posted on 21-Feb-2014 22:42:53
#23 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2010
From: Kansas

Quote:

itix wrote:
In my experience 32 registers are not so useful than one might think. You cant use registers straight away. You must save register to stack then load new value to register and at the end of function restore register. If you need that register only once it is inefficient versus to read+write instruction.

15 registers in 68k (A7 was SP) was often enough except that they were not general purpose registers and you could run out of address registers.


RISC needs 32 general purpose registers but 16 is enough for CISC. You are correct that the 68k does not have 16 truly general purpose registers. It has the unusual An/Dn split which was because the original register file was split. The encoding still has some advantages (code density) but the register file is now unified (68060 and Apollo). Any advanced future designs would probably be unified so it is possible to open up address registers in instructions where there is an EA. This gives better orthogonality between An/Dn (good for compilers), reduces the number of needed instructions, reduces the need for scratch registers and improves code density. It is better to avoid altering address registers while they are being used as pointers because of change use/delays that are normally hidden (but allow powerful addressing modes) and the address register results are not forwarded (68060 behavior also). Being able to do an AND.L An,Dn or a BTST #0,An is a big improvement and would have no penalty unless An was just loaded or changed. There are other ideas to decrease the number of scratch registers and increase the efficiency of registers used. One common one is that MOVEQ #d8,Dn+OP.L Dm,Dn is no longer needed for OP.L #d8,Rn as 32 bit immediates are automatically compressed to the same size if less than 16 bit signed. We are still evaluating ideas but I think we can make it feel like there is another register or 2 available. Address registers will still be tight especially with programs using small data (small programs). You shouldn't be using a stack frame anymore unless your debugger needs it. Vbcc has this turned off for 68k code by default as it generates more efficient code using the stack. Programs using absolute addressing will not be any less efficient than the loss of code density which should be minor with larger caches. Note that A7 is pretty much a general purpose register even though it is used as the stack pointer practically everywhere. It is possible to save A7 to an absolute address and use it for other things with only one minor difference. Byte writes are aligned with padding which could be seen as a feature by some but it is less orthogonal too. More could be done as far as orthogonality, registers and decoding efficiency if we re-encoded the 68k but we would lose binary compatibility and the source of a lot of 68k software is no longer available. It's barely worthwhile as an enhanced 68k can be quite powerful with tiny little code and is still one of the easiest and funnest processors to program in assembler.

Quote:
Quote:

PowerPC compilers may need to add code to check the alignment to make sure the access is aligned if it can't 100% determine this. For one access the compiler probably wouldn't add any prefetch instructions but what if we want to add 1 to a series of addresses in memory? PowerPC compilers would start by aligning the memory access if necessary and generating a prefetch instruction. Next it would unroll the loop much as your example shows above except the "do stuff" would just be adding 1 a bunch of times in different registers.


Compilers dont do that. They assume developer has properly aligned memory access. Neither I have seen compilers generating prefetch instructions...


Compilers align data allocated by the compiler and know that data is safe to access. There are times when the compiler doesn't know what a pointer is pointing to though. I have seen extra aligning and alignment proof code generated when compiling for a 68000. The 68000 traps with one odd non-byte access. I believe most PowerPC processors have a trap handler so it won't guru. It's still a fraction of normal speed though. Are there any PowerPC programs that show the hits from unaligned memory accesses?

It's interesting that there are no prefetch instructions. The PowerPC would never get up to full speed in memory without prefetch or stream detection with prefetching optimizations. Maybe there are load/store operations with caching hints? The whole DCache could be trashed by a stream the same size as the DCache. I asked Gunnar which processors have stream detection since I had never heard of it before:

Quote:

Late 68050 version did had this also.
I think new POWER and INTEL core also stream detect, but I do not know if they use such a cache trashing prevention.
G2/G3/g4 PowerPC chips did not have this.
They sucked when working with memory.
G5 (970) did had some simple from of L2 prefetching which was not perfect but worked OK.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Understanding CPU and integer performance
Posted on 21-Feb-2014 22:50:26
#24 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

The thread is interesting, but I've very little time now to reply to some cool messages.

I express my opinion only for some things now:

@KimmoK

Quote:

Multiple integer execution units enable to take two or more instructions in processing (pipelines) in one cycle.
(with wide data path, a CPU can get (fetch) multiple instructions in. so far up to 4? or even 8?)

It depends on the ISA, and is strongly related to the maximum number of instructions that can be decoded (and by their average length, of course).
Quote:
Modern CPUs (since 68060 and Pentium) decode the read instruction to micro operations that can be handled in smaller parts and with more units.

Modern CPUs have multiple integer units that execute micro operations (Athlon has 3, i7 has up to 8).

…as a result…
I7 does processess up to to 8 integer instructions per clock cycle (as an average) per core.
PA6T processes up to 2.2 integer instructions per clock cycle (as an average) per core.

Maybe it's better to talk about how many instructions can be decoded and issued to the internal queue to be executed, and how many of them can be retired. Per clock cycle.

IMO we can consider this: MAX IPC = MAX(MaxDecodedInstructions, MaxRetiredInstructions). The same for MIC IPC (using MIN in all parts).

But comparing IPCs from different architectures doesn't make sense, since the "useful work" done by the instructions can be very different. Generally CISCs do more work-per-instruction, so a lower IPC doesn't mean that a CISC processor is less powerful than a RISC one with a greater IPC.
Quote:
Often some instruction execution results depend on the results of previous instructions, in this situation the processing needs to stop and wait for previous result to become out of execution. (execution units stand idle unless provided more work in the meanwhile)

Usually CISCs have less dependencies, like matt explained.
Quote:
Also many instructions need data to process, if the data is not available in CPU registers, it need to be read from L1 cache or from L2 or from system RAM. Instruction processing is stopped until data becomes available. (execution units stand idle unless provided more work in the meanwhile)

One of the CISC advantage is the possibility to directly load large immediates from the code cache, so avoiding a costly load towards memory, and gaining a better code density.

About code density, I've written a series of articles about x86 and x64. This is the last one: http://www.appuntidigitali.it/18371/statistiche-su-x86-x64-parte-9-legacy-e-conclusioni/ It's in Italian, but the google translation is good enough to make it understandable. At the end of the article there are the links to the previous 8 ones.

 Status: Offline
Profile     Report this post  
matthey 
Re: Understanding CPU and integer performance
Posted on 22-Feb-2014 1:39:44
#25 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2010
From: Kansas

Quote:

cdimauro wrote:
One of the CISC advantage is the possibility to directly load large immediates from the code cache, so avoiding a costly load towards memory, and gaining a better code density.


And using simple compression of those immediates also ;).

Quote:

About code density, I've written a series of articles about x86 and x64. This is the last one: http://www.appuntidigitali.it/18371/statistiche-su-x86-x64-parte-9-legacy-e-conclusioni/ It's in Italian, but the google translation is good enough to make it understandable. At the end of the article there are the links to the previous 8 ones.


Your article stats:

Quote:

x86
Size Count
2 440045
3 419064
1 298078
5 190101
6 157035
4 136394
7 97835
10 4660
8 2411
11 829
9117
Average length: 3.2


Quote:

x86_64
Size Count
3 362650
5 353288
4 283352
2 240530
8 172284
7 131164
1 91535
6 80322
9 12945
10 3725
11 3257
12 1997
13 126
14 89
15 67
Average length: 4.3


Statistics on vbcc compiled by vbcc for the 68060+FPU:

> ADisS -m vbccm68k

ADisS xxx (23.01.14)
by Martin Apel & Matt Hey


Statistics
----------------------------------------
Instructions 2 bytes in length = 66810
Instructions 4 bytes in length = 37739
Instructions 6 bytes in length = 19471
Instructions 8 bytes in length = 1721
Instructions 10 bytes in length = 339
Instructions 12 bytes in length = 0
Instructions 14 bytes in length = 0
Instructions 16 bytes in length = 0
Instructions 18 bytes in length = 0
Instructions 20 bytes in length = 0
Instructions 22 bytes in length = 0
Instruction total = 126080
Code total bytes = 418560

3863 op.l #,Rn -> op.l #.w,Rn : bytes saved = 7726
0 opi.l #,Dn -> op.l #.w,Dn : 68kF1 bytes saved = 0
1234 opi.l #,EA -> opi.l #.w,EA : 68kF2 bytes saved = 2468
892 pea (xxx).w -> mov3q #,EA : bytes saved = 1784
342 move.l #,EA -> mov3q #,EA : bytes saved = 1368 68kF bytes saved = 684

EA modes used
----------------------------------------
Dn = 19833
An = 11056
# = 486
# = 589
# = 4203
# = 0
# = 0
(xxx).w = 2259
(xxx).l = 15081
(An) = 7835
(An)+ = 7451
-(An) = 13485
(d16,An) = 23170
(d8,An,Xn*SF) = 2277
(bd,An,Xn*SF) = 23
(d16,PC) = 2139
(d8,PC,Xn*SF) = 39
(bd,PC,Xn*SF) = 0

Integer instructions
----------------------------------------
Instructions with 0 ops = 666
Instructions 1 op reg = 8336
Instructions 1 op imm = 21355
Instructions 1 op mem = 17211
Instructions 2 op reg,reg = 12728
Instructions 2 op imm,reg = 22670
Instructions 2 op reg,imm = 553
Instructions 2 op mem,reg = 19974
Instructions 2 op reg,mem = 12026
Instructions 2 op imm,mem = 2617
Instructions 2 op mem,mem = 7394

It should be pretty reliable except the last part which depends on how instructions are classified. I should separate out integer and fp. The immediate longword compression may be disappointing to some but it's really not bad for one code density improvement (these are what would be saved and NOT included in the other stats). The more impressive improvement is the code density gained from the floating point compression (this is included in the stats as it is already being done by vbcc+vasm but not GCC). Every single double (64 bit immediate) used by the compiler (compilers default to double) was compressed to 32 bits by the current vasm optimization. Many of these should compress to 16 bit immediates if we add half IEEE fp to the FPU as I have proposed. Also note that the ColdFire MOV3Q is getting most of it's gain from PEA (pushing data on the stack). Our 32 bit longword immediate compression is better and MOV3Q would be mostly unnecessary with a new ABI with register passing. One thing I have discovered with my ADisS (ADis Stats special version) is that the 68k code statistics vary a lot by compiler and program. glQuake was unable to compress 100% of it's double fp immediates. Gunnar decided to add another register write port (expensive) due to the heavy use of -(An)+ so there will be no penalty for using them and it will help some other instructions that write 2 registers :).

Last edited by matthey on 22-Feb-2014 at 01:57 AM.
Last edited by matthey on 22-Feb-2014 at 01:43 AM.

 Status: Offline
Profile     Report this post  
tlosm 
Re: Understanding CPU and integer performance
Posted on 22-Feb-2014 9:15:13
#26 ]
Elite Member
Joined: 28-Jul-2012
Posts: 2746
From: Amiga land

My only way for understand a cpu is use the same Os with the same benchmark tool :)


This is The Quad I7 2.93 ghz on my intel Imac

System Info
Xbench Version 1.3
System Version 10.6.8 (10K549)
Physical RAM 8192 MB
Model iMac11,3
Drive Type SAMSUNG HD103SJ

CPU Test 237.03
GCD Loop 368.39 19.42 Mops/sec
Floating Point Basic 212.52 5.05 Gflop/sec
vecLib FFT 137.81 4.55 Gflop/sec
Floating Point Library 454.65 79.17 Mops/sec
Thread Test 888.92
Computation 917.54 18.59 Mops/sec, 4 threads
Lock Contention 862.03 37.08 Mlocks/sec, 4 threads
Memory Test 413.81
System 408.35
Allocate 311.79 1.14 Malloc/sec
Fill 399.97 19447.33 MB/sec
Copy 610.11 12601.52 MB/sec
Stream 419.42
Copy 397.15 8203.01 MB/sec
Scale 382.45 7901.20 MB/sec
Add 457.79 9751.83 MB/sec
Triad 450.49 9637.00 MB/sec


This is the Quad G5 2.5 ghz on my MacProo ... not bed :)



Xbench Version 1.3
System Version 10.5.8 (9L31a)
Physical RAM 8192 MB
Model PowerMac11,2
Processor PowerPC G5x4 @ 2.50 GHz
L1 Cache 64K (instruction), 32K (data)
L2 Cache 1024K @ 2.50 GHz
Bus Frequency 1 GHz
Drive Type Maxtor 6B160M0 Maxtor 6B160M0


CPU Test 148.84
GCD Loop 114.48 6.03 Mops/sec
Floating Point Basic 168.00 3.99 Gflop/sec
AltiVec Basic 328.82 13.11 Gflop/sec
vecLib FFT 128.74 4.25 Gflop/sec
Floating Point Library 123.50 21.50 Mops/sec
Thread Test 190.14
Computation 227.99 4.62 Mops/sec, 4 threads
Lock Contention 163.08 7.02 Mlocks/sec, 4 threads
Memory Test 159.79
System 169.92
Allocate 247.08 907.36 Kalloc/sec
Fill 242.22 11777.41 MB/sec
Copy 105.48 2178.74 MB/sec
Stream 150.80
Copy 148.40 3065.10 MB/sec [G5]
Scale 145.83 3012.88 MB/sec [G5]
Add 152.66 3252.03 MB/sec [G5]
Triad 156.77 3353.60 MB/sec [G5]


and if i thing the 10.5.8 dont run in 64bit mode and i can do this ... hope soon i will have the x1900 for put lubuntu and see the real performances of this cpus

http://www.youtube.com/watch?v=GsTw6AN-0uE

Last edited by tlosm on 22-Feb-2014 at 09:26 AM.
Last edited by tlosm on 22-Feb-2014 at 09:20 AM.
Last edited by tlosm on 22-Feb-2014 at 09:17 AM.

_________________
I love Amiga and new hope by AmigaNG
A 500 + ; CDTV; CD32;
PowerMac G5 Quad 8GB,SSD,SSHD,7800gtx,Radeon R5 230 2GB;
MacBook Pro Retina I7 2.3ghz;
#nomorea-eoninmyhome

 Status: Offline
Profile     Report this post  
itix 
Re: Understanding CPU and integer performance
Posted on 22-Feb-2014 9:57:22
#27 ]
Elite Member
Joined: 22-Dec-2004
Posts: 3398
From: Freedom world

@matthey

Quote:

The 68000 traps with one odd non-byte access. I believe most PowerPC processors have a trap handler so it won't guru. It's still a fraction of normal speed though. Are there any PowerPC programs that show the hits from unaligned memory accesses?


PowerPC can read 32-integers from unaligned address. Floats and doubles must be accessed from word aligned (32-bit) address or CPU throws an alignment exception. It is silently handled by the operating system. Should I say unfortunately because WarpUp programs dont align floats and doubles properly and run like a duck.

I dont know any program that would show alignment exceptions (except the OS itself when it cant handle exception and gurus).

Quote:

It's interesting that there are no prefetch instructions. The PowerPC would never get up to full speed in memory without prefetch or stream detection with prefetching optimizations. Maybe there are load/store operations with caching hints? The whole DCache could be trashed by a stream the same size as the DCache. I asked Gunnar which processors have stream detection since I had never heard of it before:


I just read GCC documentation and there is some support for prefetching, at least via __builtint_prefetch(). I havent tried newer GCC versions (4.x) lately since I am still defaulting to GCC 2.

_________________
Amiga Developer
Amiga 500, Efika, Mac Mini and PowerBook

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Understanding CPU and integer performance
Posted on 22-Feb-2014 18:53:57
#28 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey

Quote:

matthey wrote:
Quote:

cdimauro wrote:
One of the CISC advantage is the possibility to directly load large immediates from the code cache, so avoiding a costly load towards memory, and gaining a better code density.


And using simple compression of those immediates also ;).

Quote:

About code density, I've written a series of articles about x86 and x64. This is the last one: http://www.appuntidigitali.it/18371/statistiche-su-x86-x64-parte-9-legacy-e-conclusioni/ It's in Italian, but the google translation is good enough to make it understandable. At the end of the article there are the links to the previous 8 ones.


Your article stats:

Quote:

x86
Size Count
2 440045
3 419064
1 298078
5 190101
6 157035
4 136394
7 97835
10 4660
8 2411
11 829
9117
Average length: 3.2


Quote:

x86_64
Size Count
3 362650
5 353288
4 283352
2 240530
8 172284
7 131164
1 91535
6 80322
9 12945
10 3725
11 3257
12 1997
13 126
14 89
15 67
Average length: 4.3


Pay attention that the x86_64 code is generally compiled for speed and not for size, as I've written in the article. In fact, it has A LOT of NOPs to properly align branch/call targets to 16 bytes boundary.

I also have found a bug in the compiler used, which produces redundant REX prefixes in some instructions.
Quote:

Statistics on vbcc compiled by vbcc for the 68060+FPU:

> ADisS -m vbccm68k

ADisS xxx (23.01.14)
by Martin Apel & Matt Hey


Statistics
----------------------------------------
Instructions 2 bytes in length = 66810
Instructions 4 bytes in length = 37739
Instructions 6 bytes in length = 19471
Instructions 8 bytes in length = 1721
Instructions 10 bytes in length = 339
Instructions 12 bytes in length = 0
Instructions 14 bytes in length = 0
Instructions 16 bytes in length = 0
Instructions 18 bytes in length = 0
Instructions 20 bytes in length = 0
Instructions 22 bytes in length = 0
Instruction total = 126080
Code total bytes = 418560

Very good, as expected by 68K: about 3.3 bytes average instruction length.
Quote:

3863 op.l #,Rn -> op.l #.w,Rn : bytes saved = 7726
0 opi.l #,Dn -> op.l #.w,Dn : 68kF1 bytes saved = 0
1234 opi.l #,EA -> opi.l #.w,EA : 68kF2 bytes saved = 2468
892 pea (xxx).w -> mov3q #,EA : bytes saved = 1784
342 move.l #,EA -> mov3q #,EA : bytes saved = 1368 68kF bytes saved = 684

Does the above "Code total bytes" takes already into account these values?
Quote:

EA modes used
----------------------------------------
Dn = 19833
An = 11056
# = 486
# = 589
# = 4203
# = 0
# = 0
(xxx).w = 2259
(xxx).l = 15081
(An) = 7835
(An)+ = 7451
-(An) = 13485
(d16,An) = 23170
(d8,An,Xn*SF) = 2277
(bd,An,Xn*SF) = 23
(d16,PC) = 2139
(d8,PC,Xn*SF) = 39
(bd,PC,Xn*SF) = 0

So no double indirect modes are used. But why I see so much .w & .l addressing modes? Are they for the PEAs?

It's very strange to see more -(An) than (An)+, but it should be due to the pushes and pops of values into the stack.
Quote:


Integer instructions
----------------------------------------
Instructions with 0 ops = 666
Instructions 1 op reg = 8336
Instructions 1 op imm = 21355
Instructions 1 op mem = 17211
Instructions 2 op reg,reg = 12728
Instructions 2 op imm,reg = 22670
Instructions 2 op reg,imm = 553
Instructions 2 op mem,reg = 19974
Instructions 2 op reg,mem = 12026
Instructions 2 op imm,mem = 2617
Instructions 2 op mem,mem = 7394

op mem,mem is mostly the MOVE and very little bit of CMPM, right?
Quote:

It should be pretty reliable except the last part which depends on how instructions are classified. I should separate out integer and fp.

The script that I've written does it. I've general stats, and specific stats for integer, FPU, MMX, SSE, and AVX instructions.
Quote:

The immediate longword compression may be disappointing to some but it's really not bad for one code density improvement (these are what would be saved and NOT included in the other stats).

I don't see any problem here. 8086 has it from the beginning, letting use 8 bits immediates in 16 bits instructions (and in 32 bits instructions for 386).

The only thing that I don't like is using specific addressing modes to define hard-coded immediates like 0, 1, -1.
Quote:

The more impressive improvement is the code density gained from the floating point compression (this is included in the stats as it is already being done by vbcc+vasm but not GCC). Every single double (64 bit immediate) used by the compiler (compilers default to double) was compressed to 32 bits by the current vasm optimization. Many of these should compress to 16 bit immediates if we add half IEEE fp to the FPU as I have proposed.

Frankly speaking, I don't agree on enhancing an old FPU unit. It's better to spend resources on a new, modern, SIMD unit, which can handle both scalar and vector data, in a more uniform way.

AVX-512 has specific addressing modes to convert compact values to the more common (and used) formats, albeit they must stay in memory and cannot be loaded as immediate.
Quote:

Also note that the ColdFire MOV3Q is getting most of it's gain from PEA (pushing data on the stack). Our 32 bit longword immediate compression is better and MOV3Q would be mostly unnecessary with a new ABI with register passing.

I don't know the MOV3Q opcode format, but if it's a 16-bit one it'll still have an advantage over the 32-bit longword compression (which uses an additional 16-bit immediate, I suppose).
Quote:

One thing I have discovered with my ADisS (ADis Stats special version) is that the 68k code statistics vary a lot by compiler and program.

Compiling the same source?
Quote:

glQuake was unable to compress 100% of it's double fp immediates.

Strange. I've always thought that it uses FP32 values instead of FP64 ones.
Quote:

Gunnar decided to add another register write port (expensive) due to the heavy use of -(An)+ so there will be no penalty for using them

Wise decision. Such addressing modes are greatly used and are one of the key for the success of 68Ks in terms of execution speed and code density.
Quote:

and it will help some other instructions that write 2 registers :).

Like long MULS and DIVx?

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Understanding CPU and integer performance
Posted on 22-Feb-2014 18:59:12
#29 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

Quote:

tlosm wrote:
My only way for understand a cpu is use the same Os with the same benchmark tool :)

But the os is not the same. Here you compared OS X 10.6.8 and 10.5.8.
Quote:

This is The Quad I7 2.93 ghz on my intel Imac

System Info
Xbench Version 1.3
System Version 10.6.8 (10K549)
Physical RAM 8192 MB
Model iMac11,3
Drive Type SAMSUNG HD103SJ

CPU Test 237.03
GCD Loop 368.39 19.42 Mops/sec
Floating Point Basic 212.52 5.05 Gflop/sec
vecLib FFT 137.81 4.55 Gflop/sec
Floating Point Library 454.65 79.17 Mops/sec
Thread Test 888.92
Computation 917.54 18.59 Mops/sec, 4 threads
Lock Contention 862.03 37.08 Mlocks/sec, 4 threads
Memory Test 413.81
System 408.35
Allocate 311.79 1.14 Malloc/sec
Fill 399.97 19447.33 MB/sec
Copy 610.11 12601.52 MB/sec
Stream 419.42
Copy 397.15 8203.01 MB/sec
Scale 382.45 7901.20 MB/sec
Add 457.79 9751.83 MB/sec
Triad 450.49 9637.00 MB/sec


This is the Quad G5 2.5 ghz on my MacProo ... not bed :)



Xbench Version 1.3
System Version 10.5.8 (9L31a)
Physical RAM 8192 MB
Model PowerMac11,2
Processor PowerPC G5x4 @ 2.50 GHz
L1 Cache 64K (instruction), 32K (data)
L2 Cache 1024K @ 2.50 GHz
Bus Frequency 1 GHz
Drive Type Maxtor 6B160M0 Maxtor 6B160M0


CPU Test 148.84
GCD Loop 114.48 6.03 Mops/sec
Floating Point Basic 168.00 3.99 Gflop/sec
AltiVec Basic 328.82 13.11 Gflop/sec
vecLib FFT 128.74 4.25 Gflop/sec
Floating Point Library 123.50 21.50 Mops/sec
Thread Test 190.14
Computation 227.99 4.62 Mops/sec, 4 threads
Lock Contention 163.08 7.02 Mlocks/sec, 4 threads
Memory Test 159.79
System 169.92
Allocate 247.08 907.36 Kalloc/sec
Fill 242.22 11777.41 MB/sec
Copy 105.48 2178.74 MB/sec
Stream 150.80
Copy 148.40 3065.10 MB/sec [G5]
Scale 145.83 3012.88 MB/sec [G5]
Add 152.66 3252.03 MB/sec [G5]
Triad 156.77 3353.60 MB/sec [G5]

Not so bad for an old G5, albeit I don't like synthetic benchmarks.

Anyway I see Altivec for the G5, but no SSE or (the much better) AVX for your i7. Are these SIMDs not supported by your benchmark tool?
Quote:

and if i thing the 10.5.8 dont run in 64bit mode and i can do this ... hope soon i will have the x1900 for put lubuntu and see the real performances of this cpus

Don't expect something better than you already gained: 64-bit mode on PowerPCs is mostly always SLOWER than the (usual) 32-bit one.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Understanding CPU and integer performance
Posted on 22-Feb-2014 19:05:03
#30 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

Quote:

itix wrote:
@matthey

Quote:

The 68000 traps with one odd non-byte access. I believe most PowerPC processors have a trap handler so it won't guru. It's still a fraction of normal speed though. Are there any PowerPC programs that show the hits from unaligned memory accesses?


PowerPC can read 32-integers from unaligned address.

I'm not a PowerPCs expert, but I suppose that they have a specific load instruction which is slower than the usual (aligned) one, right?

 Status: Offline
Profile     Report this post  
matthey 
Re: Understanding CPU and integer performance
Posted on 22-Feb-2014 19:36:28
#31 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2010
From: Kansas

Quote:

itix wrote:
PowerPC can read 32-integers from unaligned address. Floats and doubles must be accessed from word aligned (32-bit) address or CPU throws an alignment exception. It is silently handled by the operating system. Should I say unfortunately because WarpUp programs dont align floats and doubles properly and run like a duck.


I thought the PowerPC ISA pretty well left it open on how to handle unaligned accesses although most modern implementations are more friendly. Maybe they changed it in a later version of the ISA. The larger the data sizes, the higher the cost of hardware alignment support though. Even the 68060 takes a hit in performance with unaligned 64 bit doubles. The 68040-68060 and Apollo are free or nearly free on

 Status: Offline
Profile     Report this post  
tlosm 
Re: Understanding CPU and integer performance
Posted on 22-Feb-2014 19:44:09
#32 ]
Elite Member
Joined: 28-Jul-2012
Posts: 2746
From: Amiga land

@cdimauro

This are the xbench default configs there not simds

about os
from 10.5.8 to 10.6. difference is only in some improvements in the interfaces and os kinds plus compatibility with some boards and components ... nothing special,
macosx become full 64bit only from 10.8 (muntain lion)..
The big problem is all the software is made for run 32bit :( :( :(
plus virtual pc is using only one cpu ... i was really curious to know how was good the real emulation on this machine ..

Last edited by tlosm on 22-Feb-2014 at 07:51 PM.

_________________
I love Amiga and new hope by AmigaNG
A 500 + ; CDTV; CD32;
PowerMac G5 Quad 8GB,SSD,SSHD,7800gtx,Radeon R5 230 2GB;
MacBook Pro Retina I7 2.3ghz;
#nomorea-eoninmyhome

 Status: Offline
Profile     Report this post  
matthey 
Re: Understanding CPU and integer performance
Posted on 22-Feb-2014 21:29:38
#33 ]
Elite Member
Joined: 14-Mar-2007
Posts: 2010
From: Kansas

Quote:

cdimauro wrote:
Pay attention that the x86_64 code is generally compiled for speed and not for size, as I've written in the article. In fact, it has A LOT of NOPs to properly align branch/call targets to 16 bytes boundary.


Almost all code is compiled for best speed.

Quote:

Very good, as expected by 68K: about 3.3 bytes average instruction length.


That's with light FPU use mixed code. It's only -O1 in vbcc although vbcc doesn't improve that much with -O2 or higher. The vbcc 68k backend is very simple and below average in code generation quality but that means it's not doing a lot of CPU specific optimizations. Note that vbcc still generates average code because of vasm optimizations, good quality vclib static include libs, inlined code and some intelligent high level optimizations. My beta version of vbcc is less than a frame per second behind GCC 2.95.3 which probably generates the best quality integer 68k code of any 68k compiler.

Quote:
Quote:

3863 op.l #,Rn -> op.l #.w,Rn : bytes saved = 7726
0 opi.l #,Dn -> op.l #.w,Dn : 68kF1 bytes saved = 0
1234 opi.l #,EA -> opi.l #.w,EA : 68kF2 bytes saved = 2468
892 pea (xxx).w -> mov3q #,EA : bytes saved = 1784
342 move.l #,EA -> mov3q #,EA : bytes saved = 1368 68kF bytes saved = 684

Does the above "Code total bytes" takes already into account these values?


No. The stats only count the existing disassembled code. This many more bytes would be saved and the average instruction length would drop. The MOV3Q savings would not be as great with the immediate compression though. I have a newer version of ADisS which shows partial MVS/MVZ (x86 MOVSX/MOVZX) savings which are small but only a fraction of the uses.

Quote:
Quote:

EA modes used
----------------------------------------
Dn = 19833
An = 11056
# = 486
# = 589
# = 4203
# = 0
# = 0
(xxx).w = 2259
(xxx).l = 15081
(An) = 7835
(An)+ = 7451
-(An) = 13485
(d16,An) = 23170
(d8,An,Xn*SF) = 2277
(bd,An,Xn*SF) = 23
(d16,PC) = 2139
(d8,PC,Xn*SF) = 39
(bd,PC,Xn*SF) = 0

So no double indirect modes are used. But why I see so much .w & .l addressing modes? Are they for the PEAs?


Vbcc doesn't generate any double indirect addressing modes that I know of. Some versions of GCC and SAS/C do (usually jump tables). Yes, the (xxx).w are usually used with PEA for pushing immediates on the stack. Most (xxx).l are RELOCs (relocatable addresses) that are filled in when the program starts. This is common for programs that are too big to compile with small data. Absolute addressing is actually one of the simplest addressing modes for the processor and it saves a register used for a pointer. The speed is good with adequate instruction fetch and ICache but it's not as good for code density.

Quote:

It's very strange to see more -(An) than (An)+, but it should be due to the pushes and pops of values into the stack.
Quote:


I think many -(An)+ are for stack use. I thought the -(An) being more common was also strange but other 68k programs are similar. I can't narrow out my code having a bug but looking at the disassembly I do see a lot of -(An).

[quote]
op mem,mem is mostly the MOVE and very little bit of CMPM, right?


Correct. CMPM would be in 2nd place but a long way back.

Quote:
[quote]
The immediate longword compression may be disappointing to some but it's really not bad for one code density improvement (these are what would be saved and NOT included in the other stats).

I don't see any problem here. 8086 has it from the beginning, letting use 8 bits immediates in 16 bits instructions (and in 32 bits instructions for 386).


Aren't most x86 small immediates based on unsigned values? The 68k is mostly signed using sign extension. This takes a little more CPU power but it gives small negative numbers which is better IMO. Sign extension is much simpler than shifting which early ARM did and is a good way to uncompress small immediates.

Quote:

The only thing that I don't like is using specific addressing modes to define hard-coded immediates like 0, 1, -1.


That would mostly be gone with the replacement of PEA with MOVE and a compressed longword immediate. We do need a new ABI passing variables in registers. Amiga programs that use libraries extensively don't have this problem because they do pass variables in registers.

Quote:

Frankly speaking, I don't agree on enhancing an old FPU unit. It's better to spend resources on a new, modern, SIMD unit, which can handle both scalar and vector data, in a more uniform way.


I understand. Some would say not to implement the 68k FPU at all. The 68k FPU is easy to use and flexible though. Branches and integer conversions are common. The SIMD will not be getting double fp support for a long time if ever and that is the default for C. The FPU is all that is necessary for light to moderate fp use by a compiler. Any resources for enhancements to the FPU are small compared to an SIMD. I am opposed to doubling the number of FPU registers as would be needed for heavy parallel work as that is SIMD territory. My focus for the FPU is giving compilers what they need, reducing the code for common compiler operations and opening up what is mostly free and easy performance enhancements. I'm testing and finishing up the new vclib c99 math support for vbcc. I have some good ideas from my work and the resource cost should be low. It's a completely different focus than performance maximizing the resource hungry SIMD.

Quote:
Quote:

Also note that the ColdFire MOV3Q is getting most of it's gain from PEA (pushing data on the stack). Our 32 bit longword immediate compression is better and MOV3Q would be mostly unnecessary with a new ABI with register passing.

I don't know the MOV3Q opcode format, but if it's a 16-bit one it'll still have an advantage over the 32-bit longword compression (which uses an additional 16-bit immediate, I suppose).


MOV3Q is a 16 bit encoding and it would still provide a small code density benefit as well as help ColdFire compatibility. However, the encoding is in A-line which many 68k OSs used for trapping to function calls. MOV3Q seems out of place like an add-on to the 68k also. We can convert it as we won't be binary compatible to the ColdFire but rather highly source compatible.

Quote:
Quote:

glQuake was unable to compress 100% of it's double fp immediates.

Strange. I've always thought that it uses FP32 values instead of FP64 ones.


Quake does use mostly single precision float but double is needed sometimes. Quake uses a lot of fp.

Quote:
Quote:

Gunnar decided to add another register write port (expensive) due to the heavy use of -(An)+ so there will be no penalty for using them

Wise decision. Such addressing modes are greatly used and are one of the key for the success of 68Ks in terms of execution speed and code density.

and it will help some other instructions that write 2 registers :).
Like long MULS and DIVx?


I advised adding an extra register write port earlier but write ports are much more expensive than read ports because of forwarding. Less port sharing does simplify some of the logic elsewhere. Statistics are useful for decision making ;).

Yes this should also help with integer 64 bit MULx.L and DIVx.L performance which will be back in hardware.

Last edited by matthey on 22-Feb-2014 at 09:30 PM.

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Understanding CPU and integer performance
Posted on 23-Feb-2014 7:01:55
#34 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@tlosm

Quote:

tlosm wrote:
@cdimauro

This are the xbench default configs there not simds

But I saw an Altivec result for your G5, so it seems to support SIMD, at least con PowerPC.
Quote:
about os
from 10.5.8 to 10.6. difference is only in some improvements in the interfaces and os kinds plus compatibility with some boards and components ... nothing special,

OK, but... they are different o.s..
Quote:
macosx become full 64bit only from 10.8 (muntain lion)..

AFAIK Apple introduced full support to 64 bits from Leopard (10.6). I wrote an article about it: http://www.appuntidigitali.it/4913/apple-ri-scopre-i-64-bit/
Quote:
The big problem is all the software is made for run 32bit :( :( :(

Which is GOOD, since running in 64 bits mode on PowerPCs slows down the execution.

In a 64 bits PowerPC platform it's better to have the possibility to run multiple 32-bit applications, each one with up to 4GB of private address space, and using the extra (>4GB) unused memory for caching disk data. This way you don't have to slow execution, while taking some advantage of the extra physical memory. I think that it's how OS4 will "support" 64 bits in future.
Quote:
plus virtual pc is using only one cpu ...

What about VirtualBox? If it runs on a MacOS X PowerPC machine, you can try it. May be it can use more cores.
Quote:
i was really curious to know how was good the real emulation on this machine ..

I don't understand. :|

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Understanding CPU and integer performance
Posted on 23-Feb-2014 7:40:30
#35 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@matthey

Quote:

matthey wrote:
Quote:

cdimauro wrote:
Pay attention that the x86_64 code is generally compiled for speed and not for size, as I've written in the article. In fact, it has A LOT of NOPs to properly align branch/call targets to 16 bytes boundary.


Almost all code is compiled for best speed.

Not always, and x86 and x64 show it. The generated code for the former isn't aligned to 16 bytes boundaries, while for the latter it is. It's strange, since both code can run on exactly the same processor, which has the same decoder (so aligning the jump targets to 16 bytes will benefit by the possibility to decode more instructions from the same cache-line, and filling more quickly the - now empty - pipeline).

Quote:

Quote:

Very good, as expected by 68K: about 3.3 bytes average instruction length.


That's with light FPU use mixed code. It's only -O1 in vbcc although vbcc doesn't improve that much with -O2 or higher. The vbcc 68k backend is very simple and below average in code generation quality but that means it's not doing a lot of CPU specific optimizations. Note that vbcc still generates average code because of vasm optimizations, good quality vclib static include libs, inlined code and some intelligent high level optimizations. My beta version of vbcc is less than a frame per second behind GCC 2.95.3 which probably generates the best quality integer 68k code of any 68k compiler.

Isn't it possible to port some optimizations (not code!) to VBCC?

Quote:

Quote:

Does the above "Code total bytes" takes already into account these values?


No. The stats only count the existing disassembled code. This many more bytes would be saved and the average instruction length would drop. The MOV3Q savings would not be as great with the immediate compression though. I have a newer version of ADisS which shows partial MVS/MVZ (x86 MOVSX/MOVZX) savings which are small but only a fraction of the uses.

Very good. So the code density will improve; maybe going down 3.2 bytes on average.

Quote:

Quote:

So no double indirect modes are used. But why I see so much .w & .l addressing modes? Are they for the PEAs?


Vbcc doesn't generate any double indirect addressing modes that I know of. Some versions of GCC and SAS/C do (usually jump tables). Yes, the (xxx).w are usually used with PEA for pushing immediates on the stack. Most (xxx).l are RELOCs (relocatable addresses) that are filled in when the program starts. This is common for programs that are too big to compile with small data. Absolute addressing is actually one of the simplest addressing modes for the processor and it saves a register used for a pointer. The speed is good with adequate instruction fetch and ICache but it's not as good for code density.

But unfortunately it's not good for pure / relocatable code. Introducing a (d32, PC) can help in this direction; albeit code density doesn't change.

Quote:

Quote:

It's very strange to see more -(An) than (An)+, but it should be due to the pushes and pops of values into the stack.


I think many -(An)+ are for stack use. I thought the -(An) being more common was also strange but other 68k programs are similar. I can't narrow out my code having a bug but looking at the disassembly I do see a lot of -(An).

Even on x86 (much less on x64) there are more pushs than pops. It happens usually because passing parameters on the stack, and not popping them out, but simply resetting the SP to "clean-up" the stack and removing them.

For me it's strange, because I suppose that a 68K ABI prefers registers to pass parameters, instead of the stack, since there are many of them.

Quote:

Quote:

I don't see any problem here. 8086 has it from the beginning, letting use 8 bits immediates in 16 bits instructions (and in 32 bits instructions for 386).


Aren't most x86 small immediates based on unsigned values?

No. Most of them are signed. Only some rare cases (INT, IN, OUT, ENTER, shifts; may be some other which I don't remember now) are unsigned.

Quote:
The 68k is mostly signed using sign extension. This takes a little more CPU power but it gives small negative numbers which is better IMO. Sign extension is much simpler than shifting which early ARM did and is a good way to uncompress small immediates.

I agree.

Quote:

Quote:

Frankly speaking, I don't agree on enhancing an old FPU unit. It's better to spend resources on a new, modern, SIMD unit, which can handle both scalar and vector data, in a more uniform way.


I understand. Some would say not to implement the 68k FPU at all.

It should IMO, because it's widely used. At least on Amiga, on number crunching applications, and form many of them there's no source available for a recompile.

Quote:
The 68k FPU is easy to use and flexible though. Branches and integer conversions are common. The SIMD will not be getting double fp support for a long time if ever and that is the default for C. The FPU is all that is necessary for light to moderate fp use by a compiler. Any resources for enhancements to the FPU are small compared to an SIMD. I am opposed to doubling the number of FPU registers as would be needed for heavy parallel work as that is SIMD territory.

I absolutely agree. It's better to leave the FPU as is, introducing very little changes if needed, since there's no future for an the old FPU model for getting better performance.

Quote:
My focus for the FPU is giving compilers what they need, reducing the code for common compiler operations and opening up what is mostly free and easy performance enhancements. I'm testing and finishing up the new vclib c99 math support for vbcc. I have some good ideas from my work and the resource cost should be low. It's a completely different focus than performance maximizing the resource hungry SIMD.

It's fine. But FPGAs have enough resources now, even for a SIMD unit.

Quote:

Quote:

I don't know the MOV3Q opcode format, but if it's a 16-bit one it'll still have an advantage over the 32-bit longword compression (which uses an additional 16-bit immediate, I suppose).


MOV3Q is a 16 bit encoding and it would still provide a small code density benefit as well as help ColdFire compatibility. However, the encoding is in A-line which many 68k OSs used for trapping to function calls. MOV3Q seems out of place like an add-on to the 68k also. We can convert it as we won't be binary compatible to the ColdFire but rather highly source compatible.

Motorola did a good job on messing up the wonderful 68K ISA...

Personally I'm totally against using line-A to introduce new instructions, because it hurts binary compatibility, UNLESS you find a way to make it totally retro-compatible.

When I've discussed about a new SIMD unit for the 68K, my plan was to use the line-A (and line-F) to introduce the new opcodes in a compact and efficient way, but in a totally backward compatible way (so, definitely NOT like Motorola did with Coldfire).

 Status: Offline
Profile     Report this post  
tlosm 
Re: Understanding CPU and integer performance
Posted on 23-Feb-2014 9:13:37
#36 ]
Elite Member
Joined: 28-Jul-2012
Posts: 2746
From: Amiga land

@cdimauro

AFAIK Apple introduced full support to 64 bits from Leopard (10.6). I wrote an article about it: http://www.appuntidigitali.it/4913/apple-ri-scopre-i-64-bit/

yes , but 10.8 become the real 64 bit os , and for ppc there is only 10.5.8 :(
this why im waiting for the Ati x1900 for have a good performed linux on my G5

about
Virtual box is only for intel system mac and not for ppc .

About virual pc it use only one Cpu on 4 ... if i had a single core G5 2.5ghz i will have the same performances.

Last edited by tlosm on 23-Feb-2014 at 09:14 AM.

_________________
I love Amiga and new hope by AmigaNG
A 500 + ; CDTV; CD32;
PowerMac G5 Quad 8GB,SSD,SSHD,7800gtx,Radeon R5 230 2GB;
MacBook Pro Retina I7 2.3ghz;
#nomorea-eoninmyhome

 Status: Offline
Profile     Report this post  
itix 
Re: Understanding CPU and integer performance
Posted on 23-Feb-2014 10:28:15
#37 ]
Elite Member
Joined: 22-Dec-2004
Posts: 3398
From: Freedom world

@cdimauro

Quote:

I'm not a PowerPCs expert, but I suppose that they have a specific load instruction which is slower than the usual (aligned) one, right?


I recall reading something like that long ago but I cant find anything like that from PowerPC instruction set.

Certain special load/store instructions work only with word alignment, for example lwarx/stwcx. used to implement atomic operations. When atomic 8-bit/16-bit operations are needed code must test alignment and implement paths for all possible alignment types.

_________________
Amiga Developer
Amiga 500, Efika, Mac Mini and PowerBook

 Status: Offline
Profile     Report this post  
itix 
Re: Understanding CPU and integer performance
Posted on 23-Feb-2014 10:32:32
#38 ]
Elite Member
Joined: 22-Dec-2004
Posts: 3398
From: Freedom world

@cdimauro

Quote:

Even on x86 (much less on x64) there are more pushs than pops. It happens usually because passing parameters on the stack, and not popping them out, but simply resetting the SP to "clean-up" the stack and removing them.

For me it's strange, because I suppose that a 68K ABI prefers registers to pass parameters, instead of the stack, since there are many of them.


On 68K passing parameters in the stack is the norm but compilers may support passing parameters in registers.

If disassembled code is from AmigaOS executeable it can be using SetAttrs() or DoMethod() calls which are (sort of) variadic functions and used a lot in modern UI rogramming. Parameters to those calls are almost always constructed to the stack.

_________________
Amiga Developer
Amiga 500, Efika, Mac Mini and PowerBook

 Status: Offline
Profile     Report this post  
KimmoK 
Re: Understanding CPU and integer performance
Posted on 23-Feb-2014 12:22:56
#39 ]
Elite Member
Joined: 14-Mar-2003
Posts: 5211
From: Ylikiiminki, Finland

@SIMDs

From AMD e350, it does have:
MMX instructions
SSE / Streaming SIMD Extensions
SSE2 / Streaming SIMD Extensions 2
SSE3 / Streaming SIMD Extensions 3
SSSE3 / Supplemental Streaming SIMD Extensions 3
SSE4a

not those new intel SIMD instructions...

I imagine it's not simple to use SIMD on x86, if you want the SW to run on older CPUs as well etc...?

_________________
- KimmoK
// For freedom, for honor, for AMIGA
//
// Thing that I should find more time for: CC64 - 64bit Community Computer?

 Status: Offline
Profile     Report this post  
cdimauro 
Re: Understanding CPU and integer performance
Posted on 2-Mar-2014 8:02:01
#40 ]
Elite Member
Joined: 29-Oct-2012
Posts: 3650
From: Germany

@tlosm

Quote:

tlosm wrote:
@cdimauro

AFAIK Apple introduced full support to 64 bits from Leopard (10.6). I wrote an article about it: http://www.appuntidigitali.it/4913/apple-ri-scopre-i-64-bit/

yes , but 10.8 become the real 64 bit os ,

Maybe with 10.8 Apple presented 64-bit versions for all its software, but for sure the ground-breaking was represented by 10.6.
Quote:
and for ppc there is only 10.5.8 :(

Apple cannot and doesn't want to support old hardware. Don't blame it.

Also, 64-bit PowerPC code runs slower on average, so it doesn't make much sense to support it.
Quote:
this why im waiting for the Ati x1900 for have a good performed linux on my G5

Have you run 32 and 64 bits Linux flavor on your G5 machine? Take a look at the performance difference.
Quote:
about
Virtual box is only for intel system mac and not for ppc .

But it's open source...
Quote:
About virual pc it use only one Cpu on 4 ... if i had a single core G5 2.5ghz i will have the same performances.

Understood. And here it's unlikely that you can convince Microsoft...

 Status: Offline
Profile     Report this post  
Goto page ( Previous Page 1 | 2 | 3 | 4 | 5 | 6 Next Page )

[ home ][ about us ][ privacy ] [ forums ][ classifieds ] [ links ][ news archive ] [ link to us ][ user account ]
Copyright (C) 2000 - 2019 Amigaworld.net.
Amigaworld.net was originally founded by David Doyle