Your support is needed and is appreciated as Amigaworld.net is primarily dependent upon the support of its users.
|
|
|
|
Poster | Thread | AmigaNoob
|  |
J-core in Embedded Posted on 2-Nov-2023 19:58:18
| | [ #1 ] |
| |
 |
Member  |
Joined: 14-Oct-2021 Posts: 14
From: Unknown | | |
|
| | Status: Offline |
| | cdimauro
|  |
Re: J-core in Embedded Posted on 3-Nov-2023 5:21:18
| | [ #2 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 3313
From: Germany | | |
|
| @AmigaNoob
Quote:
J-Core looks promising, but I don't think that it can compete with the 68k or something similar in terms of code density and executed instructions.
They are trying to compete with RISC-V, which now has a HUGE ecosystem. It's very very difficult to beat it.
The very good thing of J-Core is that it requires 43k gates, which is really small.
With its good code density it can have some chance on the embedded market, if someone invests on it. But with RISC-V without licenses it's difficult.
It should propose more value: better code density and at least comparable amount of executed instructions. |
| Status: Offline |
| | matthey
|  |
Re: J-core in Embedded Posted on 3-Nov-2023 23:02:26
| | [ #3 ] |
| |
 |
Super Member  |
Joined: 14-Mar-2007 Posts: 1852
From: Kansas | | |
|
| AmigaNoob Quote:
Jeff Dionne started assembly programming on the 68k. He mentions the 68k first in architectures considered before choosing SuperH/J-core for revival. He thought the 68k hardware was too complex and had some bad data on code density leading him to believe SuperH had better code density than the 68k which I informed him was false. He was already committed to SuperH though. The SuperH ISA allows for small and simple cores but I don't expect performance to scale up.
Open hardware 68k development is more likely to complement the J-core project rather than compete against it. SuperH is a RISC ISA based on the 68k ISA. This makes code translation between the ISAs easier. They are both natively big endian which has practically disappeared but improves compatibility. While SuperH has performance obstacles, a high enough clocked J-Core CPU core on semi-modern silicon would still outperform an old 68060 on old silicon even though the 68060 has better performance per MHz. A scalar non-pipelined TG68 on semi-modern silicon would likely outperform a 68060 though too.
cdimauro Quote:
J-Core looks promising, but I don't think that it can compete with the 68k or something similar in terms of code density and executed instructions.
|
SuperH code density is good. I have SuperH documentation that shows SH-2 code is more dense than 68000 code. It's clever that they compared to 68000 code instead of 68020/CPU32 code. Don't omit frame pointers and don't use small data on the 68k too and I can believe the propaganda. I have SuperH code density beating RV32IMC and even x86 by a little bit while it is well behind the 68020/CPU32 and Thumb encodings.
cdimauro Quote:
They are trying to compete with RISC-V, which now has a HUGE ecosystem. It's very very difficult to beat it.
|
You mean its difficult to beat the RISC-V promotion and propaganda? Other than SuperH having a small advantage in code density, RV32IMC has better performance metrics despite being weak.
cdimauro Quote:
The very good thing of J-Core is that it requires 43k gates, which is really small.
|
It's nothing special. A 68000 may have fewer gates but the J-core has a 5 stage pipeline giving it a performance advantage. Add a 5 stage pipeline to a 68k CPU core and a RISC core can have a cache. Add a cache to the 68k core and it catches back up to RISC cores although not as much with the good code density of J-core. The big problem is all the dependent instructions J-core has to execute.
cdimauro Quote:
With its good code density it can have some chance on the embedded market, if someone invests on it. But with RISC-V without licenses it's difficult.
It should propose more value: better code density and at least comparable amount of executed instructions. |
SuperH/J-core needs a complete remapping into a variable length encoding to be competitive. A 16 bit fixed length instruction leaves too few bits for encoding immediates and displacements resulting in many dependent instructions to execute or increased memory traffic from putting immediates/constants in memory. The latter is often worse for performance with load-to-use penalties but it likely happens (ColdFire docs call for floating point immediates/constants to be loaded from memory to limit instruction size too). Compressed RISC encodings typically increase instruction counts but where Thumb2 has about 20% more instructions to execute than the 68k, SuperH/J-core is closer to 50% more instructions to execute. That is a huge performance deficit to overcome. ARM upgraded the performance of Thumb2 to AArch64 on all but the smallest cores and Thumb2 had significantly better code density and fewer instructions to execute vs SuperH/J-core.
SH-4 is the first SuperH superscalar CPU core introduced in 1998 with an 8kiB instruction and 16kiB data cache. Preliminary Hitachi docs show 1.5 DMIPS/MHz using a 350nm process which is surprising considering the performance metrics and the 2 cycle load-to-use penalty.
SH-4 CPU Core Architecture Quote:
If an executing instruction locks any resource, i.e. a function block that performs a basic operation, a following instruction that happens to attempt to use the locked resource must be stalled (Figure 42 (h)). This kind of stall can be compensated by inserting one or more instructions independent of the locked resource to separate the interfering instructions. For example, when a load instruction and an ADD instruction that references the loaded value are consecutive, the 2-cycle stall of the ADD is eliminated by inserting three instructions without dependency. Software performance can be improved by such instruction scheduling.
|
Let's compare SH-4 to the 68060.
SuperH: mov.l @(var,r3),r4 bubble bubble bubble add r4,r5
68060RISC: move.l (var,a3),d4 add.l d4,d5
68060CISC: add.l (var,a3),d5
The 68060CISC code is the same code size as the SuperH equivalent. The 68060 is able to execute both the RISC compiler code and the CISC compiler code in 1 cycle while the SH-4 core is executing bubbles, granted the shorter SH-4 pipeline has a smaller load-to-use penalty than the ARM Cortex-A53. With 50% more instructions to execute, load-to-use penalties in memory and only 1 simple integer unit and 1 load/store unit for SH-4 vs 68060 with 2 simple integer units and 2 AGU/mem units, either the DMIPS/MHz of the 68060 is low or the SH-4 DMIPS/MHz is high. Hitachi had some strong marketing but they weren't very conservative. The SH-4 manual is very good though.
Last edited by matthey on 04-Nov-2023 at 02:09 PM.
|
| Status: Offline |
| | cdimauro
|  |
Re: J-core in Embedded Posted on 4-Nov-2023 8:03:51
| | [ #4 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 3313
From: Germany | | |
|
| @matthey
Quote:
matthey wrote: AmigaNoob Quote:
Jeff Dionne started assembly programming on the 68k. He mentions the 68k first in architectures considered before choosing SuperH/J-core for revival. He thought the 68k hardware was too complex and had some bad data on code density leading him to believe SuperH had better code density than the 68k which I informed him was false. |
Interesting. Any source for this besides the Weaver's paper? Quote:
He was already committed to SuperH though. The SuperH ISA allows for small and simple cores but I don't expect performance to scale up. |
Then it's a dead end. Quote:
cdimauro Quote:
J-Core looks promising, but I don't think that it can compete with the 68k or something similar in terms of code density and executed instructions.
|
SuperH code density is good. I have SuperH documentation that shows SH-2 code is more dense than 68000 code. It's clever that they compared to 68000 code instead of 68020/CPU32 code. Don't omit frame pointers and don't use small data on the 68k too and I can believe the propaganda. |
Looks like. Quote:
I have SuperH code density beating RV32IMC and even x86 by a little bit while it is well behind the 68020/CPU32 and Thumb encodings. |
Hum. x86 shouldn't be that bad. Thumb-2 and 68k should be better, anyway (leaving apart BA2). Quote:
cdimauro Quote:
They are trying to compete with RISC-V, which now has a HUGE ecosystem. It's very very difficult to beat it.
|
You mean its difficult to beat the RISC-V promotion and propaganda? |
Exactly! Quote:
Other than SuperH having a small advantage in code density, RV32IMC has better performance metrics despite being weak. |
That's bad for SuperH, then. I don't see how it can compete. It should have at least strong advantage on code density to "offer something". Quote:
cdimauro Quote:
The very good thing of J-Core is that it requires 43k gates, which is really small.
|
It's nothing special. A 68000 may have fewer gates |
How? The 68000 ISA more complicated and the decoder isn't trivial and requires more transistors. Quote:
but the J-core has a 5 stage pipeline giving it a performance advantage. Add a 5 stage pipeline to a 68k CPU core and a RISC core can have a cache. Add a cache to the 68k core and it catches back up to RISC cores although not as much with the good code density of J-core. The big problem is all the dependent instructions J-core has to execute. |
Apollo's 68080 has already shown what a modernized 68k could do.
However I don't know how much resources it takes for the implementation compared to a J-core. Quote:
cdimauro Quote:
With its good code density it can have some chance on the embedded market, if someone invests on it. But with RISC-V without licenses it's difficult.
It should propose more value: better code density and at least comparable amount of executed instructions. |
SuperH/J-core needs a complete remapping into a variable length encoding to be competitive. |
But this means that you have a different architecture and it requires more gates to be implemented. Quote:
A 16 bit fixed length instruction leaves too few bits for encoding immediates and displacements resulting in many dependent instructions to execute or increased memory traffic from putting immediates/constants in memory. The latter is often worse for performance with load-to-use penalties but it likely happens (ColdFire docs call for floating point immediates/constants to be loaded from memory to limit instruction size too). |
That's a great advantage for the 68k family which x86 hasn't: FP immediates.
The only problem (if it's a problem) is that instructions might be too long (e.g.: 12 bytes for doubles and 16 for extended precision). Quote:
Compressed RISC encodings typically increase instruction counts but where Thumb2 has about 20% more instructions to execute than the 68k, SuperH/J-core is closer to 50% more instructions to execute. That is a huge performance deficit to overcome. ARM upgraded the performance of Thumb2 to AArch64 on all but the smallest cores and Thumb2 had significantly better code density and fewer instructions to execute vs SuperH/J-core. |
Thumb-2 requires also a small amount of transistors (I mean: a Cortex-M which only implements this ISA and not the ARM32).
I don't see why people should use J-Core, with all those disadvantages. Quote:
SH-4 is the first SuperH superscalar CPU core introduced in 1998 with an 8kiB instruction and 16kiB data cache. Preliminary Hitachi docs show 1.5 DMIPS/MHz using a 350nm process which is surprising considering the performance metrics and the 2 cycle load-to-use penalty.
SH-4 CPU Core Architecture Quote:
If an executing instruction locks any resource, i.e. a function block that performs a basic operation, a following instruction that happens to attempt to use the locked resource must be stalled (Figure 42 (h)). This kind of stall can be compensated by inserting one or more instructions independent of the locked resource to separate the interfering instructions. For example, when a load instruction and an ADD instruction that references the loaded value are consecutive, the 2-cycle stall of the ADD is eliminated by inserting three instructions without dependency. Software performance can be improved by such instruction scheduling.
|
Let's compare SH-4 to the 68060.
SuperH: mov.l @(var,r3),r4 bubble bubble bubble add r4,r5
68060RISC: move.l (var,a3),d4 add.l d4,d5
68060CISC: add.l (var,a3),d5
The 68060CISC code is the same code size as the SuperH equivalent. The 68060 is able to execute both the RISC compiler code and the CISC compiler code in 1 cycle while the SH-4 core is executing bubbles, granted the shorter SH-4 pipeline has a smaller load-to-use penalty than the ARM Cortex-A53. With 50% more instructions to execute, load-to-use penalties in memory and only 1 simple integer unit and 1 load/store unit for SH-4 vs 68060 with 2 simple integer units and 2 mem units, either the DMIPS/MHz of the 68060 is low or the SH-4 DMIPS/MHz is high. Hitachi had some strong marketing but they weren't very conservative. The SH-4 manual is very good though.
|
I agree: it looks Hitachi's marketing.
68060 is on another level and transistor-wise might be similar to SH4 (because the latter is using a lot of L1 cache). |
| Status: Offline |
| | matthey
|  |
Re: J-core in Embedded Posted on 6-Nov-2023 0:28:16
| | [ #5 ] |
| |
 |
Super Member  |
Joined: 14-Mar-2007 Posts: 1852
From: Kansas | | |
|
| cdimauro Quote:
Interesting. Any source for this besides the Weaver's paper?
|
It is mostly old papers that would show SuperH and 68k code density. Also, SuperH is off the radar of most English papers as it was much more popular in Asia for embedded and console use. I can give a little bit of an idea to back up the updated Weaver code density results though. There is a paper called "High Performance Extendable Instruction Set Computing" which gives the following compiled code density relative sizes.
PowerPC 1.92 ARM-7 1.64 ColdFire 1.43 I80386 1.39 SH-3 1.38 68000 1.35 68020/CPU32 1.32 Thumb 1.13
The compressed RISC EISC ISA is the reference and has the best code density of course. At least relatively the above is about the right order. The ColdFire chip mentioned is ISA_A which has disappointing code density despite claims of "Best-in-Class Code Density" by Motorola. It was ISA_B that added back 68k functionality and new instructions to improve code density like MOV3Q, MVS, MVZ, Bcc.L, MOVE.(B|W) #data,mem, CMP.(B|W), etc. ARM Thumb is too far ahead of the 68020 above which is backed up by another more credible paper called "SPARC16: A new compression approach for the SPARC architecture". It has the 68k with the best code density by geometric mean of benchmarks and Thumb with the best code density by arithmetic mean. There is no SuperH, unfortunately, but it has a more modern ColdFire cfv4e which I believe is ISA_C, has the 3rd best code density and is nearly on par with the 68k and is well ahead of the i686 in 4th place. Of course MOV3Q, MVS and MVZ would improve 68k code density as well although I'm not a fan of MOV3Q preferring a more universal and wider range method of immediate compression for code. MVS and MVZ are good for a 68k32 ISA improving performance (avoiding partial register writes and allowing bypass/forwarding), code density and ColdFire compatibility but it is possible to define other behavior and do away with them saving encoding space for a 68k64 mode.
Last and least there is the following website from a RISC-V fan.
Code Density Compared Between Way Too Many Instruction Sets https://portal.mozz.us/gemini/arcanesciences.com/gemlog/22-07-28/
PPC32 - 984,396 bytes - 476FP target SH-4A LE - 842,884 bytes - No target specific settings ARM64 LE - 779,936 bytes - Cortex-A76 target, FP-ARMv8 x86_64 - 747,224 bytes - Haswell target RV64GC - 741,856 bytes - ilp64d ABI RV32GC - 719,916 bytes - ilp32d ABI x86 - 713,916 bytes - i686 target m68k - 698,776 bytes - M68040 target ARM Thumb2 LE - 599,248 bytes - Thumb2, VFPv4-D16, Cortex-A7 target
The M68040 target gives the worst code density of any 680x0 target (M68020 target usually gives the best code density). If the stated compiler options along with -Os are the only ones given, then frame pointers are conveniently not omitted which reduces the code density of some old CISC targets like the 68k and x86. The SH-4A LE target has poor code density here which could be explained by the choice of "LE" if the bi-endian support is not good and/or SuperH having trouble addressing and branching in the large executable with so few displacement bits. It is interesting to see the affect of a large executable on code density but this smells of more RISC and RISC-V bias even though ARM looks more like the winner here. A newer GCC compiler version was likely used which better supports current and more popular architectures while code generation for older architectures like the 68k has declined, perhaps including SuperH as well.
cdimauro Quote:
SuperH/J-core may be dead end as far as it is unlikely to scale up in performance to challenge AArch64 or x86-64 but RISC-V has the same problem. They can at least scale lower and they are license and royalty free which keeps them viable ISAs. The BA2 ISA can likely scale lower and is at least royalty free. An enhanced 68k ISA may be able to surpass Thumb2 and BA2 code density but a 68k core is unlikely to scale as small. An enhanced 68k ISA and CPU has more potential to scale up in performance than any of these RISC ISAs. The 68k can scale down much further than x86(-64) as demonstrated by its embedded market dominance at one time. The fat x86(-64) ISA and cores even struggled to scale down to a low enough power and area mid market superscalar in-order core with the original Pentium and later Atom where the 68060 demonstrate significantly superior PPA.
SuperH was successful enough to be number 1 in the 32 bit CPU/MCU embedded market after Motorola forced the conversion of 68k developers to fat PPC which didn't scale as low and ColdFire which was a bastardized truncated 68k to scale only a little lower than the original 68000 (CPU32 was better and the CF enhancements could have been added while retaining good 68k compatibility). Where Motorola created a disjointed overlapping embedded lineup of 68000, 680x0, CPU32, ColdFire and PPC CPU/MCUs, Hitachi had a nicely featured and complete lineup of SuperH embedded cores that scaled up to mid performance which they advertise as "Pentium Class CPU/FPU Performance".
https://www.renesas.com/us/en/document/fly/superh-platform-brochure
The best integer performance on the brochure is 2.47 DMIPS/MHz which I believe is better than any ARM core up to 2010. I don't see anything about OoO cores so these appear to be superscalar in-order cores where 2.47 DMIPS/MHz is difficult to believe. The lower clocked MCU has better DMIPS/MHz so this result may be from on-chip SRAM while their "Pentium Class" cores using DDR2 are a more believable 1.8 DMIPS/MHz. It's not a bad result considering how many instructions have to be executed.
cdimauro Quote:
How? The 68000 ISA more complicated and the decoder isn't trivial and requires more transistors.
|
The 68000 ISA is small and fairly simple. There are only 56 instructions which is less than many so called RISC ISAs today.
Architecture | Number of Instructions RV32I 47 (RISC-V base) 68000 56 Cortex-M0(+) 56 (subset of Thumb1 and Thumb2) RV32IM 57 (RISC-V base+mul/div extension similar features to 68000 but inferior code density) MCF5202 67 (ColdFire ISA_A) RV32IMC 69 (RISC-V base+mul/div+compressed extensions so features & code density more like 68000) 68020 101 (CALLM & RTM dropped for 68030+; TAS, CAS & CAS2 illegal on Amiga so 96 for Amiga) Cortex-M3 115 (full Thumb1 and Thumb2) SH-4 208 PowerPC 222 (standard with FPU but not counting SIMD unit or embedded extensions) SH-5 417 AArch64 ~1300 https://www.reddit.com/r/arm/comments/mod62/how_many_instructions_in_armv8/ x86 1,503 x86-64 3,684
The 68000 has more memory addressing modes than most RISC ISAs with 9 but 3 are small variations of others. Some would claim CISC like reg-mem accesses are too expensive but the cut down ColdFire ISA_A kept them. Decoding for the 68000 is easiest of all 68k ISAs even though the 68000 design used uCode which Hitachi engineers copied from the 68000 for RISC SuperH (see page 12 at the following link and look at the decoder stage).
Turtles all the Way Down: Running Linux on Open Hardware https://j-core.org/talks/japan-2015.pdf
cdimauro Quote:
Apollo's 68080 has already shown what a modernized 68k could do.
However I don't know how much resources it takes for the implementation compared to a J-core.
|
The Apollo core goal is high performance in a FPGA and not a small size. Gunnar was constantly experimenting with the number of pipeline stages but expect more than the 8 stages of the 68060 in order to increase the clock speed in the FPGA in a similar way more shorter stages can increase clock speeds in an ASIC at the cost of more transistors and requiring better branch prediction.
One of the goals of the J-Core project is small low cost CPU cores (see page 16 of the last link above).
https://j-core.org/talks/japan-2015.pdf Quote:
So, how do you use it for anything? o Releasing VHDL and build system under BSD license o Make any chip you want - Royalty free o 180nm ASIC of SOC we're demoing costs less than 10 yen - Processor only, about 2 and a half cents o Disposable computing at "free toy inside" level o Think IoT : ‘Trillion Sensor Network’ economics, but running Linux
|
A $0.03 USD CPU core is a small scale core potentially for extreme volume "trillion sensor" production. This is low enough cost to go on disposable items which has potential for many products.
cdimauro Quote:
That's a great advantage for the 68k family which x86 hasn't: FP immediates.
The only problem (if it's a problem) is that instructions might be too long (e.g.: 12 bytes for doubles and 16 for extended precision).
|
Immediates in the more predictable instruction stream is a good thing. Yes, large instructions from large FP immediates could be challenging for small core designs. Many FP immediates can be exactly represented in smaller precision forms providing FP immediate compression though. I suggested and Frank Wille coded the optimization for vasm which is on by default for 68k FPU compiled vbcc code. I was surprised to see that all FP immediates in the vbcc executable were compressed from double to single FP precision (If half precision FP was supported, there would be more compression). GCC does not have this optimization for the 68k. This optimization along with my FPU support code for vbcc had vbcc easily outperforming GCC in the ByteMark FP benchmark for the 68060.
cdimauro Quote:
Thumb-2 requires also a small amount of transistors (I mean: a Cortex-M which only implements this ISA and not the ARM32).
I don't see why people should use J-Core, with all those disadvantages.
|
The Cortex-M0(+) is small as far as instructions and has better performance metrics than SuperH. The ARM licensing and royalty fees are low for these small cores too. License and royalty free are still appealing though. ARM usually charges $1-$10 million USD for a core license and 1.5% plus royalty with small cores being at the low end of this. Royalty free especially is a competitive advantage. Larger AArch64 Cortex-A cores are 3% plus royalty fees and increase with the number of cores, custom IP blocks, GPU, etc. The a la carte support is easy and nice but has a cost. J-core and RISC-V open hardware is very appealing too, especially as the design process gets easier and more open hardware IP blocks become available. It's too bad he chose SuperH instead of 68k which he prefers to program for but it likely would have been at least somewhat more difficult.
cdimauro Quote:
I agree: it looks Hitachi's marketing.
68060 is on another level and transistor-wise might be similar to SH4 (because the latter is using a lot of L1 cache). |
SH-4 had larger L1 and even received L2 caches as it made it to newer chip processes than the 68060. The cache SRAM transistors quickly outnumber the CPU core logic transistors which is a good reason to have good code density. The SH-4 cores were used on a more complete SoC where the 68060 was only a simple, at least by today's standards, CPU.
Last edited by matthey on 07-Nov-2023 at 04:48 PM. Last edited by matthey on 06-Nov-2023 at 08:17 AM. Last edited by matthey on 06-Nov-2023 at 08:16 AM. Last edited by matthey on 06-Nov-2023 at 08:07 AM. Last edited by matthey on 06-Nov-2023 at 12:43 AM. Last edited by matthey on 06-Nov-2023 at 12:42 AM. Last edited by matthey on 06-Nov-2023 at 12:38 AM. Last edited by matthey on 06-Nov-2023 at 12:35 AM. Last edited by matthey on 06-Nov-2023 at 12:31 AM.
|
| Status: Offline |
| | Hypex
 |  |
Re: J-core in Embedded Posted on 7-Nov-2023 15:15:07
| | [ #6 ] |
| |
 |
Elite Member  |
Joined: 6-May-2007 Posts: 11054
From: Greensborough, Australia | | |
|
| @AmigaNoob
Well, if you are targetting a RISC CPU for competition, then using a ColdFire makes more sense. The ColdFire is actually like a 68K RISC. They cut instructions out to streamline it to be more like RISC. So removed some addressing modes and features.
Source: https://microapl.com/Porting/ColdFire/cf_68k_diffs.html
Opinion: It doesn't look like it worked out to be any better than the full CISC they had. Motorola already had the 88K to take over from the 68K which wasn't a crippled attempt to convert 68K into RISC. But instead they replace it with PPC which shared no family line. All along Intel souped up the x86 and went the opposite direction. Proving you didn't need to cripple or replace a CPU core in order to go faster. But can build on and complicate a CPU core even futher to become more powerful. Now Mototola RISC and PPC combined is even smaller than RISC today. Last edited by Hypex on 07-Nov-2023 at 03:17 PM.
|
| Status: Offline |
| | matthey
|  |
Re: J-core in Embedded Posted on 8-Nov-2023 6:26:59
| | [ #7 ] |
| |
 |
Super Member  |
Joined: 14-Mar-2007 Posts: 1852
From: Kansas | | |
|
| Hypex Quote:
Well, if you are targetting a RISC CPU for competition, then using a ColdFire makes more sense. The ColdFire is actually like a 68K RISC. They cut instructions out to streamline it to be more like RISC. So removed some addressing modes and features.
Source: https://microapl.com/Porting/ColdFire/cf_68k_diffs.html
|
I would say SuperH is "like a 68k RISC" while ColdFire is like a stripped down bastardized 68k CISC ISA. Don't let the ColdFire "Variable-Length RISC" marketing fool you. It is reg-mem and not load/store. ColdFire can scale smaller than the 68k by trimming some lesser used instructions and limiting the instruction size to 6 bytes while losing performance, orthogonality and 68k compatibility. It really doesn't scale much smaller than the original 68000 ISA and is not significantly simpler. ColdFire is a 68000 ISA with a few instructions and features cut and a few features borrowed from the 68020 ISA. After adding back too deep of cuts in useful 68k features and adding MVS, MVZ and MOV3Q, it was usable and even efficient for a minimalist ISA for a small core. ColdFire was trying to scale down to compete with SuperH and then its licensed successor ARM Thumb. Then ARM replaces most of its embedded cores with ARMv8/AArch64 which has many times the number of instructions of the 68060 and many similar CISC like complex addressing modes. AArch64 only lacks reg-mem accesses instead of load/store and a variable length RISC encoding yet these features were kept on the cut down ColdFire while the number of instructions and complex addressing modes were cut. Is the 68k bad because it has too many CISC features even though AArch64 "RISC" has half the features and the "Variable-Length RISC" architecture of ColdFire has the other half?
Hypex Quote:
Opinion: It doesn't look like it worked out to be any better than the full CISC they had. Motorola already had the 88K to take over from the 68K which wasn't a crippled attempt to convert 68K into RISC. But instead they replace it with PPC which shared no family line. All along Intel souped up the x86 and went the opposite direction. Proving you didn't need to cripple or replace a CPU core in order to go faster. But can build on and complicate a CPU core even futher to become more powerful. Now Mototola RISC and PPC combined is even smaller than RISC today. |
PPC was considered robust and not very RISC like when it came out. PPC has 222 standard instructions while I only count 66 instructions for the 88k of which 16 are FPU and 9 are SIMD instructions. The standard PPC ISA doesn't include SIMD instructions so it is more like PPC with 222 CPU+FPU instructions vs 88k with 57 CPU+FPU instructions. The 88k has a few more addressing modes than most RISC ISAs including a complex "register indirect with scaled index" addressing mode like the 68k, x86 and AArch64. PPC has more addressing modes than average for most early RISC ISAs too. The 88k looks more assembler friendly than PPC and many of the instructions resemble 68k instructions. Even today, something can be said of a simple and elegance ISA. PPC doesn't have it. AArch64 is better but too many instructions. What the 68k lacks in simple it gains in elegance. Actually, it is simple from the programmer perspective but moderately complex in hardware.
|
| Status: Offline |
| | Hypex
 |  |
Re: J-core in Embedded Posted on 10-Nov-2023 15:09:02
| | [ #8 ] |
| |
 |
Elite Member  |
Joined: 6-May-2007 Posts: 11054
From: Greensborough, Australia | | |
|
| @matthey
Quote:
I would say SuperH is "like a 68k RISC" while ColdFire is like a stripped down bastardized 68k CISC ISA. Don't let the ColdFire "Variable-Length RISC" marketing fool you. It is reg-mem and not load/store. |
The SuperH also looks like it could have been a possible 68K replacement. With some benefits it looks over ARM. I do wonder if perhaps Phase 5 choosing the Coldfire instead of copying the Mac with PPC would have been a better choice?
Quote:
Is the 68k bad because it has too many CISC features even though AArch64 "RISC" has half the features and the "Variable-Length RISC" architecture of ColdFire has the other half? |
I wouldn't say so. The 68K is designed to be CISC. Perhaps a CPU with the AArch64 and CF-VLR features making up a whole CPU would be better. Or it might end up as some hybrid monster CPU like x86 has become.
Quote:
PPC was considered robust and not very RISC like when it came out. PPC has 222 standard instructions while I only count 66 instructions for the 88k of which 16 are FPU and 9 are SIMD instructions. |
I considered it annoying when I learnt about the instructions. Even now I avoid PPC ASM. It's ok for small routines but I couldn't imagine writing a program in it like with 68K. But I didn't know 88K already had SIMD. It tools years before SIMD made it into desktop PPC.
Quote:
PPC has more addressing modes than average for most early RISC ISAs too. |
They would seem clouded by the load/store standard it adheres too. Even the load/store update, similar to (Ax)+ and -(Ax), a quirky with compiler code loading an address as one less. But what most seems crippled by the enforced RISC everything in a register convention is function calls, since any address, be it direct or loaded from memory, must be loaded into register, then moved to a call register, then finally called.
Quote:
The 88k looks more assembler friendly than PPC and many of the instructions resemble 68k instructions |
Anything could look more friendly than PPC. But resembling 68K would be a bonus since it was meant to follow it as the next generation. |
| Status: Offline |
| | matthey
|  |
Re: J-core in Embedded Posted on 11-Nov-2023 6:00:05
| | [ #9 ] |
| |
 |
Super Member  |
Joined: 14-Mar-2007 Posts: 1852
From: Kansas | | |
|
| Hypex Quote:
The SuperH also looks like it could have been a possible 68K replacement. With some benefits it looks over ARM. I do wonder if perhaps Phase 5 choosing the Coldfire instead of copying the Mac with PPC would have been a better choice?
|
ColdFire wouldn't have been any better. It was stripped down and performance weakened to scale down smaller for the embedded market which was fine but then they tried to increase the performance to scale back up. Why execute more weak instructions instead of going back to the 68k and executing fewer and more powerful instructions? Motorola/Freescale wouldn't allow that as it would have competed with PPC. ColdFire permanently and deliberately had its performance clipped for the low end embedded market.
Hypex Quote:
I wouldn't say so. The 68K is designed to be CISC. Perhaps a CPU with the AArch64 and CF-VLR features making up a whole CPU would be better. Or it might end up as some hybrid monster CPU like x86 has become.
|
At least the 68k is CISC and good at being CISC. ColdFire is CISC and not good at being CISC or RISC.
Hypex Quote:
I considered it annoying when I learnt about the instructions. Even now I avoid PPC ASM. It's ok for small routines but I couldn't imagine writing a program in it like with 68K. But I didn't know 88K already had SIMD. It tools years before SIMD made it into desktop PPC.
|
PPC standards were conservative and slow to change. PA-RISC and the 88k added SIMD instructions early but they were minimal implementations. Basic integer SIMD support is very cheap. The PA-RISC SIMD support used 0.1% - 0.2% of the silicon area in early PA-RISC CPUs with no cycle time impact.
Hypex Quote:
They would seem clouded by the load/store standard it adheres too. Even the load/store update, similar to (Ax)+ and -(Ax), a quirky with compiler code loading an address as one less. But what most seems crippled by the enforced RISC everything in a register convention is function calls, since any address, be it direct or loaded from memory, must be loaded into register, then moved to a call register, then finally called.
|
The link register isn't so bad. Accessing memory is just a pain with RISC.
Hypex Quote:
Anything could look more friendly than PPC. But resembling 68K would be a bonus since it was meant to follow it as the next generation. |
Perhaps 68k fans would have been more willing to adopt the 88k than PPC. Instructions and addressing modes were more similar, the ISA is simpler and the assembler was more readable. I think I would have preferred a variable length encoded SuperH for a load/store architecture though.
|
| Status: Offline |
| | cdimauro
|  |
Re: J-core in Embedded Posted on 11-Nov-2023 12:35:07
| | [ #10 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 3313
From: Germany | | |
|
| @matthey
Quote:
matthey wrote: cdimauro Quote:
Interesting. Any source for this besides the Weaver's paper?
|
It is mostly old papers that would show SuperH and 68k code density. Also, SuperH is off the radar of most English papers as it was much more popular in Asia for embedded and console use. I can give a little bit of an idea to back up the updated Weaver code density results though. There is a paper called "High Performance Extendable Instruction Set Computing" which gives the following compiled code density relative sizes.
PowerPC 1.92 ARM-7 1.64 ColdFire 1.43 I80386 1.39 SH-3 1.38 68000 1.35 68020/CPU32 1.32 Thumb 1.13
The compressed RISC EISC ISA is the reference and has the best code density of course. |
With good reasons due to the design choices, but it had no success and I assume that the primary reason is related to the dependency chains on the "extended" instructions mechanism that they've implemented for allowing the construction of bigger immediates and offsets. Quote:
At least relatively the above is about the right order. The ColdFire chip mentioned is ISA_A which has disappointing code density despite claims of "Best-in-Class Code Density" by Motorola. It was ISA_B that added back 68k functionality and new instructions to improve code density like MOV3Q, MVS, MVZ, Bcc.L, MOVE.(B|W) #data,mem, CMP.(B|W), etc. ARM Thumb is too far ahead of the 68020 above which is backed up by another more credible paper called "SPARC16: A new compression approach for the SPARC architecture". It has the 68k with the best code density by geometric mean of benchmarks and Thumb with the best code density by arithmetic mean. There is no SuperH, unfortunately, but it has a more modern ColdFire cfv4e which I believe is ISA_C, has the 3rd best code density and is nearly on par with the 68k and is well ahead of the i686 in 4th place. Of course MOV3Q, MVS and MVZ would improve 68k code density as well although I'm not a fan of MOV3Q preferring a more universal and wider range method of immediate compression for code. MVS and MVZ are good for a 68k32 ISA improving performance (avoiding partial register writes and allowing bypass/forwarding), code density and ColdFire compatibility but it is possible to define other behavior and do away with them saving encoding space for a 68k64 mode. |
Thanks for sources and the analysis.
BTW, adding MOV3Q isn't mutually exclusive with adding a more general method for compressed immediates: an 68k ISA can have both and gaining their advantage (MOV3Q is very very compact, albeit the range is super limited. However those small constants are also the most common ones). Quote:
Last and least there is the following website from a RISC-V fan.
Code Density Compared Between Way Too Many Instruction Sets https://portal.mozz.us/gemini/arcanesciences.com/gemlog/22-07-28/
PPC32 - 984,396 bytes - 476FP target SH-4A LE - 842,884 bytes - No target specific settings ARM64 LE - 779,936 bytes - Cortex-A76 target, FP-ARMv8 x86_64 - 747,224 bytes - Haswell target RV64GC - 741,856 bytes - ilp64d ABI RV32GC - 719,916 bytes - ilp32d ABI x86 - 713,916 bytes - i686 target m68k - 698,776 bytes - M68040 target ARM Thumb2 LE - 599,248 bytes - Thumb2, VFPv4-D16, Cortex-A7 target
The M68040 target gives the worst code density of any 680x0 target (M68020 target usually gives the best code density). If the stated compiler options along with -Os are the only ones given, then frame pointers are conveniently not omitted which reduces the code density of some old CISC targets like the 68k and x86. The SH-4A LE target has poor code density here which could be explained by the choice of "LE" if the bi-endian support is not good and/or SuperH having trouble addressing and branching in the large executable with so few displacement bits. It is interesting to see the affect of a large executable on code density but this smells of more RISC and RISC-V bias even though ARM looks more like the winner here. |
Those results aren't consistent with many other benchmarks. You can also see how so good is ARC (not reported by you) whereas usually its code density isn't that good.
I'm pretty sure that the guy made some mistakes on performing those benchmarks. I don't want to think that it deliberately made this mess on purpose (being a fan isn't bad per se, but being fanatical is different). Quote:
A newer GCC compiler version was likely used which better supports current and more popular architectures while code generation for older architectures like the 68k has declined, perhaps including SuperH as well. |
I agree, but maybe it's better to take a look at some modern compiler like Clang/LLVM which recently added 68k support as well. Quote:
Where Motorola created a disjointed overlapping embedded lineup of 68000, 680x0, CPU32, ColdFire and PPC CPU/MCUs, Hitachi had a nicely featured and complete lineup of SuperH embedded cores that scaled up to mid performance which they advertise as "Pentium Class CPU/FPU Performance".
https://www.renesas.com/us/en/document/fly/superh-platform-brochure
The best integer performance on the brochure is 2.47 DMIPS/MHz which I believe is better than any ARM core up to 2010. I don't see anything about OoO cores so these appear to be superscalar in-order cores where 2.47 DMIPS/MHz is difficult to believe. |
Same opinion: it's too high for this kind of (micro)architecture. Quote:
The lower clocked MCU has better DMIPS/MHz so this result may be from on-chip SRAM while their "Pentium Class" cores using DDR2 are a more believable 1.8 DMIPS/MHz. It's not a bad result considering how many instructions have to be executed. |
I agree. But SuperH executes too many instructions and this is a major pain for getting good performances. Quote:
cdimauro Quote:
How? The 68000 ISA more complicated and the decoder isn't trivial and requires more transistors.
|
The 68000 ISA is small and fairly simple. There are only 56 instructions which is less than many so called RISC ISAs today.
Architecture | Number of Instructions RV32I 47 (RISC-V base) 68000 56 Cortex-M0(+) 56 (subset of Thumb1 and Thumb2) RV32IM 57 (RISC-V base+mul/div extension similar features to 68000 but inferior code density) MCF5202 67 (ColdFire ISA_A) RV32IMC 69 (RISC-V base+mul/div+compressed extensions so features & code density more like 68000) 68020 101 (CALLM & RTM dropped for 68030+; TAS, CAS & CAS2 illegal on Amiga so 96 for Amiga) Cortex-M3 115 (full Thumb1 and Thumb2) SH-4 208 PowerPC 222 (standard with FPU but not counting SIMD unit or embedded extensions) SH-5 417 AArch64 ~1300 https://www.reddit.com/r/arm/comments/mod62/how_many_instructions_in_armv8/ x86 1,503 x86-64 3,684 |
I'd refrain to use such numbers: they don't make sense to me.
Counting ISAs' instructions isn't a trivial task, as one could think about.
For example, you stated that the 68k has only 56 instructions. However I assume that you count ADD.B, ADD.W and ADD.L (and a future ADD.Q) as a single instruction, whereas the ALU is very likely to have 3 different "sub ALUs" to perform those computations.
Same for x86: I assume that they are counting all instructions of one of the last x86 processors. Whereas we know that x86 means 8086, 80186, 80286, ..., which far fewer instructions compared to the last processors.
It also depends on how you "group" the instructions. For example, my NEx64T ISA is source-level compatibile with both x86 and x64. However it has much lower instructions because I've grouped all SIMD instructions by their "base" ones. Taking the ADD instructions, my ?ADD (with ? which defines the vector register size: MMX, scalar or packed SSE, scalar or packed AVX/AVX-512/AVX-512, or packed length-agnostic) maps in one instruction all the following x86/x64 ones:
VADDPD, VADDPS, VADDSD, VADDSS, VPADDB, VPADDW, VPADDD, VPADDQ ADDPD, ADDPS, ADDSD, ADDSS, PADDB, PADDW, PADDD, PADDQ
To add more on this topic, this ?ADD it's just a "base" instruction, because and beside it's normal usage it can also be used in other ways. For example, as mem-mem-reg and with reg that can be freely positioned everywhere:
VADD.D [R0 + R1 * 8 + 0x12345678],V0,[R4]+
Another example, it can be used by operating directly on memory using the famous REP extension:
REP VADD.D [RDI]+,[RSI]-,[RBX]+
Vector reduction is also possible:
REP VADD.D RAX,[RSI]+
How do you count all those? As different instructions? To me it's just one instruction which is "more versatile". Quote:
The 68000 has more memory addressing modes than most RISC ISAs with 9 but 3 are small variations of others. Some would claim CISC like reg-mem accesses are too expensive but the cut down ColdFire ISA_A kept them. Decoding for the 68000 is easiest of all 68k ISAs even though the 68000 design used uCode which Hitachi engineers copied from the 68000 for RISC SuperH (see page 12 at the following link and look at the decoder stage).
Turtles all the Way Down: Running Linux on Open Hardware https://j-core.org/talks/japan-2015.pdf |
Are you sure that it's the uCode? Quote:
cdimauro Quote:
Apollo's 68080 has already shown what a modernized 68k could do.
However I don't know how much resources it takes for the implementation compared to a J-core.
|
The Apollo core goal is high performance in a FPGA and not a small size. Gunnar was constantly experimenting with the number of pipeline stages but expect more than the 8 stages of the 68060 in order to increase the clock speed in the FPGA in a similar way more shorter stages can increase clock speeds in an ASIC at the cost of more transistors and requiring better branch prediction. |
Maybe more, if you consider that he added also instruction fusing, which requires more analysis for the decoded instructions to be able to understand if/how they could be fused. Quote:
One of the goals of the J-Core project is small low cost CPU cores (see page 16 of the last link above).
https://j-core.org/talks/japan-2015.pdf Quote:
So, how do you use it for anything? o Releasing VHDL and build system under BSD license o Make any chip you want - Royalty free o 180nm ASIC of SOC we're demoing costs less than 10 yen - Processor only, about 2 and a half cents o Disposable computing at "free toy inside" level o Think IoT : ‘Trillion Sensor Network’ economics, but running Linux
|
A $0.03 USD CPU core is a small scale core potentially for extreme volume "trillion sensor" production. This is low enough cost to go on disposable items which has potential for many products. |
Impressive! And I agree. With such low numbers it can have a change to exploit on this super embedded market. Quote:
cdimauro Quote:
That's a great advantage for the 68k family which x86 hasn't: FP immediates.
The only problem (if it's a problem) is that instructions might be too long (e.g.: 12 bytes for doubles and 16 for extended precision).
|
Immediates in the more predictable instruction stream is a good thing. Yes, large instructions from large FP immediates could be challenging for small core designs. Many FP immediates can be exactly represented in smaller precision forms providing FP immediate compression though. I suggested and Frank Wille coded the optimization for vasm which is on by default for 68k FPU compiled vbcc code. I was surprised to see that all FP immediates in the vbcc executable were compressed from double to single FP precision (If half precision FP was supported, there would be more compression). GCC does not have this optimization for the 68k. This optimization along with my FPU support code for vbcc had vbcc easily outperforming GCC in the ByteMark FP benchmark for the 68060. |
For compressed FP immediates do you mean that you load single precision FP immediate and then you use it in subsequent FP instructions? Quote:
cdimauro Quote:
Thumb-2 requires also a small amount of transistors (I mean: a Cortex-M which only implements this ISA and not the ARM32).
I don't see why people should use J-Core, with all those disadvantages.
|
The Cortex-M0(+) is small as far as instructions and has better performance metrics than SuperH. The ARM licensing and royalty fees are low for these small cores too. License and royalty free are still appealing though. ARM usually charges $1-$10 million USD for a core license and 1.5% plus royalty with small cores being at the low end of this. Royalty free especially is a competitive advantage. Larger AArch64 Cortex-A cores are 3% plus royalty fees and increase with the number of cores, custom IP blocks, GPU, etc. The a la carte support is easy and nice but has a cost. J-core and RISC-V open hardware is very appealing too, especially as the design process gets easier and more open hardware IP blocks become available. |
Makes sense: there's a lot of opportunity outside of ARM. Quote:
It's too bad he chose SuperH instead of 68k which he prefers to program for but it likely would have been at least somewhat more difficult. |
68k is more complicated to implement. Maybe this is the reason.
However Gunnar succeed, but taking the 68050 source AFAIR and after so many years. Quote:
cdimauro Quote:
I agree: it looks Hitachi's marketing.
68060 is on another level and transistor-wise might be similar to SH4 (because the latter is using a lot of L1 cache). |
SH-4 had larger L1 and even received L2 caches as it made it to newer chip processes than the 68060. The cache SRAM transistors quickly outnumber the CPU core logic transistors which is a good reason to have good code density. The SH-4 cores were used on a more complete SoC where the 68060 was only a simple, at least by today's standards, CPU.
|
I agree. The good thing is that modern processors have very large caches, so that the number of transistors used by legacy architectures like 68k or x86 don't take so much space and they can be competitive with L/S architectures (which usually have worse code density. So, they require large caches). |
| Status: Offline |
| | matthey
|  |
Re: J-core in Embedded Posted on 12-Nov-2023 4:41:18
| | [ #11 ] |
| |
 |
Super Member  |
Joined: 14-Mar-2007 Posts: 1852
From: Kansas | | |
|
| cdimauro Quote:
With good reasons due to the design choices, but it had no success and I assume that the primary reason is related to the dependency chains on the "extended" instructions mechanism that they've implemented for allowing the construction of bigger immediates and offsets.
|
Most compressed load/store ISAs are handicapped. The have limited GP registers, limited instructions available without switching modes, limited immediates and displacement, limited addressing modes, etc. The BA2 ISA is the only one that did it right and it was done by supporting more variable length instruction sizes like a CISC ISA. Maybe the ColdFire developers would have considered up to 8 byte instruction sizes to be acceptable if BA2 had been around then to show them code density is more important than following RISC norms for scaling smaller.
cdimauro Quote:
Thanks for sources and the analysis.
BTW, adding MOV3Q isn't mutually exclusive with adding a more general method for compressed immediates: an 68k ISA can have both and gaining their advantage (MOV3Q is very very compact, albeit the range is super limited. However those small constants are also the most common ones).
|
The MOV3Q encoding is open on the 68k, it would further improve code density and it would improve ColdFire compatibility but it is in A-line which is used by other 68k OSs and, yes, it has a very limited immediate range.
cdimauro Quote:
I agree, but maybe it's better to take a look at some modern compiler like Clang/LLVM which recently added 68k support as well.
|
Old ISA support in compilers grows outdated while new compiler support takes time to mature. New hardware is required for good compiler support.
cdimauro Quote:
I'd refrain to use such numbers: they don't make sense to me.
Counting ISAs' instructions isn't a trivial task, as one could think about.
For example, you stated that the 68k has only 56 instructions. However I assume that you count ADD.B, ADD.W and ADD.L (and a future ADD.Q) as a single instruction, whereas the ALU is very likely to have 3 different "sub ALUs" to perform those computations.
Same for x86: I assume that they are counting all instructions of one of the last x86 processors. Whereas we know that x86 means 8086, 80186, 80286, ..., which far fewer instructions compared to the last processors.
It also depends on how you "group" the instructions. For example, my NEx64T ISA is source-level compatibile with both x86 and x64. However it has much lower instructions because I've grouped all SIMD instructions by their "base" ones. Taking the ADD instructions, my ?ADD (with ? which defines the vector register size: MMX, scalar or packed SSE, scalar or packed AVX/AVX-512/AVX-512, or packed length-agnostic) maps in one instruction all the following x86/x64 ones:
VADDPD, VADDPS, VADDSD, VADDSS, VPADDB, VPADDW, VPADDD, VPADDQ ADDPD, ADDPS, ADDSD, ADDSS, PADDB, PADDW, PADDD, PADDQ
To add more on this topic, this ?ADD it's just a "base" instruction, because and beside it's normal usage it can also be used in other ways. For example, as mem-mem-reg and with reg that can be freely positioned everywhere:
VADD.D [R0 + R1 * 8 + 0x12345678],V0,[R4]+
Another example, it can be used by operating directly on memory using the famous REP extension:
REP VADD.D [RDI]+,[RSI]-,[RBX]+
Vector reduction is also possible:
REP VADD.D RAX,[RSI]+
How do you count all those? As different instructions? To me it's just one instruction which is "more versatile".
|
I agree. It is difficult to count the number of instructions as minor or major variations may or may not be counted. I knew some ISA instruction counts from various sources and Googled the rest for fun. The number is mostly meaningless but it does give a general idea of how many instructions these ISAs have, more instructions do use more transistors limiting how small cores can scale and RISC stands for Reduced Instruction Set Computer so the number of instructions are important to judge RISC ISAs.
cdimauro Quote:
Are you sure that it's the uCode?
|
The pipeline shows uCode pages feeding a uCode sequencer perhaps to expand instructions in the decode stage before feeding into RISC Pipeline Control. It's possible some SuperH cores do not use uCode but it may be the easiest way to handle some 68k like features, especially if Hitachi had the 68000 uCode to look at.
cdimauro Quote:
Maybe more, if you consider that he added also instruction fusing, which requires more analysis for the decoded instructions to be able to understand if/how they could be fused.
|
ColdFire v4 and v5 do instruction fusing/folding. Even the 68060 uses a very simple form for predicted branches. It wouldn't necessarily require another stage if kept simple. It may be possible to do it at the early decode stage on the 68060 which wouldn't affect the execution pipelines do to the decoupled instruction buffer. Without an instruction buffer, Gunnar may do it at the instruction dispatch/issue stage but either way it may be possible to do it at least partially in parallel. Deeper logic slowing the electricity through many gates is what creates timing problems where wider but shallower logic allows parallelism with all the cheap transistors today.
cdimauro Quote:
For compressed FP immediates do you mean that you load single precision FP immediate and then you use it in subsequent FP instructions?
|
No. Let's say we want to add 1.5 to FP0 and we are using double precision for the variable.
fadd.d #1.5,fp0 ; 12 bytes fadd.d #$3ff8000000000000,fp0 ; 12 bytes, 4 for FADD + 8 for immediate.d ext
fadd.s #1.5,fp0 fadd.s #$3fc00000,fp0 ; 8 bytes, 4 for FADD + 4 for immediate.s ext
The number 1.5 can be exactly represented in both single and double precision but we can save 4 bytes if we use single precision. The 68k FPU converts the 1.5 to extended precision before any calc and/or setting the FPU condition codes so "fadd.d #1.5,fp0" can be exactly and safely replaced by "fadd.s #1.5,fp0" saving 4 bytes of code. Vasm will compress extended precision immediates down to double or single saving even more code if possible. If half precision fp format was supported, there would be more savings as many small numbers can be exactly represented. Most fp numbers using few mantissa/fraction bits and small exponents have exact representations using a small fp format and can be compressed while repeating decimals like .333... from 1/3 are inexact and can't be compressed. This can be explored by using an online floating point converter.
https://www.exploringbinary.com/floating-point-converter/
Select precision both double and single and for output select raw binary and raw hexadecimal. Enter a decimal number to convert at the top and look at the flags at the bottom after converting. If there is no inexact or subnormal flag then this number can be compressed from double to single precision. This optimization is in vasm and turned on by default when compiling with vbcc as it is safe for all "Fop #immediate,FPn" and "FMOVE #immediate,FPn" instructions. GCC does not have this optimization or some of the other peephole optimizations of vasm which is why some developers use vasm as the assembler for GCC. It would be better if the 68k backend for vbcc was improved but it is difficult to justify the time spent for an emulated 68k CPU.
cdimauro Quote:
68k is more complicated to implement. Maybe this is the reason.
However Gunnar succeed, but taking the 68050 source AFAIR and after so many years.
|
Jens didn't completely disappear in the post Natami era. He helped with the AC design and programmed some of the most difficult and critical logic. He is the professional core architect at IBM not Gunnar from my understanding which may also be out of date. Gunnar had misc other duties at IBM.
Last edited by matthey on 14-Nov-2023 at 03:55 PM. Last edited by matthey on 12-Nov-2023 at 04:05 PM. Last edited by matthey on 12-Nov-2023 at 04:00 PM.
|
| Status: Offline |
| | cdimauro
|  |
Re: J-core in Embedded Posted on 12-Nov-2023 17:18:32
| | [ #12 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 3313
From: Germany | | |
|
| @matthey
Quote:
matthey wrote: cdimauro Quote:
With good reasons due to the design choices, but it had no success and I assume that the primary reason is related to the dependency chains on the "extended" instructions mechanism that they've implemented for allowing the construction of bigger immediates and offsets.
|
Most compressed load/store ISAs are handicapped. The have limited GP registers, limited instructions available without switching modes, limited immediates and displacement, limited addressing modes, etc. The BA2 ISA is the only one that did it right and it was done by supporting more variable length instruction sizes like a CISC ISA. Maybe the ColdFire developers would have considered up to 8 byte instruction sizes to be acceptable if BA2 had been around then to show them code density is more important than following RISC norms for scaling smaller. |
I fully agree.
And Coldfire should have been based on the 68020+ ISA with the following two simple changes: - limited instructions length. 8 bytes looks the best compromise; - removal of double indirection addressing modes for all instructions except JMP/JSR, but further limited to ([bd,An,Xi*SF]) only. However it would be best to support also ([bd,An,Rn.Size*Scale],od.W/L) because some interesting things could be possible in some common scenarios, but only if it doesn't complicate the implementation. Quote:
cdimauro Quote:
Maybe more, if you consider that he added also instruction fusing, which requires more analysis for the decoded instructions to be able to understand if/how they could be fused.
|
ColdFire v4 and v5 do instruction fusing/folding. Even the 68060 uses a very simple form for predicted branches. It wouldn't necessarily require another stage if kept simple. It may be possible to do it at the early decode stage on the 68060 which wouldn't affect the execution pipelines do to the decoupled instruction buffer. Without an instruction buffer, Gunnar may do it at the instruction dispatch/issue stage but either way it may be possible to do it at least partially in parallel. Deeper logic slowing the electricity through many gates is what creates timing problems where wider but shallower logic allows parallelism with all the cheap transistors today. |
I think that it requires some additional pipeline stages. The decoding stages are already quite complicated for a 68020, but they can always forward some important information to the following fusing/folding stage(s). Quote:
cdimauro Quote:
For compressed FP immediates do you mean that you load single precision FP immediate and then you use it in subsequent FP instructions?
|
No. Let's say we want to add 1.5 to FP0 and we are using double precision for the variable.
fadd.d #1.5,fp0 ; 12 bytes fadd.d #$3ff8000000000000,fp0 ; 12 bytes, 4 for FADD + 8 for immediate.d ext
fadd.s #1.5,fp0 fadd.s #$3fc00000,fp0 ; 8 bytes, 4 for FADD + 4 for immediate.s ext
The number 1.5 can be exactly represented in both single and double precision but we can save 4 bytes if we use single precision. The 68k FPU converts the 1.5 to extended precision before any calc and/or setting the FPU condition codes so "fadd.d #1.5,fp0" can be exactly and safely replaced by "fadd.s #1.5,fp0" saving 4 bytes of code. Vasm will compress extended precision immediates down to double or single saving even more code if possible. If half precision fp format was supported, there would be more savings as many small numbers can be exactly represented. Most smaller powers of 2 for the mantissa/fraction have exact representations using a small fp format and can be compressed while repeating decimals like .333... from 1/3 are inexact and can't be compressed. This can be explored by using an online floating point converter.
https://www.exploringbinary.com/floating-point-converter/
Select precision both double and single and for output select raw binary and raw hexadecimal. Enter a decimal number to convert at the top and look at the flags at the bottom after converting. If there is no inexact or subnormal flag then this number can be compressed from double to single precision. |
OK, then it was like I was thinking about it. It's more or less what I've added to my ISA(s), but I can't support immediates of arbitrary lengths and I'm limited to a bit more than single precision (even for FP128 data).
Here I think that 68k's FPU can give the best results from this PoV with this capability of loading immediates/constants of a defined precision, that then are expanded to the full/extended precision.
Actually the only advantage that I've is that I support three operands and up to 32 registers. However the minimum size for the base opcode which allows to reference memory or (short) immediates is 6 bytes (with those 6 bytes I can load some very very limited FP constants, but it's too much limited: not enough space for at least an FP16). So, the opcode is big compared to the the 68k's FPU one (4 bytes).
But 68k's FPU can also be greatly improved by using some bits from its 32-bit base line-F opcode (there are plenty which aren't used) to introduce a third operand and more registers. More registers is very easy (just set to 1 some bits for the specific arguments that need to access the additional registers). There are 3 unused bits on the first word, which allow to access an additional 8 FP registers. Adding a third operand is also possible, reusing the Source Specificer, but then it's backward incompatible with the existing software. I've seen that you've introduced it on your 68kF2, but it looks backward-incompatible (or, at least, you can miss the new FP registers). However there's a trick to get the second source operand in a totally backward-compatible way (I can share it, but not here publicly: mail me if you're interested). Quote:
This optimization is in vasm and turned on by default when compiling with vbcc as it is safe for all "Fop #immediate,FPn" and "FMOVE #immediate,FPn" instructions. GCC does not have this optimization or some of the other peephole optimizations of vasm which is why some developers use vasm as the assembler for GCC. It would be better if the 68k backend for vbcc was improved but it is difficult to justify the time spent for an emulated 68k CPU. |
Then why don't just use vasm instead of gas, if it's compatible with the latter? |
| Status: Offline |
| | matthey
|  |
Re: J-core in Embedded Posted on 13-Nov-2023 1:52:19
| | [ #13 ] |
| |
 |
Super Member  |
Joined: 14-Mar-2007 Posts: 1852
From: Kansas | | |
|
| cdimauro Quote:
I fully agree.
And Coldfire should have been based on the 68020+ ISA with the following two simple changes: - limited instructions length. 8 bytes looks the best compromise; - removal of double indirection addressing modes for all instructions except JMP/JSR, but further limited to ([bd,An,Xi*SF]) only. However it would be best to support also ([bd,An,Rn.Size*Scale],od.W/L) because some interesting things could be possible in some common scenarios, but only if it doesn't complicate the implementation.
|
The full extension word format is where most of the 68020 ISA decoding complexity increased. I can see why they would want to do away with it for smaller cores. The ColdFire ISA just doesn't make any sense for the larger ColdFire cores like v4 and v5 where it would have been better to go back to something that looks more like CPU32 with some ColdFire instructions added for compatibility.
cdimauro Quote:
I think that it requires some additional pipeline stages. The decoding stages are already quite complicated for a 68020, but they can always forward some important information to the following fusing/folding stage(s).
|
What is the code fusion pipeline stage on CPU cores with code fusion called? I haven't see such a stage documented even with deep pipelines.
cdimauro Quote:
OK, then it was like I was thinking about it. It's more or less what I've added to my ISA(s), but I can't support immediates of arbitrary lengths and I'm limited to a bit more than single precision (even for FP128 data).
Here I think that 68k's FPU can give the best results from this PoV with this capability of loading immediates/constants of a defined precision, that then are expanded to the full/extended precision.
Actually the only advantage that I've is that I support three operands and up to 32 registers. However the minimum size for the base opcode which allows to reference memory or (short) immediates is 6 bytes (with those 6 bytes I can load some very very limited FP constants, but it's too much limited: not enough space for at least an FP16). So, the opcode is big compared to the the 68k's FPU one (4 bytes).
But 68k's FPU can also be greatly improved by using some bits from its 32-bit base line-F opcode (there are plenty which aren't used) to introduce a third operand and more registers. More registers is very easy (just set to 1 some bits for the specific arguments that need to access the additional registers). There are 3 unused bits on the first word, which allow to access an additional 8 FP registers. Adding a third operand is also possible, reusing the Source Specificer, but then it's backward incompatible with the existing software. I've seen that you've introduced it on your 68kF2, but it looks backward-incompatible (or, at least, you can miss the new FP registers). However there's a trick to get the second source operand in a totally backward-compatible way (I can share it, but not here publicly: mail me if you're interested).
|
My FPU enhancement proposal was not good enough for Gunnar as it only offered 16 FPU registers with 3 op instructions and support for more datatypes while reusing some of the existing encodings and not adding a prefix. It would also still support the immediate fp compression including half precision which would further improve code density than is possible with current encodings. It didn't take Gunnar long to complain about only 16 FPU registers though. The 68k is not PPC which he is used to working on and mem-reg FPU instructions does a lot to reduce FPU register needs.
cdimauro Quote:
Then why don't just use vasm instead of gas, if it's compatible with the latter? |
Vasm is not 100% compatible with GAS although some developers have been working with Frank to improve compatibility. It it not a fun job as GAS does not use Motorola syntax. Slow steady development is often easier than making many changes at once.
Last edited by matthey on 13-Nov-2023 at 01:53 AM.
|
| Status: Offline |
| | Hypex
 |  |
Re: J-core in Embedded Posted on 13-Nov-2023 4:06:19
| | [ #14 ] |
| |
 |
Elite Member  |
Joined: 6-May-2007 Posts: 11054
From: Greensborough, Australia | | |
|
| @matthey
Quote:
ColdFire wouldn't have been any better. It was stripped down and performance weakened to scale down smaller for the embedded market which was fine but then they tried to increase the performance to scale back up. Why execute more weak instructions instead of going back to the 68k and executing fewer and more powerful instructions? Motorola/Freescale wouldn't allow that as it would have competed with PPC. ColdFire permanently and deliberately had its performance clipped for the low end embedded market. |
The ColdFire was the closest relative to 68K. But it was scaled down where as they needed a superior replacement. Doesn't make sense to scale the CF68K back up. It was already up scaled in the 68K design they had. The CF V4 core looks the most scaled up with closest specs to 68040.
Found this article: https://www.cpushack.com/2019/11/01/cpu-of-the-day-motorola-mc68040vl/
Quote:
At least the 68k is CISC and good at being CISC. ColdFire is CISC and not good at being CISC or RISC. |
The 68K was a solid. The ColdFire was confusing in the least. Were they trying to bring back a hybridised 6800?
Quote:
PPC standards were conservative and slow to change. PA-RISC and the 88k added SIMD instructions early but they were minimal implementations. Basic integer SIMD support is very cheap. The PA-RISC SIMD support used 0.1% - 0.2% of the silicon area in early PA-RISC CPUs with no cycle time impact. |
The PPC implementation did include specific registers which would have cost silicon. FPR was double GPR while AVR while was double FPR.
So for the registers: GPR: 32 x 32 bit. FPU: 32 x 64 bit. AVR: 32 x 128 bit.
Quote:
The link register isn't so bad. Accessing memory is just a pain with RISC. |
I forgot to mention that. The linker is good for fast subroutines and avoiding stack use. Well, the PPC technically has no stack, since must be handled manually. But, when it comes to call stacks, it must eventually stack it as it will soon run out of register storage. I imagine modern OOP code would use more stack than traditional routines working in single file.
Quote:
Perhaps 68k fans would have been more willing to adopt the 88k than PPC. Instructions and addressing modes were more similar, the ISA is simpler and the assembler was more readable. I think I would have preferred a variable length encoded SuperH for a load/store architecture though. |
I'd predict that would be the case. The PPC had some acceptance in the community but it was almost like a placebo effect. Some liked it and showed the features, others just didn't like it at all. It may seem quaint these days, but part of the Amiga was writing programs in assembler, which the PPC made harder than it was before. There was also a real commercial example with Macs using PPC to replace 68K so it made sense. And by this stage the Amiga was sitting in the Macs shadow.
The AmigaOne was and continues to be a point of contention. Just the other day I heard again how it's not a real Amiga. But the same people would use Amithlon which didn't run on an Amiga at all. I've never heard of any Mac users in the past say the iMac or an Intel Mac isn't a real Mac. Yet the comparison is similar. However the Amiga lacked a parent company since Commodore were lost and the later Amiga Inc could not replace it.
I do wonder if it is the CPU or surrounding hardware that's the focus of rejection. We are aware Commodore had plans to replace the Amiga with both different CPU and different chipset. So what we know to be Amiga would have changed. Just like how what Mac users knew to be Mac would be changed by Apple. Different hardware, same name. Unfortunately Commodore never produced anything to replace the Amiga as we know it. So we don't know how another Amiga would have been received. What we do know is anything without that original chipset and without that 68K CPU with any Amiga label applied to it does get rejected by the community. But, the Amiga was only produced by Commodore, like a record company produces a band, so there's no reason why the Amiga needed to be stuck with Commodore since a band can change record companies. The Amiga, however, didn't get a chance of independence and so was lost with Commodore. |
| Status: Offline |
| | cdimauro
|  |
Re: J-core in Embedded Posted on 13-Nov-2023 5:57:22
| | [ #15 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 3313
From: Germany | | |
|
| @matthey
Quote:
matthey wrote: cdimauro Quote:
I think that it requires some additional pipeline stages. The decoding stages are already quite complicated for a 68020, but they can always forward some important information to the following fusing/folding stage(s).
|
What is the code fusion pipeline stage on CPU cores with code fusion called? I haven't see such a stage documented even with deep pipelines. |
Me neither. I just assumed that should be some stage for the fusion process.
Think about a peephole optimizer: it scans the decoded instructions and looks for possibility to remove / combine instructions. But you need all instructions to be decoded, before doing it. Quote:
cdimauro Quote:
But 68k's FPU can also be greatly improved by using some bits from its 32-bit base line-F opcode (there are plenty which aren't used) to introduce a third operand and more registers. More registers is very easy (just set to 1 some bits for the specific arguments that need to access the additional registers). There are 3 unused bits on the first word, which allow to access an additional 8 FP registers. Adding a third operand is also possible, reusing the Source Specificer, but then it's backward incompatible with the existing software. I've seen that you've introduced it on your 68kF2, but it looks backward-incompatible (or, at least, you can miss the new FP registers). However there's a trick to get the second source operand in a totally backward-compatible way (I can share it, but not here publicly: mail me if you're interested).
|
My FPU enhancement proposal was not good enough for Gunnar as it only offered 16 FPU registers with 3 op instructions and support for more datatypes while reusing some of the existing encodings and not adding a prefix. It would also still support the immediate fp compression including half precision which would further improve code density than is possible with current encodings. It didn't take Gunnar long to complain about only 16 FPU registers though. The 68k is not PPC which he is used to working on and mem-reg FPU instructions does a lot to reduce FPU register needs. |
Well, if you recall it (on Olaf's Amiga dev forum), Gunnar didn't want to extend the 68k register set because it was already good enough for him.
Now it came with a hybrid/bastardized core which has 32 registers for GP/FPU/SIMD use (and 16 for AR). "Coherent"...
Anyway, 16 registers for the FPU should be good. A requirement, I would say: 8 are really too little.
@Hypex: PowerPCs had no stack, but handling the stack with them is a nightmare even using the load/store multiple registers... |
| Status: Offline |
| | Hypex
 |  |
Re: J-core in Embedded Posted on 13-Nov-2023 14:14:42
| | [ #16 ] |
| |
 |
Elite Member  |
Joined: 6-May-2007 Posts: 11054
From: Greensborough, Australia | | |
|
| @cdimauro
Quote:
PowerPCs had no stack, but handling the stack with them is a nightmare even using the load/store multiple registers... |
Yes, the only stack that really exists on PPC is in a software construct, in the ABI. For this reason, is why I let the C compiler manage stack frames for me. And insert ASM in the middle if I want to use it. |
| Status: Offline |
| | matthey
|  |
Re: J-core in Embedded Posted on 14-Nov-2023 4:33:49
| | [ #17 ] |
| |
 |
Super Member  |
Joined: 14-Mar-2007 Posts: 1852
From: Kansas | | |
|
| Hypex Quote:
The ColdFire was the closest relative to 68K. But it was scaled down where as they needed a superior replacement. Doesn't make sense to scale the CF68K back up. It was already up scaled in the 68K design they had. The CF V4 core looks the most scaled up with closest specs to 68040.
Found this article: https://www.cpushack.com/2019/11/01/cpu-of-the-day-motorola-mc68040vl/
|
ColdFire obviously borrowed 68k logic but made changes too like adding decoupling between the instruction fetch pipeline and the execution pipeline including on the CF5102/68040VL.
68020 3 stage, split cache 68030 3 stage, split cache CF v2 2+2 stage, unified cache
68040 6 stage, split cache CF5102 2+4 stage, split cache CFv3 2+4 stage, unified cache
68060 4+4 stage, split cache, superscalar + branch folding CFv4 4+5 stage, split cache, instruction fusion/folding CFv5 4+5 stage, split cache, superscalar + instruction fusion/folding
The CF5102/68040VL was almost what C= needed as an in between upgrade from the 68030 to 68060. The 5V 68040 ran too hot and cost too much for a console or fanless computer. The CF5102/68040VL was 3.3V, reduced the cache sizes (2kiB I+1kiB D) and removed the MMU and FPU to lower power, reduce the cost and boost integer performance over the 68030. There were 2 problems though. C= went bankrupt before the CF5102/68040VL was available and it isn't fully 68k compatible. Calling it a 68040VL would have been false advertising and even the internal Motorola documentation says it is "User-mode Compatible with M68K Instruction Set" and "The XCF5102 is fully ColdFire code compatible." There a couple of instructions/encodings that are incompatible. One of them is the MULU.L instruction.
ColdFire Family Programmer’s Reference Manual Quote:
Note that CCR V is always cleared by MULU, unlike the 68K family processors.
|
The 68060 removed from hardware the 32x32=64 bit multiply instructions while retaining 32x32=32 like ColdFire but sets the CCR overflow flag correctly. Perhaps it is necessary to examine the upper 32 bit result to detect the overflow which the 68060 kept and the ColdFire lopped off? Does that mean the 68060 has the upper 32 bits of 32x32=64 available but still doesn't support 32x32=64? Was the 68060 unable to write a 2nd register in the same cycle and they didn't want to add a cycle to 3 cycles for 32x32=64? Could it be that they just wanted to save a few transistors even though 32x32=64 is not rare or is it that writing 2 registers is not RISC like and they wanted to be variable length RISC?
Hypex Quote:
The 68K was a solid. The ColdFire was confusing in the least. Were they trying to bring back a hybridised 6800?
|
You mean hybrid 68000? I believe the CF5102/68040VL would have been user mode compatible with the 68000. I don't see any encoding conflicts between the 68000 ISA and ColdFire ISA but I may have missed something. The 68k features ColdFire brought back were recognition that they made a mistake cutting the 68k ISA too deep. It's kind of like the return of the standard PPC FPU in the e500mc core after the P1022 SoCs with e500v2 core were unloaded in the trash where Trevor found them.
Hypex Quote:
The PPC implementation did include specific registers which would have cost silicon. FPR was double GPR while AVR while was double FPR.
So for the registers: GPR: 32 x 32 bit. FPU: 32 x 64 bit. AVR: 32 x 128 bit.
|
The PPC developers thought more complex instructions, more registers and the limited OoO designs would give them enough of an advantage over classic RISC. PPC did have have a small performance advantage over classic RISC but not enough when other load/store architectures moved toward CISC.
Hypex Quote:
I forgot to mention that. The linker is good for fast subroutines and avoiding stack use. Well, the PPC technically has no stack, since must be handled manually. But, when it comes to call stacks, it must eventually stack it as it will soon run out of register storage. I imagine modern OOP code would use more stack than traditional routines working in single file.
|
A memory access per branch to subroutine is usually required with or without a link register. There is rarely a difference in performance with a hardware link/return stack common today. Any advantage for PPC of not saving the link register on leaf functions is smaller than the overhead of setting up the stack frame in memory if any local variables or register saving will occur.
Hypex Quote:
I'd predict that would be the case. The PPC had some acceptance in the community but it was almost like a placebo effect. Some liked it and showed the features, others just didn't like it at all. It may seem quaint these days, but part of the Amiga was writing programs in assembler, which the PPC made harder than it was before. There was also a real commercial example with Macs using PPC to replace 68K so it made sense. And by this stage the Amiga was sitting in the Macs shadow.
|
Most PPC users liked the hyped technical features but even among developers there are few PPC assembler programmers and debugging is more challenging. Many 68k programmers can at least read 68k assembler well enough to see what the code is doing, often without a manual. The AmigaOS was designed for and around CISC like memory accesses. PPC isn't even worst case for load/store inefficiency or code density yet the PPC AmigaOS 4 footprint more than doubled.
Hypex Quote:
The AmigaOne was and continues to be a point of contention. Just the other day I heard again how it's not a real Amiga. But the same people would use Amithlon which didn't run on an Amiga at all. I've never heard of any Mac users in the past say the iMac or an Intel Mac isn't a real Mac. Yet the comparison is similar. However the Amiga lacked a parent company since Commodore were lost and the later Amiga Inc could not replace it.
I do wonder if it is the CPU or surrounding hardware that's the focus of rejection. We are aware Commodore had plans to replace the Amiga with both different CPU and different chipset. So what we know to be Amiga would have changed. Just like how what Mac users knew to be Mac would be changed by Apple. Different hardware, same name. Unfortunately Commodore never produced anything to replace the Amiga as we know it. So we don't know how another Amiga would have been received. What we do know is anything without that original chipset and without that 68K CPU with any Amiga label applied to it does get rejected by the community. But, the Amiga was only produced by Commodore, like a record company produces a band, so there's no reason why the Amiga needed to be stuck with Commodore since a band can change record companies. The Amiga, however, didn't get a chance of independence and so was lost with Commodore. |
Would a MacOne be a Mac? Sometimes a space is important as "AmigaOne" is a brand while "Amiga One" is the model" One" of the "Amiga" brand. AmigaOne and AmigaOS 4 are less accepted because they are licensed with division and shenanigans over the licensing. Many of the people who liked shiny new high tech toys have moved on from PPC AmigaOne as it has declined while the situation for 68k Amiga users has improved. Remaining 68k Amiga users like the classic Amiga where compatibility is more important than shiny features and performance. It's a retro nostalgia thing which AmigaOne lacks. Many of us are not against porting the AmigaOS to other architectures either but AmigaOS on a RPi is still a RPi and AmigaOS on a Mac is still a Mac. It would be very difficult to move the Amiga brand to Amiga branded hardware on another architecture as credibility must be built and the business would need to be seen as a legitimate successor to the Amiga. Amiga Technologies was building that kind of legitimacy and possibly could have eventually switched architectures while Amiga Inc. and AmigaAnywhere did not. The business name is important like the brand. Amiga Corporation is a great name to return to as it already shows an understanding of the Amiga history and C= mistakes. Products which customers want need to be delivered to build credibility and legitimacy as a successor while the largest Amiga market is retro 68k Amiga. There is a tightrope to walk for retro products. THEA500 Mini hit the mark without Amiga branding. The RPi is a successful spiritual successor of Acorn without branding. The new Atari despite branding and plenty of spending has struggled with questionably retro enough products. Intellivision flopped and was perceived as a scam. AmigaOne did better than Intellivision and is perhaps comparable to the new Atari VCS console. Not retro enough to succeed but it has enough of a connection to appeal to some serious fans.
cdimauro Quote:
Me neither. I just assumed that should be some stage for the fusion process.
Think about a peephole optimizer: it scans the decoded instructions and looks for possibility to remove / combine instructions. But you need all instructions to be decoded, before doing it.
|
The macro-op instruction is more like intermediate code with placeholders for registers and EAs to be filled in later. I expect some instruction fusion is possible but more info would provide more fusion opportunities.
cdimauro Quote:
Well, if you recall it (on Olaf's Amiga dev forum), Gunnar didn't want to extend the 68k register set because it was already good enough for him.
Now it came with a hybrid/bastardized core which has 32 registers for GP/FPU/SIMD use (and 16 for AR). "Coherent"...
|
Actually, I don't recall Gunnar ever saying no to more registers. I recall on the Natami forum when the consensus was that the current integer registers were adequate, he became upset after failing to convince developers otherwise.
cdimauro Quote:
Anyway, 16 registers for the FPU should be good. A requirement, I would say: 8 are really too little.
|
The 68k FPU can do surprisingly well with just 8 FPU registers. I rarely had to spill FPU registers for the vbcc support code even with some fairly complex polynomial equations. There isn't much that can be done for 68k FPU performance except schedule integer instructions between the FPU instructions though as the FPUs have not been pipelined. With pipelining, the result of long latency FPU instructions can't be touched without register renaming which is not always implemented for FPUs. I can see some large matrix math with many intermediate results benefiting from more registers as well. With Fop mem-reg instructions, 16 FPU registers seems like a good number to me too.
|
| Status: Offline |
| | cdimauro
|  |
Re: J-core in Embedded Posted on 18-Nov-2023 5:54:58
| | [ #18 ] |
| |
 |
Elite Member  |
Joined: 29-Oct-2012 Posts: 3313
From: Germany | | |
|
| @Hypex
Quote:
Hypex wrote: @cdimauro
Quote:
PowerPCs had no stack, but handling the stack with them is a nightmare even using the load/store multiple registers... |
Yes, the only stack that really exists on PPC is in a software construct, in the ABI. For this reason, is why I let the C compiler manage stack frames for me. And insert ASM in the middle if I want to use it. |
Good luck with the "beautiful" PowerPC assembly...
@matthey
Quote:
matthey wrote:
cdimauro Quote:
Well, if you recall it (on Olaf's Amiga dev forum), Gunnar didn't want to extend the 68k register set because it was already good enough for him.
Now it came with a hybrid/bastardized core which has 32 registers for GP/FPU/SIMD use (and 16 for AR). "Coherent"...
|
Actually, I don't recall Gunnar ever saying no to more registers. I recall on the Natami forum when the consensus was that the current integer registers were adequate, he became upset after failing to convince developers otherwise. |
Strange. Two different positions.
Unfortunately Olaf is keeping all those gems for himself and isn't possible to retrieve them even with the web wayback machine. Quote:
cdimauro Quote:
Anyway, 16 registers for the FPU should be good. A requirement, I would say: 8 are really too little.
|
The 68k FPU can do surprisingly well with just 8 FPU registers. I rarely had to spill FPU registers for the vbcc support code even with some fairly complex polynomial equations. There isn't much that can be done for 68k FPU performance except schedule integer instructions between the FPU instructions though as the FPUs have not been pipelined. With pipelining, the result of long latency FPU instructions can't be touched without register renaming which is not always implemented for FPUs. I can see some large matrix math with many intermediate results benefiting from more registers as well. |
That's the point. Even for 3D games, handling vector & pixel data requires several registers (without resorting to a SIMD unit). Quote:
With Fop mem-reg instructions, 16 FPU registers seems like a good number to me too. |
I agree. Think about adding them to your ISA. |
| Status: Offline |
| | Hypex
 |  |
Re: J-core in Embedded Posted on 18-Nov-2023 15:25:33
| | [ #19 ] |
| |
 |
Elite Member  |
Joined: 6-May-2007 Posts: 11054
From: Greensborough, Australia | | |
|
| @matthey
Quote:
The CF5102/68040VL was almost what C= needed as an in between upgrade from the 68030 to 68060. The 5V 68040 ran too hot and cost too much for a console or fanless computer. The CF5102/68040VL was 3.3V, reduced the cache sizes (2kiB I+1kiB D) and removed the MMU and FPU to lower power, reduce the cost and boost integer performance over the 68030. There were 2 problems though. |
I'd add a 3rd. A missing MMU didn't help badly coded applications. By that stage an OS without any kind of memory protection could hardly be taken seriously as a high end desktop.
Quote:
The 68060 removed from hardware the 32x32=64 bit multiply instructions while retaining 32x32=32 like ColdFire but sets the CCR overflow flag correctly. Perhaps it is necessary to examine the upper 32 bit result to detect the overflow which the 68060 kept and the ColdFire lopped off? |
Problem would be a full 32 bit result without overflow. A change of sign could mean overflow but unchanged sign might not be. It could also store result as 33 bits to catch it though that would be a bit odd. But overflow operations were detected on lesser CPUs. Not always with multiply but operations like add. Such as on a 6502.
Quote:
Sorry no, I missed on some detail. I did mean the 6800. My point was if they were trying to create a 68000 RISC, which is built around load/store, were they attempting to retrofit a 68K back into an 8 bit 6800 which is like an early RISC?
Quote:
The PPC developers thought more complex instructions, more registers and the limited OoO designs would give them enough of an advantage over classic RISC. PPC did have have a small performance advantage over classic RISC but not enough when other load/store architectures moved toward CISC. |
I recall it was competitive with CISC at around 200Mhz or just below. Then lost any edge soon after that. Not being able to compete with other RISC would put it behind its own kind. The last major change was VLE to PPC I noticed but it just seems wrong splitting it into 16 bits and off long alignment. Like converting copper from 32 bits to 16 bits. Sacrilegious. 
Quote:
A memory access per branch to subroutine is usually required with or without a link register. There is rarely a difference in performance with a hardware link/return stack common today. Any advantage for PPC of not saving the link register on leaf functions is smaller than the overhead of setting up the stack frame in memory if any local variables or register saving will occur. |
There would be since at least 16 GPR are non-volatile. And more if FPU or vectors need saving. The frame needs 16 byte alignment, but that's just a pointer, so the over head there is space. More space needed than a usual stacked call so the PPC ABI would need more memory that way. And with higher register count, like PPC has, it needs to stack more unless it uses only those volatile registers.
Quote:
Most PPC users liked the hyped technical features but even among developers there are few PPC assembler programmers and debugging is more challenging. Many 68k programmers can at least read 68k assembler well enough to see what the code is doing, often without a manual. The AmigaOS was designed for and around CISC like memory accesses. PPC isn't even worst case for load/store inefficiency or code density yet the PPC AmigaOS 4 footprint more than doubled. |
Some code is close in size. For example when comparing a Hollywood executable the 68K and PPC were similar size with only a few MB different. But what is usually levelled against OS4 is double indirection in functions calls. It isn't made clear if this is because of the OS4 ABI or the PPC calling conventions. The OS calls in the OS4 API are called from a function table in form of the so called interface. Which just contains pointers to functions, so on call, they are loaded in and then branched too. Compare this with 68K, where OS functions are in a jump table, so a call can jump indirectly. But the function mostly jumps off to the actual routine. This is less of an incursion for 68K since a call is one instruction. But it does mean each OS call needs two jumps to reach it. Perhaps double indirection is just a fact of Amiga OS design.
Quote:
Interesting question. Only if Bill "MacEwan" was behind the rebooting of the Mac on PPC. It's a Mac Jim, but not as we know it. 
The Mac hardware was clearly superior to the AmigaOne. But both are similar enough. So, an iMac or eMac running on PPC, could be the closest to a MacOne. Especially if produced under Apple. Apple would be an authority on their model naming.
And also again, when the Mac in PPC form was rebooted as an Intel Mac. The 'i' had changed meaning in iMac.
Quote:
Sometimes a space is important as "AmigaOne" is a brand while "Amiga One" is the model" One" of the "Amiga" brand |
When expanding that it becomes AmigaOne Amiga One which looks rather silly. 
Quote:
AmigaOne and AmigaOS 4 are less accepted because they are licensed with division and shenanigans over the licensing. |
I never let that bother me much. But yes, as happens in life, humans had to be involved. I don't agree with all I've read about Hyperion or all OS design changes made even though I'm an "OS4 user". What matters to me is the source. That the source of the OS4 codebase is from the original code. Some would say I'm being pedantic since code is replaced over time and when porting AmigaOS3 to OS4 code would have been replaced. But in use I've found it to be the closest to be a modern AmigaOS.
Quote:
Remaining 68k Amiga users like the classic Amiga where compatibility is more important than shiny features and performance. |
I've noticed there is a tie to the chipset. Which has high importance. This goes against the modern Amiga of the 90s where the chipset was taking a back seat to PC cards in PCI bridge boards. So wanting to go back is a bit retro. When you consider the later standards of the Amiga with hardware 3d support and planar problems replaced by fast RTG it looks a bit silly to go backwards to an older 80's 2d chipset. Of course those modern standards were expensive.
Quote:
It's a retro nostalgia thing which AmigaOne lacks. |
Some are so old like my XE that they are retro now. My X1000 is now over ten years old and for OS4 at least can still power it. And on the XE back then. It can play WB games. And OS friendly games to an extent. And I did enjoy seeing what would work. Sure you could do that with UAE. But there was a novelty in the early days of a native PPC AmigaOS running completely on RTG and seeing what Amiga gamed worked.
However, a mistake I think was lack of PowerUP/WarpUP compatibility. This didn't make sense that Hyperion, a game company, produced an OS under their banner for a CPU where their own games didn't work. It also caused confusion, as not every Amiga person knew all the technical differences, so wondered why a game like Heretic for PPC didn't work directly and when it was made to work would just crash.
Quote:
Many of us are not against porting the AmigaOS to other architectures either but AmigaOS on a RPi is still a RPi and AmigaOS on a Mac is still a Mac. It would be very difficult to move the Amiga brand to Amiga branded hardware on another architecture as credibility must be built and the business would need to be seen as a legitimate successor to the Amiga. |
I think it's a bit late for that now. The problem with another architecture is it isn't 68K and if the old AGA chips set isn't dragged over it would be unacceptable. Amiga uses are like a stick in the mud, Lol. The problem with the Amiga and moving it forward is that the Amiga is stuck in time. It was frozen with the last Amiga. The AmigaOne got stuck in this. There was such a large gap between the real thing and the One Amiga that there was virtually nothing left to offer the Amiga market. And the hardware was nothing special and had no special features like the original. But like I've said before, the original brought colour to a black and white world; the AmigaOne was bringing colour into a coloured world, and there was nothing it could offer to be a next Amiga. It couldn't even compete with the fantastic designs we read about in CU Amiga for what could have been the next big Amiga for a modern world.
Quote:
Amiga Technologies was building that kind of legitimacy and possibly could have eventually switched architectures while Amiga Inc. and AmigaAnywhere did not. |
Amiga the Inc (as I call them because I keep thinking of the Spaceballs Pizza Hut joke) wanted to move the brand forward onto x86. I met Bill personally and I could see what he wanted. AmigaAnywhere was one plan but all it did was take the Amiga name and place it where it didn't belong. The AmigaOne and OS4 has been criticised for two decades now but at least it offered a platform and an updated Amiga OS that could run 68K apps. AmigaDE didn't even run on an Amiga nor run any Amiga apps. It was Amiga in name only. I bought into it even though I had no hardware to run it. It was said to support PowerPC and a host of other CPUs which was still relevant. But I think they promoted that as buzzword and then just had plans to support x86 and nothing else. One clue was that it didn't even run on a real Amiga with the PowerPC card. To me that lost it all credibility. I don't know what they told the folk at Redding but a PC only version was so lazy. What did that offer the Amiga market?
Now, OS3.9 was to me, produced under duress. Because they had nothing to offer Amiga users. OS3.9 balanced that though I got the feeling Bill didn't want to it. Sure it was on the old 68K. But it was a product for actual Amiga users. I was there at the dinner when Bill announced it at Ace2K. This guy shouted out, "Why can't you let it die?" Why does his computer deserve to live? Unfortunately I didn't think of a quick witted response in time, nor did any one else, so kinda regret not having any public retort to that.
Quote:
AmigaOne did better than Intellivision and is perhaps comparable to the new Atari VCS console. Not retro enough to succeed but it has enough of a connection to appeal to some serious fans. |
Lol!
Well, the point of the AmigaOne wasn't to be retro but to modernise the market. It provided some equality with common gfx and sound card support. And I must say it felt good to just pick a reasonably priced card off the shelf knowing I could use it. Now of course people bring up the board was expensive. I didn't think it was to bad for the time, it's way worse now! But using standard parts did balance the cost. What people don't see, or ignore perhaps because they see it as irrelevant, is the AmigaOne supported peripherals that cost exactly the same as on the common PC. This brought the overall cost down. Had the AmigaOne had exclusive cards that did exactly the same as the PC version, but with the Amiga tax on top making it more expensive, it would have been so much worse And everyone would have had a case against it. Not even the Mac could offer this!
See, where we came from was expensive Amiga PCI bridge cards, just so we could plug in a cheap PCI card. Like the Mediator, an expensive way to add cheap cards. The logic of that makes no sense! And then it got worse. Spider. Exactly the same as an NEC USB2 card bit with a firmware hack. Add on the Amiga tax. Ohhh exclusive! This rally went against the Mediator idea. You paid lots of money to save it on the cards. Not for both! The AmigaOne did away with all this. The board did what the Mediator did and more with a CPU on board. It did away with the Amiga card and HDD tax. Sure, it had to be supported, but you didn't need to buy an 80GB HDD with Amiga tax in the Amiga section at twice the price of a 500GB HDD in the PC section. People seem to forget this. Oh it's not good enough, it's still not x86, well they can get over it. It's not a perfect Amiga world. 
Quote:
Actually, I don't recall Gunnar ever saying no to more registers. I recall on the Natami forum when the consensus was that the current integer registers were adequate, he became upset after failing to convince developers otherwise. |
I don't think it would fit either. With base 16 bit for each instruction code it isn't designed for anything else. The 68K series supports "parameters" but the standard is 8 data/8 address/8 FPU. Possibly 8 vectors can fit but they need to fit in the encoding and the Apollo has it somewhere. But, they must be extra 16 bit codes, after the main code. They cannot be prefixes like on x86. 68K is not x86 nor 68x or K86. Adding prefixes of any size to 68K is wrong. If BigGun has taken his Intel ideas and coded them that way it would be wrong. That is not in the standard. No need to mock the 68K with such a abomination.  Last edited by Hypex on 18-Nov-2023 at 03:52 PM.
|
| Status: Offline |
| | matthey
|  |
Re: J-core in Embedded Posted on 18-Nov-2023 20:11:37
| | [ #20 ] |
| |
 |
Super Member  |
Joined: 14-Mar-2007 Posts: 1852
From: Kansas | | |
|
| Hypex Quote:
I'd add a 3rd. A missing MMU didn't help badly coded applications. By that stage an OS without any kind of memory protection could hardly be taken seriously as a high end desktop.
|
A high end Amiga should be using a full 68040 with MMU at that time not that there was a high end Amiga during the 68040 and 68060 era due to the gimped low end Amiga chipset. I was referring to low end hardware for the masses like an intermediate upgrade for the CD32+ between a 68EC030 and 68040.
Hypex Quote:
Problem would be a full 32 bit result without overflow. A change of sign could mean overflow but unchanged sign might not be. It could also store result as 33 bits to catch it though that would be a bit odd. But overflow operations were detected on lesser CPUs. Not always with multiply but operations like add. Such as on a 6502.
|
No doubt multiply is more difficult than add to detect overflow. The ColdFire RISC castration just cut off all the upper 32 bits including overflow detection. Higher end 32 bit RISC cores have a MUL and MULH at least although two 2-5 cycle multiplies often have to be performed for 32x32=64 to avoid the hardware complexity of writing results to 2 registers.
Hypex Quote:
Sorry no, I missed on some detail. I did mean the 6800. My point was if they were trying to create a 68000 RISC, which is built around load/store, were they attempting to retrofit a 68K back into an 8 bit 6800 which is like an early RISC?
|
Not likely. The 6800 is an accumulator architecture which is closer to the opposite of RISC in some ways, perhaps more so than CISC.
Accumulator architecture examples: 4004, 8008, 6800, 6502, 8051 registers: few mem access: reg-mem mem traffic: high code density: mediocre
CISC general purpose register architecture examples: PDP-11, VAX, 8086, 68000, NS32000 registers: several to many mem access: reg-mem mem traffic: low code density: good
RISC general purpose register architecture examples: 801. RISC-I, MIPS, 88000, ARM2 registers: many mem access: load-store mem traffic: mediocre code density: bad
The Motorola 88000 borrowed from the 68k. ColdFire borrowed heavily from the 68k. MCore was "microRISC" that did not borrow much from the 68k and failed miserably.
Hypex Quote:
I recall it was competitive with CISC at around 200Mhz or just below. Then lost any edge soon after that. Not being able to compete with other RISC would put it behind its own kind. The last major change was VLE to PPC I noticed but it just seems wrong splitting it into 16 bits and off long alignment. Like converting copper from 32 bits to 16 bits. Sacrilegious. 
|
PPC performance improved when maximizing the L1 caches like upgrading to 32kiB I+D L1 caches in the 604e. The shallow pipelines and slower access times of larger caches held PPC back until the on chip L2 cache became practical and PPC had a 2nd wind with the G3 and G4. PPC has always been cache hungry.
IBM used CodePack and Motorola/Freescale VLE for code compression. Lack of standardized hardware played a part in killing PPC.
Hypex Quote:
There would be since at least 16 GPR are non-volatile. And more if FPU or vectors need saving. The frame needs 16 byte alignment, but that's just a pointer, so the over head there is space. More space needed than a usual stacked call so the PPC ABI would need more memory that way. And with higher register count, like PPC has, it needs to stack more unless it uses only those volatile registers.
|
PPC has all those GP registers so compilers can unroll loops and inline functions. Every variable needs to be pre-loaded into registers for RISC to have good performance and this means spilling non-volatile registers to gain more registers. AmigaOS 4 needing in the order of ten times the stack space says something about the increased memory traffic to and from the PPC stack frame.
Hypex Quote:
Some code is close in size. For example when comparing a Hollywood executable the 68K and PPC were similar size with only a few MB different. But what is usually levelled against OS4 is double indirection in functions calls. It isn't made clear if this is because of the OS4 ABI or the PPC calling conventions. The OS calls in the OS4 API are called from a function table in form of the so called interface. Which just contains pointers to functions, so on call, they are loaded in and then branched too. Compare this with 68K, where OS functions are in a jump table, so a call can jump indirectly. But the function mostly jumps off to the actual routine. This is less of an incursion for 68K since a call is one instruction. But it does mean each OS call needs two jumps to reach it. Perhaps double indirection is just a fact of Amiga OS design.
|
Hollywood executables may contain more data than code.
The 68k has efficient branch support which starts with small displacements for good code density but scales up to cover the whole 32 bit address range. Immediates are handled efficiently like this too. RISC ISAs have been slow to adapt even with variable length ISAs as Thumb and RVC don't take full advantage. Only BA2 and the load/store ISA Mitch Alsup was working on take full advantage that I am aware of.
Hypex Quote:
I never let that bother me much. But yes, as happens in life, humans had to be involved. I don't agree with all I've read about Hyperion or all OS design changes made even though I'm an "OS4 user". What matters to me is the source. That the source of the OS4 codebase is from the original code. Some would say I'm being pedantic since code is replaced over time and when porting AmigaOS3 to OS4 code would have been replaced. But in use I've found it to be the closest to be a modern AmigaOS.
|
AmigaOS 4 may retain more of the 68k AmigaOS design and feel than MorphOS or AROS for better or for worse. I don't know if this is due to being more closely derived from it or a deliberate design philosophy. Perhaps it is some of both.
Hypex Quote:
I've noticed there is a tie to the chipset. Which has high importance. This goes against the modern Amiga of the 90s where the chipset was taking a back seat to PC cards in PCI bridge boards. So wanting to go back is a bit retro. When you consider the later standards of the Amiga with hardware 3d support and planar problems replaced by fast RTG it looks a bit silly to go backwards to an older 80's 2d chipset. Of course those modern standards were expensive.
|
I expect many other Amiga users would have liked to have Amiga chipset compatibility on their RTG PCI cards so they didn't have to switch video outputs. The original Cybervision card has Amiga video passthrough which was more convenient making these cards very desirable today. There was no other choice for high end graphics as C= kept the Amiga chipset low end to turn the Amiga into the next C64 instead of the Next NeXT.
Hypex Quote:
I think it's a bit late for that now. The problem with another architecture is it isn't 68K and if the old AGA chips set isn't dragged over it would be unacceptable. Amiga uses are like a stick in the mud, Lol. The problem with the Amiga and moving it forward is that the Amiga is stuck in time. It was frozen with the last Amiga. The AmigaOne got stuck in this. There was such a large gap between the real thing and the One Amiga that there was virtually nothing left to offer the Amiga market. And the hardware was nothing special and had no special features like the original. But like I've said before, the original brought colour to a black and white world; the AmigaOne was bringing colour into a coloured world, and there was nothing it could offer to be a next Amiga. It couldn't even compete with the fantastic designs we read about in CU Amiga for what could have been the next big Amiga for a modern world.
|
There is no ground breaking jaw dropping advancements in computers anymore. There is only more performance or a cheaper price for the existing technology. Features vary enough to allow for favorite computer environments while compatibility and software availability plays a key role in adoption.
Hypex Quote:
Lol!
Well, the point of the AmigaOne wasn't to be retro but to modernise the market. It provided some equality with common gfx and sound card support. And I must say it felt good to just pick a reasonably priced card off the shelf knowing I could use it. Now of course people bring up the board was expensive. I didn't think it was to bad for the time, it's way worse now! But using standard parts did balance the cost. What people don't see, or ignore perhaps because they see it as irrelevant, is the AmigaOne supported peripherals that cost exactly the same as on the common PC. This brought the overall cost down. Had the AmigaOne had exclusive cards that did exactly the same as the PC version, but with the Amiga tax on top making it more expensive, it would have been so much worse And everyone would have had a case against it. Not even the Mac could offer this!
See, where we came from was expensive Amiga PCI bridge cards, just so we could plug in a cheap PCI card. Like the Mediator, an expensive way to add cheap cards. The logic of that makes no sense! And then it got worse. Spider. Exactly the same as an NEC USB2 card bit with a firmware hack. Add on the Amiga tax. Ohhh exclusive! This rally went against the Mediator idea. You paid lots of money to save it on the cards. Not for both! The AmigaOne did away with all this. The board did what the Mediator did and more with a CPU on board. It did away with the Amiga card and HDD tax. Sure, it had to be supported, but you didn't need to buy an 80GB HDD with Amiga tax in the Amiga section at twice the price of a 500GB HDD in the PC section. People seem to forget this. Oh it's not good enough, it's still not x86, well they can get over it. It's not a perfect Amiga world. 
|
AmigaOS 4 has good gfx card support but...
1. gfx card drivers don't come with the OS, drive up total cost and are less user friendly to obtain and install 2, PPC hardware doesn't have SMP, a standard FPU since A1222 or standard SIMD/vector support 3. PPC hardware is too expensive to leverage the advantage and integrated graphics are replacing gfx cards except for the highest end desktop and workstation systems
Poor quality Linux gfx card drivers are a major problem for ARM and RISC-V desktop systems. PCIe cards for the RPi 4 and 5 were a joke with all but one or two not working. Similar problems likely affect the RISC-V SiFive boards. Using integrated graphics allows for cheaper standard hardware thus avoiding the problem but SiFive doesn't have an open hardware GPU where open hardware is one of their selling points.
Hypex Quote:
I don't think it would fit either. With base 16 bit for each instruction code it isn't designed for anything else. The 68K series supports "parameters" but the standard is 8 data/8 address/8 FPU. Possibly 8 vectors can fit but they need to fit in the encoding and the Apollo has it somewhere. But, they must be extra 16 bit codes, after the main code. They cannot be prefixes like on x86. 68K is not x86 nor 68x or K86. Adding prefixes of any size to 68K is wrong. If BigGun has taken his Intel ideas and coded them that way it would be wrong. That is not in the standard. No need to mock the 68K with such a abomination. 
|
For a 16 bit reg-mem instruction, 8 registers and 2 op instructions are the limit. With a 32 bit reg-mem instruction, 16 registers and 3 op instructions are possible (68k FPU instructions are 32 bits). It is possible to add a 16 bit prefix to access more registers and allow 3 op instructions. It normally isn't so bad if code mostly uses the lower registers which offer better code density. The x86-64 ISA uses a pre-fix to access the upper 8 of 16 GP registers resulting in a decline of code density. A 68k pre-fix adds an extra 16 bits but the 68k already has 16 registers and 16 more registers would be available as well as 3 op and even leaving a few more bits available.
16 bit prefix 1 bit - high bit of source register 1 bit - high bit of destination register 1 bit - high bit of index register (may be avoidable by adding bit to full format extension) 4 bits - 3 op 2nd source register
Other potential options 1 bit - CC toggle 1 bit - saturation toggle 1 bit - SIMD toggle 1 bit - 64 bit operation datatype size
I believe the AC68080 uses a pre-fix like this with a 64 bit operation datatype size bit which I really don't like as 64 bit operations are common on a fully utilized 64 bit CPU greatly reducing code density, overriding the instruction size and length is bad for decoding and address register operations are already auto extended to 64 bit if following current 68k behavior. It may be better to have a separate 64 bit mode where size=byte, word, long or quad after cleaning up the encodings. This is natural and efficient as Gunnar wanted and even tried to do this but it is not compatible. A 68k32 high compatibility mode and 68k64 new mode likely uses more transistors and requires more design work so was rejected by Gunnar.
Last edited by matthey on 19-Nov-2023 at 02:02 AM.
|
| Status: Offline |
| |
|
|
|
[ home ][ about us ][ privacy ]
[ forums ][ classifieds ]
[ links ][ news archive ]
[ link to us ][ user account ]
|