GCC vs Renesas SHC compiler
Posté le 02/07/2024 22:04
This topic was first mentioned during a discussion between me and Sentaro21 on
e-Gadget. From him, the main reason C.Basic is still built under fx-9860G SDK and fxCG miniSDK is that they use Renesas SHC compiler which has better floating-point arithmetic performance than GCC, and he would like to know how the current GCC's emulated FPU library performs to see if it is worth switching.
So months later, after opening a megathread on transition to fxSDK/gint in C.Basic Git reopsitory, I began to work on benchmarks for SH-based calculators, in which Whetstone (floating-point) and Dhrystone (integer) are considered. As such, I requested Sentaro21 for his Dhrystone add-in that was used for Ftune/Ptune and he kindly provided not only this benchmark, but also the Whetstone and 8-queen one.
The adapted Whetstone benchmark was conducted on my fx-9860GII SD SH4 without overclocking, and compiled under fx-9860G SDK+Renesas SHC compiler and fxSDK/gint@dev+GCC 11.1. The results are as follows:
Renesas: 523.83 KWIPS
GCC (-Os, 56504 bytes): 321.95 KWIPS (-38.54%)
GCC (-O3, 57544 bytes): 455.86 KWIPS* (-12.98%)
GCC (-Ofast, 57704 bytes): 456.38 KWIPS* (-12.88%)
Well, the performance of
-Os build lags considerably behind that of Renesas
. While
-O3 and
-Ofast builds' results are much closer to the original, their results are invalid as Lephe found they skipped the module 8 of Whetstone test.
Lephenixnoir a écrit :
The -O3 figure (452 Kwhet/s) is probably invalid since its optimization removes module 8 entirely (which is 0.7 second)
One possible cause is that GCC has IEEE 754-compliant fp arithmetic and maybe SDK doesn't. And as far as library functions are concerned different compromises will lead to different speeds, maybe OpenLibm tries harder to be precise and so has slower functions
Benchmark source:
fxSDK/gint project: attached to this post
fx-9860G SDK project:
https://pm.matrix.jp/fx-bench2.zip
Fichier joint
Citer : Posté le 02/07/2024 22:11 | #
For extra clarification, there are 3 contributors to performance in each build:
It's highly unlikely that SHC generates better code than GCC. It'd be like throwing away a few decades of compiler research. I've started looking into individual Whetstone modules and the SDK add-in outperforms the GCC add-in consistently, so it doesn't appear to be a case of one outlier operation being extremely bad.
My main suspicion right now is precision—the soft-fp library could be less than perfectly IEEE-754-precise (which libgcc is), and the math library could be less precise than OpenLibm, which is overall quite good.
Citer : Posté le 03/07/2024 09:40 | #
To my surprise, the SuperH build of libgcc still uses a quite old soft-fp library called fp-bit ; it is documented to be rather slow and many targets have been moved to a more recent implementation tersely called "soft-fp" which is from the glibc project (and purported to give speedups in the 15-30% range). These efforts have been ongoing for at least 10 years now.
I think it would be a good idea to try and build the "soft-fp" library in libgcc to see if we can make it work and how much that improves performance. The starting point is the sh case of libgcc/config.host using the t-make file t-fdpbit.
Citer : Posté le 03/07/2024 10:53 | #
I was able to make a build of libgcc that uses the "soft-fp" library. For context, this is something that, if we validate works, will be changed in the fxSDK and distributed automatically to everyone with the next GCC update in the fxSDK.
I haven't really tested whether the computations work yet apart from a single test I added that squaring √2 gives the expected result.
With this version (development GCC 15.0 with "soft-fp" in libgcc), the benchmark finishes in 0.950s (1052 Kwhet/s) at -O2.
Now that's more like it. x)
Citer : Posté le 03/07/2024 16:08 | #
Can confirm the results with soft-fp library: the time needed is now reduced to 1.24s (806 KWIPS) for -Os and 0.929s (1076 KWIPS) for -O2 👍
Citer : Posté le 03/07/2024 19:53 | #
I also tested by graphing some trigo curves and it seems to be working as intended.
I will check a game or two that uses floating-point and then add this change to a GCC 14 build for the next version of the fxSDK.
I assume these performance results are satisfactory for a potential port of C.Basic? :3
Citer : Posté le 03/07/2024 20:08 | #
Most likely
GCC's integer mode performance has been high for some time, but if the latest version of the floating-point emulator library is better than Renesas' library, it is worth switching.
Citer : Posté le 03/07/2024 20:10 | #
I forgot to mention but you can build SDK-like add-ins (with no gint) using GCC, you should probably start there
Citer : Posté le 03/07/2024 20:23 | #
Ah, then the build test could be much easier, but what else should I add to let GCC compile it?
Citer : Posté le 03/07/2024 20:26 | #
This file: https://www.planet-casio.com/storage/forums/bin-12970.zip
Contains crt0.s, which is the initialization code (replaces the init code in your main file ; you should also rename AddIn_main() into main()), the fxlib, and a linker script.
You can compile your project with a Makefile of your own. Since you'll be linking with fxlib you should compile with -mrenesas otherwise variadic-argument functions like printf() will break down. You can use the linker script with -T addin.ld when linking.
Citer : Posté le 03/07/2024 21:34 | # | Fichier joint
Ported the Dhrystone benchmark. Here's the results:
Format: <x1 score>,<x80 score>
Renesas: 32320 Dhry/s, 2820 Dhry/s
GCC -Os: 35080 Dhry/s, 29263 Dhry/s
GCC -O2: 37758 Dhry/s, 37187 Dhry/s
GCC -O3: 38139 Dhry/s, 37578 Dhry/s
The two Dhrystones are tests with and without the cache enabled.
(The second one takes a little time)
I'm kind of surprised about the GCC's x80 scores, which far outperform Renesas' compiler under every optimization flag.
Uploaded my updated gint benchmark
Citer : Posté le 04/07/2024 07:28 | # | Fichier joint
The 8-queen benchmark results are out:
Format: <8q>, <8q_un>, <8q_asm>
Renesas: 3193ms, 2712ms, 1355ms
GCC -Os: 3321ms, 2839ms, X
GCC -O2: 2987ms, 2656ms, X
GCC -O3: 2839ms, 2702ms, X
8q_asm test is unavailable in GCC as the original asm file is only for fx-9860G SDK, and its ported .s counterpart caused the add-in to crash.
Also, I noticed performance drop on Dhrystone test after including the 8-queen benchmark:
(Edit: Updated x80 score)
GCC -Os: 32512 Dhry/s, 4838 Dhry/s
GCC -O2: 34822 Dhry/s, 4505 Dhry/s
GCC -O3: 35118 Dhry/s, 4487 Dhry/s
So I seperated 8-queen benchmark and the results indicate increase in performance for -Os:
GCC -Os: 2991ms, 2784ms, X
GCC -O2: 3003ms, 2667ms, X
GCC -O3: 2882ms, 2785ms, X
I also did the Dhrystone test alone but the results were consistent with the ones in the last reply. Not too sure why the 8-queen source affects the previous benchmarks.
Citer : Posté le 04/07/2024 07:59 | # | Fichier joint
A quick update from Sentaro21:
@CalcLoverHK
Thanks for the quick work.()
I didn't know GCC's FPU emulator was that old.
Now I see nothing wrong with moving the development environment to GCC based.
GCC's integer arithmetic performance remains very impressive.
As for the x80 code, it is simply connecting 80 times the source, so you need to make sure that the second and subsequent sources are not removed by the optimization.
SHC did not optimize that much, but we will need to do some work to prevent GCC optimization.
The code to prevent GCC optimization would be:
#pragma GCC push_options
#pragma GCC optimize ("O0")
...
#pragma GCC pop_options
Updated the above Dhrystone test results and benchmark add-in.
Citer : Posté le 04/07/2024 08:33 | #
Setting -O0 via a pragma is way too much, it will inhibit all optimizations. -O0 code is just plain garbage, that's not something you'd ever consider performance measurements for. It is sufficient instead to sink an output of the test into a volatile variable so that GCC keeps all the iterations.
I am interested in maintaining these benchmarks as part of gintctl so I can keep an eye on them over time. Is there a specific license attached to libbench?
Citer : Posté le 04/07/2024 09:15 | #
I am interested in maintaining these benchmarks as part of gintctl so I can keep an eye on them over time. Is there a specific license attached to libbench?
Oh, I forgot about this one. libbench is licensed under GPLv2 or later as of now, though I have to confirm the license of fxbench2 with Sentaro21 before uploading it to Git.
Citer : Posté le 05/07/2024 06:56 | #
@Lephenixnoir
Thanks for your quick and precise support.
@Calcloverhk
Thanks again.
You are our main development member for the upcoming GCC-based port.
I am looking forward to seeing the FPU emulator library issues resolved and how much performance can be improved by porting to GCC.
If it will be optimized for SH4 as well as SH3, we can expect further performance increase.
Overclocking utilitaire Ftune/Ptune2/Ptune3 est également disponible.
Si vous avez des questions ou un rapport de bogue, n'hésitez pas à me le faire savoir.