I worship modules, because they’re wise!!
academic, developer, with an behold against a brighter techno-social existence
[prev]
[next]
2022-04-06
Hand-optimizing the TCC code generator
I am on a resolution of diversified compiler mailing lists. One among these is the mailing checklist for the Puny C Compiler. Factor in my surprise when a sleek electronic mail mentioned our latest series and urged attempting a an identical almost about boost the code generator in TCC. I became as soon as to peek if there dangle been any low-hanging fruit that may maybe well well believe an identical beneficial properties as O
affords for QBE.
Variations between tcc
and cproc
We are in a position to also simply restful familiarize ourselves with how each tcc
and cproc
generate their output. For the capabilities of simplifying issues, we can presume that every compiler entrance-ends are completely an identical. I do know they’re now no longer, nonetheless our passion comely now is now no longer in the entrance-cease.
As we beforehand learned, cproc
outputs QBE IL that is extra processed by QBE to believe assembly, which is then processed by the procedure assembler to believe object code and (finally) linked with the procedure linker to believe binaries and libraries. Which capability that of of this pipeline where each of the fundamental steps is processed by a diversified standalone executable, we dangle been ready to jot down our have standalone executable, O
, that we inserted into this compilation pipeline after QBE nonetheless sooner than the procedure assembler. That allowed us to leverage data referring to the structure of QBE-generated assembly to request deficiencies and write a straightforward pattern-matching program to output optimized variations of particular patterns. Our O
program is a comparatively straightforward peephole optimizer.
In incompatibility, tcc
combines the features of the total utilities wanted to assemble a total binary or library correct into a single executable. That procedure that tcc
contains a C preprocessor, C compiler, and linker multi function. There is additionally an built-in assembler in bid to boost inline assembly. Unlike cproc
, tcc
does now no longer output assembly—it straight away outputs object files. That is big when it involves urge: saving a total step in contrast to cproc
(and gcc, for that topic) is droop to result in faster compiles, all else being equal.
Then again, taking the single-binary near blended with straight away outputting object code procedure that we’re now no longer going to be capable to recall the O
almost about optimization. There is now no longer a pipeline of standalone executables that we can insert a contemporary utility into. And even if there dangle been, whereas the ELF structure does divide code into sections that we may maybe well well selectively query and breeze a peephole optimizer over, this may maybe well well well be far extra work than reading over assembly code, requiring at a minimal a manual so we may maybe well well work out what the binary decodes to. As x86 and x86_64 are notoriously hard to decode as it’ll be, we would either must spend endless sleepless nights pouring over Intel manuals (and restful doubtlessly earn it spoiled usually) or belief somebody else’s decoder to earn issues comely (and restful doubtlessly earn it spoiled usually). Neither of these suggestions are expedient, so we now dangle got to search out one other near.
Hand-optimizing the tcc
code generator
Which capability that of tcc
straight away outputs object code, we may maybe well well recall an near where we learn thru the code generator recognizing object code generation that is lower than expedient and enlighten tcc
about greater code the generate as an change. To originate up out, I am handiest going to dismay about x86_64 code generation. There are code generators for x86, arm, aarch64, riscv64, and TMS320C67x as neatly. Every code generator is its have file and there’s no longer any such thing as a intermediate code so there don’t seem to be any IL-based optimizations that may maybe well well well profit the total code generators. On the change hand, I suspect there’s sufficient usage of the x86_64 code generator to manufacture this a official initiating level.
The x86_64 code generator will also be discovered right here. We’re additionally going to make utilize of a truly obliging tool to decode the article code that tcc
generates: objdump
. This utility is from the GNU binutils so likelihood is barely correct you may maybe well well also simply dangle got already got it put in to your procedure or can with out trouble earn it to your procedure. On my OpenBSD procedure, I even dangle rather script that every night time grabs the latest git HEAD of binutils and builds and installs it, so on my machine I am the utilize of GNU objdump (GNU Binutils) 2.38.50.20220406
nonetheless I am appealing to bet that even worn variations of objdump
will work gorgeous for on the sleek time.
What we can designate is assemble some straightforward pattern programs with tcc
and learn the disassembled output from objdump
in bid to search out alternatives to educate the code generator straightforward techniques to designate issues greater.
Therefore, when we discuss optimizing the code generator for on the sleek time, we’re now no longer talking about making the code generator smaller or faster basically. We’re talking about making the code generator smarter. If we’re lucky, even supposing we’re at wretchedness of add some code to manufacture the code generator smarter, since the improved code generator will likely be breeze on itself, likely the optimization enhancements outweigh the extra code we can add.
With that acknowledged, there’s an overarching project that I are searching to care about: the overall purpose of TCC is to be fleet. It is about an bid of magnitude faster than gcc. It turns out optimizations will also be dear to assignment and earn comely. I are searching to steer particular of creating TCC recall longer to compile code. Confidently, our work on the sleek time can fabricate TCC each faster and believe greater code. If now no longer, we may maybe well well simply restful on the least now no longer fabricate TCC slower.
Discovering out Howdy World
Let’s originate up with doubtlessly the simplest C program. Even easier than Howdy World. How referring to the true program? How about one thing even smaller:
predominant(){}
That is it. That is perchance the simplest total C program I do know straightforward techniques to jot down. Let’s compile it with tcc
and learn the disassembly with objdump
:
/dwelling/brian $ tcc -c true.c /dwelling/brian $ objdump -d true.o true.o: file structure elf64-x86-64 Disassembly of allotment .text: 0000000000000000: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 48 81 ec 00 00 00 00 sub $0x0,%rsp b: b8 00 00 00 00 mov $0x0,%eax 10: c9 crawl away 11: c3 ret
Our worn neutral correct friend movl $0, %eax
is abet. We all know what to designate with that, and now we can also come at some stage in why it’s we would are searching to manufacture the change: movl $0, %eax
is five bytes. For comparability, xorl %eax, %eax
is handiest two bytes (0x31 0xc0
), so we earn a 60% good deal in code dimension for this optimization. It is additionally commended by Intel for zeroing a register.
As we learned when developing O
, optimizing all cases of mov $0
to xor
consistently diminished binary sizes between 2% and 3%. Let’s enlighten tcc
straightforward techniques to manufacture this development.
Wading thru the code generator
The code generator (and most of TCC) is rather opaque in its coding. A great deal of one-letter and two-letter variables and efficiency names. Fortuitously for us, some comments fabricate it abundantly particular where we may maybe well well simply restful focal level our consideration: there’s a comment that clearly says /mov $xx, r */
and that is the particular idiom we’re attempting to procure. We’re searching to educate the code generator about a particular case where if the instantaneous is zero, we can optimize this instruction to an xor
. There’s separate code for 32-bit and 64-bit cases. Let’s originate up with the 32-bit case.
The 32-bit case code will also be discovered right here. Let’s factor in that we assemble now no longer know what the orex()
unbiased does. There is restful some data we can rep on this block. The next line seems to be to emit a 32-bit instantaneous. It seems to be worship the variable fc
represents the 32-bit instantaneous. So that must always be the cost for us to ascertain. If fc
is zero, then we may maybe well well simply restful apply the optimization. Even though we restful assemble now no longer know what the orex()
unbiased does, thought to be one of its arguments is REG_VALUE(r)
, and its cost is added to 0xb8
to make a mov
opcode, so it seems to be to be a register modifier.
Let’s come at some stage in what xorl %eax, %eax
encodes to. Sure, I do know I already immoral it earlier on this post, nonetheless let’s ascertain issues. We are in a position to jot down a contemporary assembly file named xor.s
and inside of put:
xorl %eax, %eax
My version of the GNU assembler is GNU assembler (GNU Binutils) 2.38.50.20220406
. I even dangle considered examples where newer variations of GNU as utilize greater encodings for certain instructions than older variations, nonetheless I assemble now no longer relate that can topic worthy on the sleek time. After I breeze as -o xor.o xor.s
and then breeze objdump -d xor.o
, right here is the output I come at some stage in:
xor.o: file structure elf64-x86-64 Disassembly of allotment .text: 0000000000000000 <.text>: 0: 31 c0 xor %eax,%eax
Confirmed. We’re searching to expose the binary pattern b8 00 00 00 00
into 31 c0
. A minimal of, that can turn movl $0, %eax
into xorl %eax, %eax
. This may maybe occasionally be moderately diversified for the change registers. How diversified? Let’s discover.
Generalizing the pattern
One among the issues to look at is that TCC does now no longer utilize the total on hand registers. Having a peek on the checklist, it seems to be worship TCC avoids the utilize of registers that the System V AMD64 ABI says are callee-saved registers because the utilize of callee-saved registers nearly surely complicates issues. So we handiest dangle rax
, rcx
, rdx
, rsi
, rdi
, r8
, r9
, r10
, and r11
. Whereas rsp
is in that enum, it’s a callee-saved register; I am resplendent determined or now no longer it’s handiest on this enum to take care of stack frame setup and teardown. Let’s now no longer dismay referring to the xmm
registers; these are for SSE instructions. The st0
register is for the floating-level unit.
So let’s apply the bid on this enum and edit our xor.s
test file to add extra xor
traces for each of these registers we now dangle got on hand for us to make utilize of:
xorl %eax, %eax xorl %ecx, %ecx xorl %edx, %edx xorl %esp, %esp xorl %esi, %esi xorl %edi, %edi xorl %r8d, %r8d xorl %r9d, %r9d xorl %r10d, %r10d xorl %r11d, %r11d
That assembly produces this output:
xor.o: file structure elf64-x86-64 Disassembly of allotment .text: 0000000000000000 <.text>: 0: 31 c0 xor %eax,%eax 2: 31 c9 xor %ecx,%ecx 4: 31 d2 xor %edx,%edx 6: 31 e4 xor %esp,%esp 8: 31 f6 xor %esi,%esi a: 31 ff xor %edi,%edi c: 45 31 c0 xor %r8d,%r8d f: 45 31 c9 xor %r9d,%r9d 12: 45 31 d2 xor %r10d,%r10d 15: 45 31 db xor %r11d,%r11d
Looks that the r8
–r11
registers require a further byte to encode. It is restful smaller than a mov
into these registers. Then again, I’ve already performed a official bit of testing and it turns out that it getting that far down the register checklist is so uncommon I’ve by no procedure considered it happen. So we’re correct now no longer going to wretchedness with the three byte ones.
For the final six registers, the adaptation between each opcode is 9 (c9-c0, d2-c9, ff-f6)—do no longer put out of your mind that we’re missing rbx
and rbp
and these possess in the missing steps. The enum we beforehand saw does story for the missing registers. Let’s put a pin on this for now.
Along side our first optimization
Armed with this contemporary data, let’s enhance the 32-bit case. We’re searching to ascertain to peek if the fc
variable is zero and if it’s, and the register we’re working with is thought to be one of the registers that encodes to a two byte xor
, then we may maybe well well simply restful output code to xor
the register with itself. In every other case, we may maybe well well simply restful designate what we now dangle been doing: mov
the instantaneous into the register.
In code make, that may maybe well well well peek worship this:
if (fc==0 && rThat's it. That should improve the code generator for 32-bit
mov
. Note that we doREG_VALUE(r) 9
because we learned that eachxor
a register with itself is nine opcodes away from the next one. What's in between? If you want toxor
a register with a different register.Improving 64-bit
mov
We can follow the exact same pattern for 64-bit
mov
optimization. Note here that TCC is usingsv->c.i
to store the cost of the instantaneous. Even supposing right here's a 64-bit optimization, we're going to are searching to make utilize of the 32-bitxor
. Right here is because 32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination favorite-reason register. Or, rather extra succinctly, 32-bit operations particular the upper half of the 64-bit destination register. It takes three bytes to encode a 64-bitxor
nonetheless since the 32-bitxor
will particular the upper half of the register anyway, we can attach a byte and earn the an identical result. Right here's the code:if (sv->c.i==0 && r c.i); }Become it if truth be told that easy? Sure, certain it became as soon as.
...nearly
An additional optimization space
Further down the code generator, there's a space where a hardcoded
mov
instantaneous into%eax
is taking procedure. If that instantaneous is zero, then right here's but one other space where we now dangle got to educate TCC about ourxor
optimization. It took me a whereas to look at it in the initiating, and I handiest seen it because subsequent runs ofobjdump
told me I wasn't making the total optimizations I believed I may maybe well well simply restful be.Right here is the event:
if (vtop->form.ref->f.func_type !=FUNC_NEW) { /implies FUNC_OLD or FUNC_ELLIPSIS */ if (nb_sse_args==0) o(0xc031); /xor eax, eax */ else oad(0xb8, nb_sse_argsChecking out
Whereas we can congratulate ourselves on making improvements to the code generator, we assemble now no longer if truth be told know we now dangle got made an development until we test issues out. I are searching to ascertain two diversified issues: first, if we're producing smaller binaries; and 2d, if we're producing faster binaries. Again, I assemble now no longer are searching to dull
tcc
down. I am rather assured we can believe smaller binaries. However iftcc
takes longer to compile code, I am now no longer determined it'd be price it.First, I ran the TCC test suite to be certain that I didn't ruin the rest. Everything passed. To this point, so neutral correct.
Next, I calculated the adaptation in the
.text
allotment of TCC binaries, libraries, and object files sooner than and after applying this diff. Listed below are the numbers I saw:
Binary | Broken-down dimension | Unusual dimension | Distinction in bytes | Percent dimension good deal |
---|---|---|---|---|
tcc | 328786 | 321358 | 7428 | 2.26% |
libtcc.a | 307288 | 300252 | 7036 | 2.29% |
bcheck.o | 23254 | 22801 | 453 | 1.95% |
bt-exe.o | 4732 | 4550 | 182 | 3.85% |
bt-log.o | 648 | 639 | 9 | 1.39% |
libtcc1.a | 12678 | 12119 | 559 | 4.41% |
That is resplendent unbelievable. Even supposing we added code to the compiler, we restful walked away with a resplendent essential dimension good deal. That is a big buy.
I additionally did a battery of assemble time tests and may maybe well well procure no statistically essential changes in compile time. In actual fact, the predominant time I ran the TCC test suite with the improved compiler, I became as soon as surprised to peek that the times for that breeze dangle been lower than they dangle been with the worn TCC. However or now no longer it’s now no longer statistically essential as far as I can expose. That is now no longer all that unexpected as we if truth be told handiest added five truthiness assessments, and these assessments assemble now no longer even breeze at any time when the code generator produces a contemporary instruction.
One other TCC user posted some assemble times and .text
sizes for SQLite3. A 2.60% good deal in binary dimension and no incompatibility in assemble times (even though there’s now no longer rather ample data in the electronic mail to manufacture a teach on significance of assemble times).
I will name this a buy. Smaller binaries and no incompatibility in compile times. Right here is an optimization price sharing with the rest of the TCC neighborhood.
Work continues…
I posted the total diff to the TCC mailing checklist. Other than a power-by commit I made as allotment of my unbiased as OpenBSD port maintainer of TCC to boost the riscv64 arch, I haven’t individually made every other commits to TCC, even though I did writer the contemporary diff to end TCC linker enhance on OpenBSD. I am now no longer precisely determined what the assignment is for committing one thing that goes to electrify all americans to the mob
division. So I will correct wait until I earn a crawl-ahead. Till then, you may maybe well well also take dangle of the patch from the mailing checklist. I suspect I will commit it soon.
I’m hoping you enjoyed approaching this sprint with me to hand-optimize the TCC x86_64 code generator. There are surely extra such enhancements in the x86_64 code generator and the total other code generators. Let’s procure these alternatives and enlighten them to TCC.
>
Read More
Portion this on knowasiak.com to consult with with folks on this topicReview in on Knowasiak.com now in the event you may maybe well well also very neatly be now no longer registered but.