Hand-optimizing the TCC code generator

academic, developer, with an eye towards a brighter techno-social life [prev] [next] 2022-04-06Hand-optimizing the TCC code generator I am on a number of different compiler mailing lists. One of those is the mailing list for the Tiny C Compiler. Imagine my surprise when a recent email mentioned our latest series and suggested trying a similar…

96
Hand-optimizing the TCC code generator

I worship modules, because they’re wise!!

academic, developer, with an behold against a brighter techno-social existence



[prev]
[next]

2022-04-06
Hand-optimizing the TCC code generator

I am on a resolution of diversified compiler mailing lists. One among these is the mailing checklist for the Puny C Compiler. Factor in my surprise when a sleek electronic mail mentioned our latest series and urged attempting a an identical almost about boost the code generator in TCC. I became as soon as to peek if there dangle been any low-hanging fruit that may maybe well well believe an identical beneficial properties as O affords for QBE.

Variations between tcc and cproc

We are in a position to also simply restful familiarize ourselves with how each tcc and cproc generate their output. For the capabilities of simplifying issues, we can presume that every compiler entrance-ends are completely an identical. I do know they’re now no longer, nonetheless our passion comely now is now no longer in the entrance-cease.

As we beforehand learned, cproc outputs QBE IL that is extra processed by QBE to believe assembly, which is then processed by the procedure assembler to believe object code and (finally) linked with the procedure linker to believe binaries and libraries. Which capability that of of this pipeline where each of the fundamental steps is processed by a diversified standalone executable, we dangle been ready to jot down our have standalone executable, O, that we inserted into this compilation pipeline after QBE nonetheless sooner than the procedure assembler. That allowed us to leverage data referring to the structure of QBE-generated assembly to request deficiencies and write a straightforward pattern-matching program to output optimized variations of particular patterns. Our O program is a comparatively straightforward peephole optimizer.

In incompatibility, tcc combines the features of the total utilities wanted to assemble a total binary or library correct into a single executable. That procedure that tcc contains a C preprocessor, C compiler, and linker multi function. There is additionally an built-in assembler in bid to boost inline assembly. Unlike cproc, tcc does now no longer output assembly—it straight away outputs object files. That is big when it involves urge: saving a total step in contrast to cproc (and gcc, for that topic) is droop to result in faster compiles, all else being equal.

Then again, taking the single-binary near blended with straight away outputting object code procedure that we’re now no longer going to be capable to recall the O almost about optimization. There is now no longer a pipeline of standalone executables that we can insert a contemporary utility into. And even if there dangle been, whereas the ELF structure does divide code into sections that we may maybe well well selectively query and breeze a peephole optimizer over, this may maybe well well well be far extra work than reading over assembly code, requiring at a minimal a manual so we may maybe well well work out what the binary decodes to. As x86 and x86_64 are notoriously hard to decode as it’ll be, we would either must spend endless sleepless nights pouring over Intel manuals (and restful doubtlessly earn it spoiled usually) or belief somebody else’s decoder to earn issues comely (and restful doubtlessly earn it spoiled usually). Neither of these suggestions are expedient, so we now dangle got to search out one other near.

Hand-optimizing the tcc code generator

Which capability that of tcc straight away outputs object code, we may maybe well well recall an near where we learn thru the code generator recognizing object code generation that is lower than expedient and enlighten tcc about greater code the generate as an change. To originate up out, I am handiest going to dismay about x86_64 code generation. There are code generators for x86, arm, aarch64, riscv64, and TMS320C67x as neatly. Every code generator is its have file and there’s no longer any such thing as a intermediate code so there don’t seem to be any IL-based optimizations that may maybe well well well profit the total code generators. On the change hand, I suspect there’s sufficient usage of the x86_64 code generator to manufacture this a official initiating level.

The x86_64 code generator will also be discovered right here. We’re additionally going to make utilize of a truly obliging tool to decode the article code that tcc generates: objdump. This utility is from the GNU binutils so likelihood is barely correct you may maybe well well also simply dangle got already got it put in to your procedure or can with out trouble earn it to your procedure. On my OpenBSD procedure, I even dangle rather script that every night time grabs the latest git HEAD of binutils and builds and installs it, so on my machine I am the utilize of GNU objdump (GNU Binutils) 2.38.50.20220406 nonetheless I am appealing to bet that even worn variations of objdump will work gorgeous for on the sleek time.

What we can designate is assemble some straightforward pattern programs with tcc and learn the disassembled output from objdump in bid to search out alternatives to educate the code generator straightforward techniques to designate issues greater.

Therefore, when we discuss optimizing the code generator for on the sleek time, we’re now no longer talking about making the code generator smaller or faster basically. We’re talking about making the code generator smarter. If we’re lucky, even supposing we’re at wretchedness of add some code to manufacture the code generator smarter, since the improved code generator will likely be breeze on itself, likely the optimization enhancements outweigh the extra code we can add.

With that acknowledged, there’s an overarching project that I are searching to care about: the overall purpose of TCC is to be fleet. It is about an bid of magnitude faster than gcc. It turns out optimizations will also be dear to assignment and earn comely. I are searching to steer particular of creating TCC recall longer to compile code. Confidently, our work on the sleek time can fabricate TCC each faster and believe greater code. If now no longer, we may maybe well well simply restful on the least now no longer fabricate TCC slower.

Discovering out Howdy World

Let’s originate up with doubtlessly the simplest C program. Even easier than Howdy World. How referring to the true program? How about one thing even smaller:

predominant(){}

That is it. That is perchance the simplest total C program I do know straightforward techniques to jot down. Let’s compile it with tcc and learn the disassembly with objdump:

/dwelling/brian $ tcc -c true.c
/dwelling/brian $ objdump -d true.o

true.o:     file structure elf64-x86-64


Disassembly of allotment .text:

0000000000000000 
: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 48 81 ec 00 00 00 00 sub $0x0,%rsp b: b8 00 00 00 00 mov $0x0,%eax 10: c9 crawl away 11: c3 ret

Our worn neutral correct friend movl $0, %eax is abet. We all know what to designate with that, and now we can also come at some stage in why it’s we would are searching to manufacture the change: movl $0, %eax is five bytes. For comparability, xorl %eax, %eax is handiest two bytes (0x31 0xc0), so we earn a 60% good deal in code dimension for this optimization. It is additionally commended by Intel for zeroing a register.

As we learned when developing O, optimizing all cases of mov $0 to xor consistently diminished binary sizes between 2% and 3%. Let’s enlighten tcc straightforward techniques to manufacture this development.

Wading thru the code generator

The code generator (and most of TCC) is rather opaque in its coding. A great deal of one-letter and two-letter variables and efficiency names. Fortuitously for us, some comments fabricate it abundantly particular where we may maybe well well simply restful focal level our consideration: there’s a comment that clearly says /mov $xx, r */ and that is the particular idiom we’re attempting to procure. We’re searching to educate the code generator about a particular case where if the instantaneous is zero, we can optimize this instruction to an xor. There’s separate code for 32-bit and 64-bit cases. Let’s originate up with the 32-bit case.

The 32-bit case code will also be discovered right here. Let’s factor in that we assemble now no longer know what the orex() unbiased does. There is restful some data we can rep on this block. The next line seems to be to emit a 32-bit instantaneous. It seems to be worship the variable fc represents the 32-bit instantaneous. So that must always be the cost for us to ascertain. If fc is zero, then we may maybe well well simply restful apply the optimization. Even though we restful assemble now no longer know what the orex() unbiased does, thought to be one of its arguments is REG_VALUE(r), and its cost is added to 0xb8 to make a mov opcode, so it seems to be to be a register modifier.

Let’s come at some stage in what xorl %eax, %eax encodes to. Sure, I do know I already immoral it earlier on this post, nonetheless let’s ascertain issues. We are in a position to jot down a contemporary assembly file named xor.s and inside of put:

	xorl %eax, %eax

My version of the GNU assembler is GNU assembler (GNU Binutils) 2.38.50.20220406. I even dangle considered examples where newer variations of GNU as utilize greater encodings for certain instructions than older variations, nonetheless I assemble now no longer relate that can topic worthy on the sleek time. After I breeze as -o xor.o xor.s and then breeze objdump -d xor.o, right here is the output I come at some stage in:


xor.o:     file structure elf64-x86-64


Disassembly of allotment .text:

0000000000000000 <.text>:
   0:   31 c0                   xor    %eax,%eax

Confirmed. We’re searching to expose the binary pattern b8 00 00 00 00 into 31 c0. A minimal of, that can turn movl $0, %eax into xorl %eax, %eax. This may maybe occasionally be moderately diversified for the change registers. How diversified? Let’s discover.

Generalizing the pattern

One among the issues to look at is that TCC does now no longer utilize the total on hand registers. Having a peek on the checklist, it seems to be worship TCC avoids the utilize of registers that the System V AMD64 ABI says are callee-saved registers because the utilize of callee-saved registers nearly surely complicates issues. So we handiest dangle rax, rcx, rdx, rsi, rdi, r8, r9, r10, and r11. Whereas rsp is in that enum, it’s a callee-saved register; I am resplendent determined or now no longer it’s handiest on this enum to take care of stack frame setup and teardown. Let’s now no longer dismay referring to the xmm registers; these are for SSE instructions. The st0 register is for the floating-level unit.

So let’s apply the bid on this enum and edit our xor.s test file to add extra xor traces for each of these registers we now dangle got on hand for us to make utilize of:

	xorl %eax, %eax
	xorl %ecx, %ecx
	xorl %edx, %edx
	xorl %esp, %esp
	xorl %esi, %esi
	xorl %edi, %edi
	xorl %r8d, %r8d
	xorl %r9d, %r9d
	xorl %r10d, %r10d
	xorl %r11d, %r11d

That assembly produces this output:


xor.o:     file structure elf64-x86-64


Disassembly of allotment .text:

0000000000000000 <.text>:
   0:   31 c0                   xor    %eax,%eax
   2:   31 c9                   xor    %ecx,%ecx
   4:   31 d2                   xor    %edx,%edx
   6:   31 e4                   xor    %esp,%esp
   8:   31 f6                   xor    %esi,%esi
   a:   31 ff                   xor    %edi,%edi
   c:   45 31 c0                xor    %r8d,%r8d
   f:   45 31 c9                xor    %r9d,%r9d
  12:   45 31 d2                xor    %r10d,%r10d
  15:   45 31 db                xor    %r11d,%r11d

Looks that the r8r11 registers require a further byte to encode. It is restful smaller than a mov into these registers. Then again, I’ve already performed a official bit of testing and it turns out that it getting that far down the register checklist is so uncommon I’ve by no procedure considered it happen. So we’re correct now no longer going to wretchedness with the three byte ones.

For the final six registers, the adaptation between each opcode is 9 (c9-c0, d2-c9, ff-f6)—do no longer put out of your mind that we’re missing rbx and rbp and these possess in the missing steps. The enum we beforehand saw does story for the missing registers. Let’s put a pin on this for now.

Along side our first optimization

Armed with this contemporary data, let’s enhance the 32-bit case. We’re searching to ascertain to peek if the fc variable is zero and if it’s, and the register we’re working with is thought to be one of the registers that encodes to a two byte xor, then we may maybe well well simply restful output code to xor the register with itself. In every other case, we may maybe well well simply restful designate what we now dangle been doing: mov the instantaneous into the register.

In code make, that may maybe well well well peek worship this:

            if (fc==0 && r 

That's it. That should improve the code generator for 32-bit mov. Note that we do REG_VALUE(r) 9 because we learned that each xor a register with itself is nine opcodes away from the next one. What's in between? If you want to xor a register with a different register.

Improving 64-bit mov

We can follow the exact same pattern for 64-bit mov optimization. Note here that TCC is using sv->c.i to store the cost of the instantaneous. Even supposing right here's a 64-bit optimization, we're going to are searching to make utilize of the 32-bit xor. Right here is because 32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination favorite-reason register. Or, rather extra succinctly, 32-bit operations particular the upper half of the 64-bit destination register. It takes three bytes to encode a 64-bit xor nonetheless since the 32-bit xor will particular the upper half of the register anyway, we can attach a byte and earn the an identical result. Right here's the code:

            if (sv->c.i==0 && r c.i);
            }

Become it if truth be told that easy? Sure, certain it became as soon as.

...nearly

An additional optimization space

Further down the code generator, there's a space where a hardcoded mov instantaneous into %eax is taking procedure. If that instantaneous is zero, then right here's but one other space where we now dangle got to educate TCC about our xor optimization. It took me a whereas to look at it in the initiating, and I handiest seen it because subsequent runs of objdump told me I wasn't making the total optimizations I believed I may maybe well well simply restful be.

Right here is the event:

    if (vtop->form.ref->f.func_type !=FUNC_NEW) { /implies FUNC_OLD or FUNC_ELLIPSIS */
        if (nb_sse_args==0)
            o(0xc031); /xor eax, eax */
        else
            oad(0xb8, nb_sse_args 

Checking out

Whereas we can congratulate ourselves on making improvements to the code generator, we assemble now no longer if truth be told know we now dangle got made an development until we test issues out. I are searching to ascertain two diversified issues: first, if we're producing smaller binaries; and 2d, if we're producing faster binaries. Again, I assemble now no longer are searching to dull tcc down. I am rather assured we can believe smaller binaries. However if tcc takes longer to compile code, I am now no longer determined it'd be price it.

First, I ran the TCC test suite to be certain that I didn't ruin the rest. Everything passed. To this point, so neutral correct.

Next, I calculated the adaptation in the .text allotment of TCC binaries, libraries, and object files sooner than and after applying this diff. Listed below are the numbers I saw:

BinaryBroken-down dimensionUnusual dimensionDistinction in bytesPercent dimension good deal
tcc32878632135874282.26%
libtcc.a30728830025270362.29%
bcheck.o23254228014531.95%
bt-exe.o473245501823.85%
bt-log.o64863991.39%
libtcc1.a12678121195594.41%

That is resplendent unbelievable. Even supposing we added code to the compiler, we restful walked away with a resplendent essential dimension good deal. That is a big buy.

I additionally did a battery of assemble time tests and may maybe well well procure no statistically essential changes in compile time. In actual fact, the predominant time I ran the TCC test suite with the improved compiler, I became as soon as surprised to peek that the times for that breeze dangle been lower than they dangle been with the worn TCC. However or now no longer it’s now no longer statistically essential as far as I can expose. That is now no longer all that unexpected as we if truth be told handiest added five truthiness assessments, and these assessments assemble now no longer even breeze at any time when the code generator produces a contemporary instruction.

One other TCC user posted some assemble times and .text sizes for SQLite3. A 2.60% good deal in binary dimension and no incompatibility in assemble times (even though there’s now no longer rather ample data in the electronic mail to manufacture a teach on significance of assemble times).

I will name this a buy. Smaller binaries and no incompatibility in compile times. Right here is an optimization price sharing with the rest of the TCC neighborhood.

Work continues…

I posted the total diff to the TCC mailing checklist. Other than a power-by commit I made as allotment of my unbiased as OpenBSD port maintainer of TCC to boost the riscv64 arch, I haven’t individually made every other commits to TCC, even though I did writer the contemporary diff to end TCC linker enhance on OpenBSD. I am now no longer precisely determined what the assignment is for committing one thing that goes to electrify all americans to the mob division. So I will correct wait until I earn a crawl-ahead. Till then, you may maybe well well also take dangle of the patch from the mailing checklist. I suspect I will commit it soon.

I’m hoping you enjoyed approaching this sprint with me to hand-optimize the TCC x86_64 code generator. There are surely extra such enhancements in the x86_64 code generator and the total other code generators. Let’s procure these alternatives and enlighten them to TCC.

Top

RSS


>
Read More
Portion this on knowasiak.com to consult with with folks on this topicReview in on Knowasiak.com now in the event you may maybe well well also very neatly be now no longer registered but.

Vanic
WRITTEN BY

Vanic

“Simplicity, patience, compassion.
These three are your greatest treasures.
Simple in actions and thoughts, you return to the source of being.
Patient with both friends and enemies,
you accord with the way things are.
Compassionate toward yourself,
you reconcile all beings in the world.”
― Lao Tzu, Tao Te ChingBio: About: