# Changing std:sort at Google’s Scale and Beyond

These items are adorable.

TL;DR; We are changing std::sort in LLVM’s libcxx. That’s a long story of what it took us to get there and all possible consequences, bugs you might encounter with examples from open source. We provide some benchmarks, perspective, why we did this in the first place and what it cost us with exciting ideas from Hyrum’s Law to reinforcement learning. All changes went into open source and thus I can freely talk about all of them.

This article is split into 3 parts, the first is history with all details of recent (and not so) past of sorting in C++ standard libraries. Second part is about what it takes to switch from one sorting algorithm to another with various bugs. The final one is about the implementation we have chosen with all optimizations we have done.

Sorting algorithms have been extensively researched since the start of computer science.1 Specifically, people tried to optimize the number of comparisons on average, in the worst case, in certain cases like partially sorted data. There is even a sorting algorithm that is based on machine learning2 — it tries to predict the position where the elements should go with pretty impressive benchmarks! One thing is clear—sorting algorithms do evolve even now with better constant factors and reduced number of comparisons made.

In every programming language, sorting calls exist and it’s up to the library to decide which one to use, we’ll talk about different choices in languages later. There are still debates over which sorting is the best on Hackernews3, papers4, repos5.

As Donald Knuth said

It would be nice if only one or two of the sorting methods would dominate all of the others, regardless of application or the computer being used. But in fact, each method has its own peculiar virtues. […] Thus we find that nearly all of the algorithms deserve to be remembered, since there are some applications in which they turn out to be best.

— Donald Knuth, The Art Of Computer Programming, Volume 3

C++ history

std::sort has been in C++ since the invention of so-called “STL” by Alexander Stepanov9 and C++ standard overall got an interesting innovation back then called “complexity”. At the time the complexity was set to being comparisons on average. We know from Computer Science courses that quicksort is comparisons on average, right? This algorithm was first implemented in the original STL.

How was really the first std::sort implemented?

It used a simple quicksort with a median of 3 (median from (first, middle, last) elements). Once the recursion hits less than 16 elements, it bails out and at the end uses insertion sort as it is believed to work faster for small arrays.

You can see the last stage where it tries to “smooth out” inaccurate blocks of size 16.

A minor problem with quicksort

Well, that’s definitely true that quicksort has on average complexity, however, C++ STL may accept third parameter, called comp function:

This actually gives us an opportunity to self-modify the array, or, in other words, make decisions along the way the comp function is called, and introduce a worst time complexity on any data. The code below will make sense in a little while:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.

Learn more about bidirectional Unicode characters

If we consider any quick sort algorithm with the following semantics:

Find some pivot among elements (constant number)Partition by pivot, recurse on both sides

“gas” value represents unknown, infinity, the value is set to the left element only when two unknown elements are compared, and if one element is gas, then it is always greater.

At the first step you pick some pivot among at most elements. While partitioning, all other elements will be to the right of the pivot.

At step you know the relation of at most elements and all elements will still be partitioned to the right. Take equals and you already have quadratic behavior.

If we run this code against the original STL10, we clearly are having a quadratic number of comparisons.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.

Learn more about bidirectional Unicode characters

No matter which randomization we are going to introduce, even quicksort implementation of std::sort is quadratic on average (with respect to arbitrary comparator) and implementation technically is not compliant.

The example above does not try to prove something and is quite artificial, even though there were some problems with the wording in the standard regarding “average” case.

Moving on with quadratic behavior

Quicksort worst case testing was first introduced by Douglas McIlroy11 in 1998 and called “Antiquicksort”. In the end it influenced the decision to move std::sort complexity being worst case rather than average which has been changed in the C++11 standard. The decision was partially made due to the fact there are lots of efficient worst case sorts out there as well.

Well, however there is more to the story.

Are modern C++ standard libraries actually compliant?

There are not so many places C++ standard specifies the wording “on average“. One more example is std::nth_element call.

What is std::nth_element?

You might guess it finds the nth element in the range. More specifically std::nth_element(first, nth, last, comp) is a partial sorting algorithm that rearranges elements in a range such that:

The element pointed at by nth is changed to whatever element would occur in that position if [first, last) were sorted.All of the elements before this new nth element are less than or equal to the elements after the new nth element.

You can see that the complexity still states “on average“.

This decision was in place due to the existing quickselect12 algorithm. However, this algorithm is susceptible to the same trickery for both GNU and LLVM implementations.

https://gcc.godbolt.org/z/xqqqKWv4r

For LLVM/clang version it’s obviously degrading to quadratic behavior, for GNU version for is around which are very close to reported numbers. If you read carefully the implementation13, you’ll see that the fallback algorithm uses heap select – by creating heap and extracting elements from it. And heap extraction is known to be logarithmic.

However, for finding nth element there are not so many worst case linear algorithms, one of them is median of medians16 which has a really bad constant. It took us around 20 years to find something really appropriate, thanks to Andrei Alexandrescu14 . My post on selection algorithms discussed that quite extensively but got too little attention in my humble opinion 🙂 (and it has an implementation from Andrei as well!). We found great speedups on real data for real SQL queries of type SELECT * from table ORDER BY x LIMIT N.

What happened to std::sort?

It started to use Introspective Sort, or, simply, introsort, upon too many levels of quicksort, more specifically , fall back to heap sort, worst case known algorithm. Even Wikipedia has all good references15 regarding introsort implementations.

Here is the worst case sorting introsort for GNU implementation:

LLVM history

When LLVM tried to build C++0x version of STL, Howard Hinnant made a presentation6 on how it all was going with the implementation. Back then we recognized some really interesting benchmarks and more and more benchmarked sorts on different data patterns.

Howard Hinnant’s slide on sorting in 2010

That gave us one interesting thought when we found this slide on what makes a sorting successful and efficient. Clearly not all data is random and some patterns happen in prod, how important is it to balance or recognize it?

For example even at Google as we use lots of protobufs, there are frequent calls to std::sort which come from the proto library7 which sorts all tags of fields presented in the message:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.

Learn more about bidirectional Unicode characters

Learn more about bidirectional Unicode characters

It makes a first quite important point: we need to recognize “almost sorted” patterns as they do happen. Obvious cases are ascending/descending, some easy mixes of those like pipes, and multiple consecutive increasing or decreasing runs. TL;DR; we did not do a very good job here but most modern algorithms do recognize quite obvious ones.

Theory: presortedness

Presortedness was first formally described in8:

Condition 1 is a normalizer, condition 2 states that we should care only about comparisons and not elements, condition 3 shows that if you can sort a supersequence, you should be able to sort a subsequence in fewer amount of comparisons, condition 4 is an upper limit on sorted parts: you should be able to sort if you can sort and , condition 5 is a more general upper limit – you should be able to find position of in linear amount of comparisons.

There are many existing presortedness measures like

m01: 1 if not sorted, 0 if sorted. A pretty stupid one.Block: Number of elements in a sequence that aren’t followed by the same element in the sorted sequence.Mono: Minimum number of elements that must be removed from X to obtain a sorted subsequence, which corresponds to |X| minus the size of the longest non-decreasing subsequence of X.Dis: Maximum distance determined by an inversion.Runs: number of non-decreasing runs in X minus one.Etc

Some presortedness measures are better than others, meaning if there exists an algorithm is optimal towards some measure (optimality means number of comparisons for all input behaves logarithmically on number on the inputs which have not bigger measure value: ), then it is also optimal towards another. And at this point, theory starts to differ much from reality. Theory found a nice “common” presortedness measure, it’s very complicated and out of scope for this article.

https://arxiv.org/pdf/1407.6183.pdf

Unfortunately, among all measures above only Mono, Dis and Runs are linear time (others are and it’s an open question whether they have lower complexity). If you want to report some of these measures, you need to sample heavily or add extra to the sorting itself which is not great for performance. We could have done more work in that area but generally all we tried were microbenchmarks + several real world workloads.

Anyway, I guess you are tired of theory and let’s get to something more practical.

LLVM history continues

As LLVM libcxx was developed before C++11, the first version was also based on quicksort. What was the difference to the GNU sort?

The libcxx implementation handled a few particular cases specially. Collections of length 5 or less are sorted using special handwritten comparison based sortings.

Learn more about bidirectional Unicode characters

Learn more about bidirectional Unicode characters

Depending on the data type being sorted, collections of lengths up to 30 are sorted using insertion sort. Trivial types are easy to swap and assembly is quite good for them.

There is a special handling case for collections with most items being equal and for collections that are almost sorted. It tries to use insertion sort upon the limit of 8 transpositions: if during the outer loop we see more than 8 pair elements where , we bail out to recursion, otherwise we sort it and don’t go there. That’s really great for almost sorted patterns.

Learn more about bidirectional Unicode characters

LLVM libcxx sort on random input

However, if you look at ascending inputs, you can see libstdcxx does lots of unnecessary work compared to libcxx sort which matters in practice. First is literally running 4.5 times faster!

Last distinction was that the median of 5 was chosen when the number of elements in a quicksort partition is more than 1000. No more differences, for me the biggest impact of this sort is in trying to identify common patterns which is not cheap but gets lots of benefits for real world cases.

When we changed libstdcxx to libcxx at Google, we saw significant improvements (dozens of percent) spent in std::sort. From then, the algorithm hasn’t been changed, and the usage has been growing.

Quadratic problem

Given LLVM libcxx was developed for C++03, the first implementation targeted on average case we talked about earlier. That has been addressed several times in the past, in 2014, 2017, 201817, 18.

In the end we managed to submit an improvement same as GNU library has with introsort. We add an additional parameter to the algorithm that indicates the maximum depth of the recursion the algorithm can go, then the remaining sequence on that path is sorted using heapsort. The number of partitions allowed is set to . Since heapsort’s worst case time complexity is , the modified algorithm also has a worst case time complexity of . This change19 has been committed to the LLVM trunk and released with LLVM 14.

Learn more about bidirectional Unicode characters

How many real world cases got there into heap sort?

We also were curious how much performance went into deep levels of quicksort and confirmed that one in several thousand of all std::sort calls got into the fallback heapsort.

That was slightly unusual to discover. It also did not show any statistically significant performance improvements, i.e. no obvious or significant quadratic improvements have been found. Quicksort is really working ok on real world data, however, this algorithm can be exploitable.

At first it looks easy to just change the implementation and win resources: sorting has order and, for example, if you sort integers, the API does not care about the implementation; the range should be just ordered correctly.

However, the C++ API can take compare functions, which may be for simplicity lambda functions. We will call them “comparators.” These can break our assumptions about sorting being deterministic in several ways. Sometimes I refer to this problem a.k.a. “ties can be resolved differently”.

Learn more about bidirectional Unicode characters

More serious examples might involve SQL queries of type:

Learn more about bidirectional Unicode characters

And we know that users like to write golden tests with queries of that sort. Even though nobody guarantees the order of equal elements, users do depend on that behavior as it might be buried down in code they have never heard of. That’s a classic example of Hyrum’s Law

With a sufficient number of users of an API,

it does not matter what you promise in the contract:

all observable behaviors of your system

will be depended on by somebody.

Hyrum Wright

Golden tests can be confusing if the diff is too big: are we breaking something or is the test too brittle to show anything useful to us? Golden tests are not a typical unit test because they don’t enforce any behavior. They simply let you know that the output of the service changed. There is no contract about the meaning of these changes; it is entirely up to the user to do whatever they want with this information.

When we tried to find all such cases, we understood it made the migration almost impossible to automate — how did we know these changes were the ones that the users wanted? In the end we learned a pretty harsh lesson that even slight changes in how we use primitives lead us to problems with goldens. It’s better if you use unit tests instead of golden ones or pay more attention to determinism of the code written.

Actually, about finding all Hyrum’s Law cases.

How to find all equal elements dependencies?

As equal elements are mostly indistinguishable during the compare functions (we found only a small handful of examples of comparators doing changes to array along the way), it is enough to randomize the range before the actual call to std::sort. You can figure out the probabilities and prove it is enough on your own.

We decided to submit such functionality into LLVM under debug mode20 for calls std::sort, std::nth_element, std::partial_sort.

Learn more about bidirectional Unicode characters

Seeding techniques

We used ASLR (address space layout randomization)21 technique for seeding the random number generator, meaning static variables will be in random addresses upon the start of the program and we can use it as a seed. This provides the same stability guarantee within a run but not through different runs, for example, for tests to become flaky and eventually be seen as broken. For platforms which do not support ASLR, the seed is fixed during build. Using other techniques from header was not possible as header recursively depended on and in such a low level library, we implemented a very simple linear generator.

Learn more about bidirectional Unicode characters

This randomization was enabled in a debug build mode as performance penalty might be significant for shuffling for all cases.

Partial vs nth danger

Also if you look closely at the randomization changes above, you may notice some difference between std::nth_element and std::partial_sort. That can be misleading.

std::partial_sort and std::nth_element have a difference in the meaning of their parameters that is easy to get confused. Both take 3 iterators:

begin – the beginning of the rangenth or middle – the meaning (and name) of this parameter differs between these functionsend – the end of the range

For std::partial_sort, the middle parameter is called middle, and points right after the part of the range that should end up sorted. That means you have no idea which element middle will point to – you only know that it will be one of the elements that didn’t need to be sorted.

For std::nth_element, this middle parameter is nth. It points to the only element that will be sorted. For all of the elements in [begin, nth) you only know that they’ll be less than or equal to *nth, but you don’t know what order they’ll be in.

That means that if you want to find the 10th smallest element of a container, you have to call these functions a bit differently:

Learn more about bidirectional Unicode characters

In the end, after dozens of runs of all tests at Google and with the help of a strong prevailing wind of randomness, we measured a couple of thousands of tests to be dependent on the stability of sorting and selection algorithms. As we also planned on updating sorting algorithms, this effort helped doing it gradually and sustainably.

All in all, it took us around a year to fix all of them.

Which failures will you probably discover?

Goldens

First of all, we, of course, discovered numerous failures regarding golden tests described above, that’s inevitable. From open source, you can try to look at ClickHouse22, 23, they also decided to introduce randomness described above.

Typical golden test updates

Most changes will look like this by adjusting the right ordering and updating golden tests.

Unfortunately, golden tests might be quite sensitive to production workloads, for example, during streaming engine rollout — what if some instances produce slightly different results for the same shard? Or what if some compression algorithm by accident uses std::sort and compares the checksum from another service which hasn’t updated its implementation? That might cause checksum mismatch, higher error rate, users suffering and even data loss, and you cannot easily swap the algorithm right away as it can break production workloads in unusual ways. Hyrum’s Law at its best and worst. For example, we needed to inject in a couple of places old implementations to allow teams to migrate.

Oh, crap, determinism

Some other changes might require a transition from std::sort to std::stable_sort if determinism is required. We recommend writing a comment on why this is important as stable_sort guarantees that equal elements will be in the same order as before the sort.

Side note: defaults in other languages are different and that’s probably good

In many languages24, including Python, Java, Rust, sort() is stable by default and, if being honest, that’s a much better engineering decision, in my opinion. For example, Rust has .sort_unstable() which does not have stability guarantees but explicitly tells what it does. However, C++ has a different priority, or, you may say, direction, i.e. usages of something should not do more than requested (a.k.a “You don’t pay for what you don’t use“). From our benchmarks std::stable_sort was 10-15% slower than std::sort, and it allocated linear memory. For C++ code that was quite critical given performance benefits. I like to think sometimes that Rust assumes more restrictive defaults with possibilities to relax them whereas C++ assumes less restrictive defaults with possibilities to tighten them.

Logical Bugs

We found several places where users invoked undefined behavior or made inefficiencies. Let’s get them from less to more important.

Sorting of binary data

If you compare by a boolean variable, for example, partition data by existence of something, it’s very tempting to write std::sort call.

Learn more about bidirectional Unicode characters

However, for compare functions that compare only by boolean variables, we have much faster linear algorithms, named std::partition and for stable version, std::stable_partition.

Learn more about bidirectional Unicode characters

Even though modern algorithms do a good job in detection of cardinality, try to prefer std::partition at least for readability issues.

Sorting more than needed

We saw a pattern of sort+resize a lot.

Learn more about bidirectional Unicode characters

You can work out from the code above that although each element must be inspected, sorting the whole of ‘vector’ (beyond the -th element) is not necessary. The compiler likely cannot optimize it away.

Learn more about bidirectional Unicode characters

Unfortunately, there is no stable std::partial_sort analogue, so fix a comparator if the determinism is required.

C++ is hard

If you have a mismatched type in a comparator, C++ will not warn you even with -Weverything. In the picture below zero warnings have been produced when sorting a vector of floats with std::greater comparator.

Not following strict weak ordering

When you call any of the ordering functions in C++ including std::sort, compare functions much comply with the strict weak ordering which formally means the following:

All these conditions make sense and algorithms actually use all those for optimization purposes. First 3 conditions set strict partial order, the 4th one is introducing equivalence relations on incomparable elements.

As you might imagine, we faced the violation of all conditions. In order to demonstrate those, I will post screenshots below where I found them through Github codesearch (https://cs.github.com). I promise I haven’t tried much to find bugs. The biggest emphasis is that violations do happen. After them we will discuss how they can be exploited

Violation of irreflexivity and asymmetry

This is a slideshow, look through it.

All of them violate irreflexivity, comp(x, x) returns true. You may say this might not be used in practice, however, we learned a tough lesson that even testing does not always help.

30 vs 31 elements. Happy execution vs SIGSEGV

You may remember that up to 30 elements for trivial types (and 6 for non-trivial), LLVM/libcxx sort uses insertion sort and after that it bails out to quicksort. Well, if you submit a comparator where conditions for irreflexivity or asymmetry are not met, you will find that with 31 elements the program might get into segfault whereas with 30 elements it works just fine. Consider this example, we want to move all negative elements to the right and sort all positive, same as some examples above.

https://gcc.godbolt.org/z/17r76q7eo

We saw users writing tests for small ranges, however, when the number of elements grows, std::sort can result in SIGSEGV, and this may slip during testing, and be an interesting attack vector to kill the program.

This is used in the implementation of libcxx to wait for some condition to be false knowing we will at some point compare two equal elements:

Learn more about bidirectional Unicode characters

https://github.com/llvm/llvm-project/blob/34a68037ddb4dff972c5d8c599cf5edf08fadf6b/libcxx/include/__algorithm/sort.h#L451

Violation of transitivity of incomparability

https://webrtc-review.googlesource.com/c/src/+/251681

You can construct an example where the 4 condition is violated

The worst that can happen in that case with the current implementation that the elements would not be sorted (not segfaults, although this needs a proof but the article is already too big for that), check out:

https://gcc.godbolt.org/z/c71qzM97f

And again, if you use fewer than 7 elements, insertion sort is used, and you will not construct a counter-example where std::is_sorted is not working. Even though, on paper this is undefined behavior, this is very hard to detect by sanitizers or tests, and in reality it passes simple cases.

Honestly, this snippet can be as simple as:

Learn more about bidirectional Unicode characters

Why? As doubles/floats can be NaN which means x

Are they actually faster? Well, in 18->17 case it is not always like that because the removed mov is greatly pipelined. It is still less work for the processor frontend to decode the instruction but you are not likely to see anything in the benchmarks. For 46->43 the situation is the same.

Sort3 18 vs 17 instructions https://quick-bench.com/q/0THJhOe5q2-vQsEy4v5DrAFGzBc. For Sort5 46 vs 43, see https://quick-bench.com/q/NXRFUJdHUoG6KMFSQ9B3fiVSKZ4

LLVM review claimed to save around 2-3% for integer benchmarks but they all mostly come from the transition from branchy version to a branchless one. Instruction reduction does not help much but anyway is a good story how machine learning can help driving compiler optimizations in such primitives as sorting networks together with assembly generation. I highly recommend reading the patch for all C++ lovers, it has lots of lovely ideas and sophisticated discussions on how to make it right31.

How can you help?

There are many things that are not yet done. Here is by no means an exhaustive list of things you can help with:

In debug modes introduce randomization described above.This can be done for the GNU library, Microsoft STL, other languages like Rust, D, etc.Introduce strict weak ordering debug checks in all functions that require it.std::sort, std::partial_sort, std::nth_element, std::min_element, std::max_element, std::stable_sort, others, in all C++ standard libraries. In all other languages like Rust, Java, Python, D, etc. As we said, checking at most 20 elements per call seems to be ok. You can also introduce sampling if needed.In your C++ project try to introduce a debug mode which sets _LIBCPP_DEBUG to some level27.Consider randomization for the APIs that can be relied on at least in testing/debug mode. Seeding the hash function differently for not relying on the order of iteration of hashtables. If the function requires to be only associative, try to accumulate results in different order, etc.Fix worst case std::nth_element in all standard library implementations.Optimize assembly generation for sorts (small, big, partitions) even further. As you can see, there is room for optimizations there as well!

Final thoughts

We started this process more than a year ago (of course, not full time), and the first primary goal was performance. However, it turned out to be a much more sophisticated issue. We found several hundred bugs (including pretty critical ones). In the end, we figured out a way to prevent bugs from happening in the future which will help us to adopt any correct implementation and, for example, see wins right away without being blocked by broken tests. We suggest if your codebase is huge, adopt the build flag from libcxx and prepare yourself for the migration. Most importantly, this effort produced a story on how to change even simplest things at scale, how to fight Hyrum’s Law, and I am glad to be a part of the effort to help open source learn from it.

Acknowledgements

Thanks to Nilay Vaish who pushed the changes for a new sort to LLVM, thanks to Louis Dionne, the maintainer of libcxx, who patiently accepted our changes. Thanks to Morwenn for outstanding work on sorting from which I learned a lot5. Thanks to Orson Peters and pdqsort which greatly improved the performance of modern in-memory sorting.

The famous textbook from Cormen and others Introduction to Algorithms devotes over 50 pages to sorting algorithms.Ani Kristo, Kapil Vaidya, Ugur Çetintemel, Sanchit Misra, and Tim Kraska. 2020. The Case for a Learned Sorting Algorithm. SIGMOD ’20.Hackernews: “I think it’s fair to say that pdqsort (pattern-defeating quicksort) is overall the best”In-place Parallel Super Scalar Samplesort (IPS4o)https://github.com/Morwenn/cpp-sort — collection of most fun and efficient sorting algorithms written in C++libc++: A Standard Library for C++0x, Howard Hinnant, 2010 LLVM Developers’ MeetingOne of the most popular sorting calls at Google ListFieldsMayFailOnStrippedH. Mannila, “Measures of Presortedness and Optimal Sorting Algorithms,” in IEEE Transactions on Computers, vol. C-34, no. 4, pp. 318-325, April 1985, doi: 10.1109/TC.1985.5009382.https://en.wikipedia.org/wiki/C%2B%2B_Standard_LibraryOriginal std::sort implementation by Stepanov and LeeAntiquicksort: https://www.cs.dartmouth.edu/~doug/aqsort.cQuickselectnth_element worst case fallback in GNU libstdc++Fast Deterministic Selection by Andrei AlexandrescuIntrosortMedian of medians[libc++] Introsort based sorting function (2017, unsubmitted)libc++ has quadratic std::sort (reddit discussion)Add introsort to avoid O(n^2) behavior and a benchmark for adversarial quick sort input (submitted).[libcxx][RFC] Unspecified behavior randomization in libcxxASLR (address space layout randomization)Sort added equal items ranges randomization (ClickHouse)bitsetsort peformance check (ClickHouse)Sort stability defaults in different programming languagesAll found open source issues (slides with repo names, commits and files)Inline variables in C++Debug mode in libcxxStefan Edelkamp and Armin Weiß. 2019. BlockQuicksort: Avoiding Branch Mispredictions in Quicksort. ACM J. Exp. Algorithmics 24, Article 1.4 (2019), 22 pages. DOI:https://doi.org/10.1145/3274660.Kaligosi, K., & Sanders, P. (2006). How Branch Mispredictions Affect Quicksort. Algorithms – ESA 2006, 780–791. doi:10.1007/11841036_69Proving 50-Year-Old Sorting Networks OptimalIntroduce branchless sorting functions for sort3, sort4 and sort5 (LLVM review)

We are changing std::sort in libcxx. That’s a long story of what it took us to get there, all bugs you might encounter with many examples from OSS. We submitted almost all changes to upstream and thus I can freely talk about them and how you can use themhttps://t.co/3MzIJQD4wP

— Danila Kutenin (@Danlark1) April 20, 2022

Read More

Share this on knowasiak.com to discuss with people on this topic__Sign up on Knowasiak.com now if you’re not registered yet.__