Large-scale, semi-automated Go GC tuning
Advertisement

How We Saved 70K Cores Across 30 Mission-Critical Services (Large-Scale, Semi-Automated Go GC Tuning @Uber)

As part of Uber engineering’s wide efforts to reach profitability, recently our team was focused on reducing cost of compute capacity by improving efficiency. Some of the most impactful work was around GOGC optimization. In this blog we want to share our experience with a highly effective, low-risk, large-scale, semi-automated Go GC tuning mechanism.

Uber’s tech stack is composed of thousands of microservices, backed by a cloud-native, scheduler-based infrastructure. Most of these services are written in Go. Our team, Maps Production Engineering, has previously played an instrumental role in significantly improving the efficiency of multiple Java services by tuning GC. At the beginning of 2021, we explored the possibilities of having a similar impact on Go-based services. We ran several CPU profiles to assess the current state of affairs and we found that GC was the top CPU consumer for a vast majority of mission-critical services. Here is a representation of some CPU profiles where GC (identified by the runtime.scanobject method) is consuming a significant portion of allocated compute resources.

Service #1

Figure 1: GC CPU cost of Example Service #1

Service #2

Figure 2: GC CPU cost of Example Service #1

Emboldened by this finding, we commenced to tune GC for the relevant services. To our delight, Go’s GC implementation and the simplicity of tuning allowed us to automate the bulk of the detection and tuning mechanism. We detail our approach and its impact in the following sections.

GOGC Tuner 

Go runtime invokes a concurrent garbage collector at periodic intervals unless there is a triggering event before it. The triggering events are based on memory back pressure. Due to this, GC-impacted Go services benefit from more memory, since it reduces the times GC has to run. In addition, we realized that our host-level CPU to memory ratio is 1:5 (1 core : 5 GB RAM), while most Golang services were configured with a 1:1 to 1:2 ratio. Therefore, we were confident that we could leverage using more memory to reduce GC CPU impact. This is a service-agnostic mechanism that can yield a large impact when applied judiciously.

Advertisement

Delving deep into Go’s garbage collection is beyond the scope of this article, but here are the relevant bits for this work: Garbage collection in Go is concurrent and involves analyzing all objects to identify which ones are still reachable. We would call the reachable objects the “live dataset.” Go offers only one knob, GOGC, expressed in percentage of live dataset, to control garbage collection. The GOGC value acts as a multiplier for the dataset. The default value of GOGC is 100%, which means Go runtime will reserve the same amount of memory for new allocations as the live dataset. For instance:

hard_target = live_dataset + live_dataset * (GOGC / 100).

Then the pacer is in charge of predicting when it is the best time to trigger GC to avoid hitting the hard target (soft target).

Figure 3: Example heap with default configuration.

Dynamic and Diverse: One Size Does Not Fit All

We identified that a fixed GOGC value-based tuning is not suitable for services at Uber. Some of the challenges are:

  • It is not aware of the maximum memory assigned to the container and can cause out of memory issues.
  • Our microservices have a significantly diverse memory utilization portfolio. For example, a sharded system can have very different live datasets. We experienced this in one of our services where the p99 utilization was 1G but the p1 was 100MB, therefore the 100MB instances were having a huge GC impact.

A Case for Automation

The pain points previously presented are the reason for the conception of GOGCTuner. GOGCTuner is a library which simplifies the process of tuning garbage collection for service owners and adds a reliability layer on top of it.

GOGCTuner dynamically computes the correct GOGC value in accordance with the container’s memory limit (or the upper limit from the service owner) and sets it using Go’s runtime API. Following are the specifics of the GOGCTuner library’s features:

  • Simplified configuration for easier reasoning and deterministic calculations. GOGC at 100% is not clear for beginner Go developers and it is not deterministic, because it still depends on the live dataset. On the other hand a 70% limit ensures that the service is always going to use 70% of the heap space.
  • Protection against OOMs (Out Of Memory): The library reads the memory limit from the cgroup and uses a default hard limit of 70%, a safe value from our experience.
        • It is important to note that there is a limit to this protection. The tuner can only adjust the buffer allocation, so if your service live objects are higher than the limit the tuner would set a default lower limit of 1.25X your live objects utilization.
  • Allow higher GOGC values for corner cases like:
        • As we mentioned above, manual GOGC is not deterministic. We are still relying on the size of the live dataset. What if live_dataset doubles our last peak value? GOGCTuner would enforce the same memory limit at the cost of more CPU. Manual tuning instead could cause OOMs. Therefore, service owners used to give plenty of buffer for these types of scenarios. See the example below:

Normal traffic (live dataset is 150M)

Figure 4: Normal operation. Default configuration on the left, manually tuned on the right.

Traffic increased 2X (live dataset is 300M)

Figure 5: Double the load. Default configuration on the left, manually tuned on the right.

Traffic increased 2X with GOGCTuner at 70% (live dataset is 300M)

Figure 6: Double the load, but using the tuner. Default configuration on the left, GOGCTuner tuned on the right.
  • Services using MADV_FREE memory policy that results in wrong memory metrics. For instance, our observability metrics were showing 50% memory utilization (when it actually had already released 20% of that 50%). Then service owners were only tuning GOGC using this “inaccurate” metric.

Observability

We found that we lacked some critical metrics which would give us more insights into garbage collection of each service.

  • Intervals between garbage collections: useful to know if we can still tune. For instance, Go forces a garbage collection every 2 minutes. If your service is still having high GC impact, but you already see 120s for this graph, it means that you can no longer tune using GOGC. In this case you would need to optimize your allocations.
Figure 7: Graph for intervals between GCs.
  • GC CPU impact: allows us to know which services are the most affected by GC.
Figure 8: Graph for p99 GC CPU cost.
  • Live dataset size: helps us to identify memory leaks. The concern noted by service owners was that they saw an increase in memory utilization. In order to show them there was no memory leak we added the “live usage” metric, which showed a steady utilization.
Figure 9: Graph for estimated p99 live dataset.
  • GOGC value: useful to know how the tuner is reacting.
Figure 10: Graph for min, p50, p99 GOGC value assigned to the application by the tuner.

Implementation

Our initial approach was to have a ticker to run every second to monitor the heap metrics, and then adjust GOGC value accordingly. The disadvantage of this approach is that the overhead starts to become considerable, because in order to read heap metrics Go needs to do a STW (ReadMemStats) and it is somewhat inaccurate, because we can have more than one garbage collection per second.

Luckily we were able to find a good alternative. Go has finalizers (SetFinalizer), which are functions that run when the object is going to be garbage collected. They are mainly useful for cleaning memory in C code or some other resources. We were able to employ a self-referencing finalizer that resets itself on every GC invocation. This allows us to reduce any CPU overhead. For instance:

Figure 11: Example code for GC triggered events.

Calling runtime.SetFinalizer(f, finalizerHandler) inside of finalizerHandler is what allows the handler to run on every GC; it is basically not letting the reference die, since it is not a costly resource to keep alive (it is just a pointer).

Impact

After deploying GOGCTuner across a few dozen of our services, we dove deep on a few that showed significant, double-digit improvement in their CPU utilization. Accumulated cost savings from these services alone are around 70K cores. Following are 2 such examples:

Figure 12: Observability service that operates on thousands of compute cores with high standard deviation for live_dataset (max value was 10X of the lowest value), showed ~65% reduction in p99 CPU utilization.
Figure 13: Mission critical Uber eats service that operates on thousands of compute cores, showed ~30% reduction in p99 CPU utilization.

The resulting CPU utilization reduction improves p99 latency (and associated SLA, user experience) tactically, and cost of capacity strategically (since services are scaled based on their utilization). 

Garbage collection is one of the most elusive and underestimated performance influencers of an application. Go’s robust GC mechanism and simplified tuning, our diverse, large-scale Go services footprint, and a robust internal platform (Go, compute, observability) collectively allowed us to make such a large-scale impact. We expect to continue improving how we tune GC as the problem space itself is evolving, due to changes in the tech and our competency.

To reiterate what we mentioned at the introduction: there is no one size fits all solution. We feel GC performance will remain variable in cloud-native setup due to the highly variable performance of both public clouds and containerized workloads that run within. Coupled with the fact that a vast majority of CNCF landscape projects that we use are written in Golang (Kubernetes, Prometheus, Jaeger, etc.), this means any large-scale deployment outside could also benefit from such effort.

NOW WITH OVER +8500 USERS. people can Join Knowasiak for free. Sign up on Knowasiak.com
Read More

Advertisement

2 Comments

  1. The fact you can make such savings tweaking the GC, suggests the applications themselves could be improved not to produce so much garbage in the first place.

    A win is a win, and its still a very nice saving for barely touching the application code.

  2. I've written my own Go (subset + extensions) -> C++ transpiler and using it on a game project: https://www.youtube.com/watch?v=8He97Sl9iy0 — No GC, it does have slices and has access to an entity/component API and with that I think you're basically set and don't need GC for games.

    Example transpiler input / output: https://github.com/nikki93/gx/blob/master/example/main.gx.go… becomes https://gist.github.com/nikki93/97ff376abb6718427387bb9cca2f… Can call to C/C++ (including templates) w/o overhead.

    That said, for logic that is heavy on async and escaping closures like how a lot of Go server code tends to be, a GC is maybe a reasonable tradeoff?