Optimizing the kernel to saturate a 100Gbps hyperlink (2017)

Optimizing the kernel to saturate a 100Gbps hyperlink (2017)

By Drew Gallatin

Netflix Technology Blog

In the summer of 2015, the Netflix Birth Join CDN crew made up our minds to steal on an ambitious mission. The aim used to be to leverage the new 100GbE network interface technology perfect coming to market in an effort to get the flexibility to reduction at 100 Gbps from a single FreeBSD-basically basically based Birth Join Equipment (OCA) the spend of NVM Insist (NVMe)-basically basically based storage.

On the time, the bulk of our flash storage-basically basically based home equipment had been discontinuance to being CPU restricted serving at 40 Gbps the spend of single-socket Xeon E5–2697v2. The first step used to be to search out the CPU bottlenecks in the present platform while we waited for newer CPUs from Intel, newer motherboards with PCIe Gen3 x16 slots that can presumably maybe bustle the new Mellanox 100GbE NICs at corpulent tempo, and for programs with NVMe drives.

On the entire, most of an OCA’s vow is served from disk, with handiest 10–20% of the most standard titles being served from memory (sight our old blog, Shriek Recognition for Birth Join for most indispensable facets). Nonetheless, our early pre-NVMe prototypes had been restricted by disk bandwidth. So we attach up a contrived experiment where we served handiest the very most standard vow on a test server. This allowed all vow to slot in RAM and therefore preserve away from the momentary disk bottleneck. Surprisingly, the efficiency with out a doubt dropped from being CPU restricted at 40 Gbps to being CPU restricted at handiest 22 Gbps!

After doing some very classic profiling with pmcstat and flame graphs, we suspected that we had a speak with lock competitors. So we ran the DTrace-basically basically based lockstat lock profiling tool that is supplied with FreeBSD. Lockstat instructed us that we had been spending most of our CPU time awaiting the lock on FreeBSD’s idle web page queue. Why used to be this occurring? Why did this accumulate worse when serving handiest from memory?

A Netflix OCA serves spacious media files the spend of NGINX by the utilization of the asynchronous sendfile() device call. (Sight NGINX and Netflix Make contributions Contemporary sendfile(2) to FreeBSD). The sendfile() device call fetches the vow from disk (except it’s a long way already in memory) one 4 KB web page at a time, wraps it in a network memory buffer (mbuf), and passes it to the network stack for no longer most indispensable encryption and transmission by the utilization of TCP. After the network stack releases the mbuf, a callback into the VM device causes the 4K web page to be launched. When the earn page is launched, it’s a long way both freed into the free web page pool, or inserted precise into a list of pages that will likely be wanted again, is named the idle queue. Because we had been serving fully from memory, NGINX used to be advising sendfile() that just about the total pages will likely be wanted again — so almost every web page on the device went by the idle queue.

The speak right here is that the idle queue is structured as a single list per non-uniform memory (NUMA) domain, and is protected by a single mutex lock. By serving all the pieces from memory, we moved a spacious p.c of the earn page launch process from the free web page pool (where we already had a per-CPU free web page cache, thanks to earlier work by Netflix’s Randall Stewart and Scott Long, Jeff Roberson’s crew at Isilon, and Matt Macy) to the idle queue. The glaring fix would get been in an effort to add a per-CPU idle web page cache, nonetheless the device quiet needs in an effort to search out the earn page when it needs it again. Pages are hashed to the per-NUMA queues in a predictable intention.

One of the best solution we got right here up with is what we call “Pretend NUMA”. This procedure takes reduction of the truth that there might maybe be one attach of web page queues per NUMA domain. All we needed to wreck used to be to deceive the device and divulge it that we get one Pretend NUMA domain for every 2 CPUs. After we did this, our lock competitors practically disappeared and we had been able to reduction at 52 Gbps (restricted by the PCIe Gen3 x8 slot) with huge CPU lazy time.

After we had newer prototype machines, with an Intel Xeon E5 2697v3 CPU, PCIe Gen3 x16 slots for 100GbE NIC, and extra disk storage (4 NVMe or 44 SATA SSD drives), we hit one more bottleneck, moreover linked to a lock on a world list. We had been stuck at around 60 Gbps on this new hardware, and we had been constrained by pbufs.

FreeBSD makes spend of a “buf” structure to control disk I/O. Bufs which could presumably maybe be used by the paging device are statically allocated at boot time and saved on a world linked list that is protected by a single mutex. This used to be performed long ago, for several reasons, basically to preserve up away from needing to allocate memory when the device is already low on memory, and trying to web page or swap files out in an effort to get the flexibility to free memory. Our speak is that the sendfile() device call makes spend of the VM paging device to read files from disk after they set aside no longer seem to be resident in memory. Therefore, all of our disk I/O used to be constrained by the pbuf mutex.

Our first speak used to be that the list used to be too minute. We had been spending moderately quite loads of time awaiting pbufs. This used to be effortlessly fastened by increasing the change of pbufs allocated at boot time by increasing the kern.nswbuf tunable. Nonetheless, this change printed the next speak, which used to be lock competitors on the global pbuf mutex. To solve this, we modified the vnode pager (which handles paging to files, in attach of the swap partition, and hence handles all sendfile() I/O) to make spend of the long-established kernel zone allocator. This alternate removed the lock competitors, and boosted our efficiency into the 70 Gbps vary.

Apart from-known above, we make heavy spend of the VM web page queues, especially the idle queue. Indirectly, the device runs brief of memory and these queues get to be scanned by the earn page daemon to disencumber memory. At corpulent load, this used to be occurring roughly twice per minute. When this came about, all NGINX processes would sail to sleep in vm_wait() and the device would discontinue serving web page visitors while the pageout daemon labored to scan pages, most steadily for several seconds. This had a excessive impact on key metrics that we spend to search out out an OCA’s health, especially NGINX serving latency.

The classic device health might maybe presumably maybe moreover moreover be expressed as follows (I wish this used to be a comic strip):


Time1: 15GB memory free. The total lot is okay. Serving at 80 Gbps.

Time2: 10GB memory free. The total lot is okay. Serving at 80 Gbps.

Time3: 05GB memory free. The total lot is okay. Serving at 80 Gbps.

Time4: 00GB memory free. OH MY GOD!!! We’re all gonna DIE!! Serving at 0 Gbps.

Time5: 15GB memory free. The total lot is okay. Serving at 80 Gbps.


This speak is with out a doubt made step by step worse as one adds NUMA domains, in consequence of there might maybe be one pageout daemon per NUMA domain, nonetheless the earn page deficit that it’s a long way trying to certain is calculated globally. So if the vm pageout daemon decides to clear, snarl 1GB of memory and there are 16 domains, every of the 16 pageout daemons will for my piece attempt to clear 1GB of memory.

To solve this speak, we made up our minds to proactively scan the VM web page queues. In the sendfile course, when allocating a web page for I/O, we bustle the pageout code several times per 2nd on every VM domain. The pageout code is bustle in its lightest-weight mode in the context of 1 heart-broken NGINX course of. Various NGINX processes continue to bustle and reduction web page visitors while right here’s occurring, so we are able to preserve away from bursts of pager process that blocks web page visitors serving. Proactive scanning allowed us to reduction at roughly 80 Gbps on the prototype hardware.

TCP Huge Receive Offload (LRO), is the strategy of mixing several packets obtained for the linked TCP connection precise into a single spacious packet. This procedure reduces device load by decreasing trips by the network stack. The effectiveness of LRO is measured by the aggregation rate. For instance, if we are able to receive four packets and mix them into one, then our LRO aggregation rate is 4 packets per aggregation.

The FreeBSD LRO code will, by default, manage up to eight packet aggregations at one time. This works with out a doubt nicely on a LAN, when serving web page visitors over a minute change of with out a doubt quick connections. Nonetheless, we get tens of hundreds of intriguing TCP connections on our 100GbE machines, so our aggregation rate used to be rarely ever higher than 1.1 packets per aggregation on common.

Hans Petter Selasky, Mellanox’s 100GbE driver developer, got right here up with an innovative approach to our speak. Newest NICs will provide an Receive Facet Scaling (RSS) hash end result to the host. RSS is a stale developed by Microsoft wherein TCP/IP web page visitors is hashed by source and sail back and forth space IP address and/or TCP source and sail back and forth space ports. The RSS hash end result will almost at all times uniquely title a TCP connection. Hans’ thought used to be that in attach of edifying passing the packets to the LRO engine as they arrive from the network, we must at all times quiet receive the packets in a spacious batch, and then model the batch of packets by RSS hash end result (and usual time of arrival, to preserve up them so as). After the packets are sorted, packets from the linked connection are adjacent even after they arrive broadly separated in time. Therefore, when the packets are passed to the FreeBSD LRO routine, it will aggregate them.

With this new LRO code, we had been able to receive out an LRO aggregation rate of over 2 packets per aggregation, and had been able to reduction at nicely over 90 Gbps for the predominant time on our prototype hardware for largely unencrypted web page visitors.

An RX queue containing 1024 packets from 256 connections would get 4 packets from the linked connection in the ring, nonetheless the LRO engine would no longer have the selection to deem that the packets belonged together, in consequence of it maintained perfect a handful of aggregations straight away. After sorting by RSS hash, the packets from the linked connection appear adjacent in the queue, and can moreover be fully aggregated by the LRO engine.

So the job used to be performed. Or used to be it? The next aim used to be to receive out 100 Gbps while serving handiest TLS-encrypted streams.

By this level, we had been the spend of hardware which intently resembles this day’s 100GbE flash storage-basically basically based OCAs: four NVMe PCIe Gen3 x4 drives, 100GbE ethernet, Xeon E5v4 2697A CPU. With the enhancements described in the Keeping Netflix Viewing Privateness at Scale blog entry, we had been able to reduction TLS-handiest web page visitors at roughly 58 Gbps.

In the lock competitors problems we’d seen above, the attach off of any elevated CPU spend used to be moderately obvious from long-established device stage tools savor flame graphs, DTrace, or lockstat. The 58 Gbps limit used to be comparatively irregular. As sooner than, the CPU spend would lengthen linearly as we approached the 58 Gbps limit, nonetheless then as we neared the limit, the CPU spend would lengthen almost exponentially. Flame graphs perfect showed all the pieces taking longer, with out a obvious hotspots.

We lastly had a hunch that we had been restricted by our device’s memory bandwidth. We used the Intel® Performance Counter Tune Tools to measure the memory bandwidth we had been inviting at high load. We then wrote a easy memory thrashing benchmark that used one thread per core to replica between spacious memory chunks that did now not match into cache. In preserving with the PCM tools, this benchmark consumed the linked quantity of memory bandwidth as our OCA’s TLS-serving workload. So it used to make certain that we had been memory restricted.

At this level, we grew to change into centered on decreasing memory bandwidth utilization. To reduction with this, we started the spend of the Intel VTune profiling tools to title memory loads and stores, and to title cache misses.

Because we are the spend of sendfile() to reduction files, encryption is performed from the digital memory web page cache into connection-specific encryption buffers. This preserves the long-established FreeBSD web page cache in an effort to enable serving of sizzling files from memory to many connections. One among the predominant things that stood out to us used to be that the ISA-L encryption library used to be the spend of half of again as much memory bandwidth for memory reads because it used to be for memory writes. From VTune profiling files, we seen that ISA-L used to be by some potential reading both the source and sail back and forth space buffers, in attach of edifying writing to the sail back and forth space buffer.

We realized that this used to be since the AVX instructions used by ISA-L for encryption on our CPUs labored on 256-bit (32-byte) portions, whereas the cache line dimension used to be 512-bits (64 bytes) — thus triggering the device to wreck read-regulate-writes when files used to be written. The speak is that the the CPU will most steadily accumulate entry to the memory device in 64 byte cache line-sized chunks, reading a full 64 bytes to accumulate entry to even perfect a single byte. On this case, the CPU desired to jot down 32 bytes of a cache line, nonetheless the spend of read-regulate-writes to contend with these writes supposed that it used to be reading the entire 64 byte cache line in an effort to get the flexibility to jot down that first 32 bytes. This used to be especially silly, since the very next component that can presumably maybe happen will likely be that the 2nd half of of the cache line will likely be written.

After a immediate email exchange with the ISA-L crew, they supplied us with a new version of the library that used non-temporal instructions when storing encryption outcomes. Non-temporals bypass the cache, and enable the CPU declare accumulate entry to to memory. This supposed that the CPU used to be now no longer reading from the sail back and forth space buffers, and so this elevated our bandwidth from 58 Gbps to 65 Gbps.

In parallel with this optimization, the spec for our closing manufacturing machines used to be modified from the spend of more cost-effective label DDR4–1866 memory to the spend of DDR4–2400 memory, which used to be the fastest supported memory for our platform. With the sooner memory, we had been able to reduction at 76 Gbps.

We spent moderately quite loads of time VTune profiling files, re-working a spacious change of core kernel files constructions to get higher alignment, and the spend of minimally-sized kinds in an effort to suggest the that you might maybe presumably maybe presumably imagine ranges of files that will likely be expressed there. Examples of this system embody rearranging the fields of kernel structs linked to TCP, and re-sizing most of the fields that had been initially expressed in the 1980s as “longs” which get to receive 32 bits of files, nonetheless that are with out a doubt 64 bits on 64-bit platforms.

One more trick we spend is to preserve up away from accessing rarely ever used cache traces of spacious constructions. For instance, FreeBSD’s mbuf files structure is amazingly versatile, and enables referencing many change forms of objects and wrapping them for spend by the network stack. One among the absolute most sensible sources of cache misses in our profiling used to be the code to launch pages sent by sendfile(). The linked share of the mbuf files structure appears to be like savor this:

struct m_ext {

volatile u_int *ext_cnt; /pointer to ref count files */

caddr_t ext_buf; /open of buffer */

uint32_t ext_size; /dimension of buffer, for ext_free */

uint32_t ext_type:8, /vogue of exterior storage */

ext_flags: 24; /exterior storage mbuf flags */

void (*ext_free) /free routine if no longer the linked old */

(struct mbuf *, void *, void *);

void *ext_arg1; /no longer most indispensable argument pointer */

void *ext_arg2; /no longer most indispensable argument pointer */


The speak is that arg2 fell in the 3rd cache line of the mbuf, and used to be the best component accessed in that cache line. Even worse, in our workload arg2 used to be almost at all times NULL. So we had been paying to read 64 bytes of files for every 4 KB we sent, where that pointer used to be NULL discontinuance to the entire time. After failing to shrink the mbuf, we made up our minds to boost the ext_flags to connect ample bellow in the predominant cache line of the mbuf to search out out if ext_arg2 used to be NULL. If it used to be, then we perfect passed NULL explicitly, in attach of dereferencing ext_arg2 and taking a cache omit. This won almost 1 Gbps of bandwidth.

VTune and lockstat pointed out a change of oddities in device efficiency, most of which got right here from the guidelines sequence that is performed for monitoring and statistics.

The first instance is a metric monitored by our load balancer: TCP connection count. This metric is wanted so as that the burden balancing tool can divulge if the device is underloaded or overloaded. The kernel did now not export a connection count, nonetheless it did provide a technique to export all TCP connection files, which allowed user attach tools to calculate the change of connections. This used to be perfect-attempting for smaller scale servers, nonetheless with tens of hundreds of connections, the overhead used to be noticeable on our 100GbE OCAs. When requested to export the connections, the kernel first took a lock on the TCP connection hash desk, copied it to a momentary buffer, dropped the lock, and then copied that buffer to userspace. Userspace then needed to iterate over the desk, counting connections. This both introduced about cache misses (many of unneeded memory process), and lock competitors for the TCP hash desk. The fix used to be moderately easy. We added per-CPU lockless counters that tracked TCP bellow adjustments, and exported a count of connections in every TCP bellow.

One more instance is that we had been gathering detailed TCP statistics for every TCP connection. The aim of these statistics is to tune the quality of shopper’s sessions. The detailed statistics had been moderately costly, both in terms of cache misses and in terms of CPU. On a utterly loaded 100GbE server with many tens of hundreds of intriguing connections, the TCP statistics consumed 5–10% of the CPU. The approach to this speak used to be to handiest preserve detailed statistics on a minute percentage of connections. This dropped CPU used by TCP statistics to below 1%.

These adjustments resulted in a speedup of three–5 Gbps.

The FreeBSD mbuf device is the workhorse of the network stack. Every packet which transits the network is unruffled of 1 or extra mbufs, linked together in a list. The FreeBSD mbuf device is terribly versatile, and can wrap practically any exterior object for spend by the network stack. FreeBSD’s sendfile() device call, used to reduction the bulk of our web page visitors, makes spend of this aim by wrapping every 4K web page of a media file in an mbuf, every with its dangle metadata (free aim, arguments to the free aim, reference count, and loads others).

The map back to this flexibility is that it outcomes in moderately quite loads of mbufs being chained together. A single 1 MB HTTP vary demand going by sendfile can reference 256 VM pages, and every person will likely be wrapped in an mbuf and chained together. This gets messy quick.

At 100 Gbps, we’re transferring about 12.5 GB/s of 4K pages by our device unencrypted. Including encryption doubles that to 25 GB/s worth of 4K pages. That’s about 6.25 Million mbufs per 2nd. In case you add in the additional 2 mbufs used by the crypto code for TLS metadata initially and discontinue of every TLS file, that works out to at least one more 1.6M mbufs/sec, for a total of about 8M mbufs/2nd. With roughly 2 cache line accesses per mbuf, that’s 128 bytes 8M, which is 1 GB/s (8 Gbps) of files that is accessed at extra than one layers of the stack (alloc, free, crypto, TCP, socket buffers, drivers, and loads others).

To in the reduction of the change of mbufs in transit, we made up our minds to boost mbufs to enable carrying several pages of the linked form in a single mbuf. We designed a new vogue of mbuf that can presumably maybe elevate up to 24 pages for sendfile, and which could presumably maybe moreover elevate the TLS header and trailing files in-line (decreasing a TLS file from 6 mbufs down to 1). That alternate diminished the above 8M mbufs/sec down to lower than 1M mbufs/sec. This resulted in a tempo up of roughly 7 Gbps.

This used to be no longer with out some challenges. Most particularly, FreeBSD’s network stack used to be designed to know that it will without extend accumulate entry to any share of an mbuf the spend of the mtod() (mbuf to files) macro. Supplied that we’re carrying the pages unmapped, any mtod() accumulate entry to will dismay the device. We needed to boost a mode of capabilities in the network stack to make spend of accessor capabilities to accumulate entry to the mbufs, bellow the DMA mapping device (busdma) about our new mbuf form, and write several accessors for copying mbufs into uios, and loads others. We moreover needed to glimpse every NIC driver in spend at Netflix and verify that they had been the spend of busdma for DMA mappings, and no longer accessing substances of mbufs the spend of mtod(). At this level, we get the new mbufs enabled for various of our like a flash, excluding for a couple of very worn storage platforms that are disk, and no longer CPU, restricted.

At this level, we’re able to reduction 100% TLS web page visitors very effortlessly at 90 Gbps the spend of the default FreeBSD TCP stack. Nonetheless, the goalposts preserve transferring. We’ve found that when we spend extra developed TCP algorithms, comparable to RACK and BBR, we are quiet a tiny brief of our aim. Now we get several ideas that we are for the time being pursuing, which vary from optimizing the new TCP code to increasing the effectivity of LRO to trying to wreck encryption nearer to the switch of the guidelines (both from the disk, or to the NIC) in an effort to steal higher reduction of Intel’s DDIO and fix memory bandwidth.



Hey! look, i give tutorials to all my users and i help them!

you're currently offline