Racing against the clock – hitting a miniature kernel poke window

34
[favorite_button]
Racing against the clock – hitting a miniature kernel poke window
Advertisements

TL;DR:

How to impression a miniature kernel poke window indubitably tremendous even on kernels with out CONFIG_PREEMPT:

  • expend a cache omit to widen the poke window a runt bit bit
  • impression a timerfd expire in that window (which will bustle in an interrupt handler – in diversified phrases, in hardirq context)
  • be sure that that the wakeup caused by the timerfd has to churn by means of 50000 waitqueue items created by epoll

Racing one thread against a timer also avoids accumulating timing variations from two threads in each and each poke strive – therefore the title. On the diversified hand, it also map you now must tackle how hardware timers indubitably work, which introduces its bear flavors of abnormal timing variations.

I currently figured out a poke situation (https://crbug.com/venture-zero/2247) within the Linux kernel. (While attempting to demonstrate to any individual how the repair for CVE-2021-0920 labored – I modified into as soon as explaining why the Unix GC is now derive, and then bought at a loss for phrases because of I couldn’t indubitably resolve out why or no longer it’s derive after that repair, indirectly realizing that it indubitably is never derive.) It be a quite slim poke window, so I modified into as soon as questioning whether or no longer it will seemingly be hit with a tiny substitute of makes an strive – especially on kernels that aren’t built with CONFIG_PREEMPT, which would impression it likely to preempt a thread with one other thread, as I described at LSSEU2019.

Advertisements

Here’s a writeup of how I managed to hit the poke on the same old Linux desktop kernel, with qualified payment someplace spherical 30% if the proof of thought has been tuned for the particular machine. I did not build a paunchy exploit despite the truth that, I accomplished at getting proof of expend-after-free (UAF) accesses (with the back of a indubitably tremendous file descriptor table and userfaultfd, which might perchance no longer be accessible to similar old users reckoning on system configuration) because of that’s the piece I modified into as soon as recent about.

This also demonstrates that even very tiny poke prerequisites can indifferent be exploitable if any individual sinks enough time into writing an exploit, so watch out while you sweep apart very tiny poke residence windows as unexploitable or build no longer take care of such points as security bugs.

The UAF reproducer is in our bugtracker.

In the UNIX domain socket rubbish series code (which is desired to tackle reference loops formed by UNIX domain sockets that expend SCM_RIGHTS file descriptor passing), the kernel tries to resolve out whether or no longer it would yarn for all references to just a few file by evaluating the file’s refcount with the bogus of references from inflight SKBs (socket buffers). If they’re equal, it assumes that the UNIX domain sockets subsystem effectively has recent salvage admission to to the file because of it owns all references.

Advertisements

(The identical sample also appears to be like for files as an optimization in __fdget_pos(), search for this LKML thread.)

The plan back is that struct file will seemingly be referenced from an RCU be taught-aspect severe piece (which you cannot detect by having a ogle on the refcount), and such an RCU reference will seemingly be upgraded correct into a refcounted reference the utilization of get_file_rcu() / get_file_rcu_many() by __fget_files() so long as the refcount is non-zero. As an instance, when this occurs within the dup() syscall, the ensuing reference will then be installed within the FD table and be accessible for subsequent syscalls.

When the rubbish collector (GC) believes that it has recent salvage admission to to a file, this can produce operations on that file that violate the locking ideas worn in similar old socket-connected syscalls akin to recvmsg() – unix_stream_read_generic() assumes that queued SKBs can most productive be removed below the ->iolock mutex, however the GC removes queued SKBs with out the utilization of that mutex. (Because of Xingyu Jin for explaining that to me.)

One map of getting a ogle at this malicious program is that the GC is working properly – here is a reveal blueprint exhibiting some of the likely states of a struct file, with extra particular states nested below much less particular ones and with the reveal transition within the GC marked:

Advertisements

All relevant states are RCU-accessible. An RCU-accessible object can have either a zero refcount or a positive refcount. Objects with a positive refcount can be either live or owned by the garbage collector. When the GC attempts to grab a file, it transitions from the state

While __fget_files() is making an unsuitable assumption in regards to the reveal of the struct file while it’s attempting to slim down its likely states – it tests whether or no longer get_file_rcu() / get_file_rcu_many() succeeds, which narrows the file’s reveal down a runt bit however no longer some distance enough:

__fget_files() first uses get_file_rcu() to conditionally narrow the state of a file from

And this straight leads to how the malicious program modified into as soon as fixed (there might perchance be one other practice-up patch, however that one unbiased tries to account for the code and recoup some of the ensuing performance loss) – the repair adds one other Sign in __fget_files() to properly slim down the reveal of the file such that the file is guaranteed to be live:

The fix is to properly narrow the state from

The repair ensures that a live reference can most productive be derived from one other live reference by evaluating with an FD table entry, which is guaranteed to demonstrate a live object.

[Sidenote: This scheme is similar to the one used for struct page – gup_pte_range() also uses the “grab pointer, increment refcount, recheck pointer” pattern for locklessly looking Up a struct page from a page table entry while ensuring that new refcounted references can’t be created without holding an existing reference. This is really important for struct page because a page can be given back to the page allocator and reused while gup_pte_range() holds an uncounted reference to it – freed pages still have their struct page, so there’s no need to delay freeing of the page – so if this went wrong, you’d get a page UAF.]

Advertisements

My initial suggestion modified into as soon as to as an substitute repair the difficulty by changing how unix_gc() ensures that it has recent salvage admission to, letting it set the file’s refcount to zero to forestall turning RCU references into refcounted ones; this could maintain avoided including any code within the unique __fget_files() direction, nevertheless it would maintain most productive fixed unix_gc(), no longer the __fdget_pos() case I figured out later, so or no longer it’s potentially a unbiased factor this is never the map in which it modified into as soon as fixed:

[Sidenote: In my original bug report I wrote that you’d have to wait an RCU grace period in the GC for this, but that wouldn’t be necessary as long as the GC ensures that a reaped socket’s refcount never becomes non-zero again.]

There are multiple poke prerequisites serious about exploiting this malicious program, however by some distance the trickiest to hit is that we’ve to poke an operation into the miniature poke window within the course of __fget_files() (which is willing to e.g. be reached by the expend of dup()), between the file descriptor table search for and the refcount increment:

static struct file *__fget_files(struct files_struct *files, unsigned int fd,

Advertisements

                                 fmode_t mask, unsigned int refs)

{

        struct file *file;

        rcu_read_lock();

Advertisements

loop:

        file=files_lookup_fd_rcu(files, fd); // poke window originate

        if (file) {

                /File object ref couldn’t be taken.

Advertisements

                 dup2() atomicity jabber is the explanation

                 we loop to make a selection the unique file (or NULL pointer)

                 */

                if (file->f_mode & mask)

Advertisements

                        file=NULL;

                else if (!get_file_rcu_many(file, refs)) // poke window live

                        goto loop;

        }

Advertisements

        rcu_read_unlock();

        return file;

}

On this poke window, the file descriptor might perchance indifferent be closed (to topple the FD’s reference to the file) and a unix_gc() bustle ought to salvage past the level the save it tests the file’s refcount (“total_refs=file_count(u->sk.sk_socket->file)“).

Advertisements

In the Debian 5.10.0-9-amd64 kernel at version 5.10.70-1, that poke window appears to be like as follows:

<__fget_files> cmp    r10,rax

<__fget_files> sbb    rax,rax

<__fget_files> mov    rdx,QWORD PTR [r11+0x8]

<__fget_files> and    eax,r8d

<__fget_files> lea    rax,[rdx+rax*8]

<__fget_files> mov    r12,QWORD PTR [rax] ; RACE WINDOW START

; r12 now contains file*

<__fget_files> test   r12,r12

<__fget_files> je     ffffffff812e3df7 <__fget_files>

<__fget_files> mov    eax,r9d

<__fget_files> and    eax,DWORD PTR [r12+0x44] ; LOAD (for ->f_mode)

<__fget_files> jne    ffffffff812e3df7 <__fget_files>

<__fget_files> mov    rax,QWORD PTR [r12+0x38] ; LOAD (for ->f_count)

<__fget_files> lea    rdx,[r12+0x38]

<__fget_files> test   rax,rax

<__fget_files> je     ffffffff812e3def <__fget_files>

<__fget_files> lea    rcx,[rsi+rax*1]

<__fget_files> lock cmpxchg QWORD PTR [rdx],rcx ; RACE WINDOW END (on cmpxchg success)

As you would also search for, the poke window is quite tiny – spherical 12 directions, assuming that the cmpxchg succeeds.

Happily for us, the poke window contains the first few reminiscence accesses to the struct file; therefore, by making obvious that the struct file is never any longer new within the fastest CPU caches, we will be in a position to widen the poke window by as a lot time as the reminiscence accesses rob. The long-established solution to build that is to expend an eviction sample / eviction set; however as an substitute we might perchance impression the cache line soiled on one other core (search for Anders Fogh’s blogpost for added factor). (I’m no longer indubitably obvious in regards to the intricacies of how a lot latency this adds on diversified manufacturers’ CPU cores, or on diversified CPU generations – I’ve most productive examined diversified variations of my proof-of-thought on Intel Skylake and Tiger Lake. Differences in cache coherency protocols or snooping might perchance impression a capable incompatibility.)

For the cache line containing the flags and refcount of a struct file, this could perchance also be accomplished by, on one other CPU, briefly bumping its refcount Up and then changing it support down, e.g. with cease(dup(fd)) (or unbiased by having access to the FD in quite a lot any map from a multithreaded process).

On the opposite hand, when we’re attempting to hit the poke in __fget_files() by the expend of dup(), we build no longer favor any cache misses to occur sooner than we hit the poke window – that might perchance gradual us down and maybe impression us omit the poke. To forestall that from happening, we will be in a position to call dup() with a distinct FD quantity for a heat-Up bustle rapidly sooner than attempting the poke. Because of we also favor the connected cache line within the FD table to be hot, we might perchance indifferent rob the FD quantity for the heat-Up bustle such that it uses the similar cache line of the file descriptor table.

Okay, a cache omit will seemingly be something like just a few dozen or even hundred nanoseconds or so – that’s better, nevertheless or no longer it’s no longer tremendous. What else will we build to impression this miniature half of code a lot slower to attain?

On Android, kernels veritably set CONFIG_PREEMPT, which would’ve allowed abusing the scheduler to in a map interrupt the execution of this code. The vogue I’ve accomplished this within the past modified into as soon as to give the sufferer thread a low scheduler priority and pin it to a particular CPU core along with one other excessive-priority thread that is blocked on a be taught() syscall on an empty pipe (or eventfd); when files is written to the pipe from one other CPU core, the pipe becomes readable, so the excessive-priority thread (which is registered on the pipe’s waitqueue) becomes schedulable, and an inter-processor interrupt (IPI) is sent to the sufferer’s CPU core to power it to enter the scheduler at as soon as.

One plan back with that plan, except for its reliance on CONFIG_PREEMPT, is that any timing variability within the kernel code serious about sending the IPI makes it more difficult to of direction preempt the sufferer thread within the final phrase reveal.

(Due to the Xen security team – I bear the first time I heard the foundation of the utilization of an interrupt to widen a poke window might perchance even maintain been from them.)

A greater solution to build that on an Android cellular phone shall be to trigger the scheduler no longer from an IPI, however from an expiring excessive-resolution timer on the similar core, despite the truth that I did not salvage it to work (potentially because of my code modified into as soon as broken in unrelated systems).

Excessive-resolution timers (hrtimers) are uncovered by means of many userspace APIs. Even the timeout of rob()/pselect() uses an hrtimer, despite the truth that this is an hrtimer that veritably has some slack applied to it to permit batching it with timers that are scheduled to expire a runt bit later. An example of a non-hrtimer-basically basically based API is the timeout worn for reading from a UNIX domain socket (and maybe also diversified forms of sockets?), which is willing to be set by the expend of SO_RCVTIMEO.

The factor that makes hrtimers “excessive-resolution” is that they don’t unbiased wait for the next periodic clock tick to reach; as an substitute, the expiration time of the next hrtimer on the CPU core is programmed correct into a hardware timer. So we might perchance set an absolute hrtimer for some time within the long bustle by the expend of something like timer_settime() or timerfd_settime(), and then at exactly the programmed time, the hardware will elevate an interrupt! We have made the timing behavior of the OS inappropriate for the second aspect of the poke, basically the most enthralling factor that matters is the hardware! Or… neatly, almost…

So we rob some absolute time at which we must always be interrupted, and inform the kernel the utilization of a syscall that accepts an absolute time, in nanoseconds. After which when that timer is the next one scheduled, the OS converts completely the time to whatever clock unpleasant/scale the hardware timer is per, and programs it into hardware. And the hardware veritably supports programming timers with absolute time – e.g. on contemporary X86 (with X86_FEATURE_TSC_DEADLINE_TIMER), you would also merely write an absolute Time Sign Counter(TSC) decrease-off date into MSR_IA32_TSC_DEADLINE, and when that decrease-off date is reached, you salvage an interrupt. The plan back on arm64 is analogous, the utilization of the timer’s comparator register (CVAL).

On the opposite hand, on both X86 and arm64, even supposing the clockevent subsystem is theoretically in a position to give absolute timestamps to clockevent drivers (by the expend of ->set_next_ktime()), the drivers as an substitute most productive implement ->set_next_event(), which takes a relative time as argument. This implies that completely the timestamp has to be remodeled correct into a relative one, most productive to be remodeled support to absolute a brief second later. The delay between those two operations is of direction added to the timer’s expiration time.

Happily this did not indubitably appear to be a instruct for me; if it modified into as soon as, I might perchance maintain tried to many cases call timerfd_settime() rapidly sooner than the planned expiry time to impression obvious that that the relaxation time the hardware timer is programmed, the connected code direction is hot within the caches. (I did build some experimentation on arm64, the save this perceived to maybe back a miniature bit, however I did not indubitably analyze it properly.)

Okay, so the total stuff I stated above shall be functional on an Android cellular phone with CONFIG_PREEMPT, however what if we’re attempting to heart of attention on the same old desktop/server kernel that does no longer maintain that grew to change into on?

Smartly, we will be in a position to indifferent trigger hrtimer interrupts the similar map – we unbiased cannot expend them to at as soon as enter the scheduler and preempt the thread anymore. Nonetheless in reveal of the utilization of the interrupt for preemption, we might perchance unbiased strive and impression the interrupt handler bustle for a indubitably long time.

Linux has the thought that of a “timerfd”, which is a file descriptor that refers to a timer. That you just must perchance maybe also e.g. call be taught() on a timerfd, and that operation will block till the timer has expired. Otherwise you would also computer screen the timerfd the utilization of epoll, and this can demonstrate Up as readable when the timer expires.

When a timerfd becomes ready, the total timerfd’s waiters (including epoll watches), that are queued Up in a linked list, are woken Up by the expend of the wake_up() direction – akin to when e.g. a pipe becomes readable. Therefore, if we will be in a position to impression the list of waiters indubitably long, the interrupt handler will must instruct a range of time iterating over that list.

And for any waitqueue that is wired Up to a file descriptor, it’s quite easy in an effort to add a ton of entries thanks to epoll. Epoll ties its watches to particular FD numbers, so while you reproduction an FD with a full bunch of dup() calls, you would also then expend a single epoll occasion to set Up a full bunch of waiters on the file. Additionally, a single process can maintain a full bunch epoll cases. I worn 500 epoll cases and 100 reproduction FDs, ensuing in 50 000 waitqueue items.

A capable part of this poke situation is that while you most productive hit the complicated poke (cease() the FD and bustle unix_gc() while dup() is preempted between FD table search for and refcount increment), no reminiscence corruption occurs but, however you would also search for that the GC has incorrectly removed a socket buffer (SKB) from the sufferer socket. Even better, if the poke fails, you would also moreover search for in which direction it failed, so long as no FDs below the sufferer FD are unused:

  • If dup() returns -1, it modified into as soon as called too dreary / the interrupt came about too soon: The file* modified into as soon as already long past from the FD table when __fget_files() tried to load it.
  • If dup() returns a file descriptor:
  • If it returns an FD elevated than the sufferer FD, this means that the sufferer FD modified into as soon as most productive closed after dup() had already elevated the refcount and allocated a brand unique FD. This implies dup() modified into as soon as called too soon / the interrupt came about too dreary.
  • If it returns the veteran sufferer FD quantity:
  • If recvmsg() on the FD returned by dup() returns no files, it map the poke succeeded: The GC wrongly removed the queued SKB.
  • If recvmsg() returns files, the interrupt came about between the refcount increment and the allocation of a brand unique FD. dup() modified into as soon as called a runt bit bit too soon / the interrupt came just a few runt bit bit too dreary.

In step with this, I many cases examined diversified timing offsets, the utilization of a spinloop with a variable substitute of iterations to skew the timing, and plotted what outcomes the poke makes an strive had reckoning on the timing skew.

Results: Debian kernel, on Tiger Lake

I examined this on a Tiger Lake computer, with the similar kernel as shown within the disassembly. Bid that “0” on the X axis is offset -300 ns relative to the timer’s programmed expiry.

This graph shows histograms of race attempt outcomes (too early, success, or too late), with the timing offset at which the outcome occurred on the X axis. The graph shows that depending on the timing offset, up to around 1/3 of race attempts succeeded.

Results: Other kernel, on Skylake

This graph shows similar histograms for a Skylake processor. The exact distribution is different, but again, depending on the timing offset, around 1/3 of race attempts succeeded.

These measurements are from an older computer with a Skylake CPU, operating a distinct kernel. Here “0” on the X axis is offset -1 us relative to the timer. (These timings are from a system that’s operating a distinct kernel from the one shown above, however I build no longer tell that makes a incompatibility.)

The true timings needless to sigh peek diversified between CPUs, and they potentially also exchange per CPU frequency scaling? Nonetheless indifferent, if what the final phrase timing is (or measure the machine’s timing sooner than attempting to of direction exploit the malicious program), you might perchance maybe hit this slim poke with qualified payment of about 30%!

The old piece showed that with the final phrase timing, the poke succeeds with a likelihood spherical 30% – nevertheless it doesn’t demonstrate whether or no longer the cache omit is indubitably important for that, or whether or no longer the poke would indifferent work dazzling with out it. To verify that, I patched my test code to strive and impression the file’s cache line hot (new within the cache) in reveal of frigid (no longer new within the cache):

@@ -312,8 +312,10 @@

     }

+#if 0

     // bounce socket’s file refcount over to diversified cpu

     pin_to(2);

     cease(SYSCHK(dup(RESURRECT_FD+1-1)));

     pin_to(1);

+#endif

     //printf(“environment timern”);

@@ -352,5 +354,5 @@

     cease(loop_root);

     while (ts_is_in_future(spin_stop))

–      cease(SYSCHK(dup(FAKE_RESURRECT_FD)));

+      cease(SYSCHK(dup(RESURRECT_FD)));

     while (ts_is_in_future(my_launch_ts)) /*scurry*/;

With that patch, the poke outcomes peek like this on the Tiger Lake computer:

This graph is a histogram of race outcomes depending on timing offset; it looks similar to the previous graphs, except that almost no race attempts succeed anymore.

In the event you maintain been paying attention, you would even maintain seen that the timing graphs I’ve been exhibiting are indubitably unfamiliar. If we maintain been deterministically hitting the poke in unbiased the similar map on every occasion, the timing graph might perchance indifferent peek like this (having a ogle unbiased on the “too-early” and “too-dreary” cases for simplicity):

A sketch of a histogram of race outcomes where the

Sure, maybe there is some microarchitectural reveal that is diversified between runs, inflicting timing variations – cache reveal, branch predictor reveal, frequency scaling, or something along those traces -, however a tiny substitute of discrete events that haven’t been accounted for has to be including steps to the graph. (In the event you would also very neatly be mathematically inclined, you would also model that as the implications of a convolution of the very most enthralling timing graph with the timing delay distributions of particular person discrete events.) For 2 unaccounted events, that might perchance also peek like this:

A sketch of a histogram of race outcomes where the

Nonetheless what the graphs are exhibiting is extra of a soft, linear transition, like this:

A sketch of a histogram of race outcomes where the

And that seems to me like there might perchance be indifferent something basically atrocious. Sure, if there modified into as soon as a sufficiently tremendous substitute of discrete events mixed together, the curve would indirectly unbiased peek like a soft smear – nevertheless it seems unlikely to me that there is this kind of capable substitute of a runt bit-evenly distributed random discrete events. And obvious, we build salvage a tiny quantity of timing inaccuracy from sampling the clock in a spinloop, however that has to be bounded to the execution time of that spinloop, and the timing smear is some distance too tremendous for that.

So it appears to be like as if there might perchance be a offer of randomness that can’t a discrete event, however something that introduces a random quantity of timing delay within some window. So I grew to change into suspicious of the hardware timer. The kernel is the utilization of MSR_IA32_TSC_DEADLINE, and the Intel SDM tells us that that factor is programmed with a TSC price, which makes it peek as if the timer has very excessive granularity. Nonetheless MSR_IA32_TSC_DEADLINE is a extra recent mode of the LAPIC timer, and the older LAPIC timer modes maintain been as an substitute programmed in units of the APIC timer frequency. In step with the Intel SDM, Quantity 3A, piece 10.5.4 “APIC Timer”, that is “the processor’s bus clock or core crystal clock frequency (when TSC/core crystal clock ratio is enumerated in CPUID leaf 0x15) divided by the price laid out within the divide configuration register“. This frequency is vastly decrease than the TSC frequency. So maybe MSR_IA32_TSC_DEADLINE is indubitably unbiased a front-live to the similar veteran APIC timer?

I attempted to measure the adaptation between the programmed TSC price and when execution modified into as soon as indubitably interrupted (no longer when the interrupt handler begins operating, however when the veteran execution context is interrupted – you would also measure that if the interrupted execution context is simply operating RDTSC in a loop); that appears to be like as follows:

A graph showing noise. Delays from deadline TSC to last successful TSC read before interrupt look essentially random, in the range from around -130 to around 10.

As you would also search for, the expiry of the hardware timer certainly adds a bunch of noise. The size of the timing incompatibility might perchance be very cease to the crystal clock frequency – the TSC/core crystal clock ratio on this machine is 117. So I attempted plotting the absolute TSC values at which execution modified into as soon as interrupted, modulo the TSC / core crystal clock ratio, and bought this:

A graph showing a clear grouping around 0, roughly in the range -20 to 10, with some noise scattered over the rest of the graph.

This confirms that MSR_IA32_TSC_DEADLINE is (it sounds as if) an interface that internally converts the specified TSC price into much less granular bus clock / core crystal clock time, no longer no longer Up to on some Intel CPUs.

Nonetheless there might perchance be indifferent something indubitably unfamiliar here: The TSC values at which execution appears interrupted maintain been at negative offsets relative to the programmed expiry time, as if the timeouts maintain been rounded down to the much less granular clock, or something along those traces. To salvage the next thought of how timer interrupts work, I measured on but one other system (an veteran Haswell CPU) with a patched kernel when execution is interrupted and when the interrupt handler begins executing relative to the programmed expiry time (and also plotted the adaptation between the 2):

A graph showing that the skid from programmed interrupt time to execution interruption is around -100 to -30 cycles, the skid to interrupt entry is around 360 to 420 cycles, and the time from execution interruption to interrupt entry has much less timing variance and is at around 440 cycles.

So it appears to be like as if the CPU begins going by means of timer interrupts a runt bit bit sooner than the programmed expiry time, however interrupt handler entry takes see you later (~450 TSC clock cycles?) that by the level the CPU begins executing the interrupt handler, the timer expiry time has long handed.

Anyway, the important bit for us is that as soon as the CPU interrupts execution because of timer expiry, or no longer it’s continuously at a LAPIC timer edge; and LAPIC timer edges happen when the TSC price is a multiple of the TSC/LAPIC clock ratio. An exploit that does no longer rob that into yarn and wrongly assumes that MSR_IA32_TSC_DEADLINE has TSC granularity might perchance maintain its timing smeared by one LAPIC clock interval, which is willing to be something like 40ns.

The ~30% accuracy we might perchance produce with the brand new PoC with the final phrase timing is already no longer terrible; however if we control for the timer’s weirdness, will we build better?

The plan back is that we are effectively launching the poke with two timers that behave in a different way: One timer per calling clock_gettime() in a loop (which uses the excessive-resolution TSC to compute a time), the diversified a hardware timer per the decrease-resolution LAPIC clock. I search for 2 choices to repair this:

  1. Are trying and impression obvious that that the second timer is determined on the originate of a LAPIC clock interval – that map, the second timer might perchance indifferent confidently behave exactly like the first (or maintain an additional fixed offset, however we will be in a position to catch Up on that).
  2. Shift the first timer’s expiry time down per the gap from the second timer to the old LAPIC clock interval.

(One annoyance with this is that while we will be in a position to clutch knowledge on how wall/monotonic time is calculated from TSC from the vvar mapping worn by the vDSO, the clock is self-discipline to minuscule extra corrections at every clock tick, which occur every 4ms on similar old distro kernels (with CONFIG_HZ=250) so long as any core is operating.)

I attempted to behold whether or no longer the timing graph would peek nicer if I accounted for this LAPIC clock rounding and also worn a custom kernel to cheat and control for likely skid introduced by completely the-to-relative-and-support conversion of the expiry time (search for additonal Up), however that indifferent did not back all that a lot.

One thing I might perchance indifferent’ve belief about map earlier is that needless to sigh, clock poke matters. On more moderen Intel CPUs with P-states, the CPU is veritably Up to poke of its bear frequency, and dynamically adjusts it as it sees fit; the OS unbiased provides some hints.

Linux has an interface that claims to inform you the “new frequency” of every and each CPU core in /sys/devices/system/cpu/cpufreq/protection/scaling_cur_freq, however when I attempted the utilization of that, I bought a distinct “frequency” on every occasion I be taught that file, which gave the impression suspicious.

the implementation, it appears to be like that evidently the price shown there is calculated in arch_freq_get_on_cpu() and its callees – the price is calculated on query of when the file is be taught, with outcomes cached for spherical 10 milliseconds. The price is particular as the ratio between the deltas of MSR_IA32_APERF and MSR_IA32_MPERF between the relaxation be taught and the brand new one. So while you maintain some tool that is polling these values every few seconds and wishes to demonstrate life like clock frequency over that time, or no longer it’s potentially a unbiased map of doing issues; however while you indubitably favor the brand new clock frequency, or no longer it’s no longer a unbiased fit.

I hacked a helper into my kernel that samples both MSRs twice in quick succession, and that provides a lot cleaner outcomes. As soon as I measure the clock speeds and timing offsets at which the poke succeeds, the consequence appears to be like as if this (exhibiting unbiased two clock speeds; the Y axis is the bogus of poke successes on the clock offset specified on the X axis and the frequency scaling specified by the colour):

A graph showing that the timing of successful race attempts depends on the CPU's performance setting - at 11/28 performance, most successful race attempts occur around clock offset -1200 (in TSC units), while at 14/28 performance, most successful race attempts occur around clock offset -1000.

So clearly, dynamic frequency scaling has a licensed impression on the timing of the poke – I bet that’s to be anticipated, indubitably.

Nonetheless even accounting for all this, the graph indifferent appears to be like extra or much less soft, so clearly there is indifferent something extra that I’m missing – oh neatly. I made Up my thoughts to end experimenting with the poke’s timing at this level, since I did not ought to sink too a lot time into it. (Or maybe I indubitably unbiased stopped because of I bought distracted by more moderen and shinier issues?)

Anyway, I might perchance potentially instruct a lot beyond regular time attempting to analyze the timing variations (and maybe mostly bang my head against a wall because of important aspects of execution timing are indubitably complicated to cherish in factor, and to cherish it fully, it will seemingly be mandatory to expend something like Gamozo Labs’ “Sushi Roll” and then struggle by means of each instruction in factor and compare the observations to the interior structure of the CPU). Let’s no longer build that, and salvage support to easy bag out how to of direction exploit this malicious program!

To turn this malicious program into reminiscence corruption, we’ve to abuse that the recvmsg() direction assumes that SKBs on the receive queue are protected from deletion by the socket mutex while the GC indubitably deletes SKBs from the receive queue with out touching the socket mutex. For that aim, while the unix GC is operating, we’ve to originate a recvmsg() call that appears to be like Up the sufferer SKB, block till the unix GC has freed the SKB, and then let recvmsg() continue working on the freed SKB. Here’s quite straightforward – while it’s some distance a poke, we will be in a position to with out plan back decelerate unix_gc() for multiple milliseconds by rising a full bunch sockets that are by some means referenced from the FD table and maintain many miniature SKBs queued Up – here is a graph exhibiting the unix GC execution time on my computer, reckoning on the bogus of queued SKBs that the GC has to scan by means of:

A graph showing the time spent per GC run depending on the number of queued SKBs. The relationship is roughly linear.

To turn this correct into a UAF, or no longer it might perchance maybe be mandatory to salvage past the next check reach the live of unix_gc():

       /All candidates might perchance indifferent maintain been indifferent by now. */

        BUG_ON(!list_empty(&gc_candidates));

gc_candidates is a list that previously contained all sockets that maintain been deemed to be unreachable by the GC. Then, the GC tried to free all those sockets by taking away their mutual references. If we arrange to withhold a reference to with out a doubt one of the most sockets that the GC belief modified into as soon as going away, the GC detects that with the BUG_ON().

Nonetheless we build no longer indubitably favor the sufferer SKB to reference a socket that the GC thinks is going away; in scan_inflight(), the GC targets any SKB with a socket that is marked UNIX_GC_CANDIDATE, that map it unbiased needed to be a candidate for being scanned by the GC. So by making the sufferer SKB withhold a reference to a socket that’s by some means referenced from a file descriptor table, however is by some means referenced by a file descriptor table by means of one other socket, we will be in a position to impression obvious that that the BUG_ON() might perchance no longer trigger.

I extended my reproducer with this trick and some userfaultfd trickery to impression recv() bustle with the final phrase timing. At the second you build no longer necessarily salvage paunchy salvage admission to to userfaultfd as the same old user, however since I’m unbiased attempting to demonstrate the thought that, and there are likely selections to userfaultfd (the utilization of FUSE or unbiased gradual disk salvage admission to), that’s simply enough for this blogpost.

When the same old distro kernel is operating veritably, the UAF reproducer’s UAF accesses might perchance no longer indubitably be noticeable; however while you add the kernel repeat line flag slub_debug=FP (to enable SLUB’s poisoning and sanity tests), the reproducer fleet crashes twice, first with a poison dereference and then a poison overwrite detection, exhibiting that one byte of the poison modified into as soon as incremented:

neatly-liked security fault, potentially for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] SMP NOPTI

CPU: 1 PID: 2655 Comm: hardirq_loop Not atrocious 5.10.0-9-amd64 #1 Debian 5.10.70-1

[…]

RIP: 0010:unix_stream_read_generic+0x72b/0x870

Code: fe ff ff 31 ff e8 85 87 91 ff e9 a5 fe ff ff 45 01 77 44 8b 83 80 01 00 00 85 c0 0f 89 10 01 00 00 49 8b 47 38 48 85 c0 74 23 bf 00 66 85 c0 0f 85 20 01 00 00 4c 89 fe 48 8d 7c 24 58 44 89

RSP: 0018:ffffb789027f7cf0 EFLAGS: 00010202

RAX: 6b6b6b6b6b6b6b6b RBX: ffff982d1d897b40 RCX: 0000000000000000

RDX: 6a0fe1820359dce8 RSI: ffffffffa81f9ba0 RDI: 0000000000000246

RBP: ffff982d1d897ea8 R08: 0000000000000000 R09: 0000000000000000

R10: 0000000000000000 R11: ffff982d2645c900 R12: ffffb789027f7dd0

R13: ffff982d1d897c10 R14: 0000000000000001 R15: ffff982d3390e000

FS:  00007f547209d740(0000) GS:ffff98309fa40000(0000) knlGS: 0000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007f54722cd000 CR3: 00000001b61f4002 CR4: 0000000000770ee0

PKRU: 55555554

Call Sign:

[…]

 unix_stream_recvmsg+0x53/0x70

[…]

 __sys_recvfrom+0x166/0x180

[…]

 __x64_sys_recvfrom+0x25/0x30

 do_syscall_64+0x33/0x80

 entry_SYSCALL_64_after_hwframe+0x44/0xa9

[…]

—[ end trace 39a81eb3a52e239c ]—

=============================================================================

BUG skbuff_head_cache (Rotten: G      D          ): Poison overwritten

—————————————————————————–

INFO: 0x00000000d7142451-0x00000000d7142451 @offset=68. First byte 0x6c in reveal of 0x6b

INFO: Slab 0x000000002f95c13c objects=32 worn=32 fp=0x0000000000000000 flags=0x17ffffc0010200

INFO: Object 0x00000000ef9c59c8 @offset=0 fp=0x00000000100a3918

Object   00000000ef9c59c8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000097454be8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000035f1d791: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   00000000af71b907: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000000d2d371e: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkklkkkkkkkkkkk

Object   0000000000744b35: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   00000000794f2935: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000006dc06746: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000005fb18682: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000072eb8dd2: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   00000000b5b572a9: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000085d6850b: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000006346150b: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000000ddd1ced: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.

Padding  00000000e00889a7: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ

Padding  00000000d190015f: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ

CPU: 7 PID: 1641 Comm: gnome-shell Rotten: G    B D           5.10.0-9-amd64 #1 Debian 5.10.70-1

[…]

Call Sign:

 dump_stack+0x6b/0x83

 check_bytes_and_report.frigid+0x79/0x9a

 check_object+0x217/0x260

[…]

 alloc_debug_processing+0xd5/0x130

 ___slab_alloc+0x511/0x570

[…]

 __slab_alloc+0x1c/0x30

 kmem_cache_alloc_node+0x1f3/0x210

 __alloc_skb+0x46/0x1f0

 alloc_skb_with_frags+0x4d/0x1b0

 sock_alloc_send_pskb+0x1f3/0x220

[…]

 unix_stream_sendmsg+0x268/0x4d0

 sock_sendmsg+0x5e/0x60

 ____sys_sendmsg+0x22e/0x270

[…]

 ___sys_sendmsg+0x75/0xb0

[…]

 __sys_sendmsg+0x59/0xa0

 do_syscall_64+0x33/0x80

 entry_SYSCALL_64_after_hwframe+0x44/0xa9

[…]

FIX skbuff_head_cache: Restoring 0x00000000d7142451-0x00000000d7142451=0x6b

FIX skbuff_head_cache: Marking all objects worn

RIP: 0010:unix_stream_read_generic+0x72b/0x870

Code: fe ff ff 31 ff e8 85 87 91 ff e9 a5 fe ff ff 45 01 77 44 8b 83 80 01 00 00 85 c0 0f 89 10 01 00 00 49 8b 47 38 48 85 c0 74 23 bf 00 66 85 c0 0f 85 20 01 00 00 4c 89 fe 48 8d 7c 24 58 44 89

RSP: 0018:ffffb789027f7cf0 EFLAGS: 00010202

RAX: 6b6b6b6b6b6b6b6b RBX: ffff982d1d897b40 RCX: 0000000000000000

RDX: 6a0fe1820359dce8 RSI: ffffffffa81f9ba0 RDI: 0000000000000246

RBP: ffff982d1d897ea8 R08: 0000000000000000 R09: 0000000000000000

R10: 0000000000000000 R11: ffff982d2645c900 R12: ffffb789027f7dd0

R13: ffff982d1d897c10 R14: 0000000000000001 R15: ffff982d3390e000

FS:  00007f547209d740(0000) GS:ffff98309fa40000(0000) knlGS: 0000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007f54722cd000 CR3: 00000001b61f4002 CR4: 0000000000770ee0

PKRU: 55555554

Hitting a poke can change into less complicated if, in reveal of racing two threads against each and each diversified, you poke one thread against a hardware timer to invent a huge timing window for the diversified thread. Therefore the title! On the diversified hand, it introduces extra complexity because of now you’ll need to deem how timers indubitably work, and seems, time is a posh thought…

This presentations how no longer no longer Up to just a few indubitably tight races can indifferent be hit and we might perchance indifferent take care of them as security bugs, despite the truth that it appears to be like as if they’d be very laborious to hit before all the pieces ogle.

Also, exactly timing races is laborious, and the important aspects of how long it indubitably takes the CPU to salvage from one demonstrate one other are mysterious. (As no longer most productive exploit writers know, however also any individual who’s ever desired to benchmark a performance-connected exchange…)

I did also mess spherical with these items on arm64 a runt bit, and I modified into as soon as questioning: At what aspects build interrupts indubitably salvage delivered? Does an incoming interrupt power the CPU to topple all the pieces at as soon as, or build inflight operations discontinue first? This gets particularly enthralling on telephones that non-public two or three diversified forms of CPUs mixed together.

On a Pixel 4 (which has 4 gradual in-inform cores, 3 fleet cores, and 1 faster core), I attempted firing an interval timer at 100Hz (the utilization of timer_create()), with a signal handler that logs the PC register, while operating this loop:

  400680:        91000442         add        x2, x2, #0x1

  400684:        91000421         add        x1, x1, #0x1

  400688:        9ac20820         udiv        x0, x1, x2

  40068c:        91006800         add        x0, x0, #0x1a

  400690:        91000400         add        x0, x0, #0x1

  400694:        91000442         add        x2, x2, #0x1

  400698:        91000421         add        x1, x1, #0x1

  40069c:        91000442         add        x2, x2, #0x1

  4006a0:        91000421         add        x1, x1, #0x1

  4006a4:        9ac20820         udiv        x0, x1, x2

  4006a8:        91006800         add        x0, x0, #0x1a

  4006ac:        91000400         add        x0, x0, #0x1

  4006b0:        91000442         add        x2, x2, #0x1

  4006b4:        91000421         add        x1, x1, #0x1

  4006b8:        91000442         add        x2, x2, #0x1

  4006bc:        91000421         add        x1, x1, #0x1

  4006c0:        17fffff0         b        400680

The logged interrupt PCs had the next distribution on a gradual in-inform core:

A histogram of PC register values, where most instructions in the loop have roughly equal frequency, the instructions after udiv instructions have twice the frequency, and two other instructions have zero frequency.

and this distribution on a handy book a rough out-of-inform core:

A histogram of PC register values, where the first instruction of the loop has very high frequency, the following 4 instructions have near-zero frequency, and the following instructions have low frequencies

As continuously, out-of-inform (OOO) cores impression all the pieces unfamiliar, and the originate of the loop seems to in a map “provide duvet” for the next directions; however on the in-inform core, we will be in a position to search for that extra interrupts reach after the gradual udiv directions. So it sounds as if, when a form of is executing while an interrupt arrives, it continues executing and doesn’t salvage aborted in a map?

With the next loop, which has a LDR instruction mixed in that accesses a reminiscence predicament that is continuously being modified by one other thread:

  4006a0:        91000442         add        x2, x2, #0x1

  4006a4:        91000421         add        x1, x1, #0x1

  4006a8:        9ac20820         udiv        x0, x1, x2

  4006ac:        91006800         add        x0, x0, #0x1a

  4006b0:        91000400         add        x0, x0, #0x1

  4006b4:        91000442         add        x2, x2, #0x1

  4006b8:        91000421         add        x1, x1, #0x1

  4006bc:        91000442         add        x2, x2, #0x1

  4006c0:        91000421         add        x1, x1, #0x1

  4006c4:        9ac20820         udiv        x0, x1, x2

  4006c8:        91006800         add        x0, x0, #0x1a

  4006cc:        91000400         add        x0, x0, #0x1

  4006d0:        91000442         add        x2, x2, #0x1

  4006d4:        f9400061         ldr        x1, [x3]

  4006d8:        91000421         add        x1, x1, #0x1

  4006dc:        91000442         add        x2, x2, #0x1

  4006e0:        91000421         add        x1, x1, #0x1

  4006e4:        17ffffef         b        4006a0

the cache-missing hundreds obviously maintain a capable have an effect on on the timing. On the in-inform core:

A histogram of interrupt instruction pointers, showing that most interrupts are delivered with PC pointing to the instruction after the high-latency load instruction.

On the OOO core:

A similar histogram as the previous one, except that an even larger fraction of interrupt PCs are after the high-latency load instruction.

What’s enthralling to me here is that the timer interrupts appear to all over again reach after the gradual load – implying that if an interrupt arrives while a gradual reminiscence salvage admission to is in growth, the interrupt handler might perchance no longer salvage to attain till the reminiscence salvage admission to has accomplished? (Except maybe on the OOO core the interrupt handler can originate speculating already? I might perchance no longer indubitably query of that, however might perchance give it some belief.)

On an X86 Skylake CPU, we will be in a position to build a the same test:

    11b8:        48 83 c3 01                  add    $0x1,%rbx

    11bc:        48 83 c0 01                  add    $0x1,%rax

    11c0:        48 01 d8                     add    %rbx,%rax

    11c3:        48 83 c3 01                  add    $0x1,%rbx

    11c7:        48 83 c0 01                  add    $0x1,%rax

    11cb:        48 01 d8                     add    %rbx,%rax

    11ce:        48 03 02                     add    (%rdx),%rax

    11d1:        48 83 c0 01                  add    $0x1,%rax

    11d5:        48 83 c3 01                  add    $0x1,%rbx

    11d9:        48 01 d8                     add    %rbx,%rax

    11dc:        48 83 c3 01                  add    $0x1,%rbx

    11e0:        48 83 c0 01                  add    $0x1,%rax

    11e4:        48 01 d8                     add    %rbx,%rax

    11e7:        eb cf                        jmp    11b8

with a the same consequence:

A histogram of interrupt instruction pointers, showing that almost all interrupts were delivered with RIP pointing to the instruction after the high-latency load.

This implies that if the first salvage admission to to the file terminated our poke window (which is never any longer the case), we potentially would no longer be in a position to preserve the poke by making the salvage admission to to the file gradual – as an substitute we would must decelerate with out a doubt one of the most operations sooner than that. (Nonetheless articulate that I indubitably maintain most productive examined straightforward hundreds, no longer stores or be taught-regulate-write operations here.)

+0x100>+0x54>+0x50>+0x6f>+0x4e>+0x4b>+0x46>+0x41>+0x77>+0x3f>+0x3a>+0x37>+0x77>+0x35>+0x32>+0x2f>+0x2b>+0x28>+0x24>+0x21>+0x1e>
Read More

Advertisements
Ava
WRITEN BY

Ava

I'm a researcher at Utokyo :) and a big fan of Ava Max
Get Connected!
One of the Biggest Social Platform for Entrepreneurs, College Students and all. Come and join our community. Expand your network and get to know new people!

Discussion(s)

No comments yet
Knowasiak We would like to show you notifications so you don't miss chats & status updates.
Dismiss
Allow Notifications