Don’t try to sanitize input. Escape output. (2020)

2
Don’t try to sanitize input. Escape output. (2020)

February 2020

Every so often developers talk about “sanitizing user input” to prevent cross-site scripting attacks. This is well-intentioned, but leads to a false sense of security, and sometimes mangles perfectly good input.

How does cross-site scripting happen?

A website is vulnerable to cross-site scripting (XSS) attacks if users can enter information that the site repeats back to them verbatim in a page’s HTML. This might cause minor issues (HTML that breaks the page layout) or major ones (JavaScript that sends the user’s login cookie to an attacker’s site).

Let’s walk through a concrete example:

  1. NaiveSite allows you to enter your name, which is output as is on your profile page.
  2. Billy the Kid enters his name as Billy .
  3. Anyone who visits Billy’s profile page gets some HTML including the unescaped script tag, which their browser runs.
  4. If the alert() were changed to something more malicious, like sendCookies('https://billy.com/cookie-monster'), Billy may now be collecting the unsuspecting visitor’s login information.

Side note: it isn’t quite this simple, as login cookies are usually marked HttpOnly, which means they’re not accessible to JavaScript. But this is NaiveSite, so it’s likely they made both an XSS mistake and a cookie one.

Why input filtering isn’t a great idea

The developer has heard of “input filtering” or “sanitizing input”, so they write some code to strip out unsafe HTML characters <>& from the name before storing it. Problem solved!

But there are two problems with this. For one, a couple might sign up to NaiveSite as Bob & Jane Smith, but the filtering code strips the &, and suddenly Bob is on his own, with a middle name of Jane.

Or if the filter is a bit more zealous and also strips ' and ", someone like Bill O’Brien becomes Bill OBrien. Messing up people’s names is not a good look.

Perhaps more importantly, it gives a false sense of security. What does “unsafe” mean? In what context? Sure, <>& are unsafe characters for HTML, but what about CSS, JSON, SQL, or even shell scripts? Those have a completely different set of unsafe characters.

For example, NaiveSite might have a PHP template that looks like this:


...

If an attacker sets their name to include double quotes, like "; badFunc(); ", they can run arbitrary JavaScript on any NaiveSite pages that display the user’s name (which, if you’re logged in, is probably all of them).

Another example of this kind of thing is SQL injection, an attack that’s closely related to cross-site scripting. NaiveSite is powered by MySQL, and it finds users like so:

$query = "SELECT * FROM users WHERE name = '{$name}'"

When a boy named Robert'); DROP TABLE users; comes along, NaiveSite’s entire user database is deleted. Oops!

Incidentally, the mother in the xkcd comic says, “I hope you’ve learned to sanitize your database inputs.” Which is somewhat confusing, but I’ll give Randall the benefit of the doubt and assume he meant “escape your database parameters”.

In short, it’s no good to strip out “dangerous characters”, because some characters are dangerous in some contexts and perfectly safe in others.

Escape your output instead

The only code that knows what characters are dangerous is the code that’s outputting in a given context.

So the better approach is to store whatever name the user enters verbatim, and then have the template system HTML-escape when outputting HTML, or properly escape JSON when outputting JSON and JavaScript.

And of course use your SQL engine’s parameterized query features so it properly escapes variables when building SQL:

$stmt = $db->prepare('SELECT * FROM users WHERE name = ?');
$stmt->bind_param('s', $name);

This is sometimes called “contextual escaping”. If you happen to use Go’s html/template package, you get automatic contextual escaping for HTML, CSS, and JavaScript. Most other templating systems at least give you automatic HTML escaping, for example React, Jinja2, and Rails templates.

But what if you want raw input?

One tricky situation is when your app’s purpose is allowing a user to enter HTML or Markdown for display. In this case you can’t escape when rendering output, because the whole purpose is to allow users to add links, images, headings, etc.

So you have to take a different approach. If you’re using Markdown, you can either:

  1. Allow them to only enter pure Markdown, and convert that to HTML on render (many Markdown libraries allow raw HTML by default; be sure to disable that). This is the most secure option, but also more restrictive.
  2. Allow them to use HTML in the Markdown, but only a whitelist of allowed tags and attributes, such as and . Both Stack Exchange and GitHub take this second approach.

If you’re not using Markdown but want to let your users enter HTML directly, you only have the second option – you must filter using a whitelist. This is harder to get right than you’d think (for example, ), so be sure to use a mature, security-vetted library like DOMPurify.

So in cases where you do need to “echo” raw user input, carefully filter input based on a restrictive whitelist, and store the result in the database. When you come to output it, output it as stored without escaping.

The parallel for SQL injection might be if you’re building a data charting tool that allows users to enter arbitrary SQL queries. You might want to allow them to enter SELECT queries but not data-modification queries. In these cases you’re best off using a proper SQL parser (like this one) to ensure it’s a well-formed SELECT query – but doing this correctly is not trivial, so be sure to get security review.

What about validation?

Input sanitization is usually a bad idea, but input validation is a good thing.

For example, when you’re parsing form fields, and you have a number field that’s not a number, or an email address without an @, or a “post status” drop-down that can only be one of draft, published, or archived – then by all means validate it and return an error if it’s invalid.

Good web form validation shows errors inline so the user knows exactly what to fix:

Web form validation

You must do validation at least on the backend, otherwise an attacker could bypass the frontend validation and POST bogus data to your endpoint directly. In addition, you can also validate early on the frontend to show errors more real-time, without a round trip to the server.

Further reading

OWASP has two cheat sheets on Cross Site Scripting Prevention and SQL Injection Prevention that contain a lot of further information on escaping.

There’s also a StackOverflow answer to “How can I sanitize user input with PHP?” that is somewhat PHP-specific, but I found it succinct and helpful. It links to a page on PHP magic quotes, which were a bad idea and actually removed in PHP 5.4 – the discussion there is very much in line with what I’ve written above.

If you have any feedback on this article, please get in touch! Or see the comments on Hacker News and programming reddit.

NOW WITH OVER +8500 USERS. people can Join Knowasiak for free. Sign up on Knowasiak.com
Read More

Leave a Reply

2 thoughts on “Don’t try to sanitize input. Escape output. (2020)

  1. Aditya avatar

    I think a better way to think of this may be in terms of canonicalization. Inside your application, you should decide on a single canonical way to represent data, one which fits the type of processing and expected use of the application. For example, you might decide that all strings should be UTF8, and should be interpreted (and stored) as whatever the user initially wrote. You might decide that any structured data should be parsed and then stored as protobufs in a BigTable. Or you might decide that an RDBMS is your native datastore and use whatever the native string encoding is for it, as well as parse & normalize data into tables upon input.

    Then, whenever you take input, your job is to validate and encode it. If you get a Windows-1252 string, you should re-encode it to utf8 for further storage. If it has data that are invalid UTF-8 codepoints, you should either strip, replace with a replacement character, or notify the user with a validation failure. Same with structured data that fails your normalization rules – you should usually notify the user.

    And when you send output, you should escape based on the intended output device. If you're putting it in an HTML page, HTML-escape it. If it's a URL, url-encode it. If it's a database query, SQL escape it. If it's a CSV, quote it.

    Thinking in these terms keeps the internal logic of your application simple (there are no format conversions except at system boundaries), and it also gives you a lot of flexibility to preserve the user's intent and add new output formats later.

  2. Aditya avatar

    It's cool to see how these posts are becoming less and less important in the wake of today's frameworks/tools protecting devs by default.

    From ORMs escaping SQL, to FE frameworks escaping html/js, to browsers starting to default to same-site=lax. It feels like we've slowly pulled ourselves out of OWASP hell. Pretty nice to see!

    Obviously it's still important (see log4j) to know it all especially when its not so clear cut, but still good progress.