Fuzz Testing at Airtable

Fuzz Testing at Airtable

By Keyhan Vakil

Customers use Airtable’s real-time collaborative database to build complex business workflows and apps. Correctness of the database is imperative for our customers. This blog post describes how we used fuzzing in order to automatically find bugs in our product, create regression tests for those bugs, and ultimately fix them!

A fuzzer is a program which injects automatically semi-random data into a program to detect bugs (OWASP Security)

Airtable lets users create views, which are filtered and sorted copies of the original dataset.

An example of configuring filters on an Airtable view. This view shows only records where the “Book” column contains the text “Sein” or where the “Canceled” column is checked.

Views power other Airtable features. Most of our users interact with filtered and/or sorted views of their data rather than the entire table. Customers use Airtable’s shared views feature in order to share filtered down data with third-party businesses and outside contractors. Airtable’s view feature also powers automations — one of our most common automations is “send an email when data enters this view.”

For this reason, it’s important that views work correctly. If a filter doesn’t work or is not set up properly by a user, a view might incorrectly display information to other users. And there are other downstream impacts as well. For example, since you can add filters on formulaic columns, incorrect formula computation can also incorrectly display information.

Views also need to be efficient since our larger customers have many views. To solve this problem, we have the Recalculator. The Recalculator analyzes the writes performed in a transaction and determines what formulas, filters, and sorts need to be recalculated in order to produce the correct results. Since it runs on every write transaction, it needs to be fast, and so it tries to do the minimum amount of recalculation via constructing a dependency graph.

A dependency graph used by the Recalculator. Here “Name” is a formulaic column which depends on “First Name” and “Last Name”, and there are two view filters. Note that if the “First Name” column is updated, only its dependencies (colored green) need to be recalculated.

The Recalculator contains many more optimizations to avoid extraneous recalculations, making it quite complicated. In order to find potentially obscure Recalculator bugs, we complemented our existing unit and integration tests with a custom fuzzer. The fuzzer executes randomly generated actions against an Airtable base. There are a myriad of actions the fuzzer can perform: adding and deleting columns, changing column schema configurations, adding and deleting rows, creating and modifying views, undoing and redoing previous actions, et cetera. While doing so, it continuously rechecks the Recalculator’s computations to ensure that recalculation is occurring as necessary.

One example of a rare bug the fuzzer found involved a combination of filtering, linked records (foreign keys), column destruction and undoing with an interleaving action. The bug is complicated and not necessary to understand the rest of this article, so we’ve included the full description in the appendix.

Internally, all core base actions in Airtable are called CRUD requests (Create/Read/Update/Delete requests). For example, cloning an existing column is a CRUD request. It is called column/createAsClone , and takes in a JSON object containing a columnId, which is a 17-character string starting with “fld” (short for “field”). The columnId tells our backend which column should be cloned.

This poses a problem for traditional fuzzers which send entirely random data to the application. In the case of column cloning, the chance of randomly generating a valid JSON object which references a valid columnId is vanishingly small. In addition, the set of valid columnIds change as columns are created and destroyed.

This was the main motivation for building our own fuzzer: the desire to generate valid CRUD requests while accounting for changing state.

Let’s see how we generate these CRUD requests with an example. Here is the generator for the column cloning CRUD request:

switch (randomStream.getRandomAction()) {
/* ... */
case columnCreateAsClone: {
const columnIdToClone=randomSeenIdGenerator.getRandomId(
const shouldDuplicateCells=randomStream.randomBoolean();
const result=await ownerSession.column().createAsCloneAsync({
if (result.value) {
/* ... */

There are two fuzzer specific constructs here: randomStream and randomSeenIdGenerator. randomStream is a lazily-generated (conceptually infinite) stream of random numbers in [0, 1). It comes with helper functions to generate random types, like booleans or IDs, using the stream. We’ll talk a little bit more about why we use randomStream as opposed to just generating purely random numbers later in the next section. For now, here is an example randomStream which could generate this action:

Next, we’ll focus on randomSeenIdGenerator. This takes in a random number and uses it to choose an ID from the set of IDs which are currently “valid”. In exchange, we need to keep the randomSeenIdGenerator up-to-date as IDs become valid (through being added) or invalid (through deletion). column/createAsClone returns the cloned column ID if it succeeds. We use randomSeenIdGenerator.addIds(result.value)to inform the randomSeenIdGenerator about the new column which was created, which will allow later generated requests to reference it.

Finally, we have await ownerSession.column().createAsCloneAsync(...). This building block isn’t anything fuzzer specific. It’s just our ordinary integration testing code for making a request. It generates the JSON corresponding to the request and sends it to the base for processing.

Note that this results in syntactically valid requests, since the resulting JSON sent to the server is well-formed by construction. The requests are also semantically valid, since the use of randomSeenIdGenerator ensures that we don’t try to do nonsensical things like clone a column which doesn’t exist. The end result is that the majority of CRUD requests are successfully processed by the base.

This lets us generate random requests. But there is no guarantee that the resulting requests are any fun. We want to let the fuzzer explore different types of requests, but we also want to bias the fuzzer towards requests which exercise more complex logic. The classical way to achieve this is coverage-guided fuzzing. The goal is to encourage the fuzzer to explore requests which get new code coverage.

Most fuzzers which collect code coverage rely on instrumenting the binary at compile time. Airtable’s backend is written mostly in Typescript and runs on top of NodeJS, so we can’t copy that approach. Instead, we use the V8 Inspector API to collect coverage before and after the CRUD request execution. CRUD requests with no change in coverage have uncovered no new code. CRUD requests with lots of new code coverage are ones that we likely want to fuzz further.

Now we want to generate CRUD requests with high scores. How can we do this? One idea is a mutational approach: the fuzzer takes a CRUD request with a high score and mutates it to create a similar CRUD request.

For example, if I told you that cloning a certain column uncovered a lot of new code paths, you might come up with some followup actions to try:

  1. “What if we try a different action on the same column, like deleting it?”
  2. “What if we try cloning a different column?”
  3. “What if we try cloning the column with different parameters, like changing the shouldDuplicateCells boolean?”

The fuzzer can actually do all of these mutations. It does so by mutating the random numbers in the randomStream described earlier. There are a variety of mutations implemented. One example mutation we have is “choose a random number in the stream, and replace that number by 1- that number”. If this mutation was applied to the third number in the random stream from before, it would correspond to flipping the shouldDuplicateCells boolean:

This approach has the advantage that we don’t need to write custom mutators for each CRUD action, which is great because we have many CRUD actions. It has the disadvantage that mutations of a CRUD request are not necessarily “similar” to the existing CRUD request. For example, if we apply the same mutation to the first number in the stream, we would likely try a completely different action. Changing the action will also cause the rest of the randomStream to be interpreted very differently. The resulting CRUD request will likely have nothing to do with cloning a column. In the future, we may explore writing specialized mutators for our requests.

We want to choose CRUD requests which will have good mutations, i.e. their mutations will unlock even more new coverage points. But note that just because a CRUD request got a bunch of new coverage in the past, that doesn’t mean that mutations of it will be successful. The CRUD request may have exhausted all of its usefulness. We have a notion of energy which we use as the key to a priority queue to decide which CRUD request to mutate next. Here is some pseudo-code:

// The first element of the pair is the priority, the second is the element.
// Higher priorities are popped from the queue first.
priorityQueue=new PriorityQueue([(0, new RandomStream())])
while (!crashed) {
originalEnergy, randomStream=priorityQueue.pop()
mutantEnergy=get number of new coverage points since last execution
originalEnergy=(originalEnergy + mutantEnergy) / 2
priorityQueue.push((originalEnergy, randomStream))
if (mutantEnergy> 0)
priorityQueue.push((mutantEnergy, mutatedRandomStream))

We update the energy of an element by averaging its previous energy with the energy of its mutant. In cases where none of the mutants get new coverage, this causes the energy to decay exponentially fast, ensuring that we don’t spend too much time on CRUD requests which were initially promising. As time goes on, the maximum energy in the queue decreases exponentially, with spikes when new coverage is periodically found. This also has the effect that new coverage later is worth exponentially more than new coverage early on.

Ultimately, this energy computation model is just a heuristic that we need to tune, and we’re eager to explore different heuristics which could improve efficiency further.

With the algorithm above, we were able to send CRUD requests to the base. But how does the fuzzer know if the results are unexpected or incorrect? Traditional fuzzers would look for crashes as a sign of unexpected results. But in the bug we gave in the introduction, there were no crashes at all.

In fact, we already had a way to detect errors in our codebase called integrity checks. These are test and development only checks which verify correctness (we also run shorter integrity checks in production environment). One example is the Recalculator integrity check, which simply recomputes everything and compares it to the Recalculator’s outputs. If they differ, it throws an error. This helps catch bugs that are exercised in test or development environments.

In fuzzing mode, we enable these integrity checks. This is sometimes called “property-based fuzzing” in academic literature because we are testing “properties” of the resulting state. Note that this doesn’t tell us if the CRUD action is correct per se. For example, if we just did nothing for every CRUD request, we would pass all of our integrity checks. In the future, we plan to explore other fuzzing methods, such as differential fuzzing, which could catch subtler regressions.

Let’s zoom out and discuss how the fuzzer integrates with the rest of our system.

The fuzzer runs every 12 hours via a Jenkins CI job. It logs the results to an AWS S3 bucket, and reports crashes to Rollbar. The nice thing about Rollbar is that it can deduplicate crashes by their stack traces, and if it sees a new crash it posts the stack trace to a Slack channel that I monitor. I download the artifacts and run the minimizer, which narrows down the crash to the few actions which are necessary to reproduce the issue via a technique similar to delta debugging. I do some semi-manual conversion, create an integration test documenting the defect, and file a tracking issue. Eventually, someone will debug and fix the issue, using the regression test for test-driven development. The regression tests also run as part of our regular CI build, ensuring that fixed issues don’t resurface.

The fuzzer has found many bugs of various severity, most of which had been lurking in our systems for a long time. One fun anecdote: the fuzzer found a bug in a pre-launch feature, we came up with a fix, and closed the bug. Then the fuzzer found another issue introduced by the fix. That’s something which you rarely get from regular tests!

Database engines are complicated yet performance-sensitive pieces of software, and so validating their correctness is a difficult problem. At Airtable, our real-time database powers our customers’ apps and business workflows, and correctness is imperative for many of their use cases. Fuzzing has helped uncover bugs in our product and also increase our confidence in our product. But as we’ve mentioned throughout, there’s lots of remaining work to further improve the efficiency and coverage of the fuzzer. And internally, we have an Airtable base called “Fuzzer Ideas” which has many more ideas. If you’d like to come join us to work on interesting and challenging problems like this one — we are hiring.

Want to learn more about fuzzing? The Fuzzing Book is a great resource for practitioners. Here are links to some of the concepts we discussed:

Here is an example bug that our fuzzer found. You don’t need to understand all the details — we mainly think it’s important to focus on the complexity of the situation.

This bug intersected multiple complicated product features:

  1. Linked records & the ability to filter on them: when you filter on a linked record, you are filtering on the primary column of another record — a concept which is surprisingly hard to do correctly and efficiently.
  2. Deleting linked record column: Airtable does its best to have an intuitive product even for non-database users. Creating a linked record column to another table creates a symmetric column in the other table. When you delete a linked record column, deleting the symmetric column of the lookup is unintuitive, so the product converts it to text instead.
  3. Undo & redo with interleaving actions: Airtable tries to have good undo/redo semantics, even in the face of interleaving edits, but it’s not easy to ensure that this works well.

Here is the reproduction in text form:

  1. Create two tables A and B.
  2. Create a linked record (i.e. a foreign key) column from A to B. This creates a symmetric column for linked records from B to A in table B.
  3. Create a filter on table A in the form of “linked record column contains b”. (This filters all rows out of the view, since the linked record column is currently empty)
  4. Create a row (call it row X) in table A. Link it to a row named b in table B. (This causes row X to appear in the view.)
  5. Open table B. Delete the symmetric linked record column. (Airtable product note: deleting a linked record column in one table converts the linked record column in the other table to a text column. Therefore row X remains in the view we created for table A.)
  6. Open table A. Duplicate row X to create row X’. (Now both row X and row X’ are in the view.)
  7. Open table B. Undo the destruction of the linked record column.
  8. Open table A. Refresh the page. We do optimistic updates on the client, so the client’s view may differ from the server’s view in case of bugs like this one. In the absence of bugs, the view updates correctly in real-time. Note that row X’ appears in the view, even though its column value is empty. This is a bug: the Recalculator did not correctly recompute the filter even though the underlying row data had changed.

As far as we can tell, none of our users have ever experienced this issue. Even though we had test cases for all the individual components, we never had a test for this interleaving because it was so complicated. It’s the sort of bug which would be very hard for QA to find manually — but it was found automatically by our fuzzer.

Read More

Ava Chan

Ava Chan

I'm a researcher at Utokyo :) and a big fan of Ava Max