Are popular toxicity models simply profanity detectors?

Are popular toxicity models simply profanity detectors?

Imagine that after months of waiting, your favorite singer – Britney Spears – has released a new album. It lives up to all the hype, and you rush to post on your favorite social media site:

Holy shit. This album is fucking amazing!

Strangely, you get an instant notification that your post is Under Review.

Maybe it’s the word “fucking”? You try again:

Holy shit! This album is amaaazing

No dice. You try one last time:

fuck yes. the OG bad bitch is BACK

This time, you get warned of an account suspension.

The challenges of language

It’s tempting to believe that AI has progressed far enough that identifying hateful content is a solved problem. Isn’t this the promise of contextual word embeddings and transformers?

But language is complicated. The strongest profanities are often used in the most positive, life-affirming ways. This is a problem: people’s biggest and most enthusiastic fans – the ones whose content you love seeing and spreading – are getting hidden. Talk about terrible false positives!

In our work, we run into cases like these often. Much of the problem stems from poor training and test data: NLP datasets are often created using non-fluent labelers who pattern match on profanity.

For example, here are 10 examples from Google’s GoEmotions dataset that were labeled as Anger by the India-based raters that they used.

  • Hot damn.
  • you almost blew my fucking mind there.
  • damn, congratulation 🤣
  • A-fucking-men! 🙂
  • Wow! Good for her! I’m so glad she was able to see through the bullshit!
  • > Best ~~3pt~~ shooter fucking ever. FTFY
  • Lindt don’t fuck about
  • >I clearly have no fuck I clue what I’m doing hire someone

Clearly, these aren’t actually Anger. But without good data, your models can’t learn about context and nuance either!

Jigsaw’s Perspective API

So how well do popular toxicity models handle profanity? We decided to investigate, by evaluating the Perspective API by Google’s Jigsaw unit, which contains a popular, open source toxicity model.

(For background, Jigsaw defines toxicity as “rude, disrespectful, or unreasonable language that is likely to make someone leave a discussion”. Their annotation instructions make a special note about positive profanity: profane language used in a positive way is unlikely to cause people to leave the discussion, and should not be labeled as toxic.)

So here are the 3 examples above:

93.23% likely to be toxic

92.89% likely to be toxic

98.13% likely to be toxic

All three are scored as very likely toxic! (Scores: 0.9323, 0.9289, 0.9813)

Is it due to the profanity? Let’s try removing it and seeing what happens:

  • Holy cow. This album is amazing (Score: 0.2477191)
  • yes. the OG britney is BACK (Score: 0.07772986)

As expected, once the (positive) profanity is removed, Perspective no longer scores these so high.

A benchmark for evaluating toxicity

Of course, three examples live in the realm of anecdote.

For a larger, real-world benchmark, our labeling team of native English speakers gathered 500 examples of non-toxic profanity posted by actual social media users, as well as 500 examples of toxic profanity.

For example:

Fuck yeah!!!!!

Fuck dude, nurses are the shit.

Afterlife is a fucking rollercoaster. Brilliance @rickygervais 👏

How did the Perspective API fare on these?

Toxicity score: 93.26%

Toxicity score: 98.62%

Toxicity score: 93.88%

In fact, Perspective returned a toxicity score above 0.9 (the default threshold suggestion) for 61% of the real world, non-toxic profanity examples (and 87% of the toxic profanity examples).

If you’d like to benchmark your own toxicity or hateful content model, we’ll be releasing the dataset on our Github soon, or just reach out to in the meantime!

Improving language models

We love what the Jigsaw team is building, so this isn’t to suggest that Perspective doesn’t have great applications. Especially for its suggested usage — as a first-pass filter, leaving final judgments to human decision makers — marking all profanity as toxic can make perfect sense. Perhaps you don’t want your kids to see curse words, regardless of the sentiment behind them.

The larger problem is that your models may be misbehaving in the real world, on the examples you care about the most – but your labels and labelers may not be accurate enough for you to tell. Think hard about your data!

Have you faced similar challenges? Reach out at email us at! Our mission involves creating a safer Internet, but we don’t want to miss out on our favorite content because of AI flaws in the meantime. And if you’re interested in being an early beta user of our own Hate Speech and Toxicity APIs, please sign up here.

After all, it’s Britney’s best track for a reason.

Surge AI is a data labeling workforce and platform that provides world-class data to top AI companies and researchers. Interested in $50 of free

NOW WITH OVER +8500 USERS. people can Join Knowasiak for free. Sign up on
Read More

Ava Chan

Ava Chan

I'm a researcher at Utokyo :) and a big fan of Ava Max