I won free load testing

I won free load testing

These extensions are helpful!!

Long story short: a couple of my articles got really popular on a bunch of
sites, and someone, somewhere, went “well, let’s see how much traffic that
smart-ass can handle”, and suddenly I was on the receiving end of a couple DDoS
attacks.

It really doesn’t matter what the articles were about — the attack is certainly
not representative of how folks on either side of any number of debates
generally behave.

My assumption is that it’s a small group (maybe a Discord?) with a botnet, who
wanted to have fun.

And, friends: fun was had.

The main attack
My main site (“the blog”) received about 34M requests over 72h – in three
spikes. It’s behind Cloudflare at the time of this writing, so, here’s a pretty
graph (the granularity / bucket size is 1 hour):

To give you an idea of scale, the article that “blew up” on multiple news
aggregators only got around 130K hits. This is what organic traffic looks like
(with a long tail and everything):

In some ways, that attack was unsophisticated: it hit a single route, / (the
front page), and two thirds of the requests used a single user agent:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36

Which is just the user-agent of Chrome’s current version on Linux 64-bit. The
traffic doesn’t look like something like headless Chrome was used, (otherwise
there’d be a lot more requests for assets like fonts, stylesheets, images etc.),
but it could’ve been.

On the other hand, the attack was pretty well distributed:

Here are the top 15 AS the traffic came from:

AS4134: China Telecom (China, backbone)
AS61317: Heficed (cloud provider)
AS398465: Rackdog (cloud provider)
AS18209: Actcorp (India, telecom)
AS14061: DigitalOcean (cloud provider)
AS17639: Converge ICT (Philippines, telecom)
AS9829: Bharat Sanchar Nigam Limited (India, backbone)
AS7713: Telkom Indonesia (Indonesia, telecom)
AS208294: CIA Triad LLC

IP prefix is in the US
Phone number has Australian prefix (+61)?
Associated with Zwiebelfreunde who provide Tor exit nodes

AS8452: TE-AS (Egypt, telecom)
AS205100: F3 Netze, (Germany, associated with Freifunk Franken)
AS9299: PLDT (Philippines, telecom)
AS36947: Algérie Telecom (Algeria, telecom)
AS141995: Contabo Asia (Germany, cloud provider)
AS27699: Telefonica (Brazil, telecom)

So, you know. The flip side of compute and bandwidth becoming a very affordable
commodity worldwide is that… it’s also affordable for bad actors.

Also, I’m sure Heficed, Rackdog, DigitalOcean and Contabo all say in their
Terms of Service “don’t use us to DDoS someone else”, but it appears you can
definitely do it for a few hours without them noticing (whether it was through
compromised instances or not).

The secondary attack
The other target was my recently-launched video platform, which is a separate
service, and runs on fly.io.

Here’s that attack as seen from fly.io metrics (that I see as a user):

Here is the same attack as seen from Honeycomb,
where data is self-reported (from multiple instances of my app) and sampled (not
all traces are sent to Honeycomb):

Why yes, I do see something that stands out.

Let’s use BubbleUp
on that whole area there:

Single biggest asset (4K@60 video asset for a 50 minute video), makes sense.

They even left me a note! How thoughtful.

(That’s a reference to one of the articles that got popular this week.)

I don’t track which ranges were requested, but they did do some range requests,
and some full requests. Not sure if that was random or just two individuals
going at it:

That attack was a lot less distributed: it came from a bunch of
Vultr IPs, and it stopped after banning the whole AS
(just for my app).

How effective were the attacks?
To evaluate whether or not the attacks were successful, we need to discuss why
people on the internet perform them.

Arguably, the only reason is “for the lulz” (for laughs), but the lulz has
multiple tiers: the primary goal is to “deny service” – to overload the
server(s) so that some content is not accessible anymore.

Then there’s secondary goals: because providers typically bill for bandwidth, if
it costs the target some money, that’s even more fun. And if said providers
decide that actually, they don’t want a customer who’s getting targeted like
that, and boot them off their platform, then that’s maximum fun.

And then of course there’s the power trip, the idea that you have control over
someone else, the ability to “punish” them for something they did. Just like
any other form of harassment really.

So, were those goals met?

The primary goal definitely was: the site was inaccessible for a few hours over
the course of Saturday, April 30th. As far as HTTP status codes go, I’m counting:

11M “499 Client Closed Request”
6.5M “503 Service Unavailable”
6M “403 Forbidden”
4M “200 OK”
3M “524 Origin Timeout”
2M “522 Origin Connection time-out”
1.5M “429 Too Many Requests”
500K “520 Origin Error”
100K “521 Origin Down”

A few people reached out to let me know about it, bummed they couldn’t read the
article for the time being. But I mean, as long as The Internet
Archive is up, is anything really down? News aggregator
users reflexively save articles that become somewhat popular, expecting them to
go down, or in case they need receipts.

As for the secondary objectives: none of them were met.

It didn’t cost me a single dollar:

Hetzner doesn’t bill for bandwidth, it’s a fixed monthly price kind of deal
Cloudflare eats bigger attacks for breakfast, and they don’t charge for it either.

As for fly.io, well, I work there, so, they pay me.

It didn’t get me in trouble either. On the contrary, every time something like
that happens, I make new friends and hear back from old friends.

And honestly? Everyone brought popcorn and took notes. An attack like that is
both entertaining and informative. It kinda gave me a kick in the butt to
address some issues with my website. And it gave everyone ideas on how to better
protect against this kind of thing.

Just like this article right here is good intel for anyone who is looking to
DDoS me. But for me, sharing that info is part of the fun, and I don’t really
believe you can really achieve resiliency through obscurity.

Why the attack worked
A service “going down” can mean many things: being temporarily suspended by a
cloud provider is one failure mode, for example. Having resource utilization
increase so much that a machine is constantly swapping and everything is 1000x
slower is another.

Here, there was a single point of failure: my origin, a single
Hetzner dedicated server, running (some guessed it)
some Rust code.

It’s not exactly a secret: I outlined how my website is
built back in 2020, and although I’ve made
multiple incremental improvements since, it still
works essentially the same way.

Before I wrote my own server, I was using static site generators (nanoc, Hugo,
etc.), and deploying the result directly to an S3 bucket, configured for static
site serving, behind Cloudflare. That was an easy, reliable setup.

My site wouldn’t have gone down if I was still using that setup. However, I
would also be looking at a rather large AWS bill right about now. (And I’d
rather talk to AWS friends about anything other than billing).

Wait, a large AWS bill?

I thought you just said it was behind Cloudflare?

I know, I was surprised too. I’m fairly sure it wasn’t always that way, but if
you consult the Cloudflare docs, it clearly says:

Cloudflare only caches based on file extension and not by MIME type. The
Cloudflare CDN does not cache HTML by default. Additionally, Cloudflare caches a
website’s robots.txt.

And then, further down:

To cache additional content, see Page
Rules to
create a rule to cache everything.

Over the years, Cloudflare has saved me a lot of bandwidth costs (well, it
would’ve if I was paying for it), but “only” for images, CSS, JS, etc.

Not video, see Section 2.8 of their Self-Serve Subscription Agreement:

2.8 Limitation on Serving Non-HTML Content
The Services are offered primarily as a platform to cache and serve web pages
and websites. Unless explicitly included as part of a Paid Service purchased by
you, you agree to use the Services solely for the purpose of (i) serving web
pages as viewed through a web browser or other functionally equivalent
applications, including rendering Hypertext Markup Language (HTML) or other
functional equivalents, and (ii) serving web APIs subject to the restrictions
set forth in this Section 2.8.

Use of the Services for serving video or a disproportionate percentage of
pictures, audio files, or other non-HTML content is prohibited, unless purchased
separately as part of a Paid Service or expressly allowed under our Supplemental
Terms for a specific Service. If we determine you have breached this Section
2.8, we may immediately suspend or restrict your use of the Services, or limit
End User access to certain of your resources through the Services.

(So, pissing off a kid with a botnet will not get you booted off of Cloudflare,
but building a video platform on top of it will. They want you to use their
product for that)

And, as it turns out, not for HTML.

Both top colors there are uncached requests: the orange spikes are the requests
my origin did manage to serve, and the grey ones are the ones Cloudflare
generated for me when the origin struggled too much (or when I stopped it
altogether).

The first thing I tried to do was add a cache-control: public, max-age=120
response header. Still, the cache status was “dynamic”, so every request went
to the origin.

I already had the cache-control header set for static assets (images,
stylesheets, scripts), just not for HTML, because in two years of repeatedly
hitting the front page of various sites, it was never a performance concern.

Creating a page rule didn’t immediately fix everything (but I’m convinced I was
holding it wrong, since that’s the thing they want you to do), but even if it
had, at least on the Free and Pro plans, you can only match by URL: so, this
would break the website for any logged-in users (who support me financially, and
have access to some articles in advance).

On the Business and Pro plans, there’s a “Bypass Cache on Cookie” feature, which
even lets you specify regex patterns, like PHPSESSID=.*.

However, it wouldn’t take a very motivated attacker to figure out that they can
hammer your origin with a well-formed cookie: there’s no way for Cloudflare’s
edge to know whether the session cookie is actually valid or not, so it only
helps for legitimate requests, not for attacks.

Long story short: without upgrading to the next plan over ($200/month), I can’t
use Cloudflare to cache HTML in a way that doesn’t break my site. And that
wouldn’t help protect against attacks.

Meanwhile, back in my origin server…

Shell session

$ sudo ss | wc -l
352950

Oh, 350K connections.

That’s a lot of connections

That’s certainly more connections than there should EVER be between Cloudflare
and my server. But Cloudflare expects origin servers to signal when they’re
struggling.

Origins should return
429, or
503, or start
refusing connections; they should do something, anything other than just
accept a ridiculous number of concurrent connections and let them just sit there
— that’s just a bad deal for everybody.

And my site didn’t do that! Because I half-assed that bit, and it worked for two
years straight anyway. Almost as if… it was possible to move fast, delivering
value, leaving some problems for later, even with Rust…

Right, right, sorry.

So: the lack of caching wasn’t actually that surprising to me. I do consider
my HTML content dynamic, just like Cloudflare does: it is different for
logged-in users, there’s random bits (“what to read/watch next”), there’s a
full-text search engine (really just SQLite’s, nothing fancy there yet).

However, I did expect Cloudflare’s DDoS
protection, their #1 selling point, to
kick in.

And it mostly didn’t.

Although Cloudflare did block/present a challenge for some fraction of the
incoming traffic, it mostly did so when I turned on the I’m Under
Attack
mode, and even then, definitely not enough to let my origin serve legitimate
requests again.

By the time the second wave happened, I was already in touch with an engineer at
Cloudflare, who helped some, and let me know something interesting: the reason
protection wasn’t kicking in was that the attacker was staying just below their
detection threshold.

Which seems to indicate the attacker knew what they were doing. But then why use
a fixed user-agent? And hit just the one endpoint?

Because that’s all it took?

Eh, fair enough.

Which brings us to another interesting point: the amount of requests is
impressive in total, and very spikey, but it was still under Cloudflare’s
detection threshold most of the time, which means… my server probably
should’ve handled it like a champ!

And just to be clear: it didn’t, right?

Oh no, not at all. Memory usage was quite alright (hardly went past 5%), but all
eight CPU cores were nearly maxed out, and, well, I couldn’t get a single 200 OK
out of it during the attack, even locally.

But then again I didn’t sit there patiently for minutes waiting for my
connection to be accepted.

See, at that time, my server was quite busy, doing a lot of stuff. Here’s a
representative of what was going on, obtained with perf
(the exact command is just sudo perf top):

As I mentioned, my website is mostly static content, but not just. And as I
described in 2020, I’m not optimizing for
“maximum server performance”, I’m optimizing for “maximum content authoring
convenience” (and also nice features for readers).

And so, Rust maximalism be damned, most of my website is actually powered by
liquid templates and SQL queries.

Only the very base: file watching, hashing, DB management, the HTTP stack, and
some very useful custom liquid filters (to render markdown, including my custom
extensions, etc.), is actually written in Rust.

The front page, for example, runs several SQL queries to retrieve the latest
videos, articles, and series. It then runs several liquid filters, not to
process markdown (I cache that), but to truncate the HTML body so I can show just
an excerpt with a “Read more” button.

For the curious, it looks something like this:

{% include “html/prologue.html” %}

{% capture sql_start %}
SELECT pages.*
FROM pages
JOIN revision_routes ON revision_routes.hapa=pages.hapa
WHERE revision_routes.revision=?1
{% endcapture %}

{% capture sql_end %}
{% unless config.drafts %}
AND NOT pages.draft
{% endunless %}
{% unless config.future %}
AND datetime(pages.date) png or jpg kind never problem whenever one got popular since any single article like only involved couple rewriting whatsoever fixing origin adding cache-control headers bust… …but did make other changes help next thought neat show they look code. whole bunch if you want skip search storm yes know anchor links headers. maximum number in-flight requests. uses warp an http framework video platform axum both based hyper means use standard tower layers before requests limit: async fn serve else let addr: socketaddr=”config.address.parse()?;” turns into svc=”warp::service(all_routes.with(access_log));” make_svc=”hyper::service::make_service_fn(|_:” _ gets called new connection accepted: its own clone our service mostly increases reference counts move ok:: server=”hyper::Server::bind(&addr).serve(make_svc);” server.await after tower:: servicebuilder internal semaphore: deployed waves all. exactly says tin during abnormally high volume just keep piling up resources utilization still as… well go. load shed layer would smarter didn think time: line now others immediately error out. feels bad thing once determined gracefully want. especially when edge user-agent case cloudflare capable backing off retrying etc. concurrent connections: pops connections excessive. covered request coalescing who here code: std::convert::infallible futures::future:: ready hyper::server::conn::addrstream tokio::sync::ownedsemaphorepermit tokio_util::sync::pollsemaphore tower::service pub struct servicefactory inner: s pollsemaphore permit: option impl where s: type response=”PermitService;” future=”Ready;” poll_ready self cx: std::task::context> std::task::Poll {
self.inner.poll_ready(cx)
}

fn call(&mut self, req: R) -> Self::Future {
self.inner.call(req)
}
}

And it’s used like this:

Rust code

let conns_limit=Semaphore::new(128);
let svc=ServiceBuilder::new()
.layer(GlobalConcurrencyLimitLayer::new(512))
.service(warp::service(all_routes.with(access_log)));

let factory=ServiceFactory {
inner: svc,
semaphore: PollSemaphore::new(Arc::new(conns_limit)),
permit: None,
};

let server=hyper::Server::bind(&addr).serve(factory);
info!(“listening on {}”, addr);
server.await?;

That one’s significantly more verbose, but it’s also very re-usable, that’s
kind of tower’s deal. Which means it probably
exists in a crate somewhere and I’m an idiot for rewriting it in every project.
Or I should just publish my own crate.

But the elegance here is that… it’s a Service that takes an &AddrStream
and returns a Service. That delicious meta bit can get confusing at times,
but it turns out to be super convenient.

Shortly after I deployed that change, things broke: the site appeared
unavailable again. The attack hadn’t resumed yet though — the available 512
connection slots had simply filled up, were idle, and Cloudflare edge nodes
weren’t able to establish any more connections.

My solution was to simply enforce idle read and write timeouts: any connection
that hasn’t been read from or written to in a few seconds gets reset.

Because I brought in a custom “acceptor” for that, it was also a good occasion
to close another hole: my server was listening on 0.0.0.0:80. If you could
guess the IP (which, there’s entire sites dedicated to that), you could hammer
it directly, without proxying through Cloudflare.

That didn’t happen, but it sure could have.

Instead, we want the origin to only accept connections from Cloudflare IP
ranges.

So, here’s an acceptor that does both, and updates the IP ranges every now
and then:

Rust code

use std::{
collections::HashSet,
io,
net::{IpAddr, SocketAddr},
pin::Pin,
str::FromStr,
sync::Arc,
time::Duration,
};

use arc_swap::ArcSwap;
use futures::TryFutureExt;
use hyper::server::accept::{from_stream, Accept};
use ipnet::IpNet;
use tokio::net::TcpStream;
use tokio_io_timeout::{TimeoutReader, TimeoutWriter};
use tracing::{debug, info, warn};

pub const IPS_V4: &str=include_str!(“ips-v4.txt”);
pub const IPS_V6: &str=include_str!(“ips-v6.txt”);
pub const IPS_LOCAL: &str=”127.0.0.1/8″;

pub fn timeout_acceptor(
addr: SocketAddr,
) -> impl Accept {
let ip_nets=parse_ip_nets(&[IPS_V4, IPS_V6, IPS_LOCAL]).unwrap();
info!(“Loaded {} ip nets”, ip_nets.len());
let ip_nets=Arc::new(ArcSwap::from_pointee(ip_nets));

tokio::spawn({
let ip_nets=ip_nets.clone();
async move {
loop {
if let Err(e)=update_ip_nets(&ip_nets).await {
warn!(“Could not update IP nets: {e}”);
};
}
}
});

let localhost=IpAddr::from([127, 0, 0, 1]);

from_stream(
async move {
let ln=tokio::net::TcpListener::bind(addr).await?;
let stream=async_stream::stream! {
loop {
let (stream, addr)=ln.accept().await?;

if let Some(net)=ip_nets.load()
.iter()
.find(|net| net.contains(&addr.ip()))
{
debug!(“Allowing {} through net {net}”, addr.ip());
} else {
debug!(“Disallowing {}”, addr.ip());
continue;
}

let should_timeout=addr.ip() !=localhost;

let mut stream=TimeoutReader::new(stream);
if should_timeout {
stream.set_timeout(Some(Duration::from_secs(5)));
}
let mut stream=TimeoutWriter::new(stream);
if should_timeout {
stream.set_timeout(Some(Duration::from_secs(5)));
}
yield Ok(Box::pin(stream))
}
};
Ok(stream)
}
.try_flatten_stream(),
)
}

fn parse_ip_nets(sources: &[&str]) -> color_eyre::Result {
let mut set: HashSet=Default::default();
for input in sources {
for line in input.lines() {
let line=line.trim();
if line.is_empty() || line.starts_with(‘#’) {
continue;
}
let ip_net=IpNet::from_str(line).unwrap();
set.insert(ip_net);
}
}
Ok(set)
}

async fn update_ip_nets(ip_nets_handle: &ArcSwap) -> color_eyre::Result {
let client=reqwest::Client::new();
tokio::time::sleep(Duration::from_secs(15)).await;

let mut sources=vec![IPS_LOCAL.to_string()];
for url in [
“https://www.cloudflare.com/ips-v4”,
“https://www.cloudflare.com/ips-v6”,
] {
sources.push(client.get(url).send().await?.text().await?);
}
let sources: Vec=sources.iter().map(|x| x.as_str()).collect();
let ip_nets=parse_ip_nets(&sources[..])?;
info!(
“Loaded {} ip nets (had {})”,
ip_nets.len(),
ip_nets_handle.load().len()
);

Ok(())
}

Featured here: reqwest for easy async HTTP
requests, arc-swap to avoid using an
Arc or Arc,
async-stream to be able to create
an async Stream using generator syntax (yield etc.).

It’s pretty naive code, but I think it looks pretty neat.

A better way to only accept connection from Cloudflare IPs would be to set up
firewall rules (and update them periodically). Then any disallowed connection
could be simply refused (thus not taking space in the accept queue), or dropped
(thus wasting the attacker’s time).

But hey, gotta leave stuff to do for later!

Also, using it is a breeze:

Rust code

let acceptor=timeout_acceptor(addr);
// instead of `Server::bind`:
let server=hyper::Server::builder(acceptor).serve(factory);
info!(“listening on {}”, addr);
server.await?;

Because it doesn’t return an AddrStream but instead a
Pin Self::Service {
IncomingHttpSpanService { inner }
}
}

/// Extracts opentelemetry context from HTTP headers
#[derive(Clone)]
pub struct IncomingHttpSpanService
where
S: Service + Clone + Send + ‘static,
{
inner: S,
}

impl Service for IncomingHttpSpanService
where
S: Service + Clone + Send + ‘static,
{
type Response=S::Response;
type Error=S::Error;
type Future=PostFuture;

fn poll_ready(&mut self, cx: &mut Context) -> Poll {
let this=self.project();
let res=futures::ready!(this.inner.poll(cx));
if let Ok(res)=&res {
this.span.record(“http.status_code”, &res.status().as_u16());
}
res.into()
}
}

…so that all HTTP requests have their own info-level span, and show up in
Honeycomb like so:

Hey, those latencies don’t line up with the figures you mentioned earlier.

Haha, one thing at a time.

That lets me answer questions like “what RSS readers (that aren’t browsers) is
my audience using?”

I’m not sure what to do with that particular answer, but hey, now we know.

Also incredibly valuable, because spans are being sent instead of simple log
messages, we get these nice visualizations of where we’re actually spending time.

This is for a request to the index:

As I explained earlier, there’s a few DB queries involved, it’s fetching page
markup (which is pre-rendered, but also needs a DB query), and then it
truncates, which takes… forever.

So, immediately, I jump back to the code and find a ton of actionable
information there: for example, the size of my “connection pool” is only 10
r2d2-sqlite’s default: under load, that’s
not enough.

How do I know? I can see it waiting for a checkout:

And this is all it took:

Rust code

let conn=info_span!(“sql.checkout”).in_scope(|| self.content_pool.get())?;

I also noticed a couple other things, like:

I’m running blocking code (SQLite queries) in an async context. I should
be using spawn_blocking for that. (There’s no good lint for that)
I’m not caching prepared statements. This doesn’t show up as a hotspot, but
switching from prepare to prepare_cached is trivial, so why not?
I don’t have any indexes in my SQLite database! What a great freebie I kept for myself.

And finally, I’m spending way more time on truncate_html than I thought.

It is non-trivial:

Rust code

impl Filter for TruncateHtmlFilter {
#[instrument(name=”truncate_html”, skip(self, input, runtime), fields(html_len))]
fn evaluate(
&self,
input: &dyn ValueView,
runtime: &dyn liquid_core::Runtime,
) -> liquid_core::Result {
let span=Span::current();

convert_errors(|| {
let kinput=input.to_kstr();
let input=kinput.as_str();
span.record(“html_len”, &input.len());

let mut output: Vec=Vec::new();
let output_sink=|c: &[u8]| {
output.write_all(c).unwrap();
};

let max: u64=self
.args
.max
.as_ref()
.and_then(|p| p.try_evaluate(runtime))
.and_then(|l| l.as_scalar().and_then(|s| s.to_integer().map(|x| x as u64)))
.unwrap_or(180);

let char_count=AtomicU64::new(0);
let mut skip: bool=false;

let mut rewriter=HtmlRewriter::new(
Settings {
element_content_handlers: vec![
element!(“*”, |el| {
if char_count.load(Ordering::SeqCst)> max && el.tag_name()==”p” {
skip=true;
}

if skip {
el.remove();
}
Ok(())
}),
text!(“*”, |txt| {
if matches!(txt.text_type(), TextType::Data) {
char_count.fetch_add(txt.as_str().len() as u64, Ordering::SeqCst);
}

Ok(())
}),
],
..Settings::default()
},
output_sink,
);

rewriter
.write(input.as_bytes())
.map_err(|e| liquid_core::Error::with_msg(format!(“rewriting error: {:?}”, e)))?;
drop(rewriter);

Ok(to_value(&std::str::from_utf8(&output[..])?)?)
})
}
}

…but also, it feeds the whole article through, well past the initial 120
characters of text it’s trying to keep. That’s wasteful, and it’s a very
predictable transform that could be cached, just like I cache rendered markdown.

So what I did next…

You fixed all of these, right?

Absolutely not! I ignored all of these, and jumped straight to implementing
caching.

Latency was always “fine” when the site wasn’t under attack. Knowing about these
bothers me, but they can wait. What I really needed was some form of whole-page
caching.

I’ve put that off for so long, and it was so easy.

I simply plugged in moka, a “fast, concurrent
cache library for Rust”, that supports async, and time-based expiry.

So, boom, build a cache:

Rust code

let server_state=ServerState {
config: Arc::clone(&config),
revholder,
content_pool,
users_pool,
broadcast_rev,
// this is the only new field
rendered_templates_cache: Cache::builder()
// TTL: 5 minutes
.time_to_live(Duration::from_secs(5 * 60))
// TTI: 1 minute
.time_to_idle(Duration::from_secs(60))
.build(),
};

And the serve_template function becomes:

Rust code

#[instrument(skip(self, globals), fields(cache.status))]
async fn serve_template(
self,
template_name: &str,
mut globals: Object,
content_type: &’static str,
) -> Result=Box::new(res);
Ok(res)
}

Cloning a Bytes, which
happens on cache hit, simply increments a reference count, so everything is nice
and cheap.

The cache-control header here is technically wrong, since I don’t actually
ever want Cloudflare to cache hit, but since it already doesn’t… I’ll fix it
later.

With that little change (it rendered unconditionally beforehand), all pages for
logged-out users are cached for 1m-5m, depending on whether they’re being
requested a lot.

And that makes the difference between this:

Shell session

$ oha -z 5s http://localhost -H ‘(valid cookies omitted)’
Summary:
Success rate: 1.0000
Total: 5.0011 secs
Slowest: 1.0898 secs
Fastest: 0.0791 secs
Average: 0.5314 secs
Requests/sec: 87.9810

Total data: 14.61 MiB
Size/request: 34.00 KiB
Size/sec: 2.92 MiB

Response time histogram:
0.171 [20] |■■■■■■■
0.263 [18] |■■■■■■
0.355 [48] |■■■■■■■■■■■■■■■■■
0.447 [50] |■■■■■■■■■■■■■■■■■
0.539 [89] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.630 [82] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.722 [63] |■■■■■■■■■■■■■■■■■■■■■■
0.814 [37] |■■■■■■■■■■■■■
0.906 [21] |■■■■■■■
0.998 [7] |■■
1.090 [5] |■

Latency distribution:
10% in 0.2837 secs
25% in 0.3926 secs
50% in 0.5350 secs
75% in 0.6570 secs
90% in 0.7618 secs
95% in 0.8596 secs
99% in 0.9995 secs

Details (average, fastest, slowest):
DNS+dialup: 0.0006 secs, 0.0001 secs, 0.0011 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs

Status code distribution:
[200] 440 responses

(87 requests per second, painfully — P95 of 860ms)

And this:

Shell session

$ oha -z 5s http://localhost
Summary:
Success rate: 1.0000
Total: 5.0017 secs
Slowest: 0.0077 secs
Fastest: 0.0002 secs
Average: 0.0015 secs
Requests/sec: 34073.2242

Total data: 5.53 GiB
Size/request: 34.00 KiB
Size/sec: 1.10 GiB

Response time histogram:
0.001 [7190] |■■■■
0.001 [22000] |■■■■■■■■■■■■■■■
0.001 [40329] |■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.002 [46719] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.002 [32402] |■■■■■■■■■■■■■■■■■■■■■■
0.002 [14861] |■■■■■■■■■■
0.003 [4874] |■■■
0.003 [1364] |
0.004 [450] |
0.004 [130] |
0.004 [106] |

Latency distribution:
10% in 0.0008 secs
25% in 0.0011 secs
50% in 0.0014 secs
75% in 0.0018 secs
90% in 0.0022 secs
95% in 0.0024 secs
99% in 0.0029 secs

Details (average, fastest, slowest):
DNS+dialup: 0.0008 secs, 0.0001 secs, 0.0025 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0003 secs

Status code distribution:
[200] 170425 responses

(34K requests per second, easily — P95 of 2.4ms)

It’s important to note that this code correctly identifies valid session
cookies. These are signed, so if you want to hammer an uncached endpoint, you
now need to either:

Somehow reverse how cookies are signed + find the secret key (which is super
easy to swap, it’ll just log out everyone once. No biggie)
Become a subscriber so you get a legitimate cookie (which will show up in
traces, easy to block)

Which leaves cached endpoints as durable attack targets: but because the maximum
RPS (for the costliest page on the site) went from ~90 to 34K (a 37677%
increase, for those keeping track), an attack would need to be above
Cloudflare’s threshold to even make a dent, at which point their DDoS protection
would kick in.

As for the video platform: I made no changes. None. It already had good
observability, and I had planned for this eventuality from the start: the only
thing I was trying to prevent wasn’t downtime, it was a huge AWS bill. And I’m
happy to report that 100% of the requests that were served, were served from the
SSD cache.

Also, most requests were served, since there’s eight separate instances of the
video app in eight different regions (Paris, Tokyo, Washington, São Paulo, etc.)

After the storm: collateral damage
Another secondary objective of a DDoS is to generate collateral damage: to force
the target to block legitimate traffic in response to the attack, losing out on
potential business and/or generally annoying people.

Blocking an entire AS is often excessive (but oh-so-convenient), blocking Tor
exit nodes is very tempting, and sometimes you just don’t want to hear from a
certain country in a while. But these are all overshoots.

Ever since the attack, I’ve gotten messages from regular readers who could’t
access my website, because they happened to be running the latest Google Chrome
on Linux.

I tried everything: attack mode has been disabled for a while, I’ve started
allowlisting the IPs of my readers, forcing a “managed challenge” for that
user-agent (hoping it’ll override the automated block), but nothing seemed to be
doing the trick, except for switching to another browser.

Eventually, I realized Cloudflare had nothing to do with it. During the attack,
I had started banning that same user-agent (returning a 403) from my origin
directly, and I just… forgot to disable that after the attack was over.

The confusing bit was that readers were seeing a Cloudflare error page saying
“Sorry, you have been blocked”.

And even after I removed my counter-measure, they’re still seeing it. Living
behind someone else’s edge is a mixed blessing.

This is also partly the point: you’re running left and right trying to make
things better, you get sloppy. That’s how it’s supposed to work.

Anyway.

Performing a DDoS is technically an “illegal cybercrime”, but realistically,
receiving one is just another Saturday.

Until next time, take excellent care of yourselves!

Update: hi again!
Minutes after I posted this article, the attack resumed. Same shit, different
AS. Here are the newcomers:

AS45102: Alibaba US (Global, cloud provider)
AS16276: OVH (France, cloud provider)
AS141677: Nathosts Limited (Hong Kong)
AS8075: Microsoft (US/HK/Brazil, “Corp MSN”?)
AS328386: Adnexus (South Africa)
AS7303: Telecom Argentina
AS24940: Hetzner (Germany, cloud provider)
AS45758: Triple T Broadband (Thailand, ISP)
AS37963: Alibaba Hangzhou (Global, cloud provider)
AS63949: Linode (US, VPS provider)

At first, the site immediately went down, returning 522 (Origin Connection Time-out).

Turns out a limit of 256 is way too low: Cloudflare has ~275 POPs, and each of
them might establish a few connections to my origin. I raised the limit to 2048
(both for connections and in-flight requests), and immediately the 522 line went
down, and a 200 line went up!

Shortly after, Cloudflare caught up with them and they saw a 403 spike. Then
they went away. Site’s back up and fast as ever.

Latest video

Getting good at SNES games through DLL injection

Apr 25, 2022

19 minute watch

Are you ever confronted with a problem and then think to yourself “wait a minute,
I know how to code?” — that’s exactly what happened there.

Watch now

You can watch more videos over there

Read More
Share this on knowasiak.com to discuss with people on this topicSign up on Knowasiak.com now if you’re not registered yet.

Charlie Layers
WRITTEN BY

Charlie Layers

Fill your life with experiences so you always have a great story to tellBio: About:

Leave a Reply

Your email address will not be published. Required fields are marked *