Optimizing Docker image size and why it matters

Why does size matter? Docker images are a core component in our development and production lifecycles. Having a large image can make every step of the process slow and tedious. Size affects how long it takes to build the image locally, on CI/CD, or in production and it affects how quickly we can spin up…

1
Optimizing Docker image size and why it matters

Why does size matter? Docker images are a core component in our development and production lifecycles. Having a large image can make every step of the process slow and tedious. Size affects how long it takes to build the image locally, on CI/CD, or in production and it affects how quickly we can spin up new instances that run our code. Reducing the size of the image can have benefits both for developers and your users.

Illustration of pulling a large Docker image layer

So, what can you do about it?

There are several important considerations that go into picking a base image. In the context of optimizing image size, each base image comes with its own dependencies and footprint.

Usually, the first choice you need to make is which distribution you want. Image sizes vary between them:

Debian

124 MB

Ubuntu

73 MB

Alpine

6 MB

CentOS

231 MB

Fedora

153 MB

It’s not just a matter of image size though, each of these images comes with its own philosophy or tools you might prefer. Alpine is lightweight, security-focused, and based on musl-libc instead of glibc. Ubuntu has long-term enterprise support, comes bundled with many utilities and supports a vast amount of packages, and so on.

Next, you can decide if you want your parent image to come bundled with additional dependencies. Often you need to weigh the convenience of having a base image with all dependencies pre-installed against the size of the resulting image.

For example, if you have a Node.js app you can use the node image, or python if you’re using Python, etc. Within those images, usually you can specify the distribution you’d like using the appropriate tag, for example, node:alpine for Alpine Linux and python:3-bullseye for Debian Bullseye.

The less specific or specialized your parent image is, the more control you have over the image size:

A closer look at node:16-bullseye shows that it has buildpack-deps as its parent image, which comes with lots of dependencies you might not need. So if you’re willing to take care of the Node.js installation, you can do it directly on the Debian image and reduce the image size considerably.

Docker makes it especially easy to add files you didn’t mean to add to an image. Each ADD or COPY and even the RUN commands in your Dockerfile can include files you weren’t expecting.

It isn’t easy to see exactly which files are added and where. So the first step is to be able to quickly inspect which files are added to each layer. Each layer corresponds to specific commands in your Dockerfile, and from there we can decide what and how to optimize.

There are 3 easy methods you can use:

Docker CLI

You can save any local image as a tar archive and then inspect its contents.

bash-3.2$ docker save -o image.tar

bash-3.2$ tar -xf image.tar -C image

bash-3.2$ cd image

bash-3.2$ tar -xf

/layer.tar

bash-3.2$ ls

etc

tmp

usr

var

Dive

An excellent open-source tool to visualize and analyze local Docker images.

Contains.dev

Our contains.dev offers many tools to analyze layers, their contents, and their size. Including navigating a treemap of your image:

With these methods, you’re set up to assess improvements to your image size. There are a few common areas that have straightforward solutions that improve the overall image size:

.dockerignore

An important way to ensure you’re not bringing in unintended files is to define a .dockerignore file. This file has a similar syntax to .gitignore :

Then when you run COPY . . it’ll make sure not to copy files defined in your .dockerignore. Defining this file has the added benefit of reducing the size of the Docker build context, which are all the files Docker gathers when building an image. A smaller build context results in faster build times.

Package managers

Depending on the package manager you’re using, you can instruct it to install the minimum needed packages you explicitly defined.

For example:

  • apt-get -y --no-install-recommends – don’t install optional recommended packages.
  • npm install --production – don’t install development dependencies.

Caches

Many processes will create temporary files, caches, and other files that have no benefit to your specific use case. For example, running apt-get update will update internal files that you don’t need to persist because you’ve already installed all the packages you need. So we can add rm -rf /var/lib/apt/lists/* as part of the same layer to remove those (removing them with a separate RUN will keep them in the original layer, see “Avoid duplicating files”). Docker recognize this is an issue and went as far as adding apt-get clean automatically for their official Debian and Ubuntu images.

Each layer in your image might have a leaner version that is sufficient for your needs. The best way to see that is to audit your layers with the techniques mentioned above.

Docker uses read-only layers of files that are overlaid on top of each other. This means that when you make changes to files that come from previous layers, they’re copied into the new layer you’re creating. It isn’t always obvious that this is happening, for example:

We’re just chmod‘ing an existing file, but Docker can’t change the file in its original layer, so that results in a new layer where the file is copied in its entirety with the new permissions.

In newer versions of Docker, this can now be written as the following to avoid this issue using Docker’s BuildKit:

Other non-intuitive cases of file duplication between layers:

In this example, we created 3 copies of our file throughout different layers of the image. Despite removing the file in the last layer, the image still contains the file in other layers which contributes to the overall size of the image.

Making a small change to a file or moving it will create an entire copy of the file. Deleting a file will only hide it from the final image, but it will still exist in its original layer, taking up space. This is all a result of how images are structured as a series of read-only layers. This provides reusability of layers and efficiencies with regards to how images are stored and executed. But this also means we need to be aware of the underlying structure and take it into account when we create our Dockerfile.

For cases where we have Dockerfile steps that aren’t used at runtime.

The Dockerfile might include several steps that take care of setting up an environment for compiling the program that will run at runtime. This is especially common for compiled languages like Go.

To solve this issue Docker introduced multi-stage builds starting from Docker Engine v17.05. This allows us to perform all preparations steps as before, but then copy only the essential files or output from these steps.

As shown in the example below, the effects on image size can be dramatic:

This basic example compiles a simple Go program. The naive way on the left results in a 961 MB image. When using a multi-stage build, we copy just the compiled binary which results in a 7 MB image. The example on the left can be improved by choosing a leaner parent image, but it still would fall short of the optimal case on the right.

Multi-stage builds introduce a lot of flexibility with support for advanced cases like multiple FROM statements, copying a single file from an external image, and more. These techniques can be combined to reduce the image size to a minimum. Check out the official Docker docs for more info.

Keeping your image optimized and small pays huge dividends in the development process and in going to production. The techniques above will help you gain a good understanding of what’s going on inside your image, which has benefits beyond the optimization work.

Join the pack! Join 8000+ others registered users, and get chat, make groups, post updates and make friends around the world!
www.knowasiak.com/register/
Read More

Leave a Reply

One thought on “Optimizing Docker image size and why it matters

  1. Aditya avatar

    A common mistake that's not covered in this article is the need to perform your add & remove operations in the same RUN command. Doing them separately creates two separate layers which inflates the image size.

    This creates two image layers – the first layer has all the added foo, including any intermediate artifacts. Then the second layer removes the intermediate artifacts, but that's saved as a diff against the previous layer:

        RUN ./install-foo
        RUN ./cleanup-foo
    

    Instead, you need to do them in the same RUN command:

        RUN ./insall-foo && ./cleanup-foo
    

    This creates a single layer which has only the foo artifacts you need.

    This why the official Dockerfile best practices show[1] the apt cache being cleaned up in the same RUN command:

        RUN apt-get update && apt-get install -y 
            package-bar 
            package-baz 
            package-foo  
            && rm -rf /var/lib/apt/lists/*
    

    [1] https://docs.docker.com/develop/develop-images/dockerfile_be…