Rock IT

How to write excellent Dockerfiles

Hi! I'm working with Docker for some time now. Writing Dockerfiles is an essential part of this process, and I wanted to share a few tips how to make it better.

Our goals:
We want to minimize image size, build time and number of layers.
We want to maximize build cache usage and Dockerfile readability.
We want to make working with our container as pleasant as possible.

Container port

TL;DR

This post is filled with examples and detailed descriptions, so here is a quick summary.

  • Write .dockerignore file
  • Container should do one thing
  • Understand Docker caching! Use COPY and RUN commands in proper order to utilize that.
  • Merge multiple RUN commands into one
  • Remove unneeded files after each step
  • Use proper base image (alpine versions should be enough)
  • Set WORKDIR and CMD
  • Use ENTRYPOINT when you have more than one command and/or need to update files using runtime data
  • Use exec inside entrypoint script
  • Prefer COPY over ADD
  • Specify default environment variables, ports, and volumes inside Dockerfile

Practical example

So you just finished reading my tips. That's cool! But you may ask, how to introduce them into my Dockerfiles and whats the difference, anyway?

I've prepared small Dockerfile, with almost all possible mistakes. Next, we'll fix it! Let's assume that we want to dockerize small node.js web application. Here it is (CMD is complicated and probably doesn't work, but it's just an example):

FROM ubuntu

ADD . /app

RUN apt-get update
RUN apt-get upgrade -y
RUN apt-get install -y nodejs ssh mysql
RUN cd /app && npm install

# this should start three processes, mysql and ssh
# in the background and node app in foreground
# isn't it beautifully terrible? <3
CMD mysql & sshd & npm start

We could build it using docker build -t wtf .

Can you spot all mistakes here? No? Let's fix them together, one by one.

1. Write .dockerignore

When building an image, Docker has to prepare context first - gather all files that may be used in a process. Default context contains all files in a Dockerfile directory. Usually we don't want to include there .git directory downloaded libraries and compiled files. .dockerignore file looks exactly like .gitignore, for example:

.git/
node_modules/
dist/

2. Container should do one thing

Technically, you CAN start multiple processes inside Docker container. You CAN put database, frontend and backend applications, ssh, supervisor into one docker image. But it will bite you:

  • long build times (change in e.g. frontend will force the whole backend to rebuild)
  • very large images
  • hard logging from many applications (no more simple stdout)
  • wasteful horizontal scaling
  • problems with zombie processes - you have to remember about proper init process

My advice is to prepare separate Docker image for each component, and use Docker Compose to easily start multiple containers at the same time.

Let's remove unnecessary packages from our Dockerfile. SSH can be replaced with docker exec.

FROM ubuntu

ADD . /app

RUN apt-get update
RUN apt-get upgrade -y

# we should remove ssh and mysql, and use
# separate container for database 
RUN apt-get install -y nodejs  # ssh mysql
RUN cd /app && npm install

CMD npm start

3. Merge multiple RUN commands into one

Docker is all about layers. Knowledge how they work is essential.

  • Each command in Dockerfile creates so-called layer
  • Layers are cached and reused
  • Invalidating cache of a single layer invalidates all subsequent layers
  • Invalidation occurs after command change, if copied files are different, or build variable is other than previously
  • Layers are immutable, so if we add a file in one layer, and remove it in the next one, image STILL contains that file (it's just not available in the container)!

I like to compare Docker image to onion:

onion layers
They both makes you cry... err, not this. They both have layers. To access and modify inner layer, you have to remove all previous. Remember this and everything will be OK.

Let's optimize our example. We're merging all RUN commands into one, and removing apt-get upgrade, as it makes our build non-deterministic (we rely on our base image updates):

FROM ubuntu

ADD . /app

RUN apt-get update \
    && apt-get install -y nodejs \
    && cd /app \
    && npm install

CMD npm start

Keep in mind, that you should merge commands with similar probability of changing. Currently, every time our source code changes, we need to reinstall whole nodejs. So, a better option is:

FROM ubuntu

RUN apt-get update && apt-get install -y nodejs 
ADD . /app
RUN cd /app && npm install

CMD npm start

4. Do not use 'latest' base image tag

latest tag is a default one, used when no other tag is specified. So our instruction FROM ubuntu in reality does exactly the same as FROM ubuntu:latest. But 'latest' tag will point to a different image when a new version will be released, and your build may break. So, unless you are creating a generic Dockerfile that must stay up-to-date with the base image, provide specific tag.

In our example, let's use 16.04 tag:

FROM ubuntu:16.04  # it's that easy!

RUN apt-get update && apt-get install -y nodejs 
ADD . /app
RUN cd /app && npm install

CMD npm start

5. Remove unneeded files after each RUN step

So, let's assume we updated apt-get sources, installed few packages required for compiling others, downloaded and extracted archives. We obviously don't need them in our final images, so better let's make a cleanup. Size matters!

In our example we can remove apt-get lists (created by apt-get update):

FROM ubuntu:16.04

RUN apt-get update \
    && apt-get install -y nodejs \
    # added lines
    && rm -rf /var/lib/apt/lists/*

ADD . /app
RUN cd /app && npm install

CMD npm start

6. Use proper base image

In our example, we used ubuntu. But why? Do we really need a general-purpose base image, when we just want to run node application? A Better option is to use a specialized image with node already installed:

FROM node

ADD . /app
# we don't need to install node 
# anymore and use apt-get
RUN cd /app && npm install

CMD npm start

Or even better, we can choose alpine version (alpine is a very tiny linux distribution, just about 4 MB in size. This makes it perfect candidate for a base image)

FROM node:7-alpine

ADD . /app
RUN cd /app && npm install

CMD npm start

Alpine has package manager, called apk. It's a bit different than apt-get, but still quite easy to learn. Also, it has some really useful features, like --o-cache and --virtual options. That way, we choose what exactly we want in our image, nothing more. Your disk will love you :)

7. Set WORKDIR and CMD

WORKDIR command changes default directory, where we run our RUN / CMD / ENTRYPOINT commands.

CMD is a default command run after creating container without other command specified. It's usually the most frequently performed action. Let's add them to our Dockerfile

FROM node:7-alpine

WORKDIR /app
ADD . /app
RUN npm install

CMD ["npm", "start"]

You should put your command inside array, one word per element (more in the official documentation)

8. Use ENTRYPOINT (optional)

It's not always necessary because entrypoint adds complexity. How does it work?

Entrypoint is a script, that will run instead of command, and receive command as arguments. It's a great way to create executable Docker images:

#!/usr/bin/env sh
# $0 is a script name, 
# $1, $2, $3 etc are passed arguments
# $1 is our command
CMD=$1

case "$CMD" in
  "dev" )
    npm install
    export NODE_ENV=development
    exec npm run dev
    ;;

  "start" )
    # we can modify files here, using ENV variables passed in 
    # "docker create" command. It can't be done during build process.
    echo "db: $DATABASE_ADDRESS" >> /app/config.yml
    export NODE_ENV=production
    exec npm start
    ;;

   * )
    # Run custom command. Thanks to this line we can still use 
    # "docker run our_image /bin/bash" and it will work
    exec $CMD ${@:2}
    ;;
esac

Save it in your root directory, named entrypoint.sh. Usage in Dockerfile:

FROM node:7-alpine

WORKDIR /app
ADD . /app
RUN npm install

ENTRYPOINT ["./entrypoint.sh"]
CMD ["start"]

Now we can run this image in an executable-like way:
docker run our-app dev
docker run our-app start
docker run -it our-app /bin/bash - this one will work too

9. Use "exec" inside entrypoint script

As you can see in example entrypoint, we're using exec. Without it, we would not be able to stop our application gracefully (SIGTERM is swallowed by bash script). Exec basically replaces script process with new one, so all signals and exit codes work as intended.

10. Prefer COPY over ADD

COPY is simpler. ADD has some logic for downloading remote files and extracting archives, more in official documentation. Just stick with COPY.

EDIT: This point needs some explanation. ADD may be useful if your build depends on external resources, and you want proper build cache invalidation on change. It's not the best practice, but sometimes it's the only way.

Let's ADD... oops, COPY this to our example:

FROM node:7-alpine

WORKDIR /app

COPY . /app
RUN npm install

ENTRYPOINT ["./entrypoint.sh"]
CMD ["start"]

11. Optimize COPY and RUN

We should put least frequent changes at the top of our Dockerfiles to leverage caching.

In our example, code will change often, and we don't want to reinstall packages each time. We can copy package.json before rest of the code, install dependencies, and then add other files. Let's apply that improvement to our Dockerfile:

FROM node:7-alpine

WORKDIR /app

COPY package.json /app
RUN npm install
COPY . /app

ENTRYPOINT ["./entrypoint.sh"]
CMD ["start"]

12. Specify default environment variables, ports and volumes

We probably need some environment variables to run our container. It's a good practice to set default values in Dockerfile. Also, we should expose all used ports and define volumes. If you ask why to create volume in Dockerfile, look here.

Next improvement to our example:

FROM node:7-alpine

# env variables required during build
ENV PROJECT_DIR=/app

WORKDIR $PROJECT_DIR

COPY package.json $PROJECT_DIR
RUN npm install
COPY . $PROJECT_DIR

# env variables that can change
# volume and port settings
# and defaults for our application
ENV MEDIA_DIR=/media \
    NODE_ENV=production \
    APP_PORT=3000

VOLUME $MEDIA_DIR
EXPOSE $APP_PORT

ENTRYPOINT ["./entrypoint.sh"]
CMD ["start"]

These variables will be available in the container. If you need build-only variables, use build args instead.

13. Add metadata to image using LABEL

There's an option to add metadata to the image, such as information who is the maintainer or extended description. We need LABEL instruction for that (previously we could use MAINTAINER option, but now it's deprecated). Metadata is sometimes used by external programs, for example nvidia-docker require com.nvidia.volumes.needed label to work properly.

Example of a metadata in our Dockerfile:

FROM node:7-alpine
LABEL maintainer "jakub.skalecki@example.com"
...

14. Add HEALTHCHECK

We can start docker container with option --restart always. After container crash, docker daemon will try to restart it. It's very useful if your container has to be operational all the time. But what if container is running, but not available (infinite loop, invalid configuration etc)?. With HEALTHCHECK instruction we can tell Docker to periodically check our container health status. It can be any command, returning 0 exit code if everything is OK, and 1 in other case. You can read more about healthchecks in this excellent article.

Final change to our example:

FROM node:7-alpine
LABEL maintainer "jakub.skalecki@example.com"

ENV PROJECT_DIR=/app
WORKDIR $PROJECT_DIR

COPY package.json $PROJECT_DIR
RUN npm install
COPY . $PROJECT_DIR

ENV MEDIA_DIR=/media \
    NODE_ENV=production \
    APP_PORT=3000

VOLUME $MEDIA_DIR
EXPOSE $APP_PORT
HEALTHCHECK CMD curl --fail http://localhost:$APP_PORT || exit 1

ENTRYPOINT ["./entrypoint.sh"]
CMD ["start"]

curl --fail returns non-zero exit code if request failed.

For advanced users

This post is getting really long, so even if I have a few more ideas, I won't cover them here. If you want to know more, take a look at STOPSIGNAL, ONBUILD, and SHELL instructions. Also, very useful options during build are --no-cache (especially on a CI server, if you want to be sure that build can be done on a fresh Docker installation), and --squash (more here). Have fun :)

Conclusion

That's all. Long post, but I think it contains useful information. If you have your own tips, share them in comments. All feedback is welcomed!

PS. Recently I've created another post, dedicated to persons trying to integrate docker into their project. You can find it here

Author image
Warsaw, Poland
Full Stack geek. Likes Docker, Python, and JavaScript, always interested in trying new stuff.