April 8, 2017 · dsp2017 docker infrastructure

How to write excellent Dockerfiles

Hi! I'm working with Docker for some time now. Writing Dockerfiles is an essential part of this process, and I wanted to share a few tips how to make it better.

Our goals:
We want to minimize image size, build time and number of layers.
We want to maximize build cache usage and Dockerfile readability.
We want to make working with our container as pleasant as possible.

Container port

TL;DR

This post is filled with examples and detailed descriptions, so here is a quick summary.

Practical example

So you just finished reading my tips. That's cool! But you may ask, how to introduce them into my Dockerfiles and whats the difference, anyway?

I've prepared small Dockerfile, with almost all possible mistakes. Next, we'll fix it! Let's assume that we want to dockerize small node.js web application. Here it is (CMD is complicated and probably doesn't work, but it's just an example):

FROM ubuntu

ADD . /app

RUN apt-get update  
RUN apt-get upgrade -y  
RUN apt-get install -y nodejs ssh mysql  
RUN cd /app && npm install

# this should start three processes, mysql and ssh
# in the background and node app in foreground
# isn't it beautifully terrible? <3
CMD mysql & sshd & npm start  

We could build it using docker build -t wtf .

Can you spot all mistakes here? No? Let's fix them together, one by one.

1. Write .dockerignore

When building image, Docker has to prepare context first - gather all files that may be used in process. Default context contains all files in a Dockerfile directory. Usually we don't want to include there .git directory, downloaded libraries and compiled files. .dockerignore file looks exactly like .gitignore, for example:

.git/
node_modules/  
dist/  

2. Container should do one thing

Technically, you CAN start multiple processes inside Docker container. You CAN put database, frontend and backend applications, ssh, supervisor into one docker image. But it will bite you:

My advice is to prepare separate Docker image for each component, and use Docker Compose to easily start multiple containers at the same time.

Let's remove unnecessary packages from our Dockerfile. SSH can be replaced with docker exec.

FROM ubuntu

ADD . /app

RUN apt-get update  
RUN apt-get upgrade -y

# we should remove ssh and mysql, and use
# separate container for database 
RUN apt-get install -y nodejs  # ssh mysql  
RUN cd /app && npm install

CMD npm start  

3. Merge multiple RUN commands into one

Docker is all about layers. Knowledge how they work is essential.

I like to compare Docker image to onion:

onion layers
They both makes you cry... err, not this. They both have layers. To access and modify inner layer, you have to remove all previous. Remember this and everything will be OK.

Let's optimize our example. We're merging all RUN commands into one, and removing apt-get upgrade, as it makes our build non-deterministic (we rely on our base image updates):

FROM ubuntu

ADD . /app

RUN apt-get update \  
    && apt-get install -y nodejs \
    && cd /app \
    && npm install

CMD npm start  

Keep in mind, that you should merge commands with similar probability of changing. Currently, every time our source code changes, we need to reinstall whole nodejs. So, better option is:

FROM ubuntu

RUN apt-get update && apt-get install -y nodejs  
ADD . /app  
RUN cd /app && npm install

CMD npm start  

4. Do not use 'latest' base image tag

latest tag is a default one, used when no other tag is specified. So our instruction FROM ubuntu in reality does exactly the same as FROM ubuntu:latest. But 'latest' tag will point to a different image when new version will be released, and your build may break. So, unless you are creating a generic Dockerfile that must stay up-to-date with base image, provide specific tag.

In our example, let's use 16.04 tag:

FROM ubuntu:16.04  # it's that easy!

RUN apt-get update && apt-get install -y nodejs  
ADD . /app  
RUN cd /app && npm install

CMD npm start  

5. Remove unneeded files after each RUN step

So, let's assume we updated apt-get sources, installed few packages required for compiling others, downloaded and extracted archives. We obviously don't need them in our final images, so better let's make a cleanup. Size matters!

In our example we can remove apt-get lists (created by apt-get update):

FROM ubuntu:16.04

RUN apt-get update \  
    && apt-get install -y nodejs \
    # added lines
    && rm -rf /var/lib/apt/lists/*

ADD . /app  
RUN cd /app && npm install

CMD npm start  

6. Use proper base image

In our example, we used ubuntu. But why? Do we really need general-purpose base image, when we just want to run node application? A Better option is to use specialized image with node already installed:

FROM node

ADD . /app  
# we don't need to install node 
# anymore and use apt-get
RUN cd /app && npm install

CMD npm start  

Or even better, we can choose alpine version (alpine is a very tiny linux distribution, just about 4 MB in size. This makes it perfect candidate for a base image)

FROM node:7-alpine

ADD . /app  
RUN cd /app && npm install

CMD npm start  

Alpine has package manager, called apk. It's a bit different than apt-get, but still quite easy to learn. Also, it has some really useful features, like --co-cache and --virtual options. That way, we choose what exactly we want in our image, nothing more. Your disk will love you :)

7. Set WORKDIR and CMD

WORKDIR command changes default directory, where we run our RUN / CMD / ENTRYPOINT commands.

CMD is a default command run after creating container without other command specified. It's usually the most frequently performed action. Let's add them to our Dockerfile

FROM node:7-alpine

WORKDIR /app  
ADD . /app  
RUN npm install

CMD ["npm", "start"]  

You should put your command inside array, one word per element (more in the official documentation)

8. Use ENTRYPOINT (optional)

It's not always necessary, because entrypoint adds complexity. How does it work?

Entrypoint is a script, that will run instead of command, and receive command as arguments. It's a great way to create executable Docker images:

#!/usr/bin/env sh
# $0 is a script name, 
# $1, $2, $3 etc are passed arguments
# $1 is our command
CMD=$1

case "$CMD" in  
  "dev" )
    npm install
    export NODE_ENV=development
    exec npm run dev
    ;;

  "start" )
    # we can modify files here, using ENV variables passed in 
    # "docker create" command. It can't be done during build process.
    echo "db: $DATABASE_ADDRESS" >> /app/config.yml
    export NODE_ENV=production
    exec npm start
    ;;

   * )
    # Run custom command. Thanks to this line we can still use 
    # "docker run our_image /bin/bash" and it will work
    exec $CMD ${@:2}
    ;;
esac  

Save it in your root directory, named entrypoint.sh. Usage in Dockerfile:

FROM node:7-alpine

WORKDIR /app  
ADD . /app  
RUN npm install

ENTRYPOINT ["./entrypoint.sh"]  
CMD ["start"]  

Now we can run this image in an executable-like way:
docker run our-app dev
docker run our-app start
docker run -it our-app /bin/bash - this one will work too

9. Use "exec" inside entrypoint script

As you can see in example entrypoint, we're using exec. Without it, we would not be able to stop our application grecefully (SIGTERM is swallowed by bash script). Exec basically replaces script process with new one, so all signals and exit codes works as intended.

10. Prefer COPY over ADD

COPY is simpler. ADD has some logic for downloading remote files and extracting archives, more in official documentation. Just stick with COPY.

EDIT: This point needs some explanation. ADD may be useful if your build depends on external resources, and you want proper build cache invalidation on change. It's not the best practice, but sometimes it's the only way.

Let's ADD... oops, COPY this to our example:

FROM node:7-alpine

WORKDIR /app

COPY . /app  
RUN npm install

ENTRYPOINT ["./entrypoint.sh"]  
CMD ["start"]  

11. Optimize COPY and RUN

We should put least frequent changes at the top of our Dockerfiles to leverage caching.

In our example, code will change often, and we don't want to reinstall packages each time. We can copy package.json before rest of the code, install dependencies, and then add other files. Let's apply that improvement to our Dockerfile:

FROM node:7-alpine

WORKDIR /app

COPY package.json /app  
RUN npm install  
COPY . /app

ENTRYPOINT ["./entrypoint.sh"]  
CMD ["start"]  

12. Specify default environment variables, ports and volumes

We probably need some environment variables to run our container. It's a good practice to set default values in Dockerfile. Also, we should expose all used ports and define volumes. If you ask why to create volume in Dockerfile, look here.

Next improvement to our example:

FROM node:7-alpine

# env variables required during build
ENV PROJECT_DIR=/app

WORKDIR $PROJECT_DIR

COPY package.json $PROJECT_DIR  
RUN npm install  
COPY . $PROJECT_DIR

# env variables that can change
# volume and port settings
# and defaults for our application
ENV MEDIA_DIR=/media \  
    NODE_ENV=production \
    APP_PORT=3000

VOLUME $MEDIA_DIR  
EXPOSE $APP_PORT

ENTRYPOINT ["./entrypoint.sh"]  
CMD ["start"]  

These variables will be available in container. If you need build-only variables, use build args instead.

13. Add metadata to image using LABEL

There's an option to add metadata to the image, such as information who is the maintainer or extended description. We need LABEL instruction for that (previously we could use MAINTAINER option, but now it's deprecated). Metadata is sometimes used by external programs, for example nvidia-docker require com.nvidia.volumes.needed label to work properly.

Example of a metadata in our Dockerfile:

FROM node:7-alpine  
LABEL maintainer "jakub.skalecki@example.com"  
...

14. Add HEALTHCHECK

We can start docker container with option --restart always. After container crash, docker daemon will try to restart it. It's very useful if your container has to be operational all the time. But what if container is running, but not available (infinite loop, invalid configuration etc)?. With HEALTHCHECK instruction we can tell Docker to periodically check our container health status. It can be any command, returning 0 exit code if everything is OK, and 1 in other case. You can read more about healthchecks in this excellent article.

Final change to our example:

FROM node:7-alpine  
LABEL maintainer "jakub.skalecki@example.com"

ENV PROJECT_DIR=/app  
WORKDIR $PROJECT_DIR

COPY package.json $PROJECT_DIR  
RUN npm install  
COPY . $PROJECT_DIR

ENV MEDIA_DIR=/media \  
    NODE_ENV=production \
    APP_PORT=3000

VOLUME $MEDIA_DIR  
EXPOSE $APP_PORT  
HEALTHCHECK CMD curl --fail http://localhost:$APP_PORT || exit 1

ENTRYPOINT ["./entrypoint.sh"]  
CMD ["start"]  

curl --fail returns non-zero exit code if request failed.

For advanced users

This post is getting really long, so even if I have a few more ideas, I won't cover them here. If you want to know more, take a look at STOPSIGNAL, ONBUILD, and SHELL instructions. Also, very useful options during build are --no-cache (especially on a CI server, if you want to be sure that build can be done on a fresh Docker installation), and --squash (more here). Have fun :)

Conclusion

That's all. Long post, but I think it contains useful information. If you have your own tips, share them in comments. All feedback is welcomed!

Comments powered by Disqus