April 26, 2017 · dsp2017 docker infrastructure

Tips for using Docker Swarm mode in production

Since you are here, you're probably using Docker on your development machine, and maybe on a single production host. You found out that single host is not enough anymore. What should you do? I've been there! And I'll provide a couple of tips how to prepare yourself for Docker Swarm in production, possible solution to your problem. It's all based on my own, nearly one year long experience.

P.S. If you haven't already, take a look at my previous post My experience with Docker Swarm - when you may need it? It might be helpful for you, if you're unfamiliar with Swarm.

Prerequisite - read official tutorial

I don't want to repeat here official guide. Although a bit short, it will give you a nice overview how things works. Also, I won't cover setting up Swarm, there's a lot of resources on the net. Check Digital Ocean or just google a bit.

PS. Personally, I'm using that great ansible role.

Facts about production Docker Swarm usage

Swarm overhead is quite low. From my observation, CPU overhead of scheduling and communication inside Swarm is really small. Thanks to that, managers can be (and are, by default) worker nodes at the same time. If you are going to work on a very big clusters (1000+ nodes) managers require much more resources, but it's negligible with small to medium sized installations. Here you can read about Swarm3k, an experiment of 4700 node Docker Swarm cluster.

Routing mesh (service discovery, loadbalancing and cross-container communication) is really solid. It's just working. You open port of a service, and you can access it on any host of the Swarm. Loadbalancing is done entirely under the hood. I had some problems in the past (read below), but since 1.13 everything is OK.

You need just few commands after initial configuration. Below you can find pretty much everything I'm using on a daily basis.

# let's create new service 
docker service create \  
  --image nginx \
  --replicas 2 \
  nginx 

# ... update service ...
docker service update \  
  --image nginx:alpine \
  nginx 

# ... and remove
docker service rm nginx

# but usually it's better to scale down
docker service scale nginx=0

# you can also scale up
docker service scale nginx=5

# show all services
docker service ls

# show containers of service with status
docker service ps nginx

# detailed info
docker service inspect nginx  

It's easy to make a 0-downtime deployment. It's also perfect for Continuous Deployment.

# lets build new version and push to the registry
docker build -t hub.docker.com/image .  
docker push hub.docker.com/image

# and now just update (on a master node)
docker service update --image hub.docker.com/image service  

It's easy to start. Distributed systems are complicated on their own. Compared to other solutions (Mesos, Kubernetes), Swarm has smallest learning curve. It took me about week without prior Swarm knowledge to migrate from single-host docker-compose deployment to 20-host, distributed, scalable solution.

No more quick hacks. Your containers live on many hosts at the same time. To change anything, you need a new Docker image. Proper testing / deployment pipeline is a key to success.

Decide which containers should live inside Swarm

Not everything should be put inside Swarm. Databases and other statefull services are very bad candidates. Theoretically, you can pin container to specific place using labels, but it's much harder to access it from outside of Swarm (there's no convenient way in 1.12, in 1.13+ you can use attachable overlay network). If you try to open, for example, database for external access, it will be available on all nodes, and that's probably not exactly what you want. Also, cross-host mounted volumes in Docker Swarm were few months ago not reliable, so even simple user uploads can cause problems.

Good candidates are all stateless containers, driven by ENV variables. Consider preparing your own Docker images of used open source tools, for example Nginx with full configuration.

My Swarm services:

Outside-of-Swarm containers:

Now I would probably keep Nginx out of Swarm, or at least in a host network mode, due to a problem with getting real ip address (see issue), but in 1.12 it was the only possible option.

Set up Docker Registry

You need it! Either host your own, or use existing one, like DockerHub or provided by Gitlab.com (my choice). Building images server-side works no more, since you'll have many hosts and you have to specify image on docker service create. Also, if your registry is private, remember to add --with-registry-auth option, otherwise other nodes won't be able to pull image.

You should also start tagging versions of your releases. That allows you to rollback easily if something goes wrong.

Make you semi-stateless containers truly stateless

Semi-stateless - has some shared, not important files. You can give volumes a chance, but probably better option is to migrate to S3 or other cloud storage. Remember, when going wide, cloud is your friend!

In my case, I had to create my own Nginx image with proper configuration. Sharing it through volume was not reliable and inconvenient.

Prepare log aggregation service

When working with distributed systems, single place where we can explore logs and metrics is a must have. ELK stack, Grafana, Graylog... there are many options, both open source and SaaS. Setting up everything in a reliable way is complicated, so my advice is to start with cloud services (Loggly, Logentries are examples), and when costs starts to rise, setup your own stack.

Example ELK stack logging configuration:

docker service update \  
  --log-driver gelf \
  --log-opt gelf-address=udp://monitoring.example.com:12201 \
  --log-opt tag=example-tag \
  example-service

Create attachable network (1.13+)

It's a game changer. Remember to use it, otherwise you won't have an option to run one-off container inside the Swarm. It's 1.13+ feature, if you're using previous Docker version better upgrade.

Snippet:

docker network create --driver=overlay --attachable core  

Start with ENV variables, consider secrets API later

If you're creating Docker images in accordance with best practices, you probably allowed to configure everything through ENV variables. And if you done so, you won't have problems with move to Docker Swarm.

Useful commands:

docker service create \  
  --env VAR=VALUE \
  --env-file FILENAME \
  ...

docker service update \  
  --env-add VAR=NEW_VALUE \
  --env-rm VAR \
  ..

Next level is to use Secrets API . In short, it allows you to mount secrets as files inside containers, great for longer content (authorized keys, SSL certs etc). I'm not using them (yet!), so I can't tell much, but it's worth considering.

Set proper number of instances and parallel updates

You should keep number of replicas high enough to handle whole traffic and survive failures. On the other hand, remember that too many of them may cause fighting for CPU and increased RAM usage (obviously :P).

Also, default value for update-parallelism setting is 1, so it means that only
one replica can be down at the same time. Usually it's too low, my recommended value is replicas / 2.

Related commands

docker service update \  
  --update-parallelism 10 \
  webapp

# You can scale multiple services at once
docker service scale redis=1 nginx=4 webapp=20

# Check scaling status
docker service ls

# Check details of a service (without stopped containers)
docker service ps webapp | grep -v "Shutdown"  

Keep your Swarm configuration as a code

Best option is to use Docker Compose v3 syntax. It allows to specify almost all service options and keep it in the same place as a code. Personally, I'm using docker-compose.yml for development, and docker-compose.prod.yml with Swarm configuration for production. Deploying services described by docker-compose file require docker stack deploy command (part of a new stack commands family).

Example file and command:

# docker-compose.prod.yml
version: '3'  
services:  
  webapp:
    image: registry.example.com/webapp
    networks:
      - core
    deploy:
      replicas: ${WEBAPP_REPLICAS}
      mode: replicated
      restart_policy:
        condition: on-failure

  proxy:
    image: registry.example.com/webapp-nginx-proxy
    networks:
      - core
    ports:
      - 80:80
      - 443:443
    deploy:
      replicas: ${NGINX_REPLICAS}
      mode: replicated
      restart_policy:
        condition: on-failure

networks:  
  ingress:
    external: true

Example deployment (either initial or update):

export NGINX_REPLICAS=2 WEBAPP_REPLICAS=5

docker login registry.example.com  
docker stack deploy \  
  -c docker-compose.prod.yml\
  --with-registry-auth \
  frontend

TIP: docker-compose file supports env variables (${VARIABLE}), so you can dynamically adjust configuration for staging, test etc.

Set limits

From my experience, you should limit CPU usage of all services. It prevents situations when one container takes all host resources.

reserve-cpu options is also useful. I used it when I wanted to spread all containers evenly between all hosts, and when I wanted to be sure that process will have enough resources to operate.

Example:

docker service update  
  --limit-cpu 0.25
  --reserve-cpu 0.1
  webapp

Monitor connections

I've had some problems with Swarm networking. Few times all requests were routed just to one container, even if 9 others were alive and operational. Try to scale down / up, if it fails change routing type (--endpoint-mode option).

Discovering this is really hard without proper log aggregation.

Docker Swarm logo

That's everything for now. If you like it, please share this post on social networks. If not, write what should be improved. Feedback is always welcomed :)

Comments powered by Disqus