I had intended to turn this list in to some kind of monster article or split it in to several. I suck at getting around to writing things so rather than let this fester unpublished I'm going to publish it as is.
This experience comes from 2 and a bit years evaluating and using docker for actual work. Some of it is probably rubbish. The largest deployment of nodes I've managed is 50 so not uber scale. Some things will be different beyond that scale but I'm pretty sure most of it applies below that.
It's a nice idea but ask yourself if you really need it or if you just think the idea is cool. If you think you need this, try docker swarm first. It’s not as sexy but if you can run your stuff in swarm you can definitely run it on any other orchestrators. Look for the gaps in what swarm provides and make an orchestrator decision based on that.
Orchestrators are not free; they will cost you in operational complexity and will reduce the ability to tell what the hell is going on while you figure out how to work with them.
This is a terrible idea. By enabling developer machines to build binary artifacts you will drop in production you are leaving yourself open to a myriad of bad things. Images must run on dev but artifacts should be generated at the start of your CI pipeline and promoted to production. Anything else is crazy talk.
Now we’re getting somewhere. While you may have need for some workloads to run on beefy machines or machines with special hardware or machines with special deployment characteristics, most probably don’t.
By treating all nodes roughly equal you can simplify management of them a ton. For starters you only need one configuration management role for all machines; they’re all docker machines!
It is also crazy easy to repurpose a node. If you’ve ever tried to completely pivot a machine from one purpose to another with configuration management tools you will know how nasty this gets.
With containers it is as simple as: remove old containers, start new containers.
Obviously apps are isolated with their own set of deps when you run them in a container. This is cool for the app but also cool for the nodes the app runs on. You get to worry far less (almost to the point of not at all) about host package upgrades breaking applications.
I don't know if this has a name specific to containers but Joyent call it "The Autopilot pattern".
The idea is that the container is completely self contained in terms of ability to configure itself, configure healthchecks with external services and provide / configure monitoring for itself. They have written a tool called containerpilot to make much of this easier.
Containerpilot provides service registration for consul, the ability to run coprocesses (like metrics exporters), scheduled tasks (like cron) etc.
Making your containers self-sufficient can go a long way to cure some headaches with dynamic placement.
The usual argument against tools aware of the platform that they're embedded in is that they're no longer generic if you do that; the container should be metrics provider agnostic so you can share it. Bullshit. The generic argument applies to publicly consumable images only. If you have something generally applicable that you'd like multiple images to use, have this functionality in a separate image (or better yet, push it upstream) and derive all others from that. Using
FROM upstream-image:version you have no excuse to not create a local dockerfile specific to your tooling, and you should!
I never understood this one. Personally I think it should be "single purpose per container". You don't want loads of random crap in your image (and definitely not something like SSHD) but you are just giving yourself unnecessary headaches to enforce a single process per container policy.
Take nginx and PHP-FPM as an example.
nginx.conf requires specific fpm config including the webroot, a socket to talk to fpm and the backend block. FPM doesn't need anything specific from nginx but if you run it in a separate container you will need to match webroots and listen on TCP because you can't share a unix socket.
If fpm dies then nginx can't serve PHP (you get 502) but it will keep running. Fair enough if you have lots of independent static content (but no one using PHP that I have ever worked for has). If nginx dies then fpm is useless but will keep running doing... not much.
If you put them both in the same container then webroots naturally match, you can use a unix socket and using containerpilot you can have nginx as the primary process and fpm as a coprocess which will just get restarted if it dies.
There is very little advantage to separating these two and a lot of advantage to putting them in the same container.
These two aren't the only examples of this but they are the most pointed argument against "single process per container" that I've come across.
Do not use publicly available images and then fudge a load of crap in to another container that pulls shared volume backflips because you don't want to modify the public one. Take the public one, derive it and make it specific. This allows you to manage the lifecycle of this image and if you adopt having a Dockerfile in your own repos for every image you use (even if you don't modify them) then they can naturally have the same CI process as the rest of your images.
Sidekick containers are a hack. Docker has no way of coupling the lifecycle of these containers so if you kill your main service but not your prometheus sidekick, then it will sit there spewing errors forever (well, unless it has error policies but you shouldn't have to rely on this).
"Cron" containers that poke in to your service are hacks too for the same reason.
Kubernetes sort of gets away with it (a bit) because it manages lifecycles of pod containers but I would still recommend smart containers over relying on pods if only because smart containers are portable away from kubernetes and pods aren't.
Parameterizing all the things upfront is completely unnecessary. The more parameterization variables you have, the more chance of horribly broken combinations.
ENV vars are a great way of parameterizing containers but make sure you document them somehow (like in README.md next to Dockerfile).
The biggest change I had to get to grips with is that configuration logic shifts from converge to run. Those all encompassing apache chef templates… yeah, you need to populate those as the container starts.
At some point you will need templates in the container. Many options. consul-template if you are pulling from consul.
Dockerfiles are best. Yes, you can use config management tools for it but you really don’t need to. You’re just over complicating it because these tools are designed for running systems w/ services and drift to worry about. Images are not that.
You will at some point need to mash JSON around for something or other. jq is the bomb for this.
Always create a user in the container and use gosu or such to drop privileges but avoid USER in the Dockerfile. It messes with the build process and can be particularly confusing if deriving from an image that uses it.
Label images with metadata (git rev, build date, git url etc). See microbadger. You will thank yourself later if you find an old container floating about with no image label. This can happen when an image tag is reused but the old image keeps running. Not something you should make a habit of but it happens.
This is also useful if your docker labels don't match git versions as you can get to a repo & SHA quickly from a running container.
Reusing tags on images is a really bad idea for a few reasons:
- The meaning of
FROM my-image:v1changes and you can't guarantee in a way that wont break derived images
- When someone pulls the new version of this image, any running or old containers using that image lose their image name and tag!
- Makes it more difficult for derived images to determine if they are using latest or just how far behind they are
If you need to add a bunch of files to an image I have found that
ADD root/ / is by far the nicest way to do it. It makes the intended location of the files obvious without digging in the Dockerfile and it adds a sense of consistency if you do it across all of your images. It also means that your files are all added in one layer.
The only exception here is files that are conditional and ones you need to interpolate or generate. Those should live outside this structure and be handled case by case.
If you are downloading files for use in containers, you must validate them. A simple sha256sum -c check is sufficient. For this reason, use
RUN curl http://... && echo "<sha> <file>" | sha256sum -c over
Bear in mind that you are trusting any third party container authors to play nice. Docker registry does not support immutable tags so things can absolutely change underneath you unless you’re using Notary. On top of that most containers are likely to be “I hacked this thing up once” projects and will not be maintained. I’m not saying don’t use them but I would suggest the following:
- Get hold of the source used to build the image and build it in your own CI pipeline.
- Run it through all the scanners and CI processes you have.
- Treat it with the same scrutiny as if an intern with more interest in facebook than is healthy wrote it.
- Adopt it as internal beyond initial review.
Consider a registry that supports security scans (like quay.io); without it you should be scanning images yourself in CI process. These tools are nascent and IME need a lot of perseverance. coreos/clair and openscap/container-compliance are two examples.
You need to consider how you will keep up to date (or not) with upstream changes so that you're not going to be running a known vulnerable image.
This only really applies if you are compiling your app in a container. Your production deployment (probably) doesn't need gcc or the go compiler, much less your source code.
Make sure that if you're building in a container, you're exporting the artifact and only packaging what you need.
Big images take unnecessary time to deploy, space to store and bandwidth to transport.
Prefer consul-template as a coprocess to backends to watch services. consul-template is much more full featured and I’ve had issues with the backend onChange script not triggering when I thought it should.
CONSUL as an env var to enable running in different envs
Put all your lifecycle scripts in one place. I use /scripts . It makes them easy to find and trigger manually if required.
Use gossip encryption and use a different key per env. This will allow you to keep clusters apart by force should they ever come in to contact. I can speak from experience when I say that the self healing properties of the consul gossip protocol are very effective.
If you’re operating in HA, make sure you understand how the quorum works and how to safely remove servers. It’s super simple and will save you headaches later. If consul loses quorum it wont serve requests.
Quorum math is integer based. The minimum requirement for quorum is N/2 + 1 nodes available where N is the number of servers the cluster is aware of. So if N=3 then you need N/2+1=2 servers available to serve requests.
Three servers is the minimum you should have for HA if only for maintenance reasons. If you need consul to be always available but still want to actually maintain consul then you need at least one spare to take down.
Avoid exposing consul to public interface if at all possible. This is what we’ve done.
TLS support seems a bit hit and miss in consul clients. Same for ACL tokens. If I were doing this again I’d use both TLS and a deny all ACL by default and try to work around any consul client issues.
IF you have a failure tolerant services. For that to work, at least two of all critical services must be load balanced / in service. Run away from services that can’t be load balanced. Be wary of services that require passive replicas as they (IME) usually have clunky failover procedures.
Shut down services, kill networking, generally abuse infrastructure and make sure things keep ticking. If you have a "dragons be here" type service that no one wants to touch you definitely need to drill with that one.
Data portability is not a solved problem. This is the “don’t containerize stateful services” bear.
Some services can automatically replica but that doesn’t mean you get portability for free. Rebalancing data is intensive beyond tiny data sets.
Volume managers like flocker should be able to solve this within a single DC but I haven’t gone there yet.
In case you don’t know these work by provisioning SAN / cloud volumes and attaching them to a machine. If docker container goes down the volume gets detached and when you start a new one it comes up with the volume reattached (which could be on a different machine). Obviously this has limits in terms of range of portability but ultimately we shouldn’t need tools like this.
Some services are just a PITA to get the benefits of containers out of them. Jenkins is a great example. Because the executor queue is internal state its very difficult to load balance or migrate to a new host.
If it makes sense to put a service on the host, do it. Cadvisor is a good example. It is a single binary + systemd unit and if you want to run it in docker you must bind mounting 7 different locations or whatever it needs. It smacks of feature envy (in programmer terms).
Incidentally if you are using CentOS and cadvisor you MUST put it on the host or you will fall foul of a nasty little bug around nested bind mounts present in kernels < 3.19 that will prevent you from stopping any container cleanly while cadvisor is running.
You still need to provision those hosts. If you have a disciplined team or only a few servers use ansible. Not so disciplined team or lots (like hundreds) of servers consider chef. Ansible can totally handle that many but not in the idiomatic well documented approach. Chef is overkill for only a few servers but eventual convergence and crazy good performance of chef server mean it makes handling lots of machines a no brainer.
Make sure it converges frequently and completely. There can be a tendency for ansible to get run one play at a time which can leave stuff in bits.
Make sure your config management tool can reprovision docker storage pool (if applicable). Invaluable if you drop the ball and it gets full.
If you deploy on an old kernel you will have a bad time. Old here is anything older than 4.x.
CentOS7 / RHEL ship with a 3.10 kernel, and they don’t backport all bugfixes. There are some very nasty bugs that have already been fixed in more recent kernels waiting to bite you in the ass.
If you insist on CentOS7 / RHEL because your IT people are all narcisists, devicemapper with a dedicated LVM thinpool is the best option for storage.
Overlay on CentOS7 is unsupported. I mean, it works… until it doesn’t. It seems fine if you don’t start / stop containers too often. We were starting & stopping containers as frequently as a few a minute at peak times though and my primary experience here is recovering nodes that ran out of /var space due to filesystem errors caused by the overlay.
For the love of everything that is lovable, don’t use the default /dev/loop storage. It's flakey as hell if you have to start and stop containers with any frequency, it'll fill your /var partition (where all the logs also usually go) and you can't recover space from the sparsefile once it is gone.
I recommend overlay2 storage because better perf, more stable and it’s in mainline kernel now.
Devicemapper on ubuntu may work fine but I’ve not direct experience and I’ve heard the RH / CentOS folks have DM magic sauce. YMMV. I can't think why you would do that with overlay on the table though.
Beware the overlay driver eating inodes. I don't recall the exact circumstances but they fixed it with overlay2; use that in preference to overlay if you can.
If you have large images, you may need to up the dm.basedev option.
If considering a swarm or exposing engines via HTTP use TLS from the start; it’s a pain to retrofit in to a running swarm. square/certstrap is a huge help here.
If you are running systemd, journald is hands down the best logging option IMO.
All system and container logs go through journald w/ metadata and can be exported to JSON. This can then be piped to logstash or some remote syslog. I’d prefer if the elastic folks would write a native journald beat but there is a community one floating around (it just didn’t work for my requirements at the time).
Even if you’re not centralizing logs this gives you the option easily and docker logs still works (unlike with most of the other log drivers).
If using DM + thinpool, DO NOT let it run out of space. You will need a reboot to fix it. You can’t restart docker daemon and file handles that were open when it ran out of space will stay open even after killing stuff in our experience.
Use something to clean up stopped containers and unused images or they’ll eat all of your space and mess with swarm scheduling for containers without quotas.
Use named volumes instead of host bind mounts. This will allow orchestrators to schedule to where volumes are located or allow you to use an alternative docker volume plugin transparently to the container. It also keeps all of the service related files in one place instead of bind mounting from all over the host.
Do not populate config on host with config management tools and then bind mount all the things. This is the easy thing to do, especially if you’re using generic containers but it makes replacing a node or migrating a workload harder than it should be.
If you don’t feel a need for an orchestrator, don’t use one. They’re not free. - Ansible docker module provides decent functionality if you have needs beyond docker-compose but be wary of using machine specific facts when launching if you can.
A service registry can provide great insight in to the health of your platform. Consul is excellent and easy to manage. Can also act as KV backplane if using swarm.
Label all the things. Who started the container & when. Who owns it if it goes fubar? By tagging containers with these labels when you launch you will save some serious head scratching if your machines are shared with other teams (or you’re using a shared swarm).
- Secrets are tricksy. Consider deploying Hashicorp Vault. Makes extra sense if you’re running consul.
- If using consul, consul-template has support for Vault. Distribution / acquisition of vault tokens is still up to you but this integration allows you to request all types of stuff from vault. Keys, TLS certs, SSH logins, TOTP etc.
- Use TLS
- Use consul for KV because service discovery and health check stuff is integral and it’s important. HA consul is also nice to operate once you get the hang of quorum requirements.
- Only deploy HA swarm if you need it. If you do need it, use haproxy in TCP mode to load balance the swarm masters. This will preserve functionality like docker exec and docker logs through the swarm since they hijack the http connection to operate.
- Use resource quotas when running anything through the swarm. Without this the schedulers rely on number of running containers, INCLUDING stopped ones.
- If resources quotas aren’t appropriate, make sure you have fast removal of stopped containers on all members of the swarm or they will never balance correctly.
- Swarms are single tenant and will list everything running on the nodes that are a part of them. There is no way to restrict visibility without opting for the docker commercial offering of UCP which is a different beast entirely.
- Use containerpilot + health checks for any containers running in swarm. The service will register in the right place and can be found by other services using consul service discovery mechanisms.
- Try not to. Lots of cheap options out there for image storage and it will save you headaches.
- If you have to, make sure you use an object store.
- Make sure you use an authenticating portal from the off. Getting everyone / thing to start authenticating is painful. Portus looks nice but we never actually used it.
- Reclaiming space requires running garbage collection. Garbage collection can only run when the registry is shut down. Garbage collection is not fast; you will have to schedule periods of downtime to handle this!
- Docker DTR is … improving … but still a bit rough in places. Requires UCP.
Wow; still enormous. If you'd like any clarifications on any points (or combinations thereof) feel free to drop a comment .
Same goes for calling out BS. I'm wrong all the time ;)