We have to stop treating production (virtual) machines like they are desktops – and that means resisting the urge to login to the running machine, once it begins its production lifecycle. Later in this post I will contrast the way VM containers are managed with how Linux containers are managed by solutions like Docker.
Every time you log in to a machine, no matter for what reason, you create side-effects that you are unaware of. Over time, the state of the machine will deviate from the desired state in which it started its lifecycle. This is the root cause of many nasty problems, some of which can be very difficult to diagnose. Seemingly innocuous commands typed in the machine’s console can wreak havoc right away (if you are lucky) or linger for months before they create disruption (if you are unlucky). A surprisingly common example is changing the permissions on a directory to enable some other operation, then forgetting to change it back. There was one such situation reported at Google many years ago, when someone removed the executable permission on the directory containing the Linux dynamic loader, causing several machines to lose the ability to exec() any binaries, including their own health monitoring agents. Fortunately Google had a sufficiently resilient design that the impact of this disruption was not noticed by end users. But I have been in several customer crit-sits (critical situations) where similar accidentally introduced problems have taken business applications down.
But how can we avoid logging in to a production VM? Don’t we need to install/update software inside it and start/stop services within it? Yes and yes, but you don’t need to log in to the production VM to do it.
If you need to install/update software in a production VM, follow these steps instead:
- Start a maintenance instance of the VM image you created the production VM instance from. This assumes you are following the best practice I bogged about in an earlier post so every uniquely configured VM instance in your production environment has a corresponding VM image from which it was deployed.
- Install/update and test this maintenance VM instance. Then shutdown and capture it as a new updated VM image. This would be a good time to version the VM image so you can compare/diff it, track the provenance of changes you made to it, etc.
- Deploy an updated VM instance from this new image, in place of the original production VM instance. You may need a small downtime window to do this swap depending on how your application is set up.
To start/stop/restart services inside the VM, the better way to do this is to install a utility that listens on a designated port for start/stop/restart commands and executes them locally.
The key point here is this. Think of a VM image as the handoff between the Dev and Ops halves of your application lifecycle. It should contain all of the software environment necessary for execution, maintenance, and monitoring. No externally induced side-effects should be permitted once a VM image is instantiated as a running VM instance. Following this simple rule can improve the manageability of your operational environment a lot more than you think.
A good analogy is to think about a VM image as the a.out binary produced by a compilation process. When you run the binary, you get a process. You don’t “login” to a running process – indeed there is no such capability. And that is a good thing, because then the process’s runtime behavior is governed by the state of that exact a.out binary, which in turn is governed by the exact source code version that was used to build it.
I hate to say this, but deployment tools like Chef and Puppet violate this simple principle. They make the deployment process that is under their control more repeatable and robust, but they induce side effects on the system that are not modeled in the deployment recipe and therefore remain invisible to the tool chain. The right way to use these tools is to integrate them with VM image build tools like Ubuntu VM-Builder so that executing a deployment recipe results in a VM image, not a running VM instance. That VM image then represents the fully realized system image produced by a deployment recipe, in exactly the sense that an a.out binary corresponds to the source code from which it was compiled.
How Docker got it right
I have been tinkering with Linux containers and Docker recently, and one thing that really struck me was how Docker has followed this simple principle with (Docker) images and (Docker) containers. You can technically “log in” to a Docker container (get a tty into it) with this command:
docker run -t -i <myimage> <myshell>
But there is little need to ever do this, because a variety of Docker commands allow you to peek and poke the container from outside, without ever logging in to a shell within the container. For example, you can stop/start the processes within a container (docker stop, docker start), watch events within the container (docker events, docker top), get logs from a container (docker logs), peek at a container’s current configured state (docker inspect), and even see what files changed since you started a container (docker diff).
These utilities makes Docker more than just a user-friendly wrapper around Linux containers. It is a fundamentally different abstraction for managing the lifecycle of applications. An abstraction where the image is treated as the immutable contract between the Dev (e.g. docker build) and Ops (e.g. docker run) halves of the DevOps lifecycle. This has the potential to disrupt the VM-instance centric ecosystem of tools and platforms that are in vogue today.