Tag Archives: VM

Never login to your production containers


We have to stop treating production (virtual) machines like they are desktops – and that means resisting the urge to login to the running machine, once it begins its production lifecycle. Later in this post I will contrast the way VM containers are managed with how Linux containers are managed by solutions like Docker.

Every time you log in to a machine, no matter for what reason, you create side-effects that you are unaware of. Over time, the state of the machine will deviate from the desired state in which it started its lifecycle. This is the root cause of many nasty problems, some of which can be very difficult to diagnose. Seemingly innocuous commands typed in the machine’s console can wreak havoc right away (if you are lucky) or linger for months before they create disruption (if you are unlucky). A surprisingly common example is changing the permissions on a directory to enable some other operation, then forgetting to change it back. There was one such situation reported at Google many years ago, when someone removed the executable permission on the directory containing the Linux dynamic loader, causing several machines to lose the ability to exec() any binaries, including their own health monitoring agents. Fortunately Google had a sufficiently resilient design that the impact of this disruption was not noticed by end users. But I have been in several customer crit-sits (critical situations) where similar accidentally introduced problems have taken business applications down.

But how can we avoid logging in to a production VM? Don’t we need to install/update software inside it and start/stop services within it? Yes and yes, but you don’t need to log in to the production VM to do it.

If you need to install/update software in a production VM, follow these steps instead:

  1. Start a maintenance instance of the VM image you created the production VM instance from. This assumes you are following the best practice I bogged about in an earlier post so every uniquely configured VM instance in your production environment has a corresponding VM image from which it was deployed.
  2. Install/update and test this maintenance VM instance. Then shutdown and capture it as a new updated VM image. This would be a good time to version the VM image so you can compare/diff it, track the provenance of changes you made to it, etc.
  3. Deploy an updated VM instance from this new image, in place of the original production VM instance. You may need a small downtime window to do this swap depending on how your application is set up.

To start/stop/restart services inside the VM, the better way to do this is to install a utility that listens on a designated port for start/stop/restart commands and executes them locally.

The key point here is this. Think of a VM image as the handoff between the Dev and Ops halves of your application lifecycle. It should contain all of the software environment necessary for execution, maintenance, and monitoring. No externally induced side-effects should be permitted once a VM image is instantiated as a running VM instance. Following this simple rule can improve the manageability of your operational environment a lot more than you think.

A good analogy is to think about a VM image as the a.out binary produced by a compilation process. When you run the binary, you get a process. You don’t “login” to a running process – indeed there is no such capability. And that is a good thing, because then the process’s runtime behavior is governed by the state of that exact a.out binary, which in turn is governed by the exact source code version that was used to build it.

I hate to say this, but deployment tools like Chef and Puppet violate this simple principle. They make the deployment process that is under their control more repeatable and robust, but they induce side effects on the system that are not modeled in the deployment recipe and therefore remain invisible to the tool chain. The right way to use these tools is to integrate them with VM image build tools like Ubuntu VM-Builder so that executing a deployment recipe results in a VM image, not a running VM instance. That VM image then represents the fully realized system image produced by a deployment recipe, in exactly the sense that an a.out binary corresponds to the source code from which it was compiled.

How Docker got it right

I have been tinkering with Linux containers and Docker recently, and one thing that really struck me was how Docker has followed this simple principle with (Docker) images and (Docker) containers. You can technically “log in” to a Docker container (get a tty into it) with this command:

docker run -t -i <myimage> <myshell>

But there is little need to ever do this, because a variety of Docker commands allow you to peek and poke the container from outside, without ever logging in to a shell within the container.  For example, you can stop/start the processes within a container (docker stop, docker start), watch events within the container (docker events, docker top), get logs from a container (docker logs), peek at a container’s current configured state (docker inspect), and even see what files changed since you started a container (docker diff).

These utilities makes Docker more than just a user-friendly wrapper around Linux containers. It is a fundamentally different abstraction for managing the lifecycle of applications. An abstraction where the image is treated as the immutable contract between the Dev (e.g. docker build) and Ops (e.g. docker run) halves of the DevOps lifecycle. This has the potential to disrupt the VM-instance centric ecosystem of tools and platforms that are in vogue today.

References

  1. Chef – IT Automation for Speed and Awesomeness
  2. Docker – Build, Ship, Run Any App Anywhere
  3. Linux containers

Index and version your VM images


Most people think about VM images as black boxes, whose contents only matter when the image is instantiated as a VM instance. Even virtualization savvy customers treat a VM image as nothing more than a disk-in-a-file, more of a storage and transportation nuisance than anything of significant value to their IT operations. In fact, it is common practice to use VM images only for the basic OS layer: all of the middleware and applications are installed using deployment tools (like Chef, Puppet, etc) after the OS image is instantiated. Thus, a single “master” VM image is used to create many VM instances that each have a different personality. Occasionally, VM images are used to snapshot the (known good state) of a running VM. But even so, the snapshot images are archived as unstructured disks, and management tools are generally unaware of the semantically rich file system level information locked within.

There is a smarter way to use VM images, that can result in many improvements in the way a data center environment is managed. Instead of a 1:N mapping of images to instances (a single image from which N uniquely configured instances are created), consider for a moment what would happen if we had a N:N mapping. In order to create a uniquely configured VM instance, you first create a VM image that contains that configuration (OS, middleware, and applications, all fully configured to give the image that unique personality), then you instantiate it. If you need to create many instances of the same configuration, you can start multiple instances of the unique VM image containing that configuration, as before. The invariant you want to enforce is that for every uniquely configured machine in your data center, you have a VM image that contains that exact configuration.

This is very useful for a number of reasons:

  1. Your VM images are a concrete representation of the “desired state” you intended each of its VM instances to have when you first instantiated them.  This is valuable in drift detection: understanding if any of those instances have deviated from this desired state, and therefore may need attention. The VM image provides a valuable reference point for problem diagnosis of running instances.
  2. You can index the file system contents of your VM images without perturbing the running VM instances that were launched from them. This is useful in optimizing compliance and security scanning operations in a data center. For example, if a running VM instance only touches 2% of the originally deployed file system state, then you only need to do an online scan of this 2% in the running VM instance. The offline scan results for the remaining 98% of the file system can be taken from the VM image that the instance was started from. This could result in smaller maintenance windows. The same optimization also applies to the indexing of other file system state, such as the contents of important configuration files within VM instances.
  3. You can version VM images just like you version source code. VM image build and update tools can work with branches, tag versions, compare/diff versions, etc. These are very useful in determining the provenance of changes made to a system over time. The ability to track the evolution of images over time may also be useful in determining how a security problem manifested itself over time.

Many years ago, my team developed a system called Mirage, that was designed to be a VM Image Library that provided these capabilities. At the lowest level, you could think of Mirage as a Git for VM images: it used a similar design to reduce the storage required to keep thousands of VM images by exploiting the file level redundancies that exist across images. In addition it provided Git like version control APIs, enabling operations like compare, branching, tagging, and so on.

Here is a diagram showing the use of VM image version control:

Screen Shot 2014-06-17 at 11.46.19 AM

 

The scenario above shows three different people, whose roles are to maintain and update three different layers of the software stack. This is a common situation in many Enterprises and IT Services organizations. Traditionally, only the “OS Admin” team creates VM images – the others merely instantiate that image and then install/configure their respective software layer within the running instance. With Mirage, there is an incentive for all three teams to collaboratively develop a VM image, similar to the way a development team with different responsibilities creates a single integrated application. Working with large VM images is very simple and fast with Mirage, because most operations are performed on image manifests, which are metadata about an image’s file system contents automatically extracted by Mirage.

The key insight in engineering Mirage is to realize that a block level representation of a VM image is much clunkier than a file level representation. The former is good for transporting an image to a host to be instantiated as a running VM instance (you can use copy-on-write to demand page disk blocks to a local cache kept on the host for example). But the latter is better for installation and maintenance operations, because it exposes the internal file system contents contained within the disk image.

When an image is imported into Mirage, it indexes the file system contents of the image disk. The libguestfs library is an excellent utility over which such a capability can be built today (at the time we built the first Mirage prototype, this library was in its infancy). Here is an overview of how the indexing process works:

Screen Shot 2014-06-17 at 11.46.52 AM

 

The file system metadata (including the disk, partition table, and file system structure) is preserved as a stripped down VM image, in which the size of every file is truncated to zero size. Mirage indexes this content-less structure into an image metadata manifest that it consults to provide various services. The contents of each file are first hashed (we used SHA1), and if this hash was not already known, the contents would be stored. Such a content addressed store is similar to that used by systems like Git for storage efficiency by exploiting file content redundancy. The mapping between file path names and their corresponding hashes was maintained in the image metadata manifest.

The Mirage VM image library was a very successful project at IBM. It now forms the core of the IBM Research Compute Cloud, which is the Cloud infrastructure used by thousands of Research employees around the world (4 data centers spread across multiple geographic zones). It is also the nucleus of the IBM Virtual Image Library, a product that is used by many Enterprise customers to manage large VM environments.

Fast forward to today, and we see Linux containers emerging as a viable alternative (some would argue it is complementary) to VMs as a vehicle to encapsulate and isolate applications. Applications like Docker that build on Linux containers are taking the right direction here. With Docker, you build a separate docker-image per unique docker-container. This allows Docker to provide image-level utilities that are valuable (e.g. docker diff). What Docker needs now, is a Git for docker images, like Mirage, except for linux container images not VM images. Many of the core concepts used in Mirage would also be useful here.

References

  1. Virtual Machine Images as Structured Data: the Mirage Image Library. Glenn Ammons, Vasanth Bala, Todd Mummert, Darrell Reimer, Xiaolan Zhang. USENIX HotCloud. 2011.
  2. Libguestfs – tools for accessing and modifying Virtual Machine disk images.

Long-term preservation of executable content


With the onset of the digital revolution a few decades ago, preservation of digital content became a challenge. The process of archival, indexing, and curation that libraries and museums had used for centuries required a massive transformation in order to work for digital artifacts. The Library of Congress now archives digital media (text, audio and video), as do a number of libraries around the world.

Despite all this progress however, we have overlooked one important category of digital content, whose preservation may matter even more than the text, audio and video data we archive today.

An increasing portion of the world’s intellectual output is now in the form of executable content. Examples include simulations, education systems, expert systems, data visualization tools, interactive games, etc. Even content that appears static, such as a Web site, is often dynamically generated by code that customizes the content and appearance for individual readers at runtime.

Consider also the applications required to read the digital data we depend on today. We preserve important digital information in personal and Cloud-hosted backup systems, without bothering to also preserving the applications we depend on to process them. How many of you are able to read that Word Perfect document you wrote in the 1980s, or the Turbo Tax income tax return you created in the 1990s? Now roll the clock forward another ten years and ask yourself how you would be impacted if the digital formats you create your precious data in today could not be processed anymore.

Execution fidelity

For digital content like photographs, “fidelity” is straightforward concept to define: we want all of the pixel data preserved without any loss, in addition any metadata about the photograph like the location coordinates, date, etc. But when it comes to executable content, fidelity is much more difficult to define precisely. It could depend on many things: the computer hardware, the operating system, dynamically linked libraries, and so on.

Simply preserving the software code, or even the compiled binary (both are different types of digital text) is not sufficient – the tool chain to compile this software along with all of its dependencies also has to be preserved, and there is no guarantee that all of this will work a decade from now.

This problem is also different from that of data decay (aka bit rot), which is the degradation of the storage media on which the digital data is kept. Data decay is analogous to the degradation of ancient manuscripts that were printed before the invention of acid-free paper. We are talking about the content here, not the storage medium that content is kept in. The latter is a an orthogonal problem to the one we are examining here, though also a critical one from a historical preservation perspective.

VM images are ideal for encapsulating executable content with high enough fidelity that makes them practical for preservation of many useful executable environments. A VM is essentially a hardware instruction set emulator of such high accuracy that the OS and applications within it are unable to detect its presence. The VM’s emulated instruction set interface is tiny relative to the diversity of software that runs over it, and the diversity of hardware on which this interface can be efficiently emulated. This makes the VM a very durable abstraction for historical preservation of executable content, and a considerably more attractive alternative to mothballing the entire physical computer hardware.

There are of course scenarios where a VM is insufficient to reproduce a program’s execution fidelity. For example, if an application uses an external Web Service, like the Google Maps API, its execution dependencies cannot be fully encapsulated in the VM image. Still, there are enough scenarios where VMs offer sufficient execution fidelity for future generations to experience much of today’s executable content.

Olive: a public domain VM library

A collaboration between IBM Research and Carnegie Mellon University, supported by grants from IBM, Sloan Foundation and IMLS.org, is building Olive, a public domain library for preserving execution content as VMs.

The idea of using VMs for software preservation is not new. VMs are already used commercially for distributing pre-installed and pre-configured software environments. They have also been used in preservation efforts in the past.

What makes Olive different is that it aims to tackle three problems that are crucial for an online digital library to be practical and usable by the public. First, is the problem of how to “check out” a VM  from the library, without resorting to a long and slow download process. Second, is the problem of how to search for something in the library, without depending entirely on the VM metadata. And third is the problem of how to easily contribute new executable content to the library, without having to be an expert in VM creation tools. We have built a fully functional prototype that addresses the first problem; technologies to address the last two problems are works in progress.

To “check out” and run VM’s published in the Olive library, we have created the VMNetX application, that (currently) runs on Linux and uses the open-source KVM virtual machine monitor. VMNetX can execute VMs directly from any web server – no special server software is required. VMNetX can be installed on a user’s local laptop, or provided as a Cloud service where Olive VM’s are automatically executed, and the user interacts with the VM’s display over the Internet. VMNetX is developed on GitHub and released under a GPL2 license.

VMNetX is built on Internet Suspend Resume (ISR), a technique to “stream” VMs over the Internet, developed at CMU. The user experience is similar to playing a video from You Tube: a user clicks on a link, and the VM corresponding to that link is demand-paged to a machine where the VM executes. Demand paging allows the ISR system to move only the part of the VM’s state (disk and memory) that is required by the executing applications within it, resulting in a much faster and smoother experience for the user. This works because executable content tends to spend a lot of time within working sets, which are generally much smaller than the state of the entire VM. Once the pages that comprise a working set are locally cached, the VM’s execution is only touching this local state, and the execution fidelity is good.

When a VM is published into Olive, its file system contents can be introspected and indexed. This indexing process allows automatic inference of the contents of a VM by looking up a table of known content hashes. A technical challenge here is to index the contents of the file system within the VM image, (which have high semantic value) rather than the image’s disk blocks (which have low semantic value). This work is still ongoing, but such a capability would allow users to search for VMs by content, instead of relying solely on metadata associated with every VM to tell what it actually contains. It is also valuable in determining the provenance of the content, and in certifying it for security purposes.

Finally, Olive aims to enable anyone to contribute VMs to the library without having to install or understand complex VM image building tools. It does so through a process called dynamic VM synthesis. The diagram below is a brief overview of how this might work:

Screen Shot 2014-06-15 at 10.19.20 PM

There are three different clients of Olive in this diagram. Let us suppose that client 1 publishes a base OS image into Olive – our initial assumption is that this will be a carefully controlled process, so only members of the Olive team can perform this first step. Client 2 has an application (say a PacMan game) that requires that specific OS to run. Let us assume that the application binary is present on Client 2’s local machine – the details of how the binary was transferred from its original storage medium onto the client’s local machine are not relevant to this discussion. All that Client 2 needs to do is to retrieve and run the original VM containing just the OS using the VMNetX client. The Pac Man application binary can then be installed inside this locally running VM – the bits can be moved into the VM using either the network (which even early versions of Mac, Windows and Linux OSes support), or by exporting the guest OS’s file system to the host (which may require drivers that understand how to interpret the guest file system to be bundled into the VMNetX distribution). Client 2 then publishes the modified VM to Olive. Olive can maintain metadata passed via the VMNetX client, that allows it to determine that this new Pac Man image is a delta over the original OS image. It can then compute the delta, and only store the delta internally with a back pointer to the parent OS image. When Client 3 later retrieves the Pac Man VM, Olive can dynamically synthesize the VM from the original OS image and the Pac Man delta image, and stream it to her.

The Olive library prototype now has a number of VMs that contain historically significant executable environments. Examples include The Great American History Machine, Microsoft Office 6.0, NCSA Mosaic browser on Mac OS 7.5, Turbo Tax 1997, etc. Here are some screenshots of these VMs in action:

Screen Shot 2014-06-15 at 10.10.35 PM          Screen Shot 2014-06-15 at 10.10.47 PM

 

Screen Shot 2014-06-15 at 10.11.06 PM          Screen Shot 2014-06-15 at 10.11.14 PM

 

References

  1. Collaborating with Executable Content Across Space and Time. Mahadev Satyanarayanan, Vasanth Bala, Gloriana St Clair, Erika Linke. International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom), Orlando, FL. 2011 .
  2. Olive Executable Archive (Olive project website)
  3. VMNetX client for running VMs published in the Olive library
  4. Virtual Machine Images as Structured Data. Glenn Ammons, Vasanth Bala, Todd Mummert, Darrell Reimer, Xiaolan Zhang. USENIX HotCloud 2011.

 

Query the data center like you query the Web


Say you want to query thousands of systems in your data center for something – e.g. the string “9.22.33.4”. Maybe you want to know what systems might be impacted if you were to change this IP address somewhere, like in a firewall rule. How would you implement it?

Most people would send this query to agents running on each of the thousands of computers, have them execute this query locally by inspecting their machine’s state, and have the results shipped back. Not only is this a terribly clunky approach in practice, it also scales poorly as the number of systems grows. Your query latency is gated by the slowest machine in your data center – you have to wait until every machine responds before you have your answer. What if one of the machine’s is wedged and its response never comes back; how long should you wait?

Now let us change the context completely. Say you want to query millions of sites on the Web for the string “9.22.33.4”. How would you implement it?

This is a no-brainer. You query a central index, not the individual web sites. The index is constantly fed by crawlers that scan every web site periodically to extract changes made to that site since the last crawl. And here is the key: the crawlers have no knowledge of what queries will be asked of the index. Your query latency is independent of the current state of every website.

This approach is not only scalable, it also enables a more intuitive human interface. It is scalable because (a) crawling is a non-intrusive task (unlike running an agent inside a machine), enabling web sites to be monitored frequently enough to keep the index continuously refreshed, and (b) the data extraction and indexing process is decoupled from the query handling process, enabling each to be optimized independently. By decoupling queries from the crawling, there is no requirement to tune the query format to suit the needs of the data crawler – which in turn allows the query interface to be designed for human consumption, and the crawler interface to be designed for machine consumption.

Search engines like Google, Bing, and Yahoo are able to keep the index remarkably close to the real-time state of billions of web sites, debunking the myth that such an approach risks having the index become too stale to support real-time situational awareness requirements.

So, how can we query the data center like we query the Web?

We must begin by re-thinking how systems are monitored. In an earlier post I talked about “introspection” as an alternative way to monitor the real-time state of a system without the use of in-system agents. Introspection provides the foundation for building a new kind of “crawler”, one that continuously indexes the state of systems in a data center, similar to the way a Web crawler works on documents and web sites. This is because introspection enables crawling systems without disrupting their operation in any way.

In essence, introspection enables us to think about a running system as a series of point-in-time snapshots, where each snapshot is a document containing the metadata about that system’s state extracted by the crawler at a particular point in time. If you think about the system as a movie, you can think about this document as a frame. Frames are literally just documents. You can imagine translating all sorts of useful system state into a simple JSON dictionary for example, that would look something like this:

{
  '_frame': {
    JSON entry with timestamp and other metadata
  }
  'file': {
    one JSON entry per monitored file
  },
  'process': {
    one JSON entry per running process
   },
  'connection': {
    one JSON entry per open connection
  },
  'package': {
    one JSON entry per installed package 
  },
  ...
}

This is the “frame” output by every crawl of a system: it is the document you have to index, to provide a Google-like query interface. And yes, the query response can return faceted search results, rank ordered by various heuristics that make intuitive sense in a data center context. Your mind immediately jumps to abstractions that are familiar in the Web search domain. Few tools to manage data centers look anything like this today – they are made for consumption by skilled IT Ops people, not regular humans like the rest of us. Why must this be so?

The Origami project, that my team has been working on for the last couple of years, has been exploring this very question. Why can’t a systems in the Data Center be queried and indexed like documents in the Web? In fact, the state of many websites changes at rates faster than your typical production server, and yet we get reasonably good real-time query results from the index. There really is no good reason why these two worlds have to be so far apart.