Tag Archives: virtualization

Index and version your VM images

Most people think about VM images as black boxes, whose contents only matter when the image is instantiated as a VM instance. Even virtualization savvy customers treat a VM image as nothing more than a disk-in-a-file, more of a storage and transportation nuisance than anything of significant value to their IT operations. In fact, it is common practice to use VM images only for the basic OS layer: all of the middleware and applications are installed using deployment tools (like Chef, Puppet, etc) after the OS image is instantiated. Thus, a single “master” VM image is used to create many VM instances that each have a different personality. Occasionally, VM images are used to snapshot the (known good state) of a running VM. But even so, the snapshot images are archived as unstructured disks, and management tools are generally unaware of the semantically rich file system level information locked within.

There is a smarter way to use VM images, that can result in many improvements in the way a data center environment is managed. Instead of a 1:N mapping of images to instances (a single image from which N uniquely configured instances are created), consider for a moment what would happen if we had a N:N mapping. In order to create a uniquely configured VM instance, you first create a VM image that contains that configuration (OS, middleware, and applications, all fully configured to give the image that unique personality), then you instantiate it. If you need to create many instances of the same configuration, you can start multiple instances of the unique VM image containing that configuration, as before. The invariant you want to enforce is that for every uniquely configured machine in your data center, you have a VM image that contains that exact configuration.

This is very useful for a number of reasons:

  1. Your VM images are a concrete representation of the “desired state” you intended each of its VM instances to have when you first instantiated them.  This is valuable in drift detection: understanding if any of those instances have deviated from this desired state, and therefore may need attention. The VM image provides a valuable reference point for problem diagnosis of running instances.
  2. You can index the file system contents of your VM images without perturbing the running VM instances that were launched from them. This is useful in optimizing compliance and security scanning operations in a data center. For example, if a running VM instance only touches 2% of the originally deployed file system state, then you only need to do an online scan of this 2% in the running VM instance. The offline scan results for the remaining 98% of the file system can be taken from the VM image that the instance was started from. This could result in smaller maintenance windows. The same optimization also applies to the indexing of other file system state, such as the contents of important configuration files within VM instances.
  3. You can version VM images just like you version source code. VM image build and update tools can work with branches, tag versions, compare/diff versions, etc. These are very useful in determining the provenance of changes made to a system over time. The ability to track the evolution of images over time may also be useful in determining how a security problem manifested itself over time.

Many years ago, my team developed a system called Mirage, that was designed to be a VM Image Library that provided these capabilities. At the lowest level, you could think of Mirage as a Git for VM images: it used a similar design to reduce the storage required to keep thousands of VM images by exploiting the file level redundancies that exist across images. In addition it provided Git like version control APIs, enabling operations like compare, branching, tagging, and so on.

Here is a diagram showing the use of VM image version control:

Screen Shot 2014-06-17 at 11.46.19 AM


The scenario above shows three different people, whose roles are to maintain and update three different layers of the software stack. This is a common situation in many Enterprises and IT Services organizations. Traditionally, only the “OS Admin” team creates VM images – the others merely instantiate that image and then install/configure their respective software layer within the running instance. With Mirage, there is an incentive for all three teams to collaboratively develop a VM image, similar to the way a development team with different responsibilities creates a single integrated application. Working with large VM images is very simple and fast with Mirage, because most operations are performed on image manifests, which are metadata about an image’s file system contents automatically extracted by Mirage.

The key insight in engineering Mirage is to realize that a block level representation of a VM image is much clunkier than a file level representation. The former is good for transporting an image to a host to be instantiated as a running VM instance (you can use copy-on-write to demand page disk blocks to a local cache kept on the host for example). But the latter is better for installation and maintenance operations, because it exposes the internal file system contents contained within the disk image.

When an image is imported into Mirage, it indexes the file system contents of the image disk. The libguestfs library is an excellent utility over which such a capability can be built today (at the time we built the first Mirage prototype, this library was in its infancy). Here is an overview of how the indexing process works:

Screen Shot 2014-06-17 at 11.46.52 AM


The file system metadata (including the disk, partition table, and file system structure) is preserved as a stripped down VM image, in which the size of every file is truncated to zero size. Mirage indexes this content-less structure into an image metadata manifest that it consults to provide various services. The contents of each file are first hashed (we used SHA1), and if this hash was not already known, the contents would be stored. Such a content addressed store is similar to that used by systems like Git for storage efficiency by exploiting file content redundancy. The mapping between file path names and their corresponding hashes was maintained in the image metadata manifest.

The Mirage VM image library was a very successful project at IBM. It now forms the core of the IBM Research Compute Cloud, which is the Cloud infrastructure used by thousands of Research employees around the world (4 data centers spread across multiple geographic zones). It is also the nucleus of the IBM Virtual Image Library, a product that is used by many Enterprise customers to manage large VM environments.

Fast forward to today, and we see Linux containers emerging as a viable alternative (some would argue it is complementary) to VMs as a vehicle to encapsulate and isolate applications. Applications like Docker that build on Linux containers are taking the right direction here. With Docker, you build a separate docker-image per unique docker-container. This allows Docker to provide image-level utilities that are valuable (e.g. docker diff). What Docker needs now, is a Git for docker images, like Mirage, except for linux container images not VM images. Many of the core concepts used in Mirage would also be useful here.


  1. Virtual Machine Images as Structured Data: the Mirage Image Library. Glenn Ammons, Vasanth Bala, Todd Mummert, Darrell Reimer, Xiaolan Zhang. USENIX HotCloud. 2011.
  2. Libguestfs – tools for accessing and modifying Virtual Machine disk images.

Query the data center like you query the Web

Say you want to query thousands of systems in your data center for something – e.g. the string “”. Maybe you want to know what systems might be impacted if you were to change this IP address somewhere, like in a firewall rule. How would you implement it?

Most people would send this query to agents running on each of the thousands of computers, have them execute this query locally by inspecting their machine’s state, and have the results shipped back. Not only is this a terribly clunky approach in practice, it also scales poorly as the number of systems grows. Your query latency is gated by the slowest machine in your data center – you have to wait until every machine responds before you have your answer. What if one of the machine’s is wedged and its response never comes back; how long should you wait?

Now let us change the context completely. Say you want to query millions of sites on the Web for the string “”. How would you implement it?

This is a no-brainer. You query a central index, not the individual web sites. The index is constantly fed by crawlers that scan every web site periodically to extract changes made to that site since the last crawl. And here is the key: the crawlers have no knowledge of what queries will be asked of the index. Your query latency is independent of the current state of every website.

This approach is not only scalable, it also enables a more intuitive human interface. It is scalable because (a) crawling is a non-intrusive task (unlike running an agent inside a machine), enabling web sites to be monitored frequently enough to keep the index continuously refreshed, and (b) the data extraction and indexing process is decoupled from the query handling process, enabling each to be optimized independently. By decoupling queries from the crawling, there is no requirement to tune the query format to suit the needs of the data crawler – which in turn allows the query interface to be designed for human consumption, and the crawler interface to be designed for machine consumption.

Search engines like Google, Bing, and Yahoo are able to keep the index remarkably close to the real-time state of billions of web sites, debunking the myth that such an approach risks having the index become too stale to support real-time situational awareness requirements.

So, how can we query the data center like we query the Web?

We must begin by re-thinking how systems are monitored. In an earlier post I talked about “introspection” as an alternative way to monitor the real-time state of a system without the use of in-system agents. Introspection provides the foundation for building a new kind of “crawler”, one that continuously indexes the state of systems in a data center, similar to the way a Web crawler works on documents and web sites. This is because introspection enables crawling systems without disrupting their operation in any way.

In essence, introspection enables us to think about a running system as a series of point-in-time snapshots, where each snapshot is a document containing the metadata about that system’s state extracted by the crawler at a particular point in time. If you think about the system as a movie, you can think about this document as a frame. Frames are literally just documents. You can imagine translating all sorts of useful system state into a simple JSON dictionary for example, that would look something like this:

  '_frame': {
    JSON entry with timestamp and other metadata
  'file': {
    one JSON entry per monitored file
  'process': {
    one JSON entry per running process
  'connection': {
    one JSON entry per open connection
  'package': {
    one JSON entry per installed package 

This is the “frame” output by every crawl of a system: it is the document you have to index, to provide a Google-like query interface. And yes, the query response can return faceted search results, rank ordered by various heuristics that make intuitive sense in a data center context. Your mind immediately jumps to abstractions that are familiar in the Web search domain. Few tools to manage data centers look anything like this today – they are made for consumption by skilled IT Ops people, not regular humans like the rest of us. Why must this be so?

The Origami project, that my team has been working on for the last couple of years, has been exploring this very question. Why can’t a systems in the Data Center be queried and indexed like documents in the Web? In fact, the state of many websites changes at rates faster than your typical production server, and yet we get reasonably good real-time query results from the index. There really is no good reason why these two worlds have to be so far apart.