Category Archives: Computing

Technical blogs on a variety of computer science topics

Systems as Data

In an earlier post I asked why systems (aka executable state) cannot be indexed and queried just like documents. There is a much broader idea here, one that I have been obsessing over since around 1997. And that is the idea that systems can be treated as simply another form of data, and familiar paradigms from the data domain such as indexing, fingerprinting, clustering, markup, tagging, etc can apply to systems too.

The inception of this idea occurred in 1997 when I was trying to build a native binary interpreter called Dynamo at HP Labs. This was a degenerate interpreter in the sense that it’s input binary source was the same as its output binary target. You can find out more about how this was engineered in my earlier post on JIT acceleration. The kernel of the interpreter loop essentially involved rewriting segments of the input binary stream into a code cache so that the code in that cache could be manipulated in ways that could boost the overall runtime performance of the interpreted program. It struck me that what I was really doing was manipulating executable content (the source binary image) as data.

This made me wonder if there were other things I could do with a binary image that one normally only does with non executable data like audio, video and documents.

For instance, could a binary image be streamed on demand over a network, and executed by a “player” on a remote machine without ever requiring a local install step? I presented this idea at the IBM TJ Watson Research Center in May 2001. The talk, titled “Software as Content” was also my interview seminar. Although I got the job, I must admit the audience had this look of bewilderment on their faces. Clearly I had to articulate the commercial value of this vision to convince people of its power.

During my first two years there, with the help of a small team, I prototyped an application streaming service that eventually became the Progressive Deployment System (PDS), and a new product shipped by IBM for streaming pre-installed and pre-configured apps to desktops within an Enterprise. The end user experience was identical to viewing a video clip via YouTube. You could click a link on a webpage, and that would trigger a locally installed software stream player to communicate with a remote
streaming server to push a binary application image over the network, while simultaneously starting its execution on the local desktop.

Software images could indeed be treated like video images. A number of video related paradigms suddenly made sense in the software domain: streaming, compressed encoding, edge of network caching, etc.

Still we had a hard time convincing people that viewing software and systems as data was a valuable thing. Our next attempt to make the case was to create Mirage, a system that would index and version VM images in the same way that a source control system like Git versions text files. As VM images started to proliferate in data centers, the value of such a solution became ever more obvious. For me the cool thing was that Mirage made system images feel like documents – pretty much anything you could do in a document version control system you could now do with VM images.

The Origami project, which followed Mirage tried to explore yet another dimension of the Systems as Data idea. In this project we explored similarity detection algorithms that one normally used to cluster documents, to cluster VM images.

Despite these demonstrations most people remained unconvinced about the value and power of the Systems as Data idea. Over the years my personal obsession with this has only grown further. Perhaps some day I will get a chance to explore it again.

Never login to your production containers

We have to stop treating production (virtual) machines like they are desktops – and that means resisting the urge to login to the running machine, once it begins its production lifecycle. Later in this post I will contrast the way VM containers are managed with how Linux containers are managed by solutions like Docker.

Every time you log in to a machine, no matter for what reason, you create side-effects that you are unaware of. Over time, the state of the machine will deviate from the desired state in which it started its lifecycle. This is the root cause of many nasty problems, some of which can be very difficult to diagnose. Seemingly innocuous commands typed in the machine’s console can wreak havoc right away (if you are lucky) or linger for months before they create disruption (if you are unlucky). A surprisingly common example is changing the permissions on a directory to enable some other operation, then forgetting to change it back. There was one such situation reported at Google many years ago, when someone removed the executable permission on the directory containing the Linux dynamic loader, causing several machines to lose the ability to exec() any binaries, including their own health monitoring agents. Fortunately Google had a sufficiently resilient design that the impact of this disruption was not noticed by end users. But I have been in several customer crit-sits (critical situations) where similar accidentally introduced problems have taken business applications down.

But how can we avoid logging in to a production VM? Don’t we need to install/update software inside it and start/stop services within it? Yes and yes, but you don’t need to log in to the production VM to do it.

If you need to install/update software in a production VM, follow these steps instead:

  1. Start a maintenance instance of the VM image you created the production VM instance from. This assumes you are following the best practice I bogged about in an earlier post so every uniquely configured VM instance in your production environment has a corresponding VM image from which it was deployed.
  2. Install/update and test this maintenance VM instance. Then shutdown and capture it as a new updated VM image. This would be a good time to version the VM image so you can compare/diff it, track the provenance of changes you made to it, etc.
  3. Deploy an updated VM instance from this new image, in place of the original production VM instance. You may need a small downtime window to do this swap depending on how your application is set up.

To start/stop/restart services inside the VM, the better way to do this is to install a utility that listens on a designated port for start/stop/restart commands and executes them locally.

The key point here is this. Think of a VM image as the handoff between the Dev and Ops halves of your application lifecycle. It should contain all of the software environment necessary for execution, maintenance, and monitoring. No externally induced side-effects should be permitted once a VM image is instantiated as a running VM instance. Following this simple rule can improve the manageability of your operational environment a lot more than you think.

A good analogy is to think about a VM image as the a.out binary produced by a compilation process. When you run the binary, you get a process. You don’t “login” to a running process – indeed there is no such capability. And that is a good thing, because then the process’s runtime behavior is governed by the state of that exact a.out binary, which in turn is governed by the exact source code version that was used to build it.

I hate to say this, but deployment tools like Chef and Puppet violate this simple principle. They make the deployment process that is under their control more repeatable and robust, but they induce side effects on the system that are not modeled in the deployment recipe and therefore remain invisible to the tool chain. The right way to use these tools is to integrate them with VM image build tools like Ubuntu VM-Builder so that executing a deployment recipe results in a VM image, not a running VM instance. That VM image then represents the fully realized system image produced by a deployment recipe, in exactly the sense that an a.out binary corresponds to the source code from which it was compiled.

How Docker got it right

I have been tinkering with Linux containers and Docker recently, and one thing that really struck me was how Docker has followed this simple principle with (Docker) images and (Docker) containers. You can technically “log in” to a Docker container (get a tty into it) with this command:

docker run -t -i <myimage> <myshell>

But there is little need to ever do this, because a variety of Docker commands allow you to peek and poke the container from outside, without ever logging in to a shell within the container.  For example, you can stop/start the processes within a container (docker stop, docker start), watch events within the container (docker events, docker top), get logs from a container (docker logs), peek at a container’s current configured state (docker inspect), and even see what files changed since you started a container (docker diff).

These utilities makes Docker more than just a user-friendly wrapper around Linux containers. It is a fundamentally different abstraction for managing the lifecycle of applications. An abstraction where the image is treated as the immutable contract between the Dev (e.g. docker build) and Ops (e.g. docker run) halves of the DevOps lifecycle. This has the potential to disrupt the VM-instance centric ecosystem of tools and platforms that are in vogue today.


  1. Chef – IT Automation for Speed and Awesomeness
  2. Docker – Build, Ship, Run Any App Anywhere
  3. Linux containers

Index and version your VM images

Most people think about VM images as black boxes, whose contents only matter when the image is instantiated as a VM instance. Even virtualization savvy customers treat a VM image as nothing more than a disk-in-a-file, more of a storage and transportation nuisance than anything of significant value to their IT operations. In fact, it is common practice to use VM images only for the basic OS layer: all of the middleware and applications are installed using deployment tools (like Chef, Puppet, etc) after the OS image is instantiated. Thus, a single “master” VM image is used to create many VM instances that each have a different personality. Occasionally, VM images are used to snapshot the (known good state) of a running VM. But even so, the snapshot images are archived as unstructured disks, and management tools are generally unaware of the semantically rich file system level information locked within.

There is a smarter way to use VM images, that can result in many improvements in the way a data center environment is managed. Instead of a 1:N mapping of images to instances (a single image from which N uniquely configured instances are created), consider for a moment what would happen if we had a N:N mapping. In order to create a uniquely configured VM instance, you first create a VM image that contains that configuration (OS, middleware, and applications, all fully configured to give the image that unique personality), then you instantiate it. If you need to create many instances of the same configuration, you can start multiple instances of the unique VM image containing that configuration, as before. The invariant you want to enforce is that for every uniquely configured machine in your data center, you have a VM image that contains that exact configuration.

This is very useful for a number of reasons:

  1. Your VM images are a concrete representation of the “desired state” you intended each of its VM instances to have when you first instantiated them.  This is valuable in drift detection: understanding if any of those instances have deviated from this desired state, and therefore may need attention. The VM image provides a valuable reference point for problem diagnosis of running instances.
  2. You can index the file system contents of your VM images without perturbing the running VM instances that were launched from them. This is useful in optimizing compliance and security scanning operations in a data center. For example, if a running VM instance only touches 2% of the originally deployed file system state, then you only need to do an online scan of this 2% in the running VM instance. The offline scan results for the remaining 98% of the file system can be taken from the VM image that the instance was started from. This could result in smaller maintenance windows. The same optimization also applies to the indexing of other file system state, such as the contents of important configuration files within VM instances.
  3. You can version VM images just like you version source code. VM image build and update tools can work with branches, tag versions, compare/diff versions, etc. These are very useful in determining the provenance of changes made to a system over time. The ability to track the evolution of images over time may also be useful in determining how a security problem manifested itself over time.

Many years ago, my team developed a system called Mirage, that was designed to be a VM Image Library that provided these capabilities. At the lowest level, you could think of Mirage as a Git for VM images: it used a similar design to reduce the storage required to keep thousands of VM images by exploiting the file level redundancies that exist across images. In addition it provided Git like version control APIs, enabling operations like compare, branching, tagging, and so on.

Here is a diagram showing the use of VM image version control:

Screen Shot 2014-06-17 at 11.46.19 AM


The scenario above shows three different people, whose roles are to maintain and update three different layers of the software stack. This is a common situation in many Enterprises and IT Services organizations. Traditionally, only the “OS Admin” team creates VM images – the others merely instantiate that image and then install/configure their respective software layer within the running instance. With Mirage, there is an incentive for all three teams to collaboratively develop a VM image, similar to the way a development team with different responsibilities creates a single integrated application. Working with large VM images is very simple and fast with Mirage, because most operations are performed on image manifests, which are metadata about an image’s file system contents automatically extracted by Mirage.

The key insight in engineering Mirage is to realize that a block level representation of a VM image is much clunkier than a file level representation. The former is good for transporting an image to a host to be instantiated as a running VM instance (you can use copy-on-write to demand page disk blocks to a local cache kept on the host for example). But the latter is better for installation and maintenance operations, because it exposes the internal file system contents contained within the disk image.

When an image is imported into Mirage, it indexes the file system contents of the image disk. The libguestfs library is an excellent utility over which such a capability can be built today (at the time we built the first Mirage prototype, this library was in its infancy). Here is an overview of how the indexing process works:

Screen Shot 2014-06-17 at 11.46.52 AM


The file system metadata (including the disk, partition table, and file system structure) is preserved as a stripped down VM image, in which the size of every file is truncated to zero size. Mirage indexes this content-less structure into an image metadata manifest that it consults to provide various services. The contents of each file are first hashed (we used SHA1), and if this hash was not already known, the contents would be stored. Such a content addressed store is similar to that used by systems like Git for storage efficiency by exploiting file content redundancy. The mapping between file path names and their corresponding hashes was maintained in the image metadata manifest.

The Mirage VM image library was a very successful project at IBM. It now forms the core of the IBM Research Compute Cloud, which is the Cloud infrastructure used by thousands of Research employees around the world (4 data centers spread across multiple geographic zones). It is also the nucleus of the IBM Virtual Image Library, a product that is used by many Enterprise customers to manage large VM environments.

Fast forward to today, and we see Linux containers emerging as a viable alternative (some would argue it is complementary) to VMs as a vehicle to encapsulate and isolate applications. Applications like Docker that build on Linux containers are taking the right direction here. With Docker, you build a separate docker-image per unique docker-container. This allows Docker to provide image-level utilities that are valuable (e.g. docker diff). What Docker needs now, is a Git for docker images, like Mirage, except for linux container images not VM images. Many of the core concepts used in Mirage would also be useful here.


  1. Virtual Machine Images as Structured Data: the Mirage Image Library. Glenn Ammons, Vasanth Bala, Todd Mummert, Darrell Reimer, Xiaolan Zhang. USENIX HotCloud. 2011.
  2. Libguestfs – tools for accessing and modifying Virtual Machine disk images.

Long-term preservation of executable content

With the onset of the digital revolution a few decades ago, preservation of digital content became a challenge. The process of archival, indexing, and curation that libraries and museums had used for centuries required a massive transformation in order to work for digital artifacts. The Library of Congress now archives digital media (text, audio and video), as do a number of libraries around the world.

Despite all this progress however, we have overlooked one important category of digital content, whose preservation may matter even more than the text, audio and video data we archive today.

An increasing portion of the world’s intellectual output is now in the form of executable content. Examples include simulations, education systems, expert systems, data visualization tools, interactive games, etc. Even content that appears static, such as a Web site, is often dynamically generated by code that customizes the content and appearance for individual readers at runtime.

Consider also the applications required to read the digital data we depend on today. We preserve important digital information in personal and Cloud-hosted backup systems, without bothering to also preserving the applications we depend on to process them. How many of you are able to read that Word Perfect document you wrote in the 1980s, or the Turbo Tax income tax return you created in the 1990s? Now roll the clock forward another ten years and ask yourself how you would be impacted if the digital formats you create your precious data in today could not be processed anymore.

Execution fidelity

For digital content like photographs, “fidelity” is straightforward concept to define: we want all of the pixel data preserved without any loss, in addition any metadata about the photograph like the location coordinates, date, etc. But when it comes to executable content, fidelity is much more difficult to define precisely. It could depend on many things: the computer hardware, the operating system, dynamically linked libraries, and so on.

Simply preserving the software code, or even the compiled binary (both are different types of digital text) is not sufficient – the tool chain to compile this software along with all of its dependencies also has to be preserved, and there is no guarantee that all of this will work a decade from now.

This problem is also different from that of data decay (aka bit rot), which is the degradation of the storage media on which the digital data is kept. Data decay is analogous to the degradation of ancient manuscripts that were printed before the invention of acid-free paper. We are talking about the content here, not the storage medium that content is kept in. The latter is a an orthogonal problem to the one we are examining here, though also a critical one from a historical preservation perspective.

VM images are ideal for encapsulating executable content with high enough fidelity that makes them practical for preservation of many useful executable environments. A VM is essentially a hardware instruction set emulator of such high accuracy that the OS and applications within it are unable to detect its presence. The VM’s emulated instruction set interface is tiny relative to the diversity of software that runs over it, and the diversity of hardware on which this interface can be efficiently emulated. This makes the VM a very durable abstraction for historical preservation of executable content, and a considerably more attractive alternative to mothballing the entire physical computer hardware.

There are of course scenarios where a VM is insufficient to reproduce a program’s execution fidelity. For example, if an application uses an external Web Service, like the Google Maps API, its execution dependencies cannot be fully encapsulated in the VM image. Still, there are enough scenarios where VMs offer sufficient execution fidelity for future generations to experience much of today’s executable content.

Olive: a public domain VM library

A collaboration between IBM Research and Carnegie Mellon University, supported by grants from IBM, Sloan Foundation and, is building Olive, a public domain library for preserving execution content as VMs.

The idea of using VMs for software preservation is not new. VMs are already used commercially for distributing pre-installed and pre-configured software environments. They have also been used in preservation efforts in the past.

What makes Olive different is that it aims to tackle three problems that are crucial for an online digital library to be practical and usable by the public. First, is the problem of how to “check out” a VM  from the library, without resorting to a long and slow download process. Second, is the problem of how to search for something in the library, without depending entirely on the VM metadata. And third is the problem of how to easily contribute new executable content to the library, without having to be an expert in VM creation tools. We have built a fully functional prototype that addresses the first problem; technologies to address the last two problems are works in progress.

To “check out” and run VM’s published in the Olive library, we have created the VMNetX application, that (currently) runs on Linux and uses the open-source KVM virtual machine monitor. VMNetX can execute VMs directly from any web server – no special server software is required. VMNetX can be installed on a user’s local laptop, or provided as a Cloud service where Olive VM’s are automatically executed, and the user interacts with the VM’s display over the Internet. VMNetX is developed on GitHub and released under a GPL2 license.

VMNetX is built on Internet Suspend Resume (ISR), a technique to “stream” VMs over the Internet, developed at CMU. The user experience is similar to playing a video from You Tube: a user clicks on a link, and the VM corresponding to that link is demand-paged to a machine where the VM executes. Demand paging allows the ISR system to move only the part of the VM’s state (disk and memory) that is required by the executing applications within it, resulting in a much faster and smoother experience for the user. This works because executable content tends to spend a lot of time within working sets, which are generally much smaller than the state of the entire VM. Once the pages that comprise a working set are locally cached, the VM’s execution is only touching this local state, and the execution fidelity is good.

When a VM is published into Olive, its file system contents can be introspected and indexed. This indexing process allows automatic inference of the contents of a VM by looking up a table of known content hashes. A technical challenge here is to index the contents of the file system within the VM image, (which have high semantic value) rather than the image’s disk blocks (which have low semantic value). This work is still ongoing, but such a capability would allow users to search for VMs by content, instead of relying solely on metadata associated with every VM to tell what it actually contains. It is also valuable in determining the provenance of the content, and in certifying it for security purposes.

Finally, Olive aims to enable anyone to contribute VMs to the library without having to install or understand complex VM image building tools. It does so through a process called dynamic VM synthesis. The diagram below is a brief overview of how this might work:

Screen Shot 2014-06-15 at 10.19.20 PM

There are three different clients of Olive in this diagram. Let us suppose that client 1 publishes a base OS image into Olive – our initial assumption is that this will be a carefully controlled process, so only members of the Olive team can perform this first step. Client 2 has an application (say a PacMan game) that requires that specific OS to run. Let us assume that the application binary is present on Client 2’s local machine – the details of how the binary was transferred from its original storage medium onto the client’s local machine are not relevant to this discussion. All that Client 2 needs to do is to retrieve and run the original VM containing just the OS using the VMNetX client. The Pac Man application binary can then be installed inside this locally running VM – the bits can be moved into the VM using either the network (which even early versions of Mac, Windows and Linux OSes support), or by exporting the guest OS’s file system to the host (which may require drivers that understand how to interpret the guest file system to be bundled into the VMNetX distribution). Client 2 then publishes the modified VM to Olive. Olive can maintain metadata passed via the VMNetX client, that allows it to determine that this new Pac Man image is a delta over the original OS image. It can then compute the delta, and only store the delta internally with a back pointer to the parent OS image. When Client 3 later retrieves the Pac Man VM, Olive can dynamically synthesize the VM from the original OS image and the Pac Man delta image, and stream it to her.

The Olive library prototype now has a number of VMs that contain historically significant executable environments. Examples include The Great American History Machine, Microsoft Office 6.0, NCSA Mosaic browser on Mac OS 7.5, Turbo Tax 1997, etc. Here are some screenshots of these VMs in action:

Screen Shot 2014-06-15 at 10.10.35 PM          Screen Shot 2014-06-15 at 10.10.47 PM


Screen Shot 2014-06-15 at 10.11.06 PM          Screen Shot 2014-06-15 at 10.11.14 PM



  1. Collaborating with Executable Content Across Space and Time. Mahadev Satyanarayanan, Vasanth Bala, Gloriana St Clair, Erika Linke. International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom), Orlando, FL. 2011 .
  2. Olive Executable Archive (Olive project website)
  3. VMNetX client for running VMs published in the Olive library
  4. Virtual Machine Images as Structured Data. Glenn Ammons, Vasanth Bala, Todd Mummert, Darrell Reimer, Xiaolan Zhang. USENIX HotCloud 2011.


Locality sensitive hashing

A neat problem I’ve been working on for some time now is how to determine if two systems in a data center are “similar”. Before getting into what “similar” means for computer systems, let us first understand why this might be useful.

Many modern applications are deployed as identical clusters of VMs fronted by a load balancer – a common design pattern for scaling the application to handle high incoming request loads. At the time the cluster is first deployed and the application starts its lifecycle, the state of these VMs is indeed identical, as the developer expects. However, lots of things happen over time that can cause the state of one or more of these systems to deviate from the others. Some examples include patches not rolling out to all VMs in the cluster, an accidental or malicious disruption to one of the VMs, an environment or configuration change that causes one of the VMs to behave very differently, etc. In such situations, it would be useful if one could automatically infer when a collection of VMs that are expected to be similar, are in fact not anymore.

To sink our teeth into this problem, let us simplify it to two systems A and B. At time t1, we are told that “A is similar to B right now”.  At time t2, we are asked the question “Is A still similar to B?”. The key here is that “similarity” is defined by a reference point in time, and not by some elaborate rule. So, we must first infer what it is about A and B that is “similar” at time t1. Then we have to use this information to determine if any of the attributes we decided were similar at time t1 are still similar at time t2.

This turns out to be a very difficult problem to solve in practice, and it requires some sophisticated learning and clustering techniques that I’m not going to talk about here. But a building block necessary to solve this problem is the algorithm for determining the degree of similarity between two pieces of data, in this case the state of A and the state of B at some point in time.

Enter the locality sensitive hash, or LSH for short. Most people are familiar with secure hashes, like SHA-1, that are used to determine when two pieces of data are different. Secure hashes have the property that small differences in the data result in a very large divergence of the hashes of the data. This is a very useful property in security, as simply comparing the hashes of the data can determine if they are non-identical. Fuzzy hashes, like LSH, on the other hand have the property that small differences in the data cause small differences in the hash values of the data. Thus, if X and Y are two sets, LSH(X) and LSH(Y) would be very close in value if the contents of the two sets are also close to one another.

A pretty good measure of “similarity” is the Jaccard similarity  JS(X, Y) between two sets X and Y, which can be computed using this equation:

   JS(X, Y) = |X ∩ Y| / |X ∪ Y|

So the trick to understanding the similarity between two computer systems A and B, is to extract a set of attributes from the state of systems A and B, and treat these as the sets X and Y for purposes of similarity scoring.  For example, the names of running processes, the list of open connections, and the lstat metadata of the file system,  are some examples of system attributes that might make sense to extract. (Things are not so simple in reality – some attributes are more important than others, requiring a weighting heuristic to be introduced into the basic Jaccard similarity equation, but that is the subject of a research paper, not a short blog).

In an earlier blog post, I pointed out that the state of a system at some point in time can be transcoded as a document.  Jaccard similarity between two documents can be efficiently computed at scale using a neat technique known as “shingling”. Shingling gets its name from the manner in which shingles are laid on a roof in an overlapping pattern. A k-shingling is a grouping of every k consecutive words of the document into a new object. Each of these objects can now be treated as an attribute, instead of every word in the document being treated as an attribute (this is equivalent to a 1-shingling). This results in a substantial space reduction, with some loss of fidelity in the similarity score.

For example, consider two documents X and Y, where:

  X = Vas I am
  Y = I am Vas

A 2-shingling of each of the documents yields the following objects:

  X' = {Vas I} {I am}
  Y' = {I am} {am Vas}

The documents X’ and Y’ have a fewer “words” than the original documents X and Y. Now we can compute the Jaccard similarity between X’ and Y’ and use it as an approximation of the Jaccard similarity between the original X and Y:

  JS(X', Y') = 1/3 = 0.33

The similarity score between documents X and Y is thus approximately 0.33. Jeffrey Ullman’s free online book, “Mining of Massive Datasets” gives an excellent overview of similarity detection algorithms, including the Jaccard distance computation.

There are more sophisticated algorithms for similarity detection. One of them is the SSDEEP algorithm, used by the National Software Reference Library (NSRL). But wait – the NSRL is part of the US Department of Homeland Security, and its goal is to “promote efficient and effective use of computer technology in the investigation of crimes involving computers”. What is a fuzzy hash like SSDEEP doing here?

NSRL is really a giant white list of files that are included in known software products shipped by thousands of software publishers. Every known file has a corresponding SSDEEP hash. Now, here is the neat thing about SSDEEP. Given a random file you found on your computer, you can compute its SSDEEP (the algorithm and its implementation is open sourced by the US government cyber security office). Then you could check with NSRL if SSDEEP(F), where F is the file on your computer, is known to its white list. It will respond with one of 3 answers:

  1. Yes, F is a known file, and it belongs to product P
  2. No, F is an unknown file
  3. F is very similar to a known file

That third response is very interesting indeed! It suggests that the original file F may have been tampered with – and you may want to investigate this further.

How cool is that? Now lets go back to that neat system similarity problem I introduced at the beginning. What if you could build an NSRL-like white list of hashes for systems in your data center that are in their “desired state” (e.g. immediately after they begin their lifecycle)? Understanding deviation from that desired state could be accomplished quickly, by periodically computing an appropriate fuzzy hash of all the systems in your environment, and comparing the hashes with your white list.

Of course you would still need to solve the vexing details I mentioned briefly, like what attributes matter and to what degree. But you get the general idea.


  1. Mining of Massive Datasets. Jeffrey Ullman, Stanford University.
  2. Locality Sensitive Hashing. Wikipedia entry.
  3. SSDEEP algorithm for context-triggered piecewise hashes.

Query the data center like you query the Web

Say you want to query thousands of systems in your data center for something – e.g. the string “”. Maybe you want to know what systems might be impacted if you were to change this IP address somewhere, like in a firewall rule. How would you implement it?

Most people would send this query to agents running on each of the thousands of computers, have them execute this query locally by inspecting their machine’s state, and have the results shipped back. Not only is this a terribly clunky approach in practice, it also scales poorly as the number of systems grows. Your query latency is gated by the slowest machine in your data center – you have to wait until every machine responds before you have your answer. What if one of the machine’s is wedged and its response never comes back; how long should you wait?

Now let us change the context completely. Say you want to query millions of sites on the Web for the string “”. How would you implement it?

This is a no-brainer. You query a central index, not the individual web sites. The index is constantly fed by crawlers that scan every web site periodically to extract changes made to that site since the last crawl. And here is the key: the crawlers have no knowledge of what queries will be asked of the index. Your query latency is independent of the current state of every website.

This approach is not only scalable, it also enables a more intuitive human interface. It is scalable because (a) crawling is a non-intrusive task (unlike running an agent inside a machine), enabling web sites to be monitored frequently enough to keep the index continuously refreshed, and (b) the data extraction and indexing process is decoupled from the query handling process, enabling each to be optimized independently. By decoupling queries from the crawling, there is no requirement to tune the query format to suit the needs of the data crawler – which in turn allows the query interface to be designed for human consumption, and the crawler interface to be designed for machine consumption.

Search engines like Google, Bing, and Yahoo are able to keep the index remarkably close to the real-time state of billions of web sites, debunking the myth that such an approach risks having the index become too stale to support real-time situational awareness requirements.

So, how can we query the data center like we query the Web?

We must begin by re-thinking how systems are monitored. In an earlier post I talked about “introspection” as an alternative way to monitor the real-time state of a system without the use of in-system agents. Introspection provides the foundation for building a new kind of “crawler”, one that continuously indexes the state of systems in a data center, similar to the way a Web crawler works on documents and web sites. This is because introspection enables crawling systems without disrupting their operation in any way.

In essence, introspection enables us to think about a running system as a series of point-in-time snapshots, where each snapshot is a document containing the metadata about that system’s state extracted by the crawler at a particular point in time. If you think about the system as a movie, you can think about this document as a frame. Frames are literally just documents. You can imagine translating all sorts of useful system state into a simple JSON dictionary for example, that would look something like this:

  '_frame': {
    JSON entry with timestamp and other metadata
  'file': {
    one JSON entry per monitored file
  'process': {
    one JSON entry per running process
  'connection': {
    one JSON entry per open connection
  'package': {
    one JSON entry per installed package 

This is the “frame” output by every crawl of a system: it is the document you have to index, to provide a Google-like query interface. And yes, the query response can return faceted search results, rank ordered by various heuristics that make intuitive sense in a data center context. Your mind immediately jumps to abstractions that are familiar in the Web search domain. Few tools to manage data centers look anything like this today – they are made for consumption by skilled IT Ops people, not regular humans like the rest of us. Why must this be so?

The Origami project, that my team has been working on for the last couple of years, has been exploring this very question. Why can’t a systems in the Data Center be queried and indexed like documents in the Web? In fact, the state of many websites changes at rates faster than your typical production server, and yet we get reasonably good real-time query results from the index. There really is no good reason why these two worlds have to be so far apart.

Introspect, don’t monitor, your VMs

When you need to observe the state of a running computer, what do you do? If its just a few computers, you simply ssh into each one and inspect the state from the command line. If you have a lot of computers, you use an agent that locally monitors each computer’s state and periodically ships the results to you. In either case, you make a key assumption: that your inspection logic has to run within the OS context of the running machine in order to observe its state.

There are a number of problems with this approach, and most people are oblivious to it. Lets take the case where you need to monitor thousands of computers – a modest data center scale operational environment. Virtually anyone managing systems at this scale or higher uses some kind of in-system monitoring solution. Data center monitoring is a billion dollar business – it provides what Ops teams refer to as “situational awareness”, the lifeblood of data center operations.

Problem 1: When a system becomes unresponsive, so does the monitoring agent running within it

This is a far more common situation than you might think. Systems can become (intermittently) unresponsive for any number of reasons. A process may be thrashing the disk, or memory, or both and your monitoring agent is not getting sufficient cycles to function. A system update may have modified a library on which your monitoring agent depends, causing the agent itself to malfunction or crash.

Google had such an outage in 2011, when graphs showing the global health of Google vanished from internal dashboards, pagers fell silent, and ssh stopped working. Because the paging system was itself affected, many site reliability engineers had no idea a major outage was in progress. The root cause turned out to be an accidental change to the permission flags of file /lib/x86_64-linux-gnu/ (the Linux dynamic loader used to exec() user processes) from -rwxr-xr-x to -rw-r–r—. As a result, all user processes on these systems, including Google’s own monitoring agents, failed to start. That did not fail despite what was apparently an outage that impacted “between 15 and 20% of Google’s production serving machines” is a testament to its resilient design. But this highlights the problem of relying on in-system monitoring agents for understanding system health.

Problem 2: In-system monitoring agents are vulnerable to security attacks

Monitoring agents are the foundation of another billion dollar industry: data center security. Yet, in-system security monitoring agents are themselves vulnerable to malicious attacks and accidental disruption, like any other process running inside these systems. That does not give a warm and fuzzy feeling.

Problem 3: In-system monitoring agents impact system performance

Every vendor or individual that writes a “lightweight monitoring agent” promises that the agent’s own operation will not impact the performance of the system it is monitoring. But agents are after all just another piece of software, running in a complex system environment. It is not uncommon to encounter situations where the culprit for poor performance is the monitoring agent itself.

Ultimately, any monitoring logic (even the commands you type when you ssh into a running computer) have side-effects that you are not fully aware of. This is the classic Heisenberg effect: the very act of monitoring the system is affecting the state you are trying to monitor. Most people disregard this as a problem they need to even think about, until their systems become heavily loaded. It is usually under peak loads that the impact of monitoring agents become more noticeable. And that is just when the monitoring data they provide is most necessary.

Introspection: an alternative way to observe system health

Virtualization enables a different way to observe the state of a running system. Introspection refers to the process of inspecting the state of a VM from outside the guest OS context. This is fundamentally different from monitoring, in that there is no monitoring logic running within the VM.

Introspection of a VM’s file system and memory state is possible by leveraging the VMM (virtual machine monitor, aka hypervisor) that interposes between the VM’s guest state and its mapping to the underlying physical host state. The key question is: can out-of-VM introspection approach the robustness and fidelity of conventional in-VM monitoring?

There are 2 parts to this problem: (a) real-time introspection of guest file-system state, and (b) real-time introspection of guest memory. Each has different challenges. Note that if we removed the “real-time” requirement, many good solutions exist already. For instance, backup solutions are now available that use VM disk snapshots to perform continuous backup without the use of an in-system backup agent. A benefit of this approach is you do not have to schedule a downtime window for backing up your VMs (though your VMs may be briefly stunned when the guest OS quiesces any uncommitted state to disk, prior to taking the snapshot).

My team and I have been experimenting with a technique we refer to as “Near Field Monitoring”, that uses VM file-system and memory introspection as an alternative to in-system monitoring agents. After two years of R&D, and many dead-ends, I am proud to say we now have these techniques working in the IBM Research Compute Cloud (RC2) production environment.

The following publications give the technical details of our approach. The work was done with CMU and University of Toronto, originally started during summer internships done by Wolfgang Richter and Sahil Suneja with my team at IBM Watson Labs. In subsequent blogs, I will dive deeper into the technical intricacies of these two papers. The key result here is that introspection based “Near Field Monitoring” techniques can approach the robustness and fidelity of in-system monitoring agents for most common monitoring tasks we have studied. This makes them a viable contender to disrupt the in-system monitoring model that is in widespread use today.


[1] Agentless Cloud-wide Streaming of Guest File System Updates. Wolfgang Richter (Carnegie Mellon University), Canturk Isci (IBM Research), Jan Harkes and Benjamin Gilbert (Carnegie Mellon University), Vasanth Bala (IBM Research), and Mahadev Satyanarayan (Carnegie Mellon University). Best Paper Award, IEEE International Conference on Cloud Engineering, Boston, MA, March 2014.

[2] Non-intrusive, Out-of-band and Out-of-the-box Systems Monitoring in the CloudSahil Suneja (University of Toronto), Canturk Isci (IBM Research), Vasanth Bala (IBM Research), Eyal de Lara (University of Toronto), Todd Mummert (IBM Research). ACM SIGMETRICS, Austin TX, June 2014.



JIT acceleration with Dynamo

I stumbled into a presentation about Google’s Dalvik VM, which is the interpreter used to execute Java bytecode on the Android mobile platform. Like many modern interpreters, Dalvik uses a JIT (just-in-time compiler) to lower interpretive overhead by compiling frequently interpreted code to native binary code that can be directly executed by the hardware interpreter (aka CPU).

A crucial design decision in engineering a JIT is the heuristic that determines what to compile and when to do so. Early JITs compiled entire methods, and used method entry counts to determine when a method was “hot” enough to warrant compilation. The Dalvik JIT however uses a hybrid design that combines a method-based heuristic with a trace-based heuristic to identify and compile hot instruction traces. A “trace” is a dynamic scoping concept (unlike a method, which is a static scoping concept) – it is the sequence of instructions corresponding to some dynamically executed path in the program. Traces can cross method boundaries, and even program boundaries (e.g. it can extend into a shared library that your application uses).

This trace-based heuristic for speeding up interpreters was originally developed in the Dynamo project at HP Labs in the late 90s, a project that I spent much of my early career working on. Since its publication in 2000,  the Dynamo system has influenced a number of commercial interpreters that are in widespread use today. Mozilla’s Tracemonkey Javascript interpreter is one example – this is the Javascript engine built into the Firefox browser. But this was the first time I had seen Dynamo ideas being used in the context of a mobile platform. It got me wandering down memory lane, thinking how the original design tradeoffs explored in Dynamo might have changed if the target platform were a mobile device (a power and memory constrained computer).

How does JIT compilation lower interpreter overhead? A rough rule of thumb in program execution is that 20% of the code accounts for 80% of its execution. So, if the program is 1 GB in size, a mere 200 MB of it is where all the action is. Assuming your program is a reasonably long-running application, lowering the interpretive overhead of the 200 MB will determine whether your application is snappy and usable or sluggish and useless. The hard part is determining which 200 MB of the program is hot enough to pay the price of compiling.

Finding hot methods to compile is relatively straightforward: keep a table of counters, one per method that is interpreted; every time a method call is interpreted, bump the counter corresponding to that method; trigger a compilation if the counter exceeds some threshold. Finding hot traces is much harder: trace boundaries are not as well defined as method boundaries. Dynamo’s approach was to assume that any target of a backward-taken branch is a candidate for the start of a trace – the intuition being that this is likely to be the entry point of a loop, even if the loop body spans multiple methods and extends into external libraries. Similar counter based threshold heuristics could be used at these start-of-trace candidate instructions to trigger trace selection and compilation.

But how do we know what instructions are part of the hot trace? In the 90s, the topic of “path profiling” was a hot research area, and various techniques for determine hot traces through a combination of static program analysis and dynamic instruction counting were being proposed. Dynamo took a radically different approach that avoided static analysis and trial runs, yet was simple to engineer and effective on real code. If a candidate start-of-trace instructions is hot, then the sequence of interpreted instructions leading from that start-of-trace instruction is statistically likely to be a hot trace. We called it the Most Recently Executed Tail (or MRET) heuristic.

Here is the diagram taken from the original Dynamo paper, that shows how the interpreter / JIT flow works:

Screen Shot 2014-05-29 at 5.00.39 PM

The gray box labeled Fragment Cache is the in-memory cache of natively compiled trace fragments that are created by the Dynamo interpreter. The goal is to ensure that the program spends more time executing within the Fragment Cache than on any of the white boxes that correspond to the software interpreter. In addition, the resulting speedup should offset whatever time was spent in the interpreter (the white boxes). Handling synchronous and asynchronous interrupts during execution is among the many engineering challenges Dynamo had to address in this design – the paper contains these details.

Accomplishing the goal of speeding up the interpreter enough to recoup the overhead of trace selection and compilation is a tall order. If you look at this diagram closely, you will notice that the input to the Dynamo interpreter is itself native program code! In fact, the original Dynamo system we built was deliberately engineered as a degenerate JIT (native code to native code JIT), so we could understand its behavior in a an extreme scenario. Now here is the counter-intuitive result: even when interpreting native code, Dynamo would speed up many programs as much as 20%, and come close to breaking even in the worst case! This is true even when the input native code is emitted by a state of the art optimizing compiler.

How is this possible? Surely, a native code to native code software interpreter cannot deliver better performance than directly executing the original native code on the CPU (hardware interpreter). To understand why this is possible, we have to understand the performance bottlenecks on modern CPUs, and how Dynamo’s MRET trace selection algorithm can remove them.

The figure below shows a portion of some program’s control flow graph – think of each box as a sequence of straight line instructions (also referred to as a basic block). There is a branch instruction at the end of block A, and a call instruction at the end of block D:

Screen Shot 2014-05-29 at 5.08.13 PM

Modern CPUs use sophisticated branch prediction hardware to predict the target of branches in the native instruction stream, so the target instruction of the branch can be pre-fetched into the CPU’s instruction cache. This keeps the CPU operating at full speed by preventing expensive misses in the instruction memory hierarchy (I-cache and TLB). Predicting the target of a static branch (whose target offset is contained in the instruction itself) is easy. But predicting the target of a dynamic branch (whose target offset is computed during the branch instruction’s execution or the instruction just before to it) is difficult. So, if the call at the end of block D is a dynamic branch, the CPU’s branch prediction hardware would be of little help, and a significant performance penalty would be paid at that point to fetch block G.

But how often does this actually happen in practice? Turns out to be a lot more often than you might think. Many modern programming concepts involve dynamic branches: virtual function calls, switch statements, dynamically linked libraries, etc. The frequency of dynamic branches in program code has grown substantially over the last couple of decades as programming languages have shifted towards more dynamic binding concepts. Modern applications also use external libraries and modules much more today than 20 years ago, and these libraries are generally dynamically linked by the OS dynamic linker loader at runtime. This trend has made it harder for static compilers to optimize the program code to the degree it used to be possible, creating even further optimization opportunities for JITs that operate at program runtime rather than program compile time.

Now let us see how trace selection helps. Block A is a start-of-trace candidate, because when block E is interpreted, its return branch goes backwards in the address space to block A. Suppose block A becomes hot. Dynamo will now enter trace JIT mode (the white boxes labeled G-J in the interpreter flow diagram shown earlier) , and emit the very next sequence of interpreted instructions into the Fragment Cache. Suppose this was the trace ACDGHJE. This trace is emitted into the Fragment Cache memory:

Screen Shot 2014-05-29 at 5.08.56 PMWhen emitting the code for block D, Dynamo will notice that the call branch is redundant, because its target is the immediately following instruction in the trace. Therefore this call branch can be eliminated. This is a critical optimization. In practice, that call branch is very likely to be a dynamically computed branch, which would have incurred a performance penalty. But the very act of trace selection eliminated it, so this dynamic branch-free trace will very likely execute faster on the CPU than the original sequence of blocks in the input program code.

What happens if actual execution now flips to a different hot trace, say ABDGIJE? To identify this condition, Dynamo also treats trace exit points as start-of-trace candidates, and maintains counters to determine if they are getting hot. So, in this example, block B would start to get hot, and this triggers the selection of a new trace starting at block B: BDGIJE. Once this trace is emitted into the Fragment Cache, Dynamo patches the trace exit branch at the end of block A to go to the top of this new trace, and the Fragment Cache now contains two traces:

Screen Shot 2014-05-29 at 5.09.13 PM

You can start to sense one problem with this approach. The Fragment Cache could start to fill up pretty quickly, and once a hot trace becomes cold, there is no easy way to evict it from the cache. Because traces are linked together, evicting a subset of the traces could incur a substantial overhead cost in fixing up the remaining ones. Then there is the problem of determining when a trace gets cold, which is non-trivial because Dynamo does not instrument the native code generated into the Fragment Cache for performance reasons. For a JIT like Dalvik that runs on a memory-constrained mobile device, managing the Fragment Cache memory can be a nasty problem.

In circa 1998, when Dynamo was developed, desktops and laptops of that era had about as much memory as today’s mobile devices. Also, the HP printer division at the time was contemplating the use of Java (then a language still in its infancy) in an embedded environment, where memory constraints created many design challenges. So, early on in its development, we had to worry about Dynamo’s memory footprint.

We developed a very simple, yet surprisingly effective heuristic to manage the Fragment Cache memory. Whenever Dynamo detected a sharp increase in the trace creation rate, it would simply flush the entire Fragment Cache, deleting all traces in it. This works because such spikes in the trace creation rate are usually due to the formation of a new working set in the Fragment Cache. And because during the trace creation process time is being predominantly spent in the interpreter, such a flush is essentially “free”. The concept is illustrated below:

Screen Shot 2014-05-29 at 5.09.42 PM

And there you have it. A native instruction interpreter, that would often speed up a native program binary, even when produced by an optimizing compiler!

On a memory-constrained mobile device, interpreted programs offer a number of benefits over natively compiled programs. Bytecode for stack machines like Java can be more compact than compiled code for processors like x86. Many compiler optimizations also cause expansion of the compiled code, due to operations like code duplication, loop unrolling, etc. If only 20% of the code accounts for 80% of the execution, compiling and optimizing the code for the mobile device’s processor may not be such a good idea. A trace-based JIT interpreter could be a better choice, especially if it can deliver performance comparable to compiled native code.


  1. Dynamo: A Transparent Dynamic Binary Optimization System. Vasanth Bala, Evelyn Duesterwald, Sanjeev Benerjia. ACM Conference on Programming Language Design and Implementation (PLDI). Vancouver, 2000.
  2. A JIT Compiler for Android’s Dalvik VM. Ben Cheng, Bill Buzbee. Google IO, 2010.