With the onset of the digital revolution a few decades ago, preservation of digital content became a challenge. The process of archival, indexing, and curation that libraries and museums had used for centuries required a massive transformation in order to work for digital artifacts. The Library of Congress now archives digital media (text, audio and video), as do a number of libraries around the world.
Despite all this progress however, we have overlooked one important category of digital content, whose preservation may matter even more than the text, audio and video data we archive today.
An increasing portion of the world’s intellectual output is now in the form of executable content. Examples include simulations, education systems, expert systems, data visualization tools, interactive games, etc. Even content that appears static, such as a Web site, is often dynamically generated by code that customizes the content and appearance for individual readers at runtime.
Consider also the applications required to read the digital data we depend on today. We preserve important digital information in personal and Cloud-hosted backup systems, without bothering to also preserving the applications we depend on to process them. How many of you are able to read that Word Perfect document you wrote in the 1980s, or the Turbo Tax income tax return you created in the 1990s? Now roll the clock forward another ten years and ask yourself how you would be impacted if the digital formats you create your precious data in today could not be processed anymore.
For digital content like photographs, “fidelity” is straightforward concept to define: we want all of the pixel data preserved without any loss, in addition any metadata about the photograph like the location coordinates, date, etc. But when it comes to executable content, fidelity is much more difficult to define precisely. It could depend on many things: the computer hardware, the operating system, dynamically linked libraries, and so on.
Simply preserving the software code, or even the compiled binary (both are different types of digital text) is not sufficient – the tool chain to compile this software along with all of its dependencies also has to be preserved, and there is no guarantee that all of this will work a decade from now.
This problem is also different from that of data decay (aka bit rot), which is the degradation of the storage media on which the digital data is kept. Data decay is analogous to the degradation of ancient manuscripts that were printed before the invention of acid-free paper. We are talking about the content here, not the storage medium that content is kept in. The latter is a an orthogonal problem to the one we are examining here, though also a critical one from a historical preservation perspective.
VM images are ideal for encapsulating executable content with high enough fidelity that makes them practical for preservation of many useful executable environments. A VM is essentially a hardware instruction set emulator of such high accuracy that the OS and applications within it are unable to detect its presence. The VM’s emulated instruction set interface is tiny relative to the diversity of software that runs over it, and the diversity of hardware on which this interface can be efficiently emulated. This makes the VM a very durable abstraction for historical preservation of executable content, and a considerably more attractive alternative to mothballing the entire physical computer hardware.
There are of course scenarios where a VM is insufficient to reproduce a program’s execution fidelity. For example, if an application uses an external Web Service, like the Google Maps API, its execution dependencies cannot be fully encapsulated in the VM image. Still, there are enough scenarios where VMs offer sufficient execution fidelity for future generations to experience much of today’s executable content.
Olive: a public domain VM library
A collaboration between IBM Research and Carnegie Mellon University, supported by grants from IBM, Sloan Foundation and IMLS.org, is building Olive, a public domain library for preserving execution content as VMs.
The idea of using VMs for software preservation is not new. VMs are already used commercially for distributing pre-installed and pre-configured software environments. They have also been used in preservation efforts in the past.
What makes Olive different is that it aims to tackle three problems that are crucial for an online digital library to be practical and usable by the public. First, is the problem of how to “check out” a VM from the library, without resorting to a long and slow download process. Second, is the problem of how to search for something in the library, without depending entirely on the VM metadata. And third is the problem of how to easily contribute new executable content to the library, without having to be an expert in VM creation tools. We have built a fully functional prototype that addresses the first problem; technologies to address the last two problems are works in progress.
To “check out” and run VM’s published in the Olive library, we have created the VMNetX application, that (currently) runs on Linux and uses the open-source KVM virtual machine monitor. VMNetX can execute VMs directly from any web server – no special server software is required. VMNetX can be installed on a user’s local laptop, or provided as a Cloud service where Olive VM’s are automatically executed, and the user interacts with the VM’s display over the Internet. VMNetX is developed on GitHub and released under a GPL2 license.
VMNetX is built on Internet Suspend Resume (ISR), a technique to “stream” VMs over the Internet, developed at CMU. The user experience is similar to playing a video from You Tube: a user clicks on a link, and the VM corresponding to that link is demand-paged to a machine where the VM executes. Demand paging allows the ISR system to move only the part of the VM’s state (disk and memory) that is required by the executing applications within it, resulting in a much faster and smoother experience for the user. This works because executable content tends to spend a lot of time within working sets, which are generally much smaller than the state of the entire VM. Once the pages that comprise a working set are locally cached, the VM’s execution is only touching this local state, and the execution fidelity is good.
When a VM is published into Olive, its file system contents can be introspected and indexed. This indexing process allows automatic inference of the contents of a VM by looking up a table of known content hashes. A technical challenge here is to index the contents of the file system within the VM image, (which have high semantic value) rather than the image’s disk blocks (which have low semantic value). This work is still ongoing, but such a capability would allow users to search for VMs by content, instead of relying solely on metadata associated with every VM to tell what it actually contains. It is also valuable in determining the provenance of the content, and in certifying it for security purposes.
Finally, Olive aims to enable anyone to contribute VMs to the library without having to install or understand complex VM image building tools. It does so through a process called dynamic VM synthesis. The diagram below is a brief overview of how this might work:
There are three different clients of Olive in this diagram. Let us suppose that client 1 publishes a base OS image into Olive – our initial assumption is that this will be a carefully controlled process, so only members of the Olive team can perform this first step. Client 2 has an application (say a PacMan game) that requires that specific OS to run. Let us assume that the application binary is present on Client 2’s local machine – the details of how the binary was transferred from its original storage medium onto the client’s local machine are not relevant to this discussion. All that Client 2 needs to do is to retrieve and run the original VM containing just the OS using the VMNetX client. The Pac Man application binary can then be installed inside this locally running VM – the bits can be moved into the VM using either the network (which even early versions of Mac, Windows and Linux OSes support), or by exporting the guest OS’s file system to the host (which may require drivers that understand how to interpret the guest file system to be bundled into the VMNetX distribution). Client 2 then publishes the modified VM to Olive. Olive can maintain metadata passed via the VMNetX client, that allows it to determine that this new Pac Man image is a delta over the original OS image. It can then compute the delta, and only store the delta internally with a back pointer to the parent OS image. When Client 3 later retrieves the Pac Man VM, Olive can dynamically synthesize the VM from the original OS image and the Pac Man delta image, and stream it to her.
The Olive library prototype now has a number of VMs that contain historically significant executable environments. Examples include The Great American History Machine, Microsoft Office 6.0, NCSA Mosaic browser on Mac OS 7.5, Turbo Tax 1997, etc. Here are some screenshots of these VMs in action:
- Collaborating with Executable Content Across Space and Time. Mahadev Satyanarayanan, Vasanth Bala, Gloriana St Clair, Erika Linke. International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom), Orlando, FL. 2011 .
- Olive Executable Archive (Olive project website)
- VMNetX client for running VMs published in the Olive library
- Virtual Machine Images as Structured Data. Glenn Ammons, Vasanth Bala, Todd Mummert, Darrell Reimer, Xiaolan Zhang. USENIX HotCloud 2011.