Systems as Data

In an earlier post I asked why systems (aka executable state) cannot be indexed and queried just like documents. There is a much broader idea here, one that I have been obsessing over since around 1997. And that is the idea that systems can be treated as simply another form of data, and familiar paradigms from the data domain such as indexing, fingerprinting, clustering, markup, tagging, etc can apply to systems too.

The inception of this idea occurred in 1997 when I was trying to build a native binary interpreter called Dynamo at HP Labs. This was a degenerate interpreter in the sense that it’s input binary source was the same as its output binary target. You can find out more about how this was engineered in my earlier post on JIT acceleration. The kernel of the interpreter loop essentially involved rewriting segments of the input binary stream into a code cache so that the code in that cache could be manipulated in ways that could boost the overall runtime performance of the interpreted program. It struck me that what I was really doing was manipulating executable content (the source binary image) as data.

This made me wonder if there were other things I could do with a binary image that one normally only does with non executable data like audio, video and documents.

For instance, could a binary image be streamed on demand over a network, and executed by a “player” on a remote machine without ever requiring a local install step? I presented this idea at the IBM TJ Watson Research Center in May 2001. The talk, titled “Software as Content” was also my interview seminar. Although I got the job, I must admit the audience had this look of bewilderment on their faces. Clearly I had to articulate the commercial value of this vision to convince people of its power.

During my first two years there, with the help of a small team, I prototyped an application streaming service that eventually became the Progressive Deployment System (PDS), and a new product shipped by IBM for streaming pre-installed and pre-configured apps to desktops within an Enterprise. The end user experience was identical to viewing a video clip via YouTube. You could click a link on a webpage, and that would trigger a locally installed software stream player to communicate with a remote
streaming server to push a binary application image over the network, while simultaneously starting its execution on the local desktop.

Software images could indeed be treated like video images. A number of video related paradigms suddenly made sense in the software domain: streaming, compressed encoding, edge of network caching, etc.

Still we had a hard time convincing people that viewing software and systems as data was a valuable thing. Our next attempt to make the case was to create Mirage, a system that would index and version VM images in the same way that a source control system like Git versions text files. As VM images started to proliferate in data centers, the value of such a solution became ever more obvious. For me the cool thing was that Mirage made system images feel like documents – pretty much anything you could do in a document version control system you could now do with VM images.

The Origami project, which followed Mirage tried to explore yet another dimension of the Systems as Data idea. In this project we explored similarity detection algorithms that one normally used to cluster documents, to cluster VM images.

Despite these demonstrations most people remained unconvinced about the value and power of the Systems as Data idea. Over the years my personal obsession with this has only grown further. Perhaps some day I will get a chance to explore it again.