Tag Archives: crawling

Query the data center like you query the Web

Say you want to query thousands of systems in your data center for something – e.g. the string “”. Maybe you want to know what systems might be impacted if you were to change this IP address somewhere, like in a firewall rule. How would you implement it?

Most people would send this query to agents running on each of the thousands of computers, have them execute this query locally by inspecting their machine’s state, and have the results shipped back. Not only is this a terribly clunky approach in practice, it also scales poorly as the number of systems grows. Your query latency is gated by the slowest machine in your data center – you have to wait until every machine responds before you have your answer. What if one of the machine’s is wedged and its response never comes back; how long should you wait?

Now let us change the context completely. Say you want to query millions of sites on the Web for the string “”. How would you implement it?

This is a no-brainer. You query a central index, not the individual web sites. The index is constantly fed by crawlers that scan every web site periodically to extract changes made to that site since the last crawl. And here is the key: the crawlers have no knowledge of what queries will be asked of the index. Your query latency is independent of the current state of every website.

This approach is not only scalable, it also enables a more intuitive human interface. It is scalable because (a) crawling is a non-intrusive task (unlike running an agent inside a machine), enabling web sites to be monitored frequently enough to keep the index continuously refreshed, and (b) the data extraction and indexing process is decoupled from the query handling process, enabling each to be optimized independently. By decoupling queries from the crawling, there is no requirement to tune the query format to suit the needs of the data crawler – which in turn allows the query interface to be designed for human consumption, and the crawler interface to be designed for machine consumption.

Search engines like Google, Bing, and Yahoo are able to keep the index remarkably close to the real-time state of billions of web sites, debunking the myth that such an approach risks having the index become too stale to support real-time situational awareness requirements.

So, how can we query the data center like we query the Web?

We must begin by re-thinking how systems are monitored. In an earlier post I talked about “introspection” as an alternative way to monitor the real-time state of a system without the use of in-system agents. Introspection provides the foundation for building a new kind of “crawler”, one that continuously indexes the state of systems in a data center, similar to the way a Web crawler works on documents and web sites. This is because introspection enables crawling systems without disrupting their operation in any way.

In essence, introspection enables us to think about a running system as a series of point-in-time snapshots, where each snapshot is a document containing the metadata about that system’s state extracted by the crawler at a particular point in time. If you think about the system as a movie, you can think about this document as a frame. Frames are literally just documents. You can imagine translating all sorts of useful system state into a simple JSON dictionary for example, that would look something like this:

  '_frame': {
    JSON entry with timestamp and other metadata
  'file': {
    one JSON entry per monitored file
  'process': {
    one JSON entry per running process
  'connection': {
    one JSON entry per open connection
  'package': {
    one JSON entry per installed package 

This is the “frame” output by every crawl of a system: it is the document you have to index, to provide a Google-like query interface. And yes, the query response can return faceted search results, rank ordered by various heuristics that make intuitive sense in a data center context. Your mind immediately jumps to abstractions that are familiar in the Web search domain. Few tools to manage data centers look anything like this today – they are made for consumption by skilled IT Ops people, not regular humans like the rest of us. Why must this be so?

The Origami project, that my team has been working on for the last couple of years, has been exploring this very question. Why can’t a systems in the Data Center be queried and indexed like documents in the Web? In fact, the state of many websites changes at rates faster than your typical production server, and yet we get reasonably good real-time query results from the index. There really is no good reason why these two worlds have to be so far apart.