When you need to observe the state of a running computer, what do you do? If its just a few computers, you simply ssh into each one and inspect the state from the command line. If you have a lot of computers, you use an agent that locally monitors each computer’s state and periodically ships the results to you. In either case, you make a key assumption: that your inspection logic has to run within the OS context of the running machine in order to observe its state.
There are a number of problems with this approach, and most people are oblivious to it. Lets take the case where you need to monitor thousands of computers – a modest data center scale operational environment. Virtually anyone managing systems at this scale or higher uses some kind of in-system monitoring solution. Data center monitoring is a billion dollar business – it provides what Ops teams refer to as “situational awareness”, the lifeblood of data center operations.
Problem 1: When a system becomes unresponsive, so does the monitoring agent running within it
This is a far more common situation than you might think. Systems can become (intermittently) unresponsive for any number of reasons. A process may be thrashing the disk, or memory, or both and your monitoring agent is not getting sufficient cycles to function. A system update may have modified a library on which your monitoring agent depends, causing the agent itself to malfunction or crash.
Google had such an outage in 2011, when graphs showing the global health of Google vanished from internal dashboards, pagers fell silent, and ssh stopped working. Because the paging system was itself affected, many site reliability engineers had no idea a major outage was in progress. The root cause turned out to be an accidental change to the permission flags of file /lib/x86_64-linux-gnu/ld-2.15.so (the Linux dynamic loader used to exec() user processes) from -rwxr-xr-x to -rw-r–r—. As a result, all user processes on these systems, including Google’s own monitoring agents, failed to start. That google.com did not fail despite what was apparently an outage that impacted “between 15 and 20% of Google’s production serving machines” is a testament to its resilient design. But this highlights the problem of relying on in-system monitoring agents for understanding system health.
Problem 2: In-system monitoring agents are vulnerable to security attacks
Monitoring agents are the foundation of another billion dollar industry: data center security. Yet, in-system security monitoring agents are themselves vulnerable to malicious attacks and accidental disruption, like any other process running inside these systems. That does not give a warm and fuzzy feeling.
Problem 3: In-system monitoring agents impact system performance
Every vendor or individual that writes a “lightweight monitoring agent” promises that the agent’s own operation will not impact the performance of the system it is monitoring. But agents are after all just another piece of software, running in a complex system environment. It is not uncommon to encounter situations where the culprit for poor performance is the monitoring agent itself.
Ultimately, any monitoring logic (even the commands you type when you ssh into a running computer) have side-effects that you are not fully aware of. This is the classic Heisenberg effect: the very act of monitoring the system is affecting the state you are trying to monitor. Most people disregard this as a problem they need to even think about, until their systems become heavily loaded. It is usually under peak loads that the impact of monitoring agents become more noticeable. And that is just when the monitoring data they provide is most necessary.
Introspection: an alternative way to observe system health
Virtualization enables a different way to observe the state of a running system. Introspection refers to the process of inspecting the state of a VM from outside the guest OS context. This is fundamentally different from monitoring, in that there is no monitoring logic running within the VM.
Introspection of a VM’s file system and memory state is possible by leveraging the VMM (virtual machine monitor, aka hypervisor) that interposes between the VM’s guest state and its mapping to the underlying physical host state. The key question is: can out-of-VM introspection approach the robustness and fidelity of conventional in-VM monitoring?
There are 2 parts to this problem: (a) real-time introspection of guest file-system state, and (b) real-time introspection of guest memory. Each has different challenges. Note that if we removed the “real-time” requirement, many good solutions exist already. For instance, backup solutions are now available that use VM disk snapshots to perform continuous backup without the use of an in-system backup agent. A benefit of this approach is you do not have to schedule a downtime window for backing up your VMs (though your VMs may be briefly stunned when the guest OS quiesces any uncommitted state to disk, prior to taking the snapshot).
My team and I have been experimenting with a technique we refer to as “Near Field Monitoring”, that uses VM file-system and memory introspection as an alternative to in-system monitoring agents. After two years of R&D, and many dead-ends, I am proud to say we now have these techniques working in the IBM Research Compute Cloud (RC2) production environment.
The following publications give the technical details of our approach. The work was done with CMU and University of Toronto, originally started during summer internships done by Wolfgang Richter and Sahil Suneja with my team at IBM Watson Labs. In subsequent blogs, I will dive deeper into the technical intricacies of these two papers. The key result here is that introspection based “Near Field Monitoring” techniques can approach the robustness and fidelity of in-system monitoring agents for most common monitoring tasks we have studied. This makes them a viable contender to disrupt the in-system monitoring model that is in widespread use today.
 Agentless Cloud-wide Streaming of Guest File System Updates. Wolfgang Richter (Carnegie Mellon University), Canturk Isci (IBM Research), Jan Harkes and Benjamin Gilbert (Carnegie Mellon University), Vasanth Bala (IBM Research), and Mahadev Satyanarayan (Carnegie Mellon University). Best Paper Award, IEEE International Conference on Cloud Engineering, Boston, MA, March 2014.
 Non-intrusive, Out-of-band and Out-of-the-box Systems Monitoring in the Cloud. Sahil Suneja (University of Toronto), Canturk Isci (IBM Research), Vasanth Bala (IBM Research), Eyal de Lara (University of Toronto), Todd Mummert (IBM Research). ACM SIGMETRICS, Austin TX, June 2014.