Tag Archives: similarity clustering

Locality sensitive hashing


A neat problem I’ve been working on for some time now is how to determine if two systems in a data center are “similar”. Before getting into what “similar” means for computer systems, let us first understand why this might be useful.

Many modern applications are deployed as identical clusters of VMs fronted by a load balancer – a common design pattern for scaling the application to handle high incoming request loads. At the time the cluster is first deployed and the application starts its lifecycle, the state of these VMs is indeed identical, as the developer expects. However, lots of things happen over time that can cause the state of one or more of these systems to deviate from the others. Some examples include patches not rolling out to all VMs in the cluster, an accidental or malicious disruption to one of the VMs, an environment or configuration change that causes one of the VMs to behave very differently, etc. In such situations, it would be useful if one could automatically infer when a collection of VMs that are expected to be similar, are in fact not anymore.

To sink our teeth into this problem, let us simplify it to two systems A and B. At time t1, we are told that “A is similar to B right now”.  At time t2, we are asked the question “Is A still similar to B?”. The key here is that “similarity” is defined by a reference point in time, and not by some elaborate rule. So, we must first infer what it is about A and B that is “similar” at time t1. Then we have to use this information to determine if any of the attributes we decided were similar at time t1 are still similar at time t2.

This turns out to be a very difficult problem to solve in practice, and it requires some sophisticated learning and clustering techniques that I’m not going to talk about here. But a building block necessary to solve this problem is the algorithm for determining the degree of similarity between two pieces of data, in this case the state of A and the state of B at some point in time.

Enter the locality sensitive hash, or LSH for short. Most people are familiar with secure hashes, like SHA-1, that are used to determine when two pieces of data are different. Secure hashes have the property that small differences in the data result in a very large divergence of the hashes of the data. This is a very useful property in security, as simply comparing the hashes of the data can determine if they are non-identical. Fuzzy hashes, like LSH, on the other hand have the property that small differences in the data cause small differences in the hash values of the data. Thus, if X and Y are two sets, LSH(X) and LSH(Y) would be very close in value if the contents of the two sets are also close to one another.

A pretty good measure of “similarity” is the Jaccard similarity  JS(X, Y) between two sets X and Y, which can be computed using this equation:

   JS(X, Y) = |X ∩ Y| / |X ∪ Y|

So the trick to understanding the similarity between two computer systems A and B, is to extract a set of attributes from the state of systems A and B, and treat these as the sets X and Y for purposes of similarity scoring.  For example, the names of running processes, the list of open connections, and the lstat metadata of the file system,  are some examples of system attributes that might make sense to extract. (Things are not so simple in reality – some attributes are more important than others, requiring a weighting heuristic to be introduced into the basic Jaccard similarity equation, but that is the subject of a research paper, not a short blog).

In an earlier blog post, I pointed out that the state of a system at some point in time can be transcoded as a document.  Jaccard similarity between two documents can be efficiently computed at scale using a neat technique known as “shingling”. Shingling gets its name from the manner in which shingles are laid on a roof in an overlapping pattern. A k-shingling is a grouping of every k consecutive words of the document into a new object. Each of these objects can now be treated as an attribute, instead of every word in the document being treated as an attribute (this is equivalent to a 1-shingling). This results in a substantial space reduction, with some loss of fidelity in the similarity score.

For example, consider two documents X and Y, where:

  X = Vas I am
  Y = I am Vas

A 2-shingling of each of the documents yields the following objects:

  X' = {Vas I} {I am}
  Y' = {I am} {am Vas}

The documents X’ and Y’ have a fewer “words” than the original documents X and Y. Now we can compute the Jaccard similarity between X’ and Y’ and use it as an approximation of the Jaccard similarity between the original X and Y:

  JS(X', Y') = 1/3 = 0.33

The similarity score between documents X and Y is thus approximately 0.33. Jeffrey Ullman’s free online book, “Mining of Massive Datasets” gives an excellent overview of similarity detection algorithms, including the Jaccard distance computation.

There are more sophisticated algorithms for similarity detection. One of them is the SSDEEP algorithm, used by the National Software Reference Library (NSRL). But wait – the NSRL is part of the US Department of Homeland Security, and its goal is to “promote efficient and effective use of computer technology in the investigation of crimes involving computers”. What is a fuzzy hash like SSDEEP doing here?

NSRL is really a giant white list of files that are included in known software products shipped by thousands of software publishers. Every known file has a corresponding SSDEEP hash. Now, here is the neat thing about SSDEEP. Given a random file you found on your computer, you can compute its SSDEEP (the algorithm and its implementation is open sourced by the US government cyber security office). Then you could check with NSRL if SSDEEP(F), where F is the file on your computer, is known to its white list. It will respond with one of 3 answers:

  1. Yes, F is a known file, and it belongs to product P
  2. No, F is an unknown file
  3. F is very similar to a known file

That third response is very interesting indeed! It suggests that the original file F may have been tampered with – and you may want to investigate this further.

How cool is that? Now lets go back to that neat system similarity problem I introduced at the beginning. What if you could build an NSRL-like white list of hashes for systems in your data center that are in their “desired state” (e.g. immediately after they begin their lifecycle)? Understanding deviation from that desired state could be accomplished quickly, by periodically computing an appropriate fuzzy hash of all the systems in your environment, and comparing the hashes with your white list.

Of course you would still need to solve the vexing details I mentioned briefly, like what attributes matter and to what degree. But you get the general idea.

References

  1. Mining of Massive Datasets. Jeffrey Ullman, Stanford University.
  2. Locality Sensitive Hashing. Wikipedia entry.
  3. SSDEEP algorithm for context-triggered piecewise hashes.
Advertisements