Tag Archives: compiler optimization

JIT acceleration with Dynamo


I stumbled into a presentation about Google’s Dalvik VM, which is the interpreter used to execute Java bytecode on the Android mobile platform. Like many modern interpreters, Dalvik uses a JIT (just-in-time compiler) to lower interpretive overhead by compiling frequently interpreted code to native binary code that can be directly executed by the hardware interpreter (aka CPU).

A crucial design decision in engineering a JIT is the heuristic that determines what to compile and when to do so. Early JITs compiled entire methods, and used method entry counts to determine when a method was “hot” enough to warrant compilation. The Dalvik JIT however uses a hybrid design that combines a method-based heuristic with a trace-based heuristic to identify and compile hot instruction traces. A “trace” is a dynamic scoping concept (unlike a method, which is a static scoping concept) – it is the sequence of instructions corresponding to some dynamically executed path in the program. Traces can cross method boundaries, and even program boundaries (e.g. it can extend into a shared library that your application uses).

This trace-based heuristic for speeding up interpreters was originally developed in the Dynamo project at HP Labs in the late 90s, a project that I spent much of my early career working on. Since its publication in 2000,  the Dynamo system has influenced a number of commercial interpreters that are in widespread use today. Mozilla’s Tracemonkey Javascript interpreter is one example – this is the Javascript engine built into the Firefox browser. But this was the first time I had seen Dynamo ideas being used in the context of a mobile platform. It got me wandering down memory lane, thinking how the original design tradeoffs explored in Dynamo might have changed if the target platform were a mobile device (a power and memory constrained computer).

How does JIT compilation lower interpreter overhead? A rough rule of thumb in program execution is that 20% of the code accounts for 80% of its execution. So, if the program is 1 GB in size, a mere 200 MB of it is where all the action is. Assuming your program is a reasonably long-running application, lowering the interpretive overhead of the 200 MB will determine whether your application is snappy and usable or sluggish and useless. The hard part is determining which 200 MB of the program is hot enough to pay the price of compiling.

Finding hot methods to compile is relatively straightforward: keep a table of counters, one per method that is interpreted; every time a method call is interpreted, bump the counter corresponding to that method; trigger a compilation if the counter exceeds some threshold. Finding hot traces is much harder: trace boundaries are not as well defined as method boundaries. Dynamo’s approach was to assume that any target of a backward-taken branch is a candidate for the start of a trace – the intuition being that this is likely to be the entry point of a loop, even if the loop body spans multiple methods and extends into external libraries. Similar counter based threshold heuristics could be used at these start-of-trace candidate instructions to trigger trace selection and compilation.

But how do we know what instructions are part of the hot trace? In the 90s, the topic of “path profiling” was a hot research area, and various techniques for determine hot traces through a combination of static program analysis and dynamic instruction counting were being proposed. Dynamo took a radically different approach that avoided static analysis and trial runs, yet was simple to engineer and effective on real code. If a candidate start-of-trace instructions is hot, then the sequence of interpreted instructions leading from that start-of-trace instruction is statistically likely to be a hot trace. We called it the Most Recently Executed Tail (or MRET) heuristic.

Here is the diagram taken from the original Dynamo paper, that shows how the interpreter / JIT flow works:

Screen Shot 2014-05-29 at 5.00.39 PM

The gray box labeled Fragment Cache is the in-memory cache of natively compiled trace fragments that are created by the Dynamo interpreter. The goal is to ensure that the program spends more time executing within the Fragment Cache than on any of the white boxes that correspond to the software interpreter. In addition, the resulting speedup should offset whatever time was spent in the interpreter (the white boxes). Handling synchronous and asynchronous interrupts during execution is among the many engineering challenges Dynamo had to address in this design – the paper contains these details.

Accomplishing the goal of speeding up the interpreter enough to recoup the overhead of trace selection and compilation is a tall order. If you look at this diagram closely, you will notice that the input to the Dynamo interpreter is itself native program code! In fact, the original Dynamo system we built was deliberately engineered as a degenerate JIT (native code to native code JIT), so we could understand its behavior in a an extreme scenario. Now here is the counter-intuitive result: even when interpreting native code, Dynamo would speed up many programs as much as 20%, and come close to breaking even in the worst case! This is true even when the input native code is emitted by a state of the art optimizing compiler.

How is this possible? Surely, a native code to native code software interpreter cannot deliver better performance than directly executing the original native code on the CPU (hardware interpreter). To understand why this is possible, we have to understand the performance bottlenecks on modern CPUs, and how Dynamo’s MRET trace selection algorithm can remove them.

The figure below shows a portion of some program’s control flow graph – think of each box as a sequence of straight line instructions (also referred to as a basic block). There is a branch instruction at the end of block A, and a call instruction at the end of block D:

Screen Shot 2014-05-29 at 5.08.13 PM

Modern CPUs use sophisticated branch prediction hardware to predict the target of branches in the native instruction stream, so the target instruction of the branch can be pre-fetched into the CPU’s instruction cache. This keeps the CPU operating at full speed by preventing expensive misses in the instruction memory hierarchy (I-cache and TLB). Predicting the target of a static branch (whose target offset is contained in the instruction itself) is easy. But predicting the target of a dynamic branch (whose target offset is computed during the branch instruction’s execution or the instruction just before to it) is difficult. So, if the call at the end of block D is a dynamic branch, the CPU’s branch prediction hardware would be of little help, and a significant performance penalty would be paid at that point to fetch block G.

But how often does this actually happen in practice? Turns out to be a lot more often than you might think. Many modern programming concepts involve dynamic branches: virtual function calls, switch statements, dynamically linked libraries, etc. The frequency of dynamic branches in program code has grown substantially over the last couple of decades as programming languages have shifted towards more dynamic binding concepts. Modern applications also use external libraries and modules much more today than 20 years ago, and these libraries are generally dynamically linked by the OS dynamic linker loader at runtime. This trend has made it harder for static compilers to optimize the program code to the degree it used to be possible, creating even further optimization opportunities for JITs that operate at program runtime rather than program compile time.

Now let us see how trace selection helps. Block A is a start-of-trace candidate, because when block E is interpreted, its return branch goes backwards in the address space to block A. Suppose block A becomes hot. Dynamo will now enter trace JIT mode (the white boxes labeled G-J in the interpreter flow diagram shown earlier) , and emit the very next sequence of interpreted instructions into the Fragment Cache. Suppose this was the trace ACDGHJE. This trace is emitted into the Fragment Cache memory:

Screen Shot 2014-05-29 at 5.08.56 PMWhen emitting the code for block D, Dynamo will notice that the call branch is redundant, because its target is the immediately following instruction in the trace. Therefore this call branch can be eliminated. This is a critical optimization. In practice, that call branch is very likely to be a dynamically computed branch, which would have incurred a performance penalty. But the very act of trace selection eliminated it, so this dynamic branch-free trace will very likely execute faster on the CPU than the original sequence of blocks in the input program code.

What happens if actual execution now flips to a different hot trace, say ABDGIJE? To identify this condition, Dynamo also treats trace exit points as start-of-trace candidates, and maintains counters to determine if they are getting hot. So, in this example, block B would start to get hot, and this triggers the selection of a new trace starting at block B: BDGIJE. Once this trace is emitted into the Fragment Cache, Dynamo patches the trace exit branch at the end of block A to go to the top of this new trace, and the Fragment Cache now contains two traces:

Screen Shot 2014-05-29 at 5.09.13 PM

You can start to sense one problem with this approach. The Fragment Cache could start to fill up pretty quickly, and once a hot trace becomes cold, there is no easy way to evict it from the cache. Because traces are linked together, evicting a subset of the traces could incur a substantial overhead cost in fixing up the remaining ones. Then there is the problem of determining when a trace gets cold, which is non-trivial because Dynamo does not instrument the native code generated into the Fragment Cache for performance reasons. For a JIT like Dalvik that runs on a memory-constrained mobile device, managing the Fragment Cache memory can be a nasty problem.

In circa 1998, when Dynamo was developed, desktops and laptops of that era had about as much memory as today’s mobile devices. Also, the HP printer division at the time was contemplating the use of Java (then a language still in its infancy) in an embedded environment, where memory constraints created many design challenges. So, early on in its development, we had to worry about Dynamo’s memory footprint.

We developed a very simple, yet surprisingly effective heuristic to manage the Fragment Cache memory. Whenever Dynamo detected a sharp increase in the trace creation rate, it would simply flush the entire Fragment Cache, deleting all traces in it. This works because such spikes in the trace creation rate are usually due to the formation of a new working set in the Fragment Cache. And because during the trace creation process time is being predominantly spent in the interpreter, such a flush is essentially “free”. The concept is illustrated below:

Screen Shot 2014-05-29 at 5.09.42 PM

And there you have it. A native instruction interpreter, that would often speed up a native program binary, even when produced by an optimizing compiler!

On a memory-constrained mobile device, interpreted programs offer a number of benefits over natively compiled programs. Bytecode for stack machines like Java can be more compact than compiled code for processors like x86. Many compiler optimizations also cause expansion of the compiled code, due to operations like code duplication, loop unrolling, etc. If only 20% of the code accounts for 80% of the execution, compiling and optimizing the code for the mobile device’s processor may not be such a good idea. A trace-based JIT interpreter could be a better choice, especially if it can deliver performance comparable to compiled native code.

References

  1. Dynamo: A Transparent Dynamic Binary Optimization System. Vasanth Bala, Evelyn Duesterwald, Sanjeev Benerjia. ACM Conference on Programming Language Design and Implementation (PLDI). Vancouver, 2000.
  2. A JIT Compiler for Android’s Dalvik VM. Ben Cheng, Bill Buzbee. Google IO, 2010.