Hardware/Software Co-designed Run-times for Managed and Dynamic Languages supported by NSF CAREER Award CCF 03-47260 and a gift from Intel.

  Not only are programmers increasingly shifting to managed languages like Java and C\# for productivity reasons, but the advent of multicore sets the scene for these languages to outperform traditionally compiled languages like C/C++. In the past, languages like C/C++ have achieved efficiency by performing as much work as possible at compile time and by providing a programming interface very close to the hardware with features like explicit memory management. In contrast, managed languages have historically incurred overheads for just-in-time compilation and automatic garbage collection (GC), which programmers tolerated because of the other advantages that these languages offered.

As we shift to multicore machines, however, these tasks can be performed concurrently with execution, leaving applicaiton threads largely unencumbered by this work. In this context, delaying compilation and data layout until run time becomes an advantage, as decisions can be made (and freely changed) when application behavior and the machine's capabilities are known. Furthermore, the higher-level machine model targetted by these languages permits more flexibility in how the language constructs are implemented. In this way, these languages permit a tighter coupling between the run-time and the hardware, enabling implementation-specific features to be exposed and more aggressive optimizations.

Using Hardware Atomicity to Facilitate Reliable Compiler Optimization
  This work is concerned with hardware support to make microprocessors better compiler targets, making software optimizations more effective, safer, and more reliable. Given that programs tend to spend much of their time executing a small fraction of the possible paths through the program, there is significant opportunity to improve performance and reduce power consumption by optimizing the code along these hot paths. Traditional approaches to speculative compiler optimizations (i.e., those that improve one path at the expense of other colder paths) are significantly more complex than their non-speculative counterparts, because they must be designed to be path specific and to generate compensation code for exits from the hot path(s). We demonstrate that the benefit of these optimizations can be achieved, or even exceeded, without this complexity by providing hardware primitives for atomic execution and software-initiated rollback. These primitives provide to the compiler the ability to isolate the program's hot path for optimization, allowing the use of non-speculative formulations of optimization passes. When a speculation invariant does not hold, the checkpoint is restored and control is transferred to a non-speculative version of the code, relieving the compiler from the responsibility of generating compensation code.

Hardware Atomicity for Reliable Software Speculation Naveen Neelakantam, Ravi Rajwar, Suresh Srinivas, Uma Srinivasan, and Craig Zilles (ISCA 2007)
This work was selected for IEEE Micro's Top Picks

We are also exploring how this work could improve the performance of dynamic languages like Python, Ruby, and Perl without significantly impacting their run-times. We have some initial work exploring how a JVM could expose atomic execution through try/catch blocks that use special exceptions, to implement "fast-path" versions of the otherwise complex series of operations necessary to implement dynamic language primitives.

Transactional Runtime Extensions for Dynamic Language Performance Nicholas Riley and Craig Zilles (EPHAM 2008)

Memory System Optimizations
  Java is frequently used in applications involving the manipulation of textual data; as a result, character arrays represent a sizable fraction of all heap memory allocated (30-40% on average) across a broad set of applications. Because Unicode allocates 2 bytes per character even in locales dominated by ASCII text, there is an opportunity to achieve a near 50% percent reduction in memory allocated for these arrays. In our Accordion Array work, we propose a JVM technique that makes character arrays polymorphic (either compressed or uncompressed) that achieves a roughly 15% reduction in heap memory footprint as well as a reduction in garbage collection time.

Accordion Arrays: Selective Compression of Unicode Arrays in Java Craig Zilles (ISMM 2007)

Efficient Profiling
  We are exploring techniques to characterize program behavior. We are particulary interested in low-overhead techniques that can be employed in a dynamic optimizer.

To date, the only commericially successful profiling hardware has been performance counters, which are very low level. As software developers often use software instrumentation to complement the information collected with profile counters, we developed branch-on-random to reduce the cost of instrumentation. Basically, branch-on-random is a very cheap mechanism for providing a sampling framework in hardware, so software overhead is only paid when a sample is collected. The return-on-investment for branch-on-random is sufficiently compelling that two hardware vendors that care about Java performance are considering implementations for future processors (link).

Branch-on-Random Edward S. Lee and Craig Zilles (CGO-2)

Another key idea that we are exploring is the concept of profile-guided profiling, where a cheap-to-collect type of information is used to guide the collection of a more expensive-to-collect form of profile information. Our paper on Targeted Path Profiling applies this concept to a software-only implementation of acyclic path profiling to achieve a 50% reduction of overhead.

Targeted Path Profiling: Lower Overhead Path Profiling for Dynamic Optimization Systems, Rahul Joshi, Michael Bond and Craig Zilles (CGO-2)

Finally, we have shown how we can collect low-level profile information without specialized hardware support. We call the technique "Random Trace Construction." By collecting very basic profile information -- edge profiles and per-static instruction cache miss and branch misprediction profiles -- we can reconstruct rather accurate traces from the program's execution on which we can do detailed analysis. Particularly noteworthy is that profiling of memory dependences is not required because these can be reconstructed using a form of abstract interpretation. We show that this technique is almost as effective at constructing criticality information (used by a distributed microarchitecture) as dedicated critical path profiling hardware.

Accurate Critical Path Analysis via Random Trace Construction Pierre Salverda, Charles Tucker and Craig Zilles (CGO 2008)