|
| Hardware/Software Co-designed
Run-times for Managed and Dynamic Languages supported by
NSF CAREER Award CCF 03-47260 and a gift from Intel.
|
|
Not only are programmers increasingly shifting to managed languages
like Java and C\# for productivity reasons, but the advent of multicore
sets the scene for these languages to outperform traditionally
compiled languages like C/C++. In the past, languages like C/C++ have
achieved efficiency by performing as much work as possible at compile
time and by providing a programming interface very close to the
hardware with features like explicit memory management. In contrast,
managed languages have historically incurred overheads for
just-in-time compilation and automatic garbage collection (GC), which
programmers tolerated because of the other advantages that these
languages offered.
As we shift to multicore machines, however, these tasks can be
performed concurrently with execution, leaving applicaiton threads
largely unencumbered by this work. In this context, delaying
compilation and data layout until run time becomes an advantage, as
decisions can be made (and freely changed) when application behavior
and the machine's capabilities are known. Furthermore, the
higher-level machine model targetted by these languages permits more
flexibility in how the language constructs are implemented. In this
way, these languages permit a tighter coupling between the run-time
and the hardware, enabling implementation-specific features to be
exposed and more aggressive optimizations.
|
|
Using Hardware Atomicity to Facilitate Reliable Compiler Optimization
|
This work is concerned with hardware support to make
microprocessors better compiler targets, making software
optimizations more effective, safer, and more reliable.
Given that programs tend to spend much of their time
executing a small fraction of the possible paths through the
program, there is significant opportunity to improve performance
and reduce power consumption by optimizing the code along these
hot paths. Traditional approaches to speculative compiler
optimizations (i.e., those that improve one path at the expense of
other colder paths) are significantly more complex than their
non-speculative counterparts, because they must be designed to be
path specific and to generate compensation code for exits from the
hot path(s). We demonstrate that the benefit of
these optimizations can be achieved, or even exceeded, without
this complexity by providing hardware primitives for atomic execution
and software-initiated rollback. These primitives provide to the
compiler the ability to isolate the program's hot path for
optimization, allowing the use of non-speculative formulations of
optimization passes. When a speculation invariant does not hold,
the checkpoint is restored and control is transferred to a
non-speculative version of the code, relieving the compiler from
the responsibility of generating compensation code.
Hardware Atomicity for Reliable Software Speculation
Naveen Neelakantam, Ravi Rajwar, Suresh Srinivas, Uma Srinivasan, and Craig Zilles (ISCA 2007)
This work was selected for IEEE Micro's Top Picks
We are also exploring how this work could improve the performance of
dynamic languages like Python, Ruby, and Perl without significantly
impacting their run-times. We have some initial work exploring how a
JVM could expose atomic execution through try/catch blocks that use
special exceptions, to implement "fast-path" versions of the otherwise
complex series of operations necessary to implement dynamic language
primitives.
Transactional Runtime Extensions
for Dynamic Language Performance
Nicholas Riley and Craig Zilles (EPHAM 2008)
|
|
Memory System Optimizations
|
Java is frequently used in applications involving the manipulation of
textual data; as a result, character arrays represent a sizable
fraction of all heap memory allocated (30-40% on average) across a
broad set of applications. Because Unicode allocates 2 bytes per
character even in locales dominated by ASCII text, there is an
opportunity to achieve a near 50% percent reduction in memory
allocated for these arrays. In our Accordion Array work, we propose a
JVM technique that makes character arrays polymorphic (either
compressed or uncompressed) that achieves a roughly 15% reduction in
heap memory footprint as well as a reduction in garbage collection
time.
Accordion Arrays: Selective Compression of Unicode Arrays in Java
Craig Zilles (ISMM 2007)
|
Efficient Profiling
|
We are exploring techniques to characterize program
behavior. We are particulary interested in low-overhead techniques
that can be employed in a dynamic optimizer.
To date, the only commericially successful profiling hardware has been
performance counters, which are very low level. As software
developers often use software instrumentation to complement the
information collected with profile counters, we developed
branch-on-random to reduce the cost of instrumentation. Basically,
branch-on-random is a very cheap mechanism for providing a sampling
framework in hardware, so software overhead is only paid when a sample
is collected. The return-on-investment for branch-on-random is
sufficiently compelling that two hardware vendors that care about Java
performance are considering implementations for future processors (link).
Branch-on-Random
Edward S. Lee and Craig Zilles (CGO-2)
Another key idea that we are
exploring is the concept of profile-guided profiling, where a
cheap-to-collect type of information is used to guide the collection
of a more expensive-to-collect form of profile information. Our paper
on Targeted Path Profiling applies this concept to a software-only
implementation of acyclic path profiling to achieve a 50% reduction of
overhead.
Targeted Path Profiling: Lower
Overhead Path Profiling for Dynamic Optimization Systems,
Rahul Joshi, Michael Bond and Craig Zilles (CGO-2)
Finally, we have shown how we can collect low-level profile
information without specialized hardware support. We call the
technique "Random Trace Construction." By collecting very basic
profile information -- edge profiles and per-static instruction cache
miss and branch misprediction profiles -- we can reconstruct rather
accurate traces from the program's execution on which we can do
detailed analysis. Particularly noteworthy is that profiling of
memory dependences is not required because these can be reconstructed
using a form of abstract interpretation. We show that this technique
is almost as effective at constructing criticality information (used
by a distributed microarchitecture)
as dedicated critical path profiling hardware.
Accurate Critical Path Analysis via Random Trace Construction
Pierre Salverda, Charles Tucker and Craig Zilles (CGO 2008)
|
|
|