Efficient Microarchitecture

Efficient Microarchitecture supported by NSF Awards CCF-0311340 and CCF-0429561 and a gift from Intel.

The previous status quo of building increasingly inefficient processors for incremental gains of single-thread performance has become untenable. Processor efficiency has become paramount, both because in light of the energy crisis and because any area or power inefficiency in core design will lower the overall throughput of a chip via a reduction in core count or frequency.
In addition to exploring hardware support that enables improved code quality, which in turn improves overall system efficiency, we have explored techniques to improve the efficiency at the microarchitectural level.

Distributed Microarchitectures

Given that conventional micro-architectures scale poorly (e.g., super-linearly with superscalar width), researchers have explored distributed microarchitectures whose complexity scales closer to linear with superscalar width. In the modern context, such microarchitectures are interesting from two perspectives: 1) for large cores (for sequential parts of workloads) as part of a heterogeneous many-core, and 2) for the on-demand fusion of small cores into larger cores in a homogeneous many-core.
While there is a significant amount of research into such distributed microarchitectures, previous work had failed to match the IPC (instructions per cycle) of a monolithic microarchitecture. In spite of all of this work, we found that a number of fundamental questions remained un-answered in the literature:

How close can a distributed microarchitecture come to matching the IPC of a monolithic microarchitecture?
What symptoms are responsible for existing heuristics underperforming?
What policies enable achieving monolithic IPCs on distributed microarchitectures?
In our 2005 Micro paper, we answer these questions by: 1) devising a limit study to show that comparable IPCs are possible, 2) demonstrating by criticality analysis that a failure to properly steer critical instructions is at fault, and 3) proposing three policies: stall vs. steer (stalling dispatch instead of introducing a forwarding penalty for critical dataflow), likelihood of criticality (a fine-grain, yet practical means for determining the relative criticality of two instructions), and preemptive load balancing (pushing away dependent, but non-critical instructions from resources to make room for future critical instructions). We demonstrate that these policies enable achieving IPCs within a few percentage points of a monolithic machine of comparable width, even for a clustered machine consisting of 8 1-wide clusters.
A Criticality Analysis of Clustering in Superscalar Processors Pierre Salverda and Craig Zilles (Micro 2005)
In light of the diversity of workloads (in particularly with regards to available thread parallelism), a holy grail of computer architecture is a substrate that can efficiently (and continuously) shift between a large collection of small processors and a smaller set of large processors. In relation to research on "fusing" together small cores, we noticed that while researchers considered out-of-order cores, industry was building many-core machines using in-order cores. This lead us to try to answer the question:

To what extent can we aggregate or fuse a collection of in-order cores (whose executions can slip with respect to each other) and approach the performance of a monolithic out-of-order core comparable to modern superscalars?"
We find, as we discuss in our 2008 HPCA paper, that there are fundamental challanges to using a collection of truly in-order cores to achieve an IPC comparable to a conventional monolithic superscalar. In particular, most single-threaded programs consist of more independent chains of dataflow than cores can likely be efficiently aggregated, necessitating the interleaving of independent dataflow chains among a given in-order core. We further show that this correctly performing this interleaving (that is achieving a schedule comparable to an OOO machine) requires both more hardware complexity than OOO scheduling itself and requires predicting which loads will miss in the cache with unrealistic accuracy. We do show that the addition of a small bit of out-of-order execution (e.g., assigning dataflow chains to thread contexts in a multi-threaded in-order that selects which thread to execute at the execute stage) can greatly mitigate this problem. Nevertheless, this work leads us to believe that a heterogeneous machine consisting of a few large cores and many small cores is a more compelling approach.
Fundamental Performance Challenges in Horizontal Fusion of In-Order Cores Pierre Salverda and Craig Zilles (HPCA 2008)
Since these results contradict the previously published paper ``Complexity-Effective Superscalar Processors'' (ISCA 1997), we published a thorough analysis explaining why that paper's simulation model and baseline parameters lead to a conclusion that does hold for modern machines.
Dependence-based Scheduling Revisited: A Tale of Two Baselines Pierre Salverda and Craig Zilles (WDDD 2007)

Probabilistically-Updating Counters

Hardware counters are commonplace in modern processors. We have developed a technique to improve the efficiency of arrays of hardware counters and to enable them to be used for new purposes: probabilistic updates. The key idea is that rather than always incrementing (decrementing) on positive (negative) feedback, a random number generator is consulted, and the update is only completed a specified fraction of the time. In this way, we can provide probabilistic hysteresis (setting the probability of update low) to retain information over a larger range of time and probabilistic bias (setting the increment and decrement probabilities differently), two capabilities that typically require enlarging the counter in question.
Furthermore, it enables the construction of probabilistic stratifiers; a stratifier is a counter whose current value corresponds to the frequency of 1's in a binary stream that it is monitoring (i.e., the counter encoding 5 might mean roughly 60% 1s, while the encoding 6 might mean roughly 75% 1s).
Using these techniques we demonstrate that many counters can be reduced in size by a factor of 2 or 3 with little, if any, impact on their prediction accuracy. Furthermore, we demonstrate the viability of a 1-bit branch confidence predcitor, which might enable branch confidence to be a commonplace inclusion in future processors.
Probabilistic Counter Updates for Predictor Hysteresis and Stratification Nicholas Riley and Craig Zilles (HPCA 2006)