Microarchitecture supported by NSF Awards CCF-0311340
and CCF-0429561 and a gift from Intel.
The previous status quo of building increasingly inefficient
processors for incremental gains of single-thread performance has
become untenable. Processor efficiency has become paramount, both
because in light of the energy crisis and because any area or power
inefficiency in core design will lower the overall throughput of a
chip via a reduction in core count or frequency.
In addition to exploring hardware support that
enables improved code quality, which in turn improves overall
system efficiency, we have explored techniques to improve the
efficiency at the microarchitectural level.
Given that conventional micro-architectures scale poorly (e.g.,
super-linearly with superscalar width), researchers have explored
distributed microarchitectures whose complexity scales closer to linear
with superscalar width. In the modern context, such microarchitectures are
interesting from two perspectives: 1) for large cores (for sequential parts
of workloads) as part of a heterogeneous many-core, and 2) for the on-demand
fusion of small cores into larger cores in a homogeneous many-core.
While there is a significant amount of research into such distributed
microarchitectures, previous work had failed to match the IPC
(instructions per cycle) of a monolithic microarchitecture. In spite of
all of this work, we found that a number of fundamental questions remained
un-answered in the literature:
In our 2005 Micro paper, we answer these questions by: 1) devising a
limit study to show that comparable IPCs are possible, 2)
demonstrating by criticality analysis that a failure to properly steer
critical instructions is at fault, and 3) proposing three policies:
stall vs. steer (stalling dispatch instead of introducing a forwarding
penalty for critical dataflow), likelihood of criticality (a
fine-grain, yet practical means for determining the relative
criticality of two instructions), and preemptive load balancing
(pushing away dependent, but non-critical instructions from resources
to make room for future critical instructions). We demonstrate that
these policies enable achieving IPCs within a few percentage points of
a monolithic machine of comparable width, even for a clustered machine
consisting of 8 1-wide clusters.
- How close can a distributed microarchitecture come to matching the IPC
of a monolithic microarchitecture?
- What symptoms are responsible for existing heuristics underperforming?
- What policies enable achieving monolithic IPCs on distributed microarchitectures?
A Criticality Analysis of Clustering in Superscalar Processors
Pierre Salverda and Craig Zilles (Micro 2005)
In light of the diversity of workloads (in particularly with regards
to available thread parallelism), a holy grail of computer
architecture is a substrate that can efficiently (and continuously)
shift between a large collection of small processors and a smaller set
of large processors. In relation to research on "fusing" together
small cores, we noticed that while researchers considered out-of-order
cores, industry was building many-core machines using in-order cores.
This lead us to try to answer the question:
We find, as we discuss in our 2008 HPCA paper, that there are
fundamental challanges to using a collection of truly in-order
cores to achieve an IPC comparable to a conventional monolithic
superscalar. In particular, most single-threaded programs consist of
more independent chains of dataflow than cores can likely be
efficiently aggregated, necessitating the interleaving of independent
dataflow chains among a given in-order core. We further show that
this correctly performing this interleaving (that is achieving a
schedule comparable to an OOO machine) requires both more hardware
complexity than OOO scheduling itself and requires predicting which
loads will miss in the cache with unrealistic accuracy. We do show
that the addition of a small bit of out-of-order execution
(e.g., assigning dataflow chains to thread contexts in a
multi-threaded in-order that selects which thread to execute at
the execute stage) can greatly mitigate this problem. Nevertheless,
this work leads us to believe that a heterogeneous machine consisting
of a few large cores and many small cores is a more compelling
To what extent can we aggregate or fuse a collection of
in-order cores (whose executions can slip with respect to each other)
and approach the performance of a monolithic out-of-order core
comparable to modern superscalars?"
Fundamental Performance Challenges in Horizontal Fusion of In-Order Cores Pierre Salverda and Craig Zilles (HPCA 2008)
Since these results contradict the previously published paper
``Complexity-Effective Superscalar Processors'' (ISCA 1997), we
published a thorough analysis explaining why that paper's simulation
model and baseline parameters lead to a conclusion that does hold for
Dependence-based Scheduling Revisited: A Tale of Two Baselines
Pierre Salverda and Craig Zilles (WDDD 2007)
Hardware counters are commonplace in modern processors. We have
developed a technique to improve the efficiency of arrays of hardware
counters and to enable them to be used for new purposes: probabilistic
updates. The key idea is that rather than always incrementing
(decrementing) on positive (negative) feedback, a random number
generator is consulted, and the update is only completed a specified
fraction of the time. In this way, we can provide probabilistic
hysteresis (setting the probability of update low) to retain
information over a larger range of time and probabilistic bias
(setting the increment and decrement probabilities differently), two
capabilities that typically require enlarging the counter in question.
Furthermore, it enables the construction of probabilistic stratifiers;
a stratifier is a counter whose current value corresponds to the
frequency of 1's in a binary stream that it is monitoring
(i.e., the counter encoding 5 might mean roughly 60% 1s, while
the encoding 6 might mean roughly 75% 1s).
Using these techniques we demonstrate that many counters can be reduced
in size by a factor of 2 or 3 with little, if any, impact on their prediction
accuracy. Furthermore, we demonstrate the viability of a 1-bit branch
confidence predcitor, which might enable branch confidence to be a commonplace
inclusion in future processors.
Probabilistic Counter Updates for Predictor Hysteresis and Stratification
Nicholas Riley and Craig Zilles (HPCA 2006)