CSC 520 CH 3- 5

Ace your homework & exams now with Quizwiz!

Correlating Branches

(2,2) predictor: - Then behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction

data dependencies Importance

1) indicates the possibility of a hazard 2) determines order in which results must be calculated 3) sets an upper bound on how much parallelism can possibly be exploited

Parallel Processing Challenges

1. Application parallelism ⇒primarily via new algorithms that have better parallel performance 2. Long remote latency impact ⇒both by architect and by the programmer • For example, reduce frequency of remote accesses either by - Caching shared data (HW) - Restructuring the data layout to make more accesses local (SW) - Let's focus on the hardware solution

2 Models for Communication and Memory Architecture

1. Communication occurs by explicitly passing messages among the processors: message-passing multiprocessors 2. Implicit Communication occurs through a shared address space (via loads and stores): shared memory multiprocessors either • UMA(Uniform Memory Access time) for shared address, centralized memory MP • NUMA(Non Uniform Memory Access time multiprocessor) for shared address, distributed memory MP • Also called Distributed Shared Memory (DSM)

Limits to Loop Unrolling 3

1. Decrease in amount of overhead amortized with each extra unrolling • Amdahl's Law 2. Growth in code size • For larger loops, concern it increases the instruction cache miss rate 3. Register pressure: potential shortfall in registers created by aggressive unrolling and scheduling • If not possible to allocate all live values to registers, may lose some or all of its advantage Loop unrolling reduces impact of branches on pipeline; another way is branch prediction

Cache Coherence Protocols 2 Classes

1. Directory based—Sharing status of a block of physical memory is kept in just one location, the directory • Scales well -common for distributed shared memory systems • However, is used in Sun T1-Niagara (a symmetric shared memory system) 2. Snooping—Every cache with a copy of data also has a copy of sharing status of block, but no centralized state is kept • All caches are accessible via some broadcast medium (a bus or switch) • All cache controllers monitor or snoop on the medium to determine whether or not they have a copy of a block that is requested on a bus or switch access

Three Stages of Tomasulo Algorithm

1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instruction & sends operands (renames registers). 2. Execution—operate on operands (EX) When both operands ready then execute;if not ready, watch Common Data Bus for result 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available • Normal data bus: data + destination ("go to" bus) • Common data bus: data + source("come from" bus) - Does the broadcast - 64 bits of data + 4 bits of Functional Unit source address - FU snags value on CDB if matches expected FU (produces result)

Four Steps of Speculative Tomasulo Algorithm

1. Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & use reorder buffer no. for destination(this stage sometimes called "dispatch") 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called "issue") 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs& reorder buffer; mark reservation station available. 4. Commit—update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer. Correctly predicted branches are finished (sometimes called "graduation").

Coherent Memory System

1. Preserve Program Order: A read by processor P to location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P 2. Coherent view of memory: Read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses 3. Write serialization: 2 writes to same location by any 2 processors are seen in the same order by all processors - If not, a processor could keep value 1 since saw as last write - For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1

Exploit ILP

1. rely on hardware to help discover and exploit the parallelism dynamically (desktop and server markets such as Pentium 4, AMD Opteron, Intel core) 2. Rely on software technology to find parallelism, statically at compile-time (specialized and embedded such as Itanium 2, ARM cortex8)

Reservation Station Components

Op—Operation to perform in the unit (e.g., + or -) on S1 and S2 Vj, Vk—Value of source operands - Either V or Q is valid for each operand Qj, Qk—Reservation stations producing source registers (value to be written) - Note: No ready flags as in Scoreboard; Qj, Qk=0 => ready - Store buffers only have Qifor RS producing result Busy—Indicates reservation station or FU is busy Register result status—Indicates which functional unit (Qi) will write each register, if one exists. Blank when no pending instructions that will write that register.

Simultaneous Multithreading (SMT)

A version of multithreading that lowers the cost of multithreading by utilizing the resources needed for multiple issue, dynamically schedule microarchitecture. • Simultaneous multithreading (SMT): insight that dynamically scheduled processor already has many HW mechanisms to support multithreading - Large set of virtual registers that can be used to hold the register sets of independent threads - Register renaming provides unique register identifiers, so instructions from multiple threads can be mixed in datapath without confusing sources and destinations across threads - Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW • Just adding a per thread renaming table and keeping separate PCs - Independent commitment can be supported by logically keeping a separate reorder buffer for each thread

Tournament Predictors

Alpha 21264 tournament predictor: 4K 2-bit counters indexed by local branch address. Chooses between: • Global predictor - 4K entries indexed by the history of the last 12 branches (212 = 4K) - Each entry is a standard 2-bit predictor • Local predictor - Local history table: 1024 10-bit entries recording last 10 branches, indexed by branch address - The pattern of the last 10 occurrences of that particular branch used to index table of 1K entries with 3-bit saturating counters

Control hazards

Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)

Snooping

Every cache with a copy of data also has a copy of sharing status of block, but no centralized state is kept • All caches are accessible via some broadcast medium (a bus or switch) • All cache controllers monitor or snoop on the medium to determine whether or not they have a copy of a block that is requested on a bus or switch access

Loop-Level Parallelism

Exploit loop-level parallelism to parallelism by "unrolling loop" either by 1. dynamic via branch prediction or 2. static via loop unrolling by compiler • Determining instruction dependence is critical to Loop Level Parallelism • If 2 instructions are: - parallel, they can execute simultaneously in a pipeline of arbitrary depth without causing any stalls (assuming no structural hazards) - dependent, they are not parallel and must be executed in order, although they may often be partially overlapped

Flynn's Taxonomy

Flynn classified parallelism by data and control streams in 1966 Single Instruction Single Data (SISD)(Uniprocessor) Single Instruction Multiple Data (SIMD)(single PC: Vector, CM-5) Multiple Instruction Single Data (MISD)(????) Multiple Instruction Multiple Data (MIMD)(Clusters, multiprocessors, multicore) • SIMD⇒Data Level Parallelism • MIMD⇒Thread Level Parallelism - Current multiprocessor/multicorefocus

Structural hazards

HW cannot support this combination of instructions

Register result status

Indicates which functional unit (Qi) will write each register, if one exists. Blank when no pending instructions that will write that register.

Multicores

Instead of pursuing more ILP, architects are increasingly focusing on TLP implemented with single-chip multiprocessors

Data hazards

Instruction depends on result of prior instruction still in the pipeline

Tomasulo Loop Example

Loop: fld f0 0 x1 fmul.d f4 f0 f2 s.d f4 0 x1 addi x1 x1 -8 bnez x1 Loop • Assume Multiply takes 4 clocks • Assume first load takes 8 clocks (cache miss?), second load takes 4 clocks (hit) • Assume stores take 3 clocks • To be clear, will show clocks for addi, bnez

HW support for More ILP

Need HW buffer for results of uncommitted instructions: reorder buffer - Sometimes called Register Update Unit (RUU) - 4 fields: instr type, destination, value, ready - Reorder buffer also an operand source like RS are - Use reorder buffer number instead of register file when execution completes - Supplies operands between execution complete & commit - Once operand commits, result is put into register - Instructions commit in order - As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions

Data Level Parallelism

Parallelism achieved by performing the same operation on independent data. Perform identical operations on data, and lots of data

Both ILP and TLP Right balance is unclear today

Perhaps right choice for server market, which can exploit more TLP, may differ from desktop, where single-thread performance may continue to be a primary requirement

Tomasulo vs. Scoreboard(IBM 360/91 vs. CDC 6600)

Pipelined Functional Units Multiple Functional Units (6 load, 3 store, 3 +, 2 x/÷) (1 load/store, 1 + , 2 x, 1 ÷) window size: ≤ 14 instructions ≤ 5 instructions No issue on structural hazard same WAR: renaming avoids stall completion WAW: renaming avoids stall completion Broadcast results from FU Write/read registers Control: reservation stations central scoreboard

dynamically

Rely on hardware to help discover and exploit the parallelism

statically

Rely on software technology to find parallelism, statically at compile-time

Loop Unrolling Decisions 5

Requires understanding how one instruction depends on another and how the instructions can be changed or reordered given the dependences: 1. Determine loop unrolling useful by finding that loop iterations were independent (except for maintenance code) 2. Use different registers to avoid unnecessary constraints forced by using same registers for different computations 3. Eliminate the extra test and branch instructions and adjust the loop termination and iteration code 4. Determine that loads and stores in unrolled loop can be interchanged by observing that loads and stores from different iterations are independent • Transformation requires analyzing memory addresses and finding that they do not refer to the same address 5. Schedule the code, preserving any dependences needed to yield the same result as the original code

Directory based

Sharing status of a block of physical memory is kept in just one location, the directory • Scales well -common for distributed shared memory systems • However, is used in Sun T1-Niagara (a symmetric shared memory system)

Data hazards

These results when instruction in pipeline depends upon the result of previous instruction which is still in pipeline and not complete RAW, WAR, and WAW; Instruction depends on result of prior instruction still in the pipeline

Latency Impact

Time it takes for a bit to travel from its sender to its receiver. • CPI = Base CPI + Remote request rate x Remote request cost • CPI = 0.5cc + 0.2% x 400cc = 0.5 + 0.8 = 1.3 • No communication is 1.3/0.5 or 2.6 faster than 0.2% instructions involve remote access

control dependent

Two (obvious) constraints on control dependencies: - An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. - An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch; see complier perspectives on code movement

Anti-dependence

WAR if a hazard for HW; see complier perspectives on code movement

Output dependence

WAW if a hazard for HW; see complier perspectives on code movement

Control hazard

caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)

Advantage of ILP

compiler techniques; branch prediction; static and dynamic scheduling; multiple issue and speculation

Pipeline CPI

ideal pipeline CPI + Structural stalls + Data hazard stalls + control stalls

Basic Block ILP

is quite small; -BB: a straight-line code sequence with no branches in except to the entry and no branches out except at the exit; -average dynamic branch frequency 15% to 25% => 3 to 6 instructions execute between a pair of branches; -Plus instructions in BB likely to depend on each other; To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks; Simplest: loop-level parallelism to exploit parallelism among iterations of a loop. E.g., for(i=1; i<=1000; i=i+1)x[i] = x[i] + y[i];

Ideal Pipeline CPI

measure of the maximum performance attainable by the implementation

Ideal pipeline CPI

measure of the maximum performance attainable by the implementation

Instruction level parallelism (ILP)

overlap the execution of instructions to improve performance; more than just simple pipelining; 2 approaches to exploit ILP

Register pressure

potential shortfall in registers created by aggressive unrolling and scheduling • If not possible to allocate all live values to registers, may lose some or all of its advantage; see limits to loop unrolling

Thread Level Parallelism

process with own instructions and data • thread may be a process part of a parallel program of multiple processes, or it may be an independent program • Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute

parallel computer Back to Basics

• "A parallel computer is a collection of processing elements that cooperateand communicate to solve large problems fast." • Parallel Architecture = Computer Architecture + Communication Architecture • 2 classes of multiprocessors WRTmemory: 1. Centralized Memory Multiprocessor • <few dozen processor chips (and < 100 cores) in 2006 • Small enough to share single, centralized memory 2. Physically Distributed-Memory Multiprocessor • Larger number chips and cores than 1. • BW demands ⇒Memory distributed among processors

Branch Predictors(Two-level Predictors)

• 2-bit scheme only looks at branch's own history to predict behavior • What if we use other branches to predict it as well? if (aa== 2) aa= 0; if (bb == 2) bb = 0; if (aa!= bb) { ... }; • Branch #3 depends on the outcome of #1 and #2 • Idea: record mmost recently executed branches as taken or not taken, and use that pattern to select the proper branch history table • In general, (m,n) predictor means record last mbranches to select between 2mhistory tables each with n-bit counters - Old 2-bit BHTis then a (0,2) predictor

Distributed Memory Multiprocessor

• Adding local memory per processors, then sharing between processors as required. - Also referred to as NUMAs(Non-Uniform memory access) • Pro: Cost-effective way to scale memory bandwidth - If most accesses are to local memory • Pro: Reduces latency of local memory accesses • Con: Communicating data between processors more complex • Con: Must change software to take advantage of increased memory BW

Compiler Perspectives on Code Movement

• Again name dependencies are hard for memory accesses - Does 100(x4) = 20(x6)? - From different loop iterations, does 20(x6) = 20(x6)? • Our example required compiler to know that if x1 doesn't change then: 0(x1) ≠ -8(x1) ≠ -16(x1) ≠ -24(x1) There were no dependencies between some loads and stores so they could be moved past each other • Final kind of dependence called control dependence • Example if p1 {S1;}; if p2 {S2;}; S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.

Centralized Memory Multiprocessor

• Also called symmetric multiprocessors (SMPs) because single main memory has a symmetric relationship to all processors - Sometimes referred to as UMAs(Uniform memory access) • Large caches ⇒single memory can satisfy memory demands of small number of processors • Can scale to a few dozen processors by using a switch and by using many memory banks • Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases

Compiler Perspectives on Code Movement

• Another kind of dependence called name dependence: two instructions use same name (register or memory location) but don't exchange data • Anti-dependence(WAR if a hazard for HW) - Instruction j writes a register or memory location that instruction ireads from and instruction i is executed first C=A+B OR fldx1, 12(x2) [instruction i] B=D+E fsdx3,12(x2)[instruction j] • Output dependence (WAWif a hazard for HW) - Instruction iand instruction j write the same register or memory location; ordering between instructions must be preserved. C=A+B OR fsdx1,12(x2) [instruction i]...... C=2*D fsdx3,12(x2)[instruction j]

Preserving Exception Behavior

• Any changes in the ordering of instruction execution must not change how exceptions are raised in the program • We can reorder instructions if we can ignore exceptions that would not have occurred prior to the reordering • Moving fldbefore branch add x2,x3,x4 beqzx2, L1 nop ldx1, 0(x2) L1: • We could move fldbefore the branch because no data dependences exist • However, fldcould cause a memory protection exception • Conditional instructions and speculation (covered later) will overcome this problem

BHT Accuracy

• BHTmispredictsbecause either: - Wrong guess for that branch - Got history of wrong branch when indexing the table • 4096 entry table programs vary from 1% misprediction(nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gccat 12% • 4096 entries is about as good as infinite table(in Alpha 21164)

Branch Target Buffers (BTB)

• Branch target calculation is costly and stalls the instruction fetch. • BTB stores PCs the same way as caches • The PC of a branch is sent to the BTB • When a match is found the corresponding Predicted PC is returned • If the branch was predicted taken, instruction fetch continues at the returned predicted PC

Snooping Cache-Coherence Protocols

• Cache Controller "snoops" all transactions on the shared medium (bus or switch) - relevant transaction if for a block it contains - take action to ensure coherence • invalidate, update, or supply value - depends on state of the block and the protocol • Either get exclusive access before write via write invalidate or update all copies on write

Architectural Building Blocks

• Cache block state transition diagram - FSM specifying how disposition of block changes • invalid, valid, dirty • Broadcast Medium Transactions (e.g., bus) - Fundamental system design abstraction - Logically single set of wires connect several devices - Protocol: arbitration, command/addr, data - Every device observes every transaction • Broadcast medium enforces serialization of read or write accesses ⇒Write serialization - 1st processor to get medium invalidates others copies - Implies cannot complete write until it obtains bus - All coherence schemes require serializing accesses to same cache block • Also need to find up-to-date copy of cache block

Tomasulo Drawbacks

• Complexity • Many associative writebacks (CDB) at high speed - Big load for a single bus - Can only writeback one result per clock • Performance limited by Common Data Bus - Multiple CDBs => more FU logic for parallel assoc writebacks

ILP Limits to ILP

• Conflicting studies of amount - Benchmarks (vectorizedFortran FP vs. integer C programs) - Hardware sophistication - Compiler sophistication • How much ILPis available using existing mechanisms with increasing HW budgets? • Do we need to invent new HW/SW mechanisms to keep on processor performance curve? - Intel MMX, SSE (Streaming SIMDExtensions): 64 bit ints - Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock - Motorola AltaVec: 128 bit intsand FPs - SupersparcMultimedia ops, etc.

Compiler Perspectives on Code Movement

• Definitions: compiler concerned about dependencies in program, whether or not a HW hazard exists for the dependency is a function of a given pipeline • Try to schedule instructions to avoid hazards • Data dependencies (RAW if a hazard for HW) - Instruction iproduces a result used by instruction j, or - Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. • If truly dependent, can't execute in parallel • Easy to determine dependencies for registers (fixed names) • Harder for memory locations: - Does 100(x4) = 20(x6)? - From different loop iterations, does 20(x6) = 20(x6)?

Advantages of Dynamic Scheduling

• Dynamic scheduling -hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception behavior • It handles cases when dependences unknown at compile time - it allows the processor to tolerate unpredictable delays such as cache misses, by executing other code while waiting for the miss to resolve • It allows code that compiled for one pipeline to run efficiently on a different pipeline • It simplifies the compiler • Hardware speculation, a technique with significant performance advantages, builds on dynamic scheduling (next lecture)

Cache behavior in response to bus

• Every bus transaction must check the cache-address tags - could potentially interfere with processor cache accesses • A way to reduce interference is to duplicate tags - One set for caches access, one set for bus accesses • Another way to reduce interference is to use L2 tags - Since L2 less heavily used than L1 ⇒Every entry in L1 cache must be present in the L2 cache, called the inclusion property - If Snoop gets a hit in L2 cache, then it must arbitrate for the L1 cache to update the state and possibly retrieve the data, which usually requires a stall of the processor

Unroll Loops (When is it Safe)

• Example: Where are data dependencies? (A,B,Cdistinct & non-overlapping) for (i=1; i<=100; i=i+1) {A[i+1] = A[i] + C[i]; /* S1 */B[i+1] = B[i] + A[i+1];} /* S2 */ 1. S2 uses the value, A[i+1], computed by S1 in the same iteration. 2. S1 uses a value computed by S1 in an earlier iteration, since iteration icomputes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. This is a "loop-carried dependence" between iterations • Implies that iterations are dependent and can't be executed in parallel • Not the case for our prior example; each iteration was distinct • It's possible that loops with a limited dependency may be parallelized

Write Consistency

• For now assume 1. A write does not complete (and allow the next write to occur) until all processors have seen the effect of that write 2. The processor does not change the order of any write with respect to any other memory access ⇒ if a processor writes location A followed by location B, any processor that sees the new value of B must also see the new value of A • These restrictions allow the processor to reorder reads, but forces the processor to finish writes in program order

Symmetric Shared-Memory Architectures

• From multiple boards on a shared bus to multiple processors inside a single chip • Caches both - Private data used by a single processor - Shared data used by multiple processors • Caching shared data ⇒reduces latency to shared data, memory bandwidth for shared data, and interconnect bandwidth ⇒ cache coherence problem

Multiprocessors ⇒ Other Factors

• Growth in data-intensive applications - Data bases, file servers, ... • Growing interest in servers, server perf. • Increasing desktop perf. less important - Outside of graphics • Improved understanding in how to use multiprocessors effectively - Especially server where significant natural TLP • Advantage of leveraging design investment by replication - Rather than unique design

Advantages of HW (Tomasulo) vs. SW (VLIW) Speculation

• HW advantages: - HW better at memory disambiguation since knows actual addresses - HW better at branch prediction since lower overhead - HW maintains precise exception model - Same software works across multiple implementations • Binary compatibility across generations of hardware - Smaller code size (not as many nopsfilling blank instructions) • SW advantages: - Window of instructions that is examined for parallelism much higher - Much less hardware involved in VLIW - More involved types of speculation can be done more easily - Speculation can be based on large-scale program behavior, not just local information

ILP and Data Dependencies, Hazards

• HW/SW must preserve program order: "ordered" instructions would execute in if executed sequentially as determined by original source program - Dependences are a property of programs • Presence of dependence indicates potentialfor a hazard, but actual hazard and length of any stall is property of the pipeline • Importance of the data dependencies 1) indicates the possibilityof a hazard 2) determines order in which results must be calculated 3) sets an upper bound on how much parallelism can possibly be exploited

Speculative Execution

• Hardware Speculation: issue instructions based on branch predictions, but be ready to deal with consequences of mis-predicted branches, including exceptions occurring in mis-predicted code ("HW undo") - called "boosting" • Combine branch prediction with dynamic scheduling to execute branches before resolved • Separate speculative bypassing of results from real bypassing of results - When instruction no longer speculative, write boosted results (instruction commit) or discard boosted results - Key: Execute out-of-order but commit in-orderto prevent irrevocable action (update state or exception)

Tomasulo Algorithm

• Hardware will detect and preserve dependencies (within a limited window of the instruction stream) • Hardware will check for resource availability • Independent instructions will be issued to the correct functional units • Correctness of execution guaranteed by hardware • Independent of compiler optimizations • Backward compatibility - Software scheduling: different machine configuration necessitate recompilation (or at least rescheduling) • Control & buffers distributedwith Function Units (FU) - FU buffers called "reservation stations"; have pending operands • Registers in instructions replaced by values or pointers to reservation stations(RS); called registerrenaming; - Renaming avoids WAR, WAW hazards - More reservation stations than registers, so can do optimizations compilers can't • Results to FU from RS, not through registers, over Common Data Busthat broadcasts results to all FUs - Avoids RAW hazards by executing an instruction only when its operands are available • Load and Stores treated as FUs with RSs as well • Integer instructions can go past branches (predict taken), allowing FP ops beyond basic block in FP queue

Exceptions and Interrupts

• IBM 360/91 invented "imprecise interrupts" - Computer stopped at this PC; its likely close to this address - Not so popular with programmers - Also, what about Virtual Memory? (Not in IBM 360) • Technique for both precise interrupts/exceptions and speculation: in-order completion and in-order commit - If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly - This is exactly same as need to do with precise exceptions • Exceptions are handled by not recognizing the exception until instruction that caused it is ready to commit in ROB - If a speculated instruction raises an exception, the exception is recorded in the ROB - This is why reorder buffers in all new processors

Thread Level Parallelism (TLP)

• ILP exploits implicit parallel operations within a loop or straight-line code segment • TLP explicitly represented by the use of multiple threads of execution that are inherently parallel • Goal: Use multiple instruction streams to improve 1. Throughput of computers that run many programs 2. Execution time of multi-threaded programs • TLP could be more cost-effective to exploit than ILP

Tomasulo/ Scoreboarding

• Implemented for the floating point unit of the IBM 360/91 - About 3 years after CDC 6600 (1966) • Goal: High Performance without special compilers • Differences between IBM 360 & CDC 6600 ISA - IBM has only 2 register specifiers/instruction vs. 3 in CDC 6600 - IBM has 4 FP registers vs. 8 in CDC 6600 - Pipelined rather than multiple functional units (but conceptually viewed as multiple functional units) • Adder supports 3 instructions • Multiplier supports 2 instructions • Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, i7

SMT Changes in Power 5 to support SMT

• Increased associatively of L1 instruction cache and the instruction address translation buffers • Added per thread load and store queues • Increased size of the L2 (1.92 vs. 1.44 MB) and L3 caches • Added separate instruction prefetch and buffering per thread • Increased the number of virtual registers from 152 to 240 • Increased the size of several issue queues • The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support

Limits to Multi-Issue Machines

• Inherent limitations of ILP - 1 branch in 5: How to keep a 5-way VLIW busy? - Latencies of units: many operations must be scheduled - Need about Pipeline Depth x No. Functional Units of independent operations to keep machines busy, e.g. 5 x 4 = 15-20 independent instructions? • Difficulties in building HW - Easy: Duplicate FUs to get parallel execution - Hard: Increase ports to Register File (bandwidth) • VLIW example needs 7 read and 3 write for Int. reg. & 5 read and 3 write for FP reg - Harder: Increase ports to memory (bandwidth) • Limitations specific to either Superscalar or VLIWimplementation - Decode issue in Superscalar: how wide is practical? - VLIW code size: unroll loops + wasted fields in VLIW (nops) - VLIWlock step -> 1 hazard & all instructions stall - VLIW& binary compatibility is practical weakness as you vary the number of FU and latencies over time

Instruction Parallelism HW Schemes

• Key idea: Allow instructions behind stall to proceedfdiv.df0,f2,f4fadd.df10,f0,f8fsub.df12,f8,f14 • Enables out-of-order execution and allows out-of-order completion(e.g., SUBD) -In a dynamically scheduled pipeline, all instructions still pass through issue stage in order(in-order issue) • Out-of-order completion caused WAW • Out-of-order execution causes WAR

ILP Summary

• Leverage Implicit Parallelism for Performance: Instruction Level Parallelism • Loop unrolling by compiler to increase ILP • Branch prediction to increase ILP • Dynamic HW exploiting ILP - Works when can't know dependence at compile time - Can hide L1 cache misses - Code for one machine runs well on another • Interest in multiple-issue (superscalar, VLIW) because wanted to improve performance without affecting uniprocessorprogramming model • Taking advantage of ILPis conceptually simple, but design problems are amazingly complex in practice • Conservative in ideas, just faster clock and bigger • Recent processors (Pentium 4, IBM Power 5, AMD Opteron) have the same basic structure and similar sustained issue rates (3 to 4 instructions per clock) as the 1st dynamically scheduled, multiple-issue processors announced in 1995 - Clocks 10 to 20X faster, caches 4 to 8X bigger, 2 to 4X as many renaming registers, and 2X as many load-store units ⇒ performance 8 to 16X • But Brick Wall is slowing ILP - Remember UniprocessorPerformance Curve??

Tournament Predictors

• Multilevel branch predictor • Use n-bit saturating counter to choose between predictors • Usual choice between global and local predictors Advantage of tournament predictor is ability to select the right predictor for a particular branch

Mulithreaded Execution

• Multithreading: multiple threads to share the functional units of 1 processor via overlapping - processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table - memory shared through the virtual memory mechanisms, which already support multiple processes - HW for fast thread switch; much faster than full context switch ≈100s to 1000s of clocks • When to switch? - Alternate instruction per thread (fine grain) - When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)

ILP No Silver Bullet for ILP

• No obvious overall leader in performance • The AMD Athlonleads on SPECIntperformance followed by the Pentium 4, Itanium 2, and Power5 • Itanium 2 and Power5, which perform similarly on SPECFP, clearly dominate the Athlonand Pentium 4 on SPECFP • Itanium 2 is the most inefficientprocessor both for Fl. Pt. and integer code for all but one efficiency measure (SPECFP/Watt) • Athlon and Pentium 4 both make good use of transistors and area in terms of efficiency • IBM Power5 is the most effective user of energy on SPECFP and essentially tied on SPECINT

WB Snooping Cache Resources

• Normal cache tags can be used for snooping • Valid bit per block makes invalidation easy • Read misses easy since rely on snooping • Writes ⇒ Need to know whether any other copies of the block are cached - No other copies ⇒ No need to place write on bus for WB - Other copies ⇒ Need to place invalidate on bus • To track whether a cache block is shared, add extra state bit associated with each cache block, like valid bit and dirty bit - Write to Shared block ⇒ Need to place invalidate on bus and mark cache block as private(if an option) - No further invalidations will be sent for that block - This processor called owner of cache block - Owner then changes state from shared to unshared (or exclusive)

Dynamic Branch Prediction Summary

• Prediction becoming important part of execution • Branch History Table: 2 bits for loop accuracy • Correlation: Recently executed branches correlated with next branch - Either different branches - Or different executions of same branches • Tournament predictors take insight to next level, by using multiple predictors - usually one based on global information and one based on local information, and combining them with a selector - In 2006, tournament predictors using ≈30K bits are in processors like the Power5 and Pentium 4 • Branch Target Buffer: include branch address & prediction

Dynamic Branch Prediction

• Prediction depends on behavior of branch at runtime and can change over time • Performance = ƒ(accuracy, cost of misprediction) • Branch History Table (BHT) is simplest - Small memory indexed by lower portion of address of branch instruction - Says whether or not branch taken last time - Lower bits of address indexes table of 1-bit values - No address (tag) check of upper bits • another branch could have set bit • Problem: in a loop, 1-bit BHT will cause two mispredictions: - End of loop case, when it exits instead of looping as before - First time through loop on next time through code, when it predicts exit instead of looping

Enforcing Coherence Basic Schemes

• Program on multiple processors will normally have copies of the same data in several caches • Rather than trying to avoid sharing in SW, SMPs use a HW protocol to maintain coherent caches - Migration and Replication key to performance of shared data • Migration-data can be moved to a local cache and used there in a transparent fashion - Reduces both latency to access shared data that is allocated remotely and bandwidth demand on the shared memory • Replication-for shared data being simultaneously read, since caches make a copy of data in local cache - Reduces both latency of access and contention for reading shared data

Intuitive Memory Model

• Reading an address should return the last value written to that address -Easy in uniprocessors, except for I/O • Too vague and simplistic; 2 issues 1.Coherencedefines values returned by a read 2.Consistencydetermines when a written value will be returned by a read • Coherence defines behavior to same location, Consistency defines behavior to other locations

Tomasulo Summary

• Reservations stations: renaming to larger set of registers + buffering source operands - Prevents registers as bottleneck - Avoids WAR, WAW hazards of Scoreboard - Allows loop unrolling in HW • Not limited to basic blocks (integer units gets ahead, beyond branches) • Lasting Contributions - Dynamic scheduling - Register renaming - Load/store disambiguation • 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264

Parallel Processing Challenges

• Second challenge is long latency to remote memory • Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles.) • What is performance impact of local vs 0.2% instructions involve remote access? a. 1.5X b. 2.0X c. 2.6X

SMT Design Challenge

• Since SMT makes sense only with fine-grained implementation, impact of fine-grained scheduling on single thread performance? - A preferred thread approach sacrifices neither throughput nor single-thread performance? - Unfortunately, with a preferred thread, the processor is likely to sacrifice some throughput, when preferred thread stalls • Larger register file needed to hold multiple contexts • Not affecting clock cycle time, especially in - Instruction issue -more candidate instructions need to be considered - Instruction completion -choosing which instructions to commit may be challenging • Ensuring that cache and TLB conflicts generated by SMT do not degrade performance

Preserving Data Flow

• Sometimes data flow is dynamic or altered due to branches addx1, x2, x3 beqzx4, L1 nop sub x1, x5, x6 L1: orx7, x1, R8 • Result of OR depends on x1 which is controlled by the branch (x4 value) • Note data dependence alone will not preserve correctness, there is also a control dependence that must be preserved (dynamic data flow) • Conditional instructions and speculation (covered later) will help overcome this problem too

Statically Scheduled Issue

• Superscalar MIPS: assume 2 instrs, 1 FP & 1 anything else - Fetch 64-bits/clock cycle; Inton left, FP on right - Can only issue 2nd instruction if 1st instruction issues - More ports for FP registers to do FP load & FP op in a pair Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB

Issuing Multiple Instructions/Cycle

• Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo) • (Very) Long Instruction Words (V)LIW: fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates - Style: "Explicitly Parallel Instruction Computer (EPIC)" • Anticipated success led to use of Instructions Per Clock cycle (IPC) vs. CPI

Fine-Grained Multithreading

• Switches between threads on each instruction, causing the execution of multiples threads to be interleaved • Usually done in a round-robin fashion, skipping any stalled threads • CPU must be able to switch threads every clock • Advantage is it can hide both short and long stalls, since instructions from other threads executed when one thread stalls • Disadvantage is it slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads • Used on Sun's Niagara

Course-Grained Multithreading

• Switches threads onlyon costly stalls, such as L2 cache misses • Advantages - Relieves need to have very fast thread-switching - Doesn't slow down thread, since instructions from other threads issued only when the thread encounters a costly stall • Disadvantageis hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs - Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen - New thread must fill pipeline before instructions can complete • Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time • Used in IBM AS/400

Both ILP and TLP

• TLP and ILP exploit two different kinds of parallel structure in a program • Could a processor oriented at ILP exploit TLP? - functional units are often idle in data path designed for ILP because of either stalls or dependences in the code • Could the TLP be used as a source of independent instructions that might keep the processor busy during stalls? • Could TLP be used to employ the functional units that would otherwise lie idle when insufficient ILP exists?

Performance beyond single thread ILP

• There can be much higher natural parallelism in some applications (e.g., Database or Scientific codes) • Explicit Thread Level Parallelism or Data Level Parallelism - Thread: process with own instructions and data • thread may be a process part of a parallel program of multiple processes, or it may be an independent program • Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute - Data Level Parallelism: Perform identical operations on data, and lots of data

Dependency Types

• True Dependency - Part of the program data flow - Read after Write (RAW) • Name Dependency (Can Solve Name Dependencies via Register Renaming) - Anti-Dependence • Write After Read (WAR) - Output Dependency • Write After Write (WAW) • Control Dependency - Values dependent on branch outcomes must be maintained

Compiler Perspectives on Code Movement

• Two (obvious) constraints on control dependencies: - An instruction that is control dependent on a branch cannot be moved beforethe branch so that its execution is no longer controlled by the branch. - An instruction that is not control dependent on a branch cannot be moved to afterthe branch so that its execution is controlled by the branch. • Control dependencies can be relaxed to get parallelism - must not affect correctness of program - need to preserve order of exceptions and data flow (i.e., the value in registers that depend on branch)

Avoiding Memory Hazards

• WAW and WAR hazards through memory are eliminated with speculation because actual updating of memory occurs in order, when a store is at head of the ROB, and hence, no earlier loads or stores can still be pending • RAW hazards through memory are maintained by two restrictions: 1. not allowing a load to initiate the second step of its execution if any active ROB entry occupied by a store has a Destination field that matches the value of the A field of the load, and 2. maintaining the program order for the computation of an effective address of a load with respect to all earlier stores. • These restrictions ensure that any load that accesses a memory location written to by an earlier store cannot perform the memory access until the store has written the data

Multiple Issue Challenges

• While Integer/FP split is simple for the HW... - You can get CPI of 0.5 only for programs with: • Exactly 50% FP operations • No hazards • If more instructions issue at same time, greater difficulty of decode and issue - Even 2-scalar examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue • VLIW: tradeoff instruction space for simple decoding - The long instruction word has room for many operations - By definition, all the operations the compiler puts in the long instruction word are independent -> execute in parallel • e.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch • 16 to 24 bits per field -> 7*16 or 112 bits to 7*24 or 168 bits wide - Need compiling technique that schedules across several branches

Locate up-to-date copy of data

• Write-through: get up-to-date copy from memory - Write through simpler if enough memory BW • Write-back: harder - Most recent copy can be in a cache • Can use same snooping mechanism - Snoop every address placed on the bus - If a processor has dirty copy of requested cache block, it provides it in response to a read request and aborts the memory access - Complexity from retrieving cache block from a processor cache, which can take longer than retrieving it from memory • Write-back needs lower memory bandwidth - Support larger numbers of faster processors - Most multiprocessors use write-back

ILP Limits to ILP

•Most techniques for increasing performance increase power consumption • The key question is whether a technique is energy efficient: does it increase power consumption faster than it increases performance? • Multiple issue processors techniques all are energy inefficient: 1. Issuing multiple instructions incurs some overhead in logic that grows faster than the issue rate grows 2. Growing gap between peak issue rates and sustained performance • Number of transistors switching = f(peak issue rate), and performance = f(sustained rate), growing gap between peak and sustained performance ⇒increasing energy per unit of performance


Related study sets

Management application questions

View Set

Series 7 - Special Securities and Financial Listings

View Set