EEL6764 Computer Architecture

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Vector Architecture (DLP and Vector Architecture - 3)

1. A vector is a one-dimensional array of numbers 2. An instruction operates on vectors, not scalar values (each instruction generates a lot of work) Basic Idea: Read vectors of data in memory into vector registers, vector functional units operate on those registers in a pipelined manner one element at a time, disperse the results back into memory. Functional units are deeply pipelined where no intra-vector dependence could occur (no hardware interlocking within a vector), and no control flow within a vector. Vector registers are controlled by the compiler and are used to hide memory latency and leverage memory bandwidth Deliver high performance without energy/design complexity of out-of-order superscalar processor Each register holds a 64 64-bit elements Register file has 16 read ports and 8 write ports with special purpose registers VLR/MVL

Exploiting TLP (TLP and Multiprocessing - 1)

1. Applications should contain abundant concurrency; for n processors, there needs to be n+ independent threads 2. Identified by programmers or the operating sytem 3. The amount of computation assigned to each thread = grain size 4. Thread grain size must be large to reduce overheads associated with thread execution

Multithreading Approaches (Pipelining and ILP - 11)

1. Coarse-grained MT: switch to different thread only on long stalls. Lower throughput, longer latency for individual thread 2. Fine-grained MT: processor switches between threads in every cycle. It could waste cycles if individual threads have limits ILP and issue logic is wide 3. Simultaneous MT: execute instructions from multiple threads using multi-issue processor => fine grained MT + multi-issue dynamic scheduling

Advantages of Vector Instructions (DLP and Vector Architecture - 3)

1. Compact - one short instruction encodes N operations 2. Expressive - tells hardware that these N operations are independent, use the same type of functional unit, access different portions of registers, access registers in the same pattern as previous instructions, access a contiguous block of memory, and accesses memory in a known patter 3. Scalable - can run the same code on multiple parallel pipelines A critical advantage of a vector instruction set is that it allows software to pass a large amount of parallel work to hardware using only a single short instruction. One vector instruction can include scores of independent operations yet be encoded in the same number of bits as a conventional scalar instruction

Why multiprocessors? (TLP and Multiprocessing - 3)

1. Diminishing return from exploiting ILP with rising cost of power and chip area 2. Single thread performance is good enough 3. Easier to replicate cores and better return of investment

Multithreading (Pipelining and ILP - 11)

1. Exploiting thread-level parallelism (TLP) to improve uniprocessor throughput 2. Multithreading allows multiple threads to share a processor without process switch; each thread is independently controlled by the operating system with duplicate private states (registers, PC, etc.) for each thread but shared functional units and memory

Disadvantages of VLIW Processors (Pipelining and ILP - 10)

1. Finds parallelism in a static manner 2. Requires aggressive loop unrolling, which blows up the code size 3. No hazard detection hardware; relies on the compiler and can't detect hazards through memory, leading to low performance 4. Binary code compatibility; code recompilation needed for new microarchitecture

How are certain conflicts resulting from pipelining avoided? (Final exam topic: Pipelining and ILP - 2)

1. Instruction and data memories are separate with their own cache each. This eliminates conflict for a single memory that would arise between IF and MEM. 2. The register file is used in two stages: one for reading in ID and one for writing in WB. To handles reads and writes to the same register, write is performed in the first half of the clock cycle and read in the second half. 3. Pipeline registers between successive stages of the pipelines are introduced so that instructions in different stages do not interfere with one another. The results at the end of each stage are stored in the pipeline registers to be used as input for the next stage.

Applications of DLP (DLP and Vector Architecture - 1)

1. Scientific re search 2. Games 3. Oil exploration 4. Industrial design (car crash simulation) 5. Bioinformatics 6. Cryptography

What are the three architectures that issue instructions in parallel? (Final exam topic: (Pipelining and ILP - 10)

1. Statically scheduled superscalar processors 2. Very Long Instruction Word (VLIW) processors 3. Dynamically scheduled superscalar processors 1 and 2 are similar since they both rely on the compiler to schedule the code.

What does a data dependence convey? (Final exam topic: Pipelining and ILP - 2)

1. The possibility of a hazard 2. The order in which results must be calculated 3. An upper bound on how much parallelism can possibly be exploited

Shortcoming of 1-bit Prediction (Final exam topic: Pipelining and ILP - 3)

A 1-bit prediction scheme has a performance shortcoming: even if a branch is almost always taken, it is likely to predict incorrectly twice when it is not taken. 2-bit prediction schemes are used to remedy this weakness. A prediction must be missed twice before it is changed.

Out-of-order Execution (Final exam topic: Pipelining and ILP - 8)

A pipeline that still uses in-order instruction issue but begins instruction execution as soon as its data operands are available is known as out-of-order execution. This introduces WAR and WAW hazards, which do not exist in a 5-stage in-order execution pipeline. Out-of-order completion also creates major complications in handling exceptions. Dynamic scheduling with out-of-order completion must preserve exception behavior in exactly the same way if a program executed in strict program order.

VLIW Processors (Pipelining and ILP - 10)

A process involving packing multiple operations into slots of one instruction. Example VLIW processor: one integer instruction (or branch), two independent floating-point operations, and two independent memory references. There must be enough parallelism in the code to fill the available slots, which can be accomplished by loop unrolling.

Loop Unrolling (Final exam topic: Pipelining and ILP - 7)

A simple scheme for increasing the number of instructions relative to the branch and overhead instructions is loop unrolling. Unrolling simply replicates the loop body multiple times, adjusting the loop termination code.

VL and MVL for vector register processors (DLP and Vector Architecture - 5)

A vector register processor has a natural vector length determined by the maximum vector length (MVL). This length is unlikely to match the real vector length of a program. Moreover, in a real program, the length of particular vector operation is unknown at compile time. In fact, a single piece of code may require different vector lengths. The solution to this problem is to add a vector-length register (VL) that controls the length of any vector operation, including vector load/store, and cannot be greater than the MVL.

Contrast between Static Scheduling and Dynamic Scheduling (Final exam topic: Pipelining and ILP - 8)

Although a dynamically scheduled processor cannot change the data flow, it tries to avoid stalling when dependencies are present. In contrast, static pipeline scheduling by the compiler tries to minimize stalls by separating dependent instructions so that they will not lead to hazards.

Correlating predictors (Final exam topic: Pipelining and ILP - 3)

Branch predictors that use the behavior of other branches to make a prediction are called correlating predictors or two-bit predictors. Existing correlating predictors add information about the behavior of the most recent branches to decide how to predict a given branch. In the general case, an (m, n) predictor uses the behavior of the last m branches to choose from 2^m branch predictors, each with an n-bit predictor for a single branch.

Chaining (DLP and Vector Architecture - 4)

Chaining allows a vector operation to start as soon as the individual elements of its vectors source operands become available: the results from the first functional unit in the chain are "forwarded" to the second functional unit.

Control Hazard (Final exam topic: Pipelining and ILP - 2)

Control hazards arise from the pipelining of branches and other instructions that change the PC.

Data Hazard (Final exam topic: Pipelining and ILP - 2)

Data hazards arise when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline.

Pipelined Functional Units (DLP and Vector Architecture - 4)

Deep pipelines are used to process data at a high throughput. No intra-vector dependence leads to a simple pipeline control. A new operation begins every cycle with fully pipelined functional units. The control unit detects data and structural hazards. The memory needs to have a high bandwidth to support vector load/store

Dynamic Scheduling (Final exam topic: Pipelining and ILP - 8)

Dynamic scheduling is a technique by which the hardware reorders the instruction execution to reduce the stalls while maintaining data flow and exception behavior.

(TLP and Multiprocessing - 2)

Empty

Hardware-Based Speculation (Final exam topic: Pipelining and ILP - 9)

Execute instructions along predicted execution paths but only commit the results if prediction was correct. Instruction commit: allowing an instruction to update the register/memory when it is no longer speculative, thus correct. Instructions execute out-of-order but commit in-order

Static Prediction (Final exam topic: Pipelining and ILP - 3)

Four simple compile time schemes for dealing with pipeline stalls caused by branch delay: 1. The simplest scheme is to freeze or flush the pipeline, holding or deleting any instructions after the branch until the branch destination is known. 2. A higher performance and slightly more complex scheme is to treat every branch as not taken, simply allowing the hardware to continue as if the branch were not executed. Here, care must be taken not to change the processor state until the branch outcome is known. For RISC-V, predicted-untaken scheme continues the pipeline, however, if it is taken (wrong prediction), the fetched instructions must be converted to no-ops and restart the fetch at the target address. 3. Another alternative scheme is to treat every branch as taken. This has a similar idea like (2). 4. The final scheme is called delayed branch where the execution with a branch delay of one is: branch instruction -> sequential successor -> branch target if taken.

What does it mean for a branch to be taken or untaken? (Final exam topic: Pipelining and ILP - 3)

If a branch changes the PC to its target address, it is taken branch. If it falls through, it is not taken, also known as untaken.

Pipeline Speedup (Final exam topic: Pipelining and ILP - 2)

If the cycle time overhead of pipleining is ignored, while assuming that the stages are perfectly balanced, then the cycle time for two processors can be equal. This leads to: Speedup = CPI_unpipelined / (1 + Pipeline_stalls_per_instructions) If no pipeline stalls occur, then performance can be improved by pipeline depth.

Non-parallel Instructions (Final exam topic: Pipelining and ILP - 2)

If two instructions are dependent, they are not parallel and must be execute in order (maybe with some partial gap).

Parallel Instructions (Final exam topic: Pipelining and ILP - 2)

If two instructions are in parallel, they can execute simultaneously in a pipeline of arbitrary depth without causing any stalls, assuming the pipeline has sufficient resources.

Conditional Branches (Final exam topic: Pipelining and ILP - 6)

In RISC-V, conditional branches depend on comparing two register values, which is assumed to occur during the EX cycle, and uses the ALU for this function. The branch target address also needs to be computed. Since testing the branch condition and determining the next PC will determine the branch penalty, both possible PCs are computed before the end of the EX cycle. In fact, this calculation is done during the ID stage. Since the instruction is not decoded yet, every instruction end s up calculating a possible target address, even if it is not a branch instruction.

Predicate Registers (DLP and Vector Architecture - 6)

In RV64V, predicate registers hold a mask and essentially provide conditional executions of each element operation in a vector instruction. These registers use a Boolean vector to control the execution of a vector instruction. When the predicate register P0 is set, all following vector instructions operate only the vector elements whose corresponding entries in the predicate register are 1.

Reservation Stations (Final exam topic: Pipelining and ILP - 8)

In Tomasulo's algorithm, register renaming is provided by reservation stations, which buffer the operands of instructions waiting to issue and are associated with the functional units. A reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from a register. In addition, pending instructions designate the reservation station that will provide their input. Finally, when successive writes to a register overlap in execution, only the last one is actually used to update the register. As instructions are issued, the register specifiers for pending operands are renamed to the names of the reservation station, which provides register renaming. Because there can be more reservation stations than real registers, the technique can even eliminate hazards arising from name dependencies that could not be eliminated by a compiler. Each reservation station holds an instruction that has been issued and is awaiting execution at a functional unit. If the operand values for that instruction have been computed, they are also stored in that entry; otherwise, the reservation station entry keeps then names of the reservation stations that will provide the operand values.

Similarities between SMP and DSM (TLP and Multiprocessing - 4)

In both SMP and DSM architectures, communication among threads occurs through a shared address space, meaning that a memory reference can be made by any processor to any memory location, assuming it has correct access rights. The term shared memory associated with SMP and DSM refers tot he fact that the address space is shared.

Extended Pipelines

It is impractical to require that all RISC-V FP operations complete in 1 clock cycle. Doing so would mean a slower clock and an absurd increase in complexity of hardware. FP pipelines allow for longer latencies for operations. There are two important changes from integer pipelines. More information at the end of notes from pipeline hazards slides

Benefits of loop unrolling (Final exam topic: Pipelining and ILP - 7)

Loop unrolling can be used to improve scheduling since it eliminates a branch to allow instructions from different iterations to be scheduled together. The stalls that come from data use can be eliminated by creating additional independent instructions within the loop body. Drawback: Simply replicating instructions requires the use of different registers for each iteration, and this increases the required number of registers.

How is loop unrolling done? (Final exam topic: Pipelining and ILP - 7)

Loop unrolling is normally done early in the compilation process so that redundant computations can be exposed and eliminated by the optimizer. To obtain the final unrolled code, the following decisions must be made: 1. Determine that unrolling the loop would be useful by finding that the loop iterations were independent, except for the loop maintenance code. 2. Use different registers to avoid any unnecessary constraints that would be forced by using the same registers for different computations. 3. Eliminate the extra test and branch instruction and adjust the loop termination and iteration code. 4. Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This transformation requires analyzing memory addresses and finding that they do not refer to the same address. 5. Schedule the code and preserve any dependencies needed to yield the same result as the original code.

Reasons for using Memory Banks (DLP and Vector Architecture - No number)

Most vector processor use memory banks, which allows several independent accesses rather than simple memory interleaving for three reasons: 1. Many vector computers support many loads/stores per clock cycle, and memory bank cycle time is several times larger than the processor cycle time. So multiple banks are required to support simultaneous access. 2. Most vector processors support the ability to load or store data words that are not sequential. 3. Most vector computers support multiple processors sharing the same memory system, so each processor will be generating its own separate stream of addresses.

The seven fields of a reservation station (Final exam topic: Pipelining and ILP - 8)

Op - the operation to perform on source operands S1 and S2 Qj, Qk - the reservation station will produce the corresponding source operand; a value zero indicates that the source operand is already available or is unnecessary Vj, Vk - the value of the source operands A - Used to hold information for the memory address calculation for a load or store. Initially, the immediate field is stored here then the effective address after its calculation Busy - Indicates that the reservation station and its accompanying functional unit are occupied The register file has a field, Qi: Qi - The number of the reservation station that contains the operation whose result should be stored in this register

Data Parallelism (DLP) (DLP and Vector Architecture - 1)

Parallelism that arises from performing the same operations on different pieces of data. Ex) Dot product of two vectors X * Y Involves operations on large data sets Contrast from ILP, where parallelism arises from executing different operations in parallel (in a data driven manner) Contrast from TLP, where parallelism arises from executing different threads in parallel

Approaches to avoiding hazards (Final exam topic: Pipelining and ILP - 5)

Pipeline registers carry both data and control from one pipeline stage to the next. Any value needed on a later stage must be placed in such register and copied from one to the net until no longer needed. The process for letting an instruction move from the ID stage into the EX stage of the pipeline is called instruction issue. For RISC-V, all data hazards can be checked during the ID stage. If a data hazard exists, the instruction is stalled before it is issued. Detecting interlocks early in the pipeline reduces hardware complexity because the hardware never has to suspend an instruction that has updated the state of the processor unless the entire processor is stalled.

More about pipelining (Final exam topic: Pipelining and ILP - 1)

Pipelining increases the processor instruction throughput but does not reduce the execution time of an individual instruction. It slightly increases the execution time of each instruction because of the overhead to control the overhead to control the pipeline. Since the execution time of each instruction does not decrease, the depth of a pipeline is limited in terms of practicality. There is also the imbalance among pipe stages which reduces performance because the clock cannot run faster than the time needed for the slowest stage.

Cache Coherence Problem (TLP and Multiprocessing - 6)

Private data are used by a single processor, while shared data are used by multiple processors, essentially providing communication among the processors through reads and writes of the shared data. Unfortunately, caching shared data introduces a new problem. Because the view of memory held by two different processors is through their individual caches, the processors could end up seeing different values fort the same memory location, which is known as the cache coherence problem.

How are hazards handled by Tomasulo's algorithm? (Final exam topic: Pipelining and ILP - 8)

RAW hazards are avoided by executing an instruction only when its operands are available. WAW and WAR hazards, which arise from name dependencies, are eliminated by register renaming. This hardware technique eliminates these hardware by renaming all destination registers, including those with a pending read or write for an earlier instruction, so that the out-of-order writes does not affect any instructions that depend on an earlier value of an operand.

Snooping Based Protocol (TLP and Multiprocessing - 7)

Rather than keeping the state of sharing in a single directory, every cache that has a copy of the data from a block of physical memory could track the sharing status of the block. In an SMP, the caches are all typically accessible via some broadcast medium, and all cache controllers monitor or snoop the medium to determine whether they have a copy of a block that is requested. One of two write methods can be used: 1. Write invalidate where on a write all other copies of the data are invalidated. 2. Write update where on a write all other copies of the data are updated Each cache block can be in one of three states: 1. Invalid 2. Shared (clean block) 3. Modified (implies exclusivity and dirty)

SIMD Architecture/Processing (DLP and Vector Architecture - 2)

SIMD: single instruction operates on multiple data elements in time and in space. Demands high memory bandwidth, so it is necessary to fetch enough data to keep the processor busy. SIMD architecture exploits DLP and is more energy efficient than MIMD; fetch one instruction for many data operations without sophisticated hardware to extract ILP. This allows the programmer to continue to think sequentially unlike for MIMD.

Gather-scatter (DLP and Vector Architecture - 7)

Since sparse matrices are commonplace, it is important to have techniques to allow programs with sparse matrices to execute in vector mode. The primary mechanism for supporting sparse matrices is gather-scatter operations using index vectors. The goal is to support moving between a compressed representation and normal representation. A gather operation takes an index vector and fetches the vector whose elements are at the addresses given by adding a base address to the offset given in the index vector. The result is a dense vector in a vector register. After these elements are operated on in a dense form, the sparse vector can be stored in an expanded form by a scatter store, using the same index vector.

Structural Hazard (Final exam topic: Pipelining and ILP - 2)

Structural hazards arise from resource conflicts when hardware cannot support all possible combinations of instructions simultaneously in overlapped execution.

Symmetric Multiprocesors (SMP) (TLP and Multiprocessing - 4)

Symmetric multiprocessors feature a small to moderate number of cores, typically 32 or fewer. With such small processor counts, it is possible to share a single centralized memory that all processors have equal access to. Also known as uniform memory access (UMA).

Memory Banks (DLP and Vector Architecture - No number)

The behavior of the load/store vector unit is significantly more complicated than that of the arithmetic functional unit. The start-up time for a load is the time to get the first word from memory into a register. If the rest of the vector can be supplied without stalling, then the vector initiation rate is equal to the rate at which new words are fetched and stored. To maintain an initiation rate of one word fetched or stored per clock cycle, the memory system must be capable of producing or accepting the much data. Spreading accesses across multiple banks usually delivers this rate.

Pipeline speedup with branch penalties (Final exam topic: Pipelining and ILP - 3)

The effective pipeline speedup with branch penalties, assuming ideal CPI is Pipeline_speedup = Pipeline_depth / (1 + Pipeline_stalls_from_branches) Pipeline_stalls_from_branches = Branch_frequency * Branch_penalty As pipelines get deeper and the potential penalty of branches increases, using delayed branches and similar schemes becomes insufficient.

Calculating CPI for a pipelined processor (Final exam topic: Pipelining and ILP - 2)

The value of the CPI for a pipelined processor is the sum of the base CPI and all contributions from stalls: Pipeline_CPI = CPI_ideal + Structural_stalls + Data_hazard_stalls + Control_stalls

Hazards through memory (Final exam topic: Pipelining and ILP - 9)

There are no WAW/WAR hazards by in-order commit Store updates memory in commit, load reads memory in execute. RAW hazards through memory can be avoided by maintaining the program order of effective address computation of a load with all earlier stores and not allowing a load to read memory if its A (address) field matches the destination field of any active ROB entry for a store.

How to implement a directory based protocol (TLP and Multiprocessing - 7)

There are two primary operations that a directory protocol must implement: handling a read miss and handling a write to a shared, clean block (handling a write miss to a block currently shared is a simple combination of the two). To implement these operations, a directory must track the state of each cache block. in a simple protocol, these states can be: 1. Shared - One or more nodes have the block cached, and the value in memory is up to date (as well in all caches) 2. Uncached - No node has a copy of the cache block 3. Modified - Exactly one node has a copy of the cache block, so the memory copy is out of date. The processor is called the owner of the block. 4. Invalidated In addition to tracking the state of each potentially shared memory block, we must track which nodes have copies of that block because those copies will need to be invalidated on a write. The simplest way to do this is to keep a bit vector for each memory block.

Ideal Speedup for Multiprocessors (TLP and Multiprocessing - 3)

Threads are independent if they do not communicate Suppose a program of M independent threads as a latency of a thread on a single processor of L: Latency of the program on a single processor is M * L Latency of the program on N processor = floor(M / N) * L So the max speedup is approximately N

What limits loop unrolling? (Final exam topic: Pipelining and ILP - 7)

Three effects limit the gains from loop unrolling: 1. A decrease in the amount of overhead amortized with each unroll 2. Code size limitations 3. Compiler limitations

Challenges for shared multiprocessing architectures (TLP and Multiprocessing - 5)

Threre are two important hurdles, both explainable by Amdahl's law, that make parallel processing challenging: 1. There is limited parallelism available in programs 2. There is a relatively high cost of communications. To overcome these hurdles typically requires a comprehensive approach that addresses the choice of algorithm and its implementation, the underlying programming languages and system, the operating system, and the architecture and hardware implementation.

What is throughput in terms of a pipeline? (Final exam topic: Pipelining and ILP - 1)

Throughput is how often an instruction exits the pipeline.

Two important properties of reservation stations (Final exam topic: Pipelining and ILP - 8)

1. Hazard detection and execution control are distributed. The information held in the reservation stations at each functional unit determine when an instruction can begin execution at that unit. 2. Results are passed directly to functional units from the reservation station where they are buffered, rather than going through registers. This bypassing is done with a common result bus that allows all units waiting for an operand to be loaded simultaneously. This is known as the Common Data Bus (CDB).

The three steps an instruction goes through in Tomasulo's algorithm (Final exam topic: Pipelining and ILP - 8)

1. Issue - Get the next instruction from the head of the instruction queue, which is maintained in FIFO order to ensure the maintenance of correct data flow. If there is a matching reservation station that is empty, issue the instruction to the station with the operand values, if they are currently in the registers. If there is not an empty reservation station, then there is a structural hazard, and the instruction issue stalls until a station or buffer is freed. If the operands are not in the registers, keep track of the functional units that will produce the operands. This step renames registers, eliminating WAR and WAW hazards. 2. Execute - If one, or more, of the operands is not yet available, monitor the common data bus while waiting for it to be computed. When an operand becomes available, it is placed into any reservation station awaiting it. When all the operands are available, the operation can be executed at the corresponding functional unit. By delaying instruction execution until the operands are available, RAW hazards are avoided. Several instructions can become ready in the same clock cycle for the same functional unit. Although independent functional units can begin execution in the same clock cycle for different instructions, the unit will have to choose among them. For floating-point reservation stations, this choice can be made arbitrarily; however, loads and stores present an additional complication. Loads and stores require a two-step execution process. The first step computes the effective address when the base register is available, and the second step is to place the effective address in the load or store buffer. Loads in the load buffer execute as soon as the memory unit is available. Stores in the store buffer wait for the value to be store before being sent to the memory unit. Loads and stores are maintained in program order through the effective address calculation, which will help prevent hazards through memory. To preserve exception behavior, no instruction is allowed to initiate execution until a branch that precedes the instruction in program order has completed. This restriction guarantees that an instruction that causes an exception during execution has indeed executed. In a processor using branch prediction, this means the processor must know that the branch prediction was correct before allowing an instruction after the branch to begin execution. 3. Write result - when the result is available, write it on the CBD and from there into any reservation station waiting for this result. Stores are buffered in the store buffer until both the value to be stored and the store address are available; then the result is written as soon as the memory unit is free.

How Tomasulo's algorithm works with speculation (Final exam topic: Pipelining and ILP - 9)

1. Issue - issue an instruction if there is an empty reservation station and ROB; stall otherwise. Send the operands to the reservation stations if they are available or send an ROB# instead if they are not. 2. Execute - Start after both operands are available. Results are tagged with an ROB#. 3. Write result - Place the result on the CDB and send it to the ROB and/or reservation stations. Any reservation station or ROB entry expecting a result with the tagged ROB# will grab the value when placed on the CDB. 4. Commit - write the result of the instruction at the head of the ROB. If the instruction is not a branch, update register/memory with the value in the ROB. If the head of the ROB is a mispredicted branch, flush the ROB and restart from the branch target; otherwise, commit as normal.

Dynamic Scheduling + Multiple Issue + Speculation (Pipelining and ILP - 10)

1. Issue an instruction in half of a cycle => only supports two instructions/cycle 2. Logic handles any possible dependencies between the instructions 3. Issue logic can become bottleneck

Advantages of Dynamic Scheduling (Final exam topic: Pipelining and ILP - 8)

1. It allows code that was compiled with one pipeline in mind to run efficiently on a different pipeline, eliminating the need to have multiple binaries and recompile for a different microarchitecture. 2. It enables handling some cases when dependencies are unknown at compile time. 3. It allows the processor to tolerate unpredictable delays, such as cache misses, by executing other code while waiting for the miss to resolve. The advantages of dynamic scheduling are gained at the cost of a significant increase in hardware complexity.

Multiple Issue (Pipelining and ILP - 10)

1. Limit the number of instructions of a given class that can be issued in a bundle, and pre-allocate reservation stations and ROB. Ex) one FP, one integer, one load, and one store. 2. Examine all the dependencies among the instructions in the bundle. The issue logic can blow up. 3. If dependencies exist in a bundle, encode them in reservation stations. 4. Multiple completion/commit

Two classes of aggressive schemes to handle branch prediction (Final exam topic: Pipelining and ILP - 3)

1. Low-cost static schemes that rely on information available at compile time 2. Strategies that predict branches dynamically based on program behavior

How can data dependence be overcome? (Final exam topic: Pipelining and ILP - 2)

1. Maintaining the dependence but avoiding the hazard 2. Eliminating the dependence by transforming the code (compiler/hardware)

Thread-Level Parallelism (TLP) (TLP and Multiprocessing - 1)

1. Multiple running threads with multiple program counters 2. Exploited through Multiple Instruction Multiple Data model 3. Targeted for tightly-coupled shared-memory multiprocessors

Limitations of ILP (Pipelining and ILP - 11)

1. Program structure can limit ILP 2. WAW/WAR hazards through memory 3. Memory bandwidth and latency: pipeline cannot hide latency to access of-chip cache/memory, and the memory wall problem still exists. 4. Impacts of wide issue width => logic complexity increases, clock rate decreases, and power increases. 5. Size of ROB and reservation stations have significant overhead

Types of nodes in a directory based protocol (TLP and Multiprocessing - 7)

1. The local node is the node where a request originates. 2. The home node is the node where the memory location and the directory entry of an address reside. The physical address space is statically distributed, so the node that contains the memory location and directory for a given physical address is known. 3. A remote node is the node that has a copy of a cache block, whether exclusive or shared. The local node may also be the home node. A remote node may be the same as either the local node or home node.

Types of TLP (TLP and Multiprocessing - 1)

1. Tightly coupled - threads collaborating for a single task 2. Loosely coupled - multiple programs running independently

What are multiprocessors? (TLP and Multiprocessing - 3)

1. Tightly coupled processors that are viewed as a single processor. They are controlled by a single operating system with shared memory space, and the communication is done in hardware. 2. Clusters are processors connected by a network => communication among different processors are coordinated by individual operating systms 3. Supports MIMD execution 4. Single multicore chips, and systems with multiple chips

Three types of dependences (Final exam topic: Pipelining and ILP - 2)

1. True Data dependencies 2. Name dependencies 3. Control dependencies

Dynamic Branch Prediction (Final exam topic: Pipelining and ILP - 3)

A key way to improve compile-time branch prediction is to use profile information collected from earlier runs. Essentially, static branch prediction is using either a predict-taken scheme or predict-untaken scheme and is not very accurate. The simplest dynamic branch prediction scheme is a branch-prediction buffer or branch history table, which is a small memory indexed by the lower portion of the address of the branch instruction. This memory contains a bit indicating whether or not a branch was recently taken. Using this scheme, it is not entirely clear if a branch prediction is correct, but it is still assumed to be correct, and fetching begins in the predicted direction. If the prediction is wrong, the prediction bit is inverted. This buffer is effectively a cache where every access is a hit, and the performance of the buffer depends on how often the prediction is for the branch of interest and how accurate the prediction is when it matches.

What is a coherent memory system? (TLP and Multiprocessing - 6)

A memory system is coherent if: 1. A read by processor P to location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P. 2. A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses. 3. Writes to the same location are serialized; that is, two writes to the same location by any two processors are seen in the same order by all processors. For example., if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1. The first property simply preserves program order. The second property defines the notion of what it means to have a coherent view of memory For the third property, suppose we have two processors P1 and P2. Serializing the writes ensures that every processor will see the write done by P2 at some point.

Two classes of protocols for cache coherence (TLP and Multiprocessing - 7)

A program running on multiple processors will normally have multiple copies of the same data in several caches. In a coherent multiproccesor, the caches provide both migration and replication of shared data items. The key to implementing a cache coherence protocol is tracking the state of any sharing of a data block. The state of any cache block is kept using status bits associated with the block, similar to the valid and dirty bits kept in a uniprocessor cache. There are two classes of protocols to use: 1. Directory based 2. Snooping

Distributed Shared Memory (DSM) (TLP and Multiprocessing - 4)

An alternative design consisting of multiprocessors with physically distributed memory. To support large processor counts, memory must be distributed rather than centralized; otherwise, the memory system would not be able to support the bandwidth demands of a large number of processors without incurring excessively long access latency. Distributing the memory among the nodes both increases the bandwidth and reduces the latency to local memory. A DSM is also called a NUMA (nonuniform memory access) because the access time depends on the location of a data word in memory.

True Data Dependence (Final exam topic: Pipelining and ILP - 2)

An instruction J is data dependent on instruction i if the following hold: 1. Instruction i produces a result that may be used by instruction j 2. Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i The second condition simply states that one instruction is dependent on another if there exists a chain of dependencies of the first type between the two instructions

Classes of Share-memory multiprocessors (TLP and Multiprocessing - 4)

Existing shared-memory multiprocessors fall into two classes, depending on the numbe rof processors involved, which in turn dictates a memory organization and interconnect strategy. 1. Symmetric Multiprocessors (SMP) 2. Distributed Shared Memory (DSM)

Name Dependence (Final exam topic: Pipelining and ILP - 2)

Name dependence occurs when two instructions use the same register or memory location, called name, but there is no flow of data between the instructions associated with that name.

Disadvantage of Tomasulo's algorithm (Final exam topic: Pipelining and ILP - 9)

No instruction after a branch is allowed to execute even though they can be issued. Execution starts only after the preceding branch is resolved. This limits exploiting ILP.

What is pipelining? (Final exam topic: Pipelining and ILP - 1)

Pipelining is an implementation technique whereby multiple instructions are overlapped in execution; it takes advantage of parallelism that exists among the actions needed to execute an instruction. The time required between moving an instruction one step down the pipeline is a processor cycle. All stages must proceed at the same time since they are hooked together. Since all stages proceed at the same time, the length of a processor cycle is determined by the time required by the slowest pipe stage. Under ideal conditions, the speedup from pipelining is equal to the number of pipe stages. Pipelining involves some overhead, so a pipelined processor will not have its minimum value but can be close. Pipelining yields a reduction in the average execution time per instruction. It exploits parallelism among instructions in a sequential instruction stream.

Pipeline Hazards (Final exam topic: Pipelining and ILP - 2)

Situations where instruction dependencies could be violated if they are executed too close or not in intended order. Hazards in the pipeline make it necessary to stall the pipeline. Avoiding hazards often requires that some instructions in the pipeline be allowed to proceed while others are delayed. A stall causes the pipeline performance to degrade from the ideal performance.

Three classifications for hazards (Final exam topic: Pipelining and ILP - 4)

Suppose instruction i occurs before instruction j and both instructions use register x, then there are three different types of hazards: 1. Read After Write (RAW) - the most common, this hazard occurs when a read of register x by instruction j occurs before the write of register x by instruction i. This hazard would have instruction j use the wrong value of x. Also known as true data dependence. 2. Write After Read (WAR) - this hazard occurs when a read of register x by instruction i occurs after a write of register x by instruction j. In this case, instruction i would use the wrong value of x. WAR hazards are impossible in a simple 5-stage pipeline, but they occur when instructions are reordered. Also known as anti data dependence. 3. Write After Write (WAW) - this hazard occurs when a write of register x by instruction i occurs after a write of register x by instruction j. When this occurs, register x will have the wrong values going forward. This is impossible for the same reason as WAR. Also known as output dependence.

What are the two types of name dependence? (Final exam topic: Pipelining and ILP - 2)

Suppose there exists a name dependence between instruction i that precedes instruction j in program order. 1. Antidependence - occurs when instruction j writes a register or memory location that instruction i reads. The original ordering must be preserved to ensure that i reads the correct value 2. Output dependence - occurs when instruction i and j write the same register or memory location. The ordering must be preserved to ensure that the value finally written corresponds to instruction j. Both of these dependencies have no value being transmitted between instructions. Since they are not true dependencies, instructions involved in a name dependence can execute simultaneously or reordered if the name used in the instructions is changed to avoid conflict. This renaming can be more easily done for register operands, where it is called register renaming. This can be done statically by a compiler or dynamically by hardware.

Lanes (DLP and Vector Architecture - 4)

The peak vector throughput can increase with more lanes. Each lane contains one portion of the vector register file and one execution pipeline from each vector functional unit. For multiple lanes to be advantageous, both the applications and the architectures must support long vectors; otherwise, they will execute quickly enough to run out of instruction bandwidth, requiring ILP techniques to supply enough vector instructions.

Memory addressing with strides (DLP and Vector Architecture - 7)

The position in memory of adjacent elements in a vector may not be sequential. When an array is allocated in memory, it is linearized and must be laid out in either row-major or column-major order. This linearization means that either the elements in the row or the elements in the column are not adjacent in memory. The distance separating elements to be gathered into a single vector register is called the stride. Once a vector is loaded into a vector register, it acts as if it had logically adjacent elements.

Conditional Execution (DLP and Vector Architecture - 5)

The presence of conditionals (IF statements) inside loops and the use of sparse matrices are the two main reasons for lower levels of vectorization. Programs that contain IF statements in loops cannot run in vector mode due to the introduction of control dependencies. The technique used for this issue is called vector-mask control.

Reorder Buffer (Final exam topic: Pipelining and ILP - 9)

The reorder buffer (ROB) holds the results of instructions between completion and commit. It has information such as instruction type, destination field, value field, and ready field. Source operands are found in one of two places: registers of committed instructions or instructions that complete execution before commit in ROB ROB and store buffer are merged, so values to registers/memory are not written until an instruction commits. On a misprediction, the speculated entries in the ROB are cleared. Exceptions are not recognized until the instruction is ready to commit.

Directory Based Protocol (TLP and Multiprocessing - 7)

The sharing status of a particular block of physical memory is kept in one location, called the directory. There are two types of directory based cache coherence: 1. In an SMP, one centralized directory associated with the memory can be used 2. In a DSM, there are multiple directories to avoid a single point of contention Memory bandwidth and interconnection bandwidth can be increased by distributing the memory to separate local memory traffic from remote memory traffic, reducing the memory bandwidth demands on the memory system and interconnection network.

How is out-of-order execution allowed? (Final exam topic: Pipelining and ILP - 8)

To allow out-of-order execution, the ID pipe stage is split into two: 1. Issue: decode instructions and check for structural hazards 2. Read operands: Wait until there are no data hazards and then read operands. In a dynamically scheduled pipeline, all instructions pass htrough the issue stage in order (in-order issue); however, they can be stalled or can bypass each other in the second stage (read operands) and thus enter execution out of order (out-of-order execution).

Strip Mining (DLP and Vector Architecture - 5)

To tackle the issue of a vector being larger than the MVL but not known during compile time, a technique known as strip mining is traditionally used. Strip mining is the generation of code such that each vector operation is done for a size less than or equal to the MVL.. Traditionally, one loop handles any number of iterations that is a multiple of the MVL and another loop handles any remaining iterations that is less than or equal to the MVL. RISC-V has a better solution than simply using two loops for strip mining. Instead, depending on which one is smaller, the setvl instruction writes eitehr the MVL or the loop variable n into VL. If the number of iterations of the loop is larger than n, then the fastest the loop can compute is MVL, so setvl sets VL to MVL. If n is smaller than MVL, then it should compute only the last N elements in the final iteration of the loop, so setvl sets VL to n.

Tomasulo's Algorithm (Final exam topic: Pipelining and ILP - 8)

Tomasulo's algorithm handles antidependencies and output dependencies by effectively renaming the registers dynamically. It can also be extended to handle speculation, a technique to reduce the effect of control dependencies by predicting the outcome of a branch, executing instructions at the predicted destination address, and taking corrective actions when the predictions are wrong. The goal of Tomasulo's scheme is to track when operands for instructions are available to minimize RAW hazards and introduce register renaming in hardware to minimize WAW and WAR hazards. Two principles: 1. Dynamically determine when an instruction is ready to execute 2. Rename registers to avoid unnecessary hazards

Static Scheduling (Final exam topic: Pipelining and ILP - 7)

Transforming and rearranging code while maintaining all dependencies in a program. Simple compiler technology can be used to enhance a processor's ability to exploit ILP, and the techniques involved are crucial for processors that use static issue or static scheduling. To keep a pipeline full, parallelism among instructions must be exploited by finding sequences of unrelated instructions that can be overlapped in the pipeline. To avoid a pipeline stall, the execution of a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction.

Two common extensions of Snoopy (TLP and Multiprocessing - 7)

Two common extension of Snoopy protocol: 1. MESI - Adds exclusive (E) to indicate that a cache block is resident in only a single cache but is clean. If a block is in the E state, it can be written without generating any invalidates. When are ad miss to a block in E state occurs, the block must be changed to S state. 2. MOESI - Adds owned (O) to indicate that the associated block is owned by that cache and out-of-date in memory. A block would be changed from modified to owned without being written to memory. Only the original cache holds a block in O state, and it must supply the block on a miss since memory is not up to date and must write the block back to memory if replaced.

Data Forwarding (Final exam topic: Pipelining and ILP - 6)

When implementing forwarding, it is important to notice that pipeline registers contain both the data to be forwarded as well as the source and destination fields. All forwarding logically happens from the ALU or data memory to the ALU input, the data memory input, or the zero detection unit. Thus, forwarding is implemented by a comparison of the destination registers of the IR contained in the EX/MEM and MEM/WB stages against the source registers of the IR contained in the ID/EX and EX/MEM stages.

Directory based protocol - When a block is in the exclusive state (TLP and Multiprocessing - 7)

When the block is in the exclusive state, the current value of the block is held in ta cache on the node identified by the set sharers (the owner), so there are three possible directory requests: Read miss - The owner is sent a data fetch message, which causes the state of the block in the owner's cache to transition to shared and causes the owner to send the data to the directory, where it is written to memory and sent back to the requesting processor. The identity of the requesting node is added to the set sharer's, which still contains the identity of the processor that was the owner (since it still has a readable copy). Data write-back - The owner is replacing the block and therefore must write it back. The write-back makes the memory copy up to date (the home directory essentially becomes the owner), the block is now uncached, and the sharers set is empty. Write miss - The block has a new owner. A message is sent to the old owner, causing the cache to invalidate the block and send the value to the directory, from which it is sent to the requesting node, which become the new owner. Sharers is set to the identity of the new owner, and the sate of the block remains exclusive.

Directory based protocol - When a block is in the shared state (TLP and Multiprocessing - 7)

When the block is in the shared state, the memory value is up to date, so the same two requests can occur as in uncached state: Read miss - The requesting node is sent the requested data from memory, and the requesting node is added to the sharing set. Write miss - The requesting node is sent the value. All nodes in the set sharers are sent the invalidate message, and the sharers set is to contain the identity of the requesting node. The state of the block is made exclusive.

Directory based protocol - When a block is in the uncached state (TLP and Multiprocessing - 7)

When the block is in the uncached state, the copy in memory is the current value, so the only possible requests for the block are: Read miss - The requesting node is sent the requested data from memory, and the requester is made the only sharing node. The state of the block is made shared (added to the sharing set). Write miss - The requesting node is sent the value and becomes the sharing node. The block is made exclusive to indicate the only valid copy is cached. Sharers indicate the identity of the owner.


Ensembles d'études connexes

Chapter 6: Other Sensory Systems

View Set

Ch 28 - Child, Older Adult, and Intimate Partner Violence

View Set

Quiz questions from weeks 12, 13, 14

View Set

English File Beginner Common verb phrases 2

View Set

Chapter 10 Anatomy and Physiology

View Set