Advanced Computer Architecture - Exam 3
Flynn's Taxonomy: What is a *multiple instruction stream, multiple data stream (MIMD)* architecture?
Multiple control units (CU) each feed their own instruction stream (IS) into a distinct processor from a set of multiple processors. Each processor operates on a single data stream from either (1) a pool of memory that is shared between the processors in the set or (2) a local memory unit (LMU), all of which are connected via an interconnection network to form a distributed memory unit between the processors in the set. SMPs, clusters, and NUMA systems fit this category
What taxonomy is an SMP?
Multiple instruction stream, multiple data stream (MIMD)
How many procedure activations can an *N-window register file* hold?
N - 1 activations
In contemporary multiprocessor systems, it is customary to have one or two levels of cache associated with each processor. What problem did this introduce?
*Cache coherency*: multiple copies of the same data can exist in different caches simultaneously, and if processors are allowed to update their own copies freely, an inconsistent view of memory can result
RISC: what is the *delayed branch* technique? How does it work?
*Delayed Branch* is a technique for mitigating the damage of control hazards in RISC pipelining. In this technique, unconditional branches, calls, and returns do not take effect until after the execution of the following instruction (hence the term delayed). This is achieved by swapping the unconditional branch / call / return with the instruction immediately before it. This prevents the load of the instruction which typically comes after the branch / call /return, which subsequently prevents the processor from having to flush it from the pipeline after the branch is taken.
When does a *control hazard* occur?
Occurs when the pipeline makes the wrong decision on a *branch prediction* and therefore brings instructions into the pipeline that must subsequently be discarded
What is a *data hazard* in pipelining?
Occurs when there is a conflict in the access of an operand location causing a different result which wouldn't happen in strict sequential execution. This requires stalling to ensure the result remains consistent
What is a *resource hazard* in pipelining?
Occurs when two or more instructions that are already in the pipeline need access to the same resource (main memory, ALU, etc.)
What is the simplest, most naive software-based approach to cache coherency?
Prevent all shared data variables from being cached. This is too conservative because shared data may be exclusively used during some periods and may be effectively read-only during other periods. Better solutions account for the actual usage of the data variable during the execution of the program
What does *RISC* stand for?
Reduced Instruction Set Computer
How do *register windows* allow for parameter passing?
Register windows for adjacent procedures are overlapped
In a *directory protocol*, before the centralized controller grants write access to a particular processor regarding a particular cache line, what does it do?
The centralized controller sends a message to all processors with a cached copy of this line, forcing each processor to invalidate its copy
In a *directory protocol*, what must a processor do before it can write to a local copy of a line?
The processor must request exclusive access to the line from the centralized controller
When using a *register file*, how do register references by machine instructions determine the actual physical register to read from?
The register references are offset by the CWP to determine the actual physical register
Since there are only a finite number of *register windows*, how does the *register window* technique handle an infinite number of procedure calls?
The register windows are used to hold the few most recent procedure activations, while older activations must be saved in memory and later restored when the nesting depth decreases
What is the primary difference between the *static* branch prediction methods and the *dynamic* ones?
The static branch prediction methods *do not* factor in the history of conditional branching in the life of the program, whereas the dynamic branch prediction methods *do*.
What is the *cycle time* of an instruction pipeline?
The time needed to advance a set of instructions *one stage* through the pipeline
What is the equation to calculate the *cycle time* of an instruction pipeline?
t = max[t_i] + d max[t_i] is the slowest cycle time of all the stages of the pipeline d is the time delay of the latch, needed to advance signals from one stage to the next
What is a *register window*?
A small subset of the total number of registers that is assigned to a particular procedure. Whenever another procedure is called, the processor automatically switches to use a different register window. At any given time, only one window of registers is visible and is addressable as if it were the only set of registers.
What is the *MESI protocol*?
A synonym for the *write-invalidate* approach to the *snoopy protocol*, where the state of every cache line is marked as *m*odified, *e*xclusive, *s*hared, or *i*nvalid by the use of two bits in each line's tag
What is *pipelining*?
A technique for implementing instruction-level parallelism in a single processor by splitting the instruction cycle up into various stages and executing each stage in parallel *Figure 14.10*
How is a *register file* organized?
As a circular buffer of overlapping register windows
Which multiprocessor organization are *snoopy protocols* bested suited for?
Bus-based multiprocessor organization
Hardware-based solutions to cache coherency are generally referred to as what?
Cache protocols
Software approaches to cache coherency transfer the detection of problems from *run time* to what?
Compile time
Since software approaches to cache coherency transfer the detection of problems to compile, what problem can arise?
Compile-time software approaches generally make conservative decisions, leading to inefficient cache utilization
What does *CISC* stand for?
Complex Instruction Set Computer
What is the primary impediment to the effective use of an instruction pipeline?
Conditional branches
An instruction pipeline has three stages: S1, S2, and S3. The cycle times of the stages are as follows: t1 = 20 t2 = 40 t3 = 10 The time delay of the latch (d) is equal to 2. *What is the cycle time of the entire instruction pipeline?*
max[t_i] + d = max([20, 40, 10]) + 2 = 40 + 2 = 42
What are the two primary components of a *directory protocol*?
1. A centralized controller that is a part of the main memory controller for managing global state of the various local caches 2. A directory containing the global state of the various local caches, which resides in main memory
What are the three key elements of a *RISC architecture*?
1. A large number of general purpose registers, and/or the use of compiler technology to optimize register usage 2. A limited and simple instruction set 3. An emphasis on optimizing the instruction pipeline
What are the primary two drawbacks of a *directory protocol*?
1. Central bottleneck 2. Communication overhead
What are the two basic approaches to maximizing register usage?
1. Compiler-based register optimization (software) 2. Register File: use more registers so that more variables can be held in registers for longer periods of time (hardware)
What two categories can hardware-based cache coherency solutions fit into?
1. Directory protocols (centralized) 2. Snoopy protocols (decentralized)
What is the general approach of *compiler-based register optimization*?
1. Each program quantity that is a candidate for residing in a register is assigned to a symbolic/virtual register 2. The compiler maps the unlimited number of symbolic registers into a fixed number of real registers. 3. Symbolic registers whose usage does not overlap can share the same real register 4. If, in a particular portion of the program, there are more quantities to deal with than real registers, then some of the quantities are assigned to memory locations
What are the 7 stages of the *instruction cycle*?
1. Fetch instruction 2. Decode instruction 3. Calculate operands 4. Fetch operands 5. Execute instruction 6. Write operand 7. (Optional) Handle interrupt
What is the primary object of any cache coherency protocol?
1. Let recently used local variables get into the appropriate cache and stay there through numerous reads and writes 2. Use the protocol to maintain consistency of shared variables that might be in multiple caches at the same time
What are the four characteristics of a *RISC architecture*?
1. One instruction per cycle 2. Register-to-register operations 3. Simple addressing modes 4. Simple instruction formats
*Register windows* are divided into what *three* areas?
1. Parameter registers: used for parameters 2. Local registers: used for local variables 3. Temporary registers: used to exchange parameters and results with the next lower level (procedure called by current procedure - physically overlaps with Parameter registers of next lower level)
What are the three types of *pipeline hazards*?
1. Resource hazard 2. Data hazard 3. Control hazard
What are the five key design issues of a *multiprocessor system*?
1. Simultaneous concurrent processes 2. Scheduling 3. Synchronization 4. Memory management 5. Reliability and fault tolerance
What are the *four* classifications defined in *Flynn's Taxonomy*?
1. Single Instruction Stream, Single Data Stream 2. Single Instruction Stream, Multiple Data Stream 3. Multiple Instruction Stream, Single Data Stream 4. Multiple Instruction Stream, Multiple Data Stream
In a *directory protocol*, what happens when a processor attempts to a read a cache line that is exclusively granted to another processor?
1. The attempting processor will send a *miss notification* to the centralized controller 2. The centralized controller issues a command to the processor holding exclusivity of that cache line to release it to main memory, where it can then be read by all processors
What approach do most effective software-based approaches to cache coherency take?
1. The compiler analyzes the program code to determine safe periods for shared variables 2. The compiler inserts instructions into the generated code to enforce cache coherency during the critical periods
In general, how do compiler-based software approaches to cache coherency work?
1. The compiler performs an analysis on the code to determine which data items may become unsafe for caching and marks them accordingly 2. The operating system or hardware prevent those items from being cached
What are the two principals that motivate *CISC*?
1. The desire to simplify compilers 2. The desire to improve performance
Flynn's Taxonomy: What is a *single instruction stream, multiple data stream (SIMD)* architecture?
A single control unit (CU) feeds a single instruction stream (IS) into multiple processors, each of which is operating on a single distinct data stream (DS) from a single distinct local memory unit (LMU). Vector and array processors fall into this category
RISC: What are the two standard methods for dealing with *global variables*?
1. Variables declared as global are assigned memory locations by the compiler, and all machine instructions that reference these variables will use memory-reference (not register) operations (inefficient for frequently used variables) 2. Incorporate a set of global registers in the processor which is fixed in number and available to all procedures (increased hardware burden to accommodate the split in register addressing)
Parallel Processing: What is an SMP?
A *symmetric multiprocessor (SMP)* is a *standalone computer* with the following four characteristics: 1. Two or more processors of comparable capacity 2. All processors share same memory and I/O facilities 3. All processors share I/O devices 4. All processors can perform the same functions 5. System controlled by integrated operating system
Dealing with Conditional Branches: what is the *branch prediction / branch history table* method?
A branch history table is a small cache memory associated with the instruction fetch stage of the pipeline. Each entry in the table consists of three elements: (1) the address of a branch instruction, (2) some number of history bits that record the state of use of that instruction, and (3) information about the target instruction (generally the address of the target instruction or even the actual target instruction).
What is *Flynn's Taxonomy*?
A classification of computer architectures developed by Michael Flynn in 1966
What is a *snoopy protocol*?
A hardware-based solution to the cache coherency problem where: 1.All caches in the system must recognize what lines they share with other caches in the system 2. When an update action is performed on a shared cache line, it must be announced to all other caches by a broadcast mechanism 3. All caches are able to "snoop" on the network to observe these broadcasted notifications and react accordingly
Dealing with Conditional Branches: what is the *loop buffer* method?
A loop buffer is a small, very high-speed memory maintained by the instruction fetch stage of the pipeline and containing the n most recently fetched instructions, in sequence. If a branch is to be taken, the hardware first checks whether the branch target is within the buffer. If so, the next instruction is fetched from the buffer
Flynn's Taxonomy: What is a *multiple instruction stream, single data stream (MISD)* architecture?
A sequence of data is transmitted to a set of processors, each of which executes a different instruction sequence Not commercially implemented
In *compiler-based register optimization*, which technique is most often used to map symbolic registers to real registers?
Graph Coloring 1. The program is analyzed to build a register interference graph, where the nodes of the graph are the symbolic registers 2. If two symbolic registers are "live" during the same program fragment, they are joined by an edge to depict interference 3. An attempt is then made to color the graph with n colors, where n is the number of registers. Adjacent nodes cannot have the same color 4. Nodes that share the same color can be assigned to the same register. If this process does not fully succeed, then those nodes that cannot be colored must be placed in memory
What is the *write-invalidate* approach to the *snoopy protocol*?
In the write-invalidate approach, there can be *multiple readers* but only *one writer* at a time, as so: 1. Initially, a line may be shared among several caches for reading purposes 2. When one of the caches wants to perform a write to the line, it first issues a notice that invalidates that line in the other caches, making the line exclusive to the writing cache 3. Once the line is exclusive, the owning processor can make cheap local writes until some other processor requires the same line
What is the *write-update* approach to the *snoopy protocol*?
In the write-update approach, there can be *multiple writers* as well as *multiple readers*, as so: 1. When a processor wishes to update a shared line, the word to be updated is distributed to all others, and caches containing that line can update it
Dealing with Conditional Branches: what is the *branch prediction / taken / not taken switch* method?
In this technique, one or more bits are associated with each conditional branch to reflect the recent branch history of that instruction. These bits are typically kept in high-speed storage (not main memory). With a single bit, all that can be recorded is whether or not this branch was taken last time we came to it. This shortcoming is notable in loop instructions, where it will always fail in two cases: when entering the loop and when exiting. With two bits, more creativity can be applied. A popular algorithm in using two bits has the processor predicting that branches are taken until two in a row are not taken, and predicting that branches are not taken until two in a row are taken. *Figure 14.19*
Dealing with Conditional Branches: what is the *branch prediction / predict by opcode* method?
In this technique, predict whether or not to take the branch based on the instruction's *opcode*. A study from the textbook found success rates of greater than 75% with this strategy. This is a static method and easy to implement.
Dealing with Conditional Branches: what is the *prefetch branch target* method?
In this technique, when a conditional branch is recognized, the target of the branch is prefetched, in addition to the instruction following the branch (AKA, the next step on both possible paths). The target is then saved until the branch is executed. If the branch is taken, the target is already saved
Dealing with Conditional Branches: what is the *branch prediction / predict always taken* method?
In this technique, whenever a branch is recognized, *always take it*. In a paged machine, prefetching the branch target is more likely to cause a page fault than prefetching from the sequential path. Thus, this method can incur some performance penalties associated with page faults. This is a static method and easy to implement.
Dealing with Conditional Branches: what is the *branch prediction / predict never taken* method?
In this technique, whenever a branch is recognized, *never take it*. This is the most popular branch prediction method because it mitigates the frequent page faults that are associated with the *predict always taken* method. It's static and simple to implement.
What is the equation for *the speedup of a processor with no pipelining versus a processor with pipelining?*
S_k = (nk) / (k + n - 1)
What is the primary drawback of using the *time-shared bus* organization in SMP?
Since all memory references pass through the common bus, overall performance takes a hit. As a result, the bus cycle time limits the speed of the system. This can be reduced by using caches in each processor (which then introduces the problem of cache coherency)
Flynn's Taxonomy: What is a *single instruction stream, single data stream (SISD)* architecture?
Single processor executes a single instruction stream to operate on data stored in a single memory (no parallel processing) A single control unit (CU) feeds a single instruction stream (IS) into a single processor, which operates on a single data stream (DS) from a single memory unit (MU)
What is the equation for *the total amount of time required for a pipeline of k stages to execute n instructions?*
T_k,n = [k + (n - 1)]t k = the number of stages in the pipeline n = the number of instructions to execute t = the cycle time of the instruction pipeline
When using a *register file*, what is the *CWP*?
The *current-window pointer (CWP)* points to the register window of the currently active procedure
When using a *register file*, what is the *SWP*?
The *saved-window pointer (SWP)* identifies the window most recently saved in memory
According to the instruction pipeline speedup equation, increasing the number of stages in the pipeline can increase the speedup. Why is this not true in reality?
The benefits of adding pipeline stages are countered by increases in: 1. Cost 2. Delays between stages 3. The fact that branches will be encountered, requiring the flushing of the pipeline
Dealing with Conditional Branches: what is the *multiple streams* method?
This technique replicates the initial portions of the pipeline, allowing it to fetch both of the two potential instructions. Eventually upon completion of the branch instruction execution, based on the result, one of the branch streams will be selected and the other will be discarded
RISC: Register File vs. Cache Register files tend to save time over caches by retaining all local scalar variables. Caches, on the other hand, make more efficient use of space by dynamically adjusting to variable usage, whereas register files statically retain all local scalar variables, despite their usage. What characteristics puts *register files* ahead of *caches*?
To reference a local scalar variable in a window-based register file, a simple calculation is performed to address the physical register. It's very quick. To reference a memory location in cache, a full-width memory address must be generated, the complexity of which depends on the addressing mode (associative, set-associative, or direct), all of which significantly slow down the reference.
What is the most common method of organizing an SMP system?
Using a single *time-shared bus* consisting of control, address, and data lines that is shared by all of the processors to access a single main memory and I/O subsystem.
What is a *pipeline hazard*?
When the pipeline, or some portion of the pipeline, must stall because conditions do not permit continued execution
Which of the two *snoopy protocol* approaches is most widely used in commercial multiprocessor systems?
Write-invalidate