CS 413 Final
Microarchitectures
*abstraction level of microprograms and microinstructions *control unit on this level: ALU circuits, registers, and buses. Register - Register MicroArchitecture: *RISC - centric model *more complex: generic ALU -sign extender - shifter -register file - collection of registers, CPU venders could purchase collections of registers. well defined logic, s1, s2, d1 ports to understand paths
MIPS16
16 bit instructions -only 8 registers are directly visible -fewer operations -short literals
ARM Features: Registers
16 general purpose registers, more like 13 (or 12) registers r0 - r15 r15: PC (if you change PC, causes a branch.) r14: Subroutine return address, Link Register (LR) r13: stack pointer (not hardwired, convention) r11: frame pointer (also a convention) RISC - like because they only have circuits for the most common operations
I/O Notices: Programmed
A typical memory-mapped peripheral has a flag bit that is set by the peripheral when it is ready to take part in a data transfer. In programmed I/O, the computer interrogates the peripheral's status register and proceeds when the peripheral is ready. The Processor issues an I/O command, on behalf of a process, to an I/O module; that process then busy waits for the operation to be completed before proceeding.
Have RISC ISAs
ARM MIPS
Benchmark Types
Kernel benchmark - a part (or component) of a real program Toy benchmark - novelty aspect of a program. Ex. Quake f.p.s. Synthetic benchmark - artificial program Benchmark suite - set of programs
Memory Hierarchy: Capacity and Delay of: CACHE
L1: 128-256 KB, 0-4 clockcycles L2: 256KB - 16MB, 0-10 ccs L3: 0-24MB, 2-20ccs
Procedures: The Call Stack
LIFO, local variables and function parameters stored here. Our interest: Stack as the storage space to "hold" register values during invoked function execution. The "Full" Stack Issue - stack pointer points to the last item placed in the stack.
Pipeline Latches
Latches between stages • Must latch data and control catches results between two instructions in the pipe
Memory Hierarchy
Registers (on die) --> Cache --> Main Memory --> Secondary Storage --> Tertiary Storage (Offline Storage)
Cache
Stores frequently accessed items.
I/O Notices: DMA
The most sophisticated means of dealing with IO uses direct memory access (DMA) in which data is transferred between a peripheral and memory without the active intervention of a processor. In effect, a dedicated processor performs the I/O transaction by taking control of the system buses and using them to move data directly between a peripheral and the memory. DMA offers a very efficient means of data transfer, because the DMA logic is dedicated to I/O processing and a large quantity of data can be transferred in a burst (for example, 128 bytes of input).
Memory Management Unit
Virtual Memory Concept - allow our program to have access to stuff in secondary storage, several programs running at once. Loader - places program in memory, can put it anywhere in main memory, not just first location. As a program runs, its location in memory changes.
Throughput
a measure of the amount of work it performs per unit time.
Parallelism
a set of processors that are able to work cooperatively to solve a computational problem. This definition is broad enough to include parallel supercomputers that have hundreds or thousands of processors, networks of workstations, multiple-processor workstations, and embedded systems.
Misses
access to item not currently in cache. If miss: baseline/default behavior. Cache controller must load missed item into cache.
Caller-Saves Scheme
calling function has a location at which it saves its registers and variables. Calling function has responsibility of saving.
Classes of Misses
capacity miss - miss caused by size of cache. Miss when all cache lines are in use. compulsory miss - first access misses. if cache full when it first accesses, will be counted as compulsory miss. conflict miss - miss when the missed item will be stored the same place as another item.
Latency
the delay between activating a process (for example, a memory write or a disk read, or a bus transaction) and the start of the operation; that is, latency is the waiting time.
Data Dependencies
*Precursor to Data Hazard 1. True Data Dependency (aka data dependency) - An instruction Ii that consumes a data value produced by a prior instruction is an instance of a true data dependency. (precursor to RAW hazard) 2. Output Dependency - instance of false dependency and name dependency. An instruction that produces an item that a prior instruction produced. (precursor to WAW Hazard) example: MUL r0, r1, r2 SUB r0, r3, r4 3. Anti-Dependency - a false and name dependency, an instruction that produces an item that a prior instruction consumed. (precursor to WAR hazard) example: EOR r0, r1, r2 LDR r2, [r3]
"The Good" and "The Bad" Performance Measures
*clock cycle - the bad - too many factors, does not always allow for an apples to apples comparison. *MIPS - *MFLOPS - Benchmarks - a program (or set of programs) whose execution properties are used to assess computer performance. The good: Consumer desire: benchmark matches computational profile of your needs. Common basis: execution time. The bad: can become gamed/obsolete over time.
ARM Features: Instructions
- Pseudo-Instructions - an instruction that is available to the programmer but which is not part of the processor's ISA. A pseudoinstruction is a form of shorthand that allows a programmer to express an action simply and then let the assembler generate the appropriate code. - Control Flow - has support for while loops, for loops and do while. Use branches to change control flow. —-Functions - - Shifts - No shift instructions in ARM, some ARM commands shift or support shifting during another operation. - Conditional Execution - condition, as a suffix to a command. Command only finishes if condition is satisfied. Conditions: LT, LE, GT, GE, EQ, NE, HI, LO, CC, CS Ex. CMP r1, #0 RSBLT r1, r1, #0 //will reverse subtract if less than condition met.
MIPS ISA
--Early RISC Architecture. Registers: 32 - 32 bit registers. (MIPS - 32) r0 is always 0. More registers -> advantage is less memory transfers. No link register. Instructions: ALL are 32 bits in length. -Basic operations. -No conditional execution -No CCR ( status register ) -Has shift instructions. -Instruction Encoding Modes: R: register-register J: jump (transfer of control commands) I: Immediate -short literal addressing modes: Direct/absolute addressing.
Memory Hierarchy: Capacity and Delay of: MAIN MEMORY
1-128GB, 20-200ccs
Memory Hierarchy: Capacity and Delay of: REGISTERS
1-64+ bit encoding. No delay.
Performance Measure Goals
1. Easy to Measure 2. Repeatable: provides confidence, more than one person can do an experiment. 3. Reliable: valid for comparison 4. Linear: value is linearly correlated with actual performance, not a perfect line. 5. Consistency (universality) : applicable to many systems. 6. Understandable
Hazard Resolutions
1. Stall Pipe: Instruction that would generate the hazard is stalled. Instructions behind the hazard are stalled as well until hazard clears. "pipeline bubble". Cost of stalling: causes overhead from delayed operations. 2. Forwarding: also know as short circuiting or bypassing, involves the supply of operands within the pipe. flushing the pipe?
Fetch-Decode-Execute
1.fetches a program instruction from its memory, 2.determines what the instruction wants to do, and 3.carries out those actions.
Memory Hierarchy: Capacity and Delay of: SECONDARY STORAGE
25GB, 20ms
Memory Hierarchy: Capacity and Delay of: TERTIARY STORAGE
3TB, 1hour
I/O Notices: Interrupt-Driven
A more efficient I/O strategy uses an interrupt handling mechanism to deal with I/O transactions when they occur. That is, the processor carries out another task until a peripheral requests attention. When the peripheral is ready, it interrupts the processor, carries out the transaction, and then returns the processor to its pre-interrupt state. Figure 12.24 describes a system using interrupt-driven I/O. The two peripheral interface components are each capable of requesting the processor's attention. Most peripherals have an active-low interrupt request (IRQ) output that runs from peripheral to peripheral and is connected to the processor's IRQ input. Active-low means that a low voltage indicates the interrupt request state. The reason that the electrically low state is used as the active state is entirely because of the behavior of transistors; that is, it is an engineering consideration that dates back to the era of the opencollector circuit that could only pull a line down to zero.
Multi-threading
A technique in which a process, executing an application, is divided into threads that can run concurrently. • Thread: A dispatchable unit of work. It includes a processor context (which includes the program counter and stack pointer) and its own data area for astack (to enable subroutine branching). A thread executes sequentially and is interruptable so that the processor can turn to another thread. • Process: A collection of one or more threads and associated system resources (such as memory containing both code and data, open files, and devices). This corresponds closely to the concept of a program in execution. By breaking a single application into multiple threads, the programmer has great control over the modularity of the application and the timing of application-related events. You can't eliminate latency, but you can sometimes hide it. Instead of waiting for data to be loaded, the processor can exploit the processor's computing capacity by doing something else. This is that approach to latency hiding.
ARM Features: Addressing
ADD r0,r1,#Q [r0] <-- [r1] + Q --->Literal: Add the integer Q to contents of register r1 LDR r0,Mem [r0] <-- [Mem] Absolute: Load contents of memory location Mem into register r0. This addressing mode is not supported by ARM but is supported by all CISC processors LDR r0,[r1] [r0] <-- [[r1]] Register Indirect: Load r0 with the contents of the memory location pointed at by r1
Scaled Speedup
Also known as variant speedup or Gustafson's speedup.
Performance Measure Obsolescence
Benchmarks can become obsolete over time.
The CPU-Memory processing mismatch and possible work-arounds
CPU performance increases about 55%/yr and memory performance increases 7%/yr, at a much slower rate. This is the mismatch. Work Around #1: more registers -> encoding longer, ceiling of log2N # of bits to encode -> longer instruction word. -> hazard detection, more bits to compare Work Around #2: Put memory on die (piece of silicon that comprises the CPU). -> space limits - not practical, expensive -> physical limits - not practical, expensive -> limit amount of buffer memory cache
DRAM vs. SRAM
DRAM - dynamic random access memory, forgetful. Volatile memory- memory will be forgetful if no power. It requires a power source to retain its value. Must be rewritten continuously or memory resistors will decay. Typically, capacitor must be refreshed periodically. SRAM - static random access memory, transistor based RAM, used to make flip flops (used in registers), fast, expensive, often L1 cache.
Cache Controller
Finds out if an item we need is in the cache. * job to determine if in cache or main memory. -Most CPUs act as if they can only access registers and cache. * this loads item into cache if needed.
ARM like model: Program Counter (PC)
Holds address of NEXT instruction to execute, NOT the current executing instruction. -Program can sometimes explicitly access this. r15 in ARM
ARM like model: Instruction Register (IR)
Holds the INSTRUCTION that is currently being executed.
I-Cache vs. D-Cache
I-Cache: hold program's instructions D-Cache: hold program's data sometimes these are put together
Instruction Set Architectures (ISAs)
Logical interface of machine. Our direct interface to the CPU. - programmer - visible instruction set - computer programmer's perspective/view of the CPU
MIPS metric
MIPS = Millions of Instructions per second. MIPS = # instructions/1Million x 1/time Measuring MIPS with instruction frequencies: CPI = clocks per instruction. fyi. 12.5 nanosecond cycle time for DDR-4. CPI = #clocks/#instructions A problematic measure. It is under-defined when referring to instruction. This could be different in assembly vs high level languages, what counts as an instruction here?
MFLOPS Metric
Millions of Floating point Operations Per Second MFLOPS = # floating point operations/1Million x 1/time Limitations: "Operation" definition. This is under-defined. Advantage: it is easy to understand.
The Call Stack: Full Descent
One of the most popular stacks, points at the top item on the stack and which grows towards lower addresses. Implemented by first decrementing the pointer and then storing data at that address (push data) or by reading data at the stack address and then incrementing the pointer (pull data). We therefore can write STMDB sp!,{r0,r1}or LDMIA sp!, {r0,r1} instead of LDMFD/STMFD.
Efficiency
Percent of time spent doing something.
Data Hazards
RAW (read after write) Hazard - reads a register before it is written to, when program says to read after the write. Wrong value would be the result. WAW (write after write) Hazard - Program is written to write #2 after write #1 but observed, it actually writes #1 after it writes #2. ARM model given will not have this except for specific load command, otherwise will not have it. WAR (write after read) Hazard - doesn't happen in ARM pipeline, program supposed to be write follows read. However, observed read followed write instead.
Compressed RISC ISAs
RISC for embedded applications Smaller instruction lengths: MIPS16 ARM Thumb
ARM instruction format
RISC. encoded as 32-bit encodings ex. ADD r0,r1,r2 ---> 1010111...011 -encoding FYI just because a set has a fixed length encoding, does not mean it is always RISC.
RISC vs CISC
Reduced Instruction Set Computing's goal was to have less instructions, less addressing modes, and simplified/simple instructions as well as simplified decoding. Fetch -> Decode -> Execute. Fixed length encoding. RISC is all about computing. Complex Instruction Set Computing's goal was to have a gate for every operation. The idea, if had square root gate, square root would be faster, this caused an imbalance in processing time. Many addressing modes: many instruction types, addressing modes, and sophisticated data moves. 32 bit registers. Register exchange operations and data pack/unpack.
Replacement + Pollution
Replacement Strategies: (1) Random - Cache controller can replace any line at random (2) FIFO - first in first out, time stamp. (3) LRU - least recently used, replace the one that has been accessed the longest time ago. Pollution - cache pollution. Occurs when a recently evicted block is needed again. More accurately: loading a block that will not be needed again as often as the evicted block.
Time and Speedup
S = Time(Old)/Time(New)
Moore's Law
The empirical observation that states that the number of devices per chip doubles every 18 months. This law has allowed chip manufacturers to begin the design of future processors even though the required manufacturing technology does not yet exist. The term "Moore's law" is widely referred to in articles on computing, although not everyone uses it with its original meaning (i.e., the number of components per chip doubles every 18 months). Moore's law is used to imply that the performance of processors grows exponentially. Limits of Moore's Law: atomic limit: cannot go smaller than an atom. Heat limit: transistors generate a lot of heat.
Memory Hierarchy Tradeoffs
The illusion is that it is a fast system with plentiful storage. Advantage: low to moderate cost.
ISAs: Instructions
The most important single factor in the design of computer architectures is the number of operand addresses per instruction. For example, an instruction set that implemented ADD r1,r2,r3 would be a three-address machine, whereas a computer that implemented ADD r1,r2 would be a two-address machine. Here we introduce three-, two-, one-, and zero-address machines. Instructions also can be grouped according to the nature of their operation. For example, we use data movement that copies data from one location to another, data processing that operates on data, and flow control that modifies the order in which instructions are executed. A three-address instruction can be written operation destination, source1, source2, where operation defines the nature of the instruction, source1 is the location of the first operand, source2 is the location of the second operand, and destination is the location of the result. Microprocessors don't implement three memory address instructions for the reasons we have already stated. A typical RISC processor allows you to specify three register addresses in an instruction by providing three 5-bit operand address fields, as Figure 3.11 demonstrates. 18 We'll use the ADD instruction to add together the four values in registers r2, r3, r4, and r5. This code is typical of RISC processors like the ARM. ADD r1,r2,r3 ; r1 = r2 + r3 ADD r1,r1,r4 ; r1 = r1 + r4 ADD r1,r1,r5 ; r1 = r1 + r5 = r2 + r3 + r4 + r5
The MicroInstruction/Microprogram Concept
The operations in the attached photo constitute a microprogram that is stored in a read-only memory called a control store. A microprogrammed control unit reads the microinstructions from the control store and uses them to interpret machine level instructions. The effect of first microinstruction is illustrated in the box.
Endianness
The storage ordering of a word. The big endian format stores a word with the most-significant bytes ("big" bytes) at the lowest memory address, and the little endian format stores a word with the least-significant byte ("little" byte) at the lowest address. In ARM, Endianness is selected.
Locality of Reference
The tendency of a processor to access the same set of memory locations repetitively over a short period of time.
ISAs: Registers
They are on the CPU for register set architectures (RSAs). We want registers to be really fast because they have to perform a task in less than 1/3 nanosecond, therefore they are on the CPU, physically near the circuits using them. Classes of registers: 1. Register - Memory Architectures - one or two memory based operands For ex. in ADD y. y would be a memory based operand. In ADD r1, r3, y , r1 and r3 would be register based operands. A lot of RMAs require the first two operands to be register based. Ex. Intel x86 FYI. Older than RRA 2. Register - Register Architectures - Typically arithmetic and logical instructions have ONLY register operands. Sometimes called Load-Store Architectures. Faster than RMA. Operates on the following model: 1) load value into registers 2) operate on registers 3) put result back into memory.
ARM Procedure Call Standard
Things are stored in a stack. (registers and variables.) r13 keeps the address for the stack. ARM has a hybrid caller-saves and callee-saves. r0 - r3 <-- caller-saves r4 - r15 <-- callee-saves Conventions: r15 is PC, r14 is LR or has return address, r13 is the stack pointer, r11 is the frame pointer. Stack Frame is also known as a activation frame or activation record.
Basic ILP: Pipelining
a model of instruction execution. This overlaps instruction execution.
Clusters
a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system
Hits
access to an item that is already in the cache. 2 classes: read hit (load to already in cache read it), and write hit (store to when already in cache to write to it).
Processor Control: Forwarding
also known as bypassing, take an operand from one part of the pipe and supply it to another part.
Associativity: Full
any memory block can be mapped to any cache. Microarchitecture level.
Processor Control: Delayed Branch
changes our concept of programming. Always executes instruction immediately following the branch. Every control instruction produces a control hazard here.
ARM Thumb
embedded ARM. instructions are 16bits 8 registers available to programmer Fewer instructions: no condition execution, _S suffix is not present All ALU instructions set CCR -no shift actions -shorter literals -BX: entry/exit performance: 30% reduction in code size in bytes, but more lines of code 2% slower.
Conditional Instruction
like BEQ results in either continuing program execution normally with the next instruction in sequence pointed to by PC + 4 or loading the program counter with a new value and executing a branch to another region of code.
Associativity: Set Associative
lines are grouped into "sets"
Associativity: Direct-Mapped
one and only one line of the cache a given memory block can never be associated with
Cache Layout - Lines
set of consecutive memory locations in the cache, starting at a location that's a multiple of line size. Each lines corresponds to a block. Line is managed as a unit. block - set of consecutive locations in the memory.
ARM like model: MBR
stores data that has just been read from main memory or data to be immediately written to main memory.
ARM like model: MAR
stores the address of the location in main memory that is currently being accessed by a read or write operation.
ARM Features: CCR (Status Register, SR)
stores the various conditions that can be tested (e.g., zero, negative, positive). That is, when the ALU performs an operation, it updates the zero, carry, negative, and overflow bits in the CCR, Condition Code Register.
ARM Features: Assembler Directives
tell the assembler something about the environment. They tell the assembler where the code is to be located in memory, allocate storage space to variables, and set up initial data that the program might need during its execution. Ex. Value1 EQU 12 ;associate name Value1 with 12 Value2 EQU 45 Table DCD Value1 ;store the word 12 in memory DCD Value2 ;store the word 45 in memory
Callee-Saves Scheme
the called function has the responsibility of saving the registers and variables.
Associativity
ways to associate memory blocks with cache lines
Processor Control: Branch Prediction
will predict branch as taken or not-taken. Two bit prediction via Branch History Table. 2 bit counter, ideally 2 per instruction. Counter is fetched with instruction High value: predict-taken -- not useful for ARM pipe. Low value: predict-not-taken
CISC
x86
ISAs: Addressing Modes
●Literal - Also called immediate or constant addressing, the simplest form of addressing where the operand is part of the instruction itself. It is called immediate addressing because the operand is immediately available (it doesn't have to be read from a register or memory). Ex. ADD r1,r1,#1 ---> #1 is literal addressing. ●Direct - also called absolute addressing, this addressing mode provides the address of an operand as part of the instruction. Ex. ADD r0, r1, r2 ●Indirect - also known as register indirect addressing or base addressing. Mostly used for load and store instructions. In register indirect addressing, the instruction provides the address of the register containing the address of the operand. As you can see, obtaining an operand requires three accesses: reading the instruction, reading the register containing the operand address, and finally, reading the actual operand. Ex. LDR r1, [r0]