CMSC 411 Final
Calculate Binary Floating Point Representation by Hand
Given 3.625, Sign bit is zero because the number is positive 3 in binary is 0011 0.625 in binary is achieved by: 0.625 x 2 = 1.250 -> 1 (Subtract the 1 from the answer) 0.250 x 2 = 0.500 -> 0 0.500 x 2 = 1.00 -> 1 (Subtract the 1 from the answer) 0 -> stop So 0.625 in binary is 0.101 3.625 in Binary is 0011.101 -> PAD WITH ZEROS (16 for 32 bit and 45 for 64 bit in this case)-> 00000000000000000011.101 Normalize this -> 1.101 x (10^-1) Exponent is offset by 127 for 32 bit, and 1023 for 64 bit 32 bit: 127 -1 = 126 -> 01111110 64 bit: 1023 -1 = 1022 -> 0000001111111110 So the Single Precision is 0011111100000000000000000011101!
MTTF
Mean Time To Failure Improved by Fault Avoidance, Fault Tolerance, Fault Forecasting
Two's Compliment
find the one's compliment and add 1 -Ex: 101010101 becomes 01010110
Loader
loads programs and libraries
Binary Division
long division -answer will be same bit length as divisor 1001010/1000 = 1001
Binary Multiplication
long multiplication -Ex: 1000 X 1001 ------------------- 1000 0000 0000 1000 ------------------------------------ 01001000
Direct mapped
(1-way associative) • One choice for placement (Block address modulo number of blocks)
MTBF
(Mean Time Between Failures) - is the predicted elapsed time between inherent failures of a system during operation.
MTTR
(Mean Time to Repair) - average time required to repair a failed component or device.
NUMA
(Non-Uniform Memory Access) - each physical processor has its own memory controller and, in turn, its own bank of memory
UMA
(Uniform Memory Access) - all of the processors in the system send their requests to a single memory controller
Cache Placement
Determined by associativity
MIPS v. MFLOPS
*Both are bad measurements of performance MIPS (Million Instructions per Second) Most popular performance metric Bad because it doesn't account for: Difference in ISAs between computers Differences in complexity between instructions MIPS = Clock Rate / CPI x (10^6) MFLOPS (Million Floating Point Operations per Second) More fair than MIPS MFLOPS is based on actual operations, not instructions Bad because: Set of FP operations is not the same across machines Rating changes not just based on FP and non-FP operations but also based on fast to slow FP Operations
Binary Logical Operators
*Know what they look like* INV (Invert) -> if x = 1, then x = 0 and vice versa OR -> output is 1 if at least one input is one AND -> output is 1 if all inputs are 1 XOR (Exclusive or) -> output is 1 if only x or only y is 1 NOR (Not or) -> output is 1 only if neither inputs are 1 NAND (Not and) -> inverted output of an "AND" XNOR (Exclusive not or) -> inverted output of an "XOR" MUX- chooses one output for every two inputs
Pipeline Operation
*Pipelining is a form of computer organization in which successive steps of an instruction sequence are executed in turn by a sequence of modules able to operate concurrently, so that another instruction can be begun before the previous one is finished* IF (Instruction Fetch) - fetches instructions from memory ID (Instruction Decode) - uses registers to decode instructions EX (Execute) - executes operation or calculates address MEM (Memory) - Access memory operand WB (Write Back) - writes data back to register
The Mill
- Operations execute in program order - The compiler controls when ops issue and retire - Short pipeline - Is not yet silicon. - No rename registers to eliminate hazards - Has no general registers since transient data lives on the Belt which is a FIFO
Servers
-Design geared toward reliability Ex: HP Itanium
Desktops
-Device design is driven by cost Ex: Mac Pro
Embedded Computers
-Example: DSP -Cheap little mini-computers that do one task (driven by cost and unique application)
Custom
-Example: GPU -Generally geared towards one task
Supercomputers
-raw computation power -price is not a concern Ex: Watson
RAID Level
0 - no redundant arrays 1 - fault-tolerance configuration known as "disk mirroring." With RAID 1, data is copied seamlessly and simultaneously, from one disk to another, creating a replica, or mirror. If one disk gets fried, the other can keep working. 5 - most common RAID configuration; data and parity (which is additional data used for recovery) are striped across three or more disks. If a disk gets an error or starts to fail, data is recreated from this distributed data and parity block— seamlessly and automatically.
Binary Addition
Add numbers, shift carries -Ex: 7 + 6 000111 + 000110 -----------> 000111 (7) 000110 (6) ---------------- 001101 -----> (13)
Binary Subtraction
Add the Two's Compliment of the second operand -Ex: 7 - 6 = 7 + (-6) 00000111 + 11111010 --------------> 00000111 (7) 11111010 (-6) ---------------------- 00000001 -------> (1)
Fully associative
Any location
ALU
Arithmetic Logic Unit -> Does arithmetic and logic
GFLOPS
Billion Floating Point Operations per Second
Who Has the Most Powerful Computer:
CHINA
Valid Flag
Cache is loaded with data
CPU
Central Processing Unit (Has an ALU, Memory, Registers, and Program Counter)
Memory Misses (3 Cs)
Compulsory or Cold-start: First to a block is not in the cache, so the block must be brought into the cache Capacity: Cache cannot contain all the blocks needed during execution of a program Conflict or Collision: Competition for entries in a set; doesn't occur in a fully associative
DMA
Direct Memory Access a method that allows an input/output (I/O) device to send or receive data directly to or from the main memory, bypassing the CPU to speed up memory operations.
DRAM
Dynamic RAM - stores each bit of data in a separate capacitor within an integrated circuit.
GPU
Graphics Processor Unit - these are processors that are optimized for 2D and 3D graphics, video, visual computing, and display. They allow these tasks to be separated from the CPU and are designed specifically to perform these tasks. GPUs have a highly parallel structure that is extremely efficient at manipulating computer graphics. Examples of GPUs are add-on cards made by NVidia and AMD
Study CPU, Clock Cycles, Clock Time and CPI
I can't import pics because I'm not paying for this shit
Carry Look Ahead
Improves speed by reducing the time it takes to determine carry bits. Calculates one or more carry bits before the sum, which reduces the wait time.
Unsupervised Training:
In unsupervised training, the network is provided with inputs but not with desired outputs. The system itself must then decide what features it will use to group the input data. Example architectures: Kohonen, ART
Caches Hierarchy
L1 - smaller than a single cache, block size less than L-2 Cache L2 - focus on low miss rate to avoid memory access L3 L4 *All L-Caches are faster than RAM* SRAM - fastest memory access speeds. Smaller, lower power, more expensive DRAM - stored as capacitor charge, and must refreshed periodically DISK - slowest memory access speeds Ideal - speed of SRAM, cost of DISK
LEGv8 Structure:
LEGv8 is a 32 bit machine, but all instructions are 32-bits, Registers, Datapath, PC & Memory Length are all 32 bits Has 32 Registers *ARM'S datapath and memory are 64 bits, ARM is a 64 bit system
Amdahl's Law:
Law of Diminishing Returns Improving an aspect of a computer does not necessarily give a proportional improvement to overall performance
LRU Replacement
Least Recently Used replace the data in the cache that has gone unused the longest. Too hard beyond 4-way set associative.
LSB
Least Significant Bit - is the bit position in a binary integer giving the units value, that is, determining whether the number is even or odd. Has the lowest value
Translation Look-aside Buffer (TLB)
Lists the physical address page number associated with each virtual address page number
Availability
MTTF / (MTTF + MTTR)
MMU
Memory Management Unit - is a computer hardware unit having all memory references passed through itself, primarily performing the translation of virtual memory addresses to physical addresses
MSB
Most Significant Bit - binary number with the highest numerical value
MIMD
Multiple Instruction streams, Multiple Data streams. This is a multiprocessor where multiple instructions are applied to many data streams. Intel Xeon e5345 is an example.
MISD
Multiple instruction, single data. This is a type of parallel computing architecture where many functional units perform different operations on the same data. Pipeline architectures belong to this category. Not many instances of this architecture exist, but one prominent example of MISD is the Space Shuttle flight control computers.
North Bridge vs South Bridge
NB - connected directly to the CPU via the front-side bus (FSB) and is thus responsible for tasks that require the highest performance. SB - typically implements the slower capabilities of the motherboard in a northbridge/southbridge chipset computer architecture
Who Has the Largest Server Farm
NSA *Data is more valuable than power*
PC
Program Counter, increments each time a new instruction is given
PLA
Programmable Logic Array - used to implement combinational logic circuits
Instruction Formats:
R: opcode|rm|shamt|rn|rd -> Arithmetic Format Instructions; (arithmetic, load or store) I: opcode|immediate|rn|rd -> Immediate Format Instruction (doesn't have load or store) D: opcode|address|op2|rn|rt -> Data Address Format Instructions (gets data out of memory or puts it into memory) B: opcode|address -> Unconditional Branch (branch to another location) CB: opcode|address|rt -> Conditional Branch (branch to another location if a condition is met)
RISC v. CISC
RISC (Reduced Instruction Set Computing) - simpler instructions to provide better performance -MIPS, ARM CISC (Complex Instruction Set Computing) -can execute several low-level operations (such as a load from memory, an arithmetic operation, and a memory store) or are capable of multi-step operations or addressing modes within single instructions. -Intel
RAM
Random Access Memory - is a form of computer data storage which stores frequently used program instructions to increase the general speed of a system. A random-access memory device allows data items to be read or written in almost the same amount of time irrespective of the physical location of data inside the memory.
SIMD
Single Instruction Multiple Data (GPUs are an example) -has multiple instances of data on which it performs same operation
SIMD
Single Instruction stream, Multiple Data streams. This is a multiprocessor, where the same instruction is applied to many data streams, as in a vector processor or array processor. An example of this type of implementation is the SSE instructions of the x86.
SISD
Single Instruction stream, Single Data stream. This is a conventional uniprocessor where a single processor executes a single instruction stream to operate on data stored in a single memory. An example of this type of processor is the Intel Pentium 4.
Binary Floating Point Representation
Single Precision (32 BIT) Sign (1 bit) Exponent (8 bits) Mantissa (23 bits) Double Precision (64 BIT) Sign (1 bit) Exponent (11 bits) Mantissa (52 bits)
Layers of Code (Abstraction)
Software Application Layer ex. Matlab, Web Browsers, Office Products Programming Language ex. C, C++, Fortran Assembly Binary Microcode Nanocode
SRAM
Static RAM - a type of semiconductor memory that uses bistable latching circuitry (flip-flop) to store each bit.
Strong Scaling vs. Weak Scaling:
Strong -time varies with # of processors for fixed total problem size Weak -time varies with # of processors for fixed problem size/processor
Pipeline Hazards
Structural Hazard -A required resource is busy -Solved by Stalling and Punching the Sys. Engineer Data Hazard -need to wait for previous instruction to complete data read/write -Solved by bubbles, data forwarding and stalls Control Hazard -deciding on a control action depends on previous instruction -Solved by branch prediction and stalls
Types of Computers
Supercomputers Servers Embedded Computers Desktops Custom
Moore's Law
The number of transistors on a circuit board doubles every 18 months to 2 years
TFLOPS
Trillion Floating Point Operations per Second
Pipeline
The stages of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel.
Quantum Computers
Use quantum bits, or qbits, but instead of 1's and 0's can be 1, 0, or both; called superposition.
VHDL
VHSIC Hardware Description Language a hardware description language used to design programmable gate arrays and integrated circuits
Flags
Valid Flag - Cache loaded with valid data Dirty Flag - Cache changed since it was read from main memory Reference Flag - LRU bits
VHSIC
Very High Speed Integrated Circuit
VLIW
Very Long Instruction Word - a VLIW processor allows programs to explicitly specify instructions to execute at the same time, concurrently, in parallel
VM
Virtual Memory- memory that appears to exist in main storage although most of it is supported by data held in secondary storage
Assembler
a program for converting instructions written in low-level symbolic code into machine code.
Linker
a program used with a compiler or assembler to provide links to the libraries needed for an executable program.
Microcode
a very low-level instruction set that is stored permanently in a computer or peripheral controller and controls the operation of the device.
Tag
block address of data in a cache
Supervised Training:
both the inputs and the outputs are provided. The network then processes the inputs and compares its resulting outputs against the desired outputs. Errors are then propagated back through the system, causing the system to adjust the weights which control the network. This process occurs over and over as the weights are continually tweaked. Example architectures: Multilayer perceptrons
One's Compliment
change all 1's to 0's and all 0's to 1's -Ex: 10101010 becomes 01010101
Index
determines which cache set the address should reside in
Page Faults
is a type of interrupt, called trap, raised by computer hardware when a running program accesses a memory page that is mapped into the virtual address space, but not actually loaded into main memory.
Simultaneous
multiple instructions are executed during a single clock cycle. The goal of SMT is to use the resources of a multiple-issue, dynamically scheduled processor to exploit thread-level parallelism at the same time it exploits instruction- level parallelism. It is the best form, but only applicable to superscalar CPUs
Course Grained
multithreading, which is an alternative to fine-grained multithreading. Coarse-grained multithreading switches threads only when a costly stall is encountered. The advantage of this multithreading scheme is that individual threads are not slowed down since instructions from other threads will only be executed if the current thread encounters a costly stall.
Set associative
n choices within a set (Block address modulo number of sets in cache)
Higher associativity
reduces miss rate - Increases complexity, cost, and access time
Page Table
the data structure used by a virtual memory system in a computer operating system to store the mapping between virtual addresses and physical addresses
Fine Grained
the processor switches between threads on each instruction, resulting in multiple threads being interleaved. A major advantage of fine-grained multithreading is that throughput losses from both short and long stalls can be hidden since instructions from other threads can be executed when a thread stalls
Faster Multiplication
use multiple adders -cost/performance tradeoff -sacrifice speed for area -can be pipelined