Second Test CSE120

Ace your homework & exams now with Quizwiz!

Exceptions and Interrupts

-"Unexpected" events requiring change in flow of control -Switch from user to privileged (kernel) mode -Example: Illegal or unmapped memory access -Exception -Arises within the CPU -E.g., undefined opcode, overflow, div by 0, syscall, ... -Interrupt -E.g., from an external I/O controller (network card) -Dealing with them without sacrificing performance is hard

Example: Intel P6 Processor

-32 bit address space with 4KB pages -2-level cache hierarchy -L1 I/D: 16KB, 4-way, 128 sets, 32B blocks -L2: 128KB - 2MB -TLBs -I-TLB: 32 entries, 4-way, 8 sets -D-TLB: 64 entries, 4-way, 16 sets -32 entries, 8 sets

How to flush

-5-stage Pipeline -only one instruction to flush -Can do by "flushing" the output register of IF -Deeper Pipelines -Multiple cycles to determine branch outcome -Need to know how many instructions in the pipe -Flush all state changing actions -RegWrite, MemWrite, PCWrite (jmp, beq), overflow...

TLB Caveats: Limited Reach

-64 entry TLB with 4KB pages maps 256KB -Smaller than many L3 caches in most systems -TLB miss rate > L2 miss rate! -Potential solutions -Larger pages -Multilevel TLBs (just like multi-level caches)

AMAT example

-Access time = hit time + miss rate x miss penalty -Assume one level cache -Fetch from cache: 1 cycle -Fetch from DRAM: 101 cycles - in this case 101 includes the 1 cycle above. -Cache miss rate: 25% -What is the AMAT? -26

Average memory Access Time

-Access time = hit time + miss rate x miss penalty -Average Memory Access Time (AMAT) -Formula can be applied to any level of the hierarchy -Access time for that level -Can be generalized for the entire hierarchy -Average access time that the processor sees for a reference

Exceptions in a Pipeline

-Another form of control hazard -Consider illegal access in MEM stage ld $1, 0($2) -Prevent $1 from being written by load -Complete previous instructions -Nullify subsequent instructions -Set Cause and EPC register values -Transfer control to handler -Similar to mispredicted branch -Uses much of the same hardware

Associative Caches: Con

-Area overhead -More storage needed for tags (compared to same sized DM) -N comparators -Latency -Critical path = way access + comparator + logic to combine answers -Logic to OR hit signals and multiplex the data outputs -Cannot forward the data to processor immediately -Must first wait for selection and multiplexing -Direct mapped assumes a hit and recovers later if a miss -Complexity: dealing with replacement

Cache Block Example

-Assume a 2^n byte direct mapped cache with 2^m byte blocks -Byte select - The lower m bits -Cache index - The next (n-m) bits of the memory address -There are 2^(n-m) cache indices -Cache tag - The upper (32-n) bits of the memory address Cache Block Example in folder

Ho Processor Handles a Miss

-Assume that cache access occurs in 1 cycle -Hit is great, and basic pipeline is fine CPI penalty = miss rate x miss penalty -For our processor, a miss stalls the pipeline (for an instruction or data miss) -Stall the pipeline (you don't have the data it needs) -Send the address that missed to the memory -Instruct main memory to perform a read and wait -When access completes, return the data to the processor -Resume the instruction

TLB and Memory Hierarchies

-Basic process -Use TLB to get VA → PA -Use PA to access caches and DRAM -Question: can you ever access the TLB and the cache in parallel? Pic in Folder: TLB and Memory Hierarchies

Terminology

-Block - minimum unit of data that is present at any level of the hierarchy -Hit - Dta found in the cache -Hit rate - Percent of accesses that hit -Hit time - Time to access on a hit -Miss - Data not found in the cache -Miss rate - Percent of misses (1 - Hit Rate) -Miss penalty - Overhead in getting data from a higher numbered level -miss penalty = higher level access time + Time to deliver to lower level + Cache replacement / forward to processor time -Miss penalty is usually much larger than the hit time -This is in addition to the hit time -These apply to each level of a multi-level cache -e.g., we may miss in the L1 cache and then hit in the L2

Dynamic Branch Prediction

-Branch History Table (BHT) -One entry for each branch PC -Taken/Not taken bit -Branch Target Buffer (BTB) -One entry for each branch PC -Target address -Increasingly important for long pipelines (IDx) -x86 vs. RISC-V instruction decode

Associativity example

-Compare 4-block caches -Direct mapped, 2-way set associative, fully associative -Block access sequence: 0, 8, 6, 8 3 pics in folder: Direct Mapped Index = (address % 4) 2-way set associative Index = (address % 2) Fully Associative Blocks can go anywhere

N-Way Set Associative Cache

-Compromise between direct-mapped and fully associative -Each memory block can go to one of N entries in cache -Each "set" can store N blocks; a cache contains some number of sets -For fast access, all blocks in a set are searched in parallel -How to think of an N-way associative cache with X sets -1st view: N direct mapped caches each with X entries -Caches search in parallel -Need to coordinate on data output and signaling hit/miss -2nd view: X fully associative caches each with N entries -One cache searched in each case

The 3 Cs of Cache Misses

-Compulsory - this is the first time you referenced this item -Capacity - not enough room in the cache to hold items -i.e., this miss would disappear if the cache were big enough -Conflict - item was replaced because of a conflict in it's set -i.e., this miss would disappear with more associativity

Precise Exceptions

-Definition: precise exceptions -All previous instructions had completed -The faulting instruction was not started -None of the following instructions were started -No changes to the architecture state (registers, memory)

More than 5 stages in a pipeline?

-Desirable properties -Higher clock frequency -CPI < 1 -Avoid in-order stalls -When instruction stalls, independent instructions behind it stall -Avoid stalls due to branches

TLB Entries

-Each TLB entry stores a page table entry (PTE) -TLB entry data → PTE entry fields -Physical page numbers -Permission bits (ReadXecuteWrite) -Other PTE information (dirty bit, etc) -The TLB entr metadata -Tag: portion of virtual page # not used to index the TLB -Depends on the TLB associativity -Valid bit -LRU bits -If TLB is associative and LRU replacement is used

Exception Example

-Exception on add in 40 sub $11, $2, $4 44 and $12, $2, $5 48 or $13, $2, $6 4c add $1, $2, $1 50 slt $15, $6, $7 54 lw $16, 50($7) ... -Handler 80000180 sw $25, 1000($0) 80000184 sw $26, 1004($0)

Dynamic Scheduling

-Execute instructions out-of-order (OOO) -Fetch multiple instructions per cycle using branch prediction -Figure out which are independent and execute them in parallel -Example

Multiple Instruction Issue (superscalar)

-Fetch and execute multiple instructions per cycle => CPI < 1 -Example: fetch 2 instructions/clock cycle -Dynamically decide if they can go down the pipe together or not (hazard) -Resources: double amount of hardware (FUs, Register file ports) -Issues: hazards, branch delay, load delay -Modern Processors: Intel x86: 4-way, ARM: 6-way/8-way

L1 associativity

-For L1 you would probably want low associativity and as you go to the higher caches, you would want more associativity. This is because higher associativity raises hit time but lowers miss rate. L1 is supposed to be the fastest cache and every level higher gets slower but is there to catch the misses from the lower levels. Because of this Hierarchy, L1 is designed to be the fastest cache, then L2, then L3, then the DRAM itself.

Tag and Index with Set-Associative Cache

-Given a 2^n-byte cache with 2^m-cache byte blocks that is a 2^a set-associative -Which bits of the address are the tag or the index? -m least significant bits are byte select within the block -Basic Idea -The cache contains 2^n / 2^m = 2^n-m blocks -Each cache contains 2^n-m / 2^a = 2^n-m-a blocks -Cache index: (n-m-a) bits after the byte select -Same index used with all cache ways... -Observation -For fixed size, length of tags increases with the associativity -Associative caches incur more overhead for tags Pic in Folder: Cache Line Tag Example

Modern Processors

-High clock frequency -> deep pipelines -10-20 stages -CPI < 1 -> superscalar pipelines -Launch 2, 3, or 4 instructions every cycle -Avoid in-order stalls -> out-of-order execution -Re-order instructions based on dependencies -Avoid stalls due to branches -> branch prediction -Keep history about direction and target of branches

Tags and Valid Bits

-How do we know which particular block is stored in a cache location? -Store block address as well as the data -Actually, only need the high-order bits -Called the tag -What if there is no data in a location? -Valid bit: 1 = present, 0 = not present -Initially 0

Virtual Memory

-How to share/use physical memory by multiple applications? -How to guarantee that applications can only access "their" memory (security) -How to expose virtually unlimited amount of main memory to applications? Pic in Folder: Virtual Memory Layout HW translates addresses using an OS-managed lookup table -indirection allows redirection and checks!

Thrashing

-If accesses alternate, one block will replace the other before reuse -in our previous example 18, 26, 18, 26, ... - every reference will miss -No benefit from caching -Conflicts and thrashing can happen quite often

Deeper Pipelining

-Increase number of pipeline stages -Fewer levels of logic per pipe stage -Higher clock frequency

Associative Caches: Pro

-Increased associativity decreases miss rate -Eliminates conflicts -But with diminishing returns -Simulation of a system with 64KB D-cache, 16-word blocks, SPEC2000 -1-way: 10.3% -2-way: 8.6% -4-way: 8.3% -8-way: 8.1% -Caveat: cache shared by multiple cores may need higher associativity

Page Size Tradeoff

-Larger Pages -Pros: smaller page tables, fewer page faults and more efficient transfer with larger applications, improved TLB coverage -Cons: higher internal fragmentation (fragmentation is like when a 1KB program is given 1 GB of space on the page This is memory that can only be used by the program, it is a huge waste -Smaller pages -Pros: improved time to start small processes with fewer pages, internal fragmentation is low (important for small programs) -Cons: high overhead in large page tables -General trend toward larger pages -1978: 512 B, 1984: 4KB, 1990: 16 KB, 2000: 64KB, 2010, 4MB

Block Sizes

-Larger block sizes take advantage of spatial locality -Also incurs larger miss penalty since it takes longer to transfer the block into the cache -Large block can also increase the average time or the miss rate -Tradeoff in selecting block size -Average Access Time = Hit Time + Miss Penalty x MR

Limits of Advanced Pipelining

-Limited ILP in real programs -Pipeline overhead -Branch and load delays exacerbated -Clock cycle timing limits -Limited branch prediction accuracy (85%=98%) -Even a few percent really hurts with long/wide pipes! -Memory inefficiency -Load delays + # of loads/cycle -Caches (next lecture) are not perfect -Complexity of Implementation -Fack YOU

Direct Mapped Cache

-Location in cache determined by (main) memory address -Direct mapped: only one choice -(Block address in memory) modulo (# blocks in cache) -Simplification -If # blocks in cache is power of 2 -Modulo is just using the low-order bits

Abbreviations

-MMU -Memory management unit: controls TLB, handles TLB misses -Components of the virtual address (VA) -TLBI: TLB Index -TLBT: TLB tag -VPO: virtual page offset -VPN: virtual page number -Components of the physical address (PA) -PPO: physical page offset (same as VPO) -PPN: physical page number -CO: byte offset within cache line -CI: cache index -CT: cache tag

Multiple Page Sizes

-Many machines support multiple page sizes -SPARC: 8KB, 64KB, 1MB, 4MB -MIPS R4000: 4KB - 16MB -x86: 4KB, 4MB, 1GB -Page size dependent upon application -OS kernel uses large pages -Most user applications use smaller pages -Issues -Software complexity -TLB complexity -How do you do match VPN if not sure about the page size?

Reducing Branch Delay

-Minimize "bubble" slots -Move branch computation earlier in the pipeline -branch outcome: add comparator to ID stage -branch target: add adder to ID stage -Predict branch not taken -if correct, no bubbles inserted -if wrong, flush pipe, inserting one bubble

Splitting Caches

-Most chips have separate caches for instructions and data -Often noted ad $I and $D or I-cache and D-cache -L1 cache is the one closest to CPU -Advantages -Extra access port, bandwidth -Low hit time -Customize to specific patterns (e.g. line size) -Disadvantages -Capacity utilization -Miss rate

Performance Impact of Branch Stalls

-Need to stall for one cycle on every branch -Consider the following case -The ideal CPI of the machine is 1 -The branch causes a stall -Effective CPI if 15% of the instructions are branches? -The new effective CPI is 1 + 1 x 0.15 = 1.15 -The old effective CPI was 1 + 3 x 0.15 = 1.45

Fully Associative Cache

-Opposite extreme in that it has no cache index to hash -Use any available entry to store memory elements -No conflict misses, only capacity misses -Must compare cache tags of all entries to find the desired one -Expensive and slow Pic in Folder

P6 2-Level Page Table Structure

-Page directory -One page directory per process -1024 4-byte page directory entries (PDEs) that point to page tables -Page directory must be in memory if process running -Always pointed to by PDBR -Page tables -1024 4-byte page table entries (PTEs) that point to pages -Page tables can be paged in and out Pic in Folder: 2 level Page Table Example

Page Faults

-Page fault → an exception case -HW indicates exception cause and problematic address -OS handler invoked to move data from disk into memory -Current process suspends, others can resume -OS has full control over replacement -When process resumes, repeat load or store Pic in Folder: Page Table Faults

Final Problem: Page Table Size

-Page table size is proportional to size of address space -x86 example: virtual addresses are 32 bits, pages are 4KB -Total number of pages: 2^32 / 2^12 = 1 Million -Page Table Entry (PTE) are 32 bits wide -Total page table size is therefore 1M x 4 bytes = 4MB -But, only a small fraction of those pages are actually used! -Why is this a problem? -The page table must be resident in memory (why?) -What happens for the 64-bit version of x86 -What about running multiple programs?

Multilevel Caches

-Primary (L1) caches attached to CPU -Small, but fast -Focusing on hit time rather than miss rate -Level-2 cache services misses from primary cache -Larger, slower, but still faster than main memory -Unified instruction and data (why?) -Focusing on low miss rate rather than low hit time (why?) -Main memory services L2 cache misses -Many chips include L3 cache

Locality

-Principle of locality -Programs work on a relatively small portion of data at any time -Can predict data accessed in near future by looking at recent accesses -Temporal locality -If an item has been referenced recently, it will probably be accessed again soon -Spatial locality -If an item has been accessed recently, nearby items will tend to be referenced soon

Handler Actions

-Read cause, and transfer to relevant handler -A software switch statement -May be necessary even if vectored interrupts are available -Determine action required: -If restartable -Take corrective action -Use exception PC (EPC) to return to program -Otherwise -Terminate program (e.g., segfault, ...) -Report error using EPC, cause, ...

Register Renaming

-Rename (map) architectural registers to physical registers in decode stage to get rid of false dependencies

Simple Scoreboarding

-Scoreboard: a bit-array, 1-bit for each GPR -if the bit is not set, the register has valid data -if the bit is set, the register has stale data -i.e. Some outstanding instruction is going to change it

Branch Prediction

-Static prediction vs. dynamic prediction -Static prediction schemes: -always predict taken -always predict not-taken -compiler/programmer hint -if (target < PC) Predict taken else What is the rationale behind this? Predict not-taken

Memory Hierarchy

-Store everything on disk or Flash -Copy recently accessed and nearby data to smaller DRAM memory -DRAM is called main memory -Copy more recently accessed and nearby data to smaller SRAM memory -Called the cache Pic in folder

Translation Look-aside Buffer

-TLB = a hardware cache just for translation entries -Very similar to L1, very close to the processor -A hardware cache specializing in page table entries -Key Idea: locality in accesses → locality in translations -TLB design: similar issues to all caches -Basic parameters: capacity, associativity replacement policy -Basic optimizations: instruction/data TLBs, multi-level TLBs, ... -Misses may be handled by HW or SW -x86: hardware services misses -MIPS: software services misses through exception Link for help: https://www.geeksforgeeks.org/translation-lookaside-buffer-tlb-in-paging/

Virtually Indexed, Physically tagged Caches

-Translation and cache access in parallel -Start access to cache with page offset -Tag check used physical address -Only works when -VPN bits not needed for cache lookup -Cache Size <= Page Size * Associativity -I.e. Set Size <= Page Size -Ok, we want L1 to be small anyway Pic in Folder: Virtually Indexed, Physically tagged(indexed) Caches

Solution: Multi-Level Page Tables

-Use a hierarchical page table structure -Example: Two Levels -First Level: directory entries -Second Level: actual page table entries -Only top level must be resident in memory -Remaining levels can be in memory, on disk, or unallocated -Unallocated if corresponding ranges of the virtual address space are not used

Page table Performance?

-Virtual memory is great but -We just doubled the memory accesses -A load requires an access to the page table first -Then an access to the actual data -How can we do translation fast? -Without an additional memory access?

Cache Write Policy

-What happens on a cache write that misses? -It's actually two sub-questions -Do you allocate space in the cache for the address? -Write-allocate VS no-write allocate -Actions: select a cache entry, evict old contents, update tags, ... -Do you fetch the rest of the block contents from memory? -Of interest if you do write allocate -Remember a store updates up to 1 word from a wider block -Fetch-on-miss VS no-fetch-on-miss -For no-fetch-on-miss must remember which words are valid -Use fine-grain valid bits in each cache line

Cache Questions

-Where can I store a particular piece of data? (mapping) -Direct mapped (single location) -Fully associative (anywhere) -Set associative (anywhere in a set) -What do I throw out to make room? (replacement policy) -How much data do I move at a time? (block Size) -How do we handle writes? -Bypass cache -Write thru the cache -Write into the cache -- and then write back

What about writes?

-Where do we put the data we want to write -In the cache? -In main memory? -In both? -Caches have different policies for this question -Most systems store the data in the cache -Some also store the data in memory as well -Interesting Observation -Processor does not need to "wait" until the store completes

Replacement Methods

-Which line do you replace on a miss? -Direct mapped -Easy, you have only one choice -Replace the line at the index you need -N-way set associative -Need to choose which way to replace -Random (choose one at random) -Least Recently Used (LRU) (the one used least recently) -Keep encoded permutation - for N-ways, N! Orderings. -For 4-way cache: How many orderings? How many bits to encode?

Typical write miss action choices

-Write-back caches -Write-allocate, fetch-on-miss (why?) -Write-through caches -Write-allocate, fetch-on-miss -Write-allocate, no-fetch-on-miss -No-write-allocate, write-around -Which program patterns match each policy? -Modern HW support multiple policies -Selected by OS on at some coarse granularity (e.g. 4KB)

Write Policy Trade-offs

-Write-through -Misses are simpler and cheaper (no write-back to memory) -Easier to implement -But requires buffering to be practical (see following slide) -Uses a lot of bandwidth to the next level of memory -Every write goes to next level -Not power efficient! -Write-back -Writes are fast on a hit (no write to memory) -Multiple writes within a block require only one "writeback" later

Cache Write Policies: Major Options

-Write-through (write data go to cache and memory) -Main memory is updated on each cache write -Replacing a cache entry is simple (just overwrite new block) -Memory write causes significant delay if pipeline must stall -Write-back (write data only goes to the cache) -Only the cache entry is update on each cache write so main memory and the cache data are inconsistent -Add "dirty" bit to the cache entry to indicate whether the data in the cache entry must be committed to memory -Replacing a cache entry requires writing the data back to memory before replacing the entry if it is "dirty"

Clearing up Page Table Stuff

A Page table is created when a program is launched. When that program closes, the page table vanishes. When this table is created, each address in the table is cleared for the valid bit. This is so that you don't get information from other programs. After you actually start needing access to memory it will map each value one at a time as they come. So when you load a new page to the page table that is not mapped to memory yet, the system must do a swap and remove an element from memory, that element is put onto the disk. After all of this, the address can be linked between the page table and the new slot made in memory. The disk acts as temporary storage. You can store pages in the disc but before you can reassess them, you need to swap them in again.

Cache Definition

A cache represents a limited amount of memory that stores a redundant set of memory that normally resides in the bigger memory. The cache can only hold so much memory at once. How do we locate data that may be in the cache? When we are replacing a value in the cache, where do we put it?

Branch Prediction Example

Branch Prediction 2 image

The Memory Problem

Build a big, fast, cheap memory -Big memories are slow -Even when built from fast components -Even if they are on-chip -Fast memories are expensive

Direct Mapped Problems

Conflict Messes -Two blocks are used concurrently and mapt o same index -Only one can fit in the cache, regardless of cache size -No flexibility in placing 2nd block elsewhere

Directly Mapped

Each of the million possible entries in the memory have exactly one slot in the cache where it can go to. This is usually done by using a modulo to determine which index in the cache it uses. If the cache has 8 slots, compute modulo 8. Very easy to implement, just compute the modulo to see if the address is in the cache. When writing to the cache, inserting an element, it is easy to figure out where to put it.

Forwarding

Grab the operand from pipeline stage, rather than register file. So instead of waiting for the operand to be put into a register to be used, you can pull the result straight from ALU to be used in the next instruction. Forwarding is also called Bypassing.

Control Instructions

IF - Instruction Fetch phase (Instruction Fetch) - This is the phase where the PC is incremented and the next instruction is read from memory ID - Decode and Register Fetch phase (Instruction Decode) - This is the phase where the Registers are read, where immediates can be given sign extensions and also where the WB phase writes back to. EX/RF - Execute Phase (Execute) - This is the phase where all of the ALU action takes place. MEM - Memory Phase (Memory Assign) - This is the phase where the result of ALU is either written into memory or sent to WB WB - Write Back Phase (Write Back) - This is the phase where the result of ALU is written back onto a register from the ID phase.

Set associative / Fully associative caches

In Set associative / Fully associative caches, you can multiple values under the same cache slot with no issues. Set associative is when you have multiple entries per slot allowed. Fully associative is when each value can be mapped arbitrarily throughout the cache. These are good because some of the conflicts seen in a directly mapped cache are accommodated.

Multilevel On-Chip Caches

L2 and L3 take a significant amount of space, L1 is generally attached to processor and is tiny, together all 3 take up almost half of the chip.

Dynamic Scheduling in a modern OOO MIPS R 10000

Look for the Image in file. You can check the scoreboard to tell if an instruction is independent then perform those instructions first.

Locality Examples

Look in folder

Larger Block Size

Motivation: exploit spatial locality and amortize overheads -This example: 64 blocks, 16 bytes / block -To what block number does address 1200 map? -Block address = 1200 / 16 = 75 -Block index = 75 modulo 64 = 11 -What is the impact of larger blocks on tag / index size -index size is bigger -What is the impact of larger blocks on the cache overhead? -Overhead = tags and valid bits

Reads and Writes in the Cache

On a read, you want to know if the address you are looking for is in the cache. On a write (inserting an element into the cache) you need to figure out where it needs to go. This is done with a modulo in a directly mapped cache.

Review: Cache Organization Options - 256 bytes, 16 byte block, 16 blocks

Pic in Folder: Cache Organization Options

Explaining Homework Last Homework A Bit

Pic in Folder: Homework Help Pic Take Entry and divide by associativity Find 2^n = Result of that ^^^ Then find 2^m = Block size Then subtract address width from n and m Ex. First Line 32 entries / 2 associativity = 16 2^n = 16 n = 4 2^m = 4 m = 2 32 - 4 - 2 = 26 Tag size is 26 bits

Overview of P6 Address Translation

Pic in Folder: Overview of P6 Address Translation

P6 Page Directory Entry (PDE)

Pic in Folder: P6 Page Directory Entry

P6 Page Table Entry (PTE)

Pic in Folder: P6 Page Table Entry

P6 Translation

Pic in Folder: P6 Translation

Page Table

Pic in Folder: Page Table Definitions -One page table per process stored in memory -Process == instance of program -One entry per virtual page number Usually a page is 4kb in size. The lower 12 bits of a command are called the page offset, the 20 upper (most significant) bits represent the virtual page number. So you use the 20 upper bits to choose which page you need to go to. Then the 12 lower bits are used to select the specific byte that you need on that page. The memory that holds the page table is only accessible by the OS. The user cannot see it at all, basically invisible. The Page Table Base Register is an address in memory that holds the start of the page table. The page table is an array in main memory. You can just add onto this Base Register to select the entry you want, just like a regular array. Page Table Entries can hold more than just values. It can hold access rights to define legal access (Read, Write, Execute)

Set Associative Cache Design

Pic in Folder: Set associative cache design

TLB Organization

Pic in Folder: TLB Organization Diagram

TLB Translation

Pic in Folder: TLB Translation

Two-level Page Tables

Pic in Folder: Two-level Page Tables Example -Disadvantage: multiple page faults -Accessing a PTE page table can cause a page fault -Accessing the actual page can cause a second page fault -TLB plays an even more important role

Representation of Virtual Address Space

Pic in Folder: Virtual Address Space Example

Write Miss Actions

Pic in Folder: Write Miss Actions Table

Associative Cache Example

Pic in folder

Cache Organization

Pic in folder

Memory Heirarchy 2

Pic in folder

Avoiding the Stalls for Write-Through

Pic in folder: Avoiding Stalls example -Use Write Buffer between cache and memory -Processor writes data into the cache and the write buffer -Memory controller slowly "drains" buffer to memory -Write Buffer: a first-in-first-out buffer (FIFO) -Typically holds a small number of writes -Can absorb small bursts as long as the long-term rate of writing to the buffer does not exceed the maximum rate of writing to DRAM

Set Associative

So Set associative allows for more entries per block, allowing for more flexibility but a little slower search time. Fully associative allows for entries going to any available block, however every single block needs to be checked to see whether or not your address is present in the cache. In a set associative cache, you can say that each set has two entries.

Problem with directly mapped caches

The problem with directly mapped caches is that the modulo allows for conflicts. For instance, 0, 8, and 16 all map to the same slot so you can only have one in the cache at a time. You won't have efficient use of the cache in some cases because the whole cache will not be used. This can be addressed by associativity.

Branch CPI

Branch CPI Image

What is the tag for?

It is used for when there are two addresses put into the same cache slot. You can use the tag to double check and see if the value read really is the correct one. The tag is a part of the address. By comparing the address used for the lookup with the tag you can find out if it's the right one.

Instruction Level Parallelism

Look in Pics Folder

DRAM

Main Memory


Related study sets

A.O.S 2 - The presumption of innocence

View Set

Managerial Accounting Chapter 6 : Cost-Volume-Profit RelationshipsAssignment reading

View Set

Geometry Chapter 3-4 Study Guide

View Set

Describing AlphaSights and Associate Role

View Set