Memory Hierarchy
TLB Optimizations: Large Pages
+ Good TLB performance (less memory accesses) + Good for regular access patterns (streaming, sequential) - Internal fragmentation - Expensive data movement and longer accessing time
Reducing the Miss Rate: Higher Associativity
- 8-way associativity is as good as full associativity - A direct-mapped cache of size N has about the same miss rate as a two-way set associativity cache of size N/2. - Increased hit time
Cache Optimization: Using Victim Caches
- A buffer to place data just evicted from cache - A small number of fully associative entries - Accessed in parallel with cache (no increase in hit time) - On a hit in the VC, swap blocks in VC and cache - When used with direct mapped caches, it has the effect of adding associativity to the most recently used cache blocks
Block
- A fixed-size collection of data containing the requested word - Also called cache line - Organized into pages
Reducing the Miss Rate: Multi-level Cache
- Adding another level of cache between the original cache and memory - First-level cache can be small enough to match the clock cycle time of the fast processor - Second-level cache can be large enough to capture many accesses that would go to main memory, lessening the effective miss penalty
Cache Optimization: Way Prediction to Reduce Hit Time
- Combine fast hit time of direct mapped and the lower conflict misses of 2-way SA caches - Check one way first (speed of direct mapped cache) - On a miss, check the other way; if it hits, call it a pseudo-hit (slow hit) - Way prediction is a bit to indicate which half to check first (changes dynamically) - May extend prediction to more than 2-way SA caches - Saves power - Drawback: CPU pipeline is hard if hit takes sometimes 1 and sometimes 2 cycles
Cache Optimization: Small and Simple Caches to Reduce Hit Time
- Critical timing path: address tag memory, then compare tags, then select set - Lower associativity - Direct-mapped caches can overlap tag comparison and transmission of data
Cache Optimization: Critical Word First and Early Restart to Reduce Miss Penalty
- Don't wait for full block to be loaded before restarting CPU - Early restart: as soon as the requested work of the block arrives, send it to the CPU and let the CPU continue execution - Critical Word First: Request the missed word first from memory and send it to CPU as soon as it arrives; generally useful only in large blocks - Beneficial when we have long cache lines (blocks) - If want next sequential word, early restart may not be useful
What are the advantages of the write through policy?
- Easier to implement than write back - Cache is always clean, so read misses never result in writes to the lower level - The next lower level has the most current copy of the data, which simplifies data coherency (which is important for multiprocessors and for I/O) - Multilevel caches make write through more viable for the upper-level caches, as the writes need only propagate the the next lower level rather than all the way to main memory
TLB Optimization: Eager Paging
- Expose range information at allocation, instead of waiting to access (demand paging--does not allow for optimization) - Allow OS to aggressively optimize the range
Direct Mapped Cache
- For each item (block) of data in memory, there is exactly one location in the cache where it might be - Index = (memory address) mod (cache size) - Cache can be accessed using the low-order address bits - Need to tag data using the high order address bits - A valid bit is used for each block
Cache Optimization: Non-Blocking Caches to Increase Bandwidth
- Hit under miss allows data cache to continue to supply cache hits during amiss--useful only with OOO execution - Hit under multiple miss or miss under miss may further lower the effective miss penalty by overlapping multiple misses - Significantly increases the complexity of the cache controller (multiple outstanding memory accesses) - Requires multiple memory banks (otherwise cannot support) - Pentium Pro allows 4 outstanding memory misses
Reducing the Miss Rate: Larger Block Size
- Increase the block size - Block size explores spatial locality - Larger block sizes will reduce compulsory misses - Because they reduce the number of blocks in the cache, larger blocks may increase conflict misses and even capacity misses if the cache is small
Reducing the Miss Rate: Larger Caches
- Increase the capacity of the cache - The drawback is potentially longer hit time and higher cost and power - This technique is popular in off-chip caches
Cache Optimization: Pipeline Cache Access to Increase Bandwidth
- Increases branch mis-prediction penalty - Makes it easier to increase associativity
Cache Optimization: Multi-Bank Caches to Increase Bandwidth
- Individual memory controller for each bank - Each bank may have its own address and data lines - How blocks are interleaved affects performance
Cache Optimization: Merging Write Buffer to Reduce Miss Penalty
- Most useful in write through caches - Combine writing individual words into a block - Writing block is faster than writing individual words
TLB Optimization: Compression
- Regular stride page accesses (including consecutive ones) - Use fewer bits in the VA # and PA # --> more translations in TLB
Reducing the Hit Time: Avoiding Address Translation
- Send virtual address to cache (called virtually addressed cache or just virtual cache) - Every context switch, flush the cache to avoid accessing data that doesn't belong to this cache (cost is time to flush + compulsory misses from empty cache) - Does not support aliases - To avoid cache flush, may add process identifier tag to cache blocks
TLB Optimization: Speculative TLB
- Similar to cache block prediction and prefetching (Heuristic, machine learning) - Depends on the accuracy and page access pattern
Miss Penalty
- The cost per miss - How many cycles it takes to get the data into the cache
Why is there no need to compare more of the address than the tag?
- The offset should not be used in the comparison because the entire block is either present or not; therefore, all block offsets result in a match by definition - Checking the index is redundant, because it was used to select the set to be checked
Hit time
- The time to hit in the cache - Number of cycles it takes to access the cache
What are the advantages of the write back policy?
- Writes occur at the speed of the cache memory - Multiple writes within a block require only one write to the lower-level memory - Because some writes don't go to memory, write back uses less memory bandwidth, making write back attractive in multiprocessors - Write back uses the rest of the memory hierarchy and memory interconnect less than write through, meaning it also saves power, making it attractive for embedded applications
CPU Execution Time with Memory Stall Cycles Equation
= (CPU clock cycles + Memory stall cycles) x Clock cycle time = IC x (CPI v (execution) + (Memory stall clock cycles / Instruction)) x Clock cycle time = IC x (CPI v (execution) + (Misses / Instruction) x Miss Penalty) x Clock cycle time = IC x (CPI v (execution) + Miss rate x (Memory accesses / Instruction) x Miss Penalty) x Clock cycle time
Misses per Instruction
= (Miss rate x Memory accesses) / Instruction count = Miss rate x (Memory accesses / Instruction) Often reported as misses per 1000 instructions to show integers instead of fractions
Memory Stall Cycles / Instruction
= (Misses/Instruction) x (Total miss latency - Overlapped miss latency) = (Misses(L1) / Instruction) x Hit time(L2) + (Misses(L2) / Instruction) x Miss penalty(L2)
2 ^ index
= Cache size / (Block size x Set associativity)
Average Memory Stalls per Instruction
= Miss per instruction(L1) x Hit time(L2) + Misses per instruction(L2) x Miss penalty(L2)
Memory Stall Cycles Equation
= Number of Misses x Miss penalty = IC x (Misses/Instruction) x Miss penalty = IC x (Memory accesses/Instruction) x Miss rate x Miss penalty = IC x Reads per instruction x Read miss rate x Read miss penalty + IC x Writes per instruction x Write miss rate x Write miss penalty
Way
A bank in a set associative cache; there are four ways in a four-way set associative cache
Average Memory Access Time (AMAT)
A better measure of memory hierarchy performance. = Hit time + Miss rate x Miss penalty = Hit time(L1) + Miss rate(L1) x (Hit time(L2) + Miss rate(L2) x Miss penalty(L2))
Translation Lookaside Buffer (TLB)
A cache that keeps track of recently used address mappings to try to avoid an access to the page table
Pseudo-LRU Replacement
A common approximation of LRU, where there is a set of bits for each set in the cache with each bit corresponding to a s ingle way in the cache. When a set is accessed, the bit corresponding to the way containing the desired block is turned on; if all the bits associated with a set are turned on, they are reset with the exception of the most recently turned on bit. When a block must be replaced, the processor chooses a block from the way whose bit is turned off, often randomly if more than one choice is available.
Write Buffer
A common optimization to reduce write stalls. This allows the processor to continue as soon as the data are written to the buffer, thereby overlapping processor execution with memory updating. Write stalls can still occur even with these
Set
A group of blocks in the cache
Reducing the Time to Hit in the Cache
Avoiding address translation when indexing the cache
First In, First Out (FIFO) Replacement
Because LRU can be complicated to calculate, this approximates LRU by determining the oldest block rather than the LRU
What does the selection of block size depend on?
Both the latency and the bandwidth of the lower-level memory. High latency and high bandwidth encourage large block size because the cache gets many more bytes per miss for a small increase in miss penalty. Conversely, low latency and low bandwidth encourage smaller block sizes because there is little time saved from a larger block
Address Tag
Caches have these on each block frame that gives the block address. These are checked on every cache block that might contain the desired information to see if it matches the block address from the processor. All are searched in parallel
Tag Field
Compared against to see if this was a hit
What type of miss are the easiest to deal with conceptually?
Conflicts because fully associative placement avoids all conflict misses; however, full associativity is expensive in hardware and may slow the processor clock rate, leading to lower performance
Cache Optimization: Compiler Prefetching to Reduce Miss Rate and Penalty
Data Prefetch: - Load data into register Cache Prefetch: - Load data into cache - Special prefetching instructions should not cause premature page faults - Issuing prefetch instructions takes time - Works only if can overlap prefetching with execution
How to deal with capacity misses?
Enlarge the cache
True or False: All objects referenced by a program need to reside in main memory
False
True or False: If the total cache size is kept the same, increasing associativity decreases the number of blocks per set, thereby decreasing the size of the index and increasing the size of the tag
False; increasing associativity increases the number of blocks per set, thereby decreasing the size of the index and increasing the size of the tag
Fully Associative Cache
If a block can be placed anywhere in the cache
Set Associative Cache
If a block can be placed in a restricted set of places in the cache. A block is first mapped onto a set, and then the block can be placed anywhere within that set. The set is usually chosen by bit selection, that is (Block address) MOD (Number of set in cache)
Conflict Miss
If the block placement strategy is set associative or direct mapped, these misses will occur because a block may be discarded and later retrieved if too many blocks map to its set. Also called collision misses. Occur going from fully associative to eight-way associative, four-way associative, and so on
Capacity Miss
If the cache cannot contain all the blocks needed during execution of a program, these misses will occur because of blocks being discarded and later retrieved. Occur in a fully associative cache
Thrash
If the upper-level memory is much smaller than what is needed for a program, and a significant percentage of the time is spent moving data between two levels in the hierarchy, the memory hierarchy is said to do this
N-Way Associative Cache
If there are n blocks in a set
Reducing the Miss Rate: Read Priority over Write on Miss
If write through: - Write buffers avoid stalling - May cause RAW conflicts with reads on cache misses - Waiting for write buffer to empty will increase read miss penalty (Solution: check write buffer contents before read; if no conflicts, let the memory access continue) If write back: - Normal: write dirty block to memory, and then do the read - Instead copy the dirty block to a write buffer, then do the read, and then do the write - CPU stalls less since restarts as soon as the read completes
Cache Optimization: Hardware Prefetching to Reduce Miss Rate and Penalty
Instruction prefetching: bring the neighboring cache lines into the cache before they are accessed - Can fetch 2 (or more) blocks on a miss - Extra block placed in stream buffer - On miss, check stream buffer--if found, move to cache and prefetch next Data Prefetching: - May have multiple stream buffers beyond the cache, each prefetching at a different address - Relies on extra memory bandwidth that can be used without penalty
Cache Optimization: Compiler Optimizations to Reduce the Miss Rate
Instructions: - Reorder procedures in memory so as to reduce conflict misses - Aligning basic blocks with cache blocks (lines) - Profiling to look at conflicts Data: - Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays - Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap - Loop Interchange: Change nesting of loops to access data in order stored in memory - Blocking: Improve temporal locality by access blocks of data repeatedly vs. going down whole columns or rows
What happens on a cache miss?
It is handled by hardware and causes processors using in-order execution to pause, or stall, until the data are available. With out-of-order execution, an instruction using the result must still wait, but other instructions may proceed during the miss
What is a virtue of random replacement?
It is simple to build in hardware
Reducing the Miss Rate
Larger block size, larger cache size, and higher associativity
What does the time required the cache miss depend on?
Latency and bandwidth of the memory. Latency determines the time to retrieve the first word of the block, and bandwidth determines the time to retrieve the rest of this block
Coherency Miss
Misses due to cache flushes to keep multiple caches coherent in a multiprocessor
Reducing the Miss Penalty
Multilevel caches and giving reads priority over writes
What are the Four Memory Hierarchy Questions?
Q1: Where can a block be placed in the upper level? (block placement) Q2: How is a block found if it is in the upper level? (block identification) Q3: Which block should be replaced on a miss? (block replacement) Q4: What happens on a write? (write strategy)
What instruction type dominates processor cache accesses? Why?
Reads; all instruction accesses are reads, and most instructions don't write to memory
What is the classical approach to improving cache behavior?
Reducing miss rates
Valid Bit
Says whether or not this entry contains a valid address. If the bit is not set, there cannot be a match on this address
Block Offset Field
Selects the desired data from the block
Index Field
Selects the set
Temporal Locality
Tells us that we are likely to need this word again in the near future, so it is useful to place it in the cache where it can be accessed more quickly
Pages
The address space is usually broken into these fixed-size blocks
Write Allocate
The block is allocated on a write miss, followed by the preceding write hit actions. In this natural option, write misses act like read misses
Miss Rate
The fraction of cache accesses that result in a miss (i.e., the number of accesses that miss divided by the total number of accesses)
Cache
The highest or first level of the memory hierarchy encountered once the address leaves the processor
Write Back Policy
The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced
Write Through Policy
The information is written to both the block in the cache and to the block in the lower-level memory
Global Miss Rate
The number of misses in the cache divided by the total number of memory accesses generated by the processor
Compulsory Miss
The very first access to a block cannot be in the cache, so the block must be brought into the cache. These are also called cold-start misses or first-reference misses. Occur in an infinite cache
Spatial Locality
There is a high probability that the other data in a block will be needed soon
No-Write Allocate
This apparently unusual alternative is write misses do not affect the cache. Instead, the block is modified only in the lower-level memory
Local Miss Rate
This rate is the number of misses in a cache divided by the total number of memory accesses to this cache
Dirty Bit
This reduces the frequency of writing back blocks on replacement. It is a status bit that indicates whether the block is dirty (modified while in the cache) or clean (not modified). If it is clean, the block is not written back on a miss, because identical information to the cache is found in lower levels
Least Recently Used (LRU) Replacement
To reduce the chance of throwing out information that will be needed soon, accesses to blocks are recorded. Relying on the past to predict the future, the block replaced is the one that has been unused for the longest time. LRU relies on a corollary of locality: if recently used blocks are likely to be used again, then a good candidate for disposal is the least recently used block
Random Block Replacement
To spread allocation uniformly, candidate blocks are randomly selected for replacement. Some systems generate pseudorandom block numbers to get reproducible behavior, which is particularly useful when debugging hardware
True or False: At any time, each page resides either in main memory or on disk
True
True or False: Because page faults take so long, they are handled in software and the processor is not stalled
True
True or False: Blocks stay out of the cache in no-write allocate until the program tries to read the blocks, but even blocks that are only written will still be in the cache with write allocate
True
Cache Miss
When the processor does not find a data item it needs in the cache
Cache Hit
When the processor finds a requested data item in the cache
Write Stall
When the processor must wait for writes to complete during write through
Page Fault
When the processor references an item within a page that is not present in the cache or main memory, this occurs, then the entire page is moved from the disk to main memory