Memory Hierarchy

Ace your homework & exams now with Quizwiz!

TLB Optimizations: Large Pages

+ Good TLB performance (less memory accesses) + Good for regular access patterns (streaming, sequential) - Internal fragmentation - Expensive data movement and longer accessing time

Reducing the Miss Rate: Higher Associativity

- 8-way associativity is as good as full associativity - A direct-mapped cache of size N has about the same miss rate as a two-way set associativity cache of size N/2. - Increased hit time

Cache Optimization: Using Victim Caches

- A buffer to place data just evicted from cache - A small number of fully associative entries - Accessed in parallel with cache (no increase in hit time) - On a hit in the VC, swap blocks in VC and cache - When used with direct mapped caches, it has the effect of adding associativity to the most recently used cache blocks

Block

- A fixed-size collection of data containing the requested word - Also called cache line - Organized into pages

Reducing the Miss Rate: Multi-level Cache

- Adding another level of cache between the original cache and memory - First-level cache can be small enough to match the clock cycle time of the fast processor - Second-level cache can be large enough to capture many accesses that would go to main memory, lessening the effective miss penalty

Cache Optimization: Way Prediction to Reduce Hit Time

- Combine fast hit time of direct mapped and the lower conflict misses of 2-way SA caches - Check one way first (speed of direct mapped cache) - On a miss, check the other way; if it hits, call it a pseudo-hit (slow hit) - Way prediction is a bit to indicate which half to check first (changes dynamically) - May extend prediction to more than 2-way SA caches - Saves power - Drawback: CPU pipeline is hard if hit takes sometimes 1 and sometimes 2 cycles

Cache Optimization: Small and Simple Caches to Reduce Hit Time

- Critical timing path: address tag memory, then compare tags, then select set - Lower associativity - Direct-mapped caches can overlap tag comparison and transmission of data

Cache Optimization: Critical Word First and Early Restart to Reduce Miss Penalty

- Don't wait for full block to be loaded before restarting CPU - Early restart: as soon as the requested work of the block arrives, send it to the CPU and let the CPU continue execution - Critical Word First: Request the missed word first from memory and send it to CPU as soon as it arrives; generally useful only in large blocks - Beneficial when we have long cache lines (blocks) - If want next sequential word, early restart may not be useful

What are the advantages of the write through policy?

- Easier to implement than write back - Cache is always clean, so read misses never result in writes to the lower level - The next lower level has the most current copy of the data, which simplifies data coherency (which is important for multiprocessors and for I/O) - Multilevel caches make write through more viable for the upper-level caches, as the writes need only propagate the the next lower level rather than all the way to main memory

TLB Optimization: Eager Paging

- Expose range information at allocation, instead of waiting to access (demand paging--does not allow for optimization) - Allow OS to aggressively optimize the range

Direct Mapped Cache

- For each item (block) of data in memory, there is exactly one location in the cache where it might be - Index = (memory address) mod (cache size) - Cache can be accessed using the low-order address bits - Need to tag data using the high order address bits - A valid bit is used for each block

Cache Optimization: Non-Blocking Caches to Increase Bandwidth

- Hit under miss allows data cache to continue to supply cache hits during amiss--useful only with OOO execution - Hit under multiple miss or miss under miss may further lower the effective miss penalty by overlapping multiple misses - Significantly increases the complexity of the cache controller (multiple outstanding memory accesses) - Requires multiple memory banks (otherwise cannot support) - Pentium Pro allows 4 outstanding memory misses

Reducing the Miss Rate: Larger Block Size

- Increase the block size - Block size explores spatial locality - Larger block sizes will reduce compulsory misses - Because they reduce the number of blocks in the cache, larger blocks may increase conflict misses and even capacity misses if the cache is small

Reducing the Miss Rate: Larger Caches

- Increase the capacity of the cache - The drawback is potentially longer hit time and higher cost and power - This technique is popular in off-chip caches

Cache Optimization: Pipeline Cache Access to Increase Bandwidth

- Increases branch mis-prediction penalty - Makes it easier to increase associativity

Cache Optimization: Multi-Bank Caches to Increase Bandwidth

- Individual memory controller for each bank - Each bank may have its own address and data lines - How blocks are interleaved affects performance

Cache Optimization: Merging Write Buffer to Reduce Miss Penalty

- Most useful in write through caches - Combine writing individual words into a block - Writing block is faster than writing individual words

TLB Optimization: Compression

- Regular stride page accesses (including consecutive ones) - Use fewer bits in the VA # and PA # --> more translations in TLB

Reducing the Hit Time: Avoiding Address Translation

- Send virtual address to cache (called virtually addressed cache or just virtual cache) - Every context switch, flush the cache to avoid accessing data that doesn't belong to this cache (cost is time to flush + compulsory misses from empty cache) - Does not support aliases - To avoid cache flush, may add process identifier tag to cache blocks

TLB Optimization: Speculative TLB

- Similar to cache block prediction and prefetching (Heuristic, machine learning) - Depends on the accuracy and page access pattern

Miss Penalty

- The cost per miss - How many cycles it takes to get the data into the cache

Why is there no need to compare more of the address than the tag?

- The offset should not be used in the comparison because the entire block is either present or not; therefore, all block offsets result in a match by definition - Checking the index is redundant, because it was used to select the set to be checked

Hit time

- The time to hit in the cache - Number of cycles it takes to access the cache

What are the advantages of the write back policy?

- Writes occur at the speed of the cache memory - Multiple writes within a block require only one write to the lower-level memory - Because some writes don't go to memory, write back uses less memory bandwidth, making write back attractive in multiprocessors - Write back uses the rest of the memory hierarchy and memory interconnect less than write through, meaning it also saves power, making it attractive for embedded applications

CPU Execution Time with Memory Stall Cycles Equation

= (CPU clock cycles + Memory stall cycles) x Clock cycle time = IC x (CPI v (execution) + (Memory stall clock cycles / Instruction)) x Clock cycle time = IC x (CPI v (execution) + (Misses / Instruction) x Miss Penalty) x Clock cycle time = IC x (CPI v (execution) + Miss rate x (Memory accesses / Instruction) x Miss Penalty) x Clock cycle time

Misses per Instruction

= (Miss rate x Memory accesses) / Instruction count = Miss rate x (Memory accesses / Instruction) Often reported as misses per 1000 instructions to show integers instead of fractions

Memory Stall Cycles / Instruction

= (Misses/Instruction) x (Total miss latency - Overlapped miss latency) = (Misses(L1) / Instruction) x Hit time(L2) + (Misses(L2) / Instruction) x Miss penalty(L2)

2 ^ index

= Cache size / (Block size x Set associativity)

Average Memory Stalls per Instruction

= Miss per instruction(L1) x Hit time(L2) + Misses per instruction(L2) x Miss penalty(L2)

Memory Stall Cycles Equation

= Number of Misses x Miss penalty = IC x (Misses/Instruction) x Miss penalty = IC x (Memory accesses/Instruction) x Miss rate x Miss penalty = IC x Reads per instruction x Read miss rate x Read miss penalty + IC x Writes per instruction x Write miss rate x Write miss penalty

Way

A bank in a set associative cache; there are four ways in a four-way set associative cache

Average Memory Access Time (AMAT)

A better measure of memory hierarchy performance. = Hit time + Miss rate x Miss penalty = Hit time(L1) + Miss rate(L1) x (Hit time(L2) + Miss rate(L2) x Miss penalty(L2))

Translation Lookaside Buffer (TLB)

A cache that keeps track of recently used address mappings to try to avoid an access to the page table

Pseudo-LRU Replacement

A common approximation of LRU, where there is a set of bits for each set in the cache with each bit corresponding to a s ingle way in the cache. When a set is accessed, the bit corresponding to the way containing the desired block is turned on; if all the bits associated with a set are turned on, they are reset with the exception of the most recently turned on bit. When a block must be replaced, the processor chooses a block from the way whose bit is turned off, often randomly if more than one choice is available.

Write Buffer

A common optimization to reduce write stalls. This allows the processor to continue as soon as the data are written to the buffer, thereby overlapping processor execution with memory updating. Write stalls can still occur even with these

Set

A group of blocks in the cache

Reducing the Time to Hit in the Cache

Avoiding address translation when indexing the cache

First In, First Out (FIFO) Replacement

Because LRU can be complicated to calculate, this approximates LRU by determining the oldest block rather than the LRU

What does the selection of block size depend on?

Both the latency and the bandwidth of the lower-level memory. High latency and high bandwidth encourage large block size because the cache gets many more bytes per miss for a small increase in miss penalty. Conversely, low latency and low bandwidth encourage smaller block sizes because there is little time saved from a larger block

Address Tag

Caches have these on each block frame that gives the block address. These are checked on every cache block that might contain the desired information to see if it matches the block address from the processor. All are searched in parallel

Tag Field

Compared against to see if this was a hit

What type of miss are the easiest to deal with conceptually?

Conflicts because fully associative placement avoids all conflict misses; however, full associativity is expensive in hardware and may slow the processor clock rate, leading to lower performance

Cache Optimization: Compiler Prefetching to Reduce Miss Rate and Penalty

Data Prefetch: - Load data into register Cache Prefetch: - Load data into cache - Special prefetching instructions should not cause premature page faults - Issuing prefetch instructions takes time - Works only if can overlap prefetching with execution

How to deal with capacity misses?

Enlarge the cache

True or False: All objects referenced by a program need to reside in main memory

False

True or False: If the total cache size is kept the same, increasing associativity decreases the number of blocks per set, thereby decreasing the size of the index and increasing the size of the tag

False; increasing associativity increases the number of blocks per set, thereby decreasing the size of the index and increasing the size of the tag

Fully Associative Cache

If a block can be placed anywhere in the cache

Set Associative Cache

If a block can be placed in a restricted set of places in the cache. A block is first mapped onto a set, and then the block can be placed anywhere within that set. The set is usually chosen by bit selection, that is (Block address) MOD (Number of set in cache)

Conflict Miss

If the block placement strategy is set associative or direct mapped, these misses will occur because a block may be discarded and later retrieved if too many blocks map to its set. Also called collision misses. Occur going from fully associative to eight-way associative, four-way associative, and so on

Capacity Miss

If the cache cannot contain all the blocks needed during execution of a program, these misses will occur because of blocks being discarded and later retrieved. Occur in a fully associative cache

Thrash

If the upper-level memory is much smaller than what is needed for a program, and a significant percentage of the time is spent moving data between two levels in the hierarchy, the memory hierarchy is said to do this

N-Way Associative Cache

If there are n blocks in a set

Reducing the Miss Rate: Read Priority over Write on Miss

If write through: - Write buffers avoid stalling - May cause RAW conflicts with reads on cache misses - Waiting for write buffer to empty will increase read miss penalty (Solution: check write buffer contents before read; if no conflicts, let the memory access continue) If write back: - Normal: write dirty block to memory, and then do the read - Instead copy the dirty block to a write buffer, then do the read, and then do the write - CPU stalls less since restarts as soon as the read completes

Cache Optimization: Hardware Prefetching to Reduce Miss Rate and Penalty

Instruction prefetching: bring the neighboring cache lines into the cache before they are accessed - Can fetch 2 (or more) blocks on a miss - Extra block placed in stream buffer - On miss, check stream buffer--if found, move to cache and prefetch next Data Prefetching: - May have multiple stream buffers beyond the cache, each prefetching at a different address - Relies on extra memory bandwidth that can be used without penalty

Cache Optimization: Compiler Optimizations to Reduce the Miss Rate

Instructions: - Reorder procedures in memory so as to reduce conflict misses - Aligning basic blocks with cache blocks (lines) - Profiling to look at conflicts Data: - Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays - Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap - Loop Interchange: Change nesting of loops to access data in order stored in memory - Blocking: Improve temporal locality by access blocks of data repeatedly vs. going down whole columns or rows

What happens on a cache miss?

It is handled by hardware and causes processors using in-order execution to pause, or stall, until the data are available. With out-of-order execution, an instruction using the result must still wait, but other instructions may proceed during the miss

What is a virtue of random replacement?

It is simple to build in hardware

Reducing the Miss Rate

Larger block size, larger cache size, and higher associativity

What does the time required the cache miss depend on?

Latency and bandwidth of the memory. Latency determines the time to retrieve the first word of the block, and bandwidth determines the time to retrieve the rest of this block

Coherency Miss

Misses due to cache flushes to keep multiple caches coherent in a multiprocessor

Reducing the Miss Penalty

Multilevel caches and giving reads priority over writes

What are the Four Memory Hierarchy Questions?

Q1: Where can a block be placed in the upper level? (block placement) Q2: How is a block found if it is in the upper level? (block identification) Q3: Which block should be replaced on a miss? (block replacement) Q4: What happens on a write? (write strategy)

What instruction type dominates processor cache accesses? Why?

Reads; all instruction accesses are reads, and most instructions don't write to memory

What is the classical approach to improving cache behavior?

Reducing miss rates

Valid Bit

Says whether or not this entry contains a valid address. If the bit is not set, there cannot be a match on this address

Block Offset Field

Selects the desired data from the block

Index Field

Selects the set

Temporal Locality

Tells us that we are likely to need this word again in the near future, so it is useful to place it in the cache where it can be accessed more quickly

Pages

The address space is usually broken into these fixed-size blocks

Write Allocate

The block is allocated on a write miss, followed by the preceding write hit actions. In this natural option, write misses act like read misses

Miss Rate

The fraction of cache accesses that result in a miss (i.e., the number of accesses that miss divided by the total number of accesses)

Cache

The highest or first level of the memory hierarchy encountered once the address leaves the processor

Write Back Policy

The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced

Write Through Policy

The information is written to both the block in the cache and to the block in the lower-level memory

Global Miss Rate

The number of misses in the cache divided by the total number of memory accesses generated by the processor

Compulsory Miss

The very first access to a block cannot be in the cache, so the block must be brought into the cache. These are also called cold-start misses or first-reference misses. Occur in an infinite cache

Spatial Locality

There is a high probability that the other data in a block will be needed soon

No-Write Allocate

This apparently unusual alternative is write misses do not affect the cache. Instead, the block is modified only in the lower-level memory

Local Miss Rate

This rate is the number of misses in a cache divided by the total number of memory accesses to this cache

Dirty Bit

This reduces the frequency of writing back blocks on replacement. It is a status bit that indicates whether the block is dirty (modified while in the cache) or clean (not modified). If it is clean, the block is not written back on a miss, because identical information to the cache is found in lower levels

Least Recently Used (LRU) Replacement

To reduce the chance of throwing out information that will be needed soon, accesses to blocks are recorded. Relying on the past to predict the future, the block replaced is the one that has been unused for the longest time. LRU relies on a corollary of locality: if recently used blocks are likely to be used again, then a good candidate for disposal is the least recently used block

Random Block Replacement

To spread allocation uniformly, candidate blocks are randomly selected for replacement. Some systems generate pseudorandom block numbers to get reproducible behavior, which is particularly useful when debugging hardware

True or False: At any time, each page resides either in main memory or on disk

True

True or False: Because page faults take so long, they are handled in software and the processor is not stalled

True

True or False: Blocks stay out of the cache in no-write allocate until the program tries to read the blocks, but even blocks that are only written will still be in the cache with write allocate

True

Cache Miss

When the processor does not find a data item it needs in the cache

Cache Hit

When the processor finds a requested data item in the cache

Write Stall

When the processor must wait for writes to complete during write through

Page Fault

When the processor references an item within a page that is not present in the cache or main memory, this occurs, then the entire page is moved from the disk to main memory


Related study sets

One Step multiplication and Division Equations

View Set

MC Chapter 28 the child with a GI condition & 29 the child with a GU condition

View Set

Chp. 19. Program Design and Technique for Speed and Agility Training

View Set

U-world: Gen-Chem: Thermo chemistry#1

View Set