Chap 5 PART 1
A page table keeps track of where put in the page our physical memory PA
(PPN,offset)
VA = 32 bits
(size of address space 2 32=4 GB)
Hit time is the time to access the upper level of the memory hierarchy
, which includes the time needed to determine whether the access is a hit or a miss (that is, the time needed to look through the books on the desk). hit time will be much smaller than the time to access the next level in the hierarchy
Moreover, if it is, how do we find it The simplest way to assign a location in the cache for each word in memory is to assign the cache location based on the address of the word in memory.----DIRECT MAPPING Only have one choice:
- (Block address) modulo (# of Blocks in cache) Number of blocks is a power of 2 Because there are eight words in the cache, an address X maps to the direct-mapped cache word X modulo 8. Th at is, the low-order log2 (8) 3 bits are used as the cache index.
- 1 GB memory ➔ 1 GB = 2 30 bytes ➔ PA 30 bits
- 4 GB memory ➔ 4 GB = 2 32 bytes ➔ PA 32 bits - 512 MB memory ➔ 512 MB = 2 29 bytes ➔ PA 29 bits
Average memory access time (AMAT) - AMAT = Hit time + Miss rate × Miss penalty • Example - CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5%
- AMAT = 1 + 0.05 × 20 = 2ns • 2 cycles per instruction
Ideal memory
- Access time of SRAM - Capacity and cost/GB of disk
Primary cache
- Focus on minimal hit time (direct map minimizes hit time) - Also we separate I-cache and D-cache. - Data path accesses memory at least once (IF) and maybe twice (MEM) for each instruction. To avoid structural hazards, we split the L1 cache for instruction and data memory
Try to minimize page fault rate
- Fully associative placement
Spatial locality
- Items with memory addresses near those accessed recently are likely to be accessed soon - E.g., sequential instruction access, sequential access to array data . Libraries put books on the same topic together on the same shelves to increase spatial locality. Space in array for ex. Access i will bring i and i+1, i+2
Larger blocks fewer of them • But in a fixed-sized cache • More competition increased miss rate
- Larger blocks pollution • Larger miss penalty - Can override benefit of reduced miss rate - Early restart and critical-word-first can help
Page size 1 KB: 1KB = 2 10 bytes ➔ page offset 10 bits
- Page size 4 KB: 4KB = 2 12 bytes ➔ page offset 12 bits
- Virtual address space divided into pages
- Physical memory (RAM)divided into page frames - Each page can be brought to a page frame and will fit perfectly
The Principle of Locality:
- Programs access a relatively small portion of the address space at any instant of time
Access to a sector involves
- Queuing delay if other accesses are pending - Seek: move the R/W heads mechanically (long delay) - Rotational latency (mechanical delay — long) The time required for the desired sector of a disk to rotate under the read/write head; usually assumed to be half the rotation time. - Data transfer - Controller overhead(access data/make it available to the rest( Lesson: Once you locate your data, read as many bytes as you can! (don't read individ. Bytes from disks but actual blocks)
Each sector records
- Sector ID - Data (512 bytes, 4096 bytes proposed) - Error correcting code (ECC) • Used to hide defects and recording errors
On page fault, the page must be fetched from disk
- Takes millions of clock cycles - Handled by OS code
What if there is no data in a location?
- Valid bit: 1 = present, 0 = not present - Initially 0
VA(VPN, OFfset) find number of bits VPN? virtual address=16 bits offset bits=12 bits
16-12=4 bits VPN
how many entries/size of page table number
2^VPN
Fully Associative A cache structure in which a block can be placed in any location in the cache all the entries in the cache must be searched because a block can be placed in any one. To make the search practical, it is done in parallel with a comparator associated with each cache entry.
8 way set associative-> 8 blocks/set. 8/8 is 1 set
Alternate of write through: write back
: On data-write hit, just update the block in cache Keep track of whether each block is dirty (i.e., modified) • When a dirty block is replaced Write it back to memory Can use a write buffer to allow replacing block to be read first Write-back schemes can improve performance,
WRITE BACK AND WRITE THROUGH ASSUME
A WRITE HIT
A tag fi eld, which is used to compare with the value of the tag fi eld of the cache ■
A cache index, which is used to select the block Cache blocks are 1 word/4 bytes. Data stored in cache has 4 bytes.
How freq can you initiate an access?
A little kid can ask money from Dad on Sun, Mom on Monday, Grandma on Tues.... All family members are like memory banks. These memory banks have longer cycle times, initiate access quicker. Memory banks are great overlapping access time with memory time.
Alternatives for write-through
Allocate on miss: fetch the block and write Write around: don't fetch the block and • Since programs often write a whole block before reading it (e.g., initialization) • For write-back - Usually fetch the block
NAND
Bit cell like a NAND gate Denser (bits/area) but block at time Cheaper /GB
NOR Flash
Bit cell like a NOR gate Random read/write access Use for instruction memory
Mapping an Address to a Multiword Cache Block Consider a cache with 64 blocks and a block size of 16 bytes. To what block number does byte address 1200 map?
Block address=byte add/bytes per block 1200/16 bytes per block=block #75 Block number 75 module 64=11. This is block in cache index in cache to which block 75 maps
Sizes typically in powers of 2.
Can use higher order address bits as page numbers, and lower address bits as offsets within page.
Take Advantage of Locality Memory hierarchy Store everything on disk
Copy recently accessed and disk-> smaller Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory - Cache memory attached to CPU Processor: Cache(faster)--main(slower) Cache memory: level of memory hierarchy closest to CPU
middle in size, speed, and price
DRAM
DRAM (Read/Write) Cycle Time [much longer]>>
DRAM (Read/Write) Access Time
3 APPROACHES TO MAPPING
Direct Mapping Cache--most rigid/only one location Set Associative Cache Fully Associative Cache--least rigid/most flexible More flexible, decrease miss rate, and increase hit rate
Increasing the associativity increases the number of blocks per set
Each increase by a factor of 2 in associativity doubles the number of blocks per set and halves the number of sets
Each disk surface is divided into concentric circles, called tracks. There are typically tens of thousands of tracks per surface.
Each track is in turn divided into sectors that contain the information
to actually find VPN
FLOOR VA/PAGE SIZE Physical frame number = 6 • Physical address = 6 ∗ 4096 + 4 • = 24576 + 4 = 24580
Magnetic disks are nonvolatile like fl ash, but unlike fl ash there is no write wear-out problem.
However, fl ash is much more rugged and hence a better match to the jostling inherent in personal mobile devices.
How do we know if a data item is in the cache?
If each word can go in exactly one place in the cache, then it is straightforward to find the word if it is in the cache
This second-level cache is normally on the same chip and is accessed whenever a miss occurs in the primary cache. If the second-level cache contains the desired data, the miss penalty for the fi rst-level cache will be essentially the access time of the second-level cache, which will be much less than the access time of main memory .
If neither the primary nor the secondary cache contains the data, a main memory access is required, and a larger miss penalty is incurred.
all tags of all the blocks in the cache must be searched. Associativity increases as long as you keep the size of cache the same.
If num of blocks within the same, associativity increases number of sets The advantage of increasing the degree of associativity is that it usually decreases the miss rate,
miss rate
If the data is not found in the upper level, the request is called a miss. The lower level in the hierarchy is then accessed to retrieve the block containing the requested data. The miss rate (1−hit rate) is the fraction of memory accesses not found in the upper level.
HIT
If the data requested by the processor appears in some block in the upper level, this is called a hit (analogous to your finding the information in one of the books on your desk).
Position and function of the MMU (Memory Management Unit) CPU has a virtual address that is fed to the MMU (uses TLB and page tables(TLC and page tables updating done by OS) to generate a physical address). Out comes physical address and sents to memory. If that page is in memory, data is sent back. If not, page fault. Go to disk controller
In page fault, OS comes and swaps out processes that cause the page fault and gives CPU to another process . IT can't wait
If tag bits do not match:miss
Index bit to cache memory: CHECK valid bit. If one, then check tag bits. Yes it is hit or not miss in cache
Temporal locality
Items accessed recently are likely to be accessed again soon -= e.g., instructions in a loop, induction variables In a loop, you access one instruction into memory, put in cache, run a loop again, and get in cache. If you recently brought a book to your desk to look at, you will probably need to look at it again soon.
L-2 cache - Focus on low miss rate to avoid main memory access (more associativity) - Hit time has less overall impact because L2 cache accessed less frequently. - Unified
L-1 cache usually smaller than a single cache - L-1 block size smaller than L-2 block size
Why have larger blocks?
Larger blocks should reduce miss rate - Due to spatial locality • But in a fixed-sized cache - Larger blocks fewer of them • More competition increased miss rate - Larger blocks pollution • Larger miss penalty - Can override benefit of reduced miss rate - Early restart and critical-word-first can help
Dynamic Ram
Larger capacity, slower, cheaper In a dynamic RAM (DRAM), the value kept in a cell is stored as a charge in a capacitor. A single transistor is then used to access this stored charge, either to read the value or to overwrite the charge stored there.
Memory Hierarchy
Level 1:closer to processor Multiple levels of cache: SRAM Main memory(DRAM) Secondary Memory(disks) A typical memory hierarchy take advantage of principle of locality. Can present user with as much memory as is available in cheapest technology
Given - I-cache miss rate = 2% - D-cache miss rate = 4% - Miss penalty = 100 cycles - Base CPI (ideal cache) = 2 - Load & stores are 36% of instructions
Miss cycles per instruction - I-cache: 0.02 × 100 = 2 - D-cache: 0.36 × 0.04 × 100 = 1.44 Actual CPI = 2 + 2 + 1.44 = 5.44 - Ideal CPU is 5.44/2 =2.72 times faster
Cache Misses The control unit must detect a miss and process the miss by fetching the requested data from memory
Modifying the control of a processor to handle a hit is trivial; misses, however, require some extra work. On cache hit, CPU proceeds normally
fully associative cache Slower hit determ time/ compare multiple locations
Multiple locations in memory will map to the same block in cache. Mod 8 <- memory add takes lower 3 bits (2^3=8)
FLASH MEMORY: With wear leveling, personal mobile devices are very unlikely to exceed the write limits in the flash. Such wear leveling lowers the potential performance of flash, Not suitable for direct RAM/direct replacement
Nonvolatile, semiconductor storage tend to be faster to access than magn disks Smaller in capacity, lower power, more robust
Increase bandwidth by Access data from memory, OR organize memory into banks
OR organize memory into banks Processor can access words arranged on factory banks. Access whichever memory banks as data and bring to processor. Once accessed, you have to wait until access it
E.g 32 bit address cache with 1024 blocks. Each block now is 16 bytes.
Offset det by how many power of 2 bytes size is? 4 Index det by how many powers of 2 num of blocks? 10 bits Tag is remaining of 32 bits TAG INDEX OFFSET 18 10 4
E.g 32 bit address cache of 512 blocks. Each block is 64 bytes.
Offset det by how many power of 2 bytes size is? 6 Index det by how many powers of 2 num of blocks? 9 Tag is remaining of 32 bits TAG INDEX OFFSET 17 9 6
E.g 1024 blocks each with 4 byte.
Offset det by how many powers of 2 bytes size is? 2. Index det by how many powers of 2 num of blocks? 10 bits Tag is remaining of 32 bits TAG INDEX OFFSET 20 10 2
page size 1024 bytes (1KB/2^10 bytes) • Address = 1300 • Virtual page number = [1300/1024] = 1 remaining 22 bits of page • Offset = 1300 mod 1024 = 276. It will be the lower 10 bits
Offsets are the same in both virtual and physical address space
To access data, the operating system must direct the disk through a three-stage process. Th e fi rst step is to position the head over the proper track. Th is operation is called a seek, and the time to move the head to the desired track is called the seek time.
Once the head has reached the correct track, we must wait for the desired sector to rotate under the read/write head. Th is time is called the rotational latency or rotational delay. Th e last component of a disk access, transfer time, is the time to transfer a block of bits. Th e transfer time is a function of the sector size, the rotation speed, and the recording density of a track
Increased Bandwidth
One-word-wide memory organization is slow
The disk heads for each surface are connected together and move in conjunction, so that every head is over the same track of every surface.
PACK more bits/area. Way to store is bit of info/change polarity of magnetic field flux on that material
• If page is not present
PTE can refer to location in swap space on disk
translation using page table
Page table in memory in OS kernel address space Page table base register is loaded with the page table base address OS maintained and updates all this when CPU is processed from one process to another
Page Tables Stores placement information (1 PT/process) Array of page table entries, indexed by virtual page number
Page table register in CPU points to page table in physical memory
replacement polci random as well
Random - Gives approximately the same performance as LRU for high associativity - non-usage-based As associativity increases, implementing LRU gets harder
fastest, smallest, most expensive memory
SRAM
Static Ram
Small, faster, more money, more area less dense Used for cache
Magnetic disk
Solid state drivers Much cheaper, larger capacity, and much slower
write through solution
Solution: write buffer - Holds data waiting to be written to memory. stores the data while it is waiting to be written to memory. - CPU continues immediately • Only stalls on write if write buffer is already full. until there is an empty position in the write buffer,
block (or line) .
Th e minimum unit of information that can be either present or not present in a cache. It is basically one book in our example
In a fully associative cache, there is eff ectively only one set, and all the blocks must be checked in parallel.
Th us, there is no index, and the entire address, excluding the block off set, is compared against the tag of every block. In other words, we search the entire cache without any indexing.
Write Miss (data is not in cache) What should happen on a write miss Write Allocate:
The block is fetched from memory and then the appropriate portion of the block is overwritten An alternative strategy is to update the portion of the block in memory but not put it in the cache, called no write allocate.
magnetic disk" The metal platters are covered with magnetic recording material on both sides, similar to the material found on a cassette or videotape. To read and write information on a hard disk, a movable arm containing a small electromagnetic coil called a read-write head is located just above each surface.
The entire drive is permanently sealed to control the environment inside the drive, which, in turn, allows the disk heads to be much closer to the drive surface.
DRAMs buffer rows for repeated access. The buffer acts like an SRAM; by changing the address, random bits can be accessed in the buffer until the next row access.
This capability improves the access time significantly, since the access time to bits in the row is much lower
Four way Set Associative
Total blocks 8. How many sets 8/4=2. 2 sets with 4 blocks each
Set Associative (2 Way Set Associative) Maps to only one set/has a fixed number of locations (at least two) where each block can be placed. Can be in any location within set
Total blocks:8. How many sets? 8/2=4. 4 sets with each 2 blocks In a set-associative cache, the set containing a memory block is given by (Block number) modulo (Number of sets in the cache) all the tags of all the elements of the set must be searched.
Virtual Address=
VPN, OFFSET
What happens to a write miss?
We first fetch the words of the block from memory. After the block is fetched and placed into the cache, we can overwrite the word that caused the miss into the cache block. We also write the word to main memory using the full address
Write-Through On data-write hit, could just update the block in cache - But then cache and memory would be inconsistent
Write through: update cache & also update memory--update both the cache and the next lower level of the memory hierarchy, ensuring that data is always consistent between the two. • But makes writes take longer
The miss penalty is the time to replace
a block in the upper level with the corresponding block from the lower level, plus the time to deliver this block to the processor (or the time to get another book from the shelves and place it on the desk).
If neither the primary nor the secondary cache contains the data,
a main memory access is required, and a larger miss penalty is incurred.
On cache miss - Stall the CPU pipeline - Fetch block from next level of hierarchy - Instruction cache miss • Restart instruction fetch
a. If an instruction access results in a miss, then the content of the Instruction register is invalid. b. Send the original PC value (current PC - 4) to the memory. c .Instruct main memory to perform a read and wait for the memory to complete its access. d. Write the cache entry, putting the data from memory in the data portion of the entry, writing the upper bits of the address (from the ALU) into the tag field, and turning the valid bit on. e.Restart the instruction execution at the fi rst step, which will refetch the instruction, this time fi nding it in the cache.
Cache memory(managed in hardware) .
acts as cache for main memory Virtual memory is for data on secondary storage Data moves slow/large capacity. Read:secondary memory, main, second level Write: opp direction
how else to improve cache perfomance
add associativity increase hit rate reduce miss rate hitr+missr=1
Because each cache location can contain the contents of a number of different memory locations, how do we know whether the data in the cache corresponds to a requested word? (how do we know whether a requested word is in the cache or not? )
adding a set of tags to the cache. The tags contain the address information required to identify whether a word in the cache corresponds to the requested word.
CPU and OS translate virtual
addresses to physical addresses during execution time
Data is copied between only two
adjacent levels at a time The upper level—the one closer to the processor—is smaller and faster than the lower level, since the upper level uses technology that is more expensive.
fully associative cache
allow a given block to go in any cache requires all entries to be searched at once tag comparator per entry
As DRAMs store the charge on a capacitor, it cannot be kept indefinitely
and must periodically be refreshed. To refresh the cell, we merely read its contents and write it back
Translating used the virtual page number
and uses it as an index into this page table
You break the cache into a number of sets. Within each set, n blocks into which data can be placed anywhere. Direct mapped->
block address in memory to a single location in the upper level of the hierarchy. Maps to only one location
VM managed jointly
by CPU hardware and operating system
virtual memory main memory is
cache for secondary storage(disk)
VM translation of a miss is called a page fault. If page is not in memory,
cached/sitting on disk and needs to be brought from main memory(disk). Page not in memory and not in cache
faster memory is
close to the processor and the slower, less expensive memory is below it.
flash memory: To cope with such limits, most fl ash products include a
controller to spread the writes by remapping blocks that have been written many times to less trodden blocks. Th is technique is called wear leveling.
Programs share physical main memory(RAM)
each gets a private virtual address space holdings its freq used code and data
n-way set associative
each set contains n entries block number determines which set search all entries in a given set at once
Size of the page table is determined by how many bits the virtual page number
has. 20 bits for the virtual page number then the page table has 2^20 entries
Making the chip wider also
improves the memory bandwidth of the chip.
To avoid performance loss, the bandwidth of main memory is
increased to transfer cache blocks more efficiently
024 blocks/sets of 4 bytes total cache size in 4 way associativity 1024/4=256 sets NOW THESE ARE THE SETS Offset: how many powers of 2 for bytes/ cache size=2 Index: how many powers of 2 for set number (256)=8 Tag is 22 bits
index: how many powers of 2 for set number
- Data cache miss • Complete data access Th e control of the cache on a data access
is essentially identical: on a miss, we simply stall the processor until the memory responds with the data
With eight blocks, an eight-way set associative cache
is the same as a fully associative cache.
If the miss penalty increased linearly with the block size,
larger blocks could easily lead to lower performance. To avoid performance loss, the bandwidth of main memory is increased to transfer cache blocks more efficiently
slowest speed , biggest, and cheapest
magnetic disk
increase associativity decreases
miss rate
The use of a larger block decreases the
miss rate and improves the efficiency of the cache by reducing the amount of tag storage relative to the amount of data storage in the cache.
The faster memories are
more expensive per bit than the slower memories and thus are smaller.
memory can complete writes < the rate at which the processor is generating writes,
no amount of buffering can help, because writes are being generated faster than the memory system can accept them.
• If page is present in memory PTE stores the physical page frame
number Plus other status bits (referenced, modified, ...)
Associativity increases and number
of index bits decrease, tag bits increase
The hit rate, or hit ratio, is the fraction
of memory accesses found in the upper level; it is often used as a measure of the performance of the memory hierarchy
The principle of locality states that programs access a relatively small portion .
of their address space at any instant of time, just as you accessed a very small portion of the library's collection
VM Black is called a page:
page size is a power of 2
virtual address is translated to
physical address Bring pages from address space of the processes which are private to disk and bring them into frames (page frames) Protected from other programs'
Increase bandwidth by Access data from memory,
put in cache that is parallel, read multiple words and put in cache. Processor will access that one at a time, but much faster sitting in cache
static RAM used for
simply integrated circuits that are memory arrays with (usually) a single access port that can provide either a read or a write.
as the distance from the processor increases, the
size of the memories and the access time both increase
SRAM have a fixed access time to any datum don't need to refresh and
so the access time is very close to the cycle time. SRAMs typically use six to eight transistors per bit to prevent the information from being disturbed when read.
Capacity increases from
static, dynamic to magnetic disk
virtual memory involves the operating
system unlike cache memory
VA(VPN, OFfset) find number of bits offset?
take size of pages 2^12 page 12 is nuber of bits in offset
Looking at a lot more misses column wise(100%)
than row wise (25%) So fixed is in the middle. Rows : a lot less misses with columns a lot of misses Two best cases:scanning rows (minimal miss rate) of two matric es
If the second-level cache contains the desired data, the miss penalty for the fi rst-level cache will be essentially
the access time of the second-level cache, which will be much less than the access time of main memory
The use of a larger block decreases the miss rate and improves the efficiency of
the cache by reducing the amount of tag storage relative to the amount of data storage in the cache.
Cache index:
the index field of any address of a cache block must be that block number
SO LOWER 3 BITS ARE THE BLOCK ADDRESS/INDEX AND UPPER 2 BITS ARE TAGS. If two blocks maps to the same location, .
the last one survives.
Where to look in the cache for each possible address:
the low-order bits of an address can be used to find the unique cache entry to which the address could map
The total size of the cache in blocks is equal to
the number of sets times the associativity.
The index value is used to select the set containing the address of interest, and
the tags of all the blocks in the set must be searched.
The tag needs only to contain
the upper portion of the address, corresponding to the bits that are not used as an index into the cache. we need only have the upper 2 of the 5 address bits in the tag, since the lower 3-bit index field of the address selects the block.
Because DRAMs use only a single transistor per bit of storage,
they are much denser and cheaper per bit than SRAM.
Virtual Memory:
to allow efficient and safe sharing of memory among multiple programs, such as for the memory needed by multiple virtual machines for cloud computing, and to remove the programming burdens of a small, limited amount of main memory
SRAM needs only minimal power
to retain the charge in standby mode. As long as we have power, we have the value
Flash Memory
type of electrically erasable programmable read-only memory (EEPROM)/tiny little storage cards writes can wear out flash memory bits.
This second-level cache is normally on the same chip and is accessed
whenever a miss occurs in the primary cache.
For a fixed cache size, increasing the associativity decreases the number of sets
while increasing the number of elements per set.
Disks have a slower access time because they are mechanical devices—flash is 1000 times as fast and DRAM is 100,000 times as fast—.
yet they are cheaper per bit because they have very high storage capacity at a modest cost—disk is 10 to 100 time cheaper
When CPU performance increased - Miss penalty becomes more significant (why?)
• Decreasing base CPI - Greater proportion of time spent on memory stalls • Increasing clock rate - Memory stalls account for more CPU cycles
Choosing Which Block to Replace
• Direct mapped: no choice • Set associative - Prefer non-valid entry, if there is one - Otherwise, choose among entries in the set • Least-recently used (LRU) - usage based - Choose the one unused for the longest time • Simple for 2-way, manageable for 4-way, too hard beyond that
The GOAL: illusion of large, fast, cheap memory • How do we create a memory that is large, cheap and fast (most of the time)? 1. Hierarchy 2. Parallelism
• Fact: Large memories are slow (dram), fast memories are small (sram) • Fact: Large memories are cheap (disk), fast memories are expensive (dram)