Chap 5 PART 1

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

A page table keeps track of where put in the page our physical memory PA

(PPN,offset)

VA = 32 bits

(size of address space 2 32=4 GB)

Hit time is the time to access the upper level of the memory hierarchy

, which includes the time needed to determine whether the access is a hit or a miss (that is, the time needed to look through the books on the desk). hit time will be much smaller than the time to access the next level in the hierarchy

Moreover, if it is, how do we find it The simplest way to assign a location in the cache for each word in memory is to assign the cache location based on the address of the word in memory.----DIRECT MAPPING Only have one choice:

- (Block address) modulo (# of Blocks in cache) Number of blocks is a power of 2 Because there are eight words in the cache, an address X maps to the direct-mapped cache word X modulo 8. Th at is, the low-order log2 (8) 3 bits are used as the cache index.

- 1 GB memory ➔ 1 GB = 2 30 bytes ➔ PA 30 bits

- 4 GB memory ➔ 4 GB = 2 32 bytes ➔ PA 32 bits - 512 MB memory ➔ 512 MB = 2 29 bytes ➔ PA 29 bits

Average memory access time (AMAT) - AMAT = Hit time + Miss rate × Miss penalty • Example - CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5%

- AMAT = 1 + 0.05 × 20 = 2ns • 2 cycles per instruction

Ideal memory

- Access time of SRAM - Capacity and cost/GB of disk

Primary cache

- Focus on minimal hit time (direct map minimizes hit time) - Also we separate I-cache and D-cache. - Data path accesses memory at least once (IF) and maybe twice (MEM) for each instruction. To avoid structural hazards, we split the L1 cache for instruction and data memory

Try to minimize page fault rate

- Fully associative placement

Spatial locality

- Items with memory addresses near those accessed recently are likely to be accessed soon - E.g., sequential instruction access, sequential access to array data . Libraries put books on the same topic together on the same shelves to increase spatial locality. Space in array for ex. Access i will bring i and i+1, i+2

Larger blocks fewer of them • But in a fixed-sized cache • More competition increased miss rate

- Larger blocks pollution • Larger miss penalty - Can override benefit of reduced miss rate - Early restart and critical-word-first can help

Page size 1 KB: 1KB = 2 10 bytes ➔ page offset 10 bits

- Page size 4 KB: 4KB = 2 12 bytes ➔ page offset 12 bits

- Virtual address space divided into pages

- Physical memory (RAM)divided into page frames - Each page can be brought to a page frame and will fit perfectly

The Principle of Locality:

- Programs access a relatively small portion of the address space at any instant of time

Access to a sector involves

- Queuing delay if other accesses are pending - Seek: move the R/W heads mechanically (long delay) - Rotational latency (mechanical delay — long) The time required for the desired sector of a disk to rotate under the read/write head; usually assumed to be half the rotation time. - Data transfer - Controller overhead(access data/make it available to the rest( Lesson: Once you locate your data, read as many bytes as you can! (don't read individ. Bytes from disks but actual blocks)

Each sector records

- Sector ID - Data (512 bytes, 4096 bytes proposed) - Error correcting code (ECC) • Used to hide defects and recording errors

On page fault, the page must be fetched from disk

- Takes millions of clock cycles - Handled by OS code

What if there is no data in a location?

- Valid bit: 1 = present, 0 = not present - Initially 0

VA(VPN, OFfset) find number of bits VPN? virtual address=16 bits offset bits=12 bits

16-12=4 bits VPN

how many entries/size of page table number

2^VPN

Fully Associative A cache structure in which a block can be placed in any location in the cache all the entries in the cache must be searched because a block can be placed in any one. To make the search practical, it is done in parallel with a comparator associated with each cache entry.

8 way set associative-> 8 blocks/set. 8/8 is 1 set

Alternate of write through: write back

: On data-write hit, just update the block in cache Keep track of whether each block is dirty (i.e., modified) • When a dirty block is replaced Write it back to memory Can use a write buffer to allow replacing block to be read first Write-back schemes can improve performance,

WRITE BACK AND WRITE THROUGH ASSUME

A WRITE HIT

A tag fi eld, which is used to compare with the value of the tag fi eld of the cache ■

A cache index, which is used to select the block Cache blocks are 1 word/4 bytes. Data stored in cache has 4 bytes.

How freq can you initiate an access?

A little kid can ask money from Dad on Sun, Mom on Monday, Grandma on Tues.... All family members are like memory banks. These memory banks have longer cycle times, initiate access quicker. Memory banks are great overlapping access time with memory time.

Alternatives for write-through

Allocate on miss: fetch the block and write Write around: don't fetch the block and • Since programs often write a whole block before reading it (e.g., initialization) • For write-back - Usually fetch the block

NAND

Bit cell like a NAND gate Denser (bits/area) but block at time Cheaper /GB

NOR Flash

Bit cell like a NOR gate Random read/write access Use for instruction memory

Mapping an Address to a Multiword Cache Block Consider a cache with 64 blocks and a block size of 16 bytes. To what block number does byte address 1200 map?

Block address=byte add/bytes per block 1200/16 bytes per block=block #75 Block number 75 module 64=11. This is block in cache index in cache to which block 75 maps

Sizes typically in powers of 2.

Can use higher order address bits as page numbers, and lower address bits as offsets within page.

Take Advantage of Locality Memory hierarchy Store everything on disk

Copy recently accessed and disk-> smaller Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory - Cache memory attached to CPU Processor: Cache(faster)--main(slower) Cache memory: level of memory hierarchy closest to CPU

middle in size, speed, and price

DRAM

DRAM (Read/Write) Cycle Time [much longer]>>

DRAM (Read/Write) Access Time

3 APPROACHES TO MAPPING

Direct Mapping Cache--most rigid/only one location Set Associative Cache Fully Associative Cache--least rigid/most flexible More flexible, decrease miss rate, and increase hit rate

Increasing the associativity increases the number of blocks per set

Each increase by a factor of 2 in associativity doubles the number of blocks per set and halves the number of sets

Each disk surface is divided into concentric circles, called tracks. There are typically tens of thousands of tracks per surface.

Each track is in turn divided into sectors that contain the information

to actually find VPN

FLOOR VA/PAGE SIZE Physical frame number = 6 • Physical address = 6 ∗ 4096 + 4 • = 24576 + 4 = 24580

Magnetic disks are nonvolatile like fl ash, but unlike fl ash there is no write wear-out problem.

However, fl ash is much more rugged and hence a better match to the jostling inherent in personal mobile devices.

How do we know if a data item is in the cache?

If each word can go in exactly one place in the cache, then it is straightforward to find the word if it is in the cache

This second-level cache is normally on the same chip and is accessed whenever a miss occurs in the primary cache. If the second-level cache contains the desired data, the miss penalty for the fi rst-level cache will be essentially the access time of the second-level cache, which will be much less than the access time of main memory .

If neither the primary nor the secondary cache contains the data, a main memory access is required, and a larger miss penalty is incurred.

all tags of all the blocks in the cache must be searched. Associativity increases as long as you keep the size of cache the same.

If num of blocks within the same, associativity increases number of sets The advantage of increasing the degree of associativity is that it usually decreases the miss rate,

miss rate

If the data is not found in the upper level, the request is called a miss. The lower level in the hierarchy is then accessed to retrieve the block containing the requested data. The miss rate (1−hit rate) is the fraction of memory accesses not found in the upper level.

HIT

If the data requested by the processor appears in some block in the upper level, this is called a hit (analogous to your finding the information in one of the books on your desk).

Position and function of the MMU (Memory Management Unit) CPU has a virtual address that is fed to the MMU (uses TLB and page tables(TLC and page tables updating done by OS) to generate a physical address). Out comes physical address and sents to memory. If that page is in memory, data is sent back. If not, page fault. Go to disk controller

In page fault, OS comes and swaps out processes that cause the page fault and gives CPU to another process . IT can't wait

If tag bits do not match:miss

Index bit to cache memory: CHECK valid bit. If one, then check tag bits. Yes it is hit or not miss in cache

Temporal locality

Items accessed recently are likely to be accessed again soon -= e.g., instructions in a loop, induction variables In a loop, you access one instruction into memory, put in cache, run a loop again, and get in cache. If you recently brought a book to your desk to look at, you will probably need to look at it again soon.

L-2 cache - Focus on low miss rate to avoid main memory access (more associativity) - Hit time has less overall impact because L2 cache accessed less frequently. - Unified

L-1 cache usually smaller than a single cache - L-1 block size smaller than L-2 block size

Why have larger blocks?

Larger blocks should reduce miss rate - Due to spatial locality • But in a fixed-sized cache - Larger blocks fewer of them • More competition increased miss rate - Larger blocks pollution • Larger miss penalty - Can override benefit of reduced miss rate - Early restart and critical-word-first can help

Dynamic Ram

Larger capacity, slower, cheaper In a dynamic RAM (DRAM), the value kept in a cell is stored as a charge in a capacitor. A single transistor is then used to access this stored charge, either to read the value or to overwrite the charge stored there.

Memory Hierarchy

Level 1:closer to processor Multiple levels of cache: SRAM Main memory(DRAM) Secondary Memory(disks) A typical memory hierarchy take advantage of principle of locality. Can present user with as much memory as is available in cheapest technology

Given - I-cache miss rate = 2% - D-cache miss rate = 4% - Miss penalty = 100 cycles - Base CPI (ideal cache) = 2 - Load & stores are 36% of instructions

Miss cycles per instruction - I-cache: 0.02 × 100 = 2 - D-cache: 0.36 × 0.04 × 100 = 1.44 Actual CPI = 2 + 2 + 1.44 = 5.44 - Ideal CPU is 5.44/2 =2.72 times faster

Cache Misses The control unit must detect a miss and process the miss by fetching the requested data from memory

Modifying the control of a processor to handle a hit is trivial; misses, however, require some extra work. On cache hit, CPU proceeds normally

fully associative cache Slower hit determ time/ compare multiple locations

Multiple locations in memory will map to the same block in cache. Mod 8 <- memory add takes lower 3 bits (2^3=8)

FLASH MEMORY: With wear leveling, personal mobile devices are very unlikely to exceed the write limits in the flash. Such wear leveling lowers the potential performance of flash, Not suitable for direct RAM/direct replacement

Nonvolatile, semiconductor storage tend to be faster to access than magn disks Smaller in capacity, lower power, more robust

Increase bandwidth by Access data from memory, OR organize memory into banks

OR organize memory into banks Processor can access words arranged on factory banks. Access whichever memory banks as data and bring to processor. Once accessed, you have to wait until access it

E.g 32 bit address cache with 1024 blocks. Each block now is 16 bytes.

Offset det by how many power of 2 bytes size is? 4 Index det by how many powers of 2 num of blocks? 10 bits Tag is remaining of 32 bits TAG INDEX OFFSET 18 10 4

E.g 32 bit address cache of 512 blocks. Each block is 64 bytes.

Offset det by how many power of 2 bytes size is? 6 Index det by how many powers of 2 num of blocks? 9 Tag is remaining of 32 bits TAG INDEX OFFSET 17 9 6

E.g 1024 blocks each with 4 byte.

Offset det by how many powers of 2 bytes size is? 2. Index det by how many powers of 2 num of blocks? 10 bits Tag is remaining of 32 bits TAG INDEX OFFSET 20 10 2

page size 1024 bytes (1KB/2^10 bytes) • Address = 1300 • Virtual page number = [1300/1024] = 1 remaining 22 bits of page • Offset = 1300 mod 1024 = 276. It will be the lower 10 bits

Offsets are the same in both virtual and physical address space

To access data, the operating system must direct the disk through a three-stage process. Th e fi rst step is to position the head over the proper track. Th is operation is called a seek, and the time to move the head to the desired track is called the seek time.

Once the head has reached the correct track, we must wait for the desired sector to rotate under the read/write head. Th is time is called the rotational latency or rotational delay. Th e last component of a disk access, transfer time, is the time to transfer a block of bits. Th e transfer time is a function of the sector size, the rotation speed, and the recording density of a track

Increased Bandwidth

One-word-wide memory organization is slow

The disk heads for each surface are connected together and move in conjunction, so that every head is over the same track of every surface.

PACK more bits/area. Way to store is bit of info/change polarity of magnetic field flux on that material

• If page is not present

PTE can refer to location in swap space on disk

translation using page table

Page table in memory in OS kernel address space Page table base register is loaded with the page table base address OS maintained and updates all this when CPU is processed from one process to another

Page Tables Stores placement information (1 PT/process) Array of page table entries, indexed by virtual page number

Page table register in CPU points to page table in physical memory

replacement polci random as well

Random - Gives approximately the same performance as LRU for high associativity - non-usage-based As associativity increases, implementing LRU gets harder

fastest, smallest, most expensive memory

SRAM

Static Ram

Small, faster, more money, more area less dense Used for cache

Magnetic disk

Solid state drivers Much cheaper, larger capacity, and much slower

write through solution

Solution: write buffer - Holds data waiting to be written to memory. stores the data while it is waiting to be written to memory. - CPU continues immediately • Only stalls on write if write buffer is already full. until there is an empty position in the write buffer,

block (or line) .

Th e minimum unit of information that can be either present or not present in a cache. It is basically one book in our example

In a fully associative cache, there is eff ectively only one set, and all the blocks must be checked in parallel.

Th us, there is no index, and the entire address, excluding the block off set, is compared against the tag of every block. In other words, we search the entire cache without any indexing.

Write Miss (data is not in cache) What should happen on a write miss Write Allocate:

The block is fetched from memory and then the appropriate portion of the block is overwritten An alternative strategy is to update the portion of the block in memory but not put it in the cache, called no write allocate.

magnetic disk" The metal platters are covered with magnetic recording material on both sides, similar to the material found on a cassette or videotape. To read and write information on a hard disk, a movable arm containing a small electromagnetic coil called a read-write head is located just above each surface.

The entire drive is permanently sealed to control the environment inside the drive, which, in turn, allows the disk heads to be much closer to the drive surface.

DRAMs buffer rows for repeated access. The buffer acts like an SRAM; by changing the address, random bits can be accessed in the buffer until the next row access.

This capability improves the access time significantly, since the access time to bits in the row is much lower

Four way Set Associative

Total blocks 8. How many sets 8/4=2. 2 sets with 4 blocks each

Set Associative (2 Way Set Associative) Maps to only one set/has a fixed number of locations (at least two) where each block can be placed. Can be in any location within set

Total blocks:8. How many sets? 8/2=4. 4 sets with each 2 blocks In a set-associative cache, the set containing a memory block is given by (Block number) modulo (Number of sets in the cache) all the tags of all the elements of the set must be searched.

Virtual Address=

VPN, OFFSET

What happens to a write miss?

We first fetch the words of the block from memory. After the block is fetched and placed into the cache, we can overwrite the word that caused the miss into the cache block. We also write the word to main memory using the full address

Write-Through On data-write hit, could just update the block in cache - But then cache and memory would be inconsistent

Write through: update cache & also update memory--update both the cache and the next lower level of the memory hierarchy, ensuring that data is always consistent between the two. • But makes writes take longer

The miss penalty is the time to replace

a block in the upper level with the corresponding block from the lower level, plus the time to deliver this block to the processor (or the time to get another book from the shelves and place it on the desk).

If neither the primary nor the secondary cache contains the data,

a main memory access is required, and a larger miss penalty is incurred.

On cache miss - Stall the CPU pipeline - Fetch block from next level of hierarchy - Instruction cache miss • Restart instruction fetch

a. If an instruction access results in a miss, then the content of the Instruction register is invalid. b. Send the original PC value (current PC - 4) to the memory. c .Instruct main memory to perform a read and wait for the memory to complete its access. d. Write the cache entry, putting the data from memory in the data portion of the entry, writing the upper bits of the address (from the ALU) into the tag field, and turning the valid bit on. e.Restart the instruction execution at the fi rst step, which will refetch the instruction, this time fi nding it in the cache.

Cache memory(managed in hardware) .

acts as cache for main memory Virtual memory is for data on secondary storage Data moves slow/large capacity. Read:secondary memory, main, second level Write: opp direction

how else to improve cache perfomance

add associativity increase hit rate reduce miss rate hitr+missr=1

Because each cache location can contain the contents of a number of different memory locations, how do we know whether the data in the cache corresponds to a requested word? (how do we know whether a requested word is in the cache or not? )

adding a set of tags to the cache. The tags contain the address information required to identify whether a word in the cache corresponds to the requested word.

CPU and OS translate virtual

addresses to physical addresses during execution time

Data is copied between only two

adjacent levels at a time The upper level—the one closer to the processor—is smaller and faster than the lower level, since the upper level uses technology that is more expensive.

fully associative cache

allow a given block to go in any cache requires all entries to be searched at once tag comparator per entry

As DRAMs store the charge on a capacitor, it cannot be kept indefinitely

and must periodically be refreshed. To refresh the cell, we merely read its contents and write it back

Translating used the virtual page number

and uses it as an index into this page table

You break the cache into a number of sets. Within each set, n blocks into which data can be placed anywhere. Direct mapped->

block address in memory to a single location in the upper level of the hierarchy. Maps to only one location

VM managed jointly

by CPU hardware and operating system

virtual memory main memory is

cache for secondary storage(disk)

VM translation of a miss is called a page fault. If page is not in memory,

cached/sitting on disk and needs to be brought from main memory(disk). Page not in memory and not in cache

faster memory is

close to the processor and the slower, less expensive memory is below it.

flash memory: To cope with such limits, most fl ash products include a

controller to spread the writes by remapping blocks that have been written many times to less trodden blocks. Th is technique is called wear leveling.

Programs share physical main memory(RAM)

each gets a private virtual address space holdings its freq used code and data

n-way set associative

each set contains n entries block number determines which set search all entries in a given set at once

Size of the page table is determined by how many bits the virtual page number

has. 20 bits for the virtual page number then the page table has 2^20 entries

Making the chip wider also

improves the memory bandwidth of the chip.

To avoid performance loss, the bandwidth of main memory is

increased to transfer cache blocks more efficiently

024 blocks/sets of 4 bytes total cache size in 4 way associativity 1024/4=256 sets NOW THESE ARE THE SETS Offset: how many powers of 2 for bytes/ cache size=2 Index: how many powers of 2 for set number (256)=8 Tag is 22 bits

index: how many powers of 2 for set number

- Data cache miss • Complete data access Th e control of the cache on a data access

is essentially identical: on a miss, we simply stall the processor until the memory responds with the data

With eight blocks, an eight-way set associative cache

is the same as a fully associative cache.

If the miss penalty increased linearly with the block size,

larger blocks could easily lead to lower performance. To avoid performance loss, the bandwidth of main memory is increased to transfer cache blocks more efficiently

slowest speed , biggest, and cheapest

magnetic disk

increase associativity decreases

miss rate

The use of a larger block decreases the

miss rate and improves the efficiency of the cache by reducing the amount of tag storage relative to the amount of data storage in the cache.

The faster memories are

more expensive per bit than the slower memories and thus are smaller.

memory can complete writes < the rate at which the processor is generating writes,

no amount of buffering can help, because writes are being generated faster than the memory system can accept them.

• If page is present in memory PTE stores the physical page frame

number Plus other status bits (referenced, modified, ...)

Associativity increases and number

of index bits decrease, tag bits increase

The hit rate, or hit ratio, is the fraction

of memory accesses found in the upper level; it is often used as a measure of the performance of the memory hierarchy

The principle of locality states that programs access a relatively small portion .

of their address space at any instant of time, just as you accessed a very small portion of the library's collection

VM Black is called a page:

page size is a power of 2

virtual address is translated to

physical address Bring pages from address space of the processes which are private to disk and bring them into frames (page frames) Protected from other programs'

Increase bandwidth by Access data from memory,

put in cache that is parallel, read multiple words and put in cache. Processor will access that one at a time, but much faster sitting in cache

static RAM used for

simply integrated circuits that are memory arrays with (usually) a single access port that can provide either a read or a write.

as the distance from the processor increases, the

size of the memories and the access time both increase

SRAM have a fixed access time to any datum don't need to refresh and

so the access time is very close to the cycle time. SRAMs typically use six to eight transistors per bit to prevent the information from being disturbed when read.

Capacity increases from

static, dynamic to magnetic disk

virtual memory involves the operating

system unlike cache memory

VA(VPN, OFfset) find number of bits offset?

take size of pages 2^12 page 12 is nuber of bits in offset

Looking at a lot more misses column wise(100%)

than row wise (25%) So fixed is in the middle. Rows : a lot less misses with columns a lot of misses Two best cases:scanning rows (minimal miss rate) of two matric es

If the second-level cache contains the desired data, the miss penalty for the fi rst-level cache will be essentially

the access time of the second-level cache, which will be much less than the access time of main memory

The use of a larger block decreases the miss rate and improves the efficiency of

the cache by reducing the amount of tag storage relative to the amount of data storage in the cache.

Cache index:

the index field of any address of a cache block must be that block number

SO LOWER 3 BITS ARE THE BLOCK ADDRESS/INDEX AND UPPER 2 BITS ARE TAGS. If two blocks maps to the same location, .

the last one survives.

Where to look in the cache for each possible address:

the low-order bits of an address can be used to find the unique cache entry to which the address could map

The total size of the cache in blocks is equal to

the number of sets times the associativity.

The index value is used to select the set containing the address of interest, and

the tags of all the blocks in the set must be searched.

The tag needs only to contain

the upper portion of the address, corresponding to the bits that are not used as an index into the cache. we need only have the upper 2 of the 5 address bits in the tag, since the lower 3-bit index field of the address selects the block.

Because DRAMs use only a single transistor per bit of storage,

they are much denser and cheaper per bit than SRAM.

Virtual Memory:

to allow efficient and safe sharing of memory among multiple programs, such as for the memory needed by multiple virtual machines for cloud computing, and to remove the programming burdens of a small, limited amount of main memory

SRAM needs only minimal power

to retain the charge in standby mode. As long as we have power, we have the value

Flash Memory

type of electrically erasable programmable read-only memory (EEPROM)/tiny little storage cards writes can wear out flash memory bits.

This second-level cache is normally on the same chip and is accessed

whenever a miss occurs in the primary cache.

For a fixed cache size, increasing the associativity decreases the number of sets

while increasing the number of elements per set.

Disks have a slower access time because they are mechanical devices—flash is 1000 times as fast and DRAM is 100,000 times as fast—.

yet they are cheaper per bit because they have very high storage capacity at a modest cost—disk is 10 to 100 time cheaper

When CPU performance increased - Miss penalty becomes more significant (why?)

• Decreasing base CPI - Greater proportion of time spent on memory stalls • Increasing clock rate - Memory stalls account for more CPU cycles

Choosing Which Block to Replace

• Direct mapped: no choice • Set associative - Prefer non-valid entry, if there is one - Otherwise, choose among entries in the set • Least-recently used (LRU) - usage based - Choose the one unused for the longest time • Simple for 2-way, manageable for 4-way, too hard beyond that

The GOAL: illusion of large, fast, cheap memory • How do we create a memory that is large, cheap and fast (most of the time)? 1. Hierarchy 2. Parallelism

• Fact: Large memories are slow (dram), fast memories are small (sram) • Fact: Large memories are cheap (disk), fast memories are expensive (dram)


Ensembles d'études connexes

Medical Terminology Chapter 1 Self Test

View Set

2019 Section 11: Payroll Accounting

View Set

Psych Videbeck Chapter 17: Mood disorders and Suicide

View Set