ECEN 350 Final Study

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

For the following code: ADD X1, XZR, 4 OuterLoop: ADD X2, XZR, 3 InnerLoop: ADD X2, X2, -1 BR1: CBNZ X2, InnerLoop ADD X1, X1, -1 BR2: CBNZ X1, OuterLoop Calculate the prediction accuracy of a one-bit branch predictor for the bne at BR1. Assume the predictor is initialized as taken (1).

0.416

For a particular branch with this observed behavior: N T N N N T N N What is the prediction accuracy of a 1-bit predictor? Assume the predictor starts in the 0 (N) state. The answer should be formatted as a decimal, so 20% accuracy should be represented as .2.

0.5

For the following code: ADD X1, XZR, 4 OuterLoop: ADD X2, XZR, 3 InnerLoop: ADD X2, X2, -1 BR1: CBNZ X2, InnerLoop ADD X1, X1, -1 BR2: CBNZ X1, OuterLoop Calculate the prediction accuracy of a two-bit branch predictor for the CBNZ at BR1. Assume the predictor is initialized as weakly taken (10).

0.66

For the branch history in the previous answer calculate the accuracy with a 2-bit predictor starting at state 01 (weakly N).

0.75

In the following instruction stream, assuming a static prediction of not-taken for all branches and no branch delay slots, how many control-flow related stall cycles will occur the first time through the loop (assume X2 is not zero)? LOOP: CBZ X2, OUT ADD X2, X2, -1 B LOOP OUT:

1

For the following code: 1 ADD X1, X2, X3 2 SUB X2, X1, X5 3 LDUR X8, [X5, 0] 4 ADD X7, X8, X6 Assuming forwarding is available and that the compiler can reorder code (but only in ways that do not change the ultimate output), which of the following arrangements would achieve the best performance (without breaking it).

1 ADD X1, X2, X3 2 LDUR X8, [X5, 0] 2 SUB X2, X1, X5 4 ADD X7, X8, X6

In the previous problem, how many references are conflict misses?

1, only reference #9 is

Assuming a standard page size of 4KB, a 42-bit address space, 4-bytes per page table entry and 256 separate processes managed by the OS at a given time, calculate the total size (in GB) of the required page tables assuming the page table is a single, unified page table for each process virtual memory space (you may ignore the OS's VM space).

1,024

For the following code: 1 ADD X1, X2, X3 2 SUB X2, X1, X5 3 LDUR X8, [X5, 0] 4 ADD X7, X8, X6 What would the CPI be with data forwarding/bypass?

1.25

If a disk drive spins at 5400 rpm, has an average seek time of 5ms and a transfer rate of 120MB/s, what is the average time to read a page of 4KB from disk? Round your answer to single digit after the decimal. Give your answer in ms.

10.9

What is the answer to question 6 if we need to add elements of a larger array of size 2048x2048?

100

You are given the following code to add all the even elements of an array: for (i=0; i<2048; i++) k += A[2*i]; The cache is 32KB, direct-mapped, with a cache block size of 16 bytes. Each element of A is 8 bytes. What is the miss rate for accessing the array A? Give the answer as percentage.

100

Given that a cache hit time is 1 cycle, cache miss penalty is 10 cycles, page fault miss penalty is 1000 cycles, TLB miss penalty is 20 cycles and that the page translation map resides in memory, what is the largest time a memory reference can potentially take in cycles? Remember, a memory reference is reissued after a miss is serviced.

1033

For a 32KB 4-way set-associative cache with a cache block of 256 bytes, what address bits would be used to index the cache. Provide your answer in the form of x-y, where x is the most significant bit location and y is the least significant bit location. Address is 32-bits long.

12-8

For the following addresses in hex, break them down into tag, index and offset (in hex) for this cache: A 16KB, 4-way set associative cache with 16-byte blocks (lines). 0x12345002: Tag=0x ______ , IDX=0x______ , Offset=0x______ 0x12345012: Tag=0x______ , IDX=0x______ , Offset=0x______ 0x12344002: Tag=0x______ , IDX=0x______ , Offset=0x______

12345, 00, 2 12345, 01, 2 12344, 00, 2

For the following addresses, give the virtual page number (VPN) and page offset: 0x12345678: VPN=0x____ , Page offset= 0x____ 0x1000010F: VPN=0x____ , Page offset= 0x____ 0x11111234: VPN= 0x____ , Page offset= 0x____ 0xFFFFFF20: VPN= 0x____ , Page offset= 0x____

12345, 678 10000, 10F 11111, 234 FFFFF, F20

For the following code: 1 ADD X1, X2, X3 2 SUB X2, X1, X5 3 LDUR X8, [X5, 0] 4 ADD X7, X8, X6 Assuming hardware hazard detection and no forwarding hardware, what is the CPI of this code (you may assume that this is in the middle of a long string of instructions so initial pipeline fill time is 0).

2

For this cache: A 16KB, 4-way set associative cache with 16-byte blocks (lines). How many hits would it have on the following stream of addresses? Assume Least Recently Used (LRU) replacement. 1. 0x12345004 2. 0x21345000 3. 0x12345008 4. 0x21345010 5. 0x12346004 6. 0x2222200E 7. 0x111111003 8. 0x22222003 9. 0x21345002

2, Reference #3 and #8 are hits.

A program has a CPI of 1.5 with an ideal cache. The program has 40% memory access instructions. If the observed miss rate for instructions is 5% and for data is 2.5%, what is the overall CPI if the miss penalty is 10 cycles.

2.1

What is the miss rate if we add the elements of the array in the previous question column by column i.e., we add all the elements in column0 first then column 1 and so on... Give your answer in percentage

25

You need to sum all the elements of a 128x128 array A. Array A is stored in row-major fashion and each element of A is 8 bytes. If the cache size is 32KB, direct-mapped and the cache block size is 32 bytes, what is the miss rate if we access the elements of A row by row? Give your answer in percentage.

25

Given that a page table entry is 4B and the page size is 4KB, you want to organize the page table in a multi-level hierarchy where each entry in the page table points to one page in memory. If the virtual address is 42 bits, how many levels are in the page table?

3

If a program has a cache miss rate of 2% and a cache miss penalty of 100 cycles, what is the average memory access latency, given that cache hit time is 1 cycle?

3

Adding two ports to an SRAM means increasing each cell by _____ transistors. (how many?)

4

Given a 16KB, 4-way set associative cache with 16-byte blocks (lines). Fill in the associated number of bits for each component of the address given a 32-bit physical address: Bits for offset: ____ Bits for index: ____ Bits for tag: ____

4, 8, 20

Calculate the CPIstalls given the following: CPIideal = 1 (this is the CPI of the processor if the memory system hit in the L1I and L1D caches) Fraction of instructions which are Load/Store: 25% L1Ihit time = N/A (already included in CPIideal) L1Dhit time = N/A (already included in CPIideal) L1I hit rate = 95% L1D hit rate = 90% L2 hit rate = 75% L2 hit time = 20 cycles DRAM access time = 100 cycles

4.375

What is the miss rate for the following code, if the cache is 32KB, direct-mapped with a block size of 32 bytes? for (i=0; i<2048; i++) k += A[2*i];

50

Given a 40-bit virtual address, page size of 4KB and each page table entry of 2B, what is the size of the page table? Give your answer in MB.

512

A given cache has 1KB of (data) storage. Each entry of the cache is an 8-byte line. Assuming 64-bit addresses, fill in the blanks below: This cache will have ___ bits for tag, ____ bits for index and ____ bits for offset.

54, 7, 3

A given cache has 1KB of (data) storage. Each entry of the cache is an 8-byte line. Assuming 64-bit addresses, fill in the blanks below: This cache will have ____ bits for tag, ____ bits for index and ____ bits for offset.

54, 7, 3

Given a 36-bit virtual address, what is the largest program that could be run on the machine? Give your answer in GB.

64

_______ memory has high capacity because each cell consists of a single transistor and a capacitor. As a result it must be regularly refreshed to maintain its state.

DRAM

________ memory has high capacity because each cell consists of a single transistor and a capacitor. As a result it must be regularly refreshed to maintain its state.

DRAM

(True or False) The following is the complete RTL description of the "AND" instruction: R[rd] = R[rs] & R[rt];

False

DRAM and SRAM and CDROMs are random access memories.

False

Flash memory must be refreshed to retain its state.

False

For all D-Type instructions, Read Data 2 is connected to the ALU.

False

For the pipeline in class where branches are resolved in the Dec/Reg stage, statically predicting taken will produce better results than statically predicting not taken because taken branches are more common.

False

Hardware that is not in use for a given instruction (on the unified hardware) is removed until needed to save power.

False

Stalling the pipeline is a simple, performance-neutral way to deal with pipeline hazards.

False

The DM stage must always come after the ALU stage.

False

The STUR instruction uses every pipeline stage during it's execution.

False

The clock period of a pipelined microarchitecture is set by the fastest stage.

False

The fastest path through the logic determines its critical path and hence its clock frequency.

False

The register file must have three ports (two read and one write) in order to directly support D-Type instructions.

False

The registerfile does not produce any output on Read Data 1 and Read Data 2 for the B instruction.

False

DRAM, SRAM and Flash are random access memories.

False, Flash is block access

Instructions which do not use a given pipeline stage may skip past it in the pipeline.

False, The instruction can not skip the stage because it would overlap with instructions ahead of it in the next stage.

Increasing associativity of a cache decreases capacity misses.

False, it decreases conflict misses.

Memory access speed tends to be directly correlated to memory size.

False, it's inversely correlated with size, big memories are slower to access.

Multi-level page tables reduce the time to access the page table versus a single unified page table.

False, they reduce the space, though they increase the access time because more memory accesses are required.

Physical memory is typically much larger than the program's virtual memory space.

False, virtual memory can be as big as 2^(number of bits in the address), Physical memory is typically much smaller

Assuming Fetch has placed an instruction on the bus called I[31:0] (in Verilog notation), which wires should be connected to "Read Addr 2" on the register file for the proper execution of R-Type instructions.

I[20:16]

Which of the following is not a benefit of virtual memory?

It improves memory system performance.

Given a 8KB 2-way set-associative cache, with 16-byte cache blocks, and 32-bit addresses, classify the following references as Miss or a hit. 0x0000 0010 0x0000 0020 0x0000 0024 0x0000 0018 0x0000 1018 0x0000 4014 0x0000 0120 0x0000 012C

Miss 0x0000 0010 Miss 0x0000 0020 Hit 0x0000 0024 Hit 0x0000 0018 Miss 0x0000 1018 Miss 0x0000 4014 Miss 0x0000 0120 Hit 0x0000 012C

Which of the following is not a way to alleviate a structural hazard.

Muxing together different inputs to a given piece of hardware, Muxing the inputs does not fix the problem that both inputs must access the same hardware in the same cycle (definition of structural hazard).

During a "STUR" instruction execution, fill in the blanks below for the values of each control signal:

Reg2Loc = 1 Branch = 0 Uncondbranch = 0 MemRead = 0 MemtoReg = x MemWrite = 1 ALUSrc = 1 RegWrite = 0

During the CBZ instruction execution, fill in the blanks below for the values of each control signal:

Reg2Loc = 1 Branch = 1 Uncondbranch = 0 MemRead = 0 MemtoReg = x MemWrite = 0 ALUSrc = 0 RegWrite = 0

In the figure below, fill in the control unit signal outputs for the microarchitecture when the following instruction is executed: ADD X1, X2, 0x10

Reg2Loc = x SignOp = I-Type Uncondbranch = 0 Branch = 0 MemRead = 0 MemtoReg = 0 ALUOp = Add MemWrite = 0 ALUSrc = 1 RegWrite = 1 The following may help you: For the multi-bit control signals, use one of the possible named values:Possible SignOp values: I-Type, D-Type, B-Type, CB-Type, IW-TypePossible ALUOp values: Add, Subtract, AND, OR, Pass-B, NOR Denote "do not care" with "x" (use only one "x" even for multi-bit control signals)

A cache in which write misses are sent directly to the next level down in the hierarchy without a local store is a write-no-allocate cache.

True

Adding "nop" instructions (literally no operation), is one way to address a data hazard.

True

Caches can work because programs tend to access the same data over and over within a given time period (temporal locality) and they tend to access nearby data with high probability (spatial locality).

True

Device wear out is a problem in many non-volatile memory technologies, including Flash and Phase Change memory.

True

For all R-Type instruction, Read Data 2 is connected to the ALU.

True

Pipelining can be viewed as a more efficient use of the datapath and control hardware.

True

Pipelining overlaps the use of different hardware components simultaneously.

True

RTL defines the actions of an instruction in terms of operations performed on registers, with the outcome being placed in another register.

True

The B instruction does not require the registerfile at all.

True

The TLB (ideally) improves the performance of the system by decreasing memory access time.

True

The memory system heirarchy attempts to approximate the size of it's largest component with the speed of its smallest.

True

The primary driving reason that memory system heirarchies have multiple levels in modern machines is because of the difference in speeds between large main memorys and fast processors.

True

There are many different possible implementations of a particular ISA in hardware.

True

With virtual memory, the operating system uses main memory as a cache for the disk.

True

Memory Management Unit (MMU) hardware and the Operating System (O/S) manage virtual memory together.

True, hardware does most of the common functions to speed things up.

The compiler can be used to address data hazards.

True, however, it may not be as good as some hardware techniques we will discuss.

A control hazard can be viewed as a data hazard upon the PC register.

True, it leads to using the wrong PC for instruction fetches on the instructions immediately following the branch.

In the LDUR and STUR instructions, the ALU is used for effective __________ calculation.

address

For the CBZ instruction, the ALU must be set to perform the "pass _________" operation.

b

A TLB is like a small _____ for page table entries with good locality.

cache

Fetch, _______ and Execute are the three phases of an instructions life-cycle.

decode

A page _____ occurs when a given virtual page is not currently in physical memory, requiring OS intervention to pull it from disk.

fault

In the class pipelined microarchitecture, control signals are decoded in the ________ stage and pipelined along with the instruction for the subsequent stages.

id decode register

Small caches work to cover large memories because of the principle of ________.

locality

Direct mapped cache indexing is performed via the ______ mathematical function.

modulo

_______ are used to allow for different inputs to a given piece of hardware when it requires different inputs when used by different instructions. For example, it is used on the input to the registerfile's Write Address input to select between Rd and Rt depending on which instruction is executing.

multiplexers

Virtual address to physical address translation changes the virtual _____ number to its physical counterpart while the offset remains untouched.

page

Forcing the compiler to deal with pipeline hazards can cause binary compatibility problems down the road, when new microarchitectures have different ________ lengths.

pipeline

The ______ is used in two different stages, ID and WB, as evidenced by the fact that some control signals come directly from the Control unit in ID, while others are pipelined from WB.

register file

RTL stands for ______________________.

register transfer language

The page-______ is the structure used by the OS to maintain translations between Virtual and Physical pages.

table

Pipelining improves __ ___ at the potential cost of individual instruction __ ___.

throughput, latency


Ensembles d'études connexes

Health Insurance Policy Provisions

View Set

18. UPDATED: Male Reproductive System

View Set

Ciclo celular y énfasis en el tejido óseo

View Set

Patho Course Point - Chapter 9: Altered Acid-Base Balance

View Set

10 Primary Characteristics of an Organizations Culture

View Set

Integumentary Disorders of the Adult Client

View Set